BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu

Introduction

Scaling up model sizes has been pivotal to the success of large language models (LLMs), yielding unprecedented performance across diverse natural language processing tasks Brown et al. (2020); Touvron et al. (2023); Kaplan et al. (2020). However, such escalating model size poses significant challenges in deployment, particularly on resource-constrained devices, due to the substantial memory footprint and computational requirements.

Weight quantization has emerged as a popular strategy to enhance the efficiency and accessibility of LLMs by reducing model size with minimal performance loss Gholami et al. (2022). In practice, 4-bit quantization has been widely adopted, offering a balance between a considerable compression ratio and the preservation of LLM capabilities Lin et al. (2023); Frantar et al. (2022); Liu et al. (2023a).

However, sub-4-bit quantization significantly degrades the fidelity of model weights, leading to deteriorated model performance, especially in smaller models or tasks requiring complex reasoning Dettmers and Zettlemoyer (2023). To address this, researchers have developed various Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) methods Chee et al. (2023); Shao et al. (2023). PTQ, while appealing without retraining, struggles to preserve model performance at very low precisions. In contrast, QAT incorporates quantization into the training loop, enabling dynamic adaptation to reduced precision and thus maintaining higher accuracy Liu et al. (2023b); Kim et al. (2023a). Despite its early promise, two fundamental challenges are essential for achieving high model performance in extreme low-bit QAT: how to maximally preserve weight fidelity during quantization, and how to effectively learn low-bit representations during training.

In this work, we present BitDistiller, a novel framework that synergizes QAT with Knowledge Distillation (KD) to significantly boost the performance of sub-4-bit quantized LLMs. To minimize quantization error, BitDistiller employs a tailored asymmetric quantization and clipping strategy to maintain the capabilities of the full-precision model as much as possible, particularly at ultra-low-bit levels. For efficient and effective low-bit representation learning, BitDistiller leverages a simple yet effective self-distillation approach, wherein the full-precision model acts as its own teacher to refine the low-bit student model. Notably, BitDistiller innovates with a Confidence-Aware Kullback-Leibler divergence (CAKLD) objective that optimizes knowledge transferring efficacy, enabling faster convergence and enhanced model performance.

Our empirical evaluations, conducted on a diverse suite of general language understanding and complex reasoning tasks including mathematics and coding, demonstrate that BitDistiller significantly outperforms existing PTQ and QAT methods in the realm of sub-4-bit quantization. As illustrated in Figure 1, BitDistiller achieves the most favorable scaling law in both 3-bit and 2-bit configurations on the code reasoning benchmark. Moreover, BitDistiller is demonstrated to be more cost-effective, requiring less training data and fewer training resources, thereby marking a significant advancement toward deploying robust Large Language Models on resource-constrained devices.

Background and Related Work

PTQ is directly applied to pre-trained models without additional training. PTQ for LLMs typically employs techniques that either adjust quantization error Frantar et al. (2022); Chee et al. (2023) or prioritize salient weights Dettmers et al. (2023b); Lin et al. (2023); Kim et al. (2023b). However, the lack of retraining with PTQ may cause notable decreases in model performance at extremely low precisions. In contrast, QAT integrates quantization into the training phase, enabling the model to learn better representations for low-bit weights, as demonstrated by approaches like LLM-QAT Liu et al. (2023b), OmniQuant Shao et al. (2023), PB-LLM Shang et al. (2023), and BitNet Wang et al. (2023). Despite improved model performance, QAT is still challenged by the need of extensive training and data, with significant potential for further optimization and enhancement. In this work, we harness the synergy of QAT and KD to enhance the performance of quantized LLMs, especially at sub-4-bit settings.

Granularity and Format Optimizations

Extensive research indicates that adopting finer-grained quantization approaches, such as group-wise quantization, can achieve higher accuracy compared to layer-wise or channel-wise methods Shen et al. (2020); Frantar et al. (2022). Floating-point formats (FP8/FP4/NF4) have been demonstrated to deliver superior accuracy compared to integer formats (INT8/INT4) in LLM quantization Kuzmin et al. (2022); Dettmers and Zettlemoyer (2023); Zhang et al. (2023b). Notably, asymmetric quantization methods, particularly for floating-point formats, outperform their symmetric counterparts by better accommodating the distribution of model weights Zhang et al. (2023a). BitDistiller aligns with these insights, employing finer granularity and asymmetric techniques for quantization.

2 Knowledge Distillation for LLMs

In the realm of LLMs, white-box knowledge distillation (KD) has become increasingly prevalent due to the accessible distribution of the teacher model, which facilitates the transmission of knowledge representations to the student model Hinton et al. (2015); Zhu et al. (2023). Notably, MINILLM Gu et al. (2023) utilizes the reverse KLD to ensure the accuracy and fidelity of language generation. GKD Agarwal et al. (2023) has explored alternative divergences called the generalized Jensen–Shannon divergence (JSD) and addressed the distribution mismatch by sampling outputs from the student model during training.

To attain exceedingly high compression ratios, a promising method is to combine KD with model quantization, where KD can be effectively used to mitigate the accuracy decline of quantized models Zhang et al. (2020); Kim et al. (2022). In cutting-edge research applying QAT-based KD for LLMs, TSLD Kim et al. (2023a) considers risks of overfitting and conducts logit distillation with ground truth loss. Similarly, LLM-QAT leverages randomly teacher-generated data for data-free distillation. In distinction from TSLD and LLM-QAT, we achieve better performance and cost-efficiency in the extremely low-bit quantization level.

Methodology

In this section, we introduce BitDistiller, a QAT with self-distillation framework for LLMs, as illustrated in Figure 2. To maximally preserve weight fidelity during quantization, we first present an asymmetric quantization and clipping method (see Section 3.1). Second, to counteract the performance degradation caused by precision reduction, we adopt Knowledge Distillation and propose a novel Confidence-Aware KL divergence (CAKLD) objective, in which the full-precision model acts as a teacher and the low-precision one plays a student (see Section 3.2).

Algorithm 1 outlines the process of BitDistiller. Given the full-precision weight w, BitDistiller adopts the asymmetric clipping to alleviate outliers in w (Line 4), prior to the training loop. Then, in each training step, BitDistiller forwards the model with the quantized weights ( $w^{t}_{Q}$ ), computes the loss with the proposed CAKLD objective (Line 8-9), and updates the full-precision weights (Line 11-12) Bengio et al. (2013). When the training finishes, BitDistiller returns the final quantized weights.

The adoption of finer granularities, or smaller group sizes, in weight quantization of LLMs inherently leads to asymmetrical distributions and the presence of outliers in weight groups. Proper management of asymmetry is crucial to maintaining model performance in low-bit PTQ regimes. Our investigation reveals that the effects of asymmetry are more prominent in extremely low-bit QAT, such as 3-bit and 2-bit configurations, necessitating tailored strategies to address these challenges. Therefore, in BitDistiller, we adopt asymmetric quantization techniques coupled with asymmetric clipping strategies to enhance the representational fidelity of quantized weights and maximally preserve the capabilities of the full-precision model.

Previous studies have shown that floating-point formats (e.g., FP, NF) often outperform integer formats (INT) in LLM quantization Dettmers et al. (2023a); Liu et al. (2023a). However, as the quantization level falls to 2-bit, we observed a notable decline in the effectiveness of FP/NF formats. This advantage of FP/NF formats is attributed to their non-uniform nature, which can capture a wider range of values. Such a non-uniform distribution aligns better with the natural distribution of weight tensors in LLMs. In 2-bit cases, the limited representational capacity, offering only four distinct values, undermines the benefits of non-uniform distribution and impedes the efficient utilization of each numerical value. In light of these findings, we employ NF formats for quantization above 2-bit, while opting for the INT format at the 2-bit level.

For NF formats (e.g., NF3), we adopt the AFPQ method Zhang et al. (2023a) to enable asymmetric quantization, which establishes separate scales, $s_{pos}$ for positive weights $w_{pos}$ and $s_{neg}$ for negative weights $w_{neg}$ , as shown in Equation 1. For INT formats (e.g., INT2), we utilize conventional asymmetric methods with a single scale and a designated zero point, as detailed in Equation 2.

Asymmetric Clipping

The strategy of clipping, which involves constraining the range of weight values, has been recognized for its contribution to maintaining high accuracy after quantization Sakr et al. (2022); Shao et al. (2023). However, naive clipping methods often fall short in effectiveness, while advanced clipping techniques come at a high computational cost which is prohibitive for practical QAT use Li et al. (2019); Jung et al. (2019). To circumvent these limitations, we propose the use of asymmetric clipping solely during the initial phase, prior to the commencement of QAT. Asymmetric clipping at initialization provides a good starting point that significantly contributes to the final overall quantized model accuracy without incurring the prohibitive costs associated with iterative clipping optimization.

To enable asymmetric clipping for QAT initialization, given input features $X$ cached from a small calibration set, we conduct an automatic search for two optimal clipping values, $\alpha$ and $\beta$ , for each layer of the model. These values aim to minimize the output difference after quantization. Formally, the objective is to optimize the following:

To demonstrate the efficacy of asymmetric quantization and clipping, we conduct a tensor-wise analysis. We selected a random weight tensor from the LLaMa-2-7B model and focused on a single output channel. As illustrated in Figure 3, our approach to asymmetric quantization and clipping achieves higher fidelity preservation compared to symmetric quantization. A more detailed ablation study on the impact of asymmetric quantization and clipping on model performance is presented in Table 3 in Section 4.4.

2 Self Distillation with CAKLD

To better counteract the performance degradation caused by precision reduction, we propose to adopt Knowledge Distillation (KD) in QAT, where the full-precision model acts as a teacher and its quantized variant plays a student:

where $\mathcal{D}$ is a divergence measure of two distributions. $P_{T}$ and $P_{S}$ denote the full-precision and quantized model, respectively.

The intuition for KD is two-fold. First, learning the token-level probability distributions potentially helps the quantized model better imitate its full-precision counterpart Hinton et al. (2015), thereby re-gaining the strong downstream performance. Second, owing to the generative nature of LLM, it is easy to scale up the data size for QAT with the full-precision model.

The divergence $\mathcal{D}$ chosen for distillation plays a crucial role. Agarwal et al. (2023) find that the mode-seeking behavior advocated by the Reverse KL divergence (i.e., $\mathcal{D}_{KL}(P_{S}\parallel P_{T}$ )) leads to better performance than Forward KL (i.e., $\mathcal{D}_{KL}(P_{T}\parallel P_{S})$ ) on instruction tuning Chung et al. (2022), while Forward KL promotes mode-covering and is superior on general text generation tasks like summarization Narayan et al. (2018). To provide a general receipt for QAT, we aim to seek a way to trade off the mode-seeking and mode-covering behaviors automatically, instead of manual selection according to some empirical understanding of downstream tasks.

To this end, we propose a novel Confidence-Aware KL divergence, shorted as CAKLD. It blends the Reverse KL and Forward KL with a coefficient $\gamma$ estimated by the averaged token probability, so that the mode-seeking and mode-covering behaviors can be automatically traded off according to the full-precision model’s confidence on the training data:

Intuitively, when the full-precision model is confident on the training data, CAKLD will prefer more on the mode-seeking behaviors. Otherwise, CAKLD will advocate more on the mode-covering behaviors, as the full-precision model is not certain about the data and modeling its single mode is suboptimal. Figure 4 visualizes the difference between Reverse KLD, Forward KLD and CAKLD when a Gaussian distribution tries to fit a Gaussian mixture. It is clear that CAKLD manages to trade off mode-seeking and mode-covering behaviors with the coefficient. For a detailed performance comparison and in-depth analysis, please refer to Figure 6 and Appendix A.2

Experiments

We evaluate BitDistiller on the LLaMA-2 Touvron et al. (2023) families and domain-specific LLMs with sub-4–bit quantization. We have set up comparative experiments to demonstrate the proficiency of our method against existing PTQ and QAT methods. Our findings illustrate that BitDistiller substantially enhances both the general language performance and the accuracy of reasoning tasks.

Following Frantar et al. (2022); Lin et al. (2023), we benchmark LLaMA-2 Touvron et al. (2023) on general language tasks, including language modeling tasks (WikiText-2 Merity et al. (2016)), common sense QA benchmarks (PIQA Bisk et al. (2020), HellaSwag Zellers et al. (2019), WinoGrande Sakaguchi et al. (2021), ARC Clark et al. (2018)) and in-context learning ability (MMLU Hendrycks et al. (2020)) under a few-shot setting. We also consider the complex reasoning tasks and evaluate various sizes of domain-specific LLMs, including WizardCoder Luo et al. (2023) on LLM-Humaneval-Benchmarks Chen et al. (2021) in the setting of greedy decode, and MetaMath Yu et al. (2023) on GSM8K Cobbe et al. (2021). To evaluate the domain-specific LLMs of smaller sizes, we finetune OpenLLaMA-3B Geng and Liu (2023) with domain-specific datasets.

Baselines

PTQ baselines include vanilla round-to-nearest (RTN), GPTQ Frantar et al. (2022), AWQ Lin et al. (2023), Omniquant Shao et al. (2023) and QuIP Chee et al. (2023). QAT baselines include LLM-QAT Liu et al. (2023b) and TSLD Kim et al. (2023a). Detailed PTQ and QAT settings can be found in appendix A.1.

Quantization and Distillation

Training Datasets

We use the instruction-tuning data from Alpaca Taori et al. (2023) and the training set of WikiText-2 for general language tasks. For code understanding and generation, we use Evol-Instruct-Code Rosh (2023). For math reasoning we use MetaMathQA Yu et al. (2023).

Given the instruction prompt $x$ , sequence $s=\{x,y\}$ where output $y\sim p(\cdot|x)$ have three different choices: Ground Truth $y_{g}$ , Student-generated Output $y_{q}$ and Teacher-generated Output $y_{p}$ . As suggested by Agarwal et al. (2023); Zhou et al. (2023), we opt to generate the Teacher-generated Output $y_{p}$ using sampling with a temperature of 0.7 Yuan et al. (2023). We conduct experiments in Section 4.4 for an ablation study on the choices of output $y$ . (See Appendix A.3 for more details of training datasets composition).

Training Implementation

We leverage DeepSpeed Rasley et al. (2020) and HuggingFace repository Wolf et al. (2020) to devise a QAT-based KD framework enabling the distillation of models up to 34B. The model optimization is facilitated through the AdamW optimizer Loshchilov and Hutter (2017), applied with zero weight decay. We initialize the constant learning rate to 8e-6 and set the sequence length to 1024 for the code-related task and 512 for others.

2 Evaluation on Language Modeling Tasks

Table 1 presents a comparative analysis of BitDistiller’s performance against previous PTQ and QAT methods on general language tasks. BitDistiller surpasses competing methods in terms of WikiText-2 perplexity and MMLU (5-shot) accuracy. Furthermore, BitDistiller demonstrates consistent performance across various QA benchmarks. Notably, in 2-bit weight quantization, BitDistiller substantially increases the average accuracy by +3.54% over LLM-QAT Liu et al. (2023b) and by +12.43% compared to the leading PTQ method Shao et al. (2023). Similar results on LLaMA-2-13B can be found in Table 9 in the Appendix A.4.

3 Evaluation on Reasoning Tasks

Table 2 demonstrates the superior performance of BitDistiller on reasoning-based benchmarks, including HumanEval and GSM8K, across a range of domain-specific language model families. BitDistiller achieves improvements over other methods in both 3-bit and 2-bit quantization. Especially in 2-bit quantization, while other methods exhibit significant performance drops, BitDistiller maintains a commendable level of accuracy. Detailedly, our method outperforms LLM-QAT by a remarkable margin of 24.69%, achieving an accuracy of 61.33% on complex mathematical reasoning tasks. These outcomes bolster the potential for implementing ultra-low-precision inference deployment in practical reasoning tasks without substantially compromising performance.

4 Ablation Studies

In this ablation study, we evaluate the efficacy of quantization strategies on the LLaMA-2-7B model. Our approach examines the impact of asymmetric quantization and clipping techniques within QAT. We specifically assess the 3-bit and 2-bit quantization levels, reporting our findings in terms of Perplexity (PPL) and MMLU (5-shot).

As demonstrated in Table 3, asymmetric quantization significantly enhances model performance. Notably, under a 2-bit configuration, PPL can be reduced from 3.4e2 to 16.94 in post-training. Furthermore, the application of asymmetric clipping during initialization yields additional performance gains upon training completion. See Appendix A.5 for integration with other PTQ methods.

Data Generation

In our analysis, we meticulously evaluated the logit information of the teacher model by computing the cross-entropy loss (CELoss) for various outputs $y$ . Figure 5(a) illustrates that the data generated by the teacher model $y_{p}$ exhibits low CELoss, indicative of a high-confidence logit distribution, which in turn facilitates better convergence with our proposed CAKLD. The comparative performance results depicted in Figure 5(b) reveal that the use of teacher-generated data in conjunction with CAKLD yields superior outcomes when compared to employing either a fixed dataset or student-generated data $y_{q}$ .

Distillation Objectives

In Figure 6, we demonstrate the effectiveness of our proposed Confidence-Aware KL Divergence (CAKLD) by showcasing performance indicators for reasoning tasks under different objective functions. Our findings show that CAKLD outperforms other objective functions. Though JSD also has a bounded coefficient for interpolation, in practice we observe that it has a weak ability to converge for QAT.

5 Analysis and Discussion

QuIP enhances 2-bit PTQ for LLMs through incoherence processing. Its subsequent iteration, QuIP#https://cornell-relaxml.github.io/quip-sharp/, refines this approach by shifting from scalar quantization to vector quantization via lattice codebooks, significantly narrowing the performance gap with 16-bit models. For a consistent comparison, we utilize the BF16 pretrained model and then apply Quip(#) and BitDistiller. As shown in Table 4, our BitDistiller surpasses QuIP across all benchmarks. In comparison with QuIP#, BitDistiller retains its superior performance in language modeling and programming, while QuIP# outperforms in mathematical reasoning. Being orthogonal to QAT with distillation, PTQ incorporating incoherence processing and vector quantization could potentially serve as an effective initialization method for BitDistiller. We intend to explore whether the integration of QuIP(#) into BitDistiller can further improve the performance of low-bit models.

Comparison with TSLD

Prior work Kim et al. (2023a) introduced Token-Scaled Logit Distillation (TSLD) to alleviate overfitting during QAT. To facilitate a direct and fair comparison between TSLD and our CAKLD, we incorporate TSLD into the BitDistiller framework by replacing CAKLD with TSLD while keeping all other settings unchanged. As depicted in Figure 7, CAKLD not only converges more rapidly but also delivers superior overall performance compared to TSLD.

Effectiveness of Self-Distillation

Table 5 compares 2-bit QAT performance using the LLaMA-2-7B or larger LLaMA-2-13B as the teacher model. Surprisingly, in practice the larger 13B model didn’t improve accuracy, hinting that a teacher with the same model architecture as the student may enhance weight alignment and probability distribution matching, thereby improving model effectiveness. Further investigation and deeper analysis are needed in future work to fully understand the implications of different teacher-student sizes and architectures in QAT.

Training Efficiency

Table 6 highlights the efficiency of BitDistiller compared to LLM-QAT Liu et al. (2023b) in quantizing the WizardCoder-7B model. The results demonstrate a dramatic reduction in the total time required for quantization: BitDistiller completes the process in approximately 3 hours on a single A100-80G GPU, as opposed to the hundreds of GPU hours required by LLM-QAT. (Original LLM-QAT uses 64 GPUs. For a direct and fair comparison, we evaluate the GPU hours needed for LLM-QAT on a single GPU.)

Conclusion

BitDistiller leverages QAT with self-distillation to boost sub-4-bit LLM performance. The asymmetric quantization and clipping strategies, coupled with the innovative CAKLD objective, facilitate faster learning and superior performance. BitDistiller outperforms existing PTQ and QAT methods, achieving notable improvements in 3/2-bit settings across diverse language and reasoning tasks. Moreover, BitDistiller is more cost-efficient with fewer data and training resources required.

Limitations

Despite the promising results demonstrated by BitDistiller, it is important to acknowledge certain limitations and areas for future investigation.

A key limitation lies in the empirical nature of our findings. For instance, the reason behind the counterintuitive outcome where a 7B model outperforms a 13B model as a teacher during the distillation of a 2-bit 7B student model. Having the same model architecture may be the reason but not detailed explained and understood. This highlights the need for a deeper investigation and theoretical exploration to complement our empirical observations.

Looking ahead, we aim to extend BitDistiller to the realm of 1-bit (binary) quantization. While this presents a more challenging scenario, it also offers the potential for significant advancements in efficient LLM inference as binary weights enables computation with only additions and without multiplications.

Moreover, the current iteration of BitDistiller applies exclusively to scalar quantization. As future work, we plan to explore the adaptation of BitDistiller to vector quantization. Preliminary research in this area indicates that vector quantization could yield substantial benefits, and incorporating it into our framework represents a natural and promising progression of our research.

Acknowledgements

We would like to thank the HPC-AI-Integrated Intelligent Computing center of HKUST(GZ) for providing some of the hardware platforms in this project.

References

Appendix A Appendix

We evaluate PTQ methods by examining the impact of different calibration dataset distributions. Illustrated in Figure 8, calibrating with domain-specific data significantly enhances task-specific performance. For a fair comparison, all PTQ methods utilize the default calibration datasets for general language tasks and domain-specific calibration datasets Rosh (2023); Yu et al. (2023) for reasoning tasks.

Regarding QAT methods, it should be noted that the use of symmetric quantization in LLM-QAT results in degradation when grouped quantization is applied. To ensure a fair comparison, we replicate the approach with our setup and employ asymmetric uniform quantization.

A.2 Implementation Details and Analysis of Confidence-Aware KLD

We use a straightforward method in the pre-calculation of the coefficient $\gamma$ . We utilize ten batches of training data to perform forward passes without updating parameters. Subsequently, we obtain the logits from the teacher model to compute the average token probability. In Figure 9, we have conducted analysis by examining the confidence scores of the teacher model in various tasks during next-word prediction. This analysis reveals that confidence levels can vary in text generation tasks, in contrast to reasoning tasks where each step is critical. Notably, in text generation tasks using LLMs, relying solely on the highest conditional probability through Greedy Search may result in local optima, overlooking more optimal sequences. These observations advocate for a mean-seeking Kullback-Leibler (KL) approach, encouraging the student model to encompass all potential modes of the teacher, thereby more effectively capturing the teacher’s general generative capabilities. In reasoning tasks, where the teacher model shows high confidence in next-word predictions, the student model should concentrate on learning the predominant mode from the teacher. Our proposed method, CAKLD, is designed to balance these two distinct modes effectively.

A.3 Training Datasets Examples

For general language tasks, we mix token sequences from Alpaca and WikiText-2 datasets with a ratio of 2:1. Since WikiText-2 lacks explicit instructions, we utilize the first 128 tokens from the corpus as the input prompt for the teacher model’s generation process, setting the temperature to 0.7. For tasks related to code understanding and generation, we employ the Evol-Instruct-Code dataset. For mathematical reasoning, we utilize MetaMathQA. Examples of the training data utilized are shown in Table 8.

It is essential to highlight that our self-distillation process utilizes only a small portion of the involved datasets.

A.4 Evaluation of General Language Tasks on LLaMA-2-13B

Additional results of the General Language Tasks for LLaMA-2-13B are shown in Table 9.

A.5 Integration with AWQ For Quantization Strategies

As shown in Table 7, we explore the efficacy of combining asymmetric clipping with AWQ during the self-distillation process. Our results indicate that asymmetric clipping significantly enhances robustness in sub-4-bit quantization scenarios. For instance, at the 2-bit quantization level, both INT-Asym and AWQ methods are unable to complete the task. Conversely, Clip-Asym not only succeeds but also achieves a marked improvement in perplexity. It is also noteworthy that while integrating AWQ prior to QAT yields improvements initially, there is no additional performance gain after training. This suggests that a straightforward clipping approach is sufficiently effective for initializing QAT.