LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
Introduction
Since the introduction of transformer architecture Vaswani et al. (2017), transformers have superseded recursive neural networks, emerging as the dominant architecture in numerous natural language processing (NLP) tasks Kenton and Toutanova (2019); Lewis et al. (2020). The transformative impact of the transformer has been further propelled by the emergence of models like GPT Brown et al. (2020); OpenAI (2023), catapulting the popularity of this architecture to new heights. Meanwhile, the versatility of transformers extends beyond NLP, encompassing diverse domains such as vision Dosovitskiy et al. ; Touvron et al. (2021), audio Akbari et al. (2021), etc. This trend towards a unified architecture for different modalities represents a groundbreaking development within the realm of deep learning.
However, the advancements in transformer performance are accompanied by a corresponding increase in model size and computational costs Kaplan et al. (2020). This poses significant challenges when attempting to leverage the full potential of transformer models in use cases where memory or computational resources are limited. Despite the extensive research and widespread adoption of transformers, the field of transformer compression remains relatively underexplored. To address this gap, our study focuses on the compression of transformers, especially through floating-point post-training quantization techniques.
Post-training quantization (PTQ) offers the advantages of simple to use with minimal fine-tuning requirements Nagel et al. (2020); Cai et al. (2020). Existing PTQ solutions for transformers primarily focus on integer (INT) quantization Liu et al. (2021); Yuan et al. (2022), which can be effective in certain scenarios but often break down when bit widths are below 8 bit. On the other hand, floating-point (FP) quantization has gained significant traction as a more flexible alternative, capable of better accommodating various activation and weight distributions. In fact, FP8 has emerged as the default choice in various hardware platforms, including the NVIDIA H100.
Different from integer (INT) quantization, a particular challenge in floating-point (FP) quantization is how to select appropriate exponent bits and scale parameters. Improper parameter choices can lead to subpar or divergent quantization results. To tackle this challenge, we introduce a robust recipe for FP quantization, which leverage layer-wise reconstruction to jointly search for optimal exponent bits and maximum values. Compared to previous approaches that utilize gradient updates for exponent bits Kuzmin et al. (2022), our search-based method proves to be more stable and consistently delivers desirable quantization results, which establishes a strong baseline for FP-PTQ.
Furthermore, our investigation uncovers an intriguing pattern of activation distributions in transformers, characterized by high inter-channel variance and low intra-channel variance. Similar patterns are also observed in previous works Xiao et al. (2022); Dettmers et al. (2022), while we argue that this pattern is inherent to transformer architectures and not limited to specific tasks, as we have observed consistent patterns not only in large language models but also in BERT model and even vision transformers. Motivated by these findings, we introduce a novel pre-shifted exponent bias for FP quantization of transformers. Concretely, we leverage the per-channel activation variance computed from calibration data and reparameterize these scales as the exponential bias of the corresponding FP quantized weight vectors. This approach effectively addresses the challenge posed by high inter-channel variance while incurring negligible computational cost.
In summary, we study floating-point post-training quantization (PTQ) for transformer architectures, and the contribution of this paper includes:
We propose a search-based framework for determining the optimal exponent bias and maximal quantization value. This method outperforms existing techniques in terms of stability and performance, establishing a strong baseline for floating-point post-training quantization.
We propose a novel technique, pre-shifted exponent bias, which effectively addresses the challenge of high inter-channel variance in the transformer with negligible computational overhead.
Experimental results demonstrate that the proposed method yields the first usable FP4 weight and activation quantized LLaMA-13B model with mere 5.8-point degradation in zero-shot reasoning tasks against the full-precision model, reducing the gap by 70% compared to the previous SoTA.
We further extend our method to BERT and vision transformers. It surpasses the previous best 4-bit quantized BERT by 7.8 points on GLUE dataset and achieves 31.4 points higher accuracy compared to the previous SoTA ViT quantization method for 4-bit DeiT-S on ImageNet dataset.
Related Works
Model quantization can be mainly categorized into quantization-aware training (QAT) and post-training quantization (PTQ), depending on whether it involves additional training for weight fine-tuning or not. Most PTQ studies are primarily focused on convolutional neural networks (CNNs) Nagel et al. (2020); Li et al. (2021); Wu et al. (2020); Cai et al. (2020); Nagel et al. (2019). However, with the growing popularity of transformer-based models, only a limited number of works Bondarenko et al. (2021); Yuan et al. (2022); Ding et al. (2022) have been conducted to realize PTQ on transformers. Moreover, the existing works primarily focus on visual transformer models and exhibit inferior performance when the bit width is below 8. Therefore, in this work, we delve into the challenges of the low-bit PTQ for language transformers.
2 Floating-Point Quantization
Floating-point (FP) quantization has emerged as a promising alternative to integer quantization due to its ability to handle long-tail distributions, and offers increased flexibility Kuzmin et al. (2022). Additionally, modern GPUs such as H100 Micikevicius et al. (2022) now support FP quantization. Nonetheless, minimal research has been conducted on FP quantization. Only Kuzmin et al. (2022) proposes a general FP8 quantization scheme primarily for vision tasks, and Zhang et al. (2023) adopts a mixture of FP and INT formats quantization for LLMs. In this work, we propose FPQ baseline as a general guideline for low-bit floating-point PTQ to compress language transformer models.
Preliminaries
A standard floating-point number is represented as:
where is the sign bit. is mantissa bit, denoted number of mantissa bits. is an integer in , and denotes number of exponent bits. is an integer exponent bias. A floating point with number exponent bits and mantissa bits is denoted as FP format .
2 Floating-Point Quantization Process
In integer quantization, the real-valued variable is quantized to an integer with the following formula:
where is the rounding function. is the real-valued variable, represents the full-precision scaling factor, and , are the min/max value of the quantization range. Similarly, a real-valued variable can be converted to floating-point in two steps.
(1) Scale and clip. In FP quantization, we also scale and clip the real-valued variable before quantization as:
where the min/max value range of signed floating-point quantization can be calculated from Eq.1:
Here the integer exponent bias is another adjustable hyperparameter controlling and , which has similar functionality as . Therefore, for simplicity, we reformulate Eq. 3 as:
(2) Compare and quantize. Different from integer quantization, which simply utilizes the rounding function to convert the real-valued variables to quantized ones, in floating-point quantization, there is an additional step of comparing with quantization levels and then quantize:
3 Floating-Point Matrix Multiplication
With the floating-point quantized variables, the matrix multiplication is formulated as:
Method
In this section, we begin by introducing our joint format and max value search, which establishes our strong baseline and already achieves state-of-the-art results at 8-bit and 6-bit quantization. Then we present an efficient pre-shifted exponent bias to tackle the catastrophic high inter-channel activation variance in transformer models and push the quantization limit to 4-bit.
The objective of post-training quantization is to minimize the perturbation () introduced by quantization to the pre-trained real-valued network:
In this study, we adopt the setting presented in Choukroun et al. (2019); Wu et al. (2020), which assumes a positive correlation between the change in the intermediate output of the quantized model and Eq. 13. Therefore, minimizing the distance between the intermediate output of the quantized layer () and the output of the original layer () leads to minimize Eq. 13. Hence, the objective loss metric is formulated as:
which is used to search for the optimal FP quantization function in the following proposed framework.
The challenges in FP quantization arise from its sensitivity to the quantization format and clipping range. Undesirable format selection will result in a catastrophic error rate. In addition, we observe that the optimal clipping range varies depending on the format used. Previous work Kuzmin et al. (2022) on floating-point (FP) quantization-aware training (QAT) proposed to learn both the FP format and maximum value with gradients. However, we find this method suffers from over-fitting in PTQ, with accuracy being even worse than naïve MinMax method, details can be found in Appendix E. Instead, we propose a search-based algorithm that jointly determines the optimal format and its associated clipping range to address this challenge.
The searching process is conducted layer by layer with the metric of minimizing Eq. 14. The output of matrix multiplication corresponding to each sub-module is denoted as , where can be either a weight tensor or another activation tensor.
The search process is outlined in Alg.1. We search the quantization scheme in all the matrix multiplication layers in parallel following Yuan et al. (2022); Bai et al. (2022). The algorithm can be divided into two parts. (1) Do forward propagation to store the intermediate raw output of each layer . (2) Iteratively update the optimal format and biases for each layer for three rounds by minimizing the reconstruction metric (Eq. 14). We name this search-based framework as Floating Point Quantization Baseline (FPQ baseline), and it can already achieve state-of-the-art results on both 8-bit and 6-bit settings.
2 Pre-Shifted Exponent Bias
In transformer architectures, we observed an intriguing phenomenon of high inter-channel variance. As shown in Fig.2, the magnitudes of values within the same channel are close to each other but exhibit significant differences across different channels. This phenomenon is not only observed in language models (i.e., LLaMA and BERT) but also significant in vision transformer models. Since outlier channels are often orders of magnitude bigger than the rest, they will dominate the quantization precision of the quantized tensor, resulting in less representation capacity for those channels with smaller magnitudes Xiao et al. (2022). This makes tensor-wise or token-wise scaling factor insufficient for accurate activations quantization.
However, applying per-channel scaling factors for activations poses challenges to efficient matrix multiplication, because the scaling factor is not a shared constant along the multiplication direction and cannot be extracted as Eq. 12. To address this challenge, we introduce pre-shifted exponent bias, which allows us to calculate per-channel scaling factors from activations. These scaling factors are then re-parameterized as the exponent biases of the corresponding weights. This method effectively handles high inter-channel variance while maintaining nearly identical efficiency to per-tensor quantization.
Note that the bias is constrained to integers within [], compatible with the standard floating-point number calculation. Nevertheless, adding different biases for each channel during inference may still cause some extra hardware operations. Thus, we re-parameterized the per-channel activation bias into a weight tensor and pre-computed the weights using the calibration set. This way, the exponent biases shifting only happens in the calibration stage. Then, an element in channel of activation tensors becomes:
and the corresponding weight element in row of the weight tensor becomes:
As result, efficient matrix multiplication in Eq.12 is reformulated as:
Combining pre-shifted exponent bias method with the joint format and max-value search framework(FPQ baseline), we name our method as (FPQ), short for Floating Point Quantization.
Experiments
To validate the effectiveness of the proposed method, we conduct experiments on LLaMA Touvron et al. (2023) and BERT Devlin et al. (2019) models in 5.2.1 and Sections 5.2.2. Further, in Section 5.2.3 we show that our method also generalizes well to vision transformer architectures. We present ablation studies on the calibration size and search range in Section 5.3, and analyze the hardware costs of implementing FP operators in Section 5.4.
We adopt per-tensor quantization for activation and per-channel quantization for weight. We employ layer reconstruction following the settings of Yuan et al. (2022); Nagel et al. (2020), and parallel quantization based on the approach outlined in Bai et al. (2022); Yuan et al. (2022). A more detailed discussion regarding our implementation decisions can be found in Appendix F. For LLaMA models, we quantize all the weight and activation tensors in fully-connected layers for a fair comparison with previous work Xiao et al. (2022); Liu et al. (2023). For BERT and ViT models, both fully-connected layers and activation-activation multiplication tensors in the self-attention module are quantized. Note that for FPQ on BERT Devlin et al. (2019) and ViTs models, the reconstruction metric Eq. 14 is substituted with a Hessian approximation loss metric. This substitution is further detailed in Appendix A.
2 Main Results
We evaluate the effectiveness of FPQ for LLaMA-7B/ LLaMA-13B Touvron et al. (2023) on common sense zero-shot reasoning tasks. For the calibration data, we sample 32 random segments with 2048 tokens length from the C4 Raffel et al. (2020) dataset following the setting of GPTQ Frantar et al. (2023). The data preprocessing and score calculation are based on EleutherAI evaluation harnesshttps://github.com/EleutherAI/lm-evaluation-harness. In Table 1, we compare FPQ to the floating-point PTQ baselines, and state-of-the-art PTQ and QAT methods, including SmoothQuant Xiao et al. (2022) and GPTQ Frantar et al. (2023), and LLM-QAT Liu et al. (2023).
In general, all methods, except for the naïve MinMax INT Quantization, produce comparable outcomes in the 8-bit setting on both LLaMA-7B and LLaMA-13B. Additionally, we observe that the naïve MinMax FP Quantization achieves nearly lossless results and even surpasses the state-of-the-art integer post-training quantization method, SmoothQuant (Xiao et al., 2022), which indicates that floating-point quantization naturally has a strong capability in handling the distributions in transformers. However, both MinMax FP Quant and FPQ baseline fail when pushing the quantization precision to ultra-low 4/4/4 bit setting, with and accuracy degradation on LLaMA-7B, respectively. In this extreme case, the previous state-of-the-art PTQ and QAT methods, SmoothQuant Xiao et al. (2022) and LLM-QAT Liu et al. (2023) also suffer severe accuracy downgrade. In comparison, FPQ demonstrates a strong capability of handling extra-low bit settings and achieves only / accuracy drop on LLaMA-7B/13B with 4/4/4 bit-width, outperforming SmoothQuant Xiao et al. (2022) by a large margin, yet with less bit-width and smaller calibration size. Moreover, FPQ even achieves 5.3% accuracy improvements compared to LLM-QAT Liu et al. (2023) in the 4/4/4 setting and 1.5% over GPTQ Frantar et al. (2023) in the 4/4/16 configuration on LLaMA-7B.
For practitioners, a crucial consideration is determining the appropriate quantization methods for various bit-widths. Therefore, based on our findings, we offer two recommendations that balance the trade-off between accuracy and search/optimization efficiency. First of all, since the difference between MinMax FP Quant and the rest of the methods is marginal for the 8/8/8 setting, we recommend simply using the MinMax FP Quant method for the 8/8/8 setting as the MinMax method does not involve search process. However, for more demanding scenarios, especially with activation quantization to 4 bits, we recommend employing FPQ for minimizing accuracy degradation with negligible inference overhead.
2.2 BERT Model
We evaluate the proposed quantization techniques for BERT model on GLUE tasks Wang et al. (2019). Full-precision BERT-base models fine-tuned on GLUE datasets are obtained from Huggingface public repositoryhttps://huggingface.co/textattack/bert-base-uncased-{TASK_NAME}. We randomly sample 128 data from the training set as the calibration set. In Table 2, FPQ demonstrates remarkable performance, achieving absolute average accuracy improvements of compared to BrecQ Li et al. (2021) and over QDrop Wei et al. (2022) with 4/4/4 bit setting. Further, with 4-bit weight and 8-bit activation, MREM-S/MREM-P Bai et al. (2022) present a 1.6/1.5% accuracy gap to the full-precision model with 4096 calibration data, while FPQ achieves almost no accuracy loss with only 128 calibration data points.
2.3 Generalizability on Vision Transformer
Based on our findings that vision transformers also exhibit a consistent activation distribution pattern as language transformers, characterized by high inter-channel variance and low intra-channel variance, as detailed in Fig. 2, we extended our proposed methods to ViT and compared FPQ with floating-point PTQ baselines and state-of-the-art PTQ method for ViT on the ImageNet classification task. Table 3 shows that findings on ViT are consistent with that on language models: previous state-of-the-art integer-based methods struggled to maintain reasonable accuracy when quantizing the transformer to lower bits. In comparison, the proposed FPQ outperformed both PTQ4ViT and APQ-ViT on 6 bits, and also achieved 40.9% and 31.5% absolute accuracy improvement over PTQ4ViT and APQ-ViT on DeiT-S in the 4-bit configuration.
3 Ablation Study
In this section, we first compare the influence of different calibration sizes on FPQ. We vary the calibration size in and test on MNLI, QQP, and CoLA. Table 4 shows that the evaluation on MNLI and QQP is more robust to different settings, and the variance is more significant on CoLA. We observe that FPQ performs well with a calibration set size of 128 data points. However, we also find that it remains robust and maintains competitive accuracy even with limited access to calibration data, such as when using as few as 32 data points.
We investigate the robustness of FPQ to different search ranges . Table 5 presents the results of FPQ using three sets of : , on MNLI, QQP, and CoLA. It is observed that no single search range outperforms the others consistently across all tasks. For instance, the search range performs better than on MNLI and QQP, but slightly worse on CoLA in the 4-bit configuration. Overall, FPQ exhibits robustness to various and , as long as the search range is not overly aggressive.
4 Hardware Cost
We further examine the hardware utilization of low-bit INT, FP, and mixed-format FP multiplication operators, including adder, multiplier, and multiply-accumulate (MAC) units, in terms of hardware area. Mixed-format FP refers to the multiplication of floating-point numbers with different formats, e.g., E2M1 multiplies with E1M2. We implemented the MAC operator by Verilog HDL and utilized Cadence Genus to obtain the synthesized area under TSMC 40nm technology and 0.5GHz clock frequency.
Table 6 illustrates the hardware cost of the INT and FP operators, with the multiplier being the primary cost for INT and the adder for FP. Notably, the disparity between FP4 and INT4 adders is small, while INT has twice the hardware cost for the multiplier. Moreover, the mixed-format FP4 operator has comparable hardware area as the standard FP4 operator. These findings indicate that the proposed FPQ approach imposes negligible overhead in terms of hardware implementation when compared to the standard FP operators and the hardware cost for FP is comparable with INT.
Conclusion
This paper presents the first successful demonstration of 4-bit floating-point post-training quantization for weights, activations, and embeddings in natural language transformer architectures, including both large language models and BERT model. We also extend our method to vision transformers and observe its robust generalization ability. Our approach involves a practical search-based technique which establishes a strong baseline and achieves state-of-the-art results for 6-bit and 8-bit quantization. Furthermore, we address the challenge of high inter-channel variance in transformers by proposing pre-shifted exponent bias, which proves highly effective in achieving accurate 4-bit quantization.
Acknowledgement
This research is supported by National Natural Science Foundation of China/ HKSAR Research Grants Council Joint Research Scheme under Grant , and Foshan HKUST Projects under Grant .
Limitations
Our experiments were conducted on publicly available datasets with finite sentence lengths, and the generalizability of our method to extremely long sequences or streaming data has not been verified and may require further investigation. In addition, it remains to be seen how our proposed method can generalize to other domains beyond language and vision, such as audio. It would also be interesting to see the applicability of our method to generative tasks and other applications.
References
Appendix A Hessian-Based Loss Metric
The objective of post-training quantization is to minimize the perturbation () introduced by quantization to the pre-trained real-valued network:
Following the Taylor series expansion, we have
Here, is the gradients and is the Hessian matrix. Since the pre-trained model is well-converged, we can assume that has near zero value in every element, and thus term can be neglected.
The Hessian matrix is computed as:
where denotes the Jacobian matrix of the layer output w.r.t , and is the Hessian matrix w.r.t . We then substitute the above equation back to equation LABEL:eq:target :
Here is the intermediate output of the quantized layer and is the original layer output. Note that under the assumption that is relatively small Li et al. (2021), we can approximate as using first-order Taylor expansion.
Nevertheless, the calculation of is still burdensome, therefore, we use the diagonal entries of the Fisher Information Matrix of to substitute following Li et al. (2021); Yuan et al. (2022), and the new Hessian-based metric becomes:
Here, each entry of is assumed to be independent and denoted the total number of elements in . In this study, this hessian-based metric is used as the reconstruction metric to search for the optimal FP quantization function for both the weight and activation when performing layer-wise reconstruction in BERT and Vision Transformer models.
Appendix B Quantization Error of Different Floating-Point Formats
Figure 4 compares the quantization error of different formats in 8-bit quantization, including , , , , and . We apply these formats to different BERT modules in the first, fifth, and last layers. The figures demonstrate that the optimal FP formats differs depending on the specific module that we are quantizing.
Appendix C Inter-Channel Variance Visualization
Figure 5 and 6 depict the output of different fully-connected layers in BERT for the MNLI task, DeiT-S for the ImageNet-1K task, and LLaMA-7B for the zero-shot reasoning task. The visualizations reveal a noticeable inter-channel variance presented in both language and vision transformers.
Appendix D Efficient Matrix Multiplication
Figure 7 displays a comprehensive list of all the granularity options that allow for efficient matrix multiplication. While per-token quantization theoretically provides greater precision in terms of quantization granularity, the accuracy gains achieved through this method are minimal and do not justify the additional computational overhead required. As a result, we have opted to use per-tensor quantization when quantizing activations.
Appendix E Learning Format and Maximum Value
We compare the previous gradient-based method Kuzmin et al. (2022) with the proposed search-based method for finding the optimal format and maximum value. On DeiT-S, the learnable method only achieves 74.38% accuracy for an 8-bit quantized model on ImageNet, in contrast, FPQ can attain an almost loss-less result of 79.88%. We analyze the gradients for the number of exponent bits derived in Kuzmin et al. (2022) and observe that each time the exponent bits change, the gradients experience exponential variations, leading to high instability. Based on this observation, we assert that employing a search-based method to determine the optimal formats is crucial in post-training quantization (PTQ).
Appendix F Reconstruction Choices
The previous works on integer post-training quantization involves breaking down the target model into sub-modules and reconstructing them separately Nagel et al. (2020); Li et al. (2021); Bai et al. (2022); Yuan et al. (2022). This addresses the problem of over-fitting, given that only a limited amount of unlabeled calibration data is available. In this study we find the layer-wise reconstruction and parallel quantization works best for floating-point PTQ:
Layer Reconstruction: Recent research Li et al. (2021); Bai et al. (2022) suggests increasing the reconstruction granularity from layer reconstruction Nagel et al. (2020) to block reconstruction Li et al. (2021) or even larger granularity Lee et al. (2023). This is achieved by jointly optimizing all the linear layers or matrix multiplication components within each module to prevent the propagation of reconstruction errors among the layers. Despite this, we have observed that increasing the reconstruction granularity does not improve the accuracy of FPQ baseline or sometimes even lead to worse results. Therefore, we choose layer reconstruction.
Parallel Quantization: Sequential quantization is the most commonly used approach Wu et al. (2020); Nagel et al. (2020); Li et al. (2021) where modules are quantized consecutively based on their sequential order, and the input for the current calibrating module is generated using all the previously quantized modules. However, some recent works Yuan et al. (2022); Bai et al. (2022) proposed a new parallel quantization framework. This framework uses the raw output of the full-precision modules as input and makes the calibration of each module independent from one another. In this work, we use parallel quantization, as it yields better results than its sequential counterparts.