OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

Introduction

Large language models (LLMs) such as GPT-4 (Bubeck et al., 2023) and LLaMA (Touvron et al., 2023a), have demonstrated impressive performance across various natural language benchmarks (Hendrycks et al., 2020; Bisk et al., 2020; Zellers et al., 2019). Furthermore, the language understanding capabilities inherent in LLMs can be successfully transferred into multimodal models (Mu et al., 2023; Xu et al., 2023; Zhang et al., 2023). Thereby, LLMs can be regarded as precursors to artificial general intelligence (Bubeck et al., 2023). However, the considerable computational and memory requirements of LLMs pose substantial challenges. For instance, the GPT-3 model (Brown et al., 2020) requires 350G of memory to load its parameters in FP16 format, which corresponds to the requirement of at least five A100-80G GPUs for inference. This significant demand for computational resources and associated communication overheads impedes the practical deployment of LLMs in real-world applications.

Quantization has shown to be promising to mitigate both computational and memory overhead in LLMs. In general, it comes in two types including post-training quantization (PTQ) and quantization-aware training (QAT). Although QAT can lead to more competitive accuracy than PTQ, it is not practical due to the high training cost because the whole model is trained with the awareness of the quantization process. As a result, PTQ is commonly utilized in existing quantization methods on LLMs. For example, lots of PTQ methods (Frantar et al., 2022; Lin et al., 2023; Dettmers et al., 2023b; Lee et al., 2023) reduce memory consumption by weight-only quantization which quantizes the weights while maintaining full-precision activation. To further reduce the computational overhead, another line of work (Xiao et al., 2023; Wei et al., 2022; Yuan et al., 2023; Wei et al., 2023) employs weight-activation quantization which quantizes both weight and activation into low-bit values for the execution of low-bit matrix multiplication.

Existing quantization methods have demonstrated significant achievements in various scenarios, including W4A16 (i.e. 4-bit weight and 16-bit activation) weight-only quantization such as (Lin et al., 2023; Dettmers et al., 2023b; Lee et al., 2023), as well as W8A8 weight-activation quantization (Wei et al., 2023). However, they usually exhibit significant performance degradation when confronted with low-bit quantization, such as W2A16 and W4A4, as illustrated in Figure 1 (b & c). This performance shortfall in low-bit quantization can be attributed to the fact that these methods (Frantar et al., 2022; Lin et al., 2023; Wei et al., 2023) primarily rely on handcrafted quantization parameters such as migration strength (Xiao et al., 2023) and scaling parameters (Wei et al., 2023), which often leads to lower performance. Although Quantization-Aware Training (QAT) (Liu et al., 2023a) is effective in determining the optimal quantization configurations, it introduces substantial training overhead in both training and data efficiency. It is thus hard to quantize LLMs with QAT-based techniques efficiently such as LLMQAT (Liu et al., 2023a). For instance, GPTQ (Frantar et al., 2022), a PTQ approach, can complete the quantization of LLaMA-13B in an hour using 128 samples on a single A100 GPU, while LLM-QAT (Liu et al., 2023a) requires 100k samples and hundreds of GPU hours. This leads us to a central question: can we attain the performance of QAT, while maintaining the time and data efficiency of PTQ?

This paper introduces a novel quantization technique, OmniQuant, which effectively addresses the above question. OmniQuant achieves state-of-the-art performance across various quantization scenarios, particularly in low-bit settings, while preserving the time and data efficiency of PTQ, as illustrated in Figure 1. Unlike Quantization-Aware Training (QAT) (Liu et al., 2023a) which involves cumbersome weight optimization, OmniQuant freezes the original full-precision weight and only incorporates a few learnable quantization parameters. As shown in Figure 2, OmniQuant consists of two key components that incorporate different types of learnable quantization parameters, including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). Specifically, LWC modulates the extreme values of weights by optimizing the clipping threshold. In the meanwhile, LET tackles activation outliers by learning mathematically equivalent transformations in a transformer encoder.

Instead of jointly optimizing all parameters across the LLM, OmniQuant sequentially quantizes the parameters of one layer before moving on to the next under a block-wise quantization error minimization framework. In this way, OminiQuant can be optimized efficiently using a simple Stochastic Gradient Descent (SGD) algorithm. Thanks to the differentiable optimization, LWC and LET can be seamlessly integrated into the quantization. We find that LWC can mitigate the difficulty in quantizing weights and LET further shifts the challenge of quantization from activations to weights, facilitating OmniQuant a versatile quantization framework for both weight-only and weight-activation quantization. Notably, OmniQuant introduces no extra computation or parameters for the quantized model because the clipping threshold in LWC and equivalent factors in LET can be fused into quantized weights.

As depicted in Figure 2, OmniQuant is easy to implement even with limited resources. Especially, taking the LLaMA-2 model family (7B-70B) as an example, all models can be quantized on a single A100-40G GPU utilizing only 128 training samples. The training time ranges from 1 to 16 hours, depending on the size of the quantized model, which ranges from 7B to 70B. Owing to the seamless integration of LWC and LET achieved by differentiable optimization, OmniQuant exhibits superior performance compared to prior PTQ-based methods in various quantization settings. For example, when LLaMA-13B is quantized into W2A16, OmniQuant achieves a perplexity of 13.2113.21, while GPTQ incurs a significant increase in perplexity to 38323832, as demonstrated in Figure 1. A similar performance advancement is also observed in the W4A4 quantization.

The contributions of OmniQuant are summarized as follows. 1) We formulate a novel quantization pipeline for LLM, OmniQuant, which freezes original full-precision weights while incorporating a restrained set of learnable parameters. OmniQuant imbues quantization with gradient updates while preserving the time and data efficiency of PTQ methods. 2) OmniQuant consists of Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). These strategies make full-precision weights and activations more amenable to quantization. 3) Through extensive experiments, we demonstrate that OmniQuant outperforms previous methods across a spectrum of quantization settings (W416, W3A16, W2A16, W6A6, W4A4), various model families (OPT, LLaMA, LLaMA-2, LLaMA-2-chat, Falcon), and a range of model sizes (125M-180B). The computation speedup and memory reduction of OmniQuant are also demonstrated on real devices.

Related Work

Quantization reduces neural network bit-precision, leading to smaller models and faster inference. Current methods are largely divided into Quantization Aware Training (QAT)(Liu et al., 2023a) and Post-training Quantization (PTQ)(Xiao et al., 2023; Frantar et al., 2022). While QAT maintains performance by simulating quantization during training, its training cost makes it unsuitable for LLM. PTQ techniques like AdaRound (Nagel et al., 2020) and BRECQ (Li et al., 2021) use gradient optimization to determine optimal rounding, but tuning all weights is time-intensive for larger models. Thus, most LLM quantization methods (Xiao et al., 2023; Frantar et al., 2022; Dettmers et al., 2023b; Lee et al., 2023; Wei et al., 2023) prioritize training-free PTQ, which limit performance in lower-bit situations. Our goal is to integrate gradient updates in LLM quantization, mirroring QAT’s approach, while retaining PTQ’s efficiency.

2 Quantization of LLM.

Consider the quantized object, exiting LLM quantization can be classified into two fields: weight-only quantization and weight-activation quantization.

Weight-only quantization. Weight-only quantization focuses on converting weights to low-bit values. For instance, GPTQ (Frantar et al., 2022) uses block-wise reconstruction for 3/4-bit quantization. SpQR (Dettmers et al., 2023b), OWQ (Lee et al., 2023), and AWQ (Lin et al., 2023) emphasize the significance of weights tied to higher-magnitude activations. Therefore, SpQR and OWQ employ mixed-precision quantization ot safeguard vital weights, while AWQ opts for channel-wise scaling to avoid mixed-precision’s hardware inefficiency. Qlora (Dettmers et al., 2023a) and INT2.1 (Chee et al., 2023) restore the capabilities of the quantized model through parameter-efficient fine-tuning. Our method, in contrast, enhances the quantization process directly, making OmniQuant complementary to Qlora and INT2.1.

Weight-activation quantization. Weight-activation quantization compresses both weights and activations. SmoothQuant (Xiao et al., 2023), LLM.int8() (Dettmers et al., 2022), and Outlier Suppression (Wei et al., 2022) achieve W8A8 quantization by managing activation outliers. LLM.int8() uses mixed-precision decomposition, while the other two employ channel-wise scaling. Furthermore, Outlier Suppression+(Wei et al., 2023) adds channel-wise shifting to drive W6A6 quantization. Unlike previous heuristic designs, we use gradient optimization and expand equivalent transformations to attention mechanisms, further boosting the K/V cache quantization. Recently, RPTQ (Yuan et al., 2023) and LLM-QAT (Liu et al., 2023a) have achieved W4A4 quantization. However, RPTQ adopts deployment-unfriendly group-wise activation quantization, and LLM-QAT employs time-consuming QAT. In distinction from RPTQ and LLM-QAT, we achieve W4A4 quantization through deployment-friendly per-token quantization and maintain the PTQ efficiency.

OmniQuant

Challenge of LLM quantization. Two main difficulties lie in quantizing an LLM. First, the activation is hard to quantize due to the existence of outlier channels. Considering that weight distribution is flat and uniform, SmoothQuant (Xiao et al., 2023) and Outlier Suppression+ (Wei et al., 2023) tackle this issue by migrating the quantization difficulty from activations to weights with a pre-defined migration strength. Second, the quantization error of weights also plays a pivotal role in the final performance due to the importance of weights corresponding to activations. SqQR (Dettmers et al., 2023b) and OWQ (Lee et al., 2023) propose to retain crucial weights in full-precision, while AWQ (Lin et al., 2023) safeguards these weights using grid-searched channel-wise scaling. Although these methods have achieved certain success in compressing various LLMs, they often lead to suboptimal performance and fail to deal with extremely low-bit quantization due to the crude design of hand-crafted quantization parameters such as migration strength and scaling factors.

In this section, we introduce a differentiable quantization technique for LLM called OmniQuant where quantization parameters are learned with better flexibility. Towards this goal, OmniQuant is implemented with a block-wise quantization error minimization framework as presented in Sec.3.1. To tackle the aforementioned challenges of LLM quantization, we devise two novel strategies for additional learnable quantization parameters including a learnable weight clipping (LWC) to mitigate the difficulty in quantizing weights and a learnable equivalent transformation (LET) to further shift the challenge of quantization from activations to weights. We introduce LWC and LCT in Sec. 3.2 and Sec. 3.3, respectively.

Previous PTQ methods with gradient optimization, such as AdaRound (Nagel et al., 2020), BRECQ (Li et al., 2021) cannot be applied in models with billions of parameters because they are hard to optimize due to the huge solution space. Instead of turning the whole model, we propose a new optimization pipeline with block-wise quantization error minimization where the additional quantization parameters can be optimized in a differentiable manner. We formulate the optimization goal as follows.

where F\mathcal{F} represents the mapping function for a transformer block in the LLM, W\mathbf{W} and X\mathbf{X} are full-precision weight and activation, Qw()Q_{w}(\cdot) and Qa()Q_{a}(\cdot) represent weight and activation quantizer, respectively, Θ1\Theta_{1} and Θ2\Theta_{2} are quantization parameters in learnable weight clipping (LWC) and learnable equivalent transformation (LET), respectively. The Block-wise quantization in Eqn.(1) sequentially quantizes the parameters of one transformer block before moving on to the next.

Block-wise minimization in Eqn.(1) has two advantages. First, equipped with block-wise minimization in Eqn.(1), OmniQuant can optimize quantization parameters in LWC and LET jointly, making it capable enough to encompass both weight-only and weight-activation quantization. Second, block-wise minimization is easy to optimize with minimal resource requirements. OmniQuant only determines a few quantization parameters with optimality, which is easier than optimizing the whole weights in previous PTQ-based methods (Nagel et al., 2020; Li et al., 2021). Empirically, we find that all models from the LLaMA-2 family (Touvron et al., 2023b) can be quantized on a single A100-40G GPU utilizing only 128 training samples.

2 Learnable Weight Clipping

OmniQuant employs a module of learnable weight clipping (LWC) to reduce the difficulty of quantizing the weights in an LLM. Similar to previous methods with learnable clipping threshold (Esser et al., 2019; Liu et al., 2022; Choi et al., 2018), LWC also determines the optimal dynamic range of the weights by optimizing a clipping threshold. However, we find that directly employing prior arts such as PACT (Choi et al., 2018) and LSQ (Esser et al., 2019) in quantization would produce unsatisfactory performance, as demonstrated in LLM-QAT (Liu et al., 2023a). A similar result has been also observed in Table A8 in the Appendix.

Instead of directly learning a clipping threshold as did in previous methods (Esser et al., 2019; Choi et al., 2018), LWC optimizes a clipping strength as formulated by

Note that LWC degrades into a vanilla MinMax quantization scheme used in existing works (Xiao et al., 2023),Frantar et al. (2022) when γ=1\gamma=1 and β=1\beta=1. By inheriting the benefits of MinMax quantization, LWC only needs to adjust the clipping strengths to determine an optimal clipping threshold, which would reduce the optimization difficulty. Clipped by an optimal threshold, the original weights would be easy to quantize. As indicated by the experiments in Table 1, our proposed learnable weight clipping method significantly outperforms previous weight-only quantization techniques (Frantar et al., 2022; Lin et al., 2023)).

3 Learnable Equivalent Transformation

Other than LWC which enables quantization-friendly weights by optimizing the clipping threshold, we further reduce the difficulty of weight-activation quantization by a learnable equivalent transformation (LET). Considering that outliers in the activation map are systematic and unique to specific channels, previous methods such as SmoothQuant (Xiao et al., 2023) migrate the difficulty of quantization from activations to weights with a mathematically equivalent transformation. However, they hand-craft the equivalent parameters, leading to suboptimal results.

Thanks to the inclusion of block-wise quantization error minimization, our LET can determine the optimal equivalent parameters in a differentiable way. Inspired by SmoothQuant (Xiao et al., 2023) and Outlier Suppression+ (Wei et al., 2023), we adopt channel-wise scaling and channel-wise shifting to manipulate the activation distribution, providing an effective solution for the outlier issue. Specifically, we investigate the equivalent transformation across both the linear layer and attention operation, as illustrated in Figure3.

where QaQ_{a} is the vanilla MinMax quantizer and QwQ_{w} is the MinMax quantizer with learnable weight clipping (i.e. our LWC).

Attention operation. Beyond the linear layer, the attention operation also accounts for a significant proportion of the computation. Additionally, the auto-regressive pattern of LLM necessitates storing the key-value(KV) cache for each token, which results in substantial memory demands for long sequences. Therefore, we also quantize Q/K/V\mathbf{Q}/\mathbf{K}/\mathbf{V} matrixes into low-bit in the weight-activation quantization setting. Specifically, the learnable equivalent transform of the self-attention affinity matrix can be written as:

Experiments

Quantization. We experiment with both weight-only and weight-activation quantization. For the former, default settings are INT4/INT3/INT2 per-channel weight quantization. Group-wise weight quantization is represented by ‘g’, e.g., W3A16g128 means 3-bit weight-only quantization with a 128-group size. In weight-activation quantization, defaults are INT6/INT4 per-channel weight and per-token activation quantization (Dettmers et al., 2022). All intermediate activations are quantized into low-bit, excluding the SoftMax output, kept at full precision due to its long-tail distribution making it unsuitable for uniform quantization.

Training The channel-wise scaling factor is initialized with SmoothQuant (Xiao et al., 2023), and the channel-wise shifting factor is initialized using Outlier Suppression+ (Wei et al., 2023). To optimize the learnable parameters, we utilize the AdamW optimizer with zero weight decay. The learning rate for learnable weight clipping and equivalent transformation is set as 5e35e-3 and 1e21e-2, respectively. We employ a calibration dataset consisting of 128 randomly selected 2048-token segments from WikiText2 (Merity et al., 2016). The entire training process is facilitated on a single Nvidia A100 GPU, using a batch size of 1 over 20 epochs, except for W2A16 quantization that leverages 40 epochs. For weight-activation quantization, both learnable weight clipping and equivalent transformation are activated. For weight-only, both are used for OPT, but only the clipping is for LLaMA, as Table A1 shows negligible benefits from the equivalent transformation for LLaMA.

Models. We test on OPT(125M-66B)(Zhang et al., 2022)), LLaMA(7B-65B) (Touvron et al., 2023a), LLaMA-2(7B-70B) (Touvron et al., 2023b), Falcon-180B (Penedo et al., 2023), and instruction-tuned LLaMA-2-chat (Touvron et al., 2023b) for generalizability. While the main paper highlights the LLaMA results, comprehensive details for other models are available in Sec. A6 of the Appendix.

Evaluation. Following the previous work (Lin et al., 2023; Frantar et al., 2022), we evaluate quantized models by reporting the perplexity of language generation experiments, specifically on WikiText2 (Merity et al., 2016), PTB (Marcus et al., 1994)), C4 (Raffel et al., 2020). Moreover, accuracy is evaluated in zero-shot tasks including PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), BoolQ (Clark et al., 2019), and HellaSwag (Clark et al., 2018). We adhere to the GPTQ (Frantar et al., 2022) settings for language generation experiments, and implement the lm-eval-harness (Gao et al., 2021) for the execution of all zero-shot tasks.

Baselines. For weight-only quantization, we compare with vanilla round-to-nearest quantization (RTN), GPTQ (Frantar et al., 2022), and AWQ (Lin et al., 2023). For weight-activation quantization, we compare our method with SmoothQuant (Xiao et al., 2023), RPTQ (Yuan et al., 2023), and the recent QAT method LLM-QAT (Liu et al., 2023a). Note that we reproduce SmoothQuant with per-channel weight quantization and per-token activation quantization for fair comparisons.

2 Weight-only Quantization Results

The results of the LLaMA family can be found in Table 1, while the results for OPT are presented in the Sec. A6 of Appendix. As illustrated by the tables, OmniQuant consistently outperforms the prior LLM weight-only quantization method across various LLM families (OPT, LLaMA-1, LLaMA-2) and diverse quantization configurations, including W2A16, W2A16g128, W2A16g64, W3A16, W3A16g128, W4A16, and W4A16g128. These findings suggest OmniQuant’s versatility, being adaptable to a multitude of quantization configurations. For instance, while AWQ (Lin et al., 2023) is particularly effective with group-wise quantization, OmniQuant demonstrates superior performance across both channel-wise and group-wise quantization. Furthermore, the performance benefits of OmniQuant become more pronounced as the quantization bit size decreases.

3 Weight-Activation Quantization Results

In weight-activation quantization, our main focus lies on W6A6 and W4A4 quantization. We exclude W8A8 quantization as SmoothQuant can nearly achieve lossless W8A8 quantized models when compared with full-precision counterparts. The results of the LLaMA family can be found in Table 2, while the results for OPT are presented in Table A16 of Appendix. Table 2 illustrates the zero-shot task accuracy of LLaMA weight-activation quantization. Notably, OmniQuant markedly enhances the average accuracy by +4.99% \sim +11.80% across various models at W4A4 quantization. Remarkably, in the LLaMA-7B, OmniQuant even surpasses the recent QAT method, LLM-QAT (Liu et al., 2023a), by an impressive margin of +6.22%. This improvement demonstrates the efficacy of incorporating additional learnable parameters, which proves to be more beneficial than the global weight tuning utilized by QAT.

4 Quantization of instruction-tuned models

To validate the generalization capability of our method, we test the quantization on LLaMA-2-chat (Touvron et al., 2023b), an instruction-tuned model for chatbots. Using the GPT-4 evaluation protocol (Chiang et al., 2023), performance is assessed on the Vicuna benchmark (Chiang et al., 2023) comprising 80 questions. To negate position bias (Zheng et al., 2023), each pair is compared in both sequences, totaling 160 trials per comparison. Figure 4 compares RTN, AWQ (Lin et al., 2023), and OmniQuant. In LLaMA-2-7b-chat, OmniQuant matches AWQ with a 50% win rate but surpasses RTN more (80.3% vs. 69.4%). In LLaMA-2-13b-chat, while AWQ lags behind RTN, OmniQuant consistently improves quantization model performance.

5 Acceleration on Real Device

MLC-LLMhttps://github.com/mlc-ai/mlc-llm provides a versatile deployment solution for diverse language models across various hardwares. It particularly excels in deploying quantized models on CUDA. One of OmniQuant’s strengths lies in its ability to avoid extra operations for quantized models, allowing MLC-LLM to seamlessly run models created with OmniQuant. Table,3 shows memory requirements and inference speeds of the LLaMA family on an NVIDIA A100-80G. ’Weights Memory (WM)’ represents quantized weight storage, and ’Running Memory (RM)’ indicates the memory for inference, with the latter being higher due to certain retained activations. Inference speed is gauged by generating 512 tokens. It is evident that quantized models significantly reduce memory usage compared to 16-bit full-precision models. For instance, models with W4A16g128 and W2A16g128 quantization almost double the inference speed. However, MLC-LLM’s support for INT3/INT2 is currently suboptimal, particularly for INT3. Enhancements to INT3/INT2 quantization speed are in our future roadmap. Additionally, we only explore the deployment of weight-only quantization in this study due to that W4A4 and W6A6 quantization methods lack out-of-the-box hardware support.

Conclusion

We present OmniQuant, a method advancing weight-only and weight-activation quantization to low-bit formats. OmniQuant’s core principle is to retain original full-precision weights while adding learnable parameters. It uses learnable weight clipping and learnable equivalent transformation to optimize weight and activation for quantization. While incorporating gradient updates, OmniQuant maintains training efficiency comparable to existing PTQ methods. It outperforms current methods in language generation and zero-shot tasks, and is suited for instruction-tuned LLMs. Additionally, OmniQuant also ensures hardware compatibility as its added parameters can be absorbed.

We thank Wentao Liu from SenseTime for his valuable insights and discussions regarding LLM deployment. We also acknowledge Siyuan Feng from Apache TVM for assisting in the successful deployment of our OmniQuant in the MLC LLM project.

References

Appendix A1 Overall algorithm

The comprehensive training algorithm of OmniQuant is illustrated in Algorithm 1. We employ a block-wise calibration strategy comprising three steps: initialization of learnable parameters (Lines 4-5), training these learnable parameters (Lines 6-15), transforming the model with learned parameters, and then quantization(Lines 16-18). The OmniQuant algorithm finds the optimal transformation to enhance the quantization compatibility of the LLM model. Additionally, due to the elegant design, OmniQuant can achieve rapid convergence using a small calibration dataset.

Appendix A2 Ablation studies

Efficacy of each component. Table A1 reveals that the baseline model incorporates both LWC and LET, labeled as ’LWC+LET’. We further investigate their individual contributions by remove each component. Both components positively influence performance, but LET proves essential for weight-activation quantization. Disabling it for W4A4 results in a sharp rise in perplexity to e3e3, primarily due to challenges with activation quantization outliers. For weight-only quantization, LET significantly boosts OPT’s performance but offers slight enhancement for LLaMA, explained by LLaMA’s few weight outliers. For example, in naive W3A16 quantization (-LWC-LET), LLaMA reaches a perplexity of 10.68, while OPT’s spikes to 4.6e34.6e3. Consequently, LET is turned off for LLaMA in weight-only quantization given its limited advantage for faster training.

Design choices of learnable equivalent transformation. In comparison to the equivalent transformation incorporated in SmoothQuant (Xiao et al. (2023)), our approach additionally implements channel-wise shifting and attention transformation. The effects of these innovations are evaluated in Table A2. We can observe that both modifications enhance the performance of weight-activation quantization. However, the incremental benefit from the equivalent transformation in the attention operation is comparatively minor. This discrepancy is primarily due to the majority of outliers existing in the output of the normalization layer while being less prevalent in the Q/K/VQ/K/V matrix.

Training Time As illustrated in Table A3, LLaMA-7B was trained across various epochs to determine the optimal convergence time. Most quantization configurations converge within 20 epochs, with the exception of W2A16, which necessitates 80 epochs. Consequently, we establish a training epoch of 20 for all configurations, except for W2A16, for which we set it to 40 in consideration of the training time.

Calibration Data OmniQuant utilizes gradient optimization on constrained calibration datasets, sourced from WikiText2 and comprising 128 segments with 2048 tokens each. This prompts concerns about potential overfitting to the calibration dataset. To explore this, we evaluated the calibration dataset’s influence using two other datasets: Pile (Gao et al. (2020)) and c4 (Raffel et al. (2020)). As depicted in Table A4, the variance in perplexity across diverse calibration datasets is marginal, fluctuating between 0.0006 and 0.17. This underscores OmniQuant’s robustness concerning calibration set distribution. Furthermore, the data efficiency of OmniQuant was gauged by modulating the number of training samples, as presented in Table A5. Remarkably, OmniQuant converges with as few as 16 samples. Our selection of 128 samples aligns with established practices in prior works (Frantar et al. (2022); Lin et al. (2023)).

Appendix A3 Training Time

As shown in Table A6, we report the training time of the proposed OmniQuant within the LLaMA family. Note that for LLaMA, we only activate learnable weight clipping for weight-only quantization. Therefore, the training time for weight-only quantization is shorter relative to weight-activation quantization, given the fewer learnable parameters involved. While our proposed method necessitates a training time that is approximately 5×\times greater than GPTQ, it remains markedly faster than QAT methods, which demand hundreds of GPU hours.

Appendix A4 Performance Analysis

In this section, we investigate the internal mechanism of learnable weight clipping and learnable equivalent transformation respectively. Further, we show that with OmniQuant, 3-bit and 4-bit achieve similar trade-off between model bits and perplexity.

Learnable weight clipping. In addition to perplexity and accuracy, the quality of a quantization method can intuitively be evaluated by calculating the distance between quantized models and their full-precision counterparts. This is demonstrated in Table A7, where we detail the l1l_{1} distance of weights and activations for LLaMA-7B’s weight-only quantization. We can observe that the proposed Learned Weight Clipping (LWC) substantially decreases the l1l_{1} distance for both weights and activations. It’s noteworthy that, in certain instances, the l1l_{1} distance for quantized models without LWC is similar to that of those utilizing LWC. However, models incorporating LWC exhibit markedly lower activation l1l_{1} distances. This observation underpins the argument that LWC can effectively balance quantization precision between outlier and regular values.

Additionally, we illustrate the distribution of the learned clipping scale (γ\gamma and β\beta) as delineated in Eq. (2) in Figure A1. It is apparent that LWC can learn different clippings for diverse quantization configurations. For instance, with per-channel weight quantization W3A16 as depicted in Figure A1(a), the learned clipping scale showcases a normal distribution. This suggests that approximately half of the outliers are being clipped. In the case of group-wise quantization, the learned clipping scale exhibits a long-tailed distribution, implying that most quantized groups are associated with minimal clipping. Note that lower bits exhibit more pronounced clipping. For example, W2A16g128 possesses a 50% clipping scale larger than 0.95, whereas, in W3A16g128, this percentage rises to 70%.

Learnable equivalent transformation Figure A2 provides visualizations of the intermediate activation in the linear layer. It is apparent that several outlier channels in the original activation (Figure A2(a)) possess significantly larger magnitudes compared to the regular channels, thereby creating an incompatibility with activation quantization. Although SmoothQuant mitigates this issue to some degree, such as reducing the outlier magnitude from 70 to 2, Figure A2(b) reveals that the magnitude of outlier channels still remains notably larger than that of other regular channels after SmoothQuant. This phenomenon can be attributed to SmoothQuant’s heuristic approach in deriving channel-wise scaling, which inevitably makes it challenging to discover an optimal solution. The impact of the proposed LET is depicted in Figure A2(c). It is noteworthy that the magnitude disparity between the outlier and regular channels is markedly diminished. This homogenization of the activation distribution, facilitated by the LET, empowers OmniQuant to efficiently steer the weight-activation quantization towards a low-bit scheme.

Scaling laws. Quantization serves as a potent strategy to curtail the total model bits, thereby facilitating the deployment of LLMs on edge or consumer devices with restricted memory. However, the total model bits are contingent on both the number of parameters within the original model and the quantization bits. Therefore, given a model bits constraint, the challenge arises: how does one optimally determine the number of parameters for the full-precision model and the quantization bits? Tim Dettmers (Dettmers & Zettlemoyer (2023)) demonstrated that 4-bit quantization establishes a universally optimal balance between the total model bits and zero-shot accuracy. Nonetheless, in this study, as shown in Figure A3,we would like to claim that OmniQuant can make 3-bit quantization achieve comparable performance like 4-bit quantization in the trade off between model bits and perplexity.

Appendix A5 Comparisons with clipping-based method

In this paper, we proposed a novel method, learnable weight clipping (LWC), designed to adaptively determine the weight clipping threshold. LWC sets the threshold by scaling the original minimum and maximum values to delineate the solution space. We compare LWC against existing clipping-based methods: PACT and LSQ. While PACT directly determines the clipping threshold, LSQ focuses on the direct derivation of the scaling factor and zero-point. Both PACT and LSQ were initially formulated as QAT methods, accounting for both weight and activation clipping. For an equitable comparison, our examination is restricted to weight clipping. We integrated PACT and LSQ into our optimization pipeline in lieu of LWC. Table A8 illustrates that while PACT and LSQ enhance the performance of weight-only quantization in comparison to MinMax quantization, their efficacy diminishes in the weight-activation quantization setting. This decline can be attributed to the proposed LET during activation quantization, which alters the weight distribution in each training iteration, undermining the convergence of both LSQ and PACT. In contrast, LWC defines relative scaling values instead of absolute metrics, making it proficient in handling changes in weight distribution.

Appendix A6 Full Results

In this section, we provide a comprehensive presentation of our results across various datasets to complement the main paper. Specifically, the results include:

Experiments results on extreme large model Falcon-180B (Table A9).

C4 perplexity with weight-only quantization in the LLaMA families (Table A10).

PTB perplexity with weight-only quantization in OPT families (Table A12).

C4 perplexity with weight-only quantization in OPT families (Table A13).

WikiText2 perplexity for weight-activation quantization in the LLaMA families (Table A14).

C4 perplexity for weight-activation quantization in the LLaMA families (Table A15).

WikiText2/PTB/C4 perplexity for weight-activation quantization in the LLaMA families (Table A16).