Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu

Introduction

Transformer has been one of the most common architectures in natural language processing along with lots of popular self-supervised models, such as BERT , RoBERTa , XLNet and BART . While these pre-trained models have demonstrated a significant superiority in performance, the memory and computation overheads have been a popular concern, particularly in the real development. Therefore, model compression has attracted much attention from both academia and industry. Among them, quantization , working in the low-precision arithmetic fashion, is one of the key approaches to compress large models and fit them into the lightweight devices.

These days, researchers focus more on quantization of Transformer-based models. proposes an 8-bit quantization scheme for BERT-like models. advises a group-wise quantization technique and analyzes mixed-precision using second-order Hessian information. combine distillation with quantization. approximates nonlinear operations to implement integer-only quantization. Nonetheless, few studies investigate the inherent bottleneck of quantizing Transformer-based models.

Recently, some papers indicate that the Transformer-based models hold significantly large outliers (even close to 100) and these extreme outliers behave in structured patterns (mainly gather at a few embedding dimensions and even become larger on unique tokens). These special outliers can bring devastating damage to the quantization performance (e.g., a 12% drop even for the 8-bit ). To combat this challenge, existing method chooses bypassing solutions such as a finer quantization granularity. However, this scheme causes an increased computation cost and unavoidably hinders the acceleration effect.

In this paper, to suppress the outliers rather than walk around them, we make an in-depth analysis to investigate the inducement of the outliers and the impact of clipping the outliers. For the inducement, we find that the scaling parameter $\bm{\gamma}$ in the LayerNorm structure works as an outlier amplifier, which amplifies the outliers in the output. By extracting it, the activation becomes more robust for quantization. Then by further studying the clipping impact, we discover that the influence of final performance when clipping the outliers varies greatly, where some more aggressive outliers covering a large area can be clipped safely without accuracy degradation, but the accuracy can drop suddenly when the important outliers are clipped. More interestingly, though those less important outliers might present in a long tail form, they are only provided by a few tokens.

Motivated by the analysis, we propose an outlier suppression framework to push the limit of low-bit Transformer language models. Such framework contains two key components: Gamma Migration and Token-Wise Clipping, corresponding to the above two findings. The Gamma Migration produces a more quantization-friendly model by migrating the outlier amplifier $\bm{\gamma}$ into subsequent modules in an equivalent transformation and bringing more robust activation for quantization without extra computation burden. The Token-Wise Clipping further efficiently finds a suitable clipping range with minimal final quantization loss in a coarse-to-fine procedure. The coarse-grained stage, which leverages the fact that those less important outliers only belong to a few tokens, can obtain a preliminary clipping range quickly in a token-wise manner. The fine-grained stage then optimizes it. Our proposed framework can be applied to different models and tasks, and coupled with existing methods. More essentially, the thought of outlier suppression shall shed new light on the study of NLP quantization.

To summarize, our contributions are as follows:

We delve into the inducement and clipping impact of outliers in the NLP models and draw two critical findings that help handle the bottleneck of transformer quantization.

Based on the findings, an outlier suppression framework containing Gamma Migration and Token-Wise Clipping is proposed. This framework is efficient, easy to implement, and plug-and-play.

The Gamma Migration suppresses the outliers from the inducement aspect and produces a more quantization-friendly model without any extra inference time. It transfers the outlier amplifier in LayerNorm to the subsequent modules in an equivalent transformation and contributes to activation with less quantization error.

The Token-Wise Clipping scheme suppresses the outliers from the aspect of importance and produces a superior clipping range efficiently. It can skip over those unimportant outliers quickly leveraging the large variance of token range and then focus on the influential area.

Extensive experiments on various NLP models (BERT, RoBERTa, BART) and tasks (text classification, question answering, and summarization) prove that our outlier suppression framework sets up a new state of the art for transformer quantization, and for the first time, pushes the 6-bit post-training quantization (PTQ) and 4-bit quantization-aware training (QAT) accuracy of BERT to the full-precision level.

Preliminaries

Basic Notations. We mark matrices as ${\bm{X}}$ and vectors as ${\bm{x}}$ . Operator $\cdot$ denotes scalar multiplication, and $\odot$ is adopted for element-wise multiplication on matrices or vectors. Also, we use ${\bm{W}}{\bm{x}}$ as matrix-vector multiplication. Specifically, considering the tokens in NLP tasks, ${\bm{X}}_{t,j}$ stands for the element at token $t$ and embedding $j$ , and ${\bm{x}}_{t}$ represents the embedding of token $t$ .

Quantizer. Quantization usually includes two operations.

where $s$ (step size), $z$ (zero point) are quantization parameters, $b$ is the bit setting. The first operation called "Quant" maps continuous numbers ( $x$ ) to discrete points ( $\bar{x}$ ) for integer-arithmetic-only matrix computation. The second operation called "DeQuant" recovers it to $\hat{x}$ after multiplication.

Outlier analysis

For Transformer-based models, standard 6/8-bit PTQ or 4-bit QAT would cause severe accuracy degradation. Investigating each quantizer, we recognize that the output of LayerNorm structures and GELU functions hold some sharp outliers, which should be responsible for the large quantization error. Evidence and experimental results in Sec. B.2.

To deeply investigate the relationship between the harmful outliers and quantization performance, we explore the underlying inducement and impact of clipping the outliers. Before that, some brief descriptions (see Sec. C.1 for detailed ones) about the outliers are given first to help understand the following two parts. The outliers show structured characteristics that they mainly gather at some certain embedding dimensions, and upon these dimensions, the outliers provided by unique tokens like the separate toke and comma even hold more aggressive values.

For the inducement of outliers, we find that the scaling parameter in LayerNorm amplifies the outliers from embedding dimensions. And the phenomenon that some tokens have sharper outliers might be caused by the uneven token frequency in the pre-training phase (see Sec. C.2). In this part, we mainly explain the first inducement to solve these outliers from the origin. For another one, due to the high cost of adjusting the pre-training, we discuss the clipping impact in the next part to suppress these outliers from the clipping perspective.

Considering the challenges of quantizing the LayerNorm, the natural action is to dive into its internal structure. For token $t$ at $j^{th}$ embedding dimension, it first normalizes the input using mean ( ${\bm{u}}_{t}$ ) and variance ( $\bm{\sigma}_{t}^{2}$ ) each forward pass, then scales and shifts the value with parameter $\bm{\gamma}_{j}$ and $\bm{\beta}_{j}$ .

Then, by observing the parameter distribution of LayerNorm, we surprisingly find that the multiplier $\bm{\gamma}$ (1(b)) and the output $\widetilde{{\bm{X}}}$ (1(a)) hold outliers at the same embedding dimensions. Besides, the adder $\bm{\beta}$ denotes a smaller range (e.g., (0,3)) compared to the output range (e.g., (-60, 0)), so we ignore it for identifying the key point. That is to say, $\bm{\gamma}$ plays a crucial part for the outliers in 1(a), especially can amplify the outliers across tokens by serving as a shared parameter.

This observation enlightens us to remove the amplification effect by extracting $\bm{\gamma}$ from Eq. (missing) 2 and use the Non-scaling LayerNorm Eq. (missing) 3.

1(c) and 1(a) show that the output of the Non-scaling LayerNorm denotes a milder distribution with weaker outliers than the normal one. It not only coincides with that $\bm{\gamma}$ does strengthen the outliers but also reveals that ${\bm{X}}^{\prime}$ behaves more friendly than $\widetilde{{\bm{X}}}$ for quantization.

To quantitatively validate the more quantization-friendly distribution ${\bm{X}}^{\prime}$ holds, we adopt the cosine similarity metric to evaluate the quantization loss. From Table 1, the second row with higher similarity, namely less quantization error, explains that the quantization performance can be improved using Non-scaling LayerNorm.

2 Impact of outlier clipping

In this part, we explore the impact of clipping the outliers to design a method that can find an appropriate clipping range for quantization. The experiments are designed for the clipping impact on the accuracy and token of FP models.

Impact on accuracy. When clipping the outliers and evaluating the final performance, we find that the importance of outliers is highly varied. Take the outliers after GELU as an example here (others in Sec. D.2), Fig. 2 shows that clipping the more aggressive outliers sharply (clipping signals in 10-100 to 10) even does not hurt the full-precision performance with accuracy still at 91.02, while the accuracy drops suddenly to 85.93 with too many outliers cut.

Impact on token. Another key point is the unimportant outliers which can be clipped without even any accuracy drop in FP models only correspond to a few tokens. Motivated by , they refer that the separator token [SEP] attends to larger values. We are also aware of the different ranges provided by different tokens. From the red points in Fig. 2, which represents the proportion of clipped tokens, it can be clearly seen that the more aggressive outliers though occupy a large range from 10 to 100 only matches with 3% tokens. Destroying those sharper outliers belonging to a few tokens will not affect the performance.

The former investigation of accuracy impact suggests us taking the final performance into account to find a superior clipping range, where some local optimization methods like are not suitable here. The latter finding in token impact encourages us to leverage the token’s indication to quickly skip over the unimportant area, especially when it presents in a long tail form where some methods like suffer low efficiency. Based on these, we will introduce our method in Sec. 4.2.

Method

In this section, we propose our outlier suppression framework based on the above analysis. Firstly, the Gamma Migration technique is adopted to obtain a more quantization-friendly model by migrating the gamma into subsequent modules. Secondly, the Token-Wise Clipping further finds a suitable clipping range efficiently by leveraging the large variance of the token range.

As pointed out in Sec. 3.1, activation without going through the scaling parameter provides less quantization error. In this way, we split the LayerNorm function, migrate $\bm{\gamma}$ into follow-up structures and quantize the output of the Non-scaling LayerNorm. The transformation is equivalent for the FP model and brings more robust activation for the low-bit one. The overall flow is illustrated in Fig. 3.

Migration equivalence on FP model. Naturally, as referred in Eq. (missing) 3, we extract the parameter $\bm{\gamma}$ and transform the LayerNorm into Non-scaling one, thus seperate ${\bm{X}}^{\prime}_{t,j}$ from $\widetilde{{\bm{X}}}_{t,j}$ .

Since the residual connection is frequently adopted after LayerNorm (), it is necessary to illustrate the way to migrate parameter $\bm{\gamma}$ into two branches. To be specific, considering the LayerNorm after Multi-Head Attention (Fig. 3), $\bm{\gamma}$ will be excluded from the LayerNorm and moved to the shortcut branch and weight of the next layer. Then the LayerNorm becomes the Non-scaling one, the shortcut branch establishes a new parameter $\bm{\gamma}$ , and the weight of the next layer can absorb the $\bm{\gamma}$ .

Now, we show how the weight absorbs $\bm{\gamma}$ . For linear layers, we have the following equation:

Quantization after migration. Deriving from the above equivalent transformation, we outline the quantization pattern after the migration process. From Fig. 3, the "Quant" process is employed at ${\bm{X}}^{\prime}$ , then the quantized output engages in the matrix multiplication on one branch, multiplies parameter $\bm{\gamma}$ and experiences the "DeQuant" process on another branch. In fact, this means delaying the $\bm{\gamma}$ calculation from LayerNorm to the shortcut branch. Hence, this new design will not increase the computation overhead.

Effect of migration. We then analyze the effect of Gamma Migration on weight and activation, respectively, to reveal that the activation quantization burden has been greatly alleviated with relatively a slight influence on weight. To begin with, suppose that the absolute max range of output in the original LayerNorm is $|max({\bm{X}}^{\prime})|*|max(\bm{\gamma})|$ for the reason that outliers emerge at the same embedding dimensions among $\bm{\gamma}$ , activation before ${\bm{X}}^{\prime}$ and after $\widetilde{{\bm{X}}}$ scaling function. For activation, extracting the $\bm{\gamma}$ will reduce the activation range by $|max(\bm{\gamma})|$ times. And the results in Table 1 have already validated the profit the transformation brings to activation. For weight, the weight matrix does not have the same embedding outlier phenomenon as the activation. Therefore, the weight range will not be amplified $|max(\bm{\gamma})|$ times after the migration. Experimentally, we also calculate the cosine similarity for the changed weight and observe that $\bm{\gamma}$ has little impact on weight (Table 2).

2 Token-Wise Clipping

Based on the analysis, we propose the Token-Wise Clipping method which considers the final loss when finding a clipping range and takes a coarse-to-fine paradigm to minimize it efficiently in a token-wise manner.

Regarding the very different accuracy impact of clipping the outliers, we search the clipping range, equivalently the step size $s$ , which has the minimal distance between the final quantized output $\hat{f}(s)$ and the real one $f$ defined as Eq. (missing) 6. To implement the process efficiently, especially when the unimportant outliers cover a wide area, a coarse-to-fine paradigm is designed below.

Coarse-grained Stage. At this stage, our aim is to quickly skip over the area where clipping causes little accuracy influence. According to Sec. 3.2, the long tail area only matches with a few tokens. Therefore, we suggest using the max value of the embedding at token $t$ to be its representatives (min value as representatives for negative outliers). A new tensor with $T$ elements can be constructed by taking out the maximum signal for each token:

where ${\bm{o}}^{u}$ is marked as the collection of upper bounds, ${\bm{o}}^{l}$ as the collection of lower bounds.

Then for a clipping ratio $\alpha$ on ${\bm{o}}^{u}$ , calculate the corresponding clipping value $c^{u}$ and use it to cut the tensor.

where the quantile function computes the $\alpha\text{-}th$ quantiles of ${\bm{o}}^{u}$ .

Through grid search of token-wise clipping ratio, step size $s=\frac{c^{u}-c^{l}}{2^{b}-1}$ ( $b$ is the bit-width) with minimal quantization loss Eq. (missing) 6 is obtained. We mark it as $s_{0}$ for later optimization.

Fine-grained Stage. At this stage, our aim is to make some fine-grained adjustments in the critical area to further provide a guarantee for the final effect. In detail, with the initialization $s_{0}$ , a learning procedure based on gradient descent is used to update parameter $s$ towards loss $L(s)$ with learning rate $\eta$ , as described in Eq. (missing) 9.

Benefits. We mainly explain the benefits of the coarse-grained stage here from efficiency and quantization performance, where the experimental comparisons with other existing approaches are put in Sec. D.3. For efficiency, because the wide range of outliers only corresponds to a few tokens, passing through the unimportant area from the token perspective needs much fewer iterations than from the value perspective. Moreover, the representative collection reduces the size of the tensor ( ${\bm{o}}^{u}$ distilled from ${\bm{X}}$ ), so the method can run very fast each iteration. For quantization performance, the first coarse step has already produced a suitable clipping range (Sec. 5.2), which offers a good initialization point for upcoming tuning.

Experiments

In this section, we conduct two sets of experiments to verify the effectiveness of our outlier suppression framework. Sec. 5.2 shows the effect of each component. Sec. 5.3 lists the results compared with other existing approaches across text classification, question answering, and summarization tasks. On the whole, we evaluate GLUE benchmark , SQuAD , and XSum and CNN/DailyMail across BERT, RoBERTa, and BART models. Here, 4-4-4 presents 4-bit weight, embedding, and activation. And the model size under a certain bit is put in Table 17.

Implementation details. To begin with, we identify the quantization nodes and take a reasonable scheme like the one in FasterTransformer (Details see Sec. B.1). For PTQ, equipping our framework, we use 256 samples to calibrate the model. For QAT, our methods work on the calibration phase and later are combined with LSQ+ , a strong baseline for the training phase. For training, hyper-parameters like learning rate are searched both for our methods and baseline techniques for fair comparisons. Details see Appendix F.

Baseline. For PTQ, we compare with the prevalent calibration mechanisms including MinMax , OMSE , Percentile , EasyQuant and PEG . For QAT, we present the results of Q-BERT , Q8BERT and PEG . Also, because our framework applying in QAT is coupled with LSQ+ , we show the results of the pure LSQ+, and another canonical quantization approach PACT . Last but not least, the results combined with knowledge distillation (KD) proposed in TernaryBERT are included as well.

2 Ablation Study

In this subsection, we ablate the design elements in the proposed framework (Table 3). As a general plug-in module, Gamma Migration helps both the MinMax and Token-Wise Clipping. And the Token-Wise Clipping also surpasses the baseline by a large margin: 17.53% on QNLI, 13.22% on MRPC (comparisons with other calibration algorithms see Sec. D.3). About the phenomenon that the fine-grained stage sometimes does not improve much upon the coarse-grained one, we think it’s due to the already good enough results produced by the coarse step.

Besides, Fig. 5 conveys that with a good initialization point provided by our framework, the training of QAT becomes much faster and easier.

3 Main Results

PTQ. Table 4 shows the results of PTQ on GLUE tasks. For 8-bit BERT models, although previous methods generally behave well, our methods can still achieve satisfying outcomes even on small datasets such as CoLA (4.49% upswings) and STS-B (1.33% upswings). To fully exploit the limit, we try a more inspiring setting with weight and activation quantized to 6-bit. It can be seen that ours is indeed close to FP value within 2.64% overall. Meanwhile, we also compare with PEG fairly by taking their quantization nodes. To be noted, their per-embedding-group (PEG) quantization certainly brings extra computation overhead and might not be available on real deployment while ours brings favorable results and can enjoy lossless acceleration on hardware. Besides, the experimental results on RoBERTa and BART consistently demonstrate our superiority whereas existing methods suffer from a non-negligible accuracy drop. On average, ours achieves up to 8.64% and 11.79% better accuracy on RoBERT and BART. To conclude, our proposed methods push the limit of 6-bit quantization to a new state of the art.

QAT. In particular, we prove the compatibility of our methods on QAT. Table 5 lists the results on BERT, other see Sec. D.4. In a much harder setting (4-4-4 bit quantization), our outlier suppression framework wins near-floating-point performance with a reduction of 2.70% on average on 4-bit quantization. Yielding a good initialization, ours obtain an acceptable accuracy drop (0.7% on QQP, 1.7% on MNLI) without any distillation and data augmentation trick, versus 4.19% and 3.16% of LSQ+. Furthermore, ours still enables performance improvements working with knowledge distillation, especially at 2-bit weight and embedding.

3.2 Results on question answering tasks

To demonstrate the wider applicability of our methods, we evaluate them on SQuAD datasets. When going down to 6-bit quantization, the performance of other methods drastically drops. Ours still outperforms them by over 4.73% and 15.55% on BERT and RoBERTa on SQuAD v1.1. Also, the boost can be 12.31% and 4.96% on RoBERTa and BART on SQuAD v2.0.

3.3 Results on summarization tasks

It is of high value to validate the effect of our methods on summarization tasks. We choose classical datasets CNN/DailyMail and XSum and report the ROUGE 1/2/L score of BART. Table 7 illustrates that our approaches also benefit the encoder-decoder models, and can bring a near-floating-point performance on 8-bit and about 4% enhancement on 6-bit.

Conclusions and Discussions of Limitations

In this paper, we analyze the outlier phenomenon from the inducement and clipping impact on transformer language models. Based on these, we establish an outlier suppression framework to suppress the outliers. There also remain some open problems worthy of more in-depth investigations. For example, it is valuable to systematically explore whether the conclusion in this paper benefits other fields such as computer vision. Besides, as we supplement in the Appendix that the outliers occur not only in the fine-tuned (BERT) models but also in the pre-trained ones, it’s also meaningful to dive into the pre-training process for a better understanding.

Acknowledgment

We sincerely thank the anonymous reviewers for their serious reviews and valuable suggestions to make this better. This work was supported in part by National Natural Science Foundation of China under Grant 62022009 and Grant 61872021, and Beijing Nova Program of Science, and Technology under Grant Z191100001119050 and the Fundamental Research Funds for the Central Universities.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] In Discussions we leave some topics as future work.

Did you discuss any potential negative societal impacts of your work? [N/A]

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [Yes]

Did you include complete proofs of all theoretical results? [Yes] Detailed proofs can be found in the supplementary materials.

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We provide code of experiment as part of our supplementary materials.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] We defer detailed training settings in the supplementary materials.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Since we comprehensively evaluate the robust generalization for various models on different datasets, it would be computationally expensive to have the error bar.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes]

Did you mention the license of the assets? [Yes]

Did you include any new assets either in the supplemental material or as a URL? [Yes]

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix

Due to the space limitation of the main paper, we will provide supplementary analysis and experimental details in the appendix, including proof of equivalent transformation in Gamma Migration, illustration of quantization challenge, more analysis of outliers, supplementary experiments to better support our observations and methods, related works and implementation details.

Appendix A Supplementary illustration of Gamma Migration

In this section, we first put proof of the equivalent transformation Eq. (missing) 5. Then the detailed migration procedures of LayerNorm after the Feed Forward network (FFN) and Cross-Attention module are given. Especially, mark the LayerNorm after FFN as FFN-LN and the one after Multi-Head Attention as MHA-LN.

To prove Eq. (missing) 5, we look at each element in the output of matrix multiplication. In detail, we mark the output as ${\bm{h}}$ .

Thus, for all the elements in ${\bm{h}}$ , we have:

The parameter $\bm{\gamma}$ is shared across samples and tokens, then the above equation always holds, and the weight in the next layer can absorb the $\bm{\gamma}$ naturally.

A.2 Gamma Migration on other structures

Appendix B Quantization nodes

For the position to insert quantization nodes, we find that different papers often have different choices, particularly at activation. This would bring difficulties for fair comparisons across methods and practical development on hardware.

By surveying multiple industry and academic solutions, we take the one in FasterTransformer : Token (position, token type) embeddings are quantized to reduce the memory storage. Weights and activation engaged in matrix multiplication are also quantized. To be noted, we only give one quantizer to the same activation because it is friendly to hardware. Thus we will quantize the shortcut branch and take the same quantization parameter for the input of Query, Key, and Value modules, where some papers do not and might suffer some problems on hardware.

A clear illustration about the position of activation quantization is depicted in Fig. 8. Here for ease of understanding, we mark each "Quant" node with a serial number and match them with the related module names in Table 8.

B.2 Problematic quantization nodes

In this subsection, we give some simple and direct studies to elaborate on the most problematic tensors (outputs of LayerNorm structures and GELU). Verifications are done on fine-tuned BERT, RoBERTa, and encoder-decoder model BART.

On the one hand, we compare the cosine similarity between the FP value and the quantized one for each output. Activation nodes with cosine similarity lower than 0.99 are viewed as problematic positions (results in Table 9, Table 10). On another hand, we can observe the final accuracy recovery by disabling the quantization of each kind of activation. Both experiments indicate the obstacles when quantizing the outputs of LayerNorm and GELU.

Appendix C Analysis of outliers

By going deeper into the above problematic activations, we find that large outliers in them cause the large quantization error, and these outliers present some structured features from the embedding and token perspectives. Activations of almost all tokens attend to outliers in specific embedding dimensions like 308 and 381 embedding dimensions in Fig. 9. Upon these dimensions, some tokens like the [SEP] token in Fig. 10 attend to even more aggressive outliers compared to other tokens in (Fig. 10). In fact, we find this often happens on token [SEP], [CLS], punctuations like commas and periods, and other high-frequency tokens like "the", "and", "of".

C.2 Detailed discussion about the inducement

Here, we discuss the inducement of the outlier phenomenon from embedding and token perspectives.

For the embedding phenomenon, the Sec. 3.1 has explained the scaling parameter amplifies the outliers at certain embeddings. In fact, we find that this not only emerges in fine-tuned models but is also obvious in the pre-trained ones. By injecting constraints such as weight decay or kurtosis regularization to LayerNorm’s parameter when fine-tuning the FP model, it is still hard to suppress the aggressive values in the scaling parameter without affecting FP performance. Hence, we conjecture that this phenomenon is beneficial to the FP performance though it indeed brings challenges to quantization.

Moreover, the huge deviation in the token range we think is caused by the token frequency in the pre-training phase. Because we find the tokens which hold more aggressive signals occur frequently during pre-training like [SEP], [CLS] occur in each example, and ’.’ is often used in an expression. We also notice that these tokens’ word (token) embeddings have larger values than others. According to these, a possible explanation might be like: the frequency information biases the word embedding space and brings different features. The sharper outliers spread to subsequent layers and seem to be less important as indicated in Sec. 3.2. Therefore, we conjecture that a good word embedding without being biased by frequency information can behave better in quantization. But we can find those less important outliers in an efficient way and clip them as well. This suits better for post-training quantization without large-scale re-training.

For the inducement of outliers, note that also mentioned the connection between the scaling parameter and outliers in the last LayerNorm each BERT layer. But we emphasize the amplification effect of the scaling parameter, especially for the LayerNorm after Multi-Head Attention. This naturally generates the finding of quantization-friendly distribution contributed by removing the scaling parameter. About the unbalanced token frequency, a concurrent work explores carefully from the FP performance perspective.

Appendix D Supplementary experiments

We show more evidence of the same outlier phenomenon in LayerNorm and illustrate that the output of Non-scaling LayerNorm is more quantization-friendly than the normal one. Firstly, Fig. 11 and Fig. 12 are presented to build a formal understanding, where the $X^{\prime}$ has weaker outliers. Furthermore, more quantitative results about cosine similarity are put in Table 12 to indicate the improvement on the most problematic tensors Sec. B.2 brought by extracting the scaling parameter $\bm{\gamma}$ . Here, we discuss the inducement of the outlier phenomenon from embedding and token perspectives.

D.2 Supplementary evidence of clipping impact

We provide more evidence of accuracy and token impact by clipping the outputs to different levels in Table 13.

The first thing is that different outliers have very different importance, where some very large values can be clipped sharply but will not introduce large accuracy degradation, whereas the performance decreases quickly with some being clipped. For example, for the outputs of MHA-LN, clipping them from -60 to -45 seems reliable in the FP model and of course friendly in the quantized one. However, clipping from -40 to -35 will induce about 5% performance loss.

Another key point is that those large outliers only belong to several tokens regarding the big divergence of the token range. For example, for values in (-60, -45), the clipped tokens are still 3% for most of the layers. Thus, finding the clipping range from the token perspective can help to jump over the less important area quickly.

D.3 Comparisons among Token-Wise Clipping and existing methods

We compare the coarse stage of Token-Wise Clipping with OMSE, percentile, and direct step size learning and argue that ours is more effective Table 14 and efficient Table 15.

Our Token-Wise Clipping searches superior clipping ratio towards the final performance and works in a remarkably efficient way (Reasons have been explained in Sec. 4.2) with about 2 minutes evaluating 30 ratios on GLUE tasks.

On the contrary, OMSE only minimizes the local quantization error and behaves terribly. For instance, it calculates 40 as the best clipping range for the distribution presented in Fig. 2 while 10 is much better. Also, OMSE runs very slowly even with the fast golden section search.

For the direct step size learning and Percentile methods, though they consider the final loss for the clipping range, they still suffer some problems in the case that the unimportant outliers can cover a large area. Direct step size learning without a good initialization point needs a proper learning rate and much tuning time to achieve the key part. Take an extreme case as an example. In QAT, step size has been tuned sufficiently but we still notice that the quantized model can be further clipped. Besides, as the Percentile builds a histogram of the activation and searches for the best clipping ratio from the value perspective, it is time-costly to jump over the relatively unimportant outliers.

D.4 Supplementary results of QAT

We apply our methods to RoBERTa and BART on quantization-aware training. From Table 16, on RoBERTa, ours still surpasses LSQ+ by 2.54% on QNLI, 7.53% on STS-B. On BART models, we achieve an absolute improvement of 1.73–32.11 points against the best baseline. The outlier suppression framework can be extended to other applications, such as integer-only quantization as well, which proposes the polynomial approximation of non-linear operations for Transformer-based models.

Appendix E Supplementary related works

Quantization algorithms are usually grouped into two categories: (1) Quantization-Aware Training (QAT) and (2) Post-Training Quantization (PTQ). The former fine-tunes the FP model to low-bit and embraces good outcomes with awareness of quantization during training. Apart from learning weight for better performance, propose to learn the quantization parameters. The latter, PTQ, usually conducts fast calibration on the FP model with much less computation and fewer data. transforms quantization to a Minimum Mean Squared Error problem. alternately optimizes the step size of weight and activation towards the matrix multiplication output.

Recently, quantization has become popular in Transformer-based models. For quantization-aware training, explores 8-bit quantization on BERT-like models. adopts group-wise quantization and applies mixed-precision quantization based on the Hessian information. investigates various distillation losses on BERT and combines the distillation with quantization. approximates the nonlinear function in Transformer architectures to enjoy integer-only inference. quantizes a different random subset of weights each forward pass during training to decrease quantization noise. Moreover, explores underlying difficulties of quantizing generative models. Due to the sequential computation nature of this type of model, they find that word embedding is easier to be homogeneous and devise a token-level contrastive distillation method to combat this obstacle. For post-training quantization, notices the structured outliers in Transformer-based models with the occurrence at a few embedding dims and the special separator token. They point out that the high dynamic ranges will even hurt the 8-bit quantization performance and suggest taking per-embedding-group quantization for this unique challenge. While they walk around the problem and their method brings extra computation burden, we explore the inducement and clipping impact of these structured outliers and solve them without computation overhead.

Appendix F Supplementary implementation details

For quantizer details, we insert quantization nodes as Sec. B.1. We adopt symmetric per-channel quantization on weight and asymmetric per-layer quantization on activation.

For PTQ experiments, we sample 256 examples as the calibration dataset with batch size set to 32 on GLUE benchmark and SQuAD, 4 for CNN/DailyMail and XSum. For learning in the fine-grained stage of the Token-Wise Clipping, we always tune 3 epochs with learning rate 1e-5 across datasets because the first step already produces good outcomes.

For QAT experiments on the GLUE benchmark, we equip our methods with LSQ+ . The coarse-grained stage of Token-Wise Clipping is used to initialize quantization parameters, the fine-grained stage is removed because LSQ+ has armed with step size learning. About hyper-parameters, learning rate is searched in {1e-5, 2e-5, 3e-5, 4e-5, 5e-5}. Batch size is usually set to 32 unless smaller (8 and 16) ones are also tried on small datasets including CoLA, MRPC, RTE, and STS-B. As for epochs, we follow on BERT (3 epochs for MNLI and QQP, 6 epochs for others), on RoBERTa (6 epochs for MNLI and QQP, 12 epochs for others), and take 6 or 12 epochs on BART as well. Other hyper-parameters are inspected and kept fixed across datasets including self-attention dropout rate 0.1, hidden states dropout rate 0.0, weight decay 0.0, and warmup ratio 10%. For baseline mechanisms like LSQ+ and PACT, we conduct the above learning rate and batch size search as well for fair comparisons.