FP8 Quantization: The Power of the Exponent

Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort

Introduction

Neural network quantization is one of the most effective ways to improve the efficiency of neural networks. Quantization allows weights and activations to be represented in low bit-width formats, e.g. 8 bit integers (INT8). When executing networks on any device, this leads to a reduction in data movement and enables the use of low bit-width computations, resulting in significantly faster inference and lower energy consumption.

Roughly speaking, values in neural networks are represented in either integer (INT) or floating-point formats (FP). Most quantization research has gone into low-bit integer formats such as INT8 and INT4, as the corresponding hardware for this is widely available. However, some research suggests there might be benefits to using low bit-width floating-point formats for neural network computations .

In this paper, we set out to thoroughly investigate the potential benefits of the floating point format for neural network inference. We show that floating point numbers add an extra dimension on top of the INT8 format that makes outliers in the quantized distribution less harmful to the overall performance. We investigate the effect of the quantization formats on neural network quantization on three levels: 1) Analytically for several common data and weight distributions, 2) practically in INT8 and FP8 post-training quantization (PTQ) settings, and 3) in quantization-aware training (QAT) settings with both INT8 and different FP8 formats. We will show there is a strong agreement between our theoretical results and our practical results on real networks.

In order to compare the two formats, we introduce a novel floating-point quantization simulation implementation that enables us to quickly run PTQ and QAT experiments with many floating point formats. Furthermore, this quantizer enables us to learn the FP8 exponent bias value, as well as the best trade-off between the number of bits used for the exponent and the mantissa, making good use of the flexibility the floating-point format provides and providing a way to learn this automatically without manual intervention. Our conclusion from the study presented in this paper is that floating-point quantization can be better than integer quantization for inference, but it needs a carefully tuned bias term and correct bit-settings between the exponent and mantissa.

Background

2 Floating point number system

Floating point numbers can be seen as a uniform mm-bit grid between two consecutive (integer) powers of two 2a,2a+12^{a},2^{a+1}. The distance between grid points in the range [2a,2a+1][2^{a},2^{a+1}] is 2am2^{a-m}. Increasing the number of mantissa bits thus increases the number of grid points in each range [2a,2a+1][2^{a},2^{a+1}]. In the definition provided earlier in this section, the number of ranges [2a,2a+1][2^{a},2^{a+1}] that can be represented by a floating point number system, is determined by the number of exponent bits ee. Increasing the number of exponent bits thus increases the dynamic range (i.e. ratio between largest and smallest non-zero value) of values that can be represented. Any fixed bit-width floating point number system must make a trade-off between the dynamic range of representable values (ee), and the precision (mm). For example, IEEE-754 32-bit ‘full-precision’ floating point numbers (FP32) use 1 sign bit, 23 mantissa bits, and 8 exponent bits. The resulting effect is that, compared to integer formats, floating point formats have more precision close to zero, as the ranges [2a,2a+1][2^{a},2^{a+1}] will be smaller for lower values of aa, and less precision away from 0. This is visualized in the left plot in Fig. 1. Intuitively, they are a better match for peaked distributions like Gaussians that have more density around 0, and a better fit for distributions with large tails and outliers like the Student’s t-distribution. The right plot in Fig. 1 shows the different distributions of values for various FP8 formats.

Note that this definition does not allow for a representation of 0. To allow 0 to be represented, the exponent value p=0p=0 is reserved to indicate subnormal numbers. In this case, the exponent value is implicitly set to 11, and f=(1)s21b(0+d12+d222+dm2m).f=(-1)^{s}2^{1-b}\left(0+\frac{d_{1}}{2}+\frac{d_{2}}{2^{2}}+\cdots\frac{d_{m}}{2^{m}}\right). Besides allowing the exact representation of 0, subnormal numbers also allow a graceful representation of values close to 0. See Fig. 1 for an intuition behind this. In Section 4 we show (de)quantization operations analogous to those of INT8.

3 Assumptions and extensions

In this work we make a number of assumptions and consider several extensions of the standard floating point format. We assume that the exponent bias bb is not restricted to the value 2e12^{e-1}. Instead, we introduce a (per-tensor or per-channel) quantization scale γ\gamma, similar to the quantization scale used in integer quantization . Since there is no standard allocation of mantissa and exponent bits for 8 bit floating point formats, we consider various allocations of mantissa and exponent bits, as we assume that the choice of trade-off between precision and dynamic range will have more impact for lower floating point bit-widths. We use the notation xMyE to denote a format with x mantissa bits and y exponent bits. Specifically, we consider the formats 5M2E, 4M3E, 3M4E and 2M5E.

Expected quantization error

In this section we perform an analytical analysis of the FP8 and INT8 formats and show that formats with more exponent bits perform better when outliers are a factor. This theoretical result lays a foundation for the comparison between the formats that is predictive of when to use the FP8 format and how many exponent bits to use.

We investigate the error induced by quantizing values drawn from several insightful distributions, considering the expected MSE of these values as it has been shown to be indicative of the final loss in a neural network .

Given a quantization grid α={α1,α2,αk}\alpha=\{\alpha_{1},\alpha_{2},\dots\alpha_{k}\}, we can define the quantization operation Qα(w)Q_{\alpha}(w) and the corresponding quantization error Rα(w)R_{\alpha}(w):

Examples of the rounding error function for the 3M4E, 4M3E, and INT8 formats are shown in fig. 2(left). We model neural network weights as a random variable Wpw(w)W\sim p_{w}(w). The expected value of the quantization MSE can be expressed as follows:

where αmin=miniαi\alpha_{min}=\min_{i}{\alpha_{i}} and αmax=maxiαi\alpha_{max}=\max_{i}{\alpha_{i}} are the limits of the quantization range. The first term corresponds to the rounding error, while the remaining two terms correspond to the clipping error. We will use this formulation to analytically compute the quantization errors for different distributions. This is done by splitting the integration into sub-intervals corresponding to each point of the quantization grid. We present further details of this procedure the appendix A.1. An example of the rounding error function weighted by the probability density is given in fig. 2(middle).

2 Scalar product quantization error

We denote the output tensor Y=WXY=WX, and its quantized version as YQY_{Q}. The quantization error ΔY\Delta Y then can be expressed as

We use these formulations to analytically compute the quantization errors for several distributions. We consider the uniform distribution, Gaussian distribution and the student-t distribution. The latter is essentially a more heavy-tailed Gaussian, allowing us to model what happens when outliers are added into the picture.

It is important to choose the bias term in the FP8 format properly. We see in fig. 2(right) that the standard fixed format of setting the bias to 2e12^{e-1} can fail. This happens when values are either clipped too much, or not enough grid-points are used as the representable range is too big. We also see that having an integer bias performs worse than having a floating-point bias. This is similar to what happens in INT8 quantization, where setting the scale parameter correctly can have a large impact on the network’s performance . As the bias shifts the dynamic range, this effect is less strong for the FP8 formats with a high amount of exponent bits that have a high dynamic range. For lower exponent-bits, one would preferably use a per-channel bias/scale parameter, similar to what is done in per-channel quantization . Further justification for this can be found in section 5.2.

We also analyze the effect of different settings of the number of mantissa and exponent bits. The results of this are presented in fig. 3. We observe a clear pattern. For a uniform distribution, the INT8 format performs the best. For the Gaussian distribution, the format with 2 exponent bits performs best. Then, when increasing the relative outliers by decreasing the degrees of freedom in the Student’s t-distribution, the best format tends towards having more exponent bits.

In neural networks, most weights and activations can be modeled relatively well by a Gaussian distribution. So based on this analysis, we would expect the 5M2E format to work the best for most well-behaved layers with little outliers. Specifically in the weights outliers are infrequent, as the weights are often explicitly, and otherwise implicitly regularized . For example, we consider a layer of Resnet18 model pre-trained on ImageNet. We fit Gaussian distributions in the weights and activations sample, and compute the expected MSE analytically in fig. 3 (right). We see that the 5M2E format works the best in this case. A more detailed per-layer study is given in appendix B.1.

However, the stronger the outliers are in the distribution, the more the FP8 format with a higher number of bits will help with its quantized representation. We would thus predict that for networks with severe outliers, formats like M4E3 would perform better. We will show in section 5 that these predictions hold for real networks, and that networks that are known to have larger activation outliers such as transformers benefit the most from more exponent bits.

FP8 quantization simulation

In the previous section our analysis showed that FP8 quantization can theoretically yield better tensor reconstruction MSE than INT8 quantization, but careful selection of the division between exponent and mantissa bits, as well as the value of the exponent bias are crucial. We investigate whether these findings can be extended to FP8 quantization of neural networks. In order to do so, we need an efficient simulation of FP8 quantization. Methods commonly used for FP8 quantization are either too slow, e.g. nearest grid point search or bit-masking , or too complicated, e.g. custom CUDA kernels , for quick experimentation.

In this section we introduce a novel method of FP8 quantization simulation. This method can easily be implemented in common deep learning frameworks. An added benefit of this method is that it exposes the parameters of the FP8 quantizer (i.e., the number of mantissa/exponent bits and the exponent bias), thus allowing the parameters to be learned by back-propagation.

In our FP8 quantizer we exploit the fact that FP8 quantization can be seen as the union of mm-bit uniform quantization grids between consecutive integer powers of two [2pb,2p+1b][2^{p-b},2^{p+1-b}]. This means we can simulate FP8 quantization of an input vectorThis method extends readily to matrices or higher-dimensional tensors. x{\bm{x}} using the same method as for simulation of uniform quantization as described in Section 2.1, with the distinction that each element xix_{i} in x{\bm{x}} has its own associated scale sis_{i}:

where xi(q){x_{i}^{(q)}} denotes xix_{i} quantized to floating point. The scale sis_{i} depends on the number of mantissa bits mm and the range [2pb,2p+1b)[2^{p-b},2^{p+1-b}) in which xix_{i} falls. This is given by log2si=pi=log2xim\log_{2}s_{i}=p_{i}=\left\lfloor\log_{2}|x_{i}|\right\rfloor-m.

To ensure that xi(q){x_{i}^{(q)}} can be represented given mm, ee and bb, both xi(q){x_{i}^{(q)}} and sis_{i} need to be clipped. Values of xi(q){x_{i}^{(q)}} greater than maximum value cc or smaller than c-c are clipped cc and c-c respectively, where c=(22m)22eb1c=(2-2^{-m})2^{2^{e}-b-1} is the largest representable value for a given floating point format. Since 21bm2^{1-b-m} is the smallest representable value, values of pip_{i} smaller than 1bm1-b-m are clipped to 1bm1-b-m.

Note that this approach is identical to the quantization operation defined in Eq. 3, provided that the rounding mode in Eq. 8 matches the tie-breaking procedure in Eq. 3 for numbers equidistant to two numbers in FF. See Fig. 4 for an intuition.

In case the scaling factor γ1\gamma\neq 1 this needs to be reflected in pip_{i}. In order to accommodate the scaling factor, we first fold it into a reparameterized bias value b^=blog2γ{\widehat{b}}=b-\log_{2}\gamma. We then compute pip_{i} as follows:

2 Quantization-aware training with FP8

To enable QAT using this quantizer, we need to make a few changes. First, to allow gradients to flow through each step of the quantizer, we use the straight-through estimator (STE) for gradients of non-differentiable rounding operations. Second, we find that learning the maximum clipping value cc instead of b^{\widehat{b}} improves training stability. b^{\widehat{b}} can be found from cc as follows: b^=2elog2clog2(22m)1{\widehat{b}}=2^{e}-\log_{2}c-\log_{2}(2-2^{-m})-1. Lastly, we treat log2xi+b^\left\lfloor\log_{2}|x_{i}|+{\widehat{b}}\right\rfloor as a constant that receives no gradient. This prevents the (sometimes extremely large) gradients of this operation w.r.t. xix_{i} to propagate backwards. The result of this is that x{\bm{x}} receives the ‘straight-through’ gradient for the full quantization procedure, i.e. xiF(xi,m,c)=1\frac{\partial}{\partial x_{i}}F(x_{i},m,c)=1, where F(,,)F(\cdot,\cdot,\cdot) denotes the FP8 quantizer.

3 Toy experiment: Learning minimal MSE on common distributions

To investigate whether our quantizer can indeed learn the maximum value cc and number of mantissa bits mm, we run a toy experiment. We sample 10510^{5} values from N(0,1)\mathcal{N}(0,1), and initialize an FP8 quantizer with 3M4E and a bias of 8, which corresponds to maximum value of c=240c=240. We then use SGD to learn the values of cc and mm that minimize MSE on the reconstruction loss: L(M,c)=1Ni(xiF(xi,m,c))2\mathcal{L}(M,c)=\frac{1}{N}\sum_{i}\left(x_{i}-F(x_{i},m,c)\right)^{2}. After 500 iterations, cc has converged to 4.35 and mm oscillates around 5.5. The oscillation behavior can be explained by the fact that MSE can be further minimized by increasing precision through higher mm, however increasing mm to 6 yields a uniform quantizer, which results in higher MSE. Similar behavior was observed in uniform quantization by . A line search shows that indeed m=5m=5 and c=4.37c=4.37 minimize the MSE on the target data, meaning our algorithm found the optimal values in this example. See Fig. 5 for details.

Experiments

In this section, we aim to empirically validate the analytical findings from section 3 on full neural networks. We use our analytical results to investigate various FP8 formats for weight and activation tensors in neural networks, and show that Finally, we perform quantization-aware training (QAT) on FP8 quantized models.

We run our experiments on a variety of different tasks, datasets, and models: We experiment on ResNet18 , MobileNetV2 , and ViT for ImageNet classification ; BERT-base for language understanding on the GLUE benchmark ; HRNet for semantic segmentation on the Cityscapes dataset ; DeepLabV3 for semantic segmentation on the Pascal VOC dataset ; and SalsaNext for LIDAR point cloud segmentation on the SemanticKITTI dataset . For each model, we report the 32-bit floating point (FP32) results on the same model instance (trained weights and code base) used in our other experiments.

As baselines we compare against models quantized using INT8 quantization. Following we do not apply batch normalization folding, and re-estimate the batch normalization statistics (running mean and variance) before final validation, as this improved results for every model we considered. For each model, we consider several methods for range estimation separately for weights and activations, per-channel and per-tensor range estimation for weights. Results reported are those for the best combination of range estimation setting and per-channel or per-tensor range estimation. Our code is written in PyTorch and all our experiments are performed using NVIDIA Tesla V100 and A100 GPUs.

2 Post-training quantization results

We compare post-training quantization using INT8 and FP8 quantization. To allow an apples-to-apples comparison, we do not use any methods to improve post-training quantization that have been developed in the past years (e.g. those mentioned in ), other than the methods described in Section 5.1.

For our FP8 results, we report the best fully fixed format, i.e. the same combination of m,em,e and bb used for the full network, the best flexible bias format, i.e. mm and ee fixed for the full network, but b^{\widehat{b}} applied per channel or tensor, whichever yields best results, and a fully flexible format where mm and ee are set for each tensor, and b^{\widehat{b}} is set for each channel, to minimize MSE. The full procedure for this approach is detailed in Section E. For an example setting, see Figures 11 and 12 for an example of FP8 formats that minimize MSE for each tensor in ResNet18 and BERT, respectively.

PTQ results can be found in Table 1. In this table we can see that networks with large activation outliers (ViT, Bert, SalsaNext and HRNet) show better results for formats with more exponent bits, and thus a larger dynamic range. Furthermore, we see that, as predicted in Section A.1, formats with more mantissa bits are better able to represent weight and activation tensors in convolutional neural networks. However, since increasing the number of mantissa bits implies reducing dynamic range, finding the right value for the bias for each channel is important. The full set of results for our PTQ experiments is detailed in Table 3 in Appendix Section I.

Lastly, we see that fully flexible formats only sometimes outperform fixed m/em/e formats, with slim improvements when they do. This is surprising, as a more flexible format should be able to better fit a network’s tensors. We attribute this discrepancy to our local greedy method for assigning m,em,e and b^{\widehat{b}}, which may not find settings that are optimal globally.

3 Quantization-aware training

We investigate whether we can improve on the FP8 PTQ results by performing training with FP8 quantization in mind (quantization-aware training, QAT). We compare against INT8 QAT, which has previously been shown to result in INT8 models with accuracy near the original FP32 models.

We consider three different initializations based on our PTQ results from Table 1: best fixed format, best flexible bias format and best fully flexible format. For each of these initializations we perform QAT with the format fixed at initialization, with learnable maximum value cc (cf. range learning in INT8 QAT), and with learning both cc and the number of mantissa bits mm. We run experiments with various learning rates for model and quantization parameters, as well as per-tensor and per-channel quantization, and report results for the best learning setup. We train our models for 20 epochs and use Adam for the model parameters and SGD for the quantization parameters. Results of these experiments can be found in Table 2. Full experimental details and results can be found in Appendix G.

From these results, we see that QAT can improve performance of FP8 quantized models. We also see that, for ResNet18 and MobileNetV2 learning both maximum value cc and number of exponent bits mm improves results compared to just learning weights, for fully fixed and flexible bias formats. See also Figure 6 for an example of how initialized cc and mm differ from learned cc and mm after 20 epochs of training. However, this difference disappears for fully flexible PTQ initialization, and learning cc and mm slightly harms performance for BERT, although the difference is too small to be significant.

Generally, we see that QAT reduces the accuracy gap between different formats, as the weights of the network can adapt to the distributions the quantizers represent.

Related work

Integer quantization, sometimes also called fixed-point quantization, has been an active field of research since neural networks became more popular and the need for efficient inference became important. On a high level there are two lines of work, post-training quantization (PTQ) and quantization-aware training (QAT) . For more details we refer the interested read to the surveys of . Our approach to learning the FP8 configuration is inspired by which assume the straight through estimator (STE) on the rounding operation in order to derive a gradient for the scaling factor of the fixed point format. extends this to jointly learn the bit width and scaling factor, which we explore for learning the trade-off between the number mantissa and exponent bits.

Floating point formats

The accuracy and stability of floating point formats and algorithms are well studied in computer science and electrical engineering, we refer the interested read to . However, these studies usually focus on high bit with floating point formats (e.g. FP32 or FP64) and do not consider the impact on neural networks which are significantly more robust to noise.

Early work by showed that convolutional networks can be trained using FP16 and FP8 using the 2M5E format. The work of introduces a hybrid FP8 format and first discovered that 3M4E has better inference performance than 2M5E. While they use a different FP8 format for the backwards path, they keep the bias and mantissa and exponent division fixed for all layers in the forward path. introduces a fully flexible FP8 format similar to the one we consider. They perform a layer-wise optimization to find the best configuration (mantissa, exponent, sign bit, bias). Their work does not go in-depth in the analysis of why some formats are better, and it has a fairly limited results section with easy to quantize models like ResNet18 and VGG . uses a custom FP4 format for the backwards path and INT4 for the forward path to enable full 4 bit training with a small drop in performance.

To the best of our knowledge we are the first work with an extensive study of the different FP8 formats based on both analytical insights and empirical results across several tasks and data modalities. As opposed to our new FP8 simulation method, other works rely on dedicated FP8 simulations that do not allow for efficient gradient based learning of the bias or mantissa/exponent bit configuration.

Impact and Limitations

FP8 is becoming widespreadhttp://bit.ly/3Sd8wey as an alternative to INT8, with multiple chips from vendors like Nvidia, Intel, Graphcore, AMD and IBM moving to support 3M4E and/or 2M5E FP8 formats, often with fixed bias values. In this paper we show that these formats are not optimal and that many networks benefit from FP8 formats with more mantissa bits and bias values that can be flexibly set per channel. We hope that the results in this paper will help guide FP8 hardware design decisions in the broader community.

Limitations

In this paper, we restrict ourselves to studying the impact of various FP8 formats on model accuracy and ignore hardware implementation-specific impact on power consumption and latency. Assessing the difference in hardware impact between INT8 and FP8 for all networks is not trivial. From a data transfer bandwidth perspective, the two 8-bit formats incur similar overhead, while for compute limited models, the relative overheads depend on the exact implementation. Generally, FP8 units use more power for additions and multiplications , however, surrounding logic (e.g. accumulator design, NaN/overflow checks etc) might amortize this difference, and in multi-purpose designs which support not only FP8, but e.g., also FP16/32, INT8/16 the difference in the overheads might disappear. Therefore we limited this work to comparing the different formats on a bit width level. Hardware design teams use our accuracy analysis to make trade-offs on the hardware side for their specific use-case and design.

Conclusion

In our analysis of the FP8 format and its benefits for quantized neural network inference we have touched on many points. We have shown that analytically the FP8 format can improve on the INT8 format for Gaussian distributions that are common in neural networks, and that higher exponent bits work well when outliers occur. This paper introduced a new way to simulate FP8 quantization in FP32 hardware that speeds up FP8 quantization simulation and makes it possible to learn the bias and mantissa-exponent bit-width trade-off. We validated the FP8 format for many networks in a post-training quantization setting, showing that generally for neural networks the 5M2E and 4M3E FP8 format works the best, and that for networks with more outliers like transformers increasing the number of exponent bits works best. We have also shown that when doing quantization-aware training, many of these benefits of the format disappear, as the network learns to perform well for the INT8 quantization grid as well.

Acknowledgments

We would like to thank Christos Louizos and Yin Huang for helpful discussions and valuable feedback.

References

Appendix A Analytical computation of the expected quantization error

Equation 4 (the quantization error) can be split into two terms corresponding to the rounding error ErwE_{rw} and the clipping error EcwE_{cw}:

As the weight and activation tensor values are bound, we assume that the distribution pw(w)p_{w}(w) is clipped within the interval (wmin,wmax)(w_{min},w_{max}). Thus the clipping error can be written as:

where \mathds1αmax<wmax\mathds{1}_{\alpha_{max}<w_{max}} is the indicator function. The calculation or the rounding error ErwE_{rw} can be split into two sub-intervals for each interval (αi,αi+1)(\alpha_{i},\alpha_{i+1}) where the first sub-interval corresponds to rounding up and the second sub-interval corresponds to rounding down:

In order to simplify the computation, we introduce the following function:

The sum of the rounding errors for each interval between two representable grid-points. We note that the clipping error EcwE_{cw} can also be expressed using Iw(a,b,w0)I_{w}(a,b,w_{0}):

The analytical expressions for I(wmin,αmin,αmin)I(w_{min},\alpha_{min},\alpha_{min}) for different distributions are given in Appendix A.3. Thus, given the explicit definition of the quantization grid and the probability density function, we can analytically compute the rounding error for different distributions, for example the Gaussian, Uniform, or Student’s t-distribution.

A.2 Scalar product quantization error.

We note that the first term is nothing but the rounding error on WW (see equation (4)) weighted by a non-central second moment on XX which does not depend on the quantization grid. The structure of the second term is very similar while WW and XX are interchanged. As the first two terms are the only integrals of non-negative functions, in practice they become dominant and mostly determine the MSE magnitude.

In order to compute the scalar product error analytically, we rewrite equation (19) in the following form:

Where every term is introduced below. MwM_{w} and MwM_{w} are the second non-central moments for WW and XX, respectively. These terms can be computed using functions Iw(a,b,w0)I_{w}(a,b,w_{0}) and Ix(a,b,x0)ab(xx0)2px(x)dxI_{x}(a,b,x_{0})\coloneqq\int\limits_{a}^{b}(x-x_{0})^{2}p_{x}(x)dx:

ErwE_{rw} and ErxE_{rx} are rounding errors on WW and XX similar to the expression in equation (4):

finally, EswE_{sw} and EsxE_{sx} are the following integrals.

Similar to rounding error calculation in equation 14, the computation of EswE_{sw} and EsxE_{sx} can be split into sub-intervals. For EswE_{sw}:

Thus we can express the term EswE_{sw} in equation 24 as:

The term EsxE_{sx} can be computed in a similar way. The formulas for Jw(a,b,w0)J_{w}(a,b,w_{0}) for different distributions are given in Appendix A.3.

In this section we give formulas for the functions Ix(a,b,x0)I_{x}(a,b,x_{0}) and Jx(a,b,x0)J_{x}(a,b,x_{0}) for different distributions which are necessary for the analytical computation of the rounding error and the scalar product error. The formulas for Iw(a,b,w0)I_{w}(a,b,w_{0}) and Jw(a,b,w0)J_{w}(a,b,w_{0}) are similar while pw(w)p_{w}(w) is used as the probability density function. The formulas were obtained using symbolical computations.

Appendix B Ablations

In this section we analyze different combinations of floating point formats for weights and activations based on the analytical computation of the scalar product error. We consider two Gaussian distributions fit into weights and activations distributions of a layer of pre-trained Resnet18 model. The results are shown in fig. 7. We observe that the optimal format for both weights and activations is 5M2E. In order to facilitate visual comparison of MSE values of different magnitude we plot SQNR which is defined as follows:

B.2 Comparison of the analytical and empirical rounding error.

In this section we compare the expected rounding error computed analytically to the empirical rounding error. The results are given on fig. 8 Note that, for visual purposes, we plot the signal-to-quantization-noise ratio (SQNR) instead of MSE in these plots. SQNR is log-proportional to MSE; i.e. a value that minimizes MSE will maximize SQNR. SQNR for an input tensor X{\bm{X}} is defined as:

Appendix C Examples of tensors in Bert

In this section we give examples of strong outliers in the tensors in Bert (fig. 9).

Appendix D Importance of the outliers

In this ablation we demonstrate influence of the outliers on the choice of the optimal floating point format. The experiment is based on the analytical computation of the rounding error. We consider a Student’s-t distribution with ν=2\nu=2 with increasing quantization range. Min-max quantizer range estimator is used, the distribution is clipped at the quantization range. The results are given on fig. 10. While the quantization range is being increased, the optimal exponent bit-width values grows starting from zero (INT8 format) to 5-bit exponent.

Appendix E MSE-based mantissa bits and bias

In our quantization setup, each tensor has an individual quantizer. The quantizers can use per-tensor or per-channel quantization. In case per-channel quantization is used, each channel has its own clipping value cc (and thus its own value for b^{\widehat{b}}), while the number of mantissa bits is set for a full tensor, and is thus shared across all channels.

During quantizer initialization we use the input weight tensor or one batch of activations, both referred to as X{\bm{X}}, and find values for mm and cc that minimize reconstruction MSE. In order to do so, we perform a grid search over values of mm and values of cc. For mm we consider 1, 2, 3, 4, 5, and 6 bits. For cc we find the absolute maximum value σ\sigma of X{\bm{X}} (or, in case of per-channel quantization, each channel in X{\bm{X}}), and run the grid search over 111 evenly spaced values between 0.1σ0.1\sigma and 1.2σ1.2\sigma. We then record the values of mm and cc that minimize MSE and initialize the quantizer with these values.

In case per-channel quantzation is used, we run this procedure for each channel individually. On each channel, for each value of mm, we store the value of cc that minimized quantization on that channel. We then choose a per-tensor value of mm by majority vote, i.e. the value of mm that occurred most often over all channels. In case of a tie we choose the value of mm that has lowest cumulative MSE. Lastly, we set cc to the value that minimized MSE for the per-tensor value of mm. We also experimented with choosing mm based on lowest cumulative MSE directly, but found negligible difference in resulting accuracy. We decided to use majority vote to ensure channels with relatively large magnitude to dominate the choice of mm. Figures 11 and 12 show a per-layer analysis of the bitwidth choices that minimize MSE.

Appendix F Correlation between MSE in the output activations and the model accuracy

In this section we provide an example of a correlation between MSE in output activations of a layer and the full model accuracy. We take a pre-trained Resnet18 model, and consider one if its layers. We inject Gaussian noise of increasing amplitude in its weight values, and measure MSE in the output activations of the layer, and the final top-1 classification accuracy. The results are shown in figure 13. The MSE value and the final accuracy exhibit strong correlation, i.e. the normalized correlation coefficient value for this experiment is 0.98.

Appendix G Full experimental details

For our QAT experiments, we initialized our models using the settings that gave the best results for the fixed format, the flexible bias format, and the fully flexible format. See Table 1 and the tables in I for details. We then trained ResNet18 for 20 epochs, and MobileNetV2 for 10. We used the Adam to optimize the weights. We considered starting learning rates of 10510^{-5} and 10610^{-6} as these gave best results in a pilot experiments. Learning rates were decayed to a factor of 10210^{-2} of the starting learning using cosine decay.

In experiments where FP8 parameters (cc and mm) were learned as well, we used SGD without momentum. We considered learning rates 10210^{-2}, 10310^{-3}, 10410^{-4}, and 10510^{-5}. No weight decay was applied to any of the models.

Baseline INT8 QAT results followed the procedure as described in .

Appendix H Gradients of the FP8 quantizer

As stated previously, the FP8 quantizer gives the ‘straight-through’ gradient w.r.t. the values to be quantized:

Lastly, the gradient w.r.t. mm is as follows:

Appendix I PTQ results

The best results for INT8, FP8 with fixed bias (per Mantissa/Exponent division), FP8 with flexible bias, and fully flexible FP8, on all models considered in Section 5.2 are shown in Table 3.

Appendix J DeepLabV3 weight distribution

DeepLabV3 shows a larger degradation in the ’fixed format’ PTQ setting than other models considered. Figure 14 shows the distribution of the values in each weight tensor in DeepLabV3. We believe the low performance on fixed format PTQ to be due to the fact that some layers in early the DeepLabV3 backbone have relatively large outliers (e.g. backbone.features.2.conv.0.weight, row 1 column 4), necessitating a relatively large number of exponent bits, while some layers later in the backbone and in the decoder have distributions that require few exponent bits (e.g. backbone.high_level_features.15.conv.3.weight; backbone.high_level_features.16.conv.3.weight; decoder.last_conv.8.weight). This discrepancy is present in other networks, most notably MobileNetV2, however, it is not as prominent as in this network.