Up or Down? Adaptive Rounding for Post-Training Quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

Introduction

Deep neural networks are being used in many real-world applications as the standard technique for solving tasks in computer vision, machine translation, voice recognition, ranking, and many other domains. Owing to this success and widespread applicability, making these neural networks efficient has become an important research topic. Improved efficiency translates into reduced cloud-infrastructure costs and makes it possible to run these networks on heterogeneous devices such as smartphones, internet-of-things applications, and even dedicated low-power hardware.

One effective way to optimize neural networks for inference is neural network quantization (Krishnamoorthi, 2018; Guo, 2018). In quantization, neural network weights and activations are kept in a low-bit representation for both memory transfer and calculations in order to reduce power consumption and inference time. The process of quantizing a network generally introduces noise, which results in a loss of performance. Various prior works adapt the quantization procedure to minimize the loss in performance while going as low as possible in the number of bits used.

As Nagel et al. (2019) explained, the practicality of neural network quantization methods is important to take into consideration. Although many methods exist that do quantization-aware training (Jacob et al., 2018; Louizos et al., 2019) and get excellent results, these methods require a user to spend significant time on re-training models and hyperparameter tuning.

On the other hand, much attention has recently been dedicated to post-training quantization methods (Nagel et al., 2019; Cai et al., 2020; Choukroun et al., 2019; Banner et al., 2019), which can be more easily applied in practice. These types of methods allow for network quantization to happen on-the-fly when deploying models, without the user of the model spending time and energy on quantization. Our work focuses on this type of network quantization.

Rounding-to-nearest is the predominant approach for all neural network weight quantization work that came out thus far. This means that the weight vector w\mathbf{w} is rounded to the nearest representable quantization grid value in a fixed-point grid by

where s denotes the quantization scale parameter and, n and p denote the negative and positive integer thresholds for clipping. We could round any weight down by replacing \lfloor\cdot\rceil with \lfloor\cdot\rfloor, or up using \lceil\cdot\rceil. But, rounding-to-nearest seems the most sensible, as it minimizes the difference per-weight in the weight matrix. Perhaps surprisingly, we show that for post-training quantization, rounding-to-nearest is not optimal.

Our contributions in this work are threefold:

We establish a theoretical framework to analyze the effect of rounding in a way that considers the characteristics of both the input data as well as the task loss. Using this framework, we formulate rounding as a per-layer Quadratic Unconstrained Binary Optimization (QUBO) problem.

We propose AdaRound, a novel method that finds a good solution to this per-layer formulation via a continuous relaxation. AdaRound requires only a small amount of unlabelled data, is computationally efficient, and applicable to any neural network architecture with convolutional or fully-connected layers.

In a comprehensive study, we show that AdaRound defines a new state-of-the-art for post-training quantization on several networks and tasks, including Resnet18, Resnet50, MobilenetV2, InceptionV3 and DeeplabV3.

Motivation

To gain an intuitive understanding for why rounding-to-nearest may not be optimal, let’s look at what happens when we perturb the weights of a pretrained model. Consider a neural network parametrized by the (flattened) weights w\mathbf{w}. Let Δ ⁣w\Delta\!\mathbf{w} denote a small perturbation and L(x,y,w)\mathcal{L(\mathbf{x},\mathbf{y},\mathbf{w})} denote the task loss that we want to minimize. Then

where (a) uses the second order Taylor series expansion. g(w)\mathbf{g}^{(\mathbf{w})} and H(w)\mathbf{H}^{(\mathbf{w})} denote the expected gradient and Hessian of the task loss L\mathcal{L} w.r.t. w\mathbf{w}, i.e.,

All the gradient and Hessian terms in this paper are of task loss L\mathcal{L} with respect to the specified variables. Ignoring the higher order terms in the Taylor series expansion is a good approximation as long as Δ ⁣w\Delta\!\mathbf{w} is not too large. Assuming the network is trained to convergence, we can also ignore the gradient term as it will be close to . Therefore, H(w)\mathbf{H}^{(\mathbf{w})} defines the interactions between different perturbed weights in terms of their joint impact on the task loss L(x,y,w+Δ ⁣w)\mathcal{L}\left(\mathbf{x},\mathbf{y},\mathbf{w}+\Delta\!\mathbf{w}\right). The following toy example illustrates how rounding-to-nearest may not be optimal.

Assume Δ ⁣wT=[Δw1Δw2]\Delta\!\mathbf{w}^{T}=[\Delta w_{1}\quad\Delta w_{2}] and

then the increase in task loss due to the perturbation is (approximately) proportional to

For the terms corresponding to the diagonal entries Δw12\Delta\mathbf{w}_{1}^{2} and Δw22\Delta\mathbf{w}_{2}^{2}, only the magnitude of the perturbations matters. Hence rounding-to-nearest is optimal when we only consider these diagonal terms in this example. However, for the terms corresponding to the Δw1Δw2\Delta\mathbf{w}_{1}\Delta\mathbf{w}_{2}, the sign of the perturbation matters, where opposite signs of the two perturbations improve the loss. To minimize the overall impact of quantization on the task loss, we need to trade-off between the contribution of the diagonal terms and the off-diagonal terms. Rounding-to-nearest ignores the off-diagonal contributions, making it often sub-optimal.

The previous analysis is valid for the quantization of any parametric system. We show that this effect also holds for neural networks. To illustrate this, we generate 100 stochastic rounding (Gupta et al., 2015) choices for the first layer of Resnet18 and evaluate the performance of the network with only the first layer quantized. The results are presented in Table 1. Among 100100 runs, we find that 4848 stochastically sampled rounding choices lead to a better performance than rounding-to-nearest. This implies that many rounding solutions exist that are better than rounding-to-nearest. Furthermore, the best among these 100 stochastic samples provides more than 10%10\% improvement in the accuracy of the network. We also see that accidentally rounding all values up, or all down, has an catastrophic effect. This implies that we can gain a lot by carefully rounding weights when doing post-training quantization. The rest of this paper is aimed at devising a well-founded and computationally efficient rounding mechanism.

Method

In this section, we propose AdaRound, a new rounding procedure for post-training quantization that is theoretically well-founded and shows significant performance improvement in practice. We start by analyzing the loss due to quantization theoretically. We then formulate an efficient per-layer algorithm to optimize it.

Finding the optimal rounding procedure can be formulated as the following binary optimization problem

Evaluating the cost in (11) requires a forward pass of the input data samples for each new Δ ⁣w\Delta\!\mathbf{w} during optimization. To avoid the computational overhead of repeated forward passes throught the data, we utilize the second order Taylor series approximation. Additionally, we ignore the interactions among weights belonging to different layers. This, in turn, implies that we assume a block diagonal H(w)\mathbf{H}^{(\mathbf{w})}, where each non-zero block corresponds to one layer. We thus end up with the following per-layer optimization problem

To verify that (13) serves as a good proxy for optimizing task loss due to quantization, we plot the cost in (13) vs validation accuracy for 100 stochastic rounding vectors when quantizing only the first layer of Resnet18. Fig. 1 shows a clear correlation between the two quantities. This justifies our approximation for optimization, even for 44 bit quantization. Optimizing (13) show significant performance gains, however its application is limited by two problems:

In section 3.2 and section 3.3 we tackle the first and the second problem, respectively.

2 From Taylor expansion to local loss

3 AdaRound

where F2\left\lVert\cdot\right\rVert^{2}_{F} denotes the Frobenius norm and W~\mathbf{\widetilde{W}} are the soft-quantized weights that we optimize over

In the case of a convolutional layer the Wx\mathbf{W}\mathbf{x} matrix multiplication is replaced by a convolution. Vi,j\mathbf{V}_{i,j} is the continuous variable that we optimize over and h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) can be any differentiable function that takes values between and 11, i.e., h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right)\in. The additional term freg(V)\mathnormal{f_{reg}}\left(\mathbf{V}\right) is a differentiable regularizer that is introduced to encourage the optimization variables h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) to converge towards either or 11, i.e., at convergence h(Vi,j){0,1}\mathnormal{h}\left(\mathbf{V}_{i,j}\right)\in\{0,1\}.

We employ a rectified sigmoid as h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right), proposed in (Louizos et al., 2018). The rectified sigmoid is defined as

where σ()\sigma(\cdot) is the sigmoid function and, ζ\zeta and γ\gamma are stretch parameters, fixed to 1.11.1 and 0.1-0.1, respectively. The rectified sigmoid has non-vanishing gradients as h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) approaches or 11, which helps the learning process when we encourage h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) to move to the extremities. For regularization we use

where we anneal the parameter β\beta. This allows most of the h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) to adapt freely in the initial phase (higher β\beta) to improve the MSE and encourages it to converge to or 11 in the later phase of the optimization (lower β\beta), to arrive at the binary solution that we are interested in. The effect of annealing β\beta is illustrated in Fig. 2. Fig. 3 shows how this combination of rectified sigmoid and freg\mathnormal{f_{reg}} leads to many weights learning a rounding that is different from rounding to the nearest, to improve the performance, while ultimately converging close to or 11.

This method of optimizing (21) is a specific instance of the general family of Hopfield methods used for binary constrained optimization problems. These types of methods are commonly used as an efficient approximation algorithm for large scale combinatorial problems (Hopfield & Tank, 1985; Smith et al., ).

To quantize the whole model, we optimize (21) layer-by-layer sequentially. However, this does not account for the quantization error introduced due to the previous layers. In order to avoid the accumulation of quantization error for deeper networks as well as to account for the activation function, we use the following asymmetric reconstruction formulation

where x^\mathbf{\hat{x}} is the layer’s input with all preceding layers quantized and fa\mathnormal{f_{a}} is the activation function. A similar formulation of the loss has been used previously in (Zhang et al., 2016; He et al., 2017), albeit for different purposes. (25) defines our final objective that we can optimize via stochastic gradient descent. We call this algorithm AdaRound, as it adapts to the statistics of the input data as well as to (an approximation of) the task loss. In section 5 we elaborate on the influence of our design choices as well as the asymmetric reconstruction loss on the performance.

Background and related work

In the 1990s, with the resurgence of the field of neural networks, several works designed hardware and optimization methods for running low-bit neural networks on-device. Hammerstrom (1990) created hardware for 8 and 16-bit training of networks, Holi & Hwang (1993) did an empirical analysis on simple neural networks to show that 8 bits are sufficient in most scenarios, and Hoehfeld & Fahlman (1992) developed a stochastic rounding scheme to push neural networks below 8 bits.

More recently, much attention has gone to quantizing neural networks for efficient inference. This is often done by simulating quantization during training, as described in Jacob et al. (2018) and Gupta et al. (2015), and using a straight-through estimator to approximate the gradients. Many methods have since then extended these training frameworks. Choi et al. (2018) learns the activations to obey a certain quantization range, while Esser et al. (2020); Jain et al. (2019) learn the quantization min and max ranges during training so that they do not have to be set manually. Louizos et al. (2019) also learn the grid and formulate a probabilistic version of the quantization training procedure. Uhlich et al. (2020) learn both the quantization grid, and the bit-width per layer, resulting in automatic bit-width selection during training. Works like Kim et al. (2019); Mishra & Marr (2017) exploit student-teacher training to improve quantized model performance during training. Although quantization-aware training is potent and often gives good results, the process is often tedious and time-consuming. Our work seeks to get high accuracy models without this hassle.

Several easy-to-use methods for quantization of networks without quantization-aware training have been proposed as of recent. These methods are often referred to as post-training quantization methods. Krishnamoorthi (2018) show several results of network quantization without fine-tuning. Works like Banner et al. (2019); Choukroun et al. (2019) optimize the quantization ranges for clipping to find a better loss trade-off per-layer. Zhao et al. (2019) improve quantization performance by splitting channels into more channels, increasing computation but achieving lower bit-widths in the process. Lin et al. (2016); Dong et al. (2019) set different bit-widths for different layers, through the information of the per-layer SQNR or the Hessian. Nagel et al. (2019); Cai et al. (2020) even do away with the requirement of needing any data to optimize a model for quantization, making their procedures virtually parameter and data-free. These methods are all solving the same quantization problem as in this paper, and some like Zhao et al. (2019) and Dong et al. (2019) could even be used in conjunction with AdaRound. We compare to the methods that improve weight quantization for 4/8 and 4/32 bit-widths without end-to-end fine-tuning, Banner et al. (2019); Choukroun et al. (2019); Nagel et al. (2019), but leave out comparisons to the mixed-precision methods Cai et al. (2020); Dong et al. (2019) since they improve networks on a different axis.

Experiments

To evaluate the performance of AdaRound, we conduct experiments on various computer vision tasks and models. In section 5.1 we study the impact of the approximations and design choices made in section 3. In section 5.2 we compare AdaRound to other post-training quantization methods.

1 Ablation study

We make various approximations and assumptions in section 3.1 and section 3.2 to simplify our optimization problem. In Table 2, we look at their impact systematically. First, we note that optimizing based on the Hessian of the task loss (cf. (13)) provides a significant performance boost compared to rounding-to-nearest. This verifies that the Taylor expansion based rounding serves as a much better alternative for the task loss when compared to rounding-to-nearest. Similarly, we show that, although moving from the optimization of Taylor expansion of the task loss to the local MSE loss (cf. (20)) requires strong assumptions, it does not degrade the performance. Unlike the Taylor series expansion, the local MSE loss makes it feasible to optimize all layers in the network. We use the cross-entropy method (Rubinstein, 1999) to solve the QUBO problems in (13) and (20), where we initialize the sampling distribution for the binary random variables w^i\widehat{\mathbf{w}}_{i} as in (Gupta et al., 2015)In the supplementary material we compare the performance of different QUBO solvers on our problem.. Finally, the continuous relaxation for the local MSE optimization problem (cf. (21)) not only reduces the optimization time from several hours to a few minutes but also slightly improves our performance.

Design choices for AdaRound

As discussed earlier, our approach to solve (21) closely resembles a Hopfield method. These methods optimize h(Vi,j)=σ(Vi,jT)\mathnormal{h}\left(\mathbf{V}_{i,j}\right)=\sigma\left(\frac{\mathbf{V}_{i,j}}{T}\right) with a version of gradient descent with respect to Vi,jV_{i,j}, and annealing the temperature TT (Hopfield & Tank, 1985; Smith et al., ). This annealing acts as an implicit regularization that allows h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) to optimize for the MSE loss initially unconstrained, while encouraging h(Vi,j)\mathnormal{h}\left(\mathbf{V}_{i,j}\right) to converge towards or 11 in the later phase of optimization. In Table 3 we show that even after an extensive hyper-parameter search for the annealing schedule of TT, using the sigmoid function with our explicit regularization term (24) outperforms the classical method. Using explicit regularization also makes the optimization more stable, leading to lower variance as shown in Table 3. Furthermore, we see that the use of the rectified sigmoid also provides a consistent small improvement in accuracy for different models.

Table 4 shows the gain of using the asymmetric reconstruction MSE (cf. section 3.3). We see that this provides a noticeable accuracy improvement when compared to (21). Similarly, accounting for the activation function in the optimization problem provides a small gain.

Optimization using STE

Another option we considered is to optimize W~\mathbf{\widetilde{W}} directly by using the straight-through estimator (STE) (Bengio et al., 2013). This is inspired by quantization-aware training (Jacob et al., 2018), which optimizes a full network with this procedure. We use the STE to minimize the MSE loss in (21). This method technically allows more flexible movement of the quantized weights W^\mathbf{\widehat{W}}, as they are no longer restricted to just rounding up or down. In Table 5 we compare the STE optimization with AdaRound. We can see that AdaRound clearly outperforms STE-based optimization. We believe this is due to the biased gradients of the STE, which hinder the optimization in this restricted setting.

Influence of quantization grid

We studied how the choice of weight quantization grid affects the performance gain that AdaRound brings vs rounding-to-nearest. We looked at three different options for determining the scale parameter s; using minimum and maximum values of the weight tensor W\mathbf{W}, minimizing the MSE WWF2\left\lVert\mathbf{W}-\overline{\mathbf{W}}\right\rVert^{2}_{F} introduced in the weights, and minimizing the MSE WxWx^F2\left\lVert\mathbf{W}\mathbf{x}-\overline{\mathbf{W}}\mathbf{\widehat{x}}\right\rVert^{2}_{F} introduced in the preactivations. W\overline{\mathbf{W}} denotes the quantized weight tensor obtained through rounding-to-nearest for a given s. Note, we do not optimize step size and AdaRound jointly as it is non-trivial to combine the two tasks: any change in the step size would result in a different QUBO problem. The results in Table 6 clearly show that AdaRound significantly improves over rounding-to-nearest, independent of the choice of the quantization grid. Both MSE based approaches are superior to the Min-Max method for determining the grid. Since there is no clear winner between the two MSE formulations for AdaRound, we continue the use of WWF2\left\lVert\mathbf{W}-\overline{\mathbf{W}}\right\rVert^{2}_{F} formulation for all other experiments.

Optimization robustness to data

We also investigate how little data is necessary to allow AdaRound to achieve good performance and investigate if this could be done with data from different datasets. The results can be seen in Fig. 4. We see that the performance of AdaRound is robust to the number of images required for optimization. Even with as little as 256256 images, the method optimizes the model to within 2%2\% of the original FP32 accuracy. We also see that when using unlabelled images that are from a similar domain but do not belong to the original training data, AdaRound achieves competitive performance. Here, we observe a less than 0.2%0.2\% degradation on average. It is worthwhile to note that both Pascal VOC and MS COCO only contain a small subset of the classes from Imagenet, implying that the optimization data for AdaRound does not need to be fully representative of the original training set.

2 Literature comparison

Our method solves this same problem, but in a better way. In Table 8 we compare empirical bias correction from Nagel et al. (2019) to AdaRound, under the exact same experimental setup, on ResNet18. While bias correction improves performance over vanilla quantization without bias correction, we see that for 4 bits it only achieves 38.87%38.87\% accuracy, where AdaRound recovers accuracy to 68.60%68.60\%.

ImageNet

In Table 7, we compare AdaRound to several recent post-training quantization methods. We use the same experimental setup as described earlier, with the exception of optimizing AdaRound with 2048 images for 20k iterations. For both Resnet18 and Resnet50, AdaRound is within 1% of the FP32 accuracy for 4-bit weight quantization and outperforms all competing methods, even though some rely on the more favorable per-channel quantization and do not quantize the first and the last layer. Similarly, on the more challenging networks, InceptionV3 and MobilenetV2, AdaRound stays within 2% of the original accuracy and outperforms any competing method.

To be able to compare to methods that also do activation quantization, we report results of AdaRound with all activation tensors quantized to 8 bits. For this scenario, we quantized the activations to 8 bits and set the scaling factor for the activation quantizers based on the minimum and maximum activations observed. We notice that activation quantization, in most cases, does not significantly harm the validation accuracy. AdaRound again outperforms the competing methods such as DFQ (Nagel et al., 2019) and bias correction (Banner et al., 2019).

Semantic segmentation

To demonstrate the wider applicability of AdaRound, we apply it to DeeplabV3+ (Chen et al., 2018) evaluated on Pascal VOC (Everingham et al., 2015). Since the input images here are significantly bigger, we only use 512 images to optimize AdaRound. All other aspects of the experimental setup stay the same. To the best of our knowledge, there are no other post-training quantization methods doing 4-bit quantization for semantic segmentation. DFQ works well for 8 bits, however performance drastically drops when going down to 4-bit weight quantization. AdaRound still performs well for 4 bits and has only a 2% performance decrease for 4-bit weights and 8-bit activations quantization.

Conclusion

In this paper we proposed AdaRound, a new rounding method for post-training quantization of neural network weights. AdaRound improves significantly over rounding-to-nearest, which has poor performance for lower bit widths. We framed and analyzed the rounding problem theoretically and by making appropriate approximations we arrive at a practical method. AdaRound is computationally fast, uses only a small number of unlabeled data examples, does not need end-to-end fine-tuning, and can be applied to any neural network that has convolutional or fully-connected layers without any restriction. AdaRound establishes a new state-of-the-art for post-training weight quantization with significant gains. It can push networks like Resnet18 and Resnet50 to 4-bit weights while keeping the accuracy drop within 1%.

References

Appendix A Comparison among QUBO solvers

We compared optimizing task loss Hessian using the cross-entropy method vs QUBO solver from the publicly available package qbsolvhttps://docs.ocean.dwavesys.com/projects/qbsolv/. We chose this qbsolv QUBO solver for comparison due to its ease of use for our needs as well its free availability for any researcher to reproduce our work. Table 10 presents the comparison between the two solvers. We see that cross-entropy method significantly outperforms the qbsolv QUBO solver. Furthermore the qbsolv QUBO solver has worse performance than rounding-to-nearest. We believe this is mainly due to the reason that the API does not allow us to provide a smart initialization (as we do for cross-entropy method). The performance of random rounding choices is significantly worse, on average, when compared to the rounding choices in the neighbourhood of rounding-to-nearest. Hence this initialization can provide a significant advantage in finding a better local minimum in this large problem space. We did not conduct an extensive search for better QUBO solvers as our own implementation of the cross-entropy method provided very good results with very little tweaking and allowed us to exploit GPU and memory resources more efficiently. Furthermore the choice of QUBO solver does not impact our final method AdaRound while clearly showing the gains that we can exploit via optimized rounding.

Appendix B From Taylor expansion to local loss (conv. layer)

Under the assumptions in (30) there are no interactions between weights in the same layer that affect two different output filters (c1oc2o)\left(c^{o}_{1}\neq c^{o}_{2}\right). We then reformulate the Hessian QUBO optimization

where (a) follows from the assumption in (30). Hence the Hessian optimization problem, under the assumptions in (30), is the same as MSE optimization for the output feature map. Furthermore, it breaks down to an optimization problem for each individual output channel separately (each element in the summation in (35) is independent of the other elements in the summation for optimization purposes as they involve disjoint sets of variables).