ZeroQ: A Novel Zero Shot Quantization Framework

Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, Kurt Keutzer

I Introduction

Despite the great success of deep Neural Network (NN) models in various domains, the deployment of modern NN models at the edge has been challenging due to their prohibitive memory footprint, inference time, and/or energy consumption. With the current hardware support for low-precision computations, quantization has become a popular procedure to address these challenges. By quantizing the floating point values of weights and/or activations in a NN to integers, the model size can be shrunk significantly, without any modification to the architecture. This also allows one to use reduced-precision Arithmetic Logic Units (ALUs) which are faster and more power-efficient, as compared to floating point ALUs. More importantly, quantization reduces memory traffic volume, which is a significant source of energy consumption .

However, quantizing a model from single precision to low-precision often results in significant accuracy degradation. One way to alleviate this is to perform the so-called quantization-aware fine-tuning to reduce the performance gap between the original model and the quantized model. Basically, this is a retraining procedure that is performed for a few epochs to adjust the NN parameters to reduce accuracy drop. However, quantization-aware fine-tuning can be computationally expensive and time-consuming. For example, in online learning situations, where a model needs to be constantly updated on new data and deployed every few hours, there may not be enough time for the fine-tuning procedure to finish. More importantly, in many real-world scenarios, the training dataset is sensitive or proprietary, meaning that it is not possible to access the dataset that was used to train the model. Good examples are medical data, bio-metric data, or user data used in recommendation systems.

To address this, recent work has proposed post-training quantization , which directly quantizes NN models without fine-tuning. However, as mentioned above, these methods result in non-trivial performance degradation, especially for low-precision quantization. Furthermore, previous post-training quantization methods usually require limited (unlabeled) data to assist the post-training quantization. However, for cases such as MLaaS (e.g., Amazon AWS and Google Cloud), it may not be possible to access any of the training data from users. An example application case is health care information which cannot be uploaded to the cloud due to various privacy issues and/or regulatory constraints. Another shortcoming is that often post-quantization methods only focus on standard NNs such as ResNet and InceptionV3 for image classification, and they do not consider more demanding tasks such as object detection.

In this work, we propose ZeroQ, a novel zero-shot quantization scheme to overcome the issues mentioned above. In particular, ZeroQ allows quantization of NN models, without any access to any training/validation data. It uses a novel approach to automatically compute a mixed-precision configuration without any expensive search. In particular, our contributions are as follows.

We propose an optimization formulation to generate Distilled Data, i.e., synthetic data engineered to match the statistics of batch normalization layers. This reconstruction has a small computational overhead. For example, it only takes 3s (0.05% of one epoch training time) to generate 32 images for ResNet50 on ImageNet on an 8-V100 system.

We use the above reconstruction framework to perform sensitivity analysis between the quantized and the original model. We show that the Distilled Data matches the sensitivity of the original training data (see Figure 1 and Table IV for details). We then use the Distilled Data, instead of original/real data, to perform post-training quantization. The entire sensitivity computation here only costs 12s (0.2% of one epoch training time) in total for ResNet50. Importantly, we never use any training/validation data for the entire process.

Our framework supports both uniform and mixed-precision quantization. For the latter, we propose a novel automatic precision selection method based on a Pareto frontier optimization (see Figure 4 for illustration). This is achieved by computing the quantization sensitivity based on the Distilled Data with small computational overhead. For example, we are able to determine automatically the mixed-precision setting in under 14s for ResNet50.

We extensively test our proposed ZeroQ framework on a wide range of NNs for image classification and object detection tasks, achieving state-of-the-art quantization results in all tests. In particular, we present quantization results for both standard models (e.g., ResNet18/50/152 and InceptionV3) and efficient/compact models (e.g., MobileNetV2, ShuffleNet, and SqueezeNext) for image classification task. Importantly, we also test ZeroQ for object detection on Microsoft COCO dataset with RetinaNet . Among other things, we show that ZeroQ achieves 1.71% higher accuracy on MobileNetV2 as compared to the recently proposed DFQ method.

II Related work

Here we provide a brief (and by no means extensive) review of the related work in literature. There is a wide range of methods besides quantization which have been proposed to address the prohibitive memory footprint and inference latency/power of modern NN architectures. These methods are typically orthogonal to quantization, and they include efficient neural architecture design , knowledge distillation , model pruning , and hardware and NN co-design . Here we focus on quantization , which compresses the model by reducing the bit precision used to represent parameters and/or activations. An important challenge with quantization is that it can lead to significant performance degradation, especially in ultra-low bit precision settings. To address this, existing methods propose quantization-aware fine-tuning to recover lost performance . Importantly, this requires access to the full dataset that was used to train the original model. Not only can this be very time-consuming, but often access to training data is not possible.

To address this, several papers focused on developing post-training quantization methods (also referred to as post-quantization), without any fine-tuning/training. In particular, proposes the OMSE method to optimize the L2L_{2} distance between the quantized tensor and the original tensor. Moreover, proposed the so-called ACIQ method to analytically compute the clipping range, as well as the per-channel bit allocation for NNs, and it achieves relatively good testing performance. However, they use per-channel quantization for activations, which is difficult for efficient hardware implementation in practice. In addition, proposes an outlier channel splitting (OCS) method to solve the outlier channel problem. However, these methods require access to limited data to reduce the performance drop .

The recent work of proposed Data Free Quantization (DFQ). It further pushes post-quantization to zero-shot scenarios, where neither training nor testing data are accessible during quantization. The work of uses a weight equalization scheme to remove outliers in both weights and activations, and they achieve similar results with layer-wise quantization, as compared to previous post-quantization work with channel-wise quantization . However, their performance significantly degrades when NNs are quantized to 6-bit or lower.

A recent concurrent paper to ours independently proposed to use Batch Normalization statistics to reconstruct input data . They propose a knowledge-distillation based method to boost the accuracy further, by generating input data that is similar to the original training dataset, using the so-called Inceptionism . However, it is not clear how the latter approach can be used for tasks such as object detection or image segmentation. Furthermore, this knowledge-distillation process adds to the computational time required for zero-shot quantization. As we will show in our work, it is possible to use batch norm statistics combined with mixed-precision quantization to achieve state-of-the-art accuracy, and importantly this approach is not limited to image classification task. In particular, we will present results on object detection using RetinaNet-ResNet50, besides testing ZeroQ on a wide range of models for image classification (using ResNet18/50/152, MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3), We show that for all of these cases ZeroQ exceeds state-of-the-art quantization performance. Importantly, our approach has a very small computational overhead. For example, we can finish ResNet50 quantization in under 30 seconds on an 8 V-100 system (corresponding to 0.5% of one epoch training time of ResNet50 on ImageNet).

Directly quantizing all NN layers to low precision can lead to significant accuracy degradation. A promising approach to address this is to perform mixed-precision quantization , where different bit-precision is used for different layers. The key idea behind mixed-precision quantization is that not all layers of a convolutional network are equally “sensitive” to quantization. A naïve mixed-precision quantization method can be computationally expensive, as the search space for determining the precision of each layer is exponential in the number of layers. To address this, uses NAS/RL-based search algorithm to explore the configuration space. However, these searching methods can be expensive and are often sensitive to the hyper-parameters and the initialization of the RL based algorithm. Alternatively, the recent work of introduces a Hessian based method, where the bit precision setting is based on the second-order sensitivity of each layer. However, this approach does require access to the original training set, a limitation which we address in ZeroQ.

III Methodology

For a typical supervised computer vision task, we seek to minimize the empirical risk loss, i.e.,

The ZeroQ framework supports both fixed-precision and mixed-precision quantization. In the latter scheme, different layers of the model could have different bit precisions (different kk). The main idea behind mixed-precision quantization is to keep more sensitive layers at higher precision, and more aggressively quantize less sensitive layers, without increasing overall model size. As we will show later, this mixed-precision quantization is key to achieving high accuracy for ultra-low precision settings such as 4-bit quantization. Typical choices for kk for each layer are {2,4,8}\{2,4,8\} bit. Note that this mixed-precision quantization leads to exponentially large search space, as every layer could have one of these bit precision settings. It is possible to avoid this prohibitive search space if we could measure the sensitivity of the model to the quantization of each layer . For the case of post-training quantization (i.e. without fine-tuning), a good sensitivity metric is to use Kullback–Leibler (KL) divergence between the original model and the quantized model, defined as:

For zero-shot quantization, we do not have access to any of the training/validation data. This poses two challenges. First, we need to know the range of values for activations of each layer so that we can clip the range for quantization (the [a,b][a,b] range mentioned above). However, we cannot determine this range without access to the training dataset. This is a problem for both uniform and mixed-precision quantization. Second, another challenge is that for mixed-precision quantization, we need to compute Ωi\Omega_{i} in Eq. 2, but we do not have access to training data xjx_{j}. A very naïve method to address these challenges is to create a random input data drawn from a Gaussian distribution with zero mean and unit variance and feed it into the model. However, this approach cannot capture the correct statistics of the activation data corresponding to the original training dataset. This is illustrated in Figure 2 (left), where we plot the sensitivity of each layer of ResNet50 on ImageNet measured with the original training dataset (shown in black) and Gaussian based input data (shown in red). As one can see, the Gaussian data clearly does not capture the correct sensitivity of the model. For instance, for the first three layers, the sensitivity order of the red line is actually the opposite of the original training data.

To address this problem, we propose a novel method to “distill” input data from the NN model itself, i.e., to generate synthetic data carefully engineered based on the properties of the NN. In particular, we solve a distillation optimization problem, in order to learn an input data distribution that best matches the statistics encoded in the BN layer of the model. In more detail, we solve the following optimization problem:

where xrx^{r} is the reconstructed (distilled) input data, and μir\mu_{i}^{r}/σir\sigma_{i}^{r} are the mean/standard deviation of the Distilled Data’s distribution at layer ii, and μi\mu_{i}/σi\sigma_{i} are the corresponding mean/standard deviation parameters stored in the BN layer at layer ii. In other words, after solving this optimization problem, we can distill an input data which, when fed into the network, can have a statistical distribution that closely matches the original model. Please see Algorithm 1 for a description. This Distilled Data can then be used to address the two challenges described earlier. First, we can use the Distilled Data’s activation range to determine quantization clipping parameters (the [a,b][a,b] range mentioned above). Note that some prior work address this by using limited (unlabeled) data to determine the activation range. However, this contradicts the assumptions of zero-shot quantization, and may not be applicable for certain applications. Second, we can use the Distilled Data and feed it in Eq. 2 to determine the quantization sensitivity (Ωi\Omega_{i}). The latter is plotted for ResNet50 in Figure 2 (left) shown in solid blue color. As one can see, the Distilled Data closely matches the sensitivity of the model as compared to using Gaussian input data (shown in red). We show a visualization of the random Gaussian data as well as the Distilled Data for ResNet50 in Figure 3. We can see that the Distilled Data can capture fine-grained local structures.

III-B Pareto Frontier

As mentioned before, the main challenge for mixed-precision quantization is to determine the exact bit precision configuration for the entire NN. For an L-layer model with mm possible precision options, the mixed-precision search space, denoted as S\mathcal{S}, has an exponential size of mLm^{L}. For example for ResNet50 with just three bit precision of {2,4,8}\{2,4,8\} (i.e., m=3m=3), the search space contains 7.2×10237.2\times 10^{23} configurations. However, we can use the sensitivity metric in Eq. 2 to reduce this search space. The main idea is to use higher bit precision for layers that are more sensitive, and lower bit precision for layers that are less sensitive. This gives us a relative ordering on the number of bits. To compute the precise bit precision setting, we propose a Pareto frontier approach similar to the method used in .

The Pareto frontier method works as follows. For a target quantized model size of StargetS_{target}, we measure the overall sensitivity of the model for each bit precision configuration that results in the StargetS_{target} model size. We choose the bit-precision setting that corresponds to the minimum overall sensitivity. In more detail, we solve the following optimization problem:

where kik_{i} is the quantization precision of the i-th layer, and PiP_{i} is the parameter size for the ii-th layer. Note that here we make the simplifying assumption that the sensitivity of different layers are independent of the choice of bits for other layers (hence Ωi\Omega_{i} only depends on the bit precision for the ii-th layer).Please see Section -A where we describe how we relax this assumption without having to perform an exponentially large computation for the sensitivity for each bit precision setting. Using a dynamic programming method we can solve the best setting with different StargetS_{target} together, and then we plot the Pareto frontier. An example is shown in Figure 4 for ResNet50 model, where the x-axis is the model size for each bit precision configuration, and the y-axis is the overall model perturbation/sensitivity. Each blue dot in the figure represents a mixed-precision configuration. In ZeroQ, we choose the bit precision setting that has the smallest perturbation with a specific model size constraint.

Importantly, note that the computational overhead of computing the Pareto frontier is O(mL)\mathcal{O}(mL). This is because we compute the sensitivity of each layer separately from other layers. That is, we compute sensitivity Ωi\Omega_{i} (i=1,2,...,Li=1,2,...,L) with respect to all mm different precision options, which leads to the O(mL)\mathcal{O}(mL) computational complexity. We should note that this Pareto Frontier approach (including the Dynamic Programming optimizer), is not theoretically guaranteed to result in the best possible configuration, out of all possibilities in the exponentially large search space. However, our results show that the final mixed-precision configuration achieves state-of-the-art accuracy with small performance loss, as compared to the original model in single precision.

IV Results

In this section, we extensively test ZeroQ on a wide range of models and datasets. We first start by discussing the zero-shot quantization of ResNet18/50, MobileNet-V2, and ShuffleNet on ImageNet in Section IV-A. Additional results for quantizing ResNet152, InceptionV3, and SqueezeNext on ImageNet, as well as ResNet20 on Cifar10 are provided in Appendix -C. We also present results for object detection using RetinaNet tested on Microsoft COCO dataset in Section IV-B. We emphasize that all of the results achieved by ZeroQ are 100% zero-shot without any need for fine-tuning.

We also emphasize that we used exactly the same hyper-parameters (e.g., the number of iterations to generate Distilled Data) for all experiments, including the results on Microsoft COCO dataset.

We start by discussing the results on the ImageNet dataset. For each model, after generating Distilled Data based on Eq. 3, we compute the sensitivity of each layer using Eq. 2 for different bit precision. Next, we use Eq. 4 and the Pareto frontier introduced in Section III-B to get the best bit-precision configuration based on the overall sensitivity for a given model size constraint. We denote the quantized results as WwAh where w and h denote the bit precision used for weights and activations of the NN model.

We present zero-shot quantization results for ResNet50 in Table Ia. As one can see, for W8A8 (i.e., 8-bit quantization for both weights and activations), ZeroQ results in only 0.05% accuracy degradation. Further quantizing the model to W6A6, ZeroQ achieves 77.43% accuracy, which is 2.63% higher than OCS , even though our model is slightly smaller (18.27MB as compared to 18.46MB for OCS).Importantly note that OCS requires access to the training data, while ZeroQ does not use any training/validation data. We show that we can further quantize ResNet50 down to just 12.17MB with mixed precision quantization, and we obtain 75.80% accuracy. Note that this is 0.82% higher than OMSE with access to training data and 5.74% higher than zero-shot version of OMSE. Importantly, note that OMSE keeps activation bits at 32-bits, while for this comparison our results use 8-bits for the activation (i.e., 4×4\times smaller activation memory footprint than OMSE). For comparison, we include results for PACT , a standard quantization method that requires access to training data and also requires fine-tuning.

An important feature of the ZeroQ framework is that it can perform the quantization with very low computational overhead. For example, the end-to-end quantization of ResNet50 takes less than 30 seconds on an 8 Tesla V100 GPUs (one epoch training time on this system takes 100 minutes). In terms of timing breakdown, it takes 3s to generate the Distilled Data, 12s to compute the sensitivity for all layers of ResNet50, and 14s to perform Pareto Frontier optimization.

We also show ZeroQ results on MobileNetV2 and compare it with both DFQ and fine-tuning based methods , as shown in Table Ib. For W8A8, ZeroQ has less than 0.12% accuracy drop as compared to baseline, and it achieves 1.71% higher accuracy as compared to DFQ method.

Further compressing the model to W6A6 with mixed-precision quantization for weights, ZeroQ can still outperform Integer-Only by 1.95% accuracy, even though ZeroQ does not use any data or fine-tuning. ZeroQ can achieve 68.83% accuracy even when the weight compression is 8×\times, which corresponds to using 4-bit quantization for weights on average.

We also experimented with percentile based clipping to determine the quantization range (please see Section -D for details). The results corresponding to percentile based clipping are denoted as ZeroQZeroQ^{\dagger} and reported in Table I. We found that using percentile based clipping is helpful for low precision quantization. Other choices for clipping methods have been proposed in the literature. Here we note that our approach is orthogonal to these improvements and that ZeroQ could be combined with these methods.

We also apply ZeroQ to quantize efficient and highly compact models such as ShuffleNet, whose model size is only 5.94MB. To the best of our knowledge, there exists no prior zero-shot quantization results for this model. ZeroQ achieves a small accuracy drop of 0.13% for W8A8. We can further quantize the model down to an average of 4-bits for weights, which achieves a model size of only 0.73MB, with an accuracy of 58.96%.

We also compare with the recent Data-Free Compression (DFC) method. There are two main differences between ZeroQ and DFC. First, DFC proposes a fine-tuning method to recover accuracy for ultra-low precision cases. This can be time-consuming and as we show it is not necessary. In particular, we show that with mixed-precision quantization one can actually achieve higher accuracy without any need for fine-tuning. This is shown in Table III for ResNet18 quantization on ImageNet. In particular, note the results for W4A4, where the DFC method without fine-tuning results in more than 15% accuracy drop with a final accuracy of 55.49%. For this reason, the authors propose a method with post quantization training, which can boost the accuracy to 68.05% using W4A4 for intermediate layers, and 8-bits for the first and last layers. In contrast, ZeroQ achieves a higher accuracy of 69.05% without any need for fine-tuning. Furthermore, the end-to-end zero-shot quantization of ResNet18 takes only 12s on an 8-V100 system (equivalent to 0.4%0.4\% of the 45 minutes time for one epoch training of ResNet18 on ImageNet). Secondly, DFC method uses Inceptionism to facilitate the generation of data with random labels, but it is hard to extend this for object detection and image segmentation tasks.

We include additional results of quantized ResNet152, InceptionV3, and SqueezeNext on ImageNet, as well as ResNet20 on Cifar10, in Appendix -C.

IV-B Microsoft COCO

Object detection is often much more complicated than ImageNet classification. To demonstrate the flexibility of our approach we also test ZeroQ on an object detection task on Microsoft COCO dataset. RetinaNet is a state-of-the-art single-stage detector, and we use the pretrained model with ResNet50 as the backbone, which can achieve 36.4 mAP.Here we use the standard mAP 0.5:0.05:0.95 metric on COCO dataset.

One of the main difference of RetinaNet with previous NNs we tested on ImageNet is that some convolutional layers in RetinaNet are not followed by BN layers. This is because of the presence of a feature pyramid network (FPN) , and it means that the number of BN layers is slightly smaller than that of convolutional layers. However, this is not a limitation and the ZeroQ framework still works well. Specifically, we extract the backbone of RetinaNet and create Distilled Data. Afterwards, we feed the Distilled Data into RetinaNet to measure the sensitivity as well as to determine the activation range for the entire NN. This is followed by optimizing for the Pareto Frontier, discussed earlier.

The results are presented in Table II. We can see that for W8A8 ZeroQ has no performance degradation. For W6A6, ZeroQ achieves 35.9 mAP. Further quantizing the model to an average of 4-bits for the weights, ZeroQ achieves 33.7 mAP. Our results are comparable to the recent results of FQN , even though it is not a zero-shot quantization method (i.e., it uses the full training dataset and requires fine-tuning). However, it should be mentioned that ZeroQ keeps the activations to be 8-bits, while FQN uses 4-bit activations.

V Ablation Study

Here, we present an ablation study for the two components of ZeroQ: (i) the Distilled Data generated by Eq. 3 to help sensitivity analysis and determine activation clipping range; and (ii) the Pareto frontier method for automatic bit-precision assignment. Below we discuss the ablation study for each part separately.

In this work, all the sensitivity analysis and the activation range are computed on the Distilled Data. Here, we perform an ablation study on the effectiveness of Distilled Data as compared to using just Gaussian data. We use three different types of data sources, (i) Gaussian data with mean “0” and variance “1”, (ii) data from training dataset, (iii) our Distilled Data, as the input data to measure the sensitivity and to determine the activation range. We quantize ResNet50 and MobileNetV2 to an average of 4-bit for weights and 8-bit for activations, and we report results in Table IV.

For ResNet50, using training data results in 75.95% testing accuracy. With Gaussian data, the performance degrades to 75.44%. ZeroQ can alleviate the gap between Gaussian data and training data and achieves 75.80%. For more compact/efficient models such as MobileNetV2, the gap between using Gaussian data and using training data increases to 2.33%. ZeroQ can still achieve 68.83%, which is only 0.23% lower than using training data. Additional results for ResNet18, ShuffleNet and SqueezeNext are shown in Table VIII.

V-B Sensitivity Analysis

Here, we perform an ablation study to show that the bit precision of the Pareto frontier method works well. To test this, we compare ZeroQ with two cases, one where we choose a bit-configuration that corresponds to maximizing Ωsum\Omega_{sum} (which is opposite to the minimization that we do in ZeroQ), and one case where we use random bit precision for different layers. We denote these two methods as Inverse and Random. The results for quantizing weights to an average of 4-bit and activations to 8-bit are shown in Table V. We report the best and worst testing accuracy as well as the mean and variance in the results out of 20 tests. It can be seen that ZeroQ results in significantly better testing performance as compared to Inverse and Random. Another noticeable point is that the best configuration (i.e., minimum Ωsum\Omega_{sum}) can outperform 0.18% than the worst case among the top-20 configurations from ZeroQ, which reflects the advantage of the Pareto frontier method. Also, notice the small variance of all configurations generated by ZeroQ.

VI Conclusions

We have introduced ZeroQ, a novel post-training quantization method that does not require any access to the training/validation data. Our approach uses a novel method to distill an input data distribution to match the statistics in the batch normalization layers of the model. We show that this Distilled Data is very effective in capturing the sensitivity of different layers of the network. Furthermore, we present a Pareto frontier method to select automatically the bit-precision configuration for mixed-precision settings. An important aspect of ZeroQ is its low computational overhead. For example, the end-to-end zero-shot quantization time of ResNet50 is less than 30 seconds on an 8-V100 GPU system. We extensively test ZeroQ on various datasets and models. This includes various ResNets, InceptionV3, MobileNetV2, ShuffleNet, and SqueezeNext on ImageNet, ResNet20 on Cifar10, and even RetinaNet for object detection on Microsoft COCO dataset. We consistently achieve higher accuracy with the same or smaller model size compared to previous post-training quantization methods. All results show that ZeroQ could exceed previous zero-shot quantization methods. We have open sourced ZeroQ framework .

References

-A Pareto Frontier

In Section III-B, we presented how we compute the overall sensitivity incurred by performing mixed-precision quantization. In particular, in Eq. 4 we made the simplifying assumption that the sensitivity of each layer to quantization is independent to sensitivity of other layers (we refer to this as independence assumption). This is clearly not the case in practice. One can instead directly compute the sensitivity for each possible bit-precision computation without any approximation but this is not possible as there are mLm^{L} possible bit-precision configurations. Here we discuss our approach which falls in between these two extremes. Instead of computing the sensitivity of the entire network at once, we break the network into L/aL/a groups, with each group containing aa layers. Furthermore, we break the x-axis (model size) of the Pareto frontier plot into bb intervals in every steps mentioned below.

-B Results on CIFAR-10

In this section, we show the results of our ZeroQ on CIFAR-10 dataset with ResNet20. See Table VI.

-C Extra Results on ImageNet

In this section, we show extra results for our ZeroQ on ImageNet with ResNet152, InceptionV3, and SqueezeNext in Table VII. We also show more results to illustrate the effect of Distilled Data compared with Gaussian noise in Table VIII.

-D Clipping

Quantization maps a single-precision tensor zz to a low-precision tensor Q(z)Q(z). This includes two steps: 1) clipping the original tensor to range [a, b][a,~{}b], and then 2) mapping this range to integer range [0, 2k1][0,~{}2^{k}-1]. A simple way is to set [a, b]=[min(z), max(z)][a,~{}b]=[\min(z),~{}\max(z)] for conventional quantization methods. Recently, more effort has been spent on choosing the “optimal” range of [a, b][a,~{}b] , which are the so-called clipping methods.

In all of our experiments above, we use the simplest way, i.e., [a, b]=[min(z), max(z)][a,~{}b]=[\min(z),~{}\max(z)], to conduct the quantization. The main reason behind this is two-fold: (i) we want to show the efficacy of ZeroQ without the assistance of any other technique; (ii) some of proposed methods need hyper-parameter tuning to get the optimal aa and bb which can be costly. However, we show that performance of ZeroQ can be further boosted by the weight clipping method, if the slightly higher computational overhead could be afforded. In particular, we use the “percentile” method proposed in . This method directly clips a single-precision weight tensor to γ\gamma-th and (1γ)(1-\gamma)-th percentiles (we refer the reader to for more details). As shown in Table I and Table III, ZeroQ can be further improved by weight clipping.