Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, Yurong Chen

Introduction

Despite these tremendous advances, CNN quantization still remains an open problem due to two critical issues which have not been well resolved yet, especially under scenarios of using low-precision weights for quantization. The first issue is the non-negligible accuracy loss for CNN quantization methods, and the other issue is the increased number of training iterations for ensuring convergence. In this paper, we attempt to address these two issues by presenting a novel incremental network quantization (INQ) method.

In our INQ, there is no assumption on the CNN architecture, and its basic goal is to efficiently convert any pre-trained full-precision (i.e., 32-bit floating-point) CNN model into a low-precision version whose weights are constrained to be either powers of two or zero. The advantage of such kind of low-precision models is that the original floating-point multiplication operations can be replaced by cheaper binary bit shift operations on dedicated hardware like FPGA. We noticed that most existing network quantization methods adopt a global strategy in which all the weights are simultaneously converted to low-precision ones (that are usually in the floating-point types). That is, they have not considered the different importance of network weights, leaving the room to retain network accuracy limited. In sharp contrast to existing methods, our INQ makes a very careful handling for the model accuracy drop from network quantization. To be more specific, it incorporates three interdependent operations: weight partition, group-wise quantization and re-training. Weight partition uses a pruning-inspired measure (Han et al., 2015; Guo et al., 2016) to divide the weights in each layer of a pre-trained full-precision CNN model into two disjoint groups which play complementary roles in our INQ. The weights in the first group are quantized to be either powers of two or zero by a variable-length encoding method, forming a low-precision base for the original model. The weights in the other group are re-trained while keeping the quantized weights fixed, compensating for the accuracy loss resulted from the quantization. Furthermore, these three operations are repeated on the latest re-trained weight group in an iterative manner until all the weights are quantized, acting as an incremental network quantization and accuracy enhancement procedure (as illustrated in Figure 1).

The main insight of our INQ is that a compact combination of the proposed weight partition, group-wise quantization and re-training operations has the potential to get a lossless low-precision CNN model from any full-precision reference. We conduct extensive experiments on the ImageNet large scale classification task using almost all known deep CNN architectures to validate the effectiveness of our method. We show that: (1) For AlexNet, VGG-16, GoogleNet and ResNets with 5-bit quantization, INQ achieves improved accuracy in comparison with their respective full-precision baselines. The absolute top-1 accuracy gain ranges from 0.13% to 2.28%, and the absolute top-5 accuracy gain is in the range of 0.23% to 1.65%. (2) INQ has the property of easy convergence in training. In general, re-training with less than 8 epochs could consistently generate a lossless model with 5-bit weights in the experiments. (3) Taking ResNet-18 as an example, our quantized models with 4-bit, 3-bit and 2-bit ternary weights also have improved or very similar accuracy compared with its 32-bit floating-point baseline. (4) Taking AlexNet as an example, the combination of our network pruning and INQ outperforms deep compression method (Han et al., 2016) with significant margins.

Incremental Network Quantization

In this section, we clarify the insight of our INQ, describe its key components, and detail its implementation.

Suppose a pre-trained full-precision (i.e., 32-bit floating-point) CNN model can be represented by {Wl:1lL}\{\mathbf{W}_{l}:1\leq l\leq L\}, where Wl\mathbf{W}_{l} denotes the weight set of the lthl^{th} layer, and LL denotes the number of learnable layers in the model. To simplify the explanation, we only consider convolutional layers and fully connected layers. For CNN models like AlexNet, VGG-16, GoogleNet and ResNets as tested in this paper, Wl\mathbf{W}_{l} can be a 4D4D tensor for the convolutional layer, or a 2D2D matrix for the fully connected layer. For simplicity, here the dimension difference is not considered in the expression. Given a pre-trained full-precision CNN model, the main goal of our INQ is to convert all 32-bit floating-point weights to be either powers of two or zero without loss of model accuracy. Besides, we also attempt to explore the limit of the expected bit-width under the premise of guaranteeing lossless network quantization. Here, we start with our basic network quantization method on how to convert Wl\mathbf{W}_{l} to be a low-precision version W^l\widehat{\mathbf{W}}_{l}, and each of its entries is chosen from

where n1n_{1} and n2n_{2} are two integer numbers, and they satisfy n2n1n_{2}\leq n_{1}. Mathematically, n1n_{1} and n2n_{2} help to bound Pl\mathbf{P}_{l} in the sense that its non-zero elements are constrained to be in the range of either [2n1,2n2][-2^{n_{1}},-2^{n_{2}}] or [2n2,2n1][2^{n_{2}},2^{n_{1}}]. That is, network weights with absolute values smaller than 2n22^{n_{2}} will be pruned away (i.e., set to zero) in the final low-precision model. Obviously, the problem is how to determine n1n_{1} and n2n_{2}. In our INQ, the expected bit-width bb for storing the indices in Pl\mathbf{P}_{l} is set beforehand, thus the only hyper-parameter shall be determined is n1n_{1} because n2n_{2} can be naturally computed once bb and n1n_{1} are available. Here, n1n_{1} is calculated by using a tricky yet practically effective formula as

Once Pl\mathbf{P}_{l} is determined, we further use the ladder of powers to convert every entry of Wl\mathbf{W}_{l} into a low-precision one by using

where α\alpha and β\beta are two adjacent elements in the sorted Pl\mathbf{P}_{l}, making the above equation as a numerical rounding to the quantum values. It should be emphasized that factor 4/34/3 in Equation (2) is set to make sure that all the elements in Pl\mathbf{P}_{l} correspond with the quantization rule defined in Equation (4). In other words, factor 4/34/3 in Equation (2) highly correlates with factor 3/23/2 in Equation (4).

Here, an important thing we want to clarify is the definition of the expected bit-width bb. Taking 5-bit quantization as an example, since zero value cannot be written as the power of two, we use 1 bit to represent zero value, and the remaining 4 bits to represent at most 16 different values for the powers of two. That is, the number of candidate quantum values is at most 2b1+12^{b-1}+1, so our quantization method actually adopts a variable-length encoding scheme. It is clear that the quantization described above is performed in a linear scale. An alternative solution is to perform the quantization in the log scale. Although it may also be effective, it should be a little bit more difficult in implementation and may cause some extra computational overhead in comparison to our method.

2 Incremental Quantization Strategy

We can naturally use the above described method to quantize any pre-trained full-precision CNN model. However, noticeable accuracy loss appeared in the experiments when using small bit-width values (e.g., 5-bit, 4-bit, 3-bit and 2-bit).

In the literature, there are many existing network quantization works such as HashedNet (Chen et al., 2015b), vector quantization (Gong et al., 2014), fixed-point representation (Vanhoucke et al., 2011; Gupta et al., 2015), BinaryConnect (Courbariaux et al., 2015), BinaryNet (Courbariaux et al., 2016), XNOR-Net (Rastegari et al., 2016), TWN (Li & Liu, 2016), DoReFa-Net (Zhou et al., 2016) and QNN (Hubara et al., 2016). Similar to our basic network quantization method, they also suffer from non-negligible accuracy loss on deep CNNs, especially when being applied on the ImageNet large scale classification dataset. For all these methods, a common fact is that they adopt a global strategy in which all the weights are simultaneously converted into low-precision ones, which in turn causes accuracy loss. Compared with the methods focusing on the pre-trained models, accuracy loss becomes worse for the methods such as XNOR-Net, TWN, DoReFa-Net and QNN which intend to train low-precision CNNs from scratch.

Recall that our main goal is to achieve lossless low-precision quantization for any pre-trained full-precision CNN model with no assumption on its architecture. To this end, our INQ makes a special handling of the strategy for suppressing resulting quantization loss in model accuracy. We are partially inspired by the latest progress in network pruning (Han et al., 2015; Guo et al., 2016). In these methods, the accuracy loss from removing less important network weights of a pre-trained neural network model could be well compensated by following re-training steps. Therefore, we conjecture that the nature of changing network weight importance is critical to achieve lossless network quantization.

Base on this assumption, we present INQ which incorporates three interdependent operations: weight partition, group-wise quantization and re-training. Weight partition is to divide the weights in each layer of a pre-trained full-precision CNN model into two disjoint groups which play complementary roles in our INQ. The weights in the first group are responsible for forming a low-precision base for the original model, thus they are quantized by using Equation (4). The weights in the second group adapt to compensate for the loss in model accuracy, thus they are the ones to be re-trained. Once the first run of the quantization and re-training operations is finished, all the three operations are further conducted on the second weight group in an iterative manner, until all the weights are converted to be either powers of two or zero, acting as an incremental network quantization and accuracy enhancement procedure. As a result, accuracy loss under low-precision CNN quantization can be well suppressed by our INQ. Illustrative results at iterative steps of our INQ are provided in Figure 2.

For the lthl^{th} layer, weight partition can be defined as

where Al(1)\mathbf{A}^{(1)}_{l} denotes the first weight group that needs to be quantized, and A2\mathbf{A}_{2} denotes the other weight group that needs to be re-trained. We leave the strategies for group partition to be chosen in the experiment section. Here, we define a binary matrix Tl\mathbf{T}_{l} to help distinguish above two categories of weights. That is, Tl(i,j)=0\mathbf{T}_{l}(i,j)=0 means Wl(i,j)Al(1)\mathbf{W}_{l}(i,j)\in\mathbf{A}^{(1)}_{l}, and Tl(i,j)=1\mathbf{T}_{l}(i,j)=1 means Wl(i,j)Al(2)\mathbf{W}_{l}(i,j)\in\mathbf{A}^{(2)}_{l}.

3 Incremental Network Quantization Algorithm

Now, we come to the training method. Taking the lthl^{th} layer as an example, the basic optimization problem of making its weights to be either powers of two or zero can be expressed as

where L(Wl)L(\mathbf{W}_{l}) is the network loss, R(Wl)R(\mathbf{W}_{l}) is the regularization term, λ\lambda is a positive coefficient, and the constraint term indicates each weight entry Wl(i,j)\mathbf{W}_{l}(i,j) should be chosen from the set Pl\mathbf{P}_{l} consisting of a fixed number of the values of powers of two plus zero. Direct solving above optimization problem in training from scratch is challenging since it is very easy to undergo convergence problem.

By performing weight partition and group-wise quantization operations beforehand, the optimization problem defined in (6) can be reshaped into a easier version. That is, we only need to optimize the following objective function

where Pl\mathbf{P}_{l} is determined at group-wise quantization operation, and the binary matrix Tl\mathbf{T}_{l} acts as a mask which is determined by weight partition operation. Since Pl\mathbf{P}_{l} and Tl\mathbf{T}_{l} are known, the optimization problem (7) can be solved using popular stochastic gradient decent (SGD) method. That is, in INQ, we can get the update scheme for the re-training as

where γ\gamma is a positive learning rate. Note that the binary matrix Tl\mathbf{T}_{l} forces zero update to the weights that have been quantized. That is, only the weights still keep with floating-point values are updated, akin to the latest pruning methods (Han et al., 2015; Guo et al., 2016) in which only the weights that are not currently removed are re-trained to enhance network accuracy. The whole procedure of our INQ is summarized as Algorithm 1.

We would like to highlight that the merits of our INQ are in three aspects: (1) Weight partition introduces the importance-aware weight quantization. (2) Group-wise weight quantization introduces much less accuracy loss than simultaneously quantizing all the network weights, thus making re-training have larger room to recover model accuracy. (3) By integrating the operations of weight partition, group-wise quantization and re-training into a nested loop, our INQ has the potential to obtain lossless low-precision CNN model from the pre-trained full-precision reference.

Experimental Results

To analyze the performance of our INQ, we perform extensive experiments on the ImageNet large scale classification task, which is known as the most challenging image classification benchmark so far. ImageNet dataset has about 1.2 million training images and 50 thousand validation images. Each image is annotated as one of 1000 object classes. We apply our INQ to AlexNet, VGG-16, GoogleNet, ResNet-18 and ResNet-50, covering almost all known deep CNN architectures. Using the center crops of validation images, we report the results with two standard measures: top-1 error rate and top-5 error rate. For fair comparison, all pre-trained full-precision (i.e., 32-bit floating-point) CNN models except ResNet-18 are taken from the Caffe model zoohttps://github.com/BVLC/caffe/wiki/Model-Zoo. Note that He et al. (2016) do not release their pre-trained ResNet-18 model to the public, so we use a publicly available re-implementation by Facebookhttps://github.com/facebook/fb.resnet.torch/tree/master/pretrained. Since our method is implemented with Caffe, we make use of an open source toolhttps://github.com/zhanghang1989/fb-caffe-exts to convert the pre-trained ResNet-18 model from Torch to Caffe.

Setting expected bit-width to 5, the first set of experiments is performed to testify the efficacy of our INQ on different CNN architectures. Regarding weight partition, there are several candidate strategies as we tried in our previous work for efficient network pruning (Guo et al., 2016). In Guo et al. (2016), we found random partition and pruning-inspired partition are the two best choices compared with the others. Thus in this paper, we directly compare these two strategies for weight partition. In random strategy, the weights in each layer of any pre-trained full-precision deep CNN model are randomly split into two disjoint groups. In pruning-inspired strategy, the weights are divided into two disjoint groups by comparing their absolute values with layer-wise thresholds which are automatically determined by a given splitting ratio. Here we directly use pruning-inspired strategy and the experimental results in Section 3.2 will show why. After the re-training with no more than 8 epochs over each pre-trained full-precision model, we obtain the results as shown in Table 1. It can be concluded that the 5-bit CNN models generated by our INQ show consistently improved top-1 and top-5 recognition rates compared with respective full-precision references. Parameter settings are described below.

AlexNet: AlexNet has 5 convolutional layers and 3 fully-connected layers. We set the accumulated portions of quantized weights at iterative steps as {0.3, 0.6, 0.8, 1}, the batch size as 256, the weight decay as 0.0005, and the momentum as 0.9.

VGG-16: Compared with AlexNet, VGG-16 has 13 convolutional layers and more parameters. We set the accumulated portions of quantized weights at iterative steps as {0.5, 0.75, 0.875, 1}, the batch size as 32, the weight decay as 0.0005, and the momentum as 0.9.

GoogleNet: Compared with AlexNet and VGG-16, GoogleNet is more difficult to quantize due to a smaller number of parameters and the increased network width. We set the accumulated portions of quantized weights at iterative steps as {0.2, 0.4, 0.6, 0.8, 1}, the batch size as 80, the weight decay as 0.0002, and the momentum as 0.9.

ResNet-18: Different from above three networks, ResNets have batch normalization layers and relief the vanishing gradient problem by using shortcut connections. We first test the 18-layer version for exploratory purpose and test the 50-layer version later on. The network architectures of ResNet-18 and ResNet-34 are very similar. The only difference is the number of filters in every convolutional layer. We set the accumulated portions of quantized weights at iterative steps as {0.5, 0.75, 0.875, 1}, the batch size as 80, the weight decay as 0.0005, and the momentum as 0.9.

ResNet-50: Besides significantly increased network depth, ResNet-50 has a more complex network architecture in comparison to ResNet-18. However, regarding network architecture, ResNet-50 is very similar to ResNet-101 and ResNet-152. The only difference is the number of filters in every convolutional layer. We set the accumulated portions of quantized weights at iterative steps as {0.5, 0.75, 0.875, 1}, the batch size as 32, the weight decay as 0.0005, and the momentum as 0.9.

2 Analysis of Weight Partition Strategies

In our INQ, the first operation is weight partition whose result will directly affect the following group-wise quantization and re-training operations. Therefore, the second set of experiments is conducted to analyze two candidate strategies for weight partition. As mentioned in the previous section, we use pruning-inspired strategy for weight partition. Unlike random strategy in which all the weights have equal probability to fall into the two disjoint groups, pruning-inspired strategy considers that the weights with larger absolute values are more important than the smaller ones to form a low-precision base for the original CNN model. We use ResNet-18 as a test case to compare the performance of these two strategies. In the experiments, the parameter settings are completely the same as described in Section 3.1. We set 4 epochs for weight re-training. Table 2 summarizes the results of our INQ with 5-bit quantization. It can be seen that our INQ achieves top-1 error rate of 32.11%32.11\% and top-5 error rate of 11.73%11.73\% by using random partition. Comparatively, pruning-inspired partition brings 1.09%1.09\% and 0.83%0.83\% decrease in top-1 and top-5 error rates, respectively. Apparently, pruning-inspired partition is better than random partition, and this is the reason why we use it in this paper. For future works, weight partition based on quantization error could also be an option worth exploring.

3 The Trade-off between Expected Bit-width and Model Accuracy

The third set of experiments is performed to explore the limit of the expected bit-width under which our INQ can still achieve lossless network quantization. Similar to the second set of experiments, we also use ResNet-18 as a test case, and the parameter settings for the batch size, the weight decay and the momentum are completely the same. Finally, lower-precision models with 4-bit, 3-bit and even 2-bit ternary weights are generated for comparisons. As the expected bit-width goes down, the number of candidate quantum values will be decreased significantly, thus we shall increase the number of iterative steps accordingly for enhancing the accuracy of final low-precision model. Specifically, we set the accumulated portions of quantized weights at iterative steps as {0.3, 0.5, 0.8, 0.9, 0.95, 1}, {0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 0.95, 1} and {0.2, 0.4, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 0.975, 1} for 4-bit, 3-bit and 2-bit ternary models, respectively. The required number of epochs also increases when the expected bit-width goes down, and it reaches 30 when training our 2-bit ternary model. Although our 4-bit model shows slightly decreased accuracy when compared with the 5-bit model, its accuracy is still better than that of the pre-trained full-precision model. Comparatively, even when the expected bit-width goes down to 3, our low-precision model shows only 0.19%0.19\% and 0.33%0.33\% losses in top-1 and top-5 recognition rates, respectively. As for our 2-bit ternary model, although it incurs 2.25%2.25\% decrease in top-1 error rate and 1.56%1.56\% decrease in top-5 error rate in comparison to the pre-trained full-precision reference, its accuracy is considerably better than state-of-the-art results reported for binary-weight network (BWN) (Rastegari et al., 2016) and ternary weight network (TWN) (Li & Liu, 2016). Detailed results are summarized in Table 3 and Table 4.

4 Low-Bit Deep Compression

In the literature, recently proposed deep compression method (Han et al., 2016) reports so far best results on network compression without loss of model accuracy. Therefore, the last set of experiments is conducted to explore the potential of our INQ for much better deep compression. Note that Han et al. (2016) is a hybrid network compression solution combining three different techniques, namely network pruning (Han et al., 2015), vector quantization (Gong et al., 2014) and Huffman coding. Taking AlexNet as an example, network pruning gets 9×\times compression, however this result is mainly obtained from the fully connected layers. Actually its compression performance on the convolutional layers is less than 3×\times (as can be seen in the Table 4 of Han et al. (2016)). Besides, network pruning is realized by separately performing pruning and re-training in an iterative way, which is very time-consuming. It will cost at least several weeks for compressing AlexNet. We solved this problem by our dynamic network surgery (DNS) method (Guo et al., 2016) which achieves about 7×\times speed-up in training and improves the performance of network pruning from 9×\times to 17.7×\times. In Han et al. (2016), after network pruning, vector quantization further improves compression ratio from 9×\times to 27×\times, and Huffman coding finally boosts compression ratio up to 35×\times. For fair comparison, we combine our proposed INQ and DNS, and compare the resulting method with Han et al. (2016). Detailed results are summarized in Table 5. When combing our proposed INQ and DNS, we achieve much better compression results compared with Han et al. (2016). Specifically, with 5-bit quantization, we can achieve 53×\times compression with slightly larger gains both in top-5 and top-1 recognition rates, yielding 51.43%/96.30% absolute improvement in compression performance compared with full version/fair version (i.e., the combination of network pruning and vector quantization) of Han et al. (2016), respectively. Consistently better results have also obtained for our 4-bit and 3-bit models.

Besides, we also perform a set of experiments on AlexNet to compare the performance of our INQ and vector quantization (Gong et al., 2014). For fair comparison, re-training is also used to enhance the performance of vector quantization, and we set the number of cluster centers for all of 5 convolutional layers and 3 fully connect layers to 32 (i.e., 5-bit quantization). In the experiment, vector quantization incurs over 3% loss in model accuracy. When we change the number of cluster centers for convolutional layers from 32 to 128, it gets an accuracy loss of 0.98%. This is consistent with the results reported in (Gong et al., 2014). Comparatively, vector quantization is mainly proposed to compress the parameters in the fully connected layers of a pre-trained full-precision CNN model, while our INQ addresses all network layers simultaneously and has no accuracy loss for 5-bit and 4-bit quantization. Therefore, it is evident that our INQ is much better than vector quantization. Last but not least, the final weights for vector quantization (Gong et al., 2014), network pruning (Han et al., 2015) and deep compression (Han et al., 2016) are still floating-point values, but the final weights for our INQ are in the form of either powers of two or zero. The direct advantage of our INQ is that the original floating-point multiplication operations can be replaced by cheaper binary bit shift operations on dedicated hardware like FPGA.

Conclusions

In this paper, we present INQ, a new network quantization method, to address the problem of how to convert any pre-trained full-precision (i.e., 32-bit floating-point) CNN model into a lossless low-precision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which usually quantize all the network weights simultaneously, INQ is a more compact quantization framework. It incorporates three interdependent operations: weight partition, group-wise quantization and re-training. Weight partition splits the weights in each layer of a pre-trained full-precision CNN model into two disjoint groups which play complementary roles in INQ. The weights in the first group is directly quantized by a variable-length encoding method, forming a low-precision base for the original CNN model. The weights in the other group are re-trained while keeping all the quantized weights fixed, compensating for the accuracy loss from network quantization. More importantly, the operations of weight partition, group-wise quantization and re-training are repeated on the latest re-trained weight group in an iterative manner until all the weights are quantized, acting as an incremental network quantization and accuracy enhancement procedure. On the ImageNet large scale classification task, we conduct extensive experiments and show that our quantized CNN models with 5-bit, 4-bit, 3-bit and even 2-bit ternary weights have improved or at least comparable accuracy against their full-precision baselines, including AlexNet, VGG-16, GoogleNet and ResNets. As for future works, we plan to extend incremental idea behind INQ from low-precision weights to low-precision activations and low-precision gradients (we have actually already made some good progress on it, as shown in our supplementary materials). We will also investigate computation and power efficiency by implementing our low-precision CNN models on hardware platforms.

References

Appendix A A Appendix 1: Statistical Analysis of the Quantized Weights

Taking our 5-bit AlexNet model as an example, we analyze the distribution of the quantized weights. Detailed statistical results are summarized in Table 6. We can find: (1) in the 1st1^{st} and 2nd2^{nd} convolutional layers, the values of {26-2^{-6}, 25-2^{-5}, 24-2^{-4}, 262^{-6}, 252^{-5}, 242^{-4}} and {28-2^{-8}, 27-2^{-7}, 26-2^{-6}, 25-2^{-5}, 0, 282^{-8}, 272^{-7}, 262^{-6}, 252^{-5}} occupy over 60% and 94% of all quantized weights, respectively; (2) the distributions of the quantized weights in the 3rd3^{rd}, 4th4^{th} and 5th5^{th} convolutional layers are similar to that of the 2nd2^{nd} convolutional layer, and more weights are quantized into zero in the 2nd2^{nd}, 3rd3^{rd}, 4th4^{th} and 5th5^{th} convolutional layers compared with the 1st1^{st} convolutional layer; (3) in the 1st1^{st} fully connected layer, the values of {210-2^{-10}, 29-2^{-9}, 28-2^{-8}, 27-2^{-7}, 0, 2102^{-10}, 292^{-9}, 282^{-8}, 272^{-7}} occupy about 98% of all quantized weights, and similar results can be seen for the 2nd2^{nd} fully connected layer; (4) generally, the distributions of the quantized weights in the convolutional layers are usually more scattered compared with the fully connected layers. This may be partially the reason why it is much easier to get good compression performance on fully connected layers in comparison to convolutional layers, when using methods such as network hashing (Chen et al., 2015b) and vector quantization (Gong et al., 2014); (5) for 5-bit AlexNet model, the required bit-width for each layer is actually 4 but not 5.

Appendix B B Appendix 2: Lossless CNNs with Low-Precision Weights and Low-Precision Activations

Recently, we have made some good progress on developing our INQ for lossless CNNs with both low-precision weights and low-precision activations. According to the results summarized in Table 7, it can be seen that our VGG-16 model with 5-bit weights and 4-bit activations shows improved top-5 and top-1 recognition rates in comparison to the pre-trained reference with 32-bit floating-point weights and 32-bit floating-point activations. To the best of our knowledge, this should be the best results reported on VGG-16 architecture so far.