An Analysis of Deep Neural Network Models for Practical Applications
Alfredo Canziani, Adam Paszke, Eugenio Culurciello
Introduction
Since the breakthrough in 2012 ImageNet competition (Russakovsky et al., 2015) achieved by AlexNet (Krizhevsky et al., 2012) — the first entry that used a Deep Neural Network (DNN) — several other DNNs with increasing complexity have been submitted to the challenge in order to achieve better performance.
In the ImageNet classification challenge, the ultimate goal is to obtain the highest accuracy in a multi-class classification problem framework, regardless of the actual inference time. We believe that this has given rise to several problems. Firstly, it is now normal practice to run several trained instances of a given model over multiple similar instances of each validation image. This practice, also know as model averaging or ensemble of DNNs, dramatically increases the amount of computation required at inference time to achieve the published accuracy. Secondly, model selection is hindered by the fact that different submissions are evaluating their (ensemble of) models a different number of times on the validation images, and therefore the reported accuracy is biased on the specific sampling technique (and ensemble size). Thirdly, there is currently no incentive in speeding up inference time, which is a key element in practical applications of these models, and affects resource utilisation, power-consumption, and latency.
This article aims to compare state-of-the-art DNN architectures, submitted for the ImageNet challenge over the last 4 years, in terms of computational requirements and accuracy. We compare these architectures on multiple metrics related to resource utilisation in actual deployments: accuracy, memory footprint, parameters, operations count, inference time and power consumption. The purpose of this paper is to stress the importance of these figures, which are essential hard constraints for the optimisation of these networks in practical deployments and applications.
Methods
In order to compare the quality of different models, we collected and analysed the accuracy values reported in the literature. We immediately found that different sampling techniques do not allow for a direct comparison of resource utilisation. For example, central-crop (top-5 validation) errors of a single run of VGG-16 In the original paper this network is called VGG-D, which is the best performing network. Here we prefer to highlight the number of layer utilised, so we will call it VGG-16 in this publication. (Simonyan & Zisserman, 2014) and GoogLeNet (Szegedy et al., 2014) are and respectively, revealing that VGG-16 performs better than GoogLeNet. When models are run with 10-crop sampling, From a given image multiple patches are extracted: four corners plus central crop and their horizontal mirrored twins. then the errors become and respectively, and therefore VGG-16 will perform worse than GoogLeNet, using a single central-crop. For this reason, we decided to base our analysis on re-evaluations of top-1 accuracies Accuracy and error rate always sum to , therefore in this paper they are used interchangeably. for all networks with a single central-crop sampling technique (Zagoruyko, 2016).
Results
In this section we report our results and comparisons. We analysed the following DDNs: AlexNet (Krizhevsky et al., 2012), batch normalised AlexNet (Zagoruyko, 2016), batch normalised Network In Network (NIN) (Lin et al., 2013), ENet (Paszke et al., 2016) for ImageNet (Culurciello, 2016), GoogLeNet (Szegedy et al., 2014), VGG-16 and -19 (Simonyan & Zisserman, 2014), ResNet-18, -34, -50, -101 and -152 (He et al., 2015), Inception-v3 (Szegedy et al., 2015) and Inception-v4 (Szegedy et al., 2016) since they obtained the highest performance, in these four years, on the ImageNet (Russakovsky et al., 2015) challenge.
Figure 2 shows one-crop accuracies of the most relevant entries submitted to the ImageNet challenge, from the AlexNet (Krizhevsky et al., 2012), on the far left, to the best performing Inception-v4 (Szegedy et al., 2016). The newest ResNet and Inception architectures surpass all other architectures by a significant margin of at least .
Figure 2 provides a different, but more informative view of the accuracy values, because it also visualises computational cost and number of network’s parameters. The first thing that is very apparent is that VGG, even though it is widely used in many applications, is by far the most expensive architecture — both in terms of computational requirements and number of parameters. Its 16- and 19-layer implementations are in fact isolated from all other networks. The other architectures form a steep straight line, that seems to start to flatten with the latest incarnations of Inception and ResNet. This might suggest that models are reaching an inflection point on this data set. At this inflection point, the costs — in terms of complexity — start to outweigh gains in accuracy. We will later show that this trend is hyperbolic.
2 Inference Time
Figure 4 reports inference time per image on each architecture, as a function of image batch size (from 1 to 64). We notice that VGG processes one image in a fifth of a second, making it a less likely contender in real-time applications on an NVIDIA TX1. AlexNet shows a speed up of roughly going from batch of 1 to 64 images, due to weak optimisation of its fully connected layers. It is a very surprising finding, that will be further discussed in the next subsection.
3 Power
In figure 4 we see that the power consumption is mostly independent with the batch size. Low power values for AlexNet (batch of 1) and VGG (batch of 2) are associated to slower forward times per image, as shown in figure 4.
4 Memory
5 Operations
Operations count is essential for establishing a rough estimate of inference time and hardware circuit size, in case of custom implementation of neural network accelerators. In figure 7, for a batch of 16 images, there is a linear relationship between operations count and inference time per image. Therefore, at design time, we can pose a constraint on the number of operation to keep processing speed in a usable range for real-time applications or resource-limited deployments.
6 Operations and Power
7 Accuracy and Throughput
We note that there is a non-trivial linear upper bound between accuracy and number of inferences per unit time. Figure 9 illustrates that for a given frame rate, the maximum accuracy that can be achieved is linearly proportional to the frame rate itself. All networks analysed here come from several publications, and have been independently trained by other research groups. A linear fit of the accuracy shows all architecture trade accuracy vs. speed. Moreover, chosen a specific inference time, one can now come up with the theoretical accuracy upper bound when resources are fully utilised, as seen in section 3.6. Since the power consumption is constant, we can even go one step further, and obtain an upper bound in accuracy even for an energetic constraint, which could possibly be an essential designing factor for a network that needs to run on an embedded system.
As the spoiler in section 3.1 gave already away, the linear nature of the accuracy vs. throughput relationship translates into a hyperbolical one when the forward inference time is considered instead. Then, given that the operations count is linear with the inference time, we get that the accuracy has an hyperbolical dependency on the amount of computations that a network requires.
8 Parameters Utilisation
DNNs are known to be highly inefficient in utilising their full learning power (number of parameters / degrees of freedom). Prominent work (Han et al., 2015) exploits this flaw to reduce network file size up to , using weights pruning, quantisation and variable-length symbol encoding. It is worth noticing that, using more efficient architectures to begin with may produce even more compact representations. In figure 10 we clearly see that, although VGG has a better accuracy than AlexNet (as shown by figure 2), its information density is worse. This means that the amount of degrees of freedom introduced in the VGG architecture bring a lesser improvement in terms of accuracy. Moreover, ENet (Paszke et al., 2016) — which we have specifically designed to be highly efficient and it has been adapted and retrained on ImageNet (Culurciello, 2016) for this work — achieves the highest score, showing that less parameters are sufficient to provide state-of-the-art results.
Conclusions
In this paper we analysed multiple state-of-the-art deep neural networks submitted to the ImageNet challenge, in terms of accuracy, memory footprint, parameters, operations count, inference time and power consumption. Our goal is to provide insights into the design choices that can lead to efficient neural networks for practical application, and optimisation of the often-limited resources in actual deployments, which lead us to the creation of ENet — or Efficient-Network — for ImageNet. We show that accuracy and inference time are in a hyperbolic relationship: a little increment in accuracy costs a lot of computational time. We show that number of operations in a network model can effectively estimate inference time. We show that an energy constraint will set a specific upper bound on the maximum achievable accuracy and model complexity, in terms of operations counts. Finally, we show that ENet is the best architecture in terms of parameters space utilisation, squeezing up to more information per parameter used respect to the reference model AlexNet, and respect VGG-19.
This paper would have not look so pretty without the Python Software Foundation, the matplotlib library and the communities of stackoverflow and TeX of StackExchange which I ought to thank. This work is partly supported by the Office of Naval Research (ONR) grants N00014-12-1-0167, N00014-15-1-2791 and MURI N00014-10-1-0278. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TX1, Titan X, K40 GPUs used for this research.