LCNN: Lookup-based Convolutional Neural Network

Hessam Bagherinezhad, Mohammad Rastegari, Ali Farhadi

Introduction

In recent years convolutional neural networks (CNN) have played major roles in improving the state of the art across a wide range of problems in computer vision, including image classification , object detection , segmentation , etc. These models are very expensive in terms of computation and memory. For example, AlexNet has 61M parameters and performs 1.5B high precision operations to classify a single image. These numbers are even higher for deeper networks, e.g.,VGG . The computational burden of learning and inference for these models is significantly higher than what most compute platforms can afford.

Recent advancements in virtual reality (VR by Oculus) , augmented reality (AR by HoloLens) , and smart wearable devices increase the demand for getting our state of the art deep learning algorithm on these portable compute platforms. Porting deep learning methods to these platforms is challenging mainly due to the gap between what these platforms can offer and what our deep learning methods require. More efficient approaches to deep neural networks is the key to this challenge.

Recent work on efficient deep learning have focused on model compression and reducing the computational precision of operations in neural networks . CNNs suffer from over-parametrization and often encode highly correlated parameters , resulting in inefficient computation and memory usage. Our key insight is to leverage the correlation between the parameters and represent the space of parameters by a compact set of weight vectors, called dictionary. In this paper, we introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weights in CNNs. Training LCNN involves jointly learning a dictionary and a small set of linear combinations. The size of the dictionary naturally traces a spectrum of trade-offs between efficiency and accuracy. Our experimental results using AlexNet on ImageNet challenge show that LCNN can offer $3.2\times$ speedup while achieving $55.1\%$ top-1 accuracy. Our fastest LCNN offers $37.6\times$ speed up over CNN while maintaining $44.3\%$ top-1 accuracy. In the ResNet-18, the most accurate LCNN offers $5\times$ speedup with $62.2\%$ accuracy and the fastest LCNN offers $29.2\times$ speedup with $51.8\%$ accuracy

In addition, LCNN enables efficient training; almost all the work in efficient deep learning have focused on efficient inference on resource constrained platforms . Training on these platforms is even more challenging and requires addressing two major problems: i. few-shot learning:the settings of on-device training dictates that there won’t be enough training examples for new categories. In fact, most training needs to be done with very few training examples; ii. few-iteration learning:the constraints in computation and power require the training to be light and quick. This imposes hard constraints on the number of iterations in training. LCNN offers solutions for both of these problems in deep on-device training.

Few-shot learning, the problem of learning novel categories from few examples (sometimes even one example), have been extensively studies in machine learning and computer vision. The topic is, however, relatively new for deep learning, where the main challenge is to avoid overfitting. The number of parameters are significantly higher than what can be learned from few examples. LCNN, by virtue of having fewer parameters to learn (only around 7% of parameters of typical networks), offers a simple solution to this challenge. Our dictionary can be learned offline from training data where enough training examples per category exists. When facing new categories, all we need to learn is the set of sparse reconstruction weights. Our experimental evaluations show significant gain in few-shot learning; $6.3\%$ in one training example per category.

Few-iteration learning is the problem of getting highest possible accuracy in few iterations that a resource constrained platform can offer. In a typical CNN, training often involves hundreds of thousands of iterations. This number is even higher for recent deeper architectures. LCNN offers a solution: dictionaries in LCNN are architecture agnostic and can be transferred across architectures or layers. This allows us to train a dictionary using a shallow network and transfer it to a deeper one. As before, all we need to learn are the few reconstruction weights; dictionaries don’t need to be trained again. Our experimental evaluations on ImageNet challenge show that using LCNN we can train an 18-layer ResNet with a pre-trained dictionary from a 10-layer ResNet and achieve $16.2\%$ higher top-1 accuracy on $10K$ iterations.

In this paper, we 1) introduce LCNN; 2) show state of the art efficient inference in CNNs using LCNN; 3) demonstrate possibilities of training deep CNNs using as few as one example per category 4) show results for few iteration learning .

Related Work

A wide range of methods have been proposed to address efficient training and inference in deep neural networks. Here, we briefly study these methods under the topics that are related to our approach.

Weight compression: Several attempts have been made to reduce the number of parameters of deep neural networks. Most of such methods are based on compressing the fully connected layers, which contain most of the weights. These methods do not achieve much improvement on speed. In , a small DNN architecture is proposed which is fully connected free and has 50x fewer parameters in compare to AlexNet . However, their model is slower than AlexNet. Recently reduced the number of parameters by pruning. All of these approaches update a pre-trained CNN, whereas we propose to train a compact structure that enables faster inference.

Low Rank Assumption: Approximating the weights of convolutional layers with low-rank tensor expansion has been explored by . They only demonstrated speedup in the case of large convolutions. uses SVD for tensor decomposition to reduce the computation in the lower layers on a pre-trained CNN. minimizes the reconstruction error of the nonlinear responses in a CNN, subject to a low-rank constraint which helps to reduce the complexity of filters. Notably, all of these methods are a post processing on the weights of a trained CNN, and none of them train a lower rank network from scratch.

Low Precision Networks: A fixed-point implementation of 8-bit integer was compared with 32-bit floating point activations in . Several network quantization methods are proposed by . Most recently, binary networks has shown to achieve relatively strong result on ImageNet . They have trained a network that computes the output with mostly binary operations, except for the first and the last layer. uses the real-valued version of the weights as a key reference for the binarization process. is an extension of , where both weights and activations are binarized. retrains a previously trained neural network with binary weights and binary inputs. Our approach is orthogonal to this line of work. In fact, any of these methods can be applied in our model to reduce the precision.

Sparse convolutions: Recently, several attempts have been made to sparsify the weights of convolutional layers . shows how to reduce the redundancy in parameters of a CNN using a sparse decomposition. proposed a framework to simultaneously speed up the computation and reduce the storage of CNNs. proposed a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of CNNs. Only in a sparse CNN is trained from scratch which makes it more similar to our approach. However, our method provides a rich set of dictionary that enables implementing convolution with lookup operations.

Few-Shot Learning: The problem of learning novel categories has been studied in . Learning from few examples per category explored by . proposed a method to learn from one training example per category, known as one-shot learning. Learning without any training example, zero-shot learning, is studied by .

Our Approach

Overview: In a CNN, each convolutional layer consists of $n$ cubic weight filters of size $m\times k_{w}\times k_{h}$ , where $m$ and $n$ are the number of input and output channels, respectively, and $k_{w}$ and $k_{h}$ are the width and the height of the filter. Therefore, the weights in each convolutional layer is composed of $nk_{w}k_{h}$ vectors of length $m$ . These vectors are shown to have redundant information. To avoid this redundancy, we build a relatively small set of vectors for each layer, to which we refer as dictionary, and enforce each vector in the weight filter to be a linear combination of a few elements from this set. Figure 1 shows an overview of our model. The gray matrix at the left of the figure is the dictionary. The dashed lines show how we lookup a few vectors from the dictionary and linearly combine them to build up a weight filter. Using this structure, we devise a fast inference algorithm for CNNs. We then show that the dictionaries provide a strong prior on the visual data and enables us to learn from few examples. Finally, we show that the dictionaries can be transferred across different network architectures. This allows us to speedup the training of a deep network by transferring the dictionaries from a shallower model.

This procedure is illustrated in Figure 1. In LCNN, instead of storing the weight tensors $\mathbf{W}$ for convolutional layers, we store $\mathbf{D}$ , $\mathbf{I}$ and $\mathbf{C}$ , the building blocks of the weight tensors. As a result, we can reduce the number of parameters in a convolutional layer by reducing $k$ , the dictionary size, and $s$ , the number of components in the linear combinations. In the next section, we will discuss how LCNN uses this representation to speedup the inference.

A forward pass in a convolutional layer consists of $n$ convolutions between the input $\mathbf{X}$ and each of the weight filters $\mathbf{W}$ . We can write a convolution between an $m\times k_{w}\times k_{h}$ weight filter and the input $\mathbf{X}$ as a sum of $k_{w}k_{h}$ separate $(1\times 1)$ -convolutions:

, where $\text{shift}_{r,c}$ is the matrix shift function along rows and columns with zero padding relative to the filter size. Now we use the LCNN representation of weights (equation 1) to rewrite each $1\times 1$ convolution:

Once the values of $\mathbf{S}$ are computed, we can reconstruct the output of convolution by lookups over the channels of $\mathbf{S}$ according to $\mathbf{I}$ , then scale them by the values in $\mathbf{C}$ :

This is shown in Figure 2 (left). Reducing the size of the dictionary $k$ lowers the cost of computing $\mathbf{S}$ and makes the forward pass faster. Since $\mathbf{S}$ is computed by a dense matrix multiplication, we are still able to use OpenBlas for fast matrix multiplication. In addition, by pushing the value of $s$ to be small, we can reduce the number of lookups and floating point operations.

1.2 Training LCNN

So far we have discussed how LCNN represents a weight filter by linear combinations of a subset of elements in a shared dictionary. We have also shown that how LCNN performs convolutions efficiently in two stages: 1- Small convolutions: convolving the input with a set of $1\times 1$ filters (equation 4). 2- Lookup and scale: few lookups over the channels of a tensor followed by a linear combination (equation 5) . Now, we explain how one can jointly train the dictionary and the lookup parameters, $\mathbf{I}$ and $\mathbf{C}$ . Direct training of the proposed lookup based convolution leads to a combinatorial optimization problem, where we need to find the optimal values for the integer tensor $\mathbf{I}$ . To get around this, we reformulate the lookup and scale stage (equation 5) using a standard convolution with sparsity constraints.

Note that this conversion is reversible, i.e.,we can create $\mathbf{I}$ and $\mathbf{C}$ from the position and the values of the non-zero entries in $\mathbf{P}$ . With this conversion, the lookup and scale stage (equation 5) becomes:

over the values in $\mathbf{P}$ during training. We also backpropagate through this threshold function to compute the gradients with respect to $\mathbf{P}$ . The derivative of the threshold function is $1$ everywhere except at $|x|<\epsilon$ , which is . Hence, if any of the entries of $\mathbf{P}$ becomes at some iteration, they stay forever. Using the threshold function, we let each vector to be a combination of arbitrary vectors. At the end of the training, the sparsity parameter $s$ at each spatial position $(r,c)$ is determined by the number of non-zero values in $\mathbf{P}{[:,r,c]}$ .

Although the focus of our work is to speedup convolutional layers where most of the computations are, our lookup based convolution model can also be applied on fully connected (FC) layers. An FC layer that goes from $m$ inputs to $n$ outputs can be viewed as a convolutional layer with input tensor $m\times 1\times 1$ and $n$ weight filters, each of size $m\times 1\times 1$ . We take the same approach to speedup fully connected layers.

After training, we convert $\mathbf{P}$ to the indices and the coefficients tensors $\mathbf{I}$ and $\mathbf{C}$ for each layer. At test time, we follow equation 5 to efficiently compute the output of each convolutional layer.

2 Few-shot learning

The shared dictionary in LCNN allows a neural network to learn from very few training examples on novel categories, which is known as few-shot learning. A good model for few-shot learning should have two properties: a) strong priors on the data, and b) few trainable parameters . LCNN has both of these properties. An LCNN trained on a large dataset of images (e.g. ImageNet ) will have a rich dictionary $\mathbf{D}$ at each convolutional layer. This dictionary provides a powerful prior on visual data. At the time of fine-tuning for a new set of categories with few training examples, we only update the coefficients in $\mathbf{C}$ . This reduces the number of trainable parameters significantly.

In a standard CNN, to use a pre-trained network to classify a set of novel categories, we need to reinitialize the classification layer randomly. This introduces a large number of parameters, on which we don’t have any prior, and they should be trained solely by a few examples. LCNN, in contrast, can use the dictionary of the classification layer of the pre-trained model, and therefore only needs to learn $\mathbf{I}$ and $\mathbf{C}$ from scratch, which form a much smaller set of parameters. Furthermore, for all other layers, we only fine-tune the coefficients $\mathbf{C}$ , i.e.,only update the non-zero entries of $\mathbf{P}$ . Note that the dictionary $\mathbf{D}$ is fixed across all layers during the training with few examples.

3 Few-iteration learning

Training very deep neural networks are computationally expensive and require hundreds of thousands of iterations. This is mainly due to the complexity of these models. In order to constrain the complexity, we should limit the number of learnable parameters in the network. LCNN has a suitable setting that allows us to limit the number of learnable parameters without changing the architecture. This can be done by transferring the shared dictionaries $\mathbf{D}$ from a shallower network to a deeper one.

Experiments

We evaluate the accuracy and the efficiency of LCNN under different settings. We first evaluate the accuracy and speedup of our model for the task of object classification, evaluated on the standard image classification challenge of ImageNet, ILSRVC2012 . We then evaluate the accuracy of our model under few-shot setting. We show that given a set of novel categories with as small as $1$ training example per category, our model is able to learn a classifier that is both faster and more accurate than the CNN baseline. Finally we show that the dictionaries trained in LCNN are generalizable and can be transferred to other networks. This leads to a higher accuracy in small number of iterations compared to standard CNN.

We follow the common way of initializing the convolutional layers by Gaussian distributions introduced in , including for the sparse tensor $\mathbf{P}$ . We set the threshold in equation 9 for each layer in such a way that we maintain the same initial sparsity across all the layers. That is, we set the threshold of each layer to be $\epsilon=c\cdot\sigma$ , where $c$ is constant across layers and $\sigma$ is the standard deviation of Gaussian initializer for that layer. We use $c=0.01$ for AlexNet and $c=0.001$ for ResNet. Similarly, to maintain the same level of sparsity across layers we need a $\lambda$ (equation 8) that is proportional to the standard deviation of the Gaussian initializers. We use $\lambda=\lambda^{\prime}\epsilon$ , where $\lambda^{\prime}$ is constant across layers and $\epsilon$ is the threshold value for that layer. We try $\lambda^{\prime}\in\{0.1,0.2,0.3\}$ for both AlexNet and ResNet to get different sparsities in $\mathbf{P}$ .

The dictionary size $k$ , the regularizer coefficient $\lambda$ , and threshold value $\epsilon$ are the three important hyperparameters for gaining speedup. The larger the dictionary is, the more accurate (but slower) the model becomes. The size of the the dictionary for the first layer does not need to be very large as it’s representing a $3$ -dimensional space. We observed that for the first layer, a dictionary size as small as $3$ vectors is sufficient for both AlexNet and ResNet. In contrast, fully connected layers of AlexNet are of higher dimensionality and a relatively large dictionary is needed to cover the input space. We found dictionary sizes $512$ and $1024$ to be proper for fully connected layers. In AlexNet we use the same dictionary size across other layers, which we vary from $100$ to $500$ for different experiments. In ResNet, aside from the very first layer, all the other convolutional layers are grouped into $4$ types of ResNet blocks. The dimensionality of input is equal between same ResNet block types, and is doubled for consecutive different block types. In a similar way we set the dictionary size for different ResNet blocks: equal between the same block types, and doubles for different consecutive block types. We vary the dictionary size of the first block from $16$ to $128$ in different experiments.

2 Image Classification

In this section we evaluate the efficiency and the accuracy of LCNN for the task of image classification. Our proposed lookup based convolution is general and can be applied on any CNN architecture. We use AlexNet and ResNet architectures in our experiments. We use ImageNet challenge ILSVRC2012 to evaluate the accuracy of our model. We report standard top- $1$ and top- $5$ classification accuracy on $1$ K categories of objects in natural scenes. To evaluate the efficiency, we compare the number of floating point operations as a representation for speedup. The speed and the accuracy of our model depend on two hyperparameters: 1) $k$ , the dictionary size and 2) $\lambda$ , which controls the sparsity of $\mathbf{P}$ ; i.e.,the average number of dictionary components in the linear combination . One can set a trade-off between the accuracy and the efficiency of LCNN by adjusting these two parameters. We compare our model with several baselines: 1- XNOR-Net , which reduces the precision of weights and outputs to $1$ -bit, and therefore multiplications can be replaced by binary operations. In XNOR-Net, all the layers are binarized except the first and the last layer (in AlexNet, they contain $9.64\%$ of the computation). 2- Wen et al. , which speeds up the convolutions by sparsifying the weight filters .

Table 1 compares the top- $1$ and top- $5$ classification accuracy of LCNN with baselines on AlexNet architecture. It shows that with small enough dictionaries and sparse linear combinations, LCNN offers $37.6\times$ speedup with the accuracy of XNOR-Net. On the other hand, if we set the dictionaries to be large enough, LCNN can be as accurate as slower models like Wen et al. In LCNN-fast, the dictionary size of the mid-layer convolutions is $30$ and for the fully connected layers is $512$ . In LCNN-accurate, the mid-layer convolutions have a dictionary of size $500$ and the size of dictionary in fully connected layers is $1024$ . The reguralizer constant (Section 4.1) $\lambda^{\prime}$ for LCNN-fast and LCNN-accurate is $0.3$ and $0.1$ , respectively.

Depending on the dictionary size and $\lambda^{\prime}$ , LCNN can achieve various speedups and accuracies. Figure 3 shows different accuracies vs. speedups that our model can achieve. The accuracy is computed by top-1 measure and the speedup is relative to the original CNN model. It is interesting to see that the trend is nearly linear. The best fitted line has a slope of $-3.08$ , i.e.,for each one percent accuracy that we sacrifice in top-1, we gain $3.08$ more speedup.

We also evaluate the performance of LCNN on ResNet-18 architecture. ResNet-18 is a compact architecture, which has $5\times$ fewer parameters in compare to AlexNet while it achieves $12.7\%$ higher top-1 accuracy. That makes it a much more challenging architecture for further compression. Yet we show that we can gain large speedups with a few points drop in the accuracy. Table 2 compares the accuracy of LCNN, XNOR-Net , and the original model (CNN). LCNN-fast is getting the same accuracy as XNOR-Net while getting a much larger speedup. Moreover, LCNN-accurate is getting a much higher accuracy yet maintaining a relatively large speedup. LCNN-fast has dictionaries of size $16$ , $32$ , $64$ , and $128$ for different block types. LCNN-accuracte has larger dictionaries: $128$ , $256$ , $512$ and $1024$ for different block types.

3 Few-shot Learning

In this section we evaluate the performance of LCNN on the task of few-shot learning. To evaluate the performance of LCNN on this task, we split the categories of ImageNet challenge ILSVRC2012 into two sets: i) base categories, a set of $990$ categories which we use for pre-training, and ii) novel categories, a set of $10$ categories that we use for few-shot learning. We do experiments under $1$ , $2$ , and $4$ samples per category. We take two strategies for splitting the categories. One is random splitting, where we randomly split the dataset into $990$ and $10$ categories. We repeat the random splitting $5$ times and report the average over all. The other strategy is to hold out all cats ( $7$ categories), bicycles ( $2$ categories) and sofa ( $1$ category) for few-shot learning, and use the other $990$ categories for pre-training. With this strategy we make sure that base and novel categories do not share similar objects, like different breeds of cats. For each split, we repeat the random sampling of $1$ , $2$ , and $4$ training images per category $20$ times, and get the average over all. Repeating the random sampling of the few examples is crucial for any few-shot learning experiment, since a model can easily overfit to a specific sampling of images.

We compare the performance of CNN and LCNN on few-shot learning in Figure 4. We first train an original AlexNet and an LCNN AlexNet on all training images of base categories ( $990$ categories, $1000$ images per category). We then replace the $990$ -way classification layer with a randomly initialized $10$ -way linear classifier. In CNN, this produces $10\times 4096$ randomly initialized weights, on which we don’t have any prior. These parameters need to be trained merely from the few examples. In LCNN, however, we transfer the dictionary trained in the $990$ -way classification layer to the new $10$ -way classifier. This reduces the number of randomly initialized parameters by at least a factor of $4$ . We use AlexNet LCNN-accurate model (same as the one in Table 1) for few-shot learning. At the time of fine-tuning for few-shot categories, we keep the dictionaries in all layers fixed and only fine-tune the sparse $\mathbf{P}$ tensor. This reduces the total number of parameters that need to be fine-tuned by a factor of $14\times$ . We use different learning rates $\eta$ and $\eta^{\prime}$ for the randomly initialized classification layer (which needs to be fully trained) and the previous pre-trained layers (which only need to be fine-tuned). We tried $\eta^{\prime}=\eta$ , $\eta^{\prime}=\frac{\eta}{10}$ , $\eta^{\prime}=\frac{\eta}{100}$ and $\eta^{\prime}=0$ for both CNN and LCNN, then picked the best for each configuration.

Figure 4 shows the top-1 accuracies of our model and the baseline in the two splitting strategies of our few-shot learning experiment. In Figure 4 (a) we are holding out all cat, sofa, and bicycle categories ( $10$ categories in total) for few-shot learning. LCNN is beating the baseline consistently in $\{1,2,4\}$ examples per category. Figure 4 (b) shows the comparison in the random splitting strategy. We repeat randomly splitting the categories into $990$ and $10$ categories $5$ times, and report the average over all. Here LCNN gets a larger improvement in the top-1 accuracy compared to the baseline for $\{1,2,4\}$ images per category.

4 Few-iteration Learning

In section 3.3 we discussed that the dictionaries in LCNN can be transferred from a shallower network to a deeper one. As a result, one can train fewer parameters–only $\mathbf{I}$ and $\mathbf{C}$ –in the deeper network with few iterations obtaining a higher test accuracy compared to a standard CNN. In this experiment we train a ResNet with $1$ block of each type, 10 layers total. We then transfer the dictionaries of each layer to its corresponding layer of ResNet-18 (with 18 layers). After transfer, we keep the dictionaries fixed. We show that we get higher accuracy in small number of iterations compared to standard CNN. Figure 5 illustrates the learning curves on top-1 accuracy for both LCNN and standard CNN. The test accuracy of LCNN is $16.2\%$ higher than CNN at iteration $10K$ . The solid lines denote the training accuracy and the dashed lines denote the test accuracy.

Conclusion

With recent advancements in virtual reality, augmented reality, and smart wearable devices, the need for getting the state of the art deep learning algorithms onto these resource constrained compute platforms increases. Porting state of the art deep learning algorithms to resource constrained compute platforms is extremely challenging. We introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weights in CNNs. Training LCNN involves jointly learning a dictionary and a small set of linear combinations. The size of the dictionary naturally traces a spectrum of trade-offs between efficiency and accuracy.

LCCN enables efficient inference; our experimental results on ImageNet challenge show that LCNN can offer $3.2\times$ speedup while achieving $55.1\%$ top-1 accuracy using AlexNet architecture. Our fastest LCNN offers $37.6\times$ speed up over AlexNet while maintaining $44.3\%$ top-1 accuracy. LCNN not only offers dramatic speed ups at inference, but it also enables efficient training. On-device training of deep learning methods requires algorithms that can handle few-shot and few-iteration constrains. LCNN can simply deal with these problems because our dictionaries are architecture agnostic and transferable across layers and architectures, enabling us to only learn few linear combination weights. Our future work involves exploring low-precision dictionaries as well as compact data structures for the dictionaries.

Acknowledgments: This work is in part supported by ONR N00014-13-1-0720, NSF IIS-1338054, NSF-1652052, NRI-1637479, Allen Distinguished Investigator Award, and the Allen Institute for Artificial Intelligence.