How Can We Be So Dense? The Benefits of Using Highly Sparse Representations
Subutai Ahmad, Luiz Scheinkman
Introduction
The literature on sparse representations in neural networks dates back many decades, with neuroscience as one of the primary motivations. In 1988 Kanerva proposed the use of sparse distributed memories (Kanerva, 1988) to model the highly sparse representations seen in the brain. In 1997, (Olshausen & Field, 1997) showed that incorporating sparse priors and sparse cost functions in encoders can lead to receptive field representations that are remarkably close to what is observed in the primate visual cortex. More recently (Lee et al., 2008; Chen et al., 2018) showed hierarchical sparse representations that qualitatively lead to natural looking hierarchical feature detectors. (Lee et al., 2009; Nair & Hinton, 2009; Srivastava et al., 2013; Rawlinson et al., 2018) showed that introducing sparsity terms can sometimes lead to improved test set accuracies.
Despite the above literature the majority of neural networks today rely on dense representations. One exception is the pervasive use of dropout (Srivastava et al., 2014) as a regularizer. Dropout randomly “kills” a percentage of the units (in practice usually ) on every training input presentation. Variational dropout techniques tune the dropout rates individually per weight (Molchanov et al., 2017). Dropout introduces random sparse representations during learning, and has been shown to be an effective regularizer in many contexts.
In this paper we discuss certain inherent benefits of high dimensional sparse representations. We focus on robustness and sensitivity to interference. These are central issues with today’s neural network systems where even small (Szegedy et al., 2013) and large (Rosenfeld et al., 2018) perturbations can cause dramatic changes to a network’s output. We offer two main contributions. First, we analyze high dimensional sparse representations, and show that such representations are naturally more robust to noise and interference from random inputs. When matching sparse patterns, corrupted versions of a pattern are “close” to the original whereas random patterns are exponentially hard to match.
Our second contribution is an efficient construction of sparse deep networks that is designed to exploit the above properties. We implement networks where the weights for each unit in a layer randomly sample from a sparse subset of the source layer below. In addition the output of each layer is constrained such that only the most active units are allowed to be non-zero, where is much smaller than the number of units in that layer. In these networks, the number of non-zero products for each layer is approximately . This formulation results in simple differentiable sparse layers that can be dropped into both standard linear and convolutional layers.
We demonstrate significantly improved robustness to noise for MNIST and the Google Speech Commands dataset, while maintaining competitive accuracy in the standard zero noise scenario. We discuss the number of weights used by sparse networks in these datasets, and the impact of additional pruning. Our work extends the existing literature on sparse networks and pruning (see Section 5 for a comparison with some prior work). At the end of the paper we discuss some possible areas for future work.
High Dimensional Sparse Representations
In this section we develop some basic properties of sparse representations as they relate to noise robustness and interference. In a typical neural network an input vector is matched against a stored weight vector using a dot product. This is then followed by a threshold-like non-linearity such as or .
Ideally we would like the outputs of each layer to be invariant to noise or corrupted inputs. When comparing two sparse vectors via a dot product, the results are unaffected by the zero components of either vector. A key quantity we consider is the ratio of the matching volume around a prototype vector divided by the volume of the whole space. The larger the match volume around a vector, the more robust it is to noise. The smaller the ratio, the less likely it is that random inputs can affect the match.
We quantify the above ratio using binary vectors (following our previous work in (Ahmad & Hawkins, 2016)). In this section we show that the ratio decreases exponentially with increased dimensionality, while maintaining a large match volume. Let be a binary vector of length , and let denote the number of non-zero entries. The dot product counts the overlap, or number of shared bits, between two such vectors. We would like to understand the probability of two vectors having significant overlap, i.e. overlap greater than some threshold .
We define the overlap set, , as the set of all vectors of size that have exactly bits of overlap with . The number of such vectors can be calculated as:
The left half of the above product counts all the ways we can select exactly bits out of active bits in . The right half counts the number of ways we can select the remaining bits from the components of that are zero. The product of these two quantities represents the number of all vectors with exactly bits of overlap with . We can now count the number of vectors that match , i.e. where as:
If we select vectors from a uniform random distribution, the probability of significant overlap can be calculated as:
where is the set of all possible comparison vectors.
2 Impact of Dimensionality and Sparsity
Two key factors in Eq. 3 are the number of non-zero components, , and the dimensionality, . Figure 1 provides an intuitive description of their impact. Assume we have prototype vectors, and we want to match noisy versions of these vectors. Around each prototype there is a set of matching vectors. If the threshold is very high, the set of matching vectors is small (illustrated by the small circles in Figure 1A) and there will be quite a bit of space between these sets. As you decrease matching is less strict and you can match noisier versions of each prototype. The cost is that the chance of matching the other vectors also increases because there is less free space in between (Figure 1B). It turns out that for sparse vectors, this cost is offset as you increase . That is, as increases, the denominator in Eq. 3 (and the corresponding ”free” space) increases much faster than the numerator. For a fixed sparsity level, you can maintain highly tolerant matches without the cost of additional false positives simply by increasing the dimensionality.
Fig 2 illustrates this trend for some example sparsities. In this figure we simulated matching with random vectors and plotted match rates with random vectors as a function of the number of active bits and the underlying dimensionality. In the simulation we repeatedly generated a random prototype vector with bits on and then attempted to match against random test vectors with bits on. We matched using a threshold of 12 which meant that even vectors that were up to 50% different from would match. We varied and the dimensionality of the vectors, .
The chart shows that for sparse binary vectors, match rates with random vectors drop rapidly as the underlying dimensionality increases. The horizontal line indicates the probability of matching against dense vectors, with . The probability of dense matches stays relatively high and unaffected by dimensionality, indicating that both sparseness and high dimensionality are key to robust matches. In (Ahmad & Hawkins, 2016) we develop additional properties, including the probability of false negatives.
3 Matching Sparse Scalar Vectors
Deep networks operate on scalar vectors, and in this section we consider how the above ideas apply to sparse scalar representations. Binary and scalar vectors are similar in that the components containing zero do not affect the dot product, and thus the combinatorics in Eq. 3 are still applicable. Eq. 1 represents the set of scalar vectors where the number of non-zero multiplies in the dot product is exactly , and Eq. 3 represents the probability that the number of non-zero multiplies is . However, an additional factor is the distribution of scalar values. If components in one vector are extremely large relative to , the likelihood of a significant match will be high even with a single shared non-zero component.
We wanted to see if the exponential drop in random matches for binary vectors, demonstrated by Figure 2, can be obtained using scalar vectors, and if so, the conditions under which they hold. Let and represent two sparse vectors such that and counts the number of non-zero entries in each. Let each non-zero component be independent and sampled from the distributions and . The probability of a significant match is then:
where is the probability that the dot product is given that the overlap is exactly components:
There does not appear to be a closed form way to compute for normal or uniform distributions so we resort to simulations that mimic our network structure.
As before, we generated a large number of random vectors and , and plotted the frequency of random matches. With , we focus on simulations where the non-zero entries in are uniform in , and the non-zero entries in are uniform in . We focus on this formulation because of the relationship to common network structures and weight initialization. is a putative weight vector and is an input vector to this layer from the previous layer (we assume unit activations are positive, the result of a ReLU-like non-linearity). controls the scale of relative to .
Figure 3 (left) shows the behavior with and . We varied the activity of the input vectors and the dimensionality of the vectors, . We set . The chart demonstrates that under these conditions we can achieve robust behavior similar to that of binary vectors. Figure 3 (right) plots the effect of on the match probabilities with a fixed . As this chart shows, the error increases significantly as increases. Taken together, these results show that the fundamental robustness properties of binary sparse vectors can also hold for sparse scalar vectors, as long as the overall scaling of vectors are in a similar range.
4 Non-uniform Distribution of Vectors
Eq. 3 assumes the ideal case where vectors are chosen with a uniform random distribution. With a non-uniform distribution the error rates will be higher. The more non-uniform the distribution the worse the error rates. For example, if you mostly end up observing 10 inputs, your error rates will be bounded at around . Thus, to optimize error rates, it is important to be as close to a uniform distribution as possible.
Sparse Network Description
Here we discuss a particular sparse network implementation that is designed to exploit Eq. 3. This implementation is an extension of our previous work on the HTM Spatial Pooler, a binary sparse coding algorithm that models sparse code generation in the neocortex (Hawkins et al., 2011; Cui et al., 2017). Specifically, we formulate a version of the Spatial Pooler that is designed to be a drop-in layer for neural networks trained with back-propagation. Our work is also closely related to previous literature on k-winner take all networks (Majani et al., 1989) and fixed sparsity networks (Makhzani & Frey, 2015).
Consider a network with hidden layers. Let denote the vector of outputs from layer , respectively, with as the input vector. and are the weights and biases for each layer. In a standard neural network the weights are typically dense and initialized using a uniform random distribution. The feed forward outputs are then calculated as follows:
where is any activation function, such as or . (Figure 4 left.)
To implement our sparse networks, we make two modifications to this basic formulation (Figure 4 right.). First, we initialize the weights using a sparse random distribution, such that only a fraction of the weights contain non-zero values. Non-zero weights are initialized using standard Kaiming initialization (He et al., 2015b). The rest of the connections are treated as non-existent, i.e. the corresponding weights are zero throughout the life of the network. Second, only the top-k active units within each layer are maintained in , and the rest set to zero. This -winners step is non-linear and can be thought of as a substitute for the ReLU function. Instead of a threshold of , the threshold here is adaptive and corresponds to the ’th largest activation (Makhzani & Frey, 2013).
The layer can be trained using standard gradient descent. Similar to ReLU, the gradient of the layer is calculated as above the threshold and elsewhere. During inference we increase by 50%, which led to slightly better accuracies. In all our simulations the last layer of each network is a standard linear output layer with log-softmax activation function.
One practical issue with the above formulation is that it is possible for a small number of units to initially dominate and then, through learning, become active for a large percentage of patterns (this was also noted in (Makhzani & Frey, 2015; Cui et al., 2017)). Having a small number of active units negatively impacts the available representational volume. It is desirable for every unit to be equally active in order to maximize the robustness of the representation in Eq. 3. To address this we employ a boosting term (Hawkins et al., 2011; Cui et al., 2017) which favors units that have not been active recently. We compute a running average of each unit’s duty cycle (i.e. how frequently it has been one of the top units):
A boost coefficient is then calculated for each unit based on the target duty cycle and the current average duty cycle:
The target duty cycle is a constant reflecting the percentage of units that are expected to be active, i.e. . The boost factor, , is a positive parameter that controls the strength of boosting. implies no boosting (), and higher numbers lead to larger boost coefficients. In (Hawkins et al., 2011; Cui et al., 2017) we showed that Eq. 7 encourages each unit to have equal activation frequency and effectively maximizes the entropy of the layer.
The boost coefficients are used during the k-winners step to select which units remain active for this input. Through boosting, units which have not been active recently have a disproportionately higher impact and are more likely to win, whereas overly active units are de-emphasized. To determine the output of the layer, the non-boosted activity of each winning unit is kept and the remaining units are set to zero. The duty cycle is then updated. The complete pseudo-code description for the -winners layer is described in Algorithm 1. In our simulations we used for all sparse simulations.
2 Sparse Convolutional Layers
We can apply the above algorithm to convolutional networks (CNNs) (LeCun et al., 1989). A canonical CNN layer uses a linear convolutional layer containing a number of filters, followed by a max-pooling (downsampling) layer, followed by ReLU. In order to implement sparse CNN layers, the k-winners layer is applied to the output of the max-pooling layer instead of ReLU (just as in our non-convolutional layers). However, since each filter in a CNN shares weights across the image, duty cycles are accumulated per filter. In our simulations dense and sparse CNN nets both have a hidden layer (which is dense or sparse, respectively) after the last convolutional layer, followed by a linear plus softmax layer. We used filters throughout with a stride of . In our tests, the weight sparsity of CNN layers did not impact the results. We suspect this is due to the small size of each kernel and did not use sparse weights for the CNN filters in our experiments.
Results
We first trained our networks on MNIST (LeCun et al., 1998). We trained both dense and sparse implementations. Each network consisted of one or two convolutional layers, followed by a hidden layer, followed by a linear + softmax output layer. Sparse nets consisted of sparse convolutional layers followed by a sparse hidden layer.
Networks were trained using standard stochastic gradient descent to minimize cross entropy loss. We used starting learning rates in the range , and the learning rate was decreased by a factor between 0.5 and 0.9 after each epoch. We also tried batch normalization (Ioffe & Szegedy, 2015) and found it did not help for MNIST (it did help significantly for Google Speech Commands results - see below). For sparse networks, we used a small mini-batch size (around 4), for the first epoch only, in order to let duty cycle calculations update frequently and settle. Hyperparameters such as the learning rate and network size were chosen using a validation set consisting of randomly chosen training samples. We then report final results on the test set using networks trained on the full training set.
Results Without Noise: State of the art accuracies on MNIST using convolutional neural networks (without distortions or other training augmentation) are in the range respectivelySource: http://yann.lecun.com/exdb/mnist. Table 1 (left column) lists the classification accuracies for the networks in our experiments. Our accuracies are in the same range, for both sparse and dense networks. Table 3 lists the key parameters for each of the listed networks (see also the next section for a more in-depth discussion).
Results With Noise: In order to test noise robustness we generated MNIST images with varying levels of additive noise. For each test image we randomly set of the pixels to a constant value near white (the constant value was two standard deviations over the mean pixel intensity). Figure 5 (A) shows sample images for different noise levels. We generated different noise levels with ranging between and in increments of . We also computed an overall noise score which counted the total number of correct classifications across all noise levels.
The right column of Table 1 shows the noise scores for each of the architectures. Networks in the top section of the table (Dense CNN-1 and Dense CNN-2) are composed of standard dense convolutional and hidden layers. Networks in the middle section (Sparse CNN-1 and Sparse CNN-2) are composed of sparse convolutional and sparse hidden layers. Networks in the last section contain a mixture of dense and sparse layers. Overall the architectures with sparse layers performed significantly better on the noise score than the fully dense networks. Sparse CNN-2, the two layer completely sparse network, had the best noise score. The two fully dense networks performed substantially worse than the others on noise, even though their test accuracies were comparable. Figure 5 plots the accuracy of fully dense and sparse networks at different noise levels. Note that raw test score was not a predictor of noise robustness, suggesting that focusing on pure test set accuracy alone is not sufficient for gauging performance under adverse conditions.
Ablation studies: In order to judge the relative contributions of sparse layers we ran experiments where we replaced various sparse components with their dense counterparts, i.e. dense CNNs with sparse hidden layers, and vice versa. Dense CNN-2 SP3 contained two dense CNN layers followed by the sparse third layer from Sparse CNN-2. Sparse CNN-2 D3 contained the same CNN layers as Sparse CNN-2 followed by the dense third layer from Dense CNN-2. Sparse CNN-2 W1 was identical to Sparse CNN-2 except that the weight sparsity was 1 (i.e. fully dense weights). Sparse CNN-2 DSW contained a third layer with dense outputs, but with a weight sparsity of .
The results of these networks are shown in the bottom third of Table 1. From a noise robustness perspective, most of the variants (except for Sparse CNN-2 DSW) performed well, better than the best pure dense network. This supports the idea that sparsity in many forms may be helpful with robustness. It is interesting to note that the standard deviation of the noise score in these variants was also higher than that of the pure sparse networks. Overall the results with mixed networks were encouraging, and suggest a clear benefit to introducing sparsity at any level.
Impact of Dropout: The above results did not use dropout (Srivastava et al., 2014), which is generally thought to improve robustness. We found that dropout did occasionally improve the robustness of dense networks, but any improvements were modest and the dropout percentage had to be tuned carefully. For sparse nets dropout consistently reduced accuracies. Even with the optimal dropout percentage, the noise scores of dense networks were significantly lower than sparse nets.
2 Google Speech Commands Dataset
In order to test sparse nets on a different domain, we applied them to the Google Speech Commands dataset (GSC). This audio dataset was made publicly available in 2017 (Warden, 2017) and consists of 65,000 one-second long utterances of 30 keywords spoken by thousands of individuals. The dataset contains predefined training, validation, and test sets.
Reference convolutional nets using ten of the keyword categories (plus artificial ”silence” and ”unknown” categories created during training augmentation) achieve accuracies in the range (Sainath & Parada, 2015; Tang & Lin, 2017). In (Tang & Lin, 2017) they demonstrated improved accuracies in the range of using residual networks (ResNets (He et al., 2015b, a)).
A Kaggle competition using GSC (also limited to 10 categories) took place between November 2017 and early 2018https://www.kaggle.com/c/tensorflow-speech-recognition-challenge. For our simulations we use the preprocessing code provided by one of the top-10 contestants (Tuguldur, 2018) who achieved around accuracies using variants of ResNet and VGG (Simonyan & Zisserman, 2014) architectures. Following this implementation, audio samples in our simulations are converted to 32-band Mel spectograms before being fed to the network. During training we augment the data by randomly adjusting the amplitude, speed, and pitch of each training sample, and by randomly shifting and stretching samples in the frequency domain. No data augmentation is performed on the validation or test sets.
We trained dense and sparse convolutional networks, with hyperparameters chosen based on the validation set. We were able to achieve reasonable accuracies using two convolutional layers, followed by a hidden layer and then a linear + softmax output layer. Our sparse networks had sparse convolutional layers as well as a sparse hidden layer. Unlike MNIST we found that batch normalization (Ioffe & Szegedy, 2015) accelerated learning significantly, and we used it for every layer.
Using the above setup we were able to achieve test set accuracies in the range of classifying the ten categories corresponding to the digits ”zero” through ”nine”. Table 2 (left column) shows mean accuracy on the test set. Both dense and sparse networks had about the same accuracy. Dropout had a negative effect on the accuracy. Table 3 lists the key parameters in each network.
Results With Noise: As with MNIST, we again created noisy versions of the test set. For each test audio sample we generated a random white noise sample and blended them together:
We generated 11 different noise levels, with ranging from 0 to 0.5 in increments of 0.05. Our overall noise score counted the total number of classifications across all noise levels.
As can be seen in Table 2 sparse networks performed significantly better than the best dense network. We included a ”Super-Sparse CNN-2” with a significantly sparser hidden layer. The hidden layer for this network had weight sparsity, and a lower output sparsity (Table 3). This network had slightly lower noise score, but its score was still significantly higher than that of the dense networks. Overall these results demonstrate that the robustness of sparse networks seen with MNIST can scale to other domains.
3 Computational Considerations
In standard networks, the size of each weight matrix is and the order of complexity of the feed-forward operation can be approximated by the number of multiplications, . The computational efficiency of sparse systems is closely related to the fraction of non-zeros. In our sparse hidden layers, both activations and weight values are sparse and the number of non-zero product terms in the forward computation is proportional to , where is the fraction of non-zero weights. In our convolutional layers, only activations values are sparse and the number of non-zero product terms in the forward computation is proportional to , where is the kernel width of each filter.
As an example, the number of non-zero multiplies between the first two convolutional layers in the GSC Sparse CNN-2 network is , about 10.5X smaller than the corresponding dense network. The number of non-zero multiplies between the second convolutional layer and the hidden layer in the same network is , about 20X smaller than the dense network. For Super Sparse CNN-2, that ratio is 35X as compared to the dense version.
As can be seen, the number of non-zeros products is significantly smaller in the sparse net implementations. Unfortunately we found that current versions of deep learning frameworks, including PyTorch and Tensorflow do not have adequate support for sparse matrices to exploit these properties, and our implementations ran at the same speed as the corresponding dense networks. We suspect this is due to the fact that highly sparse networks are not sufficiently popular in practice. We hope that studies such as this one will encourage highly optimized sparse implementations. (Note that such optimizations may be non-trivial as the set of -winners changes on every step.) When this becomes feasible our numbers suggest there is a strong possibility for large performance gains and/or improvements in power usage. It is also worth noting that this reduction in computational complexity does not come at a cost. Rather, our experiments showed that sparse representations can lead to improved accuracies under noisy conditions.
Discussion
In this paper we illustrated benefits of sparse representations. We developed intuitions and theory for the structure of vector matching in the context of binary sparse representations. We then constructed efficient neural network formulations of sparse networks that place internal representations in the sweet spot suggested by the theory. In particular we aim to match sparse activations with sparse weights in relatively high dimensional settings. A boosting rule was used to increase the overall entropy of the internal layers in order to maximize the utilization of the representational space. We showed that this formulation increases the overall robustness of the system to noisy inputs using MNIST and the Google Speech Command Dataset. Both dense and sparse networks showed high accuracies, but the sparse nets were significantly more robust. These results suggest that it is important to look beyond pure test set performance as test accuracy by itself is not a reliable indicator of overall robustness.
Our work extends the existing literature on sparsity and pruning. A very recent theoretical paper showed that simple linear sparse networks may be more robust to adversarial attacks (Guo et al., 2018). A number of papers have shown that it is possible to effectively introduce sparsity through pruning and retraining (Han et al., 2015; Frankle & Carbin, 2018; Lee et al., 2018). The mechanisms introduced here can be seen as complementary to those techniques. Our network enforces sparse weights from the beginning by construction, and sparse weights are learned as part of the training process. In addition, we reduce the overall computational complexity by enforcing sparse activations, which in turn significantly reduces the number of overall non-zero products. This should produce significant power savings for optimized hardware implementations.
We demonstrated increased robustness in our networks whereas the papers on pruning typically do not explicitly test robustness. It is possible that such networks are also more robust, though this remains to be tested. Pruning techniques in general are quite orthogonal to ours, and it may be feasible to combine them with the mechanisms discussed here.
In our work we did not attempt to introduce sparsity into the convolutional filters themselves. (Li et al., 2016) have shown it is sometimes possible to remove entire filters from large CNNs suggesting that sparsifying filter weights may also be possible, particularly in networks with larger filters. Introducing sparse convolutions within the context of the techniques in this paper is an area of future exploration. The techniques described here are straightforward to implement and can be extended to other architectures including RNNs. This is yet another promising area for future research.
All code and experiments are available at https://github.com/numenta/htmpapers as open source.
Acknowledgements
We thank Jeff Hawkins, Ali Rahimi, and John Berkowitz for helpful discussions and comments.