Improving Deep Neural Networks with Probabilistic Maxout Units

Jost Tobias Springenberg, Martin Riedmiller

Introduction

Regularization of large neural networks through stochastic model averaging was recently shown to be an effective tool against overfitting in supervised classification tasks. Dropout was the first of these stochastic methods which led to improved performance on several benchmarks ranging from small to large scale classification problems . The idea behind dropout is to randomly drop the activation of each unit within the network with a probability of $50\%$ . This can be seen as an extreme form of bagging in which parameters are shared among models, and the number of trained models is exponential in the number of these model parameters. During testing an approximation is used to average over this large number of models without instantiating each of them. When combined with efficient parallel implementations this procedure opened the possibility to train large neural networks with millions of parameters via back-propagation .

Inspired by this success a number of other stochastic regularization techniques were recently developed. This includes the work on dropconnect, a generalization of dropout, in which connections between units rather than their activation are dropped at random. Adaptive dropout is a recently introduced variant of dropout in which the stochastic regularization is performed through a binary belief network that is learned alongside the neural network to decrease the information content of its hidden units. Stochastic pooling is a technique applicable to convolutional networks in which the pooling operation is replaced with a sampling procedure.

Instead of changing the regularizer the authors in searched for an activation function for which dropout performs well. As a result they introduced the maxout unit, which can be seen as a generalization of rectified linear units (ReLUs) , that is especially suited for the model averaging performed by dropout. The success of maxout can partly be attributed to the fact that maxout aids the optimization procedure by partially preventing units from becoming inactive; an artifact caused by the thresholding performed by the rectified linear unit. Additionally, similar to ReLUs, they are piecewise linear and – in contrast to e.g. sigmoid units – typically do not saturate, which makes networks containing maxout units easier to optimize.

We argue that an equally important property of the maxout unit however is that its activation function can be seen as performing a pooling operation over a subspace of $k$ linear feature mappings (in the following referred to as subspace pooling). As a result of this subspace pooling operation each maxout unit is partially invariant to changes within its input. A natural question arising from this observation is thus whether it could be beneficial to replace the maximum operation used in maxout units with other pooling operations, such as L2 pooling. The utility of different subspace pooling operations has already been explored in the context of unsupervised learning where e.g. L2-pooling is known give rise to interesting invariances . While work on generalizing maxout by replacing the max-operation with general $Lp$ -pooling exists , a deviation from the standard maximum operation comes at the price of discarding some of the desirable properties of the maxout unit. For example abandoning piecewise linearity, restricting units to positive values and the introduction of saturation regimes, which potentially worsen the accuracy of the approximate model averaging performed by dropout.

Based on these observations we propose a stochastic generalization of the maxout unit that preserves its desirable properties while improving the subspace pooling operation of each unit. As an additional benefit when training a neural network using our proposed probabilistic maxout units the gradient of the training error is more evenly distributed among the linear feature mappings of each unit. In contrast, a maxout network helps gradient flow through each of the maxout units but not through their k linear feature mappings. Compared to maxout our probabilistic units thus learn to better utilize their full k-dimensional subspace. We evaluate the classification performance of a model consisting of these units and show that it matches the state of the art performance on three challenging classification benchmarks.

Model Description

Before defining the probabilistic maxout unit we briefly review the notation used in the following for defining deep neural network models. We adopt the standard feed-forward neural network formulation in which given an input $\mathbf{x}$ and desired output $y$ (a class label) the network realizes a function computing a $C$ -dimensional vector $\mathbf{o}$ – where $C$ is the number of classes – predicting the desired output. The prediction is computed by first sequentially mapping the input to a hierarchy of $N$ hidden layers $\mathbf{h}^{(1)},\dots,\mathbf{h}^{(N)}$ . Each unit $h_{i}^{(l)}$ within hidden layer $l\in[1,N]$ in the hierarchy realizes a function $h_{i}^{(l)}(\mathbf{v};\mathbf{w}^{(l)}_{i},b^{(l)}_{i})$ mapping its inputs $\mathbf{v}$ (given either as the input $\mathbf{x}$ or the output of the previous layer $h^{(l-1)}$ ) to an activation using weight and bias parameters $\mathbf{w}^{(l)}_{i}$ and $b^{(l)}_{i}$ . Finally the prediction is computed based on the last layer output $\mathbf{h}^{N}$ . This prediction is realized using a softmax layer $\mathbf{o}=softmax(\mathbf{W}^{N+1}\mathbf{h}^{(N)}+\mathbf{b}^{N+1})$ with weights $\mathbf{W}^{N+1}$ and bias $\mathbf{b}^{N+1}$ . All parameters $\theta=\{W^{(1)},b^{(1)},\dots,W^{(N+1)},b^{(N+1)}\}$ are then learned by minimizing the cross entropy loss between output probabilities $\mathbf{o}$ and label $y$ : $\mathcal{L}(o,y;\mathbf{x})=-\sum_{i=1}^{C}y_{i}\log(o_{i})+(1-y_{i})log(1-o_{i})$ .

and k is the number of linear sub-units combined by one maxout unit. Afterwards the output $h_{maxout}$ of the maxout hidden unit is given as the maximum over the k feature mappings:

When formalized like this it becomes clear that (in contrast to conventional activation functions) the maxout unit can be interpreted as performing a pooling operation over a k-dimensional subspace of linear units $[z_{1},\dots,z_{k}]$ each representing one transformation of the input $\mathbf{v}$ . This is similar to spatial max-pooling which is commonly employed in convolutional neural networks. However, unlike in spatial pooling the maxout unit pools over a subspace of k different linear transformations applied to the same input $\mathbf{v}$ . In contrast to this, spatial max-pooling of linear feature maps would compute a pooling over one linear transformation applied to k different inputs. A schematic of the difference between several pooling operations is given in Fig. 1 .

As such maxout is thus more similar to the subspace pooling operations used for example in topographic ICA which is known to result in partial invariance to changes within its input. On the basis of this observation we propose a stochastic generalization of the maxout unit that preserves its desirable properties while improving gradient propagation among the $k$ linear feature mappings as well as the invariance properties of each unit. In the following we call these generalized units probout units since they are a direct probabilistic generalization of maxout.

We derive the probout unit activation function from the maxout formulation by replacing the maximum operation in Eq. (2) with a probabilistic sampling procedure. More specifically we assume a Boltzmann distribution over the $k$ linear feature mappings and sample the activation $h(\mathbf{v})$ from the activation of the corresponding subspace units. To this end we first define a probability for each of the k linear units in the subspace as:

where $\lambda$ is a hyperparameter (referred to as an inverse temperature parameter) controlling the variance of the distribution. The activation $h_{probout}(\mathbf{x})$ is then sampled as

Comparing Eq. (4) to Eq. (2) we see that both, are not bounded from above or below and their activation is always given as one of the linear feature mappings within their subspace. The probout unit hence preserves most of the properties of the maxout unit, only replacing the sub-unit selection mechanism.

We can further see that Eq. (4) reduces to the maxout activation for $\lambda\rightarrow\infty$ . For other values of $\lambda$ the probout unit will behave similarly to maxout when the activation of one linear unit in the subspace dominates. However, if the activation of multiple linear units differs only slightly they will be selected with almost equal probability. Futhermore, each active linear unit will have a chance to be selected. The sampling approach therefore ensures that gradient flows through each of the $k$ linear subspace units of a given probout unit for some examples (given that $\lambda$ is sufficiently small). We hence argue that probout units can learn to better utilize their full k-dimensional subspace.

In practice we want to combine the probout units described by Eq. (4) with dropout for regularizing the learned model. To achieve this we directly include dropout in the probabilistic sampling step by re-defining the probabilities as:

Consequently, we sample the probout activation function including dropout $\hat{h}_{probout}(\mathbf{v})$ as

2 Relation to other pooling operations

The idea of using a stochastic pooling operation has been explored in the context of spatial pooling within the machine learning literature before. Among this work the approach most similar to ours is . There the authors introduced a probabilistic pooling approach in order to derive a convolutional deep believe network (DBN). They also use a Boltzmann distribution based on unit activations to calculate a sampling probability. The main difference between their work and ours is that they calculate the probability of sampling one unit at different spatial locations whereas we calculate the probability of sampling a unit among k units forming a subspace at one spatial location. Another difference is that we forward propagate the sampled activation $z_{i}$ whereas they use the calculated probability to activate a binary stochastic unit.

Another approach closely related to our work is the stochastic pooling presented in . Their stochastic pooling operation samples the activation of a pooling unit $p_{i}$ proportionally to the activation $a$ of a rectified linear unit computed at different spatial positions. This is similar to Eq. (4) in the sense that the activation is sampled from a set of different activations. Similar to it however differs in that the sampling is performed over spatial locations rather than activations of different units.

It should be noted that our work also bears some resemblance to recent work on training stochastic units, embedded in an autoencoder network, via back-propagation . In contrast to their work, which aims at using stochastic neurons to train a generative model, we embrace stochasticity in the subspace pooling operation as an effective means to regularize a discriminative model.

3 Inference

At test time we need to account for the stochastic nature of a neural network containing probout units. During a forward pass through the network the value of each probout unit is sampled from one of $k$ values according to their probability. The output of such a forward pass thus always represents only one of $k^{M}$ different instantiations of the trained probout network; where $M$ is the number of probout units in the network. When combined with dropout the number of possible instantiations increases to ${(k+1)}^{M}$ . Evaluating all possible models at test time is therefore clearly infeasible. The Dropout formulation from deals with this large amount of possible models by removing dropout at test time and halving the weights of each unit. If the network consists of only one softmax layer then this modified network performs exact model averaging . For general models this computation is merely an approximation of the true model average which, however, performs well in practice for both deep ReLU networks and the maxout model .

We adopt the same procedure of halving the weights for removing the influence of dropout at test-time and rescale the probabilities such that $\sum_{i=1}^{k}\hat{p}_{i}=1$ and $\hat{p}_{0}=0$ , effectively replacing the sampling from Eq .(7) with Eq. (4). We further observe that from the $k^{M}$ models remaining after removing dropout only few models will be instantiated with high probability. We therefore resort to sampling a small number of outputs $\mathbf{o}$ from the networks softmax layer and average their values. An evaluation of the exact effect of this model averaging can be found in Section 3.1.1 .

Evaluation

We evaluate our method on three different image classification datasets (CIFAR-10, CIFAR-100 and SVHN) comparing it against the basic maxout model as well as the current state of the art on all datasets. All experiments were performed using an implementation based on Theano and the pylearn2 library using the fast convoltion code of . We use mini-batch stochastic gradient descent with a batch size of 100. For each of the datasets we start with the same network used in – retaining all of their hyperparameter choices – to ensure comparability between results. We replace the maxout units in the network with probout units and choose one $\lambda^{(l)}$ via crossvalidation for each layer $l$ in a preliminary experiment on CIFAR-10.

We begin our experiments with the CIFAR-10 dataset. It consists of $50,000$ training images and $10,000$ test images that are grouped into $10$ categories. Each of these images is of size $32\times 32$ pixels and contains $3$ color channels. Maxout is known to yield good performance on this dataset, making it an ideal starting point for evaluating the difference between maxout and probout units.

We conducted a preliminary experiment to evaluate the effect of the probout parameters $\lambda^{(l)}$ on the performance and compare it to the standard maxout model. For this purpose we use a five layer model consisting of three convolutional layers with 48, 128 and 128 probout units respectively which pool over 2 linear units each. The penultimate layer then consists of 240 probout units pooling over a subspace of 5 linear units. The final layer is a standard softmax layer mapping from the 240 units in the penultimate layer to the 10 classes of CIFAR-10. The receptive fields of units in the convolutional layers are 8, 8 and 5 respectively. Additionally, spatial max-pooling is performed after each convolutional layer with pooling size of $4\times 4$ , $4\times 4$ and $2\times 2$ using a stride of 2 in all layers. We split the CIFAR-10 training data retaining the first 40000 samples for training and using the last 10000 samples as a validation set.

We start our evaluation by using probout units everywhere in the network and cross-validate the choice of the inverse-temperature parameters $\lambda^{(l)}\in\{0.1,0.5,1,2,3,4\}$ keeping all other hyperparameters fixed. We find that annealing the $\lambda^{(l)}$ parameter during training to a lower value improved performance for all $\lambda^{(l)}>0.5$ and hence linearly decrease $\lambda^{(l)}$ to a value that is $0.9$ lower than the initial $\lambda$ in these cases. As shown in Fig. 3(a) the best classification performance is achieved when $\lambda$ is set to allow higher variance sampling for the first two layers, specifically when $\lambda^{(1)}=1$ and $\lambda^{(2)}=2$ . For the third as well as the fully connected layer we observe a performance increase when $\lambda^{(3)}$ is chosen as $\lambda^{(3)}=3$ and $\lambda^{(4)}=4$ , meaning that the sampling procedure selects the maximum value with high probability. This indicates that the probabilistic sampling is most effective in lower layers. We verified this by replacing the probout units in the last two layers with maxout units which did not significantly decrease classification accuracy.

We hypothesize that increasing the probability of sampling a non maximal linear unit in the subspace pulls the units in the subspace closer together and forces the network to become “more invariant” to changes within this subspace. This is a property that is desired in lower layers but might turn to be detrimental in higher layers where the model averaging effect of maxout is more important than achieving invariance. Here sampling units with non-maximal activation could result in unwanted correlation between the “submodels”. To qualitatively verify this claim we plot the first layer linear filters learned using probout units alongside the filters learned by a model consisting only of maxout units in Fig. 2. When inspecting the filters we can see that many of the filters belonging to one subspace formed by a probout unit seem to be transformed versions of each other, with some of then resembling “quadrature pairs” of filters. Among the linear filters learned by the maxout model some also appear to encode invariance to local transformations. Most of the filters contained in a subspace however are seemingly unrelated. To support this observation empirically we probed for changes in the feature vectors of different layers (extracted from both maxout and probout models) when they are applied to translated and rotated images from the validation set. Similar to we calculate the normalized Euclidean distance between feature vectors extracted from an unchanged image and a transformed version. We then plot these distances for several exemplary images as well as the mean over 100 randomly sampled images. The result of this experiment is given in Fig. 4, showing that introducing probout units into the network has a moderate positive effect on both invariance to translation and rotations.

Finally, we evaluate the computational cost of the model averaging procedure described in Section 2.3 at test time. As depicted in Fig. 3(b) the classification error for the probout model decreases with more model evaluations saturating when a moderate amount of 50 evaluations is reached. Conversely, using sampling at test time in conjunction with the standard maxout model significantly decreases performance. This indicates that the maxout model is highly optimized for the maximum responses and cannot deal with the noise introduced through the sampling procedure. We additionally also tried to replace the model averaging mechanism with cheaper approximations. Replacing the sampling in the probout units with a maximum operation at test time resulted in a decrease in performance, reaching $14.13\%$ . We also tried to use probability weighting during testing which however performed even worse, achieving $15.21\%$ .

1.2 Evaluation of Classification Performance

As the next step, we evaluate the performance of our model on the full CIFAR-10 benchmark. We follow the same protocol as in to train the probout model. That is, we first preprocess all images by applying contrast normalization followed by ZCA whitening. We then train our model using the first 40000 examples from the training set using the last 10000 examples as a validation set. Training then proceeds until the validation error stops decreasing. We then retrain the model on the complete training set for the same amount of epochs it took to reach the best validation error.

To comply with the experiments in we used a larger version of the model from Section 3.1.1 in all experiments. Compared to the preliminary experiment the size of the convolutional layers was increased to $96$ , $192$ and $192$ units respectively. The size of the fully connected layer was increased to 500 probout units pooling over a 5 dimensional subspace.

The top half of Table 1 shows the result of training this model as well as other recent results. We achieve an error of $11.35\%$ , slightly better than – but statistically tied to – the previous state of the art given by the maxout model. We also evaluated the performance of this model when the training data is augmented with additional transformed training examples. For this purpose we train our model using the original training images as well as add randomly translated and horizontally flipped versions of the images. The bottom half of Table 1 shows a comparison of different results for training on CIFAR-10 with additional data augmentation. Using this augmentation process we achieve a classification error of $9.39\%$ , matching, but not outperforming the maxout result.

2 CIFAR-100

The images contained in the CIFAR-100 dataset are – just as the CIFAR-10 images – taken from a subset of the 10-million images database. The dataset contains $50,000$ training and $10,000$ test examples of size $32\times 32$ pixels each. The dataset is hence similar to CIFAR-10 in both size and image content. It, however, differs from CIFAR-10 in its label distribution. Concretely, CIFAR-100 contains images of 100 classes grouped into 20 “super-classes”. The training data therefore contains 500 training images per class – 10 times less examples per class than in CIFAR-10 – which are accompanied by 100 examples in the test-set.

We do not make use of the 20 super-classes and train a model using a similar setup to the experiments we carried out on CIFAR-10. Specifically, we use the same preprocessing and training procedure (determining the amount of epochs using a validation set and then retraining the model on the complete data). The same network as in Section 3.1.1 was used for this experiment (adapted to classify 100 classes). Again, this is the same architecture used in thus ensuring comparability between results. During testing we use 50 model evaluations to average over the sampled probout units.

The result of this experiment is given in Table 2. In agreement with the CIFAR-10 results our model performs marginally better than the maxout model (by $0.45\%$ While we were writing this manuscript it came to our attention that the experiments on CIFAR-100 in were carried out using a different preprocessing than mentioned in the original paper. To ensure that this does not substantially effect our comparison we ran their experiment using the same preprocessing used in our experiments. This resulted in a slightly improved classification error of $38.50\%$ .). As also shown in the table the current best method on CIFAR-100 achieves a classification error of $36.85\%$ , using a larger convolutional neural network together with a tree-based prior on the classes formed by utilizing the super-classes. A similar performance increase could potentially be achieved by combining their tree-based prior with our model.

3 SVHN

The street view house numbers dataset is a collection of images depicting digits which were obtained from google street view images. The dataset comes in two variants of which we restrict ourselves to the one containing cropped $32\times 32$ pixel images. Similar to the well known MNIST dataset the task for this dataset is to classify each image as one of 10 digits in the range from 0 to 9. The task is considerably more difficult than MNIST since the images are cropped out of natural image data. The images thus contain color information and show significant contrast variation. Furthermore, although centered on one digit, several images contain multiple visible digits, complicating the classification task.

The training and test set contain $73,257$ and $20,032$ labeled examples respectively. In addition to this data there is an “extra” set of $531,131$ labeled digits which are somewhat less difficult to differentiate and can be used as additional training data. As in we build a validation set by selecting 400 examples per class from the training and 200 examples per class from the extra dataset. We conflate all remaining training images to a large set of $598,388$ images which we use for training.

The model trained for this task consists of three convolutional layers containing 64, 128 and 128 units respectively, pooling over a 2 dimensional subspace. These are followed by a fully connected and a softmax layer of which the fully connected layer contains 400 units pooling over a 5 dimensional subspace. This yields a classification error of $2.39\%$ (using 50 model evaluations at test-time), matching the current state of the art for a model trained on SVHN without data augmentation achieved by the maxout model ( $2.47\%$ ). A comparison to other results can be found in Table 3 . This includes the current best result with data augmentation which was obtained using a generalization of dropout in conjunction with a large network containing rectified linear units .

Conclusion

We presented a probabilistic version of the recently introduced maxout unit. A model built using these units was shown to yield competitive performance on three challenging datasets (CIFAR-10, CIFAR-100, SVHN). As it stands, replacing maxout units with probout units is computationally expensive at test time. This problem could be diminished by developing an approximate inference scheme similar to which we see as an interesting possibility for future work.

We see our approach as part of a larger body of work on exploring the utility of learning “complex cell like” units which can give rise to interesting invariances in neural networks. While this paradigm has extensively been studied in unsupervised learning it is less explored in the supervised scenario. We believe that work towards building activation functions incorporating such invariance properties, while at the same time designed for use with efficient model averaging techniques such as dropout, is a worthwhile endeavor for advancing the field.

The authors want to thank Alexey Dosovistkiy for helpful discussions and comments, as well as Thomas Brox for generously providing additional computing resources.