Improved Dropout for Shallow and Deep Learning

Zhe Li, Boqing Gong, Tianbao Yang

Introduction

Dropout has been widely used to avoid overfitting of deep neural networks with a large number of parameters , which usually identically and independently at random samples neurons and sets their outputs to be zeros. Extensive experiments have shown that dropout can help obtain the state-of-the-art performance on a range of benchmark data sets. Recently, dropout has also been found to improve the performance of logistic regression and other single-layer models for natural language tasks such as document classification and named entity recognition .

In this paper, instead of identically and independently at random zeroing out features or neurons, we propose to use multinomial sampling for dropout, i.e., sampling features or neurons according to a multinomial distribution with different probabilities for different features/neurons. Intuitively, it makes more sense to use non-uniform multinomial sampling than identical and independent sampling for different features/neurons. For example, in shallow learning if input features are centered, we can drop out features with small variance more frequently or completely allowing the training to focus on more important features and consequentially enabling faster convergence. To justify the multinomial sampling for dropout and reveal the optimal sampling probabilities, we conduct a rigorous analysis on the risk bound of shallow learning by stochastic optimization with multinomial dropout, and demonstrate that a distribution-dependent dropout leads to a smaller expected risk (i.e., faster convergence and smaller generalization error).

Inspired by the distribution-dependent dropout, we propose a data-dependent dropout for shallow learning, and an evolutional dropout for deep learning. For shallow learning, the sampling probabilities are computed from the second order statistics of features of the training data. For deep learning, the sampling probabilities of dropout for a layer are computed on-the-fly from the second-order statistics of the layer’s outputs based on a mini-batch of examples. This is particularly suited for deep learning because (i) the distribution of each layer’s outputs is evolving over time, which is known as internal covariate shift ; (ii) passing through all the training data in deep neural networks (in particular deep convolutional neural networks) is much more expensive than through a mini-batch of examples. For a mini-batch of examples, we can leverage parallel computing architectures to accelerate the computation of sampling probabilities.

We note that the proposed evolutional dropout achieves similar effect to the batch normalization technique (Z-normalization based on a mini-batch of examples) but with different flavors. Both approaches can be considered to tackle the issue of internal covariate shift for accelerating the convergence. Batch normalization tackles the issue by normalizing the output of neurons to zero mean and unit variance and then performing dropout independently The author also reported that in some cases dropout is even not necessary. In contrast, our proposed evolutional dropout tackles this issue from another perspective by exploiting a distribution-dependent dropout, which adapts the sampling probabilities to the evolving distribution of a layer’s outputs. In other words, it uses normalized sampling probabilities based on the second order statistics of internal distributions. Indeed, we notice that for shallow learning with Z-normalization (normalizing each feature to zero mean and unit variance) the proposed data-dependent dropout reduces to uniform dropout that acts similarly to the standard dropout. Because of this connection, the presented theoretical analysis also sheds some lights on the power of batch normalization from the angle of theory. Compared to batch normalization, the proposed distribution-dependent dropout is still attractive because (i) it is rooted in theoretical analysis of the risk bound; (ii) it introduces no additional parameters and layers without complicating the back-propagation and the inference; (iii) it facilitates further research because its shares the same mathematical foundation as standard dropout (e.g., equivalent to a form of data-dependent regularizer) .

We summarize the main contributions of the paper below.

We propose a multinomial dropout and demonstrate that a distribution-dependent dropout leads to a faster convergence and a smaller generalization error through the risk bound analysis for shallow learning.

We propose an efficient evolutional dropout for deep learning based on the distribution-dependent dropout.

We justify the proposed dropouts for both shallow learning and deep learning by experimental results on several benchmark datasets.

In the remainder, we first review some related work and preliminaries. We present the main results in Section 4 and experimental results in Section 5.

Related Work

In this section, we review some related work on dropout and optimization algorithms for deep learning.

Dropout is a simple yet effective technique to prevent overfitting in training deep neural networks . It has received much attention recently from researchers to study its practical and theoretical properties. Notably, Wager et al. , Baldi and Sadowski have analyzed the dropout from a theoretical viewpoint and found that dropout is equivalent to a data-dependent regularizer. The most simple form of dropout is to multiply hidden units by i.i.d Bernoulli noise. Several recent works also found that using other types of noise works as well as Bernoulli noise (e.g., Gaussian noise), which could lead to a better approximation of the marginalized loss . Some works tried to optimize the hyper-parameters that define the noise level in a Bayesian framework . Graham et al. used the same noise across a batch of examples in order to speed up the computation. The adaptive dropout proposed in overlays a binary belief network over a neural netowrk, incurring more computational overhead to dropout because one has to train the additional binary belief network. In constrast, the present work proposes a new dropout with noise sampled according to distribution-dependent sampling probabilities. To the best of our knowledge, this is the first work that rigorously studies this type of dropout with theoretical analysis of the risk bound. It is demonstrated that the new dropout can improve the speed of convergence.

Stochastic gradient descent with back-propagation has been used a lot in optimizing deep neural networks. However, it is notorious for its slow convergence especially for deep learning. Recently, there emerge a battery of studies trying to accelearte the optimization of deep learning , which tackle the problem from different perspectives. Among them, we notice that the developed evolutional dropout for deep learning achieves similar effect as batch normalization addressing the internal covariate shift issue (i.e., evolving distributions of internal hidden units).

Preliminaries

In this section, we present some preliminaries, including the framework of risk minimization in machine learning and learning with dropout noise. We also introduce the multinomial dropout, which allows us to construct a distribution-dependent dropout as revealed in the next section.

In this paper, we are interested in learning with dropout, i.e., the feature vector $\mathbf{x}$ is corrupted by a dropout noise. In particular, let $\boldsymbol{\epsilon}\sim\mathcal{M}$ denote a dropout noise vector of dimension $d$ , and the corrupted feature vector is given by $\widehat{\mathbf{x}}=\mathbf{x}\circ\boldsymbol{\epsilon}$ , where the operator $\circ$ represents the element-wise multiplication. Let $\widehat{\mathcal{P}}$ denote the joint distribution of the new data $(\widehat{\mathbf{x}},y)$ and $\widehat{\mathcal{D}}$ denote the marginal distribution of $\widehat{\mathbf{x}}$ . With the corrupted data, the risk minimization becomes

To this end, we introduce the following multinomial dropout.

(Multinomial Dropout) A multinomial dropout is defined as $\widehat{\mathbf{x}}=\mathbf{x}\circ\boldsymbol{\epsilon}$ , where $\epsilon_{i}=\frac{m_{i}}{kp_{i}},i\in[d]$ and $\{m_{1},\ldots,m_{d}\}$ follow a multinomial distribution $Mult(p_{1},\ldots,p_{d};k)$ with $\sum_{i=1}^{d}p_{i}=1$ and $p_{i}\geq 0$ .

Dropout as a regularizer has been studied in for logistic regression, which is stated in the following proposition for ease of discussion later.

Remark: It is notable that $R_{\mathcal{D},\mathcal{M}}\geq 0$ due to the Jensen inequality. Using the second order Taylor expansion, showed that the following approximation of $R_{\mathcal{D},\mathcal{M}}(\mathbf{w})$ is easy to manipulate and understand:

where $q(\mathbf{w}^{\top}\mathbf{x})=\frac{1}{1+\exp(-\mathbf{w}^{\top}\mathbf{x}/2)}$ , and $C_{\mathcal{M}}$ denotes the covariance matrix in terms of $\boldsymbol{\epsilon}$ . In particular, if $\boldsymbol{\epsilon}$ is the standard dropout noise, then $C_{\mathcal{M}}[\mathbf{x}\circ\boldsymbol{\epsilon}]=diag(x_{1}^{2}\delta/(1-\delta),\ldots,x_{d}^{2}\delta/(1-\delta))$ , where $diag(s_{1},\ldots,s_{n})$ denotes a $d\times d$ diagonal matrix with the $i$ -th entry equal to $s_{i}$ . If $\boldsymbol{\epsilon}$ is the multinomial dropout noise in Definition 1, we have

Learning with Multinomial Dropout

In this section, we analyze a stochastic optimization approach for minimizing the dropout loss in (2). Assume the sampling probabilities are known. We first obtain a risk bound of learning with multinomial dropout for stochastic optimization. Then we try to minimize the factors in the risk bound that depend on the sampling probabilities. We would like to emphasize that our goal here is not to show that using dropout would render a smaller risk than without using dropout, but rather focus on the impact of different sampling probabilities on the risk. Let the initial solution be $\mathbf{w}_{1}$ . At the iteration $t$ , we sample $(\mathbf{x}_{t},y_{t})\sim\mathcal{P}$ and $\boldsymbol{\epsilon}_{t}\sim\mathcal{M}$ as in Definition 1 and then update the model by

The following theorem establishes a risk bound of $\widehat{\mathbf{w}}_{n}$ in expectation.

Remark: In the above theorem, we can choose $\mathbf{w}_{*}$ to be the best model that minimizes the expected risk in (1). Since $R_{\mathcal{D},M}(\mathbf{w})\geq 0$ , the upper bound in the theorem above is also the upper bound of the risk of $\widehat{\mathbf{w}}_{n}$ , i.e., $\mathcal{L}(\widehat{\mathbf{w}}_{n})$ , in expectation. The proof of the above theorem follows the standard analysis of stochastic gradient descent. The detailed proof of theorem is included in the appendix.

Let $\boldsymbol{\epsilon}$ follow the distribution $\mathcal{M}$ defined in Definition 1. Then

Next, we examine $R_{\mathcal{D},\mathcal{M}}(\mathbf{w}_{*})$ . Since direct manipulation on $R_{\mathcal{D},\mathcal{M}}(\mathbf{w}_{*})$ is difficult, we try to minimize the second order Taylor expansion $\widehat{R}_{\mathcal{D},\mathcal{M}}(\mathbf{w}_{*})$ for logistic loss. The following theorem establishes an upper bound of $\widehat{R}_{\mathcal{D},\mathcal{M}}(\mathbf{w}_{*})$ .

Remark: By minimizing the relaxed upper bound in Proposition 4, we obtain the same sampling probabilities as in (8). We note that a tighter upper bound can be established, however, which will yield sampling probabilities dependent on the unknown $\mathbf{w}_{*}$ .

where $[\mathbf{x}_{j}]_{i}$ denotes the $i$ -th feature of the $j$ -th example, which gives us a data-dependent dropout. We state it formally in the following definition.

(Data-dependent Dropout) Given a set of training examples $(\mathbf{x}_{1},y_{1}),\ldots,(\mathbf{x}_{n},y_{n})$ . A data-dependent dropout is defined as $\widehat{\mathbf{x}}=\mathbf{x}\circ\boldsymbol{\epsilon}$ , where $\epsilon_{i}=\frac{m_{i}}{kp_{i}},i\in[d]$ and $\{m_{1},\ldots,m_{d}\}$ follow a multinomial distribution $Mult(p_{1},\ldots,p_{d};k)$ with $p_{i}$ given by (9).

Remark: Note that if the data is normalized such that each feature has zero mean and unit variance (i.e., according to Z-normliazation), the data-dependent dropout reduces to uniform dropout. It implies that the data-dependent dropout achieves similar effect as Z-normalization plus uniform dropout. In this sense, our theoretical analysis also explains why Z-normalization usually speeds up the training .

2 Evolutional Dropout for Deep Learning

Next, we discuss how to implement the distribution-dependent dropout for deep learning. In training deep neural networks, the dropout is usually added to the intermediate layers (e.g., fully connected layers and convolutional layers). Let $\mathbf{x}^{l}=(x^{l}_{1},\ldots,x^{l}_{d})$ denote the outputs of the $l$ -th layer (with the index of data omitted). Adding dropout to this layer is equivalent to multiplying $\mathbf{x}^{l}$ by a dropout noise vector $\boldsymbol{\epsilon}^{l}$ , i.e., feeding $\widehat{\mathbf{x}}^{l}=\mathbf{x}^{l}\circ\boldsymbol{\epsilon}^{l}$ as the input to the next layer. Inspired by the data-dependent dropout, we can generate $\boldsymbol{\epsilon}^{l}$ according to a distribution given in Definition 1 with sampling probabilities $p^{l}_{i}$ computed from $\{\mathbf{x}_{1}^{l},\ldots,\mathbf{x}^{l}_{n}\}$ similar to that (9). However, deep learning is usually trained with big data and a deep neural network is optimized by mini-batch stochastic gradient descent. Therefore, at each iteration it would be too expensive to afford the computation to pass through all examples. To address this issue, we propose to use a mini-batch of examples to calculate the second-order statistics similar to what was done in batch normalization. Let $X^{l}=(\mathbf{x}^{l}_{1},\ldots,\mathbf{x}^{l}_{m})$ denote the outputs of the $l$ -th layer for a mini-batch of $m$ examples. Then we can calculate the probabilities for dropout by

which define the evolutional dropout named as such because the probabilities $p_{i}^{l}$ will also evolve as the the distribution of the layer’s outputs evolve. We describe the evolutional dropout as applied to a layer of a deep neural network in Figure 1.

Finally, we would like to compare the evolutional dropout with batch normalization. Similar to batch normalization, evolutional dropout can also address the internal covariate shift issue by adapting the sampling probabilities to the evolving distribution of layers’ outputs. However, different from batch normalization, evolutional dropout is a randomized technique, which enjoys many benefits as standard dropout including (i) the back-propagation is simple to implement (just multiplying the gradient of $\widehat{X}^{l}$ by the dropout mask to get the gradient of $X^{l}$ ); (ii) the inference (i.e., testing) remains the same Different from some implementations for standard dropout which doest no scale by $1/(1-\delta)$ in training but scale by $1-\delta$ in testing, here we do scale in training and thus do not need any scaling in testing. ; (iii) it is equivalent to a data-dependent regularizer with a clear mathematical explanation; (iv) it prevents units from co-adapting of neurons, which facilitate generalization. Moreover, the evolutional dropout has its root in distribution-dependent dropout, which has theoretical guarantee to accelerate the convergence and improve the generalization for shallow learning.

Experimental Results

In the section, we present some experimental results to justify the proposed dropouts. In all experiments, we set $\delta=0.5$ in the standard dropout and $k=0.5d$ in the proposed dropouts for fair comparison, where $d$ represents the number of features or neurons of the layer that dropout is applied to. For the sake of clarity, we divided the experiments into three parts. In the first part, we compare the performance of the data-dependent dropout (d-dropout) to the standard dropout (s-dropout) for logistic regression. In the second part, we compare the performance of evolutional dropout (e-dropout) to the standard dropout for training deep convolutional neural networks. Finally, we compare e-dropout with batch normalization.

We implement the presented stochastic optimization algorithm. To evaluate the performance of data-dependent dropout for shallow learning, we use the three data sets: real-sim, news20 and RCV1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. In this experiment, we use a fixed step size and tune the step size in $[0.1,0.05,0.01,0.005,0.001,0.0005,0.0001]$ and report the best results in terms of convergence speed on the training data for both standard dropout and data-dependent dropout. The left three panels in Figure 2 show the obtained results on these three data sets. In each figure, we plot both the training error and the testing error. We can see that both the training and testing errors using the proposed data-dependent dropout decrease much faster than using the standard dropout and also a smaller testing error is achieved by using the data-dependent dropout.

2 Evolutional Dropout for Deep Learning

We would like to emphasize that we are not aiming to obtain better prediction performance by trying different network structures and different engineering tricks such as data augmentation, whitening, etc., but rather focus on the comparison of the proposed dropout to the standard dropout using Bernoulli noise on the same network structure. In our experiments, we use the default splitting of training and testing data in all data sets. We directly optimize the neural networks using all training images without further splitting it into a validation data to be added into the training in later stages, which explains some marginal gaps from the literature results that we observed (e.g., on CIFAR-10 compared with ).

We conduct experiments on four benchmark data sets for comparing e-dropout and s-dropout: MNIST , SVHN , CIFAR-10 and CIFAR-100 . We use the same or similar network structure as in the literatures for the four data sets. In general, the networks consist of convolution layers, pooling layers, locally connected layers, fully connected layers, softmax layers and a cost layer. For the detailed neural network structures and their parameters, please refer to the supplementary materials. The dropout is added to some fully connected layers or locally connected layers. The rectified linear activation function is used for all neurons. All the experiments are conducted using the cuda-convnet library https://code.google.com/archive/p/cuda-convnet/. The training procedure is similar to using mini-batch SGD with momentum (0.9). The size of mini-batch is fixed to 128. The weights are initialized based on the Gaussian distribution with mean zero and standard deviation $0.01$ . The learning rate (i.e., step size) is decreased after a number of epochs similar to what was done in previous works . We tune the initial learning rates for s-dropout and e-dropout separately from $0.001,0.005,0.01,0.1$ and report the best result on each data set that yields the fastest convergence.

Figure 3 shows the training and testing error curves in the optimization process on the four data sets using the standard dropout and the evolutional dropout. For SVHN data, we only report the first 12000 iterations, after which the error curves of the two methods almost overlap. We can see that using the evolutional dropout generally converges faster than using the standard dropout. On CIFAR-100 data, we have observed significant speed-up. In particular, the evolutional dropout achieves relative improvements over 10% on the testing performance and over 50% on the convergence speed compared to the standard dropout.

3 Comparison with the Batch Normalization (BN)

Finally, we make a comparison between the evolutional dropout and the batch normalization. For batch normalization, we use the implementation in Caffe https://github.com/BVLC/caffe/. We compare the evolutional dropout with the batch normalization on CIFAR-10 data set. The network structure is from the Caffe package and can be found in the supplement, which is different from the one used in the previous experiment. It contains three convolutional layers and one fully connected layer. Each convolutional layer is followed by a pooling layer. We compare four methods: (1) No BN and No dropout - without using batch normalization and dropout; (2) BN; (3) BN with standard dropout; (4) Evolutional Dropout. The rectified linear activation is used in all methods. We also tried BN with the sigmoid activation function, which gives worse results. For the methods with BN, three batch normalization layers are inserted before or after each pooling layer following the architecture given in Caffe package (see supplement). For the evolutional dropout training, only one layer of dropout is added to the the last convolutional layer. The mini-batch size is set to 100, the default value in Caffe. The initial learning rates for the four methods are set to the same value ( $0.001$ ), and they are decreased once by ten times. The testing accuracy versus the number of iterations is plotted in the right panel of Figure 2, from which we can see that the evolutional dropout training achieves comparable performance with BN + standard dropout, which justifies our claim that evolutional dropout also addresses the internal covariate shift issue.

Conclusion

In this paper, we have proposed a distribution-dependent dropout for both shallow learning and deep learning. Theoretically, we proved that the new dropout achieves a smaller risk and faster convergence. Based on the distribution-dependent dropout, we developed an efficient evolutional dropout for training deep neural networks that adapts the sampling probabilities to the evolving distributions of layers’ outputs. Experimental results on various data sets verified that the proposed dropouts can dramatically improve the convergence and also reduce the testing error.

We thank anonymous reviewers for their comments. Z. Li and T. Yang are partially supported by National Science Foundation (IIS-1463988, IIS-1545995). B. Gong is supported in part by NSF (IIS-1566511) and a gift from Adobe.

References

Supplement

Let $\mathbf{w}_{t+1}=\mathbf{w}_{t}-\eta\mathbf{g}_{t}$ and $\mathbf{w}_{1}=0$ . Then for any $\|\mathbf{w}_{*}\|_{2}\leq r$ we have

By taking expectation on both sides over the randomness in $(\mathbf{x}_{t},y_{t},\boldsymbol{\epsilon}_{t})$ and noting the bound on $\|\mathbf{g}_{t}\|_{2}$ , we have

2 Proof of Lemma 1

By summing the above inequality over $t=1,\ldots,n$ , we obtain

By noting that $\mathbf{w}_{1}=0$ and $\|\mathbf{w}_{*}\|_{2}\leq r$ , we obtain the inequality in Lemma 1.

3 Proof of Proposition 2

Since $\{m_{1},\ldots,m_{d}\}$ follows a multinomial distribution $Mult(p_{1},\ldots,p_{d};k)$ , we have

The result in the Proposition follows by combining the above two equations.

4 Proof of Proposition 3

Note that only the first term in the R.H.S of Eqn. (7) depends on $p_{i}$ . Thus,

The result then follows the KKT conditions.

5 Proof of Proposition 4

We prove the first upper bound first. From Eqn. (4) in the paper, we have

where we use the fact $\sqrt{ab}\leq\frac{a+b}{2}$ for $a,b\geq 0$ . Using Eqn. (5) in the paper, we have

This gives a tight bound of $\widehat{R}_{\mathcal{D},\mathcal{M}}(\mathbf{w}_{*})$ , i.e.,

By minimizing the above upper bound over $p_{i}$ , we obtain following probabilities

which depend on unknown $\mathbf{w}_{*}$ . We address this issue, we derive a relaxed upper bound. We note that

where $I_{d}$ denotes the identity matrix of dimension $d$ . Thus

By noting the result in Proposition 2 in the paper, we have

which proves the upper bound in Proposition 4.

6 Neural Network Structures

In this section we present the neural network structures and the number of filters, filter size, padding and stride parameters for MNIST, SVHN, CIFAR-10 and CIFAR-100, respectively. Note that in Table 2, Table 3 and Table 4, the rnorm layer is the local response normalization layer and the local layer is the locally-connected layer with unshared weights.

We used the similar neural network structure to : two convolution layers, two fully connected layers, a softmax layer and a cost layer at the end. The dropout is added to the first fully connected layer. Tables 1 presents the neural network structures and the number of filters, filter size, padding and stride parameters for MNIST.

6.2 SVHN

The neural network structure used for this data set is from , including 2 convolutional layers, 2 max pooling layers, 2 local response layers, 2 fully connected layers, a softmax layer and a cost layer with one dropout layer. Tables 2 presents the neural network structures and the number of filters, filter size, padding and stride parameters used for SVHN data set.

6.3 CIFAR-10

The neural network structure is adopted from , which consists two convolutional layer, two pooling layers, two local normalization response layers, 2 locally connected layers, two fully connected layers and a softmax and a cost layer. Table 3 presents the detail neural network structure and the number of filters, filter size, padding and stride parameters used.

6.4 CIFAR-100

The network structure for this data set is similar to the neural network structure in , which consists of 2 convolution layers, 2 max pooling layers, 2 local response normalization layers, 2 locally connected layers, 3 fully connected layers, and a softmax and a cost layer. Table 4 presents the neural network structures and the number of filters, filter size, padding and stride parameters used for CIFAR-100 data set.

6.5 The Neural Network Structure used for BN

Tables 5 and 6 present the network structures of different methods in subsection 5.3 in the paper. The layer pool(ave) in Table 5 and Table 6 represents the average pooling layer.