Curriculum Dropout

Pietro Morerio, Jacopo Cavazza, Riccardo Volpi, Rene Vidal, Vittorio Murino

Introduction

Since , deep neural networks have become ubiquitous in most computer vision applications. The reason is generally ascribed to the powerful hierarchical feature representations directly learnt from data, which usually outperform classical hand-crafted feature descriptors.

As a drawback, deep neural networks are difficult to train because non-convex optimization and intensive computations for learning the network parameters. Relying on availability of both massive data and hardware resources, the aforementioned training challenges can be empirically tackled and deep architectures can be effectively trained in an end-to-end fashion, exploiting parallel GPU computation.

However, overfitting remains an issue. Indeed, such a gigantic number of parameters is likely to produce weights that are so specialized to the training examples that the network’s generalization capability may be extremely poor.

The seminal work of argues that overfitting occurs as the result of excessive co-adaptation of feature detectors which manage to perfectly explain the training data. This leads to overcomplicated models which unsatisfactory fit unseen testing data points. To address this issue, the Dropout algorithm was proposed and investigated in and is nowadays extensively used in training neural networks. The method consists in randomly suppressing neurons during training according to the values rr sampled from a Bernoulli distribution. More specifically, if r=1r=1 that unit is kept unchanged, while if r=0 the unit is suppressed. The effect of suppressing a neuron is that the value of its output is set to zero during the forward pass of training, and its weights are not updated during the backward pass. One one forward-backward pass is completed, a new sample of r is drawn from each neuron, and another forward-backward pass is done and so on till convergence. At testing time, no neuron is suppressed and all activations are modulated by the mean value of the Bernoulli distribution. The resulting model is in fact often interpreted as an average of multiple models, and it is argued that this improves its generalization ability .

Leveraging on the Dropout idea, many works have proposed variations of the original strategy . However, it is still unclear which variation improves the most with respect to the original dropout formulation . In many works (such as ) there is no real theoretical justification of the proposed approach other than favorable empirical results. Therefore, providing a sound justification still remains an open challenge. In addition, the lack of publicly available implementations (e.g., ) make fair comparisons problematic.

The point of departure of our work is the intuition that the excessive co-adaptation of feature detectors, which leads to overfitting, are very unlikely to occur in the early epochs of training. Thus, Dropout seems unnecessary at the beginning of training. Inspired by these considerations, in this work we propose to dynamically increase the number of units that are suppressed as a function of the number of gradient updates. Specifically, we introduce a generalization of the dropout scheme consisting of a temporal scheduling - a curriculum - for the expected number of suppressed units. By adapting in time the parameter of the Bernoulli distribution used for sampling, we smoothly increase the suppression rate as training evolves, thereby improving the generalization of the model.

In summary, the main contributions of this paper are the following.

We address the problem of overfitting in deep neural networks by proposing a novel regularization strategy called Curriculum Dropout that dynamically increases the expected number of suppressed units in order to improve the generalization ability of the model.

We draw connections between the original dropout framework with regularization theory and curriculum learning . This provides an improved justification of (Curriculum) Dropout training, relating it to existing machine learning methods.

We complement our foundational analysis with a broad experimental validation, where we compare our Curriculum Dropout versus the original one and anti-Curriculum paradigms, for (convolutional) neural network-based image classification. We evaluate the performance on standard datasets (MNIST , SVHN , CIFAR-10/100 , Caltech-101/256 ). As the results certify, the proposed method generally achieves a superior classification performance.

The remaining of paper is outlined as follows. Relevant related works are summarized in §2 and Curriculum Dropout is presented in §3 and §4, providing foundational interpretations. The experimental evaluation is carried out in §5. Conclusions and future work are presented in §6.

Related Work

As previously mentioned, dropout is introduced by Hinton et al. and Sivrastava et al. . Therein, the method is detailed and evaluated with different types of deep learning models (Multi-Layer Perceptrons, Convolutional Neural Networks, Restricted Boltzmann Machines) and datasets, confirming the effectiveness of this approach against overfitting. Since then, many works have investigated the topic.

Wan et al. propose Drop-Connect, a more general version of Dropout. Instead of directly setting units to zero, only some of the network connections are suppressed. This generalization is proven to be better in performance but slower to train with respect to . Li et al. introduce data-dependent and Evolutional-dropout for shallow and deep learning, respectively. These versions are based on sampling neurons form a multinomial distribution with different probabilities for different units. Results show faster training and sometimes better accuracies. Wang et al. accelerate dropout. In their method, hidden units are dropped out using approximated sampling from a Gaussian distribution. Results show that leads to fast convergence without deteriorating the accuracy. Bayer et al. carry out a fine analysis, showing that dropout can be proficiently applied to Recurrent Neural Networks. Wu and Gu analyze the effect of dropout on the convolutional layers of a CNN: they define a probabilistic weighted pooling, which effectively acts as a regularizer. Zhai and Zhang investigate the idea of dropout once applied to matrix factorization. Ba and Frey introduce a binary belief network which is overlaid on a neural network to selectively suppress hidden units. The two networks are jointly trained, making the overall process more computationally expensive. Wager et al. apply Dropout on generalized linear models and approximately prove the equivalence between data-dependent L2L^{2} regularization and dropout training with AdaGrad optimizer. Rennie et al. propose to adjust the dropout rate, linearly decreasing the unit suppression rate during training, until the network experiences no dropout.

While some of the aforementioned methods can be applied in tandem, there is still a lack of understanding about which one is superior - this is also due to the lack of publicly released code (as happens in ). In this respect, is the most similar to our work. A few papers do not go beyond a bare experimental evaluation of the proposed dropout variation , omitting to justify the soundness of their approach. Conversely, while some works are much more formal than ours , all of them rely on approximations to carry out their analysis which is biased towards shallow models (logistic or linear regression and matrix factorization ). Differently, in our paper, in addition to its experimental effectiveness, we provide several natural justifications to corroborate the proposed dropout generalization for deep neural networks.

A Time Scheduling for the Dropout Rate

Deep Neural Networks display co-adaptations between units in terms of concurrent activations of highly organized clusters of neurons. During training, the latter specialize themselves in detecting certain details of the image to be classified, as shown by Zeiler and Fergus . They visualize the high sensitivity of certain filters in different layers in detecting dogs, people’s faces, wheels and more general ordered geometrical patterns [32, Fig. 2]. Moreover, such co-adaptations are highly generalizable across different datasets as proved by Torralba’s work . Indeed, the filter responses provided in the AlexNet within conv1, pool2/5 and fc7 layers are very similar [34, Fig. 5], despite the images used for the training are very different: objects from ImageNet versus scenes from Places datasets.

These arguments support the existence of some positive co-adaptations between neurons in the network. Nevertheless, as soon as the training keeps going, some co-adaptations can also be negative if excessively specific of the training images exploited for updating the gradients. Consequently, exaggerated co-adaptations between neurons weaken the network generalization capability, ultimately resulting in overfitting. To prevent it, Dropout precisely contrasts those negative co-adaptations.

The latter can be removed by randomly suppressing neurons of the architecture, restoring an improved situation where the neurons are more “independent”. This empirically reflects into a better generalization capability .

Despite the previous interpretation is totally sound, the original Dropout algorithm cannot precisely accommodate for it. Indeed, the suppression of a neuron in a given layer is modeled by a Bernoulli(θ)(\theta) random variableTo avoid confusion in our notation, please note that θ\theta is the equivalent of pp in , i.e the probability of retaining a neuron., 0<θ10<\theta\leq 1. Employing such distribution is very natural, since it statistically models binary activation/inhibition processes. In spite of that, it seems suboptimal that θ\theta should be fixed during the whole training stage. With this operative choice, is actually treating the negative co-adaptations phenomena as uniformly distributed during the whole training time.

Differently, our intuition is that, at the beginning of the training, if any co-adaptation between units is displayed, this should be preserved as positively representing the self-organization of the network parameters towards their optimal configuration.

We can understand this by considering the random initialization of the network’s weights. They are statistically independent and actually not co-adapted at all. Also, it is quite unnatural for a neural network with random weights to overfit the data. On the other hand, the risk of overdone co-adaptations increases as the training proceeds since the loss minimization can achieve a small objective value by overcomplicating the hierarchical representation learnt from data. This implies that overfitting caused by excessive co-adaptations appears only after a while.

Since a fixed parameter θ\theta is not able to handle increasing levels of negative co-adaptations, in this work, we tackle this issue by proposing a temporal dependent θ(t)\theta(t) parameter. Here, tt denotes the training time, measured in gradient updates t{0,1,2,}t\in\{0,1,2,\dots\}. Since θ(t)\theta(t) models the probability for a given neuron to be retained, Dθ(t)D\cdot\theta(t) will count the average number of units which remain active over the total number DD in a given layer. Intuitively, such quantity must be higher for the first gradient updates, then starting decreasing as soon as the training gears. In the late stages of training, such decrease should be stopped. We thus constrain θ(t)\theta(t) to be θ(t)θ\theta(t)\geq\overline{\theta} for any tt, where θ\overline{\theta} is a limit value, to be taken as 0.5θ0.90.5\leq\overline{\theta}\leq 0.9 as prescribed by the original dropout scheme [25, §A.4] (the higher the layer hierarchy, the lower the retain probability).

Inspired by the previous considerations, we propose the following definition for a curriculum function θ(t)\theta(t) aimed at improving dropout training (as it will become clear in section 4, from now on we will often use the terms curriculum and scheduling interchangeably).

Any function tθ(t)t\mapsto\theta(t) such that θ(0)=1\theta(0)=1 and limtθ(t)θ\lim_{t\to\infty}\theta(t)\searrow\overline{\theta} is said to be a curriculum function to generalize the original dropout formulation with retain probability θ\overline{\theta}.

Starting from the initial condition θ(0)=1\theta(0)=1 where no unit suppression is performed, dropout is gradually introduced in a way that θ(t)θ\theta(t)\geq\overline{\theta} for any tt. Eventually (i.e. when tt is big enough), the convergence θ(t)θ\theta(t)\to\overline{\theta} models the fact that we retrieve the original formulation of as a particular case of our curriculum.

Among the functions as in Def. 1, in our work we fix

By considering Figure 2, we can provide intuitive and straightforward motivations regarding our choice.

The blue curves in Fig. 2 are polynomials of increasing degree δ={1,,10}\delta=\{1,\dots,10\} (left to right). Despite fulfilling the initial constraint θ(0)=1\theta(0)=1, they have to be manually thresholded to impose θ(t)θ\theta(t)\to\overline{\theta} when tt\to\infty. This introduces two more (undesired) parameters (δ\delta and the threshold) with respect to , where the only quantity to be selected is θ\overline{\theta}.

The very same argument discourages the replacement of the variable tt by tαt^{\alpha} in (1), (green curves in Fig. 2, α={2,,10}\alpha=\{2,\dots,10\}, left to right). Moreover, by evaluating the area under the curve, we can intuitively measure how aggressively the green curves behave while delaying the dropping out scheme they eventually converge to (as θ(t)θ\theta(t)\to\overline{\theta}). Precisely, that convergence is faster while moving to the green curves more on the left, being the fastest one achieved by our scheduling function (1) (red curve, Fig. 2).

One could still argue that the parameter γ>0\gamma>0 is annoying since it requires cross validation. This is not necessary: in fact, γ\gamma can actually be fixed according to the following heuristics. Despite Def. 1 considers the limit of θ(t)\theta(t) for tt\to\infty, such condition has to be operatively replaced by tTt\approx T, being TT the total number of gradient updates needed for optimization. It is thus totally reasonable to assume that the order of magnitude of TT is a priori known and fixed to be some power of 1010 such as 104,10510^{4},10^{5}. Therefore, for a curriculum function as in Def. 1, we are interested in furthermore imposing θ(t)θ\theta(t)\approx\overline{\theta} when tTt\approx T. Actually, a rule of thumb such as

implies θcurriculum(T)θ<104|\theta_{\rm curriculum}(T)-\overline{\theta}|<10^{-4} and was used for all the experiments in §5. Additionally, from Figure 2, we can grab some intuitions about the fact that the asymptotic convergence to θ\overline{\theta} is indeed realized for a quite consistent part of the training and well before tTt\approx T. This means that during a big portion of the training, we are actually dropping out neurons as prescribed in , addressing the overfitting issue. In addition to these arguments, we will provide complementary insights on our scheduled implementation for dropout training.

Smarter initialization for the network weights.

The problem of optimizing deep neural networks is non-convex due to the non-linearities (ReLUs) and pooling steps. In spite of that, a few theoretical papers have investigated this issue under a sound mathematical perspective. For instance, under mild assumptions, Haeffele and Vidal derive sufficient conditions to ensure that a local minimum is also a global one to guarantee that the former can be found when starting from any initialization. The same theory presented in cannot be straightforwardly applied to the dropout case due to the pure deterministic framework of the theoretical analysis that is carried out. Therefore, it is still an open question whether all initializations are equivalent for the sake of a dropout training and, if not, which ones are preferable. Far from providing any theoretical insight in this flavor, we posit that Curriculum Dropout can be interpreted as a smarter initialization. Indeed, we implement a soft transition between a classical dropout-free training of a network versus the dropout one . Under this perspective, our curriculum seems equivalent to performing dropout training of a network whose weights have already been slightly optimized, evidently resulting in a better initialization for them.

As a naive approach, one can think to perform regular training for a certain amount of gradient updates and then apply dropout during the remaining ones. We call that Switch-Curriculum. This actually induces a discontinuity in the objective value which can damage the performance with respect to the smooth transition performed by our curriculum (1) - check Fig. 4.

Curriculum Dropout as adaptive regularization.

Several connections have been established between Dropout and model training with noise addition . The common trend discovered is that when an unregularized loss function is optimized to fit artificially corrupted data, this is actually equivalent to minimize the same loss augmented by a data dependent penalizing term. In both [28, Table 2.] for linear/logistic regression and [25, §9.1] for least squares, it is proved that Dropout induces a regularizer which is scaled

When θ=θ\theta=\overline{\theta}, the impact of the regularization is just fixed, therefore rising potential over- and under-fitting issues . But, for θ=θcurriculum(t)\theta=\theta_{\rm curriculum}(t), when tt is small, the regularizer is set to zero (θcurriculum(0)=1\theta_{\rm curriculum}(0)=1) and we do not perform any regularization at all. Indeed, the latter is simply not necessary: the network weights still have values which are close to their random and statistically independent initialization. Hence, overfitting is unlikely to occur at early training steps. Differently, we should expect it to occur as soon as training proceeds: by using (1), the regularizer is now weighted by

which is an increasing function of tt. Therefore, the more the gradient updates tt, the heavier the effect of the regularization. This is the reason why overfitting is better tackled by the proposed curriculum. Despite the overall idea of an adaptive selection of parameters is not novel for either regularization theory or tuning of network hyper-parameters (e.g. learning rate, ), to the best of our knowledge, this is the first time that this concept of time-adaptive regularization is applied to deep neural networks.

Compendium.

Let us conclude with some general comments. We posit that there is no overfitting at the beginning of the network training. Therefore, differently from , we allow for a scheduled retain probability θ(t)\theta(t) which gradually drops neurons out. Among other plausible curriculum functions as in Def. 1, the proposed choice (1) introduces no additional parameter to be tuned and implicitly provides a smarter weight initialization for dropout training.

The superiority of (1) also relates to i)i) the smoothly increasingly amount of units suppressed and ii)ii) the soft adaptive regularization performed to contrast overfitting.

Throughout these interpretations, we can retrieve a common idea of smoothly changing difficulty of the training which is applied to the network. This fact can be better understood by finding the connections with Curriculum Learning , as we explain in the next section.

Curriculum Learning, Curriculum Dropout

For the sake of clarity, let us remind the concept of curriculum learning . Within a classical machine learning algorithm, all training examples are presented to the model in an unordered manner, frequently applying a random shuffling. Actually, this is very different from what happens for the human training process, that is education. Indeed, the latter is highly structured so that the level of difficulty of the concepts to learn is proportional to the age of the people, managing easier knowledge when babies and harder when adults. This “start small” paradigm will likely guide the learning process .

Following the same intuition, proposes to subdivide the training examples based on their difficulty. Then, the learning is configured so that easier examples come first, eventually complicating them and processing the hardest ones at the end of the training. This concept is formalized by introducing a learning time λ\lambda\in, so that training begins at λ=0\lambda=0 and ends at λ=1\lambda=1. At time λ\lambda, Qλ(z)Q_{\lambda}(z) denotes the distribution which a training example zz is drawn from. The notion of curriculum learning is formalized requiring that QλQ_{\lambda} ensures a sampling of examples zz which are easier than the ones sampled from Qλ+εQ_{\lambda+\varepsilon}, ε>0\varepsilon>0. Mathematically, this is formalized by assuming

In (4), P(z)P(z) is the target training distribution, accounting for all examples, both easy and hard ones. The sampling from PP is corrected by the factor 0Wλ(z)10\leq W_{\lambda}(z)\leq 1 for any λ\lambda and zz. The interpretation for Wλ(z)W_{\lambda}(z) is the measure of the difficulty of the training example zz. The maximal complexity for a training example is fixed to 11 and reached at the end of the training, i.e. W1(z)=1W_{1}(z)=1, i.e. Q1(z)=P(z)Q_{1}(z)=P(z). The relationship

represents the increased complexity of training examples from instant λ\lambda to λ+ε\lambda+\varepsilon. Moreover, the weights Wλ(z)W_{\lambda}(z) must be chosen in such a way that

where Shannon’s entropy H(Qλ)H(Q_{\lambda}) models the fact that the quantity of information exploited by the model during training increases with respect to λ\lambda.

In order to prove that our scheduled dropout fulfills this definition, for simplicity, we will consider it as applied to the input layer only. This is not restrictive since the same considerations apply to any intermediate layer, by considering that each layer trains the feature representation used as input by the subsequent one.

As the images exploited for training, consider the partitions in the dataset including all the (original) clean data and all the possible ways of corrupting them through the Bernoulli multiplicative noise (see Fig. 1). Let π\pi denote the probability of sampling an uncorrupted dd-dimensional image within an image dataset (nothing more than a uniform distribution over the available training examples). Let us fix the gradient update tt. The case of sampling a dropped-out zz is equivalent to sampling the corresponding uncorrupted image z0z_{0} from π\pi and then overlapping it with a binary mask bb (of size dd), where each entry of bb is zero with probability 1θ(t)1-\theta(t). By mapping bb to the number ii of its zeros,

Indeed, (1θ(t))iθ(t)di(1-\theta(t))^{i}\theta(t)^{d-i} is the probability of sampling one binary mask bb with ii zeros and (di){d\choose i} accounts for all the possible combinations. Re-parameterizing the training time t=λTt=\lambda T, we get

one can easily prove that the definition in is fulfilled by the choice (8) for curriculum learning distribution Qλ(z)Q_{\lambda}(z).

To conclude, we give an additional interpretation to Curriculum Dropout. At λ=0\lambda=0, θ(0)=1\theta(0)=1 and no entry of z0z_{0} is set to zero. This clearly corresponds to the easiest available example, since the learning starts at t=0t=0 by considering all possible available visual information. When θ\theta start decreasing to θ(λT)0.99\theta(\lambda T)\approx 0.99, only 1% of z0z_{0} is suppressed (on average) and still almost all the information of the original dataset Z0\mathcal{Z}_{0} is available for training the network. But, as λ\lambda grows, θ(λT)\theta(\lambda T) decreases and a bigger number of entries are set to zero. This complicates the task, requiring an improved effort from the model to capitalize from the reduced uncorrupted information which is available at that stage of the training process.

After all, this connection between Dropout and Curriculum Learning was possible thanks to our generalization through Def. 1. Consequently, the original Dropout can be interpreted as considering the single specific value λ\overline{\lambda} such that θ(λT)=θ\theta(\overline{\lambda}T)=\overline{\theta}, being θ\overline{\theta} the constant retain probability on . This means that, as previously found for the adaptive regularization (see §3), the level of difficulty Wλ(z)W_{\overline{\lambda}}(z) of the training examples zz is fixed in the original Dropout. This encounters the concrete risk of either oversimplifying or overcomplicating the learning, with detrimental effects on the model’s generalization capability. Hence, the proposed method allows to setup a progressive curriculum Qλ(z)Q_{\lambda}(z), complicating the examples zz in a smooth and adaptive manner, as opposed to , where such complication is fixed to equal the maximal one from the very beginning (Fig. 1).

To conclude, let us note that the aforementioned work proposes a linear increase of the retain probability. According to equations (4-6) this implements what calls an anti-curriculum: this is shown to perform slightly better or worse than the no-curriculum strategy and always worse than any curriculum implementation. Our experiments confirm this finding.

Experiments

In this Section, we applied Curriculum Dropout to neural networks for image classification problems on different datasets, using Convolutional Neural Network (CNN) architectures and Multi-Layer Perceptrons (MLPs)Code available at https://github.com/pmorerio/curriculum-dropout.. In particular, we used two different CNN architectures: LeNet and a deeper one (conv-maxpool-conv-maxpool-conv-maxpool-fc-fc-softmax), further called CNN-1 and CNN-2, respectively. In the following, we detail the datasets used and the network architectures adopted in each case.

MNIST - A dataset of grayscale images of handwritten digits (from 0 to 9), of resolution 28 ×\times 28. Training and test sets contain 60.000 and 10.000 images, respectively. For this dataset, we used a three-layer MLP, with 2.000 units in each hidden layer, and CNN-1.

Double MNIST - This is a static version of , generated by superimposing two random images of two digits (either distinct or equal), in order to generate 64 ×\times 64 images. The total amount of images are 70.000, with 55 total classes (10 unique digits classes + (102)=45{10\choose 2}=45 unsorted couples of digits) . Training and test sets contain 60.000 and 10.000 images, respectively. Training set’s images were generated using MNIST training images, and test set’s images were generated using MNIST test images. We used CNN-2.

SVHN - Real world RGB images of street view house numbering. We used the cropped 32 ×\times 32 images representing a single digit (from 0 to 9). We exploited a subset of the dataset, consisting in 6.000 images for training and 1.000 images for testing, randomly selected. We used CNN-2 also in this case.

CIFAR-10 and CIFAR-100 - These datasets collect 32 ×\times 32 tiny RGB natural images, reporting 6000 and 600 elements per each of the 10 or 100 classes, respectively. In both datasets, training and test sets contain 50.000 and 10.000 images, respectively. We used CNN-1 for both datasets.

Caltech-101 - 300 ×\times 200 resolution RGB images of 101 classes. For each of them, a variable size of instances is available: from 30 to 800. To have a balanced dataset, we used 20 and 10 images per class for training and testing, respectively. Images were reshaped to 128×128128\times 128 pixels. We used CNN-2 again here.

Caltech-256 - 31000 RGB images for 256 total classes. For each class, we used 50 and 20 images for training and testing, respectively. Images were reshaped to 128×128128\times 128 pixels. We used CNN-2.

For training CNN-1, CNN-2 and MLP, we exploited a cross-entropy cost function with Adam optimizer and a momentum term of 0.95, as suggested in . We used mini-batches of 128 images and fixed the learning rate to be 10410^{-4}.

We applied curriculum dropout using the function (1) where γ\gamma is picked using the heuristics (2) and θ\overline{\theta} is fixed as follows. For both CNN-1 and CNN-2, the retain probability for the input layer was set to θinput=0.9\overline{\theta}_{\rm input}=0.9, selecting θconv=0.75\overline{\theta}_{\rm conv}=0.75 and θfc=0.5\overline{\theta}_{\rm fc}=0.5 for convolutional and fully connected layers, respectively. For the MLP, θinput=0.8\overline{\theta}_{\rm input}=0.8 and θhidden=0.5\overline{\theta}_{\rm hidden}=0.5. In all cases, we adopted the recommended values [25, §A.4].

Before reporting our results, let us emphasize that our aim is to improve the standard dropout framework , not to compete for the state-of-the art performance in image classification tasks. For this reason, we did not use engineering tricks such as data augmentation or any particular pre-processing, and neither we tried more complex (or deeper) network architectures.

In Fig. 3, we qualitatively compared Curriculum Dropout (green) versus the original Dropout (blue), anti-Curriculum Dropout (red) and an unregularized, i.e. no Dropout, training of a network (black). Since CNN-1, CNN-2 and MLP are trained from scratch, in order to ensure a more robust experimental evaluation, we have repeated the weight optimization 10 times for all the cases. Hence, in Fig. 3, we report the mean accuracy value curves, representing with shadows the standard deviation errors.

Additionally, we report in Table 1 the percentage accuracy improvements of Dropout , anti-Curriculum Dropout and Curriculum Dropout (proposed) versus a baseline network where no neuron is suppressed. To do that, we selected the average of the 10 highest mean accuracies obtained by each paradigm during each trial; then we averaged them over the 10 runs. We accommodated the metric of to measure the boost in accuracy over . Also, we reproduced for two datasets the cases of fixed layer size nn or fixed nθn\overline{\theta} as in [25, §7.3]. Here the network layers’ size nn is preliminary increased by a factor 1/θ1/\overline{\theta}, since on average a fraction θ\overline{\theta} of the units is dropped out. However, we notice that those bigger architectures tend to overfit the data.

Figure 4 shows the results obtained on Double MNIST dataset by scheduling the dropout with a step function, i.e. no suppression is performed until a certain switch-epoch is reached (§3). Precisely, we switched at 10-20-50 epochs. This curriculum is similar to the one induced by the polynomial functions of Figure 2: in fact, both curves have a similar shape and share the drawback of a threshold to be introduced. Yet, Switch-Curriculum shows an additional shortcoming: as highlighted by the spikes of both training and test accuracies, the sudden change in the network connections, induced by the sharp shift in the retain probabilities, makes the network lose some of the concepts learned up to that moment. While early switches are able to recover quickly to good performances, late ones are deleterious. Moreover, we were not able to find any heuristic rule for the switch-epoch, which would then be a parameter to be validated. This makes Switch-Curriculum a less powerful option compared to a smoothly-scheduled curriculum.

Discussion.

The proposed Curriculum Dropout, implemented through the scheduling function (1), improves the generalization performance of in almost all cases. As the only exception, in MNIST with MLP, the scheduling is just equivalent to the original dropout framework . Our guess is that the simpler the learning task, the less effective Curriculum Learning. After all, for a task which is relatively easy itself, there is less need for “starting easy”. This is in any case done at no additional cost nor training time requirements.

As expected, anti-Curriculum was improved by a more significant gap by our scheduling. Also, sometimes, an anti-Curriculum strategy even performs worse than a non-regularized network (e.g., Caltech 256 ). This is coherent with the findings of and with our discussion in §4 concerning Annealed Dropout , of which anti-Curriculum represents a generalization. In addition, while neither regular nor Curriculum Dropout ever need early stopping, anti-Curriculum often does.

Conclusions and Future Work

In this paper we have propose a scheduling for dropout training applied to deep neural networks. By softly increasing the amount of units to be suppressed layerwise, we achieve an adaptive regularization and provide a better smooth initialization for weight optimization. This allows us to implement a mathematically sound curriculum and justifies the proposed generalization of .

Through a broad experimental evaluation on 7 image classification tasks, the proposed Curriculum Dropout have proved to be more effective than both the original Dropout and the Annealed , the latter being an example of anti-Curriculum and therefore achieving an inferior performance to our more disciplined approach in ease dropout training. Globally, we always outperform the original Dropout using various architectures, and we improve the idea of by margin.

We have tested Curriculum Dropout on image classification tasks only. However, our guess is that, as standard Dropout, our method is very general and thus applicable to different domains. As a future work, we will apply our scheduling to other computer vision tasks, also extending it for the case of inter-neural connection inhibitions and Recurrent Neural Networks.

Acknowledgment

We gratefully acknowledge the support of NVIDIA Corporation with the donation of one Tesla K40 GPU used for part of this research.

References