Dropout Training as Adaptive Regularization

Stefan Wager, Sida Wang, Percy Liang

Introduction

Dropout training was introduced by Hinton et al. as a way to control overfitting by randomly omitting subsets of features at each iteration of a training procedure. Hinton et al. introduced dropout training in the context of neural networks specifically, and also advocated omitting random hidden layers during training. In this paper, we follow and study feature dropout as a generic training method that can be applied to any learning algorithm. Although dropout has proved to be a very successful technique, the reasons for its success are not yet well understood at a theoretical level.

Dropout training falls into the broader category of learning methods that artificially corrupt training data to stabilize predictions . There is a well-known connection between artificial feature corruption and regularization . For example, Bishop showed that the effect of training with features that have been corrupted with additive Gaussian noise is equivalent to a form of $L_{2}$ -type regularization in the low noise limit. In this paper, we take a step towards understanding how dropout training works by analyzing it as a regularizer. We focus on generalized linear models (GLMs), a class of models for which feature dropout reduces to a form of adaptive model regularization.

Using this framework, we show that dropout training is first-order equivalent to $L_{2}$ -regularization after transforming the input by $\operatorname{diag}(\hat{\mathcal{I}})^{-1/2}$ , where $\hat{\mathcal{I}}$ is an estimate of the Fisher information matrix. This transformation effectively makes the level curves of the objective more spherical, and so balances out the regularization applied to different features. In the case of logistic regression, dropout can be interpreted as a form of adaptive $L_{2}$ -regularization that favors rare but useful features.

The problem of learning with rare but useful features is discussed in the context of online learning by Duchi et al. , who show that their AdaGrad adaptive descent procedure achieves better regret bounds than regular stochastic gradient descent (SGD) in this setting. Here, we show that AdaGrad and dropout training have an intimate connection: Just as SGD progresses by repeatedly solving linearized $L_{2}$ -regularized problems, a close relative of AdaGrad advances by solving linearized dropout-regularized problems.

Our formulation of dropout training as adaptive regularization also leads to a simple semi-supervised learning scheme, where we use unlabeled data to learn a better dropout regularizer. The approach is fully discriminative and does not require fitting a generative model. We apply this idea to several document classification problems, and find that it consistently improves the performance of dropout training. On the benchmark IMDB reviews dataset introduced by , dropout logistic regression with a regularizer tuned on unlabeled data outperforms previous state-of-the-art. In follow-up research , we extend the results from this paper to more complicated structured prediction, such as multi-class logistic regression and linear chain conditional random fields.

Artificial Feature Noising as Regularization

We begin by discussing the general connections between feature noising and regularization in generalized linear models (GLMs). We will apply the machinery developed here to dropout training in Section 4.

Additive Gaussian noise: $\nu(x_{i},\,\xi_{i})=x_{i}+\xi_{i}$ , where $\xi_{i}\sim\mathcal{N}(0,\,\sigma^{2}I_{d\times d})$ .

Integrating over the feature noise gives us a noised maximum likelihood parameter estimate:

is the expectation taken with respect to the artificial feature noise $\xi=(\xi_{1},\,\dots,\,\xi_{n})$ . Similar expressions have been studied by .

For GLMs, the noised empirical loss takes on a simpler form:

Here, $R(\beta)$ acts as a regularizer that incorporates the effect of artificial feature noising. In GLMs, the log-partition function $A$ must always be convex, and so $R$ is always positive by Jensen’s inequality.

The key observation here is that the effect of artificial feature noising reduces to a penalty $R(\beta)$ that does not depend on the labels $\{y_{i}\}$ . Because of this, artificial feature noising penalizes the complexity of a classifier in a way that does not depend on the accuracy of a classifier. Thus, for GLMs, artificial feature noising is a regularization scheme on the model itself that can be compared with other forms of regularization such as ridge ( $L_{2}$ ) or lasso ( $L_{1}$ ) penalization. In Section 6, we exploit the label-independence of the noising penalty and use unlabeled data to tune our estimate of $R(\beta)$ .

The quadratic approximation also appears to hold up on real datasets. In Figure 1(b), we compare the evolution during training of both $R$ and ${R^{\text{q}}}$ on the 20 newsgroups alt.atheism vs soc.religion.christian classification task described in . We see that the quadratic approximation is accurate most of the way through the learning procedure, only deteriorating slightly as the model converges to highly confident predictions.

In practice, we have found that fitting logistic regression with the quadratic surrogate ${R^{\text{q}}}$ gives similar results to actual dropout-regularized logistic regression. We use this technique for our experiments in Section 6.

Regularization based on Additive Noise

Having established the general quadratic noising regularizer ${R^{\text{q}}}$ , we now turn to studying the effects of ${R^{\text{q}}}$ for various likelihoods (linear and logistic regression) and noising models (additive and dropout). In this section, we warm up with additive noise; in Section 4 we turn to our main target of interest, namely dropout noise.

Thus, we recover the well-known result that linear regression with additive feature noising is equivalent to ridge regression . Note that, with linear regression, the quadratic approximation ${R^{\text{q}}}$ is exact and so the correspondence with $L_{2}$ -regularization is also exact.

Logistic regression

The situation gets more interesting when we move beyond linear regression. For logistic regression, $A^{\prime\prime}(x_{i}\cdot\beta)=p_{i}(1-p_{i})$ where $p_{i}=(1+\exp(-x_{i}\cdot\beta))^{-1}$ is the predicted probability of $y_{i}=1$ . The quadratic noising penalty is then

In other words, the noising penalty now simultaneously encourages parsimonious modeling as before (by encouraging $\lVert\beta\rVert_{2}^{2}$ to be small) as well as confident predictions (by encouraging the $p_{i}$ ’s to move away from $\frac{1}{2}$ ).

Regularization based on Dropout Noise

For linear regression, $V$ is the identity matrix, so the dropout objective is equivalent to a form of ridge regression where each column of the design matrix is normalized before applying the $L_{2}$ penalty.Normalizing the columns of the design matrix before performing penalized regression is standard practice, and is implemented by default in software like glmnet for R . This connection has been noted previously by .

Logistic Regression

The form of dropout penalties becomes much more intriguing once we move beyond the realm of linear regression. The case of logistic regression is particularly interesting. Here, we can write the quadratic dropout penalty from (10) as

Thus, just like additive noising, dropout generally gives an advantage to confident predictions and small $\beta$ . However, unlike all the other methods considered so far, dropout may allow for some large $p_{i}(1-p_{i})$ and some large $\beta_{j}^{2}$ , provided that the corresponding cross-term $x_{ij}^{2}$ is small.

Our analysis shows that dropout regularization should be better than $L_{2}$ -regularization for learning weights for features that are rare (i.e., often 0) but highly discriminative, because dropout effectively does not penalize $\beta_{j}$ over observations for which $x_{ij}=0$ . Thus, in order for a feature to earn a large $\beta_{j}^{2}$ , it suffices for it to contribute to a confident prediction with small $p_{i}(1-p_{i})$ each time that it is active.To be precise, dropout does not reward all rare but discriminative features. Rather, dropout rewards those features that are rare and positively co-adapted with other features in a way that enables the model to make confident predictions whenever the feature of interest is active. Dropout training has been empirically found to perform well on tasks such as document classification where rare but discriminative features are prevalent . Our result suggests that this is no mere coincidence.

We summarize the relationship between $L_{2}$ -penalization, additive noising and dropout in Table 2. Additive noising introduces a product-form penalty depending on both $\beta$ and $A^{\prime\prime}$ . However, the full potential of artificial feature noising only emerges with dropout, which allows the penalty terms due to $\beta$ and $A^{\prime\prime}$ to interact in a non-trivial way through the design matrix $X$ (except for linear regression, in which all the noising schemes we consider collapse to ridge regression).

1 A Simulation Example

The above discussion suggests that dropout logistic regression should perform well with rare but useful features. To test this intuition empirically, we designed a simulation study where all the signal is grouped in 50 rare features, each of which is active only 4% of the time. We then added 1000 nuisance features that are always active to the design matrix, for a total of $d=1050$ features. To make sure that our experiment was picking up the effect of dropout training specifically and not just normalization of $X$ , we ensured that the columns of $X$ were normalized in expectation.

The dropout penalty for logistic regression can be written as a matrix product

We designed the simulation study in such a way that, at the optimal $\beta$ , the dropout penalty should have structure

A dropout penalty with such a structure should be small. Although there are some uncertain predictions with large $p_{i}(1-p_{i})$ and some big weights $\beta_{j}^{2}$ , these terms cannot interact because the corresponding terms $x_{ij}^{2}$ are all 0 (these are examples without any of the rare discriminative features and thus have no signal). Meanwhile, $L_{2}$ penalization has no natural way of penalizing some $\beta_{j}$ more and others less. Our simulation results, given in Table 3, confirm that dropout training outperforms $L_{2}$ -regularization here as expected. See Appendix A.1 for details.

Dropout Regularization in Online Learning

where the first two terms form a linear approximation to the loss and the third term is an $L_{2}$ -regularizer. Thus, SGD progresses by repeatedly solving linearized $L_{2}$ -regularized problems.

As discussed by Duchi et al. , a problem with classic SGD is that it can be slow at learning weights corresponding to rare but highly discriminative features. This problem can be alleviated by running a modified form of SGD with $\hat{\beta}_{t+1}=\hat{\beta}_{t}-\eta\,A_{t}^{-1}g_{t},$ where the transformation $A_{t}$ is also learned online; this leads to the AdaGrad family of stochastic descent rules. Duchi et al. use $A_{t}=\operatorname{diag}(G_{t})^{1/2}\text{ where }G_{t}=\sum_{i=1}^{t}g_{i}g_{i}^{\top}$ and show that this choice achieves desirable regret bounds in the presence of rare but useful features. At least superficially, AdaGrad and dropout seem to have similar goals: For logistic regression, they can both be understood as adaptive alternatives to methods based on $L_{2}$ -regularization that favor learning rare, useful features. As it turns out, they have a deeper connection.

The natural way to incorporate dropout regularization into SGD is to replace the penalty term $\lVert\beta-\hat{\beta}\rVert_{2}^{2}/2\eta$ in (15) with the dropout regularizer, giving us an update rule

where, ${R^{\text{q}}}(\cdot;\,\hat{\beta}_{t})$ is the quadratic noising regularizer centered at $\hat{\beta}_{t}$ :This expression is equivalent to (11) except that we used $\hat{\beta}_{t}$ and not $\beta-\hat{\beta}_{t}$ to compute $H_{t}$ .

This implies that dropout descent is first-order equivalent to an adaptive SGD procedure with $A_{t}=\operatorname{diag}(H_{t})$ . To see the connection between AdaGrad and this dropout-based online procedure, recall that for GLMs both of the expressions

are equal to the Fisher information $\mathcal{I}$ . In other words, as $\hat{\beta}_{t}$ converges to $\beta^{*}$ , $G_{t}$ and $H_{t}$ are both consistent estimates of the Fisher information. Thus, by using dropout instead of $L_{2}$ -regularization to solve linearized problems in online learning, we end up with an AdaGrad-like algorithm.

Of course, the connection between AdaGrad and dropout is not perfect. In particular, AdaGrad allows for a more aggressive learning rate by using $A_{t}=\operatorname{diag}(G_{t})^{-1/2}$ instead of $\operatorname{diag}(G_{t})^{-1}$ . But, at a high level, AdaGrad and dropout appear to both be aiming for the same goal: scaling the features by the Fisher information to make the level-curves of the objective more circular. In contrast, $L_{2}$ -regularization makes no attempt to sphere the level curves, and AROW —another popular adaptive method for online learning—only attempts to normalize the effective feature matrix but does not consider the sensitivity of the loss to changes in the model weights. In the case of logistic regression, AROW also favors learning rare features, but unlike dropout and AdaGrad does not privilege confident predictions.

Semi-Supervised Dropout Training

Recall that the regularizer $R(\beta)$ in (5) is independent of the labels $\{y_{i}\}$ . As a result, we can use additional unlabeled training examples to estimate it more accurately. Suppose we have an unlabeled dataset $\{z_{i}\}$ of size $m$ , and let $\alpha\in(0,1]$ be a discount factor for the unlabeled data. Then we can define a semi-supervised penalty estimate

Most approaches to semi-supervised learning either rely on using a generative model or various assumptions on the relationship between the predictor and the marginal distribution over inputs. Our semi-supervised approach is based on a different intuition: we’d like to set weights to make confident predictions on unlabeled data as well as the labeled data, an intuition shared by entropy regularization and transductive SVMs .

We apply this semi-supervised technique to text classification. Results on several datasets described in are shown in Table 4(a); Figure 2 illustrates how the use of unlabeled data improves the performance of our classifier on a single dataset. Overall, we see that using unlabeled data to learn a better regularizer $R_{*}(\beta)$ consistently improves the performance of dropout training.

Table 9 shows our results on the IMDB dataset of . The dataset contains 50,000 unlabeled examples in addition to the labeled train and test sets of size 25,000 each. Whereas the train and test examples are either positive or negative, the unlabeled examples contain neutral reviews as well. We train a dropout-regularized logistic regression classifier on unigram/bigram features, and use the unlabeled data to tune our regularizer. Our method benefits from unlabeled data even in the presence of a large amount of labeled data, and achieves state-of-the-art accuracy on this dataset.

Conclusion

We analyzed dropout training as a form of adaptive regularization. This framework enabled us to uncover close connections between dropout training, adaptively balanced $L_{2}$ -regularization, and AdaGrad; and led to a simple yet effective method for semi-supervised training. There seem to be multiple opportunities for digging deeper into the connection between dropout training and adaptive regularization. In particular, it would be interesting to see whether the dropout regularizer takes on a tractable and/or interpretable form in neural networks, and whether similar semi-supervised schemes could be used to improve on the results presented in .

References

Appendix A Appendix

Section 4.1 gives the motivation for and a high-level description of our simulation study. Here, we give a detailed description of the study.

Our simulation has 1050 features. The first 50 discriminative features form 5 groups of 10; the last 1000 features are nuisance terms. Each $x_{i}$ was independently generated as follows:

Pick a group number $g\in 1,\,...,\,25$ , and a sign $sgn=\pm 1$ .

Draw the last 1000 entries of $x_{i}$ independently from $\mathcal{N}(0,1)$ .

Notice that this procedure guarantees that the columns of $X$ all have the same expected second moments.

Generating labels.

Given an $x_{i}$ , we generate $y_{i}$ from the Bernoulli distribution with parameter $\sigma(x_{i}\cdot\beta)$ , where the first 50 coordinates of $\beta$ are 0.057 and the remaining 1000 coordinates are 0. The value 0.057 was selected to make the average value of $|x_{i}\cdot\beta|$ in the presence of signal be 2.

Training.

For each simulation run, we generated a training set of size $n=75$ . For this purpose, we cycled over the group number $g$ deterministically. The penalization parameters were set to roughly optimal values. For dropout, we used $\delta=0.9$ while from $L_{2}$ -penalization we used $\lambda=32$ .