Regularization for Deep Learning: A Taxonomy

Jan Kukačka, Vladimir Golkov, Daniel Cremers

Introduction

Regularization is one of the key elements of machine learning, particularly of deep learning (Goodfellow et al.,, 2016), allowing to generalize well to unseen data even when training on a finite training set or with an imperfect optimization procedure. In the traditional sense of optimization and also in older neural networks literature, the term “regularization” is reserved solely for a penalty term in the loss function (Bishop, 1995a, ). Recently, the term has adopted a broader meaning: Goodfellow et al., (2016, Chap. 5) loosely define it as “any modification we make to a learning algorithm that is intended to reduce its test error but not its training error”. We find this definition slightly restrictive and present our working definition of regularization, since many techniques considered as regularization do reduce the training error (e.g. weight decay in AlexNet (Krizhevsky et al.,, 2012)).

Regularization is any supplementary technique that aims at making the model generalize better, i.e. produce better results on the test set.

This can include various properties of the loss function, the loss optimization algorithm, or other techniques. Note that this definition is more in line with machine learning literature than with inverse problems literature, the latter using a more restrictive definition.

Before we proceed to the presentation of our taxonomy, we revisit some basic machine learning theory in Section 2. This will provide a justification of the top level of the taxonomy. In Sections 3–7, we continue with a finer division of the individual classes of the regularization techniques, followed by our practical recommendations in Section 8. We are aware that the many research works discussed in this taxonomy cannot be summarized in a single sentence. For the sake of structuring the multitude of papers, we decided to merely describe a certain subset of their properties according to the focus of our taxonomy.

Theoretical framework

The central task of our interest is model fitting: finding a function $f$ that can well approximate a desired mapping from inputs $x$ to desired outputs $f(x)$ . A given input $x$ can have an associated target $t$ which dictates the desired output $f(x)$ directly (or in some applications indirectly (Ulyanov et al.,, 2016; Johnson et al.,, 2016)). A typical example of having available targets $t$ is supervised learning. Data samples $(x,t)$ then follow a ground truth probability distribution $P$ .

Usually the loss function takes the form of expected risk:

where we identify two parts, an error function $E$ and a regularization term $R$ . The error function depends on the targets and assigns a penalty to model predictions according to their consistency with the targets. The regularization term assigns a penalty to the model based on other criteria. It may depend on anything except the targets, for example on the weights (see Section 6).

The expected risk cannot be minimized directly since the data distribution $P$ is unknown. Instead, a training set $\mathcal{D}$ sampled from the distribution is given. The minimization of the expected risk can be then approximated by minimizing the empirical risk $\hat{\mathcal{L}}$ :

where $(x_{i},t_{i})$ are samples from $\mathcal{D}$ .

Now we have the minimal background to formalize the division of regularization methods into a systematic taxonomy. In the minimization of the empirical risk, Eq. (3), we can identify the following elements that are responsible for the value of the learned weights, and thus can contribute to regularization:

$\mathcal{D}$ : The training set, discussed in Section 3

$f$ : The selected model family, discussed in Section 4

$E$ : The error function, briefly discussed in Section 5

$R$ : The regularization term, discussed in Section 6

The optimization procedure itself, discussed in Section 7

Ambiguity regarding the splitting of methods into these categories and their subcategories is discussed in Appendix A using notation from Section 3.

Regularization via data

The quality of a trained model depends largely on the training data. Apart from acquisition/selection of appropriate training data, it is possible to employ regularization via data. This is done by applying some transformation to the training set $\mathcal{D}$ , resulting in a new set $\mathcal{D}_{R}$ . Some transformations perform feature extraction or pre-processing, modifying the feature space or the distribution of the data to some representation simplifying the learning task. Other methods allow generating new samples to create a larger, possibly infinite, augmented dataset. These two principles are somewhat independent and may be combined. The goal of regularization via data is either one of them, or the other, or both. They both rely on transformations with (stochastic) parameters:

Transformation with stochastic parameters is a function $\tau_{\theta}$ with parameters $\theta$ which follow some probability distribution.

In this context we consider $\tau_{\theta}$ which can operate on network inputs, activations in hidden layers, or targets. An example of a transformation with stochastic parameters is the corruption of inputs by Gaussian noise (Bishop, 1995b, ; An,, 1996):

The stochasticity of the transformation parameters is responsible for generating new samples, i.e. data augmentation. Note that the term data augmentation often refers specifically to transformations of inputs or hidden activations, but here we also list transformations of targets for completeness. The exception to the stochasticity is when $\theta$ follows a delta distribution, in which case the transformation parameters become deterministic and the dataset size is not augmented.

We can categorize the data-based methods according to the properties of the used transformation and of the distribution of its parameters. We identify the following criteria for categorization (some of them later serve as columns in Tables 1–2):

Deterministic parameters: Parameters $\theta$ follow a delta distribution, size of the dataset remains unchanged

Stochastic parameters: Allow generation of a larger, possibly infinite, dataset. Various strategies for sampling of $\theta$ exist:

Random: Draw a random $\theta$ from the specified distribution

Adaptive: Value of $\theta$ is the result of an optimization procedure, usually with the objective of maximizing the network error on the transformed sample (such “challenging” sample is considered to be the most informative one at current training stage), or minimizing the difference between the network prediction and a predefined fake target $t^{\prime}$

Constrained optimization: $\theta$ found by maximizing error under hard constraints (support of the distribution of $\theta$ controls the strongest allowed transformation)

Unconstrained optimization: $\theta$ found by maximizing modified error function, using the distribution of $\theta$ as weighting (proposed herein for completeness, not yet tested)

Stochastic: $\theta$ found by taking a fixed number of samples of $\theta$ and using the one yielding the highest error

Effect on the data representation

Representation-preserving transformations: Preserve the feature space and attempt to preserve the data distribution

Representation-modifying transformations: Map the data to a different representation (different distribution or even new feature space) that may disentangle the underlying factors of the original representation and make the learning problem easier

Transformation space

Hidden-feature space: Transformation is applied to some deep-layer representation of samples (this also uses parts of $f$ and $w$ to map the input into the hidden-feature space; such transformations act inside the network $f_{w}$ and thus can be considered part of the architecture, additionally fitting Section 4)

Target: Transformation is applied to $t$ (can only be used during the training phase since labels are not shown to the model at test time)

Universality

Domain-specific: Specific (handcrafted) for the problem at hand, for example image rotations

Dependence of the distribution of θ𝜃\theta

$p(\theta)$ : distribution of $\theta$ is the same for all samples

$p(\theta|t)$ : distribution of $\theta$ can be different for each target (class)

$p(\theta|t^{\prime})$ : distribution of $\theta$ depends on desired (fake) target $t^{\prime}$

$p(\theta|x)$ : distribution of $\theta$ can be different for each input vector (with implicit dependence on $f$ and $w$ if the transformation is in hidden-feature space)

$p(\theta|\mathcal{D})$ : distribution of $\theta$ depends on the whole training dataset

$p(\theta|\mathbf{x})$ : distribution of $\theta$ depends on a batch of training inputs (for example (parts of) the current mini-batch, or also previous mini-batches)

$p(\theta|\pi)$ : distribution of $\theta$ depends on some trainable parameters $\pi$ subject to loss minimization (i.e. the parameters $\pi$ evolve during training along with the network weights $w$ )

Combinations of the above, e.g. $p(\theta|x,t)$ , $p(\theta|x,\pi)$ , $p(\theta|x,t^{\prime})$ , $p(\theta|x,\mathcal{D})$ , $p(\theta|t,\mathcal{D})$ , $p(\theta|x,t,\mathcal{D})$

Phase

Training: Transformation of training samples

Test: Transformation of test samples, for example multiple augmented variants of a sample are classified and the result is aggregated over them

A review of existing methods that use generic transformations can be found in Table 1. Dropout in its original form (Hinton et al.,, 2012; Srivastava et al.,, 2014) is one of the most popular methods from the generic group, but also several variants of Dropout have been proposed that provide additional theoretical motivation and improved empirical results (Standout (Ba and Frey,, 2013), Random dropout probability (Bouthillier et al.,, 2015), Bayesian dropout (Maeda,, 2014), Test-time dropout (Gal and Ghahramani,, 2016)).

Table 2 contains a list of some domain-specific methods focused especially on the image domain. Here the most used method is rigid and elastic image deformation.

Target-preserving data augmentation

In the following, we discuss an important group of methods: target-preserving data augmentation. These methods use stochastic transformations in input and hidden-feature spaces, while preserving the original target $t$ . As can be seen in the respective two columns in Tables 1–2, most of the listed methods have exactly these properties. These methods transform the training set to a distribution $Q$ , which is used for training instead. In other words, the training samples $(x_{i},t_{i})\in\mathcal{D}$ are replaced in the empirical risk loss function (Eq. (3)) by augmented training samples $(\tau_{\theta}(x_{i}),t_{i})\sim Q$ . By randomly sampling the transformation parameters $\theta$ and thus creating many new samples $(\tau_{\theta}(x_{i}),t_{i})$ from each original training sample $(x_{i},t_{i})$ , data augmentation attempts to bridge the limited-data gap between the expected and the empirical risk, Eqs. (2)–(3). While unlimited sampling from $Q$ provides more data than the original dataset $\mathcal{D}$ , both of them usually are merely approximations of the ground truth data distribution or of an ideal training dataset; both $\mathcal{D}$ and $Q$ have their own distinct biases, advantages and disadvantages. For example, elastic image deformations result in images that are not perfectly realistic; this is not necessarily a disadvantage, but it is a bias compared to the ground truth data distribution; in any case, the advantages (having more training data) often prevail. In some cases, it may be even desired for $Q$ to be deliberately different from the ground truth data distribution. For example, in case of class imbalance (unbalanced abundance or importance of classes), a common regularization strategy is to undersample or oversample the data, sometimes leading to a less realistic $Q$ but better models. This is how an ideal training dataset may be different from the ground truth data distribution.

If the transformation is additionally representation-preserving, then the distribution $Q$ created by the transformation $\tau_{\theta}$ attempts to mimic the ground truth data distribution $P$ . Otherwise, the notion of a “ground truth data distribution” in the modified representation may be vague. We provide more details about the transition from $\mathcal{D}$ to $Q$ in Appendix B.

Summary of data-based methods

Data-based regularization is a popular and very useful way to improve the results of deep learning. In this section we formalized this group of methods and showed that seemingly unrelated techniques such as Target-preserving data augmentation, Dropout, or Batch normalization are methodologically surprisingly close to each other. In Section 8 we discuss future directions that we find promising.

Regularization via the network architecture

A network architecture $f$ can be selected to have certain properties or match certain assumptions in order to have a regularizing effect.The network architecture is represented by a function $f:(w,x)\mapsto y$ , and together with the set $W$ of all its possible weight configurations defines a set of mappings that this particular architecture can realize: $\{f_{w}:x\mapsto y\mid\forall w\in W\}$ .

An input-output mapping $f_{w}$ must have certain properties in order to fit the data $P$ well. Although it may be intractable to enforce the precise properties of an ideal mapping, it may be possible to approximate them by simplified assumptions about the mapping. These properties and assumptions can then be imposed upon model fitting in a hard or soft manner. This limits the search space of models and allows finding better solutions. An example is the decision about the number of layers and units, which allows the mapping to be neither too simple nor too complex (thus avoiding underfitting and overfitting). Another example are certain invariances of the mapping, such as locality and shift-equivariance of feature extraction hardwired in convolutional layers. Overall, the approach of imposing assumptions about the input-output mapping discussed in this section is the selection of the network architecture $f$ . The choice of architecture $f$ on the one hand hardwires certain properties of the mapping; additionally, in an interplay between $f$ and the optimization algorithm (Section 7), certain weight configurations are more likely accessible by optimization than others, further limiting the likely search space in a soft way. A complementary way of imposing certain assumptions about the mapping are regularization terms (Section 6), as well as invariances present in the (augmented) data set (Section 3).

Assumptions can be hardwired into the definition of the operation performed by certain layers, and/or into the connections between layers. This distinction is made in Table 3, where these and other methods are listed.

In Section 3 about data, we mentioned regularization methods that transform data in the hidden-feature space. They can be considered part of the architecture. In other words, they fit both Sections 3 (data) and 4 (architecture). These methods are listed in Table 1 with hidden features as their transformation space.

Weight sharing

Reusing a certain trainable parameter in several parts of the network is referred to as weight sharing. This usually makes the model less complex than using separately trainable parameters. An example are convolutional networks (LeCun et al.,, 1989). Here the weight sharing does not merely reduce the number of weights that need to be learned; it also encodes the prior knowledge about the shift-equivariance and locality of feature extraction. Another example is weight sharing in autoencoders.

Activation functions

Noisy models

Stochastic pooling was one example of a stochastic generalization of a deterministic model. Some models are stochastic by injecting random noise into various parts of the model. The most frequently used noisy model is Dropout (Hinton et al.,, 2012; Srivastava et al.,, 2014).

Multi-task learning

A special type of regularization is multi-task learning (see Caruana,, 1998; Ruder,, 2017). It can be combined with semi-supervised learning to utilize unlabeled data on an auxiliary task (Rasmus et al.,, 2015). A similar concept of sharing knowledge between tasks is also utilized in meta-learning, where multiple tasks from the same domain are learned sequentially, using previously gained knowledge as bias for new tasks (Baxter,, 2000); and transfer learning, where knowledge from one domain is transferred into another domain (Pan and Yang,, 2010).

Model selection

The best among several trained models (e.g. with different architectures) can be selected by evaluating the predictions on a validation set. It should be noted that this holds for selecting the best combination of all techniques (Sections 3–7), not just architecture; and that the validation set used for model selection in the “outer loop” should be different from the validation set used e.g. for Early stopping (Section 7), and different from the test set (Cawley and Talbot,, 2010). However, there are also model selection methods that specifically target the selection of the number of units in a specific network architecture, e.g. using network growing and network pruning (see Bishop, 1995a, , Sec. 9.5), or additionally do not require a validation set, e.g. the Network information criterion to compare models based on the training error and second derivatives of the loss function (Murata et al.,, 1994).

Regularization via the error function

Ideally, the error function $E$ reflects an appropriate notion of quality, and in some cases some assumptions about the data distribution. Typical examples are mean squared error or cross-entropy. The error function $E$ can also have a regularizing effect. An example is Dice coefficient optimization (Milletari et al.,, 2016) which is robust to class imbalance. Moreover, the overall form of the loss function can be different than Eq. (3). For example, in certain loss functions that are robust to class imbalance, the sum is taken over pairwise combinations $\mathcal{D}\times\mathcal{D}$ of training samples (Yan et al.,, 2003), rather than over training samples. But such alternatives to Eq. (3) are rather rare, and similar principles apply. If additional tasks are added for a regularizing effect (multi-task learning (see Caruana,, 1998; Ruder,, 2017)), then targets $t$ are modified to consist of several tasks, the mapping $f_{w}$ is modified to produce an according output $y$ , and $E$ is modified to account for the modified $t$ and $y$ . Besides, there are regularization terms that depend on $\partial E/\partial x$ . They depend on $t$ and thus in our definition are considered part of $E$ rather than of $R$ , but they are listed in Section 6 among $R$ (rather than here) for a better overview.

Regularization via the regularization term

Regularization can be achieved by adding a regularizer $R$ into the loss function. Unlike the error function $E$ (which expresses consistency of outputs with targets), the regularization term is independent of the targets. Instead, it is used to encode other properties of the desired model, to provide inductive bias (i.e. assumptions about the mapping other than consistency of outputs with targets). The value of $R$ can thus be computed for an unlabeled test sample, whereas the value of $E$ cannot.

The independence of $R$ from $t$ has an important implication: it allows additionally using unlabeled samples (semi-supervised learning) to improve the learned model based on its compliance with some desired properties (Sajjadi et al.,, 2016). For example, semi-supervised learning with ladder networks (Rasmus et al.,, 2015) combines a supervised task with an unsupervised auxiliary denoising task in a “multi-task” learning fashion. (For alternative interpretations, see Appendix A.) Unlabeled samples are extremely useful when labeled samples are scarce. A Bayesian perspective on the combination of labeled and unlabeled data in a semi-supervised manner is offered by Lasserre et al., (2006).

A classical regularizer is weight decay (see Plaut et al.,, 1986; Lang and Hinton,, 1990; Goodfellow et al.,, 2016, Chap. 7):

where $\lambda$ is a weighting term controlling the importance of the regularization over the consistency. From the Bayesian perspective, weight decay corresponds to using a symmetric multivariate normal distribution as prior for the weights: $p(w)=\mathcal{N}(w|\mathbf{0},\lambda^{-1}\mathbf{I})$ (Nowlan and Hinton,, 1992). Indeed, $-\log\mathcal{N}(w|\mathbf{0},\lambda^{-1}\mathbf{I})\propto-\log\exp\left(-\frac{\lambda}{2}\lVert w\rVert_{2}^{2}\right)=\frac{\lambda}{2}\lVert w\rVert_{2}^{2}=R(w)$ . Weight decay has gained big popularity, and it is being successfully used; Krizhevsky et al., (2012) even observe reduction of the error on the training set.

Another common prior assumption that can be expressed via the regularization term is “smoothness” of the learned mapping (see Bengio et al.,, 2013, Section 3.2): if $x_{1}\approx x_{2}$ , then $f_{w}(x_{1})\approx f_{w}(x_{2})$ . It can be expressed by the following loss term:

where $\lVert\cdot\rVert_{F}$ denotes the Frobenius norm, and $J_{f_{w}}(x)$ is the Jacobian of the neural network input-to-output mapping $f_{w}$ for some fixed network weights $w$ . This term penalizes mappings with large derivatives, and is used in contractive autoencoders (Rifai et al., 2011c, ).

The domain of loss regularizers is very heterogeneous. We propose a natural way to categorize them by their dependence. We saw in Eq. (5) that weight decay depends on $w$ only, whereas the Jacobian penalty in Eq. (6) depends on $w$ , $f$ , and $x$ . More precisely, the Jacobian penalty uses the derivative $\partial y/\partial x$ of output $y=f_{w}(x)$ w.r.t. input $x$ . (We use vector-by-vector derivative notation from matrix calculus, i.e. $\partial y/\partial x=\partial f_{w}(x)/\partial x=J_{f_{w}}$ is the Jacobian of $f_{w}$ with fixed weights $w$ .) We identify the following dependencies of $R$ :

Dependence on the network output $y=f_{w}(x)$

Dependence on the derivative $\partial y/\partial w$ of the output $y=f_{w}(x)$ w.r.t. the weights $w$

Dependence on the derivative $\partial y/\partial x$ of the output $y=f_{w}(x)$ w.r.t. the input $x$

Dependence on the derivative $\partial E/\partial x$ of the error term $E$ w.r.t. the input $x$ ( $E$ depends on $t$ , and according to our definition such methods belong to Section 5, but they are listed here for overview)

A review of existing methods can be found in Table 4. Weight decay seems to be still the most popular of the regularization terms. Some of the methods are equivalent or nearly equivalent to other methods from different taxonomy branches. For example, Tangent prop simulates minimal data augmentation (Simard et al.,, 1992); Injection of small-variance Gaussian noise (Bishop, 1995b, ; An,, 1996) is an approximation of Jacobian penalty (Rifai et al., 2011c, ); and Fast dropout (Wang and Manning,, 2013) is (in shallow networks) a deterministic approximation of Dropout. This is indicated in the Equivalence column in Table 4.

Regularization via optimization

The last class of the regularization methods according to our taxonomy is the regularization through optimization. Stochastic gradient descent (SGD) (see Bottou,, 1998) (along with its derivations) is the most frequently used optimization algorithm in the context of deep neural networks and is the center of our attention. We also list some alternative methods below.

Stochastic gradient descent is an iterative optimization algorithm using the following update rule:

where $\nabla\mathcal{L}(w_{t},d_{t})$ is the gradient of the loss $\mathcal{L}$ evaluated on a mini-batch $d_{t}$ from the training set $\mathcal{D}$ . It is frequently used in combination with momentum and other tweaks improving the convergence speed (see Wilson et al.,, 2017). Moreover, the noise induced by the varying mini-batches helps the algorithm escape saddle points (Ge et al.,, 2015); this can be further reinforced by adding supplementary gradient noise (Neelakantan et al.,, 2015; Chaudhari and Soatto,, 2015).

If the algorithm reaches a low training error in a reasonable time (linear in the size of the training set, allowing multiple passes through $\mathcal{D}$ ), the solution generalizes well under certain mild assumptions; in that sense SGD works as an implicit regularizer: a short training time prevents overfitting even without any additional regularizer used (Hardt et al.,, 2016). This is in line with (Zhang et al.,, 2017) who find in a series of experiments that regularization (such as Dropout, data augmentation, and weight decay) is by itself neither necessary nor sufficient for good generalization.

We divide the methods into three groups: initialization/warm-start methods, update methods, and termination methods, discussed in the following.

These methods affect the initial selection of the model weights. Currently the most frequently used method is sampling the initial weights from a carefully tuned distribution. There are multiple strategies based on the architecture choice, aiming at keeping the variance of activations in all layers around $1$ , thus preventing vanishing or exploding activations (and gradients) in deeper layers Glorot and Bengio,, 2010, Sec. 4.2; He et al.,, 2015.

Another (complementary) option is pre-training on different data, or with a different objective, or with partially different architecture. This can prime the learning algorithm towards a good solution before the fine-tuning on the actual objective starts. Pre-training the model on a different task in the same domain may lead to learning useful features, making the primary task easier. However, pre-trained models are also often misused as a lazy approach to problems where training from scratch or using thorough domain adaptation, transfer learning, or multi-task learning methods would be worth trying. On the other hand, pre-training or similar techniques may be a useful part of such methods.

Finally, with some methods such as Curriculum learning (Bengio et al.,, 2009), the transition between pre-training and fine-tuning is smooth. We refer to them as warm-start methods.

Random weight initialization Rumelhart et al.,, 1986, p. 330; Glorot and Bengio,, 2010; He et al.,, 2015; Hendrycks and Gimpel,, 2016

Orthogonal weight matrices (Saxe et al.,, 2013)

Data-dependent weight initialization (Krähenbühl et al.,, 2015)

Greedy layer-wise pre-training (Hinton et al.,, 2006; Bengio et al.,, 2007; Erhan et al.,, 2010) (has become less important due to advances (e.g. ReLUs) in effective end-to-end training that optimizes all parameters simultaneously)

Curriculum learning (Bengio et al.,, 2009)

Spatial contrasting (Hoffer et al.,, 2016)

Subtask splitting (Gülçehre and Bengio,, 2016)

Update methods

This class of methods affects individual weight updates. There are two complementary subgroups: Update rules modify the form of the update formula; Weight and gradient filters are methods that affect the value of the gradient or weights, which are used in the update formula, e.g. by injecting noise into the gradient (Neelakantan et al.,, 2015).

Again, it is not entirely clear which of the methods only speed up the optimization and which actually help the generalization. Wilson et al., (2017) show that some of the methods such as AdaGrad or Adam even lose the regularization abilities of SGD.

Momentum, Nesterov’s accelerated gradient method, AdaGrad, AdaDelta, RMSProp, Adam—overview in (Wilson et al.,, 2017)

Learning rate schedules (Girosi et al.,, 1995; Hoffer et al.,, 2017)

Online batch selection (Loshchilov and Hutter,, 2015)

SGD alternatives: L-BFGS (Liu and Nocedal,, 1989; Le et al.,, 2011), Hessian-free methods (Martens,, 2010), Sum-of-functions optimizer (Sohl-Dickstein et al.,, 2014), ProxProp (Frerix et al.,, 2017)

Annealed Langevin noise (Neelakantan et al.,, 2015)

Dropout (Hinton et al.,, 2012; Srivastava et al.,, 2014) corresponds to optimization steps in subspaces of weight space, see Figure 1

Annealed noise on targets (Wang and Principe,, 1999) (works as noise on gradient, but belongs rather to data-based methods, Section 3)

Termination methods

There are numerous possible stopping criteria and selecting the right moment to stop the optimization procedure may improve the generalization by reducing the error caused by the discrepancy between the minimizers of expected and empirical risk: The network first learns general concepts that work for all samples from the ground truth distribution $P$ before fitting the specific sample $\mathcal{D}$ and its noise (Krueger et al.,, 2017).

The most successful and popular termination methods put a portion of the labeled data aside as a validation set and use it to evaluate performance (validation error). The most prominent example is Early stopping (see Prechelt,, 1998). In scenarios where the training data are scarce it is possible to resort to termination methods that do not use a validation set. The simplest case is fixing the number of passes through the training set.

Early stopping (see Morgan and Bourlard,, 1990; Prechelt,, 1998)

Choice of validation set size based on test set size (Amari et al.,, 1997)

Termination without using a validation set

Optimized approximation algorithm (Liu et al.,, 2008)

Recommendations, discussion, conclusions

We see the main benefits of our taxonomy to be two-fold: Firstly, it provides an overview of the existing techniques to the users of regularization methods and gives them a better idea of how to choose the ideal combination of regularization techniques for their problem. Secondly, it is useful for development of new methods, as it gives a comprehensive overview of the main principles that can be exploited to regularize the models. We summarize our recommendations in the following paragraphs:

Overall, using the information contained in data as well as prior knowledge as much as possible, and primarily starting with popular methods, the following procedure can be helpful:

Common recommendations for the first steps:

Deep learning is about disentangling the factors of variation. An appropriate data representation should be chosen; known meaningful data transformations should not be outsourced to the learning. Redundantly providing the same information in several representations is okay.

Output nonlinearity and error function should reflect the learning goals.

A good starting point are techniques that usually work well (e.g. ReLU, successful architectures). Hyperparameters (and architecture) can be tuned jointly, but “lazily” (interpolating/extrapolating from experience instead of trying too many combinations).

Often it is helpful to start with a simplified dataset (e.g. fewer and/or easier samples) and a simple network, and after obtaining promising results gradually increasing the complexity of both data and network while tuning hyperparameters and trying regularization methods.

When not working with nearly infinite/abundant data:

Gathering more real data (and using methods that take its properties into account) is advisable if possible:

Labeled samples are best, but unlabeled ones can also be helpful (compatible with semi-supervised learning).

Samples from the same domain are best, but samples from similar domains can also be helpful (compatible with domain adaptation and transfer learning).

Reliable high-quality samples are best, but lower-quality ones can also be helpful (their confidence/importance can be adjusted accordingly).

Labels for an additional task can be helpful (compatible with multi-task learning).

Additional input features (from additional information sources) and/or data preprocessing (i.e. domain-specific data transformations) can be helpful (the network architecture needs to be adjusted accordingly).

Data augmentation (e.g. target-preserving handcrafted domain-specific transformations) can well compensate for limited data. If natural ways to augment data (to mimic natural transformations sufficiently well) are known, they can be tried (and combined).

If natural ways to augment data are unknown or turn out to be insufficient, it may be possible to infer the transformation from data (e.g. learning image-deformation fields) if a sufficient amount of data is available for that.

Popular generic methods (e.g. advanced variants of Dropout) often also help.

Knowledge about possible meaningful properties of the mapping can be used to e.g. hardwire invariances (to certain transformations) into the architecture, or be formulated as regularization terms.

Popular methods may help as well (see Tables 3–4), but should be chosen to match the assumptions about the mapping (e.g. convolutional layers are fully appropriate only if local and shift-equivariant feature extraction on regular-grid data is desired).

Initialization: Even though pre-trained ready-made models greatly speed up prototyping, training from a good random initialization should also be considered.

Optimizers: Trying a few different ones, including advanced ones (e.g. Nesterov momentum, Adam, ProxProp), may lead to improved results. Correctly chosen parameters, such as learning rate, usually make a big difference.

Recommendations for developers of novel regularization methods

Getting an overview and understanding the reasons for the success of the best methods is a great foundation. Promising empty niches (certain combinations of taxonomy properties) exist that can be addressed. The assumptions to be imposed upon the model can have a strong impact on most elements of the taxonomy. Data augmentation is more expressive than loss terms (loss terms enforce properties only in infinitesimally small neighborhood of the training samples; data augmentation can use rich transformation parameter distributions). Data and loss terms impose assumptions and invariances in a rather soft manner, and their influence can be tuned, whereas hardwiring the network architecture is a harsher way to impose assumptions. Different assumptions and options to impose them have different advantages and disadvantages.

Future directions for data-based methods

There are several promising directions that in our opinion require more investigation: Adaptive sampling of $\theta$ might lead to lower errors and shorter training times (Fawzi et al.,, 2016) (in turn, shorter training times may additionally work as implicit regularization (Hardt et al.,, 2016), see also Section 7). Secondly, learning class-dependent transformations (i.e. $p(\theta|t)$ ) in our opinion might lead to more plausible samples. Furthermore, the field of adversarial examples (and network robustness to them) is gaining increased attention after the recently sparked discussion on real-world adversarial examples and their robustness/invariance to transformations such as the change of camera position (Lu et al.,, 2017; Athalye and Sutskever,, 2017). Countering strong adversarial examples may require better regularization techniques.

Summary

In this work we proposed a broad definition of regularization for deep learning, identified five main elements of neural network training (data, architecture, error term, regularization term, optimization procedure), described regularization via each of them, including a further, finer taxonomy for each, and presented example methods from these subcategories. Instead of attempting to explain referenced works in detail, we merely pinpointed their properties relevant to our categorization. Our work demonstrates some links between existing methods. Moreover, our systematic approach enables the discovery of new, improved regularization methods by combining the best properties of the existing ones.

Acknowledegments

We thank Antonij Golkov for valuable discussions. Grant support: ERC Consolidator Grant “3DReloaded”.

References

Appendix A Ambiguities in the taxonomy

Although our proposed taxonomy seems intuitive, there are some ambiguities: Certain methods have multiple interpretations matching various categories. Viewed from the exterior, a neural network maps inputs $x$ to outputs $y$ . We formulate this as $y=f_{w}(\tau_{\theta}(x))$ for transformations $\tau_{\theta}$ in input space (and similarly for hidden-feature space, where $\tau_{\theta}$ is applied in between layers of the network $f_{w}$ ). However, how to split this $x$ -to- $y$ mapping into “the $\tau_{\theta}$ part” and “the $f_{w}$ part”, and thus into Section 3 vs. Section 4, is ambiguous and up to one’s taste and goals. In our choices (marked with “☑” below), we attempt to use common notions and Occam’s razor.

Ambiguity of attributing noise to $f$ , or to $w$ , or to data transformations $\tau_{\theta}$ :

Stochastic methods such as Stochastic depth (Huang et al., 2016b, ) can have several interpretations if stochastic transformations are allowed for $f$ or $w$ :

Stochastic transformation of the architecture $f$ (randomly dropping some connections), Table 3

Stochastic transformation of the weights $w$ (setting some weights to in a certain random pattern)

Stochastic transformation $\tau_{\theta}$ of data in hidden-feature space; dependence is $p(\theta)$ , described in Table 1 for completeness

Ambiguity of splitting $\tau_{\theta}$ into $\tau$ and $\theta$ :

Parameters $\theta$ are the dropout mask; dependence is $p(\theta)$ ; transformation $\tau$ applies the dropout mask to the hidden features

Parameters $\theta$ are the seed state of a pseudorandom number generator; dependence is $p(\theta)$ ; transformation $\tau$ internally generates the random dropout mask from the random seed and applies it to the hidden features

Projecting dropout noise into input space (Bouthillier et al.,, 2015, Sec. 3) can fit our taxonomy in different ways by defining $\tau$ and $\theta$ accordingly. It can have similar interpretations as Dropout above (if $\tau$ is generalized to allow for dependence on $x,f,w$ ), but we prefer the third interpretation without such generalizations:

Parameters $\theta$ are the dropout mask (to be applied in a hidden layer); dependence is $p(\theta)$ ; transformation $\tau$ transforms the input to mimic the effect of the mask

Parameters $\theta$ describe the transformation of the input in any formulation; dependence is $p(\theta|x,f,w)$ ; transformation $\tau$ merely applies the transformation in input space

Ambiguity of splitting the network operation $f_{w}$ into layers: There are several possibilities to represent a function (neural network) as a composition (or directed acyclic graph) of functions (layers).

Many of the input and hidden-feature transformations (Section 3) can be considered layers of the network (Section 4). In fact, the term “layer” is not uncommon for Dropout or Batch normalization.

The usage of a trainable parameter in several parts of the network is called weight sharing. However, some mappings can be expressed with two equivalent formulas such that a parameter appears only once in one formulation, and several times in the other.

Ambiguity of $E$ vs. $R$ : Auxiliary denoising task in ladder networks (Rasmus et al.,, 2015) and similar autoencoder-style loss terms can be interpreted in different ways:

Regularization term $R$ without given auxiliary targets $t$

The ideal reconstructions can be considered as targets $t$ (if the definition of “targets” is slightly modified) and thus the denoising task becomes part of the error term $E$

Appendix B Data-augmented loss function

To understand the success of target-preserving data augmentation methods, we consider the data-augmented loss function, which we obtain by replacing the training samples $(x_{i},t_{i})\in\mathcal{D}$ in the empirical risk loss function (Eq. (3)) by augmented training samples $(\tau_{\theta}(x_{i}),t_{i})$ :

When $Q=P$ , Eq. (11) becomes the expected risk (2). We can show how this is related to importance sampling:

The difference between $\mathcal{L}$ and $\hat{\mathcal{L}}_{A}$ is the re-weighting term $p(x,t)/q(x,t)$ identical to the one known from importance sampling (see Bishop, 1995a, ). The more similar $Q$ is to $P$ (i.e. the closer $Q$ models the ground truth distribution $P$ ), the more similar the augmented-data loss $\hat{\mathcal{L}}_{A}$ is to the expected loss $\mathcal{L}$ . We see that data augmentation tries to simulate the real distribution $P$ by creating new samples from the training set $\mathcal{D}$ , bridging the gap between the expected and the empirical risk.