How to Start Training: The Effect of Initialization and Architecture
Boris Hanin, David Rolnick
Introduction
Despite the growing number of practical uses for deep learning, training deep neural networks remains a challenge. Among the many possible obstacles to training, it is natural to distinguish two kinds: problems that prevent a given neural network from ever achieving better-than-chance performance and problems that have to do with later stages of training, such as escaping flat regions and saddle points , reaching spurious local minima , and overfitting . This paper focuses specifically on two failure modes related to the first kind of difficulty:
The mean length scale in the final layer increases/decreases exponentially with the depth.
The empirical variance of length scales across layers grows exponentially with the depth.
Our main contributions and conclusions are:
The mean and variance of activations in a neural network are both important in determining whether training begins. If both failure modes FM1 and FM2 are avoided, then a deeper network need not take longer to start training than a shallower network.
FM1 is dependent on weight initialization. Initializing weights with the correct variance (in fully connected and convolutional networks) and correctly weighting residual modules (in residual networks) prevents the mean size of activations from becoming exponentially large or small as a function of the depth, allowing training to start for deeper architectures.
For fully connected and convolutional networks, FM2 is dependent on architecture. Wider layers prevent FM2, again allowing training to start for deeper architectures. In the case of constant-width networks, the width should grow approximately linearly with the depth to avoid FM2.
For residual networks, FM2 is largely independent of the architecture. Provided that residual modules are weighted to avoid FM1, FM2 can never occur. This qualitative difference between fully connected and residual networks can help to explain the empirical success of the latter, allowing deep and relatively narrow networks to be trained more readily.
FM1 for fully connected networks has been previously studied . Training may fail to start, in this failure mode, since the difference between network outputs may exceed machine precision even for moderate . For activations, FM1 has been observed to be overcome by initializations of He et al. . We prove this fact rigorously (see Theorems 5 and 6). We find empirically that for poor initializations, training fails more frequently as networks become deeper (see Figures 1 and 4).
Aside from , there appears to be less literature studying FM1 for residual networks (ResNets) . We prove that the key to avoiding FM1 in ResNets is to correctly rescale the contributions of individual residual modules (see Theorems 5 and 6). Without this, we find empirically that training fails for deeper ResNets (see Figure 2).
FM2 is more subtle and does not seem to have been widely studied (see for a notable exception). We find that FM2 indeed impedes early training (see Figure 3). Let us mention two possible explanations. First, if the variance between activations at different layers is very large, then some layers may have very large or small activations that exceed machine precision. Second, the backpropagated SGD update for a weight in a given layer includes a factor that corresponds to the size of activations at the previous layer. A very small update of essentially keeps it at its randomly initialized value. A very large update on the hand essentially re-randomizes . Thus, we conjecture that FM2 causes the stochasticity of parameter updates to outweigh the effect of the training loss.
Our analysis of FM2 reveals an interesting difference between fully connected and residual networks. Namely, for fully connected and convolutional networks, FM2 is a function of architecture, rather than just of initialization, and can occur even if FM1 does not. For residual networks, we prove by contrast that FM2 never occurs once FM1 is avoided (see Corollary 2 and Theorem 6).
Related Work
Also related to this work is that of He et al. already mentioned above, as well as . The authors in the latter group show that information can be propagated in infinitely wide nets so long as weights are initialized independently according to an appropriately normalized distribution (see condition (ii) in Definition 1). One notable difference between this collection of papers and the present work is that we are concerned with a rigorous computation of finite width effects.
These finite size corrections were also studied by Schoenholz et al. , which gives exact formulas for the distribution of pre-activations in the case when the weights and biases are Gaussian. For more on the Gaussian case, we also point the reader to Giryes et al. . The idea that controlling means and variances of activations at various hidden layers in a deep network can help with the start of training was previously considered in Klaumbauer et al. . This work introduced the scaled exponential linear unit (SELU) activation, which is shown to cause the mean values of neuron activations to converge to and the average squared length to converge to A different approach to this kind of self-normalizing behavior was suggested in Wu et al. . There, the authors suggest to add a linear hidden layer (that has no learnable parameters) but directly normalizes activations to have mean and variance Activation lengths can also be controlled by constraining weight matrices to be orthogonal or unitary (see e.g. ).
We would also like to point out a previous appearance in Hanin of the sum of reciprocals of layer widths, which we here show determines the variance of the sizes of the activations (see Theorem 5) in randomly initialized fully connected nets. The article studied the more delicate question of the variance for gradients computed by random nets. Finally, we point the reader to the discussion around Figure 6 in , which also finds that time to convergence is better for wider networks.
Results
In this section, we will (1) provide an intuitive motivation and explanation of our mathematical results, (2) verify empirically that our predictions hold, and (3) show by experiment the implications for training neural networks.
Consider a depth-, fully connected net with hidden layer widths and random weights and biases (see Definition 1 for the details of the initialization). As propagates an input vector from one layer to the next, the lengths of the resulting vectors of activations change in some manner, eventually producing an output vector whose length is potentially very different from that of the input. These changes in length are summarized by
where here and throughout the squared norm of a vector is the sum of the squares of its entries. We prove in Theorem 5 that the mean of the normalized output length , which controls whether failure mode FM1 occurs, is determined by the variance of the distribution used to initialize weights. We emphasize that all our results hold for any fixed input, which need not be random; we average only over the weights and the biases. Thus, FM1 cannot be directly solved by batch normalization , which renormalizes by averaging over inputs to , rather than averaging over initializations for .
In Figure 1, we compare the effects of different initializations in networks with varying depth, where the width is equal to the depth (this is done to prevent FM2, see §3.3). Figure 1(a) shows that, as predicted, initializations for which the variance of weights is smaller than the critical value of lead to a dramatic decrease in output length, while variance larger than this value causes the output length to explode. Figure 1(b) compares the ability of differently initialized networks to start training; it shows the average number of epochs required to achieve 20% test accuracy on MNIST . It is clear that those initializations which preserve output length are also those which allow for fast initial training - in fact, we see that it is faster to train a suitably initialized depth-100 network than it is to train a depth-10 network. Datapoints in (a) represent the statistics over random unit inputs for 1,000 independently initialized networks, while (b) shows the number of epochs required to achieve 20% accuracy on vectorized MNIST, averaged over 5 training runs with independent initializations, where networks were trained using stochastic gradient descent with a fixed learning rate of 0.01 and batch size of 1024, for up to 100 epochs. Note that changing the learning rate depending on depth could be used to compensate for FM1; choosing the right initialization is equivalent and much simpler.
It is worth noting that the in our optimal variance arises from the ReLU, which zeros out symmetrically distributed input with probability , thereby effectively halving the variance at each layer. (For linear activations, the would disappear.) The initializations described above may preserve output lengths for activation functions other than ReLU. However, ReLU is one of the most common activation functions for feed-forward networks and various initializations are commonly used blindly with ReLUs without recognizing the effect upon ease of training. An interesting systematic approach to predicting the correct multiplicative constant in the variance of weights as a function of the non-linearity is proposed in (e.g., the definition of around (7) in Poole et al. ). For non-linearities other than , however, this constant seems difficult to compute directly.
2 Avoiding FM1 for Residual Networks: Weights of Residual Modules
3 FM2 for Fully Connected Networks: The Effect of Architecture
In the notation of §3.1, failure mode FM2 is characterized by a large expected value for
the empirical variance of the normalized squared lengths of activations among all the hidden layers in Our main theoretical result about FM2 for fully connected networks is the following.
For a formal statement see Theorem 5. It is well known that deep but narrow networks are hard to train, and this result provides theoretical justification; since for such nets is large. More than that, this sum of reciprocals gives a definite way to quantify the effect of “deep but narrow” architectures on the volatility of the scale of activations at various layers within the network. We note that this result also implies that for a given depth and fixed budget of neurons or parameters, constant width is optimal, since by the Power Mean Inequality, is minimized for all equal if (number of neurons) or (approximate number of parameters) is held fixed.
We experimentally verify that the sum of reciprocals of layer widths (FM2) is an astonishingly good predictor of the speed of early training. Figure 3(a) compares training performance on MNIST for fully connected networks of five types:
The first half of the layers of width 30, then the second half of width 10,
The first half of the layers of width 10, then the second half of width 30,
Note that types (i)-(iii) have the same layer widths, but differently permuted. As predicted by our theory, the order of layer widths does not affect FM2. We emphasize that
type (iv) networks have, for each depth, the same sum of reciprocals of layer widths as types (i)-(iii). As predicted, early training dynamics for networks of type (i)-(iv) were similar for each fixed depth. By contrast, networks of type (v), which had a lower sum of reciprocals of layer widths, trained faster for every depth than the corresponding networks of types (i)-(iv).
In all cases, training becomes harder with greater depth, since increases with depth for constant-width networks. In Figure 3(b), we plot the same data with on the -axis, showing this quantity’s power in predicting the effectiveness of early training, irrespective of the particular details of the network architecture in question.
Each datapoint is averaged over 100 independently initialized training runs, with training parameters as in Figure 1. All networks are initialized with He normal weights to prevent FM1.
4 FM2 for Residual Networks
In the notation of §3.2, failure mode FM2 is equivalent to a large expected value for the empirical variance
of the normalized squared lengths of activations among the residual modules in Our main theoretical result about FM2 for ResNets is the following (see Theorem 5 for the precise statement).
5 Convolutional Architectures
Our above results were stated for fully connected networks, but the logic of our proofs carries over to other architectures. In particular, similar statements hold for convolutional neural networks (ConvNets). Note that the fan-in for a convolutional layer is not given by the width of the preceding layer, but instead is equal to the number of features multiplied by the kernel size.
In Figure 4, we show that the output length behavior we observed in fully connected networks also holds in ConvNets. Namely, mean output length equals input length for weights drawn i.i.d. from a symmetric distribution of variance , while other variances lead to exploding or vanishing output lengths as the depth increases. In our experiments, networks were purely convolutional, with no pooling or fully connected layers. By analogy to Figure 1, the fan-in was set to approximately the depth of the network by fixing kernel size and setting the number of features at each layer to one tenth of the network’s total depth. For each datapoint, the network was allowed to vary over 1,000 independent initializations, with input a fixed image from the dataset CIFAR-10 .
Notation
To state our results formally, we first give the precise definition of the networks we study; and we introduce some notation. For every and , we define
Note that is the dimension of the input. Given , the function it computes is determined by its weights and biases
For every input to we write for all
The vectors are thus the inputs and outputs of nonlinearities in the layer of
Fix positive integers and two collections of probability measures and on such that are symmetric around for every , and such that the variance of is .
A random network is obtained by requiring that the weights and biases for neurons at layer are drawn independently from , respectively.
Formal statements
We begin by stating our results about fully connected networks. Given a random network and an input to we write as in §3.1, for the normalized square length of activations at layer Our first theoretical result, Theorem 5, concerns both the mean and variance of To state it, we denote for any probability measure on its moments by
For each fix For each let be a fully connected net with depth , hidden layer widths as well as random weights and biases as in Definition 1. Fix also an input to with We have almost surely
Moreover, if (4) holds, then exists a random variable (that is almost surely finite) such that as pointwise almost surely. Further, suppose for all and that Then
where the following finite constants:
If and and is Gaussian for all , then
In particular, is exponential in and if then the convergence of to is in and
The proof of Theorem 5 is deferred to the Supplementary Material. Although we state our results only for fully connected feed-forward nets, the proof techniques carry over essentially verbatim to any feed-forward network in which only weights in the same hidden layer are tied. In particular, our results apply to convolutional networks in which the kernel sizes are uniformly bounded. In this case, the constants in Theorem 5 depend on the bound for the kernel dimensions, and denotes the fan-in for neurons in the hidden layer (i.e. the number of channels in layer multiplied by the size of the appropriate kernel). We also point out the following corollary, which follows immediately from the proof of Theorem 5.
With notation as in Theorem 5, suppose that for all the weights in layer of have variance for some Then the average squared size of activations at layer will grow or decay exponentially unless :
Our final result about fully connected networks is a corollary of Theorem 5, which explains precisely when failure mode FM2 occurs (see §3.3). It is proved in the Supplementary Material.
Take the same notation as in Theorem 5. There exist so that
Finally, our main result about residual networks is the following:
Conclusion
In this article, we give a rigorous analysis of the layerwise length scales in fully connected, convolutional, and residual networks at initialization. We find that a careful choice of initial weights is needed for well-behaved mean length scales. For fully connected and convolutional networks, this entails a critical variance for i.i.d. weights, while for residual nets this entails appropriately rescaling the residual modules. For fully connected nets, we prove that to control not merely the mean but also the variance of layerwise length scales requires choosing a sufficiently wide architecture, while for residual nets nothing further is required. We also demonstrate empirically that both the mean and variance of length scales are strong predictors of early training dynamics. In the future, we plan to extend our analysis to other (e.g. sigmoidal) activations, recurrent networks, weight initializations beyond i.i.d. (e.g. orthogonal weights), and the joint distributions of activations over several inputs.
References
Appendix A Proof of Theorem 5
Let us first verify that is a submartingale for the filtration with being the sigma algebra generated by all weights and biases up to and including layer (for background on sigma algebras and martingales we refer the reader to Chapters 2 and 37 in ). Since is a fixed non-random vector, it is clear that is measurable with respect to We have
where we can replace the sigma algebra by the sigma algebra generated by since the computation done by a feed-forward neural net is a Markov chain with respect to activations at consecutive layers (for background see Chapter 8 in ). Next, recall that by assumption the weights and biases are symmetric in law around Note that for each changing the signs of all the weights and biases causes to change sign. Hence, we find
Symmetrizing the expression in (9), we obtain
where in the second equality we used that the weights and biases are independent of with mean and in the last equality that The above computation also yields that for each
It also shows that is a martingale. Taking the limit in (11) proves (4). Next, assuming condition (4), we find that
which is finite. Hence, we may apply Doob’s pointwise martingale convergence theorem (see Chapter 35 in ) to conclude that the limit
is exists and is finite almost surely. Indeed, Doob’s result states that if our martingale is bounded in uniformly in , then, almost surely, has a finite pointwise limit as To show (5) we will need the following result.
and, conditioned on the random variables are i.i.d. Hence,
We apply the same symmetrization trick as in the derivation of (10) to obtain
which after using that the odd moments of and vanish becomes
where we recall that Putting together the preceding computations and using that
Recall that the excess kurtosis of is bounded below by for any probability measure (see Chapter 4 in ) and observe that Therefore, using that we obtain
To conclude the proof of Theorem 5, we write
and combine Lemma 1 with the expression (11) to obtain with as in Lemma 1
Taking expectations of both sides in the inequalities above yields with as in Lemma 1
Iterating the lower bound in this inequality yields the lower bound in (5). Similarly, using that , we iterate the upper bound to obtain
Using the above estimate for , gives the upper bound in (5) and completes the proof of Theorem 5.
Appendix B Proof of Corollary 2
Fix a fully connected net with depth and hidden layer widths We fix an input to and study the empirical variance of the squared sizes of activations Since the biases in are the squared activations are a martingale (see (10)) and we find
To see that this sum is exponential in as in (7), let us consider the special case of equal widths . Then, writing
This proves the lower bounds in (6) and (7). The upper bounds are similar.
Appendix C Proof of Theorem 6
To understand the sizes of activations produced by , we need the following Lemma.
Let be a feed-forward, fully connected net with depth and hidden layer widths having random weights as in Definition 1 and biases set to . Then for each we have
where , and we have used the fact that (see (10)) as well as the positive homogeneity of nets with zero biases:
Let us also write for the input to , similarly set for the activations at layer . We denote by the row of the weights at layer in . We have:
Therefore, using that is a supermartingale (since its square is a martingale by (10)):
Combining this with (13) completes the proof. ∎
The Lemma implies part (i) of the Theorem as follows: