How to Start Training: The Effect of Initialization and Architecture

Boris Hanin, David Rolnick

Introduction

Despite the growing number of practical uses for deep learning, training deep neural networks remains a challenge. Among the many possible obstacles to training, it is natural to distinguish two kinds: problems that prevent a given neural network from ever achieving better-than-chance performance and problems that have to do with later stages of training, such as escaping flat regions and saddle points , reaching spurious local minima , and overfitting . This paper focuses specifically on two failure modes related to the first kind of difficulty:

The mean length scale in the final layer increases/decreases exponentially with the depth.

The empirical variance of length scales across layers grows exponentially with the depth.

Our main contributions and conclusions are:

The mean and variance of activations in a neural network are both important in determining whether training begins. If both failure modes FM1 and FM2 are avoided, then a deeper network need not take longer to start training than a shallower network.

FM1 is dependent on weight initialization. Initializing weights with the correct variance (in fully connected and convolutional networks) and correctly weighting residual modules (in residual networks) prevents the mean size of activations from becoming exponentially large or small as a function of the depth, allowing training to start for deeper architectures.

For fully connected and convolutional networks, FM2 is dependent on architecture. Wider layers prevent FM2, again allowing training to start for deeper architectures. In the case of constant-width networks, the width should grow approximately linearly with the depth to avoid FM2.

For residual networks, FM2 is largely independent of the architecture. Provided that residual modules are weighted to avoid FM1, FM2 can never occur. This qualitative difference between fully connected and residual networks can help to explain the empirical success of the latter, allowing deep and relatively narrow networks to be trained more readily.

FM1 for fully connected networks has been previously studied . Training may fail to start, in this failure mode, since the difference between network outputs may exceed machine precision even for moderate $d$ . For $\operatorname{ReLU}$ activations, FM1 has been observed to be overcome by initializations of He et al. . We prove this fact rigorously (see Theorems 5 and 6). We find empirically that for poor initializations, training fails more frequently as networks become deeper (see Figures 1 and 4).

Aside from , there appears to be less literature studying FM1 for residual networks (ResNets) . We prove that the key to avoiding FM1 in ResNets is to correctly rescale the contributions of individual residual modules (see Theorems 5 and 6). Without this, we find empirically that training fails for deeper ResNets (see Figure 2).

FM2 is more subtle and does not seem to have been widely studied (see for a notable exception). We find that FM2 indeed impedes early training (see Figure 3). Let us mention two possible explanations. First, if the variance between activations at different layers is very large, then some layers may have very large or small activations that exceed machine precision. Second, the backpropagated SGD update for a weight $W$ in a given layer includes a factor that corresponds to the size of activations at the previous layer. A very small update of $W$ essentially keeps it at its randomly initialized value. A very large update on the hand essentially re-randomizes $W$ . Thus, we conjecture that FM2 causes the stochasticity of parameter updates to outweigh the effect of the training loss.

Our analysis of FM2 reveals an interesting difference between fully connected and residual networks. Namely, for fully connected and convolutional networks, FM2 is a function of architecture, rather than just of initialization, and can occur even if FM1 does not. For residual networks, we prove by contrast that FM2 never occurs once FM1 is avoided (see Corollary 2 and Theorem 6).

Related Work

Also related to this work is that of He et al. already mentioned above, as well as . The authors in the latter group show that information can be propagated in infinitely wide $\operatorname{ReLU}$ nets so long as weights are initialized independently according to an appropriately normalized distribution (see condition (ii) in Definition 1). One notable difference between this collection of papers and the present work is that we are concerned with a rigorous computation of finite width effects.

These finite size corrections were also studied by Schoenholz et al. , which gives exact formulas for the distribution of pre-activations in the case when the weights and biases are Gaussian. For more on the Gaussian case, we also point the reader to Giryes et al. . The idea that controlling means and variances of activations at various hidden layers in a deep network can help with the start of training was previously considered in Klaumbauer et al. . This work introduced the scaled exponential linear unit (SELU) activation, which is shown to cause the mean values of neuron activations to converge to and the average squared length to converge to $1.$ A different approach to this kind of self-normalizing behavior was suggested in Wu et al. . There, the authors suggest to add a linear hidden layer (that has no learnable parameters) but directly normalizes activations to have mean and variance $1.$ Activation lengths can also be controlled by constraining weight matrices to be orthogonal or unitary (see e.g. ).

We would also like to point out a previous appearance in Hanin of the sum of reciprocals of layer widths, which we here show determines the variance of the sizes of the activations (see Theorem 5) in randomly initialized fully connected $\operatorname{ReLU}$ nets. The article studied the more delicate question of the variance for gradients computed by random $\operatorname{ReLU}$ nets. Finally, we point the reader to the discussion around Figure 6 in , which also finds that time to convergence is better for wider networks.

Results

In this section, we will (1) provide an intuitive motivation and explanation of our mathematical results, (2) verify empirically that our predictions hold, and (3) show by experiment the implications for training neural networks.

Consider a depth- $d$ , fully connected $\operatorname{ReLU}$ net $\mathcal{N}$ with hidden layer widths $n_{j},\,j=0,\ldots,d,$ and random weights and biases (see Definition 1 for the details of the initialization). As $\mathcal{N}$ propagates an input vector $\operatorname{act}^{(0)}\in{\bf R}^{n_{0}}$ from one layer to the next, the lengths of the resulting vectors of activations $\operatorname{act}^{(j)}\in{\bf R}^{n_{j}}$ change in some manner, eventually producing an output vector whose length is potentially very different from that of the input. These changes in length are summarized by

where here and throughout the squared norm of a vector is the sum of the squares of its entries. We prove in Theorem 5 that the mean of the normalized output length $M_{d}$ , which controls whether failure mode FM1 occurs, is determined by the variance of the distribution used to initialize weights. We emphasize that all our results hold for any fixed input, which need not be random; we average only over the weights and the biases. Thus, FM1 cannot be directly solved by batch normalization , which renormalizes by averaging over inputs to $\mathcal{N}$ , rather than averaging over initializations for $\mathcal{N}$ .

In Figure 1, we compare the effects of different initializations in networks with varying depth, where the width is equal to the depth (this is done to prevent FM2, see §3.3). Figure 1(a) shows that, as predicted, initializations for which the variance of weights is smaller than the critical value of $2/\text{fan-in}$ lead to a dramatic decrease in output length, while variance larger than this value causes the output length to explode. Figure 1(b) compares the ability of differently initialized networks to start training; it shows the average number of epochs required to achieve 20% test accuracy on MNIST . It is clear that those initializations which preserve output length are also those which allow for fast initial training - in fact, we see that it is faster to train a suitably initialized depth-100 network than it is to train a depth-10 network. Datapoints in (a) represent the statistics over random unit inputs for 1,000 independently initialized networks, while (b) shows the number of epochs required to achieve 20% accuracy on vectorized MNIST, averaged over 5 training runs with independent initializations, where networks were trained using stochastic gradient descent with a fixed learning rate of 0.01 and batch size of 1024, for up to 100 epochs. Note that changing the learning rate depending on depth could be used to compensate for FM1; choosing the right initialization is equivalent and much simpler.

It is worth noting that the $2$ in our optimal variance $2/\text{fan-in}$ arises from the ReLU, which zeros out symmetrically distributed input with probability $1/2$ , thereby effectively halving the variance at each layer. (For linear activations, the $2$ would disappear.) The initializations described above may preserve output lengths for activation functions other than ReLU. However, ReLU is one of the most common activation functions for feed-forward networks and various initializations are commonly used blindly with ReLUs without recognizing the effect upon ease of training. An interesting systematic approach to predicting the correct multiplicative constant in the variance of weights as a function of the non-linearity is proposed in (e.g., the definition of $\chi_{1}$ around (7) in Poole et al. ). For non-linearities other than $\operatorname{ReLU}$ , however, this constant seems difficult to compute directly.

2 Avoiding FM1 for Residual Networks: Weights of Residual Modules

3 FM2 for Fully Connected Networks: The Effect of Architecture

In the notation of §3.1, failure mode FM2 is characterized by a large expected value for

the empirical variance of the normalized squared lengths of activations among all the hidden layers in $\mathcal{N}.$ Our main theoretical result about FM2 for fully connected networks is the following.

For a formal statement see Theorem 5. It is well known that deep but narrow networks are hard to train, and this result provides theoretical justification; since for such nets $\sum 1/n_{j}$ is large. More than that, this sum of reciprocals gives a definite way to quantify the effect of “deep but narrow” architectures on the volatility of the scale of activations at various layers within the network. We note that this result also implies that for a given depth and fixed budget of neurons or parameters, constant width is optimal, since by the Power Mean Inequality, $\sum_{j}1/n_{j}$ is minimized for all $n_{j}$ equal if $\sum n_{j}$ (number of neurons) or $\sum n_{j}^{2}$ (approximate number of parameters) is held fixed.

We experimentally verify that the sum of reciprocals of layer widths (FM2) is an astonishingly good predictor of the speed of early training. Figure 3(a) compares training performance on MNIST for fully connected networks of five types:

The first half of the layers of width 30, then the second half of width 10,

The first half of the layers of width 10, then the second half of width 30,

Note that types (i)-(iii) have the same layer widths, but differently permuted. As predicted by our theory, the order of layer widths does not affect FM2. We emphasize that

type (iv) networks have, for each depth, the same sum of reciprocals of layer widths as types (i)-(iii). As predicted, early training dynamics for networks of type (i)-(iv) were similar for each fixed depth. By contrast, networks of type (v), which had a lower sum of reciprocals of layer widths, trained faster for every depth than the corresponding networks of types (i)-(iv).

In all cases, training becomes harder with greater depth, since $\sum 1/n_{j}$ increases with depth for constant-width networks. In Figure 3(b), we plot the same data with $\sum 1/n_{j}$ on the $x$ -axis, showing this quantity’s power in predicting the effectiveness of early training, irrespective of the particular details of the network architecture in question.

Each datapoint is averaged over 100 independently initialized training runs, with training parameters as in Figure 1. All networks are initialized with He normal weights to prevent FM1.

4 FM2 for Residual Networks

In the notation of §3.2, failure mode FM2 is equivalent to a large expected value for the empirical variance

of the normalized squared lengths of activations among the residual modules in $\mathcal{N}.$ Our main theoretical result about FM2 for ResNets is the following (see Theorem 5 for the precise statement).

5 Convolutional Architectures

Our above results were stated for fully connected networks, but the logic of our proofs carries over to other architectures. In particular, similar statements hold for convolutional neural networks (ConvNets). Note that the fan-in for a convolutional layer is not given by the width of the preceding layer, but instead is equal to the number of features multiplied by the kernel size.

In Figure 4, we show that the output length behavior we observed in fully connected networks also holds in ConvNets. Namely, mean output length equals input length for weights drawn i.i.d. from a symmetric distribution of variance $2/\text{fan-in}$ , while other variances lead to exploding or vanishing output lengths as the depth increases. In our experiments, networks were purely convolutional, with no pooling or fully connected layers. By analogy to Figure 1, the fan-in was set to approximately the depth of the network by fixing kernel size $3\times 3$ and setting the number of features at each layer to one tenth of the network’s total depth. For each datapoint, the network was allowed to vary over 1,000 independent initializations, with input a fixed image from the dataset CIFAR-10 .

Notation

To state our results formally, we first give the precise definition of the networks we study; and we introduce some notation. For every $d\geq 1$ and ${\bf n}=\left(n_{i}\right)_{i=0}^{d}\in{\bf Z}_{+}^{d+1}$ , we define

Note that $n_{0}$ is the dimension of the input. Given $\mathcal{N}\in\mathfrak{N}({\bf n},d)$ , the function $f_{\mathcal{N}}$ it computes is determined by its weights and biases

For every input $\operatorname{act}^{(0)}=\left(\operatorname{act}_{\alpha}^{(0)}\right)_{\alpha=1}^{n_{0}}\in{\bf R}^{n_{0}}$ to $\mathcal{N},$ we write for all $j=1,\ldots,d$

The vectors $\operatorname{preact}^{(j)},\,\operatorname{act}^{(j)}$ are thus the inputs and outputs of nonlinearities in the $j^{th}$ layer of $\mathcal{N}.$

Fix $d\geq 1,$ positive integers ${\bf n}$ $=\left(n_{0},\ldots,n_{d}\right)\in{\bf Z}_{+}^{d+1},$ and two collections of probability measures ${\bf\mu}=\left(\mu^{(1)},\ldots,\mu^{(d)}\right)$ and ${\bf\nu}=\left(\nu^{(1)},\ldots,\nu^{(d)}\right)$ on ${\bf R}$ such that $\mu^{(j)},\nu^{(j)}$ are symmetric around for every $1\leq j\leq d$ , and such that the variance of $\mu^{(j)}$ is $2/(n_{j-1})$ .

A random network $\mathcal{N}\in\mathfrak{N}_{{\bf\mu},{\bf\nu}}\left({\bf n},d\right)$ is obtained by requiring that the weights and biases for neurons at layer $j$ are drawn independently from $\mu^{(j)},\nu^{(j)}$ , respectively.

Formal statements

We begin by stating our results about fully connected networks. Given a random network $\mathcal{N}\in\mathfrak{N}_{\mu,\nu}\left(d,{\bf n}\right)$ and an input $\operatorname{act}^{(0)}$ to $\mathcal{N},$ we write as in §3.1, $M_{j}$ for the normalized square length of activations $\frac{1}{n_{d}}||\operatorname{act}^{(j)}||^{2}$ at layer $j.$ Our first theoretical result, Theorem 5, concerns both the mean and variance of $M_{d}.$ To state it, we denote for any probability measure $\lambda$ on ${\bf R}$ its moments by

For each $j\geq 0,$ fix $n_{j}\in{\bf Z}_{+}.$ For each $d\geq 1,$ let $\mathcal{N}\in\mathfrak{N}_{\mu,\nu}\left(d,{\bf n}\right)$ be a fully connected $\operatorname{ReLU}$ net with depth $d$ , hidden layer widths ${\bf n}=\left(n_{j}\right)_{j=0}^{d}$ as well as random weights and biases as in Definition 1. Fix also an input $\operatorname{act}^{(0)}\in{\bf R}^{n_{0}}$ to $\mathcal{N}$ with $||\operatorname{act}^{(0)}||=1.$ We have almost surely

Moreover, if (4) holds, then exists a random variable $M_{\infty}$ (that is almost surely finite) such that $M_{d}\rightarrow M_{\infty}$ as $d\rightarrow\infty$ pointwise almost surely. Further, suppose $\mu_{4}^{(j)}<\infty$ for all $j\geq 1$ and that $\sum_{j=1}^{\infty}\left(\nu_{2}^{(j)}\right)^{2}<\infty.$ Then

where $C_{1},C_{2}$ the following finite constants:

If $M_{0}=1$ and $\sum_{j=1}^{\infty}\nu_{2}^{(j)},\,\nu_{4}^{(d)}\leq 1$ and $\mu^{(j)}$ is Gaussian for all $j\geq 1$ , then

In particular, $\operatorname{Var}[M_{d}]$ is exponential in $\sum_{j=1}^{d}1/n_{j}$ and if $~{}\sum_{j}1/n_{j}<\infty,$ then the convergence of $M_{d}$ to $M_{\infty}$ is in $L^{2}$ and $\operatorname{Var}[M_{\infty}]<\infty.$

The proof of Theorem 5 is deferred to the Supplementary Material. Although we state our results only for fully connected feed-forward $\operatorname{ReLU}$ nets, the proof techniques carry over essentially verbatim to any feed-forward network in which only weights in the same hidden layer are tied. In particular, our results apply to convolutional networks in which the kernel sizes are uniformly bounded. In this case, the constants in Theorem 5 depend on the bound for the kernel dimensions, and $n_{j}$ denotes the fan-in for neurons in the $(j+1)^{st}$ hidden layer (i.e. the number of channels in layer $j$ multiplied by the size of the appropriate kernel). We also point out the following corollary, which follows immediately from the proof of Theorem 5.

With notation as in Theorem 5, suppose that for all $j=1,\ldots,d,$ the weights in layer $j$ of $\mathcal{N}_{d}$ have variance $\kappa\cdot 2/n_{j}$ for some $\kappa>0.$ Then the average squared size $M_{d}$ of activations at layer $d$ will grow or decay exponentially unless $\kappa=1$ :

Our final result about fully connected networks is a corollary of Theorem 5, which explains precisely when failure mode FM2 occurs (see §3.3). It is proved in the Supplementary Material.

Take the same notation as in Theorem 5. There exist $c,C>0$ so that

Finally, our main result about residual networks is the following:

Conclusion

In this article, we give a rigorous analysis of the layerwise length scales in fully connected, convolutional, and residual $\operatorname{ReLU}$ networks at initialization. We find that a careful choice of initial weights is needed for well-behaved mean length scales. For fully connected and convolutional networks, this entails a critical variance for i.i.d. weights, while for residual nets this entails appropriately rescaling the residual modules. For fully connected nets, we prove that to control not merely the mean but also the variance of layerwise length scales requires choosing a sufficiently wide architecture, while for residual nets nothing further is required. We also demonstrate empirically that both the mean and variance of length scales are strong predictors of early training dynamics. In the future, we plan to extend our analysis to other (e.g. sigmoidal) activations, recurrent networks, weight initializations beyond i.i.d. (e.g. orthogonal weights), and the joint distributions of activations over several inputs.

References

Appendix A Proof of Theorem 5

Let us first verify that $M_{d}$ is a submartingale for the filtration $\{\mathcal{F}_{d}\}_{d\geq 1}$ with $\mathcal{F}_{d}$ being the sigma algebra generated by all weights and biases up to and including layer $d$ (for background on sigma algebras and martingales we refer the reader to Chapters 2 and 37 in ). Since $\operatorname{act}^{(0)}$ is a fixed non-random vector, it is clear that $M_{d}$ is measurable with respect to $\mathcal{F}_{d}.$ We have

where we can replace the sigma algebra $\mathcal{F}_{d-1}$ by the sigma algebra generated by $\operatorname{act}^{(d-1)}$ since the computation done by a feed-forward neural net is a Markov chain with respect to activations at consecutive layers (for background see Chapter 8 in ). Next, recall that by assumption the weights and biases are symmetric in law around $0.$ Note that for each $\beta,$ changing the signs of all the weights $w_{\alpha,\beta}^{(d)}$ and biases $b_{\beta}^{(d)}$ causes $\operatorname{preact}_{\beta}^{(d)}$ to change sign. Hence, we find

Symmetrizing the expression in (9), we obtain

where in the second equality we used that the weights $w_{\alpha,\beta}^{(d)}$ and biases $b_{\beta}^{(d)}$ are independent of $\mathcal{F}_{d-1}$ with mean and in the last equality that $\operatorname{Var}[w_{\alpha,\beta}^{(d)}]=2/n_{j-1}.$ The above computation also yields that for each $d\geq 1,$

It also shows that $\widehat{M}_{d}=M_{d}-\sum_{j=1}^{d}\frac{1}{2}\nu_{2}^{(d)}$ is a martingale. Taking the limit $d\rightarrow\infty$ in (11) proves (4). Next, assuming condition (4), we find that

which is finite. Hence, we may apply Doob’s pointwise martingale convergence theorem (see Chapter 35 in ) to conclude that the limit

is exists and is finite almost surely. Indeed, Doob’s result states that if our martingale $\widehat{M}_{d}$ is bounded in $L^{1}$ uniformly in $d$ , then, almost surely, $\widehat{M}_{d}$ has a finite pointwise limit as $d\rightarrow\infty.$ To show (5) we will need the following result.

and, conditioned on $\operatorname{act}^{(d-1)},$ the random variables $\{\operatorname{act}_{\beta}^{(d)}\}_{\beta}$ are i.i.d. Hence,

We apply the same symmetrization trick as in the derivation of (10) to obtain

which after using that the odd moments of $w_{\alpha,1}^{(d)}$ and $b_{1}^{(d)}$ vanish becomes

where we recall that $\widetilde{\mu}_{4}^{(d)}=\mu_{4}^{(d)}/\left(\mu_{2}^{(d)}\right)^{2}.$ Putting together the preceding computations and using that

Recall that the excess kurtosis $\widetilde{\mu}_{4}^{(d)}-3$ of $\mu^{(d)}$ is bounded below by $-2$ for any probability measure (see Chapter 4 in ) and observe that $||\operatorname{act}^{(d-1)}||_{4}^{4}\leq||\operatorname{act}^{(d-1)}||^{4}.$ Therefore, using that $\frac{1}{2}\nu_{4}^{(d)}-\frac{1}{4}(\nu_{2}^{(d)})^{2}\geq 0,$ we obtain

To conclude the proof of Theorem 5, we write

and combine Lemma 1 with the expression (11) to obtain with $C$ as in Lemma 1

Taking expectations of both sides in the inequalities above yields with $C$ as in Lemma 1

Iterating the lower bound in this inequality yields the lower bound in (5). Similarly, using that $1+C/n_{d}>1$ , we iterate the upper bound to obtain

Using the above estimate for $\sum_{j}a_{j}$ , gives the upper bound in (5) and completes the proof of Theorem 5.

Appendix B Proof of Corollary 2

Fix a fully connected $\operatorname{ReLU}$ net $\mathcal{N}$ with depth $d$ and hidden layer widths $n_{0},\ldots,n_{d}.$ We fix an input $\operatorname{act}^{(0)}$ to $\mathcal{N}$ and study the empirical variance $\widehat{\operatorname{Var}}[M]$ of the squared sizes of activations $M_{j},\,j=1,\ldots,d.$ Since the biases in $\mathcal{N}$ are $0,$ the squared activations $M_{j}$ are a martingale (see (10)) and we find

To see that this sum is exponential in $\sum 1/n_{j}$ as in (7), let us consider the special case of equal widths $n_{j}=n$ . Then, writing

This proves the lower bounds in (6) and (7). The upper bounds are similar.

Appendix C Proof of Theorem 6

To understand the sizes of activations produced by $\mathcal{N}^{res}_{L}$ , we need the following Lemma.

Let $\mathcal{N}$ be a feed-forward, fully connected $\operatorname{ReLU}$ net with depth $d$ and hidden layer widths $n_{0},\ldots,n_{d}$ having random weights as in Definition 1 and biases set to . Then for each $\eta\in(0,1),$ we have

where $\widehat{x}=\frac{x}{\left\lVert x\right\rVert}$ , and we have used the fact that $\left\lVert\mathcal{N}(x)\right\rVert^{2}=\left\lVert x\right\rVert^{2}$ (see (10)) as well as the positive homogeneity of $\operatorname{ReLU}$ nets with zero biases:

Let us also write $x=x^{(0)}$ for the input to $\mathcal{N}$ , similarly set $x^{(j)}$ for the activations at layer $j$ . We denote by $W_{\beta}^{(j)}$ the $\beta^{th}$ row of the weights $W^{(j)}$ at layer $j$ in $\mathcal{N}$ . We have:

Therefore, using that $\left\lVert x^{(j)}\right\rVert$ is a supermartingale (since its square is a martingale by (10)):

Combining this with (13) completes the proof. ∎

The Lemma implies part (i) of the Theorem as follows: