Deep Residual Networks and Weight Initialization

Masato Taki

Introduction

In the last few years, developments in deep neural networks have derived tremendous improvements in image recognition and other machine learning tasks. Increasing depth significantly yields accuracy gains, but a deeper model suffers from serious problem of vanishing/exploding gradients in general. Initializing weight parameters by sampling them from an appropriate distribution is a standard way nowadays to address this problem. The first modern initialization method was proposed in for symmetric activation functions, and it was widely applied to various architectures. ReLU , which is not symmetric function, becomes a popular selection of an activation function recently. He et al. generalized the initialization to such activation function. So far, many researches have been made to extend this initialization method.

Residual network See also highway network . is an another idea to realize deep neural networks with high performance. Intuitive idea is that shortcut connections in ResNet provide bypaths for propagating signals and it prevents the loss of signal through propagation.

In this paper, we provide another theoretical explanation of goodness of ResNet. We show that ResNet is relatively insensitive to weight initialization compared to usual neural networks. This fact enable us to train deep ResNet successfully and easily. ResNet is comparatively robust to initialization distribution, but the variance of the initialization distribution must be smaller that that in :

where $L$ is the depth of ResNetIn this paper, this $L$ actually corresponds to the numbers of residual blocks in ResNet., and $n$ is the fan-in. $c$ is some $\mathcal{O}(1)$ coefficient for instance. For $L\gg 1$ , this requirement leads to a theoretical upper bound on possible value of $L$ when we realize it as practical floating-point arithmetic. To adress this difficulty, we investigate a well-known modification of ResNet with insertion of batch normalization layers . Explosion of gradients is then relaxed as

This explosion is linear in the depth and not so serious as exponential explosion of plain networks . Batch normalization significantly improves the problem, but gradients still diverge when ResNet is extremely deep. It can be an another theoretical limitation to possible depth of ResNet.

Deeply related work to this paper is Balduzzi et al. : similar analysis for special cases was done from the view point of shattered gradient problem.

Weight Initialization for Residual Networks

Careful weight initialization is the main current of the technique for training deep neural networks without problem of vanishing/exploding gradients . The behavior of a plain neural network is highly sensitive to the initial weights when the network is very deep, and their distribution directly affects the magnitudes of the outputs and gradients of the network . We therefore need to tune the initial weight distribution carefully for avoiding vanishing and explosion.

Using residual networks is another way to realize smooth convergence of training deep model. Shortcut connections in ResNet keep signals finite through propagation, and this has been an intuitive explanation why ResNet can avoid problem of vanishing/exploding gradients.

In this section, we point out that these two approaches are actually related and deep ResNet is special model from the viewpoint of the weight initialization. We also propose new weight initialization that works for deep ResNets.

We can therefore solve this relation easily, and the expectation value at the output layer is given by

by choosing weight initialization distribution whose variance is

Here $c$ is some small number. This weight initialization distribution relies not only on local layer information $n$ For sparse network such as convolutional network, the fan-in $n$ is not the actual number of the units. but also on global information—the depth $d$ .

The relevant difference with plain feedforward neural networks is that the prefactor in that case behaves as $\left(n\,\textrm{Var}\big{[}{w}\big{]}\right)^{L}$ , and therefore it is highly sensitive to deviation from the recommended value $n\,\textrm{Var}\big{[}{w}\big{]}=1$ . If the variance $\textrm{Var}\big{[}{w}\big{]}$ is twice the recommended value, then the prefactor grows to $2^{L}$ . This number is extremely large for deep model: $2^{20}\approx 1000000$ for $L=20$ . Residual case (10), on the other hand, is more robust to such deviation. If $c=1$ the grown prefactor is just $e^{2}\approx 7$ . This robustness eases training of very deep ResNet under our initialization.

ReLU

The structure of this lower bound is the same as that of (7), and initialization distribution which is necessary condition for avoiding this divergence is again $\textrm{Var}\big{[}{w}\big{]}={c}/{nL}$ . This condition guarantees $\textrm{Var}\big{[}y\big{]}\geq e^{\frac{c}{4}}\,\textrm{Var}\big{[}x\big{]}$ .

2 Back propagation

Let us move on to evaluating decay/growth of gradients of ResNet. In backpropagation method, the gradients are given by the deltas as

where $\otimes$ is the Kronecker product and $\odot$ is the Hadamard product. The chain rule then leads to the following recursion relation for deltas between adjoining layers

Solving this recursion starting from the initial value $\boldsymbol{\delta}^{L}_{{z}}={\partial E}/{\partial\boldsymbol{y}}$ and substituting the solution to (15) give the gradients for all layers.

and $f^{\prime}(u)$ becomes with probability $1/2$ and $1$ with probability $1/2$ . The relation (15) then gives

We use (16) to derive the first equality. Hence we obtain the following expression

We can conclude that the most simple choice of the initialization is again $\textrm{Var}\big{[}{w}\big{]}={c}/{nL}$ .

Identity

When the activation function is approximated by the identity map $f(u)=u$ , the derivative of it is of course $f^{\prime}(u)=1$ . This simplification leads to

Let us assume $\textrm{E}\big{[}{\delta}_{z}^{L}\big{]}=0$ again. The variances of deltas then satisfy

Using (16), we obtain the recursion relation of this variance

This relation immediately leads to the following formula

Therefore, this growth factor requires the same initialization distribution as the ReLU case.

3 Limitation of depth

In the previous subsection, we propose new weight initialization distribution with the variance

which is required to prevent exploding gradients. Theoretically we can always prepare such initial weights for any $n$ and $L$ , but of course possible mantissa and exponent of practical float numbers are limited. Implementation of deep learning sometimes prefers small bit-width to save computational cost. Using very small weight values causes cancellation and loss of trailing digits, and thus there is an obstruction to realize variance $\textrm{Var}\big{[}{w}\big{]}={c}/{nL}$ for very deep network. This fact restricts possible depth of ResNet when we implement our initialization as floating-point arithmetic. In the next section, we compare our initialization with another improvement known as batch normalization.

Batch Normalization

In this section, we extend our previous analysis to the cases of batch normalized ResNets. It is experimentally known that the introduction of batch normalization layers improves performance of ResNets drastically. This result, however, highly depends on the way to insert these layers, and therefore we pick up a typical way. Our theoretical treatment in this section explains how batch normalization cures problems inhered in ResNets.

Feedforward propagation has very simple property. Let us focus on identity activation function for simplicity. The expectation value of layer output is

The derivation of the following backpropagation relation is parallel to that in previous section

Chain rule of derivative again gives the derivative coefficient appearing in the second term

Substituting these expressions into (34), we can rewrite the right hand side of (33) as

Substituting it into (38) leads to the following simple formula

Experiments

To evaluate our initialization (28), we conduct little experiments on ResNets with a hundred of blocks. Residual blocks without and with batch normalization layer are illustrated in Fig.1. Since our purpose is not realizing significant performance but checking effect of initializations at the first stage of training, we use simple models whose residual block consists of $8\times 8$ convolution layer with 16 channels and ReLU activation function. These models are trained on the CIFAR-10 dataset .

Learning curves at first 20 epochs are shown in Fig.2. Gray and yellow lines are learning curves of plain ResNets without batch normalization. Gray line corresponds to He’s initialization , and yellow line corresponds to the proposed initialization (28)We used the normal distribution with variance $1/(8\times 8\times 16\times 100)$ . This distribution in not truncated.. Learning curves with He’s initialization under batch normalization are illustrated in blue lines. Fig.2a is the validation accuracies, and Fig.2b is the values of the loss functions at each epoch.

As Fig.2, training of plain ResNet with He’s initialization (gray lines) is trapped on a plateau from the beginning and converge is hampered. This initialization in this model is very unstable and easily trapped in plateau: Fig.3 shows nine learning curves of ResNets initialized by . On the one hand, the same ResNet with our initialization (yellow lines) is more stable: repeated experiments showed that the mean and variance of the validation accuracy after the first epoch are $0.434$ and $0.017$ . Our initialization in Fig.2 shows a comparable performance with ResNet with batch normalization (blue lines). The performance of batch normalization is slightly better than our initialization, and it stabilizes the convergence throughout the entire training period . But batch normalization takes more computational cost than our simple initialization. In this sense, our initialization can be substitute method to batch normalization.

Discussion

In this paper, we study how weight initialization and batch normalization improve the training of ResNet. We also propose new weight initialization (28) that works for deep ResNets. A simple experimental test shows that its effect at the first stage of training is comparable to that of batch normalization.

ResNets in this paper is more simplified than practical models, and generalizing our analysis to complicated ResNets is an interesting further direction. It is especially important to understand the role of including activation function after shortcut connection, inserting many layers in a single block and changing the position of batch normalization layer . We leave these questions open for future research.

This work was supported by RIKEN interdisciplinary Theoretical & Mathematical Sciences Program (iTHEMS) and JSPS KAKENHI Grant Number JP17K19989.

References

LeCun, Y., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541-551.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

Russakovsky, Olga, et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211-252.

Szegedy, C., et al. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen, 1991.

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.

Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 249-256).

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.

Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (pp. 448-456).

LeCun, Y. A., et al. (2012). Efficient backprop. In Neural networks: Tricks of the trade (pp. 9-48). Springer Berlin Heidelberg.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.

Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.

Veit, A., Wilber, M. J., & Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems (pp. 550-558).

Greff, K., Srivastava, R. K., & Schmidhuber, J. (2016). Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771.

He, K., Zhang, X., Ren, S., & Sun, J. (2016, October). Identity mappings in deep residual networks. In European Conference on Computer Vision (pp. 630-645). Springer International Publishing.

Gross, S., & M. Wilber. (2016) Training and investigating residual nets. http://torch.ch/blog/2016/02/04/resnets.html

Krizhevsky, A. & Hinton, G. (2009). Learning multiple layers of features from tiny images.

Zhang, C., et al. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

Balduzzi, D., et al. (2017). The Shattered Gradients Problem: If resnets are the answer, then what is the question?. arXiv preprint arXiv:1702.08591.

In this appendix, we study another way of insertion of batch normalization layer Fig.4. In appendix of , related computation is done for special case.

The feedforward propagation rule in this case Fig.4 is

We therefore obtain the following expression for variance

Let us consider backpropagation next. Gradients are now given by deltas for $z$ as

Applying chain rule leads to the following backpropagation rule

Using (34), we can rewrite it into the following form

To obtain the last equality, we use (47). This formula immediately leads to the following evolution equation

for both choices of activation functions.

Let us approximate this numerical coefficient by $a$ . Then the final formula is

This behavior is completely the same as that in Section 3, and therefore we can expect this property is universal for batch normalized ResNets.