Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki, Keisuke Fukuda

Introduction

Training deep neural networks is computationally expensive. Acceleration by distributed computing is required for higher scalability (larger datasets and more complex models) and for higher productivity (shorter training time and quicker trial and error). This paper demonstrates that highly-parallel training is possible with a large minibatch size without losing accuracy on carefully-designed software and hardware systems.

We used the 90-epoch, ResNet-50 training on ImageNet as our benchmark. This task has been extensively used in evaluating performance of distributed deep learning . Table 1 shows the summary of these previous attempts along with our new results. We achieved a total training time of 15 minutes while maintaining a comparable accuracy of 74.9%.

The technical challenge is two-fold; On the algorithm side, we have to design training methods that can prevent loss of accuracy with large minibatch sizes, while on the system side, we have to design stable and practical combinations of available hardware and software components.

Training Procedure for Large Minibatches

We build on the training procedure proposed by , and the same settings are used unless otherwise specified. We briefly highlight the differences in this section. For further details, please see Appendix A.

We found that the primary challenge is the optimization difficulty at the start of training. To address this issue, we start the training with RMSprop , then gradually transition to SGD.

Slow-Start Learning Rate Schedule.

To further overcome the initial optimization difficulty, we use a slightly modified learning rate schedule with a longer initial phase and lower initial learning rate.

Batch Normalization without Moving Averages.

With the larger minibatch sizes, the batch normalization moving averages of the mean and variance became inaccurate estimates of the actual mean and variance. To cope with this problem, we only considered the last minibatch, instead of the moving average, and used all-reduce communication on these statistics to obtain the average over all workers before validation.

Software and Hardware Systems

We used Chainer and ChainerMN . Chainer is an open-source deep learning framework featuring the define-by-run approach. ChainerMN is an add-on package for Chainer enabling multi-node distributed deep learning with synchronous data-parallelism. We used development branches based on versions 3.0.0rc1 and 1.0.0, respectively. As the underlying communication libraries, we used NCCL version 2.0.5 and Open MPI version 1.10.2. While computation was generally done in single precision, in order to reduce the communication overhead during all-reduce operations, we used half-precision floats for communication. In our preliminary experiments, we observed that the effect from using half-precision in communication on the final model accuracy was relatively small.

Hardware.

We used MN-1, an in-house cluster owned by Preferred Networks, Inc. designed to facilitate research and development of deep learning. It consists of 128 nodes, where each node has two Intel Xeon E5-2667 processors (3.20 GHz, eight cores), 256 GB memory and eight NVIDIA Tesla P100 GPUs. The nodes are interconnected by Mellanox Infiniband FDR.

Experimental Results

For running time and accuracy, the mean and standard deviation from five independent runs are reported. The per-worker minibatch size was 32, and the total minibatch size was 32k with 1024 workers.

Using 1024 GPUs, the training time was $897.9\pm 3.3$ seconds for 90 epochs, including validation after each epoch. Figure 1 illustrates the average communication time (i.e., all-reduce operations) and time to complete a whole iteration (i.e., forward and backward computation, communication, and optimization) over 100 iterations. Our scaling efficiency when using 1024 GPUs is 70% and 80% in comparison to single-GPU and single-node (i.e., 8 GPUs) baselines, respectively.

Accuracy.

After training on 90 epochs using 1024 GPUs with the training procedure designed in Section 2, the top-1 single-crop accuracy on the validation images was $74.94\%\pm 0.09$ . As we can observe from Table 1, this accuracy is comparable to that of previous results using ResNet-50. Therefore, it shows that ResNet-50 can be trained on ImageNet with a minibatch size of 32k without severely degrading the accuracy, which validates our claim that training of ResNet-50 can be successfully completed in 15 minutes.

Acknowledgements

The authors thank Y. Doi, G. Watanabe, R. Okuta, T. Kikuchi, and M. Sakata for help on experiments, T. Miyato and S. Tokui for fruitful discussions, and H. Maruyama, R. Calland, and C. Loomis for helping to improve the manuscript.

References

Appendix A Details of Training Procedure

Our update rule is a simple combination of momentum SGD and RMSprop (a variant with momentum), defined as follows:

Here, $t$ denotes the current index of iteration. The weights, gradients, momentum, and moving average of the second moment of the gradient at the $i$ -th iteration are represented by $\theta_{i},g_{i},\Delta_{i}$ , and $m_{i}$ respectively. The inputs are $g_{t},\theta_{t-1},\Delta_{t-1}$ , and $m_{t-1}$ , and the outputs are $\theta_{t},\Delta_{t}$ , and $m_{t}$ . Hyperparameters are $\eta$ , $\mu_{1}$ , $\mu_{2}$ , $\varepsilon$ , $\alpha_{\text{SGD}}$ and $\alpha_{\text{RMSprop}}$ : $\eta$ is the learning rate, $\mu_{1}$ determines the amount of momentum, $\mu_{2}$ is the coefficient for the moving average of the gradient second moment, and $\varepsilon$ is a small number added for numerical stability. We used $\mu_{1}=0.9,\mu_{2}=0.99$ , and $\varepsilon=10^{-8}$ throughout our experiments. Parameters $\alpha_{\text{SGD}}$ and $\alpha_{\text{RMSprop}}$ determine the balance between momentum SGD and RMSprop: when $\alpha_{\text{RMSprop}}=0$ , it corresponds to the standard momentum SGD, and when $\alpha_{\text{SGD}}=0$ , it matches RMSprop.

We start with RMSprop (i.e., $\alpha_{\text{SGD}}\approx 0$ ), and then smoothly switch to SGD (i.e., $\alpha_{\text{SGD}}=1$ ). For the transition schedule, we use a function that is similar to the exponential linear unit (ELU) activation function defined as follows:

Here, $\beta_{\text{center}}$ and $\beta_{\text{period}}$ are hyperparameters. First, $\alpha_{\text{SGD}}$ increases exponentially. At the $\beta_{\text{center}}$ -th epoch, $\alpha_{\text{SGD}}$ reaches $\frac{1}{2}$ . After that, it increases linearly until the $\beta_{\text{center}}+\frac{1}{2}\beta_{\text{period}}$ -th epoch. At the $\beta_{\text{center}}+\frac{1}{2}\beta_{\text{period}}$ -th epoch, $\alpha_{\text{SGD}}$ becomes 1, and we set $\alpha_{\text{SGD}}=1$ for the remainder of the training. We set $\beta_{\text{center}}=10$ and $\beta_{\text{period}}=5$ throughout our experiments.

We used $\eta_{\text{RMSprop}}=0.0003$ for the learning rate of RMSprop. Let $\eta_{\text{SGD}}$ be the learning rate of SGD, which will be discussed in the next subsection. To incorporate different learning rates of SGD and RMSprop, we set $\eta=\eta_{\text{SGD}}$ and $\alpha_{\text{RMSprop}}=(1-\alpha_{\text{SGD}})\eta_{\text{RMSprop}}/\eta_{\text{SGD}}$ . One might think that the rule would be simpler if we multiply $\eta_{\text{SGD}}$ to $\alpha_{\text{SGD}}$ beforehand, but we should make $\Delta_{t}$ independent from varying learning rates for momentum correction proposed by Goyal et al. .

A method similar to our RMSprop warm-up is used by Wu et al. for a machine translation task. They use the Adam optimizer at the beginning, then switch to SGD. In our preliminary experiments, we found that RMSprop performs better for our task. In addition, Wu et al. suddenly switches from Adam to SGD. However, we found that sudden transition severely impacts training and has a negative effect on the final results. Therefore, we designed a smooth transition from RMSprop to SGD. We examined a few transition functions including linear and sigmoid functions. Linear functions have a similar problem at the beginning of the transition. ELU and sigmoid performed similarly, but ELU performs slightly better, so we opted for ELU.

A.2 Slow-Start Learning Rate Schedule

Let $\eta_{\text{base}}$ be the initial learning rate under the linear rule by Goyal et al. . Specifically, $\eta_{\text{base}}=0.1\cdot\frac{b_{\text{total}}}{256}=0.1\cdot\frac{nb_{\text{local}}}{256}$ , where $n$ is the number of workers, $b_{\text{local}}$ is the local batch size for each worker, and $b_{\text{total}}$ is the total batch size among all workers (i.e., $b_{\text{total}}=nb_{\text{local}}$ ). In our experiments, $n=1024$ and $b_{\text{local}}=32$ , and thus $\eta_{\text{base}}=12.8$ . Goyal et al.’s learning rate schedule is as follows: $\eta_{\text{base}}$ for first 30 epochs, $0.1\cdot\eta_{\text{base}}$ for the next 30 epochs, $0.01\cdot\eta_{\text{base}}$ for the following 20 epochs, and $0.001\cdot\eta_{\text{base}}$ for the last 10 epochs.

To overcome the initial optimization difficulty, we used a slow-start schedule; our learning rate for SGD was $0.5\cdot\eta_{\text{base}}$ for the first 40 epochs, $0.075\cdot\eta_{\text{base}}$ for the next 30 epochs, $0.01\cdot\eta_{\text{base}}$ for the following 15 epochs, and $0.001\cdot\eta_{\text{base}}$ for the last 5 epochs.