MegDet: A Large Mini-Batch Object Detector

Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, Jian Sun

Introduction

Tremendous progresses have been made on CNN-based object detection, since seminal work of R-CNN , Fast/Faster R-CNN series , and recent state-of-the-art detectors like Mask R-CNN and RetinaNet . Taking COCO dataset as an example, its performance has been boosted from $19.7$ AP in Fast R-CNN to $39.1$ AP in RetinaNet , in just two years. The improvements are mainly due to better backbone network , new detection framework , novel loss design , improved pooling method , and so on .

A recent trend on CNN-based image classification uses very large min-batch size to significantly speed up the training. For example, the training of ResNet-50 can be accomplished in an hour or even in 31 minutes , using mini-batch size 8,192 or 16,000, with little or small sacrifice on the accuracy. In contract, the mini-batch size remains very small (e.g., 2-16) in object detection literatures. Therefore in this paper, we study the problem of mini-batch size in object detection and present a technical solution to successfully train a large mini-batch size object detector.

What is wrong with the small mini-batch size? Originating from the object detector R-CNN series, a mini-batch involving only $2$ images is widely adopted in popular detectors like Faster R-CNN and R-FCN. Though in state-of-the-art detectors like RetinaNet and Mask R-CNN the mini-batch size is increased to $16$ , which is still quite small compared with the mini-batch size (e.g., 256) used in current image classification. There are several potential drawbacks associated with small mini-batch size. First, the training time is notoriously lengthy. For example, the training of ResNet-152 on COCO takes 3 days, using the mini-bath size 16 on a machine with 8 Titian XP GPUs. Second, training with small mini-batch size fails to provide accurate statistics for batch normalization (BN). In order to obtain a good batch normalization statistics, the mini-batch size for ImageNet classification network is usually set to 256, which is significantly larger than the mini-batch size used in current object detector setting.

Last but not the least, the number of positive and negative training examples within a small mini-batch are more likely imbalanced, which might hurt the final accuracy. Figure 2 gives some examples with imbalanced positive and negative proposals. And Table 1 compares the statistics of two detectors with different mini-batch sizes, at different training epochs on COCO dataset.

What is the challenge to simply increase the min-batch size? As in the image classification problem, the main dilemma we are facing is: the large min-batch size usually requires a large learning rate to maintain the accuracy, according to “equivalent learning rate rule” . But a large learning rate in object detection could be very likely leading to the failure of convergence; if we use a smaller learning rate to ensure the convergence, an inferior results are often obtained.

To tackle the above dilemma, we propose a solution as follows. First, we present a new explanation of linear scaling rule and borrow the “warmup” learning rate policy to gradually increase the learning rate at the very early stage. This ensures the convergence of training. Second, to address the accuracy and convergence issues, we introduce Cross-GPU Batch Normalization (CGBN) for better BN statistics. CGBN not only improves the accuracy but also makes the training much more stable. This is significant because we are able to safely enjoy the rapidly increased computational power from industry.

Our MegDet (ResNet-50 as backbone) can finish COCO training in 4 hours on 128 GPUs, reaching even higher accuracy. In contrast, the small mini-batch counterpart takes 33 hours with lower accuracy. This means that we can speed up the innovation cycle by nearly an order-of-magnitude with even better performance, as shown in Figure 1. Based on MegDet, we secured 1st place of COCO 2017 Detection Challenge.

Our technical contributions can be summarized as:

We give a new interpretation of linear scaling rule, in the context of object detection, based on an assumption of maintaining equivalent loss variance.

We are the first to train BN in the object detection framework. We demonstrate that our Cross-GPU Batch Normalization not only benefits the accuracy, but also makes the training easy to converge, especially for the large mini-batch size.

We are the first to finish the COCO training (based on ResNet-50) in 4 hours, using 128 GPUs, and achieving higher accuracy.

Our MegDet leads to the winning of COCO 2017 Detection Challenge.

Related Work

CNN-based detectors have been the mainstream in current academia and industry. We can roughly divide existing CNN-based detectors into two categories: one-stage detectors like SSD , YOLO and recent Retina-Net , and two-stage detectors like Faster R-CNN , R-FCN and Mask-RCNN .

For two-stage detectors, let us start from the R-CNN family. R-CNN was first introduced in 2014. It employs Selective Search to generate a set of region proposals and then classifies the warped patches through a CNN recognition model. As the computation of the warp process is intensive, SPPNet improves the R-CNN by performing classification on the pooled feature maps based on a spatial pyramid pooling rather than classifying on the resized raw images. Fast-RCNN simplifies the Spatial Pyramid Pooling (SPP) to ROIPooling. Although reasonable performance has been obtained based on Fast-RCNN, it still replies on traditional methods like selective search to generate proposals. Faster-RCNN replaces the traditional region proposal method with the Region Proposal Network (RPN), and proposes an end-to-end detection framework. The computational cost of Faster-RCNN will increase dramatically if the number of proposals is large. In R-FCN , position-sensitive pooling is introduced to obtain a speed-accuracy trade-off. Recent works are more focusing on improving detection performance. Deformable ConvNets uses the learned offsets to convolve different locations of feature maps, and forces the networks to focus on the objects. FPN introduces the feature pyramid technique and makes significant progress on small object detection. As FPN provides a good trade-off between accuracy and implementation, we use it as the default detection framework. To address the alignment issue, Mask R-CNN introduces the ROIAlign and achieves state-of-the-art results for both object detection and instance segmentation.

Different from two-stage detectors, which involve a proposal and refining step, one-stage detectors usually run faster. In YOLO , a convolutional network is followed with a fully connected layer to obtain classification and regression results based on a $7\times 7$ grid. SSD presents a fully convolutional network with different feature layers targeting different anchor scales. Recently, RetinaNet is introduced in based on the focal loss, which can significantly reduce false positives in one-stage detectors.

Large mini-batch training has been an active research topic in image classification. In , imagenet training based on ResNet50 can be finished in one hour. presents a training setting which can finish the ResNet50 training in 31 minutes without losing classification accuracy. Besides the training speed, investigates the generalization gap between large mini-batch and small mini-batch, and propose the novel model and algorithm to eliminate the gap. However, the topic of large mini-batch training for object detection is rarely discussed so far.

Approach

In this section, we present our Large Mini-Batch Detector (MegDet), to finish the training in less time while achieving higher accuracy.

The early generation of CNN-based detectors use very small mini-batch size like $2$ in Faster-RCNN and R-FCN. Even in state-of-the-art detectors like RetinaNet and Mask R-CNN, the batch size is set as $16$ . There exist a few problems when training with a small mini-batch size. First, we have to pay much longer training time if a small mini-batch size is utilized for training. As shown in Figure 1, the training of a ResNet-50 detector based on a mini-batch size of 16 takes more than 30 hours. With the original mini-batch size $2$ , the training time could be more than one week. Second, in the training of detector, we usually fix the statistics of Batch Normalization and use the pre-computed values on ImageNet dataset, since the small mini-batch size is not applicable to re-train the BN layers. It is a sub-optimal tradeoff since the two datasets, COCO and ImageNet, are much different. Last but not the least, the ratio of positive and negative samples could be very imbalanced. In Table 1, we provide the statistics for the ratio of positive and negative training examples. We can see that a small mini-batch size leads to more imbalanced training examples, especially at the initial stage. This imbalance may affect the overall detection performance.

As we discussed in the introduction, simply increasing the mini-batch size has to deal with the tradeoff between convergence and accuracy. To address this issue, we first discuss the learning rate policy for the large mini-batch.

2 Learning Rate for Large Mini-Batch

The learning rate policy is strongly related to the SGD algorithm. Therefore, we start the discussion by first reviewing the structure of loss for object detection network,

where $N$ is the min-batch size, $l(x,w)$ is the task specific loss and $l(w)$ is the regularization loss. For Faster R-CNN framework and its variants , $l(x_{i},w)$ consists of RPN prediction loss, RPN bounding-box regression loss, prediction loss, and bounding box regression loss.

According to the definition of mini-batch SGD, the training system needs to compute the gradients with respect to weights $w$ , and updates them after every iteration. When the size of mini-batch changes, such as $\hat{N}\leftarrow k\cdot N$ , we expect that the learning rate $r$ should also be adapted to maintain the efficiency of training. Previous works use Linear Scaling Rule, which changes the new learning rate to $\hat{r}\leftarrow k\cdot r$ . Since one step in large mini-batch $\hat{N}$ should match the effectiveness of $k$ accumulative steps in small mini-batch $N$ , the learning rate $r$ should be also multiplied by the same ratio $k$ to counteract the scaling factor in loss. This is based on a gradient equivalence assumption in the SGD updates. This rule of thumb has been well-verified in image classification, and we find it is still applicable for object detection. However, the interpretation is is different for a weaker and better assumption.

In image classification, every image has only one annotation and $l(x,w)$ is a simple form of cross-entropy. As for object detection, every image has different number of box annotations, resulting in different ground-truth distribution among images. Considering the differences between two tasks, the assumption of gradient equivalence between different mini-batch sizes might be less likely to be hold in object detection. So, we introduce another explanation based on the following variance analysis.

Variance Equivalence. Different from the gradient equivalence assumption, we assume that the variance of gradient remain the same during $k$ steps. Given the mini-batch size $N$ , if the gradient of each sample $\nabla l(x_{i},w)$ obeying i.i.d., the variance of gradient on $l(x,w)$ is:

Similarly, for the large mini-batch $\hat{N}=k\cdot N$ , we can get the following expression:

Instead of expecting equivalence on weight update, here we want to maintain the variance of one update in large mini-batch $\hat{N}$ equal to $k$ accumulative steps in small mini-batch $N$ . To achieve this, we have:

Within Equation (2) and (3), the above equality holds if and only if $\hat{r}=k\cdot r$ , which gives the same linear scaling rule for $\hat{r}$ .

Although the final scaling rule is the same, our variance equivalence assumption on Equation (4) is weaker because we just expect that the large mini-batch training can maintain equivalent statistics on the gradients. We hope the variance analysis here can shed light on deeper understanding of learning rate in wider applications.

Warmup Strategy. As discussed in , the linear scaling rule may not be applicable at the initial stage of the training, because the weights changing are dramatic. To address this practical issue, we borrow Linear Gradual Warmup in . That is, we set up the learning rate small enough at the beginning, such as $r$ . Then, we increase the learning rate with a constant speed after every iteration, until to $\hat{r}$ .

The warmup strategy can help the convergence. But as we demonstrated in the experiments later, it is not enough for larger mini-batch size, e.g., 128 or 256. Next, we introduce the Cross-GPU Batch Normalization, which is the main workhorse of large mini-batch training.

3 Cross-GPU Batch Normalization

Batch Normalization is an important technique for training a very deep convolutional neural network. Without batch normalization, training such a deep network will consume much more time or even fail to converge. However, previous object detection frameworks, such as FPN , initialize models with an ImageNet pre-trained model, after which the batch normalization layer is fixed during the whole fine-tuning procedure. In this work, we make an attempt to perform batch normalization for object detection.

It is worth noting that the input image of classification network is often $224\times{}224$ or $299\times{}299$ , and a single NVIDIA TITAN Xp GPU with 12 Gigabytes memory is enough for 32 or more images. In this way, batch normalization can be computed on each device alone. However, for object detection, a detector needs to handle objects of various scales, thus higher resolution images are needed as its input. In , input of size $800\times{}800$ is used, significantly limiting the number of possible samples on one device. Thus, we have to perform batch normalization crossing multiple GPUs to collect sufficient statistics from more samples.

To implement batch normalization across GPUs, we need to compute the aggregated mean/variance statistics over all devices. Most existing deep learning frameworks utilize the BN implementation in cuDNN that only provides a high-level API without permitting modification of internal statistics. Therefore we need to implement BN in terms of preliminary mathematical expressions and use an “AllReduce” operation to aggregate the statistics. These fine-grained expressions usually cause significant runtime overhead and the AllReduce operation is missing in most frameworks.

Our implementation of Cross-GPU Batch Normalization is sketched in Figure 3. Given $n$ GPU devices in total, sum value $s_{k}$ is first computed based on the training examples assigned to the device $k$ . By averaging the sum values from all devices, we obtain the mean value $\mu_{\mathcal{B}}$ for current mini-batch. This step requires an AllReduce operation. Then we calculate the variance for each device and get $\sigma^{2}_{\mathcal{B}}$ . After broadcasting $\sigma^{2}_{\mathcal{B}}$ to each device, we can perform the standard normalization by $y=\gamma\frac{x-\mu_{\mathcal{B}}}{\sqrt{\sigma^{2}_{\mathcal{B}}+\epsilon}}+\beta$ . Algorithm 1 gives the detailed flow. In our implementation, we use NVIDIA Collective Communication Library (NCCL) to efficiently perform AllReduce operation for receiving and broadcasting.

Note that we only perform BN across GPUs on the same machine. So, we can calculate BN statistics on 16 images if each GPU can hold 2 images. To perform BN on 32 or 64 images, we apply sub-linear memory to save the GPU memory consumption by slightly compromising the training speed.

In next section, our experimental results will demonstrate the great impacts of CGBN on both accuracy and convergence.

Experiments

We conduct experiments on COCO Detection Dataset , which is split into train, validation, and test, containing 80 categories and over $250,000$ images. We use ResNet-50 pre-trained on ImageNet as the backbone network and Feature Pyramid Network (FPN) as the detection framework. We train the detectors over 118,000 training images and evaluate on 5000 validation images. We use the SGD optimizer with momentum 0.9, and adopts the weight decay 0.0001. The base learning rate for mini-batch size 16 is $0.02$ . For other settings, the linear scaling rule described in Section 3.2 is applied. As for large mini-batch, we use the sublinear memory and distributed training to remedy the GPU memory constraints.

We have two training policies in following: 1) normal, decreasing the learning rate at epoch 8 and 10 by multiplying scale 0.1, and ending at epoch 11; 2) long, decreasing the learning rate at epoch 11 and 14 by multiplying scale 0.1, halving the learning rate at epoch 17, and ending at epoch 18. Unless specified, we use the normal policy.

We start our study through the different mini-batch size settings, without batch normalization. We conduct the experiments with mini-batch size 16, 32, 64, and 128. For mini-batch sizes 32, we observed that the training has some chances to fail, even we use the warmup strategy. For mini-batch size 64, we are not able to manage the training to converge even with the warmup. We have to lower the learning rate by half to make the training to converge. For mini-batch size 128, the training failed with both warmup and half learning rate. The results on COCO validation set are shown in Table 2. We can observe that: 1) mini-batch size 32 achieved a nearly linear acceleration, without loss of accuracy, compared with the baseline using 16; 2) lower learning rate (in mini-batch size 64) results in noticeable accuracy loss; 3) the training is harder or even impossible when the mini-batch size and learning rate are larger, even with the warmup strategy.

2 Large mini-batch size, with CGBN

This part of experiments is trained with batch normalization. Our first key observation is that all trainings easily converge, no matter of the mini-batch size, when we combine the warmup strategy and CGBN. This is remarkable because we do not have to worry about the possible loss of accuracy caused by using smaller learning rate.

The main results are summarized in Table 3. We have the following observations. First, within the growth of mini-batch size, the accuracy almost remains the same level, which is consistently better than the baseline (16-base). In the meanwhile, a larger mini-batch size always leads to a shorter training cycle. For instance, the 256 mini-batch experiment with 128 GPUs finishes the COCO training only in 4.1 hours, which means a $8\times$ acceleration compared to the $33.2$ hours baseline.

Second, the best BN size (number of images for BN statistics) is 32. With too less images, e.g. 2, 4, or 8, the BN statistics are very inaccurate, thus resulting a worse performance. However, when we increase the size to 64, the accuracy drops. This demonstrates the mismatch between image classification and object detection tasks.

Third, in the last part of Table 3, we investigate the long training policy. Longer training time slightly boots the accuracy. For example, “32 (long)” is better that its counterpart (37.8 v.s. 37.3). When the mini-batch size is larger than 16, the final results are very consist, which indicates the true convergence.

Last, we draw epoch-by-epoch mmAP curves of 16 (long) and 256 (long) in Figure 4. 256 (long) is worse at early epochs but catches up 16 (long) at the last stage (after second learning rate decay). This observation is different from those in image classification , where both the accuracy curves and convergent scores are very close between different mini-batch size settings. We leave the understanding of this phenomenon as the future work.

Concluding Remarks

We have presented a large mini-batch size detector, which achieved better accuracy in much shorter time. This is remarkable because our research cycle has been greatly accelerated. As a result, we have obtained 1st place of COCO 2017 detection challenge. The details are in Appendix.

Appendix

Based on our MegDet, we integrate the techniques including OHEM , atrous convolution , stronger base models , large kernel , segmentation supervision , diverse network structure , contextual modules , ROIAlign and multi-scale training and testing for COCO 2017 Object Detection Challenge. We obtained 50.5 mmAP on validation set, and 50.6 mmAP on the test-dev. The ensemble of four detectors finally achieved 52.5. Table 4 summarizes the entries from the leaderboard of COCO 2017 Challenge. Figure 5 gives some exemplar results.