SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

Lewei Yao, Hang Xu, Wei Zhang, Xiaodan Liang, Zhenguo Li

Introduction

Real-time object detection is a core and challenging task to localize and recognize objects in an image on a certain device. This task widely benefits autonomous driving , surveillance video , facial recognition in mobile phone , to name a few. A state-of-the-art detection system usually consists of four modules: backbone, feature fusion neck, region proposal network (in two-stage detection), and RCNN head. Recent progress in this area shows various designs of each modules: backbone , region proposal network, feature fusion neck and RCNN head .

However, how to select the best combination of modules under hardware resource constrains remains unknown. This problem draws much attention from the industry because in practice adjusting each module manually based on a standard detection model is inefficient and sub-optimal. It is hard to leverage and evaluate the inference time and accuracy trade-off as well as the representation capacity of each module in different datasets. For instance, empirically we found that combination of Cascade-RCNN with ResNet18 (not a standard detection model) is even faster and more accurate than FPN with ResNet50 in COCO and BDD (autonomous driving dataset). However, this is not true in the case of VOC.

There has been a growing trend in automatically designing a neural network architecture instead of relying heavily on human efforts and experience. For the image classification, searched networks surpass the performance of hand-crafted networks. For the detection task, existing NAS works focus on optimizing a single component of the detection system instead of considering the whole system. For example, only transfers the searched architecture from the classification task (ImageNet) to the detector backbone. DetNAS searches for better backbones on a pre-trained super-net for object detection. NAS-FPN, Auto-FPN, NAS-FCOS use NAS to find a better feature fusion neck and a more powerful RCNN head. However, those pipelines only partially solve the problem by changing one component while neglecting the balance and efficiency of the whole system. On the contrary, our work aims to develop a multi-objective NAS scheme specifically designed to find an optimal and efficient whole architecture.In this work, we make the first effort on searching the whole structure for object detectors. By investigating the state-of-the-art design, we found three factors are crucial for the performance of a detection system: 1) size of the input images; 2) combination of modules of the detector; 3) architecture within each module. To find an optimal tradeoff between inference time and accuracy with these three factors, we propose a coarse-to-fine searching strategy: 1) Structural-level searching stage (Stage-one) first aims to find an efficient combination of different modules as well as the model-matching input sizes; 2) Modular-level search stage (Stage-two) then evolves each specific module and push forward to an efficient task-specific network.

We consider a multi-objective search targeting directly on GPU devices, which outputs a Pareto front showing the optimal designs of the detector under different resource constraints. During Stage-one, the search space includes different choices of modules to cover many popular one-stage/two-stage designs of detectors. We also consider putting the input image size into the search space since it greatly impacts the latency and accuracy . During Stage-two, we further consider to optimize and evolve the modules (e.g. backbone) following the optimal combination found in the previous stage. The previous works find that backbones originally designed for classification task might be sub-optimal for object detection. The resulting modular-level search thus leans the width and depth of the overall architecture towards detection task. With the improved training strategy, our search can be conducted directly on the detection datasets without ImageNet pre-training. For an efficient search, we combine evolutionary algorithms with Partial Order Pruning technique for a fast searching and parallelize the whole searching algorithm in a distributed training system to further speed up the whole process.

Extensive experiments are conducted on the widely used detection benchmarks, including Pascal VOC , COCO , BDD . As shown in Figure 1, SM-NAS yields state-of-the-art speed/accuracy trade-off and outperforms existing detection methods, including FPN , Cascade-RCNN and the most recent work NAS-FPN . Our E2 reaches half of the inference time with additional 1% mAP improvement compared to FPN. E5 reaches 46% mAP with the similar inference time of MaskRCNN (mAP:39.4%).

To sum up, we make the following contributions to NAS for detection:

We are among the first to investigate the trade-off for speed and accuracy of an object detection system with a different combination of different modules.

We develop a coarse-to-fine searching strategy by decoupling the search into structural-level and modular-level to efficiently lift the Pareto front. The searched models reach the state-of-the-art speed/accuracy, dominating existing methods with a large margin.

We make the first attempt to directly search a detection backbone without pre-trained models or any proxy task by exploring fast training from scratch strategy.

Related Work

Object Detection. Object detection is a core problem in computer vision. State-of-the-art anchor-based detection approaches usually consists of four modules: backbone, feature fusion neck, region proposal network (in two-stage detectors), and RCNN head. Most of the previous progress focus on developing better architectures for each module. For example, tries to develop a backbone for detection; FPN and PANet modified multi-level features fusion module; try to make RPN more powerful. On the other hand, R-FCN and Light-head RCNN design different structures of bbox head. However, community lacks of literatures comparing the efficiency and performance of different combination of different modules.

Neural Architecture Search. NAS aims at automatically finding an efficient neural network architecture for a certain task and dataset without labor of designing network. Most works are based on searching CNN architectures for image classification while only a few of them focus on more complicated vision tasks such as semantic segmentation and detection. There are mainly three categories of searching strategies in NAS area: 1) Reinforcement learning based methods train a RNN policy controller to generate a sequence of actions to specify CNN architecture; 2) Evolutionary Algorithms based methods and Network Morphism try to “evolves” architectures by mutating the current best architectures; 3) Gradient based methods define an architecture parameter for continuous relaxation of the discrete search space, thus allowing differentiable optimization of the architecture. Among those approaches, gradient based methods is fast but not so reliable since weight-sharing makes a big gap between the searching and final training. RL methods usually require massive samples to converge which is not practical for detection. Thus we use EA based method in this paper.

The Proposed Approach

With preliminary empirical experiments, we have found some interesting facts:

1) One-stage detector is not always faster than two-stage detector. Although RetinaNet is faster than FPN on VOC (Exp 3&4), it is slower and worse than FPN on COCO (Exp 1&2).

2) Reasonable combination of modules and input resolution can lead to an efficient detection system. Generally, Cascade-RCNN is slower than FPN with the same backbone since it has 2 more cascade heads. However, with a better combination of modules and input resolution, CascadeRCNN with ResNet50 can be faster and more accurate than FPN with ResNet101 (Exp 5 & 6).

It can be found that customizing different modules and input-size is crucial for real-time object detection system for task specific datasets. Thus we present the SM-NAS for searching an efficient combination of modules and better modular-level architecture for object detection.

2 NAS Pipeline

As in Figure 2, we propose a coarse-to-fine searching pipeline: 1) Structural-level searching stage first aims to find an efficient combination of different modules; 2) Modular-level search stage then evolves each specific module and push forward to a faster task-specific network. Moreover, we explore a strategy of fast training from scratch for the detection task, which can directly search a detection backbone without pre-trained models or any proxy task.

Modern object detection systems can be decoupled into four components: backbone, feature fusion neck, region proposal network (RPN), and RCNN head. We consider putting different popular and latest choices of modules into the search space to cover many popular designs.

Backbone. Commonly used backbones are included in the search space: ResNet (ResNet18, ResNet34, ResNet50 and ResNet101), ResNeXt (ResNeXt50, ResNeXt101) and MobileNet V2 . During Stage-one, we loaded the backbones pre-trained from ImageNet for fast convergence.

Feature fusion neck. Features from different layers are commonly used to predict objects across various sizes. The feature fusion neck aims at conducting feature fusion for better prediction. Here, we use { $P_{1}$ , $P_{2}$ , $P_{3}$ , $P_{4}$ } to denote feature levels generated by the backbone e.g. ResNet. From $P_{1}$ to $P_{4}$ , the spatial size is gradually down-sampled with factor 2. We further add two smaller $P_{5}$ and $P_{6}$ feature maps downsampled from $P_{4}$ following RetinaNet . The search space contains: no FPN (the original Faster RCNN setting) and FPN with different choices of input and output feature levels (ranging from $P_{1}$ to $P_{6}$ ).

Region proposal network (RPN). RPN generates multiple foreground proposals within each feature map and only exists in two-stage detectors. Our search space is chosen to be: no RPN (one-stage detectors); with RPN; with Guided anchoring RPN .

RPN generates multiple foreground anchor proposals within each feature map and only exists in two-stage detectors. Our search space is chosen to be: no RPN (one-stage detectors); with RPN; with Guided anchoring RPN .

RCNN head. RCNN head refines the objects location and predicts final classification results. proposed cascade RCNN heads to iterative refine the detection results, which has been proved to be useful yet requiring more computational resources. Thus, we consider regular RCNN head , RetinaNet head , and cascade RCNN heads with different number of heads (2 to 4) as our search space to exam the accuracy/speed trade-off. Note that our search space covers both one-stage and two-stage detection systems.

Input Resolution.Furthermore, the input resolution is closely related to the accuracy and speed. also suggested that input resolution should match the capability of the backbone, which is not measurable in practice. Intuitively, we thus add input resolution in our search space to find the best matching with different models: 512x512, 800x600, 1080x720 and 1333x800.

Inference time is then evaluated for each combination of modules. Together with the accuracy on validation dataset, a Pareto front is then generated showing the optimal structures of the detector under different resource constraints.

2.2 Stage-two: Modular-level Search

On the Pareto front generated by Stage-one, we can pick up several efficient detection structures with different combination of modules. Then in Stage-two, we search the detailed architecture for each module and lift the boundary of speed/accuracy tradeoff of the selected structures.

suggested that in detection backbone, early-stage feature maps are larger with low-level features which describe spatial details, while late-stage feature maps are smaller with high-level features which are more discriminative. Localization subtask is sensitive to low-level features while high-level features are crucial for classification. Thus, a natural question is to ask how to leverage the computational cost over different stages to obtain an optimal design for detection. Therefore, inside the backbone, we design a flexible search space to find the optimal base channel size, as well as the position of down-sampling and channel-raising.

As shown in Figure 2, the Stage-two backbone search space consists of 5 stages, each of which refers to a bench of convolutional blocks fed by the features with the same resolution. The spatial size of stage 1 to 5 is gradually downsampled with factor 2. As suggested in , we fix stage 1 and the first layer of stage 2 to be a 3x3 conv (stride=2). We use the same block setting (basic/bottleneck residual block, ResNeXt block or MBblock ) as the structures selected from the result of Stage-one. For example, if the candidate model selected from Stage-one’s Pareto front is with ResNet101 as the backbone, we will use the corresponding bottleneck residual block as its search space.

Furthermore, the backbone architecture encoding string is like “basicblock 54 1211-211-1111-12111” where the first placeholder encodes the block setting; 54 is the base channel size; “-” separates each stage with different resolution; “1” means regular block with no change of channels and “2” indicated the number of base channels is doubled in this block. The base channel size is chosen from ${48,56,64,72}$ . Since there is no pre-trained model available for customized backbones, we use a fast train-from-scratch technique instead which will be elaborated in the next section.

Besides the flexible backbone, we also adjust the channel size of the FPN during the Stage-two search. The input channel size is chosen from ${128,256,512}$ and the channels of the head is adjusted correspondingly. Thus, the objective of Stage-two is to further refine the detailed modular structure of the selected efficient architectures.

3 Train from scratch and fast evaluate the architecture

Most of the detection models require initialization of backbone from the ImageNet pre-trained models during training. Any modification on the structure of backbone requires training again on the ImageNet, which makes it harder to evaluate the performance of a customized backbone. This paradigm hinders the development of efficient NAS for detection problem. first explores the possibility of training a detector from scratch by the deeply supervised networks and dense connections. and ScratchDet find that normalization play an significant role in training from scratch and a longer training can then help to catch up pre-trained counterparts. Inspired by those works, we conjecture the difficulty from two factors and try to fix them:

1) Inaccurate Batch Normalization because of smaller batch size: During the training, the batch-size is usually very small because of high GPU consumption, which leads to inaccurate estimation of the batch statistics and increasing the model error dramatically . To alleviate this problem, we use Group Normalization (GN) instead of standard BN since GN is not sensitive to the batch size.

2) Complexity of the loss landscape: suggested that the multiple loss and ROI pooling layer in detection hinder the gradient of region-level backward to the backbone. Significant loss jitter or gradient explosion are often observed during training from scratch. BN has been proved to be an effective solution of the problem through significantly smoothing the optimization landscape . Instead of using BN, which is not suitable for small batch size training, we adopt Weight Standardization (WS) for the weights in the convolution layers to further smooth the loss landscape.

Experiments in the later section show that with GN and WS, a much larger learning rate can be adopted, thus enabling us to train a detection network from scratch even faster than the pre-trained counterparts.

4 Multi-objective Searching Algorithm

For each stage, we aims at generating a Pareto front showing the optimal trade-off between accuracy and different computation constrains. To generate the Pareto front, we use nondominate sorting to determinate whether one model dominates another in terms of both efficiency and accuracy. In Stage-one, we use inference time on one V100 GPU as the efficiency metric to roughly compare the actual performance between different structures. In Stage-two, we use FLOPs instead of actual time since FLOPs is more accurate than inference time to compare different backbones with the same kind of block (the inference time has some variation because of the GPU condition). Moreover, FLOPs is able to keep the consistency of rank when changing the BN to GN+WS during searching in Stage-two.

The architecture search step is based on: 1) the evolutionary algorithm to mutate the best architecture on the Pareto front; 2) Partial Order Pruning method to prune the architecture search space with the prior knowledge that deeper models and wider models are better. Our algorithm can be parallelized on multiple computation nodes (each has 8 V100 GPUs) and lift the Pareto front simultaneously.

Experiments

We conduct architecture search on the well-known COCO dataset, which contains 80 object classes with 118K images for training, 5K for evaluation. For Stage-one, we consider a totally $1.1\times 10^{4}$ combination of modules. For Stage-two, the search space is much larger, containing about $5.0\times 10^{12}$ unique paths. We conduct all experiments using Pytorch , multiple computational nodes with 8 V100 cards on each server. To measure the inference speed, we run all the testing images on one V100 GPU and take the average inference time for comparison. All experiments are performed under CUDA 9.0 and CUDNN 7.0.

During searching, we first generate some initial models with a random combination of modules. Then evolutionary algorithm is used to mutate the best architecture on the Pareto front and provides candidate models. During architectures evaluation, we use SGD optimizer with cosine decay learning rate from 0.04 to 0.0001, momentum 0.9 and $10^{-4}$ as weight decay. Pre-trained models on ImageNet are used as our backbone for fast convergence. Empirically, we found that training with 5 epochs can separate good models from bad models. In this stage, we evaluate about 500 architectures and it takes about 2000 GPU hours for the whole searching process.

Intermediate results for Stage-one. The first two figures in 3 show the comparison of mAP and inference time of the architectures searched on COCO. From Figure 3-1, it can be found that different input resolution can variate the speed and accuracy. We also found that MobileNet V2 is dominated by other models although it has mush less FLOPs in Figure 3-2. This is because it has higher memory access cost thus is slower in practice . Therefore, using the direct metric, i.e. inference time, rather than approximate metric such as FLOPs is necessary for achieving the best speed/accuracy trade-off and our searching found some structures dominate classic detectors. From Figure 3-3, it can be found that our searching already found some structures dominate classic objectors. On the generated Pareto front, we pick 6 models (C0 to C5) and further search for the better modular-level architectures in Stage-two.

1.2 Implementation Details for Stage-two.

During Stage-two, we use the training strategy with GN and WS methods discussed in the previous section. We use cosine decay learning rate ranging from 0.24 to 0.0001 with batch size 8 on each GPU. The model is trained with 9 epochs to fully explore the different modular-level structures. It is worth mention that we directly search on the COCO without pre-trained models. In Stage-two, we evaluate about 300 architectures for each group and use about 2500 GPU hours.

Intermediate results for Stage-two. Figure 4 shows mAP/speed improvement of the searched models compared to the optimal model selected in Stage-one. It can be found that SM-NAS can further push the Pareto front to a better trade-off of speed/accuracy.

2 Object Detection Results

On the COCO dataset, the optimal architectures E0 to E5 are identified with our two-stages search. We change the backbone back to BN and no Weight Standardization mode since these practices will slow down the inference time. We first pre-train those searched backbones on ImageNet following common practice for fair comparison with other methods. Then stochastic gradient descent (SGD) is performed to train the full model on 8 GPUs with 4 images on each GPU. Following the setting of 2x schedule the initial learning rate is 0.04 (with a linear warm-up), and reduces two times ( $\times 0.1$ ) during fine-tuning; $10^{-4}$ as weight decay; $0.9$ as momentum. The training and testing is conducted with the searched optimal input resolutions. Image flip and scale jitter is adopted for augmentation during training, and evaluation procedure follows the COCO official setting .

Detailed architectures of the final searched models. Table 2 shows architecture details of the final searched E0 to E5. Comparing the searched backbones with classical ResNet/ResNeXt, we find that early stages in our models are very short which is more efficient since feature maps in an early stage is very large with a high computational cost. We also found that for high-performance detectors E3-E5, raising channels usually happens in very early stage which means that lower-level feature plays an important role for localization.The classification performance of the backbone of E0 to E5 on ImageNet can also be found in the supplementary materials. We can find the searched backbones are also efficient in the classification task.

Comparison with the state-of-the-art. In Table 3, we make a detailed comparison with existing detectors: YOLOv3, DSSD, RetinaNet, FSAF, CornerNet, CenterNet, AlignDet, GA-FasterRCNN, Faster-RCNN, Mask-RCNN, Cascade-RCNN, TridentNet, and NAS-FPN. Most reported results are tested with single V100 GPU (some models marked with other GPU devices following the original papers). For a fair comparison, multi-scale testing is not adopted for all methods. From E0 to E5, SM-NAS constructs a Pareto front that dominates most SOTA models as shown in Figure 1. Our searched models dominate most state-of-the-art models, demonstrating that SM-NAS is able to find efficient real-time object detection systems.

Ablative study for strategies of training from scratch. Since in modular-level searching stage, we keep changing the backbone structure, we need to find an optimal setting of training strategies for efficiently training a detection network from scratch. Table 4 shows an ablative study of FPN with ResNet-50 trained with different strategies, evaluated on COCO. Exp-0 and Exp-1 are the 1x and 2x standard FPN training procedure following . Comparing Exp-2&3, with Exp-4, it can be found smaller batch size leads to inaccurate batch normalization statistics. Using group normalization can alleviate this problem and improve the mAP from 24.8 to 29.4. From Exp-5, adding WS can further smooth the training and improve mAP by 1.3. Furthermore, enlarging the learning rate and batch size can increase the mAP to 37.5 in 16-epoch-training (see Exp 5&6&7). Thus, we can train a detection network from scratch using fewer epochs than the pre-trained counterparts.

Architecture Transfer: VOC and BDD. To evaluate the domain transferability of the searched models, we transfer the searched architecture E0-E3 from COCO to Pascal VOC and BDD. For PASCAL VOC dataset with 20 object classes, training is performed on the union of VOC 2007 trainval and VOC 2012 trainval (10K images) and evaluation is on VOC 2007 test (4.9K images). We only report mAP using IoU at 0.5. Berkeley Deep Drive (BDD) is an autonomous driving dataset with 10 object classes, containing about 70K images for training and 10K for evaluation. We use the same training and testing configurations for a fare comparison. As shown in Table 5, on Pascal VOC, E0 reduces half of the inference time compared to FPN with a higher mAP. For BDD, E3 is 17.4ms faster than FPN. The searched architectures show good transferability.

Correlation Analysis of the Architecture and mAP. It is interesting to analyze the correlation between the factors of backbone architecture and mAP. Figure 5 shows the correlation between factors of all the searched models on COCO dataset. The left figure shows the results of Pareto front 4 in Stage-two. It can be found that under the constraints of FLOPs, better architecture should decrease the depth and put the computation budget in the low-level stage. The right figure shows correlation for all the searched models. Depth shows strong positive relation with mAP, raising channels in early stage is good for detection. It is better to have a longer high-level stage and shorter low-level stage.

Conclusion

We propose a detection NAS framework for searching both an efficient combination of modules and better modular-level architectures for object detection on a target device. The searched SM-NAS networks achieve state-of-the-art speed/accuracy trade-off. The SM-NAS pipeline can keep updating and adding new modules in the future.

References

Supplementary

Classification performance of our searched backbone of E0 to E5 on ImageNet

We further compare the Classification performance of our searched backbone of E0 to E5 on ImageNet. We compare the FLOPS, memory access cost (MAC) and total number of parameters with their counterparts ResNet18, ResNet34, ResNet101 and ResNext101 in Table 6. It can be found that all the searched backbone has a lower FLOPs, MAC and total parameters with a higher Top-1 accuracy. More specifically, the searched architecture nearly cut half of the FLOPs and total number of parameters for E2, E3, E4 and E5. That’s why it is so efficient in the GPU. We can conclude that the searched architectures are not only good at detection task, but also efficient on the classification.

More intermediate results for Stage-two

In Stage-two, we conduct a further backbone search based on the module combinations and input sizes searched in Stage-one. As a preliminary, at the beginning of Stage-two we first train the candidate architectures with vanilla backbones under the GN+WS setting to obtain baselines. Table 7 shows the comparison between our searched architectures (E0-E5) and the baselines. It can be found that the searched backbones can considerably reduce FLOPs while keeping a comparable mAP. Modular-search helps to further push the candidate architectures to a better trade-off of speed/accuracy.

More Correlation Results

Figure 8 shows the correlation coefficients between factors of all the searched models on COCO dataset for all the Pareto fronts in Stage-two. From Pareto front 0 to Pareto front 5, it can be found that the correlation coefficients become more significant which indicates the larger models tends to have specific patterns. For small model, the mAP is positive correlated to the depth. However, when the model becomes larger, the depth is negative related to the mAP. It can be also found that under the constraints of FLOPs, better architecture should decrease the depth and put the computation budget in the low-level stage.

Qualitative Results and Comparison

More qualitative results comparison on multiple datasets: MSCOCO, BDD, and Pascal VOC can be found in Figure 9, 10, 11. The SM-NAS E3 is our full model trained on all the three dataset. The visualization threshold is 0.5. From Figure 9, our searched model is superior on the detection of objects with tiny-size, occlusion, ambiguities to the baseline model FPN. From Figure 10, it can be found that our E3 can detect very small cars. From Figure 11, for a easier dataset Pascal, our E3 performs also very well.