CBNet: A Composite Backbone Network Architecture for Object Detection

Tingting Liang, Xiaojie Chu, Yudong Liu, Yongtao Wang, Zhi Tang, Wei Chu, Jingdong Chen, Haibin Ling

I Introduction

Object detection aims to locate each object instance from a predefined set of classes in an arbitrary image. It serves a wide range of applications such as autonomous driving, intelligent video surveillance, remote sensing, etc. In recent years, great progresses have been made for object detection thanks to the booming development of deep convolutional networks , and excellent detectors have been proposed, e.g., SSD , YOLO , Faster R-CNN , RetinaNet, ATSS , Mask R-CNN , Cascade R-CNN , etc.

Typically, in a Neural Network (NN)-based detector, a backbone network is used to extract basic features for detecting objects, and in most cases designed originally for image classification and pre-trained on ImageNet . Intuitively, the more representative features extracted by the backbone, the better the performance of its host detector. To obtain higher accuracy, deeper and wider backbones have been exploited by mainstream detectors (i.e., from mobile-size models and ResNet , to ResNeXt and Res2Net ). Recently, Transformer based backbones have shown very promising performance. Overall, advances in large backbone pre-training demonstrate a trend towards more effective multi-scale representations in object detection.

Encouraged by the results achieved by pre-trained large backbone-based detectors, we seek further improvement to construct high-performance detectors by exploiting existing well-designed backbone architectures and their pre-trained weights. Though one may design a new improved backbone, the expertise and computing resources overhead can be expensive. On the one hand, designing a new architecture of backbone requires expert experience and a lot of trials and errors. On the other hand, pre-training a new backbone (especially for large models) on ImageNet requires a large number of computational resources, which makes it costly to obtain better detection performance following the pre-training and fine-tuning paradigm. Alternatively, training detectors from scratch saves the cost of pre-training but requires even more computing resources and training skills .

In this paper, we present a simple and novel composition approach to use existing pre-trained backbones under the pre-training fine-tuning paradigm. Unlike most previous methods that focus on modular crafting and require pre-training on ImageNet to strengthen the representation, we improve the existing backbone representation ability without additional pre-training. As shown in Fig. 1, our solution, named Composite Backbone Network (CBNet), groups multiple identical backbones together. Specifically, parallel backbones (named assisting backbones and lead backbone) are connected via composite connections. From left to right in Fig. 1, the output of each stage in an assisting backbone flows to the parallel and lower-level stages of its succeeding sibling. Finally, the features of the lead backbone are fed to the neck and detection head for bounding box regression and classification. Contrary to simple network deepening or widening, CBNet integrates the high- and low-level features of multiple backbone networks and progressively expands the receptive field for more effective object detection. Notably, each composed backbone of CBNet is initialized by the weights of an existing open-source pre-trained individual backbone (e.g., CB-ResNet50 is initialized by the weights of ResNet50 , which are available in the open-source community). In addition, to further exploit the potential of CBNet, we propose an effective training strategy with supervision for assisting backbones, achieving higher detection accuracy while sacrificing no inference speed. In particular, we propose a pruning strategy to reduce the model complexity while not sacrificing accuracy.

We present two versions of CBNet. The first, named CBNetV1 , connects only the adjacent stages of parallel backbones, providing a simple implementation of our composite backbone that is easy to follow. The other one, CBNetV2, combines the dense higher-level composition strategy, the auxiliary supervision, and a special pruning strategy, to fully explore the potential of CBNet for object detection. We empirically demonstrate the superiority of CBNetV2 over CBNetV1.

We demonstrate the effectiveness of our framework by conducting experiments on the challenging MS COCO benchmark . Experiments show that CBNet has strong generalization capabilities for different backbones and head designs of the detector architecture, which enables us to train detectors that significantly outperform detectors based on larger backbones. Specifically, CBNet can be applied to various backbones, from convolution-based to Transformer-based . Compared to the original backbones, CBNet boosts their performances by 3.4%\sim3.5% AP, demonstrating the effectiveness of the proposed CBNet. At comparable model complexity, our CBNet still improves by 1.1% \sim 2.1% AP, indicating that the composed backbone is more efficient than the pre-trained wider and deeper networks. Moreover, CBNet can be flexibly plugged into mainstream detectors (e.g., RetinaNet , ATSS , Faster R-CNN , Mask R-CNN , Cascade R-CNN and Cascade Mask R-CNN ), and consistently improve the performances of these detectors by 3%\sim3.8% AP, demonstrating its strong adaptability to various head designs of detectors. Besides, CBNet is compatible with feature enhancing networks and model ensemble method . Remarkably, it presents a general and resource-friendly framework to drive the accuracy ceiling of high-performance detectors. Without bells and whistles, our CB-Swin-L achieves unparalleled single-model single-scale result of 59.4% box AP and 51.6% mask AP on COCO test-dev, surpassing the state-of-the-art result (i.e., 57.7% box AP and 50.2% mask AP obtained by Swin-L), while reducing the training schedule by 6×\times. With multi-scale testing, we push the current best single-model result to a new record of 60.1% box AP and 52.3% mask AP.

The main contributions of this paper are listed as follows:

We propose a general, efficient and effective framework, CBNet (Composite Backbone Network), to construct high-performance backbone networks for object detection without additional pre-training.

We propose a Dense Higher-Level Composition (DHLC) strategy, auxiliary supervision, and a pruning strategy to efficiently use existing pre-trained weights for object detection under the pre-training fine-tuning paradigm.

Our CB-Swin-L achieves a new record of single-model single-scale result on COCO at a shorter (by 6×\times) training schedule than Swin-L. With multi-scale testing, our method achieves the best-known result without extra training data.

II Related work

Object detection aims to locate each object instance from a predefined set of classes in an input image. With the rapid development of convolutional neural networks (CNNs), there is a popular paradigm for deep learning-based object detectors: the backbone network (typically designed for classification and pre-trained on ImageNet) extracts basic features from the input image, and then the neck (e.g., feature pyramid network ) enhances the multi-scale features from the backbone, after which the detection head predicts the object bounding boxes with position and classification information. Based on detection heads, the cutting-edge methods for generic object detection can be briefly categorized into two major branches. The first branch contains one-stage detectors such as YOLO , SSD , RetinaNet , NAS-FPN , EfficientDet , and . The other branch contains two-stage methods such as Faster R-CNN , FPN , Mask R-CNN , Cascade R-CNN , and Libra R-CNN . Recently, academic attention has been geared toward anchor-free detectors due partly to the emergence of FPN and focal Loss , where more elegant end-to-end detectors have been proposed. On the one hand, FSAF , FCOS , ATSS and GFL improve RetinaNet with center-based anchor-free methods. On the other hand, CornerNet , CenterNet , and FoveaBox detect object bounding boxes with a keypoint-based method. In addition to the above CNN-based detectors, Transformer has also been utilized for detection. DETR proposes a fully end-to-end detector by combining CNN and Transformer encoder-decoders.

More recently, Neural Architecture Search (NAS) is applied to automatically search the architecture for a specific detector. NAS-FPN , NAS-FCOS and SpineNet use reinforcement learning to control the architecture sampling and obtain promising results. SM-NAS uses the evolutionary algorithm and partial order pruning method to search the optimal combination of different parts of the detector. Auto-FPN uses the gradient-based method to search for the best detector. OPANAS uses the one-shot method to search for an efficient neck for object detection.

II-B Backbones for Object Detection

Starting from AlexNet , deeper and wider backbones have been exploited by mainstream detectors, such as VGG , ResNet , DenseNet , ResNeXt , and Res2Net . Since the backbone network is usually designed for classification, whether it is pre-trained on ImageNet and fine-tuned on a given detection dataset or trained from scratch on the detection dataset, it requires many computational resources and is difficult to optimize. Recently, two non-trivially designed backbones, i.e., DetNet and FishNet , are specifically designed for the detection task. However, they still require pre-training for the classification task before fine-tuning for the detection task. Res2Net achieves impressive results in object detection by representing multi-scale features at the granular level. HRNet maintains high-resolution representations and achieves promising results in human pose estimation, semantic segmentation, and object detection. In addition to manually designing the backbone architecture, DetNAS and Joint-DetNAS use NAS to search for a better backbone for object detection, thereby reducing the cost of manual design. Swin Transformer and PVT utilize Transformer modular to build the backbone and achieve impressive results, despite the need for expensive pre-training.

It is well known that designing and pre-training a new and robust backbone requires significant computational costs. Alternatively, we propose a more economical and efficient solution to build a more powerful object detection backbone, by grouping multiple identical existing backbones (e.g., ResNet , ResNeXt , Res2Net , HRNet , and Swin Transformer ).

II-C Recurrent Convolution Neural Network

Different from the feed-forward architecture of CNN, Recurrent CNN (RCNN) incorporates recurrent connections into each convolution layer to enhance the contextual information integration ability of the model. As shown in Fig. 3, our proposed Composite Backbone Network shares some similarities with the unfolded RCNN , but they are very different. First, the connections between the parallel stages in CBNet are unidirectional, while they are bidirectional in RCNN. Second, in RCNN, the parallel stages at different time steps share parameter weights, while in the proposed CBNet, the parallel stages of backbones are independent of each other. Moreover, we need to pre-train RCNN on ImageNet if we use it as the backbone of the detector. By contrast, CBNet does not require additional pre-training because it directly uses existing pre-trained weights.

II-D Model Ensemble

It is well known that a combination of many different predictors can lead to more accurate predictions, e.g., ensemble methods are considered as the state-of-the-art solution for many machine learning challenges. The model ensemble improves the prediction performance of a single model by training multiple different models and combining their prediction results through post processing.

There are two key characteristics for model ensemble: model diversity and voting. Model diversity means that the models with different architectures or training techniques are trained separately, and its importance for the model ensemble is well established . Most ensemble methods need voting strategies to compare the outputs of different models and refine the final predictions . In terms of the above two characteristics, our CBNet is very different from the model ensemble. In fact, CBNet benefits from the identical backbones grouping, the recurrent style feature enhancing by jointly training. Furthermore, the output of the lead backbone is used directly for the final prediction without the need to be assembled with other backbones. More practical analysis can be found in Sec. IV-E2.

In practice, leading approaches to the challenge object detection benchmarks like MS COCO or OpenImage are based on the usage of model ensemble . For example, separately trains 28 models of different architectures, heads, data splits, class sampling strategies, augmentation strategies and supervisions and aggregate these detector’s outputs by ensembling method. proposes the Probabilistic Ranking Aware Ensemble (PRAE) that refines the confidence of bounding boxes from different detectors. Our CBNet is compatible with such model ensemble methods, as are other conventional backbones. More details can be found in Sec. IV-F6.

II-E Our Approach.

Our network groups multiple identical backbones in parallel. It integrates the high- and low-level features of multiple identical backbones and gradually expands the receptive field to more efficiently perform object detection. This paper represents a very substantial extension of our previous conference paper with results under recently developed start-of-the-art object detection frameworks. The main technical novelties compared with lie in three aspects. (1) We extend the network (named as CBNetV1) proposed in , with three modifications: a specialized training method, a better composite strategy and a pruning strategy, which respectively optimizes the training process, more efficiently enhances feature representation and reduces the model complexity of CBNetV2. (2) We show the strong generalization capabilities of CBNetV2 for various backbones and head designs of detector architecture. (3) We show the superiority of CBNetV2 over CBNetV1 and present the state-of-art result of CBNetV2 in object detection.

III Proposed method

This section elaborates the proposed CBNet in details. In Sec. III-A and Sec. III-B, we describe its basic architecture and variants, respectively. In Sec. III-C, we propose a training strategy for CBNet-based detectors. In Sec. III-D, we briefly introduce the pruning strategy. In Sec. III-E, we summarize the detection framework of CBNet.

The proposed CBNet consists of KK identical backbones (K2K\geq 2). In particular, we call the case K=nK=n as CB-Backbone-Knn, where ’-Knn’ is omitted when K=2K=2.

As in Fig. 1, the CBNet architecture includes two types of backbones: lead backbone BKB_{K} and assisting backbones B1,B2,...,BK1B_{1},B_{2},...,B_{K-1}. Each backbone comprises LL stages (usually L=5L=5), and each stage consists of several convolutional layers with feature maps of the same size. The ll-th stage of the backbone implements the non-linear transformation Fl()(l=1,2,...,L)F^{l}(\cdot)(l=1,2,...,L).

Most conventional convolutional networks follow the design of encoding the input images into intermediate features with monotonically decreased resolution. In particular, the ll-th stage takes the output (denoted as xl1x^{l-1}) of the previous (l1l-1)-th stage as input, which can be expressed as follows:

Differently, we adopt assisting backbones B1,B2,...,BK1B_{1},B_{2},...,B_{K-1} to improve the representative ability of lead backbone BKB_{K}. We iterate the features of a backbone to its successor in a stage-by-stage fashion. Thus, Equation (1) can be rewritten as:

where gl1()g^{l-1}(\cdot) represents the composite connection, which takes features (denoted as xk1={xk1ii=1,2,,L}\boldsymbol{x_{k-1}}=\{x_{k-1}^{i}|i=1,2,\ldots,L\}) from assisting backbone Bk1B_{k-1} as input and takes the features of the same size as xkl1x_{k}^{l-1} as output. Therefore, the output features of Bk1B_{k-1} are transformed and contribute to the input of each stage in BkB_{k}. Note that x11,x21,,xK1x_{1}^{1},x_{2}^{1},\ldots,x_{K}^{1} are weight sharing.

For the object detection task, only the output features of the lead backbone {xKi,i=2,3,,L}\{x_{K}^{i},i=2,3,\ldots,L\} are fed into the neck and then the RPN/detection head, while the outputs of the assisting backbone are forwarded to its succeeding siblings. It is worth noting that B1,B2,...,BK1B_{1},B_{2},...,B_{K-1} can be used for various backbone architectures (e.g., ResNet , ResNeXt , Res2Net , and Swin Transformer ) and initialized directly from the pre-trained weights of a single backbone.

III-B Possible Composite Strategies

For composite connection gl(x)g^{l}(x) which takes x={xii=1,2,,L}\boldsymbol{x}=\{x^{i}|i=1,2,\ldots,L\} from an assisting backbone as input and outputs a feature of the same size of xlx^{l} (omitting kk for simplicity), we propose the following five different composite strategies.

An intuitive and simple way of compositing is to fuse the output features from the same stage of backbones. As shown in Fig. 2.a, the operation of SLC can be formulated as:

where w\mathbf{w} represents a 1×11\times 1 convolution layer and a batch normalization layer.

III-B2 Adjacent Higher-Level Composition (AHLC)

Motivated by Feature Pyramid Networks , the top-down pathway introduces the spatially coarser, but semantically stronger, higher-level features to enhance the lower-level features of the bottom-up pathway, we introduce AHLC to feed the output of the adjacent higher-level stage of the previous backbone to the subsequent one (from left to right in Fig. 2.b):

where U()\mathbf{U}(\cdot) indicates the up-sampling operation.

III-B3 Adjacent Lower-Level Composition (ALLC)

Contrary to AHLC, we introduce a bottom-up pathway to feed the output of the adjacent lower-level stage of the previous backbone to the succeeding one. We show ALLC in Fig. 2.c, which is formulated as:

where D()\mathbf{D}(\cdot) denotes the down-sample operation.

III-B4 Dense Higher-Level Composition (DHLC)

In DenseNet , each layer is connected to all subsequent layers to build comprehensive features. Inspired by this, we utilize dense composite connections in our CBNet architecture. The operation of DHLC is expressed as follows:

As shown in Fig. 2.d, when K=2K=2, we compose the features from all the higher-level stages in the previous backbone and add them to the lower-level stages in the latter one.

III-B5 Full-connected Composition (FCC)

As shown in Fig. 2.e, we compose features from all the stages in the previous backbones and feed them to each stage in the following one. Compared to DHLC, we add connections in low-high-level case. The operation of FCC can be expressed as:

where I()\mathbf{I}(\cdot) denotes scale-resizing, I()=D()\mathbf{I}(\cdot)=\mathbf{D}(\cdot) when i>li>l, and I()=U()\mathbf{I}(\cdot)=\mathbf{U}(\cdot) when i<li<l.

III-C Auxiliary Supervision

Although increasing the depth usually leads to performance improvement , it may introduce additional optimization difficulties, as in the case of image classification . The studies in introduce the auxiliary classifiers of intermediate layers to improve the convergence of very deep networks. In original CBNet, although the composite backbones are parallel, the latter backbone (e.g., lead backbone in Fig. 4.a) deepen the network through adjacent connections between the previous backbone (e.g., assisting backbone in Fig. 4.a). To better train the CBNet-based detector, We propose to generate initial results of assisting backbones by supervision with the auxiliary neck and detection head to provide additional regularization.

An example of our supervised CBNet when KK=2 is illustrated in Fig. 4.b. Apart from the original loss that uses the lead backbone feature to train the detection head 1, another detection head 2 takes assisting backbone features as input, producing auxiliary supervision. Note that detection head 1 and detection head 2 are weight sharing, as are the two necks. The auxiliary supervision helps to optimize the learning process, while the original loss for the lead backbone takes the greatest responsibility. We add weights to balance the auxiliary supervision, where the total loss is defined as:

During the inference phase, we abandon the auxiliary supervision branch and only utilize the output features of the lead backbone in CBNet (Fig. 4.b). Consequently, auxiliary supervision does not affect the inference speed.

III-D Pruning Strategy for CBNet

To reduce the model complexity of CBNet, we explore the possibility of pruning the different number of stages in 2,3,...,K2,3,...,K-th backbones instead of composing the backbones in a holistic manner. For simplicity, we show five pruning methods when K=2K=2 in Fig. 5. sis_{i} indicates there are ii stages {xjj6i and j5,i=0,1,2,3,4\{x_{j}|j\geq 6-i{\rm~{}and~{}}j\leq 5,i=0,1,2,3,4} in the 2,3,...,K2,3,...,K-th backbone and the pruned stages are filled by the features of the same stages in the first backbone.

III-E Architecture of Detection Network with CBNet

CBNet can be applied to various off-the-shelf detectors without additional modifications to network architectures. In practice, we attach the lead backbone with functional networks, e.g., FPN and detection head. The inference phase of CBNet is shown in Fig. 1. Note that we present two versions of CBNet. The first one, named CBNetV1 , uses only AHLC composition strategy, providing a simple implementation of the composite backbone that is easy to follow. The other one, CBNetV2, combines DHLC composition strategy, the auxiliary supervision, and a special pruning strategy, to fully explore the potential of CBNet for object detection. We empirically demonstrate the superior of CBNetV2 over CBNetV1 in the following Sec. IV. In this paper, CBNet denotes CBNetV2 in the following experiments if not specified.

IV Experiments

In this section, we evaluate our CBNet through extensive experiments. In Sec. IV-A, we detail the experimental setup. In Sec. IV-B, we compare CBNet with state-of-the-art detection methods. In Sec. IV-C, we demonstrate the generality of our method over different backbones and detectors. In Sec. IV-E, we show the compatibility of CBNet with DCN and model ensemble. In Sec. IV-F, we conduct an extensive ablation study to investigate individual components of our framework.

We conduct experiments on the COCO benchmark. The training is conducted on the 118k training images, and ablation studies on the 5k minival images. We also report the results on the 20k images in test-dev for comparison with the state-of-the-art (SOTA) methods. For evaluation, we adopt the metrics from the COCO detection evaluation criteria, including the mean Average Precision (AP) across IoU thresholds ranging from 0.5 to 0.95 at different scales.

IV-A2 Training and Inference Details

Our experiments are based on the open-source detection toolbox MMDetection . For ablation studies and simple comparisons, we resize the input size to 800×500800\times 500 during training and inference if not specified. We choose Faster R-CNN (ResNet50 ) with FPN as the baseline. We use the SGD optimizer with an initial learning rate of 0.02, the momentum of 0.9, and 10410^{-4} as weight decay. We train detectors for 12 epochs with a learning rate decreased by 10×10\times at epoch 8 and 11. We use only random flip for data augmentation and set the batch size to 16. Note that experiments related to special backbones, e.g., Swin Transformer , HRNet , PVT , PVTv2 , and DetectoRS , are not highlighted specifically following the hyper-parameters of the original papers. The inference speed FPS (frames per second) for the detector is measured on a machine with 1 V100 GPU.

To compare with state-of-the-art detectors, we utilize multi-scale training (the short side resized to 4001400400\sim 1400 and the long side is at most 1600) and a longer training schedule (details can be found in Sec. IV-B). During the inference phase, we use Soft-NMS with a threshold of 0.001, and the input size is set to 1600×14001600\times 1400. All other hyper-parameters in this paper follow MMDetection if not specified.

IV-B Comparison with State-of-the-Art

We compare our methods with cutting-edge detectors. We divide the results into object detection (Table I) and instance segmentation (Table II) according to whether or not the instance segmentation annotations are used during training. Following , we improve the detector heads of Cascade R-CNN, Cascade Mask R-CNN, and HTC in the above two tables by adding four convolution layers in each bounding box head and using GIoU loss instead of Smooth L1 .

For detectors trained with only bounding box annotations, we summarize them into two categories: anchor-based, and anchor-free-based in Table I. We select ATSS as the anchor-free representative, and Cascade R-CNN as the anchor-based representative.

Anchor-free. CB-Res2Net101-DCN equipped with ATSS is trained for 20 epochs, where the learning rate is decayed by 10×10\times in the 16th and 19th epochs. Notably, our CB-Res2Net101-DCN achieves 52.8% AP, outperforming previous anchor-free methods under single-scale testing protocol.

Anchor-based. Our CB-Res2Net101-DCN achieves 55.6% AP, surpassing other anchor-based detectors . It is worth noting that our CBNet trains only for 32 epochs (the first 20 epochs are regular training and the remaining 12 epochs are trained with Stochastic Weights Averaging ), being 16×\times and 12×\times shorter than EfficientDet and YOLOv4, respectively.

IV-B2 Instance Segmentation

We further compare our method with state-of-the-art results using both bounding box and instance segmentation annotations in Table II. Following , we provide results with the backbone pre-trained on regular ImageNet-1K and ImageNet-22K to show the high capacity of CBNet.

Results with regular ImageNet-1K pre-train. Following , 3x schedule (36 epochs with the learning rate decayed by 10×10\times at epochs 27 and 33) is used for CB-Swin-S. Using Cascade Mask R-CNN, our CB-Swin-S achieves 56.3% box AP and 48.6% mask AP on COCO minival in terms of the bounding box and instance segmentation, showing significant gains of +4.4% box AP and +3.6% mask AP to Swin-B with similar model size and the same training protocol. In addition, CB-Swin-S achieves 56.9% box AP and 49.1% mask AP on COCO dev, outperforming other ImageNet-1K pre-trained backbone-based detectors.

Results with ImageNet-22K pre-train. Our CB-Swin-B achieves single-scale result of 58.4% box AP and 50.7% mask AP on COCO minival, which is 1.3% box AP and 1.2% mask AP higher than that of Swin-L (HTC++) while the number of parameters is decreased by 17% and the training schedule is reduced by 3.6×\times. Especially, with only 12 epochs training (which is 6×\times shorter than Swin-L), our CB-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev, outperforming prior arts. We can push the current best result to a new record of 60.1% box AP and 52.3% mask AP through multi-scale testing. The results demonstrate that CBNet proposes an efficient, effective, and resource-friendly framework to build high-performance detectors.

IV-C Generalization Capability of CBNet

CBNet expands the receptive field by combining the backbones in parallel rather than simply increasing the depth of the network. To demonstrate the generality of our design strategy, we perform experiments on various backbones and different head designs of the detector architecture.

Effectiveness. To demonstrate the effectiveness of CBNet, we conduct experiments on Faster R-CNN with different backbone architectures. As shown in Table III, for CNN-based backbones (e.g., ResNet, ResNeXt-32x4d, and Res2Net), our method can boost baseline by over 3.4% AP.

Efficiency. Note that the number of parameters in CBNet has increased compared to the baseline. To better demonstrate the efficiency of the composite architecture, we compare CBNet with deeper and wider backbone networks. As shown in Table IV, with comparable number of number and inference speed, CBNet improves ResNet101, ResNeXt101-32x4d, Res2Net101 by 1.7%, 2.1%, and 1.1% AP, respectively. Additionally, CB-ResNeXt50-32x4d is 1.1% AP higher than that of ResNeXt101-64x4d, while the number of parameters is only 70%. The results demonstrate that our composite backbone architecture is more efficient and effective than simply increasing the depth and width of the network.

IV-C2 Generality for Swin Transformer

Transformer is notable for the use of attention to model long-range dependencies in data, and Swin Transformer is one of the most representative recent arts. We conduct experiments on Swin Transformer to show the model generality of CBNet. For a fair comparison, we follow the same training strategy as with multi-scale training (the short side resized to 480800480\sim 800 and the long side at most 1333), AdamW optimizer (initial learning rate of 0.0001, weight decay of 0.05, and batch size of 16), and 3x schedule (36 epochs). As shown in Table V, the accuracy of the model slowly increases as the Swin Transformer is deepened and widened, and saturates at Swin-S. Swin-B is only 0.1% AP higher than that of Swin-S, but the amount of parameters increases by 38M. When using CB-Swin-T, we achieve 53.6% box AP and 46.2% mask AP by improving Swin-T 3.1% box AP and 2.5% mask AP. Surprisingly, our CB-Swin-T is 1.7% box AP and 1.2% mask AP higher than that of the deeper and wider Swin-B while the model complexity is lower (e.g., FLOPs 836G vs. 975G, Params 113.8M vs. 145.0M, FPS 6.5 vs. 5.9). These results prove that CBNet can also improve non-pure convolutional architectures. They also demonstrate that CBNet pushes the upper limit of accuracy for high-performance detectors more effectively than simply increasing the depth and width of the network.

IV-C3 Generality for special backbones

To further show the generality of CBNet for various backbones, we conduct experiments on CBNet equipped with different backbones including MobileNetV2 , HRNet , PVT , and PVTv2 . For a fair comparison, we choose the publicly available pre-trained backbones and all experiment settings (e.g., choice of detectors, training, and inference details) are following their settings in MMDetection . Results are shown in Table VI. For Mobile settings with YOLOV3 , our CB-MobileNetV2 improves MobileNetV2 by 3.1% AP and is 1% AP higher than MobileNetV2(1.4x) with comparable model complexity. For backbones with high-resolution representations, our CB-HRNetv2p_\_w32 improves HRNetv2p_\_w32 by 2.4% AP and is 0.6% AP higher than HRNetv2p_\_w48 with less model complexity. For global transformer backbones, we choose RetinaNet as detector follow the original paper . Our CB-PVT-Small improves PVT-Small by 3% AP and is 0.8% AP higher than PVT-Large with only 83% number of parameters. Furthermore, our CB-PVTv2-B2 improves PVTv2-B2 by 3.1% AP and is 1.6% AP higher than PVTv2-B5 with only 66% number of parameters. The results show that our CBNet improves a wide variety of backbones and achieves better accuracy under comparable or less parameters and FLOPs, which verify the effectiveness and efficiency of CBNet.

IV-C4 Model Adaptability for Mainstream Detectors

We evaluate the adaptability of CBNet by plugging it into mainstream detectors such as RetinaNet, ATSS, Faster R-CNN, Mask R-CNN, and Cascade R-CNN. These methods present a variety of detector head designs (e.g., two-stage vs. one-stage, anchor-based vs. anchor-free). As shown in Table VII, our CBNet significantly boosts all popular object detectors by over 3% AP. The instance segmentation accuracy of Mask R-CNN is also improved by 2.9% AP. These results demonstrate the robust adaptability of CBNet to various head designs of detectors.

IV-D Comparison with Relevant Works.

There are several relevant detectors, such as DetectoRS that composites both backbone and FPN and Joint-DetNAS searches for the model scaling strategy. We conduct comparisons between CBNet and these two methods.

Joint-DetNAS integrates neural architecture search (NAS), pruning and knowledge distillation for optimizing detectors. Similarly, our CBNet also uses pruning strategy but focus more on scaling backbones using composite strategy. Thanks to the strong generalization ability, our CBNet can boosts the performance of advanced high-performance detectors (e.g., YOLOX ). As shown in Table VIII, our CB-CSPNet-L improves CSPNet-L by 2.6% AP and is 1.1% AP higher than CSPNet-X with only 85% number of parameters. We further compare our CBNet using an existing hand-designed detector (i.e., YOLOX) with Joint-DetNAS which uses an advanced knowledge distillation training strategy. Our CBNet achieves 52% AP with 118 GFLOPs, superior to that of Joint-DetNAS (X101-FPN based) at 45.7% AP with 266 GFLOPs. Note that it is hard to have a fair comparison because our CBNet focuses on the architecture design of the backbone while Joint-DetNAS focuses on the joint optimization of the architecture and training for the entire detector.

DetectoRS conducts a similar design as CBNet while DetectoRS composites both backbone and FPN. We compare CBNetV2 and DetectoRS with different backbones on Faster R-CNN in Table IX. Under the same training strategy of DetectoRS with 1333×8001333\times 800 as input size, CBNet achieves comparable or higher AP with fewer FLOPs. Specifically, with advanced backbone Swin-Tiny, our CBNet outperforms DetectoRS by 0.8% AP with only 84% FLOPs.

IV-E Compatibility of CBNet

Deformable convolution enhances the transformation modeling capability of CNNs and is widely used for accurate object detectors (e.g., simply adding DCN improves Faster R-CNN ResNet50 from 34.6% to 37.4% AP). To show the compatibility of CBNet architecture with deformable convolution, we perform experiments on ResNet and ResNeXt equipped with Faster R-CNN. As shown in Table X, DCN is still effective on CBNet with 2.3% AP\sim2.7% AP improvement. This improvement is greater than the 2.0% AP and 1.3% AP increments on ResNet152 and ResNeXt101-64x4d. On the other hand, CB-ResNet50-DCN increases the AP of ResNet50-DCN and the deeper ResNet152-DCN by 3.0% and 0.6%, respectively. In addition, CB-ResNet50-32x4d-DCN increases the AP of ResNet50-32x4d-DCN and the deeper and wider ResNeXt101-64x4d-DCN by 3.7% and 1.3%, respectively. The results show that the effects of CBNet and deformable convolution can be superimposed without conflicting with each other.

IV-E2 Compatibility with Model Ensemble

The model ensemble improves the prediction performance of a single model by training multiple different models and combining their prediction results through post-processing . Probabilistic Ranking Aware Ensemble (PRAE) refines the confidence of bounding boxes from different detectors and outperforms other ensemble learning methods for object detection by significant margins (e.g., assembling Faster R-CNN ResNet50 and Faster R-CNN ReNeXt50 improves the single model best AP from 36.3% to 37.3%), Note that assembling two same detectors (i.e., two Faster R-CNN ResNeXt50) does not improve the performance (same as the single detector 36.3% AP). To show the compatibility of our CBNet architecture with the model ensemble method PRAE, we perform experiments on traditional backbones (i.e., ResNet, ResNeXt, Res2Net) and their Composite Backbones equipped with Faster R-CNN. As shown in Table XI, PRAE is still effective for assembling detectors with CBNet, with 0.8% \sim 1.7% AP improvement, which is consistent with the case of assembling detectors with traditional backbones. In addition, CBNet is more effective than the model ensembling method PRAE, e.g., Faster R-CNN CB-R50 achieves 38.0% AP, superior to the 37.3% AP of assembling Faster R-CNN ResNet50 and Faster R-CNN ReNeXt50. The results show that the effects of CBNet and model ensemble can be superimposed without conflicting with each other, suggesting that the detector equipped with CBNet should be considered as a single detector/model despite having multiple identical backbones compositions.

IV-F Ablation Studies

We ablate various design choices for our proposed CBNet. For simplicity, all accuracy results here are on the COCO validation set with 800×500800\times 500 input size if not specified.

We conduct experiments to compare the proposed composite strategies in Fig. 2, including SLC, AHLC, ALLC, DHLC and FCC. All these experiments are conducted based on the Faster R-CNN CB-ResNet50 architecture. Results are shown in Table XII.

SLC gets a slightly improves accuracy of the single-backbone baseline (35% vs. 34.6% AP). The features extracted by the same stage of both backbones are similar, and thus SLC can only learn slightly more semantic information than a single backbone does.

AHLC raises the baseline by 1.4% AP, which verifies our motivation in Sec. III-B2, i.e., the semantic information higher-level features of the former backbone enhances the representation ability of the latter backbone.

ALLC degrades the performance of the baseline by 2.2% AP. We infer that directly adding the lower-level features of the assisting backbone to the higher-level ones of the lead backbone impair the representation ability of the latter.

DHLC improves the performance of the baseline by a large margin (from 34.6% AP to 37.3% AP by 2.7% AP). More composite connections of the high-low cases enrich the representation ability of features to some extent.

FCC achieves the best performance of 37.4% AP while being 7% slower than DHLC (19.9 vs. 21.4 FPS).

In summary, FCC and DHLC achieve the two best results. Considering the computational simplicity, we recommend using DHLC for CBNet. All the above composite strategies have a similar amount of parameters, but the accuracy varies greatly. The results prove that simply increasing the number of parameters or adding a backbone network does not guarantee a better result while the composite connection plays a crucial part. These results show that the suggested DHLC composite strategy is effective and nontrivial.

Note that there is a minor performance gap between the original CBNetV1 in and this paper for DHLC and AHLC. The reason is that CBNetV1 and this paper are performed under different deep learning platforms (CAFFE vs. PyTorch), which use different model initialization strategies and result in different model performances. We compare DHLC and AHLC on different backbones in Table XIII. DHLC outperforms AHLC by 0.8% AP and 1.1% AP on ResNet101 and ResNeXt101-64x4d, respectively, showing the generality of DHLC for different backbones.

We conduct a grid search by proxy task to search for better composite strategies. To reduce the search cost, we simplify the search space by only searching the connections including x3,x4,x5x_{3},x_{4},x_{5} stages in composite backbones, and design a proxy task with 1/5 of the COCO training set with input size set to 800×500800\times 500. In this way, we only need to train (23)3=512(2^{3})^{3}=512 detectors for 205 GPU days. The best-searched strategy is a simplified DHLC(s3s_{3}) without the connection between x4x_{4} of the former backbone to the input of x3x_{3} of the latter one. The searched strategy achieves 37.3% AP with 69.1 M, 126 GFLOPs, and performs un-par with our designed DHLC (37.3% AP with 69.7 M, 127 GFLOPs), further validating the necessity of high-to-low connections in our handcraft design.

IV-F2 Weights for Auxiliary Supervision

Experimental results related to weighting the auxiliary supervision are presented in Table XIV. For simplicity, we perform DHLC composite strategy on CBNet. The first setting is the Faster R-CNN CB-ResNet50 baseline and the second is the CB-ResNet50-K3 (K=3K=3 in CBNet) baseline, where the λ\lambda for assisting backbone in Equation (8) is set to zero. For the case K=2K=2, the baseline can be improved by 0.8% AP by setting λ1\lambda_{1} to 0.5. For the case K=3K=3, the baseline can be improved by 1.8% AP by setting {λ1,λ2}\{\lambda_{1},\lambda_{2}\} to {0.5,1.0}\{0.5,1.0\}. The experimental results verify that the auxiliary supervision forms an effective training strategy that improves the performance of CBNet.

IV-F3 Efficiency of Pruning Strategy

As shown in Fig. 6(a), with the pruning strategy, our CB-ResNet50 family and CB-ResNet50-K3 family achieve better FLOPs-accuracy trade-offs than ResNet family. This also illustrates the efficiency of our pruning strategy. In particular, the number of FLOPs in s3s_{3} is reduced by 10% compared to s4s_{4}, but the accuracy is decreased by only 0.1%. This is because the weights of the pruned stage are fixed during the detector training so pruning this stage does not sacrifice detection accuracy. Hence, when speed and memory cost need to be prioritized, we suggest pruning the fixed stages in 2,3,...,K2,3,...,K-th backbones in CBNet.

IV-F4 Number of Backbones in CBNet

To further explore the ability to construct high-performance detectors of CBNet, we evaluate the efficiency of our CBNet by controlling the number of backbones. As shown in Fig. 6(b), we vary the number of backbones (e.g., KK = 1,2,3,4,5) and compare their accuracy and efficiency (GFLOPs) with the ResNet family. Note that the accuracy continues to increase as the complexity of the model increases. Compared with ResNet152, our method obtains higher accuracy at KK=2 while computation cost is lower. Meanwhile, the accuracy can be further improved for KK=3,4,5. CBNet provides an effective and efficient alternative to improve the model performance rather than simply increasing the depth or width of the backbone.

IV-F5 Comparison of CBNetV1 and CBNetV2

To fairly compare CBNetV1 and CBNetV2, we progressively apply the DHLC composite strategy, auxiliary supervision, and pruning strategy to CBNetV1, where AHLC is the default composite strategy in Table XV. As in the 1st and 2nd rows of Table XV, the composite backbone structure CBNetV1 improves the Faster R-CNN ResNet50 baseline by 1.4% AP. As in the 2nd and 3rd row, the accelerated version of CBNetV1 (s2s_{2} pruning version in Fig. 5) improves the inference speed from 22.4 FPS to 26.6 FPS while decreasing the accuracy by 0.4% AP. As in the 2nd and 4th rows, the auxiliary supervision brings a 0.9% AP increment to CBNetV1, thanks to the better training strategy that improves the representative ability of the lead backbone. Note that the auxiliary supervision does not introduce extra parameters during the inference phase. As in the 2nd and 5th rows, DHLC composite strategy improves the detection performance of CBNetV1 by 1.3% AP with higher model complexity. The results confirm that DHLC enables a larger receptive field, with features at each level obtaining rich semantic information from all higher-level features. As in the 1st and 6th rows, when combining the DHLC and the auxiliary supervision, there is a significant improvement of 2.1% AP over the baseline. As in the 2nd and last row, when we perform our default pruning strategy (s3s_{3} version in Fig. 5), CBNetV2 is faster (23.3 vs. 22.4 FPS) and much more accurate (38.0% vs. 36.0% AP) than CBNetV1 . DHLC slows down the detector, while the pruning strategy effectively speeds up the inference speed of CBNetV2.

IV-F6 Importance of Identical Backbones for CBNet

To verify the necessity of identical backbones in CBNet, we explore the diversity backbones by compositing ResNet50, ResNet101, Res2Net50, and Res2Net101. Note that no pruning is conducted for compositing diverse backbones and backbones from different families do not share the stem layer (Conv1 in Fig. 3). As shown in Table XVI, for backbones belonging to the same family, compositing identical backbones outperforms compositing diverse ones. For example, CB-ResNet50 achieves higher AP with fewer parameters than both ResNet50-C-ResNet101 and ResNet101-C-ResNet50. Similarly, CB-Res2Net50 gains higher or comparable AP with fewer parameters than both Res2Net50-C-Res2Net101 and Res2Net101-C-Res2Net50. For backbones from different families, the observation still holds. For example, CB-Res2Net50 achieves better performance than ResNet50-C-Res2Net101, Res2Net101-C-ResNet50, ResNet101-C-Res2Net50, and Res2Net50-C-ResNet101. These experimental results indicate that increasing the diversity of composite models is not the most efficient way for CBNet. We believe the reason is that using different backbones needs different optimization strategies, which usually output very different learned features and are difficult for joint training. CBNet intends to learn similar features for each grouped backbone, and the stronger the former backbones are, the more representative features the lead backbone outputs. Our experiments show that such a joint-training strategy works best for identical backbone grouping. This validates the necessity of the identical backbones in CBNet and further distinguishes our approach from ensemble methods where diversity is a key character.

V Conclusion

In this paper, we propose a novel and flexible backbone framework, called Composite Backbone Network (CBNet), to improve the performance of cutting-edge object detectors.

CBNet consists of a series of backbones with the same network architecture in parallel, the Dense Higher-Level composition strategy, and the auxiliary supervision. Together they construct a robust representative backbone network that uses existing pre-trained backbones under the pre-training fine-tuning paradigm. CBNet has strong generalization capabilities for different backbones and head designs of the detector architecture. Extensive experimental results demonstrate that the proposed CBNet is compatible with various backbones networks, including CNN-based (ResNet, ResNeXt, Res2Net) and Transformer-based (Swin-Transformer) ones. At the same time, CBNet is more effective and efficient than simply increasing the depth and width of the network. Furthermore, CBNet can be flexibly plugged into most mainstream detectors, including one-stage (e.g., RetinaNet) and two-stage (Faster R-CNN, Mask R-CNN, Cascade R-CNN, and Cascade Mask R-CNN) detectors, as well as anchor-based (e.g., Faster R-CNN) and anchor-free-based (ATSS) ones. CBNet is compatible with feature enhancing networks (DCN and HRNet) and model ensemble methods. Specifically, the performances of the above detectors are increased by over 3% AP. In particular, our CB-Swin-L achieves a new record of 59.4% box AP and 51.6% mask AP on COCO test-dev, outperforming prior single-model single-scale results. With multi-scale testing, we achieve a new state-of-the-art result of 60.1% box AP and 52.3% mask AP without extra training data.

Acknowledgment

This work was supported by National Natural Science Foundation of China under Grant 62176007. This work was also a research achievement of Key Laboratory of Science, Technology, and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). We thank Dr. Han Hu, Prof. Ming-Ming Cheng and Shang-Hua Gao for the insightful discussions.

References