ResNeSt: Split-Attention Networks

Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola

Introduction

Deep convolutional neural networks (CNNs) have become the fundamental approach for image classification and other transfer learning tasks in computer vision. As the key component of the CNNs, a convolutional layer learns a set of filters which aggregates the neighborhood information with spatial and channel connections. This operation is suitable to capture correlated features with the output channels densely connected to each input channel. Inception models explore the multi-path representation to learn independent features, where the input is split into a few lower dimensional embeddings, transformed by different sets of convolutional filters and then merged by concatenation. This strategy encourages the feature exploration by decoupling the input channel connections .

The neuron connections in visual cortex have inspired the development of CNNs in the past decades . The main theme of visual representation learning is discovering salient features for a given task . Prior work has modeled spatial and channel dependencies , and incorporated attention mechanism . SE-like channel-wise attention employs global pooling to squeeze the channel statistics, and predicts a set of attention factors to apply channel-wise multiplication with the original featuremaps. This mechanism models the interdependencies of featuremap channels, which uses the global context information to selectively highlight or de-emphasize the features . This attention mechanism is similar to attentional selection stage of human primary visual cortex , which finds the informative parts for recognizing objects. Human/animals perceive various visual patterns using the cortex in separate regions that respond to different and particular visual features . This strategy makes it easy to identify subtle but dominant differences of similar objects in the neural perception system. Similarly, if we can build a CNN architecture to capture individual salient attributes for different visual features, we would improve the network representation for image classification.

In this paper, we present a simple architecture which combines the channel-wise attention strategy with multi-path network layout. Our method captures cross-channel feature correlations, while preserving independent representation in the meta structure. A module of our network performs a set of transformations on low dimensional embeddings and concatenates their outputs as in a multi-path network. Each transformation incorporates channel-wise attention strategy to capture interdependencies of the featuremap. We further simplify the architecture to make each transformation share the same topology (e.g. Fig 2 (Right)). We can parameterize the network architecture with only a few variables. In addition, such setting also allows us to accelerate the training using identical implementation with unified CNN operators. We refer to such computation block as Split-Attention Block. Stacking several Split-Attention blocks in ResNet style, we create a new ResNet variant which we refer to as Split-Attention Network (ResNeSt).

We benchmark the performance of the proposed ResNeSt networks on ImageNet dataset . The proposed ResNeSt achieves better speed-accuracy trade-offs than state-of-the-art CNN models produced via neural architecture search as shown in Table 2. In addition, we also study the transfer learning results on object detection, instance segmentation and semantic segmentation. The proposed ResNeSt has achieved superior performance on several gold-standard benchmarks when serving as the backbone network. For example, our Cascade-RCNN model with ResNeSt-101 backbone achieves 48.3% box mAP and 41.56% mask mAP on MS-COCO instance segmentation. Our DeepLabV3 model, again using a ResNeSt-101 backbone, achieves mIoU of 46.9% on the ADE20K scene parsing validation set, which surpasses the previous best result by more than 1% mIoU. Furthermore, ResNeSt has been adopted by the winning entries of 2020 COCO-LVIS challenge .

Related Work

Since AlexNet , deep convolutional neural networks have dominated image classification. With this trend, research has shifted from engineering handcrafted features to engineering network architectures. NIN first uses a global average pooling layer to replace the heavy fully connected layers, and adopts 1×11\times 1 convolutional layers to learn non-linear combination of the featuremap channels, which is the first kind of featuremap attention mechanism. VGG-Net proposes a modular network design strategy, stacking the same type of network blocks repeatedly, which simplifies both the workflow of network design and transfer learning for downstream applications. Highway network introduces highway connections which makes the information flow across several layers without attenuation and helps the network convergence. Built on the success of the pioneering work, ResNet introduces an identity skip connection which alleviates the difficulty of vanishing gradient in deep neural network and allows network to learn improved feature representations. ResNet has become one of the most successful CNN architectures which has been adopted in various computer vision applications.

Multi-path and featuremap Attention.

Multi-path representation has shown success in GoogleNet , in which each network block consists of different convolutional kernels. ResNeXt adopts group convolution in the ResNet bottle block, which converts the multi-path structure into a unified operation. SE-Net introduces a channel-attention mechanism by adaptively recalibrating the channel feature responses. Recently, SK-Net brings the featuremap attention across two network branches. Inspired by the previous methods, our network integrates the channel-wise attention with multi-path network representation.

Neural Architecture Search.

With increasing computational power, research interest has begun shifting from manually designed architectures to systematically searched architectures. Recent work explored efficient neural architecture search via parameter sharing and have achieved great success in low-latency and low-complexity CNN models . However, searching a large-scale neural network is still challenging due to the high GPU memory usage via parameter sharing with other architectures. EfficientNet first searches in a small setting and then scale up the network complexity systematically. Instead, we build our model with ResNet meta architecture to scale up the network to deeper versions (from 50 to 269 layers). Our approach also augments the search spaces for neural architecture search and potentially improve the overall performance, which can be studied in the future work.

Split-Attention Networks

We now introduce the Split-Attention block, which enables featuremap attention across different featuremap groups in Section 3.1. Later, we describe our network instantiation and how to accelerate this architecture via standard CNN operators in Section 3.2.

Our Split-Attention block is a computational unit, consisting of featuremap group and split attention operations. Figure 2 (Right) depicts an overview of a Split-Attention Block.

Featuremap Group. As in ResNeXt blocks , the feature can be divided into several groups, and the number of featuremap groups is given by a cardinality hyperparameter KK. We refer to the resulting featuremap groups as cardinal groups. In this paper, we introduce a new radix hyperparameter RR that indicates the number of splits within a cardinal group, so the total number of feature groups is G=KRG=KR. We may apply a series of transformations {F1,F2,...FG}\{\mathcal{F}_{1},\mathcal{F}_{2},...\mathcal{F}_{G}\} to each individual group, then the intermediate representation of each group is Ui=Fi(X), for i{1,2,...G}U_{i}=\mathcal{F}_{i}(X)\text{, for }i\in\{1,2,...G\}.

where aik(c)a_{i}^{k}(c) denotes a (soft) assignment weight given by:

and mapping Gic\mathcal{G}_{i}^{c} determines the weight of each split for the cc-th channel based on the global context representation sks^{k}.

ResNeSt Block. The cardinal group representations are then concatenated along the channel dimension: V=Concat{V1,V2,...VK}V=Concat\{V^{1},V^{2},...V^{K}\}. As in standard residual blocks, the final output YY of our Split-Attention block is produced using a shortcut connection: Y=V+XY=V+X, if the input and output featuremap share the same shape. For blocks with a stride, an appropriate transformation T\mathcal{T} is applied to the shortcut connection to align the output shapes: Y=V+T(X)Y=V+\mathcal{T}(X). For example, T\mathcal{T} can be strided convolution or combined convolution-with-pooling.

Instantiation and Computational Costs. Figure 2 (right) shows an instantiation of our Split-Attention block, in which the group transformation Fi\mathcal{F}_{i} is a 1×11\times 1 convolution followed by a 3×33\times 3 convolution, and the attention weight function G\mathcal{G} is parameterized using two fully connected layers with ReLU activation. The number of parameters and FLOPS of a Split-Attention block are roughly the same as a standard residual block with the same cardinality and number of channels.

Relation to Existing Attention Methods. First introduced in SE-Net , the idea of squeeze-and-attention (called excitation in the original paper) is to employ a global context to predict channel-wise attention factors. With radix=1\text{radix}=1, our Split-Attention block is applying a squeeze-and-attention operation to each cardinal group, while the SE-Net operates on top of the entire block regardless of multiple groups. SK-Net introduces feature attention between two network streams. Setting radix=2\text{radix}=2, the Split-Attention block applies SK-like attention to each cardinal group. Our method generalizes prior work of featuremap attention within a cardinal group setting , and its implementation remains computationally efficient. Figure 2 shows an overall comparison with SE-Net and SK-Net blocks.

2 Efficient Radix-major Implementation

We refer to the layout described in the previous section as cardinality-major implementation, where the featuremap groups with the same cardinal index reside next to each other physically (Figure 2 (Right)). The cardinality-major implementation is straightforward and intuitive, but is difficult to modularize and accelerate using standard CNN operators. For this, we introduce an equivalent radix-major implementation.

Figure 4 gives an overview of the Split-Attention block in radix-major layout. The input featuremap is first divided into RKRK groups, in which each group has a cardinality-index and radix-index. In this layout, the groups with same radix-index reside next to each other. Then, we can conduct a summation across different splits, so that the featuremap groups with the same cardinality-index but different radix-index are fused together. A global pooling layer aggregates over the spatial dimension, while keeps the channel dimension separated, which is identical to conducting global pooling to each individual cardinal groups then concatenate the results. Then two consecutive fully connected (FC) layers with number of groups equal to cardinality are added after pooling layer to predict the attention weights for each splits. The use of grouped FC layers makes it identical to apply each pair of FCs separately on top each cardinal groups.

With this implementation, the first 1×11\times 1 convolutional layers can be unified into one layer and the 3×33\times 3 convolutional layers can be implemented using a single grouped convolution with the number of groups of RKRK. Therefore, the Split-Attention block is modularized using standard CNN operators.

Network and Training

We now describe the network design and training strategies used in our experiments. First, we detail a couple of tweaks that further improve performance, some of which have been empirically validated in .

Average Downsampling. For transfer learning on dense prediction tasks such as detection or segmentation, it becomes essential to preserve spatial information. Recent ResNet implementations usually apply the strided convolution at the 3×33\times 3 layer instead of the previous 1×11\times 1 layer to better preserve such information . Convolutional layers require handling featuremap boundaries with zero-padding strategies, which is often suboptimal when transferring to other dense prediction tasks. Instead of using strided convolution at the transitioning block (in which the spatial resolution is downsampled), we use an average pooling layer with a kernel size of 3×33\times 3 .

Tweaks from ResNet-D. We also adopt two simple yet effective ResNet modifications introduced by : (1) The first 7×77\times 7 convolutional layer is replaced with three consecutive 3×33\times 3 convolutional layers, which have the same receptive field size with a similar computation cost as the original design. (2) A 2×22\times 2 average pooling layer is added to the shortcut connection prior to the 1×11\times 1 convolutional layer for the transitioning blocks with stride of two.

2 Training Strategy

Large Mini-batch Distributed Training.Note that large mini-batch training does not improve network accuracy. Instead, it often degrades the results. For effectively training deep CNN models, we follow the prior work to train our models using 8 servers (64 GPUs in total) in parallel. Our learning rates are adjusted according to a cosine schedule . We follow the common practice using linearly scaling-up the initial learning rate based on the mini-batch size. The initial learning rate is given by η=B256ηbase\eta=\frac{B}{256}\eta_{base}, where BB is the mini-batch size and we use ηbase=0.1\eta_{base}=0.1 as the base learning rate. This warm-up strategy is applied over the first 5 epochs, gradually increasing the learning rate linearly from 0 to the initial value for the cosine schedule . The batch normalization (BN) parameter γ\gamma is initialized to zero in the final BN operation of each block, as has been suggested for large batch training .

Label Smoothing. Label smoothing was first used to improve the training of Inception-V2 . Recall the cross entropy loss incurred by our network’s predicted class probabilities qq is computed against ground-truth pp as:

with small constant ε>0\varepsilon>0. This mitigates network overconfidence and overfitting.

Auto Augmentation. Auto-Augment is a strategy that augments the training data with transformed images, where the transformations are learned adaptively. 16 different types of image jittering transformations are introduced, and from these, one augments the data based on 24 different combinations of two consecutive transformations such as shift, rotation, and color jittering. The magnitude of each transformation can be controlled with a relative parameter (e.g. rotation angle), and transformations may be probabilistically skipped.

Mixup Training. Mixup is another data augmentation strategy that generates a weighted combinations of random image pairs from the training data . Given two images and their ground truth labels: (x(i),y(i)),(x(j),y(j))(x^{(i)},y^{(i)}),(x^{(j)},y^{(j)}), a synthetic training example (x^,y^)(\hat{x},\hat{y}) is generated as:

where λBeta(α=0.2)\lambda\sim\text{Beta}(\alpha=0.2) is independently sampled for each augmented example.

Large Crop Size. Image classification research typically compares the performance of different networks operating on images that share the same crop size. ResNet variants usually use a fixed training crop size of 224, while the Inception-Net family uses a training crop size of 299. Recently, the EfficientNet method has demonstrated that increasing the input image size for a deeper and wider network may better trade off accuracy vs. FLOPS. For fair comparison, we use a crop size of 224 when comparing our ResNeSt with ResNet variants, and a crop size of 256 when comparing with other approaches.

Regularization. Very deep neural networks tend to overfit even for large datasets . To prevent this, dropout regularization randomly masks out some neurons during training (but not during inference) to form an implicit network ensemble . A dropout layer with the dropout probability of 0.2 is applied before the final fully-connected layer to the networks with more than 200 layers. We also apply DropBlock layers to the convolutional layers at the last two stages of the network. As a structured variant of dropout, DropBlock randomly masks out local block regions, and is more effective than dropout for specifically regularizing convolutional layers.

Finally, we also apply weight decay (i.e. L2 regularization) which additionally helps stabilize training. We only apply weight decay to the weights of convolutional and fully connected layers .

Image Classification Results

Our first experiments study the image classification performance of ResNeSt on the ImageNet 2012 dataset with 1.28M training images and 50K validation images (from 1000 different classes). As is standard, networks are trained on the training set and we report their top-1 accuracy on the validation set.

We use data sharding for distributed training on ImageNet, evenly partitioning the data across GPUs. At each training iteration, a mini-batch of training data is sampled from the corresponding shard (without replacement). We apply the transformations from the learned Auto Augmentation policy to each individual image. Then we further apply standard transformations including: random size crop, random horizontal flip, color jittering, and changing the lighting. Finally, the image data are RGB-normalized via mean/standard-deviation rescaling. For mixup training, we simply mix each sample from the current mini-batch with its reversed order sample . Batch Normalization is used after each convolutional layer before ReLU activation . Network weights are initialized using Kaiming Initialization . A drop layer is inserted before the final classification layer with dropout ratio =0.2=0.2. Training is done for 270 epochs with a weight decay of 0.0001 and momentum of 0.9, using a cosine learning rate schedule with the first 5 epochs reserved for warm-up. We use a mini-batch of size 8192 for ResNeSt-50, 4096 for ResNeSt 101, and 2048 for ResNeSt-{200, 269}. For evaluation, we first resize each image to 1/0.875 of the crop size along the short edge and apply a center crop. Our code implementation for ImageNet training uses GluonCV with MXNet .

2 Ablation Study

ResNeSt is based on the ResNet-D model . Mixup training improves the accuracy of ResNetD-50 from 78.31% to 79.15%. Auto augmentation further improves the accuracy by 0.26%. When employing our Split-Attention block to form a ResNeSt-50-fast model, accuracy is further boosted to 80.64%. In this ResNeSt-fast setting, the effective average downsampling is applied prior to the 3×33\times 3 convolution to avoid introducing extra computational costs in the model. With the downsampling operation moved after the convolutional layer, ResNeSt-50 achieves 81.13% accuracy.

Radix vs. Cardinality. We conduct an ablation study on ResNeSt-variants with different radix/cardinality. In each variant, we adjust the network’s width appropriately so that its overall computational cost remains similar to the ResNet variants. The results are shown in Table 1, where ss denotes the radix, xx the cardinality, and dd the network width (0ss represents the use of a standard residual block as in ResNet-D ). We empirically find that increasing the radix from 0 to 4 continuously improves the top-1 accuracy, while also increasing latency and memory usage. Although we expect further accuracy improvements with even greater radix/cardinality, we employ Split-Attention with the 2ss1xx64dd setting in subsequent experiments, to ensure these blocks scale to deeper networks with a good trade-off between speed, accuracy and memory usage.

3 Comparing against the State-of-the-Art

To compare with CNN models trained using different crop size settings, we increase the training crop size for deeper models. We use a crop size of 256×256256\times 256 for ResNeSt-200 and 320×320320\times 320 for ResNeSt-269. Bicubic upsampling strategy is employed for input-size greater than 256. The results are shown in Table 2, where we compare the inference speed in addition to the number of parameters. We find that despite its advantage in parameters with accuracy trade-off, the widely used depth-wise convolution is not optimized for inference speed. In this benchmark, all inference speeds are measured using a mini-batch of 16 using the implementation from the original author on a single NVIDIA V100 GPU. The proposed ResNeSt has better accuracy and latency trade-off than models found via neural architecture search.

Transfer Learning Results

We report our detection result on MS-COCO in Table 9. All models are trained on COCO-2017 training set with 118k images, and evaluated on COCO-2017 validation set with 5k images (aka. minival) using the standard COCO AP metric of single scale. We train all models with FPN, synchronized batch normalization and image scale augmentation (short size of a image is picked randomly from 640 to 800). 1x learning rate schedule is used. We conduct Faster-RCNNs and Cascade-RCNNs experiments using Detectron2. For comparison, we simply replaced the vanilla ResNet backbones with our ResNeSt, while using the default settings for the hyper-parameters and detection heads .

Compared to the baselines using standard ResNet, Our backbone is able to boost mean average precision by around 3% on both Faster-RCNNs and Cascade-RCNNs. The result demonstrates our backbone has good generalization ability and can be easily transferred to the downstream task. Notably, our ResNeSt50 outperforms ResNet101 on both Faster-RCNN and Cascade-RCNN detection models, using significantly fewer parameters. Detailed results in Table 9. We evaluate our Cascade-RCNN with ResNeSt101 deformable, that is trained using 1x learning rate schedule on COCO test-dev set as well. It yields a box mAP of 49.2 using single scale inference.

2 Instance Segmentation

To explore the generalization ability of our novel backbone, we also apply it to instance segmentation tasks. Besides the bounding box and category probability, instance segmentation also predicts object masks, for which a more accurate dense image representation is desirable.

We evaluate the Mask-RCNN and Cascade-Mask-RCNN models with ResNeSt-50 and ResNeSt-101 as their backbones. All models are trained along with FPN and synchronized batch normalization. For data augmentation, input images’ shorter side are randomly scaled to one of (640, 672, 704, 736, 768, 800). To fairly compare it with other methods, 1x learning rate schedule policy is applied, and other hyper-parameters remain the same. We re-train the baseline with the same setting described above, but with the standard ResNet. All our experiments are trained on COCO-2017 dataset and using Detectron2. For the baseline experiments, the backbone we used by default is the MSRA version of ResNet, having stride-2 on the 1x1 conv layer. Both bounding box and mask mAP are reported on COCO-2017 validation dataset.

As shown in Table 4, our new backbone achieves better performance. For Mask-RCNN, ResNeSt50 outperforms the baseline with a gain of 2.85%/2.09% for box/mask performance, and ResNeSt101 exhibits even better improvement of 4.03%/3.14%. For Cascade-Mask-RCNN, the gains produced by switching to ResNeSt50 or ResNeSt101 are 3.13%/2.36% or 3.51%/3.04%, respectively. This suggests a model will be better if it consists of more Split-Attention modules. As observed in the detection results, the mAP of our ResNeSt50 exceeds the result of the standard ResNet101 backbone, which indicates a higher capacity of the small model with our proposed module. Finally, we also train a Cascade-Mask-RCNN with ResNeSt101-deformable using a 1x learning rate schedule. We evaluate it on the COCO test-dev set, yielding 50.0 box mAP, and 43.1 mask mAP respectively. Additional experiments under different settings are included in the supplementary material.

3 Semantic Segmentation

In transfer learning for semantic segmentation, we use the GluonCV implementation of DeepLabV3 as a baseline approach. Here a dilated network strategy is applied to the backbone network, resulting in a stride-8 model. Synchronized Batch Normalization is used during training, along with a polynomial-like learning rate schedule (with initial learning rate =0.1=0.1). For evaluation, the network prediction logits are upsampled 8 times to calculate the per-pixel cross entropy loss against the ground truth labels. We use multi-scale evaluation with flipping .

We first consider the Cityscapes dataset, which consists of 5K high-quality labeled images. We train each model on 2,975 images from the training set and report its mIoU on 500 validation images. Following prior work, we only consider 19 object/stuff categories in this benchmark. We have not used any coarse labeled images or any extra data in this benchmark. Our ResNeSt backbone boosts the mIoU achieved by DeepLabV3 models by around 1% while maintaining a similar overall model complexity. Notably, the DeepLabV3 model using our ResNeSt-50 backbone already achieves better performance than DeepLabV3 with a much larger ResNet-101 backbone.

ADE20K is a large scene parsing dataset with 150 object and stuff classes containing 20K training, 2K validation, and 3K test images. All networks are trained on the training set for 120 epochs and evaluated on the validation set. Table 6 shows the resulting pixel accuracy (pixAcc) and mean intersection-of-union (mIoU). The performance of the DeepLabV3 models are dramatically improved by employing our ResNeSt backbone. Analogous to previous results, the DeepLabv3 model using our ResNeSt-50 backbone already outperforms DeepLabv3 using a deeper ResNet-101 backbone. DeepLabV3 with a ResNeSt-101 backbone achieves 82.07% pixAcc and 46.91% mIoU, which to our knowledge, is the best single model that has been presented for ADE20K.

Conclusion

This work proposes the ResNeSt architecture that leverages the channel-wise attention with multi-path representation into a single unified Split-Attention block. The model universally improves the learned feature representations to boost performance across image classification, object detection, instance segmentation and semantic segmentation. Our Split-Attention block is easy to work with (i.e., drop-in replacement of a standard residual block), computationally efficient (i.e., 32% less latency than EfficientNet-B7 but with better accuracy), and transfers well. We believe ResNeSt can have an impact across multiple vision tasks, as it has already been adopted by multiple winning entries in 2020 COCO-LVIS challenge and 2020 DAVIS-VOS chanllenge.

References

Appendix

We investigate the effect of backbone on pose estimation task. The baseline model is SimplePose with ResNet50 and ResNet101 implemented in GluonCV . As comparison we replace the backbone with ResNeSt50 and ResNeSt101 respectively while keeping other settings unchanged. The input image size is fixed to 256x192 for all runs. We use Adam optimizer with batch size 32 and initial learning rate 0.001 with no weight decay. The learning rate is divided by 10 at the 90th and 120th epoch. The experiments are conducted on COCO Keypoints dataset, and we report the OKS AP for results without and with flip test. Flip test first makes prediction on both original and horizontally flipped images, and then averages the predicted keypoint coordinates as the final output.

From Table 7, we see that models backboned with ResNeSt50/ResNeSt101 significantly outperform their ResNet counterparts. Besides, with ResNeSt50 backbone the model achieves performance similar with ResNet101 backbone.

.2 Object Detection and Instance Segmentation

For object detection, we add deformable convolution to our Cascade-RCNN model with ResNeSt-101 backbone and train the model on the MS-COCO training set for 1x schedule. The resulting model achieves 49.2% mAP on COCO test-dev set, which surpass all previous methods including these employing multi-scale evaluation. Detailed results are shown in Table 9.

We include more results of instance segmentation, shown in Table 10, from the models trained with 1x/3x learning rate schedules and with/without SyncBN. All of resutls are reported on COCO val dataset. For both 50/101-layer settings, our ResNeSt backbones still outperform the corresponding baselines with different lr schedules. Same as the Table. 6 in the main text, our ResNeSt50 also exceeds the result of the standard ResNet101.

We also evaluate our ResNeSt with and without deformable convolution v2. With its help, we are able to obtain a higher performance, shown in Table 11. It indicates our designed module is compatible with deformable convolution.