Squeeze-and-Excitation Networks

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Enhua Wu

Introduction

Convolutional neural networks (CNNs) have proven to be useful models for tackling a wide range of visual tasks . At each convolutional layer in the network, a collection of filters expresses neighbourhood spatial connectivity patterns along input channels—fusing spatial and channel-wise information together within local receptive fields. By interleaving a series of convolutional layers with non-linear activation functions and downsampling operators, CNNs are able to produce image representations that capture hierarchical patterns and attain global theoretical receptive fields. A central theme of computer vision research is the search for more powerful representations that capture only those properties of an image that are most salient for a given task, enabling improved performance. As a widely-used family of models for vision tasks, the development of new neural network architecture designs now represents a key frontier in this search. Recent research has shown that the representations produced by CNNs can be strengthened by integrating learning mechanisms into the network that help capture spatial correlations between features. One such approach, popularised by the Inception family of architectures , incorporates multi-scale processes into network modules to achieve improved performance. Further work has sought to better model spatial dependencies and incorporate spatial attention into the structure of the network .

In this paper, we investigate a different aspect of network design - the relationship between channels. We introduce a new architectural unit, which we term the Squeeze-and-Excitation (SE) block, with the goal of improving the quality of representations produced by a network by explicitly modelling the interdependencies between the channels of its convolutional features. To this end, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.

It is possible to construct an SE network (SENet) by simply stacking a collection of SE blocks. Moreover, these SE blocks can also be used as a drop-in replacement for the original block at a range of depths in the network architecture (Section 6.4). While the template for the building block is generic, the role it performs at different depths differs throughout the network. In earlier layers, it excites informative features in a class-agnostic manner, strengthening the shared low-level representations. In later layers, the SE blocks become increasingly specialised, and respond to different inputs in a highly class-specific manner (Section 7.2). As a consequence, the benefits of the feature recalibration performed by SE blocks can be accumulated through the network.

The design and development of new CNN architectures is a difficult engineering task, typically requiring the selection of many new hyperparameters and layer configurations. By contrast, the structure of the SE block is simple and can be used directly in existing state-of-the-art architectures by replacing components with their SE counterparts, where the performance can be effectively enhanced. SE blocks are also computationally lightweight and impose only a slight increase in model complexity and computational burden.

To provide evidence for these claims, we develop several SENets and conduct an extensive evaluation on the ImageNet dataset . We also present results beyond ImageNet that indicate that the benefits of our approach are not restricted to a specific dataset or task. By making use of SENets, we ranked first in the ILSVRC 2017 classification competition. Our best model ensemble achieves a 2.251%2.251\% top-5 error on the test sethttp://image-net.org/challenges/LSVRC/2017/results. This represents roughly a 25%25\% relative improvement when compared to the winner entry of the previous year (top-5 error of 2.991%2.991\%).

Related Work

Deeper architectures. VGGNets and Inception models showed that increasing the depth of a network could significantly increase the quality of representations that it was capable of learning. By regulating the distribution of the inputs to each layer, Batch Normalization (BN) added stability to the learning process in deep networks and produced smoother optimisation surfaces . Building on these works, ResNets demonstrated that it was possible to learn considerably deeper and stronger networks through the use of identity-based skip connections . Highway networks introduced a gating mechanism to regulate the flow of information along shortcut connections. Following these works, there have been further reformulations of the connections between network layers , which show promising improvements to the learning and representational properties of deep networks.

An alternative, but closely related line of research has focused on methods to improve the functional form of the computational elements contained within a network. Grouped convolutions have proven to be a popular approach for increasing the cardinality of learned transformations . More flexible compositions of operators can be achieved with multi-branch convolutions , which can be viewed as a natural extension of the grouping operator. In prior work, cross-channel correlations are typically mapped as new combinations of features, either independently of spatial structure or jointly by using standard convolutional filters with 1×11\times 1 convolutions. Much of this research has concentrated on the objective of reducing model and computational complexity, reflecting an assumption that channel relationships can be formulated as a composition of instance-agnostic functions with local receptive fields. In contrast, we claim that providing the unit with a mechanism to explicitly model dynamic, non-linear dependencies between channels using global information can ease the learning process, and significantly enhance the representational power of the network.

Algorithmic Architecture Search. Alongside the works described above, there is also a rich history of research that aims to forgo manual architecture design and instead seeks to learn the structure of the network automatically. Much of the early work in this domain was conducted in the neuro-evolution community, which established methods for searching across network topologies with evolutionary methods . While often computationally demanding, evolutionary search has had notable successes which include finding good memory cells for sequence models and learning sophisticated architectures for large-scale image classification . With the goal of reducing the computational burden of these methods, efficient alternatives to this approach have been proposed based on Lamarckian inheritance and differentiable architecture search .

By formulating architecture search as hyperparameter optimisation, random search and other more sophisticated model-based optimisation techniques can also be used to tackle the problem. Topology selection as a path through a fabric of possible designs and direct architecture prediction have been proposed as additional viable architecture search tools. Particularly strong results have been achieved with techniques from reinforcement learning . SE blocks can be used as atomic building blocks for these search algorithms, and were demonstrated to be highly effective in this capacity in concurrent work .

Attention and gating mechanisms. Attention can be interpreted as a means of biasing the allocation of available computational resources towards the most informative components of a signal . Attention mechanisms have demonstrated their utility across many tasks including sequence learning , localisation and understanding in images , image captioning and lip reading . In these applications, it can be incorporated as an operator following one or more layers representing higher-level abstractions for adaptation between modalities. Some works provide interesting studies into the combined use of spatial and channel attention . Wang et al. introduced a powerful trunk-and-mask attention mechanism based on hourglass modules that is inserted between the intermediate stages of deep residual networks. By contrast, our proposed SE block comprises a lightweight gating mechanism which focuses on enhancing the representational power of the network by modelling channel-wise relationships in a computationally efficient manner.

Squeeze-and-Excitation Blocks

In order to tackle the issue of exploiting channel dependencies, we first consider the signal to each channel in the output features. Each of the learned filters operates with a local receptive field and consequently each unit of the transformation output U\mathbf{U} is unable to exploit contextual information outside of this region.

Discussion. The output of the transformation U\mathbf{U} can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image. Exploiting such information is prevalent in prior feature engineering work . We opt for the simplest aggregation technique, global average pooling, noting that more sophisticated strategies could be employed here as well.

2 Excitation: Adaptive Recalibration

To make use of the information aggregated in the squeeze operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. To fulfil this objective, the function must meet two criteria: first, it must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels) and second, it must learn a non-mutually-exclusive relationship since we would like to ensure that multiple channels are allowed to be emphasised (rather than enforcing a one-hot activation). To meet these criteria, we opt to employ a simple gating mechanism with a sigmoid activation:

Discussion. The excitation operator maps the input-specific descriptor z\mathbf{z} to a set of channel weights. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, which can be regarded as a self-attention function on channels whose relationships are not confined to the local receptive field the convolutional filters are responsive to.

3 Instantiations

The SE block can be integrated into standard architectures such as VGGNet by insertion after the non-linearity following each convolution. Moreover, the flexibility of the SE block means that it can be directly applied to transformations beyond standard convolutions. To illustrate this point, we develop SENets by incorporating SE blocks into several examples of more complex architectures, described next.

We first consider the construction of SE blocks for Inception networks . Here, we simply take the transformation Ftr\mathbf{F}_{tr} to be an entire Inception module (see Fig. 2) and by making this change for each such module in the architecture, we obtain an SE-Inception network. SE blocks can also be used directly with residual networks (Fig. 3 depicts the schema of an SE-ResNet module). Here, the SE block transformation Ftr\mathbf{F}_{tr} is taken to be the non-identity branch of a residual module. Squeeze and Excitation both act before summation with the identity branch. Further variants that integrate SE blocks with ResNeXt , Inception-ResNet , MobileNet and ShuffleNet can be constructed by following similar schemes. For concrete examples of SENet architectures, a detailed description of SE-ResNet-50 and SE-ResNeXt-50 is given in Table I.

One consequence of the flexible nature of the SE block is that there are several viable ways in which it could be integrated into these architectures. Therefore, to assess sensitivity to the integration strategy used to incorporate SE blocks into a network architecture, we also provide ablation experiments exploring different designs for block inclusion in Section 6.5.

Model and Computational Complexity

For the proposed SE block design to be of practical use, it must offer a good trade-off between improved performance and increased model complexity. To illustrate the computational burden associated with the module, we consider a comparison between ResNet-50 and SE-ResNet-50 as an example. ResNet-50 requires 3.86{\sim}3.86 GFLOPs in a single forward pass for a 224×224224\times 224 pixel input image. Each SE block makes use of a global average pooling operation in the squeeze phase and two small FC layers in the excitation phase, followed by an inexpensive channel-wise scaling operation. In the aggregate, when setting the reduction ratio rr (introduced in Section 3.2) to 1616, SE-ResNet-50 requires 3.87{\sim}3.87 GFLOPs, corresponding to a 0.26%0.26\% relative increase over the original ResNet-50. In exchange for this slight additional computational burden, the accuracy of SE-ResNet-50 surpasses that of ResNet-50 and indeed, approaches that of a deeper ResNet-101 network requiring 7.58{\sim}7.58 GFLOPs (Table II).

In practical terms, a single pass forwards and backwards through ResNet-50 takes 190190 ms, compared to 209209 ms for SE-ResNet-50 with a training minibatch of 256256 images (both timings are performed on a server with 88 NVIDIA Titan X GPUs). We suggest that this represents a reasonable runtime overhead, which may be further reduced as global pooling and small inner-product operations receive further optimisation in popular GPU libraries. Due to its importance for embedded device applications, we further benchmark CPU inference time for each model: for a 224×224224\times 224 pixel input image, ResNet-50 takes 164164 ms in comparison to 167167 ms for SE-ResNet-50. We believe that the small additional computational cost incurred by the SE block is justified by its contribution to model performance.

We next consider the additional parameters introduced by the proposed SE block. These additional parameters result solely from the two FC layers of the gating mechanism and therefore constitute a small fraction of the total network capacity. Concretely, the total number introduced by the weight parameters of these FC layers is given by:

where rr denotes the reduction ratio, SS refers to the number of stages (a stage refers to the collection of blocks operating on feature maps of a common spatial dimension), CsC_{s} denotes the dimension of the output channels and NsN_{s} denotes the number of repeated blocks for stage ss (when bias terms are used in FC layers, the introduced parameters and computational cost are typically negligible). SE-ResNet-50 introduces 2.5{\sim}2.5 million additional parameters beyond the 25{\sim}25 million parameters required by ResNet-50, corresponding to a 10%{\sim}10\% increase. In practice, the majority of these parameters come from the final stage of the network, where the excitation operation is performed across the greatest number of channels. However, we found that this comparatively costly final stage of SE blocks could be removed at only a small cost in performance (<0.1%{<}0.1\% top-55 error on ImageNet) reducing the relative parameter increase to 4%{\sim}4\%, which may prove useful in cases where parameter usage is a key consideration (see Section 6.4 and 7.2 for further discussion).

Experiments

In this section, we conduct experiments to investigate the effectiveness of SE blocks across a range of tasks, datasets and model architectures.

To evaluate the influence of SE blocks, we first perform experiments on the ImageNet 20122012 dataset which comprises 1.281.28 million training images and 5050K validation images from 10001000 different classes. We train networks on the training set and report the top-11 and top-55 error on the validation set.

Each baseline network architecture and its corresponding SE counterpart are trained with identical optimisation schemes. We follow standard practices and perform data augmentation with random cropping using scale and aspect ratio to a size of 224×224224\times 224 pixels (or 299×299299\times 299 for Inception-ResNet-v2 and SE-Inception-ResNet-v2) and perform random horizontal flipping. Each input image is normalised through mean RGB-channel subtraction. All models are trained on our distributed learning system ROCS which is designed to handle efficient parallel training of large networks. Optimisation is performed using synchronous SGD with momentum 0.90.9 and a minibatch size of 10241024. The initial learning rate is set to 0.60.6 and decreased by a factor of 1010 every 3030 epochs. Models are trained for 100100 epochs from scratch, using the weight initialisation strategy described in . The reduction ratio rr (in Section 3.2) is set to 1616 by default (except where stated otherwise).

When evaluating the models we apply centre-cropping so that 224×224224\times 224 pixels are cropped from each image, after its shorter edge is first resized to 256256 (299×299299\times 299 from each image whose shorter edge is first resized to 352352 for Inception-ResNet-v2 and SE-Inception-ResNet-v2).

Network depth. We begin by comparing SE-ResNet against ResNet architectures with different depths and report the results in Table II. We observe that SE blocks consistently improve performance across different depths with an extremely small increase in computational complexity. Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%6.62\%, exceeding ResNet-50 (7.48%) by 0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52% top-5 error) with only half of the total computational burden (3.873.87 GFLOPs vs. 7.587.58 GFLOPs). This pattern is repeated at greater depth, where SE-ResNet-101 (6.07%6.07\% top-55 error) not only matches, but outperforms the deeper ResNet-152 network (6.34%6.34\% top-5 error) by 0.27%0.27\%. While it should be noted that the SE blocks themselves add depth, they do so in an extremely computationally efficient manner and yield good returns even at the point at which extending the depth of the base architecture achieves diminishing returns. Moreover, we see that the gains are consistent across a range of different network depths, suggesting that the improvements induced by SE blocks may be complementary to those obtained by simply increasing the depth of the base architecture.

Integration with modern architectures. We next study the effect of integrating SE blocks with two further state-of-the-art architectures, Inception-ResNet-v2 and ResNeXt (using the setting of 32×432\times 4d) , both of which introduce additional computational building blocks into the base network. We construct SENet equivalents of these networks, SE-Inception-ResNet-v2 and SE-ResNeXt (the configuration of SE-ResNeXt-50 is given in Table I) and report results in Table II. As with the previous experiments, we observe significant performance improvements induced by the introduction of SE blocks into both architectures. In particular, SE-ResNeXt-50 has a top-5 error of 5.495.49% which is superior to both its direct counterpart ResNeXt-50 (5.90%5.90\% top-5 error) as well as the deeper ResNeXt-101 (5.57%5.57\% top-5 error), a model which has almost twice the total number of parameters and computational overhead. We note a slight difference in performance between our re-implementation of Inception-ResNet-v2 and the result reported in . However, we observe a similar trend with regard to the effect of SE blocks, finding that SE counterpart (4.79%4.79\% top-5 error) outperforms our reimplemented Inception-ResNet-v2 baseline (5.21%5.21\% top-5 error) by 0.42%0.42\% as well as the reported result in .

We also assess the effect of SE blocks when operating on non-residual networks by conducting experiments with the VGG-16 and BN-Inception architecture . To facilitate the training of VGG-16 from scratch, we add Batch Normalization layers after each convolution. We use identical training schemes for both VGG-16 and SE-VGG-16. The results of the comparison are shown in Table II. Similarly to the results reported for the residual baseline architectures, we observe that SE blocks bring improvements in performance on the non-residual settings.

To provide some insight into influence of SE blocks on the optimisation of these models, example training curves for runs of the baseline architectures and their respective SE counterparts are depicted in Fig. 4. We observe that SE blocks yield a steady improvement throughout the optimisation procedure. Moreover, this trend is fairly consistent across a range of network architectures considered as baselines.

Mobile setting. Finally, we consider two representative architectures from the class of mobile-optimised networks, MobileNet and ShuffleNet . For these experiments, we used a minibatch size of 256 and slightly less aggressive data augmentation and regularisation as in . We trained the models across 8 GPUs using SGD with momentum (set to 0.9) and an initial learning rate of 0.1 which was reduced by a factor of 10 each time the validation loss plateaued. The total training process required 400\sim 400 epochs (enabling us to reproduce the baseline performance of ). The results reported in Table III show that SE blocks consistently improve the accuracy by a large margin at a minimal increase in computational cost.

Additional datasets. We next investigate whether the benefits of SE blocks generalise to datasets beyond ImageNet. We perform experiments with several popular baseline architectures and techniques (ResNet-110 , ResNet-164 , WideResNet-16-8 , Shake-Shake and Cutout ) on the CIFAR-1010 and CIFAR-100100 datasets . These comprise a collection of 50k training and 10k test 32×3232\times 32 pixel RGB images, labelled with 10 and 100 classes respectively. The integration of SE blocks into these networks follows the same approach that was described in Section 3.3. Each baseline and its SENet counterpart are trained with standard data augmentation strategies . During training, images are randomly horizontally flipped and zero-padded on each side with four pixels before taking a random 32×3232\times 32 crop. Mean and standard deviation normalisation is also applied. The setting of the training hyperparameters (e.g. minibatch size, initial learning rate, weight decay) match those suggested by the original papers. We report the performance of each baseline and its SENet counterpart on CIFAR-1010 in Table IV and performance on CIFAR-100100 in Table V. We observe that in every comparison SENets outperform the baseline architectures, suggesting that the benefits of SE blocks are not confined to the ImageNet dataset.

2 Scene Classification

We also conduct experiments on the Places365-Challenge dataset for scene classification. This dataset comprises 88 million training images and 36,50036,500 validation images across 365365 categories. Relative to classification, the task of scene understanding offers an alternative assessment of a model’s ability to generalise well and handle abstraction. This is because it often requires the model to handle more complex data associations and to be robust to a greater level of appearance variation.

We opted to use ResNet-152 as a strong baseline to assess the effectiveness of SE blocks and follow the training and evaluation protocols described in . In these experiments, models are trained from scratch. We report the results in Table VI, comparing also with prior work. We observe that SE-ResNet-152 (11.01%11.01\% top-5 error) achieves a lower validation error than ResNet-152 (11.61%11.61\% top-5 error), providing evidence that SE blocks can also yield improvements for scene classification. This SENet surpasses the previous state-of-the-art model Places-365-CNN which has a top-5 error of 11.48%11.48\% on this task.

3 Object Detection on COCO

We further assess the generalisation of SE blocks on the task of object detection using the COCO dataset . As in previous work , we use the minival protocol, i.e., training the models on the union of the 8080k training set and a 3535k val subset and evaluating on the remaining 55k val subset. Weights are initialised by the parameters of the model trained on the ImageNet dataset. We use the Faster R-CNN detection framework as the basis for evaluating our models and follow the hyperparameter setting described in (i.e., end-to-end training with the ’2x’ learning schedule). Our goal is to evaluate the effect of replacing the trunk architecture (ResNet) in the object detector with SE-ResNet, so that any changes in performance can be attributed to better representations. Table VII reports the validation set performance of the object detector using ResNet-50, ResNet-101 and their SE counterparts as trunk architectures. SE-ResNet-50 outperforms ResNet-50 by 2.4%2.4\% (a relative 6.3%6.3\% improvement) on COCO’s standard AP metric and by 3.1%3.1\% on AP@IoU=0.50.5. SE blocks also benefit the deeper ResNet-101 architecture achieving a 2.0%2.0\% improvement (5.0%5.0\% relative improvement) on the AP metric. In summary, this set of experiments demonstrate the generalisability of SE blocks. The induced improvements can be realised across a broad range of architectures, tasks and datasets.

4 ILSVRC 2017 Classification Competition

SENets formed the foundation of our submission to the ILSVRC competition where we achieved first place. Our winning entry comprised a small ensemble of SENets that employed a standard multi-scale and multi-crop fusion strategy to obtain a top-5 error of 2.251%2.251\% on the test set. As part of this submission, we constructed an additional model, SENet-154, by integrating SE blocks with a modified ResNeXt (the details of the architecture are provided in Appendix). We compare this model with prior work on the ImageNet validation set in Table VIII using standard crop sizes (224×224224\times 224 and 320×320320\times 320). We observe that SENet-154154 achieves a top-1 error of 18.68%18.68\% and a top-5 error of 4.47%4.47\% using a 224×224224\times 224 centre crop evaluation, which represents the strongest reported result.

Following the challenge there has been a great deal of further progress on the ImageNet benchmark. For comparison, we include the strongest results that we are currently aware of in Table IX. The best performance using only ImageNet data was recently reported by . This method uses reinforcement learning to develop new policies for data augmentation during training to improve the performance of the architecture searched by . The best overall performance was reported by using a ResNeXt-101101 32×48d32\times 48d architecture. This was achieved by pretraining their model on approximately one billion weakly labelled images and finetuning on ImageNet. The improvements yielded by more sophisticated data augmentation and extensive pretraining may be complementary to our proposed changes to the network architecture.

Ablation Study

In this section we conduct ablation experiments to gain a better understanding of the effect of using different configurations on components of the SE blocks. All ablation experiments are performed on the ImageNet dataset on a single machine (with 8 GPUs). ResNet-50 is used as the backbone architecture. We found empirically that on ResNet architectures, removing the biases of the FC layers in the excitation operation facilitates the modelling of channel dependencies, and use this configuration in the following experiments. The data augmentation strategy follows the approach described in Section 5.1. To allow us to study the upper limit of performance for each variant, the learning rate is initialised to 0.1 and training continues until the validation loss plateausFor reference, training with a 270 epoch fixed schedule (reducing the learning rate at 125, 200 and 250 epochs) achieves top-1 and top-5 error rates for ResNet-50 and SE-ResNet-50 of (23.21%23.21\%, 6.53%6.53\%) and (22.20%22.20\%, 6.00%6.00\%) respectively. (300{\sim}300 epochs in total). The learning rate is then reduced by a factor of 10 and then this process is repeated (three times in total). Label-smoothing regularisation is used during training.

The reduction ratio rr introduced in Eqn. 5 is a hyperparameter which allows us to vary the capacity and computational cost of the SE blocks in the network. To investigate the trade-off between performance and computational cost mediated by this hyperparameter, we conduct experiments with SE-ResNet-50 for a range of different rr values. The comparison in Table X shows that performance is robust to a range of reduction ratios. Increased complexity does not improve performance monotonically while a smaller ratio dramatically increases the parameter size of the model. Setting r=16r=16 achieves a good balance between accuracy and complexity. In practice, using an identical ratio throughout a network may not be optimal (due to the distinct roles performed by different layers), so further improvements may be achievable by tuning the ratios to meet the needs of a given base architecture.

2 Squeeze Operator

We examine the significance of using global average pooling as opposed to global max pooling as our choice of squeeze operator (since this worked well, we did not consider more sophisticated alternatives). The results are reported in Table XI. While both max and average pooling are effective, average pooling achieves slightly better performance, justifying its selection as the basis of the squeeze operation. However, we note that the performance of SE blocks is fairly robust to the choice of specific aggregation operator.

3 Excitation Operator

We next assess the choice of non-linearity for the excitation mechanism. We consider two further options: ReLU and tanh, and experiment with replacing the sigmoid with these alternative non-linearities. The results are reported in Table XII. We see that exchanging the sigmoid for tanh slightly worsens performance, while using ReLU is dramatically worse and in fact causes the performance of SE-ResNet-50 to drop below that of the ResNet-50 baseline. This suggests that for the SE block to be effective, careful construction of the excitation operator is important.

4 Different stages

We explore the influence of SE blocks at different stages by integrating SE blocks into ResNet-50, one stage at a time. Specifically, we add SE blocks to the intermediate stages: stage_2, stage_3 and stage_4, and report the results in Table XIII. We observe that SE blocks bring performance benefits when introduced at each of these stages of the architecture. Moreover, the gains induced by SE blocks at different stages are complementary, in the sense that they can be combined effectively to further bolster network performance.

5 Integration strategy

Finally, we perform an ablation study to assess the influence of the location of the SE block when integrating it into existing architectures. In addition to the proposed SE design, we consider three variants: (1) SE-PRE block, in which the SE block is moved before the residual unit; (2) SE-POST block, in which the SE unit is moved after the summation with the identity branch (after ReLU) and (3) SE-Identity block, in which the SE unit is placed on the identity connection in parallel to the residual unit. These variants are illustrated in Figure 5 and the performance of each variant is reported in Table XIV. We observe that the SE-PRE, SE-Identity and proposed SE block each perform similarly well, while usage of the SE-POST block leads to a drop in performance. This experiment suggests that the performance improvements produced by SE units are fairly robust to their location, provided that they are applied prior to branch aggregation.

In the experiments above, each SE block was placed outside the structure of a residual unit. We also construct a variant of the design which moves the SE block inside the residual unit, placing it directly after the 3×33\times 3 convolutional layer. Since the 3×33\times 3 convolutional layer possesses fewer channels, the number of parameters introduced by the corresponding SE block is also reduced. The comparison in Table XV shows that the SE_3×\times3 variant achieves comparable classification accuracy with fewer parameters than the standard SE block. Although it is beyond the scope of this work, we anticipate that further efficiency gains will be achievable by tailoring SE block usage for specific architectures.

Role of SE blocks

Although the proposed SE block has been shown to improve network performance on multiple visual tasks, we would also like to understand the relative importance of the squeeze operation and how the excitation mechanism operates in practice. A rigorous theoretical analysis of the representations learned by deep neural networks remains challenging, we therefore take an empirical approach to examining the role played by the SE block with the goal of attaining at least a primitive understanding of its practical function.

To assess whether the global embedding produced by the squeeze operation plays an important role in performance, we experiment with a variant of the SE block that adds an equal number of parameters, but does not perform global average pooling. Specifically, we remove the pooling operation and replace the two FC layers with corresponding 1×11\times 1 convolutions with identical channel dimensions in the excitation operator, namely NoSqueeze, where the excitation output maintains the spatial dimensions as input. In contrast to the SE block, these point-wise convolutions can only remap the channels as a function of the output of a local operator. While in practice, the later layers of a deep network will typically possess a (theoretical) global receptive field, global embeddings are no longer directly accessible throughout the network in the NoSqueeze variant. The accuracy and computational complexity of both models are compared to a standard ResNet-50 model in Table XVI. We observe that the use of global information has a significant influence on the model performance, underlining the importance of the squeeze operation. Moreover, in comparison to the NoSqueeze design, the SE block allows this global information to be used in a computationally parsimonious manner.

2 Role of Excitation

To provide a clearer picture of the function of the excitation operator in SE blocks, in this section we study example activations from the SE-ResNet-50 model and examine their distribution with respect to different classes and different input images at various depths in the network. In particular, we would like to understand how excitations vary across images of different classes, and across images within a class.

We first consider the distribution of excitations for different classes. Specifically, we sample four classes from the ImageNet dataset that exhibit semantic and appearance diversity, namely goldfish, pug, plane and cliff (example images from these classes are shown in Appendix). We then draw fifty samples for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block of each stage (immediately prior to downsampling) and plot their distribution in Fig. 6. For reference, we also plot the distribution of the mean activations across all of the 10001000 classes.

We make the following three observations about the role of the excitation operation. First, the distribution across different classes is very similar at the earlier layers of the network, e.g. SE_2_3. This suggests that the importance of feature channels is likely to be shared by different classes in the early stages. The second observation is that at greater depth, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features, e.g. SE_4_6 and SE_5_1. These observations are consistent with findings in previous work , namely that earlier layer features are typically more general (e.g. class agnostic in the context of the classification task) while later layer features exhibit greater levels of specificity .

Next, we observe a somewhat different phenomena in the last stage of the network. SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to one. At the point at which all activations take the value one, an SE block reduces to the identity operator. At the end of the network in the SE_5_3 (which is immediately followed by global pooling prior before classifiers), a similar pattern emerges over different classes, up to a modest change in scale (which could be tuned by the classifiers). This suggests that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network. This finding is consistent with the result of the empirical investigation in Section 4 which demonstrated that the additional parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance.

Finally, we show the mean and standard deviations of the activations for image instances within the same class for two sample classes (goldfish and plane) in Fig. 7. We observe a trend consistent with the inter-class visualisation, indicating that the dynamic behaviour of SE blocks varies over both classes and instances within a class. Particularly in the later layers of the network where there is considerable diversity of representation within a single class, the network learns to take advantage of feature recalibration to improve its discriminative performance . In summary, SE blocks produce instance-specific responses which nevertheless function to support the increasingly class-specific needs of the model at different layers in the architecture.

Conclusion

In this paper we proposed the SE block, an architectural unit designed to improve the representational power of a network by enabling it to perform dynamic channel-wise feature recalibration. A wide range of experiments show the effectiveness of SENets, which achieve state-of-the-art performance across multiple datasets and tasks. In addition, SE blocks shed some light on the inability of previous architectures to adequately model channel-wise feature dependencies. We hope this insight may prove useful for other tasks requiring strong discriminative features. Finally, the feature importance values produced by SE blocks may be of use for other tasks such as network pruning for model compression.

Acknowledgments

The authors would like to thank Chao Li and Guangyuan Wang from Momenta for their contributions in the training system optimisation and experiments on CIFAR dataset. We would also like to thank Andrew Zisserman, Aravindh Mahendran and Andrea Vedaldi for many helpful discussions. The work is supported in part by NSFC Grants (61632003, 61620106003, 61672502, 61571439), National Key R&D Program of China (2017YFB1002701), and Macao FDCT Grant (068/2015/A2). Samuel Albanie is supported by EPSRC AIMS CDT EP/L015897/1.

Appendix: Details of SENet-154

SENet-154 is constructed by incorporating SE blocks into a modified version of the 64×\times4d ResNeXt-152 which extends the original ResNeXt-101 by adopting the block stacking strategy of ResNet-152 . Further differences to the design and training of this model (beyond the use of SE blocks) are as follows: (a) The number of the first 1×11\times 1 convolutional channels for each bottleneck building block was halved to reduce the computational cost of the model with a minimal decrease in performance. (b) The first 7×77\times 7 convolutional layer was replaced with three consecutive 3×33\times 3 convolutional layers. (c) The 1×11\times 1 down-sampling projection with stride-22 convolution was replaced with a 3×33\times 3 stride-22 convolution to preserve information. (d) A dropout layer (with a dropout ratio of 0.20.2) was inserted before the classification layer to reduce overfitting. (e) Label-smoothing regularisation (as introduced in ) was used during training. (f) The parameters of all BN layers were frozen for the last few training epochs to ensure consistency between training and testing. (g) Training was performed with 8 servers (64 GPUs) in parallel to enable large batch sizes (2048). The initial learning rate was set to 1.01.0.

References