Attention Augmented Convolutional Networks

Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le

Introduction

Convolutional Neural Networks have enjoyed tremendous success in many computer vision applications, especially in image classification . The design of the convolutional layer imposes 1) locality via a limited receptive field and 2) translation equivariance via weight sharing. Both these properties prove to be crucial inductive biases when designing models that operate over images. However, the local nature of the convolutional kernel prevents it from capturing global contexts in an image, often necessary for better recognition of objects in images .

Self-attention , on the other hand, has emerged as a recent advance to capture long range interactions, but has mostly been applied to sequence modeling and generative modeling tasks. The key idea behind self-attention is to produce a weighted average of values computed from hidden units. Unlike the pooling or the convolutional operator, the weights used in the weighted average operation are produced dynamically via a similarity function between hidden units. As a result, the interaction between input signals depends on the signals themselves rather than being predetermined by their relative location like in convolutions. In particular, this allows self-attention to capture long range interactions without increasing the number of parameters.

In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We develop a novel two-dimensional relative self-attention mechanism that maintains translation equivariance while being infused with relative position information, making it well suited for images. Our self-attention formulation proves competitive for replacing convolutions entirely, however we find in control experiments that the best results are obtained when combining both. We therefore do not completely abandon the idea of convolutions, but instead propose to augment convolutions with this self-attention mechanism. This is achieved by concatenating convolutional feature maps, which enforce locality, to self-attentional feature maps capable of modeling longer range dependencies (see Figure 2).

We test our method on the CIFAR-100 and ImageNet classification and the COCO object detection tasks, across a wide range of architectures at different computational budgets, including a state-of-the art resource constrained architecture . Attention Augmentation yields systematic improvements with minimal additional computational burden and notably outperforms the popular Squeeze-and-Excitation channelwise attention approach in all experiments. In particular, Attention Augmentation achieves a 1.3% top-1 accuracy ImageNet on top of a ResNet50 baseline and 1.4 mAP increase in COCO object detection on top of a RetinaNet baseline. Suprisingly, experiments also reveal that fully self-attentional models, a special case of Attention Augmentation, only perform slightly worse than their fully convolutional counterparts on ImageNet, indicating that self-attention is a powerful stand-alone computational primitive for image classification.

Related Work

Modern computer vision has been built on powerful image featurizers learned on image classification tasks such as CIFAR-10 and ImageNet . These datasets have been used as benchmarks for delineating better image featurizations and network architectures across a broad range of tasks . For example, improving the “backbone” network typically leads to improvements in object detection and image segmentation . These observations have inspired the research and design of new architectures, which are typically derived from the composition of convolution operations across an array of spatial scales and skip connections . Indeed, automated search strategies for designing architectures based on convolutional primitives result in state-of-the-art accuracy on large-scale image classification tasks that translate across a range of tasks .

2 Attention mechanisms in networks

Attention has enjoyed widespread adoption as a computational module for modeling sequences because of its ability to capture long distance interactions . Most notably, Bahdanau et al. first proposed to combine attention with a Recurrent Neural Network for alignment in Machine Translation. Attention was further extended by Vaswani et al. , where the self-attentional Transformer architecture achieved state-of-the-art results in Machine Translation. Using self-attention in cooperation with convolutions is a theme shared by recent work in Natural Language Processing and Reinforcement Learning . For example, the QANet and Evolved Transformer architectures alternate between self-attention layers and convolution layers for Question Answering applications and Machine Translation respectively. Additionally, multiple attention mechanisms have been proposed for visual tasks to address the weaknesses of convolutions . For instance, Squeeze-and-Excitation and Gather-Excite reweigh feature channels using signals aggregated from entire feature maps, while BAM and CBAM refine convolutional features independently in the channel and spatial dimensions. In non-local neural networks , improvements are shown in video classification and object detection via the additive use of a few non-local residual blocks that employ self-attention in convolutional architectures. However, non-local blocks are only added to the architecture after ImageNet pretraining and are initialized in such a way that they do not break pretraining.

In contrast, our attention augmented networks do not rely on pretraining of their fully convolutional counterparts and employ self-attention along the entire architecture. The use of multi-head attention allows the model to attend jointly to both spatial and feature subspaces. Additionally, we enhance the representational power of self-attention over images by extending relative self-attention to two dimensional inputs allowing us to model translation equivariance in a principled way. Finally our method produces additional feature maps, rather than recalibrating convolutional features via addition or gating . This property allows us to flexibly adjust the fraction of attentional channels and consider a spectrum of architectures, ranging from fully convolutional to fully attentional models.

Methods

We now formally describe our proposed Attention Augmentation method. We use the following naming conventions: $H$ , $W$ and $F_{in}$ refer to the height, width and number of input filters of an activation map. $N_{h}$ , $d_{v}$ and $d_{k}$ respectively refer the number of heads, the depth of values and the depth of queries and keys in multihead-attention (MHA). We further assume that $N_{h}$ divides $d_{v}$ and $d_{k}$ evenly and denote $d_{v}^{h}$ and $d_{k}^{h}$ the depth of values and queries/keys per attention head.

Without explicit information about positions, self-attention is permutation equivariant:

for any permutation $\pi$ of the pixel locations, making it ineffective for modeling highly structured data such as images. Multiple positional encodings that augment activation maps with explicit spatial information have been proposed to alleviate related issues. In particular, the Image Transformer extends the sinusoidal waves first introduced in the original Transformer to 2 dimensional inputs and CoordConv concatenates positional channels to an activation map.

However these encodings did not help in our experiments on image classification and object detection (see Section 4.5). We hypothesize that this is because such positional encodings, while not permutation equivariant, do not satisfy translation equivariance, which is a desirable property when dealing with images. As a solution, we propose to extend the use of relative position encodings to two dimensions and present a memory efficient implementation based on the Music Transformer .

Introduced in for the purpose of language modeling, relative self-attention augments self-attention with relative position embeddings and enables translation equivariance while preventing permutation equivariance. We implement two-dimensional relative self-attention by independently adding relative height information and relative width information. The attention logit for how much pixel $i=(i_{x},i_{y})$ attends to pixel $j=(j_{x},j_{y})$ is computed as:

where $q_{i}$ is the query vector for pixel $i$ (the i-th row of $Q$ ), $k_{j}$ is the key vector for pixel $j$ (the j-th row of $K$ ) and $r_{j_{x}-i_{x}}^{W}$ and $r_{j_{y}-i_{y}}^{H}$ are learned embeddings for relative width $j_{x}-i_{x}$ and relative height $j_{y}-i_{y}$ , respectively. The output of head $h$ now becomes:

The relative attention algorithm in explicitly stores all relative embeddings $r_{ij}$ in a tensor of shape $(HW,HW,d_{k}^{h})$ , thus incurring an additional memory cost of $O((HW)^{2}d_{k}^{h})$ . This compares to $O((HW)^{2}N_{h})$ for the position-unaware version self-attention that does not use position encodings. As we typically have $N_{h}<d_{k}^{h}$ , such an implementation can prove extremely prohibitive and restrict the number of images that can fit in a minibatch. Instead, we extend the memory efficient relative masked attention algorithm presented in to unmasked relative self-attention over 2 dimensional inputs. Our implementation has a memory cost of $O(HWd_{k}^{h})$ . We leave the Tensorflow code of the algorithm in the Appendix.

The relative positional embeeddings $r^{H}$ and $r^{W}$ are learned and shared across heads but not layers. For each layer, we add $(2(H+W)-2)d_{k}^{h}$ parameters to model relative distances along height and width.

2 Attention Augmented Convolution

Multiple previously proposed attention mechanisms over images suggest that the convolution operator is limited by its locality and lack of understanding of global contexts. These methods capture long-range dependencies by recalibrating convolutional feature maps. In particular, Squeeze-and-Excitation (SE) and Gather-Excite (GE) perform channelwise reweighing while BAM and CBAM reweigh both channels and spatial positions independently. In contrast to these approaches, we 1) use an attention mechanism that can attend jointly to spatial and feature subspaces (each head corresponding to a feature subspace) and 2) introduce additional feature maps rather than refining them. Figure 2 summarizes our proposed augmented convolution.

Formally, consider an original convolution operator with kernel size $k$ , $F_{in}$ input filters and $F_{out}$ output filters. The corresponding attention augmented convolution can be written as

We denote $\upsilon=\frac{d_{v}}{F_{out}}$ the ratio of attentional channels to number of original output filters and $\kappa=\frac{d_{k}}{F_{out}}$ the ratio of key depth to number of original output filters. Similarly to the convolution, the proposed attention augmented convolution 1) is equivariant to translation and 2) can readily operate on inputs of different spatial dimensions. We include Tensorflow code for the proposed attention augmented convolution in the Appendix A.3.

Multihead attention introduces a 1x1 convolution with $F_{in}$ input filters and $(2d_{k}+d_{v})=F_{out}(2\kappa+\upsilon)$ output filters to compute queries, keys and values and an additional 1x1 convolution with $d_{v}=F_{out}\upsilon$ input and output filters to mix the contribution of different heads. Considering the decrease in filters in the convolutional part, this leads to the following change in parameters:

where we ignore the parameters introduced by relative position embeddings for simplicity as these are negligible. In practice, this causes a slight decrease in parameters when replacing 3x3 convolutions and a slight increase in parameters when replacing 1x1 convolutions. Interestingly, we find in experiments that attention augmented networks still significantly outperform their fully convolutional counterparts while using less parameters.

In all our experiments, the augmented convolution is followed by a batch normalization layer which can learn to scale the contribution of the convolution feature maps and the attention feature maps. We apply our augmented convolution once per residual block similarly to other visual attention mechanisms and along the entire architecture as memory permits (see Section 4 for more details).

Since the memory cost $O((N_{h}(HW)^{2})$ can be prohibitive for large spatial dimensions, we augment convolutions with attention starting from the last layer (with smallest spatial dimension) until we hit memory constraints. To reduce the memory footprint of augmented networks, we typically resort to a smaller batch size and sometimes additionally downsample the inputs to self-attention in the layers with the largest spatial dimensions where it is applied. Downsampling is performed by applying 3x3 average pooling with stride 2 while the following upsampling (required for the concatenation) is obtained via bilinear interpolation.

Experiments

In the subsequent experiments, we test Attention Augmentation on standard computer vision architectures such as ResNets , and MnasNet on the CIFAR-100 , ImageNet and COCO datasets. Our experiments show that Attention Augmentation leads to systematic improvements on both image classification and object detection tasks across a broad array of architectures and computational demands. We validate the utility of the proposed two-dimensional relative attention mechanism in ablation experiments. In all experiments, we substitute convolutional feature maps with self-attention feature maps as it makes for an easier comparison against the baseline models. Unless specified otherwise, all results correspond to our two-dimensional relative self-attention mechanism. Experimental details can be found in the Appendix.

We first investigate how Attention Augmentation performs on CIFAR-100 , a standard benchmark for low-resolution imagery, using a Wide ResNet architecture . The Wide-ResNet-28-10 architecture is comprised of 3 stages of 4 residual blocks each using two $3\times 3$ convolutions. We augment the Wide-ResNet-28-10 by augmenting the first convolution of all residual blocks with relative attention using $N_{h}$ =8 heads and $\kappa$ = $2\upsilon$ =0.2 and a minimum of 20 dimensions per head for the keys. We compare Attention Augmentation (AA) against other forms of attention including Squeeze-and-Excitation (SE) and the parameter-free formulation of Gather-Excite (GE) . Table 1 shows that Attention Augmentation improves performance both over the baseline network and Squeeze-and-Excitation at a similar parameter and complexity cost.

2 ImageNet image classification with ResNet

We next examine how Attention Augmentation performs on ImageNet , a standard large-scale dataset for high resolution imagery, across an array of architectures. We start with the ResNet architecture because of its widespread use and its ability to easily scale across several computational budgets. The building block in ResNet-34 comprises two 3x3 convolutions with the same number of output filters. ResNet-50 and its larger counterparts use a bottleneck block comprising of 1x1, 3x3, 1x1 convolutions where the last pointwise convolution expands the number of filters and the first one contracts the number of filters. We modify all ResNets by augmenting the 3x3 convolutions as this decreases number of parameters.We found that augmenting the pointwise expansions works just as well but does not save parameters or computations. We apply Attention Augmentation in each residual block of the last 3 stages of the architecture – when the spatial dimensions of the activation maps are 28x28, 14x14 and 7x7 – and downsample only during the first stage. All attention augmented networks use $\kappa$ = $2\upsilon$ =0.2, except for ResNet-34 which uses $\kappa$ = $\upsilon$ =0.25. The number of attention heads is fixed to $N_{h}$ =8.

Table 2 benchmarks Attention Augmentation against channel and spatial attention mechanisms BAM , CBAM and GALA with channel reduction ratio $\sigma=16$ on the ResNet50 architecture. Despite the lack of specialized kernels (See Appendix A.3), Attention Augmentation offers a competitive accuracy/computational trade-off compared to previously proposed attention mechanisms. Table 3 compares the non-augmented networks and Squeeze-and-Excitation (SE) across different network scales. In all experiments, Attention Augmentation significantly increases performance over the non-augmented baseline and notably outperforms Squeeze-and-Excitation (SE) while being more parameter efficient (Figure 1). Remarkably, our AA-ResNet-50 performs comparably to the baseline ResNet-101 and our AA-ResNet-101 outperforms the baseline ResNet-152. These results suggest that attention augmentation is preferable to simply making networks deeper. We include and discuss attention maps visualizations from different pixel positions in the appendix.

3 ImageNet classification with MnasNet

In this section, we inspect the use of Attention Augmentation in a resource constrained setting by conducting ImageNet experiments with the MnasNet architecture , which is an extremely parameter-efficient architecture. In particular, the MnasNet was found by neural architecture search , using only the highly optimized mobile inverted bottleneck block and the Squeeze-and-Excitation operation as the primitives in its search space. We apply Attention Augmentation to the mobile inverted bottleneck by replacing convolutional channels in the expansion pointwise convolution using $\kappa$ = $2\upsilon$ =0.1 and $N_{h}$ =4 heads. Our augmented MnasNets use augmented inverted bottlenecks in the last 13 blocks out of 18 in the MnasNet architecture, starting when the spatial dimension is 28x28. We downsample only in the first stage where Attention Augmentation is applied. We leave the final pointwise convolution, also referred to as the “head”, unchanged.

In Table 4, we report ImageNet accuracies for the baseline MnasNet and its attention augmented variants at different width multipliers. Our experiments show that Attention Augmentation yields accuracy improvements across all width multipliers. Augmenting MnasNets with relative self-attention incurs a slight parameter increase, however we verify in Figure 3 that the accuracy improvements are not just explained by the parameter increase. Additionally, we note that the MnasNet architecture employs Squeeze-and-Excitation at multiple locations that were optimally selected via architecture search, further suggesting the benefits of our method.

4 Object Detection with COCO dataset

We next investigate the use of Attention Augmentation on the task of object detection on the COCO dataset . We employ the RetinaNet architecture with a ResNet-50 and ResNet-101 backbone as done in , using the opensourced RetinaNet codebase.https://github.com/tensorflow/tpu/tree/master/models/official/retinanet We apply Attention Augmentation uniquely on the ResNet backbone, modifying them similarly as in our ImageNet classification experiments.

Our relative self-attention mechanism improves the performance of the RetinaNet on both ResNet-50 and ResNet-101 as shown in Table 5. Most notably, Attention Augmentation yields a 1.4% mAP improvement over a strong RetinaNet baseline from . In contrast to the success of Squeeze-and-Excitation in image classification with ImageNet, our experiments show that adding Squeeze-and-Excitation operators in the backbone network of the RetinaNet significantly hurts performance, in spite of grid searching over the squeeze ratio $\sigma\in\{4,8,16\}$ . We hypothesize that localization requires precise spatial information which SE discards during the spatial pooling operation, thereby negatively affecting performance. Self-attention on the other hand maintains spatial information and is likely to be able to identify object boundaries successfully. Visualizations of attention maps (See Figures 9 and 10 in the Appendix) reveal that some heads are indeed delineating objects from their background which might be important for localization.

5 Ablation Study

In this section, we investigate the performance of Attention Augmentation as a function of the fraction of attentional channels. As we increase this fraction to 100%, we begin to replace a ConvNet with a fully attentional model, only leaving pointwise convolutions and the stem unchanged. Table 6 presents the performance of Attention Augmentation on the ResNet-50 architecture for varying ratios $\kappa$ = $\upsilon$ $\in\{0.25,0.5,0.75,1.0\}$ . Performance slightly degrades as the ratio of attentional channels increases, which we hypothesize is partly explained by the average pooling operation for downsampling at the first stage where Attention Augmentation is applied. Attention Augmentation proves however quite robust to the fraction of attentional channels. For instance, AA-ResNet-50 with $\kappa$ = $\upsilon$ =0.75 outperforms its ResNet-50 counterpart, while being more parameter and flops efficient, indicating that mostly employing attentional channels is readily competitive.

Perhaps surprisingly, these experiments also reveal that our proposed self-attention mechanism is a powerful stand-alone computational primitive for image classification and that fully attentional models are viable for discriminative visual tasks. In particular, AA-ResNet-50 with $\kappa$ = $\upsilon$ =1, which uses exclusively attentional channels, is only 2.5% worse in accuracy than its fully convolutional counterpart, in spite of downsampling with average pooling and having 25% less parameters. Notably, this fully attentional architectureWe consider pointwise convolutions as dense layers. This architecture employs 4 non-pointwise convolutions in the stem and the first stage of the architecture, but we believe such operations can be replaced by attention too. also outperforms ResNet-34 while being more parameter and flops efficient (see Table 6).

In Figure 4, we show the effect of our proposed two-dimensional relative position encodings as a function of the fraction of attentional channels. As expected, experiments demonstrate that our relative position encodings become increasingly more important as the architecture employs more attentional channels. In particular, the fully self-attentional ResNet-50 gains 2.8% top-1 ImageNet accuracy when using relative position encodings, which indicates the necessity of maintaining position information for fully self-attentional vision models.

We additionally compare our proposed two-dimensional relative position encodings to other position encoding schemes. We apply Attention Augmentation using the same hyperparameters as 4.2 with the following different position encoding schemes: 1) The position-unaware version of self-attention (referred to as None), 2) a two-dimensional implementation of the sinusoidal positional waves (referred to as 2d Sine) as used in , 3) CoordConv for which we concatenate (x,y,r) coordinate channels to the inputs of the attention function, and 4) our proposed two-dimensional relative position encodings (referred to as Relative).

In Table 7 and 8, we present the results on ImageNet classification and the COCO object detection task respectively. On both tasks, Attention Augmentation without position encodings already yields improvements over the fully convolutional non-augmented variants. Our experiments also reveal that the sinusoidal encodings and the coordinate convolution do not provide improvements over the position-unaware version of Attention Augmentation. We obtain additional improvements when using our two-dimensional relative attention, demonstrating the utility of preserving translation equivariance while preventing permutation equivariance.

Discussion and future work

In this work, we consider the use of self-attention for vision models as an alternative to convolutions. We introduce a novel two-dimensional relative self-attention mechanism for images that enables training of competitive fully self-attentional vision models on image classification for the first time. We propose to augment convolutional operators with this self-attention mechanism and validate the superiority of this approach over other attention schemes. Extensive experiments show that Attention Augmentation leads to systematic improvements on both image classification and object detection tasks across a wide range of architectures and computational settings.

Several open questions from this work remain. In future work, we will focus on the fully attentional regime and explore how different attention mechanisms trade off computational efficiency versus representational power. For instance, identifying a local attention mechanism may result in an efficient and scalable computational mechanism that could prevent the need for downsampling with average pooling . Additionally, it is plausible that architectural design choices that are well suited when exclusively relying on convolutions are suboptimal when using self-attention mechanisms. As such, it would be interesting to see if using Attention Augmentation as a primitive in automated architecture search procedures proves useful to find even better models than those previously found in image classification , object detection , image segmentation and other domains . Finally, one can ask to which degree fully attentional models can replace convolutional networks for visual tasks.

Acknowledgements

The authors would like to thank Tsung-Yi Lin, Prajit Ramachandran, Mingxing Tan, Yanping Huang and the Google Brain team for insightful comments and discussions.

References

Appendix A Appendix

Unless specified otherwise, we use the default hyperparameters found in reference baseline codebases without tuning. $\kappa$ was searched in {0.1, 0.2, 0.5}, $\upsilon$ in {0.0, 0.1, 0.25, 0.5, 0.75, 1.0} and the number of heads was chosen based on memory constraints (starting from 8 and decreasing when necessary). We report the final accuracy for each run without performing early stopping.

Given the low resolution of CIFAR-100 images, we do not downsample feature maps before the attention operation and instead resort to a smaller batch size. We train all networks for 500 epochs using synchronous SGD with momentum 0.9 distributed across 8 TESLA V100 GPUs. The learning rate is linearly scaled from 0 to $0.2B/256$ , where $B$ is the total batch size, for the first $5\%$ training epochs and then annealed with cosine decay . We use standard CIFAR preprocessing: mean normalizing, random flipping and cropping . Non-augmented architectures are trained with a batch size of 1024 and a weight decay of 2e-4. Augmented architectures are trained with batch size of 256 and a weight decay of 5e-4.

We train all ResNet architectures for 100 epochs using synchronous SGD with momentum 0.9 across 8 TESLA V100 GPUs and weight decay of 1e-4. We use the largest batch size per worker $B\in\{32,64,128,256\}$ that fits in a minibatch. The initial learning rate is scaled linearly according to the total batch size using a base learning rate of 0.128 for total batch size of 256. During training, we linearly scale the learning rate from 0 to this value for the first 5% of training epochs and divide it by 10 at epochs 30, 60, 80 and 90. We use standard Inception data augmentation as described in .

We follow the training setup described in and train all networks for 350 epochs with the RMSProp optimizer using exponential learning rate decay. When training our augmented MnasNets, we divide the learning rate by 2 and adjusted the learning rate decay so that the final learning rate stays the same.

We follow the setup described in and train the RetinaNet from scratch for 150 epochs without using ImageNet pretraining for the ResNet backbone.

A.2 Computational & Memory costs

Table 9 provides the breakdown of self-attention related computational costs per image. Storing attention maps in each layer induces a memory cost of $N_{h}(HW)^{2}$ bfloat16. At inference, the memory cost for storing attention maps is only 1.2% of the memory required to store model parameters (49MB).

Figures 5 and 6 show the accuracies of our attention augmented networks across FLOPS counts, which correlate with running times across hardware platforms.

A.3 2D Relative Self-Attention implementation

While our method is simple and only requires matrix multiplication, addition and the softmax operation (Equations 3 and 4), our implementation relies on non-trivial operations (e.g. tiling, transposing and reshaping) because no low-level kernels currently exist for hardware platforms. Future work may develop specialized kernels as previously done for convolutions. Therefore, we believe that current latency times (Table 2) reflect the lack of dedicated engineering as opposed to inefficiency in the proposed method.

A.4 Attention visualizations.

In Figure 10, we present attention maps visualizations for the input image shown in Figure 9. We see that attention heads learn to specialize to different content and notably can delineate object boundaries.