Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
Introduction
Convolution is a core building block in computer vision. Early algorithms employ convolutional filters to blur images, extract edges, or detect features. It has been heavily exploited in modern neural networks due to its efficiency and generalization ability, in comparison to fully connected models . The success of convolution mainly comes from two properties: translation equivariance, and locality. Translation equivariance, although not exact , aligns well with the nature of imaging and thus generalizes the model to different positions or to images of different sizes. Locality, on the other hand, reduces parameter counts and M-Adds. However, it makes modeling long range relations challenging.
A rich set of literature has discussed approaches to modeling long range interactions in convolutional neural networks (CNNs). Some employ atrous convolutions , larger kernel , or image pyramids , either designed by hand or searched by algorithms . Another line of works adopts attention mechanisms. Attention shows its ability of modeling long range interactions in language modeling , speech recognition , and neural captioning . Attention has since been extended to vision, giving significant boosts to image classification , object detection , semantic segmentation , video classification , and adversarial defense . These works enrich CNNs with non-local or long-range attention modules.
Recently, stacking attention layers as stand-alone models without any spatial convolution has been proposed and shown promising results. However, naive attention is computationally expensive, especially on large inputs. Applying local constraints to attention, proposed by , reduces the cost and enables building fully attentional models. However, local constraints limit model receptive field, which is crucial to tasks such as segmentation, especially on high-resolution inputs. In this work, we propose to adopt axial-attention , which not only allows efficient computation, but recovers the large receptive field in stand-alone attention models. The core idea is to factorize 2D attention into two 1D attentions along height- and width-axis sequentially. Its efficiency enables us to attend over large regions and build models to learn long range or even global interactions. Additionally, most previous attention modules do not utilize positional information, which degrades attention’s ability in modeling position-dependent interactions, like shapes or objects at multiple scales. Recent works introduce positional terms to attention, but in a context-agnostic way. In this paper, we augment the positional terms to be context-dependent, making our attention position-sensitive, with marginal costs.
We show the effectiveness of our axial-attention models on ImageNet for classification, and on three datasets (COCO , Mapillary Vistas , and Cityscapes ) for panoptic segmentation , instance segmentation, and semantic segmentation. In particular, on ImageNet, we build an Axial-ResNet by replacing the convolution in all residual blocks with our position-sensitive axial-attention layer, and we further make it fully attentional by adopting axial-attention layers in the ‘stem’. As a result, our Axial-ResNet attains state-of-the-art results among stand-alone attention models on ImageNet. For segmentation tasks, we convert Axial-ResNet to Axial-DeepLab by replacing the backbones in Panoptic-DeepLab . On COCO , our Axial-DeepLab outperforms the current bottom-up state-of-the-art, Panoptic-DeepLab , by 2.8% PQ on test-dev set. We also show state-of-the-art segmentation results on Mapillary Vistas , and Cityscapes .
To summarize, our contributions are four-fold:
The proposed method is the first attempt to build stand-alone attention models with large or global receptive field.
We propose position-sensitive attention layer that makes better use of positional information without adding much computational cost.
We show that axial attention works well, not only as a stand-alone model on image classification, but also as a backbone on panoptic segmentation, instance segmentation, and segmantic segmentation.
Our Axial-DeepLab improves significantly over bottom-up state-of-the-art on COCO, achieving comparable performance of two-stage methods. We also surpass previous state-of-the-art methods on Mapillary Vistas and Cityscapes.
Related Work
Top-down panoptic segmentation: Most state-of-the-art panoptic segmentation models employ a two-stage approach where object proposals are firstly generated followed by sequential processing of each proposal. We refer to such approaches as top-down or proposal-based methods. Mask R-CNN is commonly deployed in the pipeline for instance segmentation, paired with a light-weight stuff segmentation branch. For example, Panoptic FPN incorporates a semantic segmentation head to Mask R-CNN , while Porzi et al. append a light-weight DeepLab-inspired module to the multi-scale features from FPN . Additionally, some extra modules are designed to resolve the overlapping instance predictions by Mask R-CNN. TASCNet and AUNet propose a module to guide the fusion between ‘thing’ and ‘stuff’ predictions, while Liu et al. adopt a Spatial Ranking module. UPSNet develops an efficient parameter-free panoptic head for fusing ‘thing’ and ‘stuff’, which is further explored by Li et al. for end-to-end training of panoptic segmentation models. AdaptIS uses point proposals to generate instance masks.
Bottom-up panoptic segmentation: In contrast to top-down approaches, bottom-up or proposal-free methods for panoptic segmentation typically start with the semantic segmentation prediction followed by grouping ‘thing’ pixels into clusters to obtain instance segmentation. DeeperLab predicts bounding box four corners and object centers for class-agnostic instance segmentation. SSAP exploits the pixel-pair affinity pyramid enabled by an efficient graph partition method . BBFNet obtains instance segmentation results by Watershed transform and Hough-voting . Recently, Panoptic-DeepLab , a simple, fast, and strong approach for bottom-up panoptic segmentation, employs a class-agnostic instance segmentation branch involving a simple instance center regression , coupled with DeepLab semantic segmentation outputs . Panoptic-DeepLab has achieved state-of-the-art results on several benchmarks, and our method builds on top of it.
Self-attention: Attention, introduced by for the encoder-decoder in a neural sequence-to-sequence model, is developed to capture correspondence of tokens between two sequences. In contrast, self-attention is defined as applying attention to a single context instead of across multiple modalities. Its ability to directly encode long-range interactions and its parallelizability, has led to state-of-the-art performance for various tasks . Recently, self-attention has been applied to computer vision, by augmenting CNNs with non-local or long-range modules. Non-local neural networks show that self-attention is an instantiation of non-local means and achieve gains on many vision tasks such as video classification and object detection. Additionally, show improvements on image classification by combining features from self-attention and convolution. State-of-the-art results on video action recognition tasks are also achieved in this way. On semantic segmentation, self-attention is developed as a context aggregation module that captures multi-scale context . Efficient attention methods are proposed to reduce its complexity . Additionally, CNNs augmented with non-local means are shown to be more robust to adversarial attacks . Besides discriminative tasks, self-attention is also applied to generative modeling of images . Recently, show that self-attention layers alone could be stacked to form a fully attentional model by restricting the receptive field of self-attention to a local square region. Encouraging results are shown on both image classification and object detection. In this work, we follow this direction of research and propose a stand-alone self-attention model with large or global receptive field, making self-attention models non-local again. Our models are evaluated on bottom-up panoptic segmentation and show significant improvements.
Method
We begin by formally introducing our position-sensitive self-attention mechanism. Then, we discuss how it is applied to axial-attention and how we build stand-alone Axial-ResNet and Axial-DeepLab with axial-attention layers.
This mechanism pools values globally based on affinities , allowing us to capture related but non-local context in the whole feature map, as opposed to convolution which only captures local relations.
However, self-attention is extremely expensive to compute () when the spatial dimension of the input is large, restricting its use to only high levels of a CNN (i.e., downsampled feature maps) or small images. Another drawback is that the global pooling does not exploit positional information, which is critical to capture spatial structures or shapes in vision tasks.
These two issues are mitigated in by adding local constraints and positional encodings to self-attention. For each location , a local square region is extracted to serve as a memory bank for computing the output . This significantly reduces its computation to , allowing self-attention modules to be deployed as stand-alone layers to form a fully self-attentional neural network. Additionally, a learned relative positional encoding term is incorporated into the affinities, yielding a dynamic prior of where to look at in the receptive field (i.e., the local square region). Formally, proposes
In practice, and are much smaller than , and one could extend single-head attention in Eq. (2) to multi-head attention to capture a mixture of affinities. In particular, multi-head attention is computed by applying single-head attentions in parallel on (with different for the -th head), and then obtaining the final output by concatenating the results from each head, i.e., . Note that positional encodings are often shared across heads, so that they introduce marginal extra parameters.
Position-Sensitivity: We notice that previous positional bias only depends on the query pixel , not the key pixel . However, the keys could also have information about which location to attend to. We therefore add a key-dependent positional bias term , besides the query-dependent bias .
Similarly, the values do not contain any positional information in Eq. (2). In the case of large receptive fields or memory banks, it is unlikely that contains the precise location from which comes. Thus, previous models have to trade-off between using smaller receptive fields (i.e., small regions) and throwing away precise spatial structures. In this work, we enable the output to retrieve relative positions , besides the content , based on query-key affinities . Formally,
We call this design position-sensitive self-attention, which captures long range interactions with precise positional information at a reasonable computation overhead, as verified in our experiments.
2 Axial-Attention
The local constraint, proposed by the stand-alone self-attention models , significantly reduces the computational costs in vision tasks and enables building fully self-attentional model. However, such constraint sacrifices the global connection, making attention’s receptive field no larger than a depthwise convolution with the same kernel size. Additionally, the local self-attention, performed in local square regions, still has complexity quadratic to the region length, introducing another hyper-parameter to trade-off between performance and computation complexity. In this work, we propose to adopt axial-attention in stand-alone self-attention, ensuring both global connection and efficient computation. Specifically, we first define an axial-attention layer on the width-axis of an image as simply a one dimensional position-sensitive self-attention, and use the similar definition for the height-axis. To be concrete, the axial-attention layer along the width-axis is defined as follows.
One axial-attention layer propagates information along one particular axis. To capture global information, we employ two axial-attention layers consecutively for the height-axis and width-axis, respectively. Both of the axial-attention layers adopt the multi-head attention mechanism, as described above.
Axial-attention reduces the complexity to . This enables global receptive field, which is achieved by setting the span directly to the whole input features. Optionally, one could also use a fixed value, in order to reduce memory footprint on huge feature maps.
Axial-ResNet: To transform a ResNet to an Axial-ResNet, we replace the convolution in the residual bottleneck block by two multi-head axial-attention layers (one for height-axis and the other for width-axis). Optional striding is performed on each axis after the corresponding axial-attention layer. The two convolutions are kept to shuffle the features. This forms our (residual) axial-attention block, as illustrated in Fig. 2, which is stacked multiple times to obtain Axial-ResNets. Note that we do not use a convolution in-between the two axial-attention layers, since matrix multiplications () follow immediately. Additionally, the stem (i.e., the first strided convolution and max-pooling) in the original ResNet is kept, resulting in a conv-stem model where convolution is used in the first layer and attention layers are used everywhere else. In conv-stem models, we set the span to the whole input from the first block, where the feature map is 5656.
In our experiments, we also build a full axial-attention model, called Full Axial-ResNet, which further applies axial-attention to the stem. Instead of designing a special spatially-varying attention stem , we simply stack three axial-attention bottleneck blocks. In addition, we adopt local constraints (i.e., a local square region as in ) in the first few blocks of Full Axial-ResNets, in order to reduce computational cost.
Axial-DeepLab: To further convert Axial-ResNet to Axial-DeepLab for segmentation tasks, we make several changes as discussed below.
Firstly, to extract dense feature maps, DeepLab changes the stride and atrous rates of the last one or two stages in ResNet . Similarly, we remove the stride of the last stage but we do not implement the ‘atrous’ attention module, since our axial-attention already captures global information for the whole input. In this work, we extract feature maps with output stride (i.e., the ratio of input resolution to the final backbone feature resolution) 16. We do not pursue output stride 8, since it is computationally expensive.
Secondly, we do not adopt the atrous spatial pyramid pooling module (ASPP) , since our axial-attention block could also efficiently encode the multi-scale or global information. We show in the experiments that our Axial-DeepLab without ASPP outperforms Panoptic-DeepLab with and without ASPP.
Lastly, following Panoptic-DeepLab , we adopt exactly the same stem of three convolutions, dual decoders, and prediction heads. The heads produce semantic segmentation and class-agnostic instance segmentation, and they are merged by majority voting to form the final panoptic segmentation.
In cases where the inputs are extremely large (e.g., ) and memory is constrained, we resort to a large span in all our axial-attention blocks. Note that we do not consider the axial span as a hyper-parameter because it is already sufficient to cover long range or even global context on several datasets, and setting a smaller span does not significantly reduce M-Adds.
Experimental Results
We conduct experiments on four large-scale datasets. We first report results with our Axial-ResNet on ImageNet . We then convert the ImageNet pretrained Axial-ResNet to Axial-DeepLab, and report results on COCO , Mapillary Vistas , and Cityscapes for panoptic segmentation, evaluated by panoptic quality (PQ) . We also report average precision (AP) for instance segmentation, and mean IoU for semantic segmentation on Mapillary Vistas and Cityscapes. Our models are trained using TensorFlow on 128 TPU cores for ImageNet and 32 cores for panoptic segmentation.
Training protocol: On ImageNet, we adopt the same training protocol as for a fair comparison, except that we use batch size 512 for Full Axial-ResNets and 1024 for all other models, with learning rates scaled accordingly .
For panoptic segmentation, we strictly follow Panoptic-DeepLab , except using a linear warm up Radam Lookahead optimizer (with the same learning rate 0.001). All our results on panoptic segmentation use this setting. We note this change does not improve the results, but smooths our training curves. Panoptic-DeepLab yields similar result in this setting.
For ImageNet, we build Axial-ResNet-L from ResNet-50 . In detail, we set , for the first stage after the ‘stem’. We double them when spatial resolution is reduced by a factor of 2 . Additionally, we multiply all the channels by 0.5, 0.75, and 2, resulting in Axial-ResNet-{S, M, XL}, respectively. Finally, Stand-Alone Axial-ResNets are further generated by replacing the ‘stem’ with three axial-attention blocks where the first block has stride 2. Due to the computational cost introduced by the early layers, we set the axial span in all blocks of Stand-Alone Axial-ResNets. We always use heads . In order to avoid careful initialization of , we use batch normalizations in all attention layers.
Tab. 1 summarizes our ImageNet results. The baselines ResNet-50 (done by ) and Conv-Stem + Attention are also listed. In the conv-stem setting, adding BN to attention layers of slightly improves the performance by 0.3%. Our proposed position-sensitive self-attention (Conv-Stem + PS-Attention) further improves the performance by 0.4% at the cost of extra marginal computation. Our Conv-Stem + Axial-Attention performs on par with Conv-Stem + Attention while being more parameter- and computation-efficient. When comparing with other full self-attention models, our Full Axial-Attention outperforms Full Attention by 0.5%, while being 1.44 more parameter-efficient and 1.09 more computation-efficient.
Following , we experiment with different network widths (i.e., Axial-ResNets-{S,M,L,XL}), exploring the trade-off between accuracy, model parameters, and computational cost (in terms of M-Adds). As shown in Fig. 3, our proposed Conv-Stem + PS-Attention and Conv-Stem + Axial-Attention already outperforms ResNet-50 and attention models (both Conv-Stem + Attention, and Full Attention) at all settings. Our Full Axial-Attention further attains the best accuracy-parameter and accuracy-complexity trade-offs.
2 COCO
The ImageNet pretrained Axial-ResNet model variants (with different channels) are then converted to Axial-DeepLab model variant for panoptic segmentation tasks. We first demonstrate the effectiveness of our Axial-DeepLab on the challenging COCO dataset , which contains objects with various scales (from less than to larger than ).
Val set: In Tab. 2, we report our validation set results and compare with other bottom-up panoptic segmentation methods, since our method also belongs to the bottom-up family. As shown in the table, our single-scale Axial-DeepLab-S outperforms DeeperLab by 8% PQ, multi-scale SSAP by 5.3% PQ, and single-scale Panoptic-DeepLab by 2.1% PQ. Interestingly, our single-scale Axial-DeepLab-S also outperforms multi-scale Panoptic-DeepLab by 0.6% PQ while being 3.8 parameter-efficient and 27 computation-efficient (in M-Adds). Increasing the backbone capacity (via large channels) continuously improves the performance. Specifically, our multi-scale Axial-DeepLab-L attains 43.9% PQ, outperforming Panoptic-DeepLab by 2.7% PQ.
Test-dev set: As shown in Tab. 3, our Axial-DeepLab variants show consistent improvements with larger backbones. Our multi-scale Axial-DeepLab-L attains the performance of 44.2% PQ, outperforming DeeperLab by 9.9% PQ, SSAP by 7.3% PQ, and Panoptic-DeepLab by 2.8% PQ, setting a new state-of-the-art among bottom-up approaches. We also list several top-performing methods adopting the top-down approaches in the table for reference.
Scale Stress Test: In order to verify that our model learns long range interactions, we perform a scale stress test besides standard testing. In the stress test, we train Panoptic-DeepLab (X-71) and our Axial-DeepLab-L with the standard setting, but test them on out-of-distribution resolutions (i.e., resize the input to different resolutions). Fig. 4 summarizes our relative improvements over Panoptic-DeepLab on PQ, PQ (thing) and PQ (stuff). When tested on huge images, Axial-DeepLab shows large gain (30%), demonstrating that it encodes long range relations better than convolutions. Besides, Axial-DeepLab improves 40% on small images, showing that axial-attention is more robust to scale variations.
3 Mapillary Vistas
We evaluate our Axial-DeepLab on the large-scale Mapillary Vistas dataset . We only report validation set results, since the test server is not available.
Val set: As shown in Tab. 4, our Axial-DeepLab-L outperforms all the state-of-the-art methods in both single-scale and multi-scale cases. Our single-scale Axial-DeepLab-L performs 2.4% PQ better than the previous best single-scale Panoptic-DeepLab (X-71) . In multi-scale setting, our lightweight Axial-DeepLab-L performs better than Panoptic-DeepLab (Auto-DeepLab-XL++), not only on panoptic segmentation (0.8% PQ) and instance segmentation (0.3% AP), but also on semantic segmentation (0.8% mIoU), the task that Auto-DeepLab was searched for. Additionally, to the best of our knowledge, our Axial-DeepLab-L attains the best single-model semantic segmentation result.
4 Cityscapes
Val set: In Tab. 5 (a), we report our Cityscapes validation set results. Without using extra data (i.e., only Cityscapes fine annotation), our Axial-DeepLab achieves 65.1% PQ, which is 1% better than the current best bottom-up Panoptic-DeepLab and 3.1% better than proposal-based AdaptIS . When using extra data (e.g., Mapillary Vistas ), our multi-scale Axial-DeepLab-XL attains 68.5% PQ, 1.5% better than Panoptic-DeepLab and 3.5% better than Seamless . Our instance segmentation and semantic segmentation results are respectively 1.7% and 1.5% better than Panoptic-DeepLab .
Test set: Tab. 5 (b) shows our test set results. Without extra data, Axial-DeepLab-XL attains 62.8% PQ, setting a new state-of-the-art result. Our model further achieves 66.6% PQ, 39.6% AP, and 84.1% mIoU with Mapillary Vistas pretraining. Note that Panoptic-DeepLab adopts the trick of output stride 8 during inference on test set, making their M-Adds comparable to our XL models.
5 Ablation Studies
We perform ablation studies on Cityscapes validation set.
Importance of Position-Sensitivity and Axial-Attention: In Tab. 1, we experiment with attention models on ImageNet. In this ablation study, we transfer them to Cityscapes segmentation tasks. As shown in Tab. 6, all variants outperform ResNet-50 . Position-sensitive attention performs better than previous self-attention , which aligns with ImageNet results in Tab. 1. However, employing axial-attention, which is on-par with position-sensitive attention on ImageNet, gives more than 1% boosts on all three segmentation tasks (in PQ, AP, and mIoU), without ASPP, and with fewer parameters and M-Adds, suggesting that the ability to encode long range context of axial-attention significantly improves the performance on segmentation tasks with large input images.
Importance of Axial-Attention Span: In Tab. 7, we vary the span (i.e., spatial extent of local regions in an axial block), without ASPP. We observe that a larger span consistently improves the performance at marginal costs.
Conclusion and Discussion
In this work, we have shown the effectiveness of proposed position-sensitive axial-attention on image classification and segmentation tasks. On ImageNet, our Axial-ResNet, formed by stacking axial-attention blocks, achieves state-of-the-art results among stand-alone self-attention models. We further convert Axial-ResNet to Axial-DeepLab for bottom-up segmentation tasks, and also show state-of-the-art performance on several benchmarks, including COCO, Mapillary Vistas, and Cityscapes. We hope our promising results could establish that axial-attention is an effective building block for modern computer vision models.
Our method bears a similarity to decoupled convolution , which factorizes a depthwise convolution to a column convolution and a row convolution. This operation could also theoretically achieve a large receptive field, but its convolutional template matching nature limits the capacity of modeling multi-scale interactions. Another related method is deformable convolution , where each point attends to a few points dynamically on an image. However, deformable convolution does not make use of key-dependent positional bias or content-based relation. In addition, axial-attention propagates information densely, and more efficiently along the height- and width-axis sequentially.
Although our axial-attention model saves M-Adds, it runs slower than convolutional counterparts, as also observed by . This is due to the lack of specialized kernels on various accelerators for the time being. This might well be improved if the community considers axial-attention as a plausible direction.
Acknowledgments
We thank Niki Parmar for discussion and support; Ashish Vaswani, Xuhui Jia, Raviteja Vemulapalli, Zhuoran Shen for their insightful comments and suggestions; Maxwell Collins and Blake Hechtman for technical support. This work is supported by Google Faculty Research Award and NSF 1763705.
Appendix A Runtime
In this section, we profile our Conv-Stem Axial-ResNet-L in a common setting: 224x224 feed-forward with batch size 1, on a V100 GPU, averaged over 5 runs. The time includes input standardization, and the last projection to 1000 logits. Our model takes 16.54 ms. For comparison, we list our TensorFlow runs of some popular models at hand (with comparable flops). To provide more context, we take entries from for reference (A Titan X Pascal is used in , but the PyTorch code is more optimized). Our runtime is roughly at the same level of ResNeXt-101 (32x4d), SE-ResNet-101, ResNet-152, and DenseNet-201 (k=32).
Note that we directly benchmark with our code optimized for TPU execution, with channels being the last dimension. Empirically, the generated graph involves transposing between NCHW and NHWC, before and after almost every conv2d operation. (This effect also puts Xception-71 at a disadvantage because of its separable conv design.) Further optimizing this could lead to faster inference.
We observe that our Conv-Stem Axial-ResNet-L runs faster than Conv-Stem Stand-Alone-L , although we split one layer into two. This is because our axial-attention makes better use of existing kernels:
The width-axis attention is parallelizable over height-axis, i.e. this is a large batch of 1d row operations (the batch size is the height of the input).
Axial attention avoids extracting 2d memory blocks with pads, splits and concatenations, which are not efficient on accelerators.
Appendix B Axial-Decoder
Axial-DeepLab employs dual convolutional decoders . In this section, we explore a setting with a single axial-decoder instead. In the axial-decoder module, we apply one axial-attention block at each upsampling stage. In Fig. 5, we show an example axial-decoder in Axial-DeepLab-L from output stride 8 to output stride 4. We apply three such blocks, analogous to the three 55 convolutions in Panoptic-DeepLab .
Importance of Output Stride and Axial-Decoder: In Tab. 9, we experiment with the effect of output stride and axial-decoder (i.e., replacing dual decoders with axial-attention blocks). As shown in the table, our models are robust to output stride, and using axial-decoder is able to yield similar results. Our simple axial-decoder design works as well as dual convolutional decoders.
Appendix C COCO Visualization
In Fig. 6, we visualize some panoptic segmentation results on COCO val set. Our Axial-DeepLab-L demonstrates robustness to occlusion, compared with Panoptic-DeepLab (Xception-71).
In Fig. 7 and Fig. 8, we visualize the attention maps of our Axial-DeepLab-L on COCO val set. We visualize a low level block (stage 3 block 2) and a high level block (stage 4 block 3), which are respectively the first block and the last block with resolution 6565, in the setting of output stride 16. We notice that in our multi-head axial-attention, some heads learn to focus on local details while some others focus on long range context. Additionally, we find that some heads are able to capture positional information and some others learn to correlate with semantic concepts
In Fig. 9, we compare Axial-DeepLab with Panoptic-DeepLab , in terms of the three training loss functions, defined in Panoptic-DeepLab . We observe that Axial-DeepLab is able to fit data better, especially on the offset prediction task. This also demonstrates the effectiveness of our position-sensitive attention design, and the long range modeling ability of axial-attention.
Appendix D Raw Data
In companion to Fig. 3 of the main paper where we compare parameters and M-Adds against accuracy on ImageNet classification, we also show the performance of our models in Tab. 10.
In companion to Fig. 4 of the main paper where we demonstrate the relative improvements of Axial-DeepLab-L over Panoptic-DeepLab (Xception-71) in our scale stress test on COCO, we also show the raw performance of both models in Fig. 10.