Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Sitong Wu, Tianyi Wu, Haoru Tan, Guodong Guo

Introduction

Inspired by the success of Transformer (Vaswani et al. 2017) on a wide range of tasks in natural language processing (NLP) (McCann et al. 2017; Howard and Ruder 2018), Vision Transformer (ViT) (Dosovitskiy et al. 2021) first employed a pure Transformer architecture for image classification, which shows the promising performance of Transformer architecture for vision tasks. However, the quadratic complexity of the global self-attention results in expensive computation costs and memory usage especially for high-resolution scenarios, making it unaffordable for applications in various vision tasks.

A typical way to improve the efficiency is to replace the global self-attention with local ones. A crucial and challenging issue is how to enhance the modeling capability under the local settings. For example, Swin (Liu et al. 2021) and Shuffle Transformer (Huang et al. 2021) proposed shifted window and shuffled window, respectively (Figure 1(b)), and alternately used two different window partitions (i.e., regular window and the proposed window) in consecutive blocks to build cross-window connections. MSG Transformer (Fang et al. 2021) manipulated the messenger tokens to exchange information across windows. Axial self-attention (Wang et al. 2020) treated the local attention region as a single row or column of the feature map (Figure 1(c)). CSWin (Dong et al. 2021) proposed cross-shaped window self-attention (Figure 1(d)), which can be regarded as a multiple row and column expansion of axial self-attention. Although these methods achieve excellent performance and are even superior to the CNN counterparts, the dependencies in each self-attention layer are not rich enough for capturing sufficient contextual information.

(c) Axial Self-Attention (d) Cross-Shaped (e) Pale-Shaped

Window Self-Attention Self-Attention (ours)

In this work, we propose a Pale-Shaped self-Attention (PS-Attention) to capture richer contextual dependencies efficiently. Specifically, the input feature maps are first split into multiple pale-shaped regions spatially. Each pale-shaped region (abbreviating as pale) is composed of the same number of interlaced rows and columns of the feature map. The intervals between adjacent rows or columns are equal for all the pales. For example, the pink shadow in Figure 1(e) indicates one of the pales. Then, self-attention is performed within each pale. For any token, it can directly interact with other tokens within the same pale, which endows our method with the capacity of capturing richer contextual information in a single PS-Attention layer. To further improve the efficiency, we develop a more efficient parallel implementation of the PS-Attention. Benefit from the larger receptive fields and stronger context modeling capability, our PS-Attention shows superiority to the existing local self-attention mechanisms illustrated in Figure 1.

Based on the proposed PS-Attention, we design a general vision transformer backbone with a hierarchical architecture, named Pale Transformer. We scale our approach up to get a series of models, including Pale-T (22M), Pale-S (48M), and Pale-B (85M), reaching significantly better performance than previous approaches. Our Pale-T achieves 83.4% Top-1 classification accuracy on ImageNet-1k, 50.4% single-scale mIoU on ADE20K (semantic segmentation), 47.4 box mAP (object detection) and 42.7 mask mAP (instance segmentation) on COCO, outperforming the state-of-the-art backbones by +0.7%, +1.1%, +0.7, and +0.5, respectively. Furthermore, our largest variant Pale-B is also superior to the previous methods, achieving 84.9% Top-1 accuracy on ImageNet-1K, 52.2% single-scale mIoU on ADE20K, 49.3 box mAP and 44.2 mask mAP on COCO.

Related Work

ViT (Dosovitskiy et al. 2021), which takes the input image as a sequence of patches, has paved a new way and shown promising performance for many vision tasks dominated by CNNs over the years. A line of previous Vision Transformer backbones mainly focused on the following two aspects to better adapt to vision tasks: (1) Enhancing the locality of Vision Transformers. (2) Seeking a better trade-off between performance and efficiency.

Different from CNNs, the inductive bias for local connections is not involved in the original Transformer, which may lead to insufficient extraction of local structures, such as lines, edges, and color conjunctions. Many works are devoted to strengthening the local feature extraction of Vision Transformers. The earliest approach is to replace the single-scale architecture of ViT with a hierarchical one to obtain multi-scale features (Wang et al. 2021b). Such design is followed by many works afterward (Liu et al. 2021; Huang et al. 2021; Yang et al. 2021; Dong et al. 2021). Another way is to combine CNNs and Transformers. Mobile-Former (Chen et al. 2021b), Conformer (Peng et al. 2021) and DS-Net (Mao et al. 2021) integrated the CNN and Transformer features by the well-designed dual-branch structures. In contrast, Local ViT (Li et al. 2021b), CvT (Wu et al. 2021a) and Shuffle Transformer (Huang et al. 2021) only inserted several convolutions into some components of Transformer. Besides, some works obtain richer features by fusing the multi-branch with different scales (Chen, Fan, and Panda 2021) or cooperating with local attention (Han et al. 2021; Zhang et al. 2021; Chu et al. 2021a; Li et al. 2021a; Yuan et al. 2021b).

Efficient Vision Transformers

The mainstream research on improving the efficiency for Vision Transformer backbones has two folds: reducing the redundant calculations via pruning strategies and designing more efficient self-attention mechanisms.

For pruning, the existing methods can be divided into three categories: (1) Token Pruning. DVT (Wang et al. 2021d) proposed a cascade Transformer architecture to adaptively adjust the number of tokens according to the hardness for classification of the input image. Considering that tokens with irrelevant or even confusing information may be detrimental to image classification, some works proposed to locate discriminative regions and progressively drop less informative tokens by learnable sampling (Rao et al. 2021; Yue et al. 2021) and reinforcement learning (Pan et al. 2021) strategies. However, such unstructured sparsity results in incompatibility with dense prediction tasks. Some structure-preserving token selection strategies were implemented via token pooling (Chen et al. 2021a) and a slow-fast updating (Xu et al. 2021). (2) Channel Pruning. VTP (Zhu et al. 2021a) presented a simple but effective framework to remove the reductant channels. (3) Attention Sharing. Based on the observation that attention maps from continuous blocks are highly correlated, PSViT (Chen et al. 2021a) was proposed to reuse the attention calculation process between adjacent layers.

Considering that the quadratic computation complexity is caused by self-attention, many methods are committed to improving its efficiency while avoiding performance decay (Wang et al. 2021b; Zhu et al. 2021b; Liu et al. 2021; Huang et al. 2021). One way is to reduce the sequence length of key and value. PVT (Wang et al. 2021b) proposed a spatial reduction attention to downsample the scale of key and value before computing attention. Deformable attention (Zhu et al. 2021b) used a linear layer to select several keys from the full set, which can be regarded as a sparse version of global self-attention. However, excessive downsampling will lead to information confusion, and deformable attention relies heavily on a high-level feature map learned by CNN and may not be directly used on the original input image. Another way is to replace the global self-attention with local self-attention, which limits the range of each self-attention layer into a local region. As shown in Figure 1(b), the feature maps are first divided into several non-overlapping square regular windows (indicated with diverse colors), and the self-attention is performed within each window individually. The key challenge for the design of local self-attention mechanisms is to bridge the gap between local and global receptive fields. A typical manner is to build connections across regular square windows. For example, alternately using regular window and another newly designed window partition manner (shifted window (Liu et al. 2021) or shuffled window (Huang et al. 2021) in Figure 1(b)) in consecutive blocks, and manipulating messenger tokens to exchange information across windows (Fang et al. 2021). Besides, axial self-attention (Wang et al. 2020) achieves longer-range dependencies in horizontal and vertical directions respectively by performing self-attention in each single row or column of the feature map. CSWin (Dong et al. 2021) proposed a cross-shaped window self-attention region including multiple rows and columns. Although these existing local attention mechanisms can provide opportunities for breaking through the local receptive fields to some extent, their dependencies are not rich enough to capture sufficient contextual information in a single self-attention layer, which limits the modeling capacity of the whole network.

The most related to our work is CSWin (Dong et al. 2021), which developed a cross-shaped window self-attention mechanism for computing self-attention in the horizontal and vertical stripes, while our proposed PS-Attention computes self-attention in the pale-shaped regions. Moreover, the receptive fields of each token in our method are much wider than CSWin, which also endows our approach with stronger context modeling capacity.

Methodology

In this section, we first present our Pale-Shaped self-Attention (PS-Attention) and its efficient parallel implementation. Then, the composition of the Pale Transformer block is given. Finally, we describe the overall architecture and variants configurations of our Pale Transformer backbone.

For capturing dependencies varied from short-range to long-range, we propose Pale-Shaped self-Attention (PS-Attention), which computes self-attention within a pale-shaped region (abbreviating as pale). As shown in the pink shadow of Figure 1(e), one pale contains srs_{r} interlaced rows and scs_{c} interlaced columns, which covers a region containing (srw+schsrsc)(s_{r}w+s_{c}h-s_{r}s_{c}) tokens. We define (sr,sc)(s_{r},s_{c}) as the pale size. Given an input feature map XRh×w×cX\in\mathcal{R}^{h\times w\times c}, we first split it into multiple pales {P1,...,PN}\{P_{1},...,P_{N}\} with the same size (sr,sc)(s_{r},s_{c}), where PiR(srw+schsrsc)×c,i{1,2,...,N}P_{i}\in\mathcal{R}^{(s_{r}w+s_{c}h-s_{r}s_{c})\times c},i\in\{1,2,...,N\}. The number of pales is equal to N=hsr=wscN=\frac{h}{s_{r}}=\frac{w}{s_{c}}, which can be ensured by padding or interpolation operation. For all pales, intervals between adjacent rows or columns are the same. The self-attention is then performed within each pale individually. As illustrated in Figure 1, the receptive field of PS-Attention is significantly wider and richer than all the previous local self-attention mechanisms, enabling more powerful context modeling capacity.

To further improve the efficiency, we decompose the vanilla PS-Attention mentioned above into row-wise and column-wise attention, which perform self-attention within row-wise and column-wise token groups, respectively. Specifically, as shown in Figure 2(c), we first divide the input feature XRh×w×cX\in\mathcal{R}^{h\times w\times c} into two independent parts XrRh×w×c2X_{r}\in\mathcal{R}^{h\times w\times\frac{c}{2}} and XcRh×w×c2X_{c}\in\mathcal{R}^{h\times w\times\frac{c}{2}} in the channel dimension, which are then split into multiple groups for row-wise and column-wise attention respectively.

where Nr=h/srN_{r}=h/s_{r}, Nc=w/scN_{c}=w/s_{c}, XriRsr×w×cX_{r}^{i}\in\mathcal{R}^{s_{r}\times w\times c} contains srs_{r} interlaced rows, and XcjRh×sc×cX_{c}^{j}\in\mathcal{R}^{h\times s_{c}\times c} contains scs_{c} interlaced columns.

Then, the self-attention is conducted within each row-wise and column-wise token group, respectively. Similar to (Wu et al. 2021a), we use three separable convolution layers ϕQ\phi_{Q}, ϕK\phi_{K}, and ϕV\phi_{V} to generate the query, key, and value.

where i{1,2,...,N}i\in\{1,2,...,N\}, and MSA indicates the Multi-head Self-Attention (Dosovitskiy et al. 2021).

Finally, the outputs of row-wise and column-wise attention are concatenated along channel dimension, resulting in the final output YRh×w×cY\in\mathcal{R}^{h\times w\times c},

where Yr=[Yr1,...,YrNr]Y_{r}=[Y_{r}^{1},...,Y_{r}^{N_{r}}] and Yc=[Yc1,...,YcNc]Y_{c}=[Y_{c}^{1},...,Y_{c}^{N_{c}}].

Compared to the vanilla implementation of PS-Attention within the whole pale, such a parallel mechanism has a lower computation complexity. Furthermore, the padding operation only needs to ensure hh can be divisible by srs_{r} and ww can be divisible by scs_{c}, rather than hsr=wsc\frac{h}{s_{r}}=\frac{w}{s_{c}}. Therefore, it is also conducive to avoiding excessive padding.

Complexity Analysis.

Given the input feature of size h×w×ch\times w\times c and pale size (sr,sc)(s_{r},s_{c}), the standard global self-attention has a computational complexity of

however, our proposed PS-Attention under the parallel implementation has a computational complexity of

which can obviously alleviate the computation and memory burden compared with the global one, since 2hw>>(sch+srw+27)2hw>>(s_{c}h+s_{r}w+27) always holds. The detailed derivations of Eq. (4) and Eq. (5) are provided in the supplementary material.

Pale Transformer Block

As shown in Figure 2(b), our Pale Transformer block consists of three sequential parts, the conditional position encoding (CPE) for dynamically generating the positional embedding, the proposed PS-Attention module for capturing contextual information, and the MLP module for feature projection. The forward pass of the ll-th block can be formulated as follows:

where LN(\cdot) refers to layer normalization (Ba, Kiros, and Hinton 2016). The CPE (Chu et al. 2021b) is implemented as a simple depth-wise convolution, which is widely used in previous works (Wu et al. 2021b; Chu et al. 2021a) for its compatibility with an arbitrary size of input. The PS-Attention module defined in Eq. (7) is constructed by sequentially performing Eq. (LABEL:pale:split) to Eq. (LABEL:pale:concat). The MLP module defined in Eq. (8) consists of two linear projection layers to expand and contract the embedding dimension sequentially, which is the same as (Dosovitskiy et al. 2021) for fair comparisons.

Overall Architecture and Variants

As illustrated in Figure 2(a), the Pale Transformer consists of four hierarchical stages for capturing multi-scale features by following the popular design in CNNs (He et al. 2016) and Transformers (Liu et al. 2021; Dong et al. 2021). Each stage contains a patch merging layer and multiple Pale Transformer blocks. The patch merging layer aims to spatially downsample the input features by a certain ratio and expand the channel dimension by twice for a better representation capacity. For fair comparisons, we use the overlapping convolution for patch merging, the same as (Wu et al. 2021a; Dong et al. 2021). Specifically, the spatial downsampling ratio is set to 4 for the first stage and 2 for the last three stages, implementing by 7×77\times 7 convolution with stride 4 and 3×33\times 3 convolution with stride 2, respectively. The outputs of the patch merging layer are fed into the subsequent Pale Transformer blocks, with the number of tokens kept constant. Following (Liu et al. 2021; Dong et al. 2021), we simply apply an average pooling operation on the top of the last block to obtain a representative token for the final classification head, which is composed of a single linear projection layer.

The definitions of model hyper-parameters for the ii-th stage are listed below:

PiP_{i}: the spatial reduction factor for patch merging layer,

CiC_{i}: the embedding dimension of tokens,

SiS_{i}: the pale size for the PS-Attention,

HiH_{i}: the head number for the PS-Attention,

RiR_{i}: the expansion ratio for the MLP module.

By varying the hyper-parameters HiH_{i} and CiC_{i} in each stage, we design three variants of our Pale Transformer, named Pale-T (Tiny), Pale-S (Small), and Pale-B (Base), respectively. Table 1 shows the detailed configurations of all variants. Note that all variants have the same depth with $infourstages.Ineachstageofthesevariants,wesetthepalesizein four stages. In each stage of these variants, we set the pale sizes_{r}=s_{c}=S_{i}=7,andusethesameMLPexpansionratioof, and use the same MLP expansion ratio ofR_{i}=4$. Thus, the main differences among Pale-T, Pale-S, and Pale-B lie in the embedding dimension of tokens and the head number for the PS-Attention in four stages, i.e., variants vary from narrow to wide.

Experiments

We first compare our Pale Transformer with the state-of-the-art Transformer backbones on ImageNet-1K (Russakovsky et al. 2015) for image classification. To further demonstrate the effectiveness and generalization of our backbone, we conduct experiments on ADE20K (Zhou et al. 2019) for semantic segmentation (Wu et al. 2021b, 2020; Zhang et al. 2019; Wu et al. 2021c), and COCO (Lin et al. 2014) for object detection & instance segmentation. Finally, we dig into the design of key components of our Pale Transformer to better understand the method.

All the variants are trained from scratch for 300 epochs on 8 V100 GPUs with a total batch size of 1024. Both the training and evaluation are conducted with the input size of 224×224224\times 224 on ImageNet-1K dataset. Detailed configurations are provided in the supplementary material.

Results.

Semantic Segmentation on ADE20K

To demonstrate the superiority of our Pale Transformer for dense prediction tasks, we conduct experiments on ADE20K with the widely-used UperNet (Xiao et al. 2018) as decoder for fair comparisons to other backbones. Detailed settings are described in the supplementary material.

Results.

Table 4 shows the comparisons of UperNet with various excellent Transformer backbones on ADE20K validation set. We report both the single-scale (SS) and multi-scale (MS) mIoU for better comparison. Our Pale variants are consistently superior to the state-of-the-art method by a large margin. Specifically, our Pale-T and Pale-S outperform the state-of-the-art CSWin by +1.1% and +1.2% SS mIoU, respectively. Besides, our Pale-B achieves 52.5%/53.0% SS/MS mIoU, surpassing the previous best by +1.3% and +1.2%, respectively. These results demonstrate the stronger context modeling capacity of our Pale Transformer for dense prediction tasks.

Object Detection and Instance Segmentation on COCO

We evaluate the performance of our Pale Transformer backbone on COCO benchmark for object detection and instance segmentation, utilizing Mask R-CNN (He et al. 2017) framework under 1x schedule (12 training epochs). Details can be found in the supplementary material.

Results.

As shown in Table 3, for object detection, our Pale-T, Pale-S, and Pale-B achieve 47.4, 48.4, and 49.2 box mAP for object detection, surpassing the previous best CSWin Transformer by +0.7, +0.5, and +0.6, respectively. Besides, our variants also have consistent improvement on instance segmentation, which are +0.5, +0.5, and +0.3 mask mAP higher than the previous best backbone.

Ablation Study

We conduct ablation studies for the key designs of our Pale Transformer on image classification and downstream tasks. All the experiments are performed with the Tiny variant under the same training settings as mentioned above. We also analyze the influence of position encoding in the supplementary material.

The pale sizes of four stages {S1,S2,S3,S4}\{S_{1},S_{2},S_{3},S_{4}\} control the trade-off between the richness of contextual information and computation costs. As shown in Table 7, increasing the pale size (from 1 to 7) can continuously improve performance across all tasks, while further up to 9 does not bring obvious and consistent improvements but more FLOPs. Therefore, we use Si=7,i{1,2,3,4}S_{i}=7,i\in\{1,2,3,4\} for all the tasks by default.

Comparisons with Different Implementations of PS-Attention.

We compare three implementations of our PS-Attention. The vanilla PS-Attention directly conducts self-attention within the whole pale region, which can be approximated as two more efficient implementations, sequential and parallel. The sequential one computes self-attention in row and column directions alternately in consecutive blocks, while the parallel one performs row-wise and column-wise attention in parallel within each block. As shown in Table 8, the parallel PS-Attention achieves the best results on all tasks, even slightly better than the vanilla one by +0.3/0.4 box/mask mAP on COCO. We attribute this to that the excessive padding for the non-square input size in vanilla PS-Attention will result in slight performance degradation.

Comparisons with other Axial-based Attentions.

In order to compare our PS-Attention with the most related axial-based self-attention mechanisms directly, we replace the PS-Attention of our Pale-T with the axial self-attention (Wang et al. 2020) and cross-shaped window self-attention (Dong et al. 2021), respectively. As shown in Table 8, our PS-Attention outperforms these two mechanisms obviously.

Conclusion

This work presented a new effective and efficient self-attention mechanism, named Pale-Shaped self-Attention (PS-Attention), which performs self-attention in a pale-shaped region. PS-Attention can model richer contextual dependencies than the previous local self-attention mechanisms. In order to further improve its efficiency, we designed a parallel implementation for PS-Attention, which decomposes the self-attention within the whole pale into row-wise and column-wise attention. It is also conducive to avoiding excessive padding operations. Based on the proposed PS-Attention, we developed a general Vision Transformer backbone, called Pale Transformer, which can achieve state-of-the-art performance on ImageNet-1K for image classification. Furthermore, our Pale Transformer is superior to the previous Vision Transformer backbones on ADE20K for semantic segmentation, and COCO for object detection & instance segmentation.

References

Appendix A Appendix

In this appendix, we first provide the detailed experimental settings for classification, semantic segmentation, object detection, and instance segmentation, respectively. Then, we study the effect of the position encoding method of our Pale Transformer, and provide more detailed comparisons of the ablation studies in the body of our paper in terms of the model size and computation costs. Finally, the detailed derivations of computation complexity for the global self-attention and our PS-Attention are given.

Appendix B Detailed Experimental Settings

We follow most of the settings in DeiT (Touvron et al. 2021), Swin (Liu et al. 2021) and CSWin (Dong et al. 2021) for fair comparisons. In detail, we use AdamW (Loshchilov and Hutter 2019) optimizer with a weight decay of 0.05. The initial learning rate is set to 1e-3 and progressively decays after each iteration by a cosine schedule. The linear warmup takes up 20 epochs. We use the random horizontal flipping (Szegedy et al. 2015), color jitter, Mixup (Zhang et al. 2018), CutMix (Yun et al. 2019) and AutoAugment (Cubuk et al. 2019) as data augmentation. We also adopt some common regularizations, such as Label-Smoothing (Szegedy et al. 2016) and stochastic depth (Huang et al. 2016). The maximal stochastic depth rate is set to 0.1, 0.3, and 0.5 for Pale-T, Pale-S, and Pale-B, respectively. All the variants are trained from scratch for 300 epochs on 8 V100 GPUs with the input size of 224×224224\times 224 and a total batch size of 1024. During the evaluation, the images are first resized to 256×256256\times 256 and then center-cropped to 224×224224\times 224.

Semantic Segmentation on ADE20K

We conduct experiments on the widely-used and challenging ADE20K (Zhou et al. 2019) scene parsing dataset, which contains 20210, 2000, and 3352 images for training, validation, and testing, respectively, with 150 fine-grained object categories. For fair comparisons, we use our ImageNet-1k pretrained Pale Transformer as backbone and UperNet as the decoder, and follow the same training settings as (Liu et al. 2021). Specifically, all the models are trained for total 160k iterations with a batch size of 16. The AdamW (Loshchilov and Hutter 2019) optimizer with weight decay 0.01 is used. The initial learning rate is set to 6e-5 and decay with a polynomial scheduler after the 1500-iterations warmup. The stochastic depth rate is set to 0.3, 0.3, and 0.5 for Pale-T, Pale-S, and Pale-B, respectively. Both single-scale and multi-scale inference are reported for performance comparison. For multi-scale inference, factors vary from 0.75 to 1.75 with 0.25 as the interval. Auxiliary losses are added to the output of stage 3 of the backbone with factor 0.4, which is the same as the previous works (Liu et al. 2021; Dong et al. 2021) for fair comparisons. For data augmentation during training, we follow the default configurations of mmsegmentation, such as random crop, random flipping, random rescaling (with ratio range from 0.5 to 2.0), and random photometric distortion.

Object Detection & Instance Segmentation on COCO

We compare the performance of our Pale Transformer backbone on COCO benchmark for object detection and instance segmentation, with the typical Mask R-CNN (He et al. 2017) framework. We follow the same training strategies as (Dong et al. 2021). In detail, we train and evaluate our Pale Transformer under the normal 1x schedule, and all the models are trained for 12 epochs on 8 GPUs with the total batch size of 16 and single-scale input (shorter size is resized to 800 and longer size is no more than 1333). We use AdamW (Loshchilov and Hutter 2019) as the optimizer with a weight decay of 0.001 for Pale-T and Pale-S and 0.05 for Pale-B. For all models, the learning rate is set to 0.0001 initially and decay at epoch 8 and 11 with a ratio of 0.1. We set the stochastic depth rate to 0.2, 0.3, and 0.5 for Pale-T, Pale-S, and Pale-B, respectively. FLOPs are compared under the input size of 1280×8001280\times 800.

Appendix C Further Ablation Study

In this section, we first provide the complete version of Table 5 and Table 6 in the body of our paper, including the parameters and FLOPs comparisons, shown in Table 7 and Table 8. Then, we analyze the effect of different position encoding methods for our Pale Transformer backbone.

The position encoding plays an important role in Transformers, as it can introduce the spatial location awareness for feature aggregation of self-attention. Here, we compare several widely-used position encoding methods, e.g., no position encoding (no pos.), absolute position encoding (APE) (Dosovitskiy et al. 2021) and conditional position encoding (CPE) (Chu et al. 2021b). As shown in Table 9, CPE performs best. Not using any position encoding will cause serious performance degradation, which demonstrates the effectiveness of the position encoding in Vision Transformer models.

Appendix D Derivations of the Computational Complexity

In this section, we derive the computation complexity of the global self-attention and our PS-Attention in detail.

Supposing that the size of input feature map is denoted as h×w×ch\times w\times c. The global self-attention (Dosovitskiy et al. 2021) has three parts. Firstly, the input feature XRh×w×cX\in\mathcal{R}^{h\times w\times c} is first sent into three independent linear layers to generate query QRh×w×cQ\in\mathcal{R}^{h\times w\times c}, key KRh×w×cK\in\mathcal{R}^{h\times w\times c}, and value VRh×w×cV\in\mathcal{R}^{h\times w\times c}, respectively. Thus, the computational complexity of the generation of QQ, KK, and VV is

Secondly, the attention map AA is computed by softmax(QKT/d)\text{softmax}(QK^{T}/\sqrt{d}). Then, the aggregated feature is obtained by the matrix multiplication between the normalized attention map AA and the value VV. The computational complexity of these two processes is

Finally, the aggregated feature also needs to pass through a linear projection layer generally with the complexity of

Thus, the overall computational complexity of the global self-attention is

Computational Complexity of Our PS-Attention

Similarly, given the input feature of size h×w×ch\times w\times c and pale size (sr,sc)(s_{r},s_{c}), our PS-Attention(parallel) also contains three processes. Firstly, the three individual 3×33\times 3 separable convolutions are used to generate the query QRh×w×cQ\in\mathcal{R}^{h\times w\times c}, key KRh×w×cK\in\mathcal{R}^{h\times w\times c}, and value VRh×w×cV\in\mathcal{R}^{h\times w\times c}, respectively, with the complexity of

Second, we decompose the self-attention within the whole pale region into row-wise and column-wise self-attention. The computational complexity of these two parallel branches are as follows

Finally, the linear projection layer has the complexity of

Therefore, the overall complexity of our parallel PS-Attention is

Compared with the global self-attention, our parallel PS-Attention has lower complexity, since 2hw>>(sch+srw+27)2hw>>(s_{c}h+s_{r}w+27) always holds.