CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo
Introduction
Transformer-based architectures have recently achieved competitive performances compared to their CNN counterparts in various vision tasks. By leveraging the multi-head self-attention mechanism, these vision Transformers demonstrate a high capability in modeling the long-range dependencies, which is especially helpful for handling high-resolution inputs in downstream tasks, e.g., object detection and segmentation. Despite the success, the Transformer architecture with full-attention mechanism is computationally inefficient.
To improve the efficiency, one typical way is to limit the attention region of each token from full-attention to local/windowed attention . To bridge the connection between windows, researchers further proposed halo and shift operations to exchange information through nearby windows. However, the receptive field is enlarged quite slowly and it requires stacking a great number of blocks to achieve global self-attention. A sufficiently large receptive field is crucial to the performance especially for the downstream tasks(e.g., object detection and segmentation). Therefore it is important to achieve large receptive filed efficiently while keeping the computation cost low.
In this paper, we present the Cross-Shaped Window (CSWin) self-attention, which is illustrated in Figure 1 and compared with existing self-attention mechanisms. With CSWin self-attention, we perform the self-attention calculation in the horizontal and vertical stripes in parallel, with each stripe obtained by splitting the input feature into stripes of equal width. This stripe width is an important parameter of the cross-shaped window because it allows us to achieve strong modelling capability while limiting the computation cost. Specifically, we adjust the stripe width according to the depth of the network: small widths for shallow layers and larger widths for deep layers. A larger stripe width encourages a stronger connection between long-range elements and achieves better network capacity with a small increase in computation cost. We will provide a mathematical analysis of how the stripe width affects the modeling capability and computation cost.
It is worthwhile to note that with CSWin self-attention mechanism, the self-attention in horizontal and vertical stripes are calculated in parallel. We split the multi-heads into parallel groups and apply different self-attention operations onto different groups. This parallel strategy introduces no extra computation cost while enlarging the area for computing self-attention within each Transformer block. This strategy is fundamentally different from existing self-attention mechanisms that apply the same attention operation across multi-heads((Figure 1 b,c,d,e), and perform different attention operations sequentially(Figure 1 c,e). We will show through ablation analysis that this difference makes CSWin self-attention much more effective for general vision tasks.
Based on the CSWin self-attention mechanism, we follow the hierarchical design and propose a new vision Transformer architecture named “CSWin Transformer” for general-purpose vision tasks. This architecture provides significantly stronger modeling power while limiting computation cost. To further enhance this vision Transformer, we introduce an effective positional encoding, Locally-enhanced Positional Encoding (LePE), which is especially effective and friendly for input varying downstream tasks such as object detection and segmentation. Compared with previous positional encoding methods , our LePE imposes the positional information within each Transformer block and directly operates on the attention results instead of the attention calculation. The LePE makes CSWin Transformer more effective and friendly for the downstream tasks.
As a general vision Transformer backbone, the CSWin Transformer demonstrates strong performance on image classification, object detection and semantic segmentation tasks. Under the similar FLOPs and model size, CSWin Transformer variants significantly outperforms previous state-of-the-art (SOTA) vision Transformers. For example, our base variant CSWin-B achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer counterpart by +1.2, +2.0, 1.4 and +2.0 respectively. Under a smaller FLOPs setting, our tiny variant CSWin-T even shows larger performance gains, i.e.,, +1.4 point on ImageNet classification, +3.0 box AP, +2.0 mask AP on COCO detection and +4.6 on ADE20K segmentation. Furthermore, when pretraining CSWin Transformer on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU.
Related Work
Vision Transformers. Convolutional neural networks (CNN) have dominated the computer vision field for many years and achieved tremendous successes . Recently, the pioneering work ViT demonstrates that pure Transformer-based architectures can also achieve very competitive results, indicating the potential of handling the vision tasks and natural language processing (NLP) tasks under a unified framework. Built upon the success of ViT, many efforts have been devoted to designing better Transformer based architectures for various vision tasks, including low-level image processing , image classification , object detection and semantic segmentation . Rather than concentrating on one special task, some recent works try to design a general vision Transformer backbone for general-purpose vision tasks. They all follow the hierarchical Transformer architecture but adopt different self-attention mechanisms. The main benefit of the hierarchical design is to utilize the multi-scale features and reduce the computation complexity by progressively decreasing the number of tokens. In this paper,we propose a new hierarchical vision Transformer backbone by introducing cross-shaped window self-attention and locally-enhanced positional encoding.
Efficient Self-attentions. In the NLP field, many efficient attention mechanisms have been designed to improve the Transformer efficiency for handling long sequences. Since the image resolution is often very high in vision tasks, designing efficient self-attention mechanisms is also very crucial. However, many existing vision Transformers still adopt the original full self-attention, whose computation complexity is quadratic to the image size. To reduce the complexity, the recent vision Transformers adopt the local self-attention mechanism and its shifted/haloed version to add the interaction across different local windows. Besides, axial self-attention and criss-cross attention propose calculating attention within stripe windows along horizontal or/and vertical axis. While the performance of axial attention is limited by its sequential mechanism and restricted window size, criss-cross attention is inefficient in practice due to its overlapped window design and ineffective due to its restricted window size. They are the most related works with our CSWin, which could be viewed as a much general and efficient format of these previous works.
Positional Encoding. Since self-attention is permutation-invariant and ignores the token positional information, positional encoding is widely used in Transformers to add such positional information back. Typical positional encoding mechanisms include absolute positional encoding (APE) , relative positional encoding (RPE) and conditional positional encoding (CPE) . APE and RPE are often defined as the sinusoidal functions of a series of frequencies or the learnable parameters, which are designed for a specific input size and are not friendly to varying input resolutions. CPE takes the feature as input and can generate the positional encoding for arbitrary input resolutions. Then the generated positional encoding will be added onto the input feature. Our LePE shares a similar spirit as CPE, but proposes to add the positional encoding as a parallel module to the self-attention operation and operates on projected values in each Transformer block. This design decouples positional encoding from the self-attention calculation, and can enforce stronger local inductive bias.
Method
The overall architecture of CSWin Transformer is illustrated in Figure 2. For an input image with size of , we follow and leverage the overlapped convolutional token embedding ( convolution layer with stride 4) ) to obtain patch tokens, and the dimension of each token is . To produce a hierarchical representation, the whole network consists of four stages. A convolution layer (, stride 2) is used between two adjacent stages to reduce the number of tokens and double the channel dimension. Therefore, the constructed feature maps have tokens for the stage, which is similar to traditional CNN backbones like VGG/ResNet. Each stage consists of sequential CSWin Transformer Blocks and maintains the number of tokens. CSWin Transformer Block has the overall similar topology as the vanilla multi-head self-attention Transformer block with two differences: 1) It replaces the self-attention mechanism with our proposed Cross-Shaped Window Self-Attention; 2) In order to introduce the local inductive bias, LePE is added as a parallel module to the self-attention branch.
2 Cross-Shaped Window Self-Attention
Despite the strong long-range context modeling capability, the computation complexity of the original full self-attention mechanism is quadratic to feature map size. Therefore, it will suffer from huge computation cost for vision tasks that take high resolution feature maps as input, such as object detection and segmentation. To alleviate this issue, existing works suggest to perform self-attention in a local attention window and apply halo or shifted window to enlarge the receptive filed. However, the token within each Transformer block still has limited attention area and requires stacking more blocks to achieve global receptive filed. To enlarge the attention area and achieve global self-attention more efficiently, we present the cross-shaped window self-attention mechanism, which is achieved by performing self-attention in horizontal and vertical stripes in parallel that form a cross-shaped window.
Horizontal and Vertical Stripes. According to the multi-head self-attention mechanism, the input feature will be first linearly projected to heads, and then each head will perform local self-attention within either the horizontal or vertical stripes.
For horizontal stripes self-attention, is evenly partitioned into non-overlapping horizontal stripes of equal width , and each of them contains tokens. Here, is the stripe width and can be adjusted to balance the learning capacity and computation complexity. Formally, suppose the projected queries, keys and values of the head all have dimension , then the output of the horizontal stripes self-attention for head is defined as:
Assuming natural images do not have directional bias, we equally split the heads into two parallel groups (each has heads, is often an even value). The first group of heads perform horizontal stripes self-attention while the second group of heads perform vertical stripes self-attention. Finally the output of these two parallel groups will be concatenated back together.
Where is the commonly used projection matrix that projects the self-attention results into the target output dimension (set as by default). As described above, one key insight in our self-attention mechanism design is splitting the multi-heads into different groups and applying different self-attention operations accordingly. In other words, the attention area of each token within one Transformer block is enlarged via multi-head grouping. By contrast, existing self-attention mechanisms apply the same self-attention operations across different multi-heads. In the experiment parts, we will show that this design will bring better performance.
Computation Complexity Analysis. The computation complexity of CSWin self-attention is:
For high-resolution inputs, considering will be larger than in the early stages and smaller than in the later stages, we choose small for early stages and larger for later stages. In other words, adjusting provides the flexibility to enlarge the attention area of each token in later stages in an efficient way. Besides, to make the intermediate feature map size divisible by for input, we empirically set to for four stages by default.
Locally-Enhanced Positional Encoding. Since the self-attention operation is permutation-invariant, it will ignore the important positional information within the 2D image. To add such information back, different positional encoding mechanisms have been utilized in existing vision Transformers. In Figure 3, we show some typical positional encoding mechanisms and compare them with our proposed locally-enhanced positional encoding. In details, APE and CPE add the positional information into the input token before feeding into the Transformer blocks, while RPE and our LePE incorporate the positional information within each Transformer block. But different from RPE that adds the positional information within the attention calculation (i.e., ), we consider a more straightforward manner and impose the positional information upon the linearly projected values. Meanwhile, we notice that RPE introduces bias in a per head manner, while our LePE is a per-channel bias, which may show more potential to serve as positional embeddings.
Mathematically, we denote the input sequence as of elements, and the output of the attention of the same length, where . Self-attention computation could be formulated as:
where are the and get by a linear transformation of the input and is the feature dimension. Then our Locally-Enhanced position encoding performs as a learnable per-element bias and Eq.4 could be formulated as:
where represents the element of vector . To make the LePE suitable to varying input size, we set a distance threshold to the LePE and set it to if the Chebyshev distance of token and is greater than a threshold ( in the default setting).
3 CSWin Transformer Block
Equipped with the above self-attention mechanism and positional embedding mechanism, CSWin Transformer block is formally defined as:
where denotes the output of -th Transformer block or the precedent convolutional layer of each stage.
4 Architecture Variants
For a fair comparison with other vision Transformers under similar settings, we build four different variants of CSWin Transformer as shown in Table 1: CSWin-T (Tiny), CSWin-S (Small), CSWin-B (Base), CSWin-L (Large). They are designed by changing the base channel dimension and the block number of each stage. In all these variants, the expansion ratio of each MLP is set as . The head number of the four stages is set as in the first three variants and in the last variant respectively.
Experiments
To show the effectiveness of CSWin Transformer as a general vision backbone, we conduct experiments on ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation. We also perform comprehensive ablation studies to analyze each component of CSWin Transformer. As most of the methods we compared did not report downstream inference speed, we use an extra section to report it for simplicity.
For fair comparison, we follow the training strategy in DeiT as other baseline Transformer architectures . Specifically, all our models are trained for 300 epochs with the input size of . We use the AdamW optimizer with weight decay of 0.05 for CSWin-T/S and 0.1 for CSWin-B. The default batch size and initial learning rate are set to 1024 and 0.001, and the cosine learning rate scheduler with 20 epochs linear warm-up is used. We apply increasing stochastic depth augmentation for CSWin-T, CSWin-S, and CSWin-B with the maximum rate as 0.1, 0.3, 0.5 respectively. When reporting the results of input, we fine-tune the models for 30 epochs with the weight decay of , learning rate of , batch size of .
In Table 11, we compare our CSWin Transformer with state-of-the-art CNN and Transformer architectures. With the limitation of pages, we only compare with a few classical methods here and make a comprehensive comparison in the supplemental materials.
It shows that our CSWin Transformers outperform previous state-of-the-art vision Transformers by large margins. For example, CSWin-T achieves 82.7% Top-1 accuracy with only 4.3G FLOPs, surpassing CvT-13, Swin-T and DeiT-S by 1.1%, 1.4% and 2.9% respectively. And for the small and base model setting, our CSWin-S and CSWin-B also achieve the best performance. When finetuned on the input, a similar trend is observed, which well demonstrates the powerful learning capacity of our CSWin Transformers.
Compared with state-of-the-art CNNs, we find our CSWin Transformer is the only Transformer based architecture that achieves comparable or even better results than EfficientNet under the small and base settings, while using less computation complexity . It is also worth noting that neural architecture search is used in EfficientNet but not in our CSWin Transformer design.
We further pre-train CSWin Transformer on ImageNet-21K dataset, which contains 14.2M images and 21K classes. Models are trained for 90 epochs with the input size of . We use the AdamW optimizer with weight decay of 0.1 for CSWin-B and 0.2 for CSWin-L, and the default batch size and initial learning rate are set to 2048 and 0.001. When fine-tuning on ImageNet-1K, we train the models for 30 epochs with the weight decay of , learning rate of , batch size of . The increasing stochastic depth augmentation for both CSWin-B and CSWin-L is set to 0.1.
Table.3 reports the results of pre-training on ImageNet-21K. Compared to the results of CSWin-B pre-trained on ImageNet-1K, the large-scale data of ImageNet-21K brings a 1.6%1.7% gain. CSWin-B and CSWin-L achieve 87.0% and 87.5% top-1 accuracy, surpassing previous methods.
2 COCO Object Detection
Next, we evaluate CSWin Transformer on the COCO objection detection task with the Mask R-CNN and Cascade Mask R-CNN framework respectively. Specifically, we pretrain the backbones on the ImageNet-1K dataset and follow the finetuning strategy used in Swin Transformer on the COCO training set.
We compare CSWin Transformer with various backbones: previous CNN backbones ResNet , ResNeXt(X) , and Transformer backbones PVT , Twins , and Swin . Table 4 reports the results of the Mask R-CNN framework with “” (12 training epoch) and “” (36 training epoch with multi-scale training) schedule. It shows that our CSWin Transformer variants clearly outperforms all the CNN and Transformer counterparts. In details, our CSWin-T outperforms Swin-T by +4.5 box AP, +3.1 mask AP with the schedule and +3.0 box AP, +2.0 mask AP with the schedule respectively. We also achieve similar performance gain on small and base configuration.
Table 5 reports the results with the Cascade Mask R-CNN framework. Though Cascade Mask R-CNN is overall stronger than Mask R-CNN, we observe CSWin Transformers still surpass the counterparts by promising margins under different model configurations.
3 ADE20K Semantic Segmentation
We further investigate the capability of CSWin Transformer for Semantic Segmentation on the ADE20K dataset. Here we employ the semantic FPN and Upernet as the basic framework. For fair comparison, we follow previous works and train Semantic FPN 80k iterations with batch size as 16, and Upernet 160k iterations with batch size as 16, more details are provided in the supplementary material. In Table 6, we report the results of different methods in terms of mIoU and Multi-scale tested mIoU (MS mIoU). It can be seen that, our CSWin Transformers significantly outperform previous state-of-the-arts under different configurations. In details, CSWin-T, CSWin-S, CSWin-B achieve +6.7, +4.0, +3.9 higher mIOU than the Swin counterparts with the Semantic FPN framework, and +4.8, +2.8, +3.0 higher mIOU with the Upernet framework. Compared to the CNN counterparts, the performance gain is very promising and demonstrates the potential of vision Transformers again. When using the ImageNet-21K pre-trained model, our CSWin-L further achieves 55.7 mIoU and surpasses the previous best model by +2.2 mIoU, while using less computation complexity.
4 Inference Speed.
Here we report the inference speed of our CSWin and Swin works. For downstream tasks, we report the FPS of Cascade Mask R-CNN for object detection on COCO and UperNet for semantic segmentation on ADE20K. In most cases, the speed of our model is only slightly slower than Swin (less than 10%), but our model outperforms Swin by large margins. For example, on COCO, CSWin-S are +1.9% box AP and +1.7% mask AP higher than Swin-S with similar inference speed(11.7 FPS vs. 12 FPS). Note that our CSWin-T performs better than Swin-B on box AP(+0.6%), mask AP(+0.3%) with much faster inference speed(14.2 FPS vs. 11.2 FPS), indicating our CSWin achieves better accuracy/FPS trade-offs.
5 Ablation Study
To better understand CSWin Transformers, we compare each key component with the previous works under a completely fair setting that we use the same architecture and hyper-parameter for the following experiments, and only vary one component for each ablation. For time consideration, we use Mask R-CNN with 1x schedule as the default setting for detection and instance segmentation evaluation, and Semantic FPN with 80k iterations and single-scale test for segmentation evaluation.
Parallel Multi-Head Grouping. We first study the effectiveness of our novel “Parallel Multi-Head Grouping” strategy. Here we compare Axial-Attention and Criss-Cross-Attention under the CSWin-T backbone. “Attention region” is used as the computation cost metric for detailed comparison. To simplify, we assume the attention is calculated on a square input that .
In Table.8, we find that the “parallel multi-head grouping” is efficient and effective, especially for downstream tasks. When we replace the Parallel manner with Sequential, the performance of CSWin degrades on all tasks. When comparing with previous methods under the similar attention region constrain, our CSWin performs slightly better than Axial on ImageNet, while outperforming it by a large margin on downstream tasks. Our CSWin performs slightly better than Criss-Cross Attention, while the speed of CSWin is faster than it on different tasks, this further proves that our “parallel” design is much more efficient.
Dynamic Stripe Width . In Fig.4 we study the trade off between stripe width and accuracy. We find that with the increase of stripe width, the compution cost(FLOPS) increase, and the Top-1 classification accuracy improves greatly at the beginning and slows down when the width is large enough. Our default setting achieves a good trade-off between accuracy and FLOPs.
Attention Mechanism Comparison. Following the above analysis on each component of CSWin self-attention, we further compare with existing self-attention mechanisms. As some of the methods need even layers in each stage, for a fair comparison, we use the Swin-T as backbone and only change the self-attention mechanism. In detail, we use blocks for the four stages with the 96 base channel, non-overlapped token embedding , and RPE . The results are reported in Table 9. Obviously, our CSWin self-attention mechanism performs better than existing self-attention mechanisms across all the tasks.
Positional Encoding Comparison. The proposed LePE is specially designed to enhance the local positional information on downstream tasks for various input resolutions. Here we use CSWin-T as the backbone and only very the position encoding. In Table 10, we compare our LePE with other recent positional encoding mechanisms(APE , CPE , and RPE ) for image classification, object detection and image segmentation. Besides, we also test the variants without positional encoding (No PE) and CPE*, which is obtained by applying CPE before every Transformer block. According to the comparison results, we see that: 1) Positional encoding can bring performance gain by introducing the local inductive bias; 2) Though RPE achieves similar performance on the classification task with fixed input resolution, our LePE performs better (+1.2 box AP and +0.9 mask AP on COCO, +0.9 mIoU on ADE20K) on downstream tasks where the input resolution varies; 3) Compared to APE and CPE, our LePE also achieves better performance.
Conclusion
In this paper, we have presented a new Vision Transformer architecture named CSWin Transformer. The core design of CSWin Transformer is the CSWin Self-Attention, which performs self-attention in the horizontal and vertical stripes by splitting the multi-heads into parallel groups. This multi-head grouping design can enlarge the attention area of each token within one Transformer block efficiently. On the other hand, the mathematical analysis also allows us to increase the stripe width along the network depth to further enlarge the attention area with subtle extra computation cost. We further introduce locally-enhanced positional encoding into CSWin Transformer for downstream tasks. We achieved the state-of-the-art performance on various vision tasks under constrained computation complexity. We are looking forward to applying it for more vision tasks.
References
Experiment Details
In this section, we provide more detailed experimental settings about ImageNet and downstream tasks.
ImageNet-1K Classification. For a fair comparison, we follow the training strategy in DeiT . Specifically, all our models are trained for 300 epochs with the input size of . We use the AdamW optimizer with weight decay of 0.05 for CSWin-T/S and 0.1 for CSWin-B. The default batch size and initial learning rate are set to 2048 and respectively, and the cosine learning rate scheduler with 20 epochs linear warm-up is used. We adopt most of the augmentation in , including RandAugment (rand-m9-mstd0.5-inc1) , Mixup , CutMix , Random Erasing and Exponential Moving Average -, increasing stochastic depth ( for CSWin-T, CSWin-S, and CSWin-B respectively).
When fine-tuning with input, we follow the setting in that fine-tune the models for 30 epochs with the weight decay of , learning rate of , batch size of . We notice that a large ratio of stochastic depth is beneficial for fine-tuning and keeping it the same as the training stage.
COCO Object Detection and Instance Segmentation. We use two classical object detection frameworks: Mask R-CNN and Cascade Mask R-CNN based on the implementation from mmdetection . For Mask R-CNN, we train it with ImageNet-1K pretrained model with two settings: schedule and +MS schedule. For schedule, we train the model with single-scale input (image is resized to the shorter side of 800 pixels, while the longer side does not exceed 1333 pixels) for 12 epochs. We use AdamW optimizer with a learning rate of 0.0001, weight decay of 0.05 and batch size of 16. The learning rate declines at the 8 and 11 epoch with decay rate 0.1. The stochastic depth is also same as the ImageNet-1K setting that 0.1, 0.3, 0.5 for CSWin-T, CSWin-S, and CSWin-B respectively. For +MS schedule, we train the model with multi-scale input (image is resized to the shorter side between 480 and 800 while the longer side is no longer than 1333) for 36 epochs. The other settings are same as the except we decay the learning rate at epoch 27 and 33. When it comes to Cascade Mask R-CNN, we use the same +MS schedule as Mask R-CNN.
ADE20K Semantic segmentation. Here we consider two semantic segmentation frameworks: UperNet and Semantic FPN based on the implementation from mmsegmentaion . For UperNet, we follow the setting in and use AdamW optimizer with initial learning rate , weight decay of 0.01 and batch size of 16 (8 GPUs with 2 images per GPU) for 160K iterations. The learning rate warmups with 1500 iterations at the beginning and decays with a linear decay strategy. We use the default augmentation setting in mmsegmentation including random horizontal flipping, random re-scaling (ratio range [0.5, 2.0]) and random photo-metric distortion. All the models are trained with input size . The stochastic depth is set to 0.2, 0.4, 0.6 for CSWin-T, CSWin-S, and CSWin-B respectively. When it comes to testing, we report both single-scale test result and multi-scale test ([0.5, 0.75, 1.0, 1.25, 1.5, 1.75] of that in training).
For Semantic FPN, we follow the setting in . We use AdamW optimizer with initial learning rate , weight decay of and batch size of 16 (4 GPUs with 4 images per GPU) for 80K iterations.
More Experimetns
With the limitation of pages, we only compare with a few classical methods in our paper, here we make a comprehensive comparison with more current methods on ImageNet-1K. We find that our CSWin performs best in concurrent works.