Swin Transformer V2: Scaling Up Capacity and Resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo

cs.CV

Introduction

Scaling up language models has been incredibly successful. It significantly improves a model’s performance on language tasks and the model demonstrates amazing few-shot capabilities similar to that of human beings . Since the BERT large model with 340 million parameters , language models are quickly scaled up by more than 1,000 times in a few years, reaching 530 billion dense parameters and 1.6 trillion sparse parameters . These large language models are also found to possess increasingly strong few-shot capabilities akin to human intelligence for a broad range of language tasks .

On the other hand, the scaling up of vision models has been lagging behind. While it has long been recognized that larger vision models usually perform better on vision tasks , the absolute model size was just able to reach about 1-2 billion parameters very recently . More importantly, unlike large language models, the existing large vision models are applied to the image classification task only .

To successfully train large and general vision model, we need to address a few key issues. Firstly, our experiments with large vision models reveal an instability issue in training. We find that the discrepancy of activation amplitudes across layers becomes significantly greater in large models. A closer look at the original architecture reveals that this is caused by the output of the residual unit directly added back to the main branch. The result is that the activation values are accumulated layer by layer, and the amplitudes at deeper layers are thus significantly larger than those at early layers. To address this issue, we propose a new normalization configuration, called res-post-norm, which moves the LN layer from the beginning of each residual unit to the backend, as shown in Figure 1. We find this new configuration produces much milder activation values across the network layers. We also propose a scaled cosine attention to replace the previous dot product attention. The scaled cosine attention makes the computation irrelevant to amplitudes of block inputs, and the attention values are less likely to fall into extremes. In our experiments, the proposed two techniques not only make the training process more stable but also improve the accuracy especially for larger models.

Secondly, many downstream vision tasks such as object detection and semantic segmentation require high resolution input images or large attention windows. The window size variations between low-resolution pre-training and high-resolution fine-tuning can be quite large. The current common practice is to perform a bi-cubic interpolation of the position bias maps . This simple fix is somewhat ad-hoc and the result is usually sub-optimal. We introduce a log-spaced continuous position bias (Log-CPB), which generates bias values for arbitrary coordinate ranges by applying a small meta network on the log-spaced coordinate inputs. Since the meta network takes any coordinates, a pre-trained model will be able to freely transfer across window sizes by sharing weights of the meta network. A critical design of our approach is to transform the coordinates into the log-space so that the extrapolation ratio can be low even when the target window size is significantly larger than that of pre-training. The scaling up of model capacity and resolution also leads to prohibitively high GPU memory consumption with existing vision models. To resolve the memory issue, we incorporate several important techniques including zero-optimizer , activation check pointing and a novel implementation of sequential self-attention computation. With these techniques, the GPU memory consumption of large models and resolutions is significantly reduced with only marginal effect on the training speed.

With the above techniques, we successfully trained a 3 billion Swin Transformer model and effectively transferred it to various vision tasks with image resolution as large as 1,536 $\times$ 1,536, using Nvidia A100-40G GPUs. In our model pre-training, we also employ self-supervised pre-training to reduce the dependency on super-huge labeled data. With 40 $\times$ less labelled data than that in previous practice (JFT-3B), the 3 billion model achieves the state-of-the-art accuracy on a broad range of vision benchmarks. Specifically, it obtains 84.0% top-1 accuracy on the ImageNet-V2 image classification validation set , 63.1 / 54.4 box / mask AP on the COCO test-dev set of object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification, which are +NA%, +4.4/+3.3, +6.3 and +1.9 higher than the best numbers in the original Swin Transformers , and surpass previous best records by +0.8% (), +1.8/+1.4 (), +1.5 () and +1.4% ().

By scaling up both capacity and resolution of vision models with strong performance on general vision tasks, just like a good language model’s performance on general NLP tasks, we aim to stimulate more research in this direction so that we can eventually close the capacity gap between vision and language models and facilitate the joint modeling of the two domains.

Related Works

Transformer has served the standard network since the pioneer work of . The exploration of scaling this architecture has since begun, and the progress has been accelerated by the invention of effective self-supervised learning approaches, such as masked or auto-regressive language modeling , and has been further encouraged by the discovery of a scaling law . Since then, the capacity of language models has increased dramatically by more than 1,000 times in a few years, from BERT-340M to the Megatron-Turing-530B and sparse Switch-Transformer-1.6T . With increased capacity, the accuracy of various language benchmarks has been significantly improved. The zero-shot or few-shot performance is also significantly improved , which is a foundation of human generic intelligence.

Vision networks and scaling up

CNNs have long been the standard computer vision networks . Since AlexNet , architectures have become deeper and larger, which has greatly advanced various visual tasks and largely fueled the wave of deep learning in computer vision, such as VGG , GoogleNet and ResNet citehe2015resnet. In the past two years, the CNN architectures have been further scaled up to about 1 billion parameters , however, absolute performance may not be so encouraging, perhaps due to inductive biases in the CNN architecture limiting modeling power.

Last year, Transformers started taking over one representative visual benchmark after another, including ImageNet-1K image-level classification benchmarks , COCO region-level object detection benchmark , ADE20K pixel-level semantic segmentation benchmark , Kinetics-400 video action classification benchmark , etc. Since these works, numerous vision Transformer variants have been proposed to improve the accuracy at relatively small scale . Only a few works have attempted to scale up the vision Transformers . However, they rely on a huge image dataset with classification labels, i.e., JFT-3B, and are only applied to image classification problems.

Transferring across window / kernel resolution

For CNNs, previous works typically fixed kernel size during pre-training and fine-tuning. Global vision Transformers, such as ViT , compute attention globally, with the equivalent attention window size linearly proportional to the increased input image resolution. For local vision Transformer architectures, such as Swin Transformer , the window size can be either fixed or changed during fine-tuning. Allowing variable window sizes is more convenient in use, so as to be divisible by the probably variable entire feature map and to tune receptive fields for better accuracy. To handle the variable window sizes between pre-training and fine-tuning, bi-cubic interpolation was the previous common practice . In this paper, we propose a log-spaced continuous position bias approach (Log-CPB) that more smoothly transfers pre-trained model weights at low resolution to deal-with higher resolution windows.

Study on bias terms

In NLP, the relative position bias method proved beneficial , compared to the absolute position embedding used in the original Transformer . In computer vision, the relative positional bias method is more commonly used , probably because the spatial relationships of visual signals play a more important role in visual modeling. A common practice is to directly learn the bias values as model weights. There are also a few works particularly study how to set and learn the bias terms .

Continuous convolution and variants

Our Log-CPB approach is also related to earlier works on continuous convolution and variants , which utilize a meta network to handle irregular data points. Our Log-CPB approach is inspired by these efforts while solving a different problem of transferring relative position biases in vision Transformers across arbitrary window sizes. We also propose log-spaced coordinates to alleviate the difficulty of extrapolation when transferring between large size changes.

Swin Transformer V2

Swin Transformer is a general-purpose computer vision backbone that has achieved strong performance in various granular recognition tasks such as region-level object detection, pixel-level semantic segmentation, and image-level image classification. The main idea of Swin Transformer is to introduce several important visual priors into the vanilla Transformer encoder, including hierarchy, locality, and translation invariance, which combines the strength of both: the basic Transformer unit has strong modeling capabilities, and the visual priors make it friendly to a variety of visual tasks.

It is widely known that normalization technologies are crucial in stably training deeper architectures. The original Swin Transformer inherits the common practice in the language Transformers and vanilla ViT to utilize a pre-normalization configuration without extensive study, as shown in the figure 1. In the following subsections, we will examine this default normalization configurationThere have been a few alternative normalization configurations, such as post-normalization and sandwich normalization . Post-normalization harms training stability , and sandwich normalization sacrifices representation power due to too many normalization layers..

Relative position bias

is a key component in the original Swin Transformer which introduces an additional parametric bias term to encode the geometric relationship in self-attention calculation:

Issues in scaling up model capacity and window resolution

We observe two issues when we scale up the capacity and window resolution of the Swin Transformer.

An instability issue when scaling up model capacity. As shown in Figure 2, when we scale up the original Swin Transformer model from small size to large size, the activation values at deeper layers increase dramatically. The discrepancy between layers with the highest and the lowest amplitudes has reached an extreme value of $10^{4}$ . When we scale it up further to a huge size (658 million parameters), it cannot complete the training, as shown in Figure 3.

Degraded performance when transferring models across window resolutions. As shown in the first row of Table 1, the accuracy decreases significantly when we directly test the accuracy of a pre-trained ImageNet-1K model ( $256\times 256$ images with $8\times 8$ window size) at larger image resolutions and window sizes through the bi-cubic interpolation approach. It may be worth re-examining the relative position bias approach in the original Swin Transformer.

In the following subsections, we present techniques to address these issues, including residual post normalization and scaled cosine attention to address the instability issue, and a log-spaced continuous position bias approach to address the issue in transferring across window resolutions.

2 Scaling Up Model Capacity

As mentioned in Section 3.1, the original Swin Transformer (and most vision Transformers) adopts a layer norm layer at the beginning of each block, inherited from vanilla ViT. When we scale up the model capacity, a significant increase in activation values is observed at deeper layers. In fact, in a pre-normalization configuration, the output activation values of each residual block are merged directly back to the main branch, and the amplitude of the main branch grows larger and larger at deeper layers. Large amplitude discrepancy in different layers causes training instability.

To ease this problem, we propose to use a residual post normalization approach instead, as shown in Figure 1. In this approach, the output of each residual block is normalized before merging back into the main branch, and the amplitude of the main branch does not accumulate when the layer goes deeper. As shown in Figure 2, the activation amplitudes by this approach are much milder than in the original pre-normalization configuration.

In our largest model training, we introduce an additional layer normalization layer on the main branch every 6 Transformer blocks, to further stabilize training.

Scaled cosine attention

In the original self-attention computation, the similarity terms of the pixel pairs are computed as a dot product of the query and key vectors. We find that when this approach is used in large visual models, the learnt attention maps of some blocks and heads are frequently dominated by a few pixel pairs, especially in the res-post-norm configuration. To ease this issue, we propose a scaled cosine attention approach that computes the attention logit of a pixel pair $i$ and $j$ by a scaled cosine function:

where $B_{ij}$ is the relative position bias between pixel $i$ and $j$ ; $\tau$ is a learnable scalar, non-shared across heads and layers. $\tau$ is set larger than 0.01. The cosine function is naturally normalized, and thus can have milder attention values.

3 Scaling Up Window Resolution

In this subsection, we introduce a log-spaced continuous position bias approach, so that the relative position bias can be smoothly transferred across window resolutions.

Instead of directly optimizing the parameterized biases, the continuous position bias approach adopts a small meta network on the relative coordinates:

where $\mathcal{G}$ is a small network, e.g., a 2-layer MLP with a ReLU activation in between by default.

The meta network $\mathcal{G}$ generates bias values for arbitrary relative coordinates, and thus can be naturally transferred to fine-tuning tasks with arbitrarily varying window sizes. In inference, the bias values at each relative position can be pre-computed and stored as model parameters, such that the inference is the same as the original parameterized bias approach.

Log-spaced coordinates

When transferring across largely varying window sizes, a large portion of the relative coordinate range needs to be extrapolated. To ease this issue, we propose using log-spaced coordinates instead of the original linear-spaced ones:

where $\Delta x$ , $\Delta y$ and $\widehat{\Delta x}$ , $\widehat{\Delta y}$ are the linear-scaled and log-spaced coordinates, respectively.

By using the log-spaced coordinates, when we transfer the relative position biases across window resolutions, the required extrapolation ratio will be much smaller than that of using the original linear-spaced coordinates. For an example of transferring from a pre-trained $8\times 8$ window size to a fine-tuned $16\times 16$ window size, using the original raw coordinates, the input coordinate range will be from $\times$ to $\times$ . The extrapolation ratio is $\frac{8}{7}=1.14\times$ of the original range. Using log-spaced coordinates, the input range will be from $[-2.079,2.079]\times[-2.079,2.079]$ to $[-2.773,2.773]\times[-2.773,2.773]$ . The extrapolation ratio is $0.33\times$ of the original range, which is an about 4 times smaller extrapolation ratio than that using the original linear-spaced coordinates.

Table 1 compares the transferring performance of different position bias computation approaches. It can be seen that the log-spaced CPB (continuous position bias) approach performs best, particularly when transferred to larger window sizes.

4 Self-Supervised Pre-training

Larger models are more data hungry. To address the data hungry problem, previous large vision models typically utilize huge labelled data such as JFT-3B . In this work, we exploit a self-supervised pre-training method, SimMIM , to alleviate the demands on labelled data. By this approach, we successfully trained a powerful Swin Transformer model of 3 billion parameters which achieves state-of-the-art (SOTA) on 4 representative visual benchmarks, by using only 70 million labelled images (1/40 of that in JFT-3B).

5 Implementation to Save GPU Memory

Another issue lies in the unaffordable GPU memory consumption with a regular implementation when both the capacity and resolution are large. To facility the memory issue, we adopt the following implementations:

Zero-Redundancy Optimizer (ZeRO) . In a general data-parallel implementation of optimizers, the model parameters and optimization states are broadcasted to every GPU. This implementation is very unfriendly on GPU memory consumption, for example, a model of 3 billion parameters will consume 48G GPU memory when an AdamW optimizer and fp32 weights/states are used. With a ZeRO optimizer, the model parameters and the corresponding optimization states will be split and distributed to multiple GPUs, which significantly reduces memory consumption. We adopt the DeepSpeed framework and use the ZeRO stage-1 option in our experiments. This optimization has little effect on training speed.

Activation check-pointing . Feature maps in the Transformer layers also consume a lot of GPU memory, which can create bottlenecks when image and window resolutions are high. The activation check-pointing technology can significantly reduce the memory consumption, while the training speed is up to 30% slower.

Sequential self-attention computation. To train large models on very large resolutions, for example, an image of 1,536 $\times$ 1,536 resolution with a window size of 32 $\times$ 32, regular A100 GPUs (40GB memory) are still unaffordable, even with the above two optimization technologies. We found that in this case, the self-attention module constitutes a bottleneck. To alleviate this problem, we implement self-attention computation sequentially, instead of using the previous batch computation approach. This optimization is applied to the layers in the first two stages and has little impact on the overall training speed.

With these implementations, we managed to train a 3B model using the Nvidia A100-40G GPUs for COCO object detection with an input image resolution of 1,536 $\times$ 1,536, and Kinetics-400 action classification with an input resolution of $320\times 320\times 8$ .

6 Model configurations

We maintain the stage, block, and channel settings of the original Swin Transformer for 4 configurations of Swin Transformer V2:

SwinV2-T: $C$ = $96$ , #. block = $\{2,2,6,2\}$

SwinV2-S/B/L: $C$ = $96/128/192$ , #.block= $\{2,2,18,2\}$

with $C$ the number of channels in the first stage.

We further scale up Swin Transformer V2 to its huge size and giant size, with 658 million parameters and 3 billion parameters, respectively:

SwinV2-H: $C=352$ , #. block = $\{2,2,18,2\}$

SwinV2-G: $C=512$ , #. block = $\{2,2,42,4\}$

For SwinV2-H and SwinV2-G, we add an additional layer normalization layer on the main branch every 6 layers. To save experimental time, we only employ SwinV2-G for large-scale experiments. SwinV2-H is employed for another parallel study about self-supervised learning .

Experiments

We conduct experiments on ImageNet-1K image classification (V1 and V2) , COCO object detection , and ADE20K semantic segmentation . For the 3B model experiments, we also report the accuracy on Kinetics-400 video action recognition .

Image classification. ImageNet-1K V1 and V2 val are used for evaluation. ImageNet-22K which has 14M images and 22K categories is optionally employed for pre-training. For the pre-training our largest model SwinV2-G, a privately collected ImageNet-22K-ext dataset with 70 million images is used. For this dataset, a duplicate removal process is conducted to exclude overlapping images with ImageNet-1K V1 and V2 validation sets.

Object detection. COCO is used for evaluation. For our largest model experiments, we employ an additional detection pre-training phase using Object 365 v2 dataset , in-between the image classification pre-training phase and the COCO fine-tuning phase.

Video action classification. Kinetics-400 (K400) is used in evaluation.

The pre-training and fine-tuning settings will be detailed in Appendix.

2 Scaling Up Experiments

We first present the results on various representative visual benchmarks by scaling up models to 3 billion parameters and to high image/window resolutions.

We adopt a smaller $192\times 192$ image resolution in pre-training to save on training costs. We take a 2-step pre-training approach. First, the model is pre-trained using a self-supervised method on the ImageNet-22K-ext dataset by 20 epochs. Second, the model is further pre-trained by 30 epochs using the image classification task on this dataset. Detailed pre-training and fine-tuning setups are described in the appendix.

In the following paragraphs, we report the accuracy of SwinV2-G on representative vision benchmarks. Note that since our main goal is to explore how to feasibly scale up model capacity and window resolution, and whether the vision tasks can benefit from significantly larger capacity, we did not particularly align complexities or pre-training data in comparisons.

ImageNet-1K image classification results

Table 2 compares the SwinV2-G model with previously largest/best vision models on ImageNet-1K V1 and V2 classification. SwinV2-G is the largest dense vision model to present. It achieves a top-1 accuracy of 84.0% on the ImageNet V2 benchmark, which is +0.7% higher than previous best one (83.3%). Our accuracy on ImageNet-1K V1 is marginally lower (90.17% vs 90.88%). The performance difference might come from different degrees of dataset over-tuning . Also note we employ much less training iterations and lower image resolutions than those in previous efforts, while performing very well.

We also compare the SwinV2-B and SwinV2-L to the original SwinV1-B and SwinV1-L, respectively, where a +0.8% and +0.4% gains are observed. The shrunken gains by SwinV2-L than that of SwinV2-B may imply that if exceeding this size, more labeled data, stronger regularization, or advanced self-supervised learning methods are required.

COCO object detection results

Table 3 compares the SwinV2-G model with previous best results on COCO object detection and instance segmentation. It achieves 63.1/54.4 box/max AP on COCO test-dev, which is +1.8/1.4 higher than previous best numberw (61.3/53.0 by ). This suggests that scaling up vision model is beneficial for the dense vision recognition task of object detection. Our approach can use a different window size at test to additionally benefit, probably attributed to the effective Log-spaced CPB approach.

ADE20K semantic segmentation results

Table 4 compares the SwinV2-G model with previous best results on the ADE20K semantic segmentation benchmark. It achieves 59.9 mIoU on ADE20K val set, +1.5 higher than the previous best number (58.4 by ). This suggests scaling up vision model is beneficial for pixel-level vision recognition tasks. Using a larger window size at test time can additionally bring +0.2 gains, probably attributed to the effective Log-spaced CPB approach.

Kinetics-400 video action classification results

Table 5 compares the SwinV2-G model with previous best results on the Kinetics-400 action classification benchmark. It achieves 86.8% top-1 accuracy, +1.4% higher than previous best number . This suggests that scaling up vision models also benefits video recognition tasks. In this scenario, using a larger window size at test time can also bring additional benefits of +0.2%, probably attributed to the effective Log-spaced CPB approach.

3 Ablation Study

Table 6 ablates the performance of applying the proposed res-post-norm and scaled cosine attention approaches to Swin Transformer. Both techniques improve the accuracy at all the tiny, small and base size, and the overall improvements are +0.2%, +0.4% and +0.5% respectively, indicating the techniques are more beneficial for larger models. It also turns out to benefit ViT architecture (+0.4%). The proposed normalization approach also performs better than some other normalization methods, as shown in Table 7.

More importantly, the combination of post-norm and scaled cosine attention stabilize the training. As shown in Figure 2, while the activation values at deeper layers for the original Swin Transformer are almost exploded at large (L) size, those of the new version have much milder behavior. On a huge size model, the self-supervised pre-training diverges using the original Swin Transformer, while it trains well by a Swin Transformer V2 model.

Scaling up window resolution by different approaches

Table 1 and 8 ablate the performance of 3 approaches by scaling window resolutions from $256\times 256$ in pre-training to larger sizes in 3 down-stream vision tasks of ImageNet-1K image classification, COCO object detection, and ADE20K semantic segmentation, respectively. It can be seen that: 1) Different approaches have similar accuracy in pre-training (81.7%-81.8%); 2) When transferred to down-stream tasks, the two continuous position bias (CPB) approaches perform consistently better than the parameterized position bias approach used in Swin Transformer V1. Compared to the linear-spaced approach, the log-spaced version is marginally better; 3) The larger the change in resolutions between pre-training and fine-tuning, the larger the benefit of the proposed log-spaced CPB approach.

In Table 1 and 8, we also report the accuracy using targeted window resolutions without fine-tuning (see the first number in each column in the ImageNet-1K experiments). The recognition accuracy remains not bad even when the window size is enlarged from $8$ to $24$ (78.9% versus 81.8%), while the top-1 accuracy of the original approach significantly degrades from 81.7% to 68.7%. Also note that without fine-tuning, using a window size of $12$ that the pre-trained model has never seen before can even be +0.4% higher that the original accuracy. This suggests that we can improve accuracy through test-time window adjustment, as also observed in Table 3, 4 and 5.

Conclusion

We have presented techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536 $\times$ 1,536 resolution, including the res-post-norm and scaled cosine attention to make the model easier to be scaled up in capacity, as well a log-spaced continuous relative position bias approach which lets the model more effectively transferred across window resolutions. The adapted architecture is named Swin Transformer V2, and by scaling up capacity and resolution, it sets new records on 4 representative vision benchmarks. By these strong results, we hope to stimulate more research in this direction so that we can eventually close the capacity gap between vision and language models and facilitate the joint modeling of the two domains.

Acknowledgement

We thank many colleagues at Microsoft for their help, in particular, Eric Chang, Lidong Zhou, Jing Tao, Aaron Zhang, Edward Cui, Bin Xiao, Lu Yuan, Peng Cheng, Fan Yang for useful discussion and the help on GPU resources and datasets.

A1 Experimental Settings for Ablation

This section describes the experimental settings for ablation, including models of SwinV2-T, SwinV2-S, and SwinV2-B, and tasks of ImageNet-1K image classification, COCO object detection and ADE semantic segmentation.

All ablation study use the ImageNet-1K image classification task for pre-training. We adopt an input image size (window size) of 256 $\times$ 256 (8 $\times$ 8)Most of our experiments have the window size as an even number to make the window shifting offset divisible by the window size. Nevertheless, an odd number of window size also works well, as is right the case in the original Swin Transformer ( $7\times 7$ ).. Following , we employ an AdamW optimizer for 300 epochs using a cosine decay learning rate scheduler with 20 epochs of linear warm-up. A batch size of 1024, an initial learning rate of $1\times 10^{-3}$ , a weight decay of 0.05, and gradient clipping with a max norm of 5.0 are used. Augmentation and regularization strategies include RandAugment , Mixup , Cutmix , random erasing and stochastic depth . An increasing degree of stochastic depth augmentation is employed for larger models, i.e. $0.2,0.3,0.5$ for tiny, small, and base models, respectively.

A1.2 Fine-tuning on various tasks

For ImageNet-1K image classification experiments, we conduct a fine-tuning step if the input image resolution is larger than that in the pre-training step. The fine-tuning lasts for 30 epochs, with an AdamW optimizer, a cosine decay learning rate scheduler with an initial learning rate of $4\times 10^{-5}$ , a weight decay of $1\times 10^{-8}$ , and the same data augmentation and regularizations as those in the first stage.

COCO object detection

We use cascade mask R-CNN implemented in mmdetection as the object detection framework. In training, a multi-scale augmentation with the shorter side between 480 and 800 and the longer side of 1333 is used. The window size is set 16 $\times$ 16. An AdamW optimizer with an initial learning rate of $1\times 10^{-4}$ , a weight decay of 0.05, a batch size of 16, and a 3 $\times$ scheduler are used.

ADE20K semantic segmentation

We adopt an image size (window size) of 512 $\times$ 512 (16 $\times$ 16). In training, we employ an AdamW optimizer with an initial learning rate of $4\times 10^{-5}$ , a weight decay of 0.05, a learning rate scheduler that uses linear learning rate decay and a linear warm-up of 1,500 iterations. Models are trained with batch size of 16 for 160K iterations. We follow the mmsegmentation codebase to adopt augmentations of random horizontal flipping, random re-scaling within ratio range [0.5, 2.0] and a random photometric distortion. Stochastic depth with ratio of $0.3$ is applied for all models. A layer-wise learning rate decay of 0.95 is adopted for all experiments.

A2 Experimental Settings for System-Level Comparison

Table 2, 3 and 4 include results of SwinV2-B and SwinV2-L. For these experiments, we first conduct ImageNet-22K pre-training, and then fine-tune the pre-trained models on individual down-stream recognition tasks.

Both models use an input image size (window size) of 192 $\times$ 192 (12 $\times$ 12). We employ an AdamW optimizer for 90 epochs using a cosine learning rate scheduler with 5-epoch linear warm-up. A batch size of 4096, an initial learning rate of 0.001, a weight decay of 0.1, and gradient clipping with a max norm of 5.0 are used. Augmentation and regularization strategies include RandAugment , Mixup , Cutmix , random erasing and stochastic depth with ratio of 0.2.

ImageNet-1K image classification

We consider input image sizes of 256 $\times$ 256 and 384 $\times$ 384. The training length is set 30 epochs, with a batch size of 1024, a cosine decay learning rate scheduler with an initial learning rate of $4\times 10^{-5}$ , and a weight decay of $1\times 10^{-8}$ . The ImageNet-1K classification weights are also initialized from the corresponding ones in the ImageNet-22K model.

COCO object detection

We adopt HTC++ for experiments. In data pre-processing, Instaboost , a multi-scale training with an input image size of 1536 $\times$ 1536, a window size of 32 $\times$ 32, and a random scale between $[0.1,2.0]$ are used. An AdamW optimizer with an initial learning rate of $4\times 10^{-4}$ on batch size of 64, a weight decay of 0.05, and a $3\times$ scheduler are used. The backbone learning rate is set $0.1\times$ of the head learning rate. In inference, soft-NMS is used. Both single-scale and multi-scale test results are reported.

ADE20K semantic segmentation

The input image size (window size) is set 640 $\times$ 640 (40 $\times$ 40). We employ an AdamW optimizer with an initial learning rate of $6\times 10^{-5}$ , a weight decay of 0.05, a linear decayed learning rate scheduler with 375-iteration linear warm-up. The model is trained with batch size of 64 for 40K iterations. We follow the default settings in mmsegmentation for data augmentation, including random horizontal flipping, random re-scaling within ratio range $[0.5,2.0]$ and random photometric distortion. Stochastic depth with ratio of $0.3$ is applied.

A2.2 SwinV2-G Settings

The model is first pre-trained using a self-supervised learning approach on the ImageNet-22K-ext dataset (70 million images) for 20 epochs. To reduce experimental overheads, we adopt a smaller image size of 192 $\times$ 192. The model is trained using the AdamW optimizer with a cosine decay learning rate scheduler with 30000 steps of linear warm-up. A batch size of 9216, an initial learning rate of $1.4\times 10^{-3}$ , a weight decay of 0.1, and gradient clipping with a max norm of 100.0 are used. A light data augmentation strategy is employed: random resize cropping with scale range of [0.67, 1] and a aspect ratio range of [3/4, 4/3], followed by a random flipping and a color normalization steps.

Stage-2 supervised pre-training

The model is further pre-trained using the class labels on the ImageNet-22K-ext dataset. We employ an AdamW optimizer for 30 epochs, using a cosine decayed learning rate scheduler with 20000 steps of linear warm-up. A batch size of 9216, an initial learning rate of $1.4\times 10^{-3}$ , a layer-wise learning rate decay of 0.87, a weight decay of 0.1, and gradient clipping with a max norm of 100.0 are used. Augmentation and regularization strategies include RandAugment , random erasing and a stochastic depth ratio of 0.3.

Fine-tuning on ImageNet-1K image classification

We adopt an input image size of 640 $\times$ 640 for experiments. An AdamW optimizer is employed for 10 epochs, using a cosine decayed learning rate scheduler and a 2-epoch linear warm-up. A batch size of 576, an initial learning rate of $2.1\times 10^{-5}$ , a weight decay of 0.1, and gradient clipping with a max norm of 100.0 are used. Augmentation and regularization strategies include RandAugment , random erasing and a stochastic depth ratio of 0.5.

In evaluation, we test top-1 accuracy on both ImageNet-1K V1 and V2.

Fine-tuning on COCO object detection

We first conduct inter-mediate fine-tuning using the Objects-365 V2 dataset. In this stage, we remove the mask branch of the HTC++ framework because there are no mask annotations. The input image resolution and window size are set as $ $and$ 32\times 32 $, respectively. In training, an AdamW optimizer with initial learning rate of$ 1.2\times 10^{-3}$, a weight decay of 0.05 and a batch size of 96 are used, and the training length is set 67,500 steps.

Then we fine-tune the HTC++ model on COCO dataset, with the mask branch randomly initialized and other model weights loaded from the Objects-365-V2 pre-trained model. In this training stage, the input image resolution is set 1536 $\times$ 1536 with a multi-scale ratio of $[0.1,2.0]$ . The window size is set 32 $\times$ 32. The AdamW optimizer is employed, with an initial learning rate of $6\times 10^{-4}$ , a weight decay of 0.05, and a batch size of 96, and is trained 45,000 steps.

In test, Soft-NMS is used. Both window sizes of $32\times 32$ and $48\times 48$ are considered.

Fine-tuning on ADE20K semantic segmentation

The input image size (window size) is set 640 $\times$ 640 (40 $\times$ 40). An AdamW optimizer is employed, with an initial learning rate of $4\times 10^{-5}$ , a weight decay of 0.05, a linear decayed learning rate scheduler with 80K iterations, a batch size of 32, and a linear warm-up of 750 iterations. For augmentations, we follow the default settings in mmsegmentation to include random horizontal flipping, random re-scaling within ratio range $[0.5,2.0]$ and random photometric distortion. The stochastic depth ratio is set $0.4$ .

Fine-tuning on Kinetics-400 video action recognition

A 2-stage fine-tuning process is employed. In the first stage, an input resolution of 256 $\times$ 256 $\times$ 8 with 16 $\times$ 16 $\times$ 8 window size is adopted. We employ the AdamW optimizer for 20 epochs using a cosine decayed learning rate scheduler with 2.5-epoch linear warm-up. Other training hyper-parameters are: batch-size 80, an initial learning rate of $3.6\times 10^{-4}$ , and a weight decay of 0.1.

In the second stage, we further fine-tune the model using a larger input video resolution of 320 $\times$ 320 $\times$ 8 with 20 $\times$ 20 $\times$ 8 window size. We employ the AdamW optimizer for 5 epochs using a cosine decayed learning rate scheduler with 1-epoch linear warm-up. A batch-size of 64, an initial learning rate of $5\times 10^{-5}$ and a weight decay of 0.1 are set.

Introduction

Related Works

Vision networks and scaling up

Transferring across window / kernel resolution

Study on bias terms

Continuous convolution and variants

Swin Transformer V2

Relative position bias

Issues in scaling up model capacity and window resolution

2 Scaling Up Model Capacity

Scaled cosine attention

3 Scaling Up Window Resolution

Log-spaced coordinates

4 Self-Supervised Pre-training

5 Implementation to Save GPU Memory

6 Model configurations

Experiments

2 Scaling Up Experiments

ImageNet-1K image classification results

COCO object detection results

ADE20K semantic segmentation results

Kinetics-400 video action classification results

3 Ablation Study

Scaling up window resolution by different approaches

Conclusion

Acknowledgement

A1 Experimental Settings for Ablation

A1.2 Fine-tuning on various tasks

COCO object detection

ADE20K semantic segmentation

A2 Experimental Settings for System-Level Comparison

ImageNet-1K image classification

COCO object detection

ADE20K semantic segmentation

A2.2 SwinV2-G Settings

Stage-2 supervised pre-training

Fine-tuning on ImageNet-1K image classification

Fine-tuning on COCO object detection

Fine-tuning on ADE20K semantic segmentation

Fine-tuning on Kinetics-400 video action recognition

A3 Learnt Relative Position Bias by Different Approaches

References