SimMIM: A Simple Framework for Masked Image Modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu

Introduction

“What I cannot create, I do not understand.”

“Masked signal modeling” is one such task that learns to create: masking a portion of input signals and trying to predict these masked signals. In NLP, following this philosophy, self-supervised learning approaches built on the masked language modeling tasks have largely repainted the field , i.e., learning very large-scale language models by using huge amounts of unlabeled data has been shown to generalize well to a broad range of NLP applications.

In computer vision, although there are pioneers leveraging this philosophy for self-supervised representation learning , in previous years, this line of work was almost buried by the contrastive learning approaches . The different difficulties of applying this task to the language and visual domains can be explained by the differences between two modalities. One of the differences is that images exhibit stronger locality: pixels that are close to each other tend to be highly correlated , so the task can be done by duplicating close pixels rather than by semantic reasoning. Another difference is that visual signals are raw and low-level, while text tokens are human-generated high-level concepts. This raises a question of whether the prediction of low-level signals is useful for high-level visual recognition tasks. A third difference is that the visual signal is continuous, and the text token is discrete. It is unknown how classification-based masked language modeling approaches can be adapted to handle continuous visual signals well.

Until recently, there have been trials that attempt to bridge modality gaps and resolve the obstacles, by introducing several special designs, for example, by converting continuous signals into color clusters , by patch tokenization using an additional network , or by a block-wise masking strategy to break short-range connections , etc. Through these special designs, the learned representations proved to be well transferable to several visual recognition tasks.

Random masking is applied on image patches, which is simple and convenient for vision Transformers. For masked pixels, either larger patch size or higher masking ratio can result in a smaller chance of finding visible pixels that are close. For a large masking patch size of 32, the approach can achieve competitive performance in a wide range of masking ratios (10%-70%). For a small mask patch size of 8, the masking ratio needs to be as high as 80% to perform well. Note that the preferred masking ratios are very different from that in the language domain, where a small masking ratio of 0.15 is adopted as default. We hypothesize that different degrees of information redundancy in two modalities may lead to the different behaviors.

A raw pixel regression task is used. The regression task aligns well with the continuous nature of visual signals, which possesses ordering property. This simple task performs no worse than the classification approaches with classes specially defined by tokenization, clustering, or discretization.

An extremely lightweight prediction head (e.g., a linear layer) is adopted, which achieves similarly or slightly better transferring performance than that of heavier prediction heads (e.g., an inverse Swin-B). The use of an extremely lightweight prediction head brings a remarkable speedup in pre-training. In addition, we note that a broad range of target resolutions (e.g., $12^{2}$ - $96^{2}$ ) perform competitive with the highest $192^{2}$ . While heavier heads or higher resolutions generally result in greater generation capability, this greater capability does not necessarily benefit down-stream fine-tuning tasks.

Though simple, the proposed SimMIM approach is very effective for representation learning. Using ViT-B, it achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K, surpassing previous best approach () by +0.6%. SimMIM has also shown to be scalable to larger models: with a SwinV2-H model (658M parameters) , it achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the highest number among methods that use ImageNet-1K data only. This result encourages the use of self-supervised learning to address the increasing data-hungry problem caused by quickly rising model capacity. In fact, with the help of SimMIM, we successfully trained a SwinV2-G model with 3 billion parameters using $\sim$ 40 $\times$ smaller data than that of Google’s JFT-3B dataset, and set new records on several representative benchmarks: 84.0% top-1 accuracy on ImageNet-V2 classification , 63.1/54.4 box/mask mAP on COCO object detection , 59.9 mIoU on ADE20K semantic segmentation , and 86.8% top-1 accuracy on Kinetics-400 action recognition .

While in recent years we have witnessed an increasing overlap between NLP and computer vision in both basic modeling and learning algorithms, as well as in multi-modal applications, which aligns well with how human brains achieve general intelligence capabilities, we hope that our demonstration of “masked signal modeling” in computer vision can drive this trend a bit further and encourage deeper interaction of different AI fields.

Related Work

Masked language modeling and its auto-regressive variants are the dominant self-supervised learning approaches in the field of natural language processing (NLP). Given visible tokens in a sentence or a sentence pair / triplet, the approaches learn representations by predicting invisible tokens of the input. This line of approaches has repainted the field since about 3 years ago , that it enables the learning of very large language models and generalizes well on broad language understanding and generation tasks by leveraging huge data.

Masked image modeling progressed in parallel with the MLM task in NLP but located in a non-mainstream position for a long time. The context encoder approach is a pioneer work in this direction, which masks a rectangle area of the original images, and predicts the missing pixels. CPC predicts patches via a verification task in each batch with a contrastive predictive coding loss. Recently, iGPT , ViT and BEiT recall this learning approach on the modern vision Transformers, and show strong potential in representation learning by introducing special designs on some components, such as clustering on pixels , prediction of mean color , and tokenization via an additional dVAE network with a block-wise masking strategy . In contrary to these complex designs, we present an extremely simple framework, SimMIM, which shows similar or even slightly better effectiveness.

are also related to our approach, particularly the auto-encoder approaches . Similar as in our approach, they adopt a reconstruction task to recover the original signals. However, they are based on a different philosophy of visible signal reconstruction, other than the creation or prediction of invisible signals as in our approach. They thus progress in a very different path, by studying how to effectively regularize the task learning by proper regularization or architecture bottlenecks.

Beyond representation learning, masked image modeling is a classical computer vision problem, named image inpainting. This problem has been extensively studied in computer vision for a long time , aiming for improving the inpainting quality and without connecting to self-supervised representation learning. While we advocate image inpainting as a strong self-supervised pre-text task, we also find stronger inpainting capability does not necessarily leads to stronger fine-tuning performance on down-stream tasks.

The approach in this paper is also related to compressed sensing , which affirms most of the data we acquire including image signals can be thrown away with almost no perceptual loss. Such claim is also partly supported by recent works of sparse inference that the recognition accuracy has very little drop after throwing a large portion of image features . The observation in this paper goes further for the input signals, that with an extremely small portion of randomly selected input image patches as input, i.e., 10%, the inpainting task can still be learnt to produce good visual representations.

During the last two decades, there have been numerous pretext tasks to learn visual representation in a self-supervised way: gray-scale image colorization , jigsaw puzzle solving , split-brain auto-encoding , rotation prediction , learning to cluster . Though very different from masked image modeling, some of them interestingly also follow a philosophy of predicting the invisible parts of signals, e.g., use one or two color channels as input to predict values of other channels. Another large portion of works lie in the contrastive learning approaches , which are the previous mainstream. We hope our work can encourage the study of masked language modeling as a pre-text task for self-supervised visual representation learning.

Approach

Our approach SimMIM learns representation through masked image modeling, which masks a portion of input image signals and predicts the original signals at masked area. The framework consists of 4 major components:

Masking strategy. Given an input image, this component designs how to select the area to mask, and how to implement masking of selected area. The transformed image after masking will be used as the input.

Encoder architecture. It extracts a latent feature representation for the masked image, which is then used to predict the original signals at the masked area. The learnt encoder is expected to be transferable to various vision tasks. In this paper, we mainly consider two typical vision Transformer architectures: a vanilla ViT and Swin Transformer .

Prediction head. The prediction head will be applied on the latent feature representation to produce one form of the original signals at the masked area.

In the following subsections, we will present typical options of each component. These options are then systematically studied. By combining simple designs of each component, we have been able to achieve strong representation learning performance.

2 Masking Strategy

For input transformation of masked area, we follow the NLP community and BEiT to use a learnable mask token vector to replace each masked patch. The token vector dimension is set the same as that of the other visible patch representation after patch embedding. For masking area selection, we study the following masking strategies (illustrated in Figure 2):

We first present a patch-aligned random masking strategy. Image patches are the basic processing units of vision Transformers, and it is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked. For Swin Transformer, we consider equivalent patch sizes of different resolution stages, 4 $\times$ 4 $\sim$ 32 $\times$ 32, and adopt 32 $\times$ 32 by default which is the patch size of the last stage. For ViT, we adopt 32 $\times$ 32 as the default masked patch size.

We also try other masking strategies in previous works: 1) introduces a central region masking strategy. We relax it to be randomly movable on the image. 2) introduces a complex block-wise masking strategy. We try this mask strategy on two masked patch sizes of $16\times 16$ and $32\times 32$ .

3 Prediction Head

The prediction head can be of arbitrary form and capacity, as long as its input conforms with the encoder output and its output accomplishes the prediction target. Some early works follow auto-encoders to employ a heavy prediction head (decoder) . In this paper, we show that the prediction head can be made extremely lightweight, as light as a linear layer. We also try heavier heads such as a 2-layer MLP, an inverse Swin-T, and an inverse Swin-B.

4 Prediction Targets

The pixel values are continuous in the color space. A straight-forward option is to predict raw pixels of the masked area by regression. In general, vision architectures usually produce feature maps of down-sampled resolution, e.g., $16\times$ in ViT and $32\times$ for most other architectures. To predict all pixel values at a full resolution of input images, we map each feature vector in feature map back to the original resolution, and let this vector take charge of the prediction of corresponding raw pixels.

For example, on the $32\times$ down-sampled feature maps produced by a Swin Transformer encoder, we apply a $1\times 1$ convolution (linear) layer with output dimension of $3072=32\times 32\times 3$ to stand for the RGB values of $32\times 32$ pixels. We also consider lower resolution targets by downsampling the original images by $\{32\times,16\times,8\times,4\times,2\times\}$ , respectively.

Previous approaches mostly convert the masked signals to clusters or classes, and then perform a classification task for masked image prediction.

Color clustering. In iGPT , the RGB values are grouped into 512 clusters by $k$ -means using a large amount of natural images. Each pixel is then assigned to the closest cluster center. This approach requires an additional clustering step to generate the 9-bit color palette. In our experiments, we use the 512 cluster centers learnt in iGPT.

Vision tokenization. In BEiT , a discrete VAE (dVAE) network is employed to transform image patches to dVAE tokens. The token identity is used as the classification target. In this approach, an additional dVAE network needs to be pre-trained.

Channel-wise bin color discretization. The R, G, B channels are separately classified, with each channel discretized into equal bins, e.g., 8 and 256 bins used in the experiments.

5 Evaluation protocols

We follow to mainly evaluate the quality of learnt representations by fine-tuning on ImageNet-1K image classification, which is a more usable scenario in practice. We will mainly account for this metric in our ablations. In the system-level comparison, we also follow previous works to report the performance on previous dominant metric of linear probing. Nevertheless, we will not account on this linear probing metric, as our main goal is to learn representations which can well complement the following down-stream tasks.

Experiments

We adopt Swin-B as the default backbone in our ablation study, which facilitates us to evaluate the learnt representations also on downstream tasks such as object detection and semantic segmentation (see Appendix). To reduce experimental overhead, we use a default input image size of $192^{2}$ and adapt the window size as 6 to accommodate the changed input image size. The ImageNet-1K image classification dataset is used for both pre-training and fine-tuning.

In self-supervised pre-training, we employ an AdamW optimizer with a cosine learning rate scheduler, and train for 100 epochs. The training hyper-parameters are: the batch size as 2048, base learning rate as 8e-4, weight decay as 0.05, $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , warm-up for 10 epochs. A light data augmentation strategy is used: random resize cropping with scale range of $[0.67,1]$ and a aspect ratio range of $[3/4,4/3]$ , followed by a random flipping and a color normalization steps.

In fine-tuning, we also employ an AdamW optimizer, 100-epoch training, and a cosine learning rate scheduler with 10-epoch warm-up. The fine-tunig hyper-parameters are: the batch size as 2048, a base learning rate of 5e-3, a weight decay of 0.05, $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , a stochastic depth ratio of 0.1, and a layer-wise learning rate decay of 0.9. We follow the same data augmentation used in , including RandAug , Mixup , Cutmix , label smoothing , and random erasing .

1.2 Masking Strategy

We first study how different masking strategies affect the effectiveness of representation learning. The fine-tuning accuracy of different approaches under multiple masking ratios are summarized in Table 1.

We first notice that the best accuracy of our simple random masking strategy reaches 83.0%, which is +0.3% higher than the best of other more specially designed strategies such as the block-wise masking as in .

In addition, when a large masked patch size of 32 is adopted, this simple strategy performs stably well on a broad range of masking ratios (10%-70%). We hypothesize that the centering pixel of a large masked patch may be distant enough to visible pixels. Thus it enforces the network to learn relatively long-range connections, even when a low masking ratio is used (e.g., 10%) or all patches around are not masked. Another way to increase the prediction distance is to use larger masking ratio, which also shows to benefit the fine-tuning performance of relatively small patch sizes. By increasing the masking ratio from 0.4 to 0.8 at a patch size of 4, 8 and 16, the accuracy is smoothly improved by +0.2% (from 81.9% to 82.1%), +0.4% (from 82.0% to 82.4%), and +0.4% (from 82.4% to 82.8%), respectively. Nevertheless, the overall accuracy at these smaller patches is not as high as that at a larger patch size of 32. Further increasing the patch size to 64 is observed with degraded accuracy, probably due to the too large prediction distance.

The above observations and analyses can also be well reflected by a newly proposed AvgDist metric, which measures the averaged Euclidean distance of masked pixels to the nearest visible ones. The AvgDist of different masking strategies w.r.t. varying masking ratios are shown in Figure 3(a). From this figure, we observe that the AvgDist of all masking strategies is smoothly increased with growing masking ratios. For random masking strategy, when the masked patch size is low, e.g., 4 or 8, the AvgDist is relatively low and grows slowly with increasing masking ratios. On the other hand, when the patch size is large, e.g., 64, very small masking ratio (e.g. 10%) still makes relatively large AvgDist. The square and block-wise methods produce similarly high AvgDist values as of patch size 64.

Figure 3(b) plots the relationship between fine-tuning accuracy and the AvgDist measure, which follows a ridge shape. The entries of high fine-tuning accuracy roughly distribute in a range of $$ of AvgDist, while entries with smaller or higher AvgDist perform worse. This indicates that the prediction distance in masked image modeling is encouraged to be moderate, neither too large nor too small. Probably, small distance in masked prediction may let the network learn too much short connections, while large distance may be too difficult to learn. These results also indicate that AvgDist may be a good indicator for the effectiveness of masked image modeling.

In our experiments, we adopt a masking ratio of 0.6 on patch size of 32 by default, due to its stable performance. Also note that the masking strategies and ratios in the language domain are very different from what explored in our work, which usually adopts a small masking ratio of 15%. We hypothesize that different degrees of information redundancy by two modalities may lead to the different behaviors.

1.3 Prediction Head

Table 2 ablates the effect of different prediction heads, including a linear layer, a 2-layer MLP, an inverse Swin-T and an inverse Swin-B. While generally heavier heads produce slightly lower losses, for example, 0.3722 (inverse Swin-B) versus 0.3743 (a linear layer), the transferring performances on the down-stream ImageNet-1K task are lower. It indicates that stronger inpainting capability does not necessarily result in better down-stream performance. It is probably because that the capacity is largely wasted in the prediction head, which will not be used in down-stream tasks. There is also a practical drawback, that a heavier prediction head brings higher training costs, e.g., the training cost of using an inverse Swin-B is 2.3 $\times$ of that by a linear layer.

Also note that in previous contrastive learning approaches , it is a common practice to use a multi-layer MLP head in the pre-text tasks, instead of a linear layer, which makes the latent feature produced by the encoder moderately distant to the pre-text target, and shows beneficial for the linear probing evaluation metric. In our work, we show that a single linear layer head in our approach, under a fine-tuning metric, has shown competitive or even the optimal transferring performance. It indicates that if our aim is to learn good features for fine-tuning, the important exploration on head designing in contrastive learning approaches may not be necessary for that of masked image modeling.

1.4 Prediction Resolution

Table 3 ablates the effect of varying target resolution. It shows that a large range of resolutions (e.g., $12^{2}$ - $192^{2}$ ) perform equally well. The transferring performance drops only at a low resolution of $6^{2}$ , probably because this option throws too much information away. These results imply the information granularity required by the down-stream image classification task. The effects to other more fine-grained down-stream tasks such as object detection or semantic segmentation will be explored in our future study.

Note that we adopt a default target resolution of $192^{2}$ in our experiments, due to the equally best transferring accuracy and the negligible computation overhead.

1.5 Prediction Target

Table 5 compares the effects of different prediction targets. Several observations can be drawn as follows:

Carefully defined classes by color clustering or tokenization perform slightly worse than ours;

It reveals that it is not necessary to align the target of masked image modeling to be the same classification based as masked language modeling. It is good to align the approach to the own nature of visual signals.

While both auto-encoders and masked image modeling approaches learn a network by recovering the original signals, they are built on different philosophies of visible signal reconstruction and prediction of invisible signals. In our framework, we can instantiate a reconstruction task by also regress the raw pixel values of visible patches in the input.

Table 4 compares the approach which predicts only the masked area as in our default setting and an alternative to recover both masked and unmasked area. The approach predicting the masked area performs significantly better than that recovering all image pixels as 82.8% vs. 81.7%. This implies that the two tasks are fundamentally different in their internal mechanisms, and the task to predict might be a more promising representation learning approach.

2 Comparison to Previous Approaches on ViT-B

As previous works performed experiments on the ViT architectures, for fair comparison, we also conduct experiments using the ViT-B architecture.

In pre-training, 800 epochs with a cosine learning rate scheduler and a 20-epoch linear warm-up procedure are employed. All other hyper-parameters strictly follow the same settings as in the ablation study, except that we use a $224^{2}$ input resolution to be the same as in previous approaches. In fine-tuning, we adopt a layer-wise learning rate decay of 0.65 following , and keep all other settings strictly the same as in our ablation study. In linear probing, we follow to choose an inter-mediate layer of ViT-B which produces the best linear probing accuracy. 100-epoch training with a 5-epoch linear warm-up step is employed.

Table 6 compares our approach to previous ones on both metrics of fine-tuning and linear probing using ViT-B. Our approach achieves a top-1 accuracy of 83.8% by fine-tuning, which is +0.6% higher than previous best approach . Also note that our approach reserves the highest training efficiency than others thanks to its simplicity, that it is 2.0 $\times$ , 1.8 $\times$ , $\sim$ 4.0 $\times$ , and 1.5 $\times$ more efficient than that of DINO , MoCo v3 , ViT , and BEiT (not counting the time for dVAE pre-training), respectively.

Though our main focus is to learn representations that are better for fine-tuning, we also report the linear probing accuracy of different approaches for reference.

3 Scaling Experiments with Swin Transformer

We adopt Swin Transformer of different model sizes for experiments, including Swin-B, Swin-L, SwinV2-H, and SwinV2-G . To reduce experimental overheads, we adopt a smaller image size of $192^{2}$ in pre-training, and a step learning rate scheduler that the experiments of different training lengths can reuse model training of the first step. The base learning rate of the first learning rate step is set 4e-4 and lasts for 7/8 of the total training epochs. The learning rate is divided by 10 for the remaining epochs. For model sizes of H and G, we use the variants introduced in , which have stronger stability than the original version. All models use the ImageNet-1K dataset for training, except that SwinV2-G uses a larger and privately collected ImageNet-22K-ext dataset, as detailed in .

When using ImageNet-1K for pre-training, all models are trained by 800 epochs, with most other hyper-parameters following that in ablations. In fine-tuning, a larger image size of $224^{2}$ is employed. For SwinV2-H, we also consider a larger resolution of $512^{2}$ . The training length of fine-tuning is set 100-epoch, except for SwinV2-H where 50-epoch is used. The layer-wise learning rate decay is set as 0.8, 0.75, and 0.7 for Swin-B, Swin-L, and SwinV2-H, respectively. Other fine-tuning hyper-parameters follow that in ablation.

Table 7 lists the results of our approach with different model sizes, compared to the supervised counterparts. With SimMIM pre-training, all of Swin-B, Swin-L, and SwinV2-H achieve significantly higher accuracy than their supervised counterparts. In addition, the SwinV2-H model with a larger resolution of $512^{2}$ achieves 87.1% top-1 accuracy on ImageNet-1K, which is the highest number among methods that use ImageNet-1K data only.

While all previous billion-level vision models rely on Google’s JFT-3B dataset for model training , the proposed SimMIM approach is used to aid the training of a 3B SwinV2-G model by using $\sim$ 40 $\times$ smaller data than that of JFT-3B. It achieves strong performance on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet-V2 classification , 63.1/54.4 box/mask mAP on COCO object detection , 59.9 mIoU on ADE20K semantic segmentation , and 86.8% top-1 acc on Kinetics-400 action recognition . More details are described in .

4 Visualization

In this section, we attempt to understand the proposed approach as well as some critical designs through visualizations. All example images are from the ImageNet-1K validation set.

Figure 4 shows the recovered images with several human-designed masks, to understand what capability is learnt through masked image modeling. The human-designed masks (from left to right) consist of a random mask, a mask to remove most parts of a major object, and a mask to remove all of the major object, respectively. We can draw the following observations: 1) by random masking moderate parts of the major object, both the shape and texture of masked parts can be well recovered, as shown by the penguin, the mountain, the sailboat, and the persons. On the unmasked area, there is a severe checkerboard artifact due to that the recovery of unmasked area is not learnt during training; 2) by masking most parts of a major object (larger than 90%), the model can still predict an existence of object by the negligible clues; 3) when the objects are fully masked out, the masked area will be inpainted with background textures.

These observations indicate that the approach has learnt strong reasoning ability of objects, and the ability is not due to memorization of image identities or the simple copying of nearby pixels.

We have shown the comparison of the representations learnt by a masked prediction task (our approach), and a joint masked prediction and visible signal reconstruction task in Table 4, which reveals that the pure masked prediction task performs significantly better. Figure 5 compares the recovery effects by two approaches. It shows that the latter approach makes better looking, however, probably the model capacity is wasted at the recovery of the unmasked area which may not be that useful for fine-tuning.

Figure 6 shows the recovery of an image with different masked patch size under a fixed masking ratio of 0.6. It can be seen that the details can be much better recovered when the masked patch size is smaller, however, the learnt representations transfer worse. Probably, with smaller patch size, the prediction task can be easily accomplished by close-by pixels or textures.

Conclusion

This paper presents a simple yet effective self-supervised learning framework, SimMIM, to leverage masked image modeling for representation learning. This framework is made as simple as possible: 1) a random masking strategy with a moderately large masked patch size; 2) predicting raw pixels of RGB values by direct regression task; 3) the prediction head can be as light as a linear layer. We hope our strong results as well as the simple framework can facilitate future study of this line, and encourage in-depth interaction between AI fields.

Acknowledgement

We thank many colleagues at Microsoft for their help, in particular, Li Dong, Furu Wei, Eric Chang, Lidong Zhou, Jing Tao, Aaron Zhang, Edward Cui, Peng Cheng and Fan Yang for useful discussion and the help on GPU resources and datasets.

Appendix

Appendix A Detailed Architectures

The detailed architecture specifications are shown in Table 8, where an input image size of $192\times 192$ is used for pre-training and $224\times 224$ is used in fine-tuning.

Appendix B The Effect of Learning Rate Schedulers

In our ablation study, we follow common practice to use a cosine learning rate scheduler. In our scaling up experiments, we adopt a step learning rate scheduler to reduce experimental overheads of potentially studying the effects of different training lengths.

In this section, we investigate the effects of different schedulers on fine-tuning accuracy. Both schedulers adopt 10-epoch linear warm-up. For the step learning rate scheduler, the base learning rate is set as 8e-4, and is decayed by a factor of 10 at 90% and 95% of the total training length. For this comparison, we follow the default settings used in ablation, except that the scheduler is changed. As shown in Table 9, the step scheduler performs marginally better than the cosine scheduler, by +0.1% using a 100-epoch pre-training, and by +0.3% using a longer 300-epoch training procedure.

Appendix C Results on Downstream Tasks

In this section, we add more results on several down-stream tasks, including iNaturalist (iNat) 2018 classification, COCO object detection and ADE20K semantic segmentation.

iNaturalist 2018 is a long-tail image classification dataset with more than 8,000 categories. It includes 437,513 training images and 24,426 validation images. We fine-tune the pre-trained models using an AdamW optimizer by 100 epochs. The fine-tuning hyper-parameters are: a batch size of 2048, a base learning rate of 1.6e-2, a weight decay of 0.05, $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , a stochastic depth ratio of 0.1, and a layer-wise learning rate decay of 0.9. We follow the same data augmentation strategies used in , including RandAug , Mixup , Cutmix , label smoothing , and random erasing .

A Mask-RCNN framework is adopted and all models are trained with a 3 $\times$ schedule (36 epochs). We utilize an AdamW optimizer with a learning rate of 6e-5, a weight decay of 0.05, and a batch size of 32. Following , we employ a large jittering augmentation (1024 $\times$ 1024 resolution, scale range [0.1, 2.0]). The window size for Swin-B is set to 7 and that for Swin-L and SwinV2-H models is 14.

Following , An UPerNet framework is used following . We use an AdamW optimizer using the following hyper-parameters: a weight decay of 0.05, a batch size of 32, a layer-wise decay rate of 0.9, and a learning rate searching from 1e-4 and 3e-4. All models are trained for 80K iterations with an input resolution of 512 $\times$ 512 and a window size of 20. In inference, a multi-scale test using resolutions that are [0.75, 0.875, 1.0, 1.125, 1.25] $\times$ of 512 $\times$ 2048 is employed.

For ADE20K experiments, we initialized the segmentation models using model weights after supervised fine-tuning on ImageNet-1K, because its performance is superior to using the self-supervised pre-trained weights directly.

C.2 Ablation Studies

Table 10 and 11 ablates the designs in SimMIM on the above additional down-stream tasks. We also copy the results of ImageNet-1K from the main body to these tables for reference.

We also use these additional down-stream tasks to verify different masking strategies, as shown in Figure 7. It turns out that the observations in Figure 3 of the main paper also hold: 1) the AvgDist measure is a good indicator for the learning effectiveness of masked image modeling; 2) an AvgDist of $15$ is empirically good for masked image modeling.

C.3 Scaling Experiments

Table 12 shows the scaling performance using COCO object detection and ADE20K semantic segmentation. On Swin-B, Swin-L, and SwinV2-H, SimMIM achieves +2.1 / +2.9 / +4.2 mAP ${}^{\text{box}}$ and +2.4 / +3.5 / +4.4 mIoU higher accuracy than its supervised counterparts, respectively. It indicates the broad effectiveness of the SimMIM approach. It also suggests that larger models benefit more from this approach.

Appendix D More Results on Channel-wise Bin Color Discretization

Table 13 shows more results of using channel-wise bin color discretization as the prediction target, by varying bin numbers and prediction resolutions. We notice that the best accuracy for different bin numbers are achieved at different prediction resolutions: the 2-bin and 4-bin targets reach the best accuracy at a resolution of $192^{2}$ , and all other bin numbers reach the best accuracy at a low prediction resolution of $6^{2}$ . These results imply a moderately fine-grained target is encouraged for this classification based approach.

Appendix E SimMIM with ConvNets

With the remarkable performance of SimMIM on Vision Transformers, we want to verify its effectiveness on versatile architectures. Here we adopt ResNet-50 $\times$ 4 as the base architecture. The overall training setup remains the same as that of Swin-Base. We use masked tokens to replace the original features after the stem of a $3\times 3$ convolution of $stride=2$ followed by a 2 $\times$ 2 max-pooling operator.

On ResNet-50 $\times$ 4, SimMIM achieves 81.6% top-1 accuracy on ImageNet-1K validation set using 300-epoch pre-training and 100-epoch fine-tuning, outperforming the supervised counterpart by +0.9% (vs. 80.7%). This indicates the generality of SimMIM.