On Data Scaling in Masked Image Modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu

Introduction

In natural language processing, scaling model capacity and data size has been an important driving force for the remarkable improvements of language models over the past few years . Behind the success is a self-supervised pre-training approach, masked language modeling (MLM) , that can take advantage of and benefit from almost unlimited data. As the same time, the relevant research in the field of computer vision has also been intensifying. However, due to the lack of effective self-supervision methods, most previous works are based on image classification tasks , where the huge labeling cost and low information contained in the labels limit broader exploration of scaling visual models, or the models being scaled up further, thus leaving progress in computer vision largely behind the NLP field.

Recently, a self-supervision visual pre-training method named masked image modeling (MIM) has become popular due to its impressive fine-tuning performance on a variety of downstream computer vision tasks. Given its high analogy with MLM , the dominant pre-training approach in NLP, we expect masked image modeling to advance the scaling performance of visual models. Specifically, we are concerned with two aspects of scaling ability: model scaling and data scaling. While the masked image modeling approach is shown to be good at scaling up model capacity , like NLP models, its ability to benefit from larger data is unclear or even a bit negative. For example, show that using a small amount of training data in masked image modeling can achieve comparable performance than that using larger data. The data scaling capability is critical, as an important hallmark of self-supervised learning is the ability to leverage almost unlimited data, and failure to benefit from larger data may hinder the future potential of the masked image modeling.

In this paper, we systematically investigate the data scaling capability of masked image modeling at different model sizes and training lengths. We use Swin Transformer V2 as the visual encoder because of its proven trainability for large models and its applicability to a wide range of vision tasks, and adopt SimMIM for masked image modeling pre-training because it has no restrictions on the encoder architectures. With extensive experiments, we find that:

(i) Masked image modeling is demanding for large data. We observed large models overfited with relatively small data, as reflected by the increased validation losses with longer training when a large model while relatively small data is used (see Figure 1 center). The overfitting issue will result in degraded fine-tuning performance, as shown in Figure 1 right.

(ii) Training length matters. Large models trained with masked image modeling can benefit from more data at a longer training length. When the training length is short, the difference in performance between using large and small datasets is not significant. However, with sufficient training, more data shows better performance. In addition, as the data size increases, the fine-tuning performance of large models saturates more slowly than that of small models.

(iii) The validation loss is a good proxy indicator for fine-tuning performance. We observe a strong correlation between validation loss and fine-tuning performance on multiple tasks. This finding suggests that the validation loss can be used as a good indicator of how well the model is trained, which can reduce the overhead of evaluation by direct fine-tuning on downstream tasks.

These findings suggest that masked image modeling (MIM) is not only a model scalable learner, but also a data scalable learner. Particularly, our revealing of data scaling capability of masked image modeling breaks the misconception of previous studies that suspected masked image modeling could not benefit from more data. We hope these findings will deepen the understanding of masked image modeling.

Background and Experimental Setup

2 Architecture Specifications

We use Swin Transformer V2 as the vision encoder in this study. Thanks to its generality and scalability, we evaluate a series of SwinV2 models with a wide range of model sizes (the number of parameters ranges from $\sim$ 50M to $\sim$ 1B, and FLOPs range from $\sim$ 9G to $\sim$ 190G) on multiple downstream tasks. The detailed model specifications are shown in Table 1. We use a new variant SwinV2-g (giant), with number of parameters between SwinV2-L and the 3-billion-parameter SwinV2-G (Giant) used in .

3 Pre-training Datasets

To study the effect of data size on masked image modeling, we build datasets with different sizes. We use the training set of ImageNet-1K and ImageNet-22K as two large-scale datasets, and randomly sample $10\%$ , $20\%$ , $50\%$ of images in the ImageNet-1K training set as smaller datasets. By default, the images are uniformly sampled from each category. We also consider the sampling strategies could perform differently. To this end, we randomly sample 100 classes from ImageNet-1K as ImageNet-100, and compare it with ImageNet-1K ( $10\%)$ but find their training loss and fine-tuning performance are almost the same. The details and statistics of all pre-training datasets used in our study are shown in Table 2.

4 Pre-training Details

To better compare the performance of models with different amounts of data under the same pre-training length, we use training iterations rather than training epochs and adopt the same hyper-parameters for all models with different sizes during pre-training. The total number of training iterations is in {125K, 250K, 500K} and the batch size is set as 2048 for all experiments. In pre-training stage, we use the same hyper-parameters for all models, and the training details and hyper-parameters of pre-training are summarized in Table 10. Because of the excessive amount of experiments, we follow SimMIM and also use the following two techniques for reducing the experimental overheads: First, we use the step learning rate scheduler in pre-training for sharing the first training step among experiments with different training lengths. The first 7/8 training iterations are the first step and the last 1/8 training iterations are the second step with the learning rate ratio of 0.1 (i.e. learning rate is divided by 10 in the second step). Second, we adopt the input image size of $192^{2}$ and set the window size of $12$ . We improve the SimMIM by normalizing the predicted target according to with a sliding window of $47^{2}$ and observe an improvement of 0.3 on top-1 accuracy of ImageNet-1K for the SwinV2-Large model. The same light data augmentation strategy as SimMIM is used: random resize cropping with a scale range of [0.67, 1], an aspect ratio range of [3/4, 4/3] and a random flipping with probability 0.5.

5 Fine-tuning Tasks

To extensively and accurately evaluate the performance of pre-trained models under different pre-training schedulers and datasets, a series of diverse and representative tasks including fine-tuning on ImageNet-1K, fine-grained image classification, object detection, instance segmentation, and semantic segmentation are selected for evaluation.

We follow to evaluate the quality of learnt representations by fine-tuning the pre-trained models on ImageNet-1K image classification task, which is the most commonly used scenario and evaluation criterion for pre-trained models . The setting details and fine-tuning hyper-parameters for ImageNet-1K image classification are summarized in Table 11. Different from pre-training, We adopt the image size with $224^{2}$ with window size of 14 in fine-tuning. The AdamW with batch size of 2048, base learning rate of 5e-3, weight decay of 0.05, $\beta_{1}$ of 0.9 and $\beta_{2}$ of 0.999 are used, and we adopt cosine learning rate scheduler. As larger models are more prone to overfitting, we fine-tune SwinV2-S/B/L for 100 epochs with 20 warm-up epochs and SwinV2-H/g for 50 epochs with 10 warm-up epochs, and decrease the layer decay as the model size increases. In addition, gradient clipping, stochastic depth, label smoothing and data augmentations (e.g. random crop, rand erasing , rand augment , mixup , cutmix , etc.) are also used by following .

iNaturalist-18

iNaturalist 2018 is a long-tailed fine-grained image classification dataset. The details and fine-tuning hyper-parameters for iNaturalist 2018 are summarized in Table 12. As fine-tuning in ImageNet-1K, we also use the input image size of $224^{2}$ , window size of 14 and patch size of 4 in iNaturalist 2018. We fine-tune all models for 100 epochs with 20 warm-up epochs, and set layer decay to 0.8, 0.75 and 0.7 for SwinV2-S/B/L, respectively. The AdamW optimizer with cosine learning rate scheduler, batch size of 2048, base learning rate of 1.6e-2, weight decay of 0.1, $\beta_{1}$ of 0.9 and $\beta_{2}$ of 0.999 are used. In addition, we also adopt stochastic depth, label smoothing, gradient clipping and data augmentations in fine-tuning.

COCO Object Detection and Instance Segmentation [19]

The details and fine-tuning hyper-parameters for COCO dataset are summarized in Table 13. We use Mask R-CNN Our implementation based on MMDetection . for evaluation. We set the window size to 14 and patch size to 4. The AdamW optimizer with batch size of 32, base learning rate of 8e-5, weight decay of 0.05, $\beta_{1}$ of 0.9, $\beta_{2}$ of 0.999 and a step learning rate scheduler (step learning rate ratio of 0.1, step epochs are 27 and 33) are used. In training, the random cropping with crop size of , large scale jittering with a range of [0.1, 2.0], random horizontal flip with probability 0.5, and stochastic depth regularization are used. In testing, all images are resized to (800, 1333) and keeping the aspect ratio unchanged.

ADE20K Semantic Segmentation [41]

The details and fine-tuning hyper-parameters for ADE20K dataset are summarized in Table 14. Following , we use UPerNet for evaluation. We set the window size to 20 and the patch size to 4. The AdamW optimizer with with batch size of 32, base learning rate searched in a range of [1e-4, 3e-4], weight decay of 0.05, $\beta_{1}$ of 0.9, $\beta_{2}$ of 0.999 and a linear learning rate scheduler with a total of 80K iterations are used. Also, we use the layer decay of 0.95, 0.95, 0.9 for SwinV2-S/B/L, respectively. In training, the random cropping with crop size of , scale jittering with a range of [0.5, 2.0], random horizontal flip with probability 0.5, random photometric distortion and stochastic depth regularization of 0.1 are used. In testing, all images are evaluated by sliding window manner, and use the test image size of (2560, 640) and set sliding window stride to 426, following .

Results and Findings

We train numerous models with different training lengths, data sizes, and model sizes, and study how these factors affect the performance of masked image modeling. Figure 1 and Figure 2 illustrate the relationship between the training loss, and the validation loss of pre-trainingThe validation loss of pre-training is measured on the validation set of ImageNet-1K for all experiments., and the fine-tuning top-1 accuracy of ImageNet-1K. Based on these extensive experiments, we make the following observations:

When with the high masking rate (e.g., $60\%$ in our work), the masked image modeling is considered a very challenging training objective and has been found to be data efficient by previous literature , i.e., a comparable performance can be achieved with small datasets as with large datasets. However, Figure 1 shows that as the training cost increases, the training loss of some models drops significantly, and their validation loss rises significantly, even on using 50% images of ImageNet-1K (i.e., IN1K (50%)), indicating the overfitting phenomenon exists. And significant decrease to the fine-tuning performance caused by overfitting could be observed in Figure 2. Moreover, we measure the best fine-tuning performance of each model trained by different training schedulers in Table 3. We find the large models perform even worse than smaller models when small dataset is used for training. For example, the best top-1 accuracy of SwinV2-H with IN1K ( $20\%)$ is 84.4, worse than the best performance of SwinV2-L by 0.3. In addition, by comparing the best performance that can be obtained using different sizes of dataset, we find that using more data results in better performance. These observations suggest that masked image modeling does not alleviate the demands of large dataset.

The training length matters. Larger models can benefit from more data at a longer training length.

By comparing the performance of models pre-trained by different data sizes (3rd row of Figure 2), we find that the fine-tuning performance of the large models saturates more slowly with the increasing data size compared to the smaller models. For example, the SwinV2-S model pre-trained on IN1K ( $50\%$ ) has a very similar fine-tuning performance to the model pre-trained on IN1K ( $100\%$ ). In comparison, the performance difference between the SwinV2-H model pre-trained on IN1K ( $50\%$ ) and IN1K ( $100\%$ ) is near 0.5, which is a significant gap for ImageNet-1K classification.

Furthermore, a comprehensive observation reveals that the improvements from using more data are not significant under short training lengths. For example, while there is a noticeable performance gap between SwinV2-H trained on IN1K ( $50\%$ ) and IN1K ( $100\%$ ) at a training length of 500K iterations, the gap is less than 0.1 at a training length of 125K iterations. This observation suggests that while larger models can benefit from more data, the training length must also increase at the same time.

Evaluation on more tasks.

In addition to ImageNet-1K image classification, we also evaluate the MIM pre-trained SwinV2-S, SwinV2-B and SwinV2-L on iNaturalist-18 fine-grained image classification, ADE20K semantic segmentation, and COCO object detection/segmentation. Figure 3 shows a similar pattern with ImageNet-1K (Figure 1 (right)) that as the training cost increases, some models have significantly performance drop. In addition, as shown in Table 4, 5, and 6, the smaller models rapidly reach saturation as the amount of data increases, while larger models can continuously benefit from more data after sufficient training. These results suggest that the conclusions drawn on ImageNet-1K are broadly applicable to other vision tasks.

2 Reconstruction Results of Overfitting and Non-overfitting Models

To better understand the difference between overfitting and non-overfitting models, we visualize the reconstruction results of SwinV2-L that pre-trained on ImageNet1K (10%) and ImageNet1K (100%). Figure 5(a) shows the reconstruction results on the training images from ImageNet1K (10%) dataset, and Figure 5(b) shows the reconstruction results on the images from ImageNet-1K validation set. Based on the reconstruction results on the training images, we observed the overfitting model (i.e. SwinV2-L pre-trained on ImageNet1K (10%)) is more like "remembering" the masked regions, while the non-overfitting model (i.e. SwinV2-L pre-trained on ImageNet1K (100%)) is more like performing "reasoning" on the masked regions. For example, the results on the left of the first row in Figure 5(a) show that the overfitting model even predicts the black hair of the dog, but the seen regions only indicate that the dog is white. And the non-overfitting model only predicts the dog with the white hair. Furthermore, we observe that the overfitting model seems to lack the "reasoning" ability and has a poorer prediction quality on the images of the validation set compared to the non-overfitting model. For example, the results on the left of the first row in Figure 5(b) show the overfitting model even fails to predict the eyes of the dog.

3 Correlation between Pre-training Losses and the Fine-tuning Performance

Evaluating a pre-trained model by its fine-tuned performance on downstream tasks is costly. In supervised pre-training, the validation accuracy is used as the proxy metric to evaluate the quality of the pre-trained models. While in previous studies on other self-supervised learning approaches (e.g., contrastive learning), such a proxy metric is lacking. In this study, we would like to explore whether the pre-training loss in the training of masked image modeling is a good indicator of its fine-tuning performance. We collect all pre-trained models and plot their training and validation loss curves on Figure 4. Interestingly, the correlations between pre-training losses and the fine-tuning performance on multiple tasks could be observed with a phase transition around overfitting.

Specifically, the correlation between training loss and fine-tuning performance is negative for the overfitting model (green circles) and positive for the non-overfitting model (red circles). The correlation between validation loss and fine-tuning performance is always negative, but the slope of their linear fit lines The least squares method is used for linear fit. is significantly different.

In addition, we further analyze the Pearson correlation coefficient between training loss and fine-tuning performance (Table 7), and find the validation loss has stronger linear correlation with fine-tuned performance than train loss for all cases, especially for non-overfitting models.

4 Effects of Different Sizes of Decoders

We have studied the effects of encoder size from the data scaling perspective. Here, the effects of decoder size are further studied. We pre-train SwinV2-B models with decoder heads of different sizes on IN1K ( $20\%$ ), and Table 8 shows the results. Interestingly, although we find that the heavier decoder has lower training loss and higher validation loss than the linear decoder, indicating a more severe overfitting issue. But there is no decrease in its fine-tuning performance on ImageNet-1K than the linear decoder. This experiment shows that the decoder behaves very differently from the encoder, and we speculate that this is because the decoder "blocks" the damage to the encoder from overfitting.

5 Impact of Different Dataset Sampling Strategies

We study different dataset sampling strategies by comparing the training behavior and fine-tuned performance of models pre-trained on IN1K ( $10\%$ ) and IN100. In IN1K ( $10\%$ ), the images are uniformly sampled from each category, and we randomly sample 100 categories from ImageNet-1K as IN100. Experiments are conducted on SwinV2-L with 500K training iterations. Table 9 shows the training loss, validation loss and fine-tuning top-1 accuracy of ImageNet-1K. For the two models pre-trained on IN1K ( $10\%$ ) and IN100, all three metrics are very similar. Figure 6 further illustrates the training dynamics of the two models, and we find both their training loss curves and validation loss curves are almost overlapping. These results show the disparity caused by different dataset sampling strategies is minor.

Related Work

Masked Image Modeling learns representations by reconstructing the masked content of images, and its early exploration can be traced back to context encoder and denoising autoencoder . Recently, iGPT , BEiT , MAE and SimMIM recall this approach on training vision transformer. iGPT sequentially predicted the pixels by auto-regressive manner. BEiT proposed to predict the discrete visual tokens. MAE and SimMIM concurrently find predicting the raw pixels with a high masking ratio can work well. In this work, we use SimMIM as the default masked image modeling approach, because of its simplicity and no restrictions on the architecture of vision encoder like MAE.

Vision Transformer

Transformer was first applied to natural language processing and became the dominant architecture, and has recently attracted a lot of attention in computer vision. The pioneering work ViT first shows that the transformer architecture works well in image classification when trained on large amounts of data. DeiT proposed a better training recipe based on ViT and demonstrated that vision Transformer has promising performance when only using ImageNet-1K dataset. Swin Transformer improves plain ViT by inducing the hierarchical architecture and non-overlapping local attention and successfully demonstrates the effectiveness of vision transformer on a wide range of vision tasks. Swin Transformer V2 further addresses the training stability issue of in model scaling and illustrates better performance than the original Swin Transformer, and thus we use it as the default vision encoder in this work.

Scaling Vision Models

Many works examine how to scale vision models, but most are more concerned with exploring the perspective of model architecture designs. For example, EfficientNet extensively studied how model width, model depth and input resolution affect the convolutional neural networks; proposed to scale vision model with sparse mixture-of-expert; and studied how to scale ViT and Swin Transformer, respectively.

Only a few works explored the perspective of data scaling under the pre-training fine-tuning paradigm. BiT revisited the supervised pre-training on a wide range of data scales up to 1M images. SEER studied the effectiveness of data scaling in the contrastive learning framework with up to one billion images. Recently, SplitMask find that masked image modeling is robust to the size of pre-training data and challenges the data scaling capability of masked image modeling, which is most relevant to our work.

Conclusion

In our work, we systematically study the data scaling capability of masked image modeling at different model sizes and training lengths. Based on the extensive experiments, we demonstrate that masked image modeling is not only a model scalable learner but also a data scalable learner, which challenges the conclusion of previous literature that a large dataset may not be necessary in masked image modeling. The reason behind this is that they overlooked a key factor, namely training length. In addition, a strong correlation between the validation loss of masked image modeling and the fine-tuning performance is observed. This observation suggests that validation loss can be considered as a good proxy metric for evaluating pre-trained models, and makes it possible to reduce the experimental overhead of measuring models by fine-tuning.

While these findings deepen our understanding of masked image modeling in data scaling angles and can facilitate future research, our study still has limitations. First, the maximum model size used in our study reaches only one billion parameters, which we speculate leaves the overfitting phenomenon on the ImageNet-1K dataset unobserved; Second, there is a lack of research on the effect of encoder specifications (e.g., depth and width) on data scaling. Third, our study does not involve the study angle of data augmentation which is a common technique to alleviate data scarcity and overfitting.

References

Appendix A Hyper-parameters and training details

We illustrate the training details of pre-training and fine-tuning for different tasks and different models. Table 10 presents pre-training details. Table 11 presents the fine-tuning details on ImageNet-1K image classification. Table 12 presents the fine-tuning details on iNaturalist 2018. Table 13 presents the fine-tuning details on COCO dataset. Table 14 presents the fine-tuning details on ADE20K dataset.

Appendix B Training dynamics of masked image modeling

We show the training curves and validation curves of different models trained by masked image modeling to better illustrate the training dynamics. In Figure 7, each row presents the training and validation loss curves for training with the same model but different dataset. The training loss is computed on its corresponding training dataset and the validation loss is computed on the ImageNet-1K validation set. We make the following observations: First, all models have the overfitting issues when using small datasets. Second, for the non-overfitting cases, the training and validation losses are similar using different sizes of datasets for training. In Figure 8, the training/validation loss curves of different models but using the same training dataset are presented at each row. We make the following observations: First, larger models have lower training losses than smaller models for all datasets. Second, the validation loss of the larger model is lower than the smaller model in the non-overfitting cases but higher than the smaller model in the over-fitting cases.