Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave

Introduction

Modern computer vision neural networks are heavily parametrized: they routinely have tens or hundreds of millions of parameters . This has been the key to their success for leveraging large-scale image collections such as ImageNet. However these high capacity models tend to overfit on small, or even medium sized datasets consisting of hundreds of thousands of images. This problem was pointed out by Oquab et al. in 2014:

“Learning CNNs […] amounts to estimating millions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data.”

The authors describe a learning setting that is nowadays the dominant learning paradigm for data-starving problems:

(1) pre-train a model on a large dataset like Imagenet , and in turn (2) finetune the weights of the models on the target task for which we have a limited amount of data. The second training stage typically adopts a shorter optimization procedure than the one employed when training from scratch (i.e., from randomly generated weights).

This simple approach has led to impressive results, which are state-of-the-art in many tasks such as detection , segmentation and action recognition . Despite this success, we point out that it is difficult to disentangle the benefits offered by such a large-scale curated label dataset from the limitations of this pre-training paradigm. Putting aside the discussion on the collection effort (cost, requiring in-domain expertise, etc), we point out that pre-training a model on a dataset and fine-tuning it on another can introduce two sort of discrepancies.

First, this setting introduces a domain shift between the images used to pre-train the model and those targeted by the fine-tuning stage. Imagenet images may be sufficiently representative of natural images (despite the collecting bias). To date, most researchers consider that the benefit of having a large amount of images vastly compensates the domain discrepancy on benchmarks involving natural images, such as the fine-grained iNaturalist datasets or even out-of-domain distributions such as sketches, painting or clipart.

The second question, discussed by Doersch et al. , is the so-called supervision collapse. This phenomenon is inherent to pre-training with a fixed set of labels: the network learns to focus on the mapping between images and the labels of the pre-training stage, but can discard information that is relevant to other downstream tasks. In other terms, pre-training on large-scale classification datasets does not necessarily align with the goal of learning general-purpose features, as it uses only a subset of the available information controlled by the given dataset categorization bias .

These limitations have motivated the development of self-supervised pre-training methods which learn directly from data, without relying on annotations. Most notably, the contrastive and joint embedding approaches can serve as effective pre-training strategies. While obtaining a strong performance on numerous tasks, such methods have a strong bias towards ImageNet data since the transformations have been hand-designed to perform well on the ImageNet benchmark. Some of the most effective transformations, like cropping, rely on the images being object centric . When applied on uncurated data, these methods degrade significantly and require larger datasets to obtain similar performance .

This is in contrast with natural language processing, where nowadays, most applications use large models which were pre-trained on uncurated data. In particular, the (masked) language modeling loss has been applied to transformer networks, leading to the BERT model , which is now the foundation of most NLP models. Inspired by this success, Bao et al. have shown the potential of the Masked Image Modeling (MIM) task to pre-train vision transformers. Such a model can be thought of as a denoising autoencoder where the noise corresponds to the patch masking operation. This technique has been successfully applied to ImageNet, but research questions remain:

(1) How much does this pre-training technique rely on the number of pre-training samples, and in particular, does it require millions of images to be useful?

(2) Is this technique robust to different distributions of training images? In particular, is it an effective paradigm to learn with non object-centric or uncurated images?

If the answer to both questions is positive, it will enable pre-training using a larger variety of datasets, including the training sets of many tasks that are smaller or belong to a different domain than ImageNet.

In this work, we make the following contributions:

First, we demonstrate that denoising autoencoders are more sample efficient than joint embedding techniques, enabling pre-training without relying on large-scale datasets (e.g. ImageNet);

Second, as a consequence of the better sample efficiency, we show on multiple datasets that it is possible to pre-train directly on the target task data and obtain a competitive performance, even with datasets that are orders of magnitude smaller than ImageNet;

Third, we demonstrate that denoising autoencoders can be successfully applied to non object-centric images such as COCO, achieving performance similar to the one obtained when pre-training with ImageNet, unlike joint embedding techniques which seem to suffer a drop in performance.

Related Work

In this section, we briefly review some previous work on self-supervised learning, including autoencoders and instance discrimination methods.

has a long history in deep learning, where it was initially used as a greedy layer-wise method to improve optimization . In the context of unsupervised feature learning for image classification, different tasks related to denoising autoencoders have been considered, such as in-painting , colorization or de-shuffling of image patches . In natural language processing, denoising autoencoders have been applied by masking or randomly replacing some tokens of the input, and reconstructing the original sequence, leading to the BERT model . Similar methods have been proposed to pre-train sequence-to-sequence models, by considering additional kind of noises such as word shuffling or deleting .

There has been efforts to adopt such successful ideas in NLP to computer vision, but with limited success. Chen et al. proposed iGPT, a transformer-based autoregressive model that operates over image pixels, while Atito et al. trained a ViT model on denoising of images where the noise is applied at pixel level. More recently, Bao et al. introduced the Masked Image Modeling loss in computer vision, where image patches are masked, and the goal is to predict the discretized label of the missing patches corresponding to their visual words as defined by a pre-trained discrete VAE .

is a set of self-supervised techniques which consider that each image corresponds to its own class . A set of data augmentations (or transformations) is then applied to each image to generate multiple examples for each class. The global image representations are trained in a contrastive framework, typically using the InfoNCE loss , to have high similarity for instances transformed from the same source image and low similarity with all other images. As the performance of these methods depends on the number of negatives, it either requires large batches or memory banks to work well . It was later shown that when using a momentum encoder , simpler loss functions that did not directly discriminate against other images could be used . Finally, a related line of work is to use clustering techniques to pre-train deep neural networks .

were originally introduced in the context of machine translation, replacing recurrent neural networks by an attention-based mechanism . Transformers were later applied to image recognition, by splitting images into patches, embedding these independently, and then processing the obtained representations as a sequence . Initially, only vision transformers pre-trained on very large collections obtained good performance, but smaller models trained on ImageNet with heavy augmentation can also yield competitive tradeoffs .

is an important ingredient of self-supervised learning, and multiple works have studied its impact on the transfer performance of models. While it is possible to learn high quality features from non-curated (eg. YFCC or IG) data using instance discrimination, this usually requires order of magnitude more data than ImageNet . Similarly, one can perform supervised pre-training using weakly supervised data, such as using hashtags as labels, but this strategy also requires large amount of data to work well . On the other hand, it was shown that for many natural language processing tasks, increasing the size of the pre-training dataset did not lead to strong improvement when using denoising autoencoders . Finally, some work studied how much could be learned from a single pre-training image or from synthetic data .

Analysis

In this section, we study the impact of the pre-training data on the performance of denoising autoencoder, and how they compare to those of joint embedding methods. More precisely, we investigate how the number of images, and their nature, influence the quality of self-supervised models. In this preliminary analysis, we consider the recent method BEiT and SplitMask, our variant as detailed in Section 4, as representatives of denoising autoencoders, and DINO of a joint embedding method, respectively.

First, we start by studying the impact of the pre-training dataset size, by varying the number of ImageNet examples we use to train models. We consider subsets of ImageNet containing 10% and 1% of the total number of examples, and use the balanced (in terms of classes) subsets from . To decouple the effect of using smaller datasets and the effect of doing less training updates, we adapt the number of epochs to keep the number of iterations constant. This means that we perform 3k and 30k epochs on ImageNet 10% and 1% respectively. We report results in Table 1. Observe how pre-training with an autoencoder loss such as masked image modeling is robust to the reduction in dataset size. In contrast, like for supervised pre-training, the performance of models pre-trained with DINO self-supervision degrades when training with smaller datasets.

We plot the iNaturalist-2019 transfer performance as a function of ImageNet subset size used during pre-training using SplitMask in Figure 3. We observe that the peak performance is achieved using only 5% of the ImageNet samples and adding more samples does not provide additional boost, given the number of updates are kept constant. We also observe that using only a single image per class, which corresponds to the 0.1% subset containing 1000 samples, leads to a non-trivial boost (+4 points) over training from scratch. This is a strong indication that denoising autoencoders are highly sample efficient unsupervised learning methods.

Furthermore, we plot the transfer performance as a function of number of pre-training epochs in Figure 3 using the 10% ImageNet subset. It can be observed that training for long schedules of nearly 3k epochs, matching the total number of updates for that of full ImageNet with 300 epochs, is crucial to achieve such strong performance for smaller subsets. However, we observe slight overfitting for very long schedules. This problem is more predominant for pre-training using very small datasets like Stanford-Cars as illustrated in Figure 6.

2 Learning using non object-centric images

We now study the impact of changing the nature of the pre-training data. In particular we use images that are not object-centric, like in Imagenet. To this end, instead of pre-training using ImagetNet, we pre-train with images from the COCO dataset only. As COCO contains roughly 118k images, this dataset is approximately equivalent in terms of size to the ImageNet 10% subset. Again, to disentangle the effect of training with a different number of iterations, we adapt the number of epochs: we use 3k epochs on COCO.

We report the results of this experiments in Table 1. When pre-trained on COCO, DINO drops significantly compared to full ImageNet pre-training (-8.3). Interestingly, the drop is higher than using 10% ImageNet even though the numbers of samples is roughly the same. We hypothesis this is because COCO images are not biased to be object-centric, while this joint embedding method was designed and developed using ImageNet as benchmark. In contrast, BEiT’s performance only decreases slightly while SplitMask attains +0.7 improvement over full ImageNet pre-training. This is an interesting property which makes such models prime candidates for learning effectively from uncurated images in the wild.

3 Tokenizers

The BEiT method, as proposed by Bao et al. , relies on the discrete VAE tokenizer from DALL-E, which has been pretrained on a large weakly supervised dataset. Since we want to study whether it is possible to pre-train models solely on small datasets, or non object-centric ones, we replace the DALL-E tokenizer by a simple alternative. To this end, we consider different simple alternatives to discretize images at the patch level without any pre-training. Each of these techniques is applied on each patch independently, making them relatively lightweight and more efficient than the original tokenizer considered in BEiT.

We now discuss three simple ways to obtain the elements of the vocabulary ei\mathbf{e}_{i}. First, we can sample random vectors with uniform element-wise distribution, and call the corresponding tokenizer random projection. Second, we can sample VV random patches uniformly in the set of all patches of images from the training set, and refer to the tokenizer as random patches. Finally, we can perform k-means clustering on the patches of images from the training set, and use the centroids as elements of the vocabulary. We refer to this last tokenizer, which was once widely employed in computer vision for bag-of-words representations, as k-means.

We train a ViT-base model on the ImageNet dataset, using these three tokenizers, as well as the DALL-E tokenizer originally considered by BEiT. We report results in Table 2. We observe that replacing the DALL-E tokenizer by simpler choices does not lead to any significant degradation in accuracy. This also provides a 26% relative runtime improvement for base models over its counterpart using the DALL-E tokenizer on 16 GPUs with a batch size of 1024.

Methodology

In this section, we introduce SplitMask, a variant of denoising autoencoders based on vision transformers. An overview of our method is illustrated in Figure 4.

Our approach is based on three steps, which we refer to as split, inpaint and match. As in standard vision transformers, an image is first broken down into patches of 16×\times16 pixels. Then, we split the patches into two disjoint subsets A\mathcal{A} and B\mathcal{B}, which are processed independently by our deep ViT encoder. Next, using the patch representations of the subset A\mathcal{A} and a shallow decoder (e.g. 2 layers), we inpaintInpainting in this context is implemented by solving a Masked Image Modeling task rather than the typical inpainting by reconstruction of pixels. the patches of the subset B\mathcal{B} , by solving a MIM task, and vice versa. Finally, we obtain a global image descriptor by average pooling of the patch representations from the decoder output corresponding to each branch.

The feature aggregation is over both observed and hallucinated patches. We try to match the global descriptors of the image obtained from subset A\mathcal{A} to that obtained from subset B\mathcal{B}. In other words, we use the masking operation of the mask image modeling loss as a data augmentation for a contrastive learning loss similar to NPID or SimCLR. Note, SplitMask does not add any significant computational cost over MIM methods like BEiT to produce this global contrastive training signal.

2 Encoder-Decoder Architecture

We now discuss in more details the architecture of the model that we use to implement the SplitMask pipeline described in the previous subsection. Our method relies on an encoder-decoder architecture. The encoder of our model is a standard vision transformer, with absolute positional embeddings. In contrast to BEiT method, our encoder does not process representations of the masked tokens, but only of the observed onesConcurrent to our work, He et al. propose MAE. This is an encoder-decoder architecture where the encoder processing the observed patches only, similar to what we do in our SplitMask variant. . Hence, an image is divided into patches, which are linearly embedded, and positional embeddings are added to these representations. These representations are split into two subsets A\mathcal{A} and B\mathcal{B}, which are processed independently by standard transformer layers. Before feeding the output representations to the decoder, we insert mask embeddings that includes the position information of the missing patches in the sequences A\mathcal{A} and B\mathcal{B}. Finally, using the decoded representations of the masked patches, we predict their corresponding visual words using a cross entropy loss function.

Thus, if an image contains nn patches, the encoder processes two sequences of size n/2n/2, while the decoder processes two sequences of size nn. Since in practice we use decoder which is much more lightweight than standard vision transformers, the computational complexity of our models is similar to a standard ViT. One advantage of our approach compared to BEiT is that at each iteration, the encoder processes all the patches of the image. The loss function is also computed over all the patches of the image, instead of only on a subset. Additional comparisons to BEiT are detailed in Sections A and B of the appendix.

3 Global Contrastive Loss

In addition to the MIM loss, which is computed at the patch level, our approach also uses a contrastive loss at the image level. To this end, we apply an average pooling operation over all the output representations of the decoder (including representations of the masked patches). For each image, we obtain two representations xa\mathbf{x}_{a} and xb\mathbf{x}_{b}, corresponding to the subsets A\mathcal{A} and B\mathcal{B} of observed patches. We then apply the InfoNCE loss over these representations:

where τ\tau is a temperature hyper-parameter and N\mathcal{N} is a set of negatives, corresponding to the representations of the other images in the batch. Following previous work , we symmetrize the contrastive loss, and apply it similarly on the representation xb\mathbf{x}_{b} from the subset B\mathcal{B}. The motivation for adding this contrastive loss is to encourage the model to produce globally coherent features that are consistent across different choices of observed subsets without relying on any hand-designed transformations. Using our design of SplitMask, we attain such signal with almost no overhead.

Experiments

In this section, we perform empirical evaluations of denoising autoencoders, and the impact of the pre-training data on downstream task performance. In particular, we study how well pre-training performs when only the target task data is used instead of relying on a large-scale dataset such as ImageNet. We perform experiments on different tasks, such as classification, detection and instance segmentation. We consider datasets of varying size, including some significantly smaller than ImageNet. We also compare our variant SplitMask method to BEiT, either pre-trained on target task data or ImageNet, in addition to the supervised pre-training baselines. Finally, we perform an ablation study on our method to investigate the impact of its different components on finetuning and linear evaluation.

We study the pre-training and finetuning of computer vision models on a variety of datasets, see Table 3 for details. For image classification, we consider the iNaturalist 2018 and 2019 , Stanford Cars and Food101 datasets, which all contain fine-grained categories. We also consider three subsets from the DomainNet dataset , clipart, painting and sketch, which are not natural images and hence from different domains than ImageNet. For object detection and instance segmentation, we use the COCO dataset . Finally, we also use the ADE20k dataset for semantic segmentation. The training set sizes of these different datasets vary from 8k to 437k images, thus all being significantly smaller than ImageNet, some more than two order of magnitude smaller. This allows to investigate under different data regimes how feasible it is to pre-train directly on the target task data, alleviating the need for a large scale curated dataset as ImageNet.

As previously mentioned, we want to perform a constant number of updates during pre-training, and we thus adapt the number of epochs when training on target task data to match the number of updates corresponding to 300 epochs on ImageNet. For smaller classification datasets, we limit the number of pre-training epochs to 5000 since we observed pre-training for longer generally does not result in further improvement in terms of downstream performance. For very small datasets, like Stanford Cars, we observed an overfitting behaviour with training for very long schedules (e.g. more than 5k epochs, see Figure 6). Note that the adjusted number of pre-training epochs is provided in Table 3.

2 Dense Prediction

First, we evaluate our approach on the COCO object detection and instance segmentation dataset using the Mask R-CNN pipeline and report our results in Table 4. We compare models pre-trained on the COCO dataset alone with their equivalent counterparts that were pre-trained on ImageNet, either in a supervised or self-supervised fashion. First, we observe that BEiT models which were pre-trained on the COCO dataset alone obtain better downstream task performance than the same models pre-trained on ImageNet. For example, when using a ViT-base backbone, pre-training on COCO instead of ImageNet leads to a boost of +0.4 in box AP.

Additionally, we observe that a similar pre-training of DINO using COCO images provides a relatively weak performance, only outperforming random initialization. This indicates that strong pre-training on COCO is a unique property of denoising autoencoders and it does not extend to other self-supervised learning methods.

Finally, we observe that SplitMask leads to a consistent improvement compared to the BEiT baseline, such as +0.6 box AP when using a ViT-small and +0.3 mask AP for ViT-base backbones. All put together, in a comparable setting, we obtain a +1.1 box AP increase while not using ImageNet. Since COCO contains one order of magnitude less images than ImageNet, this suggests that large scale datasets are not necessary for pre-training.

2.2 Semantic Segmentation

For semantic segmentation, we compare our denoising autoencoder models, pre-trained solely using ADE20k images, to their counterparts pre-trained on ImageNet. The results are reported in Table 5. All models use an UperNet pipeline . We observe that denoising autoencoders can provide a very competitive performance on such a challenging task even when pre-trained using a relatively small sample size of 20k images. The performance matches that of BEiT self-supervised pre-training using ImageNet and only marginally lower than supervised ImageNet pre-training.

We have found that adapting the random cropping strategy is a crucial implementation detail that helps improve the denoising autoencoders pre-training performance on such dataset. In particular, we reduce the maximal size of the crop from 100% to 25% of the raw image size.

3 Image Classification

We perform empirical evaluation on a number classification datasets and report our results in Table 6. Overall, we find that BEiT or SplitMask pre-training, using solely the target datasets images, consistently obtains either the strongest or, at worst, the second strongest performance when compared to different options of self-supervised and supervised pre-training using ImageNet as well as training from scratch .

First, we compare ImageNet pre-training to the target data pre-training with BEiT and observe that for many cases, pre-training on the target data alone leads to better results. This is true for the ViT-small backbone across all the datasets including Stanford cars (+1.1% acc), which consists of only 8k images. When using a ViT-base backbone, pre-training on the target task data outperforms BEiT self-supervised ImageNet pre-training for datasets as small as Food101 (+0.7 acc), which is more than 10x smaller than ImageNet. Second, we observe that SplitMask leads to further improvement in performances for multiple datasets: for example, on the iNaturalist 2018 dataset, we see +3.0 in accuracy with a ViT-base model.

As it was already observed in previous work , we also see in many cases that self-supervised training outperforms supervised pre-training on ImageNet. For example, on the iNaturalist datasets, training with the target task data alone (including a pre-training step) gives better results than pre-training on ImageNet with labels: with a ViT-base model and the SplitMask method, we see an improvement of +2.7% in top-1 accuracy. As for the clipart, painting and sketch datasets, we see that SplitMask provides a competitive performance, outperforming an ImageNet pre-trained BEiT across all datasets for ViT-S. However, for the aforementioned datasets, supervised pre-training achieves the best performance for both ViT-S and ViT-B.

We note that when pre-training using the clipart and sketch datasets with the BEiT method, we experienced numerical instability that prevented the model from converging with long schedules (e.g. 5000 epochs). However, the instability problem was not observed for SplitMask models. Nevertheless, more investigation might be needed to fully understand how to optimize pre-training of such models.

4 Pre-training using ImageNet

In addition to our main study concerning the robustness of denoising autoencoders w.r.t the size and type of pre-training data, we study SplitMask in the more commonly used setting of pre-training and finetuning using ImageNet.

In Table 7 we show the performance of our SplitMask method using the ViT-S and ViT-B backbones and 300 epochs pre-training compared to other recent transformer-based self-supervised learning methods. It can be observed that SplitMask provides a strong performance, outperforming both BEiT and MocoV3 for all backbones. Additionally, SplitMask achieves a performance on par with DINO while being significantly cheaper and simpler to train. Note that while SplitMask and BEiT attain a strong finetuning performance, denoising autoencoding methods typically fall behind in terms of linear probing compared to instance discrimination methods like DINO.

5 Implementation Details

Similarly to the tokenizer used in , all tokenizers presented in Table 2 have a vocabulary of size 8192. For the random tokenizer, we sample 8192 vectors with uniform component-wise distribution. For the random patches tokenizer we sample 8192 patches from different images. For the K-means tokenizer, the 8192 elements of the vocabulary are obtained by applying the K-means algorithm to 3 millions patches sampled from the dataset.

We use the original ViT formulation as proposed by Dosovitskiy et al. and we follow the pre-training hyperparameters of Bao et al. . All baselines reported use the same backbone implementation and trained in similar settings. For SplitMask, by default, we use random block masking of 50% masking ratio to obtain a mask and its complement to extract the two subsets. The maximum and minimum number of patches per block is 75 and 16 respectively. We use the standard random cropping and horizontal flipping as data augmentations. We use 2 transformer layers for the decoder with embedding dimension matching that of the encoder.

However, for the smallest datasets (i.e. Stanford-Cars, ClipArt, Sketch and Paintings), we found that stronger data augmentation and more aggressive masking prevents early overfitting. In particular, we use a uniform masking of 75% (like in the work by He et al. ), as well as using random greyscale, solarization, Gaussian blur and color jittering as additional forms of data augmentation.

The BEiT baselines pre-trained on ImageNet and reported in Table 4 and 6 use the DALL-E tokenizer. Other BEiT and SplitMask models have been pre-trained using our random projection tokenizer. For the InfoNCE loss we use τ=0.2\tau=0.2 following Chen et al. .

We use the Mask R-CNN detection method with ViT backbone as our detection method. In order to obtain features compatible with the Feature Pyramid Network (FPN) design , we use max pooling and transposed convolution operations similar to El-Nouby et al. . To accommodate for the variable resolution we replace the absolute positional encoding for our models and the baselines with sinusoidal positional encoding . All models are trained using the 3x schedule (36 epochs) unless mentioned otherwise. We use the training hyper-parameters used by Liu et al. .

Hyperparameters used for finetuning each of the specific image classification datasets reported in Table 6 is provided in Appendix D.

Conclusion

In this paper, we have raised the question of how to pre-train models with self-supervised learning, wondering in particular on whether large scales datasets such as Imagenet are necessary for pre-training. Our study on ImageNet shows that taking a smaller pre-training dataset does not lead to big performance drop for denoising autoencoders, as opposed to instance discrimination self-supervised techniques or supervised pre-training. Similarly, training on non object-centric images does not impact the downstream task performance significantly.

Building upon these observations, we have pre-trained models directly on the target task data, instead of ImageNet, and performed evaluations on datasets of various sizes. We have shown that it is possible to pre-train on datasets 10x smaller than ImageNet, for example obtaining +0.5 box AP gains by solely using COCO images. We believe that this is strong evidence that large scale datasets, such as ImageNet, are not necessary for self-supervised pre-training when using denoising autoencoders.

We thank Armand Joulin, Jakob Verbeek, Natalia Neverova and Gabriel Synnaeve for fruitful discussions around this project.

References

Appendix A SplitMask vs BEiT

We ablate our proposed components in SplitMask compared to a BEiT baseline in Table 8. All models use a ViT-B backbone and pre-trained for 300 epochs. First, we observe that the ImageNet finetuning performance improves with a margin (+0.5) by simply adopting the encoder-decoder architecture and processing two disjoint subsets per iteration. Second, the global contrastive loss on its own, without the MIM objective, provides a very weak performance. This is expected since there is no training signal for the local patch representations, and a global matching objective with 50% masking of patches may be too hard, providing a noisy training signal and hindering the model’s ability to learn informative features.

Our full SplitMask model that uses both the MIM and contrastive objectives obtains the best performance and outperforms BEiT by a large margin of +0.8. The Linear probing performance of SplitMask is stronger than BEiT. However, both models provide a relatively weak performance on this benchmark compared to instance discrimination methods, whose final layers are more aligned to the classification task. Note, SplitMask adds a negligible computing overhead compared to the BEiT baseline: its wall-clock training time is marginally higher as detailed in Table 8. All models are trained using 16 GPUs and batch size of 2048.

Appendix B Encoder-Decoder vs BEiT

An advantage of the encoder-decoder design we propose in 4.2 is that it encourages decoupling of general-purpose encoding of image features, which is required for the downstream tasks, and features specific to solving the pretext task of MIM. In particular, compared to BEiT the encoder is not capable of solving the pretext task on its own since it does not have access to the mask token. Therefore, it can only help solve the task by providing informative representation to the decoder which is the component responsible of solving the pretext task. We can see in Figure 5 that this property improves the transferability of later layers representation to downstream tasks compared to BEiT which has a stronger drop in linear probing performance in later layers.

Appendix C Overfitting during pre-training

We observed that for pre-training of very small datasets (e.g. Stanford-Cars), longer pre-training schedules can be counterproductive. For example, if we follow the assumption we need to pre-training for the same number of updates of ImageNet pre-training for 300 epochs, the Stanford-Cars equivilant schedule would be 45k epochs. However, as we see in Figure 6, pre-training longer than 5k epochs leads to a severe drop in finetuning performance.

Appendix D Image Classification Finetuning

We detail the hyperparameters used to finetune each of the classification datasets in Table 9.