PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu, Baining Guo

cs.CV cs.LG

Introduction

Current state-of-the-art self-supervised pre-training methods (Dosovitskiy et al. 2020; Bao, Dong, and Wei 2021; He et al. 2021; Xie et al. 2021; Chen et al. 2022; Wei et al. 2021) for vision transformers focus on masked image modeling (MIM), a task of making predictions for masked patches from the visible patches. The input is usually an image consisting of visible patches and randomly masked patches and each patch is associated with corresponding positional embedding. The prediction target for masked patches varies for different methods, ranging from pixel-level prediction (Dosovitskiy et al. 2020; He et al. 2021; Xie et al. 2021) to feature-level prediction (Bao, Dong, and Wei 2021; Chen et al. 2022; Wei et al. 2021). In this paper, we study the prediction targets and introduce a better prediction target for MIM.

We point out that current prediction targets disagree with human judgment when evaluating the similarity between two different images. There are two representative prediction targets in current MIM methods: per-pixel regression and discrete token prediction. Figure 1 illustrates the results of different prediction targets on the question that which image (View1 or View2) is “closer” to the “Reference” for these examples. The reason for such disagreement of current prediction targets may come from the per-pixel loss. Note that the discrete tokens are obtained by a VQ-VAE trained under the objective of reconstruction loss, i.e. , per-pixel loss. The per-pixel measure assuming pixel-wise independence is insufficient for assessing structured outputs. For example, blurring causes large perceptual change but small pixel error, while shifting incurs small perceptual change but large pixel error (Zhang et al. 2018). Such disagreement with human visual perception indicates that perceptually similar patches may have divergent prediction targets. This undermines the capability of MIM as it, in principle, is based on context prediction.

We propose that a good prediction target for MIM should coincide with human judgment. In other words, perceptually similar images should be close to each other in the prediction target space. Inspired from the observation in (Zhang et al. 2018) that deep features model low-level perceptual similarity surprisingly well, we introduce this so-called perceptual loss in VQ-VAE for discrete token learning. This loss can be viewed as per-feature loss as it aims to minimize the feature-wise distance between the original image and the reconstructed image. Specifically, we adopt multi-scale deep features from multiple layers at different depth of a self-supervised Transformer. As shown in Figure 1, our proposed new prediction target indeed aligns with human perception judgment. We also show that the proposed visual tokens get much higher linear accuracy than the one without the perceptual loss. It indicates that our new visual tokens exhibit more semantic meanings, which is analogous to texts whose discrete tokens often contain highly semantic information.

We denote MIM using the introduced perceptual visual tokens for targets as “PeCo”, i.e. Perceptual Codebook for BERT pre-training of vision transformers. In the experiments, we demonstrate that equipped with such perceptual visual tokens, PeCo achieves better performance compared with the strong competitor BEiT (Bao, Dong, and Wei 2021) using DALL-E (Ramesh et al. 2021) codebook trained over $250M$ images without the perceptual loss. We fine-tune the pre-trained model on various downstream tasks: image classification, object detection, and semantic segmentation. Experimental results show that our pre-trained model transfers better than BEiT with only the prediction target changed. Concretely, we achieve $\textbf{84.5}\%$ Top-1 accuracy on ImageNet- $1$ K with ViT-B model, outperforming BEiT by +1.3% with the same 800 pre-training epochs. Our approach also gets significant improvement on COCO object detection and semantic segmentation as well as on ADE20K semantic segmentation. Our PeCo also shows strong scalability that when equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (88.3%) among methods using only ImageNet-1K data.

Related Works

Self-supervised Learning. Self-supervised learning has attracted increasing attention over the past few years, as deep learning networks become more and more data-hungry and it’s impossible to label everything in the world. There are two main categories along this path, contrastive and generative (Liu et al. 2021a). One emerging field is self-supervised contrastive learning, training an encoder to the representation measured by contrastive loss (Hadsell, Chopra, and LeCun 2006; Dosovitskiy et al. 2014) via comparing similar and dissimilar samples. The representative methods include MOCO (He et al. 2020; Chen et al. 2020d), SimCLR (Chen et al. 2020b, c), BYOL (Grill et al. 2020), SwAV (Caron et al. 2020) and more (Oord, Li, and Vinyals 2018; Li et al. 2021; Bachman, Hjelm, and Buchwalter 2019). However, contrastive-based methods heavily depend on the strong data augmentation and effective negative sampling.

The other recent resurgent field is generative self-supervised learning, training an encoder and a decoder under the objective of reconstruction loss. The typical objectives, autoregressive and denoising autoencoder, aiming at recovering the corrupted or masked input, has yielded the most successful frameworks (Devlin et al. 2018; Radford et al. 2018, 2019; Brown et al. 2020; Liu et al. 2019; Joshi et al. 2020) in NLP. Thanks to the pre-existing vocabulary in language, recovering the missing word can be transformed into predicting all the possible words with the probability estimation, converting the prediction problem to an easier classification problem. While in CV, on the other hand, most attempts (Van Oord, Kalchbrenner, and Kavukcuoglu 2016; Oord et al. 2016; Chen et al. 2020a; He et al. 2021) still resort to regression for generative methods due to the lack of visual vocabulary, e.g. iGPT (Chen et al. 2020a). Recently, BEiT (Bao, Dong, and Wei 2021) successfully adopts a classifier for prediction by directly adopting a VQ-VAE as the visual tokenizer. Yet there exists a major difference between the language vocabulary and the visual vocabulary. That is, the words of language are highly semantic, while the visual words of images are mostly not. Most recently, numerous works (Bao, Dong, and Wei 2021; He et al. 2021; Xie et al. 2021; Chen et al. 2022; Wang et al. 2022b; Dong et al. 2022; Baevski et al. 2022; Zheng et al. 2022) based on MIM have been concurrently developed, yet few studied the perceptual level of the prediction targets. In this work, we attempt to learn a perceptual visual vocabulary for BERT pre-training, showing superior transfer performance than BEiT (Bao, Dong, and Wei 2021) and MAE (He et al. 2021).

Discrete Visual Supervision. Exploring masked image modeling or image inpainting task for self-supervised pretrained tasks has never been stopped in vision community, especially when BERT (Devlin et al. 2018) achieves great success in various tasks of NLP. To apply the cross-entropy loss function for vision tasks, iGPT (Chen et al. 2020a) clusters the pixel values to simulate the process of BPE (Sennrich, Haddow, and Birch 2015) process for different words in language. ViT (Dosovitskiy et al. 2020) attempts to directly divide the raw pixel values into multiple groups and assign a discrete label for each group GRB value. Recent work VQ-VAE (Oord, Vinyals, and Kavukcuoglu 2017) proposes to adopt encoder and decoder to quantize the visual contents to a learnable codebook with fixed size.

Perceptual Similarity. The perceptual similarity, as its name suggests, is to mimic the human perceptual judgment of image similarity. Numerous efforts have been proposed to achieve that, such as SSIM (Wang et al. 2004), MSSIM (Wang, Simoncelli, and Bovik 2003), FSIM (Zhang et al. 2011), and HDR-VDP (Mantiuk et al. 2011). It has been shown in (Zhang et al. 2018) that the internal activations of network trained for classification task surprisingly coincide with human judgment. Such deep features have been widely used in image generation (Gatys, Ecker, and Bethge 2016; Johnson, Alahi, and Fei-Fei 2016; Chen et al. 2017; Bruna, Sprechmann, and LeCun 2015; Ledig et al. 2017; Esser, Rombach, and Ommer 2021) with the goal of synthesizing realistic images. The loss is called perceptual loss or VGG loss as the network used is often VGG architecture. In this paper, we surprisingly discover that this simple loss is super effective in building a better prediction target and significantly improves vision BERT pretraining. Moreover, to enable self-supervised learning, we adopt a self-supervised trained network rather than ImageNet-trained networks and show it also works comparably well. Both these two discoveries are conceptually simple yet super-effective and valuable.

Method

In the natural language processing field, the words are naturally discrete tokens which contain high semantic information. By contrast, vision signals are continuous with redundant low-level information. While there are various ways to discretize the image in prior works, the semantic level of the resulting visual tokens has been largely ignored. In this section, we start by briefly describing the discrete representation learning from VQ-VAE, and then introduce the process of how to learn a perceptual codebook, followed by BERT pre-training over the learned perceptual visual tokens.

where $q$ is the quantization encoder that maps the vector to an index of the codebook, and $r$ is the quantization decoder that reconstructs the vector from the index. Based on the quantized codewords $z_{\mathbf{q}}$ , the decoder aims to reconstruct the input image $x$ . Suppose the reconstruct result is $\hat{x}=\text{Dec}(z_{\mathbf{q}})$ . Since the quantizer is non-differentiable, to back-propagate gradient into encoder, the gradient is approximated like the straight-through estimator (Bengio, Léonard, and Courville 2013) and just copied from decoder to encoder (Oord, Vinyals, and Kavukcuoglu 2017). The training objective of VQ-VAE is defined as,

Here, $\mathcal{L}_{pixel}=\frac{1}{H\times W\times 3}\|x-\hat{x}\|$ is the per-pixel loss, $\text{sg}[\cdot]$ is the stop-gradient operator, $\beta$ is a loss weight set to 0.25 in all our experiments.

Learning Perceptual Codebook for Visual Content

Therefore, we propose a simple yet effective strategy by enforcing perceptual similarity between the original image and the reconstructed one beyond the pixel loss. The perceptual similarity is not based on pixel differences but instead feature differences where the high-level image features extracted from a pre-trained deep neural network. We hope this feature-wise loss will better capture perceptual difference and offer invariance towards low-level variations. We show the comparison of using different losses in Figure 3 from the perspective of image reconstruction, suggesting that images with lower pixel-wise loss may not appear perceptually similar.

Previous works usually adopt a supervised pretrained VGG (Simonyan and Zisserman 2014) network to calculate perceptual loss, since using supervision is not consistent with our purpose of self-supervised pre-training. We turn to the self-supervised models and replace the ConvNet-based model with Vision Transformer, which have a better modeling capability and efficiency. On the other hand, pre-trained models usually encode different levels of semantic information in different layers, to enable our codebook to have rich perceptual information, we adopt multi-scale features from multiple layers of the model to calculate the perceptual loss. Our experiments show that a vision Transformer (ViT-B model) from self-supervised learning works well for calculating perceptual loss.

Formally, let $f_{l}(x)$ be the normalized activations of the $l$ -th layer of a network $F$ when processing the image $x$ . The size of the feature map is $H_{l}\times W_{l}\times C_{l}$ with $H_{l}$ being the height, $W_{l}$ being the width and $C_{l}$ being the channel dimension. Usually, multi-scale features, more comprehensive and discriminative, from multiple layers at different depth are extracted to calculate the perceptual similarity for better semantic capture. The perceptual metric for the input image $x$ and the reconstructed image $\hat{x}$ can be formulated as,

where $\mathcal{S}$ denotes the number of layers from which the features are extracted.

Therefore, the overall objective function is,

where $\lambda$ is the hyper-parameter for the loss weight of $\mathcal{L}_{percep}$ , we will study different vaules of loss weight $\lambda$ in the experiments. The training pipeline of perceptual codebook is illustrated in Figure 2 (a). After training, the encoder and the quantizer are used as tokenizer in the subsequent pre-training process.

BERT Objective over Perceptual Codebook

We adopt the BERT objective to perform the masked image modeling task over the discrete visual tokens as in BEiT (Bao, Dong, and Wei 2021), illustrated in Figure 2. For a given image $x$ , the input tokens are image patches which are non-overlappingly split from the whole image, and the output tokens are discrete perceptual visual words obtained through learning Eqn 5. Let the input be $\{x^{1},x^{2},\cdots,x^{N}\}$ , and the groundtruth output be $\{k^{1},k^{2},\cdots,k^{N}\}=q(Enc(x))$ . The goal of the masked image modeling is to recover the corresponding visual tokens from the masked input where a portion of input tokens have been masked.

Precisely, let $\mathcal{M}$ be the set of masked index. Then the masked input $\bar{x}$ is represented as,

where $m$ is a learnable mask token as same dimension as non-mask tokens. The masked input tokens are fed into a $L$ -layer vision Transformer with the last layer’s hidden output being denoted as $\{h^{1},h^{2},\cdots,h^{N}\}$ . We aim at recovering the corresponding visual token from the hidden vector at masked positions. To achieve that with the classification loss, a $K$ -way classifier is appended after the hidden vector $h^{i}$ to get the probability estimation about all possible discrete tokens in the corresponding codebook $\mathcal{V}^{i}$ . Suppose the groundtruth discrete visual tokens corresponding to the masked patches are $k^{t}$ with $t\in\mathcal{M}$ , the pre-training objective can be formulated as,

where $P(k^{t}|\bar{x})$ is the estimated target token probability for masked patches of corrupted image $\bar{x}$ .

After pre-training the model, we apply the model to various downstream tasks including ImageNet- $1$ K (Deng et al. 2009) classification, COCO object detection (Lin et al. 2014), and ADE20K (Zhou et al. 2017) Segmentation.

Pre-training Details

Vector Quantizer. We use the standard k-means algorithm for vector quantization. We set the codebook size $K$ as 8192 for fair comparison. When the size of the discrete latent space $K$ is large, we observe that only a few codewords are selected to represent image and get trained. Many other codewords are wasted. To overcome this issue, we adopt exponential moving averages (Oord, Vinyals, and Kavukcuoglu 2017) to update the codebook which is proved to be useful for increasing utilization of codewords in a codebook.

BERT Pre-training Setup. For computation resource consideration, we use the original ViT-B/16 (Dosovitskiy et al. 2020) as the basic architecture of our backbone to validate the effectiveness of the learned visual codebook, as in BEiT (Bao, Dong, and Wei 2021). The model is pre-trained for 300/800 epochs with the batchsize of 2048. We use a block-wise masking strategy for obtaining the corrupted images with the same setup as BEiT (Bao, Dong, and Wei 2021). We further demonstrate the effectiveness of our approach when scaling to ViT-Large and ViT-Huge backbones.

Experiments

Image Classification aims to classify a given image into its corresponding class category. We use the popular ImageNet-1K dataset. To enable classification, a global average pooling layer is appended after the pre-trained model. We finetune the model with 100 epochs and a cosine decay learning rate that warmups to $4e^{-3}$ with 20 epochs and decays to 0. Following (Bao, Dong, and Wei 2021), the layer-wise learning rate decay is also used and set to 0.65 by default. For more details, please refer to the supplementary materials.

Semantic Segmentation is the task of assigning a label to each pixel of the input image. We compare on the semantic segmentation dataset ADE $20K$ benchmark (Zhou et al. 2017). Here we employ the Upernet (Xiao et al. 2018) as the basic framework. For fair comparison, we follow previous works (Bao, Dong, and Wei 2021) and train Upernet 160k iterations with batch size set as 16, more details are provided in the supplementary material.

Object Detection and Segmentation. Object detection is to locate objects in a given image and identify each object. We perform fine-tuning on the COCO objection detection and segmentation with the Mask R-CNN (He et al. 2017) framework. Specifically, we add four different scale FPNs to scale the feature map into different size following (Bao, Dong, and Wei 2021). The fine-tuning is conducted with “1x” (12 training epochs) schedule and single-scale input on the COCO training set and test the performance on COCO validation set, following the strategy used in Swin Transformer (Liu et al. 2021b).

Comparison with previous works

We first compare our PeCo with previous state-of-the-art works. Here we report ImageNet-1K results with various model sizes. For object detection on CoCo and semantic segmentation on ADE20K, we use ViT-B as the backbone.

Image Classification. The Top-1 accuracy on ImageNet-1K classification is reported in Table 1. We compare our method with 1) ViT (Dosovitskiy et al. 2020) and DeiT (Touvron et al. 2021) that are supervisedly trained from scratch with random initialization; and 2) MoCo v3 (Chen, Xie, and He 2021) and DINO (Caron et al. 2021), represent the contrastive learning for self-supervised pre-training; and 3) BEiT (Bao, Dong, and Wei 2021), MAE (He et al. 2021) and BootMAE (Dong et al. 2022) based on masked image modeling for self-supervised pre-training. It can be seen that our model (PeCo) significantly improves the performance compared with the models trained from scratch, suggesting the effectiveness of pre-training.

Compared with prior self-supervised pre-training models, our model achieves the best performance. For example, our model using ViT-B backbone pre-trained with 800 epochs reaches 84.5% Top-1 accuracy, 1.3% higher than BEiT and 0.9% higher than MAE. Furthermore, we also compare the results on larger backbones, e.g. ViT-L and ViT-H. The results are reported in the Table1, showing significantly better performance than previous counterparts. This validates that our perceptual codebook is indeed beneficial for pre-training. Concretely, our model PeCo-H448 achieves the best Top-1 accuracy, 88.3%, on ImageNet-1K without external data, outperforming MAE by $0.5\%$ . This is a new state-of-the-art result using only ImageNet-1K data.

We also report the results pre-trained with 300 epochs in Table 2. Compared with the baseline BEiT (Bao, Dong, and Wei 2021), our model gets $+1.3\%$ improvement for both 300 and 800 pre-training epochs. We further investigate a lite version of tokenizer which reduces the channel number of the original by half. This decreases the extra timecost introduced by the tokenizer by about $2\times$ . We can see from Table 2 that with a lite tokenizer, our model still gets competitive performance.

Semantic segmentation. We compare our method with 1) DEiT, which is a supervised pre-training method on ImageNet-1K , 2) MoCo, the contrastive learning based methods, and 3) BEiT (Bao, Dong, and Wei 2021), MAE (He et al. 2021), the state-of-the-art self-supervised learning model. Here we use UperNet (Xiao et al. 2018) framework with $512\times 512$ input and trained for 160K iterations. The evaluation metric is mean Intersection of Union (mIoU) averaged over all semantic categories and we report single-scale results here. The results are given in Table 3. Our method achieve 48.5 mIoU, +1.1 mIoU than supervised based methods. It is also + 1.2 mIoU than MoCo, +1.4 mIoU than BEiT, and +0.9 mIoU than MAE. Our model even achieve better results(+0.4 mIoU) than MAE pre-training with 1600 epochs. This verifies the effectiveness of the perceptual codebook.

Object detection and segmentation. We further investigate our transfer performance on object detection and segmentation. Here we use Mask-RCNN (He et al. 2017) framework with single-scale input and $1\times$ schedule (12 epochs). We compare with the strong competitor BEiT (Bao, Dong, and Wei 2021) on this dataset. The evaluation metric is box AP for detection and mask AP for segmentation. The comparison is presented in Table 3. Our model with ViT-B as backbone achieve 47.8 box AP and 42.6 mask AP, +3.7 box AP and +2.8 mask AP over supervised methods. Our model also outperform recent work MAE by +1.0 box AP, + 0.7 box AP under the same pre-training epochs. Our model is also higher than MAE pre-training with 1600 epochs.

Analysis of Perceptual Codebook

In this section, we ablate our perceptual codebook by using the setting of self-supervised pre-training on ImageNet-1K. The pre-raining epochs is 800.

Semantics of the Codewords. The most important question would be: will the learned perceptual codewords exhibit (more) semantic meanings? To answer this, we quantitatively evaluate the codewords’ semantics from two aspects. (1) We use the codewords of the image as features for classification. An average pooling is conducted over the quantized codewords of the image and we test its linear probing accuracy over ImageNet dataset. (2) We use an ImageNet-1K supervisedly pre-trained DeiT-T (Touvron et al. 2021) (72.2% Top1 accuracy on clean ImageNet val set) to test the classification accuracy over the reconstructed images. We compare with the variant without using the perceptual similarity. The results are given in Table 4. We find that our perceptual codewords get much higher accuracy for both linear evaluation on codewords and classification on the reconstructed images. This indicates that our perceptual codebook exhibits more semantic meanings and benefits the image reconstruction process. We also provide a visualization of the masked region prediction using BEiT (Bao, Dong, and Wei 2021) and our PeCo in Figure 4, showing that our PeCo, with the aid of perceptual codebook, is able to make more semantic predictions for the masked region.

Deep Architectures for Perceptual Similarity. Another key question would be: will the deep architectures for deep perceptual features affect the perceptual codebook learning and thus affect the pre-training performance? Therefore, we investigate two different deep architectures: convolutional-based backbone ResNet50 (He et al. 2016) and Transformer-based model ViT-B (Dosovitskiy et al. 2020). We study the self-supervised models in order to enable unsupervised pre-training. The results are reported in Table 10. We can see that using convolution-based or Transformer-based network achieves similar performance. In addition, we also report the results using the classical supervised (i.e. using label) trained VGG (Simonyan and Zisserman 2014) in Table 10. It can be seen that using supervised model for perceptual metric achieve comparable performance as self-supervised model.

Discussions

We present several in-depth discussions about the proposed model in this section.

Implicit vs. Explicit. The key contribution of our paper is improving the perceptual level of the discrete visual tokens for the subsequent pre-training. We have successfully demonstrated that through a simple strategy, i.e. enforcing perceptual similarity over images. One may think that it seems quite implicit for learning perceptual codebook by constraining on images instead of directly exploiting some constraint over the codebook. Indeed, we also experiment in two explicit ways: 1) supervised classification loss over the codewords; 2) constraining a momentum contrastive loss over the quantized codewords through data augmentation in a self-supervised way. We hope that leveraging those forms of high-level classification objective may encode some semantics into the codewords. But empirically we found that such explicit ways are not as effective as the proposed implicit strategy. The results are reported in Table 6. We conjecture that the codebook may learn global semantics from the classification/contrastive loss and thus fail to differentiate different codewords, which is not suitable for pre-training. In contrast, deep features from a pre-trained deep model contain rich and dense semantics.

Perceptual Loss vs. GAN Loss. The perceptual loss is widely used in generation tasks with the goal of improving the image quality. We ask the question that is there a positive relation with the image quality and the perceptual level of the codebook. In order to explore this, we adopt another technique, adversarial loss in Generative Adversarial Nets(GANs) (Goodfellow et al. 2014), which has been proved to be effective in enhancing the reconstructed image. Specifically, we add a patch-based discriminator D (Li and Wand 2016), aiming to make the original image and the reconstructed one indistinguishable. The adversarial loss is,

We add this loss with a suitable weight 0.4 to Eqn 5 and use the learned codebook for pre-training. The resulting performance is shown in Table 7. We can see that adversarial loss can not bring gain to the transfer performance of pre-training.

Conclusion

In this paper, we argue that a good prediction target for masked image modeling should agree with human perception judgment. Motivated by this observation, we propose a simple yet effective strategy to obtain perceptually discrete tokens, beneficial for BERT pre-training of vision transformers. We present extensive comparisons on various downstream tasks. Our results indeed validate our hypothesis and show superior performance compared with previous state-of-the-art methods. We hope that the deep analysis about the prediction target in our work will lead to a broader exploration of this perspective and even help existing multi-modality foundation model pretraining (Yuan et al. 2021; Wang et al. 2022a).

Acknowledgements

This work was supported in part by the Natural Science Foundation of China under Grant U20B2047, 62072421, 62002334, 62102386 and 62121002.

References

Appendix A More Experiments

Accelerated BERT pre-training over PeCo. Recent work MAE (He et al. 2021) introduces an asymmetric encoder-decoder design where the masked tokens are shifted from the encoder to the small decoder. This results in a large reduction in computation. Inspired from this, we can also accelerate PeCo by adopting the network structure in (He et al. 2021) for BERT pre-training but with the proposed perceptual codebook as prediction targets. We denote perceptual codebook using MAE framework as $\text{PeCo}_{\text{MAE}}$ . We show that $\text{PeCo}_{\text{MAE}}$ enjoys the efficiency of the framework while improving the performance through the proposed prediction target.

Here we show the results of adopting the accelerated BERT pre-training paradigm but with the proposed perceptual codebook as prediction target. The comparison is shown in Table 8 for all the three downstream tasks. We can see that our new prediction target enjoys the efficiency of the framework and also gets a higher downstream performance.

Extend PeCo to Video-level Tasks. In our main paper, we explore PeCo on different image-level downstream tasks, here we further apply PeCo on video-level tasks. We apply PeCo to video recognition task with two wildly used dataset Kinetics-400 (K400) (Kay et al. 2017) and something-something-v2 (SSv2) (Goyal et al. 2017). We use TimeSformer and initial it with model pretrained on ImageNet-1K, and we use clips of size $8\times 224\times 224$ and patch size is set to $16\times 16$ . For PeCo and BEiT, we finetune it with 15 epochs, learning rate is set to $1.2e^{-3}$ and layer decay is 0.65. Weight decay is set to $1e^{-4}$ . For supervised baseline DEiT, we set the learning rate as $1e^{-4}$ for the backbone and $1e^{-3}$ for the classification head. The batch size is set to 64 for all experiments.

As shown in Table 9, our PeCo outperforms the supervised baseline DEiT and previous BEiT with a large margin, this proves the effectiveness and generalizability of PeCo.

The Loss Weight of Perceptual Similarity. In the experiments, the loss weight $\lambda$ in Eqn 5 is set as 1. Here we present the performance under various values of $\lambda$ among 0, 0.3, 1, 3, 10. The results are shown in Table 10. We can see that using perceptual loss yields 84.1% accuracy outperforming 82.9% from the model without perceptual loss. However, further enlarging the loss weight gets performance drop. One possible explanation is that large perceptual loss leads the model to pay more attention to semantic while lose some local details, while a good codebook for BERT pre-training needs both semantic and local details.

Adversarial Robustness Analysis. Here we provide analysis about the fine-tuned model adversarial robustness of different pretraining methods. Here we use two classical white-box attack method Basic Iteration Attach Method (BIM) (Kurakin et al. 2016) and Momentum Iteration Attach Method (MIM) (Dong et al. 2018) The attack threshold is $2/255$ and iterations are 20.

As shown in Table 11, we find that compared with the vanilla DEiT, both contrastive-learning based method MoCo and mask image modeling based method BEiT and PeCo improves the adversarial robustness and PeCo performs best. While an interesting point is that only the robustness of MAE is worse than the baseline. We argue this may be because the prediction target of MAE is raw pixels (with simple pixel norm), so it pays more attention to the high-frequency of the input, which makes it sensitive to the high-frequency change of the input. On the contrary, BEiT and PeCo predicts tokens, which could be viewed as clustered and distillate target, so the model could focus on the structure or semantic information of the input, rather than the high-frequency information.

Different Architectures for VQ-VAE. Here we investigate the performance when using different architectures for VQ-VAE. We consider several variants of the network architecture. For encoder, we explore three models: 1) 16 $\times$ down-sample encoder (our default setting); 2) 8 $\times$ down sample encoder; 3) ViT-B (16 $\times$ down-sample). For the 8 $\times$ down-sample encoder, we remove one stage and train it with images of 112 $\times$ 112 resolution. For decoder, we use the inversed version of the corresponding decoder. The results on ImageNet- $1$ K dataset are shown in Table 12. We observe that CNN based encoders and decoders achieve better results than vision Transformer. We further reduce the parameters of decoder by decreasing the channel number or decreasing the depth of the network by half. Results shown in Table 12 suggest that reducing the parameters of decoder may not hurt the fine-tuning performance of PeCo.

Appendix B Experiment Details

In this section, we provide more detailed experimental settings about downstream tasks.

VQ-VAE Architectures. For convolutional encoder, the number of channels at the first stage is set to $64$ , then it will be doubled in every downsample operation. we apply the Group Normalization (Wu and He 2018) as introduced in Taming Transformer (Esser, Rombach, and Ommer 2021). The convolutional decoder is an inverse version of the encoder. For ViT-base encoder, we use the original structure, and use the inverse version of ViT as decoder.

Perceptual Codebook Learning Setup. We train the perceptual codebook using the training set of ImageNet-1K dataset by default. For the encoder and decoder of VQ-VAE, we choose traditional convolutional based backbone. The network contains two residual blocks at each resolution. A self-attention block is applied to the smallest resolution for both encoder and decoder. For perceptual loss, we use the pre-trained 100 epochs ViT-B model from self-supervised method MoCo v3 (Chen, Xie, and He 2021) by default, and the 3rd, 6th, 9th, and 12nd layer are selected for deep features. We also apply the ResNet50 (He et al. 2016) and VGG (Simonyan and Zisserman 2014) model with the perceptual similarity calculated at the end of each stage. We set the perceptual loss weight $\lambda$ to 1 without special noting. Different models for providing deep features for perceptual loss are ablated in the experiments section. The input image size is $224\times 224$ , which is consistent with pre-training image input size, the latent codes are in a resolution of $16\times 16$ . We use EMA vector quantizer as the default quantizer algorithm. The learning rate is set $5e^{-5}$ with batchsize 128. We train the PeCo for 100 epochs and warm up the first 5000 iterations to stabilize the training process. The Adam (Kingma and Ba 2014) optimizer is used with $\beta_{1}$ and $\beta_{2}$ set to 0.5 and 0.95 respectively.

BERT Pre-training Setup. For computation resource consideration, we use the original ViT-B/16 (Dosovitskiy et al. 2020) as the basic architecture of our backbone to validate the effectiveness of the learned visual codebook, as in BEiT (Bao, Dong, and Wei 2021). The model is pre-trained for 300/800 epochs with the batchsize of 2048. AdamW optimizer is adopted with learning rate, $\beta_{1}$ , $\beta_{2}$ , weight decay set to $1.5e^{-3}$ , $0.9$ , $0.999$ , and 0.05 respectively. We also apply stochastic depth (Huang et al. 2016) with 0.1 rate. We use a block-wise masking strategy for obtaining the corrupted images with the same setup as BEiT (Bao, Dong, and Wei 2021). We further demonstrate the effectiveness of our approach when scaling to ViT-Large and ViT-Huge backbones.

ADE20K Semantic segmentation. Here we use: UperNet (Xiao et al. 2018) based on the implementation from mmsegmentaion (Contributors 2020). For UperNet, we follow the settings in (Bao, Dong, and Wei 2021) and use AdamW (Loshchilov and Hutter 2017) optimizer with initial learning rate $3e^{-4}$ , weight decay of 0.05 and batch size of 16 (8 GPUs with 2 images per GPU) for 160K iterations. The learning rate warmups with 1500 iterations at the beginning and decays with a linear decay strategy. We use the layer decay (Bao, Dong, and Wei 2021) for the backbone and we set it as 0.65. As the ViT architecture outputs features with the same size, here we add four different scale FPNs to scale the feature map into different size. Specifically, we upsample the output feature of the $4th$ block $4\times$ , upsample the output feature of the $6th$ block $2\times$ , keep the output feature of the $8th$ block unchanged and downsample the output feature of the $12th$ block $2\times$ . We use the default augmentation setting in mmsegmentation including random horizontal flipping, random re-scaling (ratio range [0.5, 2.0]) and random photo-metric distortion. All the models are trained with input size $512\times 512$ . The stochastic depth is set to 0.2. When it comes to testing, we report single-scale test result.

COCO Object Detection and Instance Segmentation. We use the classical object detection framework Mask R-CNN (He et al. 2017) based on the implementation from mmdetection (Chen et al. 2019). We train it the $1\times$ schedule with single-scale input (image is resized so that the shorter side is 800 pixels, while the longer side does not exceed 1333 pixels) for 12 epochs. We use AdamW (Loshchilov and Hutter 2017) optimizer with a learning rate of $4e^{-4}$ , weight decay of 0.05 and batch size of 16. We also use the layer decay (Bao, Dong, and Wei 2021) for the backbone and we set it as 0.75. The learning rate declines at the $8th$ and $11th$ epoch with decay rate being 0.1. The stochastic depth is set to 0.1. Similar to the implementation of semantic segmentation above, we also use four different scale FPNs to scale the feature map into different size.

Appendix C More visual results

In Figure.5, we show the reconstruction results with a different number of patches masked. We find that PeCo learns strong semantic that could predict a reasonable object with limited visible patches.