Three things everyone should know about Vision Transformers

Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, Hervé Jégou

Introduction

Since its introduction the Transformer architecture has become the dominant architecture in natural language processing tasks, replacing previously popular recurrent architectures. The vision transformer (ViT) is a simple adaptation of transformers to computer vision tasks like image classification: the input image is divided into non-overlapping patches, which are fed to a vanilla transformer architecture, after a linear patch projection layer. In contrast to networks built from convolutional layers, transformers offer parallel processing and a complete field-of-view in a single layer. Along with other attention-based architectures, see e.g. , transformers have recently substantially influenced the design of computer vision architectures. Many modern architectures in computer vision directly inherit parts of their design from this work, or are at least inspired by the recent findings resulting from transformers . As a result, significant improvements have been observed on different computer vision tasks, ranging from object detection and segmentation and video analysis to image generation .

While vision transformers have led to considerable progress, the optimization of their design and training procedures have only been explored to a limited extent. In this paper, we offer three insights on training vision transformers.

1. Parallel vision transformers. Several works advocate the interest shallower networks for reasons ranging from lower latency to easier optimization. We propose a very simple way to achieve this with ViTs. Let us denote by MHSA the multi-headed self-attention residual block, and by FFN the residual feedforward network. Starting from a sequential architecture depicted as follows,

we parallelize the architecture by reorganizing the same blocks by pairs,

which can be done for any different numbers of parallel blocks. This produces an architecture with the same number of parameters and compute, while being wider and shallower. This design allows for more parallel processing, easing optimization and reducing latency depending on the implementation.

In Section 3, we experimentally analyse the performance of this parallel construction, and in particular how it affects the accuracy in comparison to the sequential baseline. The parallel version becomes a compelling option if deep enough. In some cases, we observe improvements in accuracy resulting from an easier optimization. Regarding the latency on GPUs, we observe reductions in the case of small batch sizes.We have not found any papers in the literature analyzing the effect of width versus depth for ViT on common GPUs and CPUs.

2. Fine-tuning attention is all you need. It is common practice to pre-train networks before fine-tuning them on a target task. This is the standard approach underpinning transfer learning, where one leverages a large generic dataset like ImageNet when the number of images is limited for the target task . Another context is the one of changing resolution. Typically one would train at a lower resolution than the one employed at inference time. This saves resources, but additionally it reduces the discrepancy of scale between train and test images that results from data augmentation . In Section 4 we show that, in the case of ViT, it is mostly sufficient to fine-tune only the multi-head attention layers and freeze the feedforward network (FFN) layers. This saves compute and reduces the memory peak during training. Importantly this allows the same FFN weights, which dominate the number of parameters, to be used for multiple tasks. The impact on accuracy is statistically not significant when fine-tuning for different image resolutions. For large models, the impact on accuracy is limited when considering transfer to other classification tasks.

3. Patch preprocessing with masked self-supervised learning. The first layers of a transformer have a relatively local span , suggesting that they mostly behave like convolutions. Some recent hybrid architectures preprocess their input images with a convolutional stem, to improve accuracy and training stability . However, preprocessing images with convolutions is a priori not compatible with the recent and successful mask-based self-supervised learning approaches, like BeiT or MAE . The convolutions propagate information across patches, impeding the masked prediction task.

In Section 5, we propose a simple way to adapt mask-based self-supervised training methods with patch pre-processing, by applying the masking after the patch pre-processing. However, our analysis reveals that existing convolutional stems are not effective when combined with BeiT. To address this issue, we introduce a hierarchical MLP (hMLP) stem that interleaves MLP layers and patch aggregation operations, and prohibits any communication between patches. Our experiments show that this choice is effective and able to leverage the benefit of both BeiT self-supervised pre-training and patch pre-processing. Moreover, our hMLP-stem is also effective for ViT in the supervised case: it is on par with the best convolutional stem of our comparison .

Background

In this section, we discuss related work in common with our different contributions. We also introduce the baseline ViT models considered in this study and how they are trained. In subsequent sections, we discuss related work that is more specific to each of our three specific contributions.

Attention-based models, and in particular transformers , have been rapidly adopted in neural networks handling text , speech , and even for more complex tasks such as function integration or solving differential equation . In computer vision, DeTR and Vision Transformers (ViT) have deeply influenced the design of architectures in a short period of time. Most of the architectures introduced since ViT can be regarded as some form of hybridisation of transformers with convolutional neural networks, as illustrated by the hierarchical transformers , or conversely by convolutional neural networks with design elements inspired from ViT , or even multi-layer perceptrons adopting designs inspired by transformers .

In our case we build upon the basic ViT design of Dosovitskiy. Its design is governed by a small hyper-parameter space, and as such is less engineered than some recent follow-up architectures. With a proper training procedure , it achieves interesting performance/complexity trade-offs. It is also versatile: it can be effectively combined with hierarchical detection or segmentation frameworks . Importantly, in spite of limited built-in priors, it has demonstrated great potential when combined with self-supervised learning, either with contrastive methods or for reconstruction-based techniques like BeiT or other forms of masked auto-encoders .

2 Experimental setting

ViT models. We consider the vanilla ViT models initially introduced by Dosovitskiy et al. as well as the smaller ones proposed by Touvron et al. . Therefore we use the initial pooling method that is based on a so-called class token. We only consider transformers operating on 16 $\times$ 16 patches. Decreasing this patch size improves the results but significantly increases the model complexity.

Training procedure. To prevent overfitting, we adopt an existing training setting, namely the A2 procedure of Wightman et al. . It uses a binary cross entropy loss and fixes the setting of most of the hyper-parameters.Wightman et al.’s A2 procedure was originally designed for training ResNet-50 models, and requires a few modifications when adopting it for ViTs to get strong performance and ensure sufficient stability:

The learning rate should be reduced compared to ResNet-50. We set it to $lr=4.10^{-3}$ for ViT-Ti and ViT-S and to $lr=3.10^{-3}$ for ViT-B and ViT-L.

Stochastic depth drop-rate $sd$ : we adjust it per model following Touvron et al. . It is not used for ViT-Ti. We fix $sd=0.05$ for Vit-S, $sd=0.1$ for ViT-B and $sd=0.4$ for Vit-L.

We observe that LayerScale significantly improves the performance when training large models, and that in that case a longer training is also beneficial. Therefore in addition to our main baseline where we train during 300 epochs without LayerScale, like in DeiT and in the A2 procedure of Wightman et al. , we consider another one that is trained for 400 epochs with LayerScale (LS).

Evaluation. Unless specified otherwise, we train our models on the ImageNet-1k dataset , and evaluate the top-1 accuracy on its validation set. All experiments are carried with seed 0. Since we have adjusted a low number of hyper-parameters, and since we share them across models except stochastic depth, we do not expect much overfitting. Nevertheless we also evaluate our models with the same metric on ImageNet-V2 (matched frequency), which provides a separate test set, to provide a complementary view on the results.

3 Baselines

We report the results of our baseline in Table 1. With the few adaptations that we have done, our training procedure outperforms existing ones for supervised training for the model sizes that we consider, see Appendix A (Table 8). Note that all our models use a patch size of 16 $\times$ 16 as in Dosovitskiy et al. . Unless specified, our experiments are carried out with images of size 224 $\times$ 224.

Depth vs Width: Parallel ViT

A recurrent debate in neural architecture design is on how to balance width versus depth. The first successful neural networks on Imagenet were not very deep, for instance the 22-layer GoogleNet was regarded as deep in 2014’s standards. This has changed with ResNets , for which going deeper was hindering significantly less the optimization due to the residual connections. After its introduction, some researchers have investigated alternative choices for trading depth against width , like Wide Residual Networks .

Recently, there has been a renewed interest for wider architectures with attention . For instance the Non-deep Networks proposes an architecture with several parallel branches whose design is more complex. In our work, we aim at proposing a much simpler and flexible alternative that builds upon a regular ViT in a more straightforward manner.

The ViT architecture of Dosovitskiy et al. is parametrized by three quantities: the width (i.e., the working dimensionality $d$ ), the depth, and the number of heads. We do not discuss the latter. Increasing depth or width increases the capacity of the model and usually its accuracy. For the most common ViT models that we report in Table 1 , width and height are scaled together. Below, we discuss the different pros and cons for favoring width versus depth.

Parametrization & Optimization. The compositionality of the layers is better with deeper networks. This was one of the decisive advantage of ResNet once optimization issues were solved by residual connections. Yet too much depth hinders optimization, even with residual connections. Some solutions have been proposed to address this issue for ViTs , showing that transformers benefit from depth when trained with improved optimization procedure.

Separability. In image classification, the spatial features are ultimately projected or pooled into a high-dimensional latent vector that is subsequently fed to a linear classifier. The dimensionality of this vector should be high enough so that the classes are linearly separable. Hence it is typically larger for tasks involving many classes. For instance in ResNet-50 it has dimension 512 when applied to CIFAR, but 2048 for ImageNet. In ViT, the width is identical to the working dimensionality of each patch, and is typically smaller than with ResNet, possibly limiting the separation capabilities. Besides, a larger dimension of the latent vector tend to favor overfitting. In this regard the compromise between capacity and overfitting is subtle and depends size of the training set .

Complexity. In ViT, the different complexity measures are affected differently by width and depth. Ignoring the patch pre-processing and final classification layer, which contribute to complexity in a negligible manner, then we have:

The number of parameters is proportional to depth and a quadratic function of the width.

The compute, as determined by FLOPS, is similarly proportional to the depth and quadratic in width.

The peak memory usage at inference time is constant when increasing the depth for a fixed width, but it is quadratic as a function of width.

The latency of wide architectures is in theory better as they are more parallel, but actual speedups depend on implementation and hardware.

2 Parallelizing ViT

we replace this composition by two parallel operations:

Our strategy is different from taking transformers with a larger working dimensionality, which leads to different trade-offs between accuracy, parameters, memory and FLOPS, as discussed in our experiments. In contrast to increasing the working dimension, which increases the complexity quadratically as discussed above, our modification is neutral with respect to parameter and compute.

Depending on whether we effectively parallelize the processing, the peak memory usage at inference time and the latency are modified. Note that rather than just two, we can choose to process any number of blocks in parallel; falling back to the sequential design if we process a single block in each layer.

3 Experiments

Notation. We adopt the standard naming convention of previous work to use the postfixes Ti/S/B/L to identify the working dimensionality of the models, i.e., the column “width” in Table 1. We append the depth $N$ to indicate variations on the number of pairs of layers (MHSA,FFN) . For instance, ViT-B24 has the same width as a ViT-B12 but with twice the depth, i.e., 24 pairs of MHSA and FFN layers instead of 12. For our parallel models, we specify both the depth and the number of parallel branches: ViT-B12 $\times$ 2 has twice the number of residual modules as a ViT-B12. It includes a total of 12 $\times$ 2=24 pairs of MHSA and FFN layers. Therefore it has the same complexity as the ViT-B24 model (a.k.a. ViT-B24 $\times$ 1).

Comparison of sequential and parallel ViTs. In Figure 4, we compare the performance of sequential and parallel models of a fixed complexity. We fix the total number of blocks, i.e. pairs of MHSA and FFN layers, which determines the number of parameters and FLOPS, and we consider different possible of branches that leads to the same total number of blocks. For instance 36 can be obtained as the sequential ViT 36 $\times$ 1, or the parallel ViTs 18 $\times$ 2, 12 $\times$ 3 or 9 $\times$ 4.

We observe that, amongst the parallel and sequential models, the best performance is obtained with two parallel branches for all tested model capacities. The performance is comparable between the S20 $\times$ 3 and S30 $\times$ 2 for ViT-S60, but generally using more than two parallel branches is not favorable in terms of accuracy and we do not discuss them further. Note that Figure 4 compares ViT models with a relatively large number of blocks (36 and 60). This is the case where sequential models are relatively difficult to optimize due to their depth. The parallel models with two branches are easier to train, while being deep enough to benefit from layer compositionality.

In Figure 4, we consider models with only 24 pairs (MHSA,FFN) and a varying width. Here we observe that the smallest models ViT-Ti and ViT-S are better in their sequential version. This is because are easy to optimize up to 24 layers. The B24 $\times$ 1 and B12 $\times$ 2 achieve comparable performance. In contrast, the ViT-L12 $\times$ 2 is stronger than its sequential counterpart, which is more difficult to optimize even though we used LS for this size; without LS its performance is 83% at 300 epochs.

In Figure 4, we compare the performance of sequential and parallel as a function of the number of blocks for ViT-S and ViT-B. Our observations concur with our previous findings: the parallel version is more helpful for the deeper and higher capacity models that are more difficult to optimize; our parallelization scheme alleviates this issue.

Impact of optimization. In Table 4, we provide results with LayerScale , which helps the optimization of the biggest models. It improves the performance of both sequential and parallel models, which end up approximately on par. Hence, for models big enough and with proper optimization, sequential and parallel ViTs are roughly equivalent.

Increasing the number of modules or the working dimensionality? Table 4 provides a comparison between different ViT architectures: sequential, parallel, and with larger working dimensionality. We approximately adjust the complexity in terms of parameters and FLOPS, yet this means that ViT models with larger working dimensionality have a higher peak memory usage with typical implementation. In both tested settings the sequential and parallel models yield substantially higher accuracy than the models with larger working dimensionality. The sequential and parallel models are comparable with 36 blocks. The parallel model is better in the case of 48 blocks due to the increased depth of the sequential model.

Latency. On a commodity V100 GPUs, we observe a significant speed-up in the case of per-sample processing, with also some gains for small batch sizes with relatively small models, see Table 4. This comparison is based on a simple implementation of our parallel architecture, which is suboptimal due to the lack of a specific CUDA kernel. Overall our measurements suggest specific hardware or kernels are required to obtain compelling benefits in terms of throughput.

Fine-tuning attention is all you need

In this section we focus on fine-tuning ViT models, either to adapt the the model to larger image resolutions or to address different downstream classification tasks. In particular, we consider an approach where we only fine-tune the weights corresponding to the MHSA layer, see Figure 4. We analyse the impact in terms of prediction accuracy and savings in processing complexity, peak memory usage and parameter count. As we will see, our choice is significantly better than alternative ones, such as fine-tuning the parameter-heavy FFN layers.

It is common to train networks at lower resolution and fine-tuning it at a higher target resolution. This saves a significant amount of compute at training time, and typically also improves the accuracy of the network at the target resolution . This is because it reduces the discrepancy between the scale of the images seen at train and at test time that is induced by common data augmentation. Fine-tuning is also the paradigm associated with foundation models in general and to the concept of transfer learning itself . A recent line of work explores adaptation of pre-trained models with various types of adapter modules with a small amount of task-specific parameters . In our work, instead, we focus on fine-tuning vanilla ViTs.

Fine-tuning at different resolutions. In Table 5, we report results with fine-tuning ViT-S, ViT-B and ViT-L at 384 $\times$ 384 resolution for models pre-trained at 224 $\times$ 224. Solely fine-tuning the MHSA weights provides results that are within standard deviation ( $\pm{0.1}$ ) from a full fine-tuning both on ImageNet-val and ImageNet-V2. This is not the case when fine-tuning the FFN layers, while these contain twice the number of parameters of MHSA. Note, our pre-trained models have been trained long enough (400 epochs) to ensure convergence.

There are only advantages to use this approach when fine-tuning at higher resolution as opposed to doing a full fine-tuning, as we get substantial savings in terms of parameters, latency, and peak memory usage for free, see Figure 4 (right panels). First, the fine-tuning stage requires 10% less memory on the GPU, which is especially interesting in the context of high-resolution fine-tuning where the higher images require more memory. The training is also 10% faster, as less gradients are computed. Finally, the attention weights correspond to approximately one third of the weights. Therefore, if one wants to use multiple models fine-tuned for different input resolutions, we save 66% of the storage for each additional model.

Fine-tuning on different datasets. We now evaluate our approach when transferring ViTs pre-trained on ImageNet to different downstream classification tasks by fine-tuning. We consider public benchmarks whose characteristics and references are given in Appendix B.

In Table 6 we report the performance for different fine-tuning strategies. Here we make different observations. First, for the smallest datasets, namely CARS and Flower, fine-tuning only the MHSA layers is an excellent strategy. It is even better than full-tuning. Our interpretation is that restricting the number of weights has a regularizing effect. The conclusion is more mixed with the largest datasets, in particular iNaturalist, where we observe a significant gap between the full fine-tuning and our solution for the ViT-S. This could be expected: in this case there are more images to learn from and new classes that were not seen before the fine-tuning stage. Restricting the fine-tuning to MHSA layer allows modifying only a relatively small number of parameters. FFN layers have twice more weights and leads to better results in that case. This limitation tends to disappear with the larger ViT-L models, for which the the capacity of the MHSA is much larger and therefore sufficient. Our strategy is therefore interesting in the typical use-cases of foundation models, which are very large models that are fine-tuned on a variety of downstream tasks.

Patch preprocessing for Bert-like self-supervised learning

The original ViT paper considered to include convolution instead of patch projection in the network design. Several recent papers advocate this choice to include a small pre-processing network in the architecture, instead of a simple patch projection. Most of the pre-processing subnetworks that have been considered are based on convolutions, and are often referred to as “convolutional stems”. Small transformers have also been considered .

While these patch pre-processing designs have been developed to improve accuracy and/or stability, there are some remaining questions regarding their design and flexibility. First, it is not clear which is the most effective when combined with a vanilla transformer. Second, to our knowledge there is no work addressing the problem of their compatibility with self-supervised methods based on patch masking, and in particular on Bert-like auto-encoders such as BeiT .

In this section we try to answer these questions. We compare several existing pre-processing designs in terms of accuracy and compute and evaluate them in combination with BeiT, using the codebase release by the authors of BeiT. The only change we make is to train the tokenizer on ImageNet-1k, rather than using the one from DALL-E used in BeiT which is trained on a proprietary dataset comprised of 250 million images. In this manner, pre-training is based on ImageNet-1k only. This permits reproducible experimentation and fair comparison, and gives equivalent results . Since existing convolutional designs are not satisfactory in combination with masking, we first introduce our own design.

is depicted in Figure 5. All patches are processed independently with linear layers interleaved with non-linearities and renormalization. Its design is guided by our motivation to remove any interaction between the different 16 $\times$ 16 patches during the pre-processing stage. Even if we mask a patch, it does not create any artifacts resulting from the convolution overlapping with other patches, as it is the case with existing designs. Therefore, with our hMLP solution, we can equivalently mask the patches before or after the patch-processing stage. Note that, although patches are processed independently, our hMLP-stem is equivalent to a convolutional stem in which the size of the convolutional kernel and its stride are matched, and in practice we implement it with convolutional layers, see our code in Appendix C.

In short, we start from small 2 $\times$ 2 patches, and gradually increase their size until they reach 16 $\times$ 16. Each increase of the patch size is denoted by “patchify” in Figure 5, in spirit of hierarchical transformer designs like Swin-Transformers . The patches are projected with a linear projection and normalized before we apply a GELU non-linearity . For the normalization, we consider and evaluate two choices: either we use batch-normalization (BN) or layer-normalization (LN) . While the BN offers better trade-offs, LN is of interest when used with small batch sizes: it works well even with a single image per batch, as often used in object detection.

In contrast with existing stems from the literature, our hMLP design does not significantly increase the compute requirement. For instance, ViT-B, requires FLOPS is 17.73 GFLOPS with our design. This adds less than 1% of compute compared to using the usual linear projection stem.

Stem comparison in supervised learning. In Table 7 we provide a comparison between different stem designs. We have selected several prototypical designs from the literature for which the code is available online. In addition to our hMLP stem, we have considered some variations over the standard linear projection to evaluate the influence of the non-linearities and normalization. For the standard linear stem, we also consider a ViT-B13 including an extra pair (MHSA, FFN) to allow more direct comparisons with other stems with more FLOPS. In this comparison the most effective existing design is the one of LeViT . The improvements with respect to the linear baseline are significant considering the standard deviation, even when taking into account the extra layer of ViT-B13 to compare with an similar number of FLOPS. Our hMLP stem obtains a comparable performance but with lower complexity, and without any interaction between the 16 $\times$ 16 patches.

Results with BeiT training. We report the results with BeiT, fine-tuned on ImageNet-val, in the right-most column of Table 7. We use the code of BeiT with their training procedure, which includes LayerScale and a relatively elaborated fine-tuning procedure. As one can see, existing stems do not provide any improvement compared to the linear baseline, while adding compute. In contrast, our design is effective and provides an improvement of +0.3/+0.4 top1 accuracy compared to the baseline, which is significant considering the measure uncertainty. The interest of hMLP in the context of masked self-supervised learning is clear in Figure 6, where we plot the performance, averaged over 5 seeds for our method, in the supervised case versus the one with BeiT.

Conclusion

In this paper, we looked at three different topics related to Vision Transformers. First, we investigated a simple but effective way to parallelize them, showing a viable alternative to increase capacity without significantly increasing the working dimensionality. Whether this simple parallel design principle can be applied to other architectures is an exploration left for future work. Second, we considered different fine-tuning strategies and showed that fine-tuning the self-attention layer is sufficient in the context of resolution fine-tuning. This can also be interesting when transferring to other downstream classification tasks, especially when fine-tuning large models or/and transferring to a dataset with few training images. Last, we introduced a simple patch pre-processing stem, which processes patches independently across multiple linear layers interleaved with non-linearities and patch aggregation. It is especially useful when combined with mask-based self-supervised learning such as BeiT.

We thank Francisco Massa for valuable discussions and insights about optimizing the implementation of block parallelization.