Vision Transformer Adapter for Dense Predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao

Introduction

Recently, transformers have witnessed remarkable success in a broad range of computer vision fields. Benefiting from the dynamic modeling capability and the long-range dependence of the attention mechanism, various vision transformers (Dosovitskiy et al., 2020; Chen et al., 2021; Han et al., 2021; Li et al., 2021c; Wu et al., 2022b) soon rose in many computer vision tasks such as object detection and semantic segmentation, surpassing CNN models and reaching state-of-the-art performance. These models are mainly divided into two families, i.e. the plain ViT (Dosovitskiy et al., 2020; Touvron et al., 2021), and its hierarchical variants (Dong et al., 2021; Liu et al., 2021b; Wang et al., 2021; 2022a). In general, the latter can produce better results and is believed to introduce vision-specific inductive biases into their architectures by using local spatial operations.

Nonetheless, the plain ViT (i.e., vanilla transformer) still has some nonnegligible advantages. A typical example lies in multi-modal pre-training (Zhu et al., 2021; 2022; Wang et al., 2022b). Stemming from the natural language processing (NLP) field, transformer has no assumption of input data. Equipping with different tokenizers, e.g. patch embedding (Dosovitskiy et al., 2020), 3D patch embedding (Liu et al., 2021c), and token embedding (Vaswani et al., 2017), vanilla transformers such as plain ViT can use massive multi-modal data for pre-training, including image, video, and text, which encourages the model to learn semantic-rich representations. However, the plain ViT has conclusive defects in dense predictions compared to vision-specific transformers. Lacking image-related prior knowledge results in slower convergence and lower performance, and thus plain ViTs are hard to compete with vision-specific transformers (Huang et al., 2021b; Xie et al., 2021; Wang et al., 2022a) on dense prediction tasks. Inspired by the adapters (Houlsby et al., 2019; Stickland & Murray, 2019) in the NLP field, this work aims to develop an adapter to close the performance gap between the plain ViT and vision-specific backbones for dense prediction tasks.

To this end, we propose the Vision Transformer Adapter (ViT-Adapter), which is a pre-training-free additional network that can efficiently adapt the plain ViT to downstream dense prediction tasks without modifying its original architecture. Specifically, to introduce the vision-specific inductive biases into the plain ViT, we design three tailored modules for ViT-Adapter, including (1) a spatial prior module for capturing the local semantics (spatial prior) from input images, (2) a spatial feature injector for incorporating spatial prior into the ViT, and (3) a multi-scale feature extractor to reconstruct the multi-scale features required by dense prediction tasks.

As shown in Figure 1, compared to the previous paradigm that pre-trains on large-scale image datasets (e.g., ImageNet (Deng et al., 2009)) then fine-tunes on other tasks, our paradigm is more flexible. In our framework, the backbone network is a general-propose model (e.g., plain ViT) that can be pre-trained with not only images but also multi-modal data. For the transfer learning of dense prediction tasks, we use a randomly initialized adapter to introduce the image-related prior knowledge (inductive biases) into the pre-trained backbone, making the model suitable for these tasks. In this way, using ViT as the backbone, our framework achieves comparable or even better performance than vision-specific transformers such as Swin (Liu et al., 2021b).

$\bullet$ We explore a new paradigm to introduce vision-specific inductive biases into the plain ViT. It helps ViT achieve comparable performance to recent transformer variants (Liu et al., 2021b; Wang et al., 2022a) with regular ImageNet pre-training and further benefits from multi-modal pre-training.

$\bullet$ We design a spatial prior module and two feature interaction operations, to inject the image prior without redesigning the architecture of ViT. They can supplement the missing local information and reorganize fine-grained multi-scale features for dense prediction tasks.

$\bullet$ We evaluate the ViT-Adapter on multiple challenging benchmarks, including COCO (Lin et al., 2014) and ADE20K (Zhou et al., 2017). As shown in Figure 2, our models consistently achieve improved performance compared to the prior arts under the fair pre-training strategy. For instance, when using only ImageNet-1K pre-training, ViT-Adapter-B reports 49.6 box AP on COCO val, outperforming Swin-B by 1.0 points. Benefiting from multi-modal pre-training (Peng et al., 2022), our ViT-Adapter-L yields 60.9 box AP, which is the best record on COCO test-dev without training on extra detection data such as Objects365 (Shao et al., 2019).

Related Work

Transformers. In recent years, transformers have dominated various tasks across multiple modalities, such as natural language processing, computer vision, and speech recognition. The vanilla transformer (Vaswani et al., 2017) was initially proposed for machine translation and remains the state-of-the-art architecture for NLP tasks today. ViT (Dosovitskiy et al., 2020) is the first work to generalize the vanilla transformer to the image classification task without much modification. PVT (Wang et al., 2021) and Swin (Liu et al., 2021b) introduce more vision-specific inductive biases by incorporating the pyramid structure from CNNs. Afterward, Conformer (Peng et al., 2021) proposed the first dual network to combine CNN with transformer. Recently, BEiT (Bao et al., 2022) and MAE (He et al., 2021) extended the scope of ViT to self-supervised learning with masked image modeling (MIM), demonstrating the powerful potential of the plain ViT architecture. Many works (Li et al., 2021b; Zhu et al., 2021; 2022; Wang et al., 2022b) have shown that designing vision-specific models is an important direction, but the general-propose architectures (e.g., plain ViT) are more flexible and essential for masked data modeling and multi-modal pre-training. Therefore, we develop a pre-training-free adapter to introduce the image prior without modifying the architecture of ViT, preserving its flexibility and enjoying advanced multi-modal pre-training.

Decoders for ViT. The architecture for dense prediction commonly follows an encoder-decoder pattern, in which the encoder generates rich features and the decoder aggregates and translates them to the final predictions. Recently, illuminated by the global receptive fields of ViT, many works employ it as the encoder and design task-specific decoders. SETR (Zheng et al., 2021) is the first work to adopt ViT as the backbone and develop several CNN decoders for semantic segmentation. Segmenter (Strudel et al., 2021) also extends ViT to semantic segmentation, but differs in that it equips a transformer-based decoder. DPT (Ranftl et al., 2021) further applies ViT to the monocular depth estimation task via a CNN decoder and yields remarkable improvements. In summary, these works improve the dense prediction performance of ViT by designing modality- and task-specific decoders, but remain ViT’s weakness of single-scale and low-resolution representation.

Adapters. To date, adapters have been widely used in the NLP field. PALs (Stickland & Murray, 2019) and Adapters (Houlsby et al., 2019) introduce new modules in transformer encoders for task-specific fine-tuning, making the pre-trained model quickly adapt to downstream NLP tasks. In the field of computer vision, some adapters have been proposed for incremental learning (Rosenfeld & Tsotsos, 2018) and domain adaptation (Rebuffi et al., 2017; 2018). With the advent of CLIP (Radford et al., 2021), many CLIP-based adapters (Gao et al., 2021; Sung et al., 2021; Zhang et al., 2021) were presented to transfer pre-trained knowledge to zero-shot or few-shot downstream tasks. Recently, Li et al. (2021b) and ViTDet (Li et al., 2022b) employed some upsampling and downsampling modules to adapt the plain ViT for object detection, as shown in Figure 3(a). However, under regular training settings (i.e., apply ImageNet supervised pre-training and fine-tune for 36 epochs), their detection performance is still inferiorIn ViTDet, using regular ImageNet-22K pre-training instead of MAE (He et al., 2021) drops 4.0 box AP. to recent models (Chu et al., 2021b; Dong et al., 2021; Wang et al., 2022a; Wu et al., 2022b) that well combine image prior. Therefore, it is still challenging to design a powerful dense prediction task adapter for ViT.

Vision Transformer Adapter

As illustrated in Figure 4, our model can be divided into two parts. The first part is the plain ViT (Dosovitskiy et al., 2020) that consists of a patch embedding followed by $L$ transformer encoder layers (see Figure 4(a)). The second part is the proposed ViT-Adapter as shown in Figure 4(b), which contains (1) a spatial prior module to capture spatial features from the input image, (2) a spatial feature injector to inject spatial priors into the ViT, and (3) a multi-scale feature extractor to extract hierarchical features from the single-scale features of ViT.

For the ViT, the input image is first fed into the patch embedding, where the image is divided into $16\times 16$ non-overlapping patches. After that, these patches are flattened and projected to $D$ -dimensional tokens, and the feature resolution is reduced to 1/16 of the original image. Then, these tokens added with the position embedding, are passed through $L$ encoder layers.

For the ViT-Adapter, we first feed the input image into the spatial prior module. $D$ -dimensional spatial features of three target resolutions (i.e., 1/8, 1/16, and 1/32) will be collected. Then, these feature maps are flattened and concatenated as the input for feature interaction. Specifically, given the number of interactions $N$ (usually $N=4$ ), we evenly split the transformer encoders of ViT into $N$ blocks, each containing $L/N$ encoder layers. For the $i$ -th block, we first inject spatial priors $\mathcal{F}_{\text{sp}}^{i}$ into the block via a spatial feature injector, and then extract hierarchical features from the output of the block by a multi-scale feature extractor. After $N$ feature interactions, we obtain high-quality multi-scale features, and then we split and reshape the features into three target resolutions 1/8, 1/16, and 1/32. Finally, we build the 1/4-scale feature map by upsampling the 1/8-scale feature map using a $2\times 2$ transposed convolution. In this way, we obtain a feature pyramid of similar resolutions to ResNet (He et al., 2016), which can be used in various dense prediction tasks.

2 Spatial Prior Module

Recent studies (Wang et al., 2022a; Wu et al., 2021; Fang et al., 2022; Park & Kim, 2022) show convolutions can help transformers better capture the local spatial information. Inspired by this, we introduce the Spatial Prior Module (SPM). It is designed to model the local spatial contexts of images parallel with the patch embedding layer, so as not to alter the original architecture of ViT.

3 Feature Interaction

Due to weak prior assumptions, the plain ViT suffers sub-optimal performance on dense prediction tasks compared to vision-specific transformers (Chu et al., 2021a; Dong et al., 2021; Liu et al., 2021b; Wang et al., 2022a). To alleviate this issue, we propose two feature interaction modules to bridge the feature maps of our SPM and the ViT. To be specific, the two modules are mainly based on cross-attention, namely Spatial Feature Injector and Multi-Scale Feature Extractor.

Multi-Scale Feature Extractor. After injecting the spatial priors into the ViT, we obtain the output feature $\mathcal{F}^{i+1}_{\rm vit}$ by passing $\mathcal{\hat{F}}^{i}_{\rm vit}$ through the encoder layers of the $i$ -th block. Then, we apply a module consisting of a cross-attention layer and a feed-forward network (FFN), to extract multi-scale features, as shown in Figure 4(e). This process can be formulated as:

4 Architecture Configurations

We build our ViT-Adapter for 4 different sizes of ViT, including ViT-T, ViT-S, ViT-B, and ViT-L. For these models, the parameter numbers of our adapters are 2.5M, 5.8M, 14.0M, and 23.7M, respectively. We employ deformable attention (Zhu et al., 2020) as the default sparse attention in our method, where the number of sampling points is fixed to 4, and the number of attention heads is set to 6, 6, 12, and 16. The number of interactions $N$ is 4, and in the last feature interaction, we stack three multi-scale feature extractors. Besides, we set the FFN ratio in our adapter to 0.25 to save computational overhead, i.e. the hidden sizes of FFN are 48, 96, 192, and 256 for 4 different adapters. More details of each configuration are shown in Table 10 in Appendix 10.

Experiments

Previous work (Wang et al., 2021) has shown that the pyramid prior is beneficial to dense prediction, but brings little gains to image classification. Therefore, in this study, we focus on how to better adapt readily available pre-trained ViTs to dense prediction tasks. We hope this method will also help decouple the model design of upstream pre-training and downstream fine-tuning.

Settings. Our detection experiments are based on MMDetection (Chen et al., 2019b) and the COCO (Lin et al., 2014) dataset. We use 4 mainstream detectors to evaluate our ViT-Adapter, including Mask R-CNN (He et al., 2017), Cascade Mask R-CNN (Cai & Vasconcelos, 2019), ATSS (Zhang et al., 2020), and GFL (Li et al., 2020). To save time and memory, we refer to (Li et al., 2021b) and modify the $L$ -layer ViT to use 14 $\times$ 14 window attention except for layers spaced at an interval of $L/4$ . Following common practices (Wang et al., 2021), we adopt 1 $\times$ or 3 $\times$ training schedule (i.e., 12 or 36 epochs) with a batch size of 16, and AdamW (Loshchilov & Hutter, 2017) optimizer with an initial learning rate of $1\times 10^{-4}$ and a weight decay of 0.05.

Results with ImageNet-1K Pre-training. In Table 1 and Table 2, we apply the DeiT (Touvron et al., 2021) released ImageNet-1K weights (without distillation) as the initialization for all ViT-T/S/B models. We compare our ViT-Adapter with two related approaches (Li et al., 2021b; 2022b) and multiple representative vision-specific backbones (Wang et al., 2021; 2022a; Huang et al., 2021b; Liu et al., 2021b; Yang et al., 2021). As we can see, when using regular training settings for fair comparison, the detection performance of ViT (Li et al., 2021b) and ViTDet (Li et al., 2022b) is inferior to recent vision-specific models. For example, with Mask R-CNN and 3 $\times$ +MS schedule, ViT-S and ViTDet-S are 3.8 APb and 3.3 APb lower than PVTv2-B2 (Wang et al., 2022a) respectively. Differently, our ViT-Adapter-S outperforms these two approaches by clear margins and even 0.4 APb higher than PVTv2-B2. This observation can also be seen in the experiments of three other detectors, including Cascade Mask R-CNN, ATSS, and GFL. These results indicate that, with only the regular ImageNet-1K pre-training, ViT-Adapter can promote the plain ViT to attain similar or even superior performance than these vision-specific transformers.

Results with ImageNet-22K Pre-training. In Table 1, we employ the ImageNet-22K pre-trained weights from AugReg (Steiner et al., 2021) to initialize all ViT-L models, including ViT (Li et al., 2021b), ViTDet (Li et al., 2022b), and our ViT-Adapter. It can be seen that, when training Mask R-CNN with 3 $\times$ +MS schedule, our ViT-Adapter-L† brings 3.8 APb and 3.0 APb improvements over ViT-L† (Li et al., 2021b) and ViTDet-L† (Li et al., 2022b), respectively.

Results with Multi-Modal Pre-training. In this experiment, we study the effect of multi-modal pre-training. Specifically, we fine-tune the ViT-Adapter-B with Mask R-CNN for the 3 $\times$ +MS schedule using different pre-trained weights. As shown in Table 4, simply replacing the ImageNet-22K pre-training (Steiner et al., 2021) with the multi-modal pre-training (Zhu et al., 2021) gives us a significant gain of 0.7 AP ${}^{\text{b}}$ and AP ${}^{\text{m}}$ . These results indicate that our method can easily derive considerable benefits from advanced multi-modal pre-training, which is difficult for vision-specific models like Swin.

2 Semantic Segmentation

Settings. We evaluate our ViT-Adapter on semantic segmentation with the ADE20K (Zhou et al., 2017) dataset and MMSegmentation (Contributors, 2020) codebase. Both Semantic FPN (Kirillov et al., 2019) and UperNet (Xiao et al., 2018) are employed as the basic frameworks. For Semantic FPN, we apply the settings of PVT (Wang et al., 2021) and train the models for 80k iterations. For UperNet, we follow the settings of Swin (Liu et al., 2021b) to train it for 160k iterations.

Results with ImageNet-1K Pre-training. In Table 3, we report the semantic segmentation results in terms of single-scale and multi-scale (MS) mIoU. As same as Section 4.1, we initialize all ViT-T/S/B models with the DeiT (Touvron et al., 2021) released ImageNet-1K weights. It shows that, under comparable model sizes, our method surpasses the ViT (Li et al., 2021b) and many representative vision-specific transformers (Wang et al., 2021; 2022a; Liu et al., 2021b; Chu et al., 2021a). For instance, our ViT-Adapter-S achieves 47.1 MS mIoU with UperNet, outperforming many strong counterparts such as Swin-T. Similarly, ViT-Adapter-B reports a competitive performance of 49.7 MS mIoU, which is 2.6 points higher than ViT-B and on par with Swin-B and Twins-SVT-L. These fair comparisons using only regular ImageNet-1K pre-training (Touvron et al., 2021) demonstrate the effectiveness and universality of our ViT-Adapter.

Results with ImageNet-22K Pre-training. When using the ImageNet-22K pre-trained weights (Steiner et al., 2021), our ViT-Adapter-B† attains 51.9 mIoU and 52.5 MS mIoU with UperNet, exceeding Swin-B† by at least 0.8 mIoU. Similarly, ViT-Adapter-L† yields the results of 53.4 mIoU and 54.4 MS mIoU, which is outstanding from the counterparts like Swin-L†. These significant and consistent improvements over different model sizes suggest that our method can cover the shortage of plain ViT, making it more suitable for semantic segmentation.

Results with Multi-Modal Pre-training. Here, we apply the multi-modal pre-trained weights from Uni-Perceiver (Zhu et al., 2021) for semantic segmentation. As shown in Table 3, for Semantic FPN and UperNet, replacing the ImageNet-22K pre-training with multi-modal pre-training benefits our ViT-Adapter-L★ with impressive gains of 1.3 mIoU and 1.6 mIoU, respectively.

3 Comparisons with State-of-the-Arts

Settings. We conduct experiments to combine our ViT-Adapter with state-of-the-art detection/segmentation frameworks, including HTC++ (Liu et al., 2021b) (without extra detection dataset) and Mask2Former (Cheng et al., 2021), and recent multi-modal pre-training BEiTv2 (Peng et al., 2022). The experimental settings are listed in Appendix A.1 and A.2.

Results. As shown in Table 5, our method reaches state-of-the-art performance. While these results may be partly due to the effectiveness of advanced pre-training, our study demonstrates that plain backbone detectors/segmenters can challenge the entrenched position of hierarchical backbones.

4 Ablation Study

ViT vs. ViT-Adapter Feature. Recent works (Park & Kim, 2022; Si et al., 2022) show that ViT presents the characteristics of learning low-frequency global signals, while CNN tends to extract high-frequency information (e.g., local edges and textures). To show the difference between the features of ViT and ViT-Adapter, we first use Fourier analysis as a toolkit for visualization. As shown in Figure 5(a), the Fourier spectrum and relative log amplitudes of the Fourier transformed feature maps (average over 100 images) indicate that ViT-Adapter captures more high-frequency signals than the ViT (Li et al., 2021b) baseline. In addition, we also visualize the stride-8 feature map in Figure 5 (b)(c), which shows that the features of ViT are blurry and coarse. In contrast, our features are more fine-grained and have more local edges and textures. This observation demonstrates that our method grafts the merit of CNN for capturing high-frequency information to ViT.

Ablation for Components. To investigate the contribution of each key design, we gradually extend the ViT-S baseline (Li et al., 2021b) to our ViT-Adapter-S. All models are trained with Mask R-CNN for 1 $\times$ schedule. As shown in the left side of Table 6, by directly resizing and adding the spatial features from SPM, our variant 1 improves 1.4 AP ${}^{\text{b}}$ and 0.9 AP ${}^{\text{m}}$ over the baseline, showing that local spatial information is essential for dense prediction. From variant 2, we find that the spatial feature injector further boosts the performance by 1.0 AP ${}^{\text{b}}$ and 0.8 AP ${}^{\text{m}}$ . This observation illustrates that cross-attention is a more flexible way to inject spatial features. Moreover, we employ the multi-scale feature extractor to reconstruct hierarchical features, which brings 2.1 AP ${}^{\text{b}}$ and 1.1 AP ${}^{\text{m}}$ gains, alleviating ViT’s drawback of single-scale features. In summary, our proposed components are each necessary and collectively create 4.5 AP ${}^{\text{b}}$ and 2.8 AP ${}^{\text{m}}$ improvements.

Number of Interactions. In the right side of Table 6, we study the effect of the number of interactions. Specifically, we build several ViT-Adapter-S variants with different numbers of interactions. We observe that the model accuracy saturates when $N$ goes larger, and applying more interactions cannot monotonically promote the performance. Therefore, we empirically set $N$ to 4 by default.

Attention Type. Our method is a general framework in which the attention mechanism is replaceable. To verify this, we adopt ViT-Adapter-S as the basic model and study 4 different attention mechanisms. As shown in Table 7, sparse attention with linear complexity is more suitable for our adapter than global attention with quadratic complexity. We ended up using deformable attention (Zhu et al., 2020) as the default configuration. Notably, it can be replaced by other more advanced attention mechanisms in the future to further boost performance.

Conclusion

This work explores a new paradigm, namely ViT-Adapter, to bridge the performance gap between the plain ViT and vision-specific transformers on dense prediction tasks. Without modifying the inherent architecture, we flexibly inject image-related inductive biases into the ViT and reconstruct fine-grained multi-scale features required by dense predictions. Extensive experiments on object detection, instance segmentation, and semantic segmentation show that our method can achieve comparable or even better performance than well-designed vision-specific transformers, and further derive considerable benefits from advanced multi-modal pre-training.

Acknowledgement

This work is partly supported by the National Natural Science Foundation of China (Grant No. 61672273, 61832008), and the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100).

References

Appendix A Comparison with Previous State-of-the-Arts

In recent years, the state-of-the-art models on dense prediction benchmarks are primarily vision-specific transformers, such as Swin (Liu et al., 2021b), Focal (Yang et al., 2021), MViTv2 (Li et al., 2021a), and SwinV2 (Liu et al., 2021a), while the plain ViT is rarely found. Nevertheless, we argue that the plain ViT still has the potential to reach the leading performance by leveraging our ViT-Adapter. To verify this, we conduct extensive additional experiments as follows.

Settings. Following prior art (Li et al., 2021b), we modify the 24-layer ViT-L to use 14 $\times$ 14 window attention except for layers spaced at an interval of 6, to save training time and memory. The state-of-the-art detector HTC++ (Liu et al., 2021b) is employed for our experiments. Specifically, we rescale the shorter side of images between 400 and 1400, while the longer side is at most 1600. Instaboost (Fang et al., 2019), Soft-NMS (Bodla et al., 2017), AdamW (Loshchilov & Hutter, 2017) optimizer (batch size of 16, initial learning rate of 1 $\times$ 10-4, and weight decay of 0.05), and 3 $\times$ schedule are adopted during training. We use a layer-wise learning rate decay of 0.9, and a drop path rate of 0.4. For a fairer comparison, here we take two initialization strategies, i.e. regular ImageNet-22K pre-training, and more advanced self-supervised or multi-modal pre-training.

Results with ImageNet-22K Pre-training. As shown in Table 8, with the ImageNet-22K supervised pre-training from AugReg (Steiner et al., 2021), our ViT-Adapter-L reports 58.4 APb and 50.7 APm on the COCO test-dev, which is comparable to many vision-specific transformers such as Swin-L (58.4 APb vs. 58.7 APb) and Focal-L (58.4 APb vs. 58.4 APb). This fair comparison illustrates that, our ViT-Adapter significantly narrows the performance gap between the plain ViT and well-designed vision-specific models.

Results with More Advanced Pre-training. Since our paradigm retains the flexibility of the plain ViT, it can easily derive significant benefits from advanced pre-training techniques, such as multi-modal pre-training (Zhu et al., 2021; 2022) or self-supervised pre-training (Bao et al., 2022; Peng et al., 2022; He et al., 2021), or a combination of the both (Wang et al., 2022b). Here, we take the readily available weights from BEiT (Bao et al., 2022) and BEiTv2 (Bao et al., 2022) as examples. Due to BEiT using learnable relative position biases instead of the absolute position embeddings, we replace the remaining global attention (see the settings part) with 56 $\times$ 56 window attention as an approximation. For these layers, the relative position biases need to be interpolated to adapt to the new window size.

As reported in Table 8, our ViT-Adapter-L (w/ BEiT) creates 60.4 APb and 52.5 APm on the COCO test-dev, and ViT-Adapter-L (w/ BEiTv2) further sets this record to 60.9 APb and 53.0 APm. Notably, although it’s not a perfectly controlled comparison, our method attains similar performance taking fewer training epochs (36 vs. 100) than ViTDet (Li et al., 2022b). We argue that a longer training schedule such as 100 epochs may bring an added bonus, but it is expensive to afford due to limited computing resources. In summary, from a system-level perspective, our ViT-Adapter can enjoy the dividends of various advanced pre-training techniques and help plain ViT achieve leading performance on the object detection and instance segmentation tasks.

A.2 Semantic Segmentation

Settings. For semantic segmentation, we employ the AdamW optimizer with an initial learning rate of 2 $\times$ 10-5, a batch size of 16, and a weight decay of 0.05. Layer-wise learning rate decay of 0.9 and drop path rate of 0.4 are used to train the models. Other training settings, such as pre-training techniques, crop size, and the number of iterations, are listed in Table 9.

Results with ImageNet-22K Pre-training. As shown in Table 9, using Mask2Former (Cheng et al., 2021) as the segmenter, our ViT-Adapter-L achieves 56.8 mIoU and 57.7 MS mIoU on the ADE20K val, which is comparable to recent vision-specific models, such as Swin-L (Liu et al., 2021b), Swin-L-FaPN (Huang et al., 2021a), SeMask-Swin-L (Jain et al., 2021), and HorNet-L (Rao et al., 2022).

Results with More Advanced Pre-training. It can be seen from Table 9, when training with UperNet for 160k iterations, our ViT-Adapter-L (w/ BEiT) yields 58.4 MS mIoU, outperforming BEiT-L by 1.4 points with only 10M additional parameters. It shows that our adapter can deliver significant benefits even for a powerful self-supervised pre-trained ViT.

Furthermore, we compare the performance of our method with vision-specific models that also use additional datasets. For example, SwinV2-G (Liu et al., 2021a) uses a privately collected ImageNet-22K-ext-70M dataset that contains 70 million images. Mask DINO (Li et al., 2022a) takes the detection pre-training on large-scale Objects365 (Shao et al., 2019) dataset as the initialization for segmentation. Due to limited computing resources, we explore a simple and affordable transfer learning strategy for semantic segmentation. Specifically, we use the COCO-Stuff (Caesar et al., 2018) dataset for 80k iterations of pre-training, and then ADE20K for 80k iterations of fine-tuning. The total number of iterations is still 160k, and no additional training overhead is added.

Under this setting, our ViT-Adapter-L (w/ BEiT) produces an exciting score of 60.5 MS mIoU. Further, ViT-Adapter-L (w/ BEiTv2) creates a new record of 61.5 MS mIoU, which is slightly better than FD-SwinV2-G (Wei et al., 2022), while the parameter number is much smaller (571M vs. 3.0B). It’s worth noting that, our ViT-Adapter is also adopted by the recently proposed BEiT-3 (Wang et al., 2022b), which is a ViT-style foundation model that can be pre-trained with multi-modal data. As described in their paper, using ViT-Adapter for the transfer learning of semantic segmentation, BEiT-3 establishes a new state-of-the-art of 62.8 MS mIoU on ADE20K val, which is a convincing verification of the paradigm we present in Figure 1.

Appendix B Additional Ablation and Discussion

Architecture Configurations. The more detailed configurations are listed in Table 10.

TIDE Error Type Analysis. TIDE (Bolya et al., 2020) is a toolbox for analyzing the sources of error in object detection algorithms. Following (Li et al., 2021b), we show the error type analysis in Figure 6. For fair comparison, the models listed in Table 1 are adopted for analysis. These results reveal where our ViT-Adapter improves overall AP ${}^{\text{b}}$ relative to the ViT baseline (Li et al., 2021b). For instance, we observe that our adapter helps reduce missed and localization errors, and has a substantial effect on fixing false negative and positive errors.

Feature Visualization. We plot more visualization of feature maps produced by ViT-B (Li et al., 2021b) and our ViT-Adapter-B in Figure 7 and Figure 8, which are trained based on Mask R-CNN for detection and UperNet for segmentation, respectively. As can be seen, the features of ViT-B are blurry and coarse, while our features are more refined and have more local edges and textures. This observation also accords with the Fourier analysis in Section 4.4, which demonstrates that ViT has the characteristics of capturing low-frequency information, and our ViT-Adapter can supplement the missing high-frequency signals.

Comparison with SETR. Like ViTDet (Li et al., 2022b), SETR (Zheng et al., 2021) also changes the shape of features of ViT according to the task prior (see Figure 3(a)), thus allowing ViT to achieve better segmentation performance. Although this paradigm shares some similarities with our approach, e.g. combining ViT and convolutions, they have three main differences: (1) In addition to the task prior, our method also takes the information of the input image (the input prior) into consideration when adapting ViT to dense prediction tasks; (2) The input prior will constantly interact with ViT’s features, making the output features more suitable for dense prediction tasks; (3) Our method is an adapter that is general in both detection and segmentation tasks, and moreover achieves better results than segmentation-specific head SETR (Zheng et al., 2021).

Comparison with other Adapters. We would like to clarify the differences between ViT-Adapter and other adapters (Jia et al., 2022; Bahng et al., 2022; Chen et al., 2022; Zhang et al., 2022; Jie & Deng, 2022) for ViTs, from two aspects as follows:

(1) Different tasks. Our method is designed for dense prediction tasks, while VPT (Jia et al., 2022), Visual Prompt (Bahng et al., 2022), AdaptFormer (Chen et al., 2022), NOAH (Zhang et al., 2022), and Convpass (Jie & Deng, 2022) are mainly proposed for classification tasks. By training the parameters only in input spaces, or some modules attached to the backbone, or their combination, these models perform well on classification and even obtain better results than full-tuning models.

However, when applying these methods (Jia et al., 2022; Bahng et al., 2022; Chen et al., 2022; Zhang et al., 2022; Jie & Deng, 2022) to dense prediction tasks, they perform below expectations. For example, we see from Table 11 that the performance of VPT (Jia et al., 2022) has a large gap with the baseline ViT-L (Zheng et al., 2021) on ADE20K.

(2) Different targets. These mentioned adapters (Jia et al., 2022; Bahng et al., 2022; Chen et al., 2022; Zhang et al., 2022; Jie & Deng, 2022) aim to explore parameter-efficient transfer learning, while the goal of our ViT-Adapter is to push the performance boundaries of plain ViT downstream applications, make ViT more general for downstream tasks, and efficiently utilize large-scale ViT weights pre-trained in different ways. We argue that these two technical lines are orthogonal, as shown in the last column in Table 11. Combining ViT-Adapter with these adapters to achieve efficient and accurate transfer learning of dense prediction is a research topic worth exploring.

ViTDet’s Performance. The higher performance of the original ViTDet (Li et al., 2022b) comes from stronger training settings. Specifically, ViTDet adopts a more expensive training strategy than ours, i.e., loading the MAE (He et al., 2021) pre-trained weights, and using the Large Scale Jitter (Ghiasi et al., 2021) augmentation to train the model for 100 epochs. This setting leads to almost 3 times the training cost compared to the commonly used 36 epochs (i.e., $3\times$ +MS schedule). And to some extent, it reveals the lack of image-related inductive biases in ViT will lead to slow convergence on dense prediction tasks.

For fair comparisons, we benchmark all plain ViT detectors, including ViTDet (Li et al., 2022b) and our ViT-Adapter under the commonly used $3\times$ +MS training schedule, and use the same ImageNet-1K pre-trained weights (i.e., DeiT) as initialization. It makes sense that our ViT-Adapter achieves better performance than ViTDet under this setting, because our adapter injects image-related prior into the plain ViT, which can speed up convergence and improve performance.