Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

Introduction

Convolutional neural network (CNNs) have achieved remarkable success in computer vision, making them a versatile and dominant approach for almost all tasks . Nevertheless, this work aims to explore an alternative backbone network beyond CNN, which can be used for dense prediction tasks such as object detection , semantic and instance segmentation , in addition to image classification .

Inspired by the success of Transformer in natural language processing, many researchers have explored its application in computer vision. For example, some works model the vision task as a dictionary lookup problem with learnable queries, and use the Transformer decoder as a task-specific head on top of the CNN backbone. Although some prior arts have also incorporated attention modules into CNNs, as far as we know, exploring a clean and convolution-free Transformer backbone to address dense prediction tasks in computer vision is rarely studied.

Recently, Dosovitskiy et al. introduced the Vision Transformer (ViT) for image classification. This is an interesting and meaningful attempt to replace the CNN backbone with a convolution-free model. As shown in Figure 1 (b), ViT has a columnar structure with coarse image patches as input.Due to resource constraints, ViT cannot use fine-grained image patches (e.g., 4 $\times$ 4 pixels per patch) as input, instead only receive coarse patches (e.g., 32 $\times$ 32 pixels per patch) as input, which leads to its low output resolution (e.g., 32-stride). Although ViT is applicable to image classification, it is challenging to directly adapt it to pixel-level dense predictions such as object detection and segmentation, because (1) its output feature map is single-scale and low-resolution, and (2) its computational and memory costs are relatively high even for common input image sizes (e.g., shorter edge of 800 pixels in the COCO benchmark ).

To address the above limitations, this work proposes a pure Transformer backbone, termed Pyramid Vision Transformer (PVT), which can serve as an alternative to the CNN backbone in many downstream tasks, including image-level prediction as well as pixel-level dense predictions. Specifically, as illustrated in Figure 1 (c), our PVT overcomes the difficulties of the conventional Transformer by (1) taking fine-grained image patches (i.e., 4 $\times$ 4 pixels per patch) as input to learn high-resolution representation, which is essential for dense prediction tasks; (2) introducing a progressive shrinking pyramid to reduce the sequence length of Transformer as the network deepens, significantly reducing the computational cost, and (3) adopting a spatial-reduction attention (SRA) layer to further reduce the resource consumption when learning high-resolution features.

Overall, the proposed PVT possesses the following merits. Firstly, compared to the traditional CNN backbones (see Figure 1 (a)), which have local receptive fields that increase with the network depth, our PVT always produces a global receptive field, which is more suitable for detection and segmentation. Secondly, compared to ViT (see Figure 1 (b)), thanks to its advanced pyramid structure, our method can more easily be plugged into many representative dense prediction pipelines, e.g., RetinaNet and Mask R-CNN . Thirdly, we can build a convolution-free pipeline by combining our PVT with other task-specific Transformer decoders, such as PVT+DETR for object detection. To our knowledge, this is the first entirely convolution-free object detection pipeline.

(1) We propose Pyramid Vision Transformer (PVT), which is the first pure Transformer backbone designed for various pixel-level dense prediction tasks. Combining our PVT and DETR, we can construct an end-to-end object detection system without convolutions and handcrafted components such as dense anchors and non-maximum suppression (NMS).

(2) We overcome many difficulties when porting Transformer to dense predictions, by designing a progressive shrinking pyramid and a spatial-reduction attention (SRA). These are able to reduce the resource consumption of Transformer, making PVT flexible to learning multi-scale and high-resolution features.

(3) We evaluate the proposed PVT on several different tasks, including image classification, object detection, instance and semantic segmentation, and compare it with popular ResNets and ResNeXts . As presented in Figure 2, our PVT with different parameter scales can consistently archived improved performance compared to the prior arts. For example, under a comparable number of parameters, using RetinaNet for object detection, PVT-Small achieves 40.4 AP on COCO val2017, outperforming ResNet50 by 4.1 points (40.4 vs. 36.3). Moreover, PVT-Large achieves 42.6 AP, which is 1.6 points better than ResNeXt101-64x4d, with 30% less parameters.

Related Work

CNNs are the work-horses of deep neural networks in visual recognition. The standard CNN was first introduced in to distinguish handwritten numbers. The model contains convolutional kernels with a certain receptive field that captures favorable visual context. To provide translation equivariance, the weights of convolutional kernels are shared over the entire image space. More recently, with the rapid development of the computational resources (e.g., GPU), the successful training of stacked convolutional blocks on large-scale image classification datasets (e.g., ImageNet ) has become possible. For instance, GoogLeNet demonstrated that a convolutional operator containing multiple kernel paths can achieve very competitive performance. The effectiveness of a multi-path convolutional block was further validated in Inception series , ResNeXt , DPN , MixNet and SKNet . Further, ResNet introduced skip connections into the convolutional block, making it possible to create/train very deep networks and obtaining impressive results in the field of computer vision. DenseNet introduced a densely connected topology, which connects each convolutional block to all previous blocks. More recent advances can be found in recent survey/review papers .

Unlike the full-blown CNNs, the vision Transformer backbone is still in its early stage of development. In this work, we try to extend the scope of Vision Transformer by designing a new versatile Transformer backbone suitable for most vision tasks.

2 Dense Prediction Tasks

Preliminary. The dense prediction task aims to perform pixel-level classification or regression on a feature map. Object detection and semantic segmentation are two representative dense prediction tasks.

Object Detection. In the era of deep learning, CNNs have become the dominant framework for object detection, which includes single-stage detectors (e.g., SSD , RetinaNet , FCOS , GFL , PolarMask and OneNet ) and multi-stage detectors (Faster R-CNN , Mask R-CNN , Cascade R-CNN and Sparse R-CNN ). Most of these popular object detectors are built on high-resolution or multi-scale feature maps to obtain good detection performance. Recently, DETR and deformable DETR combined the CNN backbone and the Transformer decoder to build an end-to-end object detector. Likewise, they also require high-resolution or multi-scale feature maps for accurate object detection.

Semantic Segmentation. CNNs also play an important role in semantic segmentation. In the early stages, FCN introduced a fully convolutional architecture to generate a spatial segmentation map for a given image of any size. After that, the deconvolution operation was introduced by Noh et al. and achieved impressive performance on the PASCAL VOC 2012 dataset . Inspired by FCN, U-Net was proposed for the medical image segmentation domain specifically, bridging the information flow between corresponding low-level and high-level feature maps of the same spatial sizes. To explore richer global context representation, Zhao et al. designed a pyramid pooling module over various pooling scales, and Kirillov et al. developed a lightweight segmentation head termed Semantic FPN, based on FPN . Finally, the DeepLab family applies dilated convolutions to enlarge the receptive field while maintaining the feature map resolution. Similar to object detection methods, semantic segmentation models also rely on high-resolution or multi-scale feature maps.

3 Self-Attention and Transformer in Vision

As convolutional filter weights are usually fixed after training, they cannot be dynamically adapted to different inputs. Many methods have been proposed to alleviate this problem using dynamic filters or self-attention operations . The non-local block attempts to model long-range dependencies in both space and time, which has been shown beneficial for accurate video classification. However, despite its success, the non-local operator suffers from the high computational and memory costs. Criss-cross further reduces the complexity by generating sparse attention maps through a criss-cross path. Ramachandran et al. proposed the stand-alone self-attention to replace convolutional layers with local self-attention units. AANet achieves competitive results when combining the self-attention and convolutional operations. LambdaNetworks uses the lambda layer, an efficient self-attention to replace the convolution in the CNN. DETR utilizes the Transformer decoder to model object detection as an end-to-end dictionary lookup problem with learnable queries, successfully removing the need for handcrafted processes such as NMS. Based on DETR, deformable DETR further adopts a deformable attention layer to focus on a sparse set of contextual elements, obtaining faster convergence and better performance. Recently, Vision Transformer (ViT) employs a pure Transformer model for image classification by treating an image as a sequence of patches. DeiT further extends ViT using a novel distillation approach. Different from previous models, this work introduces the pyramid structure into Transformer to present a pure Transformer backbone for dense prediction tasks, rather than a task-specific head or an image classification model.

Pyramid Vision Transformer (PVT)

Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation). An overview of PVT is depicted in Figure 3. Similar to CNN backbones , our method has four stages that generate feature maps of different scales. All stages share a similar architecture, which consists of a patch embedding layer and $L_{i}$ Transformer encoder layers.

In the first stage, given an input image of size $H\!\times\!W\!\times\!3$ , we first divide it into $\frac{HW}{4^{2}}$ patches,As done for ResNet, we keep the highest resolution of our output feature map at 4-stride. each of size $4\!\times\!4\!\times\!3$ . Then, we feed the flattened patches to a linear projection and obtain embedded patches of size $\frac{HW}{4^{2}}\!\times\!C_{1}$ . After that, the embedded patches along with a position embedding are passed through a Transformer encoder with $L_{1}$ layers, and the output is reshaped to a feature map $F_{1}$ of size $\frac{H}{4}\!\times\!\frac{W}{4}\!\times\!C_{1}$ . In the same way, using the feature map from the previous stage as input, we obtain the following feature maps: $F_{2}$ , $F_{3}$ , and $F_{4}$ , whose strides are 8, 16, and 32 pixels with respect to the input image. With the feature pyramid $\{F_{1},F_{2},F_{3},F_{4}\}$ , our method can be easily applied to most downstream tasks, including image classification, object detection, and semantic segmentation.

2 Feature Pyramid for Transformer

Unlike CNN backbone networks , which use different convolutional strides to obtain multi-scale feature maps, our PVT uses a progressive shrinking strategy to control the scale of feature maps by patch embedding layers.

In this way, we can flexibly adjust the scale of the feature map in each stage, making it possible to construct a feature pyramid for Transformer.

3 Transformer Encoder

The Transformer encoder in the stage $i$ has $L_{i}$ encoder layers, each of which is composed of an attention layer and a feed-forward layer . Since PVT needs to process high-resolution (e.g., 4-stride) feature maps, we propose a spatial-reduction attention (SRA) layer to replace the traditional multi-head attention (MHA) layer in the encoder.

Similar to MHA, our SRA receives a query $Q$ , a key $K$ , and a value $V$ as input, and outputs a refined feature. The difference is that our SRA reduces the spatial scale of $K$ and $V$ before the attention operation (see Figure 4), which largely reduces the computational/memory overhead. Details of the SRA in the stage $i$ can be formulated as follows:

Through these formulas, we can find that the computational/memory costs of our attention operation are $R_{i}^{2}$ times lower than those of MHA, so our SRA can handle larger input feature maps/sequences with limited resources.

4 Model Details

In summary, the hyper parameters of our method are listed as follows:

$C_{i}$ : the channel number of the output of Stage $i$ ;

$L_{i}$ : the number of encoder layers in Stage $i$ ;

$R_{i}$ : the reduction ratio of the SRA in Stage $i$ ;

$N_{i}$ : the head number of the SRA in Stage $i$ ;

$E_{i}$ : the expansion ratio of the feed-forward layer in Stage $i$ ;

Following the design rules of ResNet , we (1) use small output channel numbers in shallow stages; and (2) concentrate the major computation resource in intermediate stages.

To provide instances for discussion, we describe a series of PVT models with different scales, namely PVT-Tiny, -Small, -Medium, and -Large, in Table 1, whose parameter numbers are comparable to ResNet18, 50, 101, and 152 respectively. More details of employing these models in specific downstream tasks will be introduced in Section 4.

5 Discussion

The most related work to our model is ViT . Here, we discuss the relationship and differences between them. First, both PVT and ViT are pure Transformer models without convolutions. The primary difference between them is the pyramid structure. Similar to the traditional Transformer , the length of ViT’s output sequence is the same as the input, which means that the output of ViT is single-scale (see Figure 1 (b)). Moreover, due to the limited resource, the input of ViT is coarse-grained (e.g., the patch size is 16 or 32 pixels), and thus its output resolution is relatively low (e.g., 16-stride or 32-stride). As a result, it is difficult to directly apply ViT to dense prediction tasks that require high-resolution or multi-scale feature maps.

Our PVT breaks the routine of Transformer by introducing a progressive shrinking pyramid. It can generate multi-scale feature maps like a traditional CNN backbone. In addition, we also designed a simple but effective attention layer—SRA, to process high-resolution feature maps and reduce computational/memory costs. Benefiting from the above designs, our method has the following advantages over ViT: 1) more flexible—can generate feature maps of different scales/channels in different stages; 2) more versatile—can be easily plugged and played in most downstream task models; 3) more friendly to computation/memory—can handle higher resolution feature maps or longer sequences.

Application to Downstream Tasks

Image classification is the most classical task of image-level prediction. To provide instances for discussion, we design a series of PVT models with different scales, namely PVT-Tiny, -Small, -Medium, and -Large, whose parameter numbers are similar to ResNet18, 50, 101, and 152, respectively. Detailed hyper-parameter settings of the PVT series are provided in the supplementary material (SM).

For image classification, we follow ViT and DeiT to append a learnable classification token to the input of the last stage, and then employ a fully connected (FC) layer to conduct classification on top of the token.

2 Pixel-Level Dense Prediction

In addition to image-level prediction, dense prediction that requires pixel-level classification or regression to be performed on the feature map, is also often seen in downstream tasks. Here, we discuss two typical tasks, namely object detection, and semantic segmentation.

We apply our PVT models to three representative dense prediction methods, namely RetinaNet , Mask R-CNN , and Semantic FPN . RetinaNet is a widely used single-stage detector, Mask R-CNN is the most popular two-stage instance segmentation framework, and Semantic FPN is a vanilla semantic segmentation method without special operations (e.g., dilated convolution). Using these methods as baselines enables us to adequately examine the effectiveness of different backbones.

The implementation details are as follows: (1) Like ResNet, we initialize the PVT backbone with the weights pre-trained on ImageNet; (2) We use the output feature pyramid $\{F_{1},F_{2},F_{3},F_{4}\}$ as the input of FPN , and then the refined feature maps are fed to the follow-up detection/segmentation head; (3) When training the detection/segmentation model, none of the layers in PVT are frozen; (4) Since the input for detection/segmentation can be an arbitrary shape, the position embeddings pre-trained on ImageNet may no longer be meaningful. Therefore, we perform bilinear interpolation on the pre-trained position embeddings according to the input resolution.

Experiments

We compare PVT with the two most representative CNN backbones, i.e., ResNet and ResNeXt , which are widely used in the benchmarks of many downstream tasks.

Settings. Image classification experiments are performed on the ImageNet 2012 dataset , which comprises 1.28 million training images and 50K validation images from 1,000 categories. For fair comparison, all models are trained on the training set, and report the top-1 error on the validation set. We follow DeiT and apply random cropping, random horizontal flipping , label-smoothing regularization , mixup , CutMix , and random erasing as data augmentations. During training, we employ AdamW with a momentum of 0.9, a mini-batch size of 128, and a weight decay of $5\times 10^{-2}$ to optimize models. The initial learning rate is set to $1\times 10^{-3}$ and decreases following the cosine schedule . All models are trained for 300 epochs from scratch on 8 V100 GPUs. To benchmark, we apply a center crop on the validation set, where a 224 $\times$ 224 patch is cropped to evaluate the classification accuracy.

Results. In Table 2, we see that our PVT models are superior to conventional CNN backbones under similar parameter numbers and computational budgets. For example, when the GFLOPs are roughly similar, the top-1 error of PVT-Small reaches 20.2, which is 1.3 points higher than that of ResNet50 (20.2 vs. 21.5). Meanwhile, under similar or lower complexity, PVT models archive performances comparable to the recently proposed Transformer-based models, such as ViT and DeiT (PVT-Large: 18.3 vs. ViT(DeiT)-Base/16: 18.3). Here, we clarify that these results are within our expectations, because the pyramid structure is beneficial to dense prediction tasks, but brings little improvements to image classification.

Note that ViT and DeiT have limitations as they are specifically designed for classification tasks, and thus are not suitable for dense prediction tasks, which usually require effective feature pyramids.

2 Object Detection

Settings. Object detection experiments are conducted on the challenging COCO benchmark . All models are trained on COCO train2017 (118k images) and evaluated on val2017 (5k images). We verify the effectiveness of PVT backbones on top of two standard detectors, namely RetinaNet and Mask R-CNN . Before training, we use the weights pre-trained on ImageNet to initialize the backbone and Xavier to initialize the newly added layers. Our models are trained with a batch size of 16 on 8 V100 GPUs and optimized by AdamW with an initial learning rate of $1\times 10^{-4}$ . Following common practices , we adopt 1 $\times$ or 3 $\times$ training schedule (i.e., 12 or 36 epochs) to train all detection models. The training image is resized to have a shorter side of 800 pixels, while the longer side does not exceed 1,333 pixels. When using the 3 $\times$ training schedule, we randomly resize the shorter side of the input image within the range of $$. In the testing phase, the shorter side of the input image is fixed to 800 pixels.

Results. As shown in Table 3, when using RetinaNet for object detection, we find that under comparable number of parameters, the PVT-based models significantly surpasses their counterparts. For example, with the 1 $\times$ training schedule, the AP of PVT-Tiny is 4.9 points better than that of ResNet18 (36.7 vs. 31.8). Moreover, with the 3 $\times$ training schedule and multi-scale training, PVT-Large archive the best AP of 43.4, surpassing ResNeXt101-64x4d (43.4 vs. 41.8), while our parameter number is 30% fewer. These results indicate that our PVT can be a good alternative to the CNN backbone for object detection.

Similar results are found in instance segmentation experiments based on Mask R-CNN, as shown in Table 4. With the 1 $\times$ training schedule, PVT-Tiny achieves 35.1 mask AP (APm), which is 3.9 points better than ResNet18 (35.1 vs. 31.2) and even 0.7 points higher than ResNet50 (35.1 vs. 34.4). The best APm obtained by PVT-Large is 40.7, which is 1.0 points higher than ResNeXt101-64x4d (40.7 vs. 39.7), with 20% fewer parameters.

3 Semantic Segmentation

Settings. We choose ADE20K , a challenging scene parsing dataset, to benchmark the performance of semantic segmentation. ADE20K contains 150 fine-grained semantic categories, with 20,210, 2,000, and 3,352 images for training, validation, and testing, respectively. We evaluate our PVT backbones on the basis of Semantic FPN , a simple segmentation method without dilated convolutions . In the training phase, the backbone is initialized with the weights pre-trained on ImageNet , and other newly added layers are initialized with Xavier . We optimize our models using AdamW with an initial learning rate of 1e-4. Following common practices , we train our models for 80k iterations with a batch size of 16 on 4 V100 GPUs. The learning rate is decayed following the polynomial decay schedule with a power of 0.9. We randomly resize and crop the image to $512\times 512$ for training, and rescale to have a shorter side of 512 pixels during testing.

Results. As shown in Table 5, when using Semantic FPN for semantic segmentation, PVT-based models consistently outperforms the models based on ResNet or ResNeXt . For example, with almost the same number of parameters and GFLOPs, our PVT-Tiny/Small/Medium are at least 2.8 points higher than ResNet-18/50/101. In addition, although the parameter number and GFLOPs of our PVT-Large are 20% lower than those of ResNeXt101-64x4d, the mIoU is still 1.9 points higher (42.1 vs. 40.2). With a longer training schedule and multi-scale testing, PVT-Large+Semantic FPN archives the best mIoU of 44.8, which is very close to the state-of-the-art performance of the ADE20K benchmark. Note that Semantic FPN is just a simple segmentation head. These results demonstrate that our PVT backbones can extract better features for semantic segmentation than the CNN backbone, benefiting from the global attention mechanism.

4 Pure Transformer Detection & Segmentation

PVT+DETR. To reach the limit of no convolution, we build a pure Transformer pipeline for object detection by simply combining our PVT with a Transformer-based detection head—DETR . We train models on COCO train2017 for 50 epochs with an initial learning rate of $1\times 10^{-4}$ . The learning rate is divided by 10 at the 33rd epoch. We use random flipping and multi-scale training as data augmentation. All other experimental settings is the same as those in Sec. 5.2. As reported in Table 6, PVT-based DETR archieves 34.7 AP on COCO val2017, outperforming the original ResNet50-based DETR by 2.4 points (34.7 vs. 32.3). These results prove that a pure Transformer detector can also works well in the object detection task.

PVT+Trans2Seg.We build a pure Transformer model for semantic segmentation by combining our PVT with Trans2Seg , a Transformer-based segmentation head. According to the experimental settings in Sec. 5.3, we perform experiments on ADE20K with 40k iterations training, single scale testing, and compare it with ResNet50+Trans2Seg and DeeplabV3+ with ResNet50-d8 (dilation 8) and -d16(dilation 8) in Table 7. We find that our PVT-Small+Trans2Seg achieves 42.6 mIoU, outperforming ResNet50-d8+DeeplabV3+ (41.5). Note that, ResNet50-d8+DeeplabV3+ has 120.5 GFLOPs due to the high computation cost of dilated convolution, and our method has only 31.6 GFLOPs, which is 4 times fewer. In addition, our PVT-Small+Trans2Seg performs better than ResNet50-d16+Trans2Seg (mIoU: 42.6 vs. 39.7, GFlops: 31.6 vs. 79.3). These results prove that a pure Transformer segmentation network is workable.

5 Ablation Study

Settings. We conduct ablation studies on ImageNet and COCO datasets. The experimental settings on ImageNet are the same as the settings in Sec. 5.1. For COCO, all models are trained with a 1 $\times$ training schedule (i.e., 12 epochs) and without multi-scale training, and other settings follow those in Sec. 5.2.

Pyramid Structure. A Pyramid structure is crucial when applying Transformer to dense prediction tasks. ViT (see Figure 1 (b)) is a columnar framework, whose output is single-scale. This results in a low-resolution output feature map when using coarse image patches (e.g., 32 $\times$ 32 pixels per patch) as input, leading to poor detection performance (31.7 AP on COCO val2017),For adapting ViT to RetinaNet, we extract the features from the layer 2, 4, 6, and 8 of ViT-Small/32, and interpolate them to different scales. as shown in Table 8. When using fine-grained image patches (e.g., 4 $\times$ 4 pixels per patch) as input like our PVT, ViT will exhaust the GPU memory (32G). Our method avoids this problem through a progressive shrinking pyramid. Specifically, our model can process high-resolution feature maps in shallow stages and low-resolution feature maps in deep stages. Thus, it obtains a promising AP of 40.4 on COCO val2017, 8.7 points higher than ViT-Small/32 (40.4 vs. 31.7).

Deeper vs. Wider. The problem of whether the CNN backbone should go deeper or wider has been extensively discussed in previous work . Here, we explore this problem in our PVT. For fair comparisons, we multiply the hidden dimensions $\{C_{1},C_{2},C_{3},C_{4}\}$ of PVT-Small by a scale factor 1.4 to make it have an equivalent parameter number to the deep model (i.e., PVT-Medium). As shown in Table 9, the deep model (i.e., PVT-Medium) consistently works better than the wide model (i.e., PVT-Small-Wide) on both ImageNet and COCO. Therefore, going deeper is more effective than going wider in the design of PVT. Based on this observation, in Table 1, we develop PVT models with different scales by increasing the model depth.

Pre-trained Weights. Most dense prediction models (e.g., RetinaNet ) rely on the backbone whose weights are pre-trained on ImageNet. We also discuss this problem in our PVT. In the top of Figure 5, we plot the validation AP curves of RetinaNet-PVT-Small w/ (red curves) and w/o (blue curves) pre-trained weights. We find that the model w/ pre-trained weights converges better than the one w/o pre-trained weights, and the gap between their final AP reaches 13.8 under the 1 $\times$ training schedule and 8.4 under the 3 $\times$ training schedule and multi-scale training. Therefore, like CNN-based models, pre-training weights can also help PVT-based models converge faster and better. Moreover, in the bottom of Figure 5, we also see that the convergence speed of PVT-based models (red curves) is faster than that of ResNet-based models (green curves).

PVT vs. “CNN w/ Non-Local” To obtain a global receptive field, some well-engineered CNN backbones, such as GCNet , integrate the non-local block in the CNN framework. Here, we compare the performance of our PVT (pure Transformer) and GCNet (CNN w/ non-local), using Mask R-CNN for instance segmentation. As reported in Table 10, we find that our PVT-Small outperforms ResNet50+GC r4 by 1.6 points in APm (37.8 vs. 36.2), and 2.0 points in AP ${}_{75}^{\rm m}$ (38.3 vs. 40.3), under comparable parameter number and GFLOPs. There are two possible reasons for this result:

(1) Although a single global attention layer (e.g., non-local or multi-head attention (MHA) ) can acquire global-receptive-field features, the model performance keeps improving as the model deepens. This indicates that stacking multiple MHAs can further enhance the representation capabilities of features. Therefore, as a pure Transformer backbone with more global attention layers, our PVT tends to perform better than the CNN backbone equipped with non-local blocks (e.g., GCNet).

(2) Regular convolutions can be deemed as special instantiations of spatial attention mechanisms . In other words, the format of MHA is more flexible than the regular convolution. For example, for different inputs, the weights of the convolution are fixed, but the attention weights of MHA change dynamically with the input. Thus, the features learned by the pure Transformer backbone full of MHA layers, could be more flexible and expressive.

With increasing input scale, the growth rate of the GFLOPs of our PVT is greater than ResNet , but lower than ViT , as shown in Figure 6. However, when the input scale does not exceed 640 $\times$ 640 pixels, the GFLOPs of PVT-Small and ResNet50 are similar. This means that our PVT is more suitable for tasks with medium-resolution input.

On COCO, the shorter side of the input image is 800 pixels. Under this condition, the inference speed of RetinaNet based on PVT-Small is slower than the ResNet50-based model, as reported in Table 11. (1) A direct solution for this problem is to reduce the input scale. When reducing the shorter side of the input image to 640 pixels, the model based on PVT-Small runs faster than the ResNet50-based model (51.7ms vs., 55.9ms), with 2.4 higher AP (38.7 vs. 36.3). 2) Another solution is to develop a self-attention layer with lower computational complexity. This is a worth exploring direction, we recently propose a solution PVTv2 .

Detection & Segmentation Results. In Figure 7, we also present some qualitative object detection and instance segmentation results on COCO val2017 , and semantic segmentation results on ADE20K . These results indicate that a pure Transformer backbone (i.e., PVT) without convolutions can also be easily plugged in dense prediction models (e.g., RetinaNet , Mask R-CNN , and Semantic FPN ), and obtain high-quality results.

Conclusions and Future Work

We introduce PVT, a pure Transformer backbone for dense prediction tasks, such as object detection and semantic segmentation. We develop a progressive shrinking pyramid and a spatial-reduction attention layer to obtain high-resolution and multi-scale feature maps under limited computation/memory resources. Extensive experiments on object detection and semantic segmentation benchmarks verify that our PVT is stronger than well-designed CNN backbones under comparable numbers of parameters.

Although PVT can serve as an alternative to CNN backbones (e.g., ResNet, ResNeXt), there are still some specific modules and operations designed for CNNs and not considered in this work, such as SE , SK , dilated convolution , model pruning , and NAS . Moreover, with years of rapid developments, there have been many well-engineered CNN backbones such as Res2Net , EfficientNet , and ResNeSt . In contrast, the Transformer-based model in computer vision is still in its early stage of development. Therefore, we believe there are many potential technologies and applications (e.g., OCR , 3D and medical image analysis) to be explored in the future, and hope that PVT could serve as a good starting point.

Acknowledgments

This work was supported by the Natural Science Foundation of China under Grant 61672273 and Grant 61832008, the Science Foundation for Distinguished Young Scholars of Jiangsu under Grant BK20160021, Postdoctoral Innovative Talent Support Program of China under Grant BX20200168, 2020M681608, the General Research Fund of Hong Kong No. 27208720.