HRFormer: High-Resolution Transformer for Dense Prediction

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, Jingdong Wang

Introduction

Vision Transformer (ViT) shows promising performance on ImageNet classification tasks. Many follow-up works boost the classification accuracy through knowledge distillation , adopting deeper architecture , directly introducing convolution operations , redesigning input image tokens , and etc. Besides, some studies attempt to extend the transformer to address broader vision tasks such as object detection , semantic segmentation , pose estimation , video understanding , and so on. This work focuses on the transformer for dense prediction tasks, including pose estimation and semantic segmentation.

Vision Transformer splits an image into a sequence of image patches of size 16×1616\times 16, and extracts the feature representation of each image patch. Thus, the output representations of Vision Transformer lose the fine-grained spatial details that are essential for accurate dense predictions. The Vision Transformer only outputs a single-scale feature representation, and thus lacks the capability to handle multi-scale variation. To mitigate the loss of feature granularity and model the multi-scale variation, we present High-Resolution Transformer (HRFormer) that contains richer spatial information and constructs multi-resolution representations for dense predictions.

The High-Resolution Transformer is built by following the multi-resolution parallel design that is adopted in HRNet . First, HRFormer adopts convolution in both the stem and the first stage as several concurrent studies also suggest that convolution performs better in the early stages. Second, HRFormer maintains a high-resolution stream through the entire process with parallel medium- and low-resolution streams helping boost high-resolution representations. With feature maps of different resolutions, thus HRFormer is capable to model the multi-scale variation. Third, HRFormer mixes the short-range and long-range attention via exchanging multi-resolution feature information with the multi-scale fusion module.

At each resolution, the local-window self-attention mechanism is adopted to reduce the memory and computation complexity. We partition the representation maps into a set of non-overlapping small image windows and perform self-attention in each image window separately. This reduces the memory and computation complexity from quadratic to linear with respect to spatial size. We further introduce a 3×33\times 3 depth-wise convolution into the feed-forward network (FFN) that follows the local-window self-attention, to exchange information between the image windows which are disconnected in the local-window self-attention process. This helps to expand the receptive field and is essential for dense prediction tasks. Figure 1 shows the details of an HRFormer block.

We conduct experiments on image classification, pose estimation, and semantic segmentation tasks, and achieve competitive performance on various benchmarks. For example, HRFormer-B gains +1.0%+1.0\% top-11 accuracy on ImageNet classification over DeiT-B with 40%40\% fewer parameters and 20%20\% fewer FLOPs. HRFormer-B gains 0.9%0.9\% AP over HRNet-W4848 on COCO val set with with 32%32\% fewer parameters and 19%19\% fewer FLOPs. HRFormer-B + OCR gains +1.2%+1.2\% and +2.0%+2.0\% mIoU over HRNet-W4848 + OCR with 25%25\% fewer parameters and slightly more FLOPs on PASCAL-Context test and COCO-Stuff test, respectively.

Related work

With the success of Vision Transformer (ViT) and the data-efficient image transformer (DeiT) , various techniques are proposed to improve the ImageNet classification accuracy of Vision Transformer . Among the very recent advancements, the community has verified several effective improvements such as multi-scale feature hierarchies and incorporating convolutions.

For example, the concurrent works MViT , PVT , and Swin introduce the multi-scale feature hierarchies into transformer following the spatial configuration of a typical convolutional architecture such as ResNet-5050. Different from them, our HRFormer incorporates the multi-scale feature hierarchies through exploiting the multi-resolution parallel design inspired by HRNet. CvT , CeiT , and LocalViT propose to enhance the locality of transformer via inserting depth-wise convolutions into either the self-attention or the FFN. The purpose of the inserted convolution within our HRFormer is different, apart from enhancing the locality, it also ensures information exchange across the non-overlapping windows.

Several previous studies have proposed similar local self-attention schemes for image classification. They construct the overlapped local windows following the strided convolution, resulting in heavy computation cost. Similar to , we propose to apply the local-window self-attention scheme to divide the input feature map into non-overlapping windows. Then we apply the self-attention within each window independently so as to improve the efficiency significantly.

There are several concurrently-developed works use the Vision Transformer to address the dense predict tasks such as semantic segmentation. They have shown that increasing the spatial resolution of the representations output by the Vision Transformer is important for semantic segmentation. Our HRFormer provides a different path to address the low-resolution problem of the Vision Transformer via exploiting the multi-resolution parallel transformer scheme.

High-Resolution CNN for Dense Prediction.

The high-resolution convolutional schemes have achieved great success on both pose estimation and semantic segmentation tasks. In the development of high-resolution convolutional neural networks, the community has developed three main paths including: (i) applying dilated convolutions to remove some down-sample layers , (ii) recovering high-resolution representations from low-resolution representations with decoders , and (iii) maintaining high-resolution representations throughout the network . Our HRFormer belongs to the third path, and retains the advantages of both vision transformer and HRNet .

High-Resolution Transformer

Multi-resolution parallel transformer. We follow the HRNet design and start from a high-resolution convolution stem as the first stage, gradually adding high-to-low resolution streams one by one as new stages. The multi-resolution streams are connected in parallel. The main body consists of a sequence of stages. In each stage, the feature representation of each resolution stream is updated with multiple transformer blocks independently and the information across resolutions is exchanged repeatedly with the convolutional multi-scale fusion modules.

Figure 2 illustrates the overall HRFormer architecture. The design of convolutional multi-scale fusion modules exactly follows HRNet. We illustrate the details of the transformer block in the following discussion and more details are presented in Figure 1.

With MHSA aggregates information within each window, we merge them to compute the output XMHSA\mathbf{X}^{\rm{MHSA}}: {ceqn}

The left part of Figure 1 illustrates how local-window self-attention updates the 22D input representations, where the multi-head self-attention operates within each window independently.

FFN with depth-wise convolution. Local-window self-attention performs self-attention over the non-overlapping windows separately. There is no information exchange across the windows. To handle this issue, we add a 3×33\times 3 depth-wise convolution in between the two point-wise MLPs that form the FFN in Vision transformer: MLP(DW-Conv.(MLP()))\operatorname{MLP}(\operatorname{DW-Conv.}(\operatorname{MLP}())). The right part of Figure 1 shows an example of how FFN with 3×33\times 3 depth-wise convolution updates the 22D input representations.

Representation head designs. As shown in Figure 2, the output of HRFormer consists of four feature maps of different resolutions. We illustrate the details of the representation head designs for different tasks as following: (i) ImageNet classification, we send the four-resolution feature maps into a bottleneck and the output channels are changed to 128128, 256256, 512512, and 10241024 respectively. Then, we apply the strided convolutions to fuse them and output a feature map of the lowest resolution with 20482048 channels. Last, we apply a global average pooling operation followed by the final classifier. (ii) pose estimation, we only apply the regression head over the highest resolution feature map. (iii) semantic segmentation, we apply the semantic segmentation head over the concatenated representations, which are computed by first upsampling all the low-resolution representations to the highest resolution and then concatenate them together.

Instantiation. We illustrate the overall architecture configuration of HRFormer in Table 1. We use (M1,M2,M3,M4)\left(M_{1},M_{2},M_{3},M_{4}\right) and (B1,B2,B3,B4)\left(B_{1},B_{2},B_{3},B_{4}\right) to represent the number of modules and the number of blocks of {state11, stage22, stage33, stage44}, respectively. We use (C1,C2,C3,C4)\left(C_{1},C_{2},C_{3},C_{4}\right), (H1,H2,H3,H4)\left(H_{1},H_{2},H_{3},H_{4}\right) and (R1,R2,R3,R4)\left(R_{1},R_{2},R_{3},R_{4}\right) to represent the number of channels, the number of heads and the MLP expansion ratios in transformer block associated with different resolutions. We keep the first stage unchanged following the original HRNet and use the bottleneck as the basic building block. We apply the transformer blocks in the other stages and each transformer block consists of a local-window self-attention followed by an FFN with 3×33\times 3 depth-wise convolution. We have not included the convolutional multi-scale fusion modules in Table 1 for simplicity. In our implementation, we set the size of the windows on four resolution streams as (7,7,7,7)\left(7,7,7,7\right) by default. Table 2 illustrates the configuration details of three different HRFormer instances with increasing complexities, where the MLP expansion ratios (R1,R2,R3,R4)\left(R_{1},R_{2},R_{3},R_{4}\right) are set as (4,4,4,4)\left(4,4,4,4\right) for all models and are not shown.

Analysis. The benefits of 3×33\times 3 depth-wise convolution are twofold: one is enhancing the locality and the other one is enabling the interactions across windows. We illustrate how the FFN with depth-wise convolution is capable to expand the interactions beyond the non-overlapping local windows and model the relations between them in Figure 3. Therefore, based on the combination of the local-window self-attention and the FFN with 3×33\times 3 depth-wise convolution, we can build the HRFormer block that improves the memory and computation efficiency significantly.

Experiments

Training setting. We study the performance of HRFormer on the COCO human pose estimation benchmark, which contains more than 200200K images and 250250K person instances labeled with 1717 keypoints. We train our model on COCO train 20172017 dataset, including 5757K images and 150150K person instances. We evaluate our approach on the val 20172017 set and test-dev 20172017, containing 55K images and 2020K images, respectively.

We follow most of the default training and evaluation settings of mmpose https://github.com/open-mmlab/mmpose, Apache License 2.0, and change the optimizer from Adam to AdamW. For the training batch size, we choose 256256 for HRFormer-T and HRFormer-S and 128128 for HRFormer-B due to limited GPU memory. Each HRFormer experiment on COCO pose estimation task takes 8×8\times 3232G-V100100 GPUs.

Results. Table 3 reports the comparisons on COCO val set. We compare HRFormer to the representative convolutional method such as HRNet and several recent transformer methods, including PRTR , TransPose-H-A66 , and TokenPose-L/D2424 . HRFormer-B gains 0.9%0.9\% with 32%32\% fewer parameters and 19%19\% fewer FLOPs when compared to HRNet-W4848 with an input size of 384×288384\times 288. Therefore, our HRFormer-B already achieves 77.2%77.2\% w/o using any advanced techniques such as UDP and DARK. We believe that our HRFormer-B could achieve better results by exploiting either UDP or DARK scheme. We also report the comparisons on COCO test-dev set in Table 4. Our HRFormer-B outperforms HRNet-W4848 by around 0.7%0.7\% with fewer parameters and FLOPs. Figure 4 shows some example results of human pose estimation on COCO val set.

2 Semantic Segmentation

Cityscapes. The Cityscapes dataset is for urban scene understanding. There are a total of 3030 classes and only 1919 classes are used for parsing evaluation. The dataset contains 55K high-quality pixel-level finely annotated images and 2020K coarsely annotated images. The finely annotated 55K images are divided into 2,9752,975 train images, 500500 val images and 1,5251,525 test images. We set the initial learning rate as 0.00010.0001, weight decay as 0.010.01, crop size as 1024×5121024\times 512, batch size as 88, and training iterations as 8080K by default. Each HRFormer + OCR experiment on Cityscapes takes 8×8\times 3232G-V100100 GPUs.

Table 5 reports the results on Cityscapes val. We choose to use HRFormer + OCR as our semantic segmentation architecture. We compare our method with several well-known Vision Transformer based methods and CNN based methods . Specifically, SETR-PUP and SETR-MLA use the ViT-Large as the backbone. DPT-Hybrid uses the ViT-Hybrid that consists of a ResNet-5050 followed by 1212 transformer layers. Both ViT-Large and ViT-Hybrid are initialized with the weights pre-trained on ImageNet-2121K, where both of them achieve around 85.1%85.1\% top11 accuracy on ImageNet. DeepLabv3 and PSPNet are based on dilated ResNet-101101 with output stride 88. According to the fourth column of Table 5, HRFormer + OCR achieves competitive performance overall. For example, HRFormer-B + OCR achieves comparable performance with SETR-PUP while saving 70%70\% parameters and 50%50\% FLOPs.

PASCAL-Context. The PASCAL-Context dataset is a challenging scene parsing dataset that contains 5959 semantic classes and 11 background class. The train set and test set consist of 4,9984,998 and 5,1055,105 images respectively. We set the initial learning rate as 0.00010.0001, weight decay as 0.010.01, crop size as 520×520520\times 520, batch size as 1616, and training iterations as 6060K by default. We report the comparisons on the fifth column of Table 5. Accordingly, HRFormer-B + OCR gains 1.1%1.1\%, 1.5%1.5\% over HRNet-W4848 + OCR, SETR-MLA with fewer parameters and FLOPs, respectively. Notably, DPT-Hybrid achieves the best performance through extra pre-training the models on ADE2020K in advance. Each HRFormer + OCR experiment on PASCAL-Context takes 8×8\times 3232G-V100100 GPUs.

COCO-Stuff. The COCO-Stuff dataset is a challenging scene parsing dataset that contains 171171 semantic classes. The train set and test set consist of 99K and 11K images respectively. We set the initial learning rate as 0.00010.0001, weight decay as 0.010.01, crop size as 520×520520\times 520, batch size as 1616, and training iterations as 6060K by default. We report the comparisons on the last column of Table 5 and HRFormer-B + OCR outperforms the previous best-performing HRNet-W4848 + OCR by nearly 2%2\%. Each HRFormer + OCR experiment on COCO-Stuff takes 8×8\times 3232G-V100100 GPUs. Figure 5 shows some example results on Cityscapes, PASCAL-Context, and COCO-Stuff.

3 ImageNet Classification

Training setting. We conduct the comparisons on ImageNet-11K, which consists of 1.281.28M train images and 5050K val images with 10001000 classes. We train all models with batch size 10241024 for 300300 epochs with AdamW optimizer, cosine decay learning rate schedule, weight decay as 0.050.05, and a bag of augmentation policies, including rand augmentation , mixup , cutmix , and so on. HRFormer-T and HRFormer-S require 8×328\times 32G-V100100 GPUs and HRFormer-B requires 32×3232\times 32G-V100100 GPUs.

Results. We compare HRFormer to some representative CNN methods and vision transformer methods in Table 6, where all methods are trained on ImageNet-11K only. The results of ViT-Large with larger dataset such as ImageNet-2121K not included for fairness. According to Table 6, HRFormer achieves competitive performance. For example, HRFormer-B gains 1.0%1.0\% over DeiT-B while saving nearly 40%40\% parameters and 20%20\% FLOPs.

4 Ablation Experiments

We study the influence of the 3×33\times 3 depth-wise convolution within FFN based on HRFormer-T in Table 7. We observe that applying 3×33\times 3 depth-wise convolution in FFN significantly improves the performance on multiple tasks, including ImageNet classification, PASCAL-Context segmentation, and COCO pose estimation. For example, HRFormer-T + FFN w/ 3×\times 3 depth-wise convolution outperforms HRFormer-T + FFN w/o 3×\times 3 depth-wise convolution by 0.65%0.65\%, 2.9%2.9\% and 4.04%4.04\% on ImageNet, PASCAL-Context and COCO, respectively.

Influence of shifted window scheme & 3×\times3 depth-wise convolution within FFN based on Swin-T.

We compare our method with the shifted windows scheme of Swin transformer in Table 8. For fair comparisons, we construct a Intra-Window transformer architecture following the same architecture configurations of Swin-T except that we do not apply shifted windows scheme. We see that applying 3×\times3 depth-wise convolution within FFN improves both Swin-T and Intrawin-T. Surprisingly, when equipped with 3×\times 3 depth-wise convolution within FFN, Intrawin-T even outperforms Swin-T.

Shifted window scheme v.s. 3×\times3 depth-wise convolution within FFN based on HRFormer-T.

In Table 9, we compare the 3×33\times 3 depth-wise convolution within FFN scheme to the shifted window scheme based on HRFormer-T. According to the results, we see that applying 3×\times3 depth-wise convolution within FFN significantly outperforms applying shifted window scheme across all different tasks.

Comparison to ViT, DeiT & Swin on pose estimation.

We report the COCO pose estimation results based on the two well-known transformer models, including ViT-Large , DeiT-B and Swin-B in Table 10. Notably, both ViT-Large and Swin-B‡ are pre-trained on ImageNet2121K in advance and then finetuned on ImageNet11K and achieve 85.1%85.1\% and 86.4%86.4\% top-1 accuracy respectively. DeiT-B is trained on ImageNet11K for 10001000 epochs and achieves 85.2%85.2\% top-1 accuracy. We apply deconvolution modules to upsample the output representations of the encoder following the SimpleBaseline for three methods. The number of parameters and FLOPs are listed on the fourth and fifth columns of Table 10. According to the results in Table 10, we see that our HRFormer-B achieves better performance than all three methods with fewer parameters and FLOPs.

Comparison to HRNet.

We compare our HRFormer to the convolutional HRNet with almost the same architecture configurations via replacing all the transformer blocks with the conventional basic block consisting of two 3×33\times 3 convolutions. Table 11 shows the comparison results on ImageNet, PASCAL-Context, and COCO. We observe that HRFormer significantly outperforms HRNet under various configurations with much less model and computation complexity. For example, HRFormer-T outperforms HRNet-T by 2.0%2.0\%, 1.5%1.5\%, and 1.6%1.6\% on three tasks while requiring only around 50%50\% parameters and FLOPs, respectively. In summary, HRFormer achieves better performance via exploiting the benefits of transformers such as content-dependent dynamic interactions.

Conclusion

In this work, we present the High-Resolution Transformer (HRFormer), a simple yet effective transformer architecture, for dense prediction tasks, including pose estimation and semantic segmentation. The key insight is to integrate the HRFormer block, which combines local-window self-attention and FFN with depth-wise convolution to improve the memory and computation efficiency, with the multi-resolution parallel design of the convolutional HRNet. Besides, HRFormer also benefits from adopting convolution in the early stages and mixing short-range and long-range attention with multi-scale fusion scheme. We empirically verify the effectiveness of our HRFormer on both pose estimation and semantic segmentation tasks.

Appendix

We present additional visualizations of the example results of our method on both pose estimation and semantic segmentation tasks.

Figure 6 shows more pose estimation results of HRFormer-B on COCO val. Figure 7 shows more semantic segmentation results on Cityscapes val, PASCAL-Context test and COCO-Stuff test.

Ablation of window sizes.

We report the results with different window sizes at different resolutions on semantic segmentation tasks and we will add more results if necessary. We use (W1,W2,W3,W4)\left(W_{1},W_{2},W_{3},W_{4}\right) to represent the window sizes associated with feature maps with different resolutions with stride 44, 88, 1616, 3232. We choose larger window sizes for higher resolution branches, thus, we have W1>W2>W3>W4W_{1}>W_{2}>W_{3}>W_{4}. According to these results, we can see that applying larger windows improves the performance, and applying different window sizes at different resolutions makes no big difference.

References