HRFormer: High-Resolution Transformer for Dense Prediction

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, Jingdong Wang

Introduction

Vision Transformer (ViT) shows promising performance on ImageNet classification tasks. Many follow-up works boost the classification accuracy through knowledge distillation , adopting deeper architecture , directly introducing convolution operations , redesigning input image tokens , and etc. Besides, some studies attempt to extend the transformer to address broader vision tasks such as object detection , semantic segmentation , pose estimation , video understanding , and so on. This work focuses on the transformer for dense prediction tasks, including pose estimation and semantic segmentation.

Vision Transformer splits an image into a sequence of image patches of size $16\times 16$ , and extracts the feature representation of each image patch. Thus, the output representations of Vision Transformer lose the fine-grained spatial details that are essential for accurate dense predictions. The Vision Transformer only outputs a single-scale feature representation, and thus lacks the capability to handle multi-scale variation. To mitigate the loss of feature granularity and model the multi-scale variation, we present High-Resolution Transformer (HRFormer) that contains richer spatial information and constructs multi-resolution representations for dense predictions.

The High-Resolution Transformer is built by following the multi-resolution parallel design that is adopted in HRNet . First, HRFormer adopts convolution in both the stem and the first stage as several concurrent studies also suggest that convolution performs better in the early stages. Second, HRFormer maintains a high-resolution stream through the entire process with parallel medium- and low-resolution streams helping boost high-resolution representations. With feature maps of different resolutions, thus HRFormer is capable to model the multi-scale variation. Third, HRFormer mixes the short-range and long-range attention via exchanging multi-resolution feature information with the multi-scale fusion module.

At each resolution, the local-window self-attention mechanism is adopted to reduce the memory and computation complexity. We partition the representation maps into a set of non-overlapping small image windows and perform self-attention in each image window separately. This reduces the memory and computation complexity from quadratic to linear with respect to spatial size. We further introduce a $3\times 3$ depth-wise convolution into the feed-forward network (FFN) that follows the local-window self-attention, to exchange information between the image windows which are disconnected in the local-window self-attention process. This helps to expand the receptive field and is essential for dense prediction tasks. Figure 1 shows the details of an HRFormer block.

We conduct experiments on image classification, pose estimation, and semantic segmentation tasks, and achieve competitive performance on various benchmarks. For example, HRFormer-B gains $+1.0\%$ top- $1$ accuracy on ImageNet classification over DeiT-B with $40\%$ fewer parameters and $20\%$ fewer FLOPs. HRFormer-B gains $0.9\%$ AP over HRNet-W $48$ on COCO val set with with $32\%$ fewer parameters and $19\%$ fewer FLOPs. HRFormer-B + OCR gains $+1.2\%$ and $+2.0\%$ mIoU over HRNet-W $48$ + OCR with $25\%$ fewer parameters and slightly more FLOPs on PASCAL-Context test and COCO-Stuff test, respectively.

Related work

With the success of Vision Transformer (ViT) and the data-efficient image transformer (DeiT) , various techniques are proposed to improve the ImageNet classification accuracy of Vision Transformer . Among the very recent advancements, the community has verified several effective improvements such as multi-scale feature hierarchies and incorporating convolutions.

For example, the concurrent works MViT , PVT , and Swin introduce the multi-scale feature hierarchies into transformer following the spatial configuration of a typical convolutional architecture such as ResNet- $50$ . Different from them, our HRFormer incorporates the multi-scale feature hierarchies through exploiting the multi-resolution parallel design inspired by HRNet. CvT , CeiT , and LocalViT propose to enhance the locality of transformer via inserting depth-wise convolutions into either the self-attention or the FFN. The purpose of the inserted convolution within our HRFormer is different, apart from enhancing the locality, it also ensures information exchange across the non-overlapping windows.

Several previous studies have proposed similar local self-attention schemes for image classification. They construct the overlapped local windows following the strided convolution, resulting in heavy computation cost. Similar to , we propose to apply the local-window self-attention scheme to divide the input feature map into non-overlapping windows. Then we apply the self-attention within each window independently so as to improve the efficiency significantly.

There are several concurrently-developed works use the Vision Transformer to address the dense predict tasks such as semantic segmentation. They have shown that increasing the spatial resolution of the representations output by the Vision Transformer is important for semantic segmentation. Our HRFormer provides a different path to address the low-resolution problem of the Vision Transformer via exploiting the multi-resolution parallel transformer scheme.

High-Resolution CNN for Dense Prediction.

The high-resolution convolutional schemes have achieved great success on both pose estimation and semantic segmentation tasks. In the development of high-resolution convolutional neural networks, the community has developed three main paths including: (i) applying dilated convolutions to remove some down-sample layers , (ii) recovering high-resolution representations from low-resolution representations with decoders , and (iii) maintaining high-resolution representations throughout the network . Our HRFormer belongs to the third path, and retains the advantages of both vision transformer and HRNet .

High-Resolution Transformer

Multi-resolution parallel transformer. We follow the HRNet design and start from a high-resolution convolution stem as the first stage, gradually adding high-to-low resolution streams one by one as new stages. The multi-resolution streams are connected in parallel. The main body consists of a sequence of stages. In each stage, the feature representation of each resolution stream is updated with multiple transformer blocks independently and the information across resolutions is exchanged repeatedly with the convolutional multi-scale fusion modules.

Figure 2 illustrates the overall HRFormer architecture. The design of convolutional multi-scale fusion modules exactly follows HRNet. We illustrate the details of the transformer block in the following discussion and more details are presented in Figure 1.

With MHSA aggregates information within each window, we merge them to compute the output $\mathbf{X}^{\rm{MHSA}}$ : {ceqn}

The left part of Figure 1 illustrates how local-window self-attention updates the $2$ D input representations, where the multi-head self-attention operates within each window independently.

FFN with depth-wise convolution. Local-window self-attention performs self-attention over the non-overlapping windows separately. There is no information exchange across the windows. To handle this issue, we add a $3\times 3$ depth-wise convolution in between the two point-wise MLPs that form the FFN in Vision transformer: $\operatorname{MLP}(\operatorname{DW-Conv.}(\operatorname{MLP}()))$ . The right part of Figure 1 shows an example of how FFN with $3\times 3$ depth-wise convolution updates the $2$ D input representations.

Representation head designs. As shown in Figure 2, the output of HRFormer consists of four feature maps of different resolutions. We illustrate the details of the representation head designs for different tasks as following: (i) ImageNet classification, we send the four-resolution feature maps into a bottleneck and the output channels are changed to $128$ , $256$ , $512$ , and $1024$ respectively. Then, we apply the strided convolutions to fuse them and output a feature map of the lowest resolution with $2048$ channels. Last, we apply a global average pooling operation followed by the final classifier. (ii) pose estimation, we only apply the regression head over the highest resolution feature map. (iii) semantic segmentation, we apply the semantic segmentation head over the concatenated representations, which are computed by first upsampling all the low-resolution representations to the highest resolution and then concatenate them together.

Instantiation. We illustrate the overall architecture configuration of HRFormer in Table 1. We use $\left(M_{1},M_{2},M_{3},M_{4}\right)$ and $\left(B_{1},B_{2},B_{3},B_{4}\right)$ to represent the number of modules and the number of blocks of {state $1$ , stage $2$ , stage $3$ , stage $4$ }, respectively. We use $\left(C_{1},C_{2},C_{3},C_{4}\right)$ , $\left(H_{1},H_{2},H_{3},H_{4}\right)$ and $\left(R_{1},R_{2},R_{3},R_{4}\right)$ to represent the number of channels, the number of heads and the MLP expansion ratios in transformer block associated with different resolutions. We keep the first stage unchanged following the original HRNet and use the bottleneck as the basic building block. We apply the transformer blocks in the other stages and each transformer block consists of a local-window self-attention followed by an FFN with $3\times 3$ depth-wise convolution. We have not included the convolutional multi-scale fusion modules in Table 1 for simplicity. In our implementation, we set the size of the windows on four resolution streams as $\left(7,7,7,7\right)$ by default. Table 2 illustrates the configuration details of three different HRFormer instances with increasing complexities, where the MLP expansion ratios $\left(R_{1},R_{2},R_{3},R_{4}\right)$ are set as $\left(4,4,4,4\right)$ for all models and are not shown.

Analysis. The benefits of $3\times 3$ depth-wise convolution are twofold: one is enhancing the locality and the other one is enabling the interactions across windows. We illustrate how the FFN with depth-wise convolution is capable to expand the interactions beyond the non-overlapping local windows and model the relations between them in Figure 3. Therefore, based on the combination of the local-window self-attention and the FFN with $3\times 3$ depth-wise convolution, we can build the HRFormer block that improves the memory and computation efficiency significantly.

Experiments

Training setting. We study the performance of HRFormer on the COCO human pose estimation benchmark, which contains more than $200$ K images and $250$ K person instances labeled with $17$ keypoints. We train our model on COCO train $2017$ dataset, including $57$ K images and $150$ K person instances. We evaluate our approach on the val $2017$ set and test-dev $2017$ , containing $5$ K images and $20$ K images, respectively.

We follow most of the default training and evaluation settings of mmpose https://github.com/open-mmlab/mmpose, Apache License 2.0, and change the optimizer from Adam to AdamW. For the training batch size, we choose $256$ for HRFormer-T and HRFormer-S and $128$ for HRFormer-B due to limited GPU memory. Each HRFormer experiment on COCO pose estimation task takes $8\times$ $32$ G-V $100$ GPUs.

Results. Table 3 reports the comparisons on COCO val set. We compare HRFormer to the representative convolutional method such as HRNet and several recent transformer methods, including PRTR , TransPose-H-A $6$ , and TokenPose-L/D $24$ . HRFormer-B gains $0.9\%$ with $32\%$ fewer parameters and $19\%$ fewer FLOPs when compared to HRNet-W $48$ with an input size of $384\times 288$ . Therefore, our HRFormer-B already achieves $77.2\%$ w/o using any advanced techniques such as UDP and DARK. We believe that our HRFormer-B could achieve better results by exploiting either UDP or DARK scheme. We also report the comparisons on COCO test-dev set in Table 4. Our HRFormer-B outperforms HRNet-W $48$ by around $0.7\%$ with fewer parameters and FLOPs. Figure 4 shows some example results of human pose estimation on COCO val set.

2 Semantic Segmentation

Cityscapes. The Cityscapes dataset is for urban scene understanding. There are a total of $30$ classes and only $19$ classes are used for parsing evaluation. The dataset contains $5$ K high-quality pixel-level finely annotated images and $20$ K coarsely annotated images. The finely annotated $5$ K images are divided into $2,975$ train images, $500$ val images and $1,525$ test images. We set the initial learning rate as $0.0001$ , weight decay as $0.01$ , crop size as $1024\times 512$ , batch size as $8$ , and training iterations as $80$ K by default. Each HRFormer + OCR experiment on Cityscapes takes $8\times$ $32$ G-V $100$ GPUs.

Table 5 reports the results on Cityscapes val. We choose to use HRFormer + OCR as our semantic segmentation architecture. We compare our method with several well-known Vision Transformer based methods and CNN based methods . Specifically, SETR-PUP and SETR-MLA use the ViT-Large as the backbone. DPT-Hybrid uses the ViT-Hybrid that consists of a ResNet- $50$ followed by $12$ transformer layers. Both ViT-Large and ViT-Hybrid are initialized with the weights pre-trained on ImageNet- $21$ K, where both of them achieve around $85.1\%$ top $1$ accuracy on ImageNet. DeepLabv3 and PSPNet are based on dilated ResNet- $101$ with output stride $8$ . According to the fourth column of Table 5, HRFormer + OCR achieves competitive performance overall. For example, HRFormer-B + OCR achieves comparable performance with SETR-PUP while saving $70\%$ parameters and $50\%$ FLOPs.

PASCAL-Context. The PASCAL-Context dataset is a challenging scene parsing dataset that contains $59$ semantic classes and $1$ background class. The train set and test set consist of $4,998$ and $5,105$ images respectively. We set the initial learning rate as $0.0001$ , weight decay as $0.01$ , crop size as $520\times 520$ , batch size as $16$ , and training iterations as $60$ K by default. We report the comparisons on the fifth column of Table 5. Accordingly, HRFormer-B + OCR gains $1.1\%$ , $1.5\%$ over HRNet-W $48$ + OCR, SETR-MLA with fewer parameters and FLOPs, respectively. Notably, DPT-Hybrid achieves the best performance through extra pre-training the models on ADE $20$ K in advance. Each HRFormer + OCR experiment on PASCAL-Context takes $8\times$ $32$ G-V $100$ GPUs.

COCO-Stuff. The COCO-Stuff dataset is a challenging scene parsing dataset that contains $171$ semantic classes. The train set and test set consist of $9$ K and $1$ K images respectively. We set the initial learning rate as $0.0001$ , weight decay as $0.01$ , crop size as $520\times 520$ , batch size as $16$ , and training iterations as $60$ K by default. We report the comparisons on the last column of Table 5 and HRFormer-B + OCR outperforms the previous best-performing HRNet-W $48$ + OCR by nearly $2\%$ . Each HRFormer + OCR experiment on COCO-Stuff takes $8\times$ $32$ G-V $100$ GPUs. Figure 5 shows some example results on Cityscapes, PASCAL-Context, and COCO-Stuff.

3 ImageNet Classification

Training setting. We conduct the comparisons on ImageNet- $1$ K, which consists of $1.28$ M train images and $50$ K val images with $1000$ classes. We train all models with batch size $1024$ for $300$ epochs with AdamW optimizer, cosine decay learning rate schedule, weight decay as $0.05$ , and a bag of augmentation policies, including rand augmentation , mixup , cutmix , and so on. HRFormer-T and HRFormer-S require $8\times 32$ G-V $100$ GPUs and HRFormer-B requires $32\times 32$ G-V $100$ GPUs.

Results. We compare HRFormer to some representative CNN methods and vision transformer methods in Table 6, where all methods are trained on ImageNet- $1$ K only. The results of ViT-Large with larger dataset such as ImageNet- $21$ K not included for fairness. According to Table 6, HRFormer achieves competitive performance. For example, HRFormer-B gains $1.0\%$ over DeiT-B while saving nearly $40\%$ parameters and $20\%$ FLOPs.

4 Ablation Experiments

We study the influence of the $3\times 3$ depth-wise convolution within FFN based on HRFormer-T in Table 7. We observe that applying $3\times 3$ depth-wise convolution in FFN significantly improves the performance on multiple tasks, including ImageNet classification, PASCAL-Context segmentation, and COCO pose estimation. For example, HRFormer-T + FFN w/ 3 $\times$ 3 depth-wise convolution outperforms HRFormer-T + FFN w/o 3 $\times$ 3 depth-wise convolution by $0.65\%$ , $2.9\%$ and $4.04\%$ on ImageNet, PASCAL-Context and COCO, respectively.

Influence of shifted window scheme & 3×\times3 depth-wise convolution within FFN based on Swin-T.

We compare our method with the shifted windows scheme of Swin transformer in Table 8. For fair comparisons, we construct a Intra-Window transformer architecture following the same architecture configurations of Swin-T except that we do not apply shifted windows scheme. We see that applying 3 $\times$ 3 depth-wise convolution within FFN improves both Swin-T and Intrawin-T. Surprisingly, when equipped with 3 $\times$ 3 depth-wise convolution within FFN, Intrawin-T even outperforms Swin-T.

Shifted window scheme v.s. 3×\times3 depth-wise convolution within FFN based on HRFormer-T.

In Table 9, we compare the $3\times 3$ depth-wise convolution within FFN scheme to the shifted window scheme based on HRFormer-T. According to the results, we see that applying 3 $\times$ 3 depth-wise convolution within FFN significantly outperforms applying shifted window scheme across all different tasks.

Comparison to ViT, DeiT & Swin on pose estimation.

We report the COCO pose estimation results based on the two well-known transformer models, including ViT-Large , DeiT-B and Swin-B in Table 10. Notably, both ViT-Large and Swin-B‡ are pre-trained on ImageNet $21$ K in advance and then finetuned on ImageNet $1$ K and achieve $85.1\%$ and $86.4\%$ top-1 accuracy respectively. DeiT-B is trained on ImageNet $1$ K for $1000$ epochs and achieves $85.2\%$ top-1 accuracy. We apply deconvolution modules to upsample the output representations of the encoder following the SimpleBaseline for three methods. The number of parameters and FLOPs are listed on the fourth and fifth columns of Table 10. According to the results in Table 10, we see that our HRFormer-B achieves better performance than all three methods with fewer parameters and FLOPs.

Comparison to HRNet.

We compare our HRFormer to the convolutional HRNet with almost the same architecture configurations via replacing all the transformer blocks with the conventional basic block consisting of two $3\times 3$ convolutions. Table 11 shows the comparison results on ImageNet, PASCAL-Context, and COCO. We observe that HRFormer significantly outperforms HRNet under various configurations with much less model and computation complexity. For example, HRFormer-T outperforms HRNet-T by $2.0\%$ , $1.5\%$ , and $1.6\%$ on three tasks while requiring only around $50\%$ parameters and FLOPs, respectively. In summary, HRFormer achieves better performance via exploiting the benefits of transformers such as content-dependent dynamic interactions.

Conclusion

In this work, we present the High-Resolution Transformer (HRFormer), a simple yet effective transformer architecture, for dense prediction tasks, including pose estimation and semantic segmentation. The key insight is to integrate the HRFormer block, which combines local-window self-attention and FFN with depth-wise convolution to improve the memory and computation efficiency, with the multi-resolution parallel design of the convolutional HRNet. Besides, HRFormer also benefits from adopting convolution in the early stages and mixing short-range and long-range attention with multi-scale fusion scheme. We empirically verify the effectiveness of our HRFormer on both pose estimation and semantic segmentation tasks.

Appendix

We present additional visualizations of the example results of our method on both pose estimation and semantic segmentation tasks.

Figure 6 shows more pose estimation results of HRFormer-B on COCO val. Figure 7 shows more semantic segmentation results on Cityscapes val, PASCAL-Context test and COCO-Stuff test.

Ablation of window sizes.

We report the results with different window sizes at different resolutions on semantic segmentation tasks and we will add more results if necessary. We use $\left(W_{1},W_{2},W_{3},W_{4}\right)$ to represent the window sizes associated with feature maps with different resolutions with stride $4$ , $8$ , $16$ , $32$ . We choose larger window sizes for higher resolution branches, thus, we have $W_{1}>W_{2}>W_{3}>W_{4}$ . According to these results, we can see that applying larger windows improves the performance, and applying different window sizes at different resolutions makes no big difference.