Deep High-Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao

Introduction

Deep convolutional neural networks (DCNNs) have achieved state-of-the-art results in many computer vision tasks, such as image classification, object detection, semantic segmentation, human pose estimation, and so on. The strength is that DCNNs are able to learn richer representations than conventional hand-crafted representations.

Most recently-developed classification networks, including AlexNet , VGGNet , GoogleNet , ResNet , etc., follow the design rule of LeNet-55 . The rule is depicted in Figure 1 (a): gradually reduce the spatial size of the feature maps, connect the convolutions from high resolution to low resolution in series, and lead to a low-resolution representation, which is further processed for classification.

High-resolution representations are needed for position-sensitive tasks, e.g., semantic segmentation, human pose estimation, and object detection. The previous state-of-the-art methods adopt the high-resolution recovery process to raise the representation resolution from the low-resolution representation outputted by a classification or classification-like network as depicted in Figure 1 (b), e.g., Hourglass , SegNet , DeconvNet , U-Net , SimpleBaseline , and encoder-decoder . In addition, dilated convolutions are used to remove some down-sample layers and thus yield medium-resolution representations .

We present a novel architecture, namely High-Resolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a high-resolution convolution stream, gradually add high-to-low resolution convolution streams one by one, and connect the multi-resolution streams in parallel. The resulting network consists of several (44 in this paper) stages as depicted in Figure 2, and the nnth stage contains nn streams corresponding to nn resolutions. We conduct repeated multi-resolution fusions by exchanging the information across the parallel streams over and over.

The high-resolution representations learned from HRNet are not only semantically strong but also spatially precise. This comes from two aspects. (\romannum1) Our approach connects high-to-low resolution convolution streams in parallel rather than in series. Thus, our approach is able to maintain the high resolution instead of recovering high resolution from low resolution, and accordingly the learned representation is potentially spatially more precise. (\romannum2) Most existing fusion schemes aggregate high-resolution low-level and high-level representations obtained by upsampling low-resolution representations. Instead, we repeat multi-resolution fusions to boost the high-resolution representations with the help of the low-resolution representations, and vice versa. As a result, all the high-to-low resolution representations are semantically strong.

We present two versions of HRNet. The first one, named as HRNetV11, only outputs the high-resolution representation computed from the high-resolution convolution stream. We apply it to human pose estimation by following the heatmap estimation framework. We empirically demonstrate the superior pose estimation performance on the COCO keypoint detection dataset .

The other one, named as HRNetV22, combines the representations from all the high-to-low resolution parallel streams. We apply it to semantic segmentation through estimating segmentation maps from the combined high-resolution representation. The proposed approach achieves state-of-the-art results on PASCAL-Context, Cityscapes, and LIP with similar model sizes and lower computation complexity. We observe similar performance for HRNetV11 and HRNetV22 over COCO pose estimation, and the superiority of HRNetV22 to HRNet11 in semantic segmentation.

In addition, we construct a multi-level representation, named as HRNetV22p, from the high-resolution representation output from HRNetV22, and apply it to state-of-the-art detection frameworks, including Faster R-CNN, Cascade R-CNN , FCOS , and CenterNet , and state-of-the-art joint detection and instance segmentation frameworks, including Mask R-CNN , Cascade Mask R-CNN, and Hybrid Task Cascade . The results show that our method gets detection performance improvement and in particular dramatic improvement for small objects.

Related Work

We review closely-related representation learning techniques developed mainly for human pose estimation , semantic segmentation and object detection, from three aspects: low-resolution representation learning, high-resolution representation recovering, and high-resolution representation maintaining. Besides, we mention about some works related to multi-scale fusion.

Learning low-resolution representations. The fully-convolutional network approaches compute low-resolution representations by removing the fully-connected layers in a classification network, and estimate their coarse segmentation maps. The estimated segmentation maps are improved by combining the fine segmentation score maps estimated from intermediate low-level medium-resolution representations , or iterating the processes . Similar techniques have also been applied to edge detection, e.g., holistic edge detection .

The fully convolutional network is extended, by replacing a few (typically two) strided convolutions and the associated convolutions with dilated convolutions, to the dilation version, leading to medium-resolution representations . The representations are further augmented to multi-scale contextual representations through feature pyramids for segmenting objects at multiple scales.

Recovering high-resolution representations. An upsample process can be used to gradually recover the high-resolution representations from the low-resolution representations. The upsample subnetwork could be a symmetric version of the downsample process (e.g., VGGNet), with skipping connection over some mirrored layers to transform the pooling indices, e.g., SegNet and DeconvNet , or copying the feature maps, e.g., U-Net and Hourglass , encoder-decoder , and so on. An extension of U-Net, full-resolution residual network , introduces an extra full-resolution stream that carries information at the full image resolution, to replace the skip connections, and each unit in the downsample and upsample subnetworks receives information from and sends information to the full-resolution stream.

The asymmetric upsample process is also widely studied. RefineNet improves the combination of upsampled representations and the representations of the same resolution copied from the downsample process. Other works include: light upsample process , possibly with dilated convolutions used in the backbone ; light downsample and heavy upsample processes , recombinator networks ; improving skip connections with more or complicated convolutional units , as well as sending information from low-resolution skip connections to high-resolution skip connections or exchanging information between them ; studying the details of the upsample process ; combining multi-scale pyramid representations ; stacking multiple DeconvNets/U-Nets/Hourglass with dense connections .

Maintaining high-resolution representations. Our work is closely related to several works that can also generate high-resolution representations, e.g., convolutional neural fabrics , interlinked CNNs , GridNet , and multi-scale DenseNet .

The two early works, convolutional neural fabrics and interlinked CNNs , lack careful design on when to start low-resolution parallel streams, and how and where to exchange information across parallel streams, and do not use batch normalization and residual connections, thus not showing satisfactory performance. GridNet is like a combination of multiple U-Nets and includes two symmetric information exchange stages: the first stage passes information only from high resolution to low resolution, and the second stage passes information only from low resolution to high resolution. This limits its segmentation quality. Multi-scale DenseNet is not able to learn strong high-resolution representations as there is no information received from low-resolution representations.

Multi-scale fusion. Multi-scale fusionIn this paper, Multi-scale fusion and multi-resolution fusion are interchangeable, but in other contexts, they may not be interchangeable. is widely studied . The straightforward way is to feed multi-resolution images separately into multiple networks and aggregate the output response maps . Hourglass , U-Net , and SegNet combine low-level features in the high-to-low downsample process into the same-resolution high-level features in the low-to-high upsample process progressively through skip connections. PSPNet and DeepLabV2/3 fuse the pyramid features obtained by pyramid pooling module and atrous spatial pyramid pooling. Our multi-scale (resolution) fusion module resembles the two pooling modules. The differences include: (1) Our fusion outputs four-resolution representations other than only one, and (2) our fusion modules are repeated several times which is inspired by deep fusion .

Our approach. Our network connects high-to-low convolution streams in parallel. It maintains high-resolution representations through the whole process, and generates reliable high-resolution representations with strong position sensitivity through repeatedly fusing the representations from multi-resolution streams.

This paper represents a very substantial extension of our previous conference paper with an additional material added from our unpublished technical report as well as more object detection results under recently-developed start-of-the-art object detection and instance segmentation frameworks. The main technical novelties compared with lie in threefold. (1) We extend the network (named as HRNetV11) proposed in , to two versions: HRNetV22 and HRNetV22p, which explore all the four-resolution representations. (2) We build the connection between multi-resolution fusion and regular convolution, which provides an evidence for the necessity of exploring all the four-resolution representations in HRNetV22 and HRNetV22p. (3) We show the superiority of HRNetV22 and HRNetV22p over HRNetV11 and present the applications of HRNetV22 and HRNetV22p in a broad range of vision problems, including semantic segmentation and object detection.

High-Resolution Networks

We input the image into a stem, which consists of two stride-22 3×33\times 3 convolutions decreasing the resolution to 14\frac{1}{4}, and subsequently the main body that outputs the representation with the same resolution (14\frac{1}{4}). The main body, illustrated in Figure 2 and detailed below, consists of several components: parallel multi-resolution convolutions, repeated multi-resolution fusions, and representation head that is shown in Figure 4.

We start from a high-resolution convolution stream as the first stage, gradually add high-to-low resolution streams one by one, forming new stages, and connect the multi-resolution streams in parallel. As a result, the resolutions for the parallel streams of a later stage consists of the resolutions from the previous stage, and an extra lower one.

An example network structure illustrated in Figure 2, containing 44 parallel streams, is logically as follows,

where Nsr\mathcal{N}_{sr} is a sub-stream in the ssth stage and rr is the resolution index. The resolution index of the first stream is r=1r=1. The resolution of index rr is 12r1\frac{1}{2^{r-1}} of the resolution of the first stream.

2 Repeated Multi-Resolution Fusions

The goal of the fusion module is to exchange the information across multi-resolution representations. It is repeated several times (e.g., every 44 residual units).

Let us look at an example of fusing 33-resolution representations, which is illustrated in Figure 3. Fusing 22 representations and 44 representations can be easily derived. The input consists of three representations: {Rri,r=1,2,3}\{\mathbf{R}_{r}^{i},r=1,2,3\}, with rr is the resolution index, and the associated output representations are {Rro,r=1,2,3}\{\mathbf{R}_{r}^{o},r=1,2,3\}. Each output representation is the sum of the transformed representations of the three inputs: Rro=f1r(R1i)+f2r(R2i)+f3r(R3i)\mathbf{R}_{r}^{o}=f_{1r}(\mathbf{R}_{1}^{i})+f_{2r}(\mathbf{R}_{2}^{i})+f_{3r}(\mathbf{R}_{3}^{i}). The fusion across stages (from stage 33 to stage 44) has an extra output: R4o=f14(R1i)+f24(R2i)+f34(R3i)\mathbf{R}_{4}^{o}=f_{14}(\mathbf{R}_{1}^{i})+f_{24}(\mathbf{R}_{2}^{i})+f_{34}(\mathbf{R}_{3}^{i}).

The choice of the transform function fxr()f_{xr}(\cdot) is dependent on the input resolution index xx and the output resolution index rr. If x=rx=r, fxr(R)=Rf_{xr}(\mathbf{R})=\mathbf{R}. If x<rx<r, fxr(R)f_{xr}(\mathbf{R}) downsamples the input representation R\mathbf{R} through (rs)(r-s) stride-22 3×33\times 3 convolutions. For instance, one stride-22 3×33\times 3 convolution for 2×2\times downsampling, and two consecutive stride-22 3×33\times 3 convolutions for 4×4\times downsampling. If x>rx>r, fxr(R)f_{xr}(\mathbf{R}) upsamples the input representation R\mathbf{R} through the bilinear upsampling followed by a 1×11\times 1 convolution for aligning the number of channels. The functions are depicted in Figure 3.

3 Representation Head

We have three kinds of representation heads that are illustrated in Figure 4, and call them as HRNetV11, HRNetV22, and HRNetV11p, respectively.

HRNetV11. The output is the representation only from the high-resolution stream. Other three representations are ignored. This is illustrated in Figure 4 (a).

HRNetV22. We rescale the low-resolution representations through bilinear upsampling without changing the number of channels to the high resolution, and concatenate the four representations, followed by a 1×11\times 1 convolution to mix the four representations. This is illustrated in Figure 4 (b).

HRNetV22p. We construct multi-level representations by downsampling the high-resolution representation output from HRNetV22 to multiple levels. This is depicted in Figure 4 (c).

In this paper, we will show the results of applying HRNetV11 to human pose estimation, HRNetV22 to semantic segmentation, and HRNetV22p to object detection.

4 Instantiation

The main body contains four stages with four parallel convolution streams. The resolutions are 1/41/4, 1/81/8, 1/161/16, and 1/321/32. The first stage contains 44 residual units where each unit is formed by a bottleneck with the width 6464, and is followed by one 3×33\times 3 convolution changing the width of feature maps to CC. The 22nd, 33rd, 44th stages contain 11, 44, 33 modularized blocks, respectively. Each branch in multi-resolution parallel convolution of the modularized block contains 44 residual units. Each unit contains two 3×33\times 3 convolutions for each resolution, where each convolution is followed by batch normalization and the nonlinear activation ReLU. The widths (numbers of channels) of the convolutions of the four resolutions are CC, 2C2C, 4C4C, and 8C8C, respectively. An example is depicted in Figure 2.

5 Analysis

We analyze the modularized block that is divided into two components: multi-resolution parallel convolutions (Figure 5 (a)), and multi-resolution fusion (Figure 5 (b)). The multi-resolution parallel convolution resembles the group convolution. It divides the input channels into several subsets of channels and performs a regular convolution over each subset over different spatial resolutions separately, while in the group convolution, the resolutions are the same. This connection implies that the multi-resolution parallel convolution enjoys some benefit of the group convolution.

The multi-resolution fusion unit resembles the multi-branch full-connection form of the regular convolution, illustrated in Figure 5 (c). A regular convolution can be divided as multiple small convolutions as explained in . The input channels are divided into several subsets, and the output channels are also divided into several subsets. The input and output subsets are connected in a fully-connected fashion, and each connection is a regular convolution. Each subset of output channels is a summation of the outputs of the convolutions over each subset of input channels. The differences lie in that our multi-resolution fusion needs to handle the resolution change. The connection between multi-resolution fusion and regular convolution provides an evidence for exploring all the four-resolution representations done in HRNetV22 and HRNetV22p.

Human Pose Estimation

Human pose estimation, a.k.a. keypoint detection, aims to detect the locations of KK keypoints or parts (e.g., elbow, wrist, etc) from an image I\mathbf{I} of size W×H×3W\times H\times 3. We follow the state-of-the-art framework and transform this problem to estimating KK heatmaps of size W4×H4\frac{W}{4}\times\frac{H}{4}, {H1,H2,,HK}\{\mathbf{H}_{1},\mathbf{H}_{2},\dots,\mathbf{H}_{K}\}, where each heatmap Hk\mathbf{H}_{k} indicates the location confidence of the kkth keypoint.

We regress the heatmaps over the high-resolution representations output by HRNetV11. We empirically observe that the performance is almost the same for HRNetV11 and HRNetV22, and thus we choose HRNetV11 as its computation complexity is a little lower. The loss function, defined as the mean squared error, is applied for comparing the predicted heatmaps and the groundtruth heatmaps. The groundtruth heatmaps are generated by applying 22D Gaussian with standard deviation of 22 pixel centered on the groundtruth location of each keypoint. Some example results are given in Figure 6.

Dataset. The COCO dataset contains over 200,000200,000 images and 250,000250,000 person instances labeled with 1717 keypoints. We train our model on the COCO train2017 set, including 57K57K images and 150K150K person instances. We evaluate our approach on the val2017 and test-dev2017 sets, containing 50005000 images and 20K20K images, respectively.

Evaluation metric. The standard evaluation metric is based on Object Keypoint Similarity (OKS): OKS=iexp(di2/2s2ki2)δ(vi>0)iδ(vi>0).\operatorname{OKS}=\frac{\sum_{i}\exp(-d_{i}^{2}/2s^{2}k_{i}^{2})\delta(v_{i}>0)}{\sum_{i}\delta(v_{i}>0)}. Here did_{i} is the Euclidean distance between the detected keypoint and the corresponding ground truth, viv_{i} is the visibility flag of the ground truth, ss is the object scale, and kik_{i} is a per-keypoint constant that controls falloff. We report standard average precision and recall scoreshttp://cocodataset.org/#keypoints-eval: AP50\operatorname{AP}^{50} (AP\operatorname{AP} at OKS=0.50\operatorname{OKS}=0.50), AP75\operatorname{AP}^{75}, AP\operatorname{AP} (the mean of AP\operatorname{AP} scores at 1010 OKS\operatorname{OKS} positions, 0.50,0.55,,0.90,0.950.50,0.55,\dots,0.90,0.95); APM\operatorname{AP}^{M} for medium objects, APL\operatorname{AP}^{L} for large objects, and AR\operatorname{AR} (the mean of AR\operatorname{AR} scores at 1010 OKS\operatorname{OKS} positions, 0.50,0.55,,0.90,0.950.50,0.55,\dots,0.90,0.95).

Training. We extend the human detection box in height or width to a fixed aspect ratio: height:width=4:3\operatorname{height}:\operatorname{width}=4:3, and then crop the box from the image, which is resized to a fixed size, 256×192256\times 192 or 384×288384\times 288. The data augmentation scheme includes random rotation ([<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><moseparator="true">,</mo></mrow><annotationencoding="application/xtex">,</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3em;verticalalign:0.1944em;"></span><spanclass="mpunct">,</span></span></span></span></span>][<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3em;vertical-align:-0.1944em;"></span><span class="mpunct">,</span></span></span></span></span>]), random scale ([0.65,1.35][0.65,1.35]), and flipping. Following , half body data augmentation is also involved.

Testing. The two-stage top-down paradigm similar as is used: detect the person instance using a person detector, and then predict detection keypoints.

We use the same person detectors provided by SimpleBaselinehttps://github.com/Microsoft/human-pose-estimation.pytorch for both the val and test-dev sets. Following , we compute the heatmap by averaging the heatmaps of the original and flipped images. Each keypoint location is predicted by adjusting the highest heatvalue location with a quarter offset in the direction from the highest response to the second highest response.

Results on the val set. We report the results of our method and other state-of–the-art methods in Table I. The network - HRNetV11-W3232, trained from scratch with the input size 256×192256\times 192, achieves an AP score 73.473.4, outperforming other methods with the same input size. (\romannum1) Compared to Hourglass , our network improves AP by 6.56.5 points, and the GFLOP of our network is much lower and less than half, while the numbers of parameters are similar and ours is slightly larger. (\romannum2) Compared to CPN w/o and w/ OHKM, our network, with slightly larger model size and slightly higher complexity, achieves 4.84.8 and 4.04.0 points gain, respectively. (\romannum3) Compared to the previous best-performed method SimpleBaseline , our HRNetV11-W3232 obtains significant improvements: 3.03.0 points gain for the backbone ResNet-5050 with a similar model size and GFLOPs, and 1.41.4 points gain for the backbone ResNet-152152 whose model size (#Params) and GFLOPs are twice as many as ours.

Our network can benefit from (\romannum1) training from the model pretrained on the ImageNet: The gain is 1.01.0 points for HRNetV11-W3232; (\romannum2) increasing the capacity by increasing the width: HRNetV11-W4848 gets 0.70.7 and 0.50.5 points gain for the input sizes 256×192256\times 192 and 384×288384\times 288, respectively.

Considering the input size 384×288384\times 288, our HRNetV11-W3232 and HRNetV11-W4848, get the 75.875.8 and 76.376.3 AP, which have 1.41.4 and 1.21.2 improvements compared to the input size 256×192256\times 192. In comparison to SimpleBaseline that uses ResNet-152152 as the backbone, our HRNetV11-W3232 and HRNetV11-W4848 attain 1.51.5 and 2.02.0 points gain in terms of AP at 45%45\% and 92.4%92.4\% computational cost, respectively.

Results on the test-dev set. Table II reports the pose estimation performances of our approach and the existing state-of-the-art approaches. Our approach is significantly better than bottom-up approaches. On the other hand, our small network, HRNetV11-W3232, achieves an AP of 74.974.9. It outperforms all the other top-down approaches, and is more efficient in terms of model size (#Params) and computation complexity (GFLOPs). Our big model, HRNetV11-W4848, achieves the highest AP score 75.575.5. Compared to SimpleBaseline with the same input size, our small and big networks receive 1.21.2 and 1.81.8 improvements, respectively. With the additional data from AI Challenger for training, our single big network can obtain an AP of 77.077.0.

Semantic Segmentation

Semantic segmentation is a problem of assigning a class label to each pixel. Some example results by our approach are given in Figure 7. We feed the input image to the HRNetV22 (Figure 4 (b)) and then pass the resulting 15C15C-dimensional representation at each position to a linear classifier with the softmax loss to predict the segmentation maps. The segmentation maps are upsampled (44 times) to the input size by bilinear upsampling for both training and testing. We report the results over two scene parsing datasets, PASCAL-Context and Cityscapes , and a human parsing dataset, LIP . The mean of class-wise intersection over union (mIoU) is adopted as the evaluation metric.

Cityscapes. The Cityscapes dataset contains 5,0005,000 high quality pixel-level finely annotated scene images. The finely-annotated images are divided into 2,975/500/1,5252,975/500/1,525 images for training, validation and testing. There are 3030 classes, and 1919 classes among them are used for evaluation. In addition to the mean of class-wise intersection over union (mIoU), we report other three scores on the test set: IoU category (cat.), iIoU class (cla.) and iIoU category (cat.).

We follow the same training protocol . The data are augmented by random cropping (from 1024×20481024\times 2048 to 512×1024512\times 1024), random scaling in the range of [0.5,2][0.5,2], and random horizontal flipping. We use the SGD optimizer with the base learning rate of 0.010.01, the momentum of 0.90.9 and the weight decay of 0.00050.0005. The poly learning rate policy with the power of 0.90.9 is used for dropping the learning rate. All the models are trained for 120K120K iterations with the batch size of 1212 on 44 GPUs and syncBN.

Table III provides the comparison with several representative methods on the Cityscapes val set in terms of parameter and computation complexity and mIoU class. (i) HRNetV22-W4040 (4040 indicates the width of the high-resolution convolution), with similar model size to DeepLabv33+ and much lower computation complexity, gets better performance: 4.74.7 points gain over UNet++, 1.71.7 points gain over DeepLabv3 and about 0.50.5 points gain over PSPNet, DeepLabv3+. (ii) HRNetV22-W4848, with similar model size to PSPNet and much lower computation complexity, achieves much significant improvement: 5.65.6 points gain over UNet++, 2.62.6 points gain over DeepLabv3 and about 1.41.4 points gain over PSPNet, DeepLabv3+. In the following comparisons, we adopt HRNetV22-W4848 that is pretrained on ImageNet and has similar model size as most Dilated-ResNet-101101 based methods.

Table IV provides the comparison of our method with state-of-the-art methods on the Cityscapes test set. All the results are with six scales and flipping. Two cases w/o using coarse data are evaluated: One is about the model learned on the train set, and the other is about the model learned on the train+val set. In both cases, HRNetV22-W4848 achieves the superior performance.

PASCAL-Context. The PASCAL-Context dataset includes 4,9984,998 scene images for training and 5,1055,105 images for testing with 5959 semantic labels and 11 background label.

The data augmentation and learning rate policy are the same as Cityscapes. Following the widely-used training strategy , we resize the images to 480×480480\times 480 and set the initial learning rate to 0.0040.004 and weight decay to 0.00010.0001. The batch size is 1616 and the number of iterations is 60K60K.

We follow the standard testing procedure . The image is resized to 480×480480\times 480 and then fed into our network. The resulting 480×480480\times 480 label maps are then resized to the original image size. We evaluate the performance of our approach and other approaches using six scales and flipping.

Table V provides the comparison of our method with state-of-the-art methods. There are two kinds of evaluation schemes: mIoU over 5959 classes and 6060 classes (5959 classes + background). In both cases, HRNetV22-W4848 achieves state-of-the-art results except that the result from is higher than ours without using the OCR scheme .

LIP. The LIP dataset contains 50,46250,462 elaborately annotated human images, which are divided into 30,46230,462 training images, and 10,00010,000 validation images. The methods are evaluated on 2020 categories (1919 human part labels and 11 background label). Following the standard training and testing settings , the images are resized to 473×473473\times 473 and the performance is evaluated on the average of the segmentation maps of the original and flipped images.

The data augmentation and learning rate policy are the same as Cityscapes. The training strategy follows the recent setting . We set the initial learning rate to 0.0070.007 and the momentum to 0.90.9 and the weight decay to 0.00050.0005. The batch size is 4040 and the number of iterations is 110110K.

Table VI provides the comparison of our method with state-of-the-art methods. The overall performance of HRNetV22-W4848 performs the best with fewer parameters and lighter computation cost. We also would like to mention that our networks do not use extra information such as pose or edge.

COCO Object Detection

We perform the evaluation on the MS COCO 20172017 detection dataset, which contains about 118118k images for training, 55k for validation (val) and 20\sim 20k testing without provided annotations (test-dev). The standard COCO-style evaluation is adopted. Some example results by our approach are given in Figure 8.

We apply our multi-level representations (HRNetV22p)Same as FPN , we also use 55 levels., shown in Figure 4 (c), for object detection. The data is augmented by standard horizontal flipping. The input images are resized such that the shorter edge is 800 pixels . Inference is performed on a single image scale.

We compare our HRNet with the standard models: ResNet and ResNeXt . We evaluate the detection performance on COCO val. under two anchor-based frameworks: Faster R-CNN and Cascade R-CNN , and two recently-developed anchor-free frameworks: FCOS and CenterNet . We train the Faster R-CNN and Cascade R-CNN models for both our HRNetV22p and the ResNet on the public MMDetection platform with the provided training setup, except that we use the learning rate schedule suggested in for 2×2\times, and FCOS and CenterNet from the implementations provided by the authors. Table VII summarizes #parameters and GFLOPs. Table VIII and Table IX report detection scores.

We also evaluate the performance of joint detection and instance segmentation, under three frameworks: Mask R-CNN , Cascade Mask R-CNN , and Hybrid Task Cascade . The results are obtained on the public MMDetection platform and are in Table X.

There are several observations. On the one hand, as shown in Tables VIII and IX, the overall object detection performance of HRNetV22 is better than ResNet under similar model size and computation complexity. In some cases, for 1×1\times, HRNetV2p-W1818 performs worse than ResNet-5050-FPN, which might come from insufficient optimization iterations. On the other hand, as shown in Table X, the overall object detection and instance segmentation performance is better than ResNet and ResNeXt. In particular, under the Hybrid Task Cascade framework, the HRNet performs slightly worse than ResNeXt-101101-64<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><mo>×</mo></mrow><annotationencoding="application/xtex">×</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.6667em;verticalalign:0.0833em;"></span><spanclass="mord">×</span></span></span></span></span>464<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6667em;vertical-align:-0.0833em;"></span><span class="mord">×</span></span></span></span></span>4d-FPN for 20e20e, but better for 28e28e. This implies that our HRNet benefits more from longer training.

Table XI reports the comparison of our network to state-of-the-art single-model object detectors on COCO test-dev without using multi-scale training and multi-scale testing that are done in . In the Faster R-CNN framework, our networks perform better than ResNets with similar parameter and computation complexity: HRNetV22p-W3232 vs. ResNet-101101-FPN, HRNetV22p-W4040 vs. ResNet-152152-FPN, HRNetV22p-W4848 vs. X-101101-64×464\times 4d-FPN. In the Cascade R-CNN and CenterNet framework, our HRNetV22 also performs better. In the Cascade Mask R-CNN and Hybrid Task Cascade frameworks, the HRNet gets the overall better performance.

Ablation Study

We perform the ablation study for the components in HRNet over two tasks: human pose estimation on COCO validation and semantic segmentation on Cityscapes validation. We mainly use HRNetV1-W3232 for human pose estimation, and HRNetV2-W4848 for semantic segmentation. All results of pose estimation are obtained over the input size 256×192256\times 192. We also present the results for comparing HRNetV11 and HRNetV22.

Representations of different resolutions. We study how the representation resolution affects the pose estimation performance by checking the quality of the heatmap estimated from the feature maps of each resolution from high to low.

We train two HRNetV11 networks initialized by the model pretrained for the ImageNet classification. Our network outputs four response maps from high-to-low resolutions. The quality of heatmap prediction over the lowest-resolution response map is too low and the AP score is below 1010 points. The AP scores over the other three maps are reported in Figure 9. The comparison implies that the resolution does impact the keypoint prediction quality.

Repeated multi-resolution fusion. We empirically analyze the effect of the repeated multi-resolution fusion. We study three variants of our network. (a) W/o intermediate fusion units (11 fusion): There is no fusion between multi-resolution streams except the final fusion unit. (b) W/ across-stage fusion units (33 fusions): There is no fusion between parallel streams within each stage. (c) W/ both across-stage and within-stage fusion units (totally 88 fusions): This is our proposed method. All the networks are trained from scratch. The results on COCO human pose estimation and Cityscapes semantic segmentation (validation) given in Table XII show that the multi-resolution fusion unit is helpful and more fusions lead to better performance.

We also study other possible choices for the fusion design: (i) use bilinear downsample to replace strided convolutions, and (ii) use the multiplication operation to replace the sum operation. In the former case, the COCO pose estimation AP score and the Cityscapes segmentation mIoU score are reduced to 72.672.6 and 74.274.2. The reason is that downsampling reduces the volume size (width ×\times height ×\times #channels) of the representation maps, and strided convolutions learn better volume size reduction than bilinear downsampling. In the later case, the results are much worse: 54.754.7 and 66.066.0, respectively. The possible reason might be that multiplication increases the training difficulty as pointed in .

Resolution maintenance. We study the performance of a variant of the HRNet: all the four high-to-low resolution streams are added at the beginning and the depths of the four streams are the same; the fusion schemes are the same to ours. Both the HRNets and the variants (with similar #Params and GFLOPs) are trained from scratch.

The human pose estimation performance (AP) on COCO val for the variant is 72.572.5, which is lower than 73.473.4 for HRNetV11-W3232. The segmentation performance (mIoU) on Cityscapes val for the variant is 75.775.7, which is lower than 76.476.4 for HRNetV22-W4848. We believe that the reason is that the low-level features extracted from the early stages over the low-resolution streams are less helpful. In addition, another simple variant, only the high-resolution stream of similar #parameters and GFLOPs without low-resolution parallel streams shows much lower performance on COCO and Cityscapes.

V11 vs. V22. We compare HRNetV22 and HRNetV22p, to HRNetV11 on pose estimation, semantic segmentation and COCO object detection. For human pose estimation, the performance is similar. For example, HRNetV22-W3232 (w/o ImageNet pretraining) achieves the AP score 73.673.6, which is slightly higher than 73.473.4 HRNetV11-W3232.

The segmentation and object detection results, given in Figure 10 (a) and Figure 10 (b), imply that HRNetV22 outperforms HRNetV11 significantly, except that the gain is minor in the large model case (1×1\times) in segmentation for Cityscapes. We also test a variant (denoted by HRNetV11h), which is built by appending a 1×11\times 1 convolution to align the dimension of the output high-resolution representation with the dimension of HRNetV22. The results in Figure 10 (a) and Figure 10 (b) show that the variant achieves slight improvement to HRNetV11, implying that aggregating the representations from low-resolution parallel convolutions in our HRNetV22 is essential for improving the capability.

Conclusions

In this paper, we present a high-resolution network for visual recognition problems. There are three fundamental differences from existing low-resolution classification networks and high-resolution representation learning networks: (\romannum1) Connect high and low resolution convolutions in parallel other than in series; (\romannum2) Maintain high resolution through the whole process instead of recovering high resolution from low resolution; and (\romannum3) Fuse multi-resolution representations repeatedly, rendering rich high-resolution representations with strong position sensitivity.

The superior results on a wide range of visual recognition problems suggest that our proposed HRNet is a stronger backbone for computer vision problems. Our research also encourages more research efforts for designing network architectures directly for specific vision problems other than extending, remediating or repairing representations learned from low-resolution networks (e.g., ResNet or VGGNet).

Discussions. There is a possible misunderstanding: the memory cost of the HRNet is larger as the resolution is higher. In fact, the memory cost of the HRNet for all the three applications, human pose estimation, semantic segmentation and object detection, is comparable to state-of-the-arts except that the training memory cost in object detection is a little larger.

In addition, we summarize the runtime cost comparison on the PyTorch 1.01.0 platform. The training and inference time cost of the HRNet is comparable to previous state-of-the-arts except that (1) the inference time of the HRNet for segmentation is much smaller and (2) the training time of the HRNet for pose estimation is a little larger, but the cost on the MXNet 1.5.11.5.1 platform, which supports static graph inference, is similar as SimpleBaseline. We would like to highlight that for semantic segmentation the inference cost is significantly smaller than PSPNet and DeepLabv33. Table XIII summarizes memory and time cost comparisons The detailed comparisons are given in the supplementary file..

Future and followup works. We will study the combination of the HRNet with other techniques for semantic segmentation and instance segmentation. Currently, we have results (mIoU), which are depicted in Tables III IV V VI, by combining the HRNet with the object-contextual representation (OCR) scheme We empirically observed that the HRNet combined with ASPP or PPM did not get a performance improvement on Cityscape, but got a slight improvement on PASCAL-Context and LIP., a variant of object context . We will conduct the study by further increasing the resolution of the representation, e.g., to 12\frac{1}{2} or even a full resolution.

The applications of the HRNet are not limited to the above that we have done, and are suitable to other position-sensitive vision applications, such as facial landmark detection We provide the facial landmark detection results in the supplementary file., super-resolution, optical flow estimation, depth estimation, and so on. There are already followup works, e.g., image stylization , inpainting , image enhancement , image dehazing , temporal pose estimation , and drone object detection .

It is reported in that a slightly-modified HRNet combined with ASPP achieved the best performance for Mapillary panoptic segmentation in the single model case. In the COCO + Mapillary Joint Recognition Challenge Workshop at ICCV 2019, the COCO DensePose challenge winner and almost all the COCO keypoint detection challenge participants adopted the HRNet. The OpenImage instance segmentation challenge winner (ICCV 2019) also used the HRNet.

References

Appendix A Network Instantiation

Our current design (except the standard stem and the head,) contains four stages, as shown in Table XIV. Each stage consists of modularized blocks, repeated 11, 11, 44, and 33 times, respectively for the four stages. The modularized block consists of 11 (22, 33 and 44) branches for the 11st (22nd, 33rd and 44th) stages. Each branch corresponds to different resolution, and is compose of four residual units and one multi-resolution fusion unit (See Figure 3 in the main paper).

Appendix B Network Pretraining

We pretrain our network, which is augmented by a classification head shown in Figure 11, on ImageNet . The classification head is described as below. First, the four-resolution feature maps are fed into a bottleneck and the output channels are increased from CC, 2C2C, 4C4C, and 8C8C to 128128, 256256, 512512, and 10241024, respectively. Then, we downsample the high-resolution representation by a 22-strided 3×33\times 3 convolution outputting 256256 channels and add it to the representation of the second-high-resolution. This process is repeated two times to get 10241024 feature channels over the small resolution. Last, we transform the 10241024 channels to 20482048 channels through a 1×11\times 1 convolution, followed by a global average pooling operation. The output 20482048-dimensional representation is fed into the classifier.

We adopt the same data augmentation scheme for training images as in , and train our models for 100100 epochs with a batch size of 256256. The initial learning rate is set to 0.10.1 and is reduced by 1010 times at epoch 3030, 6060 and 9090. We use SGD with a weight decay of 0.00010.0001 and a Nesterov momentum of 0.90.9. We adopt standard single-crop testing, so that 224×224224\times 224 pixels are cropped from each image. The top-11 and top-55 error are reported on the validation set.

Table XV shows our ImageNet classification results. As a comparison, we also report the results of ResNets. We consider two types of residual units: One is formed by a bottleneck, and the other is formed by two 3×33\times 3 convolutions. We follow the PyTorch implementation of ResNets and replace the 7×77\times 7 convolution in the input stem with two 22-strided 3×33\times 3 convolutions decreasing the resolution to 1/41/4 as in our networks. When the residual units are formed by two 3×33\times 3 convolutions, an extra bottleneck is used to increase the dimension of output feature maps from 512512 to 20482048. One can see that under similar #parameters and GFLOPs, our results are comparable to and slightly better than ResNets.

In addition, we look at the results of two alternative schemes: (i) the feature maps on each resolution go through a global pooling separately and then are concatenated together to output a 15C15C-dimensional representation vector, named HRNet-Wxx-Ci; (ii) the feature maps on each resolution are fed into several 22-strided residual units (bottleneck, each dimension is increased to the double) to increase the dimension to 512512, and concatenate and average-pool them together to reach a 20482048-dimensional representation vector, named HRNet-Wxx-Cii, which is used in . Table XVI shows such an ablation study. One can see that the proposed manner is superior to the two alternatives.

Appendix C Training/Inference Cost

Tables XVII, XVIII and XIX provide GPU memory comparisons between HRNets and other standard networks for both training and inference in the PyTorch platform. Compared to state-of-the-arts for human pose estimation, the training and inference memory costs of the HRNet are similar or lower for similar parameter complexity (Table XVII). Compared to state-of-the-arts for semantic segmentation, the training and inference memory costs are similar (Table XVIII) for similar parameter complexity. Compared to state-of-the-arts for object detection for similar parameter complexity, the training and inference memory costs are similar or slightly higher (Table XIX).

In addition, we provide the runtime cost comparison. (1) For semantic segmentation, the time cost of the HRNet for training is slightly smaller and for inference significantly smaller than PSPNet and DeepLabv3 (Table XVIII). (2) For object detection, the time cost of the HRNet for training is larger than ResNet based networks and smaller than ResNext based networks, and for inference the HRNet is smaller for similar GFLOPs (Table XIX). (3) For human pose estimation, the time cost of the HRNet for training is similar and for inference larger; and the time cost of the HRNet for training and inference in the MXNet platform is similar as SimpleBaseline (Table XVII).

Appendix D Facial Landmark Detection

Facial landmark detection a.k.a. face alignment is a problem of detecting the keypoints from a face image. We perform the evaluation over four standard datasets: WFLW , AFLW , COFW , and 300300W . We mainly use the normalized mean error (NME) for evaluation. We use the inter-ocular distance as normalization for WFLW, COFW, and 300300W, and the face bounding box as normalization for AFLW. We also report area-under-the-curve scores (AUC) and failure rates.

We follow the standard scheme for training. All the faces are cropped by the provided boxes according to the center location and resized to 256×256256\times 256. We augment the data by ±30\pm 30 degrees in-plane rotation, 0.751.250.75-1.25 scaling, and randomly flipping. The base learning rate is 0.00010.0001 and is dropped to 0.000010.00001 and 0.0000010.000001 at the 3030th and 5050th epochs. The models are trained for 6060 epochs with the batch size of 1616 on one GPU. Different from semantic segmentation, the heatmaps are not upsampled from 1/41/4 to the input size, and the loss function is optimized over the 1/41/4 maps.

At testing, each keypoint location is predicted by transforming the highest heatvalue location from 1/41/4 to the original image space and adjusting it with a quarter offset in the direction from the highest response to the second highest response .

We adopt HRNetV22-W1818 for face landmark detection whose parameter and computation cost are similar to or smaller than models with widely-used backbones: ResNet-5050 and Hourglass . HRNetV22-W1818: #parameters =9.3=9.3M, GFLOPs =4.3=4.3G; ResNet-5050: #parameters =25.0=25.0M, GFLOPs =3.8=3.8G; Hourglass: #parameters =25.1=25.1M, GFLOPs =19.1=19.1G. The numbers are obtained on the input size 256×256256\times 256. It should be noted that the facial landmark detection methods adopting ResNet-5050 and Hourglass as backbones introduce extra parameter and computation overhead.

WFLW. The WFLW dataset is a recently-built dataset based on the WIDER Face . There are 7,5007,500 training and 2,5002,500 testing images with 9898 manual annotated landmarks. We report the results on the test set and several subsets: large pose (326326 images), expression (314314 images), illumination (698698 images), make-up (206206 images), occlusion (736736 images) and blur (773773 images).

Table XX provides the comparison of our method with state-of-the-art methods. Our approach is significantly better than other methods on the test set and all the subsets, including LAB that exploits extra boundary information and PDB that uses stronger data augmentation .

AFLW. The AFLW dataset is a widely used benchmark dataset, where each image has 1919 facial landmarks. Following , we train our models on 20,00020,000 training images, and report the results on the AFLW-Full set (4,3864,386 testing images) and the AFLW-Frontal set (13141314 testing images selected from 43864386 testing images).

Table XXI provides the comparison of our method with state-of-the-art methods. Our approach achieves the best performance among methods without extra information and stronger data augmentation and even outperforms DCFE with extra 33D information. Our approach performs slightly worse than LAB that uses extra boundary information and PDB that uses stronger data augmentation.

COFW. The COFW dataset consists of 1,3451,345 training and 507507 testing faces with occlusions, where each image has 2929 facial landmarks.

Table XXII provides the comparison of our method with state-of-the-art methods. HRNetV22 outperforms other methods by a large margin. In particular, it achieves the better performance than LAB with extra boundary information and PDB with stronger data augmentation.

300300W. The dataset is a combination of HELEN , LFPW , AFW , XM2VTS and IBUG datasets, where each face has 6868 landmarks. Following , we use the 3,1483,148 training images, which contains the training subsets of HELEN and LFPW and the full set of AFW. We evaluate the performance using two protocols, full set and test set. The full set contains 689689 images and is further divided into a common subset (554554 images) from HELEN and LFPW, and a challenging subset (135135 images) from IBUG. The official test set, used for competition, contains 600600 images (300300 indoor and 300300 outdoor images).

Table XXIII provides the results on the full set, and its two subsets: common and challenging. Table XXIV provides the results on the test set. In comparison to Chen et al. that uses Hourglass with large parameter and computation complexity as the backbone, our scores are better except the AUC0.08 scores. Our HRNetV22 gets the overall best performance among methods without extra information and stronger data augmentation, and is even better than LAB with extra boundary information and DCFE that explores extra 33D information.

Appendix E More object detection and instance results on COCO val2017