Dilated Residual Networks

Fisher Yu, Vladlen Koltun, Thomas Funkhouser

Introduction

Convolutional networks were originally developed for classifying hand-written digits . More recently, convolutional network architectures have evolved to classify much more complex images . Yet a central aspect of network architecture has remained largely in place. Convolutional networks for image classification progressively reduce resolution until the image is represented by tiny feature maps that retain little spatial information ( $7\mathbin{\!\times\!}7$ is typical).

While convolutional networks have done well, the almost complete elimination of spatial acuity may be preventing these models from achieving even higher accuracy, for example by preserving the contribution of small and thin objects that may be important for correctly understanding the image. Such preservation may not have been important in the context of hand-written digit classification, in which a single object dominated the image, but may help in the analysis of complex natural scenes where multiple objects and their relative configurations must be taken into account.

Furthermore, image classification is rarely a convolutional network’s raison d’être. Image classification is most often a proxy task that is used to pretrain a model before it is transferred to other applications that involve more detailed scene understanding . In such tasks, severe loss of spatial acuity is a significant handicap. Existing techniques compensate for the lost resolution by introducing up-convolutions , skip connections , and other post-hoc measures.

Must convolutional networks crush the image in order to classify it? In this paper, we show that this is not necessary, or even desirable. Starting with the residual network architecture, the current state of the art for image classification , we increase the resolution of the network’s output by replacing a subset of interior subsampling layers by dilation . We show that dilated residual networks (DRNs) yield improved image classification performance. Specifically, DRNs yield higher accuracy in ImageNet classification than their non-dilated counterparts, with no increase in depth or model complexity.

The output resolution of a DRN on typical ImageNet input is $28\mathbin{\!\times\!}28$ , comparable to small thumbnails that convey the structure of the image when examined by a human . While it may not be clear a priori that average pooling can properly handle such high-resolution output, we show that it can, yielding a notable accuracy gain. We then study gridding artifacts introduced by dilation, propose a scheme for removing these artifacts, and show that such ‘degridding’ further improves the accuracy of DRNs.

We also show that DRNs yield improved accuracy on downstream applications such as weakly-supervised object localization and semantic segmentation. With a remarkably simple approach, involving no fine-tuning at all, we obtain state-of-the-art top-1 accuracy in weakly-supervised localization on ImageNet. We also study the performance of DRNs on semantic segmentation and show, for example, that a 42-layer DRN outperforms a ResNet-101 baseline on the Cityscapes dataset by more than 4 percentage points, despite lower depth by a factor of 2.4.

Dilated Residual Networks

Our key idea is to preserve spatial resolution in convolutional networks for image classification. Although progressive downsampling has been very successful in classifying digits or iconic views of objects, the loss of spatial information may be harmful for classifying natural images and can significantly hamper transfer to other tasks that involve spatially detailed image understanding. Natural images often feature many objects whose identities and relative configurations are important for understanding the scene. The classification task becomes difficult when a key object is not spatially dominant – for example, when the labeled object is thin (e.g., a tripod) or when there is a big background object such as a mountain. In these cases, the background response may suppress the signal from the object of interest. What’s worse, if the object’s signal is lost due to downsampling, there is little hope to recover it during training. However, if we retain high spatial resolution throughout the model and provide output signals that densely cover the input field, backpropagation can learn to preserve important information about smaller and less salient objects.

A naive approach to increasing resolution in higher layers of the network would be to simply remove subsampling (striding) from some of the interior layers. This does increase downstream resolution, but has a detrimental side effect that negates the benefits: removing subsampling correspondingly reduces the receptive field in subsequent layers. Thus removing striding such that the resolution of the output layer is increased by a factor of 4 also reduces the receptive field of each output unit by a factor of 4. This severely reduces the amount of context that can inform the prediction produced by each unit. Since contextual information is important in disambiguating local cues , such reduction in receptive field is an unacceptable price to pay for higher resolution. For this reason, we use dilated convolutions to increase the receptive field of the higher layers, compensating for the reduction in receptive field induced by removing subsampling. The effect is that units in the dilated layers have the same receptive field as corresponding units in the original model.

We focus on the two final groups of convolutional layers: $\mathcal{G}^{4}$ and $\mathcal{G}^{5}$ . In the original ResNet, the first layer in each group ( $\mathcal{G}^{4}_{1}$ and $\mathcal{G}^{5}_{1}$ ) is strided: the convolution is evaluated at even rows and columns, which reduces the output resolution of these layers by a factor of 2 in each dimension. The first step in the conversion to DRN is to remove the striding in both $\mathcal{G}^{4}_{1}$ and $\mathcal{G}^{5}_{1}$ . Note that the receptive field of each unit in $\mathcal{G}^{4}_{1}$ remains unaffected: we just doubled the output resolution of $\mathcal{G}^{4}_{1}$ without affecting the receptive field of its units. However, subsequent layers are all affected: their receptive fields have been reduced by a factor of 2 in each dimension. We therefore replace the convolution operators in those layers by 2-dilated convolutions :

for all $i\geq 2$ . The same transformation is applied to $\mathcal{G}^{5}_{1}$ :

Subsequent layers in $\mathcal{G}^{5}$ follow two striding layers that have been eliminated. The elimination of striding has reduced their receptive fields by a factor of 4 in each dimension. Their convolutions need to be dilated by a factor of 4 to compensate for the loss:

for all $i\geq 2$ . Finally, as in the original architecture, $\mathcal{G}^{5}$ is followed by global average pooling, which reduces the output feature maps to a vector, and a $1\mathbin{\!\times\!}1$ convolution that maps this vector to a vector that comprises the prediction scores for all classes. The transformation of a ResNet into a DRN is illustrated in Figure 1.

The converted DRN has the same number of layers and parameters as the original ResNet. The key difference is that the original ResNet downsamples the input image by a factor of 32 in each dimension (a thousand-fold reduction in area), while the DRN downsamples the input by a factor of 8. For example, when the input resolution is $224\mathbin{\!\times\!}224$ , the output resolution of $\mathcal{G}^{5}$ in the original ResNet is $7\mathbin{\!\times\!}7$ , which is not sufficient for the spatial structure of the input to be discernable. The output of $\mathcal{G}^{5}$ in a DRN is $28\mathbin{\!\times\!}28$ . Global average pooling therefore takes in $2^{4}$ times more values, which can help the classifier recognize objects that cover a smaller number of pixels in the input image and take such objects into account in its prediction.

The presented construction could also be applied to earlier groups of layers ( $\mathcal{G}^{1}$ , $\mathcal{G}^{2}$ , or $\mathcal{G}^{3}$ ), in the limit retaining the full resolution of the input. We chose not to do this because a downsampling factor of 8 is known to preserve most of the information necessary to correctly parse the original image at pixel level . Furthermore, a $28\mathbin{\!\times\!}28$ thumbnail, while small, is sufficiently resolved for humans to discern the structure of the scene . Additional increase in resolution has costs and should not be pursued without commensurate gains: when feature map resolution is increased by a factor of 2 in each dimension, the memory consumption of that feature map increases by a factor of 4. Operating at full resolution throughout, with no downsampling at all, is beyond the capabilities of current hardware.

Localization

Given a DRN trained for image classification, we can directly produce dense pixel-level class activation maps without any additional training or parameter tuning. This allows a DRN trained for image classification to be immediately used for object localization and segmentation.

To obtain high-resolution class activation maps, we remove the global average pooling operator. We then connect the final $1\mathbin{\!\times\!}1$ convolution directly to $\mathcal{G}^{5}$ . A softmax is applied to each column in the resulting volume to convert the pixelwise prediction scores to proper probability distributions. This procedure is illustrated in Figure 2. The output of the resulting network is a set of activation maps that have the same spatial resolution as $\mathcal{G}^{5}$ ( $28\mathbin{\!\times\!}28$ ). Each classification category $y$ has a corresponding activation map. For each pixel in this map, the map contains the probability that the object observed at this pixel is of category $y$ .

The activation maps produced by our construction serve the same purpose as the results of the procedure of Zhou et al. . However, the procedures are fundamentally different. Zhou et al. worked with convolutional networks that produce drastically downsampled output that is not sufficiently resolved for object localization. For this reason, Zhou et al. had to remove layers from the classification network, introduce parameters that compensate for the ablated layers, and then fine-tune the modified models to train the new parameters. Even then, the output resolution obtained by Zhou et al. was quite small ( $14\mathbin{\!\times\!}14$ ) and the classification performance of the modified networks was impaired.

In contrast, the DRN was designed to produce high-resolution output maps and is trained in this configuration from the start. Thus the model trained for image classification already produces high-resolution activation maps. As our experiments will show, DRNs are more accurate than the original ResNets in image classification. Since DRNs produce high-resolution output maps from the start, there is no need to remove layers, add parameters, and retrain the model for localization. The original accurate classification model can be used for localization directly.

Degridding

The use of dilated convolutions can cause gridding artifacts. Such artifacts are shown in Figure 3(c) and have also been observed in concurrent work on semantic segmentation . Gridding artifacts occur when a feature map has higher-frequency content than the sampling rate of the dilated convolution. Figure 4 shows a didactic example. In Figure 4(a), the input feature map has a single active pixel. A 2-dilated convolution (Figure 4(b)) induces a corresponding grid pattern in the output (Figure 4(c)).

In this section, we develop a scheme for removing gridding artifacts from output activation maps produced by DRNs. The scheme is illustrated in Figure 5. A DRN constructed as described in Section 2 is referred to as DRN-A and is illustrated in Figure 5(a). An intermediate stage of the construction described in the present section is referred to as DRN-B and is illustrated in Figure 5(b). The final construction is referred to as DRN-C, illustrated in Figure 5(c).

Removing max pooling. As shown in Figure 5(a), DRN-A inherits from the ResNet architecture a max pooling operation after the initial $7\mathbin{\!\times\!}7$ convolution. We found that this max pooling operation leads to high-amplitude high-frequency activations, as shown in Figure 6(b). Such high-frequency activations can be propagated to later layers and ultimately exacerbate gridding artifacts. We thus replace max pooling by convolutional filters, as shown in Figure 5(b). The effect of this transformation is shown in Figure 6(c).

Adding layers. To remove gridding artifacts, we add convolutional layers at the end of the network, with progressively lower dilation. Specifically, after the last 4-dilated layer in DRN-A (Figure 5(a)), we add a 2-dilated residual block followed by a 1-dilated block. These become levels 7 and 8 in DRN-B, shown in Figure 5(b). This is akin to removing aliasing artifacts using filters with appropriate frequency .

Removing residual connections. Adding layers with decreasing dilation, as described in the preceding paragraph, does not remove gridding artifacts entirely because of residual connections. The residual connections in levels 7 and 8 of DRN-B can propagate gridding artifacts from level 6. To remove gridding artifacts more effectively, we remove the residual connections in levels 7 and 8. This yields the DRN-C, our proposed construction, illustrated in Figure 5(c). Note that the DRN-C has higher depth and capacity than the corresponding DRN-A or the ResNet that had been used as the starting point. However, we will show that the presented degridding scheme has a dramatic effect on accuracy, such that the accuracy gain compensates for the added depth and capacity. For example, experiments will demonstrate that DRN-C-26 has similar image classification accuracy to DRN-A-34 and higher object localization and semantic segmentation accuracy than DRN-A-50.

The activations inside a DRN-C are illustrated in Figure 7. This figure shows a feature map from the output of each level in the network. The feature map with the largest average activation magnitude is shown.

Experiments

Training is performed on the ImageNet 2012 training set . The training procedure is similar to He et al. . We use scale and aspect ratio augmentation as in Szegedy et al. and color perturbation as in Krizhevsky et al. and Howard . Training is performed by SGD with momentum 0.9 and weight decay $10^{-4}$ . The learning rate is initially set to $10^{-1}$ and is reduced by a factor of 10 every 30 epochs. Training proceeds for 120 epochs total.

The performance of trained models is evaluated on the ImageNet 2012 validation set. The images are resized so that the shorter side has 256 pixels. We use two evaluation protocols: 1-crop and 10-crop. In the 1-crop protocol, prediction accuracy is measured on the central $224\mathbin{\!\times\!}224$ crop. In the 10-crop protocol, prediction accuracy is measured on 10 crops from each image. Specifically, for each image we take the center crop, four corner crops, and flipped versions of these crops. The reported 10-crop accuracy is averaged over these 10 crops.

ResNet vs. DRN-A. Table 1 reports the accuracy of different models according to both evaluation protocols. Each DRN-A outperforms the corresponding ResNet model, despite having the same depth and capacity. For example, DRN-A-18 and DRN-A-34 outperform ResNet-18 and ResNet-34 in 1-crop top-1 accuracy by 2.43 and 2.92 percentage points, respectively. (A 10.5% error reduction in the case of ResNet-34 $\rightarrow$ DRN-A-34.)

DRN-A-50 outperforms ResNet-50 in 1-crop top-1 accuracy by more than a percentage point. For comparison, the corresponding error reduction achieved by ResNet-152 over ResNet-101 is 0.3 percentage points. (From 22.44 to 22.16 on the center crop.) These results indicate that even the direct transformation of a ResNet into a DRN-A, which does not change the depth or capacity of the model at all, significantly improves classification accuracy.

DRN-A vs. DRN-C. Table 1 also shows that the degridding construction described in Section 4 is beneficial. Specifically, each DRN-C significantly outperforms the corresponding DRN-A. Although the degridding procedure increases depth and capacity, the resultant increase in accuracy is so substantial that the transformed DRN matches the accuracy of deeper models. Specifically, DRN-C-26, which is derived from DRN-A-18, matches the accuracy of the deeper DRN-A-34. In turn, DRN-C-42, which is derived from DRN-A-34, matches the accuracy of the deeper DRN-A-50. Comparing the degridded DRN to the original ResNet models, we see that DRN-C-42 approaches the accuracy of ResNet-101, although the latter is deeper by a factor of 2.4.

2 Object Localization

We now evaluate the use of DRNs for weakly-supervised object localization, as described in Section 3. As shown in Figure 3, class activation maps provided by DRNs are much better spatially resolved than activation maps extracted from the corresponding ResNet.

For each class $c_{i}$ , define the set of valid bounding boxes as

where $t$ is an activation threshold. The minimal bounding box for class $c_{i}$ is defined as

To evaluate the accuracy of DRNs on weakly-supervised object localization, we simply compute the minimal bounding box $\mathbf{b}_{i}$ for the predicted class $i$ on each image. In the localization challenge, a predicted bounding box is considered accurate when its IoU with the ground-truth box is greater than 0.5. Table 2 reports the results. Note that the classification networks are used for localization directly, with no fine-tuning.

As shown in Table 2, DRNs outperform the corresponding ResNet models. (Compare ResNet-18 to DRN-A-18, ResNet-34 to DRN-A-34, and ResNet-50 to DRN-A-50.) This again illustrates the benefits of the basic DRN construction presented in Section 2. Furthermore, DRN-C-26 significantly outperforms DRN-A-50, despite having much lower depth. This indicates that that the degridding scheme described in Section 4 has particularly significant benefits for applications that require more detailed spatial image analysis. DRN-C-26 also outperforms ResNet-101.

3 Semantic Segmentation

We now transfer DRNs to semantic segmentation. High-resolution internal representations are known to be important for this task . Due to the severe downsampling in prior image classification architectures, their transfer to semantic segmentation necessitated post-hoc adaptations such as up-convolutions, skip connections, and post-hoc dilation . In contrast, the high resolution of the output layer in a DRN means that we can transfer a classification-trained DRN to semantic segmentation by simply removing the global pooling layer and operating the network fully-convolutionally , without any additional structural changes. The predictions synthesized by the output layer are upsampled to full resolution using bilinear interpolation, which does not involve any parameters.

We evaluate this capability using the Cityscapes dataset . We use the standard Cityscapes training and validation sets. To understand the properties of the models themselves, we only use image cropping and mirroring for training. We do not use any other data augmentation and do not append additional modules to the network. The results are reported in Table 3.

All presented models outperform a comparable baseline setup of ResNet-101, which was reported to achieve a mean IoU of 66.6 . For example, DRN-C-26 outperforms the ResNet-101 baseline by more than a percentage point, despite having 4 times lower depth. The DRN-C-42 model outperforms the ResNet-101 baseline by more than 4 percentage points, despite 2.4 times lower depth.

Comparing different DRN models, we see that both DRN-C-26 and DRN-C-42 outperform DRN-A-50, suggesting that the degridding construction presented in Section 4 is particularly beneficial for dense prediction tasks. A qualitative comparison between DRN-A-50 and DRN-C-26 is shown in Figure 8. As the images show, the predictions of DRN-A-50 are marred by gridding artifacts even though the model was trained with dense pixel-level supervision. In contrast, the predictions of DRN-C-26 are not only more accurate, but also visibly cleaner.

Conclusion

We have presented an approach to designing convolutional networks for image analysis. Rather than progressively reducing the resolution of internal representations until the spatial structure of the scene is no longer discernible, we keep high spatial resolution all the way through the final output layers. We have shown that this simple transformation improves image classification accuracy, outperforming state-of-the-art models. We have then shown that accuracy can be increased further by modifying the construction to alleviate gridding artifacts introduced by dilation.

The presented image classification networks produce informative output activations, which can be used directly for weakly-supervised object localization, without any fine-tuning. The presented models can also be used for dense prediction tasks such as semantic segmentation, where they outperform deeper and higher-capacity baselines.

The results indicate that dilated residual networks can be used as a starting point for image analysis tasks that involve complex natural images, particularly when detailed understanding of the scene is important. We will release code and pretrained models to support future research and applications.

Acknowledgments

This work was supported by Intel and the National Science Foundation (IIS-1251217 and VEC 1539014/1539099).

Introduction

Dilated Residual Networks

Localization

Degridding

Experiments

2 Object Localization

3 Semantic Segmentation

Conclusion

Acknowledgments

References