Deformable Convolutional Networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei

Introduction

A key challenge in visual recognition is how to accommodate geometric variations or model geometric transformations in object scale, pose, viewpoint, and part deformation. In general, there are two ways. The first is to build the training datasets with sufficient desired variations. This is usually realized by augmenting the existing data samples, e.g., by affine transformation. Robust representations can be learned from the data, but usually at the cost of expensive training and complex model parameters. The second is to use transformation-invariant features and algorithms. This category subsumes many well known techniques, such as SIFT (scale invariant feature transform) and sliding window based object detection paradigm.

There are two drawbacks in above ways. First, the geometric transformations are assumed fixed and known. Such prior knowledge is used to augment the data, and design the features and algorithms. This assumption prevents generalization to new tasks possessing unknown geometric transformations, which are not properly modeled. Second, hand-crafted design of invariant features and algorithms could be difficult or infeasible for overly complex transformations, even when they are known.

Recently, convolutional neural networks (CNNs) have achieved significant success for visual recognition tasks, such as image classification , semantic segmentation , and object detection . Nevertheless, they still share the above two drawbacks. Their capability of modeling geometric transformations mostly comes from the extensive data augmentation, the large model capacity, and some simple hand-crafted modules (e.g., max-pooling for small translation-invariance).

In short, CNNs are inherently limited to model large, unknown transformations. The limitation originates from the fixed geometric structures of CNN modules: a convolution unit samples the input feature map at fixed locations; a pooling layer reduces the spatial resolution at a fixed ratio; a RoI (region-of-interest) pooling layer separates a RoI into fixed spatial bins, etc. There lacks internal mechanisms to handle the geometric transformations. This causes noticeable problems. For one example, the receptive field sizes of all activation units in the same CNN layer are the same. This is undesirable for high level CNN layers that encode the semantics over spatial locations. Because different locations may correspond to objects with different scales or deformation, adaptive determination of scales or receptive field sizes is desirable for visual recognition with fine localization, e.g., semantic segmentation using fully convolutional networks . For another example, while object detection has seen significant and rapid progress recently, all approaches still rely on the primitive bounding box based feature extraction. This is clearly sub-optimal, especially for non-rigid objects.

In this work, we introduce two new modules that greatly enhance CNNs’ capability of modeling geometric transformations. The first is deformable convolution. It adds 2D offsets to the regular grid sampling locations in the standard convolution. It enables free form deformation of the sampling grid. It is illustrated in Figure 1. The offsets are learned from the preceding feature maps, via additional convolutional layers. Thus, the deformation is conditioned on the input features in a local, dense, and adaptive manner.

The second is deformable RoI pooling. It adds an offset to each bin position in the regular bin partition of the previous RoI pooling . Similarly, the offsets are learned from the preceding feature maps and the RoIs, enabling adaptive part localization for objects with different shapes.

Both modules are light weight. They add small amount of parameters and computation for the offset learning. They can readily replace their plain counterparts in deep CNNs and can be easily trained end-to-end with standard backpropagation. The resulting CNNs are called deformable convolutional networks, or deformable ConvNets.

Our approach shares similar high level spirit with spatial transform networks and deformable part models . They all have internal transformation parameters and learn such parameters purely from data. A key difference in deformable ConvNets is that they deal with dense spatial transformations in a simple, efficient, deep and end-to-end manner. In Section 3.1, we discuss in details the relation of our work to previous works and analyze the superiority of deformable ConvNets.

Deformable Convolutional Networks

The feature maps and convolution in CNNs are 3D. Both deformable convolution and RoI pooling modules operate on the 2D spatial domain. The operation remains the same across the channel dimension. Without loss of generality, the modules are described in 2D here for notation clarity. Extension to 3D is straightforward.

The 2D convolution consists of two steps: 1) sampling using a regular grid R\mathcal{R} over the input feature map x\mathbf{x}; 2) summation of sampled values weighted by w\mathbf{w}. The grid R\mathcal{R} defines the receptive field size and dilation. For example,

defines a 3×33\times 3 kernel with dilation 11.

For each location p0\mathbf{p}_{0} on the output feature map y\mathbf{y}, we have

where pn\mathbf{p}_{n} enumerates the locations in R\mathcal{R}.

In deformable convolution, the regular grid R\mathcal{R} is augmented with offsets {Δpnn=1,...,N}\{\Delta\mathbf{p}_{n}|n=1,...,N\}, where N=RN=|\mathcal{R}|. Eq. (1) becomes

Now, the sampling is on the irregular and offset locations pn+Δpn\mathbf{p}_{n}+\Delta\mathbf{p}_{n}. As the offset Δpn\Delta\mathbf{p}_{n} is typically fractional, Eq. (2) is implemented via bilinear interpolation as

where p\mathbf{p} denotes an arbitrary (fractional) location (p=p0+pn+Δpn\mathbf{p}=\mathbf{p}_{0}+\mathbf{p}_{n}+\Delta\mathbf{p}_{n} for Eq. (2)), q\mathbf{q} enumerates all integral spatial locations in the feature map x\mathbf{x}, and G(,)G(\cdot,\cdot) is the bilinear interpolation kernel. Note that GG is two dimensional. It is separated into two one dimensional kernels as

where g(a,b)=max(0,1ab)g(a,b)=max(0,1-|a-b|). Eq. (3) is fast to compute as G(q,p)G(\mathbf{q},\mathbf{p}) is non-zero only for a few q\mathbf{q}s.

As illustrated in Figure 2, the offsets are obtained by applying a convolutional layer over the same input feature map. The convolution kernel is of the same spatial resolution and dilation as those of the current convolutional layer (e.g., also 3×33\times 3 with dilation 1 in Figure 2). The output offset fields have the same spatial resolution with the input feature map. The channel dimension 2N2N corresponds to NN 2D offsets. During training, both the convolutional kernels for generating the output features and the offsets are learned simultaneously. To learn the offsets, the gradients are back-propagated through the bilinear operations in Eq. (3) and Eq. (4). It is detailed in appendix A.

2 Deformable RoI Pooling

RoI pooling is used in all region proposal based object detection methods . It converts an input rectangular region of arbitrary size into fixed size features.

RoI Pooling Given the input feature map x\mathbf{x} and a RoI of size w×hw\times h and top-left corner p0\mathbf{p}_{0}, RoI pooling divides the RoI into k×kk\times k (kk is a free parameter) bins and outputs a k×kk\times k feature map y\mathbf{y}. For (i,j)(i,j)-th bin (0i,j<k0\leq i,j<k), we have

where nijn_{ij} is the number of pixels in the bin. The (i,j)(i,j)-th bin spans iwkpx<(i+1)wk\lfloor i\frac{w}{k}\rfloor\leq p_{x}<\lceil(i+1)\frac{w}{k}\rceil and jhkpy<(j+1)hk\lfloor j\frac{h}{k}\rfloor\leq p_{y}<\lceil(j+1)\frac{h}{k}\rceil.

Similarly as in Eq. (2), in deformable RoI pooling, offsets {Δpij0i,j<k}\{\Delta\mathbf{p}_{ij}|0\leq i,j<k\} are added to the spatial binning positions. Eq.(5) becomes

Typically, Δpij\Delta\mathbf{p}_{ij} is fractional. Eq. (6) is implemented by bilinear interpolation via Eq. (3) and (4).

Figure 3 illustrates how to obtain the offsets. Firstly, RoI pooling (Eq. (5)) generates the pooled feature maps. From the maps, a fc layer generates the normalized offsets Δp^ij\Delta\widehat{\mathbf{p}}_{ij}, which are then transformed to the offsets Δpij\Delta\mathbf{p}_{ij} in Eq. (6) by element-wise product with the RoI’s width and height, as Δpij=γΔp^ij(w,h)\Delta\mathbf{p}_{ij}=\gamma\cdot\Delta\widehat{\mathbf{p}}_{ij}\circ(w,h). Here γ\gamma is a pre-defined scalar to modulate the magnitude of the offsets. It is empirically set to γ=0.1\gamma=0.1. The offset normalization is necessary to make the offset learning invariant to RoI size. The fc layer is learned by back-propagation, as detailed in appendix A.

Position-Sensitive (PS) RoI Pooling It is fully convolutional and different from RoI pooling. Through a conv layer, all the input feature maps are firstly converted to k2k^{2} score maps for each object class (totally C+1C+1 for CC object classes), as illustrated in the bottom branch in Figure 4. Without need to distinguish between classes, such score maps are denoted as {xi,j}\{\mathbf{x}_{i,j}\} where (i,j)(i,j) enumerates all bins. Pooling is performed on these score maps. The output value for (i,j)(i,j)-th bin is obtained by summation from one score map xi,j\mathbf{x}_{i,j} corresponding to that bin. In short, the difference from RoI pooling in Eq.(5) is that a general feature map x\mathbf{x} is replaced by a specific positive-sensitive score map xi,j\mathbf{x}_{i,j}.

In deformable PS RoI pooling, the only change in Eq. (6) is that x\mathbf{x} is also modified to xi,j\mathbf{x}_{i,j}. However, the offset learning is different. It follows the “fully convolutional” spirit in , as illustrated in Figure 4. In the top branch, a conv layer generates the full spatial resolution offset fields. For each RoI (also for each class), PS RoI pooling is applied on such fields to obtain normalized offsets Δp^ij\Delta\widehat{\mathbf{p}}_{ij}, which are then transformed to the real offsets Δpij\Delta\mathbf{p}_{ij} in the same way as in deformable RoI pooling described above.

3 Deformable ConvNets

Both deformable convolution and RoI pooling modules have the same input and output as their plain versions. Hence, they can readily replace their plain counterparts in existing CNNs. In the training, these added conv and fc layers for offset learning are initialized with zero weights. Their learning rates are set to β\beta times (β=1\beta=1 by default, and β=0.01\beta=0.01 for the fc layer in Faster R-CNN) of the learning rate for the existing layers. They are trained via back propagation through the bilinear interpolation operations in Eq. (3) and Eq. (4). The resulting CNNs are called deformable ConvNets.

To integrate deformable ConvNets with the state-of-the-art CNN architectures, we note that these architectures consist of two stages. First, a deep fully convolutional network generates feature maps over the whole input image. Second, a shallow task specific network generates results from the feature maps. We elaborate the two steps below.

Deformable Convolution for Feature Extraction We adopt two state-of-the-art architectures for feature extraction: ResNet-101 and a modifed version of Inception-ResNet . Both are pre-trained on ImageNet classification dataset.

The original Inception-ResNet is designed for image recognition. It has a feature misalignment issue and problematic for dense prediction tasks. It is modified to fix the alignment problem . The modified version is dubbed as “Aligned-Inception-ResNet” and is detailed in appendix B.

Both models consist of several convolutional blocks, an average pooling and a 1000-way fc layer for ImageNet classification. The average pooling and the fc layers are removed. A randomly initialized 1×11\times 1 convolution is added at last to reduce the channel dimension to 10241024. As in common practice , the effective stride in the last convolutional block is reduced from 3232 pixels to 1616 pixels to increase the feature map resolution. Specifically, at the beginning of the last block, stride is changed from 22 to 11 (“conv5” for both ResNet-101 and Aligned-Inception-ResNet). To compensate, the dilation of all the convolution filters in this block (with kernel size >1>1) is changed from 11 to 22.

Optionally, deformable convolution is applied to the last few convolutional layers (with kernel size >1>1). We experimented with different numbers of such layers and found 33 as a good trade-off for different tasks, as reported in Table 1.

Segmentation and Detection Networks A task specific network is built upon the output feature maps from the feature extraction network mentioned above.

In the below, CC denotes the number of object classes.

DeepLab is a state-of-the-art method for semantic segmentation. It adds a 1×11\times 1 convolutional layer over the feature maps to generates (C+1)(C+1) maps that represent the per-pixel classification scores. A following softmax layer then outputs the per-pixel probabilities.

Category-Aware RPN is almost the same as the region proposal network in , except that the 2-class (object or not) convolutional classifier is replaced by a (C+1)(C+1)-class convolutional classifier. It can be considered as a simplified version of SSD .

Faster R-CNN is the state-of-the-art detector. In our implementation, the RPN branch is added on the top of the conv4 block, following . In the previous practice , the RoI pooling layer is inserted between the conv4 and the conv5 blocks in ResNet-101, leaving 10 layers for each RoI. This design achieves good accuracy but has high per-RoI computation. Instead, we adopt a simplified design as in . The RoI pooling layer is added at lastThe last 1×11\times 1 dimension reduction layer is changed to outputs 256-D features.. On top of the pooled RoI features, two fc layers of dimension 10241024 are added, followed by the bounding box regression and the classification branches. Although such simplification (from 10 layer conv5 block to 2 fc layers) would slightly decrease the accuracy, it still makes a strong enough baseline and is not a concern in this work.

Optionally, the RoI pooling layer can be changed to deformable RoI pooling.

R-FCN is another state-of-the-art detector. It has negligible per-RoI computation cost. We follow the original implementation. Optionally, its RoI pooling layer can be changed to deformable position-sensitive RoI pooling.

Understanding Deformable ConvNets

This work is built on the idea of augmenting the spatial sampling locations in convolution and RoI pooling with additional offsets and learning the offsets from target tasks.

When the deformable convolution are stacked, the effect of composited deformation is profound. This is exemplified in Figure 5. The receptive field and the sampling locations in the standard convolution are fixed all over the top feature map (left). They are adaptively adjusted according to the objects’ scale and shape in deformable convolution (right). More examples are shown in Figure 6. Table 2 provides quantitative evidence of such adaptive deformation.

The effect of deformable RoI pooling is similar, as illustrated in Figure 7. The regularity of the grid structure in standard RoI pooling no longer holds. Instead, parts deviate from the RoI bins and move onto the nearby object foreground regions. The localization capability is enhanced, especially for non-rigid objects.

Our work is related to previous works in different aspects. We discuss the relations and differences in details.

Spatial Transform Networks (STN) It is the first work to learn spatial transformation from data in a deep learning framework. It warps the feature map via a global parametric transformation such as affine transformation. Such warping is expensive and learning the transformation parameters is known difficult. STN has shown successes in small scale image classification problems. The inverse STN method replaces the expensive feature warping by efficient transformation parameter propagation.

The offset learning in deformable convolution can be considered as an extremely light-weight spatial transformer in STN . However, deformable convolution does not adopt a global parametric transformation and feature warping. Instead, it samples the feature map in a local and dense manner. To generate new feature maps, it has a weighted summation step, which is absent in STN.

Deformable convolution is easy to integrate into any CNN architectures. Its training is easy. It is shown effective for complex vision tasks that require dense (e.g., semantic segmentation) or semi-dense (e.g., object detection) predictions. These tasks are difficult (if not infeasible) for STN .

Active Convolution This work is contemporary. It also augments the sampling locations in the convolution with offsets and learns the offsets via back-propagation end-to-end. It is shown effective on image classification tasks.

Two crucial differences from deformable convolution make this work less general and adaptive. First, it shares the offsets all over the different spatial locations. Second, the offsets are static model parameters that are learnt per task or per training. In contrast, the offsets in deformable convolution are dynamic model outputs that vary per image location. They model the dense spatial transformations in the images and are effective for (semi-)dense prediction tasks such as object detection and semantic segmentation.

Effective Receptive Field It finds that not all pixels in a receptive field contribute equally to an output response. The pixels near the center have much larger impact. The effective receptive field only occupies a small fraction of the theoretical receptive field and has a Gaussian distribution. Although the theoretical receptive field size increases linearly with the number of convolutional layers, a surprising result is that, the effective receptive field size increases linearly with the square root of the number, therefore, at a much slower rate than what we would expect.

This finding indicates that even the top layer’s unit in deep CNNs may not have large enough receptive field. This partially explains why atrous convolution is widely used in vision tasks (see below). It indicates the needs of adaptive receptive field learning.

Deformable convolution is capable of learning receptive fields adaptively, as shown in Figure 5, 6 and Table 2.

Atrous convolution It increases a normal filter’s stride to be larger than 11 and keeps the original weights at sparsified sampling locations. This increases the receptive field size and retains the same complexity in parameters and computation. It has been widely used for semantic segmentation (also called dilated convolution in ), object detection , and image classification .

Deformable convolution is a generalization of atrous convolution, as easily seen in Figure 1 (c). Extensive comparison to atrous convolution is presented in Table 3.

Deformable Part Models (DPM) Deformable RoI pooling is similar to DPM because both methods learn the spatial deformation of object parts to maximize the classification score. Deformable RoI pooling is simpler since no spatial relations between the parts are considered.

DPM is a shallow model and has limited capability of modeling deformation. While its inference algorithm can be converted to CNNs by treating the distance transform as a special pooling operation, its training is not end-to-end and involves heuristic choices such as selection of components and part sizes. In contrast, deformable ConvNets are deep and perform end-to-end training. When multiple deformable modules are stacked, the capability of modeling deformation becomes stronger.

DeepID-Net It introduces a deformation constrained pooling layer which also considers part deformation for object detection. It therefore shares a similar spirit with deformable RoI pooling, but is much more complex. This work is highly engineered and based on RCNN . It is unclear how to adapt it to the recent state-of-the-art object detection methods in an end-to-end manner.

Spatial manipulation in RoI pooling Spatial pyramid pooling uses hand crafted pooling regions over scales. It is the predominant approach in computer vision and also used in deep learning based object detection .

Learning the spatial layout of pooling regions has received little study. The work in learns a sparse subset of pooling regions from a large over-complete set. The large set is hand engineered and the learning is not end-to-end.

Deformable RoI pooling is the first to learn pooling regions end-to-end in CNNs. While the regions are of the same size currently, extension to multiple sizes as in spatial pyramid pooling is straightforward.

Transformation invariant features and their learning There have been tremendous efforts on designing transformation invariant features. Notable examples include scale invariant feature transform (SIFT) and ORB (O for orientation). There is a large body of such works in the context of CNNs. The invariance and equivalence of CNN representations to image transformations are studied in . Some works learn invariant CNN representations with respect to different types of transformations such as , scattering networks , convolutional jungles , and TI-pooling . Some works are devoted for specific transformations such as symmetry , scale , and rotation .

As analyzed in Section 1, in these works the transformations are known a priori. The knowledge (such as parameterization) is used to hand craft the structure of feature extraction algorithm, either fixed in such as SIFT, or with learnable parameters such as those based on CNNs. They cannot handle unknown transformations in the new tasks.

In contrast, our deformable modules generalize various transformations (see Figure 1). The transformation invariance is learned from the target task.

Dynamic Filter Similar to deformable convolution, the dynamic filters are also conditioned on the input features and change over samples. Differently, only the filter weights are learned, not the sampling locations like ours. This work is applied for video and stereo prediction.

Combination of low level filters Gaussian filters and its smooth derivatives are widely used to extract low level image structures such as corners, edges, T-junctions, etc. Under certain conditions, such filters form a set of basis and their linear combination forms new filters within the same group of geometric transformations, such as multiple orientations in Steerable Filters and multiple scales in . We note that although the term deformable kernels is used in , its meaning is different from ours in this work.

Most CNNs learn all their convolution filters from scratch. The recent work shows that it could be unnecessary. It replaces the free form filters by weighted combination of low level filters (Gaussian derivatives up to 4-th order) and learns the weight coefficients. The regularization over the filter function space is shown to improve the generalization ability when training data are small.

Above works are related to ours in that, when multiple filters, especially with different scales, are combined, the resulting filter could have complex weights and resemble our deformable convolution filter. However, deformable convolution learns sampling locations instead of filter weights.

Experiments

Semantic Segmentation We use PASCAL VOC and CityScapes . For PASCAL VOC, there are 2020 semantic categories. Following the protocols in , we use VOC 2012 dataset and the additional mask annotations in . The training set includes 10,58210,582 images. Evaluation is performed on 1,4491,449 images in the validation set. For CityScapes, following the protocols in , training and evaluation are performed on 2,9752,975 images in the train set and 500500 images in the validation set, respectively. There are 1919 semantic categories plus a background category.

For evaluation, we use the mean intersection-over-union (mIoU) metric defined over image pixels, following the standard protocols . We use mIoU@V and mIoU@C for PASCAl VOC and Cityscapes, respectively.

In training and inference, the images are resized to have a shorter side of 360360 pixels for PASCAL VOC and 1,0241,024 pixels for Cityscapes. In SGD training, one image is randomly sampled in each mini-batch. A total of 30k and 45k iterations are performed for PASCAL VOC and Cityscapes, respectively, with 8 GPUs and one mini-batch on each. The learning rates are 10310^{-3} and 10410^{-4} in the first 23\frac{2}{3} and the last 13\frac{1}{3} iterations, respectively.

Object Detection We use PASCAL VOC and COCO datasets. For PASCAL VOC, following the protocol in , training is performed on the union of VOC 2007 trainval and VOC 2012 trainval. Evaluation is on VOC 2007 test. For COCO, following the standard protocol , training and evaluation are performed on the 120k images in the trainval and the 20k images in the test-dev, respectively.

For evaluation, we use the standard mean average precision (mAP) scores . For PASCAL VOC, we report mAP scores using IoU thresholds at 0.5 and 0.7. For COCO, we use the standard COCO metric of mAP@[0.5:0.95], as well as mAP@0.5.

In training and inference, the images are resized to have a shorter side of 600 pixels. In SGD training, one image is randomly sampled in each mini-batch. For class-aware RPN, 256 RoIs are sampled from the image. For Faster R-CNN and R-FCN, 256 and 128 RoIs are sampled for the region proposal and the object detection networks, respectively. 7×77\times 7 bins are adopted in RoI pooling. To facilitate the ablation experiments on VOC, we follow and utilize pre-trained and fixed RPN proposals for the training of Faster R-CNN and R-FCN, without feature sharing between the region proposal and the object detection networks. The RPN network is trained separately as in the first stage of the procedure in . For COCO, joint training as in is performed and feature sharing is enabled for training. A total of 30k and 240k iterations are performed for PASCAL VOC and COCO, respectively, on 8 GPUs. The learning rates are set as 10310^{-3} and 10410^{-4} in the first 23\frac{2}{3} and the last 13\frac{1}{3} iterations, respectively.

2 Ablation Study

Extensive ablation studies are performed to validate the efficacy and efficiency of our approach.

Deformable Convolution Table 1 evaluates the effect of deformable convolution using ResNet-101 feature extraction network. Accuracy steadily improves when more deformable convolution layers are used, especially for DeepLab and class-aware RPN. The improvement saturates when using 33 deformable layers for DeepLab, and 66 for others. In the remaining experiments, we use 33 in the feature extraction networks.

We empirically observed that the learned offsets in the deformable convolution layers are highly adaptive to the image content, as illustrated in Figure 5 and Figure 6. To better understand the mechanism of deformable convolution, we define a metric called effective dilation for a deformable convolution filter. It is the mean of the distances between all adjacent pairs of sampling locations in the filter. It is a rough measure of the receptive field size of the filter.

We apply the R-FCN network with 33 deformable layers (as in Table 1) on VOC 2007 test images. We categorize the deformable convolution filters into four classes: small, medium, large, and background, according to the ground truth bounding box annotation and where the filter center is. Table 2 reports the statistics (mean and std) of the effective dilation values. It clearly shows that: 1) the receptive field sizes of deformable filters are correlated with object sizes, indicating that the deformation is effectively learned from image content; 2) the filter sizes on the background region are between those on medium and large objects, indicating that a relatively large receptive field is necessary for recognizing the background regions. These observations are consistent in different layers.

The default ResNet-101 model uses atrous convolution with dilation 22 for the last three 3×33\times 3 convolutional layers (see Section 2.3). We further tried dilation values 44, 66, and 88 and reported the results in Table 3. It shows that: 1) accuracy increases for all tasks when using larger dilation values, indicating that the default networks have too small receptive fields; 2) the optimal dilation values vary for different tasks, e.g., 66 for DeepLab but 44 for Faster R-CNN; 3) deformable convolution has the best accuracy. These observations verify that adaptive learning of filter deformation is effective and necessary.

Deformable RoI Pooling It is applicable to Faster R-CNN and R-FCN. As shown in Table 3, using it alone already produces noticeable performance gains, especially at the strict mAP@0.7 metric. When both deformable convolution and RoI Pooling are used, significant accuracy improvements are obtained.

Model Complexity and Runtime Table 4 reports the model complexity and runtime of the proposed deformable ConvNets and their plain versions. Deformable ConvNets only add small overhead over model parameters and computation. This indicates that the significant performance improvement is from the capability of modeling geometric transformations, other than increasing model parameters.

3 Object Detection on COCO

In Table 5, we perform extensive comparison between the deformable ConvNets and the plain ConvNets for object detection on COCO test-dev set. We first experiment using ResNet-101 model. The deformable versions of class-aware RPN, Faster R-CNN and R-FCN achieve mAP@[0.5:0.95] scores of 25.8%, 33.1%, and 34.5% respectively, which are 11%, 13%, and 12% relatively higher than their plain-ConvNets counterparts respectively. By replacing ResNet-101 by Aligned-Inception-ResNet in Faster R-CNN and R-FCN, their plain-ConvNet baselines both improve thanks to the more powerful feature representations. And the effective performance gains brought by deformable ConvNets also hold. By further testing on multiple image scales (the image shorter side is in ) and performing iterative bounding box average , the mAP@[0.5:0.95] scores are increased to 37.5% for the deformable version of R-FCN. Note that the performance gain of deformable ConvNets is complementary to these bells and whistles.

Conclusion

This paper presents deformable ConvNets, which is a simple, efficient, deep, and end-to-end solution to model dense spatial transformations. For the first time, we show that it is feasible and effective to learn dense spatial transformation in CNNs for sophisticated vision tasks, such as object detection and semantic segmentation.

The Aligned-Inception-ResNet model was trained and investigated by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in unpublished work.

Appendix A Deformable Convolution/RoI Pooling Back-propagation

In the deformable convolution Eq. (2), the gradient w.r.t. the offset Δpn\Delta\mathbf{p}_{n} is computed as

where the term G(q,p0+pn+Δpn)Δpn\frac{\partial G(\mathbf{q},\mathbf{p}_{0}+\mathbf{p}_{n}+\Delta\mathbf{p}_{n})}{\partial\Delta\mathbf{p}_{n}} can be derived from Eq. (4). Note that the offset Δpn\Delta\mathbf{p}_{n} is 2D and we use Δpn\partial\Delta\mathbf{p}_{n} to denote Δpnx\partial\Delta p_{n}^{x} and Δpny\partial\Delta p_{n}^{y} for simplicity.

Similarly, in the deformable RoI Pooling module, the gradient w.r.t. the offset Δpij\Delta\mathbf{p}_{ij} can be computed by

And the gradient w.r.t. the normalized offsets Δp^ij\Delta\widehat{\mathbf{p}}_{ij} can be easily obtained via computing derivatives in Δpij=γΔp^ij(w,h)\Delta\mathbf{p}_{ij}=\gamma\cdot\Delta\widehat{\mathbf{p}}_{ij}\circ(w,h).

Appendix B Details of Aligned-Inception-ResNet

In the original Inception-ResNet architecture, multiple layers of valid convolution/pooling are utilized, which brings feature alignment issues for dense prediction tasks. For a cell on the feature maps close to the output, its projected spatial location on the image is not aligned with the location of its receptive field center. Meanwhile, the task specific networks are usually designed under the alignment assumption. For example, in the prevalent FCNs for semantic segmentation, the features from a cell are leveraged to predict the pixel’s label at the corresponding projected image location.

To remedy this issue, the network architecture is modified , called “Aligned-Inception-ResNet” and shown in Table 6. When the feature dimension changes, a 1×11\times 1 convolution layer with stride 2 is utilized. There are two main differences between Aligned-Inception-ResNet and the original Inception-ResNet . Firstly, Aligned-Inception-ResNet does not have the feature alignment problem, by proper padding in convolutional and pooling layers. Secondly, Aligned-Inception-ResNet consists of repetitive modules, whose design is simpler than the original Inception-ResNet architectures.

The Aligned-Inception-ResNet model is pre-trained on ImageNet-1K classification . The training procedure follows . Table 7 reports the model complexity, top-1 and top-5 classification errors.

References