Resolution learning in deep convolutional networks using scale-space theory

Silvia L. Pintea, Nergis Tomen, Stanley F. Goes, Marco Loog, Jan C. van Gemert

I Introduction

Resolution defines the inner scale at which objects should be observed in an image . To control the resolution in a network, one can change the filter sizes or feature map sizes. Because there is a maximum frequency that can be encoded in a limited spatial extent, the filter sizes and feature map sizes define a lower bound on the resolution encoded in the network. CNNs typically use small filters of 3×33\times 3 px or 5×55\times 5 px, where the first layers are forced to look at detailed, local image neighborhoods such as edges, blobs, and corners. As the network deepens, each subsequent convolution increases the receptive field size linearly , allowing the network to combine the detailed responses of the previous layer to obtain textures, and object parts. Going even deeper, strategically placed memory-efficient subsampling operations reduce feature maps to half their size which is equivalent to increasing the receptive field multiplicatively. At the deepest layers, the receptive field spans a large portion of the image and objects emerge as combinations of their parts . The resolution, as controlled by the sizes of the receptive field and feature maps, is one of the fundamental aspects of CNNs.

In modern CNN architectures , the resolution is a hyper-parameter which has to be manually tuned using expert knowledge, by changing the filter sizes or the subsampling layers. For example, the popular ResNeXt for the ImageNet dataset starts with a 7×77\times 7 px filter, followed by 3×33\times 3 px and 1×11\times 1 px convolutions where the feature maps are subsampled 6 times. The same network on the CIFAR-10 dataset exclusively uses 3×33\times 3 px convolutions and the feature maps are subsampled 2 times. Hard-coding the resolution hyper-parameters in the network for different datasets affects the extent of the receptive field, and the specific choices made can be restrictive.

In this paper we propose the N-JetNet which can replace CNN network design choices of filter sizes by learning these. We make use of scale-space theory , where the resolution is modeled by the σ\sigma parameter of the Gaussian function family and its derivatives. Gaussian derivatives allow a truncated Taylor series, called the N-Jet , to model a convolutional filter as a linear combination of Gaussian derivative filters, each weighted by an αi\alpha_{i}. We optimize these α\alpha weights instead of individual weights for each pixel in the filter, as done in a standard CNN. The choice of the basis cannot be avoided. In standard CNNs the choice is implicit: an N ×\times N pixel-basis, whose size cannot be optimized, because it has no well-defined derivative to the error. In contrast, in the N-Jet model the basis is a linear combination of Gaussian derivatives where the σ\sigma parameter controls both the resolution and the filter size, and has a well-defined derivative to the effective filter and therefore to the error. This formulation allows the network to learn σ\sigma and thus the network resolution. We exemplify our approach in Fig. 1.

To avoid confusions, we make the following naming conventions: throughout the paper we refer to ‘resolution’ as the inner scale as defined in ; ‘size’ as the outer scale denoting the number of pixels of a filter or a feature map; and ‘scale’ as the parameter controlling the resolution, which is the standard deviation σ\sigma parameter of the Gaussian basis . The scale is different from the size of a filter: one can blur a filter and change its scale without necessarily changing its size. However, they are related as increasing the scale of an object (i.e. blurring) increases its size in the image (i.e. number of pixels it occupies). Here we tie the filter size to the scale parameter by making it a function of σ\sigma.

We make the following contributions. (i) We exploit the multi-scale local jet for automatically learning the scale parameter, σ\sigma. (ii) We show both for classification and segmentation that our proposed N-Jet model automatically learns the appropriate input resolution from the data. (iii) We demonstrate that our approach generalizes over network architectures and datasets without deteriorating accuracy for both classification and segmentation.

II Related work

Multiples scales and sizes in the network. Size plays an important role in CNNs. The highly successful inception architecture uses two filter sizes per layer. Multiple input sizes can be weighted per layer , integrated at the feature map level , processed at the same time or even made to compete with each other . To process multiple featuremap sizes, spatial pyramids are used , alternatively the best input size and network resolution can be selected over a validation set . Scale-equivarinat CNNs can be obtained by applying each filter at multiple sizes , or by approximating filters with Gaussian basis combinations where the set of scale parameters is not learned, but fixed. Unlike these works, we do not explicitly process our feature maps over a set of predefined fixed sizes. We learn a single scale parameter per layer from the data.

Downsampling and upsampling can be modeled as a bijective function , or made adaptive using reinforcement learning and contextual information at the object boundaries . The optimal size for processing an input giving the maximum classification confidence can be selected among multiple sizes , or learned by mimicking the human visual focus , or minimizing the entropy over multiple input sizes at inference time . Network architecture search can also be used for learning the resolution, at the cost of increased computations . Alternatively, the scale distribution can be adapted per image using dynamic gates , or by using self-attentive memory gates . The atrous or dilated convolutions design fixed versions of larger receptive fields without subsampling the image. These are extended to adaptive dilation factors learned through a sub-network . Rather than only learning the filter size, we learn both the filter shape and the size jointly, by relying on scale-space theory.

Architectures accommodating subsampling. A pooling operation groups features together before subsampling. Popular forms of grouping are average pooling , and max pooling . Average pooling tends to perform worse than max pooling which is outperformed by their combination . Other forms include pooling based on ranking , spatial pyramid pooling , spectral pooling and stochastic pooling and stochastic subsampling . The recent BlurPool avoids aliasing effects when sampling, while fractional pooling subsamples with a factor of 2\sqrt{2} instead of 2 which allows larger feature maps to be used in more network layers. All these pooling methods use hard-coded feature map subsampling. Our work differs, as we do not use fixed subsampling or strided convolution: we learn the resolution.

Fixed basis approximations. Resolution in images is aptly modeled by scale-space theory . This is achieved by convolving the image with filters of increasing scale, removing finer details at higher scales. Convolving with a Gaussian filter has the property of not introducing any artifacts and the differential structure of images can be probed with Gaussian derivative filters which form the N-Jet : a complete and stable basis to locally approximate any realistic image. Scale-spaces model images at different resolutions by a continuous one-parameter family of smoothed images, parametrized by the value of σ\sigma of the Gaussian filter . In this paper we build on scale-space theory and exploit the differential structure of images to optimize σ\sigma and thus learn the resolution.

Various mathematical multi-scale image modeling tools have been used in convolutional networks. The classical work of Simoncelli et al. proposes the steerable pyramid, defining a set of wavelets for orientation and scale invariance. Similarly, the seminal Scattering transform and its extensions are based on carefully designed complex wavelet basis filters with pre-defined rotations and scales giving excellent results on uniform datasets such as MNIST and textures. Using the Scattering transform as initialization for the first few layers of a CNN has recently been shown to also lead to good results on more varied datasets. Filters can also be approximated as a liner combination over a set of learned low-rank filter basis . Recent work also starts with a filter basis and use a CNN to learn the filter weights. Examples include a PCA basis , circular harmonics , Gabors , and Gaussian derivatives . In this paper we build on the Gaussian derivative basis because it directly offers the tools of Gaussian scale-space to learn CNN resolution.

Learning kernel shape. Current methods investigate inherent properties of CNN filters. Filters that go beyond convolution include non-linear Volterra kernels , a learned image adaptive bilateral filter and learned image processing operations . For convolutional CNN filters, Sun et al. proposes an asymmetric kernel shape, which simulates hexagonal lattices leading to improved results. The active convolution by Jeon and Kim and the deformable CNNs by Dai et al. offer an elegant approach to learn a spatial offset for each filter coefficient leading to flexible filters and improved accuracy. learns continuous filters as functions over sub-pixel coordinates, allowing learnable resizing of the feature maps. The hierarchical auto-zoom net , the scale proposal network , and the recurrent scale approximation network explicitly predict the object sizes and adapt the input size accordingly. Our work differs from all these methods because we learn both the filter shape and the size.

Most similar to us, combine free-form filters with learned Gaussian kernels that can adapt the receptive field size. The recent work of Lindeberg et al. uses Gaussian derivatives for scale-invariance, however the scales are fixed according to a geometric distribution. Dissimilar to these works we approximate the complete filter using a combination of Gaussian derivatives, while adapting the receptive field size.

III Learning network resolution

Scale-spaces offer a general framework for modeling image structures at various scales. The resolution, or the inner scale of an image is modeled by a convolution with a 2D Gaussian. The 1D Gaussian at scale σ\sigma is given by G(x;σ)=1σ2πex22σ2G(x;\sigma)=\frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-x^{2}}{2\sigma^{2}}} which is readily extended to 2D as G(x,y;σ)=G(x; σ) G(y; σ)G(x,y;\sigma)=G(x;\ \sigma)\ G(y;\ \sigma). The local structure learned in deep networks is linked to the image derivatives. Image pixels are discretely measured, and do not directly offer derivatives. The linearity of the convolution operator allows to take an exact derivative of a slightly smoothed function ff with a Gaussian kernel G(.;σ)G(.;\sigma) with scale σ\sigma:

where \ast denotes a convolution. This allows taking image derivatives by convolving the image with Gaussian derivatives. Gaussian derivatives in 1D at order mm and scale σ\sigma can be defined recursively using the Hermite polynomials :

where G(x;σ)G(x;\sigma) is the Gaussian function and Hm(x)H_{m}(x) the mm-th order Hermite polynomial, recursively defined as Hi(x)=2xHi1(x)2(i1)Hi2(x); H0(x)=1; H1(x)=2xH_{i}(x)=2xH_{i-1}(x)-2(i-1)H_{i-2}(x);\ H_{0}(x)=1;\ H_{1}(x)=2x. We define 2D Gaussian derivatives by the product of the partial derivatives on xx and on yy:

III-B Multi-scale local N-Jet for modeling local image structure

A discrete set of Gaussian derivatives up to nthn^{\text{th}} order, {Gi,j(x,y;σ)0i+jn}\{G^{i,j}(x,y;\sigma)\mid 0\leq i+j\leq n\}, can be used in a truncated local Taylor expansion to represent the local scale-space near any given point with increasing accuracy . This allows us to approximate a filter F(x)F(x) around the point aa up to order NN as:

where RR is the residual term that corresponds to the approximation error. By absorbing the polynomial coefficients into a value α\alpha, we arrive at a linear combination of Gaussian derivative basis filters which can be used to approximate image filters, as illustrated in Fig. 2. For filter F(x,y,c)F(x,y,c) at position (x,y)(x,y) and color channel cc the approximation is:

where RR is the residual error, ignored here. Optimizing the α\alpha parameters allows us to switch from learning pixel weights as commonly used in CNNs, to learning the weights of the Gaussian basis filters. We show some examples in Fig. 3 where we optimize the α\alpha parameters of an order-3 RGB Gaussian derivative basis with σ=5\sigma=5 to least squares fit an 11×1111\times 11 px patch. Results show that the fit can approximate well a slightly blurred version of the original patch. Because σ=5\sigma=5 we cannot recover a perfectly sharp faithful copy of the original patch.

III-C Learning receptive field size

We have all ingredients to learn the resolution in a convolutional deep neural network (CNN). Resolution is bounded by the size of the CNN filters. We can now dynamically adapt the resolution during training.

Scale-invariant basis normalization. The filter responses of Gaussian derivatives decay with order, as depicted in figure 4.(a). Following , we make the Gaussian derivatives scale-independent by multiply each ii-th order partial derivative by σi\sigma^{i}. This brings the magnitude of basis filters in approximately the same range, as illustrated in Fig. 4.(b).

Learning scale and filter size. The network resolution depends on the parameter σ\sigma, determining the inner scale of the Gaussian derivative basis. The chain-rule for differentiation allows to express the derivative of the error JJ with respect to σ\sigma as the product of two terms: Jσ=JFFσ\frac{\partial J}{\partial\sigma}=\frac{\partial J}{\partial F}\cdot\frac{\partial F}{\partial\sigma}. The first term is the derivative of the error with respect to the filter and it is found by error-backpropagation, as standardly done. The second term is the derivative of the filter with respect to σ\sigma and can be found by differentiating Eq. (5) with respect to σ\sigma. Similarly, the value of the Gaussian basis mixing coefficients, αi,j,c\alpha_{i,j,c} can be found by differentiating the filter FF with respect to the coefficients α\alpha.

In practice we cannot work with continuous filters. Therefore, we need to clip the filters to a finite size to perform the convolution. The size ss of the filter follows the formula: s=2 kσ +1s=2\left\lceil\ k\sigma\ \right\rceil+1, where kk determines the extent of the local N-Jet approximation and is experimentally set. By tying the filter size to the scale parameter, we only need to change σ\sigma and adapt both the scale controlling the network resolution, and the size defining the spatial extent of the filters.

IV Experiments

Safely subsampling for image classification. The receptive field size is also altered through subsampling, pooling, or strided convolution. For classification models we remove all subsampling operations in the network and add a safe-subsampling operation. If the resolution is low (i.e., the σ\sigma value is high) then there is no need to keep the feature map at full size, and it can safely be subsampled, to improve memory and speed. For a feature map of size ss, we subsample the feature map to a new size sˉ\bar{s}, where we half its current size as a function of σ\sigma as: sˉ=s(12)σ/r\bar{s}=s\left(\frac{1}{2}\right)^{\sigma/r}, where rr is the safe-subsampling hyper-parameter. We apply safe-subsampling for all models except for the very deep networks: Resnet-110 and EfficientNet, where it is reducing the feature map sizes too much.

Experiment 1.1(A): Do resolution hyper-parameters really matter? We test our assumption that filter sizes and feature map sizes affect accuracy. For this we use the NIN baseline trained on CIFAR-10. We vary the filter sizes in the layers of the NIN which are not 1×11\times 1 convolutions, and we reset the strides to 1 in all layers, to remove the feature map subsampling. Fig. 6 shows the impact of changing the filter sizes and removing the subsampling. The smaller filter sizes, as in the case when all filter sizes are set to 3, are affected to a greater degree by the removal of the subsampling because they have a smaller receptive field. Selecting the correct filter sizes impacts the overall classification accuracy, and an exhaustive search over all possible filter size combinations is not feasible. This validates the need for learning filter sizes.

Experiment 1.2(A): Can the image resolution be learned? To test resolution learning, we create a toy network architecture depicted in Fig. 5(a). We train the toy architecture on MNIST when resizing the images 1×1\times, 1.5×1.5\times, and 2×2\times. Fig. 5.(b) shows the learned Gaussian basis scale, σ\sigma, per setup. The σ\sigma values learned for the images resized by 1.51.5 and 22 do not directly correspond to these values because the operations of sampling and resizing are not commutative: we first discretized the continuous signal into an image and subsequently subsampled it. However, the relative ratio between the learned scales is close: (2.0/1.5)σ1.5=2.81±0.04σ2.0=2.82±0.04(2.0/1.5)\sigma_{1.5}=2.81\pm 0.04\approx\sigma_{2.0}=2.82\pm 0.04. The learned σ\sigma values follow the input resizing, thus the correct filter scales and sizes can be learned from the input.

Experiment 2.1(A): Learning sigma. We test the effect of σ\sigma on the performance on the CIFAR-10 dataset, using the NIN backbone. We fix the spatial extent, kk, to 22 and vary sigma in the set {0.5,1.0,2.0}\{0.5,1.0,2.0\}. Tab. I shows that a wrong setting of σ\sigma can influence the classification accuracy up to 3%. The safe-subsampling setting is affected more by the choice of σ\sigma than the baseline subsampling as it relies on the value of σ\sigma when deciding how much to subsample the input feature maps. Overall, we note that σ=1.0\sigma=1.0 achieves the best performance on this setting, therefore we use this value when initializing σ\sigma during the learning in our N-Jet models.

Experiment 2.2(A): Safe-subsampling. We test the importance of the hyper-parameter rr in the safe-subsampling, with respect to the classification accuracy on CIFAR-10 using a NIN backbone. For this experiment we learn the filter scale σ\sigma and set k=2k=2. We fix the hyper-parameter rr to one of the values in the set {2.0,4.0,6.0}\{2.0,4.0,6.0\}. Tab. II shows the effect on accuracy of different settings of rr. We also show the runtime needed to train the network for different rr settings. As the value of rr increases the accuracy also increases, however also the feature map sizes in the layers of the network increase, which affect the overall computational time. For our subsequent experiments we select r=4.0r=4.0 as a trade-off between accuracy and training speed.

Experiment 3.1(A): Generalization to other datasets.

We compare our N-Jet-NIN method with the baseline NIN . We test the generalization properties of our method by also reporting scores on two other datasets: CIFAR-100 and SVHN. Tab. III shows the classification results of our N-Jet-NIN when compared with the baseline NIN. We report mean and standard deviations over 33 runs for our method. We show in Fig. 7 the hard-coded sizes of the baseline feature maps, versus the sizes learned by our N-Jet-NIN on CIFAR-10. The performance of N-Jet-NIN is comparable with the baseline performance, while dynamically learning the appropriate feature map size.

Experiment 3.2(A): Generalization to other models.

To test the generalization of our N-Jet convolutional layer to different network architectures, we use the ALLCNN , Resnet , and the recent EfficientNet backbone network architectures. For N-Jet-ALLCNN we use safe-subsampling at every layer, while for N-Jet-Resnet-32 only at the layers where the original network subsamples. Tab. IV shows the classification accuracy of the baseline models tested by us on CIFAR-10, CIFAR-100, when compared with our N-Jet models. We report mean and standard deviation over 33 repetitions for our models, as well as the number of parameters. Using our proposed N-Jet layers gives similar accuracy to the standard convolutional layers, while avoiding the need to hard-code the filter sizes. For an approximation of order 3 in the N-Jet, there is a small increase in the number of parameters compared to the baseline, except for the EfficientNet which uses also kernel sizes larger than 3×33\times 3 px. When employing larger models – N-Jet-Resnet-110 and N-Jet-EfficientNet – a Gaussian basis combination of order 2 is sufficient to obtain an accuracy comparable to the baseline models, while reducing the number of parameters. In Fig. 8 we show the baseline ALLCNN feature map sizes when compared to the N-Jet-ALLCNN learned feature map sizes on CIFAR-10. The N-Jet model has similar classification accuracy when compared to the ALLCNN baseline, while learning at every layer the befitting feature map size. Applied at every layer, the safe-subsampling makes the subsampling continuous and smooth, compared to the baseline.

Experiment 4(A): Comparison to scale-invariant methods. We evaluate on the normal sized 28×2828\times 28 px MNIST and on MNIST resized by a factor of 4 with a size of 112×112112\times 112 px. We compare against a standard CNN with varying filter sizes, and against the Deformable CNN , as well as Atrous (dilated) convolutions . We consider 2 and 4-layer toy architectures containing only convolutional layers followed by ReLU activations. Results in table V show that the standard CNN performs well on MNIST, yet results are sensitive to the filter size for 4 ×\times MNIST. The Deformable CNN is also affected by the change in image size. Our intuition is that the Deformable CNN still relies on the initial 3×33\times 3 convolutions and optimizing the offsets is difficult under large size changes in the input. For Atrous CNN the dilation factor has to be hard-coded, and we use a dilation factor of 2, as using 4 would imply including prior knowledge. The Atrous performance is also affected by the change in input size. In contrast, our N-Jet model is able to learn the correct resolution and is more accurate.

IV-B Exp (B): N-Jet for Image Segmentation

Learning the receptive field size for segmentation. Multi-scale information processing is heavily used in modern segmentation architectures, and seems to be an important performance booster . Here, we focus on two popular mechanisms for multi-scale processing, namely the merging of information at different scales via skip connections in U-Net architectures , and the pooling of information at different scales via atrous spatial pyramid pooling (ASPP) layers in DeepLab architectures .

Similar to the classification experiments (Section IV-A), we replace the fixed-size convolutional filters of baseline networks with the N-Jet definition, where we learn the size and scale of the filters in the convolution operations during training.

Experiment 1.1(B): Segmentation of multi-scale inputs. We first evaluate the performance of N-Jet models on a small toy dataset, where each input image is formed by concatenating 4 images (objects) from the Fashion MNIST dataset (Fig. 9). Each object is assigned a random scale ss, which determines the factor by which we upsample the original Fashion MNIST image, via bilinear interpolation. The scale affects the object sizes — the number of pixels occupied by the object in the image. We construct four different training sets: three where the scale of each object is homogeneous: the discrete variable ss has the probability mass functions P(s=1)=1P(s=1)=1, or P(s=2)=1P(s=2)=1, or P(s=4)=1P(s=4)=1; and one where the image contains objects on multiple scales: ss has the probability mass function P(s)=0.25P(s)=0.25 for values s{1,2,3,4}s\in\{1,2,3,4\}. After rescaling, each object is placed in one quadrant of the input image, centered at a uniformly sampled random location. The corresponding ground truth segmentation masks are created by assigning the class label (1101\ldots 10) of the corresponding object to pixels whose input grayscale values hx,yh_{x,y} are above the threshold hθ=0.2h_{\theta}=0.2, by assigning the background label (0) to pixel locations where hx,y=0h_{x,y}=0, and by assigning an ignore index to undetermined pixel locations where 0<hx,y<hθ0<h_{x,y}<h_{\theta}. The ignored pixel locations do not contribute to the loss during training and do not contribute to the accuracy at test time.

Due to the simple nature of the training set, we use a small U-Net architecture, where the encoding network has three levels, as opposed to five in the original U-Net . This corresponds to two downsampling layers. Each level is composed of two convolutional layers, followed by the ReLU activation layer. The channel dimension is 64 at the first level, and doubles with every downsampling, performed via 2×22\times 2 max pooling. In the decoding network, we use bilinear upsampling to increase feature map size, and in all convolutional layers we use ‘same’ padding. We train all networks (N-Jet and baseline) for 50 epochs, using the ADAM optimizer and learning rate 0.00010.0001. To accommodate input images of different sizes, and keep with the original U-Net implementation, we use a batch size of 1 and no batch normalization, but high momentum β=(0.9,0.999)\beta=(0.9,0.999). To combat class imbalance, given the especially high frequency of the background class, we weigh the losses with the inverse of class frequencies in the training set. For the N-Jet models, we use filters with basis order 4 and 2. The scale parameter σ\sigma is shared between all filters in a convolutional layer.

After training, we evaluate segmentation performance on the validation set using the mean intersection over union (mIoU) over all object classes. Each validation set is constructed in the same way as the corresponding training set, using the Fashion MNIST validation images. We find that as we increase the average scale ss of the segmented objects (homogeneous scale case) or the variance of object scales ss (multi-scale case), N-Jet models successfully optimize the scale parameter σ\sigma accordingly. This makes N-Jets capable of adapting to different object scales without changing the network architecture, depth or hyper-parameters at all. In contrast, baseline U-Net models with fixed filter size cannot adapt their receptive field (RF) size based on the object scales in the training set, and their segmentation performance decays for larger objects (Fig. 10).

In addition to the robustness of N-Jet networks against changing object scales, we find that N-Jets of only order 2 (where each kernel is defined by only 6 free parameters) is enough to obtain good validation accuracy. In fact, N-Jets of order 4 perform slightly worse for larger ss. This is partly because the reduction of the basis order acts as a regularization via parameter reduction on our simple toy dataset, and partly because Fashion MNIST (especially after upscaling) does not contain many high frequency components, which the higher order Gaussian derivatives can capture.

Experiment 1.2(B): Learning the receptive field size. While σ\sigma optimization is successful for different basis orders, we note that the N-Jet model with basis order 4 has a larger number of free parameters than the baseline U-Net. Nevertheless, we observe on our toy dataset that the validation mIoU depends only weakly on the number of parameters, beyond a certain network size. For the multi-scale segmentation task with s=4s=4, where the scale of objects are increased by a factor of 4, we find that the receptive field size at the end of the encoding network largely determines the validation mIoU (Fig. 11). To demonstrate this, we vary the number of parameters and the receptive field size at the end of encoding in the baseline U-Net models, until we match the N-Jet performance: we increase the kernel size kk from 3 to 4 and 5, and expand the depth of the baseline network by increasing the number of encoding and decoding levels from 3 (10 convolutional layers) to 4 (14 convolutional layers) and 5 (18 convolutional layers). To keep the number of trainable parameters at a reasonable level, for networks with 4 and 5 levels we also decrease the channel width of the layers (by halving or quartering the number of channels in each layer, as given in the legend of Fig. 11).

We find that the N-Jet models can outperform baseline U-Net models while using a much smaller number of free parameters, due to σ\sigma optimization. In addition, we show that while the receptive field size is a good predictor of performance, it cannot be learned during training for the baseline U-Net, and would need to be optimized via hyper-parameter scans. This can potentially mean increasing the depth of the network to match the input resolution, which cannot be parallelized. Finally, we observe that slightly better validation mIoU can be obtained by baseline models, with almost 7 times the number of parameters and double the number of layers. We attribute this slight performance boost to the much larger depth, and thus increased number of nonlinearities in the network.

Experiment 2(B): Image segmentation using DeepLabv2. Next, we consider a more realistic segmentation task on the Pascal VOC (SBD) dataset using the DeepLabv2 architecture . Modern DeepLab models take advantage of dilated convolutions to aggregate information from multiple scales in atrous spatial pyramid pooling (ASPP) layers . However, dilated kernels can only be upsampled discretely, based on the dilation rate in units of pixels. In addition, it is typically not possible to determine a priori which scales in a dataset contain task-relevant information and the employed dilation rates need to be optimized using excessive hyper-parameter scans. We propose N-Jets as an alternative to optimizing the scales in a continuous way, eliminating the need to excessively search for dilation rates for each task.

To that end, we employ the DeepLabv2 model with a ResNet-101 backbone pretrained on the 20 class subset of the MS COCO dataset corresponding to the Pascal VOC classes. We retain all the network and training hyper-parameters of the original DeepLabv2 model and finetune the baseline network with an ASPP output layer on Pascal VOC with the batch normalization layers frozen. For the N-Jet network, we replace the 4 convolutional layers of the ASPP layer with different dilation rates with 4 N-Jet layers with independent, learnable scales σ\sigma (during finetuning) and we impose weight sharing between the different scales (i.e. same α\alpha values). On top of eliminating the need to manually tune the dilation rates, N-Jet models with weight sharing also have the potential to dramatically reduce the number of parameters in ASPP layers.

We find that DeepLabv2 with N-Jet output layers indeed allows for parameter reductions (Tab. VI). Using an N-Jet output layer with basis order 3 and weight sharing, we achieve validation mIoU values within 1% of the baseline network, while reducing the number of parameters by nearly a factor of 4. As an additional control, we also train a baseline network with weight sharing within the ASPP layer. Interestingly, we observe that our N-Jet models attain on par or better performance than the weight-tied baseline network, even when we only use a basis order of 1 (each kernel is defined by only 3 free parameters).

It is worth noting that these validation mIoUs are achieved with no hyper-parameter tuning for the N-Jet models, and despite not using N-Jet layers in the pretraining of the DeepLab backbone. As it is, we believe N-Jet output layers may be used for multi-scale processing applications with further hyper-parameter tuning of learning rates and regularization parameters, or can be used out of the box to estimate the optimal scale or dilation rates for other architectures.

V Discussion and limitations

To illustrate the differences between standard convolutional layers and N-Jet convolutional layers, we visualize a set of trained baseline filters compared to the equivalent N-Jet filters (Fig. 12). We find that in many models earlier layers will converge to smaller σ\sigma values during training (Fig. 12, top), while deeper layers are prone to learning larger filter sizes (Fig. 12, bottom).

In addition, the strength of the N-Jet representation lies in that it can learn filter sizes, and thus the receptive field size, during training. However, recent work has demonstrated that the effective receptive field (eRF) size of networks can be considerably smaller than what would be expected from the kernel size . We investigate the change in eRF size in our N-Jet models by visualizing the gradients with respect to the input image in our models trained on the multiscale Fashion-MNIST dataset (Fig. 13). We find that, as expected, the eRF size of N-Jet models grows with the size of the training images, proportionally to the growth of filter sizes. The baseline U-Net model with 3×33\times 3 kernels cannot learn to adapt its receptive field size during training, its eRF size remains relatively constant as a function of the input image scale.

One of the limitations of our proposed kernels is that they are typically larger than the standard 3×33\times 3 px, and therefore the convolutions take longer to compute. This comes at no cost in parameters as the size of the N-Jet filters is only affected by the scale parameters, σ\sigma. Additionally, computing the Gaussian basis is more expensive because it involves more operations: computing the Hermite polynomials, and obtaining the individual Gaussian basis from these, followed by estimating their linear combinations with the weights α\alpha. For the NiN architecture, our model is 2×\approx 2\times slower than the baseline model. As the network depth increases, so do the computations. However, manual architecture search takes a lot longer for finding the appropriate resolution hyper-parameters, because it requires a grid search over all possible filter sizes given a specific network depth and sub-sampling strategy.

VI Conclusion

We learn the resolution in deep convolutional networks. Learning the resolution frees the network architect from setting resolution related hyper-parameters such as the receptive field size and subsampling layers, which are dataset and network dependent. While we learn the receptive field size and the feature map subsampling for classification, the resolution is also determined by the depth of the network, as each layer increases the resolution linearly. Network depth is not something we learn, and thus we do not learn all resolution hyper-parameters. In addition to hard-coded filter sizes and subsampling layers, current CNN architectures are also designed to share the same filter size in a single layer. Due to computational restrains, our implementation does not make it possible to learn a σ\sigma for each filter, rather than per layer. We leave this as potential future work. To conclude, by replacing pixel-weights convolutional layers with our N-Jet convolutional layers we show that we can obtain similar performance as the baseline methods, without tuning the hyper-parameters controlling the resolution.

Acknowledgements. This publication is part of the project ”Pixel-free deep learning” (with project number 612. 001.805 of the research programme TOP which is financed by the Dutch Research Council (NWO).

References