Harmonic Networks: Deep Translation and Rotation Equivariance
Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, Gabriel J. Brostow
Introduction
We tackle the challenge of representing -rotations in convolutional neural networks (CNNs) . Currently, convolutional layers are constrained by design to map an image to a feature vector, and translated versions of the image map to proportionally-translated versions of the same feature vector (ignoring edge effects)—see Figure 1. However, until now, if one rotates the CNN input, then the feature vectors do not necessarily rotate in a meaningful or easy to predict manner. The sought-after property, directly relating input transformations to feature vector transformations, is called equivariance.
A special case of equivariance is invariance, where feature vectors remain constant under all transformations of the input. This can be a desirable property globally for a model, such as a classifier, but we should be careful not to restrict all intermediate levels of processing to be transformation invariant. For example, consider detecting a deformable object, such as a butterfly. The pose of the wings is limited in range, and so there are only certain poses our detector should normally see. A transformation invariant detector, good at detecting wings, would detect them whether they were bigger, further apart, rotated, etc., and it would encode all these cases with the same representation. It would fail to notice nonsense situations, however, such as a butterfly with wings rotated past the usual range, because it has thrown that extra pose information away. An equivariant detector, on the other hand, does not dispose of local pose information, and so it hands on a richer and more useful representation to downstream processes.
Equivariance conveys more information about an input to downstream processes, it also constrains the space of possible learned models to those that are valid under the rules of natural image formation . This makes learning more reliable and helps with generalization. For instance, consider CNNs. The key insight is that the statistics of natural images, embodied in the correlations between pixels, are a) invariant to translation, and b) highly localized. Thus features at every layer in a CNN are computed on local receptive fields, where weights are shared across translated receptive fields. This weight-tying serves both as a constraint on the translational structure of image statistics, and as an effective technique to reduce the number of learnable parameters—see Figure 1. In essence, translational equivariance has been ‘baked’ into the architecture of existing CNN models. We do the same for rotation and refer to it as hard-baking.
The current widely accepted practice to cope with rotation is to train with aggressive data augmentation . This certainly improves generalization, but is not exact, fails to capture local equivariances, and does not ensure equivariance at every layer within a network. How to maintain the richness of local rotation information, is what we present in this paper. Another disadvantage of data augmentation is that it leads to the so-called black-box problem, where there is a lack of feature map interpretability. Indeed, close inspection of first-layer weights in a CNN reveals that many of them are rotated, scaled, and translated copies of one another . Why waste computation learning all of these redundant weights?
In this paper, we present Harmonic Networks, or H-Nets. They design patch-wise -rotational equivariance into deep image representations, by constraining the filters to the family of circular harmonics. The circular harmonics are steerable filters , which means that we can represent all rotated versions of a filter, using just a finite, linear combination of steering bases. This overcomes the issue of learning multiple filter copies in CNNs, guarantees rotational equivariance, and produces feature maps that transform predictably under input rotation.
Related Work
Multiple existing approaches seek to encode rotational equivariance into CNNs. Many of these follow a broad approach of introducing filter or feature map copies at different rotations. None has dominated as standard practice.
Steerable filters At the root of H-Nets lies the property of filter steerability . Filters exhibiting steerability can be constructed at any rotation as a finite, linear combination of base filters. This removes the need to learn multiple filters at different rotations, and has the bonus of constant memory requirements. As such, H-Nets could be thought of as using an infinite bank of rotated filter copies. A work, which combines steerable filters with learning is . They build shallow features from steerable filters, which are fed into a kernel SVM for object detection and rigid pose regression. H-Nets use the same filters with an added rotation offset term, so that filters in different layers can have orientation-selectivity relative to one another.
Hard-baked transformations in CNNs While H-Nets hard-bake patch-wise -rotation into the feature representation, numerous related works have encoded equivariance to discrete rotations. The following works can be grouped into those, which encode global equivariance versus patch-wise equivariance, and those which rotate filters versus feature maps.
introduce equivariance to -rotations and dihedral flips in CNNs by copying the transformed filters at different rotation–flip combinations. More recently they generalized this theory to all group-structured transformations in , but they only demonstrated applications on finite groups—an extension to continuous transformations would require a treatment on anti-aliasing and bandlimiting. use a larger number of rotations for texture classification and also use many rotated handcrafted filter copies, opting not to learn the filters. To achieve equivariance to a greater number of rotations, these methods would need an infinite amount of computation. H-Nets achieve equivariance to all rotations, but with finite computation.
feed in multiple rotated copies of the CNN input and fuse the output predictions. do the same for a broader class of global image transformations, and propose a novel per-pixel pooling technique for output fusion. As discussed, these techniques lead to global equivariances only and do not produce interpretable feature maps. go one step further and copy each feature map at four -rotations. They propose 4 different equivariance preserving feature map transformations. Their CNN is similar to in terms of what is being computed, but rotating feature maps instead of filters. A downside of this is that all inputs and feature maps have to be square; whereas, we can use any sized input.
Learning generalized transformations Others have tried to learn the transformations directly from the data. While this is an appealing idea, as we have said, for certain transformations it makes more sense to hard-bake these in for interpretability and reliability. construct a higher-order Boltzmann machine, which learns tuples of transformed linear filters in input–output pairs. Although powerful, they have only shown this to work on shallow architectures. introduced capsules, units of neurons designed to mimic the action of cortical columns. Capsules are designed to be invariant to complicated transformations of the input. Their outputs are merged at the deepest layer, and so are only invariant to global transformation. present a method to regress equivariant feature detectors using an objective, which penalizes representations, which lie far from the equivariant manifold. Again, this only encourages global equivariance; although, this work could be adapted to encourage equivariance at every layer of a deep pipeline.
Problem analysis
Many computer vision systems strive to be view independent, such as object recognition, which is invariant to affine transformations, or boundary detection, which is equivariant to non-rigid deformations. H-Nets hard-bake -rotation equivariance into their feature representation, by constraining the convolutional filters of a CNN to be from the family of circular harmonics. Below, we outline the formal definition of equivariance (Section 3.1), how the circular harmonics exhibit rotational equivariance (Section 3.2) and some properties of the circular harmonics, which we must heed for successful integration into the CNN framework (Section 3).
Continuous domain feature maps In deep learning we use feature maps, which live in a discrete domain. We shall instead use continuous spaces, because the analysis is easier. Later on in Section 4.2 we shall demonstrate how to convert back to the discrete domain for practical implementation, but for now we work entirely in continuous Euclidean space.
Equivariance is a useful property to have because transformations of the input produce predictable transformations of the features, which are interpretable and can make learning easier. Formally, we say that feature mapping is equivariant to a group of transformations if we can associate every transformation of the input with a transformation of the features; that is,
2 The Complex Circular Harmonics
With data augmentation CNNs may learn some rotation equivariance, but this is difficult to quantify . H-Nets take the simpler approach of hard-baking this structure in. If is the feature mapping of a standard convolutional layer, then -rotational equivariance can be hard-baked in by restricting the filters to be of the from the circular harmonic family (proof in Supplementary Material)
Rotational Equivariance of the Circular Harmonics Some deep learning libraries implement cross-correlation rather than convolution , and since the understanding is slightly easier to follow, we consider correlation. Strictly, cross-correlation with complex functions requires that one of the arguments is conjugated, but we do not do this in our model/implementation, so
Consider correlating a circular harmonic of order with a rotated image patch. We assume that the image patch is only able to rotate locally about the origin of the filter. This means that the cross-correlation response is a scalar function of input image patch rotation . Using the notation from Equation 1, and recalling that we are working in polar coordinates , counter-clockwise rotation of an image about the origin by an angle is . As a shorthand we denote . It is a well-known result (proof in Supplementary Material) that
The rotation order of a filter defines its response properties to input rotation. In particular, rotation order defines invariance and defines linear equivariance. For this is because, denoting , then , which is independent of . For , —as the input rotates, is a complex-valued number of constant magnitude , spinning round with a phase equal to . Naturally, we are not constrained to using rotation orders 0 or 1 only, and we make use of higher and negative orders in our work.
This is the fundamental condition underpinning the equivariance properties of H-Net, so we call it the equivariance condition.
We note here that for our purposes, our filter (the complex conjugate), which saves on parameters, but this does not necessarily imply conjugacy of the responses unless F is real, which is only true at the input.
Method
We have considered the -rotational equivariance of feature maps arising from cross-correlation with the circular harmonics, and we determined that the rotation orders of chained cross-correlations sum. Next, we use these results to construct a deep architecture, which can leverage the equivariance properties of circular harmonics.
The rotation order of feature maps and filters sum upon cross-correlation, so to achieve a given output rotation order, we must obey the equivariance condition. In fact, at every feature map, the equivariance condition must be met, otherwise, it should be possible to arrive at the same feature map along two different paths, with different summed rotation orders. The problem is that combining complex features, with phases, which rotate at different frequencies, leads to entanglement of the responses. The resultant feature map is no longer equivariant to a single rotation order, making it difficult to work with. We resolve this by enforcing the equivariance condition at every feature map.
Our solution is to create separate streams of constant rotation order responses running through the network—see Figure 4. These streams contain multiple layers of feature maps, separated by rotation order zero cross-correlations and nonlinearities. Moving between streams, we use cross-correlations of rotation order equal to the difference between those two streams. It is very easy to check that the equivariance condition holds in these networks.
When multiple responses converge at a feature map, we have multiple choices of how to combine them. We could stack them, we could pool across them, or we could sum them . To save on memory, we chose to sum responses of the same rotation order
is then fed into the next layer. Usually in our experiments, we use streams of orders 0 and 1, which we found to work well and is justified by the fact that CNN filters tend to contain very little high frequency information .
Above, we see that the structure of the Harmonic Network is very simple. We replaced regular CNN filters with radially reweighted and phase shifted circular harmonics. This causes each filter response to be equivariant to input rotations with order . To prevent responses of different rotation order from entangling upon summation, we separated filter responses into streams of equal rotation order.
Complex nonlinearities Between cross-correlations, we use complex nonlinearities, which act on the magnitudes of the complex feature maps only, to preserve rotational equivariance. An example is a complex version of the ReLU
We can provide similar analogues for other nonlinearities and for Batch Normalization , which we use in our experiments.
We have thus far presented the Harmonic Network. Each layer is a collection of feature maps of different rotation orders, which transform predictably under rotation of the input to the network and the -rotation equivariance is achieved with finite computation. Next we show how to implement this in practice.
2 Implementation: Discrete cross-correlations
Viewing cross-correlation on discrete domains sheds some insight into how the equivariance properties behave. In Figure 5, we see that the sampling strategy introduces multiple origins, one for each feature map patch. We call these, centers of equivariance, because a feature map will exhibit local rotation equivariance about each of these points. If we move to using more exotic sampling strategies, such as strided cross-correlation or average pooling, then the centers of equivariance are ablated or shifted. If we were to use max-pooling, then the center of equivariance would be a complicated nonlinear function of the input image and harmonic weights. For this reason we have not used max-pooling in our experiments.
On a practical note, it is worth mentioning, that complex cross-correlation can be implemented efficiently using 4 real cross-correlations
So circular harmonics can be implemented in current deep learning frameworks, with minor engineering. We implement a grid-resampled version of the filters , with (see Figure 6). The polar representation can be mapped from the components by . If we stack all the polar filter samples into a matrix we can write each point as the outer product of a radial tensor and trigonometric angular tensor . The phase offset can be separated out by noting that
where the complex exponential and trigonometric terms are element-wise, and I is the identity matrix. This is just a reweighting of the ring elements. In full generality, we could also use a per-radius phase , which would allow for spiral-like left- and right-handed features, but we did not investigate this.
3 Computational cost
We have increased the computational cost of cross-correlation, in return for continuous rotational equivariance. Here we analyze the computational cost in terms of number of multiplications.
Experiments
We validate our rotation equivariant formulation below, performing some introspective investigations, and measuring against relevant baselines for classification on the rotated-MNIST dataset and boundary detection on the Berkeley Segmentation Dataset . We selected our baselines as representative examples of the current state-of-the-art and to demonstrate that H-Nets can be used on different architectures for different tasks. The networks we used are in Figure 13.
Here we compare H-Nets for classification and boundary detection. Classification is a typical rotation invariant task, and should suit H-Nets very well. In contrast, boundary detection is a rotation equivariant task. The key to the success of the H-Net is that it can achieve global equivariance, without sacrificing local equivariance of features.
MNIST Of course, this is a small dataset, with simple visual structures, but it is a good indication of how introducing the right equivariances into CNNs can aid inference.
We investigate classification on the rotated MNIST dataset (new version) as a baseline. This has 10000 training images, 2000 validation images, and 50000 test images. The -rotations and small training set size make this a difficult task for classical CNNs. We compare against a collection of previous state-of-the-art papers and , who build a deep CNN with filter copies at -rotations. We try to mimic their network architecture for H-Nets as best as we can, using 2 rotation order streams with through to the deepest layer, and complex-valued versions of ReLU nonlinearities and Batch Normalization (see Method). We also replace max-pooling with mean-pooling layers, as shown in Figure 13. We perform stochastic gradient descent on a cross-entropy loss using Adam and an adaptive learning rate, which we divide by 10 if there has been no improvement in validation accuracy in the last 10 epochs. We train multiple models with randomly chosen hyperparameters, and report the test error of the model that performed best on the validation set, training on a combined training and validation set Table 1 lists our results. This model actually has 33k parameters, which is about 50% larger than the standard CNN and , which have 22k. This is because it uses convolutions instead of . Interestingly, it does not overfit on such a small dataset and it still outperforms the standard CNN trained with rotation augmentations, which we do not use. We set the new state-of-the-art, with a 26% improvement on the previous best model.
Deep Boundary Detection Boundary detection is equivariant to non-rigid transformations; although edge presence is locally invariant to orientation. The current state-of-the-art depends on fine-tuning ImageNet-pretrained networks to regress boundary probabilities on a per-patch basis. To demonstrate that hard-baked rotation equivariance serves as a strong generalization tool, we compared against a previous state-of-the-art architecture , without pretraining. We tried to mimic as closely as possible, as shown in Figure 13. The main difference is that we divide the number of all feature maps by 2, for faster, more stable training. They use a VGG network extended with deeply supervised network (DSN) side-connections. These are -convolutions, which perform weighted averages across all relevant feature maps, resized to match the input. A binary cross-entropy loss is applied to each side connection, to stabilize learning. A final ‘fusion’ layer is created by taking a weighted linear combination of the side-connections, this is the final output. We adapt side-connections to H-Nets, by using the complex magnitude of feature maps before taking a weighted average. This means that the resultant boundary predictions are locally invariant to rotation. We added a small sparsity regularizer to our cost function, because we found it improved results slightly. We call the Harmonic variant of the DSN, an H-DSN. We also compare against with the number of parameters matched to H-DSN (the first layer has 7 features, instead of 16, and so on).
We also compared with , who use a mean-and-covariance-RBM. Their technique has five main contributions: 1) zero-mean, unit variance normalization of inputs, 2) sparsity regularization of hidden units, 3) averaged ground truth edge annotations, 4) average outputs to 16 input rotations, 5) non-maximum suppression of results by the Canny method. We perform the first 2 methods, but leave the last 3 alone. In particular, they do not pretrain on ImageNet, and attempt some kind of rotation averaging for global equivariance, so are a good baseline to measure against. We tested on the Berkeley Segmentation Dataset (BSD500) . As shown in Table 2 for non-pretrained models, H-Nets deliver superior performance over current state-of-the-art architectures, including , who also encode rotation equivariance. Most noticeable of all is that we only use 5% of the parameters of , showing how by restricting the search space of learnable models through hard-baking local rotation equivariance, we need not learn so many parameters.
2 Model Insight
Here we investigate some of the properties of the H-Net implementation, making sure that the motivations behind H-Net design are achieved by the implementation.
Rotational stability As a sanity check, we measured the invariance of the magnitude response to rotation for . We show the result of rotating a random input to an H-Net layer in Figure 8. The response is very flat, with periodic small fluctuations due to the inexactness of anti-aliasing.
The real parts of the filters, from the first layer of the boundary-detection-trained H-Net, are shown in Figure 9. They are aligned at zero phase () for ease of viewing. Since the network is trained on zero-mean, unit variance, normalized color images, the weights do not have the natural colors we would see in real-world images. Nonetheless, there is useful information we can glean from inspecting these. Most 1st layer filters detect color boundaries, there are no blank filters as one usually sees in CNNs, and there are few reoriented copies. We also see from the phase histograms that all phases are utilized by filters throughout the network, indicating full use of the phase information. This is interesting, because it means that the model’s parameters are being used fully, with low redundancy, which we surmise comes from easier optimization on the equivariant manifold.
Data ablation Here we investigate H-nets data-efficiency. CNNs are massively data-hungry. Krizhevsky’s landmark paper used 60 million parameters, trained on 1.2 million RGB images quantized to 256 bits and split between 1000 classes, for a total of 10 bits of information per weight. Even this vast amount of data was insufficient for training, and data augmentation was needed to improve results. We ran an experiment on the rotated MNIST dataset to show that with hard-baked rotation equivariance, we require less data than competing methods, which is indeed the case (see Figure 10). Interestingly, and predictably, regular CNNs trained with data augmentation still perform worse than H-Nets, because they only learn global invariance to rotation, rather than local equivariances at each layer.
We visualize feature maps in the lower layers of an MNIST trained H-Net (see Figure 11). For given input, we see the feature maps encode very complicated structures. Left to right, we see the H-Net learns to detect edges, corners, object presence, negative space, and outlines of objects. We perform this for the BSD500 trained H-DSN (see Figure 12). It shows equivariance is preserved through to the deepest feature maps. It also highlights the compact representation of feature presence and pose, which regular CNNs cannot do.
Conclusions
We presented a convolutional neural network that is locally equivariant to patch-wise translation and, for the first time, to continuous -rotation. We achieved this by restricting the filters to circular harmonics, essentially hard-baking rotation into the architecture. This can be implanted onto other architectures too. The use of circular harmonics pays dividends in that we receive full rotational equivariance using few parameters. This leads to good generalization, even with less (or less augmented) training data. The only disadvantage we’ve seen so far is the higher per-filter computational cost, but our guidance for network design balances that cost against the more expressive representation. The better interpretability of the feature maps is a bonus, because we know how they transform under input image rotations. We applied our network to the problem of classifying rotated-MNIST, setting a new state-of-the-art. We also applied our network to boundary detection, again achieving state-of-the-art results, for non-pretrained networks. We have shown that -rotational equivariance is both possible and useful. Our TensorFlow™implementation is available on the project website.
Future work Extension of this work could involve hard-baking yet more transformations into the equivariance properties of the Harmonic Network, possibly extending to 3D. This will allow yet more expressibility in network representations, extending the benefits we have seen afforded by rotation equivariance to a larger class of models and applications.
Acknowledgements Support is from Fight for Sight UK, a Microsoft Research PhD Scholarship, EPSRC Nature Smart Cities EP/K503745/1 and ENGAGE EP/K015664/1.
References
Appendix A Equivariance properties
In Section 3.2 we mentioned that cross-correlation with the circular harmonics is a -rotation equivariant feature transform. Here we provide the proof, and some of the properties mentioned in Arithmetic and Equivariance Condition.
where we have used the decomposition , with and . The rotational cross-correlation is performed about the origin of the image. If we rotate the image, then we have
If we define , where , then
And so rotational cross-correlation is rotationally equivariant about the origin of rotation. In the next part, we build up to a result needed for proving the chained cross-correlation result.
To perform the rotational cross-correlation about another point t, we first have to translate the image such that t is the new origin, so , then perform the rotational cross-correlation, so
In general, for every t this expression returns a different value. The response of a -rotated image about t is then
Say we wish to perform the rotational cross-correlation about a point t, when the image has been rotated about the origin. Denoting , then the response is
Thus we see that cross-correlation of the rotated signal with the circular harmonic filter is equal to the response at zero rotation , multiplied by a complex phase shift . In the notation of the paper, we denote this multiplication by as . Thus cross-correlation with yields a rotationally equivariant feature mapping.
A.2 Properties
We have used the property that the cross-correlation is linear and that we may pull the scalar factor outside. If we write then , so
Thus we see that the chained cross-correlation results in a summation of the rotation orders of the individual filters and . Setting , such that we evaluate the cross-correlation at the center of rotation, we regain an equation similar to 18.
A.2.2 Magnitude nonlinearities
since only acts on magnitudes. Since for fixed the output is a function of and only, the point-wise magnitude-acting nonlinearity preserves rotational equivariance.
A.2.3 Summation of feature maps
The summation of feature maps of the same rotation order is a new feature map of the same rotation order. Consider two feature maps and of rotation order . Summation is a pointwise operation, so we only consider two corresponding points in the feature maps, which we denote and , where and are phase offsets. The sum is
which for fixed is a function of and only and so also rotationally equivariant with order .
Appendix B Number of parameters
Here we list a break down of how we computed the number of parameters for the various network architectures in the experiments section. The networks architectures used are in Figure 13. Red boxes are cross-correlations, blue boxes are pooling (average for H-Nets, max for regular CNNs), green boxes are -cross-correlations.
For a standard CNN layer with input channels and output channels, and sized weights, the number of learnable parameters is . Since there is one bias per output layer, this increases to . If using batch normalization, then there is an extra per-channel scaling factor, which increases the number of learnable parameters to . The standard CNN for the rotated MNIST experiments has 6 layers of cross-correlations, and 1 layer of -cross-correlations, with 20 feature maps per layer and 3 batch normalization layers so the number of learnable parameters is 21570. The calculations are shown in Table 3.
B.2 Harmonic networks
The learnable parameters of a Harmonic Network are the radial profile and the the per-filter phase offset. For a filter, the number of radial profile elements is equal to the number of rings of equal distance from the center of the filter. For example, consider the Figure 14, which is an excerpt from the main paper. This is a filter, with 6 rings of equal distance from the center of the filter (the smallest ring is just a single point). So this filter has 6 radial profile terms and 1 phase offset to learn. This contrasts with a regular filter, which would have 25 learnable parameters. Note, that for filters with rotation order , the center pixel of the filter is in fact always zero, and so for a rotation order filter, the number of radial profile terms is . So for the H-Net in the main paper with filters and batch normalization in layers 2, 4, & 6, the number of learnable parameters is 33347. The calculations are in Table 4. Note that the final layer contains just one set of biases and no phase offsets. A similar set of calculations can be performed for the deeply supervised networks.