Group Equivariant Convolutional Networks

Taco S. Cohen, Max Welling

Introduction

Deep convolutional neural networks (CNNs, convnets) have proven to be very powerful models of sensory data such as images, video, and audio. Although a strong theory of neural network design is currently lacking, a large amount of empirical evidence supports the notion that both convolutional weight sharing and depth (among other factors) are important for good predictive performance.

Convolutional weight sharing is effective because there is a translation symmetry in most perception tasks: the label function and data distribution are both approximately invariant to shifts. By using the same weights to analyze or model each part of the image, a convolution layer uses far fewer parameters than a fully connected one, while preserving the capacity to learn many useful transformations.

Convolution layers can be used effectively in a deep network because all the layers in such a network are translation equivariant: shifting the image and then feeding it through a number of layers is the same as feeding the original image through the same layers and then shifting the resulting feature maps (at least up to edge-effects). In other words, the symmetry (translation) is preserved by each layer, which makes it possible to exploit it not just in the first, but also in higher layers of the network.

In this paper we show how convolutional networks can be generalized to exploit larger groups of symmetries, including rotations and reflections. The notion of equivariance is key to this generalization, so in section 2 we will discuss this concept and its role in deep representation learning. After discussing related work in section 3, we recall a number of mathematical concepts in section 4 that allow us to define and analyze the G-convolution in a generic manner.

In section 5, we analyze the equivariance properties of standard CNNs, and show that they are equivariant to translations but may fail to equivary with more general transformations. Using the mathematical framework from section 4, we can define G-CNNs (section 6) by analogy to standard CNNs (the latter being the G-CNN for the translation group). We show that G-convolutions, as well as various kinds of layers used in modern CNNs, such as pooling, arbitrary pointwise nonlinearities, batch normalization and residual blocks are all equivariant, and thus compatible with G-CNNs. In section 7 we provide concrete implementation details for group convolutions.

In section 8 we report experimental results on MNIST-rot and CIFAR10, where G-CNNs achieve state of the art results (2.28%2.28\% error on MNIST-rot, and 4.19%4.19\% resp. 6.46%6.46\% on augmented and plain CIFAR10). We show that replacing planar convolutions with G-convolutions consistently improves results without additional tuning. In section 9 we provide a discussion of these results and consider several extensions of the method, before concluding in section 10.

Structured & Equivariant Representations

In this paper we construct representations that have the structure of a linear GG-space, for some chosen group GG. This means that each vector in the representation space has a pose associated with it, which can be transformed by the elements of some group of transformations GG. This additional structure allows us to model data more efficiently: A filter in a G-CNN detects co-occurrences of features that have the preferred relative pose, and can match such a feature constellation in every global pose through an operation called the G-convolution.

A representation space can obtain its structure from other representation spaces to which it is connected. For this to work, the network or layer Φ\Phi that maps one representation to another should be structure preserving. For GG-spaces this means that Φ\Phi has to be equivariant:

That is, transforming an input xx by a transformation gg (forming TgxT_{g}\,x) and then passing it through the learned map Φ\Phi should give the same result as first mapping xx through Φ\Phi and then transforming the representation.

Equivariance can be realized in many ways, and in particular the operators TT and TT^{\prime} need not be the same. The only requirement for TT and TT^{\prime} is that for any two transformations gg and hh, we have T(gh)=T(g)T(h)T(gh)=T(g)T(h) (i.e. TT is a linear representation of GG).

From equation 1 we see that the familiar concept of invariance is a special kind of equivariance where TgT^{\prime}_{g} is the identity transformation for all gg. In deep learning, general equivariance is more useful than invariance because it is impossible to determine if features are in the right spatial configuration if they are invariant.

Besides improving statistical efficiency and facilitating geometrical reasoning, equivariance to symmetry transformations constrains the network in a way that can aid generalization. A network Φ\Phi can be non-injective, meaning that non-identical vectors xx and yy in the input space become identical in the output space (for example, two instances of a face may be mapped onto a single vector indicating the presence of any face). If Φ\Phi is equivariant, then the GG-transformed inputs TgxT_{g}\,x and TgyT_{g}\,y must also be mapped to the same output. Their “sameness” (as judged by the network) is preserved under symmetry transformations.

Related Work

There is a large body of literature on invariant representations. Invariance can be achieved by pose normalization using an equivariant detector (Lowe, 2004; Jaderberg et al., 2015) or by averaging a possibly nonlinear function over a group (Reisert, 2008; Skibbe, 2013; Manay et al., 2006; Kondor, 2007).

Scattering convolution networks use wavelet convolutions, nonlinearities and group averaging to produce stable invariants (Bruna & Mallat, 2013). Scattering networks have been extended to use convolutions on the group of translations, rotations and scalings, and have been applied to object and texture recognition (Sifre & Mallat, 2013; Oyallon & Mallat, 2015).

A number of recent works have addressed the problem of learning or constructing equivariant representations. This includes work on transforming autoencoders (Hinton et al., 2011), equivariant Boltzmann machines (Kivinen & Williams, 2011; Sohn & Lee, 2012), equivariant descriptors (Schmidt & Roth, 2012), and equivariant filtering (Skibbe, 2013).

Lenc & Vedaldi (2015) show that the AlexNet CNN (Krizhevsky et al., 2012) trained on imagenet spontaneously learns representations that are equivariant to flips, scaling and rotation. This supports the idea that equivariance is a good inductive bias for deep convolutional networks. Agrawal et al. (2015) show that useful representations can be learned in an unsupervised manner by training a convolutional network to be equivariant to ego-motion.

Anselmi et al. (2014, 2015) use the theory of locally compact topological groups to develop a theory of statistically efficient learning in sensory cortex. This theory was implemented for the commutative group consisting of time- and vocal tract length shifts for an application to speech recognition by Zhang et al. (2015).

Gens & Domingos (2014) proposed an approximately equivariant convolutional architecture that uses sparse, high-dimensional feature maps to deal with high-dimensional groups of transformations. Dieleman et al. (2015) showed that rotation symmetry can be exploited in convolutional networks for the problem of galaxy morphology prediction by rotating feature maps, effectively learning an equivariant representation. This work was later extended (Dieleman et al., 2016) and evaluated on various computer vision problems that have cyclic symmetry.

Cohen & Welling (2014) showed that the concept of disentangling can be understood as a reduction of the operators TgT_{g} in an equivariant representation, and later related this notion of disentangling to the more familiar statistical notion of decorrelation (Cohen & Welling, 2015).

Mathematical Framework

In this section we present a mathematical framework that enables a simple and generic definition and analysis of G-CNNs for various groups GG. We begin by defining symmetry groups, and study in particular two groups that are used in the G-CNNs we have built so far. Then we take a look at functions on groups (used to model feature maps in G-CNNs) and their transformation properties.

If we have two symmetry transformations gg and hh and we compose them, the result ghgh is another symmetry transformation (i.e. it leaves the object invariant as well). Furthermore, the inverse g1g^{-1} of any symmetry is also a symmetry, and composing it with gg gives the identity transformation ee. A set of transformations with these properties is called a symmetry group.

2 The group p​4𝑝4p4

The group p4p4 consists of all compositions of translations and rotations by 9090 degrees about any center of rotation in a square grid. A convenient parameterization of this group in terms of three integers r,u,vr,u,v is

The composition and inversion operations could also be represented directly in terms of integers (r,u,v)(r,u,v), but the equations are cumbersome. Hence, our preferred method of composing two group elements represented by integer tuples is to convert them to matrices, multiply these matrices, and then convert the resulting matrix back to a tuple of integers (using the atan2 function to obtain rr).

3 The group p​4​m𝑝4𝑚p4m

The group p4mp4m consists of all compositions of translations, mirror reflections, and rotations by 9090 degrees about any center of rotation in the grid. Like p4p4, we can parameterize this group by integers:

Again, composition is most easily performed using the matrix representation. Computing r,u,vr,u,v from a given matrix gg can be done using the same method we use for p4p4, and for mm we have m=12(1det(g))m=\frac{1}{2}(1-\det(g)).

4 Functions on groups

Although the feature maps must always be stored in finite arrays, modeling them as functions that extend to infinity (while being non-zero on a finite region only) simplifies the mathematical analysis of CNNs.

We will be concerned with transformations of the feature maps, so we introduce the following notation for a transformation gg acting on a set of feature maps:

Computationally, this says that to get the value of the gg-transformed feature map LgfL_{g}f at the point xx, we need to do a lookup in the original feature map ff at the point g1xg^{-1}x, which is the unique point that gets mapped to xx by gg. This operator LgL_{g} is a concrete instantiation of the transformation operator TgT_{g} referenced in section 2, and one may verify that

When we apply the 9090 degree rotation rr to a function on p4p4, each planar patch follows its red rr-arrow (thus incrementing the rotation coordinate by 11 (mod 44)), and simultaneously undergoes a 9090-degree rotation. The result of this operation is shown on the right of figure 1. As we will see in section 6, a p4p4 feature map in a p4p4-CNN undergoes exactly this motion under rotation of the input image.

For p4mp4m, we can make a similar plot, shown in figure 2. A p4mp4m function has 88 planar patches, each one associated with a mirroring mm and rotation rr. Besides red rotation arrows, the figure now includes small blue reflection lines (which are undirected, since reflections are self-inverse).

Upon rotation of a p4mp4m function, each patch again follows its red rr-arrows and undergoes a 9090 degree rotation. Under a mirroring, the patches connected by a blue line will change places and undergo the mirroring transformation.

This rich transformation structure arises from the group operation of p4p4 or p4mp4m, combined with equation 4 which describes the transformation of a function on a group.

Finally, we define the involution of a feature map, which will appear in section 6.1 when we study the behavior of the G-convolution, and which also appears in the gradient of the G-convolution. We have:

Equivariance properties of CNNs

In this section we recall the definitions of the convolution and correlation operations used in conventional CNNs, and show that these operations are equivariant to translations but not to other transformations such as rotation. This is certainly well known and easy to see by mental visualization, but deriving it explicitly will make it easier to follow the derivation of group equivariance of the group convolution defined in the next section.

If one employs convolution (*) in the forward pass, the correlation (\star) will appear in the backward pass when computing gradients, and vice versa. We will use the correlation in the forward pass, and refer generically to both operations as “convolution”.

Using the substitution yy+ty\rightarrow y+t, and leaving out the summation over feature maps for clarity, we see that a translation followed by a correlation is the same as a correlation followed by a translation:

And so we say that “correlation is an equivariant map for the translation group”, or that “correlation and translation commute”. Using an analogous computation one can show that also for the convolution, [Ltf]ψ=Lt[fψ][L_{t}f]*\psi=L_{t}[f*\psi].

Although convolutions are equivariant to translation, they are not equivariant to other isometries of the sampling lattice. For instance, as shown in the supplementary material, rotating the image and then convolving with a fixed filter is not the same as first convolving and then rotating the result:

In words, this says that the correlation of a rotated image LrfL_{r}f with a filter ψ\psi is the same as the rotation by rr of the original image ff convolved with the inverse-rotated filter Lr1ψL_{r^{-1}}\psi. Hence, if an ordinary CNN learns rotated copies of the same filter, the stack of feature maps is equivariant, although individual feature maps are not.

Group Equivariant Networks

In this section we will define the three layers used in a G-CNN (GG-convolution, GG-pooling, nonlinearity) and show that each one commutes with GG-transformations of the domain of the image.

The correlation (eq. 7) is computed by shifting a filter and then computing a dot product with the feature maps. By replacing the shift by a more general transformation from some group GG, we get the GG-correlation used in the first layer of a GG-CNN:

The equivariance of this operation is derived in complete analogy to eq. 8, now using the substitution huhh\rightarrow uh:

Note that if GG is not commutative, neither the GG-convolution nor the GG-correlation is commutative. However, the feature maps ψf\psi\star f and fψf\star\psi are related by the involution (eq. 6):

Since the involution is invertible (it is its own inverse), the information content of fψf\star\psi and ψf\psi\star f is the same. However, fψf\star\psi is more efficient to compute when using the method described in section 7, because transforming a small filter is faster than transforming a large feature map.

It is customary to add a bias term to each feature map in a convolution layer. This can be done for GG-conv layers as well, as long as there is only one bias per GG-feature map (instead of one bias per spatial feature plane within a GG-feature map). Similarly, batch normalization (Ioffe & Szegedy, 2015) should be implemented with a single scale and bias parameter per GG-feature map in order to preserve equivariance. The sum of two GG-equivariant feature maps is also GG-equivariant, thus GG-conv layers can be used in highway networks and residual networks (Srivastava et al., 2015; He et al., 2015).

2 Pointwise nonlinearities

Equation 12 shows that GG-correlation preserves the transformation properties of the previous layer. What about nonlinearities and pooling?

which acts on functions by post-composing them with ν\nu.

Since the left transformation operator LL acts by pre-composition, CC and LL commute:

so the rectified feature map inherits the transformation properties of the previous layer.

3 Subgroup pooling and coset pooling

where gU={guuU}gU=\{gu\,|\,u\in U\} is the gg-transformation of some pooling domain UGU\subset G (typically a neighborhood of the identity transformation). In a regular convnet, UU is usually a 2×22\times 2 or 3×33\times 3 square including the origin (0,0)(0,0), and gg is a translation.

As shown in the supplementary material, pooling commutes with LhL_{h}:

Since pooling tends to reduce the variation in a feature map, it makes sense to sub-sample the pooled feature map, or equivalently, to do a “pooling with stride”. In a G-CNN, the notion of “stride” is generalized by subsampling on a subgroup HGH\subset G. That is, HH is a subset of GG that is itself a group (i.e. closed under multiplication and inverses). The subsampled feature map is then equivariant to HH but not GG.

We can obtain full GG-equivariance by choosing our pooling region UU to be a subgroup HGH\subset G. The pooling domains gHgH that result are called cosets in group theory. The cosets partition the group into non-overlapping regions. The feature map that results from pooling over cosets is invariant to the right-action of HH, because the cosets are similarly invariant (ghH=gHghH=gH). Hence, we can arbitrarily choose one coset representative per coset to subsample on. The feature map that results from coset pooling may be thought of as a function on the quotient space G/HG/H, in which two transformations are considered equivalent if they are related by a transformation in HH.

This concludes our analysis of GG-CNNs. Since all layer types are equivariant, we can freely stack them into deep networks and expect G-conv parameter sharing to be effective at arbitrary depth.

Efficient Implementation

Computing the G-convolution for involves nothing more than indexing arithmetic and inner products, so it can be implemented straightforwardly. Here we present the details for a G-convolution implementation that can leverage recent advances in fast computation of planar convolutions (Mathieu et al., 2014; Vasilache et al., 2015; Lavin & Gray, 2015).

Thus, to compute the p4p4 (or p4mp4m) correlation fψf\star\psi we can first compute LsψL_{s}\psi (“filter transformation”) for all four rotations (or all eight rotation-flips) and then call a fast planar correlation routine on ff and the augmented filter bank.

The computational cost of the algorithm presented here is roughly equal to that of a planar convolution with a filter bank that is the same size as the augmented filter bank used in the G-convolution, because the cost of the filter transformation is negligible.

The filter transformation LsL_{s} amounts to a permutation of the entries of each of the Kl×Kl1K^{l}\times K^{l-1} scalar-valued filter channels in FF. Since we are applying SlS^{l} transformations to each filter, the output of this operation is an array of shape Kl×Sl×Kl1×Sl1×n×nK^{l}\times S^{l}\times K^{l-1}\times S^{l-1}\times n\times n, which we call F+F^{+}.

The permutation can be implemented efficiently by a GPU kernel that does a lookup into FF for each output cell of F+F^{+}, using a precomputed index associated with the output cell. To precompute the indices, we define an invertible map g(s,u,v)g(s,u,v) that takes an input index (valid for an array of shape Sl1×n×nS^{l-1}\times n\times n) and produces the associated group element gg as a matrix (section 4.2 and 4.3). For each input index (s,u,v)(s,u,v) and each transformation ss^{\prime}, we compute sˉ,uˉ,vˉ=g1(g(s,0,0)1g(s,u,v))\bar{s},\bar{u},\bar{v}=g^{-1}(g(s^{\prime},0,0)^{-1}g(s,u,v)). This index is used to set F+[i,s,j,s,u,v]=F[i,j,sˉ,uˉ,vˉ]F^{+}[i,s^{\prime},j,s,u,v]=F[i,j,\bar{s},\bar{u},\bar{v}] for all i,ji,j.

The G-convolution for a new group can be added by simply implementing a map g()g(\cdot) from indices to matrices.

2 Planar convolution

The second part of the G-convolution algorithm is a planar convolution using the expanded filter bank F+F^{+}. If Sl1>1S^{l-1}>1, the sum over XX in eq. 18 involves a sum over the stabilizer. This sum can be folded into the sum over feature channels performed by the planar convolution routine by reshaping F+F^{+} from Kl×Sl×Kl1×Sl1×n×nK^{l}\times S^{l}\times K^{l-1}\times S^{l-1}\times n\times n to SlKl×Sl1Kl1×n×nS^{l}K^{l}\times S^{l-1}K^{l-1}\times n\times n. The resulting array can be interpreted as a conventional filter bank with Sl1Kl1S^{l-1}K^{l-1} planar input channels and SlKlS^{l}K^{l} planar output channels, which can be correlated with the feature maps ff (similarly reshaped).

Experiments

The rotated MNIST dataset (Larochelle et al., 2007) contains 6200062000 randomly rotated handwritten digits. The dataset is split into a training, validation and test sets of size 1000010000, 20002000 and 5000050000, respectively.

We performed model selection using the validation set, yielding a CNN architecture (Z2CNN) with 77 layers of 3×33\times 3 convolutions (4×44\times 4 in the final layer), 2020 channels in each layer, relu activation functions, batch normalization, dropout, and max-pooling after layer 22. For optimization, we used the Adam algorithm (Kingma & Ba, 2015). This baseline architecture outperforms the models tested by Larochelle et al. (2007) (when trained on 1212k and evaluated on 5050k), but does not match the previous state of the art, which uses prior knowledge about rotations (Schmidt & Roth, 2012) (see table 1).

Next, we replaced each convolution by a p4p4-convolution (eq. 10 and 11), divided the number of filters by 4=2\sqrt{4}=2 (so as to keep the number of parameters approximately fixed), and added max-pooling over rotations after the last convolution layer. This architecture (P4CNN) was found to perform better without dropout, so we removed it. The P4CNN almost halves the error rate of the previous state of the art (2.28%2.28\% vs 3.98%3.98\% error).

We then tested the hypothesis that premature invariance is undesirable in a deep architecture (section 2). We took the Z2CNN, replaced each convolution layer by a p4p4-convolution (eq. 10) followed by a coset max-pooling over rotations. The resulting feature maps consist of rotation-invariant features, and have the same transformation law as the input image. This network (P4CNNRotationPooling) outperforms the baseline and the previous state of the art, but performs significantly worse than the P4CNN which does not pool over rotations in intermediate layers.

2 CIFAR-10

The CIFAR-10 dataset consists of 60k60k images of size 32×3232\times 32, divided into 1010 classes. The dataset is split into 40k40k training, 10k10k validation and 10k10k testing splits.

To evaluate G-CNNs, we replaced all convolution layers of the baseline architectures by p4p4 or p4mp4m convolutions. For a constant number of filters, this increases the size of the feature maps 44 or 88-fold, which in turn increases the number of parameters required per filter in the next layer. Hence, we halve the number of filters in each p4p4-conv layer, and divide it by roughly 83\sqrt{8}\approx 3 in each p4mp4m-conv layer. This way, the number of parameters is left approximately invariant, while the size of the internal representation is increased. Specifically, we used ki=11,23,45k_{i}=11,23,45 for p4mp4m-ResNet44.

To evaluate the impact of data augmentation, we compare the networks on CIFAR10 and augmented CIFAR10+. The latter denotes moderate data augmentation with horizontal flips and small translations, following Goodfellow et al. (2013) and many others.

The training procedure for training the All-CNN was reproduced as closely as possible from Springenberg et al. (2015). For the ResNets, we used stochastic gradient descent with initial learning rate of 0.050.05 and momentum 0.90.9. The learning rate was divided by 1010 at epoch 50,10050,100 and 150150, and training was continued for 300300 epochs.

To the best of our knowledge, the p4mp4m-CNN outperforms all published results on plain CIFAR10 (Wan et al., 2013; Goodfellow et al., 2013; Lin et al., 2014; Lee et al., 2015b; Srivastava et al., 2015; Clevert et al., 2015; Lee et al., 2015a). However, due to radical differences in model sizes and architectures, it is difficult to infer much about the intrinsic merit of the various techniques. It is quite possible that the cited methods would yield better results when deployed in larger networks or in combination with other techniques. Extreme data augmentation and model ensembles can also further improve the numbers (Graham, 2014).

Inspired by the wide ResNets of Zagoruyko & Komodakis (2016), we trained another ResNet with 2626 layers and ki=(71,142,248)k_{i}=(71,142,248) (for planar convolutions) or ki=(50,100,150)k_{i}=(50,100,150) (for p4mp4m convolutions). When trained with moderate data augmentation, this network achieves an error rate of 5.27%5.27\% using planar convolutions, and 4.19%\mathbf{4.19\%} with p4mp4m convolutions. This result is comparable to the 4.17%4.17\% error reported by Zagoruyko & Komodakis (2016), but using fewer parameters (7.2M vs 36.5M).

Discussion & Future work

Our results show that p4p4 and p4mp4m convolution layers can be used as a drop-in replacement of standard convolutions that consistently improves the results.

G-CNNs benefit from data augmentation in the same way as convolutional networks, as long as the augmentation comes from a group larger than GG. Augmenting with flips and small translations consistently improves the results for the p4p4 and p4mp4m-CNN.

The CIFAR dataset is not actually symmetric, since objects typically appear upright. Nevertheless, we see substantial increases in accuracy on this dataset, indicating that there need not be a full symmetry for G-convolutions to be beneficial.

In future work, we want to implement G-CNNs that work on hexagonal lattices which have an increased number of symmetries relative to square grids, as well as G-CNNs for 3D space groups. All of the theory presented in this paper is directly applicable to these groups, and the G-convolution can be implemented in such a way that new groups can be added by simply specifying the group operation and a bijective map between the group and the set of indices.

One limitation of the method as presented here is that it only works for discrete groups. Convolution on continuous (locally compact) groups is mathematically well-defined, but may be hard to approximate in an equivariant manner. A further challenge, already identified by Gens & Domingos (2014), is that a full enumeration of transformations in a group may not be feasible if the group is large.

Finally, we hope that the current work can serve as a concrete example of the general philosophy of “structured representations”, outlined in section 2. We believe that adding mathematical structure to a representation (making sure that maps between representations preserve this structure), could enhance the ability of neural nets to see abstract similarities between superficially different concepts.

Conclusion

We have introduced G-CNNs, a generalization of convolutional networks that substantially increases the expressive capacity of a network without increasing the number of parameters. By exploiting symmetries, G-CNNs achieve state of the art results on rotated MNIST and CIFAR10. We have developed the general theory of G-CNNs for discrete groups, showing that all layer types are equivariant to the action of the chosen group GG. Our experimental results show that G-convolutions can be used as a drop-in replacement for spatial convolutions in modern network architectures, improving their performance without further tuning.

Acknowledgements

We would like to thank Joan Bruna, Sander Dieleman, Robert Gens, Chris Olah, and Stefano Soatto for helpful discussions. This research was supported by NWO (grant number NAI.14.108), Google and Facebook.

References

Appendix A: Equivariance Derivations

Line by line, we used the following definitions, facts and manipulations:

The definition of the correlation \star.

The definition of LrL_{r}, i.e. Lrf(x)=f(r1x)L_{r}f(x)=f(r^{-1}x).

The definition of the correlation \star.

A visual proof can be found in (Dieleman et al., 2016).

Using a similar line of reasoning, we can show that pooling commutes with the group action:

Appendix B: Gradients

To train a G-CNN, we need to compute gradients of a loss function with respect to the parameters of the filters. If we use the fast algorithm explained in section 7 of the main paper, we only have to implement the gradient of the indexing operation (section 7.1, “filter transformation”), because the 2D convolution routine and its gradient are given.

This gradient is computed as follows. The gradient of the loss with respect to cell ii in the input of the indexing operation is the sum of the gradients of the output cells jj that index cell ii. On current GPU hardware, this can be implemented efficiently using a kernel that is instantiated for each cell jj in the output array. The kernel adds the value of the gradient of the loss with respect to cell jj to cell ii of the array that holds the gradient of the loss with respect to the input of the indexing operation (this array is to be initialized at zero). Since multiple kernels write to the same cell ii, the additions must be done using atomic operations to avoid concurrency problems.

Alternatively, one could implement the filter transformation using a precomputed permutation matrix. This is not as efficient, but the gradient is trivial, and most computation graph / deep learning packages will have implemented the matrix multiplication and its gradient.

Appendix C: G-conv calculus

Although the gradient of the filter transformation operation is all that is needed to do backpropagation in a G-CNN for a split group GG, it is instructive to derive the analytical gradients of the G-correlation operation. This leads to an elegant “G-conv calculus”, included here for the interested reader.

Let feature map kk at layer ll be denoted fkl=fl1ψlkf^{l}_{k}=f^{l-1}\star\psi^{lk}, where fl1f^{l-1} is the stack of feature maps in the previous layer. At some point in the backprop algorithm, we will have computed the derivative L/fkl\partial L/\partial f^{l}_{k} for all kk, and we need to compute L/fjl1\partial L/\partial f^{l-1}_{j} (to backpropagate to lower layers) as well as L/ψjlk\partial L/\partial\psi_{j}^{lk} (to update the parameters). We find that,

and ψjl\psi_{j}^{l} is the set of filter components applied to input feature map jj at layer ll:

To compute the gradient with respect to component jj of filter kk, we have to G-convolve the jj-th input feature map with the kk-th output feature map:

So we see that both the forward and backward passes involve convolution or correlation operations, as is the case in standard convnets.