Steerable CNNs

Taco S. Cohen, Max Welling

Introduction

Much of the recent progress in computer vision can be attributed to the availability of large labelled datasets and deep neural networks capable of absorbing large amounts of information. While many practical problems can now be solved, the requirement for big (labelled) data is a fundamentally unsatisfactory state of affairs. Human beings are able to learn new concepts with very few labels, and mimicking this ability is an important challenge for artificial intelligence research. From an applied perspective, improving the statistical efficiency of deep learning is vital because in many domains (e.g. medical image analysis), acquiring large amounts of labelled data is costly.

To improve the statistical efficiency of machine learning methods, many have sought to learn invariant representations. In deep learning, however, intermediate layers should not be fully invariant, because the relative pose of local features must be preserved for further layers (Cohen & Welling, 2016; Hinton et al., 2011). Thus, one is led to the idea of equivariance: a network is equivariant if the representations it produces transform in a predictable linear manner under transformations of the input. In other words, equivariant networks produce representations that are steerable. Steerability makes it possible to apply filters not just in every position (as in a standard convolution layer), but in every pose, thus allowing for increased parameter sharing.

Previous work has shown that equivariant CNNs yield state of the art results on classification tasks (Cohen & Welling, 2016; Dieleman et al., 2016), even though they only enforce equivariance to small groups of transformations like rotations by multiples of 9090 degrees. Learning representations that are equivariant to larger groups is likely to result in further gains, but the computational cost of current methods scales with the size of the group, making this impractical. In this paper we present a general theory of steerable representations that covers all forms of linear steerability in convolutional networks, thus increasing the flexibility of equivariant CNNs and allowing us to decouple the computational cost from the size of the group, paving the way for future scaling.

We show that any steerable representation is a composition of elementary feature types. Each elementary feature can be steered independently of the others, and captures a distinct characteristic of the input that has an invariant or “objective” meaning. This doctrine of “observer-independent quantities” was put forward by (Weyl, 1939, ch. 1.4) and is used throughout physics. It has been applied to vision and representation learning by Kanatani (1990); Cohen (2013).

The mentioned type system puts constraints on the network weights and architecture. Specifically, since an equivariant filter bank is required to map given input feature types to given output feature types, the number of parameters required by such a filter bank is reduced. Furthermore, by the same logic that tells us not to add meters to seconds, steerability considerations prevent us from adding features of different types (e.g. for residual learning (He et al., 2015)).

The rest of this paper is organized as follows. The theory of steerable CNNs is introduced in Section 2. Related work is discussed in Section 3, which is followed by classification experiments (4) and a discussion and conclusion in Section 5.

Steerable CNNs

This says that the pixel at g1xg^{-1}x gets moved to xx by the transformation gGg\in G. We note that π0(g)\pi_{0}(g) is a linear operator.

An important property of π0\pi_{0} is that π0(gh)=π0(g)π0(h)\pi_{0}(gh)=\pi_{0}(g)\pi_{0}(h). Here, ghgh means composition of transformations in GG, while π0(g)π0(h)\pi_{0}(g)\pi_{0}(h) denotes matrix multiplication. A vector space such as F0\mathcal{F}_{0} equipped with a set of linear operators π0\pi_{0} satisfying this condition is known as a group representation (or just representation, for short). A lot is known about group representations (Serre, 1977), and we will make extensive use of the theory, explaining the relevant concepts as needed.

2 Steerable representations

Let (F,π)(\mathcal{F},\pi) be a feature space with a group representation and Φ:FF\Phi:\mathcal{F}\rightarrow\mathcal{F}^{\prime} a convolutional network. The feature space F\mathcal{F}^{\prime} is said to be (linearly) steerable with respect to GG, if for all transformations gGg\in G, the features Φf\Phi f and Φπ(g)f\Phi\pi(g)f are related by a linear transformation π(g)\pi^{\prime}(g) that does not depend on ff. So π(g)\pi^{\prime}(g) allows us to “steer” the features in F\mathcal{F}^{\prime} without referring to the input in F\mathcal{F} from which they were computed.

Combining the definition of steerability (i.e. Φπ(g)=π(g)Φ\Phi\pi(g)=\pi^{\prime}(g)\Phi) with the fact that π\pi is a group representation, we find that π\pi^{\prime} must also be a group representation:

That is, π(gh)=π(g)π(h)\pi^{\prime}(gh)=\pi^{\prime}(g)\pi^{\prime}(h) (at least in the span of the image of Φ\Phi). Figure 2 gives an illustration.

Using this division, we can first construct a filter bank that generates HH-steerable fibers, and then show that convolution with such a filter bank produces a feature space that is steerable with respect to the whole group GG.

3 Equivariant filter banks

for some representation ρ\rho of HH that acts on the output fibers (see Figure 3). Note that we only require equivariance with respect to HH (which excludes translations) and not GG, because translations can move patterns into and out of the receptive field of a fiber, making full translation equivariance impossible.

The space of maps satisfying the equivariance constraint is denoted HomH(π,ρ)\operatorname{Hom}_{H}(\pi,\rho), because an equivariant map Ψ\Psi is a “homomorphism of group representations”, meaning it respects the structure of the representations. Equivariant maps are also sometimes called intertwiners (Serre, 1977).

Since the equivariance constraint (eq. 3) is linear in Ψ\Psi, the space HomH(π,ρ)\operatorname{Hom}_{H}(\pi,\rho) of admissible filter banks is a vector space: any linear combination of maps Ψ,ΨHomH(π,ρ)\Psi,\Psi^{\prime}\in\operatorname{Hom}_{H}(\pi,\rho) is again an intertwiner. Hence, given π\pi and ρ\rho, we can compute a basis for HomH(π,ρ)\operatorname{Hom}_{H}(\pi,\rho) by solving a linear system.

Computation of the intertwiner basis is done offline, before training. Once we have such a basis ψ1,,ψn\psi_{1},\ldots,\psi_{n} for HomH(π,ρ)\operatorname{Hom}_{H}(\pi,\rho), we can express any equivariant filter bank Ψ\Psi as a linear combination Ψ=iαiψi\Psi=\sum_{i}\alpha_{i}\psi_{i} using parameters αi\alpha_{i}. As shown in Section 2.7, this can be done efficiently even in high dimensions.

4 Induction

We have shown how to parameterize filter banks that intertwine π\pi and ρ\rho, making the output fibers HH-steerable by ρ\rho if the input space F\mathcal{F} is HH-steerable by π\pi. In this section we show how HH-steerability of fibers FxF^{\prime}_{x} leads to GG-steerability of the whole feature space F\mathcal{F}^{\prime}. This happens through a natural and important construction known as the induced representation (Mackey, 1952; 1953; 1968; Serre, 1977; Taylor, 1986; Folland, 1995; Kaniuth & Taylor, 2013).

As stated, the correlation Ψf\Psi\star f could be computed by translating ff before applying Ψ\Psi:

We can now calculate the transformation law of the output space. To do so, we apply a translation tt and transformation rHr\in H to fFf\in\mathcal{F}, yielding π(tr)f\pi(tr)f, and then perform the correlation with Ψ\Psi. With a some algebra (Appendix A), we find:

then Ψπ(g)f=π(g)Ψf\Psi\star\pi(g)f=\pi^{\prime}(g)\Psi\star f (see Fig. 4). This representation π\pi^{\prime} is known as the representation of GG induced by the representation ρ\rho of HH, and is denoted π=IndHGρ\pi^{\prime}=\operatorname{Ind}_{H}^{G}\rho.

When parsing eq. 6, it is important to keep in mind that (as indicated by the square brackets) π\pi^{\prime} acts on the whole feature space F\mathcal{F}^{\prime} while ρ\rho acts on individual fibers.

If we compare the induced representation (eq. 6) to the representation π0\pi_{0} defined in eq. 1, we see that the difference lies only in the presence of a factor ρ(r)\rho(r) applied to the fibers. This factor describes how the feature channels are mixed by the transformation. The color channels in the input space do not get mixed by geometrical transformations, so we say that π0\pi_{0} is induced from the trivial representation ρ0(h)=I\rho_{0}(h)=I.

Now that we have a GG-steerable feature space F\mathcal{F}^{\prime}, we can iterate the procedure by computing a basis for the space of intertwiners between π\pi^{\prime} (restricted to HH) and some ρ\rho^{\prime} of our choosing.

5 Feature types and character theory

By now, the reader may be wondering how to choose ρ\rho, or indeed what the space of representations that we can choose from looks like in the first place. We will answer these questions in this section by showing that each representation has a type (encoded as a short list of integers) that corresponds to a certain symmetry or invariance of the feature. We further show how the number of parameters of an equivariant filter bank depends on the types of the representations π\pi and ρ\rho^{\prime} that it intertwines. Our discussion will make use of a number of important elementary results from group representation theory which are stated but not proved. The reader wishing to go deeper may consult chapters 1 and 2 of the excellent book by Serre (1977).

for some basis matrix AA, and some iki_{k} that index the irreps (each irrep may occur or more times).

Each irreducible representation corresponds to a type of symmetry, as shown in table 1. For example, as can be seen in this table, the representations B1B1 and B2B2 represent the 9090-degree rotation rr as the matrix [1]\begin{bmatrix}-1\end{bmatrix}, so the basis filters for these representations change sign when rotated by rr. It should be noted that in the higher layers l>0l>0, elementary basis filters can look different because they depend on the representation πl\pi_{l} that is being decomposed.

The fact that all representations can be decomposed into a direct sum of irreducibles implies that each representation has a basis-independent type: which irreducible representations appear in it, and with what multiplicity. For example, the input representation π0\pi_{0} (table 1) has type (3,0,1,1,2)(3,0,1,1,2). This means that, for instance, π0(r)\pi_{0}(r) is block-diagonalized as:

Where the block matrix contains (3,0,1,1,2)(3,0,1,1,2) copies of the irreps (A1,A2,B1,B2,E)(A1,A2,B1,B2,E), evaluated at rr (see column rr in table 1). The change of basis matrix AA is constructed from the basis filters shown in table 1 (and the same AA block-diagonalizes π0(g)\pi_{0}(g) for all gg).

So the most general way in which we can choose a fiber representation ρ\rho is to choose multiplicities mi0m_{i}\geq 0 and a basis matrix AA. In Section 2.6 we will find that there is an important restriction on this freedom, which alleviates the need to choose a basis. The choice of multiplicities is then the only hyperparameter, analogous to the choice of the number of channels in an ordinary CNN. Indeed, the multiplicities determine the number of channels: K=imidimφiK=\sum_{i}m_{i}\dim\varphi_{i}.

By choosing the type of ρ\rho, we also determine the type of π=IndHGρ\pi=\operatorname{Ind}_{H}^{G}\rho (restricted to HH), but what is it? Explicit formulas exist (Reeder (2014); Serre (1977)) but are rather complicated, so we will present a simple computational procedure that can be used to determine the type of any representation. This procedure relies on the character χρ(g)=Tr(ρ(g))\chi_{\rho}(g)=\operatorname{Tr}(\rho(g)) of the representation to be decomposed. The most important fact about characters is that the characters of irreps φi,φj\varphi_{i},\varphi_{j} are orthogonal:

Furthermore, since the trace of a direct sum equals the sum of the traces (i.e. χρρ=χρ+χρ\chi_{\rho\oplus\rho^{\prime}}=\chi_{\rho}+\chi_{\rho^{\prime}}), and every representation ρ\rho is a direct sum of irreps, it follows that we can obtain the multiplicity of irrep φi\varphi_{i} in ρ\rho by computing the inner product with the ii-th character:

So a simple dot product of characters is all we need to determine the type of a representation.

Steerable CNNs use parameters much more efficiently than ordinary CNNs. In this section we show how the number of parameters required by an equivariant layer is determined by the feature types of the input and output space, and how the efficiency of a choice of feature types may be evaluated.

In section 2.3, we found that a filter bank Ψ\Psi is equivariant if and only if it lies in the vector space called HomH(π,ρ)\operatorname{Hom}_{H}(\pi,\rho). It follows that the number of parameters for such a filter bank is equal to the dimensionality of this space, n=dimHomH(π,ρ)n=\dim\operatorname{Hom}_{H}(\pi,\rho). This number is known as the intertwining number of π\pi and ρ\rho and plays an important role in the theory of group representations.

As with multiplicities, the intertwining number is easily computed using characters. It can be shown (Reeder, 2014) that the intertwining number equals:

By linearity and the orthogonality of characters, we find that dimHomH(π,ρ)=imimi\dim\operatorname{Hom}_{H}(\pi,\rho)=\sum_{i}m_{i}m_{i}^{\prime}, for representations π,ρ\pi,\rho of type (m1,,mJ)(m_{1},\ldots,m_{J}) and (m1,,mJ)(m_{1}^{\prime},\ldots,m_{J}^{\prime}), respectively. Thus, as far as the number of parameters of a steerable convolution layer is concerned, the only choice we have to make for ρ\rho is its type – a short list of integers mim_{i}.

The efficiency of a choice of type can be assessed using a quantity we call the parameter utilization:

The numerator equals s2KKs^{2}K\cdot K^{\prime}: the number of parameters for a non-equivariant filter bank. The denominator equals the parameter cost of an equivariant filter bank with the same filter size and number of input/output channels. Typical values of μ\mu in effective architectures are around H|H|, e.g. μ=8\mu=8 for H=D4H=D4. Such a layer utilizes its parameters 88 times more intensively than an ordinary convolution layer.

6 Equivariant nonlinearities & capsules

In the previous section we showed that only the basis-independent types of π\pi and ρ\rho play a role in determining the parameter cost of an equivariant filter bank. An equivalent representation ρ(g)=Aρ(g)A1\rho^{\prime}(g)=A\rho(g)A^{-1} will have the same type, and hence the same parameter cost as ρ\rho. However, when it comes to nonlinearities, different bases behave differently.

Since commutation with nonlinearities depends on the basis, we need a more granular notion than the feature type. We define a ρ\rho-capsule as a (typically low-dimensional) feature vector that transforms according to a representation ρ\rho (we may also refer to ρ\rho as the capsule). Thus, while a capsule has a type, not all representations of that type are equivalent as capsules. Given a catalogue of capsules ρi\rho^{i} (for i=1,,Ci=1,\ldots,C) with multiplicities mim_{i}, we can construct a fiber as a stack of capsules that is steerable by a block-diagonal representation ρ\rho with mim_{i} copies of ρi\rho^{i} on the diagonal.

Like the capsules of Hinton et al. (2011), our capsules encode the pose of a pattern in the input, and consist of a number of units (dimensions) that do not get mixed with the units of other capsules under transformations. In this sense, a stack of capsules is disentangled (Cohen & Welling, 2014).

We have found a few simple types of capsules and corresponding admissible nonlinearities. It is easy to see that any nonlinearity is admissible for ρ\rho when the latter is realized by permutation matrices: permuting a list of coordinates and then applying a nonlinearity is the same as applying the nonlinearity and then permuting. If ρ\rho is realized by a signed permutation matrix, then CReLU(α)=(ReLU(α),ReLU(α))\verb+CReLU+(\alpha)=(\verb+ReLU+(\alpha),\verb+ReLU+(-\alpha)) introduced by Shang et al. (2016), or any concatenated nonlinearity ν(α)=(ν(α),ν(α))\nu^{\prime}(\alpha)=(\nu(\alpha),\nu(-\alpha)), will be admissible. Any scale-free concatenated nonlinearity such as CReLU is admissible for a representation realized by monomial matrices (having the same nonzero pattern as a permutation matrix). Finally, we can always make a representation of a finite group orthogonal by a suitable choice of basis, which means that we can use any nonlinearity that acts only on the length of the vector.

For many groups, the irreps can be realized using signed permutation matrices, so we can use irreducible φi\varphi_{i}-capsules with concatenated nonlinearities such as CReLU. Another class of capsules, which we call quotient capsules, are naturally realized by permutation matrices, and are thus compatible with any nonlinearity. These are described in Appendix C.

7 Computational efficiency

Modern convolutional networks often use on the order of hundreds of channels KK per layer Zagoruyko & Komodakis (2016). When using 3×33\times 3 filters, a filter bank can have on the order of 9K21069K^{2}\approx 10^{6} dimensions. The number of parameters for an equivariant filter bank is about μ10\mu\approx 10 times smaller, but a basis for the space of equivariant filter banks would still be about 106×10510^{6}\times 10^{5}, which is too large to be practical.

Fortunately, the block-diagonal structure of π\pi and ρ\rho induces a block structure in Ψ\Psi. Suppose π=block_diag(π1,,πP)\pi=\verb+block_diag+(\pi^{1},\ldots,\pi^{P}) and ρ=block_diag(ρ1,,ρQ)\rho=\verb+block_diag+(\rho^{1},\ldots,\rho^{Q}). Then an intertwiner is a matrix of shape K×Ks2K^{\prime}\times Ks^{2}, where K=idimρiK^{\prime}=\sum_{i}\dim\rho^{i} and Ks2=idimπiKs^{2}=\sum_{i}\dim\pi^{i}. This matrix has the following block structure:

Each block hijh_{ij} corresponds to an input-output pair of capsules, and can be parameterized by a linear combination of basis matrices ψkijHomH(ρi,πj)\psi^{ij}_{k}\in\operatorname{Hom}_{H}(\rho^{i},\pi^{j}).

In practice, we typically use many copies of the same capsule (say nin_{i} copies of ρi\rho^{i} and mjm_{j} copies of πj\pi^{j}). Therefore, many of the blocks hijh_{ij} can be constructed using the same intertwiner basis. If we order equivalent capsules to be adjacent, the intertwiner consists of “blocks of blocks”. Each superblock HijH_{ij} has shape nidimρi×mjdimπjn_{i}\dim\rho^{i}\times m_{j}\dim\pi^{j}, and consists of subblocks of shape dimρi×dimπj\dim\rho^{i}\times\dim\pi^{j}.

The computation graph for an equivariant convolution layer is constructed as follows. Given a catalogue of capsules ρi\rho^{i} and corresponding post-activation capsules Actνρi\operatorname{Act}_{\nu}\rho^{i}, we compute the induced representations πi=IndHGActνρi\pi^{i}=\operatorname{Ind}_{H}^{G}\operatorname{Act}_{\nu}\rho^{i} and the bases for HomH(ρi,πj)\operatorname{Hom}_{H}(\rho^{i},\pi^{j}) in an offline step. The bases are stored as matrices ψij\psi^{ij} of shape dimρidimπj×dimHomH(ρi,πj)\dim\rho^{i}\cdot\dim\pi^{j}\times\dim\operatorname{Hom}_{H}(\rho^{i},\pi^{j}). Then, given a list of input / output multiplicities ni,mjn_{i},m_{j} for the capsules, a parameter matrix Θij\Theta^{ij} of shape dimHomH(ρi,πj)×nimj\dim\operatorname{Hom}_{H}(\rho^{i},\pi^{j})\times n_{i}m_{j} is instantiated. The superblocks HijH_{ij} are obtained by a matrix multiplication ψijΘij\psi^{ij}\Theta^{ij} plus reshaping to shape dimρidimπj×nimj\dim\rho^{i}\cdot\dim\pi^{j}\times n_{i}m_{j}. Once all superblocks are filled in, the matrix Ψ\Psi is reshaped from K×Ks2K^{\prime}\times Ks^{2} to K×K×s×sK^{\prime}\times K\times s\times s and convolved with the input.

8 Using steerable CNNs in practice

A full understanding of the theory of steerable CNNs requires some knowledge of group representation theory, but using steerable CNN technology is not much harder than using ordinary CNNs. Instead of choosing a number of channels for a given layer, one chooses a list of multiplicities mim_{i} for each capsule in a library of capsules provided by the developer. To preserve equivariance, the activation function applied to a capsule must be chosen from a list of admissible nonlinearities for that capsule (which sometimes includes all nonlinearities). Finally, one must respect the type system and only add identical capsules (e.g. in ResNets). These constraints can all be checked automatically.

Related Work

Steerable filters were first studied for applications in signal processing and low-level vision (Freeman & Adelson, 1991; Greenspan et al., 1994; Simoncelli & Freeman, 1995). More or less explicit connections between steerability and group representation theory have been observed by Lenz (1989); Koenderink & Van Doorn (1990); Teo (1998); Krajsek & Mester (2007). As we have tried to demonstrate in this paper, representation theory is indeed the natural mathematical framework in which to study steerability.

In machine learning, equivariant kernels were studied by Reisert (2008); Skibbe (2013). In the context of neural networks, various authors have studied equivariant representations. Capsules were introduced in Hinton et al. (2011), and significantly improved by Tieleman (2014). A theoretical account of equivariant representation learning in the brain is given by Anselmi et al. (2014). Group equivariant scattering networks were defined and studied by Mallat (2012) for compact groups, and by Sifre & Mallat (2013); Oyallon & Mallat (2015) for the roto-translation group. Jacobsen et al. (2016) describe a network that uses a fixed set of (possibly steerable) basis filters with learned weights. Lenc & Vedaldi (2015) showed empirically that convolutional networks tend to learn equivariant representations, which suggests that equivariance could be a good inductive bias.

Invariant and equivariant CNNs have been studied by Gens & Domingos (2014); Kanazawa et al. (2014); Dieleman et al. (2015; 2016); Cohen & Welling (2016); Marcos et al. (2016). All of these models, as well as scattering networks, implicitly use the regular representation: feature maps are (often implicitly) conceived of as functions on GG, and the action of GG on the space of functions on GG is known as the regular representation (Serre (1977), Appendix B). This form of equivariance is a special case of that presented in this paper.

The idea of adding a type system to neural networks has been explored by Olah (2015); Balduzzi & Ghifary (2015). We have shown that a type system emerges naturally from the decomposition of a linear representation of a mathematical structure (a group, in our case) associated with the representation learned by a neural network.

Experiments

We performed experiments on the CIFAR10 dataset (Krizhevsky, 2009) to determine if steerability is a useful inductive bias, and to determine the relative merits of the various types of capsules. In order to run experiments faster, and to see how steerable CNNs perform in the small-data regime, we used only 2000 training samples for our initial experiments.

As a baseline, we used the competitive wide residual networks (ResNets) architecture (He et al., 2015; 2016; Zagoruyko & Komodakis, 2016). We tuned the capacity of this network for the reduced dataset size and settled on a 2020 layer architecture (three residual blocks per stage, with two layers each, for three stages with feature maps of size 32×3232\times 32, 16×1616\times 16 and 8×88\times 8, various widths). We compared the baseline architecture to various kinds of steerable CNN, obtained by replacing the convolution layers by steerable convolution layers. To make sure that differences in performance were not simply due to underfitting or overfitting, we tuned the width (number of channels, KK) using a validation set. The rest of the training procedure is identical to Cohen & Welling (2016), and is fixed for all of our experiments.

We first tested steerable CNNs that consist entirely of a single kind of capsule. We found that architectures with only one type do not perform very well (roughly 3030-40%40\% error, vs. 30%30\% for plain ResNets trained on 2k samples from CIFAR10), except for those that use the regular representation capsule (Appendix C), which outperforms standard CNNs (26.75%26.75\% error). This is not too surprising, because many capsules are quite restrictive in the spatial patterns they can express. The strong performance of regular capsules is consistent with the results of Cohen & Welling (2016), and can be explained by the fact that the regular representation contains all other (irreducible and quotient) representations as subrepresentations, and can therefore learn arbitrary spatial patterns.

We then created networks that use a mix of the more successful kinds of capsules. After a few preliminary experiments, we settled on a residual network that uses one mix of capsules for the input and output layer of a residual block, and another for the intermediate layer. The first representation consists of quotient capsules: regular, qm, qmr2, qmr3 (see Appendix C) followed by ReLUs. The second consists of irreducible capsules: A1, A2, B1, B2, E(2x) followed by CReLUs. On CIFAR10 with 2k labels, this architecture works better than standard ResNets and regular capsules at 24.48%24.48\% error.

When tested on CIFAR10 with 4k labels (table 2), the method comes close to the state of the art in semi-supervised methods, that use additional unlabelled data (Rasmus et al., 2016), and better than transfer learning approaches such as DCGAN which achieves 26.2%26.2\% error (Radford et al., 2015). When tested on the full CIFAR10 and CIFAR100 dataset, the steerable CNN substantially outperforms the ResNet (He et al., 2016) baseline and achieves state of the art results (improving over wide and dense nets (Zagoruyko & Komodakis, 2016; Huang et al., 2016)).

Conclusion & Future Work

We have presented a theoretical framework for understanding steerable representations in convolutional networks, and have shown that steerability is a useful inductive bias that can improve model accuracy, particularly when little data is available. Our experiments show that a simple steerable architecture achieves state of the art results on CIFAR10 and CIFAR100, outperforming recent architectures such as wide and dense residual networks.

The mathematical connection between representation learning and representation theory that we have established improves our understanding of the inner workings of (equivariant) convolutional networks, revealing the humble CNN as an elegant geometrical computation engine. We expect that this new tool (representation theory), developed over more than a century by mathematicians and physicists, will greatly benefit future investigations in this area.

For concreteness, we have used the group of flips and rotations by multiples of 9090 degrees as a running example throughout this paper. This group already has some nontrivial characteristics (such as non-commutativity), but it is still small and discrete. The theory of steerable CNNs, however, readily extends to the continuous setting. Evaluating steerable CNNs for large, continuous and high-dimensional groups is an important piece of future work.

Another direction for future work is learning the feature types, which may be easier in the continuous setting because (for non-compact groups) the irreps live in a continuous space where optimization may be possible. Beyond classification, steerable CNNs are likely to be useful in geometrical tasks such as action recognition, pose and motion estimation, and continuous control tasks.

We kindly thank Kenta Oono, Shuang Wu, Thomas Kipf and the anonymous reviewers for their feedback and suggestions. This research was supported by Facebook, Google and NWO (grant number NAI.14.108).

References

Appendix A: Induction

In this section we will show that a stack of feature maps produced by convolution with an HH-equivariant filter bank transforms according to the induced representation. That is, we will derive eq. 5, repeated here for convenience:

With this notation, the convolution is defined as:

Although the induced representation can be described in a more general setting, we will use an explicit matrix representation of GG to make it easier to check our computations. A general element of GG is written as:

Where RR is the matrix representation of rr (e.g. a 2×22\times 2 rotation / reflection matrix), and TT is a translation vector. The section we use is:

To keep notation uncluttered, we will write π=πl\pi=\pi_{l} and ρ=ρl+1\rho=\rho_{l+1}. In full detail, the derivation of the transformation law for the feature space induced by ρ\rho proceeds as follows:

The last line is the result shown in the paper. The justification of each step is:

π\pi is a homomorphism / group representation

rr1rr^{-1} is the identity, so can always multiply by it

π\pi is a homomorphism / group representation

ΨHomH(π,ρ)\Psi\in\operatorname{Hom}_{H}(\pi,\rho) is equivariant to rHr\in H.

(tr)1x=r1t1xr\overline{(tr)^{-1}\cdot x}=r^{-1}t^{-1}\overline{x}r can be checked by multiplying the matrices / vectors.

The derivation above is somewhat involved and messy, so the reader may prefer to think geometrically (using the figures in the paper) instead of algebraically. This complexity is an artifact of the lack of abstraction in our presentation. The induced representation is really a very natural object to consider (abstractly, it is the “adjoint functor” to the restriction functor. A more abstract treatment of the induced representation can be found in Serre (1977); Mackey (1952); Reeder (2014). A treatment that is close to our own, but more general is the “alternate description” found on page 49 of Kaniuth & Taylor (2013).

Appendix B: Relation to Group Equivariant CNNs

In this section we show that the recently introduced Group Equivariant Convolutional Networks (G-CNNs, Cohen & Welling (2016)) are a special kind of steerable CNN. Specifically, a G-CNN is a steerable CNN with regular capsules.

This defines a linear representation of GG known as the regular representation. It is easy to see that the regular representation is naturally realized by permutation matrices. Furthermore, it is known that the regular representation of GG is induced by the regular representation of HH. The latter is defined in Appendix C, and is what we refer to as “regular capsules” in the paper.

Appendix C: Regular and Quotient Features

Let HH be a finite group. A subgroup of HH is a subset that is also itself a group (i.e. closed under composition and inverses). The (left) coset of a subgroup KK in HH are the sets hK={hkkK}hK=\{hk|k\in K\}. The cosets are disjoint and jointly cover the whole group HH (i.e. they partition HH). The set of all cosets of KK in HH is denoted H/KH/K, and is also called the quotient of HH by KK.

The coset space caries a natural left action by HH. Let a,bHa,b\in H, then abK=(ab)Ka\cdot bK=(ab)K.

The function ff attaches a value to every coset. The HH-action permutes these values, because it permutes the cosets. Hence, ρ\rho can be realized by permutation matrices. For small groups the explicit computations can easily be done by hand, while for large groups this task can be automated.

In this way, we get one permutation representation for each subgroup KK of HH. In particular, for the subgroup K={e}K=\{e\} (the trivial subgroup containing only the identity ee), we have H/KHH/K\cong H. The representation in the space of functions on HH is known as the “regular representation”. Using such regular representations in a steerable CNN is equivalent to using the group convolutions introduced in Cohen & Welling (2016), so steerable CNNs are a strict generalization of G-CNNs. At the other extreme, we take K=HK=H, which gives the quotient H/K{e}H/K\cong\{e\}, the trivial group, which gives the trivial representation A1A1.

For the roto-reflection group H=D4H=D4, we have the following subgroups and associated quotient features