Steerable CNNs

Taco S. Cohen, Max Welling

Introduction

Much of the recent progress in computer vision can be attributed to the availability of large labelled datasets and deep neural networks capable of absorbing large amounts of information. While many practical problems can now be solved, the requirement for big (labelled) data is a fundamentally unsatisfactory state of affairs. Human beings are able to learn new concepts with very few labels, and mimicking this ability is an important challenge for artificial intelligence research. From an applied perspective, improving the statistical efficiency of deep learning is vital because in many domains (e.g. medical image analysis), acquiring large amounts of labelled data is costly.

To improve the statistical efficiency of machine learning methods, many have sought to learn invariant representations. In deep learning, however, intermediate layers should not be fully invariant, because the relative pose of local features must be preserved for further layers (Cohen & Welling, 2016; Hinton et al., 2011). Thus, one is led to the idea of equivariance: a network is equivariant if the representations it produces transform in a predictable linear manner under transformations of the input. In other words, equivariant networks produce representations that are steerable. Steerability makes it possible to apply filters not just in every position (as in a standard convolution layer), but in every pose, thus allowing for increased parameter sharing.

Previous work has shown that equivariant CNNs yield state of the art results on classification tasks (Cohen & Welling, 2016; Dieleman et al., 2016), even though they only enforce equivariance to small groups of transformations like rotations by multiples of $90$ degrees. Learning representations that are equivariant to larger groups is likely to result in further gains, but the computational cost of current methods scales with the size of the group, making this impractical. In this paper we present a general theory of steerable representations that covers all forms of linear steerability in convolutional networks, thus increasing the flexibility of equivariant CNNs and allowing us to decouple the computational cost from the size of the group, paving the way for future scaling.

We show that any steerable representation is a composition of elementary feature types. Each elementary feature can be steered independently of the others, and captures a distinct characteristic of the input that has an invariant or “objective” meaning. This doctrine of “observer-independent quantities” was put forward by (Weyl, 1939, ch. 1.4) and is used throughout physics. It has been applied to vision and representation learning by Kanatani (1990); Cohen (2013).

The mentioned type system puts constraints on the network weights and architecture. Specifically, since an equivariant filter bank is required to map given input feature types to given output feature types, the number of parameters required by such a filter bank is reduced. Furthermore, by the same logic that tells us not to add meters to seconds, steerability considerations prevent us from adding features of different types (e.g. for residual learning (He et al., 2015)).

The rest of this paper is organized as follows. The theory of steerable CNNs is introduced in Section 2. Related work is discussed in Section 3, which is followed by classification experiments (4) and a discussion and conclusion in Section 5.

Steerable CNNs

This says that the pixel at $g^{-1}x$ gets moved to $x$ by the transformation $g\in G$ . We note that $\pi_{0}(g)$ is a linear operator.

An important property of $\pi_{0}$ is that $\pi_{0}(gh)=\pi_{0}(g)\pi_{0}(h)$ . Here, $gh$ means composition of transformations in $G$ , while $\pi_{0}(g)\pi_{0}(h)$ denotes matrix multiplication. A vector space such as $\mathcal{F}_{0}$ equipped with a set of linear operators $\pi_{0}$ satisfying this condition is known as a group representation (or just representation, for short). A lot is known about group representations (Serre, 1977), and we will make extensive use of the theory, explaining the relevant concepts as needed.

2 Steerable representations

Let $(\mathcal{F},\pi)$ be a feature space with a group representation and $\Phi:\mathcal{F}\rightarrow\mathcal{F}^{\prime}$ a convolutional network. The feature space $\mathcal{F}^{\prime}$ is said to be (linearly) steerable with respect to $G$ , if for all transformations $g\in G$ , the features $\Phi f$ and $\Phi\pi(g)f$ are related by a linear transformation $\pi^{\prime}(g)$ that does not depend on $f$ . So $\pi^{\prime}(g)$ allows us to “steer” the features in $\mathcal{F}^{\prime}$ without referring to the input in $\mathcal{F}$ from which they were computed.

Combining the definition of steerability (i.e. $\Phi\pi(g)=\pi^{\prime}(g)\Phi$ ) with the fact that $\pi$ is a group representation, we find that $\pi^{\prime}$ must also be a group representation:

That is, $\pi^{\prime}(gh)=\pi^{\prime}(g)\pi^{\prime}(h)$ (at least in the span of the image of $\Phi$ ). Figure 2 gives an illustration.

Using this division, we can first construct a filter bank that generates $H$ -steerable fibers, and then show that convolution with such a filter bank produces a feature space that is steerable with respect to the whole group $G$ .

3 Equivariant filter banks

for some representation $\rho$ of $H$ that acts on the output fibers (see Figure 3). Note that we only require equivariance with respect to $H$ (which excludes translations) and not $G$ , because translations can move patterns into and out of the receptive field of a fiber, making full translation equivariance impossible.

The space of maps satisfying the equivariance constraint is denoted $\operatorname{Hom}_{H}(\pi,\rho)$ , because an equivariant map $\Psi$ is a “homomorphism of group representations”, meaning it respects the structure of the representations. Equivariant maps are also sometimes called intertwiners (Serre, 1977).

Since the equivariance constraint (eq. 3) is linear in $\Psi$ , the space $\operatorname{Hom}_{H}(\pi,\rho)$ of admissible filter banks is a vector space: any linear combination of maps $\Psi,\Psi^{\prime}\in\operatorname{Hom}_{H}(\pi,\rho)$ is again an intertwiner. Hence, given $\pi$ and $\rho$ , we can compute a basis for $\operatorname{Hom}_{H}(\pi,\rho)$ by solving a linear system.

Computation of the intertwiner basis is done offline, before training. Once we have such a basis $\psi_{1},\ldots,\psi_{n}$ for $\operatorname{Hom}_{H}(\pi,\rho)$ , we can express any equivariant filter bank $\Psi$ as a linear combination $\Psi=\sum_{i}\alpha_{i}\psi_{i}$ using parameters $\alpha_{i}$ . As shown in Section 2.7, this can be done efficiently even in high dimensions.

4 Induction

We have shown how to parameterize filter banks that intertwine $\pi$ and $\rho$ , making the output fibers $H$ -steerable by $\rho$ if the input space $\mathcal{F}$ is $H$ -steerable by $\pi$ . In this section we show how $H$ -steerability of fibers $F^{\prime}_{x}$ leads to $G$ -steerability of the whole feature space $\mathcal{F}^{\prime}$ . This happens through a natural and important construction known as the induced representation (Mackey, 1952; 1953; 1968; Serre, 1977; Taylor, 1986; Folland, 1995; Kaniuth & Taylor, 2013).

As stated, the correlation $\Psi\star f$ could be computed by translating $f$ before applying $\Psi$ :

We can now calculate the transformation law of the output space. To do so, we apply a translation $t$ and transformation $r\in H$ to $f\in\mathcal{F}$ , yielding $\pi(tr)f$ , and then perform the correlation with $\Psi$ . With a some algebra (Appendix A), we find:

then $\Psi\star\pi(g)f=\pi^{\prime}(g)\Psi\star f$ (see Fig. 4). This representation $\pi^{\prime}$ is known as the representation of $G$ induced by the representation $\rho$ of $H$ , and is denoted $\pi^{\prime}=\operatorname{Ind}_{H}^{G}\rho$ .

When parsing eq. 6, it is important to keep in mind that (as indicated by the square brackets) $\pi^{\prime}$ acts on the whole feature space $\mathcal{F}^{\prime}$ while $\rho$ acts on individual fibers.

If we compare the induced representation (eq. 6) to the representation $\pi_{0}$ defined in eq. 1, we see that the difference lies only in the presence of a factor $\rho(r)$ applied to the fibers. This factor describes how the feature channels are mixed by the transformation. The color channels in the input space do not get mixed by geometrical transformations, so we say that $\pi_{0}$ is induced from the trivial representation $\rho_{0}(h)=I$ .

Now that we have a $G$ -steerable feature space $\mathcal{F}^{\prime}$ , we can iterate the procedure by computing a basis for the space of intertwiners between $\pi^{\prime}$ (restricted to $H$ ) and some $\rho^{\prime}$ of our choosing.

5 Feature types and character theory

By now, the reader may be wondering how to choose $\rho$ , or indeed what the space of representations that we can choose from looks like in the first place. We will answer these questions in this section by showing that each representation has a type (encoded as a short list of integers) that corresponds to a certain symmetry or invariance of the feature. We further show how the number of parameters of an equivariant filter bank depends on the types of the representations $\pi$ and $\rho^{\prime}$ that it intertwines. Our discussion will make use of a number of important elementary results from group representation theory which are stated but not proved. The reader wishing to go deeper may consult chapters 1 and 2 of the excellent book by Serre (1977).

for some basis matrix $A$ , and some $i_{k}$ that index the irreps (each irrep may occur or more times).

Each irreducible representation corresponds to a type of symmetry, as shown in table 1. For example, as can be seen in this table, the representations $B1$ and $B2$ represent the $90$ -degree rotation $r$ as the matrix $\begin{bmatrix}-1\end{bmatrix}$ , so the basis filters for these representations change sign when rotated by $r$ . It should be noted that in the higher layers $l>0$ , elementary basis filters can look different because they depend on the representation $\pi_{l}$ that is being decomposed.

The fact that all representations can be decomposed into a direct sum of irreducibles implies that each representation has a basis-independent type: which irreducible representations appear in it, and with what multiplicity. For example, the input representation $\pi_{0}$ (table 1) has type $(3,0,1,1,2)$ . This means that, for instance, $\pi_{0}(r)$ is block-diagonalized as:

Where the block matrix contains $(3,0,1,1,2)$ copies of the irreps $(A1,A2,B1,B2,E)$ , evaluated at $r$ (see column $r$ in table 1). The change of basis matrix $A$ is constructed from the basis filters shown in table 1 (and the same $A$ block-diagonalizes $\pi_{0}(g)$ for all $g$ ).

So the most general way in which we can choose a fiber representation $\rho$ is to choose multiplicities $m_{i}\geq 0$ and a basis matrix $A$ . In Section 2.6 we will find that there is an important restriction on this freedom, which alleviates the need to choose a basis. The choice of multiplicities is then the only hyperparameter, analogous to the choice of the number of channels in an ordinary CNN. Indeed, the multiplicities determine the number of channels: $K=\sum_{i}m_{i}\dim\varphi_{i}$ .

By choosing the type of $\rho$ , we also determine the type of $\pi=\operatorname{Ind}_{H}^{G}\rho$ (restricted to $H$ ), but what is it? Explicit formulas exist (Reeder (2014); Serre (1977)) but are rather complicated, so we will present a simple computational procedure that can be used to determine the type of any representation. This procedure relies on the character $\chi_{\rho}(g)=\operatorname{Tr}(\rho(g))$ of the representation to be decomposed. The most important fact about characters is that the characters of irreps $\varphi_{i},\varphi_{j}$ are orthogonal:

Furthermore, since the trace of a direct sum equals the sum of the traces (i.e. $\chi_{\rho\oplus\rho^{\prime}}=\chi_{\rho}+\chi_{\rho^{\prime}}$ ), and every representation $\rho$ is a direct sum of irreps, it follows that we can obtain the multiplicity of irrep $\varphi_{i}$ in $\rho$ by computing the inner product with the $i$ -th character:

So a simple dot product of characters is all we need to determine the type of a representation.

Steerable CNNs use parameters much more efficiently than ordinary CNNs. In this section we show how the number of parameters required by an equivariant layer is determined by the feature types of the input and output space, and how the efficiency of a choice of feature types may be evaluated.

In section 2.3, we found that a filter bank $\Psi$ is equivariant if and only if it lies in the vector space called $\operatorname{Hom}_{H}(\pi,\rho)$ . It follows that the number of parameters for such a filter bank is equal to the dimensionality of this space, $n=\dim\operatorname{Hom}_{H}(\pi,\rho)$ . This number is known as the intertwining number of $\pi$ and $\rho$ and plays an important role in the theory of group representations.

As with multiplicities, the intertwining number is easily computed using characters. It can be shown (Reeder, 2014) that the intertwining number equals:

By linearity and the orthogonality of characters, we find that $\dim\operatorname{Hom}_{H}(\pi,\rho)=\sum_{i}m_{i}m_{i}^{\prime}$ , for representations $\pi,\rho$ of type $(m_{1},\ldots,m_{J})$ and $(m_{1}^{\prime},\ldots,m_{J}^{\prime})$ , respectively. Thus, as far as the number of parameters of a steerable convolution layer is concerned, the only choice we have to make for $\rho$ is its type – a short list of integers $m_{i}$ .

The efficiency of a choice of type can be assessed using a quantity we call the parameter utilization:

The numerator equals $s^{2}K\cdot K^{\prime}$ : the number of parameters for a non-equivariant filter bank. The denominator equals the parameter cost of an equivariant filter bank with the same filter size and number of input/output channels. Typical values of $\mu$ in effective architectures are around $|H|$ , e.g. $\mu=8$ for $H=D4$ . Such a layer utilizes its parameters $8$ times more intensively than an ordinary convolution layer.

6 Equivariant nonlinearities & capsules

In the previous section we showed that only the basis-independent types of $\pi$ and $\rho$ play a role in determining the parameter cost of an equivariant filter bank. An equivalent representation $\rho^{\prime}(g)=A\rho(g)A^{-1}$ will have the same type, and hence the same parameter cost as $\rho$ . However, when it comes to nonlinearities, different bases behave differently.

Since commutation with nonlinearities depends on the basis, we need a more granular notion than the feature type. We define a $\rho$ -capsule as a (typically low-dimensional) feature vector that transforms according to a representation $\rho$ (we may also refer to $\rho$ as the capsule). Thus, while a capsule has a type, not all representations of that type are equivalent as capsules. Given a catalogue of capsules $\rho^{i}$ (for $i=1,\ldots,C$ ) with multiplicities $m_{i}$ , we can construct a fiber as a stack of capsules that is steerable by a block-diagonal representation $\rho$ with $m_{i}$ copies of $\rho^{i}$ on the diagonal.

Like the capsules of Hinton et al. (2011), our capsules encode the pose of a pattern in the input, and consist of a number of units (dimensions) that do not get mixed with the units of other capsules under transformations. In this sense, a stack of capsules is disentangled (Cohen & Welling, 2014).

We have found a few simple types of capsules and corresponding admissible nonlinearities. It is easy to see that any nonlinearity is admissible for $\rho$ when the latter is realized by permutation matrices: permuting a list of coordinates and then applying a nonlinearity is the same as applying the nonlinearity and then permuting. If $\rho$ is realized by a signed permutation matrix, then $\verb+CReLU+(\alpha)=(\verb+ReLU+(\alpha),\verb+ReLU+(-\alpha))$ introduced by Shang et al. (2016), or any concatenated nonlinearity $\nu^{\prime}(\alpha)=(\nu(\alpha),\nu(-\alpha))$ , will be admissible. Any scale-free concatenated nonlinearity such as CReLU is admissible for a representation realized by monomial matrices (having the same nonzero pattern as a permutation matrix). Finally, we can always make a representation of a finite group orthogonal by a suitable choice of basis, which means that we can use any nonlinearity that acts only on the length of the vector.

For many groups, the irreps can be realized using signed permutation matrices, so we can use irreducible $\varphi_{i}$ -capsules with concatenated nonlinearities such as CReLU. Another class of capsules, which we call quotient capsules, are naturally realized by permutation matrices, and are thus compatible with any nonlinearity. These are described in Appendix C.

7 Computational efficiency

Modern convolutional networks often use on the order of hundreds of channels $K$ per layer Zagoruyko & Komodakis (2016). When using $3\times 3$ filters, a filter bank can have on the order of $9K^{2}\approx 10^{6}$ dimensions. The number of parameters for an equivariant filter bank is about $\mu\approx 10$ times smaller, but a basis for the space of equivariant filter banks would still be about $10^{6}\times 10^{5}$ , which is too large to be practical.

Fortunately, the block-diagonal structure of $\pi$ and $\rho$ induces a block structure in $\Psi$ . Suppose $\pi=\verb+block_diag+(\pi^{1},\ldots,\pi^{P})$ and $\rho=\verb+block_diag+(\rho^{1},\ldots,\rho^{Q})$ . Then an intertwiner is a matrix of shape $K^{\prime}\times Ks^{2}$ , where $K^{\prime}=\sum_{i}\dim\rho^{i}$ and $Ks^{2}=\sum_{i}\dim\pi^{i}$ . This matrix has the following block structure:

Each block $h_{ij}$ corresponds to an input-output pair of capsules, and can be parameterized by a linear combination of basis matrices $\psi^{ij}_{k}\in\operatorname{Hom}_{H}(\rho^{i},\pi^{j})$ .

In practice, we typically use many copies of the same capsule (say $n_{i}$ copies of $\rho^{i}$ and $m_{j}$ copies of $\pi^{j}$ ). Therefore, many of the blocks $h_{ij}$ can be constructed using the same intertwiner basis. If we order equivalent capsules to be adjacent, the intertwiner consists of “blocks of blocks”. Each superblock $H_{ij}$ has shape $n_{i}\dim\rho^{i}\times m_{j}\dim\pi^{j}$ , and consists of subblocks of shape $\dim\rho^{i}\times\dim\pi^{j}$ .

The computation graph for an equivariant convolution layer is constructed as follows. Given a catalogue of capsules $\rho^{i}$ and corresponding post-activation capsules $\operatorname{Act}_{\nu}\rho^{i}$ , we compute the induced representations $\pi^{i}=\operatorname{Ind}_{H}^{G}\operatorname{Act}_{\nu}\rho^{i}$ and the bases for $\operatorname{Hom}_{H}(\rho^{i},\pi^{j})$ in an offline step. The bases are stored as matrices $\psi^{ij}$ of shape $\dim\rho^{i}\cdot\dim\pi^{j}\times\dim\operatorname{Hom}_{H}(\rho^{i},\pi^{j})$ . Then, given a list of input / output multiplicities $n_{i},m_{j}$ for the capsules, a parameter matrix $\Theta^{ij}$ of shape $\dim\operatorname{Hom}_{H}(\rho^{i},\pi^{j})\times n_{i}m_{j}$ is instantiated. The superblocks $H_{ij}$ are obtained by a matrix multiplication $\psi^{ij}\Theta^{ij}$ plus reshaping to shape $\dim\rho^{i}\cdot\dim\pi^{j}\times n_{i}m_{j}$ . Once all superblocks are filled in, the matrix $\Psi$ is reshaped from $K^{\prime}\times Ks^{2}$ to $K^{\prime}\times K\times s\times s$ and convolved with the input.

8 Using steerable CNNs in practice

A full understanding of the theory of steerable CNNs requires some knowledge of group representation theory, but using steerable CNN technology is not much harder than using ordinary CNNs. Instead of choosing a number of channels for a given layer, one chooses a list of multiplicities $m_{i}$ for each capsule in a library of capsules provided by the developer. To preserve equivariance, the activation function applied to a capsule must be chosen from a list of admissible nonlinearities for that capsule (which sometimes includes all nonlinearities). Finally, one must respect the type system and only add identical capsules (e.g. in ResNets). These constraints can all be checked automatically.

Related Work

Steerable filters were first studied for applications in signal processing and low-level vision (Freeman & Adelson, 1991; Greenspan et al., 1994; Simoncelli & Freeman, 1995). More or less explicit connections between steerability and group representation theory have been observed by Lenz (1989); Koenderink & Van Doorn (1990); Teo (1998); Krajsek & Mester (2007). As we have tried to demonstrate in this paper, representation theory is indeed the natural mathematical framework in which to study steerability.

In machine learning, equivariant kernels were studied by Reisert (2008); Skibbe (2013). In the context of neural networks, various authors have studied equivariant representations. Capsules were introduced in Hinton et al. (2011), and significantly improved by Tieleman (2014). A theoretical account of equivariant representation learning in the brain is given by Anselmi et al. (2014). Group equivariant scattering networks were defined and studied by Mallat (2012) for compact groups, and by Sifre & Mallat (2013); Oyallon & Mallat (2015) for the roto-translation group. Jacobsen et al. (2016) describe a network that uses a fixed set of (possibly steerable) basis filters with learned weights. Lenc & Vedaldi (2015) showed empirically that convolutional networks tend to learn equivariant representations, which suggests that equivariance could be a good inductive bias.

Invariant and equivariant CNNs have been studied by Gens & Domingos (2014); Kanazawa et al. (2014); Dieleman et al. (2015; 2016); Cohen & Welling (2016); Marcos et al. (2016). All of these models, as well as scattering networks, implicitly use the regular representation: feature maps are (often implicitly) conceived of as functions on $G$ , and the action of $G$ on the space of functions on $G$ is known as the regular representation (Serre (1977), Appendix B). This form of equivariance is a special case of that presented in this paper.

The idea of adding a type system to neural networks has been explored by Olah (2015); Balduzzi & Ghifary (2015). We have shown that a type system emerges naturally from the decomposition of a linear representation of a mathematical structure (a group, in our case) associated with the representation learned by a neural network.

Experiments

We performed experiments on the CIFAR10 dataset (Krizhevsky, 2009) to determine if steerability is a useful inductive bias, and to determine the relative merits of the various types of capsules. In order to run experiments faster, and to see how steerable CNNs perform in the small-data regime, we used only 2000 training samples for our initial experiments.

As a baseline, we used the competitive wide residual networks (ResNets) architecture (He et al., 2015; 2016; Zagoruyko & Komodakis, 2016). We tuned the capacity of this network for the reduced dataset size and settled on a $20$ layer architecture (three residual blocks per stage, with two layers each, for three stages with feature maps of size $32\times 32$ , $16\times 16$ and $8\times 8$ , various widths). We compared the baseline architecture to various kinds of steerable CNN, obtained by replacing the convolution layers by steerable convolution layers. To make sure that differences in performance were not simply due to underfitting or overfitting, we tuned the width (number of channels, $K$ ) using a validation set. The rest of the training procedure is identical to Cohen & Welling (2016), and is fixed for all of our experiments.

We first tested steerable CNNs that consist entirely of a single kind of capsule. We found that architectures with only one type do not perform very well (roughly $30$ - $40\%$ error, vs. $30\%$ for plain ResNets trained on 2k samples from CIFAR10), except for those that use the regular representation capsule (Appendix C), which outperforms standard CNNs ( $26.75\%$ error). This is not too surprising, because many capsules are quite restrictive in the spatial patterns they can express. The strong performance of regular capsules is consistent with the results of Cohen & Welling (2016), and can be explained by the fact that the regular representation contains all other (irreducible and quotient) representations as subrepresentations, and can therefore learn arbitrary spatial patterns.

We then created networks that use a mix of the more successful kinds of capsules. After a few preliminary experiments, we settled on a residual network that uses one mix of capsules for the input and output layer of a residual block, and another for the intermediate layer. The first representation consists of quotient capsules: regular, qm, qmr2, qmr3 (see Appendix C) followed by ReLUs. The second consists of irreducible capsules: A1, A2, B1, B2, E(2x) followed by CReLUs. On CIFAR10 with 2k labels, this architecture works better than standard ResNets and regular capsules at $24.48\%$ error.

When tested on CIFAR10 with 4k labels (table 2), the method comes close to the state of the art in semi-supervised methods, that use additional unlabelled data (Rasmus et al., 2016), and better than transfer learning approaches such as DCGAN which achieves $26.2\%$ error (Radford et al., 2015). When tested on the full CIFAR10 and CIFAR100 dataset, the steerable CNN substantially outperforms the ResNet (He et al., 2016) baseline and achieves state of the art results (improving over wide and dense nets (Zagoruyko & Komodakis, 2016; Huang et al., 2016)).

Conclusion & Future Work

We have presented a theoretical framework for understanding steerable representations in convolutional networks, and have shown that steerability is a useful inductive bias that can improve model accuracy, particularly when little data is available. Our experiments show that a simple steerable architecture achieves state of the art results on CIFAR10 and CIFAR100, outperforming recent architectures such as wide and dense residual networks.

The mathematical connection between representation learning and representation theory that we have established improves our understanding of the inner workings of (equivariant) convolutional networks, revealing the humble CNN as an elegant geometrical computation engine. We expect that this new tool (representation theory), developed over more than a century by mathematicians and physicists, will greatly benefit future investigations in this area.

For concreteness, we have used the group of flips and rotations by multiples of $90$ degrees as a running example throughout this paper. This group already has some nontrivial characteristics (such as non-commutativity), but it is still small and discrete. The theory of steerable CNNs, however, readily extends to the continuous setting. Evaluating steerable CNNs for large, continuous and high-dimensional groups is an important piece of future work.

Another direction for future work is learning the feature types, which may be easier in the continuous setting because (for non-compact groups) the irreps live in a continuous space where optimization may be possible. Beyond classification, steerable CNNs are likely to be useful in geometrical tasks such as action recognition, pose and motion estimation, and continuous control tasks.

We kindly thank Kenta Oono, Shuang Wu, Thomas Kipf and the anonymous reviewers for their feedback and suggestions. This research was supported by Facebook, Google and NWO (grant number NAI.14.108).

References

Appendix A: Induction

In this section we will show that a stack of feature maps produced by convolution with an $H$ -equivariant filter bank transforms according to the induced representation. That is, we will derive eq. 5, repeated here for convenience:

With this notation, the convolution is defined as:

Although the induced representation can be described in a more general setting, we will use an explicit matrix representation of $G$ to make it easier to check our computations. A general element of $G$ is written as:

Where $R$ is the matrix representation of $r$ (e.g. a $2\times 2$ rotation / reflection matrix), and $T$ is a translation vector. The section we use is:

To keep notation uncluttered, we will write $\pi=\pi_{l}$ and $\rho=\rho_{l+1}$ . In full detail, the derivation of the transformation law for the feature space induced by $\rho$ proceeds as follows:

The last line is the result shown in the paper. The justification of each step is:

$\pi$ is a homomorphism / group representation

$rr^{-1}$ is the identity, so can always multiply by it

$\pi$ is a homomorphism / group representation

$\Psi\in\operatorname{Hom}_{H}(\pi,\rho)$ is equivariant to $r\in H$ .

$\overline{(tr)^{-1}\cdot x}=r^{-1}t^{-1}\overline{x}r$ can be checked by multiplying the matrices / vectors.

The derivation above is somewhat involved and messy, so the reader may prefer to think geometrically (using the figures in the paper) instead of algebraically. This complexity is an artifact of the lack of abstraction in our presentation. The induced representation is really a very natural object to consider (abstractly, it is the “adjoint functor” to the restriction functor. A more abstract treatment of the induced representation can be found in Serre (1977); Mackey (1952); Reeder (2014). A treatment that is close to our own, but more general is the “alternate description” found on page 49 of Kaniuth & Taylor (2013).

Appendix B: Relation to Group Equivariant CNNs

In this section we show that the recently introduced Group Equivariant Convolutional Networks (G-CNNs, Cohen & Welling (2016)) are a special kind of steerable CNN. Specifically, a G-CNN is a steerable CNN with regular capsules.

This defines a linear representation of $G$ known as the regular representation. It is easy to see that the regular representation is naturally realized by permutation matrices. Furthermore, it is known that the regular representation of $G$ is induced by the regular representation of $H$ . The latter is defined in Appendix C, and is what we refer to as “regular capsules” in the paper.

Appendix C: Regular and Quotient Features

Let $H$ be a finite group. A subgroup of $H$ is a subset that is also itself a group (i.e. closed under composition and inverses). The (left) coset of a subgroup $K$ in $H$ are the sets $hK=\{hk|k\in K\}$ . The cosets are disjoint and jointly cover the whole group $H$ (i.e. they partition $H$ ). The set of all cosets of $K$ in $H$ is denoted $H/K$ , and is also called the quotient of $H$ by $K$ .

The coset space caries a natural left action by $H$ . Let $a,b\in H$ , then $a\cdot bK=(ab)K$ .

The function $f$ attaches a value to every coset. The $H$ -action permutes these values, because it permutes the cosets. Hence, $\rho$ can be realized by permutation matrices. For small groups the explicit computations can easily be done by hand, while for large groups this task can be automated.

In this way, we get one permutation representation for each subgroup $K$ of $H$ . In particular, for the subgroup $K=\{e\}$ (the trivial subgroup containing only the identity $e$ ), we have $H/K\cong H$ . The representation in the space of functions on $H$ is known as the “regular representation”. Using such regular representations in a steerable CNN is equivalent to using the group convolutions introduced in Cohen & Welling (2016), so steerable CNNs are a strict generalization of G-CNNs. At the other extreme, we take $K=H$ , which gives the quotient $H/K\cong\{e\}$ , the trivial group, which gives the trivial representation $A1$ .

For the roto-reflection group $H=D4$ , we have the following subgroups and associated quotient features