CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters

Ron Levie, Federico Monti, Xavier Bresson, Michael M. Bronstein

Introduction

In many domains, one has to deal with large-scale data with underlying non-Euclidean structure. Prominent examples of such data are social networks, genetic regulatory networks, functional networks of the brain, and 3D shapes represented as discrete manifolds. The recent success of deep neural networks and, in particular, convolutional neural networks (CNNs) have raised the interest in geometric deep learning techniques trying to extend these models to data residing on graphs and manifolds. In this paper we focus on spectral graph CNNs. Geometric deep learning approaches, and specifically spectral graph CNNs, have been successfully applied to computer graphics and vision , brain imaging , and drug design problems, to mention a few. For a comprehensive presentation of methods and applications of deep learning on graphs and manifolds, we refer the reader to the review paper .

2 Main contribution

In this paper, we construct graph CNNs employing an efficient spectral filtering scheme based on the new class of Cayley polynomials that enjoys similar advantages of the Chebyshev filters such as localization and linear complexity in the number of edges. The main advantage of our filters over is their ability to detect narrow frequency bands of importance during training, and to specialize on them while being well-localized on the graph. We demonstrate experimentally that this affords our method greater flexibility, making it perform better than ChebNets on a broad range of graph learning problems.

3 Notation

Spectral techniques for deep learning on graphs

2 Spectral CNNs

used the spectral definition of convolution to generalize CNNs on graphs, with a spectral convolutional layer of the form

This framework has several major drawbacks. First, the computation of the forward and inverse graph Fourier transforms incur expensive $\mathcal{O}(n^{2})$ multiplication by the matrices $\boldsymbol{\Phi},\boldsymbol{\Phi}^{\top}$ , as there is no FFT-like algorithms on general graphs.

It is noteworthy to mention alternative functional calculus driven approaches to define convolution. In filters are defined as functions of the adjacency matrix $g(\mathbf{W})$ , and in the problem of ordering the eigenvalues of $\mathbf{W}$ according to a natural notion of frequency was addressed.

3 ChebNet

used polynomial filters represented in the Chebyshev basis

A key disadvantage of Chebyshev filters is the fact that using polynomials makes it hard to produce narrow-band filters, as such filters require very high order $r$ , and produce unwanted non-local filters. This deficiency is especially pronounced when the Laplacian has clusters of eigenvalues concentrated around a few frequencies with large spectral gap (Figure 3, second to the last). Indeed, in ChebNets the Laplacian eigenvalues are contracted to the band $, and the clusters of eigenvalues become very concentrated. Now, the frequency response of the Chebyshev filter is a polynomial in$ , which is unable to separate the individual eigenvalues in the clusters due to an uncertainty principle. Such a behavior is characteristic of graphs with community structures, which is very common in many real-world graphs, for instance, social networks.

Let us explain the above phenomenon more accurately. Recall that Chebyshev polynomials are given by

and form an orthonormal basis of the weighted Lebesgue space $L_{2}\big{(},\frac{1}{\sqrt{1-x^{2}}}dx\big{)}$ . When the variable is changed via $x=\cos(\theta)$ , the Chebyshev basis maps to the cosine basis in $[0,\pi]$ , and the space $L_{2}\big{(},\frac{1}{\sqrt{1-x^{2}}}dx\big{)}$ maps to the space $L_{2}(0,\pi)$ with the standard Lebesgue measure. Now, suppose that we want to represent a band-pass filter on the narrow band $[\cos(b),\cos(a)]\subset$ , with small $\cos(b-a)$ . Under the change of variable, this band maps to $[a,b]$ with small $b-a=\epsilon$ . In this case, since the characteristic function of $[a,b]$ is the shrinking dilation of the characteristic function of $[-\frac{1}{2},\frac{1}{2}]$ (up to translation), it’s Fourier coefficients are samples from the stretching dilation of the Fourier transform of the characteristic function of $[-\frac{1}{2},\frac{1}{2}]$ . As a result, the number of coefficients required to approximate a band pass filter up to some fixed tolerance is inverse proportional to the size of the band. More generally, the number of Chebyshev coefficients required for approximating a filter having features in a given scale, is inverse proportional to the scale. When the Laplacian has a cluster of eigenvalues concentrated around one frequency, a filter that separates these eigenvalues must have features in scale proportional to the radius of the eigenvalue cluster. Therefore, the number of Chebyshev coefficients must be inverse proportional to the cluster size.

To overcome this major drawback, we need a new class of filters, that both entail $\mathcal{O}(r)$ neighbor exchanges, and are able to specialize in narrow bands in frequency.

Cayley filters

A key construction of this paper is a family of complex filters that enjoy the advantages of Chebyshev filters while avoiding some of their drawbacks. We define a Cayley polynomial of order $r$ to be a real-valued function with complex coefficients,

where $\mathbf{c}=(c_{0},\ldots,c_{r})$ is a vector of one real coefficient and $r$ complex coefficients and $h>0$ is the spectral zoom parameter, that will be discussed later. A Cayley filter $\mathbf{G}$ is a spectral filter defined on real signals $\mathbf{f}$ by

where the parameters $\mathbf{c}$ and $h$ are optimized during training. Similarly to the Chebyshev filters, Cayley filters involve basic matrix operations such as powers, additions, multiplications by scalars, and also inversions. This implies that application of the filter $\mathbf{G}\mathbf{f}$ can be performed without explicit expensive eigendecomposition of the Laplacian operator. Cayley filters are special cases of filters based on general rational functions of the Laplacian, namely ARMA filters . For a general rational functions of the Laplacian, calculating the denominator requires a matrix inversion. When the filter is based on arbitrary coefficients, there is no guarantee that the matrix inversions are calculated stably. Guaranteeing stable inversions for arbitrary filter coefficients is important, since the coefficients follow an unknown path during training. For general ARMA filters, the filter can acquire poles arbitrarily close to the spectrum of $\boldsymbol{\Delta}$ during training, so there is no uniform analysis of convergence of the approximate inversions in the filter computation. The motivation to use the subclass of Cayley filters over general rational functions is to guarantee that inversion is uniformly stable. Namely, the number of iterations required for a given approximation error is independent of the filter coefficients, as long as the coefficients are bounded. Moreover, the lower number of parameters in Cayley polynomials in comparison to general rational functions may be beneficial for avoiding overfitting.

In the following, we show that Cayley filters are analytically well behaved; in particular, any smooth spectral filter can be represented as a Cayley polynomial, and low-order filters are localized in the spatial domain. We also discuss numerical implementation and compare Cayley and Chebyshev filters. We show that Cayley filters defined on sparse Laplacians with $\mathcal{O}(1)$ non-zero elements takes $\mathcal{O}(n)$ operations, similarly to Chebyshev filters.

Note that any spectral filter can be formulated as a Cayley filter. Indeed, spectral filters $g(\boldsymbol{\Delta})$ are specified by the finite sequence of values $g(\lambda_{1}),\ldots,g(\lambda_{n})$ , which can be interpolated by a trigonometric polynomial. Moreover, since trigonometric polynomials are smooth, we expect low order Cayley filters to be well localized in some sense on the graph, as discussed later.

2 Spectral zoom

3 Numerical properties

The numerical core of the Cayley filter is the computation of $\mathcal{C}^{j}(h\boldsymbol{\Delta})\mathbf{f}$ for $j=1,\ldots,r$ , performed in a sequential manner. Let $\mathbf{y}_{0},\ldots,\mathbf{y}_{r}$ denote the solutions of the following linear recursive system,

Note that sequentially approximating $\mathbf{y}_{j}$ in (6) using the approximation of $\mathbf{y}_{j-1}$ in the right hand side is stable, since $\mathcal{C}(h\boldsymbol{\Delta})$ is unitary and thus has condition number $1$ .

Next, we give an error bound for the approximate filter. For the unnormalized Laplacian, let $d=\max_{j}\{d_{j,j}\}$ and $\kappa=\left\|\mathbf{J}\right\|_{\infty}=\frac{hd}{\sqrt{h^{2}d^{2}+1}}<1$ . For the normalized Laplacian, we assume that $(h\boldsymbol{\Delta}_{n}+i\mathbf{I})$ is dominant diagonal, which gives $\kappa=\left\|\mathbf{J}\right\|_{\infty}<1$ .

Under the above assumptions, $\frac{\|\mathbf{G}\mathbf{f}-\widetilde{\mathbf{G}\mathbf{f}}\|_{2}}{\left\|\mathbf{f}\right\|_{2}}\leq 2M\kappa^{K}$ , where $M=\sqrt{n}\sum_{j=1}^{r}j\left|c_{j}\right|$ in the general case, and $M=\sum_{j=1}^{r}j\left|c_{j}\right|$ if the graph is regular.

Proposition 1 is pessimistic in the general case, while requires strong assumptions in the regular case. We find that in most real life situations the behavior is closer to the regular case. It also follows from Proposition 1 that smaller values of the spectral zoom $h$ result in faster convergence, giving this parameter an additional numerical role of accelerating convergence.

Last, note that a Cayley filter with Jacobi approximation, is based on powers of the Jacobi matrix ${\mathbf{J}}$ , in addition to $(h\boldsymbol{\Delta}-i\mathbf{I})$ . The Jacobi matrix can be viewed as a general representation matrix of the graph, replacing the standard Laplacian with a matrix that respects the connectivity of the graph (general representation matrices were considered in ). This is also true for $(h\boldsymbol{\Delta}-i\mathbf{I})$ . In this point of view, learning $h$ is interpreted as learning a general representation matrix. Learning $h$ can be also viewed as learning a normalization of the weights of the graph. The problem of learning the topology of the graph was studied e.g. in .

4 Complexity

In practice, an accurate inversion of $(h\boldsymbol{\Delta}+i\mathbf{I})$ is not required, since the approximate inverse is combined with learned coefficients, which “compensate”, as necessary, for the inversion inaccuracy. Such behavior is well-documented in the literature in other contexts of model compression and accelerated convergence of iterative algorithms. For example, in , sparse signal coding are learned by unrolling iterative shrinkage algorithms (FISTA) into a neural network, where each layer emulates an iteration of the original algorithm but has extra learnable parameters. It is shown that FISTA networks with just a few layers outperform hundreds or thousands of iterations of the original algorithm thanks to the learnable parameters. The above phenomenon is also a common observation for solving sparse linear equations for compressed sensing tasks (see e.g ).

In a CayleyNet for a fixed graph, we fix the number of Jacobi iterations. Since the convergence rate depends on $\kappa$ , that depends on the graph, different graphs may need different numbers of iterations. The convergence rate also depends on $h$ . Since there is a trade-off between the spectral zoom amount $h$ , and the accuracy of the approximate inversion, and since $h$ is a learnable parameter, the training finds the right balance between the spectral zoom amount and the inversion accuracy.

5 Localization

Unlike Chebyshev filters that have the small $r$ -hop support, Cayley filters are rational functions supported on the whole graph. However, it is still true that Cayley filters are well localized on the graph. Let $\mathbf{G}$ be a Cayley filter and $\boldsymbol{\delta}_{m}$ denote a delta-function on the graph, defined as one at vertex $m$ and zero elsewhere. We show that $\mathbf{G}\boldsymbol{\delta}_{m}$ decays fast, in the following sense:

Let $\mathbf{f}$ be a signal on the vertices of graph $\mathcal{G}$ , $1\leq p\leq\infty$ , and $0<\epsilon<1$ . Denote by $S\subseteq\{1,\ldots,n\}$ a subset of the vertices and by $S^{c}$ its complement. We say that the $L_{p}$ -mass of $\mathbf{f}$ is supported in $S$ up to $\epsilon$ if $\left\|\mathbf{f}|_{S^{c}}\right\|_{p}\leq\epsilon\left\|\mathbf{f}\right\|_{p}$ , where $\mathbf{f}|_{S^{c}}=(f_{l})_{l\in S^{c}}$ is the restriction of $\mathbf{f}$ to $S^{c}$ . We say that $\mathbf{f}$ has (graph) exponential decay about vertex $m$ , if there exists some $\gamma\in(0,1)$ and $c>0$ such that for any $k$ , the $L_{p}$ -mass of $\mathbf{f}$ is supported in ${\cal N}_{k,m}$ up to $c\gamma^{k}$ . Here, ${\cal N}_{k,m}$ is the $k$ -hop neighborhood of $m$ .

Note that Definition 2 is analogous to classical exponential decay on Euclidean space: $\left|f(x)\right|\leq R\gamma^{-x}$ iff for every ball $B_{\rho}$ of radius $\rho$ about , $\|f|_{B_{\rho}^{c}}\|_{\infty}\leq c\gamma^{-\rho}\left\|f\right\|_{\infty}$ with $c=\frac{R}{\left\|f\right\|_{\infty}}$ .

Let $\mathbf{G}$ be a Cayley filter of order $r$ . Then, $\mathbf{G}\boldsymbol{\delta}_{m}$ has exponential decay about $m$ in $L_{2}$ , with constants $c=4M\frac{1}{\|\mathbf{G}\boldsymbol{\delta}_{m}\|_{2}}$ and $\gamma=\kappa^{1/r}$ (where $M$ and $\kappa$ are from Proposition 1).

6 Cayley vs Chebyshev

Below, we compare the two classes of filters: Chebyshev as a special case of Cayley. For a regular graph with $\mathbf{D}=d\mathbf{I}$ , using Jacobi inversion based on zero iteration, we get that any Cayley filter of order $r$ is a polynomial of $\boldsymbol{\Delta}$ in the monomial base $\big{(}\frac{h\boldsymbol{\Delta}-i}{hd+i}\big{)}^{j}$ . In this situation, a Chebyshev filter, which is a real valued polynomial of $\boldsymbol{\Delta}$ , is a special case of a Cayley filter. Spectral zoom and stability. Generally, both Chebyshev polynomials and trigonometric polynomials give stable approximations, optimal for smooth functions. However, this crude statement is over-simplified. One of the drawbacks in Chebyshev filters is the fact that the spectrum of $\boldsymbol{\Delta}$ is always mapped to $ $in a linear manner, making it hard to specialize in small frequency bands. In Cayley filters, this problem is mitigated with the help of the spectral zoom parameter$ h $. As an example, consider the community detection problem discussed in the next section. A graph with strong communities has a cluster of small eigenvalues near zero. Ideal filters$ g(\boldsymbol{\Delta}) $for extracting the community information should be able to focus on this band of frequencies. Approximating such filters with Cayley polynomials, we zoom in to the band of interest by choosing the right$ h $, and then project$ g $onto the space of trigonometric polynomials of order$ r $, getting a good and stable approximation (Figure 3, bottom). However, if we project$ g $onto the space of Chebyshev polynomials of order$ r $, the interesting part of$ g $concentrated on a small band is smoothed out and lost (Figure 3, second to the last). Thus, projections are not the right way to approximate such filters, and the stability of orthogonal polynomials cannot be invoked. On the other hand, if we want to approximate$ g $on the small band using polynomials, ignoring the behavior away from this band, the approximation will be unstable away from this band; small perturbations in$ g $will result in big perturbations in the Chebyshev filter away from the band. This is due to the fact that any polynomial diverges at infinity, and for an asymptotically small band and polynomials of fixed order, “away from the band behaves like infinity.” For this reason, we say that Cayley filters are more stable than Chebyshev filters. Regularity. We found that in practice, low-order Cayley filters are able to model both very concentrated impulse-like filters, and wider Gabor-like filters. Cayley filters are able to achieve a wider range of filter supports with less coefficients than Chebyshev filters (Figure 2), making the Cayley class more regular than Chebyshev. Complexity. Under the assumption of sparse Laplacians, both Cayley and Chebyshev filters incur linear complexity$ \mathcal{O}(n)$. Besides, the new filters are equally simple to implement as Chebyshev filters; as seen in Eq.7, they boil down to simple sparse matrix-vector multiplications providing a GPU friendly implementation.

Results

We test the proposed CayleyNets reproducing the experiments of and using ChebNet as our main baseline method. Pooling and graph coarsening was performed identically to . The hyperparameters are identical to the original experiments, and not optimized. All the methods were implemented in TensorFlow . The experiments were executed on a machine with a 3.5GHz Intel Core i7 CPU, 64GB of RAM, and NVIDIA Titan X GPU with 12GB of RAM. SGD+Momentum and Adam optimization methods were used to train the models in MNIST and the rest of the experiments, respectively. Training and testing were always done on disjoint sets.

2 Community detection

We start with an experiment on a synthetic graph consisting of 15 communities with strong connectivity within each community and sparse connectivity across communities (Figure 3, top). Though rather simple, such a dataset allows to study the behavior of different algorithms in controlled settings. On this graph, we generate noisy step signals, defined as $f_{i}=1+\sigma_{i}$ if $i$ belongs to the community, and $f_{i}=\sigma_{i}$ otherwise, where $\sigma_{i}\sim\mathcal{N}(0,0.3)$ is Gaussian i.i.d. noise. The goal is to classify each such signal according to the community it belongs to. The neural network architecture used for this task consisted of a spectral convolutional layer (based on Chebyshev or Cayley filters) with 32 output features, a mean pooling layer, and a softmax classifier for producing the final classification into one of the 15 classes. No regularization has been exploited in this setting. The classification accuracy is shown in Figure 3 (second to the top) along with examples of learned filters (bottom two). We observe that CayleyNet significantly outperforms ChebNet for smaller filter orders, with an improvement as large as 80%. Studying the filter responses, we note that due to the capability to learn the spectral zoom parameter, CayleyNet allows to generate band-pass filters in the low-frequency band that discriminate well the communities (Figure 3 bottom).

3 Complexity

We experimentally validated the computational complexity of our model applying filters of different order $r$ to synthetic 15-community graphs of different size $n$ using exact matrix inversion and approximation with different number of Jacobi iterations (Figure 6 in Appendix C). All times have been computed running 30 times the considered models and averaging the final results. As expected, approximate inversion guarantees $\mathcal{O}(n)$ complexity. We further conclude that typically very few Jacobi iterations are required (Figure 4 shows that our model with just one Jacobi iteration outperforms ChebNet for low-order filters on the community detection problem).

4 MNIST

Following , for a toy example, we approached the classical MNIST digits classification as a learning problem on graphs. Each pixel of an image is a vertex of a graph (regular grid with 8-neighbor connectivity), and pixel color is a signal on the graph. We used a graph CNN architecture with two spectral convolutional layers based on Chebyshev and Cayley filters (producing 32 and 64 output features, respectively), interleaved with pooling layers performing 4-times graph coarsening using the Graclus algorithm , and finally a fully-connected layer (this architecture replicates the classical LeNet5, , architecture, which is shown for comparison). SGD+Momentum with learning rate equal to 0.02, momentum $m=0.9$ , dropout probability $p=0.5$ and weight decay coefficient $\gamma=5\cdot 10^{-4}$ have been applied as described in . MNIST classification results are reported in Table 1. CayleyNet (11 Jacobi iterations) achieves the same (near perfect) accuracy as ChebNet with filters of lower order ( $r=12$ vs $25$ ). Examples of filters learned by ChebNet and CayleyNet are shown in Figure 2. 0.1776 +/- 0.06079 sec and 0.0268 +/- 0.00841 sec are respectively required by CayleyNet and ChebNet for analyzing a batch of 100 images at test time.

5 Citation network

Next, we address the problem of vertex classification on graphs using the popular CORA citation graph . Each of the 2,708 vertices of the CORA graph represents a scientific paper, and an undirected unweighted edge represents a citation (5,429 edges in total). For each vertex, a 1,433-dimensional binary feature vector representing the content of the paper is given. The task is to classify each vertex into one of the 7 groundtruth classes, of labels. In the semi-supervised problem (transductive learning), the features of all vertices are known, but labels are given just for a subset of the nodes. The task is to learn a mapping, that takes the features at the nodes as inputs, and gives the labels as outputs. The mapping is trained by minimizing the label error at the nodes with known labels. After training, the the mapping is tested over the nodes in which the labels were unknown during training.

To present a deep comparison with recent state-of-the-art architectures, we analyze the performance of our model in two different settings: the classic semi-supervised problem presented in with 140 training samples, 500 validation samples and 1,000 test samples and a relaxed version of this that exploits 1,708 vertices for training, 500 for validation and 500 for testing. We opted for a larger amount of training samples in our second experiment, in order to provide an estimate of the quality of CayleyNet in a situation which is less prone to overfitting. This provides a better overview of the goodness of the considered construction, since richer filters are less likely to produce lower performance, as opposed to the typical behavior when the available data is scarce. Cayley operators with matrix inversion have been considered in both settings for our solution. DCNN, GCN, MoNet and GAT have been used as terms of comparison.

On the standard split, we train CayleyNet by realizing filters as linear combinations of neighborhood descriptors obtained by applying Cayley filters on the signal of features. Two versions of CayleyNet have been implemented for this setting: a lightweight one exploiting two convolutional layers with 16 and 7 output features and a heavier one requiring 64 and 7 output features. This provides a valuable term of comparison with both the solutions presented in and as same number of parameters is respectively required by our implementations. Normalized Laplacian has been used as reference. Adam with learning rate equal to $5\cdot 10^{-3}$ , dropout probability $p=0.6$ and weight decay coefficient $\gamma=5\cdot 10^{-4}$ have been used for training. Table 3 presents the results we obtained with our solution, average performance over 50 runs are reported to guarantee accurate estimates. Performance of GCN, MoNet and GAT have been obtained from the respective papers, DCNN has been trained with 1 diffusion layer and 1 hop to guarantee same number of parameters. Our lighter version of CayleyNet outperforms DCNN, GCN and MoNet, while being defeated only by the recent GAT (which however exploits an attention mechanism for better discriminating relevant neighbors). Our heavier CayleyNet shows a significant drop in performance, likely because of overfitting on the small training set.

6 Recommender system

In our final experiment, we applied CayleyNet to recommendation system, formulated as matrix completion problem on user and item graphs, . The task is, given a sparsely sampled matrix of scores assigned by users (columns) to items (rows), to fill in the missing scores. The similarities between users and items are given in the form of column and row graphs, respectively. approached this problem as learning with a Recurrent Graph CNN (RGCNN) architecture, using an extension of ChebNets to matrices defined on multiple graphs in order to extract spatial features from the score matrix; these features are then fed into an RNN producing a sequential estimation of the missing scores. Here, we repeated verbatim their experiment on the MovieLens dataset (), replacing Chebyshev filters with Cayley filters. Following , to train our model we uniformly split the available training scores in two sets of equal dimension, 50% of the provided scores (data scores) are used to initialize the input matrix while the remaining 50% are used as training labels. SRGCNN is trained to reconstruct the missing labels from the few given data scores. At test time we initialize the input matrix only with the considered data scores to provide the network the same conditions it observed at training time. We used separable RGCNN architecture with two CayleyNets of order $r=4$ employing 15 Jacobi iterations. Adam with learning rate equal to $10^{-3}$ and regularization coefficient $\gamma=10^{-10}$ have been used for trainingValues obtained from MGCNN implementation available at https://github.com/fmonti/mgcnn/. The results are reported in Table 4. To present a complete comparison we further extended the experiments reported in by training sRGCNN with ChebNets of order 8, this provides an architecture with same number of parameters as the exploited CayleyNet (23K coefficients). Our version of sRGCNN outperforms all the competing methods, including the previous result with Chebyshev filters reported in . sRGCNNs with Chebyshev polynomials of order 4 and 8 respectively require 0.0698 +/- 0.00275 sec and 0.0877 +/- 0.00362 sec at test time, sRGCNN with Cayley polynomials of order 4 and 15 jacobi iterations requires 0.165 +/- 0.00332 sec.

Conclusion

In this paper, we introduced a new efficient spectral graph CNN architecture that scales linearly with the dimension of the input data. Our architecture is based on a new class of complex rational Cayley filters that are localized in space, can represent any smooth spectral transfer function, and are highly regular. The key property of our model is its ability to specialize in narrow frequency bands with a small number of filter parameters, while still preserving locality in the spatial domain. Experimental results on the MNIST, CORA and MovieLens datasets show the good performance of our construction wrt a variety of other approaches, and the superior performance with respect to other spectral CNN methods.

Appendix A Proof of Proposition 1

Define the approximation error in $\mathcal{C}(h\boldsymbol{\Delta})^{j}\mathbf{f}$ by

By the triangle inequality, by the fact that $\mathcal{C}^{j}(h\boldsymbol{\Delta})$ is unitary, and by (8)

Now, using standard norm bounds, in the general case we have $\left\|\mathbf{J}^{K+1}\right\|_{2}\leq\sqrt{n}\left\|\mathbf{J}^{K+1}\right\|_{\infty}$ . Thus, by $\kappa=\left\|\mathbf{J}\right\|_{\infty}$ we have

The solution of this recurrent sequence is

In case the graph is regular, we have $\mathbf{D}=d\mathbf{I}$ . In the non-normalized Laplacian case,

The spectral radius of $\boldsymbol{\Delta}$ is bounded by $2d$ . This can be shown as follows. a value $\lambda$ is not an eigenvalue of $\boldsymbol{\Delta}$ (namely it is a regular value) if and only if $(\boldsymbol{\Delta}-\lambda\mathbf{I})$ is invertible. Moreover, the matrix $(\boldsymbol{\Delta}-\lambda\mathbf{I})$ is strictly dominant diagonal for any $\left|\lambda\right|>2d$ . By Levy–Desplanques theorem ( Theorem 6.1.10), any strictly dominant diagonal matrix is invertible, which means that all of the eigenvalues of $\boldsymbol{\Delta}$ are less than $2d$ in their absolute value. As a result, the spectral radius of $(d\mathbf{I}-\boldsymbol{\Delta})$ is realized on the smallest eigenvalue of $\boldsymbol{\Delta}$ , namely it is $\left|d-0\right|=d$ . This means that the specral radius of $\mathbf{J}$ is $\frac{hd}{\sqrt{h^{2}d^{2}+1}}$ . As a result $\left\|\mathbf{J}\right\|_{2}=\frac{hd}{\sqrt{h^{2}d^{2}+1}}=\kappa$ . We can now continue from (11) to get

In the case of the normalized Laplacian of a regular graph, the spectral radius of $\boldsymbol{\Delta}_{n}$ is bounded by $2$ , and the diagonal entries are all 1. Equation (10) in this case reads $\mathbf{J}=\frac{h}{h+i}(\mathbf{I}-\boldsymbol{\Delta}_{n})$ , and $\mathbf{J}$ has spectral radius $\frac{h}{\sqrt{h^{2}+1}}$ . Thus $\left\|\mathbf{J}\right\|_{2}=\frac{h}{\sqrt{h^{2}+1}}=\kappa$ and we continue as before to get $e_{j}\leq j\kappa^{K+1}$ and $M_{j}=j$ .

In all cases, we have by the triangle inequality

Appendix B Proof of Theorem 4

In this proof we approximate $\mathbf{G}\boldsymbol{\delta}_{m}$ by $\widetilde{\mathbf{G}\boldsymbol{\delta}}_{m}$ . Note that the signal $\boldsymbol{\delta}_{m}$ is supported on one vertex, and in the calculation of $\widetilde{\mathbf{G}\boldsymbol{\delta}}_{m}$ , each Jacobi iteration increases the support of the signal by 1-hop. Therefore, the support of $\widetilde{\mathbf{G}\boldsymbol{\delta}}_{m}$ is the $r(K+1)$ -hop neighborhood ${\cal N}_{r(K+1),m}$ of $m$ . Denoting $l=r(K+1)$ , and using Proposition 1, we get

Appendix C Computational Complexity

In Figure 6 we compare the computational complexity of CayleyNey and ChebNet on our community detection problem.

Appendix D Back propagation

In this section we show how to differentiate Cayley filters $\mathbf{G}_{\mathbf{c},h}=g_{c,h}(\boldsymbol{\Delta})$ with respect to the complex coefficient vector $\mathbf{c}=(c_{0},\ldots,c_{r})$ and $h$ . Since working with complex parameters is not standard, we explain in detail how this is done. One approach is to simply treat each complex parameter $c_{j}=c_{j}^{R}+ic_{j}^{I}$ as the pair of real parameters $(c_{j}^{R},c_{j}^{I})$ , and to explicitly formulate Cayley polynomial with real numbers. This brute force formulation is suitable for automatic back propagation in software like TensorFlow. To justify the calculation from a theoretical standpoint, it is more convenient to consider a general calculus of variation approach to gradient descent.

Our goal is to minimize the loss function with respect to all of the coefficients of all filters in the network. Note that minimization is a set operation in its nature. Namely, a minimal value of a set doesn’t depend on additional structures endowed on the set, such as inner product, topology, Riemannian structure, and so on. This means that we are free to use the vector space structure and the inner product of our choice to define the gradient.

Consider a generic Cayley filter, applied on the real valued signal $\mathbf{f}$

denote the dependency of the loss function on $(\mathbf{c},h)$ . Our goal is to define an inner product, and calculate a gradient of $S(\mathbf{c},h)$ with respect to the complex coefficient vector $\mathbf{a}=(\mathbf{c},h)$ . Given an inner product structure on the space of coefficients, a (variational) gradient at the point $\mathbf{a}$ of a scalar valued function $S$ is a vector $D[S;\mathbf{a}]$ such that for any other coefficient vector $\boldsymbol{\eta}$

where $\epsilon$ denotes a scalar. If there is no such vector $D[S;\mathbf{a}]$ , $S$ is not differentiable at $\mathbf{a}$ by definition. This definition can be extended to vector valued functions.

Thus, by the definition of the inner product in the (realificated) coefficient space, the differential of $\mathbf{G}_{\mathbf{c},h}\mathbf{f}$ with respect to $\mathbf{c}$ is given by the vector $D_{\mathbf{c}}[\mathbf{G}_{\mathbf{c},h}\mathbf{f};(h,\mathbf{c})]$ with vector valued entries

where $\nabla$ denotes the gradient with respect to the signal, and $\cdot$ is the usual real dot product between real valued signals. Let us write in short $\nabla F=\nabla F(\mathbf{G}_{\mathbf{c},h}\mathbf{f})$ , and obtain

This shows, by the definition of the inner product in the (realificated) coefficient space, that

where $\mathcal{C}^{\prime}(x)=(x+i)^{-1}(1-\mathcal{C}(x))$ . We can interpret this as a calculation on the spectrum of $\mathbf{G}_{\mathbf{c},h}=g_{\mathbf{c},h}(\boldsymbol{\Delta})$ . Then, by the fact that $\boldsymbol{\Delta}$ is a bounded normal operator, we can carry the calculation to $\mathbf{G}_{\mathbf{c},h}=g_{\mathbf{c},h}(\boldsymbol{\Delta})$ via functional calculus. We thus obtain

where $\mathcal{C}^{\prime}(h\boldsymbol{\Delta})=(h\boldsymbol{\Delta}+i\mathbf{I})^{-1}(\mathbf{I}-\mathcal{C}(h\boldsymbol{\Delta}))$ .

Acknowledgment

MB and FM are partially supported by ERC Consolidator Grant No. 724228 (LEMAN), Google Faculty Research Awards, Amazon AWS ML Research Award, Royal Society Wolfson Research Merit Award, and Rudolf Diesel fellowship at the Institute for Advanced Studies, TU Munich.