Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

Introduction

Neural networks (NNs) demonstrate remarkable performance (He et al., 2016; Oord et al., 2016; Silver et al., 2017; Vaswani et al., 2017), but are still only poorly understood from a theoretical perspective (Goodfellow et al., 2015; Choromanska et al., 2015; Pascanu et al., 2014; Zhang et al., 2017). NN performance is often motivated in terms of model architectures, initializations, and training procedures together specifying biases, constraints, or implicit priors over the class of functions learned by a network. This induced structure in learned functions is believed to be well matched to structure inherent in many practical machine learning tasks, and in many real-world datasets. For instance, properties of NNs which are believed to make them well suited to modeling the world include: hierarchy and compositionality (Lin et al., 2017; Poggio et al., 2017), Markovian dynamics (Tiňo et al., 2004; 2007), and equivariances in time and space for RNNs (Werbos, 1988) and CNNs (Fukushima & Miyake, 1982; Rumelhart et al., 1985) respectively.

The recent discovery of an equivalence between deep neural networks and GPs (Lee et al., 2018; Matthews et al., 2018b) allow us to express an analytic form for the prior over functions encoded by deep NN architectures and initializations. This transforms an implicit prior over functions into an explicit prior, which can be analytically interrogated and reasoned about.While there is broad literature on empirical interpretation of finite CNNs (Zeiler & Fergus, 2014; Simonyan et al., 2014; Long et al., 2014; Olah et al., 2017), it is commonly only applicable to fully trained networks.

Previous work studying these Neural Network-equivalent Gaussian Processes (NN-GPs) has established the correspondence only for fully connected networks (FCNs). Additionally, previous work has not used analysis of NN-GPs to gain specific insights into the equivalent NNs.

In the present work, we extend the equivalence between NNs and NN-GPs to deep Convolutional Neural Networks (CNNs), both with and without pooling. CNNs are a particularly interesting architecture for study, since they are frequently held forth as a success of motivating NN design based on invariances and equivariances of the physical world (Cohen & Welling, 2016) – specifically, designing a NN to respect translation equivariance (Fukushima & Miyake, 1982; Rumelhart et al., 1985). As we will see in this work, absent pooling, this quality of equivariance has no impact in the Bayesian treatment of the infinite channel (number of convolutional filters) limit (Figure 2).

The specific novel contributions of the present work are:

We show analytically that CNNs with many channels, trained in a fully Bayesian fashion, correspond to an NN-GP (§2, §3). We show this for CNNs both with and without pooling, with arbitrary convolutional striding, and with both same\operatorname{same} and valid\operatorname{valid} padding. We prove convergence as the number of channels in hidden layers approach infinity simultaneously (i.e. min{n1,,nL}\min\left\{n^{1},\dots,n^{L}\right\}\to\infty, see §F.4 for details), extending the result of Matthews et al. (2018a) under mild conditions on the nonlinearity derivative. Our results also provide a rigorous proof of the assumption made in Xiao et al. (2018) that pre-activations in hidden layers are i.i.d. Gaussian.

We show that in the absence of pooling, the NN-GP for a CNN and a Locally Connected Network (LCN) are identical (Figure 2, §5.1). An LCN has the same local connectivity pattern as a CNN, but without weight sharing or translation equivariance.

We experimentally compare trained CNNs and LCNs and find that under certain conditions both perform similarly to the respective NN-GP (Figure 6, b, c). Moreover, both architectures tend to perform better with increased channel count, suggesting that similarly to FCNs (Neyshabur et al., 2015; Novak et al., 2018) CNNs benefit from overparameterization (Figure 6, a, b), corroborating a similar trend observed in Canziani et al. (2016, Figure 2). However, we also show that careful tuning of hyperparameters allows finite CNNs trained with SGD to outperform their corresponding NN-GPs by a significant margin. We experimentally disentangle and quantify the contributions stemming from local connectivity, equivariance, and invariance in a convolutional model in one such setting (Figure 2).

We introduce a Monte Carlo method to compute NN-GP kernels for situations (such as CNNs with pooling) where evaluating the NN-GP is otherwise computationally infeasible (§4).

We stress that we do not evaluate finite width Bayesian networks nor do we make any claims about their performance relative to the infinite width GP studied here or finite width SGD-trained networks. While this is an interesting subject to pursue (see Matthews et al. (2018a); Neal (1994)), it is outside of the scope of this paper.

In early work on neural network priors, Neal (1994) demonstrated that, in a fully connected network with a single hidden layer, certain natural priors over network parameters give rise to a Gaussian process prior over functions when the number of hidden units is taken to be infinite. Follow-up work extended the conditions under which this correspondence applied (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). An exactly analogous correspondence for infinite width, finite depth deep fully connected networks was developed recently in Lee et al. (2018); Matthews et al. (2018b), with Matthews et al. (2018a) extending the convergence guarantees from ReLU\operatorname{ReLU} to any linearly bounded nonlinearities and monotonic width growth rates. In this work we further relax the conditions to absolutely continuous nonlinearities with exponentially bounded derivative and any width growth rates.

The line of work examining signal propagation in random deep networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; 2018; Hanin & Rolnick, 2018; Chen et al., 2018; Yang et al., 2018) is related to the construction of the GPs we consider. They apply a mean field approximation in which the pre-activation signal is replaced with a Gaussian, and the derivation of the covariance function with depth is the same as for the kernel function of a corresponding GP. Recently, Xiao et al. (2017b; 2018) extended this to convolutional architectures without pooling. Xiao et al. (2018) also analyzed properties of the convolutional kernel at large depths to construct a phase diagram which will be relevant to NN-GP performance, as discussed in §C.

Compositional kernels coming from wide convolutional and fully connected layers also appeared outside of the GP context. Cho & Saul (2009) derived closed-form compositional kernels for rectified polynomial activations (including ReLU\operatorname{ReLU}). Daniely et al. (2016) proved approximation guarantees between a network and its corresponding kernel, and show that empirical kernels will converge as the number of channels increases.

There is a line of work considering stacking of GPs, such as deep GPs (Lawrence & Moore, 2007; Damianou & Lawrence, 2013). These no longer correspond to GPs, though they can describe a rich class of probabilistic models beyond GPs. Alternatively, deep kernel learning (Wilson et al., 2016b; a; Bradshaw et al., 2017) utilizes GPs with base kernels which take in features produced by a deep neural network (often a CNN), and train the resulting model end-to-end. Finally, van der Wilk et al. (2017) incorporates convolutional structure into GP kernels, with follow-up work stacking multiple such GPs (Kumar et al., 2018; Blomqvist et al., 2018) to produce a deep convolutional GP (which is no longer a GP). Our work differs from all of these in that our GP corresponds exactly to a fully Bayesian CNN in the infinite channel limit, when all layers are taken to be of infinite size. We remark that while alternative models, such as deep GPs, do include infinite-sized layers in their construction, they do not treat all layers in this way – for instance, through insertion of bottleneck layers which are kept finite. While it remains to be seen exactly which limit is applicable for understanding realistic CNN architectures in practice, the limit we consider is natural for a large class of CNNs, namely those for which all layers sizes are large and rather comparable in size. Deep GPs, on the other hand, correspond to a potentially richer class of models, but are difficult to analytically characterize and suffer from higher inference cost.

Borovykh (2018) analyzes the convergence of CNN outputs at different spatial locations (or different timepoints for a temporal CNN) to a GP for a single input example. Thus, while they also consider a GP limit (and perform an experiment comparing posterior GP predictions to an SGD-trained CNN), they do not address the dependence of network outputs on multiple input examples, and thus their model is unable to generate predictions on a test set consisting of new input examples.

In concurrent work, Garriga-Alonso et al. (2018) derive an NN-GP kernel equivalent to one of the kernels considered in our work. In addition to explicitly specifying kernels corresponding to pooling and vectorizing, we also compare the NN-GP performance to finite width SGD-trained CNNs and analyze the differences between the two models.

Many-channel Bayesian CNNs are Gaussian processes

General setup. For simplicity of presentation we consider 1D convolutional networks with circularly-padded activations (identically to Xiao et al. (2018)). Unless specified otherwise, no pooling anywhere in the network is used. If a model (NN or GP) is mentioned explicitly as “with pooling”, it always corresponds to a single global average pooling layer at the top. Our analysis is straightforward to extend to higher dimensions, using zero (same\operatorname{same}) or no (valid\operatorname{valid}) padding, strided convolutions, and pooling in intermediary layers (§D). We consider a series of L+1L+1 convolutional layers, l=0,,Ll=0,\dots,L.

Random weights and biases. The parameters of the network are the convolutional filters and biases, ωij,βl\omega^{l}_{ij,\beta} and bilb^{l}_{i}, respectively, with outgoing (incoming) channel index ii (jj) and filter relative spatial location β[±k]{k,,0,,k}\beta\in[\pm k]\equiv\{-k,\dots,0,\dots,k\}.We will use Roman letters to index channels and Greek letters for spatial location. We use letters i,j,i,ji,j,i^{\prime},j^{\prime}, etc to denote channel indices, α,α\alpha,\alpha^{\prime}, etc to denote spatial indices and β,β\beta,\beta^{\prime}, etc for filter indices. Assume a Gaussian prior on both the filter weights and biases,

The weight and bias variances are σω2,σb2\sigma^{2}_{\omega},\sigma^{2}_{b}, respectively. nln^{l} is the number of channels (filters) in layer ll, 2k+12k+1 is the filter size, and vβv_{\beta} is the fraction of the receptive field variance at location β\beta (with βvβ=1\sum_{\beta}v_{\beta}=1). In experiments we utilize uniform vβ=1/(2k+1)v_{\beta}=1/(2k+1), but nonuniform vβ1/(2k+1)v_{\beta}\neq 1/(2k+1) should enable kernel properties that are better suited for ultra-deep networks, as in Xiao et al. (2018).

Activation covariance. A recurring quantity in this work will be the empirical uncentered covariance matrix KlK^{l} of the activations yly^{l}, defined as

KlK^{l} is a random variable indexed by two inputs x,xx,x^{\prime} and two spatial locations α,α\alpha,\alpha^{\prime} (the dependence on layer widths n1,,nln^{1},\dots,n^{l}, as well as weights and biases, is implied and by default not stated explicitly). K0K^{0}, the empirical uncentered covariance of inputs, is deterministic.

Shapes and indexing. Whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension. For example, y0y^{0} is a vector of size Xn0d\left|\mathcal{X}\right|n^{0}d, [Kl]α,α\smash{\left[K^{l}\right]_{\alpha,\alpha^{\prime}}} is a matrix of shape X×X\left|\mathcal{X}\right|\times\left|\mathcal{X}\right|, and zjlz^{l}_{j} is a vector of size Xd\left|\mathcal{X}\right|d.

Our work concerns proving that the top-layer pre-activations zLz^{L} converge in distribution to an XnL+1d\left|\mathcal{X}\right|n^{L+1}d-variate normal random vector with a particular covariance matrix of shape (XnL+1d)×(XnL+1d)\left(\left|\mathcal{X}\right|n^{L+1}d\right)\times\left(\left|\mathcal{X}\right|n^{L+1}d\right) as min{n1,,nL}\min\left\{n^{1},\dots,n^{L}\right\}\to\infty. We emphasize that only the channels in hidden layers are taken to infinity, and nL+1n^{L+1}, the number of channels in the top-layer pre-activations zLz^{L}, remains fixed. For convergence proofs, we always consider zlz^{l}, yly^{l}, as well as any of their indexed subsets like zjlz^{l}_{j}, yi,αly^{l}_{i,\alpha} to be 1D vector random variables, while KlK^{l}, as well as any of its indexed subsets (when applicable, e.g. [Kl]α,α\left[K^{l}\right]_{\alpha,\alpha^{\prime}}, [Kl](x,x)\left[K^{l}\right]\left(x,x^{\prime}\right)) to be 2D matrix random variables.

2 Correspondence between Gaussian processes and Bayesian deep CNNs with infinitely many channels

We next consider the prior over outputs zLz^{L} computed by a CNN in the limit of infinitely many channels in the hidden (excluding input and output) layers, min{n1,,nL}\min\left\{n^{1},\dots,n^{L}\right\}\to\infty, and derive its equivalence to a GP with a compositional kernel. This section outlines an argument showing that zLz^{L} is normally distributed conditioned on previous layer activation covariance KLK^{L}, which itself becomes deterministic in the infinite limit. This allows to conclude convergence in distribution of the outputs to a Gaussian with the respective deterministic covariance limit. This section omits many technical details elaborated in §F.4.

As can be seen in Equation 4, the pre-activations zlz^{l} are a linear transformation of the multivariate Gaussian {ωl,bl}\left\{\omega^{l},b^{l}\right\}, specified by the previous layer’s activations yly^{l}. A linear transformation of a multivariate Gaussian is itself a Gaussian with a covariance matrix that can be derived straightforwardly. Specifically,

where Inl+1I_{n^{l+1}} is an nl+1×nl+1n^{l+1}\times n^{l+1} identity matrix, and A(Kl)\mathcal{A}\left(K^{l}\right) is the covariance of the pre-activations zilz^{l}_{i} and is derived in Xiao et al. (2018). Precisely, A:PSDXdPSDXd\mathcal{A}:\textrm{PSD}_{\left|\mathcal{X}\right|d}\to\textrm{PSD}_{\left|\mathcal{X}\right|d} is an affine transformation (a cross-correlation operator followed by a shifting operator) on the space of positive semi-definite Xd×Xd\left|\mathcal{X}\right|d\times\left|\mathcal{X}\right|d matrices defined as follows:

A\mathcal{A} preserves positive semi-definiteness due to Equation 6. Notice that the covariance matrix in Equation 6 is block diagonal due to the fact that separate channels {zil}i=1nl+1\left\{z^{l}_{i}\right\}_{i=1}^{n^{l+1}} are i.i.d. conditioned on yly^{l}, due to i.i.d. weights and biases {ωil,bil}i=1nl+1\left\{\omega^{l}_{i},b^{l}_{i}\right\}_{i=1}^{n^{l+1}}.

We further remark that per Equation 6 the normal distribution of (zlyl)\left(z^{l}|y^{l}\right) only depends on KlK^{l}, hence the random variable (zlKl)\left(z^{l}|K^{l}\right) has the same distribution by the law of total expectation:

2.2 Activation covariance matrix becomes deterministic with increasing channel count

It follows from Equation 8 that the summands in Equation 5 are i.i.d. conditioned on fixed Kl1K^{l-1}. Subject to weak restrictions on the nonlinearity ϕ\phi, we can apply the weak law of large numbers and conclude that the covariance matrix KlK^{l} becomes deterministic in the infinite channel limit in layer ll (note that pre-activations zlz^{l} remain stochastic). Precisely,

where C\mathcal{C} is defined for any Xd×Xd\left|\mathcal{X}\right|d\times\left|\mathcal{X}\right|d PSD matrix KK as

The decoupling of the kernel “propagation” into C\mathcal{C} and A\mathcal{A} is highly convenient since A\mathcal{A} is a simple affine transformation of the kernel (see Equation 7), and C\mathcal{C} is a well-studied map in literature (see §G.4), and for nonlinearities such as ReLU\operatorname{ReLU} (Nair & Hinton, 2010) and the error function (erf\operatorname{erf}) C\mathcal{C} can be computed in closed form as derived in Cho & Saul (2009) and Williams (1997) respectively. We refer the reader to Xiao et al. (2018, Lemma A.1) for complete derivation of the limiting value in Equation 9.

A less obvious result is that, under slightly stronger assumptions on ϕ\phi, the top-layer activation covariance KLK^{L} becomes unconditionally (dependence on observed deterministic inputs y0y^{0} is implied) deterministic as channels in all hidden layers grow to infinity simultaneously:

i.e. KLK_{\infty}^{L} is (CA)\left(\mathcal{C}\circ\mathcal{A}\right) applied LL times to K0K^{0}, the deterministic input covariance. We prove this in §F.4 (Theorem F.5). See Figure 3 for a depiction of the correspondence between neural networks and their infinite width limit covariances KlK_{\infty}^{l}.

2.3 A conditionally normal random variable becomes normal if its covariance becomes deterministic

For more intuition behind Equation 12 and an informal proof please consult §F.3.

Transforming a GP over spatial locations into a GP over classes

In §2.2 we showed that in the infinite channel limit a deep CNN is a GP indexed by input samples, output spatial locations, and output channel indices. Further, its covariance matrix KLK^{L}_{\infty} can be computed in closed form. Here we show that transformations to obtain class predictions that are common in CNN classifiers can be represented as either vectorization or projection (as long as we treat classification as regression (see §G), identically to Lee et al. (2018)). Both of these operations preserve the GP equivalence and allow the computation of the covariance matrix of the respective GP (now indexed by input samples and target classes) as a simple transformation of KLK^{L}_{\infty}.

It is easy to verify that (zˉiL+1KL+1)\left(\bar{z}^{L+1}_{i}|K^{L+1}\right) is a multivariate Gaussian with covariance:

The argument of §2.2 can then be extended to conclude that in the limit of infinite width (zˉiL+1y0)\left(\bar{z}^{L+1}_{i}|y^{0}\right) converges in distribution to a multivariate Gaussian (i.i.d. for each class ii) with covariance

A sample 2D CNN using this readout strategy is depicted in Figure 2, and a sample correspondence between a FCN, FCN-GP, CNN, and CNN-GP is depicted in Figure 3.

Note that as observed in Xiao et al. (2018), to compute the summands [KL+1]α,α(x,x)\left[K_{\infty}^{L+1}\right]_{\alpha,\alpha}\left(x,x^{\prime}\right) in Equation 15, one needs only the corresponding terms [KL]α,α(x,x)\left[K_{\infty}^{L}\right]_{\alpha,\alpha}\left(x,x^{\prime}\right). Consequently, we only need to compute {[Kl]α,α(x,x):x,xX,α{1d}}l=0,,L\left\{\left[K^{l}_{\infty}\right]_{\alpha,\alpha}\left(x,x^{\prime}\right):x,x^{\prime}\in\mathcal{X},\alpha\in\left\{1\dots d\right\}\right\}_{l=0,\dots,L} and the memory cost is O(X2d)\mathcal{O}\left(\left|\mathcal{X}\right|^{2}d\right) (or O(d)\mathcal{O}\left(d\right) per covariance entry in an iterative or distributed setting). Note that this approach ignores pixel-pixel covariances and produces a GP corresponding to a locally connected network (see §5.1).

2 Projection

Global average pooling: take h=1d1dh=\frac{1}{d}\bf 1_{d}. Then

This approach corresponds to applying global average pooling right after the last convolutional layer. Spatially local average pooling in intermediary layers can be constructed in a similar fashion (§D). We focus on global average pooling in this work to more effectively isolate the effects of pooling from other aspects of the model like local connectivity or equivariance. This approach takes all pixel-pixel covariances into consideration and makes the kernel translation invariant. However, it requires O(X2d2)\mathcal{O}\left(\left|\mathcal{X}\right|^{2}d^{2}\right) memory to compute the sample-sample covariance of the GP (or O(d2)\mathcal{O}\left(d^{2}\right) per covariance entry in an iterative or distributed setting). It is impractical to use this method to analytically evaluate the GP, and we propose to use a Monte Carlo approach (see §4).

Subsampling one particular pixel: take h=eαh=e_{\alpha},

This approach makes use of only one pixel-pixel covariance, and requires the same amount of memory as vectorization (§3.1) to compute.

We compare the performance of presented strategies in Figure 4. Note that all described strategies admit stacking additional FC layers on top while retaining the GP equivalence, using a derivation analogous to §2.2 (Lee et al., 2018; Matthews et al., 2018b).

Monte Carlo evaluation of intractable GP kernels

We introduce a Monte Carlo estimation method for NN-GP kernels which are computationally impractical to compute analytically, or for which we do not know the analytic form. Similar in spirit to traditional random feature methods (Rahimi & Recht, 2007), the core idea is to instantiate many random finite width networks and use the empirical uncentered covariances of activations to estimate the Monte Carlo-GP (MC-GP) kernel,

where θ\theta consists of MM draws of the weights and biases from their prior distribution, θmp(θ)\theta_{m}\sim p\left(\theta\right), and nn is the width or number of channels in hidden layers. The MC-GP kernel converges to the analytic kernel with increasing width, limnKn,Ml=Kl\lim_{n\rightarrow\infty}K^{l}_{n,M}=K^{l}_{\infty} in probability.

In a non-distributed setting, the MC-GP reduces the memory requirements to compute GPpool\mathcal{GP}^{\textrm{pool}} from O(X2d2)\mathcal{O}\left(\left|\mathcal{X}\right|^{2}d^{2}\right) to O(X2+n2+nd)\mathcal{O}\left(\left|\mathcal{X}\right|^{2}+n^{2}+nd\right), making the evaluation of CNN-GPs with pooling practical.

Discussion

Locally Connected Networks (LCNs) (Fukushima, 1975; Lecun, 1989) are CNNs without weight sharing between spatial locations. LCNs preserve the connectivity pattern, and thus topology, of a CNN. However, they do not possess the equivariance property of a CNN – if an input is translated, the latent representation in an LCN will be completely different, rather than also being translated.

The CNN-GP predictions without spatial pooling in §3.1 and §3.2.2 depend only on sample-sample covariances, and do not depend on pixel-pixel covariances. LCNs destroy pixel-pixel covariances: [KL]α,αLCN(x,x)=0\left[K_{\infty}^{L}\right]_{\alpha,\alpha^{\prime}}^{\text{LCN}}\left(x,x^{\prime}\right)=0, for αα\alpha\neq\alpha^{\prime} and all x,xXx,x^{\prime}\in\mathcal{X} and L>0L>0. However, LCNs preserve the covariances between input examples at every pixel: [KL]α,αLCN(x,x)=[KL]α,αCNN(x,x)\left[K_{\infty}^{L}\right]_{\alpha,\alpha}^{\text{LCN}}\left(x,x^{\prime}\right)=\left[K_{\infty}^{L}\right]_{\alpha,\alpha}^{\text{CNN}}\left(x,x^{\prime}\right). As a result, in the absence of pooling, LCN-GPs and CNN-GPs are identical. Moreover, LCN-GPs with pooling are identical to CNN-GPs with vectorization of the top layer (under suitable scaling of yL+1y^{L+1}). We confirm these findings experimentally in trained networks in the limit of large width in Figures 2 and 6 (b), as well as by demonstrating convergence of MC-GPs of the respective architectures to the same CNN-GP (modulo scaling of yL+1y^{L+1}) in Figures 5 and 10.

2 Pooling leverages equivariance to provide invariance

The only kernel leveraging pixel-pixel covariances is that of the CNN-GP with pooling. This enables the predictions of this GP and the corresponding CNN to be invariant to translations (modulo edge effects) – a beneficial quality for an image classifier. We observe strong experimental evidence supporting the benefits of invariance throughout this work (Figures 2, 4, 5, 6 (b); Table 1), in both CNNs and CNN-GPs.

3 Finite channel SGD-trained CNNs can outperform infinite channel Bayesian CNNs, in the absence of pooling

In the absence of pooling, the benefits of equivariance and weight sharing are more challenging to explain in terms of Bayesian priors on class predictions (since without pooling equivariance is not a property of the outputs, but only of intermediary representations). Indeed, in this work we find that the performance of finite width SGD-trained CNNs often approaches that of their CNN-GP counterpart (Figure 6, b, c),This observation is conditioned on the respective NN fitting the training set to 100%100\%. Underfitting breaks the correspondence to an NN-GP, since train set predictions of such a network no longer correspond to the true training labels. Properly tuned underfitting often also leads to better generalization (Table 1). suggesting that in those cases equivariance does not play a beneficial role in SGD-trained networks.

However, as can be seen in Figures 2, 6 (c), 7, and Table 1, the best CNN overall outperforms the best CNN-GP by a significant margin – an observation specific to CNNs and not FCNs or LCNs.Performing an analogous large-scale comparison between CNNs and CNN-GPs with pooling was not computationally feasible, and only one CNN-GP model was evaluated.footnote 3 Their relative performance remains an interesting open question for future research. We observe this gap in performance especially in the case of ReLU\operatorname{ReLU} networks trained with a large learning rate. In Figure 2 we demonstrate this large gap in performance by evaluating different models with equivalent architecture and hyperparameter settings, chosen for good SGD-trained CNN performance.

We conjecture that equivariance, a property lacking in LCNs and the Bayesian treatment of the infinite channel CNN limit, contributes to the performance of SGD-trained finite channel CNNs with the correct settings of hyperparameters.While this is consistent with the results of Lecun (1989), in a different setting Bartunov et al. (2018) found equivariance to not provide a substantial performance gain. Nonetheless, more work is needed to disentangle and quantify the separate contributions of stochastic optimization and finite width effects,We remark that concerns about GPs not being able to learn hierarchical representations have been raised in the literature (Matthews et al., 2018a, Section 7), (Neal, 1995, Chapter 5), (MacKay, 2003, Section 45.7) However, practical impact of these assertions have not been extensively investigated empirically or theoretically, and we hope that our work stimulates research in this direction. and differences in performance between CNNs with weight sharing and their corresponding CNN-GPs.

Conclusion

In this work we have derived a Gaussian process that corresponds to fully Bayesian multi-layer CNNs with infinitely many channels. The covariance of this GP can be efficiently computed either in closed form or by using Monte Carlo sampling, depending on the architecture.

The CNN-GP achieves state of the art results for GPs without trainable kernels on CIFAR10. It can perform competitively with CNNs (that fit the training set) of equivalent architecture and weight priors, which makes it an appealing choice for small datasets, as it eliminates all training-related hyperparameters. However, we found that the best overall performance, at least in the absence of pooling, is achieved by finite SGD-trained CNNs and not by their infinite Bayesian counterparts. We hope our work stimulates future research towards understanding the distribution over functions induced by model architecture and training approach, and what aspects of this distribution are important for model performance. Another natural extension of our work is the study of other deep learning architectures in the infinitely wide limit. After the publication of this paper, Yang (2019) devised a unifying framework proving the GP convergence for even more models (such as ones using batch normalization, (self-)attention, LSTM) with slightly different assumptions on the nonlinearity.

Acknowledgements

We thank Sam Schoenholz, Vinay Rao, Daniel Freeman, Qiang Zeng, and Phil Long for frequent discussion and feedback on preliminary results.

References

Appendix A Glossary

We use the following shorthands in this work:

LCN - locally connected network, a.k.a. convolutional network without weight sharing;

FCN - fully connected network, a.k.a. multilayer perceptron (MLP);

X-GP - a GP equivalent to a Bayesian infinitely wide neural network of architecture X (§2).

MC-(X-)-GP - a Monte Carlo estimate (§4) of the X-GP.

Width, (number of) filters, (number of) channels represent the same property for CNNs and LCNs.

Pooling - referring to architectures as “with” or “without pooling” means having a single global average pooling layer (collapsing the spatial dimensions of the activations yL+1y^{L+1}) before the final linear FC layer giving the regression outputs zL+1z^{L+1}.

Invariance and equivariance are always discussed w.r.t. translations in the spatial dimensions of the inputs.

Appendix B Additional figures

Appendix C Relationship to Deep Signal Propagation

The recurrence relation linking the GP kernel at layer l+1l+1 to that of layer ll following from Equation 11 (i.e. Kl+1=(CA)(Kl)K_{\infty}^{l+1}=\left(\mathcal{C}\circ\mathcal{A}\right)\left(K_{\infty}^{l}\right)) is precisely the covariance map examined in a series of related papers on signal propagation (Xiao et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee et al., 2018) (modulo notational differences; denoted as FF, C\mathcal{C} or e.g. AC\mathcal{A}\star\mathcal{C} in Xiao et al. (2018)). In those works, the action of this map on hidden-state covariance matrices was interpreted as defining a dynamical system whose large-depth behavior informs aspects of trainability. In particular, as ll\to\infty, Kl+1=(CA)(Kl)KlKK_{\infty}^{l+1}=\left(\mathcal{C}\circ\mathcal{A}\right)\left(K^{l}_{\infty}\right)\approx K_{\infty}^{l}\equiv K^{*}_{\infty}, i.e. the covariance approaches a fixed point KK^{*}_{\infty}. The convergence to a fixed point is problematic for learning because the hidden states no longer contain information that can distinguish different pairs of inputs. It is similarly problematic for GPs, as the kernel becomes pathological as it approaches a fixed point. Precisely, in both chaotic and ordered regimes, outputs of the GP become asymptotically identically correlated. Either of these scenarios captures no information about the training data in the kernel and makes learning infeasible.

This problem can be ameliorated by judicious hyperparameter selection, which can reduce the rate of exponential convergence to the fixed point. For hyperpameters chosen on a critical line separating two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be trained, and inference with deep NN-GP kernels can be performed – see Figure 11.

Appendix D Strided convolutions, average pooling in intermediate layers, higher dimensions

One can easily see that \left\{Bz_{j}^{l}\big{|}K^{l}\right\}_{j} are i.i.d. multivariate Gaussian as well.

Average pooling. Average pooling with stride ss and window size wsws is equivalent to choosing Bij=1/wsB_{ij}=1/ws for i=0,1,(d21)i=0,1,\dots\left({d_{2}-1}\right) and j=is,,(is+ws1)j=is,\dots,\left(is+ws-1\right).

ND convolutions. Note that our analysis in the main text (1D) easily extends to higher-dimensional convolutions by replacing integer pixel indices and sizes d,α,βd,\alpha,\beta with tuples (see also Figure 2). In Equation 4 β\beta values would have to span the hypercube [±k]N={k,,k}N\left[\pm k\right]^{N}=\left\{-k,\dots,k\right\}^{N} in the pre-activation definition. Similarly, in §3 the normalizing factor dd (d2d^{2}) should be the product (squared) of its entries, and summations over α,β\alpha,\beta should span the [d0]××[dN][d_{0}]\times\dots\times\left[d_{N}\right] hypercube as well. The definition of the kernel propagation operator A\mathcal{A} in Equation 7 will remain exactly the same, so long as β\beta is summed over the hypercube, and the variance weights remain respectively normalized βvβ=1\sum_{\beta}v_{\beta}=1.

Appendix E Review of exact Bayesian regression with GPs

Our discussion in the paper has focused on model priors. A crucial benefit we derive by mapping to a GP is that Bayesian inference is straightforward to implement and can be done exactly for regression (Rasmussen & Williams, 2006, chapter 2), requiring only simple linear algebra. Let X\mathcal{X} denote training inputs x1,...,xXx_{1},...,x_{\left|\mathcal{X}\right|}, tT=(t1,...,tX)\mathbf{t}^{T}=\left(t_{1},...,t_{\left|\mathcal{X}\right|}\right) training targets, and collectively D\mathcal{D} for the training set. The integral over the posterior can be evaluated analytically to give a posterior predictive distribution on a test point yy_{*} which is Gaussian, (zD,y)N(μ,σ2)\left(z^{*}|\mathcal{D},y^{*}\right)\sim\mathcal{N}\left(\mu_{*},\sigma^{2}_{*}\right), with

We use the shorthand K(X,X)\mathcal{K}\left(\mathcal{X},\mathcal{X}\right) to denote the X×X\left|\mathcal{X}\right|\times\left|\mathcal{X}\right| matrix formed by evaluating the GP covariance on the training inputs, and likewise K(x,X)\mathcal{K}\left(x^{*},\mathcal{X}\right) is a X\left|\mathcal{X}\right|-length vector formed from the covariance between the test input and training inputs. Computationally, the costly step in GP posterior predictions comes from the matrix inversion, which in all experiments were carried out exactly, and typically scales as O(X3)\mathcal{O}\left(\left|\mathcal{X}\right|^{3}\right) (though algorithms scaling as O(X2.4)\mathcal{O}\left(\left|\mathcal{X}\right|^{2.4}\right) exist for sufficiently large matrices). Nonetheless, there is a broad literature on approximate Bayesian inference with GPs which can be utilized for efficient implementation (Rasmussen & Williams, 2006, chapter 8); (Quiñonero-Candela & Rasmussen, 2005; Titsias, 2009).

Appendix F Equivalence between randomly initialized NNs and GPs

In this section, we present two different approaches, the sequential limit (§F.3) and simultaneous limit (§F.4), to illustrate the relationship between many-channels Bayesian CNNs and GPs.

Sequential limit (§F.3) involves taking the infinite channel limit in hidden layers in a sequence, starting from bottom (closest to inputs) layers and going upwards (to the outputs), i.e. n1,,nLn^{1}\to\infty,\dots,n^{L}\to\infty. Note that this approach in fact only gives intuition into constructing a GP using a NN architecture to define its covariance, and does not provide guarantees on actual convergence of large but finite Bayesian CNNs to GPs (which is of most practical interest), nor does it guarantee the existence of the specified GP on a given probability space. However, it has the following benefits:

Weak assumptions on the NN activation function ϕ\phi and on the distribution of the NN parameters.

The arguments can be easily extended to more complicated network architectures, e.g. architectures with max pooling, dropout, etc.

A straightforward and intuitive way to compute the covariance of the Gaussian process without diving into mathematical details.

Simultaneous limit (§F.4) considers growing the number of channels in hidden layers uniformly, i.e. min{n1,,nL}\min\left\{n^{1},\dots,n^{L}\right\}\to\infty. This approach establishes convergence of finite channel Bayesian CNNs to GPs and is thus a more practically relevant result. However, it makes stronger assumptions, and the proof is more involved.

We highlight that the GPs obtained by the two approaches are identical.

In both sections, we only provide the arguments for CNNs. It is straightforward (and in fact simpler) to extend them to LCNs and FCNs. Indeed, an FCN is a particular case of a CNN where the inputs and filters have singular spatial dimensions (d=1d=1, k=0k=0). For LCNs, the proof goes through in an identical fashion if we replace A\mathcal{A} with ALCN\mathcal{A}^{\text{LCN}} defined as [ALCN(K)]α,α(x,x)δα,α[A(K)]α,α(x,x)\left[\mathcal{A}^{\text{LCN}}\left(K\right)\right]_{\alpha,\alpha^{\prime}}\left(x,x^{\prime}\right)\equiv\delta_{\alpha,\alpha^{\prime}}\left[\mathcal{A}\left(K\right)\right]_{\alpha,\alpha^{\prime}}\left(x,x^{\prime}\right).

Probability space. Let P\mathscr{P} be a collection of countably many mutually independent random variables (R.V.s) defined on a probability space (Ω,F,P)(\Omega,\mathcal{F},P), where F\mathcal{F} is a product Borel σ\sigma-algebra and PP is the probability measure. Here PWBH\mathscr{P}\equiv\mathscr{W}\cup\mathscr{B}\cup\mathscr{H} is the collection of parameters used to define neural networks:

Place-holder. H\mathscr{H} is a place-holder to store extra (if needed) R.V.s , e.g. parameters coming from the final dense layer.

However, our results can be straightforwardly extended to a countably-infinite input indexing spaces X\mathcal{X} for certain topologies via an argument presented in Matthews et al. (2018a, section 2.2), allowing to infer weak convergence on X\mathcal{X} from convergence on any finite subset (which is the case we consider in this text; see also Billingsley (1999, page 19) for details). For this reason, as in Matthews et al. (2018a), weak convergence of countably-infinite stochastic processes will be considered with respect to the topology generated by the following metric:

Notation, shapes, and indexing. We adopt the notation, shape, and indexing convention similar to §2.1, which the reader is encouraged to review. We emphasize that whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension (e.g. whole X\mathcal{X} if xx is omitted, or all nln^{l} channels if the channel ii is omitted).

F.2 Preliminary

We will use the following well-known theorem.

XnDXX_{n}\overset{\mathcal{D}}{\to}X (converges in distribution / converges weakly),

Using the equivalence between (i) and (iii), it is straightforward to show that

F.3 Sequential limit

In this section, we give intuition into constructing a Gaussian process using an infinite channel CNN with the limits taken sequentially. Informally, our argument amounts to showing that if the inputs zl1z^{l-1} to the layer ll are a multivariate Gaussian with covariance A(Kl)Inl\mathcal{A}\left(K^{l}\right)\otimes I_{n^{l}}, then it’s outputs zlz^{l} converge in distribution to a multivariate Gaussian with covariance A(Kl+1)Inl+1\mathcal{A}\left(K^{l+1}\right)\otimes I_{n^{l+1}} as nln^{l}\to\infty. This “allows” us to sequentially replace z1,z2,,zLz^{1},z^{2},\dots,z^{L} with their respective limiting Gaussian R.V.s as we take n1,n2,,nLn^{1}\to\infty,n^{2}\to\infty,\dots,n^{L}\to\infty. However, since each convergence only holds in distribution and does not guarantee the existence of the necessary GPs on a given probability space, what follows merely gives intuition into understanding the relationship between wide Bayesian neural networks and GPs (contrary to §F.4, which presents a rigorous convergence proof for min{n1,,nL}\min\left\{n^{1},\dots,n^{L}\right\}\to\infty).

Let Ψ1\Psi_{1} denote the space of functions with a uniformly bounded second moment, i.e. having

In what follows, we use the central limit theorem to prove (iii).

It is not difficult to see that uju_{j} are i.i.d. and qq is Gaussian. Then we can use the central limit theorem to conclude that the above converges in distribution to a Gaussian, once we verify the second moment of uju_{j} is finite. Using the fact that {ωij,βl}i,β\left\{\omega^{l}_{ij,\beta}\right\}_{i,\beta} is a collection of independent R.V.s, and integrating over these R.V.s first, we get

where by Aˉ\bar{\mathcal{A}} we denote the linear transformation part of A\mathcal{A} from Equation 7, i.e. A\mathcal{A} without the translation term σb2\sigma_{b}^{2}:

To prove finiteness of the second moment in Equation 44, it is sufficient to show that all the diagonal terms of KlK_{\infty}^{l} are finite. This easily follows from the assumption ϕΨ1\phi\in\Psi_{1} and the definition of Kl=(CA)l(K0)K_{\infty}^{l}=\left(\mathcal{C}\circ\mathcal{A}\right)^{l}\left(K^{0}\right). Together with the distribution of vv (whose covariance is straightforward to compute), the joint distribution of [zil,n]i[m]\left[z^{l,n}_{i}\right]_{i\in[m]} converges weakly to a mean zero Gaussian with covariance matrix A(Kl)Im\mathcal{A}\left(K_{\infty}^{l}\right)\otimes I_{m} by Theorem F.1 (Equation 28). ∎

The results of Theorem F.3 can be strengthened / extended in many directions.

The same result for a countably-infinite input index set X\mathcal{X} follows immediately according to the respective remark in §F.1.

The same analysis carries over (and the covariance matrix can be computed without much extra effort) if we stack a channel-wise deterministic affine transform after the convolution operator. Note that average pooling (not max pooling, which is not affine), global average pooling and the convolutional striding are particular examples of such affine transformations. Moreover, valid padding (i.e. no padding) convolution can be regarded as a subsampling operator (a linear projection) composed with the regular circular padding.

The same analysis applies to max pooling, but computing the covariance may require non-trivial effort. Let m\mathfrak{m} denote the max pooling operator and assume it is applied right after the activation function. The assumption {yil}i[n]={ϕ(zil1,)}i[n]\left\{y^{l}_{i}\right\}_{i\in[n]}=\left\{\phi(z^{l-1,\infty}_{i})\right\}_{i\in[n]} are i.i.d. implies {m(yil)}i[n]\left\{\mathfrak{m}\left(y^{l}_{i}\right)\right\}_{i\in[n]} are also i.i.d. Then we can proceed exactly as above except for verifying the finiteness of second moment of m(yil)\mathfrak{m}(y^{l}_{i}) with the following trivial estimate:

where ss is the window size of the max pooling.

In general, one can stack a channel-wise deterministic operator op\mathfrak{op} on {yil}\left\{y^{l}_{i}\right\} so long as the second moment of {op(yil)}\left\{\mathfrak{op}\left(y^{l}_{i}\right)\right\} is finite. One can also stack a stochastic operator (e.g. dropout), so long as the outputs are still channel-wisely i.i.d. and have finite second moments.

F.4 Simultaneous limit

Define a sequence of finite channel CNNs as follows:

This network induces a sequence of covariance matrices KtlK^{l}_{t} (which are R.V.s): for l0l\geq 0 and t0t\geq 0, for x,xXx,x^{\prime}\in\mathcal{X}

We make an extra assumption on the parameters.

Assumption: all R.V.s in W\mathscr{W} are Gaussian distributed.

Notation. Let PSDm\text{PSD}_{m} denote the set of m×mm\times m positive semi-definite matrices and for R1R\geq 1, define

and Ck(ϕ,R)C_{k}\left(\phi,R\right) (may equal \infty) denotes the uniform upper bound for the kk-th moment

Let Ψ\Psi denotes the space of measurable functions ϕ\phi with the following properties:

Uniformly bounded second moment: for every R1,R\geq 1, C2(ϕ,R)<C_{2}\left(\phi,R\right)<\infty.

Lipschitz continuity: for every R1R\geq 1, there exists β=β(ϕ,R)>0\beta=\beta\left(\phi,R\right)>0 such that for all Σ,Σ\Sigma,\Sigma^{\prime}\in PSD2(R){\rm PSD}_{2}(R),

Uniform convergence in probability: for every R1R\geq 1 and every ε>0\varepsilon>0 there exists a positive sequence ρn(ϕ,ε,R)\rho_{n}(\phi,\varepsilon,R) with ρn(ϕ,ε,R)0\rho_{n}(\phi,\varepsilon,R)\to 0 as nn\to\infty such that for every ΣPSD2(R)\Sigma\in{\rm PSD}_{2}(R) and any {(xi,yi)}i=1n i.i.d. N(0,Σ)\left\{(x_{i},y_{i})\right\}_{i=1}^{n}\textrm{ i.i.d. }\sim\mathcal{N}\left(0,\Sigma\right)

We will also use Ψ1,Ψ2\Psi_{1},\Psi_{2} and Ψ3\Psi_{3} to denote the spaces of measurable functions ϕ\phi satisfying properties 1, 2, and 3, respectively. It is not difficult to see that for every ii, Ψi\Psi_{i} is a vector space, and so is Ψ=iΨi\Psi=\cap_{i}\Psi_{i}.

We now prove our main result presented in §2.2.3 through the following three theorems.

If ϕ\phi is absolutely continuous and ϕ\phi^{\prime} is exponentially bounded then ϕΨ\phi\in\Psi.

If ϕΨ\phi\in\Psi, then for l0l\geq 0, KtlPKlK^{l}_{t}\overset{P}{\to}K^{l}_{\infty}.

If for l0l\geq 0, KtlPKlK^{l}_{t}\overset{P}{\to}K^{l}_{\infty} and m1m\geq 1, the joint distribution of [zjl,t]j[m]\left[z^{l,t}_{j}\right]_{j\in[m]} converges in distribution to a multivariate normal distribution with mean zero and covariance A(Kl)Im\mathcal{A}\left(K_{\infty}^{l}\right)\otimes I_{m}.

The proofs of Theorems F.4, F.5, and F.6 can be found in §F.7, §F.6, and §F.5 respectively. The proof of Theorem F.5 is slightly more technical and we will borrow some ideas from Daniely et al. (2016).

F.5 Proof of Theorem F.6

Note that conditioned on {yjl,t}j[nl(t)]\left\{y^{l,t}_{j}\right\}_{j\in\left[n^{l}(t)\right]}, the exponent in the above expression is just a linear combination of independent Gaussian R.V.s {ωij,βl,bil}\left\{\omega_{ij,\beta}^{l},\,b_{i}^{l}\right\}, which is also a Gaussian. Integrating out these R.V.s using the formula of the characteristic function of a Gaussian distribution yields

where we have used KtlPKlK^{l}_{t}\xrightarrow{P}K_{\infty}^{l} and Lipschitz continuity of the respective function in the vicinity of KlK_{\infty}^{l} in the last step. Therefore, Equation 64 is true by Theorem F.1 (iii). As in §F.3, the same result for a countably-infinite input index set X\mathcal{X} follows immediately according to the respective remark in §F.1.

We briefly comment how to handle the cases when stacking an average pooling, a subsampling or a dense layer after flattening the activations in the last layer.

implies that the empirical covariance of {Byjl,t}\left\{By^{l,t}_{j}\right\}

Vectorization and a dense layer. Let {ωij,αl}i[m],j[nl(t)],α[d]\left\{\omega^{l}_{ij,\alpha}\right\}_{i\in[m],j\in\left[n^{l}(t)\right],\alpha\in[d]} be the weights of the dense layer, ωij,αl\omega^{l}_{ij,\alpha} represents the weight connecting the α\alpha-th pixel of the jj-channel to the ii-th output. Note that the range of α\alpha is [d][d] not [±k][\pm k] because there is no weight sharing. Define the outputs to be

where trˉ\bar{\textrm{tr}} denotes the mean trace operator acting on the pixel by pixel matrix, i.e. the functional computing the mean of the diagonal terms of the pixel by pixel matrix. Therefore [fi]i[m]\left[f_{i}\right]_{i\in[m]} converges weakly to a mean zero Gaussian with covariance [σω2trˉ(Kl(x,x))+σb2]x,xXIm\left[\sigma_{\omega}^{2}\bar{\textrm{tr}}\left(K^{l}_{\infty}(x,x^{\prime})\right)+\sigma_{b}^{2}\right]_{x,x^{\prime}\in\mathcal{X}}\otimes I_{m} in the case of vectorization (§3.1).

F.6 Proof of Theorem F.5

We first note that the affine transform A\mathcal{A} is σω2\sigma_{\omega}^{2}-Lipschitz and property 2 of Ψ\Psi implies that the C\mathcal{C} operator is β\beta-Lipschitz (both w.r.t. w.r.t. \|\cdot\|_{\infty}). Indeed, if we consider

then [C(K)]α,α(x,x)=T(Σ)[\mathcal{C}\left(K\right)]_{\alpha,\alpha^{\prime}}\left(x,x^{\prime}\right)=\mathcal{T}_{\infty}(\Sigma). Thus CA\mathcal{C}\circ\mathcal{A} is σω2β\sigma_{\omega}^{2}\beta-Lipschitz.

We now prove the theorem by induction. Assume KtlPKlK^{l}_{t}\overset{P}{\to}K^{l}_{\infty} as tt\to\infty (obvious for l=0l=0).

Now let ε>0\varepsilon>0 be sufficiently small so that the ε2β\frac{\varepsilon}{2\beta}-neighborhood of A(Kl)\mathcal{A}(K^{l}_{\infty}) is contained in PSDXd(R){\rm PSD}_{\left|\mathcal{X}\right|d}(R), where we take RR to be large enough for KlK_{\infty}^{l} to be an interior point of PSDXd(R){\rm PSD}_{\left|\mathcal{X}\right|d}(R).Such RR always exists, because the diagonal terms of KlK_{\infty}^{l} are always non-zero. See (Long & Sedghi, 2019, Lemma 4) for proof.

to prove Ktl+1PKl+1K^{l+1}_{t}\overset{P}{\to}K_{\infty}^{l+1}, it suffices to show that for every δ>0\delta>0, there is a tt^{*} such that for all t>tt>t^{*},

By our induction assumption, there is a tlt^{l} such that for all t>tlt>t^{l}

Since CA\mathcal{C}\circ\mathcal{A} is σω2β\sigma_{\omega}^{2}\beta-Lipschitz, then

To bound the second term in Equation 80, let U(t)U\left(t\right) denote the event

and U(t)cU\left(t\right)^{c} its complement. For all t>tlt>t^{l} its probability is

i.e. the event that the inequality inside the braces holds. The fact

where the maximum is taken over all (x,x,α,α)X2×[d]\left(x,x^{\prime},\alpha,\alpha^{\prime}\right)\in\mathcal{X}^{2}\times\left[d\right].

Consider a fixed κPSDXd\kappa\in\text{PSD}_{\left|\mathcal{X}\right|d}, and define

a deterministic matrix in PSD2\text{PSD}_{2}. Then

where \left\{\left(z^{l,t}_{i,\alpha}(x),z^{l,t}_{i,\alpha^{\prime}}(x^{\prime})\right)\Big{|}K^{l}_{t}=\kappa\right\}_{i\in[n^{l+1}(t)]}\textrm{are i.i.d.}\sim\mathcal{N}\left(0,\Sigma\left(\kappa,\alpha,\alpha^{\prime},x,x^{\prime}\right)\right).

Then if Σ(κ,α,α,x,x)PSD2(R)\Sigma\left(\kappa,\alpha,\alpha^{\prime},x,x^{\prime}\right)\in\text{PSD}_{2}\left(R\right) we can apply property 3 of Ψ\Psi to conclude that:

However, if Σ(κ,α,α,x,x)∉PSD2(R)\Sigma\left(\kappa,\alpha,\alpha^{\prime},x,x^{\prime}\right)\not\in\text{PSD}_{2}\left(R\right), then necessarily A(κ)∉PSDXd(R)\mathcal{A}\left(\kappa\right)\not\in\text{PSD}_{\left|\mathcal{X}\right|d}\left(R\right) (since Σ(κ,α,α,x,x)PSD2\Sigma\left(\kappa,\alpha,\alpha^{\prime},x,x^{\prime}\right)\in\text{PSD}_{2}), ensuring that P(U(t)Ktl=κ)=0P\left(U\left(t\right)|K^{l}_{t}=\kappa\right)=0. Therefore Equation 92 holds for any κPSDXd\kappa\in\text{PSD}_{\left|\mathcal{X}\right|d}, and for any (x,x,α,α)X2×[d]2\left(x,x^{\prime},\alpha,\alpha^{\prime}\right)\in\mathcal{X}^{2}\times\left[d\right]^{2}.

We further remark that ρnl+1(t)(ϕ,R,ε2)\rho_{n^{l+1}(t)}\left(\phi,R,\frac{\varepsilon}{2}\right) is deterministic and does not depend on (κ,x,x,α,α)\left(\kappa,x,x^{\prime},\alpha,\alpha^{\prime}\right). Marginalizing out KtlK^{l}_{t} and maximizing over (x,x,α,α)\left(x,x^{\prime},\alpha,\alpha^{\prime}\right) in Equation 92 we conclude that

Since ρn(ϕ,R,ε2)0\rho_{n}\left(\phi,R,\frac{\varepsilon}{2}\right)\to 0 as nn\to\infty, there exists nn such that for any nl+1(t)nn^{l+1}(t)\geq n,

and, substituting this bound in Equation 88,

Therefore we just need to choose tl+1>tlt^{l+1}>t^{l} so that nl+1(t)nn^{l+1}(t)\geq n for all t>tl+1t>t^{l+1}. ∎

We list some directions to strengthen / extend the results of Theorem F.5 (and thus Theorem F.6) using the above framework.

where B\mathcal{B} is the linear operator on the covariance matrix induced by BB. Conditioned on KtlK^{l}_{t}, since the outputs after applying the linear operator BB are still i.i.d. Gaussian (and the property 3 is applicable), the analysis in the above proof can carry over with A\mathcal{A} replaced by BA\mathcal{B}\circ\mathcal{A}.

More generally, one may consider inserting an operator op\mathfrak{op} (e.g. max-pooling, dropout and more interestingly, normalization) in some hidden layer.

Gaussian prior on weights and biases might be relaxed to sub-Gaussian.

F.7 Proof of Theorem F.4

Note that absolutely continuous exponentially bounded functions contain all polynomials, and are closed under multiplication and integration in the sense that for any constant CC the function

is also exponentially bounded. Theorem F.4 is a consequence of the following lemma.

for k1k\geq 1, Ck(ϕ,R)<C_{k}\left(\phi,R\right)<\infty if ϕ\phi is exponentially bounded.

ϕΨ2\phi\in\Psi_{2} if ϕ\phi^{\prime} exists a.e. and is exponentially bounded.

ϕΨ3\phi\in\Psi_{3} if C4(ϕ,R)<C_{4}\left(\phi,R\right)<\infty.

Indeed, if ϕ\phi is absolutely continuous and ϕ\phi^{\prime} is exponentially bounded, then ϕ\phi is also exponentially bounded. By the above lemma, ϕΨ\phi\in\Psi.

1. We prove the first statement. Assume ϕ(x)aebx|\phi(x)|\leq ae^{b|x|}.

2. To prove the second statement, let Σ,ΣPSD2(R)\Sigma,\Sigma^{\prime}\in{\rm PSD}_{2}(R) and define AA (similarly for AA^{\prime}):

Then AAT=ΣAA^{T}=\Sigma (and AAT=ΣA^{\prime}A^{\prime T}=\Sigma^{\prime}). Let

Since ϕ\phi^{\prime} is exponentially bounded, ϕ\phi is also exponentially bounded due to being absolutely continuous. In addition, p(w2)f(w)2p\left(\left\|w\right\|_{2}\right)\left\|\nabla f(w)\right\|_{2} is exponentially bounded for any polynomial p(w2)p\left(\left\|w\right\|_{2}\right).

Applying the Mean Value Theorem (we use the notation \lesssim to hide the dependence on RR and other absolute constants)

Note that the operator norm is bounded by the infinity norm (up to a multiplicity constant) and w2f(A(t)w)2\left\|w\right\|_{2}\left\|\nabla f\left(A(t)w\right)\right\|_{2} is exponentially bounded. There is a constant aa (hidden in \lesssim) and bb such that the above is bounded by

In practice the 1/n1/n decay bound obtained by Chebyshev’s inequality in Equation 114 is often too weak to be useful. However, if ϕ\phi is linearly bounded, then one can obtain an exponential decay bound via the following concentration inequality:

If ϕ(x)a+bx|\phi(x)|\leq a+b|x| a.e., then there is an absolute constant c>0c>0 and a constant κ=κ(a,b,R)>0\kappa=\kappa(a,b,R)>0 such that property 3 (Equation 62) holds with

We postpone the proof of the following claim.

Assume ϕ(x)a+bx|\phi(x)|\leq a+b|x|. Then there is a κ=κ(a,b,R)\kappa=\kappa(a,b,R) such that for all ΣPSD2(R)\Sigma\in{\rm PSD}_{2}(R) and all p1p\geq 1,

Claim F.9 and the triangle inequality imply

We can apply Bernstein-type inequality (Vershynin, 2010, Lemma 5.16) to conclude that there is a c>0c>0 such that for every ΣPSD2(R)\Sigma\in{\rm PSD}_{2}(R) and any {(xi,yi)}i=1n i.i.d. N(0,Σ)\left\{(x_{i},y_{i})\right\}_{i=1}^{n}\textrm{ i.i.d. }\sim\mathcal{N}\left(0,\Sigma\right)

It remains to prove Claim F.9. For p1p\geq 1,

We applied Cauchy–Schwarz inequality in the first inequality, the triangle inequality in the second one, the fact Σ11,Σ22R\Sigma_{11},\Sigma_{22}\leq R in the third one, absolute moments estimate of standard Gaussian in the fourth one, where cc^{\prime} is a constant such that

Appendix G Experimental Setup

Throughout this work we only consider 3×33\times 3 (possibly unshared) convolutional filters with stride 11 and no dilation.

All inputs are normalized to have zero mean and unit variance, i.e. lie on the dd-dimensional sphere of radius d\sqrt{d}, where dd is the total dimensionality of the input.

All labels are treated as regression targets with zero mean. i.e. for a single-class classification problem with CC classes targets are CC-dimensional vectors with 1/C-1/C and (C1)/C(C-1)/C entries in incorrect and correct class indices respectively.

If a subset of a full dataset is considered for computational reasons, it is randomly selected in a balanced fashion, i.e. with each class having an equal number of samples. No data augmentation is used.

All experiments were implemented in Tensorflow (Abadi et al., 2016) and executed with the help of Vizier (Golovin et al., 2017).

All neural networks are trained using Adam (Kingma & Ba, 2015) minimizing the mean squared error loss.

We use a training and validation subsets of CIFAR10 of sizes 500500 and 40004000 respectively. All images are bilinearly downsampled to 8×88\times 8 pixels.

All models have 33 hidden layers with an erf\operatorname{erf} nonlinearity. No (valid\operatorname{valid}) padding is used.

Weight and bias variances are set to σω21.7562\sigma_{\omega}^{2}\approx 1.7562 and σb20.1841\sigma_{b}^{2}\approx 0.1841, corresponding to the pre-activation variance fixed point q=1q^{*}=1 (Poole et al., 2016) for the erf\operatorname{erf} nonlinearity.

NN training proceeds for 2192^{19} gradient updates, but aborts if no progress on training loss is observed for the last 100100 epochs. If the training loss does not reduce by at least 10410^{-4} for 2020 epochs, the learning rate is divided by 1010.

All computations are done with 3232-bit precision.

The following NN parameters are considered:Due to time and memory limitations certain large configurations could not be evaluated. We believe this did not impact the results of this work in a qualitative way.

Pooling: no pooling or a single global average pooling (averaging over spatial dimensions) before the final FC layer.

Number of channels: 2k2^{k} for kk from to 1212.

Initial learning rate: 10k10^{-k} for kk from to 1515.

Weight decay: and 10k10^{-k} for kk from to 88.

Batch size: 1010, 2525, 5050, 100100, 200200.

For NNs, all models are filtered to only 100%100\%-accurate ones on the training set and then for each configuration of {architecture, pooling, number of channels} the model with the lowest validation loss is selected among the configurations of {learning rate, weight decay, batch size}.

For GPs, the same CNN-GP is plotted against CNN and LCN networks without pooling. For LCN with pooling, inference was done with an appropriately rescaled CNN-GP kernel, i.e. (Kvecσb2)/d+σb2,\left(\mathcal{K}_{\infty}^{\textrm{vec}}-\sigma_{b}^{2}\right)/d+\sigma_{b}^{2}, where dd is the spatial size of the penultimate layer. For CNNs with pooling, a Monte Carlo estimate was computed (see §4) with n=212n=2^{12} filters and M=26M=2^{6} samples.

For GP inference, the initial diagonal regularization term applied to the training covariance matrix is 101010^{-10}; if the Cholesky decomposition fails, the regularization term is increased by a factor of 1010 until it either succeeds or reaches the value of 10510^{5}, at which point the trial is considered to have failed.

G.2 Monte Carlo Evaluation of Intractable GP Kernels

We use the same setup as in §G.1, but training and validation sets of sizes 20002000 and 40004000 respectively.

For MC-GPs we consider the number of channels nn (width in FCN setting) and number of NN instantiations MM to accept values of 2k2^{k} for kk from to 1010.

where K\mathcal{K}_{\infty} is substituted with K210,210K_{2^{10},2^{10}} for the CNN-GP pooling case (due to impracticality of computing the exact Kpool\mathcal{K}_{\infty}^{\textrm{pool}}). GPs are regularized in the same fashion as in §G.1, but the regularization factor starts at 10410^{-4} and ends at 101010^{10} and is multiplied by the mean of the training covariance diagonal.

G.3 Transforming a GP over spatial locations into a GP over classes

We use the same setup as in §G.2, but rescale the input images to size of 31×3131\times 31, so that at depth 1515 the spatial dimension collapses to a 1×11\times 1 patch if no padding is used (hence the curve of the CNN-GP without padding halting at that depth).

For MC-CNN-GP with pooling, we use samples of networks with n=16n=16 filters. Due to computational complexity we only consider depths up to 3131 for this architecture. The number of samples MM was selected independently for each depth among {2k}\left\{2^{k}\right\} for kk from to 1515 to maximize the validation accuracy on a separate 500500-points validation set. This allowed us to avoid the poor conditioning of the kernel. GPs are regularized in the same fashion as in §G.1, but for MLP-GP the multiplicative factor starts at 10410^{-4} and ends at 101010^{10}.

G.4 Relationship to Deep Signal Propagation

We use a training and validation subsets of CIFAR10 of sizes 500500 and 10001000 respectively.

We use the erf\operatorname{erf} nonlinearity. For CNN-GP, images are zero-padded (same\operatorname{same} padding) to maintain the spatial shape of the activations as they are propagated through the network.

Weight and bias variances (horizontal axis σω2\sigma_{\omega}^{2} and vertical axis σb2\sigma_{b}^{2} respectively) are sampled from a uniform grid of size 50×5050\times 50 on the range [0.1,5]×[0,2]\left[0.1,5\right]\times\left[0,2\right] including the endpoints.

All computations are done with 6464-bit precision. GPs are regularized in the same fashion as in §G.1, but the regularization factor is multiplied by the mean of the training covariance diagonal. If the experiment fails due to numerical reasons, 0.10.1 (random chance) validation accuracy is reported.

G.5 CNN-GP on full datasets

Relevant Figures 6 (a, c), 7 and Table 1.

We use full training, validation, and test sets of sizes 5000050000, 1000010000, and 1000010000 respectively for MNIST (LeCun et al., 1998) and Fashion-MNIST (Xiao et al., 2017a), 4500045000, 50005000, and 1000010000 for CIFAR10 (Krizhevsky, 2009). We use validation accuracy to select the best configuration for each model (we do not retrain on validation sets).

GPs are computed with 6464-bit precision, and NNs are trained with 3232-bit precision. GPs are regularized in the same fashion as in §G.4.

Zero-padding (same\operatorname{same}) is used.

Nonlinearity: erf\operatorname{erf} or ReLU\operatorname{ReLU}.

Depth: 2k2^{k} for kk from to 44 (and up to 252^{5} for MNIST and Fashion-MNIST datasets).

Weight and bias variances. For erf\operatorname{erf}: qq^{*} from {0.1,1,2,,8}\left\{0.1,1,2,\dots,8\right\}. For ReLU\operatorname{ReLU}: a fixed weight variance σω2=2+4e16\sigma_{\omega}^{2}=2+4e^{-16} and bias variance σb2\sigma_{b}^{2} from {0.1,1,2,,8}\left\{0.1,1,2,\dots,8\right\}.

Only one ReLU\operatorname{ReLU} (σω2=2\sigma_{\omega}^{2}=2, σb2=0.01\sigma_{b}^{2}=0.01), 8-layer CNN-GP with pooling model was evaluated using the Neural Tangents library (Novak et al., 2020) after the publication, and added to Figure 7 and Table 2 for completeness. The diagonal regularization factor 10k10^{-k} was selected on the validation set among {0,,19}\{0,\dots,19\} (resulting in k=3k=3).

On CIFAR10, we additionally train NNs for 2182^{18} gradient updates with a batch size of 128128 with corresponding parameters in addition toDue to time and compute limitations certain large configurations could not be evaluated. We believe this did not impact the results of this work in a qualitative way.

Pooling: no pooling or a single global average pooling (averaging over spatial dimensions) before the final FC layer (only for CNNs).

Number of channels or width: 2k2^{k} for kk from 11 to 99 (and up to 2102^{10} for CNNs with pooling in Figure 6, a).

Learning rate: 10k×216/(width×q)10^{-k}\times 2^{16}/\left(\textrm{width}\times q^{*}\right) for kk from 55 to 99, where width is substituted with the number of channels for CNNs and qq^{*} is substituted with σb2\sigma_{b}^{2} for ReLU\operatorname{ReLU} networks. “Small learning rate” in Table 1 refers to k{8,9}k\in\left\{8,9\right\}.

Weight decay: and 10k10^{-k} for kk from to 55.

For NNs, all models are filtered to only 100%100\%-accurate ones on the training set (expect for values in parentheses in Table 1). The reported values are then reported for models that achieve the best validation accuracy.

G.6 Model comparison on CIFAR10

We use the complete CIFAR10 dataset as described in §G.5 and consider 8-layer ReLU\operatorname{ReLU} models with weight and bias variances of σω2=2\sigma_{\omega}^{2}=2 and σb2=0.01\sigma_{b}^{2}=0.01. The number of channels / width is set to 252^{5}, 2102^{10} and 2122^{12} for LCN, CNN, and FCN respectively.

GPs are computed with 6464-bit precision, and NNs are trained with 3232-bit precision.

No padding (valid\operatorname{valid}) is used.

NN training proceeds for 2182^{18} gradient updates with batch size 6464, but aborts if no progress on training loss is observed for the last 1010 epochs. If the training loss does not reduce by at least 10410^{-4} for 22 epochs, the learning rate is divided by 1010.

Values for NNs are reported for the best validation accuracy over different learning rates (10k10^{-k} for kk from 22 to 1212) and weight decay values ( and 10k10^{-k} for kk from 22 to 77). For GPs, validation accuracy is maximized over initial diagonal regularization terms applied to the training convariance matrix: 10k×[mean of the diagonal]10^{-k}\times\textrm{[mean of the diagonal]} for kk among 22, 44 and 99 (if the cholesky decompisition fails, the regularization term is increased by a factor of 1010 until it succeeds or kk reaches the value of 1010).

CNN-GP with pooling performance was computed using the Neural Tangents library (Novak et al., 2020), after the publication, and added to the figure for completeness. The diagonal regularization factor 10k10^{-k} was selected on the validation set among {0,,19}\{0,\dots,19\} (resulting in k=5k=5).