Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes
Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein
Introduction
Neural networks (NNs) demonstrate remarkable performance (He et al., 2016; Oord et al., 2016; Silver et al., 2017; Vaswani et al., 2017), but are still only poorly understood from a theoretical perspective (Goodfellow et al., 2015; Choromanska et al., 2015; Pascanu et al., 2014; Zhang et al., 2017). NN performance is often motivated in terms of model architectures, initializations, and training procedures together specifying biases, constraints, or implicit priors over the class of functions learned by a network. This induced structure in learned functions is believed to be well matched to structure inherent in many practical machine learning tasks, and in many real-world datasets. For instance, properties of NNs which are believed to make them well suited to modeling the world include: hierarchy and compositionality (Lin et al., 2017; Poggio et al., 2017), Markovian dynamics (Tiňo et al., 2004; 2007), and equivariances in time and space for RNNs (Werbos, 1988) and CNNs (Fukushima & Miyake, 1982; Rumelhart et al., 1985) respectively.
The recent discovery of an equivalence between deep neural networks and GPs (Lee et al., 2018; Matthews et al., 2018b) allow us to express an analytic form for the prior over functions encoded by deep NN architectures and initializations. This transforms an implicit prior over functions into an explicit prior, which can be analytically interrogated and reasoned about.While there is broad literature on empirical interpretation of finite CNNs (Zeiler & Fergus, 2014; Simonyan et al., 2014; Long et al., 2014; Olah et al., 2017), it is commonly only applicable to fully trained networks.
Previous work studying these Neural Network-equivalent Gaussian Processes (NN-GPs) has established the correspondence only for fully connected networks (FCNs). Additionally, previous work has not used analysis of NN-GPs to gain specific insights into the equivalent NNs.
In the present work, we extend the equivalence between NNs and NN-GPs to deep Convolutional Neural Networks (CNNs), both with and without pooling. CNNs are a particularly interesting architecture for study, since they are frequently held forth as a success of motivating NN design based on invariances and equivariances of the physical world (Cohen & Welling, 2016) – specifically, designing a NN to respect translation equivariance (Fukushima & Miyake, 1982; Rumelhart et al., 1985). As we will see in this work, absent pooling, this quality of equivariance has no impact in the Bayesian treatment of the infinite channel (number of convolutional filters) limit (Figure 2).
The specific novel contributions of the present work are:
We show analytically that CNNs with many channels, trained in a fully Bayesian fashion, correspond to an NN-GP (§2, §3). We show this for CNNs both with and without pooling, with arbitrary convolutional striding, and with both and padding. We prove convergence as the number of channels in hidden layers approach infinity simultaneously (i.e. , see §F.4 for details), extending the result of Matthews et al. (2018a) under mild conditions on the nonlinearity derivative. Our results also provide a rigorous proof of the assumption made in Xiao et al. (2018) that pre-activations in hidden layers are i.i.d. Gaussian.
We show that in the absence of pooling, the NN-GP for a CNN and a Locally Connected Network (LCN) are identical (Figure 2, §5.1). An LCN has the same local connectivity pattern as a CNN, but without weight sharing or translation equivariance.
We experimentally compare trained CNNs and LCNs and find that under certain conditions both perform similarly to the respective NN-GP (Figure 6, b, c). Moreover, both architectures tend to perform better with increased channel count, suggesting that similarly to FCNs (Neyshabur et al., 2015; Novak et al., 2018) CNNs benefit from overparameterization (Figure 6, a, b), corroborating a similar trend observed in Canziani et al. (2016, Figure 2). However, we also show that careful tuning of hyperparameters allows finite CNNs trained with SGD to outperform their corresponding NN-GPs by a significant margin. We experimentally disentangle and quantify the contributions stemming from local connectivity, equivariance, and invariance in a convolutional model in one such setting (Figure 2).
We introduce a Monte Carlo method to compute NN-GP kernels for situations (such as CNNs with pooling) where evaluating the NN-GP is otherwise computationally infeasible (§4).
We stress that we do not evaluate finite width Bayesian networks nor do we make any claims about their performance relative to the infinite width GP studied here or finite width SGD-trained networks. While this is an interesting subject to pursue (see Matthews et al. (2018a); Neal (1994)), it is outside of the scope of this paper.
In early work on neural network priors, Neal (1994) demonstrated that, in a fully connected network with a single hidden layer, certain natural priors over network parameters give rise to a Gaussian process prior over functions when the number of hidden units is taken to be infinite. Follow-up work extended the conditions under which this correspondence applied (Williams, 1997; Le Roux & Bengio, 2007; Hazan & Jaakkola, 2015). An exactly analogous correspondence for infinite width, finite depth deep fully connected networks was developed recently in Lee et al. (2018); Matthews et al. (2018b), with Matthews et al. (2018a) extending the convergence guarantees from to any linearly bounded nonlinearities and monotonic width growth rates. In this work we further relax the conditions to absolutely continuous nonlinearities with exponentially bounded derivative and any width growth rates.
The line of work examining signal propagation in random deep networks (Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; 2018; Hanin & Rolnick, 2018; Chen et al., 2018; Yang et al., 2018) is related to the construction of the GPs we consider. They apply a mean field approximation in which the pre-activation signal is replaced with a Gaussian, and the derivation of the covariance function with depth is the same as for the kernel function of a corresponding GP. Recently, Xiao et al. (2017b; 2018) extended this to convolutional architectures without pooling. Xiao et al. (2018) also analyzed properties of the convolutional kernel at large depths to construct a phase diagram which will be relevant to NN-GP performance, as discussed in §C.
Compositional kernels coming from wide convolutional and fully connected layers also appeared outside of the GP context. Cho & Saul (2009) derived closed-form compositional kernels for rectified polynomial activations (including ). Daniely et al. (2016) proved approximation guarantees between a network and its corresponding kernel, and show that empirical kernels will converge as the number of channels increases.
There is a line of work considering stacking of GPs, such as deep GPs (Lawrence & Moore, 2007; Damianou & Lawrence, 2013). These no longer correspond to GPs, though they can describe a rich class of probabilistic models beyond GPs. Alternatively, deep kernel learning (Wilson et al., 2016b; a; Bradshaw et al., 2017) utilizes GPs with base kernels which take in features produced by a deep neural network (often a CNN), and train the resulting model end-to-end. Finally, van der Wilk et al. (2017) incorporates convolutional structure into GP kernels, with follow-up work stacking multiple such GPs (Kumar et al., 2018; Blomqvist et al., 2018) to produce a deep convolutional GP (which is no longer a GP). Our work differs from all of these in that our GP corresponds exactly to a fully Bayesian CNN in the infinite channel limit, when all layers are taken to be of infinite size. We remark that while alternative models, such as deep GPs, do include infinite-sized layers in their construction, they do not treat all layers in this way – for instance, through insertion of bottleneck layers which are kept finite. While it remains to be seen exactly which limit is applicable for understanding realistic CNN architectures in practice, the limit we consider is natural for a large class of CNNs, namely those for which all layers sizes are large and rather comparable in size. Deep GPs, on the other hand, correspond to a potentially richer class of models, but are difficult to analytically characterize and suffer from higher inference cost.
Borovykh (2018) analyzes the convergence of CNN outputs at different spatial locations (or different timepoints for a temporal CNN) to a GP for a single input example. Thus, while they also consider a GP limit (and perform an experiment comparing posterior GP predictions to an SGD-trained CNN), they do not address the dependence of network outputs on multiple input examples, and thus their model is unable to generate predictions on a test set consisting of new input examples.
In concurrent work, Garriga-Alonso et al. (2018) derive an NN-GP kernel equivalent to one of the kernels considered in our work. In addition to explicitly specifying kernels corresponding to pooling and vectorizing, we also compare the NN-GP performance to finite width SGD-trained CNNs and analyze the differences between the two models.
Many-channel Bayesian CNNs are Gaussian processes
General setup. For simplicity of presentation we consider 1D convolutional networks with circularly-padded activations (identically to Xiao et al. (2018)). Unless specified otherwise, no pooling anywhere in the network is used. If a model (NN or GP) is mentioned explicitly as “with pooling”, it always corresponds to a single global average pooling layer at the top. Our analysis is straightforward to extend to higher dimensions, using zero () or no () padding, strided convolutions, and pooling in intermediary layers (§D). We consider a series of convolutional layers, .
Random weights and biases. The parameters of the network are the convolutional filters and biases, and , respectively, with outgoing (incoming) channel index () and filter relative spatial location .We will use Roman letters to index channels and Greek letters for spatial location. We use letters , etc to denote channel indices, , etc to denote spatial indices and , etc for filter indices. Assume a Gaussian prior on both the filter weights and biases,
The weight and bias variances are , respectively. is the number of channels (filters) in layer , is the filter size, and is the fraction of the receptive field variance at location (with ). In experiments we utilize uniform , but nonuniform should enable kernel properties that are better suited for ultra-deep networks, as in Xiao et al. (2018).
Activation covariance. A recurring quantity in this work will be the empirical uncentered covariance matrix of the activations , defined as
is a random variable indexed by two inputs and two spatial locations (the dependence on layer widths , as well as weights and biases, is implied and by default not stated explicitly). , the empirical uncentered covariance of inputs, is deterministic.
Shapes and indexing. Whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension. For example, is a vector of size , is a matrix of shape , and is a vector of size .
Our work concerns proving that the top-layer pre-activations converge in distribution to an -variate normal random vector with a particular covariance matrix of shape as . We emphasize that only the channels in hidden layers are taken to infinity, and , the number of channels in the top-layer pre-activations , remains fixed. For convergence proofs, we always consider , , as well as any of their indexed subsets like , to be 1D vector random variables, while , as well as any of its indexed subsets (when applicable, e.g. , ) to be 2D matrix random variables.
2 Correspondence between Gaussian processes and Bayesian deep CNNs with infinitely many channels
We next consider the prior over outputs computed by a CNN in the limit of infinitely many channels in the hidden (excluding input and output) layers, , and derive its equivalence to a GP with a compositional kernel. This section outlines an argument showing that is normally distributed conditioned on previous layer activation covariance , which itself becomes deterministic in the infinite limit. This allows to conclude convergence in distribution of the outputs to a Gaussian with the respective deterministic covariance limit. This section omits many technical details elaborated in §F.4.
As can be seen in Equation 4, the pre-activations are a linear transformation of the multivariate Gaussian , specified by the previous layer’s activations . A linear transformation of a multivariate Gaussian is itself a Gaussian with a covariance matrix that can be derived straightforwardly. Specifically,
where is an identity matrix, and is the covariance of the pre-activations and is derived in Xiao et al. (2018). Precisely, is an affine transformation (a cross-correlation operator followed by a shifting operator) on the space of positive semi-definite matrices defined as follows:
preserves positive semi-definiteness due to Equation 6. Notice that the covariance matrix in Equation 6 is block diagonal due to the fact that separate channels are i.i.d. conditioned on , due to i.i.d. weights and biases .
We further remark that per Equation 6 the normal distribution of only depends on , hence the random variable has the same distribution by the law of total expectation:
2.2 Activation covariance matrix becomes deterministic with increasing channel count
It follows from Equation 8 that the summands in Equation 5 are i.i.d. conditioned on fixed . Subject to weak restrictions on the nonlinearity , we can apply the weak law of large numbers and conclude that the covariance matrix becomes deterministic in the infinite channel limit in layer (note that pre-activations remain stochastic). Precisely,
where is defined for any PSD matrix as
The decoupling of the kernel “propagation” into and is highly convenient since is a simple affine transformation of the kernel (see Equation 7), and is a well-studied map in literature (see §G.4), and for nonlinearities such as (Nair & Hinton, 2010) and the error function () can be computed in closed form as derived in Cho & Saul (2009) and Williams (1997) respectively. We refer the reader to Xiao et al. (2018, Lemma A.1) for complete derivation of the limiting value in Equation 9.
A less obvious result is that, under slightly stronger assumptions on , the top-layer activation covariance becomes unconditionally (dependence on observed deterministic inputs is implied) deterministic as channels in all hidden layers grow to infinity simultaneously:
i.e. is applied times to , the deterministic input covariance. We prove this in §F.4 (Theorem F.5). See Figure 3 for a depiction of the correspondence between neural networks and their infinite width limit covariances .
2.3 A conditionally normal random variable becomes normal if its covariance becomes deterministic
For more intuition behind Equation 12 and an informal proof please consult §F.3.
Transforming a GP over spatial locations into a GP over classes
In §2.2 we showed that in the infinite channel limit a deep CNN is a GP indexed by input samples, output spatial locations, and output channel indices. Further, its covariance matrix can be computed in closed form. Here we show that transformations to obtain class predictions that are common in CNN classifiers can be represented as either vectorization or projection (as long as we treat classification as regression (see §G), identically to Lee et al. (2018)). Both of these operations preserve the GP equivalence and allow the computation of the covariance matrix of the respective GP (now indexed by input samples and target classes) as a simple transformation of .
It is easy to verify that is a multivariate Gaussian with covariance:
The argument of §2.2 can then be extended to conclude that in the limit of infinite width converges in distribution to a multivariate Gaussian (i.i.d. for each class ) with covariance
A sample 2D CNN using this readout strategy is depicted in Figure 2, and a sample correspondence between a FCN, FCN-GP, CNN, and CNN-GP is depicted in Figure 3.
Note that as observed in Xiao et al. (2018), to compute the summands in Equation 15, one needs only the corresponding terms . Consequently, we only need to compute and the memory cost is (or per covariance entry in an iterative or distributed setting). Note that this approach ignores pixel-pixel covariances and produces a GP corresponding to a locally connected network (see §5.1).
2 Projection
Global average pooling: take . Then
This approach corresponds to applying global average pooling right after the last convolutional layer. Spatially local average pooling in intermediary layers can be constructed in a similar fashion (§D). We focus on global average pooling in this work to more effectively isolate the effects of pooling from other aspects of the model like local connectivity or equivariance. This approach takes all pixel-pixel covariances into consideration and makes the kernel translation invariant. However, it requires memory to compute the sample-sample covariance of the GP (or per covariance entry in an iterative or distributed setting). It is impractical to use this method to analytically evaluate the GP, and we propose to use a Monte Carlo approach (see §4).
Subsampling one particular pixel: take ,
This approach makes use of only one pixel-pixel covariance, and requires the same amount of memory as vectorization (§3.1) to compute.
We compare the performance of presented strategies in Figure 4. Note that all described strategies admit stacking additional FC layers on top while retaining the GP equivalence, using a derivation analogous to §2.2 (Lee et al., 2018; Matthews et al., 2018b).
Monte Carlo evaluation of intractable GP kernels
We introduce a Monte Carlo estimation method for NN-GP kernels which are computationally impractical to compute analytically, or for which we do not know the analytic form. Similar in spirit to traditional random feature methods (Rahimi & Recht, 2007), the core idea is to instantiate many random finite width networks and use the empirical uncentered covariances of activations to estimate the Monte Carlo-GP (MC-GP) kernel,
where consists of draws of the weights and biases from their prior distribution, , and is the width or number of channels in hidden layers. The MC-GP kernel converges to the analytic kernel with increasing width, in probability.
In a non-distributed setting, the MC-GP reduces the memory requirements to compute from to , making the evaluation of CNN-GPs with pooling practical.
Discussion
Locally Connected Networks (LCNs) (Fukushima, 1975; Lecun, 1989) are CNNs without weight sharing between spatial locations. LCNs preserve the connectivity pattern, and thus topology, of a CNN. However, they do not possess the equivariance property of a CNN – if an input is translated, the latent representation in an LCN will be completely different, rather than also being translated.
The CNN-GP predictions without spatial pooling in §3.1 and §3.2.2 depend only on sample-sample covariances, and do not depend on pixel-pixel covariances. LCNs destroy pixel-pixel covariances: , for and all and . However, LCNs preserve the covariances between input examples at every pixel: . As a result, in the absence of pooling, LCN-GPs and CNN-GPs are identical. Moreover, LCN-GPs with pooling are identical to CNN-GPs with vectorization of the top layer (under suitable scaling of ). We confirm these findings experimentally in trained networks in the limit of large width in Figures 2 and 6 (b), as well as by demonstrating convergence of MC-GPs of the respective architectures to the same CNN-GP (modulo scaling of ) in Figures 5 and 10.
2 Pooling leverages equivariance to provide invariance
The only kernel leveraging pixel-pixel covariances is that of the CNN-GP with pooling. This enables the predictions of this GP and the corresponding CNN to be invariant to translations (modulo edge effects) – a beneficial quality for an image classifier. We observe strong experimental evidence supporting the benefits of invariance throughout this work (Figures 2, 4, 5, 6 (b); Table 1), in both CNNs and CNN-GPs.
3 Finite channel SGD-trained CNNs can outperform infinite channel Bayesian CNNs, in the absence of pooling
In the absence of pooling, the benefits of equivariance and weight sharing are more challenging to explain in terms of Bayesian priors on class predictions (since without pooling equivariance is not a property of the outputs, but only of intermediary representations). Indeed, in this work we find that the performance of finite width SGD-trained CNNs often approaches that of their CNN-GP counterpart (Figure 6, b, c),This observation is conditioned on the respective NN fitting the training set to . Underfitting breaks the correspondence to an NN-GP, since train set predictions of such a network no longer correspond to the true training labels. Properly tuned underfitting often also leads to better generalization (Table 1). suggesting that in those cases equivariance does not play a beneficial role in SGD-trained networks.
However, as can be seen in Figures 2, 6 (c), 7, and Table 1, the best CNN overall outperforms the best CNN-GP by a significant margin – an observation specific to CNNs and not FCNs or LCNs.Performing an analogous large-scale comparison between CNNs and CNN-GPs with pooling was not computationally feasible, and only one CNN-GP model was evaluated.footnote 3 Their relative performance remains an interesting open question for future research. We observe this gap in performance especially in the case of networks trained with a large learning rate. In Figure 2 we demonstrate this large gap in performance by evaluating different models with equivalent architecture and hyperparameter settings, chosen for good SGD-trained CNN performance.
We conjecture that equivariance, a property lacking in LCNs and the Bayesian treatment of the infinite channel CNN limit, contributes to the performance of SGD-trained finite channel CNNs with the correct settings of hyperparameters.While this is consistent with the results of Lecun (1989), in a different setting Bartunov et al. (2018) found equivariance to not provide a substantial performance gain. Nonetheless, more work is needed to disentangle and quantify the separate contributions of stochastic optimization and finite width effects,We remark that concerns about GPs not being able to learn hierarchical representations have been raised in the literature (Matthews et al., 2018a, Section 7), (Neal, 1995, Chapter 5), (MacKay, 2003, Section 45.7) However, practical impact of these assertions have not been extensively investigated empirically or theoretically, and we hope that our work stimulates research in this direction. and differences in performance between CNNs with weight sharing and their corresponding CNN-GPs.
Conclusion
In this work we have derived a Gaussian process that corresponds to fully Bayesian multi-layer CNNs with infinitely many channels. The covariance of this GP can be efficiently computed either in closed form or by using Monte Carlo sampling, depending on the architecture.
The CNN-GP achieves state of the art results for GPs without trainable kernels on CIFAR10. It can perform competitively with CNNs (that fit the training set) of equivalent architecture and weight priors, which makes it an appealing choice for small datasets, as it eliminates all training-related hyperparameters. However, we found that the best overall performance, at least in the absence of pooling, is achieved by finite SGD-trained CNNs and not by their infinite Bayesian counterparts. We hope our work stimulates future research towards understanding the distribution over functions induced by model architecture and training approach, and what aspects of this distribution are important for model performance. Another natural extension of our work is the study of other deep learning architectures in the infinitely wide limit. After the publication of this paper, Yang (2019) devised a unifying framework proving the GP convergence for even more models (such as ones using batch normalization, (self-)attention, LSTM) with slightly different assumptions on the nonlinearity.
Acknowledgements
We thank Sam Schoenholz, Vinay Rao, Daniel Freeman, Qiang Zeng, and Phil Long for frequent discussion and feedback on preliminary results.
References
Appendix A Glossary
We use the following shorthands in this work:
LCN - locally connected network, a.k.a. convolutional network without weight sharing;
FCN - fully connected network, a.k.a. multilayer perceptron (MLP);
X-GP - a GP equivalent to a Bayesian infinitely wide neural network of architecture X (§2).
MC-(X-)-GP - a Monte Carlo estimate (§4) of the X-GP.
Width, (number of) filters, (number of) channels represent the same property for CNNs and LCNs.
Pooling - referring to architectures as “with” or “without pooling” means having a single global average pooling layer (collapsing the spatial dimensions of the activations ) before the final linear FC layer giving the regression outputs .
Invariance and equivariance are always discussed w.r.t. translations in the spatial dimensions of the inputs.
Appendix B Additional figures
Appendix C Relationship to Deep Signal Propagation
The recurrence relation linking the GP kernel at layer to that of layer following from Equation 11 (i.e. ) is precisely the covariance map examined in a series of related papers on signal propagation (Xiao et al., 2018; Poole et al., 2016; Schoenholz et al., 2017; Lee et al., 2018) (modulo notational differences; denoted as , or e.g. in Xiao et al. (2018)). In those works, the action of this map on hidden-state covariance matrices was interpreted as defining a dynamical system whose large-depth behavior informs aspects of trainability. In particular, as , , i.e. the covariance approaches a fixed point . The convergence to a fixed point is problematic for learning because the hidden states no longer contain information that can distinguish different pairs of inputs. It is similarly problematic for GPs, as the kernel becomes pathological as it approaches a fixed point. Precisely, in both chaotic and ordered regimes, outputs of the GP become asymptotically identically correlated. Either of these scenarios captures no information about the training data in the kernel and makes learning infeasible.
This problem can be ameliorated by judicious hyperparameter selection, which can reduce the rate of exponential convergence to the fixed point. For hyperpameters chosen on a critical line separating two untrainable phases, the convergence rates slow to polynomial, and very deep networks can be trained, and inference with deep NN-GP kernels can be performed – see Figure 11.
Appendix D Strided convolutions, average pooling in intermediate layers, higher dimensions
One can easily see that \left\{Bz_{j}^{l}\big{|}K^{l}\right\}_{j} are i.i.d. multivariate Gaussian as well.
Average pooling. Average pooling with stride and window size is equivalent to choosing for and .
ND convolutions. Note that our analysis in the main text (1D) easily extends to higher-dimensional convolutions by replacing integer pixel indices and sizes with tuples (see also Figure 2). In Equation 4 values would have to span the hypercube in the pre-activation definition. Similarly, in §3 the normalizing factor () should be the product (squared) of its entries, and summations over should span the hypercube as well. The definition of the kernel propagation operator in Equation 7 will remain exactly the same, so long as is summed over the hypercube, and the variance weights remain respectively normalized .
Appendix E Review of exact Bayesian regression with GPs
Our discussion in the paper has focused on model priors. A crucial benefit we derive by mapping to a GP is that Bayesian inference is straightforward to implement and can be done exactly for regression (Rasmussen & Williams, 2006, chapter 2), requiring only simple linear algebra. Let denote training inputs , training targets, and collectively for the training set. The integral over the posterior can be evaluated analytically to give a posterior predictive distribution on a test point which is Gaussian, , with
We use the shorthand to denote the matrix formed by evaluating the GP covariance on the training inputs, and likewise is a -length vector formed from the covariance between the test input and training inputs. Computationally, the costly step in GP posterior predictions comes from the matrix inversion, which in all experiments were carried out exactly, and typically scales as (though algorithms scaling as exist for sufficiently large matrices). Nonetheless, there is a broad literature on approximate Bayesian inference with GPs which can be utilized for efficient implementation (Rasmussen & Williams, 2006, chapter 8); (Quiñonero-Candela & Rasmussen, 2005; Titsias, 2009).
Appendix F Equivalence between randomly initialized NNs and GPs
In this section, we present two different approaches, the sequential limit (§F.3) and simultaneous limit (§F.4), to illustrate the relationship between many-channels Bayesian CNNs and GPs.
Sequential limit (§F.3) involves taking the infinite channel limit in hidden layers in a sequence, starting from bottom (closest to inputs) layers and going upwards (to the outputs), i.e. . Note that this approach in fact only gives intuition into constructing a GP using a NN architecture to define its covariance, and does not provide guarantees on actual convergence of large but finite Bayesian CNNs to GPs (which is of most practical interest), nor does it guarantee the existence of the specified GP on a given probability space. However, it has the following benefits:
Weak assumptions on the NN activation function and on the distribution of the NN parameters.
The arguments can be easily extended to more complicated network architectures, e.g. architectures with max pooling, dropout, etc.
A straightforward and intuitive way to compute the covariance of the Gaussian process without diving into mathematical details.
Simultaneous limit (§F.4) considers growing the number of channels in hidden layers uniformly, i.e. . This approach establishes convergence of finite channel Bayesian CNNs to GPs and is thus a more practically relevant result. However, it makes stronger assumptions, and the proof is more involved.
We highlight that the GPs obtained by the two approaches are identical.
In both sections, we only provide the arguments for CNNs. It is straightforward (and in fact simpler) to extend them to LCNs and FCNs. Indeed, an FCN is a particular case of a CNN where the inputs and filters have singular spatial dimensions (, ). For LCNs, the proof goes through in an identical fashion if we replace with defined as .
Probability space. Let be a collection of countably many mutually independent random variables (R.V.s) defined on a probability space , where is a product Borel -algebra and is the probability measure. Here is the collection of parameters used to define neural networks:
Place-holder. is a place-holder to store extra (if needed) R.V.s , e.g. parameters coming from the final dense layer.
However, our results can be straightforwardly extended to a countably-infinite input indexing spaces for certain topologies via an argument presented in Matthews et al. (2018a, section 2.2), allowing to infer weak convergence on from convergence on any finite subset (which is the case we consider in this text; see also Billingsley (1999, page 19) for details). For this reason, as in Matthews et al. (2018a), weak convergence of countably-infinite stochastic processes will be considered with respect to the topology generated by the following metric:
Notation, shapes, and indexing. We adopt the notation, shape, and indexing convention similar to §2.1, which the reader is encouraged to review. We emphasize that whenever an index is omitted, the variable is assumed to contain all possible entries along the respective dimension (e.g. whole if is omitted, or all channels if the channel is omitted).
F.2 Preliminary
We will use the following well-known theorem.
(converges in distribution / converges weakly),
Using the equivalence between (i) and (iii), it is straightforward to show that
F.3 Sequential limit
In this section, we give intuition into constructing a Gaussian process using an infinite channel CNN with the limits taken sequentially. Informally, our argument amounts to showing that if the inputs to the layer are a multivariate Gaussian with covariance , then it’s outputs converge in distribution to a multivariate Gaussian with covariance as . This “allows” us to sequentially replace with their respective limiting Gaussian R.V.s as we take . However, since each convergence only holds in distribution and does not guarantee the existence of the necessary GPs on a given probability space, what follows merely gives intuition into understanding the relationship between wide Bayesian neural networks and GPs (contrary to §F.4, which presents a rigorous convergence proof for ).
Let denote the space of functions with a uniformly bounded second moment, i.e. having
In what follows, we use the central limit theorem to prove (iii).
It is not difficult to see that are i.i.d. and is Gaussian. Then we can use the central limit theorem to conclude that the above converges in distribution to a Gaussian, once we verify the second moment of is finite. Using the fact that is a collection of independent R.V.s, and integrating over these R.V.s first, we get
where by we denote the linear transformation part of from Equation 7, i.e. without the translation term :
To prove finiteness of the second moment in Equation 44, it is sufficient to show that all the diagonal terms of are finite. This easily follows from the assumption and the definition of . Together with the distribution of (whose covariance is straightforward to compute), the joint distribution of converges weakly to a mean zero Gaussian with covariance matrix by Theorem F.1 (Equation 28). ∎
The results of Theorem F.3 can be strengthened / extended in many directions.
The same result for a countably-infinite input index set follows immediately according to the respective remark in §F.1.
The same analysis carries over (and the covariance matrix can be computed without much extra effort) if we stack a channel-wise deterministic affine transform after the convolution operator. Note that average pooling (not max pooling, which is not affine), global average pooling and the convolutional striding are particular examples of such affine transformations. Moreover, valid padding (i.e. no padding) convolution can be regarded as a subsampling operator (a linear projection) composed with the regular circular padding.
The same analysis applies to max pooling, but computing the covariance may require non-trivial effort. Let denote the max pooling operator and assume it is applied right after the activation function. The assumption are i.i.d. implies are also i.i.d. Then we can proceed exactly as above except for verifying the finiteness of second moment of with the following trivial estimate:
where is the window size of the max pooling.
In general, one can stack a channel-wise deterministic operator on so long as the second moment of is finite. One can also stack a stochastic operator (e.g. dropout), so long as the outputs are still channel-wisely i.i.d. and have finite second moments.
F.4 Simultaneous limit
Define a sequence of finite channel CNNs as follows:
This network induces a sequence of covariance matrices (which are R.V.s): for and , for
We make an extra assumption on the parameters.
Assumption: all R.V.s in are Gaussian distributed.
Notation. Let denote the set of positive semi-definite matrices and for , define
and (may equal ) denotes the uniform upper bound for the -th moment
Let denotes the space of measurable functions with the following properties:
Uniformly bounded second moment: for every .
Lipschitz continuity: for every , there exists such that for all ,
Uniform convergence in probability: for every and every there exists a positive sequence with as such that for every and any
We will also use and to denote the spaces of measurable functions satisfying properties 1, 2, and 3, respectively. It is not difficult to see that for every , is a vector space, and so is .
We now prove our main result presented in §2.2.3 through the following three theorems.
If is absolutely continuous and is exponentially bounded then .
If , then for , .
If for , and , the joint distribution of converges in distribution to a multivariate normal distribution with mean zero and covariance .
The proofs of Theorems F.4, F.5, and F.6 can be found in §F.7, §F.6, and §F.5 respectively. The proof of Theorem F.5 is slightly more technical and we will borrow some ideas from Daniely et al. (2016).
F.5 Proof of Theorem F.6
Note that conditioned on , the exponent in the above expression is just a linear combination of independent Gaussian R.V.s , which is also a Gaussian. Integrating out these R.V.s using the formula of the characteristic function of a Gaussian distribution yields
where we have used and Lipschitz continuity of the respective function in the vicinity of in the last step. Therefore, Equation 64 is true by Theorem F.1 (iii). As in §F.3, the same result for a countably-infinite input index set follows immediately according to the respective remark in §F.1.
We briefly comment how to handle the cases when stacking an average pooling, a subsampling or a dense layer after flattening the activations in the last layer.
implies that the empirical covariance of
Vectorization and a dense layer. Let be the weights of the dense layer, represents the weight connecting the -th pixel of the -channel to the -th output. Note that the range of is not because there is no weight sharing. Define the outputs to be
where denotes the mean trace operator acting on the pixel by pixel matrix, i.e. the functional computing the mean of the diagonal terms of the pixel by pixel matrix. Therefore converges weakly to a mean zero Gaussian with covariance in the case of vectorization (§3.1).
F.6 Proof of Theorem F.5
We first note that the affine transform is -Lipschitz and property 2 of implies that the operator is -Lipschitz (both w.r.t. w.r.t. ). Indeed, if we consider
then . Thus is -Lipschitz.
We now prove the theorem by induction. Assume as (obvious for ).
Now let be sufficiently small so that the -neighborhood of is contained in , where we take to be large enough for to be an interior point of .Such always exists, because the diagonal terms of are always non-zero. See (Long & Sedghi, 2019, Lemma 4) for proof.
to prove , it suffices to show that for every , there is a such that for all ,
By our induction assumption, there is a such that for all
Since is -Lipschitz, then
To bound the second term in Equation 80, let denote the event
and its complement. For all its probability is
i.e. the event that the inequality inside the braces holds. The fact
where the maximum is taken over all .
Consider a fixed , and define
a deterministic matrix in . Then
where \left\{\left(z^{l,t}_{i,\alpha}(x),z^{l,t}_{i,\alpha^{\prime}}(x^{\prime})\right)\Big{|}K^{l}_{t}=\kappa\right\}_{i\in[n^{l+1}(t)]}\textrm{are i.i.d.}\sim\mathcal{N}\left(0,\Sigma\left(\kappa,\alpha,\alpha^{\prime},x,x^{\prime}\right)\right).
Then if we can apply property 3 of to conclude that:
However, if , then necessarily (since ), ensuring that . Therefore Equation 92 holds for any , and for any .
We further remark that is deterministic and does not depend on . Marginalizing out and maximizing over in Equation 92 we conclude that
Since as , there exists such that for any ,
and, substituting this bound in Equation 88,
Therefore we just need to choose so that for all . ∎
We list some directions to strengthen / extend the results of Theorem F.5 (and thus Theorem F.6) using the above framework.
where is the linear operator on the covariance matrix induced by . Conditioned on , since the outputs after applying the linear operator are still i.i.d. Gaussian (and the property 3 is applicable), the analysis in the above proof can carry over with replaced by .
More generally, one may consider inserting an operator (e.g. max-pooling, dropout and more interestingly, normalization) in some hidden layer.
Gaussian prior on weights and biases might be relaxed to sub-Gaussian.
F.7 Proof of Theorem F.4
Note that absolutely continuous exponentially bounded functions contain all polynomials, and are closed under multiplication and integration in the sense that for any constant the function
is also exponentially bounded. Theorem F.4 is a consequence of the following lemma.
for , if is exponentially bounded.
if exists a.e. and is exponentially bounded.
if .
Indeed, if is absolutely continuous and is exponentially bounded, then is also exponentially bounded. By the above lemma, .
1. We prove the first statement. Assume .
2. To prove the second statement, let and define (similarly for ):
Then (and ). Let
Since is exponentially bounded, is also exponentially bounded due to being absolutely continuous. In addition, is exponentially bounded for any polynomial .
Applying the Mean Value Theorem (we use the notation to hide the dependence on and other absolute constants)
Note that the operator norm is bounded by the infinity norm (up to a multiplicity constant) and is exponentially bounded. There is a constant (hidden in ) and such that the above is bounded by
In practice the decay bound obtained by Chebyshev’s inequality in Equation 114 is often too weak to be useful. However, if is linearly bounded, then one can obtain an exponential decay bound via the following concentration inequality:
If a.e., then there is an absolute constant and a constant such that property 3 (Equation 62) holds with
We postpone the proof of the following claim.
Assume . Then there is a such that for all and all ,
Claim F.9 and the triangle inequality imply
We can apply Bernstein-type inequality (Vershynin, 2010, Lemma 5.16) to conclude that there is a such that for every and any
It remains to prove Claim F.9. For ,
We applied Cauchy–Schwarz inequality in the first inequality, the triangle inequality in the second one, the fact in the third one, absolute moments estimate of standard Gaussian in the fourth one, where is a constant such that
Appendix G Experimental Setup
Throughout this work we only consider (possibly unshared) convolutional filters with stride and no dilation.
All inputs are normalized to have zero mean and unit variance, i.e. lie on the dimensional sphere of radius , where is the total dimensionality of the input.
All labels are treated as regression targets with zero mean. i.e. for a single-class classification problem with classes targets are dimensional vectors with and entries in incorrect and correct class indices respectively.
If a subset of a full dataset is considered for computational reasons, it is randomly selected in a balanced fashion, i.e. with each class having an equal number of samples. No data augmentation is used.
All experiments were implemented in Tensorflow (Abadi et al., 2016) and executed with the help of Vizier (Golovin et al., 2017).
All neural networks are trained using Adam (Kingma & Ba, 2015) minimizing the mean squared error loss.
We use a training and validation subsets of CIFAR10 of sizes and respectively. All images are bilinearly downsampled to pixels.
All models have hidden layers with an nonlinearity. No () padding is used.
Weight and bias variances are set to and , corresponding to the pre-activation variance fixed point (Poole et al., 2016) for the nonlinearity.
NN training proceeds for gradient updates, but aborts if no progress on training loss is observed for the last epochs. If the training loss does not reduce by at least for epochs, the learning rate is divided by .
All computations are done with -bit precision.
The following NN parameters are considered:Due to time and memory limitations certain large configurations could not be evaluated. We believe this did not impact the results of this work in a qualitative way.
Pooling: no pooling or a single global average pooling (averaging over spatial dimensions) before the final FC layer.
Number of channels: for from to .
Initial learning rate: for from to .
Weight decay: and for from to .
Batch size: , , , , .
For NNs, all models are filtered to only -accurate ones on the training set and then for each configuration of {architecture, pooling, number of channels} the model with the lowest validation loss is selected among the configurations of {learning rate, weight decay, batch size}.
For GPs, the same CNN-GP is plotted against CNN and LCN networks without pooling. For LCN with pooling, inference was done with an appropriately rescaled CNN-GP kernel, i.e. where is the spatial size of the penultimate layer. For CNNs with pooling, a Monte Carlo estimate was computed (see §4) with filters and samples.
For GP inference, the initial diagonal regularization term applied to the training covariance matrix is ; if the Cholesky decomposition fails, the regularization term is increased by a factor of until it either succeeds or reaches the value of , at which point the trial is considered to have failed.
G.2 Monte Carlo Evaluation of Intractable GP Kernels
We use the same setup as in §G.1, but training and validation sets of sizes and respectively.
For MC-GPs we consider the number of channels (width in FCN setting) and number of NN instantiations to accept values of for from to .
where is substituted with for the CNN-GP pooling case (due to impracticality of computing the exact ). GPs are regularized in the same fashion as in §G.1, but the regularization factor starts at and ends at and is multiplied by the mean of the training covariance diagonal.
G.3 Transforming a GP over spatial locations into a GP over classes
We use the same setup as in §G.2, but rescale the input images to size of , so that at depth the spatial dimension collapses to a patch if no padding is used (hence the curve of the CNN-GP without padding halting at that depth).
For MC-CNN-GP with pooling, we use samples of networks with filters. Due to computational complexity we only consider depths up to for this architecture. The number of samples was selected independently for each depth among for from to to maximize the validation accuracy on a separate -points validation set. This allowed us to avoid the poor conditioning of the kernel. GPs are regularized in the same fashion as in §G.1, but for MLP-GP the multiplicative factor starts at and ends at .
G.4 Relationship to Deep Signal Propagation
We use a training and validation subsets of CIFAR10 of sizes and respectively.
We use the nonlinearity. For CNN-GP, images are zero-padded ( padding) to maintain the spatial shape of the activations as they are propagated through the network.
Weight and bias variances (horizontal axis and vertical axis respectively) are sampled from a uniform grid of size on the range including the endpoints.
All computations are done with -bit precision. GPs are regularized in the same fashion as in §G.1, but the regularization factor is multiplied by the mean of the training covariance diagonal. If the experiment fails due to numerical reasons, (random chance) validation accuracy is reported.
G.5 CNN-GP on full datasets
Relevant Figures 6 (a, c), 7 and Table 1.
We use full training, validation, and test sets of sizes , , and respectively for MNIST (LeCun et al., 1998) and Fashion-MNIST (Xiao et al., 2017a), , , and for CIFAR10 (Krizhevsky, 2009). We use validation accuracy to select the best configuration for each model (we do not retrain on validation sets).
GPs are computed with -bit precision, and NNs are trained with -bit precision. GPs are regularized in the same fashion as in §G.4.
Zero-padding () is used.
Nonlinearity: or .
Depth: for from to (and up to for MNIST and Fashion-MNIST datasets).
Weight and bias variances. For : from . For : a fixed weight variance and bias variance from .
Only one (, ), 8-layer CNN-GP with pooling model was evaluated using the Neural Tangents library (Novak et al., 2020) after the publication, and added to Figure 7 and Table 2 for completeness. The diagonal regularization factor was selected on the validation set among (resulting in ).
On CIFAR10, we additionally train NNs for gradient updates with a batch size of with corresponding parameters in addition toDue to time and compute limitations certain large configurations could not be evaluated. We believe this did not impact the results of this work in a qualitative way.
Pooling: no pooling or a single global average pooling (averaging over spatial dimensions) before the final FC layer (only for CNNs).
Number of channels or width: for from to (and up to for CNNs with pooling in Figure 6, a).
Learning rate: for from to , where width is substituted with the number of channels for CNNs and is substituted with for networks. “Small learning rate” in Table 1 refers to .
Weight decay: and for from to .
For NNs, all models are filtered to only -accurate ones on the training set (expect for values in parentheses in Table 1). The reported values are then reported for models that achieve the best validation accuracy.
G.6 Model comparison on CIFAR10
We use the complete CIFAR10 dataset as described in §G.5 and consider 8-layer models with weight and bias variances of and . The number of channels / width is set to , and for LCN, CNN, and FCN respectively.
GPs are computed with -bit precision, and NNs are trained with -bit precision.
No padding () is used.
NN training proceeds for gradient updates with batch size , but aborts if no progress on training loss is observed for the last epochs. If the training loss does not reduce by at least for epochs, the learning rate is divided by .
Values for NNs are reported for the best validation accuracy over different learning rates ( for from to ) and weight decay values ( and for from to ). For GPs, validation accuracy is maximized over initial diagonal regularization terms applied to the training convariance matrix: for among , and (if the cholesky decompisition fails, the regularization term is increased by a factor of until it succeeds or reaches the value of ).
CNN-GP with pooling performance was computed using the Neural Tangents library (Novak et al., 2020), after the publication, and added to the figure for completeness. The diagonal regularization factor was selected on the validation set among (resulting in ).