Neural Empirical Bayes
Saeed Saremi, Aapo Hyvarinen
Introduction
This setup is denoted symbolically as .
with and . We revisit textbook calculations on concentration of measure and show that in high dimensions, the smoothed manifold is approximately a sphere of dimension :
This analysis paints a picture of “manifold disintegration-expansion” as a general phenomenon.
The first conceptual contribution of this paper is to expose a notion of manifold disintegration-expansion, with a general takeaway that in (very) high dimensions, gaussian smoothing pushes away (all) the probability masses near . This is due to the fact that in high dimensions, the Gaussian random variable
is not necessarily concentrated on , and there are many directions (asymptotically infinite) where samples from “escape” the manifold. The thesis is that, in convolving and , would be mapped to a (much) higher dimensional manifold.
A seemingly unrelated theory is empirical Bayes, as formulated by Robbins (1956), which we use for pulling the probability mass back towards . In 1956, Robbins considered a scenario of a random variable that depends “in a known way” on an “unknown” random variable . The starting point in (Robbins, 1956) was the existence of a noisy random variable that was denoted by . In that setup, one can only observe values of . Here, we started with the “clean” i.i.d. sequence and artifically created . This departure in starting points is related to our new take on empirical Bayes which will become clear in Section 3. He found that, given an observation , the least squares estimator of is the Bayes estimator, and quite remarkably, can be expressed purely in terms of the distribution of (he showed this for Poisson, geometric, and Laplacian kernels). Robbins’ results were later extended to gaussian kernels by (Miyasawa, 1961) who derived the estimator
The least-squares estimator of above is derived for any , not necessarily infinitesimal. To signify this important fact, we also refer to this estimation as “Robbins jump” or jump for short. The empirical Bayes machinery is denoted symbolically as
The smoothing of kernel density estimation, , and the denoising mechanism of empirical Bayes, , come together to define the learning objective
(Throughout the paper, is the gradient taken with respect to the inputs , not parameters.)
When analyzing a finite i.i.d. sequence, , i.i.d. samples from are generated as
and the learning objective is then approximated as
where is a shorthand for . In high dimensions, the samples are (approximately) uniformly distributed on a “thin spherical shell” around the sphere with its center at which leads to the following definition.
The sphere centered at is a useful abstraction and it is referred to as “-sphere”. The in -sphere refers to the index in , and the index is reserved for the gaussian noise . Therefore, is the th sample on the -sphere. Fixing the index , the samples are approximately uniformly distributed on a thin spherical shell around the -sphere (see Figure 1).
DEEN: minimize with stochastic gradient descent and return .
In most of the paper, we are interested in studying which is already at optimality with parameteres . In these contexts, is dropped, and its presence is understood, e.g. at optimality, (modulo a constant).
Next, we define a notion of interactions between -spheres that we find useful in thinking about the goal of approximating the score function through optimizing . However, this can be skipped until our discussions of the associative memory in Sections 6 and 7.
The interaction between -spheres is defined as the competition that emerge due to the presence of in the problem of minimizing . In other words, it refers to “the set of constraints” that , the gradient of the globally defined energy function, has to satisfy locally on -spheres to arrive at the optimality such that It is indeed an abstract notion in this work, but it is a useful abstraction to have which we come back to in thinking about the gradient flow It should be stated that there is also a notion of interaction that already exists for the kernel density in the sense of “mass aggregation” (see Section 7 for references), but in this framework, -spheres interact even when they do not overlap in that they “communicate” via .
The third conceptual contribution of this paper is introducing the physical picture of interactions between i.i.d. samples. This is a useful abstraction in this work in thinking about the problem of approximating the score function with the gradient of an energy function. The interaction is a code for “the set of constraints” that (evaluated at i.i.d. samples ) has to satisfy to arrive at optimality, . These interactions could also give rise to some collective phenomena (Anderson, 1972).
Next, we outline the technical contributions of this paper. Approximating in the empirical Bayes setup leads to a novel sampling algorithm, a new notion of associative memory and the emergence of “creative memories”. First, we define the matrix that was introduced to quantify -sphere overlaps and was used for designing experiments.
Concentration of measure leads to a geometric picture for the kernel density as a “mixture of -spheres”, and we define the matrix to quantify the extent to which the spheres in the mixture overlap; the entries in are essentially pairwise distances, scaled by (where the scaling is related to the concentration of isotropic Gaussians in high dimensions):
The -sphere and the -sphere do not overlap if , and they overlap if (see Figure 2). Of special interest is , defined as
We define “extreme noise” as the regime ; it is the regime where all -spheres have some degree of overlap.
“Walk-jump sampling” is an approximative sampling algorithm that first draws exact samples from the density ( is the unknown normalizing constant) by Langevin MCMC:
where is the step size and here is discrete time. At an arbitrary time , approximative samples “close” to are generated with the least-squares estimator of —the jumps:
A signature of walk-jump sampling is that the walks and the jumps are decoupled in that the jumps can be made at arbitrary times.
“Neural empirical Bayes associative memory”—named NEBULA, where the “LA” refers to à la Hopfield (1982)—is defined as the flow to strict local minima of the energy function . In continuous time, the memory dynamics is governed by the gradient flow:
In retrieving a “memory”, one flows deterministically to an attractor. NEBULA is an intriguing construct as the deterministic dynamics is governed by a probability density function, since —this is very far from true for Hopfield networks.
Memories in NEBULA are believed to be formed due to -sphere interactions (Definition 3). We provide evidence for the presence of this abstract notion of interaction in this problem by reporting the emergence of very structured memories in a regime of highly overlapping -spheres. They are named “creative memories” because they have intuitively appealing structures, while clearly being new instances, in fact quite different from the training data.
This paper is organized as follows. Manifold disintegration-expansion is discussed in Section 2. In Section 3, we review empirical Bayes and give a derivation of the least squares estimator of for gaussian kernels. We then talk about a Gedankenexperiment—in the school of Robbins (1956) and Robbins and Monro (1951)—to bring home the unification scheme and shed some light on the inner-working of the learning objective. In Section 4, we define the notion of extreme noise and demonstrate two extreme denoising experiments. In Section 5, we present the walk-jump sampling algorithm. NEBULA is defined in Section 6, where the abstract notion of -sphere interactions is grounded to some extent; we present two sets of experiments with qualitatively different behaviors. In Section 7, we report the emergence of creative memories for two values of close to . We finish with a summary.
Manifold disintegration-expansion in high dimensions
Our low-dimensional intuitions break down spectacularly in high dimensions. This is in large part due to the exponential growth of space in high dimensions (this abundance of space also underlies the curse of dimensionality, well known in statistics and machine learning). Our focus in this section is to develop some intuitions on the effect of gaussian smoothing, , on the data manifold . We start by a summary of results on concentration of measure. The textbooks (Ledoux, 2001; Tao, 2012; Vershynin, 2018) should be consulted to fill in details. We then extend the textbook calculations for a “gaussian manifold” and make a case for manifold disintegration-expansion.
Start with the isotropic gaussian. In 2D, samples from the gaussian form a “cloud of points”, centered around its mode as illustrated in Figure 1a. Next, take the isotropic Gaussian in dimensions, , and ask where the random variable is likely to be located. Before stating a well-known result on the concentration of measure, we can build intuition by observing the following identities for the expectation and the variance of the squared norm:
Since the components of are jointly independent, and since , the following holds with high probability,
The concentration of norm is visualized in Figure 1b; also see Figure 3.2 in (Vershynin, 2018). In contradiction to our low-dimensional intuition, in high dimensions, the Gaussian random variable is not concentrated close to the mode of its density (at the origin here), but in a “thin spherical shell” of width around the sphere of radius . The concentration of norm suggests an even deeper phenomenon, the concentration of measure, captured by a non-asymptotic result for the deviation of the norm from , where the deviation has a sub-gaussian tail (Tao, 2012; Vershynin, 2018):
In above expressions, and are absolute constants. In high dimensions, the Gaussian random variable is thus approximated with the uniform distribution on the sphere ,
The approximation becomes equality in distribution, as .
where . The manifold is viewed as a “gaussian manifold” where, in an abuse of notation, . Now consider which means
We repeat the concentration of norm calculations:
the “manifold terms”, and , become negligible, and the following holds with high probability:
The calculations in the previous page are reproduced by setting , , .
In summary, under the convolution , the “gaussian manifold” is transformed/mapped to
therefore . In our terminology, has been disintegrated-expanded to . We expect this phenomenon to hold for any asymptotically, i.e. as , , but (unfortunately) the limit is not practical. However, the picture should stay: in high dimensions, the convolution of and has severe side effects on the manifold itself. Smoothing is indeed desired, but also a mechanism to restore . In the next section, we discuss such a mechanism using empirical Bayes, where the “restoration” is in least squares.
Neural empirical Bayes
In a seminal work from 1956, titled an empirical Bayes approach to statistics, Robbins considered an intriguing scenario of a random variable having a probability distribution that depends “in a known way” on an “unknown” random variable (Robbins, 1956). Observing , the least-squares estimator of is the Bayes estimator (see Remark 1):
If is known to “the experimenter”, is a computable function, but what if the prior is unknown? It is quite remarkable that for a large class of kernels, the least-squares estimator can be derived in closed form purely in terms of the distribution of . Informally speaking, there is an “abstraction barrier” (Abelson and Sussman, 1985) between and where the knowledge of is not needed to estimate . This functional dependence of on was achieved for the Poisson, geometric, and Laplacian kernels in (Robbins, 1956), and it was later extended to the gaussian kernel by (Miyasawa, 1961).
Multiply the expression above by and integrate,
which then leads to the expression For the isotropic case , the estimator takes the form
To sum up, the estimator above is obtained in a setup where only the corrupted data (the random variable ) are observed, with the knowledge of the measurement (gaussian) kernel alone and without any knowledge of the prior . The remarkable result is the fact the least-squares estimator of is written in closed form purely as a functional of , also known as the score function (Hyvärinen, 2005). The expression above for the least squares estimator of is the basis for an algorithm to approximate which is discussed next.
In empirical Bayes, the random variable is estimated in least squares from the noisy observations , without any knowledge of the prior . But how did we get here? We started the paper with the i.i.d. sequence where did not even exist!
The idea is a Gedankenexperiment of sort, where we first construct by taking samples ; in turn, the experimenter (in the school of empirical Bayes) “observes” and estimates . But in this artificially supervised setup (the “target” is ), the error signal can also be measured—this is not the case in (Robbins, 1956)—and it drives the learning. The learning is achieved with stochastic gradient descent as the experimenter has parameterized the energy function with a neural network and can compute the stochastic gradients (Robbins and Monro, 1951),
Approximating in this manner is the essence of neural empirical Bayes.
For finite samples, the best we can do is to approximate . Given , the i.i.d. samples are given by These are indeed i.i.d. samples from density estimated by the kernel density estimator
At optimality (achieved with stochastic gradient descent), ,
Extreme noise/denoising
In empirical Bayes, the least-squares estimator of ,
is derived for any , which inspires us to call this a “jump” to capture the fact may in fact be “large”. Denoising is not our actual goal, but the expression above can be viewed as “denoising to ”. The denoising performance is important since in the machinery of neural empirical Bayes, the goal of approximating the score function is formulated by the “denoising objective”
As the least-squares estimator of is derived for any , the denoising performance of DEEN may also be tested to its limits. Here, we report such experiments for MNIST, where and is in the hypercube (see Remark 8). Next, we elaborate on the geometric meaning of the noise levels, where in addition we define a notion of “extreme noise”.
. The value is defined as the onset of extreme noise. Called “extreme”, because for , all -spheres in the dataset overlap (see Figure 2b). For MNIST, was estimated from handwritten digit pairs (see Table 1). The results for on the test set is presented in Figure 3. It should be stated that, this value of “extreme noise”, just above , is not visually perceived as extreme.
. In this regime, due to geometry, the whole dataset is in inside of each -sphere (see Figure 2c). This is very extreme! For MNIST, we experimented with which is in this regime. The results on the test set are presented in Figure 4.
The following comments are in order regarding the network architecture we used in our experiments.
(activation function) The denoising results are a significant improvement over (Saremi et al., 2018). This was due to the use of ConvNets, instead of a fully connected network (more on that below) and the use of a “smooth ReLU” activation function
where the default was . The activation function above converges uniformly to ReLU in the limit . It has been studied independently from different angles (Elfwing et al., 2018; Ramachandran et al., 2017). ReLU consistently gave a higher loss, which we believe is due to the term in the learning objective, which must be computed first before computing to update the parameters
(wide convnets) All experiments reported in the paper were performed in a fixed wide ConvNet architecture with the expanding , without pooling, and with a bottleneck layer of size . All hidden layers were activated with the activation function , and the readout layer was linear.
(readout overparametrization) We observed a slightly faster convergence to lower loss (especially, very early in the training) by overparameterizing the linear output layer. These experiments were inspired by the “acceleration effects” studied for linear neural networks (Arora et al., 2018), but we did not study it thoroughly.
We used the Adam optimizer (Kingma and Ba, 2014). The optimization was stable over a wide range of mini-batch sizes and learning rates, and as expected, we did not observe the validation/test loss going up in the experiments. This stability is due to the fact that the inputs to are noisy samples. The automatic differentiation (Baydin et al., 2018) was implemented in PyTorch (Paszke et al., 2017). DEEN is “memory hungry” but the computational costs are not addressed here in detail.
Walk-jump sampling: Langevin walk, Robbins jump
than (putting aside the harder problem of learning ). At any time, the estimator can be used to jump near . “Near” is intuitive, and it should be stated that neither nor is exactly sampled during jumps; the jump samples must be seen as heuristic.
In what follows, it is understood that is at optimality. We first give a description of the algorithm and then express the equations.
(walks) The Langevin MCMC is used to draw statistical samples Langevin MCMC is based on discretizing the Langevin diffusion (van Kampen, 1992) and the updates are based solely on (the equations are coming next), therefore not knowing the normalizing constant is of no concern.
(jumps) Given , ideally an exact sample (MacKay, 2003) from , the jumps are made with the Bayes estimator of . The spirit of empirical Bayes is fully present here: the jump samples are generated without knowing (or its approximation), and without running Langevin MCMC on its “rougher terrain”.
What emerges is a sampling algorithm in two parts. The Langevin MCMC is the engine that should run continously. By contrast, the jumps can be made at any time. That is:
Sample from with Langevin MCMC,
where is the step size and here is discrete time.
At any time , use the Bayes estimator of to jump from to ,
We tested the algorithm on the handwritten digit database for and . The results are shown in Figure 5 and 6 respectively. For (in the regime of highly overlapping spheres), there was mixing between styles/classes. For (in the regime of mostly non-overlapping spheres), samples stayed within a class while the styles changed.
Some side-by-side comparisons with denoising autoencoders are in order.
In walk-jump sampling, the key step is sampling from the density . The least-squares estimation of —the jumps—are decoupled from Langevin MCMC. This is in contrast to the chain constructed in generalized denoising autoencoders (Bengio et al., 2013b), which does not suffer from the problems we mentioned for the jumps.
Here, the learning objective is derived for any . By contrast, denoising autoencoders approximate the score function only in the limit (Alain and Bengio, 2014). Generalized denoising autoencoders (Bengio et al., 2013b) and generative stochastic networks (Alain et al., 2016) were devised as a remedy for those limitations. This is avoided altogether here.
In this work, there is a clear geometrical notion for large and small , which is based on the concentration of measure phenomenon. This picture is lacking in the references cited, and it is not clear which ranges of noise values should be considered there.
Most importantly, denoising autoencoders suffer from “limited parameterization” as expressed by Alain and Bengio (2014), and independently observed in (Saremi et al., 2018). To summarize, in denoising autoencoders, one must learn a curl-free encoder-decoder to be able to properly approximate the score function and this is problematic beyond one-hidden-layer architectures. In our work, this problem is avoided due to the energy parameterization (see Remark 7).
Neural empirical Bayes associative memory (NEBULA)
In this section, the neural empirical Bayes machinery is used to define a new notion of associative memory. Associative memory (also known as content-addressable memory) is a deep subject in neural networks with a rich history and with roots in psychology and neuroscience (long-term potentiation and Hebb’s rule). The depiction of this computation is even present in the arts and literature, championed by Marcel Proust, in the form of “stream of consciousness”, and the constant back and forth between perception and memory.
In 1982, Hopfield brought together earlier attempts to formulate associative memory, and showed that the collective computation of a system of neurons with symmetric weights, in the form of asynchronous updates of McCulloch-Pitts neurons (McCulloch and Pitts, 1943), minimize an energy function (Hopfield, 1982). The energy function was constructed in a Hebbian fashion. The associative memory was then formulated as the “flow” (the neurons were binary) to the local minima of the energy function.
However, Hopfield’s energy function is not the energy functionIn addition, Hopfield networks have severe limitations in memory capacity, addressed most recently in (Hillar and Tran, 2014; Krotov and Hopfield, 2016; Chaudhuri and Fiete, 2017).: it does not approximate (learn) the negative log probability density function of its stored memories. The Boltzmann machine (Hinton and Sejnowski, 1986) was developed in fact inspired by that observation, in which learning the energy function was achieved by introducing hidden units. Regarding the associative memory and the phase space flow, the problem is that in Boltzmann machines, hidden units need to be inferred first before having any notion of flow for the visible units. And inference is indeed computationally very expensive—“the curse of inference” (but not nearly as fundamental as the curse of dimensionality)—in probabilistic graphical models (Wainwright and Jordan, 2008; Koller and Friedman, 2009).
What we have achieved so far is to learn a function which approximates the energy function of , (modulo a constant). The key here is that is a function (a “computer”) that computes for any ; the “hidden units” in are not inferred. In other words, the hidden units (and the parameters) in are there solely for universal approximation, In Boltzmann machines, the hidden units are there to think! And this is the big difference. The definition below should be viewed against this backdrop.
We define the neural empirical Bayes associative memory (NEBULA), à la Hopfield (1982), as the flow to strict local minima of the energy function . In continuous time, the memory dynamics is governed by the gradient flow:
NEBULA is identified by its attractors, the set of all the strict local minima of :
What is so special about the local minima of ? Start with a single sample ,
For many samples, the learning objective is optimized such that evaluated on -spheres point to (see Figure 1c), but there are conflicts of interests, and these “conflicts” are the essence of -sphere interactions, stated in Definition 3; in its practical summary, larger means more -sphere overlaps, and therefore larger “interaction couplings”. For NEBULA, the presence of -sphere interactions implies that are not statistical samples from . In other words, given , . But what is the “right metric” to measure the distance between and ? The problem is that the flow is the gradient flow of but the attractors are roughly speaking “close to ”. The total length of the path taken by the gradient flow is a natural choice but we leave that for furture studies. Our expectation (based on -sphere interactions) is that will be larger for larger , visualized in Figures 7 and 8, with pronounced qualitative differences.
Emergence of “creative memories”
We finish the paper with some surprising results on the emergence of highly-structured non-digit memories, named “creative memories”. The experiments designed in this section are a natural continuation of the experiments from the previous section. In its summary, NEBULA is “well behaved” when it is initialized at . But what if it is not initialized as such? Here, we explore -sphere interactions (see Definition 5) from a different angle by initializing NEBULA at a random point,
The results in Figures 9 and 10 are the strongest evidence for -sphere interactions as being a good abstraction. In viewing Figures 9 and 10, consider Figure 2, but with 60000 highly overlapping -spheres, and on each -sphere that should “point towards” in expectation, where in addition, there are other complexities (touched upon in Remark 6) regarding the -spheres themselves.
The images shown in Figures 9 and 10 are not statistical samples from nor .
They are not the result of mass aggregation of the kernel density estimator . Mode seeking is a deep topic in the kernel density literature around the mean shift algorithm (fukunaga1975estimation; Cheng, 1995; Comaniciu and Meer, 2002), also around the intriguing topic of “ghost modes” (Carreira-Perpiñán and Williams, 2003). Of course, the attractors reported here has not been reported in the kernel density literature (see (Carreira-Perpiñán, 2015) and the examples therein for MNIST).
The results presented are from all the random initializations in the unit hypercube from which we ran the algorithm, without any hand-picking!
Summary
In neural empirical Bayes, the smoothing of a random variable to (), denoising, and the (unnormalized) density estimation of the smoothed density were unified in a single machinery. Its inner workings were captured symbolically by , as well as by a Gedankenexperiment with an “experimenter” in the school of Robbins (1956) and Robbins and Monro (1951).In our presentation, the smoothing part was viewed from the angle of kernel density estimation, with an important distinction that the kernel bandwidth was a hyperparameter. But on reflection, this is not necessary. Fundamentally, setting aside the “neural energy function”, one can go on with our program by just knowing the (deep) machineries of empirical Bayes (Robbins, 1956) and stochastic gradients (Robbins and Monro, 1951). In this machinery, the energy function is parametrized with a neural network and SGD becomes the engine for learning, whose end result is captured by . We proposed an approximative walk-jump sampling scheme which produces samples which are particularly appealing visually due to the denoising jump used. The energy function can further be used as an associative memory (NEBULA), which even has some “creative” capacities.
Acknowledgement
This work was supported by the Canadian Institute for Advanced Research, the Gatsby Charitable Foundation, and the National Science Foundation through grant IIS-1718991. We especially thank Bruno Olshausen, Eero Simoncelli, and Francis Bach for discussions.