A Spectral Approach to Gradient Estimation for Implicit Distributions
Jiaxin Shi, Shengyang Sun, Jun Zhu
Introduction
Recently there have been increasing interests in learning and inference with implicit distributions, i.e., distributions defined by a sampling process but without tractable densities. Popular examples include Generative adversarial networks (GAN) (Goodfellow et al., 2014; Mohamed & Lakshminarayanan, 2016). Compared to the explicit likelihoods (e.g., Gaussian) in other deep generative models such as variational autoencoders (VAE) (Kingma & Welling, 2013), implicit distributions are shown able to capture the complex data manifold that lies in a high dimensional space, leading to more realistic samples generated by GAN than other models. Besides, as the constraint of requiring an explicit density is removed, implicit distributions are treated as more flexible variants of variational families for approximate inference (Ranganath et al., 2016; Liu & Feng, 2016; Mescheder et al., 2017; Tran et al., 2017; Huszár, 2017; Li & Turner, 2018; Shi et al., 2018).
Despite that it is appealing to use flexible implicit distributions, which capture complex correlations and manifold structures, deploying them in practical scenes is still challenging. This is because most learning and inference algorithms require optimizing some divergences between two distributions, which often rely on evaluating the densities of them. However, the density of an implicit distribution is intractable and we only have access to its samples. Previous works have explored two directions to solve the problem. One is to first approximate the optimization objective with the samples and then use the approximation to guide the learning procedure. Many works in this direction are based on the fact that the density ratio between two distributions can be estimated from their samples, by a probabilistic classifier (also known as the discriminator) trained in an adversarial game (Donahue et al., 2016; Dumoulin et al., 2016; Mescheder et al., 2017; Tran et al., 2017; Huszár, 2017), or by kernel-based estimators (Shi et al., 2018). The other direction is to estimate the gradients instead of the objective. Li & Turner (2018) propose the Stein gradient estimator for the log density of an implicit distribution. It is based on a ridge regression that inverts a generalized version of Stein’s identity (Gorham & Mackey, 2015; Liu et al., 2016). As argued in their work, this approach is more direct and avoids probable arbitrarily diverse gradients provided by the approximate objective. In this paper, we focus on the latter direction.
Though the Stein gradient estimator has been shown to be a fast and easy way to obtain gradient estimates for implicit models. It is still limited by its simple formulation, i.e., the unjustified choice of both the test function (the kernel feature mapping) and the regularization scheme (the Frobenius-norm regularization used in ridge regression). The problem has deeper implications. For instance, no theoretical results have been established for the estimator. Moreover, there is no principled way to obtain gradient estimates at positions out of the sample points.
In this paper, we develop a novel gradient estimator for implicit distributions, which is called the Spectral Stein Gradient Estimator (SSGE). To approximate the gradient function of the log density (i.e., ), SSGE expands it in terms of the eigenfunctions of a kernel-based operator. These eigenfunctions are orthogonal with respect to the underlying distribution. By setting the test functions in the Stein’s identity to these eigenfunctions, we can take advantage of their orthogonality to obtain a simple solution. The eigenfunctions in the solution are then approximated by the Nyström method (Nyström, 1930; Baker, 1997; Williams & Seeger, 2001). Unlike the Stein gradient estimator (Li & Turner, 2018), our approach allows for a direct and principled out-of-sample extension. Moreover, we provide theoretical analysis on the error bound of SSGE and discuss the bias-variance tradeoff in practice. We also discuss its effectiveness in reducing the curse of dimensionality by drawing connections to kernel PCA.
Background
In this section we briefly introduce the Nyström method and the Stein gradient estimator.
The Nyström method originates as a method for approximating the solution of Fredholm integral equations of the second kind (Nyström, 1930; Baker, 1997). It was used by Williams & Seeger (2001) for estimating extensions of eigenvectors in Gaussian process regression. Specifically, the following equation for finding the eigenfunctions denotes the space of all square-integrable functions w.r.t. . of the covariance kernel w.r.t. the probability measure is considered:
And there is a constraint that the eigenfunctions are orthonormal under :
where . Approximating the left side of eq. 1 with its unbiased Monte Carlo estimate using i.i.d. samples from and applying the equation to these samples, we obtain
where is the Gram matrix: , and . This is an eigenvalue problem for . We compute the eigenvectors with the largest eigenvalues for . Now we have the solutions of eq. 3 by comparing against :
Note that the scaling factor in eq. 4 is due to the empirical constraint translated from eq. 2: . Baker (1997) shows that for a fixed kernel , converges to in the limit as .
Plugging these solutions back into eq. 1, we get the Nyström formula for approximating the value of the th eigenfunction at any point :
The Nyström method has been shown to be a thread linking many dimension reduction methods such as kernel PCA (Schölkopf et al., 1998), multidimensional scaling (MDS) (Borg & Groenen, 2005), local linear embedding (LLE) (Roweis & Saul, 2000), Laplacian eigenmaps (Belkin & Niyogi, 2003), and spectral clustering (Weiss, 1999), unifying their out-of-sample extensions (Bengio et al., 2004a, b; Burges et al., 2010). We will discuss these connections later.
2 Stein’s Identity and Stein Gradient Estimator
Recent developments in Stein discrepancy and its kernelized extensions (Gorham & Mackey, 2015; Chwialkowski et al., 2016; Liu et al., 2016; Liu & Wang, 2016) have renewed the interests in Stein’s method, which is a classic tool in statistics. Central to these works is an equation that generalizes the original Stein’s identity (Stein, 1981), shown in the following theorem.
where denotes the Frobenius norm of a matrix and is the regularization coefficient. It has an analytic solution that
Method
In this section we derive a gradient estimator for implicit distributions called the Spectral Stein Gradient Estimator (SSGE). Unlike the previous Stein gradient estimator that only provides estimates at the sample points, SSGE directly estimates the gradient function and thus allows simple and principled out-of-sample predictions. We also provide theoretical analysis on the error bound of SSGE.
Below we will show how to estimate the coefficients . According to 1, we have the following proposition.
If has continuous second order partial derivatives, and both and are in the Stein class of , the following set of equations hold true:
We only need to prove that is in the Stein class of . See Appendix A for details. ∎
Substituting eq. 11 into eq. 12 and using the orthonormality of , we can show that
To estimate , we need an approximation of . The key observation is that derivatives can be taken w.r.t. both sides of eq. 1:
Monte-Carlo sampling with eq. 13, we have an estimate of :
Substituting eqs. 4 and 5 into eq. 14 and comparing with eq. 6, we can show
Perhaps surprisingly, Equation 15 indicates that is a good approximation to This does not hold for general functions.. In fact, as we shall see in 2, the error introduced by Nyström approximation is negligible with high probability as .
Now truncating the series expansion to the first terms and plugging in the Nyström approximations of , we get our estimator:
where is the Nyström approximation of as in Section 2.1. We use RBF kernels in all experiments.
2 Theoretical Results
Following the derivation in Section 3.1, we analyze the theoretical properties of the resulting estimator in eqs. 16 and 17. To be clear, we formally restate the assumptions that have been made in the derivation.
and are in the Stein class of q.
, , i.e.,
Note that 1 holds for RBF kernels. 2 is necessary for ’s being possible to be expanded into the spectral series. We need 3 since our derivation is based on several well-studied bounds of Nyström approximation, i.e., Lemmas 4 and 5 in Appendix B (Sinha & Belkin, 2009; Izbicki et al., 2014). Note that when this assumption does not hold, we could proceed as in Rosasco et al. (2010) (Theorem 12) and derive the error bound in a similar way.
Given the above assumptions, the error is bounded by
where , is the Big O notation in probability.
The first three terms in eq. 18 are the sample errors caused by the Nyström approximation, which we call the estimation error. It is negligible with high probability as . The last term is caused by the bias introduced by the truncation, which we call the approximation error. From the bound we can observe a tradeoff between the estimation error and the approximation error. As an illustration, one may set to be as large as possible to reduce the magnitude of and thus reduce the approximation error, but it will increase the estimation error at a rate of .
For RBF kernels and their corresponding RKHS, smoother target functions tend to have smaller , and thus a tighter bound. In general, this indicates that choosing the appropriate kernel which is suitable to the target gradient function can improve the performance of the gradient estimator (by leading to a smaller ).
Hyperparameter Selection When RBF kernels are used, SSGE has two free parameters: The kernel bandwidth , and the number of eigenfunctions () used in the estimate. For , we use the median heuristic, i.e., we set it to be the median of pairwise distances between all samples, which turns out to work well in all experiments. Below we discuss the criterion for selecting , which is usually harder.
As the performance of the gradient estimator directly influences the task where it is used. The optimal choice for tuning is to apply cross-validation on the specific task. However, since is a discrete parameter, one has to manually set a continuous interval and bin it so that the commonly used black-box hyperparameter-search methods (e.g., Bayesian optimization) are applicable. However, this approach does not take the magnitude of eigenvalues into consideration. Observing this, we propose that, instead of directly tuning , we could tune a threshold for the percentage of remaining eigenvalues:
Note that searching may still not be easy due to the non-smooth validation surface. But in experiments we found that values in usually work well.
3 Gradient Estimation for Entropy
where can be easily estimated by SSGE. As we shall see in experiments, eq. 19 can be used for variational inference with implicit distributions.
Related Work
Our work is closely related to other works on implicit distributions. Apart from the density-ratio based approaches introduced in Section 1, we discuss two more directions.
Nonparametric Inference Nonparametric variational inference (VI) methods such as PMD (Dai et al., 2016) and SVGD (Liu & Wang, 2016) remove the need of parametric families, they keep a set of particles and gradually adjust them towards the true posterior. These particles can be viewed as samples from an implicit distribution. Instead of directly computing gradients in the sample space like us, SVGD performs functional gradient descent to transform the implicit distribution towards the true posterior. Though elegant, SVGD is limited to KL-divergence based VI problems, while our approach is generally applicable wherever gradient estimates are needed for the log density of implicit distributions.
Kernel Exponential Families and Score Matching Previous to our work, the problem of estimating gradient functions of intractable log densities has been worked on by Strathmann et al. (2015); Sutherland et al. (2018). They identified the problem when developing Hamiltonian Monte Carlo (HMC) under the settings where higher-order information of the target distribution is unavailable. To address it, a kernel exponential family (Sriperumbudur et al., 2017) is fit to samples along the Markov Chain trajectory by score matching (Hyvärinen, 2005), and then serves as a surrogate of the target distribution to provide gradient estimates for HMC. We will compare to them in Section 5.2.
Experiments
We evaluate the proposed approach on both toy problems and real-world examples. The latter includes applications of SSGE to two widely used inference methods: Hamiltonian Monte Carlo and variational inference. Code is available at https://github.com/thjashin/spectral-stein-grad. Implementations are based on ZhuSuan (Shi et al., 2017).
As a simple example, we experiment with estimating the gradient function of a 1-D standard Gaussian distribution. The target log density is , and the true gradient function is . We draw i.i.d. samples from for use in the estimation. In Figure 1 we plot the gradients estimates produced by the Stein gradient estimator, its out-of-sample extension (Stein+) (see Section 2.2), and our approach (SSGE). Since the original Stein estimator only gives estimates at the sample points, we plot them as individual points (in red). For the regularization coefficient in eq. 10, we searched it in and plot the best result at The criterion for selecting is unclear in Li & Turner (2018).. For SSGE, we set . We can see that despite all three estimators produce rather good approximation where the samples are taken densely (e.g., in $$), the gradient function estimated by SSGE is notably better at the places where samples are less dense.
2 Gradient-free Hamiltonian Monte Carlo
In this experiment we investigate the usefulness of SSGE in constructing a gradient-free HMC sampler. We follow the settings in Sejdinovic et al. (2014); Strathmann et al. (2015) and consider a Gaussian Process classification problem on the UCI Glass dataset. The goal is to infer the posterior over hyperparameters under a fully Bayesian treatment. Specifically, consider a Gaussian process whose joint distribution is over latent variables , labels , and hyperparameters :
In practice is chosen to be the Laplace approximation of . As for any pseudo-marginal MCMC scheme, the gradient information of the posterior is not available and HMC is not suitable. We have to resort to gradient-free MCMC methods. As mentioned in Section 4, kernel adaptive Metropolis samplers (Sejdinovic et al., 2014) were developed and then extended to kernel HMC (KMC) (Strathmann et al., 2015). In this experiment we compare the performance of KMC and HMC with gradients estimated by SSGE and Stein+.
To begin, we run 20 randomly initialized adaptive-Metropolis samplers for 30k iterations, with the first 10k samples discarded. We then keep every 400-th sample in each of the chains, and combine them to get 1k samples. These samples are treated as the ground truth. Similar to Sutherland et al. (2018), our experiment assumes the idealized scenario where a burn-in period for collecting a sufficient number of samples has completed. This is to remove all the other factors that could have an effect on the comparison of acceptance ratios, which then only depends on the accuracy of the gradient estimation of potentials, i.e., . So we fit all three estimators on a random subset of of these samples and repeated 10 times. For each fitted estimator, we start from a random initial point from the posterior sketch and run HMC samplers with the gradient estimates for 5k iterations. To be fair in comparing the acceptance ratios, no adaptation of HMC parameters can be used. So we randomly uses between 1 and 10 leapfrog steps of size chosen uniformly in , and a standard Gaussian momentum. The kernel bandwidths () in all three estimators are determined by median heuristics (i.e., set to the median of the pairwise distances between the data points). For Stein+, . For SSGE, .
The average acceptance ratios over 10 runs are plotted in Figure 2(b). We can see that SSGE clearly outperforms Stein+ and is even better than the KMC algorithm, which is specially designed as a gradient-free HMC algorithm. Though KMC is carefully designed, its gradients are estimated by first fitting a kernel exponential family as a surrogate and then taking derivatives through it, while SSGE is arguably more direct.
3 Variational Inference with Implicit Distributions
As introduced in Section 1, there have been increasing interests in constructing flexible variational posteriors with implicit distributions. Specifically, for a latent-variable model where and denote observed and latent variables, respectively, Variational Inference (VI) approximates the posterior by maximizing the following evidence lower bound (ELBO):
where is called the variational distribution. The second term of eq. 23 is the entropy of , which is intractable for implicit distributions. As shown in Section 3.3, SSGE can be used here for estimating gradients of the entropy term, thus allowing VI with implicit distributions. Below we conduct experiments on two examples: Bayesian Neural Networks (BNN) and Variational Autoencoders (VAE). Note that the original Stein gradient estimator can also be used here. In experiments, we find that despite lack of theoretical evidences, the performance of a well-tuned Stein gradient estimator is very close to SSGE when no out-of-sample predictions are required. As the emphasis of this paper is on SSGE, we only focus on verifying the accuracy of SSGE below.
We evaluate the predictive ability of BNNs with implicit variational posteriors trained by SSGE. To visually access the quality of uncertainty, we choose a 1-D regression problem (Hernández-Lobato & Adams, 2015; Louizos & Welling, 2016). Specifically, inputs are randomly sampled from $yy=x^{3}+\epsilon_{n}\epsilon_{n}\sim\mathcal{N}(0,9)$. We use a BNN with 1 hidden layer and 20 units to model the normalized inputs and targets. We also set the variance of the observation noise to the true value. We compare SSGE with implicit posteriors, Hamiltonian Monte Carlo (HMC) (Neal et al., 2011) and Bayes-by-backprop (BBB) (Blundell et al., 2015). To better demonstrate SSGE’s gradient estimation effect, we also test SSGE with a factorized posterior, in comparison to BBB.
We keep 20 chains and run 100k iterations for HMC. All other methods are trained with 100 samples for 20k iterations using Adam optimizer (Kingma & Ba, 2014). For SSGE, we set . The implicit posteriors we use for weights in both layers are standard normal distributions transformed by fully connected networks with one hidden layer of 100 units.
As shown in Fig. 3, HMC, as the golden standard, smoothly fits the training data and outputs sensible uncertainty estimation. HMC not only produces large uncertainty outside the data region, its predictive variance also varies even in regions with training points. This kind of interpolation behavior is hard to be captured by factorized Gaussian posteriors, as shown in Fig. 3(c) and 3(d). SSGE with implicit posteriors also has big predictive variances beyond training points, which implies that the BNN trained by SSGE is not overfitting, although it underestimates the uncertainty in the middle region. Also, we observe that SSGE can have similar interpolation behaviors as HMC (see the rightmost two points in Fig. 3(b)). Besides, SSGE with a factorized posterior has a similar prediction with BBB. Given the network and the variational posterior are both the same, we can attribute this similarity to accurate gradient estimation by SSGE.
3.2 Variational Autoencoders
From the above example we see that SSGE enables variational posteriors parameterized by implicit distributions. To demonstrate that it scales to larger models and datasets, we adopt the settings in Shi et al. (2018) and train a deep convolutional VAE with implicit variational posteriors (Implicit VAE for short) on the CelebA dataset. As in their work, the latent dimension is , and the network structure of the decoder is chosen to be the same as DCGAN (Radford et al., 2015). The observation likelihoods are Gaussian distributions with trainable data-independent variances. The implicit posterior is a deep convolutional net symmetric to the decoder, with Gaussian noises injected into hidden layers. Full details of the model structures can be found in Shi et al. (2018).
To examine how accurate the gradient estimates provided by SSGE are, we conduct experiments under three different settings: a plain VAE with normal variational posteriors, an Implicit VAE trained with the entropy term removed from the ELBO, and an Implicit VAE using SSGE’s gradient estimates for the entropy. For SSGE, we set , and . In Figures 4(a), 4(b) and 4(c) we show samples randomly generated from the trained models. We can see that without the entropy term, the Implicit VAE tends to overfit and produces visually bad generations, while if we retain the entropy term and use SSGE to estimate its gradients, the Implicit VAE can produce realistic samples. To quantitatively measure the sample quality, we compare the Fréchet Inception Distance (FID) (Heusel et al., 2017) between real data and random generations from the models. The results are shown in Figure 4(d). We can see that the Implicit VAE trained by SSGE converges faster and produces samples with slightly better quality than the plain VAE. This is probably due to that implicit posteriors are less likely to overfit (Shi et al., 2018), and SSGE gives accurate gradients for optimizing them. Besides the CelebA experiments, we also tested the models on MNIST dataset and evaluated the test log likelihoods. See Appendix D for details.
Discussion
Connection to Kernel PCA As mentioned in Section 2.1, the Nyström approximation is closely related to Kernel PCA (Schölkopf et al., 1998) (KPCA), which is a powerful method for nonlinear dimension reduction. In KPCA, the input data is first projected to a (usually high-dimensional) feature space, where PCA is then applied. The operations in the feature space are handled by the kernel trick. We briefly review the method below.
A key observation of KPCA is that the eigenvectors lie in the span of the feature vectors, since from eq. 24 we have
Here we use to represent the coefficients. This implies that instead of directly dealing with eq. 24 we can consider a set of projected equations: . Plugging eq. 25 here and replacing with (the kernel trick), we get which turns out an eigenvalue problem for : . Note that this is exactly the same eigenvalue problem solved in eq. 3. As above, we denote by the eigenvectors of that correspond to the largest eigenvalues , and we have . To determine the s, we set the eigenvectors to have unit lengths: So the s should be normalized to have length : For a new data point , KPCA computes the embedding (the projection onto the first eigenvectors) as where . It was pointed out by Williams & Seeger (2000) to be equivalent to using the well understood Nyström approximation. We can see this by noticing that each component of is identical to eq. 6 up to a scaling factor:
Looking back at eq. 16, we can see that SSGE estimates the gradients by a linear estimator with KPCA embeddings as input features. As KPCA embeddings are known to automatically adapt to the geometry of the samples, given a suitable kernel is chosen, it can reduce the curse of dimensionality when the estimator is applied to high dimensional spaces, which helps explain the effectiveness of SSGE.
Connection to Manifold-modeling Dimension Reduction Methods It has been pointed out in previous works (Williams, 2001; Ham et al., 2004; Bengio et al., 2004a, b) that many successful manifold-modeling dimension reduction methods (e.g., MDS, LLE, Laplacian eigenmaps, and Spectral clustering) can be viewed as KPCA with different ways of constructing data-dependent kernels. We believe it is a promising direction to learn a better kernel from a dataset of samples that could improve the manifold modeling behavior of KPCA embeddings, thus further improving the gradient estimator.
Conclusion
We propose the Spectral Stein Gradient Estimator (SSGE) for implicit distributions. Unlike previous methods, SSGE directly estimates the gradient function and thus has a principled out-of-sample extension. Future work may include learning kernels or eigenfunctions in the estimator, as indicated by the error bound as well as the connection to dimension reduction methods.
Acknowledgements
We thank anonymous reviewers for insightful feedbacks, and thank the meta-reviewer and Chang Liu for comments on improving 1. This work was supported by NSFC Projects (Nos. 61620106010, 61621136008, 61332007), Beijing NSF Project (No. L172037), Tiangong Institute for Intelligent Computing, NVIDIA NVAIL Program, Siemens and Intel.
References
Appendix A Proof of Proposition 1
Let denote the Reproducing Kernel Hilbert Space (RKHS) induced by kernel . If has continuous second order partial derivatives, and both and are in the Stein class of for any fixed , then , is in the Stein class of .
Let be a continuous kernel on compact metric space . is a finite Borel measure on . Then for that satisfy eq. 1, :
Then is the same space as the RKHS induced by .
In Lemma 3 we set , , then we have . According to Lemma 1, we can conclude that is in the Stein class of . ∎
Appendix B Error Bound of SSGE
which correspond to the major approximations in each step.
where .
By Cauchy-Schwartz inequality, 2 and Lemma 4:
Denote . According to 1, it is easy to see that satisfies the boundary condition:
And from the proof of 1, we know satisfies the boundary condition. Combining the two, we have:
By applying Minkowski inequality, Cauchy-Schwartz inequality, Lemma 8 and Lemma 5, we have
Appendix C Derivation of Gradient Estimates for Entropy
First, we decompose the gradients into two terms:
Then it is easy to see that the first term is zero:
A low-variance Monte-Carlo estimate of eq. 47 can be obtained when is continuous and reparameterizable (Kingma & Welling, 2013). For example, if there exists a random variable , so that follows the same distribution as , we have
where the intractable term can be easily estimated by SSGE.
Appendix D MNIST Results
We did an MNIST experiment on a VAE with an 8-dim latent space. In Figures 5(a), 5(b) and 5(c) we plot random generations by a plain VAE, an Implicit VAE with the entropy term removed, and an Implicit VAE trained by SSGE, all with the same decoder structures. In Table 1(a) we show the log likelihoods of the trained models (VAE and the Implicit VAE with SSGE) on 2048 test images using Annealed Importance Sampling (AIS). The results are averaged over 10 runs. For reference, we also include the results of 8-dim VAEs from the AVB paper (Mescheder et al., 2017). Though their decoder structure is not the same as ours (the structure is even not the same for their three models), we can see our method is slightly better than plain VAE without other tricks, while AVB has to rely on Adaptive Contrast (AC) and different decoder structures.