Gradient Estimators for Implicit Models

Yingzhen Li, Richard E. Turner

Introduction

Modelling is fundamental to the success of technological innovations for artificial intelligence. A powerful model learns a useful representation of the observations for a specified prediction task, and generalises to unknown instances that follow similar generative mechanics. A well established area of machine learning research focuses on developing prescribed probabilistic models (Diggle & Gratton, 1984), where learning is based on evaluating the probability of observations under the model. Implicit probabilistic models, on the other hand, are defined by a stochastic procedure that allows for direct generation of samples, but not for the evaluation of model probabilities. These are omnipresent in scientific and engineering research involving data analysis, for instance ecology, climate science and geography, where simulators are used to fit real-world observations to produce forecasting results. Within the machine learning community there is a recent interest in a specific type of implicit models, generative adversarial networks (GANs) (Goodfellow et al., 2014), which has been shown to be one of the most successful approaches to image and text generation (Radford et al., 2016; Yu et al., 2017; Arjovsky et al., 2017; Berthelot et al., 2017). Very recently, implicit distributions have also been considered as approximate posterior distributions for Bayesian inference, e.g. see Liu & Feng (2016); Wang & Liu (2016); Li & Liu (2016); Karaletsos (2016); Mescheder et al. (2017); Huszár (2017); Li et al. (2017); Tran et al. (2017). These examples demonstrate the superior flexibility of implicit models, which provide highly expressive means of modelling complex data structures.

Whilst prescribed probabilistic models can be learned by standard (approximate) maximum likelihood or Bayesian inference, implicit probabilistic models require substantially more severe approximations due to the intractability of the model distribution. Many existing approaches first approximate the model distribution or optimisation objective function and then use those approximations to learn the associated parameters. However, for any finite number of data points there exists an infinite number of functions, with arbitrarily diverse gradients, that can approximate perfectly the objective function at the training datapoints, and optimising such approximations can lead to unstable training and poor results. Recent research on GANs, where the issue is highly prevalent, suggest that restricting the representational power of the discriminator is effective in stabilising training (e.g. see Arjovsky et al., 2017; Kodali et al., 2017). However, such restrictions often introduce undesirable biases, responsible for problems such as mode collapse in the context of GANs, and the underestimation of uncertainty in variational inference methods (Turner & Sahani, 2011).

In this paper we explore approximating the derivative of the log density, known as the score function, as an alternative method for training implicit models. An accurate approximation of the score function then allows the application of many well-studied algorithms, such as maximum likelihood, maximum entropy estimation, variational inference and gradient-based MCMC, to implicit models. Concretely, our contributions include:

the Stein gradient estimator, a novel generalisation of the score matching gradient estimator (Hyvärinen, 2005), that includes both parametric and non-parametric forms;

a comparison of the proposed estimator with the score matching and the KDE plug-in estimators on performing gradient-free MCMC, meta-learning of approximate posterior samplers for Bayesian neural networks, and entropy based regularisation of GANs.

Learning implicit probabilistic models

Given a dataset $\mathcal{D}$ containing i.i.d. samples we would like to learn a probabilistic model $p(\bm{x})$ for the underlying data distribution $p_{\mathcal{D}}(\bm{x})$ . In the case of implicit models, $p(\bm{x})$ is defined by a generative process. For example, to generate images, one might define a generative model $p(\bm{x})$ that consists of sampling randomly a latent variable $\bm{z}\sim p_{0}(\bm{z})$ and then defining $\bm{x}=\bm{f}_{\bm{\theta}}(\bm{z})$ . Here $\bm{f}$ is a function parametrised by $\bm{\theta}$ , usually a deep neural network or a simulator. We assume $\bm{f}$ to be differentiable w.r.t. $\bm{\theta}$ . An extension to this scenario is presented by conditional implicit models, where the addition of a supervision signal $\bm{y}$ , such as an image label, allows us to define a conditional distribution $p(\bm{x}|\bm{y})$ implicitly by the transformation $\bm{x}=\bm{f}_{\bm{\theta}}(\bm{z},\bm{y})$ . A related methodology, wild variational inference (Liu & Feng, 2016; Li & Liu, 2016) assumes a tractable joint density $p(\bm{x},\bm{z})$ , but uses implicit proposal distributions to approximate an intractable exact posterior $p(\bm{z}|\bm{x})$ . Here the approximate posterior $q(\bm{z}|\bm{x})$ can likewise be represented by a deep neural network, but also by a truncated Markov chain, such as that given by Langevin dynamics with learnable step-size.

In addition, Li & Liu (2016) and Mescheder et al. (2017) exploit the additive structure of the KL-divergence and suggest discriminating between $q$ and an auxiliary distribution that is close to $q$ , making the density ratio estimation more accurate. Nevertheless all these algorithms involve a minimax optimisation, and the current practice of gradient-based optimisation is notoriously unstable.

The stabilisation of GAN training is itself a recent trend of related research (e.g. see Salimans et al., 2016; Arjovsky et al., 2017). However, as the gradient-based optimisation only interacts with gradients, there is no need to use a discriminator if an accurate approximation to the intractable gradients could be obtained. As an example, consider a variational inference task with the approximate posterior defined as $\bm{z}\sim q_{\bm{\phi}}(\bm{z}|\bm{x})\Leftrightarrow\bm{\epsilon}\sim\pi(\bm{\epsilon}),\bm{z}=\bm{f}_{\bm{\phi}}(\bm{\epsilon},\bm{x})$ . Notice that the variational lower-bound can be rewritten as

in which the first term in the last line is zero (Roeder et al., 2017). As we typically assume the tractability of $\nabla_{\bm{\phi}}\bm{f}$ , an accurate approximation to $\nabla_{\bm{z}}\log q(\bm{z}|\bm{x})$ would remove the requirement of discriminators, speed-up the learning and obtain potentially a better model. Many gradient approximation techniques exist (Stone, 1985; Fan & Gijbels, 1996; Zhou & Wolfe, 2000; De Brabanter et al., 2013), and in particular, in the next section we will review kernel-based methods such as kernel density estimation (Singh, 1977) and score matching (Hyvärinen, 2005) in more detail, and motivate the main contribution of the paper.

Gradient approximation with the Stein gradient estimator

This condition holds for almost any test function if $q$ has sufficiently fast-decaying tails (e.g. Gaussian tails). Now we introduce Stein’s identity (Stein, 1981; Gorham & Mackey, 2015; Liu et al., 2016)

Observing that the gradient term $\nabla_{\bm{x}}\log q(\bm{x})$ of interest appears in Stein’s identity (5), we propose the Stein gradient estimator by inverting Stein’s identity. As the expectation in (5) is intractable, we further approximate the above with Monte Carlo (MC):

with $||\cdot||_{F}$ the Frobenius norm of a matrix and $\eta\geq 0$ . Simple calculation shows that

where $\mathbf{K}:=\mathbf{H}^{\text{T}}\mathbf{H},\quad\mathbf{K}_{ij}=\mathcal{K}(\bm{x}^{i},\bm{x}^{j}):=\bm{h}(\bm{x}^{i})^{\text{T}}\bm{h}(\bm{x}^{j}),$ $\langle\nabla,\mathbf{K}\rangle:=K\mathbf{H}^{\text{T}}\overline{\nabla_{\bm{x}}\bm{h}},\quad\langle\nabla,\mathbf{K}\rangle_{ij}=\sum_{k=1}^{K}\nabla_{x^{k}_{j}}\mathcal{K}(\bm{x}^{i},\bm{x}^{k}).$ One can show that the RBF kernel satisfies Stein’s identity (Liu et al., 2016). In this case $\bm{h}(\bm{x})=\mathcal{K}(\bm{x},\cdot),d^{\prime}=+\infty$ and by the reproducing kernel property (Berlinet & Thomas-Agnan, 2011), $\bm{h}(\bm{x})^{\text{T}}\bm{h}(\bm{x}^{\prime})=\langle\mathcal{K}(\bm{x},\cdot),\mathcal{K}(\bm{x}^{\prime},\cdot)\rangle_{\mathcal{H}}=\mathcal{K}(\bm{x},\bm{x}^{\prime}).$

2 Stein gradient estimator minimises the kernelised Stein discrepancy

In this section we derive the Stein gradient estimator again, but from a divergence/discrepancy minimisation perspective. Stein’s method also provides a tool for checking if two distributions $q(\bm{x})$ and $\hat{q}(\bm{x})$ are identical. If the test function set $\mathcal{H}$ is sufficiently rich, then one can define a Stein discrepancy measure by

see Gorham & Mackey (2015) for an example derivation. When $\mathcal{H}$ is defined as a unit ball in an RKHS induced by a kernel $\mathcal{K}(\bm{x},\cdot)$ , Liu et al. (2016) and Chwialkowski et al. (2016) showed that the supremum in (10) can be analytically obtained as (with $\mathcal{K}_{\bm{x}\bm{x}^{\prime}}$ shorthand for $\mathcal{K}(\bm{x},\bm{x}^{\prime})$ ):

which is also named the kernelised Stein discrepancy (KSD). Chwialkowski et al. (2016) showed that for $C_{0}$ -universal kernels satisfying the boundary condition, KSD is indeed a discrepancy measure: $\mathcal{S}^{2}(q,\hat{q})=0\Leftrightarrow q=\hat{q}$ . Gorham & Mackey (2017) further characterised the power of KSD on detecting non-convergence cases. Furthermore, if the kernel is twice differentiable, then using the same technique as to derive (16) one can compute KSD by

In practice KSD is estimated with samples $\{\bm{x}^{k}\}_{k=1}^{K}\sim q$ , and simple derivations show that the V-statistic of KSD can be reformulated as $\mathcal{S}_{V}^{2}(q,\hat{q})=\frac{1}{K^{2}}\text{Tr}(\hat{\mathbf{G}}^{\text{T}}\mathbf{K}\hat{\mathbf{G}}+2\hat{\mathbf{G}}^{\text{T}}\langle\nabla,\mathbf{K}\rangle)+C$ . Thus the $l_{2}$ error in (8) is equivalent to the V-statistic of KSD if $\bm{h}(\bm{x})=\mathcal{K}(\bm{x},\cdot)$ , and we have the following:

$\hat{\mathbf{G}}_{V}^{\text{\emph{Stein}}}$ is the solution of the following KSD V-statistic minimisation problem

One can also minimise the U-statistic of KSD to obtain gradient approximations, and a full derivation of which, including the optimal solution, can be found in the appendix. In experiments we use V-statistic solutions and leave comparisons between these methods to future work.

3 Comparisons to existing kernel-based gradient estimators

There exist other gradient estimators that do not require explicit evaluations of $\nabla_{\bm{x}}\log q(\bm{x})$ , e.g. the denoising auto-encoder (DAE) (Vincent et al., 2008; Vincent, 2011; Alain & Bengio, 2014) which, with infinitesimal noise, also provides an estimate of $\nabla_{\bm{x}}\log q(\bm{x})$ at convergence. However, applying such gradient estimators result in a double-loop optimisation procedure since the gradient approximation is repeatedly required for fitting implicit distributions, which can be significantly slower than the proposed approach. Therefore we focus on “quick and dirty” approximations and only include comparisons to kernel-based gradient estimators in the following.

A naive approach for gradient approximation would first estimate the intractable density $\hat{q}(\bm{x})\approx q(\bm{x})$ (up to a constant), then approximate the exact gradient by $\nabla_{\bm{x}}\log\hat{q}(\bm{x})\approx\nabla_{\bm{x}}\log q(\bm{x})$ . Specifically, Singh (1977) considered kernel density estimation (KDE) $\hat{q}(\bm{x})=\frac{1}{K}\sum_{k=1}^{K}\mathcal{K}(\bm{x},\bm{x}^{k})\times C.$ , then differentiated through the KDE estimate to obtain the gradient estimator:

Interestingly for translation invariant kernels $\mathcal{K}(\bm{x},\bm{x}^{\prime})=\mathcal{K}(\bm{x}-\bm{x}^{\prime})$ the KDE gradient estimator (14) can be rewritten as $\hat{\mathbf{G}}^{\text{KDE}}=-\text{diag}\left(\mathbf{K}\bm{1}\right)^{-1}\langle\nabla,\mathbf{K}\rangle.$ Inspecting and comparing it with the Stein gradient estimator (9), one might notice that the Stein method uses the full kernel matrix as the pre-conditioner, while the KDE method computes an averaged “kernel similarity” for the denominator. We conjecture that this difference is key to the superior performance of the Stein gradient estimator when compared to the KDE gradient estimator (see later experiments). The KDE method only collects the similarity information between $\bm{x}^{k}$ and other samples $\bm{x}^{j}$ to form an estimate of $\nabla_{\bm{x}^{k}}\log q(\bm{x}^{k})$ , whereas for the Stein gradient estimator, the kernel similarity between $\bm{x}^{i}$ and $\bm{x}^{j}$ for all $i,j\neq k$ are also incorporated. Thus it is reasonable to conjecture that the Stein method can be more sample efficient, which also implies higher accuracy when the same number of samples are collected.

3.2 Score matching gradient estimator: minimising MSE

It has been shown in Hyvärinen (2005) that this objective can be reformulated as

The comparison between the two estimators is more complicated. Certainly by the Cauchy-Schwarz inequality the Fisher divergence is stronger than KSD in terms of detecting convergence (Liu et al., 2016). However it is difficult to perform direct gradient estimation by minimising the Fisher divergence, since (i) the Dirac kernel is non-differentiable so that it is impossible to rewrite the divergence in a similar form to (12), and (ii) the transformation to (16) involves computing $\nabla_{\bm{x}}\hat{\bm{g}}(\bm{x})$ . So one needs to propose a parametric approximation to $\mathbf{G}$ and then optimise the associated parameters accordingly, and indeed Sasaki et al. (2014) and Strathmann et al. (2015) derived a parametric solution by first approximating the log density up to a constant as $\log\hat{q}(\bm{x}):=\sum_{k=1}^{K}a_{k}\mathcal{K}(\bm{x},\bm{x}^{k})+C$ , then minimising (16) to obtain the coefficients $\hat{a}^{\text{score}}_{k}$ and constructing the gradient estimator as

Therefore the usage of parametric estimation can potentially remove the advantage of using a stronger divergence. Conversely, the proposed Stein gradient estimator (9) is non-parametric in that it directly optimises over functions evaluated at locations $\{\bm{x}_{k}\}_{k=1}^{K}$ . This brings in two key advantages over the score matching gradient estimator: (i) it removes the approximation error due to the use of restricted family of parametric approximations and thus can be potentially more accurate; (ii) it has a much simpler and ubiquitous form that applies to any kernel satisfying the boundary condition, whereas the score matching estimator requires tedious derivations for different kernels repeatedly (see appendix).

In terms of computation speed, since in most of the cases the computation of the score matching gradient estimator also involves kernel matrix inversions, both estimators are of the same order of complexity, which is $\mathcal{O}(K^{3}+K^{2}d)$ (kernel matrix computation plus inversion). Low-rank approximations such as the Nyström method (Smola & Schökopf, 2000; Williams & Seeger, 2001) can enable speed-up, but this is not investigated in the paper. Again we note here that kernel-based gradient estimators can still be faster than e.g. the DAE estimator since no double-loop optimisation is required. Certainly it is possible to apply early-stopping for the inner-loop DAE fitting. However the resulting gradient approximation might be very poor, which leads to unstable training and poorly fitted implicit distributions.

4 Adding predictive power

Though providing potentially more accurate approximations, the non-parametric estimator (9) has no predictive power as described so far. Crucially, many tasks in machine learning require predicting gradient functions at samples drawn from distributions other than $q$ , for example, in MLE $q(\bm{x})$ corresponds to the model distribution which is learned using samples from the data distribution instead. To address this issue, we derive two predictive estimators, one generalised from the non-parametric estimator and the other minimises KSD using parametric approximations.

In practice one would store the computed gradient $\hat{\mathbf{G}}_{V}^{\text{Stein}}$ , the kernel matrix inverse $(\mathbf{K}+\eta\mathbf{I})^{-1}$ and $\eta$ as the “parameters” of the predictive estimator. For a new observation $\bm{y}\sim p$ in general, one can “pretend” $\bm{y}$ is a sample from $q$ and apply the above estimator as well. The approximation quality depends on the similarity between $q$ and $p$ , and we conjecture here that this similarity measure, if can be described, is closely related to the KSD.

The non-parametric predictive estimator could be computationally demanding. Setting aside the cost of fitting the “parameters”, in prediction the time complexity for the non-parametric estimator is $\mathcal{O}(K^{2}+Kd)$ . Also storing the “parameters” needs $\mathcal{O}(Kd)$ memory for $\hat{\mathbf{G}}_{V}^{\text{Stein}}$ . These costs make the non-parametric estimator undesirable for high-dimensional data, since in order to obtain accurate predictions it often requires $K$ scaling with $d$ as well. To address this, one can also minimise the KSD using parametric approximations, in a similar way as to derive the score matching estimator in Section 3.3.2. More precisely, we define a parametric approximation in a similar fashion as (17), and in the appendix we show that if the RBF kernel is used for both the KSD and the parametric approximation, then the linear coefficients $\bm{a}=(a_{1},...,a_{K})^{\text{T}}$ can be calculated analytically: $\hat{\bm{a}}_{V}^{\text{Stein}}=(\bm{\Lambda}+\eta\mathbf{I})^{-1}\bm{b}$ , where

Applications

We present some case studies that apply the gradient estimators to implicit models. Detailed settings (architecture, learning rate, etc.) are presented in the appendix. Implementation is released at https://github.com/YingzhenLi/SteinGrad.

We first consider a simple synthetic example to demonstrate the accuracy of the proposed gradient estimator. More precisely we consider the kernel induced Hamiltonian flow (not an exact sampler) (Strathmann et al., 2015) on a 2-dimensional banana-shaped object: $\bm{x}\sim\mathcal{B}(\bm{x};b=0.03,v=100)\Leftrightarrow x_{1}\sim\mathcal{N}(x_{1};0,v),x_{2}=\epsilon+b(x_{1}^{2}-v),\epsilon\sim\mathcal{N}(\epsilon;0,1)$ . The approximate Hamiltonian flow is constructed using the same operator as in Hamiltonian Monte Carlo (HMC) (Neal et al., 2011), except that the exact score function $\nabla_{\bm{x}}\log\mathcal{B}(\bm{x})$ is replaced by the approximate gradients. We still use the exact target density to compute the rejection step as we mainly focus on testing the accuracy of the gradient estimators. We test both versions of the predictive Stein gradient estimator (see section 3.4) since we require the particles of parallel chains to be independent with each other. We fit the gradient estimators on $K=200$ training datapoints from the target density. The bandwidth of the RBF kernel is computed by the median heuristic and scaled up by a scalar between $ $. All three methods are simulated for$ T=2,000$ iterations, share the same initial locations that are constructed by target distribution samples plus Gaussian noises of standard deviation 2.0, and the results are averaged over 200 parallel chains.

2 Meta-learning of approximate posterior samplers for Bayesian NNs

One of the recent focuses on meta-learning has been on learning optimisers for training deep neural networks, e.g. see (Andrychowicz et al., 2016). Could analogous goals be achieved for approximate inference? In this section we attempt to learn an approximate posterior sampler for Bayesian neural networks (Bayesian NNs, BNNs) that generalises to unseen datasets and architectures. A more detailed introduction of Bayesian neural networks is included in the appendix, and in a nutshell, we consider a binary classification task: $p(y=1|\bm{x},\bm{\theta})=\text{sigmoid}(\text{NN}_{\bm{\theta}}(\bm{x}))$ , $p_{0}(\bm{\theta})=\mathcal{N}(\bm{\theta};\bm{0},\mathbf{I})$ . After observing the training data $\mathcal{D}=\{(\bm{x}_{n},y_{n})\}_{n=1}^{N}$ , we first obtain the approximate posterior $q_{\bm{\phi}}(\bm{\theta})\approx p(\bm{\theta}|\mathcal{D})\propto p_{0}(\bm{\theta})\prod_{n=1}^{N}p(y_{n}|\bm{x}_{n},\bm{\theta})$ , then approximate the predictive distribution for a new observation as $p(y^{*}=1|\bm{x}^{*},\mathcal{D})\approx\frac{1}{K}\sum_{k=1}^{K}p(y^{*}=1|\bm{x}^{*},\bm{\theta}^{k}),\bm{\theta}^{k}\sim q_{\bm{\phi}}(\bm{\theta}).$ In this task we define an implicit approximate posterior distribution $q_{\bm{\phi}}(\bm{\theta})$ as the following stochastic normalising flow (Rezende & Mohamed, 2015) $\bm{\theta}_{t+1}=\bm{f}(\bm{\theta}_{t},\nabla_{t},\bm{\epsilon}_{t})$ : given the current location $\bm{\theta}_{t}$ and the mini-batch data $\{(\bm{x}_{m},y_{m})\}_{m=1}^{M}$ , the update for the next step is

We briefly describe the test protocol. We take from the UCI repository (Lichman, 2013) six binary classification datasets (australian, breast, crabs, ionosphere, pima, sonar), train an approximate sampler on crabs with a small neural network that has one 20-unit hidden layer with ReLU activation, and generalise to the remaining datasets with a bigger network that has 50 hidden units and uses sigmoid activation. We use ionosphere as the validation set to tune $\zeta$ . The remaining 4 datasets are further split into 40% training subset for simulating samples from the approximate sampler, and 60% test subsets for evaluating the sampler’s performance.

Figure 3 presents the (negative) test log-likelihood (LL), classification error, and an estimate of the KSD U-statistic $\mathcal{S}^{2}_{U}(p(\bm{\theta}|\mathcal{D}),q(\bm{\theta}))$ (with data sub-sampling) over 5 splits of each test dataset. Besides the gradient estimators we also compare with two baselines: an approximate posterior sampler trained by maximum a posteriori (MAP), and stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) evaluated on the test datasets directly. In summary, SGLD returns best results in KSD metric. The Stein approach performs equally well or a little better than SGLD in terms of test-LL and test error. The KDE method is slightly worse and is close to MAP, indicating that the KDE estimator does not provide a very informative gradient for the entropy term. Surprisingly the score matching estimator method produces considerably worse results (except for breast dataset), even after carefully tuning the bandwidth and the regularisation parameter $\eta$ . Future work should investigate the usage of advanced recurrent neural networks such as an LSTM (Hochreiter & Schmidhuber, 1997), which is expected to return better performance.

3 Towards addressing mode collapse in GANs using entropy regularisation

GANs are notoriously difficult to train in practice. Besides the instability of gradient-based minimax optimisation which has been partially addressed by many recent proposals (Salimans et al., 2016; Arjovsky et al., 2017; Berthelot et al., 2017), they also suffer from mode collapse. We propose adding an entropy regulariser to the GAN generator loss. Concretely, assume the generative model $p_{\bm{\theta}}(\bm{x})$ is implicitly defined by $\bm{x}=\bm{f}_{\bm{\theta}}(\bm{z}),\bm{z}\sim p_{0}(\bm{z})$ , then the generator’s loss is defined by

where $\mathcal{J}_{\text{gen}}(\bm{\theta})$ is the original loss function for the generator from any GAN algorithm and $\alpha$ is a hyper-parameter. In practice (the gradient of) (21) is estimated using Monte Carlo.

We empirically investigate the entropy regularisation idea on the very recently proposed boundary equilibrium GAN (BEGAN) (Berthelot et al., 2017) method using (continuous) MNIST, and we refer to the appendix for the detailed mathematical set-up. In this case the non-parametric V-statistic Stein gradient estimator is used. We use a convolutional generative network and a convolutional auto-encoder and select the hyper-parameters of BEGAN $\gamma\in\{0.3,0.5,0.7\}$ , $\alpha\in$ and $\lambda=0.001$ . The Epanechnikov kernel $\mathcal{K}(\bm{x},\bm{x}^{\prime}):=\frac{1}{d}\sum_{j=1}^{d}(1-(x_{j}-x^{\prime}_{j})^{2})$ is used as the pixel values lie in a unit interval (see appendix for the expression of the score matching estimator), and to ensure the boundary condition we clip the pixel values into range $[10^{-8},1-10^{-8}]$ . The generated images are visualised in Figure 5. BEGAN without the entropy regularisation fails to generate diverse samples even when trained with learning rate decay. The other three images clearly demonstrate the benefit of the entropy regularisation technique, with the Stein approach obtaining the highest diversity without compromising visual quality.

Concerning computation speed, all the three methods are of the same order: 10.20s/epoch for KDE, 10.85s/epoch for Score, and 10.30s/epoch for Stein.All the methods are timed on a machine with an NVIDIA GeForce GTX TITAN X GPU. This is because $K<d$ (in the experiments $K=100$ and $d=784$ ) so that the complexity terms are dominated by kernel computations ( $\mathcal{O}(K^{2}d)$ ) required by all the three methods. Also for a comparison, the original BEGAN method without entropy regularisation runs for 9.05s/epoch. Therefore the main computation cost is dominated by the optimisation of the discriminator/generator, and the proposed entropy regularisation can be applied to many GAN frameworks with little computational burden.

Conclusions and future work

We have presented the Stein gradient estimator as a novel generalisation to the score matching gradient estimator. With a focus on learning implicit models, we have empirically demonstrated the efficacy of the proposed estimator by showing how it opens the door to a range of novel learning tasks: approximating gradient-free MCMC, meta-learning for approximate inference, and unsupervised learning for image generation. Future work will expand the understanding of gradient estimators in both theoretical and practical aspects. Theoretical development will compare both the V-statistic and U-statistic Stein gradient estimators and formalise consistency proofs. Practical work will improve the sample efficiency of kernel estimators in high dimensions and develop fast yet accurate approximations to matrix inversion. It is also interesting to investigate applications of gradient approximation methods to training implicit generative models without the help of discriminators. Finally it remains an open question that how to generalise the Stein gradient estimator to non-kernel settings and discrete distributions.

Acknowledgement

We thank Marton Havasi, Jiri Hron, David Janz, Qiang Liu, Maria Lomeli, Cuong Viet Nguyen and Mark Rowland for their comments and helps on the manuscript. We also acknowledge the anonymous reviewers for their review. Yingzhen Li thanks Schlumberger Foundation FFTF fellowship. Richard E. Turner thanks Google and EPSRC grants EP/M0269571 and EP/L000776/1.

References

Appendix A Score matching estimator: remarks and derivations

In this section we provide more discussions and analytical solutions for the score matching estimator. More specifically, we will derive the linear coefficient $\bm{a}=(a_{1},...,a_{K})$ for the case of the Epanechnikov kernel.

Remark. As a side note, score matching can also be used to learn the parameters of an unnormalised density. In this case the target distribution $q$ would be the data distribution and $\hat{q}$ is often a Boltzmann distribution with intractable partition function. As a parameter estimation technique, score matching is also related to contrastive divergence (Hinton, 2002), pseudo likelihood estimation (Hyvärinen, 2006), and DAEs (Vincent, 2011; Alain & Bengio, 2014). Generalisations of score matching methods are also presented in e.g. Lyu (2009); Marlin et al. (2010).

A.2 The RBF kernel case

The derivations for the RBF kernel case is referred to (Strathmann et al., 2015), and for completeness we include the final solutions here. Assume the parametric approximation is defined as $\log\hat{q}(\bm{x})=\sum_{k=1}^{K}a_{k}\mathcal{K}(\bm{x},\bm{x}^{k})+C$ , where the RBF kernel uses bandwidth parameter $\sigma$ . then the optimal solution of the coefficients $\hat{\bm{a}}^{\text{score}}=(\mathbf{\Sigma}+\eta\mathbf{I})^{-1}\bm{v}$ , with

A.3 The Epanechnikov kernel case

The Epanechnikov kernel is defined as $\mathcal{K}(\bm{x},\bm{x}^{\prime})=\frac{1}{d}\sum_{i=1}^{d}(1-(x_{i}-x^{\prime}_{i})^{2})$ , where the first and second order gradients w.r.t. $x_{i}$ is

Thus the score matching objective with $\log\hat{q}(\bm{x})=\sum_{k=1}^{K}a_{k}\mathcal{K}(\bm{x},\bm{x}^{k})+C$ is reduced to

Thus with an $l_{2}$ regulariser, the fitted coefficients are

Appendix B Stein gradient estimator: derivations

The V-statistic of KSD is the following: given samples $\bm{x}^{k}\sim q,k=1,...,K$ and recall $\mathbf{K}_{jl}=\mathcal{K}(\bm{x}^{j},\bm{x}^{l})$

The last term $\nabla_{\bm{x}^{j},\bm{x}^{l}}\mathbf{K}_{jl}$ will be ignored as it does not depend on the approximation $\hat{\bm{g}}$ . Using matrix notations defined in the main text, readers can verify that the V-statistic can be computed as

Using the cyclic invariance of matrix trace leads to the desired result in the main text. The U-statistic of KSD removes terms indexed by $j=l$ in (23), in which the matrix form is

with the $j$ th row of $\nabla\text{diag}(\mathbf{K})$ defined as $\nabla_{\bm{x}^{j}}\mathcal{K}(\bm{x}^{j},\bm{x}^{j})$ . For most translation invariant kernels this extra term $\nabla\text{diag}(\mathbf{K})=\bm{0}$ , thus the optimal solution of $\hat{\mathbf{G}}$ by minimising KSD U-statistic is

B.2 Deriving the non-parametric predictive estimator

with $\nabla_{\bm{y}}\mathcal{K}(\cdot,\bm{y})$ denoting a $K\times d$ matrix with rows $\nabla_{\bm{y}}\mathcal{K}(\bm{x}^{k},\bm{y})$ , and $\nabla_{\bm{y}}\mathcal{K}(\bm{y},\bm{y})$ only differentiates through the second argument. Thus by simple matrix calculations, we have:

For translation invariant kernels, typically $\nabla_{\bm{y}}\mathcal{K}(\bm{y},\bm{y})=\bm{0}$ , and more conveniently,

The solution for the U-statistic case can be derived accordingly which we omit here.

B.3 Parametric Stein gradient estimator with the RBF kernel

We define a parametric approximation in a similar way as for the score matching estimator:

Now we show the optimal solution of $\bm{a}=(a_{1},...,a_{K})^{\text{T}}$ by minimising (23). To simplify derivations we assume the approximation and KSD use the same kernel. First note that the gradient of the RBF kernel is

And thus $\clubsuit=\frac{1}{\sigma^{4}}\bm{a}^{\text{T}}\mathbf{\Lambda}\bm{a}$ . Similarly the summation over $j$ , $l$ in $\spadesuit$ can be simplified into

which leads to $\spadesuit=-\frac{1}{\sigma^{4}}\bm{a}^{\text{T}}\bm{b}$ . Thus minimising $\mathcal{S}_{V}^{2}(q,\hat{q})$ plus an $l_{2}$ regulariser returns the Stein estimator $\hat{\bm{a}}_{V}^{\text{Stein}}$ in the main text.

Summing over the $j$ indices for the second term, we have

Appendix C More details on the experiments

We describe the detailed experimental set-up in this section. All experiments use Adam optimiser (Kingma & Ba, 2015) with standard parameter settings.

We start by reviewing Bayesian neural networks with binary classification as a running example. In this task, a normal deep neural network is constructed to predict $y=\bm{f}_{\bm{\theta}}(\bm{x})$ , and the neural network is parameterised by a set of weights (and bias vectors which we omit here for simplicity) $\bm{\theta}=\{\mathbf{W}^{l}\}_{l=1}^{L}$ . In the Bayesian framework these network weights are treated as random variables, and a prior distribution, e.g. Gaussian, is also attached to them: $p_{0}(\bm{\theta})=\mathcal{N}(\bm{\theta};\bm{0},\mathbf{I})$ . The likelihood function of $\bm{\theta}$ is then defined as

and $p(y=0|\bm{x},\bm{\theta})=1-p(y=1|\bm{x},\bm{\theta})$ accordingly. One can show that the usage of Bernoulli distribution here corresponds to applying cross entropy loss for training.

After framing the deep neural network as a probabilistic model, a Bayesian approach would find the posterior of the network weights $p(\bm{\theta}|\mathcal{D})$ and use the uncertainty information encoded in it for future predictions. By Bayes’ rule, the exact posterior is

and the predictive distribution for a new input $\bm{x}^{*}$ is

Again the exact posterior is intractable, and approximate inference would fit an approximate posterior distribution $q_{\bm{\phi}}(\bm{\theta})$ parameterised by the variational parameters $\bm{\phi}$ to the exact posterior, and then use it to compute the (approximate) predictive distribution.

Since in practice analytical integration for neural network weights is also intractable, the predictive distribution is further approximated by Monte Carlo:

Now it remains to fit the approximate posterior $q_{\bm{\phi}}(\bm{\theta})$ , and in the experiment the approximate posterior is implicitly constructed by a stochastic flow. For the training task, we use a one hidden layer neural network with 20 hidden units to compute the noise variance and the moving direction of the next update. In a nutshell it takes the $i$ th coordinate of the current position and the gradient $\bm{\theta}_{t}(i),\nabla_{t}(i)$ as the inputs, and output the corresponding coordinate of the moving direction $\Delta_{\bm{\phi}}(\bm{\theta}_{t},\nabla_{t})(i)$ and the noise variance $\bm{\sigma}_{\bm{\phi}}(\bm{\theta}_{t},\nabla_{t})(i)$ . Softplus non-linearity is used for the hidden layer and to compute the noise variance we apply ReLU activation to ensure non-negativity. The step-size $\zeta$ is selected as 1e-5 which is tuned on the KDE approach. For SGLD step-size 1e-5 also returns overall good results.

The training process is the following. We simulate the approximate sampler for 10 transitions and sum over the variational lower-bounds computed on the samples of every step. Concretely, the maximisation objective is

where $T=100$ and $q_{t}(\bm{\theta})$ is implicitly defined by the marginal distribution of $\bm{\theta}_{t}$ that is dependent on $\bm{\phi}$ . In practice the variational lower-bound $\mathcal{L}_{\text{VI}}(q_{t})$ is further approximated by Monte Carlo and data sub-sampling:

The MAP baseline considers an alternative objective function by removing the $\log q_{t}(\bm{\theta}_{t})$ term from the above MC-VI objective.

Truncated back-propagation is applied for every 10 steps in order to avoid vanishing/exploding gradients. The simulated samples at time $T$ are stored to initialise the Markov chain for the next iteration, and for every 50 iterations we restart the simulation by randomly sampling the locations from the prior. Early stopping is applied using the validation dataset, and the learning rate is set to 0.001, the number of epochs is set to 500.

We perform hyper-parameter search for the kernel, i.e. a grid search on the bandwidth $\sigma^{2}\in\{0.25,1.0,4.0,10.0,\text{median trick}\}$ and $\eta\in\{0.1,0.5,1.0,2.0\}$ . We found the median heuristic is sufficient for the KDE and Stein approaches. However, we failed to obtain desirable results using the score matching estimator with median heuristics, and for other settings the score matching approach underperforms when compared to KDE and Stein methods.

C.2 BEGAN experiments

In this section we describe the experimental details of the BEGAN experiment, but first we introduce the mathematical idea and discuss how the entropy regulariser is applied.

Assume the generator is implicitly defined: $\bm{x}\sim p_{\bm{\theta}}(\bm{x})\leftrightarrow\bm{x}=\bm{f}_{\bm{\theta}}(\bm{z}),\bm{z}\sim p_{0}(\bm{z})$ . In BEGAN the discriminator is defined as an auto-encoder $D_{\bm{\varphi}}(\bm{x})$ that reconstructs the input $\bm{x}$ . After selecting a ratio parameter $\gamma>0$ , a control rate $\beta_{0}$ initialised at 0, and a “learning rate” $\lambda>0$ for the control rate, the loss functions for the generator $\bm{x}=\bm{f}_{\bm{\theta}}(\bm{z}),\bm{z}\sim p_{0}(\bm{z})$ and the discriminator are: