Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Yang Song, Sahaj Garg, Jiaxin Shi, Stefano Ermon

INTRODUCTION

Score matching (Hyvärinen, 2005) is particularly suitable for learning unnormalized statistical models, such as energy based ones. It is based on minimizing the distance between the derivatives of the log-density functions (a.k.a., scores) of the data and model distributions. Unlike maximum likelihood estimation (MLE), the objective of score matching only depends on the scores, which are oblivious to the (usually) intractable partition functions. However, score matching requires the computation of the diagonal elements of the Hessian of the model’s log-density function. This Hessian trace computation is generally expensive (Martens et al., 2012), requiring a number of forward and backward propagations proportional to the data dimension. This severely limits its applicability to complex models parameterized by deep neural networks, such as deep energy-based models (LeCun et al., 2006; Wenliang et al., 2019).

Several approaches have been proposed to alleviate this difficulty: Kingma & LeCun (2010) propose approximate backpropagation for computing the trace of the Hessian; Martens et al. (2012) develop curvature propagation, a fast stochastic estimator for the trace in score matching; and Vincent (2011) transforms score matching to a denoising problem which avoids second-order derivatives. These methods have achieved some success, but may suffer from one or more of the following problems: inconsistent parameter estimation, large estimation variance, and cumbersome implementation.

To alleviate these problems, we propose sliced score matching, a variant of score matching that can scale to deep unnormalized models and high dimensional data. The key intuition is that instead of directly matching the high-dimensional scores, we match their projections along random directions. Theoretically, we show that under some regularity conditions, sliced score matching is a well-defined statistical estimation criterion that yields consistent and asymptotically normal parameter estimates. Moreover, compared to the methods of Kingma & LeCun (2010) and Martens et al. (2012), whose implementations require customized backpropagation for deep networks, sliced score matching only involves Hessian-vector products, thus can be easily and efficiently implemented in frameworks such as TensorFlow (Abadi et al., 2016) and PyTorch (Adam et al., 2017).

Beyond training unnormalized models, sliced score matching can also be naturally adapted as an objective for estimating the score function of a data generating distribution (Sasaki et al., 2014; Strathmann et al., 2015) by training a score function model parameterized by deep neural networks. This observation enables many new applications of sliced score matching. For example, we show that it can be used to provide accurate score estimates needed for variational inference with implicit distributions (Huszár, 2017) and learning Wasserstein Auto-Encoders (WAE, Tolstikhin et al. (2018)).

Finally, we evaluate the performance of sliced score matching on learning unnormalized statistical models (density estimation) and estimating score functions of a data generating process (score estimation). For density estimation, experiments on deep kernel exponential families (Wenliang et al., 2019) and NICE flow models (Dinh et al., 2015) show that our method is either more scalable or more accurate than existing score matching variants.

For score estimation, our method improves the performance of variational auto-encoders (VAE) with implicit encoders, and can train WAEs without a discriminator or MMD loss by directly optimizing the KL divergence between aggregated posteriors and the prior. In both situations we outperformed kernel-based score estimators (Li & Turner, 2018; Shi et al., 2018) by achieving better test likelihoods and better sample quality in image generation.

BACKGROUND

Learning unnormalized models with maximum likelihood estimation (MLE) can be difficult due to the intractable partition function ZθZ_{\boldsymbol{\theta}}. To avoid this, score matching (Hyvärinen, 2005) minimizes the Fisher divergence between pdp_{d} and pm(,θ)p_{m}(\cdot,{\boldsymbol{\theta}}), which is defined as

Since sm(x;θ)\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}) does not involve ZθZ_{\boldsymbol{\theta}}, the Fisher divergence does not depend on the intractable partition function. However, Eq. (1) is still not readily usable for learning unnormalized models, as we only have samples and do not have access to the score function of the data sd(x)\mathbf{s}_{d}(\mathbf{x}).

2 SCORE ESTIMATION FOR IMPLICIT DISTRIBUTIONS

Besides parameter estimation in unnormalized models, score matching can also be used to estimate scores of implicit distributions, which are distributions that have a tractable sampling process but without a tractable density. For example, the distribution of random samples from the generator of a GAN (Goodfellow et al., 2014) is an implicit distribution. Implicit distributions can arise in many more situations such as the marginal distribution of a non-conjugate model (Sun et al., 2019), and models defined by complex simulation processes (Tran et al., 2017). In many cases learning and inference become intractable due to the need of optimizing an objective that involves the intractable density.

In these cases, directly estimating the score function sq(x)=xlogqθ(x)\mathbf{s}_{q}(\mathbf{x})=\nabla_{\mathbf{x}}\log q_{\boldsymbol{\theta}}(\mathbf{x}) based on i.i.d. samples from an implicit density qθ(x)q_{\boldsymbol{\theta}}(\mathbf{x}) can be useful. For example, suppose our learning problem involves optimizing the entropy H(qθ(x))H(q_{\boldsymbol{\theta}}(\mathbf{x})) of an implicit distribution. This situation is common when dealing with variational free energies (Kingma & Welling, 2014). Suppose xqθ\mathbf{x}\sim q_{{\boldsymbol{\theta}}} can be reparameterized as x=gθ(ϵ)\mathbf{x}=g_{{\boldsymbol{\theta}}}({\boldsymbol{\epsilon}}), where ϵ{\boldsymbol{\epsilon}} is a simple random variable independent of θ{\boldsymbol{\theta}}, such as a standard normal, and gθg_{\boldsymbol{\theta}} is a deterministic mapping. We can write the gradient of the entropy with respect to θ{\boldsymbol{\theta}} as

where θgθ(ϵ)\nabla_{\boldsymbol{\theta}}g_{\boldsymbol{\theta}}({\boldsymbol{\epsilon}}) is usually easy to compute. The score xlogqθ(x)\nabla_{\mathbf{x}}\log q_{\boldsymbol{\theta}}(\mathbf{x}) is intractable but can be approximated by score estimation.

Score matching is an attractive solution for score estimation since (1) naturally serves as an objective to measure the difference between the a trainable score function and the score of a data generating process. We will discuss this in more detail in Section 3.2 and mention some other approaches of score estimation in Section 5.2.

DENSITY AND SCORE ESTIMATION WITH SLICED SCORE MATCHING

We observe that one dimensional problems are usually much easier to solve than high dimensional ones. Inspired by the idea of Sliced Wasserstein distance (Rabin et al., 2012), we consider projecting sd(x)\mathbf{s}_{d}(\mathbf{x}) and sm(x;θ)\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}) onto some random direction v\mathbf{v} and propose to compare their average difference along that random direction. More specifically, we consider the following objective as a replacement of the Fisher divergence L(θ)L({\boldsymbol{\theta}}) in Eq. (1):

To eliminate the dependence of L(θ;pv)L({\boldsymbol{\theta}};p_{\mathbf{v}}) on sd(x)\mathbf{s}_{d}(\mathbf{x}), we use integration by parts as in score matching (cf., the equivalence between Eq. (1) and (2)). Defining

the equivalence is summarized in the following theorem.

Under some regularity conditions (Assumption 1-3 in Appendix B.2), we have

Other than our requirements on pvp_{\mathbf{v}}, the assumptions are exactly the same as in Theorem 1 of Hyvärinen (2005). We advise the interested readers to read Appendix B.2 for technical statements of the assumptions and a rigorous proof of the theorem.

In practice, given a dataset x1N\mathbf{x}_{1}^{N}, we draw MM projection vectors independently for each point xi\mathbf{x}_{i} from pvp_{\mathbf{v}}. The collection of all such vectors {vij}1iN,1jM\{\mathbf{v}_{ij}\}_{1\leq i\leq N,1\leq j\leq M} are abbreviated as v11NM\mathbf{v}_{11}^{NM}.We then use the following unbiased estimator of J(θ;pv)J({\boldsymbol{\theta}};p_{\mathbf{v}})

For sliced score matching, the second derivative term vxsm(x;θ)v\mathbf{v}^{\intercal}\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})\mathbf{v} is much less computationally expensive than tr(xsm(x;θ))\operatorname{tr}(\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})). Using auto-differentiation systems that support higher order gradients, we can compute it using two gradient operations for a single v\mathbf{v}, as shown in Alg. 1. Similarly, when there are MM v\mathbf{v}’s, the total number of gradient operations is M+1M+1. In contrast, assuming the dimension of x\mathbf{x} is DD and DMD\gg M, we typically need D+1D+1 gradient operations to compute tr(xsm(x;θ))\operatorname{tr}(\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})) because each diagonal entry of xsm(x;θ)\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}) needs to be computed separately (see Alg. 2 in the appendix).

In practice, we can tune MM to trade off variance and computational cost. In our experiments, we find that oftentimes M=1\boxed{M=1} is already a good choice.

2 SLICED SCORE ESTIMATION

As mentioned in section 2.2, the task of score estimation is to estimate xlogq(x)\nabla_{\mathbf{x}}\log q(\mathbf{x}) at some test point x\mathbf{x}, given a set of samples x1Ni.i.d. missingq(x)\mathbf{x}_{1}^{N}\overset{\text{i.i.d.\hbox{} missing}}{\sim}q(\mathbf{x}). In what follows, we show how our sliced score matching objective J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}) can be straightforwardly adapted for this task.

Using a similar argument of integration by parts (cf., Eq. (4), (5) and (6)), we have

which implies that minimizing J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}) with sm(x;θ)\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}) replaced by h(x;θ)\mathbf{h}(\mathbf{x};{\boldsymbol{\theta}}) is approximately minimizing the average projected difference between h(x;θ)\mathbf{h}(\mathbf{x};{\boldsymbol{\theta}}) and xlogq(x)\nabla_{\mathbf{x}}\log q(\mathbf{x}). Hence, h(x;θ^)\mathbf{h}(\mathbf{x};\hat{{\boldsymbol{\theta}}}) should be close to xlogq(x)\nabla_{\mathbf{x}}\log q(\mathbf{x}) and can serve as a score estimator.

THEORETICAL ANALYSIS

In this section, we present several theoretical results to justify sliced score matching as a principled objective. We also discuss the connection of sliced score matching to other methods. For readability, we will defer rigorous statements of assumptions and theorems to the Appendix.

One important question to ask for any statistical estimation objective is whether the estimated parameter is consistent under reasonable assumptions. Our results confirm that for any MM, the objective J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}) is consistent under suitable assumptions as NN\to\infty.

We first prove that J(θ;pv)=0θ=θJ({\boldsymbol{\theta}};p_{\mathbf{v}})=0\Leftrightarrow{\boldsymbol{\theta}}={\boldsymbol{\theta}}^{*} (Lemma 1 and Theorem 1). Then we prove the uniform convergence of J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}) (Lemma 3), which holds regardless of MM. These two facts lead to consistency. For a complete proof, see Appendix B.3. ∎

In Hyvärinen (2005), the authors only showed that J(θ)=0θ=θJ({\boldsymbol{\theta}})=0\Leftrightarrow{\boldsymbol{\theta}}={\boldsymbol{\theta}}^{*}, which leads to “local consistency” of score matching. This is a weaker notion of consistency compared to our settings.

2 ASYMPTOTIC NORMALITY

Asymptotic normality results can be very useful for approximate hypothesis testing and comparing different estimators. Below we show that θ^N,M\hat{{\boldsymbol{\theta}}}_{N,M} is asymptotically normal when NN\to\infty.

In addition to the assumptions in Section 4.1, we need an extra assumption to prove asymptotic normality. We require pm(x;θ)p_{m}(\mathbf{x};{\boldsymbol{\theta}}) to have a stronger Lipschitz property (Assumption 9).

For simplicity, we denote xh(x)x=x\nabla_{\mathbf{x}}h(\mathbf{x})|_{\mathbf{x}=\mathbf{x}^{\prime}} as xh(x)\nabla_{\mathbf{x}}h(\mathbf{x}^{\prime}), where h()h(\cdot) is an arbitrary function. In the following, we will only show the asymptotic normality result for a specific pvp_{\mathbf{v}}. More general results are in Appendix B.4.

Assume the assumptions discussed above hold. If pvp_{\mathbf{v}} is the multivariate Rademacher distribution, we have

Here DD is the dimension of data; VijV_{ij} and WijW_{ij} are p.s.d matrices depending on pm(x;θ)p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*}), and their definitions can be found in Appendix B.4.

Once we get the consistency (Theorem 2), the rest closely follows the proof of asymptotic normality of MLE (van der Vaart, 1998). A rigorous proof can be found in Appendix B.4. ∎

As expected, larger MM will lead to smaller asymptotic variance, as can be seen in Eq. (9).

As far as we know, there is no published proof of asymptotic normality for the standard (not sliced) score matching. However, by using the same techniques in our proofs, and under similar assumptions, we can conclude that the asymptotic variance of the score matching estimator is [θ2J(θ)]1(ijVij)[θ2J(θ)]1[\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*})]^{-1}\left(\sum_{ij}V_{ij}\right)[\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*})]^{-1} (Corollary 1), which will always be smaller than sliced score matching with multivariate Rademacher projections. However, the gap reduces when MM increases.

3 CONNECTION TO OTHER METHODS

Sliced score matching is widely connected to many other methods, and can be motivated from some different perspectives. Here we discuss a few of them.

Noise contrastive estimation (NCE), proposed by Gutmann & Hyvärinen (2010), is another principle for training unnormalized statistical models. The method works by comparing pm(x;θ)p_{m}(\mathbf{x};{\boldsymbol{\theta}}) with a noise distribution pn(x)p_{n}(\mathbf{x}). We consider a special form of NCE which minimizes the following objective

where h(x;θ)pm(x;θ)pm(x;θ)+pm(xv;θ)h(\mathbf{x};{\boldsymbol{\theta}})\triangleq\frac{p_{m}(\mathbf{x};{\boldsymbol{\theta}})}{p_{m}(\mathbf{x};{\boldsymbol{\theta}})+p_{m}(\mathbf{x}-\mathbf{v};{\boldsymbol{\theta}})}, and we choose pn(x)=pd(x+v)p_{n}(\mathbf{x})=p_{d}(\mathbf{x}+\mathbf{v}). Assuming v2\|\mathbf{v}\|_{2} to be small, Eq. (10) can be written as the following by Taylor expansion

For derivation, see Proposition 1 in the appendix. A similar derivation can also be found in Gutmann & Hirayama (2011). As a result of (11), if we choose some pvp_{\mathbf{v}} and take the expectation of (10) w.r.t. pvp_{\mathbf{v}}, the objective will be equivalent to sliced score matching whenever v20\left\lVert\mathbf{v}\right\rVert_{2}\approx 0.

which is exactly the sliced score matching objective with variance reduction J^vr(θ;x1N,v11NM)\hat{J}_{\text{vr}}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}).

RELATED WORK

To the best of our knowledge, there are three existing methods that are able to scale up score matching to learning deep models on high dimensional data.

which can be estimated without computing any Hessian. Although denoising score matching is much faster than score matching, it has obvious drawbacks. First, it can only recover the noise corrupted data distribution. Furthermore, choosing the parameters of the noise distribution is highly non-trivial and in practice the performance can be very sensitive to σ\sigma, and heuristics have to be used in practice. For example, Saremi et al. (2018) propose a heuristic for choosing σ\sigma based on Parzen windows.

Kingma & LeCun (2010) propose a backpropagation method to approximately compute the trace of the Hessian. Because the backpropagation of the full Hessian scales quadratically w.r.t. the layer size, the authors limit backpropagation only to the diagonal so that it has the same cost as the usual backpropagation for computing loss gradients. However, there are no theoretical guarantees for the approximation errors. In fact, the authors only did experiments on networks with a single hidden layer, in which case the approximation is exact. Moreover, there is no direct support for the proposed approximate backpropagation method in modern automatic differentiation frameworks.

Martens et al. (2012) estimate the trace term in score matching by applying curvature propagation to compute an unbiased, complex-valued estimator of the diagonal of the Hessian. The work claims that curvature propagation will have the smallest variance among a class of estimators, which includes the Hutchinson’s estimator. However, their proof evaluates the pseudo-variance of the complex-valued estimator instead of the variance. In practice, curvature propagation can have large variance when the number of nodes in the network is large, because it introduces noise for each node in the network. Finally, implementing curvature propagation also requires manually modifying the backpropagation code, handling complex numbers in neural networks, and will be inefficient for networks of more general structures, such as recurrent neural networks.

2 KERNEL SCORE ESTIMATORS

Two prior methods for score estimation are based on a generalized form of Stein’s identity (Stein, 1981; Gorham & Mackey, 2017):

where q(x)q(\mathbf{x}) is a continuously differentiable density and h(x)\mathbf{h}(\mathbf{x}) is a function satisfying some regularity conditions. Li & Turner (2018) propose to set h(x)\mathbf{h}(\mathbf{x}) as the feature map of some kernel in the Stein class (Liu et al., 2016) of qq, and solve a finite-sample version of (12) to obtain estimates of xlogq(x)\nabla_{\mathbf{x}}\log q(\mathbf{x}) at the sample points. We refer to this method as Stein in the experiments. Shi et al. (2018) adopt a different approach but also exploit (12). They build their estimator by expanding xlogq(x)\nabla_{\mathbf{x}}\log q(\mathbf{x}) as a spectral series of eigenfunctions and solve for the coefficients using (12). Compared to Stein, their method is argued to have theoretical guarantees and principled out-of-sample estimation at an unseen test point. We refer to their method as Spectral in the experiments.

EXPERIMENTS

Our experiments include two parts: (1) to test the effectiveness of sliced score matching (SSM) in learning deep models for density estimation, and (2) to test the ability of SSM in providing score estimates for applications such as VAEs with implicit encoders and WAEs. Unless specified explicitly, we choose M=1M=1 by default.

We evaluate SSM and its variance-reduced version (SSM-VR) for density estimation and compare against score matching (SM) and its three scalable baselines: denoising score matching (DSM), approximate backpropagation (approx BP), and curvature propagation (CP). All SSM methods in this section use the multivariate Rademacher distribution as pvp_{\mathbf{v}}. Our results demonstrate that: (1) SSM is comparable in performance to SM, (2) SSM outperforms other computationally efficient approximations to SM, and (3) unlike SM, SSM scales effectively to high dimensional data.

Following the settings of Wenliang et al. (2019), we trained models on three UCI datasets: Parkinsons, RedWine, and WhiteWine (Dua & Graff, 2017), and used their original code for SM. To compute the trace term exactly, Wenliang et al. (2019)’s manual implementation of backpropagation takes over one thousand lines for a model that is four layers deep, while the implementation of SSM only takes several lines. For DSM, the value of σ\sigma is chosen by grid search using the SM loss on a validation dataset. All models are trained with 15 different random seeds and training is stopped when validation loss does not improve for 200 steps.

Results in Fig. 1 demonstrate that SSM-VR performs comparably to SM, when evaluated using the SM loss on a held-out test set. We observe that variance reduction substantially improves the performance of SSM. In addition, SSM outperforms other computationally efficient approaches. DSM can perform comparably to SSM on RedWine. However, it is challenging to select σ\sigma for DSM. Models trained using σ\sigma from the heuristic in Saremi et al. (2018) are far from optimal (on both SM losses and likelihoods), and extensive grid search is needed to find the best σ\sigma. CP performs substantially worse, which is likely because it injects noise for each node in the computation graph, and the amount of noise introduced is too large for a neural-network-based kernel evaluated at 200 inducing points, which supports our hypothesis that CP does not work effectively for deep models. Results for approx BP are omitted because the losses exceeded 10910^{9} on all datasets. This is because approx BP provides a biased estimate of the Hessian without any error guarantees. All the results are similar when evaluating according to log-likelihood metric (Appendix C.1).

We evaluate the computational efficiency of different losses on data of increasing dimensionality. We fit DKEF models to a multivariate standard normal in high dimensional spaces. The average running time per minibatch of 100 examples is reported in Fig. 2. SM performance degrades linearly with the dimension of the input data due to the computation of the Hessian, and creates out of memory errors for a 12GB GPU after the dimension increases beyond 400. SSM performs similarly to DSM, approx BP and CP, and they are all scalable to large dimensions.

1.2 Deep Flow Models

As a sanity check, we also evaluate SSM by training a NICE flow model (Dinh et al., 2015), whose likelihood is tractable and can be compared to results obtained by MLE. The model we use has 20 hidden layers, and 1000 units per layer. Models are trained to fit MNIST handwritten digits, which are 784 dimensional images. Data are dequantized by adding uniform noise in the range [1512,1512][-\frac{1}{512},\frac{1}{512}], and transformed using a logit transformation, log(x)log(1x)\log(x)-\log(1-x). We provide additional details in Appendix C.2.

Training with SM is extremely computationally expensive in this case. Our SM implementation based on auto-differentiation takes around 7 hours to finish one epoch of training, and 12 GB GPU memory cannot hold a batch size larger than 24, so we do not include these results. Since NICE has tractable likelihoods, we also evaluate MLE as a surrogate objective for minimizing the SM loss. Notably, likelihoods and SM losses might be uncorrelated when the model is mis-specified, which is likely to be the case for a complex dataset like MNIST.

SM losses and log-likelihoods on the test dataset are reported in Tab. 1, where models are evaluated using the best checkpoint in terms of the SM loss on a validation dataset over 100 epochs of training. SSM-VR greatly outperforms all the other methods on the SM loss. DSM performs worse than SSM-VR, and σ\sigma is still hard to tune. Specifically, following the heuristic in Saremi et al. (2018) leads to σ=1.74\sigma=1.74, which performed the worst (on both log-likelihood and SM loss) of the eight choices of σ\sigma in our grid search. Approx BP has more success on NICE than for training DKEF models. This might be because the Hessians of hidden layers of NICE are closer to a diagonal matrix, which results in a smaller approximation error for approx BP. As in the DKEF experiments, CP performs worse. This is likely due to injecting noise to all hidden units, which will lead to large variance for a network as big as NICE. Unlike the DKEF experiments, we find that good log-likelihoods are less correlated with good SM loss, probably due to model mis-specification.

2 SCORE ESTIMATION

We consider two typical tasks that require accurate score estimations: (1) training VAEs with an implicit encoder and (2) training Wasserstein Auto-Encoders. We show in both tasks SSM outperforms previous score estimators. Samples generated by various algorithms can be found in Appendix A.

Consider a latent variable model p(x,z)p(\mathbf{x},\mathbf{z}), where x\mathbf{x} and z\mathbf{z} denote observed and latent variables respectively. A variational auto-encoder (VAE) is composed of two parts: 1) an encoder pθ(xz)p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}) modeling the conditional distribution of x\mathbf{x} given z\mathbf{z}; and a decoder qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) that approximates the posterior distribution of the latent variable. A VAE is typically trained by maximizing the following evidence lower bound (ELBO)

For a single data point x\mathbf{x}, kernel score estimators need multiple samples from qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) to estimate its score. On MNIST, we use 100 samples, as done in Shi et al. (2018). On CelebA, however, we can only take 20 samples due to GPU memory limitations. In contrast, SSM learns a score network h(zx)h(\mathbf{z}\mid\mathbf{x}) along with qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}), which amortizes the cost of score estimation. Unless noted explicitly, we use one projection per data point (M=1M=1) for SSM, which is scalable to deep networks.

We consider VAE training on MNIST and CelebA (Liu et al., 2015). All images in CelebA are first cropped to a patch of 140×140140\times 140 and then resized to 64×6464\times 64. We report negative likelihoods on MNIST, as estimated by AIS (Neal, 2001) with 1000 intermediate distributions. We evaluate sample quality on CelebA with FID scores (Heusel et al., 2017). For fast AIS evaluation on MNIST, we use shallow fully-connected networks with 3 hidden layers. For CelebA experiments we use deep convolutional networks. The architectures of implicit encoders and score networks are straightforward modifications to the encoders of ELBO. More details are provided in Appendix C.3.

The negative likelihoods of different methods on MNIST are reported in the left part of Tab. 2. We note that SSM outperforms Stein and Spectral in all cases. When the latent dimension is 8, SSM outperforms ELBO, which indicates that the expressive power of implicit qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) pays off. When the latent dimension is 32, the gaps between SSM and kernel methods are even larger, and the performance of SSM is still comparable to ELBO. Moreover, when M=100M=100 (matching the computation of kernel methods), SSM outperforms ELBO.

For CelebA, we provide FID scores of samples in the top part of Tab. 3. We observe that after 40k training iterations, SSM outperforms all other baselines. Kernel methods perform poorly in this case because only 20 samples per data point can be used due to limited amount of GPU memory. Early during training, SSM does not perform as well. Since the score network is trained along with the encoder and decoder, a moderate number of training steps is needed to give an accurate score estimation (and better learning of the VAE).

2.2 WAEs

Compared to ϕH(qϕ(zx))\nabla_{\boldsymbol{\phi}}H(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})), it is easier to estimate ϕH(qϕ(z))\nabla_{\boldsymbol{\phi}}H(q_{\boldsymbol{\phi}}(\mathbf{z})) for kernel methods, because the samples of qϕ(z)q_{\boldsymbol{\phi}}(\mathbf{z}) can be collected by first sampling x1,x2,,xNi.i.d. missingpd(x)\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\stackrel{{\scriptstyle\text{i.i.d.\hbox{} missing}}}{{\sim}}p_{d}(\mathbf{x}) and then sample one z\mathbf{z} for each xi\mathbf{x}_{i} from qϕ(zxi)q_{\phi}(\mathbf{z}\mid\mathbf{x}_{i}). In constrast, multiple z\mathbf{z}’s need to be sampled for each xi\mathbf{x}_{i} when estimating ϕH(qϕ(zx))\nabla_{\boldsymbol{\phi}}H(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})) with kernel approaches. For SSM, we use a score network h(z)h(\mathbf{z}) and train it alongside qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}).

The setup for WAE experiments is largely the same as VAE. The architectures are very similar to those used in the VAE experiments, and the only difference is that we made decoders implicit, as suggested in Tolstikhin et al. (2018). More details can be found in Appendix C.3.

The negative likelihoods on MNIST are provided in the right part of Tab. 2. SSM outperforms both kernel methods, and achieves a larger performance gap when the latent dimension is higher. The likelihoods are lower than the VAE results as the WAE objective does not directly maximize likelihoods.

We show FID scores for CelebA experiments in the bottom part of Tab. 3. As expected, kernel methods perform much better than before, because it is faster to sample from qϕ(z)q_{\boldsymbol{\phi}}(\mathbf{z}). The FID scores are generally lower than those in VAE experiments, which supports previous results that WAEs generally obtain better samples. SSM outperforms both kernel methods when the number of iterations is more than 40k. This shows the advantages of training a deep, expressive score network compared to using a fixed kernel in score estimation tasks.

CONCLUSION

We propose sliced score matching, a scalable method for learning unnormalized statistical models and estimating scores for implicit distributions. Compared to the original score matching and its variants, our estimator can scale to deep models on high dimensional data, while remaining easy to implement in modern automatic differentiation frameworks. Theoretically, our estimator is consistent and asymptotically normal under some regularity conditions. Experimental results demonstrate that our method outperforms competitors in learning deep energy-based models and provides more accurate estimates than kernel score estimators in training implicit VAEs and WAEs.

This research was supported by Intel Corporation, TRI, NSF (#1651565, #1522054, #1733686 ), ONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024).

References

Appendix A Samples

A.1.2 CelebA

A.2 WAE

A.2.2 CelebA

Appendix B PROOFS

Below we provide a summary of the most commonly used notations used in the proofs. First, we denote the data distribution as pd(x)p_{d}(\mathbf{x}) and assume that the training/test data {x1,x2,,xN}\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\} are i.i.d. samples of pd(x)p_{d}(\mathbf{x}). The model is denoted as pm(x;θ)p_{m}(\mathbf{x};{\boldsymbol{\theta}}), where θ{\boldsymbol{\theta}} is restricted to a parameter space Θ\Theta. Note that pm(x;v)p_{m}(\mathbf{x};\mathbf{v}) can be an unnormalized energy-based model. We use v\mathbf{v} to represent a random vector with the same dimension of input x\mathbf{x}. This vector v\mathbf{v} is often called the projection vector, and we use pvp_{\mathbf{v}} to denote its distribution.

Next, we introduce several shorthand notations for quantities related to pm(x;θ)p_{m}(\mathbf{x};{\boldsymbol{\theta}}) and pd(x)p_{d}(\mathbf{x}). The log-likelihood logpm(x;θ)\log p_{m}(\mathbf{x};{\boldsymbol{\theta}}) and logpd(x)\log p_{d}(\mathbf{x}) are respectively denoted as lm(x;θ)l_{m}(\mathbf{x};{\boldsymbol{\theta}}) and ld(x)l_{d}(\mathbf{x}). The (Stein) score function xlogpm(x;θ)\nabla_{\mathbf{x}}\log p_{m}(\mathbf{x};{\boldsymbol{\theta}}) and xlogpd(x)\nabla_{\mathbf{x}}\log p_{d}(\mathbf{x}) are written as sm(x;θ)\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}) and sd(x)\mathbf{s}_{d}(\mathbf{x}), and finally the Hessian of logpm(x;θ)\log p_{m}(\mathbf{x};{\boldsymbol{\theta}}) w.r.t. x\mathbf{x} is denoted as xsm(x;θ)\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}).

We also adopt some convenient notations for collections. In particular, we use x1N\mathbf{x}_{1}^{N} to denote a collection of NN vectors {x1,x2,,xN}\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\} and use v11NM\mathbf{v}_{11}^{NM} to denote N×MN\times M vectors {v11,v12,,v1M,v21,v22,,v2M,,vN1,vN2,,vNM}\{\mathbf{v}_{11},\mathbf{v}_{12},\cdots,\mathbf{v}_{1M},\mathbf{v}_{21},\allowbreak\mathbf{v}_{22},\cdots,\mathbf{v}_{2M},\cdots,\mathbf{v}_{N1},\mathbf{v}_{N2},\cdots,\mathbf{v}_{NM}\}.

B.2 BASIC PROPERTIES

The following regularity conditions are needed for integration by parts and identifiability.

θΘ,limxsm(x;θ)pd(x)=0\forall{\boldsymbol{\theta}}\in\Theta,\lim_{\left\lVert\mathbf{x}\right\rVert\rightarrow\infty}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})p_{d}(\mathbf{x})=0.

The model family {pm(x;θ)θΘ}\{p_{m}(\mathbf{x};{\boldsymbol{\theta}})\mid{\boldsymbol{\theta}}\in\Theta\} is well-specified, i.e., pd(x)=pm(x;θ)p_{d}(\mathbf{x})=p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*}). Furthermore, pm(x;θ)pm(x;θ)p_{m}(\mathbf{x};{\boldsymbol{\theta}})\neq p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*}) whenever θθ{\boldsymbol{\theta}}\neq{\boldsymbol{\theta}}^{*}.

pm(x;θ)>0,θΘp_{m}(\mathbf{x};{\boldsymbol{\theta}})>0,\forall{\boldsymbol{\theta}}\in\Theta and x\forall\mathbf{x}.

Assume sm(x;θ)\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}), sd(x)\mathbf{s}_{d}(\mathbf{x}) and pvp_{\mathbf{v}} satisfy some regularity conditions (Assumption 1, Assumption 2). Under proper boundary conditions (Assumption 3), we have

The basic idea of this proof is similar to that of Theorem 1 in Hyvärinen (2005). First, note that L(θ,pv)L({\boldsymbol{\theta}},p_{\mathbf{v}}) can be expanded to

This can be shown by first observing that

which proves (16) and the proof is completed. ∎

Assume our model family is well-specified and identifiable (Assumption 4). Assume further that the densities are all positive (Assumption 5). When pvp_{\mathbf{v}} satisfies some regularity conditions (Assumption 2), we have

First, since pd(x)=pm(x;θ)>0p_{d}(\mathbf{x})=p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*})>0, L(θ;pv)=0L({\boldsymbol{\theta}};p_{\mathbf{v}})=0 implies

B.3 CONSISTENCY

In addition to the assumptions in Theorem 1 and Lemma 1, we need the following regularity conditions to prove the consistency of θ^N,Marg minθΘJ^(θ;x1N,v11NM)\hat{{\boldsymbol{\theta}}}_{N,M}\triangleq\operatorname*{arg\,min}_{{\boldsymbol{\theta}}\in\Theta}\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}).

Let A(θ)xsm(x;θ)A({\boldsymbol{\theta}})\triangleq\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}) and B(θ)sm(x;θ)sm(x;θ)B({\boldsymbol{\theta}})\triangleq\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})^{\intercal}. Consider θ1Θ,θ2Θ{\boldsymbol{\theta}}_{1}\in\Theta,{\boldsymbol{\theta}}_{2}\in\Theta and let DD be the dimension of v\mathbf{v}, we have

where (i)(i) is Cauchy-Schwarz inequality, (ii)(ii) is Jensen’s inequality, and (iii)(iii) is due to Assumption 7. Now let

where (i)(i) results from the independence of v,x\mathbf{v},\mathbf{x}, and (ii)(ii) is due to Assumption 7 and Assumption 8. ∎

where diam()\operatorname{diam}(\cdot) denotes the diameter and DD is the dimension of Θ\Theta.

Step 1: From Jensen’s inequality, we obtain

where x1N,v11NM{\mathbf{x}^{\prime}}_{1}^{N},{\mathbf{v}^{\prime}}_{11}^{NM} are independent copies of x1N\mathbf{x}_{1}^{N} and v11NM\mathbf{v}_{11}^{NM}. Let {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} be a set of independent Rademacher random variables, we have

where (i)(i) is because the quantity is symmetric about in distribution, and (ii)(ii) is due to Jensen’s inequality.

Step 2: First note that given xi,vij\mathbf{x}_{i},\mathbf{v}_{ij}, ϵiMj=1Mf(θ;xi,vij)\frac{\epsilon_{i}}{M}\sum_{j=1}^{M}f({\boldsymbol{\theta}};\mathbf{x}_{i},\mathbf{v}_{ij}) is a zero-mean sub-Gaussian process w.r.t. θ{\boldsymbol{\theta}}. This can be observed from

where (i)(i) holds because ϵi\epsilon_{i} is a 1-sub-Gaussian random variable, (ii)(ii) is from Cauchy-Schwarz inequality and (iii)(iii) is due to Lemma 2. As a result, 1N1Mi=1Nj=1Mϵif(θ;xi,vij)\frac{1}{N}\frac{1}{M}\sum_{i=1}^{N}\sum_{j=1}^{M}\epsilon_{i}f({\boldsymbol{\theta}};\mathbf{x}_{i},\mathbf{v}_{ij}) is a zero-mean sub-Gaussian random process with metric

Since Θ\Theta is compact, the diameter of Θ\Theta with respect to the Euclidean norm 2\left\lVert\cdot\right\rVert_{2} is finite and we denote it as diam(Θ)<\operatorname{diam}(\Theta)<\infty. Then, Dudley’s entropy integral (Dudley, 1967) gives

Here logN(Θ,d,ϵ)\log N(\Theta,d,\epsilon) is the metric entropy of Θ\Theta with metric d(θ1,θ2)=1N1NMi=1Nj=1ML2(xi,vij)θ1θ22d({\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2})=\frac{1}{\sqrt{N}}\sqrt{\frac{1}{NM}\sum_{i=1}^{N}\sum_{j=1}^{M}L^{2}(\mathbf{x}_{i},\mathbf{v}_{ij})}\left\lVert{\boldsymbol{\theta}}_{1}-{\boldsymbol{\theta}}_{2}\right\rVert_{2} and size ϵ\epsilon.

Step 3: When the dimension of θΘ{\boldsymbol{\theta}}\in\Theta is DD, it is known that the ϵ\epsilon-covering number of Θ\Theta with Euclidean distance is

Therefore, N(Θ,d,ϵ)N(\Theta,d,\epsilon) can be bounded by

Hence, the metric integral can be bounded

Finally, combining (19), (20) and (21) gives us

Suppose all the previous assumptions hold (Assumption 1-Assumption 8). Assume further the conditions of Theorem 1 and Lemma 1 are satisfied. Let θ{\boldsymbol{\theta}}^{*} be the true parameter of the data distribution, and θ^N,M\hat{{\boldsymbol{\theta}}}_{N,M} be the empirical estimator defined by

Then, θ^N,M\hat{{\boldsymbol{\theta}}}_{N,M} is consistent, meaning that

Note that Theorem 1 and Lemma 1 together imply that θ=arg minθΘJ(θ;pv){\boldsymbol{\theta}}^{*}=\operatorname*{arg\,min}_{{\boldsymbol{\theta}}\in\Theta}J({\boldsymbol{\theta}};p_{\mathbf{v}}). Then, we will show J(θ^N,M;pv)pJ(θ;pv)J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})\overset{p}{\to}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}}) when NN\to\infty. This can be done by noticing

We can easily conclude that (22) is op(1)o_{p}(1) with the help of Lemma 3, because

as NN\to\infty. From Lemma 1 we also have L(θ^N,M;pv)L(θ;pv)>0L(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})-L({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})>0 if θ^N,Mθ\hat{{\boldsymbol{\theta}}}_{N,M}\neq{\boldsymbol{\theta}}. As shown by Theorem 1, this is the same as J(θ^N,M;pv)J(θ;pv)>0J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})-J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})>0 if θ^N,Mθ\hat{{\boldsymbol{\theta}}}_{N,M}\neq{\boldsymbol{\theta}}. Therefore (22) = op(1)o_{p}(1) gives J(θ^N,M;pv)pJ(θ;pv)J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})\overset{p}{\to}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}}).

However, the fact that p(J(θ^N,M;pv)J(θ;pv)J(θSϵ;pv)J(θ;pv))δp(|J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})-J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})|\geq J({\boldsymbol{\theta}}_{\mathcal{S}_{\epsilon}};p_{\mathbf{v}})-J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}}))\geq\delta holds for arbitrarily large NN contradicts J(θ^N,M;pv)pJ(θ;pv)J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})\overset{p}{\to}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}}). ∎

B.4 ASYMPTOTIC NORMALITY

To simplify notations we use ijh()(x2h())ij\partial_{i}\partial_{j}h(\cdot)\triangleq(\nabla_{\mathbf{x}}^{2}h(\cdot))_{ij}, ih()(xh())i\partial_{i}h(\cdot)\triangleq(\nabla_{\mathbf{x}}h(\cdot))_{i}, and denote xh(x)x=x\nabla_{\mathbf{x}}h(\mathbf{x})|_{\mathbf{x}=\mathbf{x}^{\prime}} as xh(x)\nabla_{\mathbf{x}}h(\mathbf{x}^{\prime}). Here h()h(\cdot) denotes some arbitrary function. Let lmlogpm(x;θ)l_{m}\triangleq\log p_{m}(\mathbf{x};{\boldsymbol{\theta}}), lm(x;θ)logpm(x;θ)l_{m}(\mathbf{x};{\boldsymbol{\theta}})\triangleq\log p_{m}(\mathbf{x};{\boldsymbol{\theta}}) and further adopt the following notations

For the proof of asymptotic normality we need the following extra assumptions.

For θ1{\boldsymbol{\theta}}_{1}, θ2{\boldsymbol{\theta}}_{2} near θ{\boldsymbol{\theta}}^{*}, and i,j\forall i,j,

Suppose lm(x;θ)l_{m}(\mathbf{x};{\boldsymbol{\theta}}) is sufficiently smooth (Assumption 9) and pvp_{\mathbf{v}} has bounded moments (Assumption 2 and Assumption 8). Let θ2f(θ;x,v1M)1Mi=1Mθ2vixsm(x;θ)vi+12θ2(vism(x;θ))2\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}_{1}^{M})\triangleq\frac{1}{M}\sum_{i=1}^{M}\nabla_{\boldsymbol{\theta}}^{2}\mathbf{v}_{i}^{\intercal}\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})\mathbf{v}_{i}+\frac{1}{2}\nabla_{\boldsymbol{\theta}}^{2}(\mathbf{v}_{i}^{\intercal}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}))^{2}. Then θ2f(θ;x,v)\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}) is Lipschitz continuous, i.e., for θ1{\boldsymbol{\theta}}_{1} and θ2{\boldsymbol{\theta}}_{2} close to θ{\boldsymbol{\theta}}^{*}, there exists a Lipschitz constant L(x,v1M)L(\mathbf{x},\mathbf{v}_{1}^{M}) such that

First, we write out θ2f(θ1;x,v1M)θ2f(θ2;x,v1M)\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}}_{1};\mathbf{x},\mathbf{v}_{1}^{M})-\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}}_{2};\mathbf{x},\mathbf{v}_{1}^{M}) according to the definitions. Let Aij(θ)θ2ijlm(x;θ)A_{ij}({\boldsymbol{\theta}})\triangleq\nabla_{\boldsymbol{\theta}}^{2}\partial_{i}\partial_{j}l_{m}(\mathbf{x};{\boldsymbol{\theta}}) and Bij(θ)θ2ilm(x;θ)jlm(x;θ)B_{ij}({\boldsymbol{\theta}})\triangleq\nabla_{\boldsymbol{\theta}}^{2}\partial_{i}l_{m}(\mathbf{x};{\boldsymbol{\theta}})\partial_{j}l_{m}(\mathbf{x};{\boldsymbol{\theta}}). Then,

Then, Cauchy-Schwarz and Jensen’s inequality give

where (i)(i) is due to Assumption 9, (ii)(ii) is Jensen’s, and (iii)(iii) is because of Assumption 2 and 8. ∎

Assume that conditions in Theorem 1 and Lemma 1 hold, and pvp_{\mathbf{v}} has bounded higher-order moments (Assumption 8). Then

where \nabla_{\boldsymbol{\theta}}f({\boldsymbol{\theta}}^{*};\mathbf{x},\mathbf{v}_{1}^{M})=\nabla_{\boldsymbol{\theta}}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}_{1}^{M})\big{|}_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}^{*}}

In particular, if pvN(0,I)p_{\mathbf{v}}\sim\mathcal{N}(0,I), we have

If pvp_{\mathbf{v}} is the distribution of multivariate Rademacher random variables, we have

Since θ{\boldsymbol{\theta}}^{*} is the true parameter of the data distribution, we have

which leads to (23). Note that Assumption 2 guarantees that Σij<|\Sigma_{ij}|<\infty and Assumption 8 ensures Sijpq<|\mathfrak{S}_{ijpq}|<\infty.

If pvN(0,I)p_{\mathbf{v}}\sim\mathcal{N}(0,I), S\mathfrak{S} and Σ\Sigma have the following simpler forms

Then, if we assume that the second derivatives of lml_{m} are continuous, we have ijlm=jilm\partial_{i}\partial_{j}l_{m}=\partial_{j}\partial_{i}l_{m}, and the variance (23) can also be simplified to

Similarly, if pvU({±1}D)p_{\mathbf{v}}\sim\mathcal{U}(\{\pm 1\}^{D}), (23) has the simplified form

With the notations and assumptions in Lemma 4, Lemma 5 and Theorem 2, we have

In particular, if pvN(0,I)p_{\mathbf{v}}\sim\mathcal{N}(0,I), then the asymptotic variance is

If pvp_{\mathbf{v}} is the distribution of multivariate Rademacher random variables, the asymptotic variance is

To simplify notations, we use PNh(x)1Ni=1Nh(xi,)P_{N}h(\mathbf{x})\triangleq\frac{1}{N}\sum_{i=1}^{N}h(\mathbf{x}_{i},\cdot), where h(x,)h(\mathbf{x},\cdot) is some arbitrary function. For example, J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}) can be written as PNf(θ;x,v1M)P_{N}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}_{1}^{M}). By Taylor expansion, we can approximate PNθf(θ^N,M;x,v1M)P_{N}\nabla_{\boldsymbol{\theta}}f(\hat{{\boldsymbol{\theta}}}_{N,M};\mathbf{x},\mathbf{v}_{1}^{M}) around θ{\boldsymbol{\theta}}^{*}:

where Eθ^N,M,x,v1MFL(x,v1M)θ^N,Mθ2\|E_{\hat{{\boldsymbol{\theta}}}_{N,M},\mathbf{x},\mathbf{v}_{1}^{M}}\|_{F}\leq L(\mathbf{x},\mathbf{v}_{1}^{M})\|\hat{{\boldsymbol{\theta}}}_{N,M}-{\boldsymbol{\theta}}^{*}\|_{2} from Lemma 4 and Taylor expansion of vector-valued functions. Combining with the law of large numbers, we have

But of course, the central limit theorem and Lemma 5 yield

Then, Slutsky’s theorem gives the desired result

In particular, if pvN(0,I)p_{\mathbf{v}}\sim\mathcal{N}(0,I) or pvU({±1}D)p_{\mathbf{v}}\sim\mathcal{U}(\{\pm 1\}^{D}), we have J(θ;pv)=J(θ)J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})=J({\boldsymbol{\theta}}^{*}), and therefore θ2J(θ;pv)=θ2J(θ)\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})=\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*}). We can apply Lemma 5 to conclude the simplified expressions for the asymptotic variance.

Under similar assumptions used in Theorem 2 and Theorem 3, we can also conclude that the score matching estimator θ^Narg minθΘJ^(θ;x)\hat{{\boldsymbol{\theta}}}_{N}\triangleq\operatorname*{arg\,min}_{{\boldsymbol{\theta}}\in\Theta}\hat{J}({\boldsymbol{\theta}};\mathbf{x}) is consistent

The other part of the proof is similar to that of Theorem 2 and Theorem 3 and is thus obmitted. ∎

B.5 NOISE CONTRASTIVE ESTIMATION

Then when v20\left\lVert v\right\rVert_{2}\to 0, we have

Using Taylor expansion, we can immediately get

Appendix C ADDITIONAL DETAILS OF EXPERIMENTS

To improve computational cost of learning f,f, Sutherland et al. (2018) use a Nyström-type lite approximation, selecting LL inducing points zl{\mathbf{z}}_{l}, and ff of the form:

We compare training using our objective against deep kernel exponential family (DKEF) models trained using exact score matching in Wenliang et al. (2019).

The kernel k(x,y)k(\mathbf{x},{\mathbf{y}}) is a mixture of R=3R=3 Gaussian kernels, with features extracted by a neural network, ϕwr()\phi_{w_{r}}(\cdot), length scales σr\sigma_{r}, and nonnegative mixture coefficients ρr\rho_{r}. The neural network is a three layer fully connected network with a skip connection from the input to output layers and softplus nonlinearities. Each hidden layer has 30 neurons. We have:

A similar closed form for sliced score matching, denoising score matching, approximate backpropogation, and curvature propagation can be derived. The derivation for sliced score matching is presented below for completeness.

For fixed kk, z\mathbf{z}, and λα\lambda_{\boldsymbol{\alpha}}, as long as λα>0\lambda_{\boldsymbol{\alpha}}>0 then the optimal α{\boldsymbol{\alpha}} is

The derivation follows very similarly to Proposition 3 in Wenliang et al. (2019). We will show that the loss is quadratic in α{\boldsymbol{\alpha}}. Note that

Because λα>0\lambda_{\alpha}>0 and G{\bm{G}} is positive semidefinite, the matrix in parentheses is strictly positive definite, and the claimed result follows directly from standard vector calculus. ∎

RedWine and WhiteWine are dequantized by adding uniform noise to each dimension in the range [d,d][-d,d] where dd is the median distance between two values for that dimension. For each dataset, 10% of the entire data was used as testing, and 10% of the remaining was used for validation. PCA whitening is applied to the data. Noise of standard deviation 0.05 is added as part of preprocessing.

The DKEF models have R=3R=3 Gaussian kernels. Each feature extractor is a 3-layer neural network with a skip connection from the input to output, with 30 hidden neurons per layer. Weights were initialized from a Gaussian distribution with standard deviation equal to 130\frac{1}{\sqrt{30}}. Length scales σr\sigma_{r} were initialized to 1.0, 3.3 and 10.0. We use L=200L=200 trainable inducing points, which were initialized from training data.

Models are trained using an Adam optimizer, with learning rate 10210^{-2}. A batch size of 200 is used, with 100 points for computing α{\boldsymbol{\alpha}}, and 100 for computing the loss. Models are stopped after validation loss does not improve for 200 steps.

For denoising score matching, we perform a grid search with values [0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.20, 0.24, 0.28, 0.32, 0.40, 0.48, 0.56, 0.64, 1.28]. We train models for each value of σ\sigma using two random seeds, and pick the σ\sigma with the best average validation score matching loss. For curvature propagation, one noise sample is used to match the performance of sliced score matching.

Log-likelihoods are presented below. They are estimated using AIS, using a proposal distribution N(0,2I)\mathcal{N}(0,2I), using 1,000,000 samples.

C.2 NICE

The model has four coupling layers, each with five hidden layers, for a total of 20 hidden layers, as well as a final scale layer (Dinh et al., 2015). Softplus nonlinearities are used between hidden layers.

Models are trained using the Adam optimizer with learning rate 10310^{-3} for 100 epochs. The best checkpoint on exact score matching loss, evaluation every 100 iterations, is used to report test set performance. We use a batch size of 128.

Data are dequantized by adding uniform noise in the range [1512,1512][-\frac{1}{512},\frac{1}{512}], clipped to be in the range [0.001,0.001][-0.001,0.001], and then transformed using a logit transformation log(x)log(1x)\log(x)-\log(1-x). 90% of the training set is used for training, and 10% for validation, and the standard test set is used.

For grid search for the optimal value of σ\sigma, eight values are used: [0.01, 0.05, 0.10, 0.20, 0.28, 0.50, 1.00, 1.50]. We also evaluate σ=1.74\sigma=1.74, chosen by the heuristic in Saremi et al. (2018). The model with the best performance on validation score matching loss is used. Only nine values of σ\sigma are evaluated because training each model takes approximately two hours.

C.3 SCORE ESTIMATION FOR IMPLICIT DISTRIBUTIONS

We put the architectures of all networks used in the MNIST and CelebA experiments in Tab. 8 and Tab. 9 respectively.

For MNIST experiments, we use RMSProp optimizer with a learning rate of 0.001 for all methods. On CelebA, the learning rate is changed to 0.0001. All algorithms are trained for 100000 iterations with a batch size of 128.

All samples are generated after 100000 training iterations.

Appendix D VARIANCE REDUCTION

Below we discuss approaches to reduce the variance of J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}), which can lead to better performance in practice. The most naïve approach, of course, is using a larger MM to compute J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}). However, this requires more computation and when MM is close to the data dimension, sliced score matching will lose its computational advantage over score matching.

An alternative approach is to leverage control variates (Owen, 2013). A control variate is a random variable whose expectation is tractable, and is highly correlated with another random variable without a tractable expectation. Define c(θ;x,v)12(vsm(x;θ))2c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})\triangleq\frac{1}{2}(\mathbf{v}^{\intercal}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}))^{2}. Note that when pvp_{\mathbf{v}} is a multivariate standard normal or multivariate Rademacher distribution, c(θ;x,v)c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}) will have a tractable expectation, i.e.,

which is easily computable. Now let β(x)c(θ;x,v)\beta(\mathbf{x})c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}) be our control variate, where β(x)\beta(\mathbf{x}) is a function to be determined. Due to the structural similarity between β(x)c(θ;x,v)\beta(\mathbf{x})c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}) and J^(θ;x1N,v11NM)\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM}), it is easy to believe that β(x)c(θ;x,v)\beta(\mathbf{x})c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}) can be a correlated control variate with an appropriate β(x)\beta(\mathbf{x}). We thus consider the following objective

Appendix E PSEUDOCODE