Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Yang Song, Sahaj Garg, Jiaxin Shi, Stefano Ermon

INTRODUCTION

Score matching (Hyvärinen, 2005) is particularly suitable for learning unnormalized statistical models, such as energy based ones. It is based on minimizing the distance between the derivatives of the log-density functions (a.k.a., scores) of the data and model distributions. Unlike maximum likelihood estimation (MLE), the objective of score matching only depends on the scores, which are oblivious to the (usually) intractable partition functions. However, score matching requires the computation of the diagonal elements of the Hessian of the model’s log-density function. This Hessian trace computation is generally expensive (Martens et al., 2012), requiring a number of forward and backward propagations proportional to the data dimension. This severely limits its applicability to complex models parameterized by deep neural networks, such as deep energy-based models (LeCun et al., 2006; Wenliang et al., 2019).

Several approaches have been proposed to alleviate this difficulty: Kingma & LeCun (2010) propose approximate backpropagation for computing the trace of the Hessian; Martens et al. (2012) develop curvature propagation, a fast stochastic estimator for the trace in score matching; and Vincent (2011) transforms score matching to a denoising problem which avoids second-order derivatives. These methods have achieved some success, but may suffer from one or more of the following problems: inconsistent parameter estimation, large estimation variance, and cumbersome implementation.

To alleviate these problems, we propose sliced score matching, a variant of score matching that can scale to deep unnormalized models and high dimensional data. The key intuition is that instead of directly matching the high-dimensional scores, we match their projections along random directions. Theoretically, we show that under some regularity conditions, sliced score matching is a well-defined statistical estimation criterion that yields consistent and asymptotically normal parameter estimates. Moreover, compared to the methods of Kingma & LeCun (2010) and Martens et al. (2012), whose implementations require customized backpropagation for deep networks, sliced score matching only involves Hessian-vector products, thus can be easily and efficiently implemented in frameworks such as TensorFlow (Abadi et al., 2016) and PyTorch (Adam et al., 2017).

Beyond training unnormalized models, sliced score matching can also be naturally adapted as an objective for estimating the score function of a data generating distribution (Sasaki et al., 2014; Strathmann et al., 2015) by training a score function model parameterized by deep neural networks. This observation enables many new applications of sliced score matching. For example, we show that it can be used to provide accurate score estimates needed for variational inference with implicit distributions (Huszár, 2017) and learning Wasserstein Auto-Encoders (WAE, Tolstikhin et al. (2018)).

Finally, we evaluate the performance of sliced score matching on learning unnormalized statistical models (density estimation) and estimating score functions of a data generating process (score estimation). For density estimation, experiments on deep kernel exponential families (Wenliang et al., 2019) and NICE flow models (Dinh et al., 2015) show that our method is either more scalable or more accurate than existing score matching variants.

For score estimation, our method improves the performance of variational auto-encoders (VAE) with implicit encoders, and can train WAEs without a discriminator or MMD loss by directly optimizing the KL divergence between aggregated posteriors and the prior. In both situations we outperformed kernel-based score estimators (Li & Turner, 2018; Shi et al., 2018) by achieving better test likelihoods and better sample quality in image generation.

BACKGROUND

Learning unnormalized models with maximum likelihood estimation (MLE) can be difficult due to the intractable partition function $Z_{\boldsymbol{\theta}}$ . To avoid this, score matching (Hyvärinen, 2005) minimizes the Fisher divergence between $p_{d}$ and $p_{m}(\cdot,{\boldsymbol{\theta}})$ , which is defined as

Since $\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ does not involve $Z_{\boldsymbol{\theta}}$ , the Fisher divergence does not depend on the intractable partition function. However, Eq. (1) is still not readily usable for learning unnormalized models, as we only have samples and do not have access to the score function of the data $\mathbf{s}_{d}(\mathbf{x})$ .

2 SCORE ESTIMATION FOR IMPLICIT DISTRIBUTIONS

Besides parameter estimation in unnormalized models, score matching can also be used to estimate scores of implicit distributions, which are distributions that have a tractable sampling process but without a tractable density. For example, the distribution of random samples from the generator of a GAN (Goodfellow et al., 2014) is an implicit distribution. Implicit distributions can arise in many more situations such as the marginal distribution of a non-conjugate model (Sun et al., 2019), and models defined by complex simulation processes (Tran et al., 2017). In many cases learning and inference become intractable due to the need of optimizing an objective that involves the intractable density.

In these cases, directly estimating the score function $\mathbf{s}_{q}(\mathbf{x})=\nabla_{\mathbf{x}}\log q_{\boldsymbol{\theta}}(\mathbf{x})$ based on i.i.d. samples from an implicit density $q_{\boldsymbol{\theta}}(\mathbf{x})$ can be useful. For example, suppose our learning problem involves optimizing the entropy $H(q_{\boldsymbol{\theta}}(\mathbf{x}))$ of an implicit distribution. This situation is common when dealing with variational free energies (Kingma & Welling, 2014). Suppose $\mathbf{x}\sim q_{{\boldsymbol{\theta}}}$ can be reparameterized as $\mathbf{x}=g_{{\boldsymbol{\theta}}}({\boldsymbol{\epsilon}})$ , where ${\boldsymbol{\epsilon}}$ is a simple random variable independent of ${\boldsymbol{\theta}}$ , such as a standard normal, and $g_{\boldsymbol{\theta}}$ is a deterministic mapping. We can write the gradient of the entropy with respect to ${\boldsymbol{\theta}}$ as

where $\nabla_{\boldsymbol{\theta}}g_{\boldsymbol{\theta}}({\boldsymbol{\epsilon}})$ is usually easy to compute. The score $\nabla_{\mathbf{x}}\log q_{\boldsymbol{\theta}}(\mathbf{x})$ is intractable but can be approximated by score estimation.

Score matching is an attractive solution for score estimation since (1) naturally serves as an objective to measure the difference between the a trainable score function and the score of a data generating process. We will discuss this in more detail in Section 3.2 and mention some other approaches of score estimation in Section 5.2.

DENSITY AND SCORE ESTIMATION WITH SLICED SCORE MATCHING

We observe that one dimensional problems are usually much easier to solve than high dimensional ones. Inspired by the idea of Sliced Wasserstein distance (Rabin et al., 2012), we consider projecting $\mathbf{s}_{d}(\mathbf{x})$ and $\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ onto some random direction $\mathbf{v}$ and propose to compare their average difference along that random direction. More specifically, we consider the following objective as a replacement of the Fisher divergence $L({\boldsymbol{\theta}})$ in Eq. (1):

To eliminate the dependence of $L({\boldsymbol{\theta}};p_{\mathbf{v}})$ on $\mathbf{s}_{d}(\mathbf{x})$ , we use integration by parts as in score matching (cf., the equivalence between Eq. (1) and (2)). Defining

the equivalence is summarized in the following theorem.

Under some regularity conditions (Assumption 1-3 in Appendix B.2), we have

Other than our requirements on $p_{\mathbf{v}}$ , the assumptions are exactly the same as in Theorem 1 of Hyvärinen (2005). We advise the interested readers to read Appendix B.2 for technical statements of the assumptions and a rigorous proof of the theorem.

In practice, given a dataset $\mathbf{x}_{1}^{N}$ , we draw $M$ projection vectors independently for each point $\mathbf{x}_{i}$ from $p_{\mathbf{v}}$ . The collection of all such vectors $\{\mathbf{v}_{ij}\}_{1\leq i\leq N,1\leq j\leq M}$ are abbreviated as $\mathbf{v}_{11}^{NM}$ .We then use the following unbiased estimator of $J({\boldsymbol{\theta}};p_{\mathbf{v}})$

For sliced score matching, the second derivative term $\mathbf{v}^{\intercal}\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})\mathbf{v}$ is much less computationally expensive than $\operatorname{tr}(\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}))$ . Using auto-differentiation systems that support higher order gradients, we can compute it using two gradient operations for a single $\mathbf{v}$ , as shown in Alg. 1. Similarly, when there are $M$ $\mathbf{v}$ ’s, the total number of gradient operations is $M+1$ . In contrast, assuming the dimension of $\mathbf{x}$ is $D$ and $D\gg M$ , we typically need $D+1$ gradient operations to compute $\operatorname{tr}(\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}))$ because each diagonal entry of $\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ needs to be computed separately (see Alg. 2 in the appendix).

In practice, we can tune $M$ to trade off variance and computational cost. In our experiments, we find that oftentimes $\boxed{M=1}$ is already a good choice.

2 SLICED SCORE ESTIMATION

As mentioned in section 2.2, the task of score estimation is to estimate $\nabla_{\mathbf{x}}\log q(\mathbf{x})$ at some test point $\mathbf{x}$ , given a set of samples $\mathbf{x}_{1}^{N}\overset{\text{i.i.d.\hbox{} missing}}{\sim}q(\mathbf{x})$ . In what follows, we show how our sliced score matching objective $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ can be straightforwardly adapted for this task.

Using a similar argument of integration by parts (cf., Eq. (4), (5) and (6)), we have

which implies that minimizing $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ with $\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ replaced by $\mathbf{h}(\mathbf{x};{\boldsymbol{\theta}})$ is approximately minimizing the average projected difference between $\mathbf{h}(\mathbf{x};{\boldsymbol{\theta}})$ and $\nabla_{\mathbf{x}}\log q(\mathbf{x})$ . Hence, $\mathbf{h}(\mathbf{x};\hat{{\boldsymbol{\theta}}})$ should be close to $\nabla_{\mathbf{x}}\log q(\mathbf{x})$ and can serve as a score estimator.

THEORETICAL ANALYSIS

In this section, we present several theoretical results to justify sliced score matching as a principled objective. We also discuss the connection of sliced score matching to other methods. For readability, we will defer rigorous statements of assumptions and theorems to the Appendix.

One important question to ask for any statistical estimation objective is whether the estimated parameter is consistent under reasonable assumptions. Our results confirm that for any $M$ , the objective $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ is consistent under suitable assumptions as $N\to\infty$ .

We first prove that $J({\boldsymbol{\theta}};p_{\mathbf{v}})=0\Leftrightarrow{\boldsymbol{\theta}}={\boldsymbol{\theta}}^{*}$ (Lemma 1 and Theorem 1). Then we prove the uniform convergence of $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ (Lemma 3), which holds regardless of $M$ . These two facts lead to consistency. For a complete proof, see Appendix B.3. ∎

In Hyvärinen (2005), the authors only showed that $J({\boldsymbol{\theta}})=0\Leftrightarrow{\boldsymbol{\theta}}={\boldsymbol{\theta}}^{*}$ , which leads to “local consistency” of score matching. This is a weaker notion of consistency compared to our settings.

2 ASYMPTOTIC NORMALITY

Asymptotic normality results can be very useful for approximate hypothesis testing and comparing different estimators. Below we show that $\hat{{\boldsymbol{\theta}}}_{N,M}$ is asymptotically normal when $N\to\infty$ .

In addition to the assumptions in Section 4.1, we need an extra assumption to prove asymptotic normality. We require $p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ to have a stronger Lipschitz property (Assumption 9).

For simplicity, we denote $\nabla_{\mathbf{x}}h(\mathbf{x})|_{\mathbf{x}=\mathbf{x}^{\prime}}$ as $\nabla_{\mathbf{x}}h(\mathbf{x}^{\prime})$ , where $h(\cdot)$ is an arbitrary function. In the following, we will only show the asymptotic normality result for a specific $p_{\mathbf{v}}$ . More general results are in Appendix B.4.

Assume the assumptions discussed above hold. If $p_{\mathbf{v}}$ is the multivariate Rademacher distribution, we have

Here $D$ is the dimension of data; $V_{ij}$ and $W_{ij}$ are p.s.d matrices depending on $p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*})$ , and their definitions can be found in Appendix B.4.

Once we get the consistency (Theorem 2), the rest closely follows the proof of asymptotic normality of MLE (van der Vaart, 1998). A rigorous proof can be found in Appendix B.4. ∎

As expected, larger $M$ will lead to smaller asymptotic variance, as can be seen in Eq. (9).

As far as we know, there is no published proof of asymptotic normality for the standard (not sliced) score matching. However, by using the same techniques in our proofs, and under similar assumptions, we can conclude that the asymptotic variance of the score matching estimator is $[\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*})]^{-1}\left(\sum_{ij}V_{ij}\right)[\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*})]^{-1}$ (Corollary 1), which will always be smaller than sliced score matching with multivariate Rademacher projections. However, the gap reduces when $M$ increases.

3 CONNECTION TO OTHER METHODS

Sliced score matching is widely connected to many other methods, and can be motivated from some different perspectives. Here we discuss a few of them.

Noise contrastive estimation (NCE), proposed by Gutmann & Hyvärinen (2010), is another principle for training unnormalized statistical models. The method works by comparing $p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ with a noise distribution $p_{n}(\mathbf{x})$ . We consider a special form of NCE which minimizes the following objective

where $h(\mathbf{x};{\boldsymbol{\theta}})\triangleq\frac{p_{m}(\mathbf{x};{\boldsymbol{\theta}})}{p_{m}(\mathbf{x};{\boldsymbol{\theta}})+p_{m}(\mathbf{x}-\mathbf{v};{\boldsymbol{\theta}})}$ , and we choose $p_{n}(\mathbf{x})=p_{d}(\mathbf{x}+\mathbf{v})$ . Assuming $\|\mathbf{v}\|_{2}$ to be small, Eq. (10) can be written as the following by Taylor expansion

For derivation, see Proposition 1 in the appendix. A similar derivation can also be found in Gutmann & Hirayama (2011). As a result of (11), if we choose some $p_{\mathbf{v}}$ and take the expectation of (10) w.r.t. $p_{\mathbf{v}}$ , the objective will be equivalent to sliced score matching whenever $\left\lVert\mathbf{v}\right\rVert_{2}\approx 0$ .

which is exactly the sliced score matching objective with variance reduction $\hat{J}_{\text{vr}}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ .

RELATED WORK

To the best of our knowledge, there are three existing methods that are able to scale up score matching to learning deep models on high dimensional data.

which can be estimated without computing any Hessian. Although denoising score matching is much faster than score matching, it has obvious drawbacks. First, it can only recover the noise corrupted data distribution. Furthermore, choosing the parameters of the noise distribution is highly non-trivial and in practice the performance can be very sensitive to $\sigma$ , and heuristics have to be used in practice. For example, Saremi et al. (2018) propose a heuristic for choosing $\sigma$ based on Parzen windows.

Kingma & LeCun (2010) propose a backpropagation method to approximately compute the trace of the Hessian. Because the backpropagation of the full Hessian scales quadratically w.r.t. the layer size, the authors limit backpropagation only to the diagonal so that it has the same cost as the usual backpropagation for computing loss gradients. However, there are no theoretical guarantees for the approximation errors. In fact, the authors only did experiments on networks with a single hidden layer, in which case the approximation is exact. Moreover, there is no direct support for the proposed approximate backpropagation method in modern automatic differentiation frameworks.

Martens et al. (2012) estimate the trace term in score matching by applying curvature propagation to compute an unbiased, complex-valued estimator of the diagonal of the Hessian. The work claims that curvature propagation will have the smallest variance among a class of estimators, which includes the Hutchinson’s estimator. However, their proof evaluates the pseudo-variance of the complex-valued estimator instead of the variance. In practice, curvature propagation can have large variance when the number of nodes in the network is large, because it introduces noise for each node in the network. Finally, implementing curvature propagation also requires manually modifying the backpropagation code, handling complex numbers in neural networks, and will be inefficient for networks of more general structures, such as recurrent neural networks.

2 KERNEL SCORE ESTIMATORS

Two prior methods for score estimation are based on a generalized form of Stein’s identity (Stein, 1981; Gorham & Mackey, 2017):

where $q(\mathbf{x})$ is a continuously differentiable density and $\mathbf{h}(\mathbf{x})$ is a function satisfying some regularity conditions. Li & Turner (2018) propose to set $\mathbf{h}(\mathbf{x})$ as the feature map of some kernel in the Stein class (Liu et al., 2016) of $q$ , and solve a finite-sample version of (12) to obtain estimates of $\nabla_{\mathbf{x}}\log q(\mathbf{x})$ at the sample points. We refer to this method as Stein in the experiments. Shi et al. (2018) adopt a different approach but also exploit (12). They build their estimator by expanding $\nabla_{\mathbf{x}}\log q(\mathbf{x})$ as a spectral series of eigenfunctions and solve for the coefficients using (12). Compared to Stein, their method is argued to have theoretical guarantees and principled out-of-sample estimation at an unseen test point. We refer to their method as Spectral in the experiments.

EXPERIMENTS

Our experiments include two parts: (1) to test the effectiveness of sliced score matching (SSM) in learning deep models for density estimation, and (2) to test the ability of SSM in providing score estimates for applications such as VAEs with implicit encoders and WAEs. Unless specified explicitly, we choose $M=1$ by default.

We evaluate SSM and its variance-reduced version (SSM-VR) for density estimation and compare against score matching (SM) and its three scalable baselines: denoising score matching (DSM), approximate backpropagation (approx BP), and curvature propagation (CP). All SSM methods in this section use the multivariate Rademacher distribution as $p_{\mathbf{v}}$ . Our results demonstrate that: (1) SSM is comparable in performance to SM, (2) SSM outperforms other computationally efficient approximations to SM, and (3) unlike SM, SSM scales effectively to high dimensional data.

Following the settings of Wenliang et al. (2019), we trained models on three UCI datasets: Parkinsons, RedWine, and WhiteWine (Dua & Graff, 2017), and used their original code for SM. To compute the trace term exactly, Wenliang et al. (2019)’s manual implementation of backpropagation takes over one thousand lines for a model that is four layers deep, while the implementation of SSM only takes several lines. For DSM, the value of $\sigma$ is chosen by grid search using the SM loss on a validation dataset. All models are trained with 15 different random seeds and training is stopped when validation loss does not improve for 200 steps.

Results in Fig. 1 demonstrate that SSM-VR performs comparably to SM, when evaluated using the SM loss on a held-out test set. We observe that variance reduction substantially improves the performance of SSM. In addition, SSM outperforms other computationally efficient approaches. DSM can perform comparably to SSM on RedWine. However, it is challenging to select $\sigma$ for DSM. Models trained using $\sigma$ from the heuristic in Saremi et al. (2018) are far from optimal (on both SM losses and likelihoods), and extensive grid search is needed to find the best $\sigma$ . CP performs substantially worse, which is likely because it injects noise for each node in the computation graph, and the amount of noise introduced is too large for a neural-network-based kernel evaluated at 200 inducing points, which supports our hypothesis that CP does not work effectively for deep models. Results for approx BP are omitted because the losses exceeded $10^{9}$ on all datasets. This is because approx BP provides a biased estimate of the Hessian without any error guarantees. All the results are similar when evaluating according to log-likelihood metric (Appendix C.1).

We evaluate the computational efficiency of different losses on data of increasing dimensionality. We fit DKEF models to a multivariate standard normal in high dimensional spaces. The average running time per minibatch of 100 examples is reported in Fig. 2. SM performance degrades linearly with the dimension of the input data due to the computation of the Hessian, and creates out of memory errors for a 12GB GPU after the dimension increases beyond 400. SSM performs similarly to DSM, approx BP and CP, and they are all scalable to large dimensions.

1.2 Deep Flow Models

As a sanity check, we also evaluate SSM by training a NICE flow model (Dinh et al., 2015), whose likelihood is tractable and can be compared to results obtained by MLE. The model we use has 20 hidden layers, and 1000 units per layer. Models are trained to fit MNIST handwritten digits, which are 784 dimensional images. Data are dequantized by adding uniform noise in the range $[-\frac{1}{512},\frac{1}{512}]$ , and transformed using a logit transformation, $\log(x)-\log(1-x)$ . We provide additional details in Appendix C.2.

Training with SM is extremely computationally expensive in this case. Our SM implementation based on auto-differentiation takes around 7 hours to finish one epoch of training, and 12 GB GPU memory cannot hold a batch size larger than 24, so we do not include these results. Since NICE has tractable likelihoods, we also evaluate MLE as a surrogate objective for minimizing the SM loss. Notably, likelihoods and SM losses might be uncorrelated when the model is mis-specified, which is likely to be the case for a complex dataset like MNIST.

SM losses and log-likelihoods on the test dataset are reported in Tab. 1, where models are evaluated using the best checkpoint in terms of the SM loss on a validation dataset over 100 epochs of training. SSM-VR greatly outperforms all the other methods on the SM loss. DSM performs worse than SSM-VR, and $\sigma$ is still hard to tune. Specifically, following the heuristic in Saremi et al. (2018) leads to $\sigma=1.74$ , which performed the worst (on both log-likelihood and SM loss) of the eight choices of $\sigma$ in our grid search. Approx BP has more success on NICE than for training DKEF models. This might be because the Hessians of hidden layers of NICE are closer to a diagonal matrix, which results in a smaller approximation error for approx BP. As in the DKEF experiments, CP performs worse. This is likely due to injecting noise to all hidden units, which will lead to large variance for a network as big as NICE. Unlike the DKEF experiments, we find that good log-likelihoods are less correlated with good SM loss, probably due to model mis-specification.

2 SCORE ESTIMATION

We consider two typical tasks that require accurate score estimations: (1) training VAEs with an implicit encoder and (2) training Wasserstein Auto-Encoders. We show in both tasks SSM outperforms previous score estimators. Samples generated by various algorithms can be found in Appendix A.

Consider a latent variable model $p(\mathbf{x},\mathbf{z})$ , where $\mathbf{x}$ and $\mathbf{z}$ denote observed and latent variables respectively. A variational auto-encoder (VAE) is composed of two parts: 1) an encoder $p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})$ modeling the conditional distribution of $\mathbf{x}$ given $\mathbf{z}$ ; and a decoder $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ that approximates the posterior distribution of the latent variable. A VAE is typically trained by maximizing the following evidence lower bound (ELBO)

For a single data point $\mathbf{x}$ , kernel score estimators need multiple samples from $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ to estimate its score. On MNIST, we use 100 samples, as done in Shi et al. (2018). On CelebA, however, we can only take 20 samples due to GPU memory limitations. In contrast, SSM learns a score network $h(\mathbf{z}\mid\mathbf{x})$ along with $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ , which amortizes the cost of score estimation. Unless noted explicitly, we use one projection per data point ( $M=1$ ) for SSM, which is scalable to deep networks.

We consider VAE training on MNIST and CelebA (Liu et al., 2015). All images in CelebA are first cropped to a patch of $140\times 140$ and then resized to $64\times 64$ . We report negative likelihoods on MNIST, as estimated by AIS (Neal, 2001) with 1000 intermediate distributions. We evaluate sample quality on CelebA with FID scores (Heusel et al., 2017). For fast AIS evaluation on MNIST, we use shallow fully-connected networks with 3 hidden layers. For CelebA experiments we use deep convolutional networks. The architectures of implicit encoders and score networks are straightforward modifications to the encoders of ELBO. More details are provided in Appendix C.3.

The negative likelihoods of different methods on MNIST are reported in the left part of Tab. 2. We note that SSM outperforms Stein and Spectral in all cases. When the latent dimension is 8, SSM outperforms ELBO, which indicates that the expressive power of implicit $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ pays off. When the latent dimension is 32, the gaps between SSM and kernel methods are even larger, and the performance of SSM is still comparable to ELBO. Moreover, when $M=100$ (matching the computation of kernel methods), SSM outperforms ELBO.

For CelebA, we provide FID scores of samples in the top part of Tab. 3. We observe that after 40k training iterations, SSM outperforms all other baselines. Kernel methods perform poorly in this case because only 20 samples per data point can be used due to limited amount of GPU memory. Early during training, SSM does not perform as well. Since the score network is trained along with the encoder and decoder, a moderate number of training steps is needed to give an accurate score estimation (and better learning of the VAE).

2.2 WAEs

Compared to $\nabla_{\boldsymbol{\phi}}H(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}))$ , it is easier to estimate $\nabla_{\boldsymbol{\phi}}H(q_{\boldsymbol{\phi}}(\mathbf{z}))$ for kernel methods, because the samples of $q_{\boldsymbol{\phi}}(\mathbf{z})$ can be collected by first sampling $\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\stackrel{{\scriptstyle\text{i.i.d.\hbox{} missing}}}{{\sim}}p_{d}(\mathbf{x})$ and then sample one $\mathbf{z}$ for each $\mathbf{x}_{i}$ from $q_{\phi}(\mathbf{z}\mid\mathbf{x}_{i})$ . In constrast, multiple $\mathbf{z}$ ’s need to be sampled for each $\mathbf{x}_{i}$ when estimating $\nabla_{\boldsymbol{\phi}}H(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}))$ with kernel approaches. For SSM, we use a score network $h(\mathbf{z})$ and train it alongside $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ .

The setup for WAE experiments is largely the same as VAE. The architectures are very similar to those used in the VAE experiments, and the only difference is that we made decoders implicit, as suggested in Tolstikhin et al. (2018). More details can be found in Appendix C.3.

The negative likelihoods on MNIST are provided in the right part of Tab. 2. SSM outperforms both kernel methods, and achieves a larger performance gap when the latent dimension is higher. The likelihoods are lower than the VAE results as the WAE objective does not directly maximize likelihoods.

We show FID scores for CelebA experiments in the bottom part of Tab. 3. As expected, kernel methods perform much better than before, because it is faster to sample from $q_{\boldsymbol{\phi}}(\mathbf{z})$ . The FID scores are generally lower than those in VAE experiments, which supports previous results that WAEs generally obtain better samples. SSM outperforms both kernel methods when the number of iterations is more than 40k. This shows the advantages of training a deep, expressive score network compared to using a fixed kernel in score estimation tasks.

CONCLUSION

We propose sliced score matching, a scalable method for learning unnormalized statistical models and estimating scores for implicit distributions. Compared to the original score matching and its variants, our estimator can scale to deep models on high dimensional data, while remaining easy to implement in modern automatic differentiation frameworks. Theoretically, our estimator is consistent and asymptotically normal under some regularity conditions. Experimental results demonstrate that our method outperforms competitors in learning deep energy-based models and provides more accurate estimates than kernel score estimators in training implicit VAEs and WAEs.

This research was supported by Intel Corporation, TRI, NSF (#1651565, #1522054, #1733686 ), ONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024).

References

Appendix A Samples

A.1.2 CelebA

A.2 WAE

A.2.2 CelebA

Appendix B PROOFS

Below we provide a summary of the most commonly used notations used in the proofs. First, we denote the data distribution as $p_{d}(\mathbf{x})$ and assume that the training/test data $\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}$ are i.i.d. samples of $p_{d}(\mathbf{x})$ . The model is denoted as $p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ , where ${\boldsymbol{\theta}}$ is restricted to a parameter space $\Theta$ . Note that $p_{m}(\mathbf{x};\mathbf{v})$ can be an unnormalized energy-based model. We use $\mathbf{v}$ to represent a random vector with the same dimension of input $\mathbf{x}$ . This vector $\mathbf{v}$ is often called the projection vector, and we use $p_{\mathbf{v}}$ to denote its distribution.

Next, we introduce several shorthand notations for quantities related to $p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $p_{d}(\mathbf{x})$ . The log-likelihood $\log p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $\log p_{d}(\mathbf{x})$ are respectively denoted as $l_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $l_{d}(\mathbf{x})$ . The (Stein) score function $\nabla_{\mathbf{x}}\log p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $\nabla_{\mathbf{x}}\log p_{d}(\mathbf{x})$ are written as $\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $\mathbf{s}_{d}(\mathbf{x})$ , and finally the Hessian of $\log p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ w.r.t. $\mathbf{x}$ is denoted as $\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ .

We also adopt some convenient notations for collections. In particular, we use $\mathbf{x}_{1}^{N}$ to denote a collection of $N$ vectors $\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}$ and use $\mathbf{v}_{11}^{NM}$ to denote $N\times M$ vectors $\{\mathbf{v}_{11},\mathbf{v}_{12},\cdots,\mathbf{v}_{1M},\mathbf{v}_{21},\allowbreak\mathbf{v}_{22},\cdots,\mathbf{v}_{2M},\cdots,\mathbf{v}_{N1},\mathbf{v}_{N2},\cdots,\mathbf{v}_{NM}\}$ .

B.2 BASIC PROPERTIES

The following regularity conditions are needed for integration by parts and identifiability.

$\forall{\boldsymbol{\theta}}\in\Theta,\lim_{\left\lVert\mathbf{x}\right\rVert\rightarrow\infty}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})p_{d}(\mathbf{x})=0$ .

The model family $\{p_{m}(\mathbf{x};{\boldsymbol{\theta}})\mid{\boldsymbol{\theta}}\in\Theta\}$ is well-specified, i.e., $p_{d}(\mathbf{x})=p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*})$ . Furthermore, $p_{m}(\mathbf{x};{\boldsymbol{\theta}})\neq p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*})$ whenever ${\boldsymbol{\theta}}\neq{\boldsymbol{\theta}}^{*}$ .

$p_{m}(\mathbf{x};{\boldsymbol{\theta}})>0,\forall{\boldsymbol{\theta}}\in\Theta$ and $\forall\mathbf{x}$ .

Assume $\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ , $\mathbf{s}_{d}(\mathbf{x})$ and $p_{\mathbf{v}}$ satisfy some regularity conditions (Assumption 1, Assumption 2). Under proper boundary conditions (Assumption 3), we have

The basic idea of this proof is similar to that of Theorem 1 in Hyvärinen (2005). First, note that $L({\boldsymbol{\theta}},p_{\mathbf{v}})$ can be expanded to

This can be shown by first observing that

which proves (16) and the proof is completed. ∎

Assume our model family is well-specified and identifiable (Assumption 4). Assume further that the densities are all positive (Assumption 5). When $p_{\mathbf{v}}$ satisfies some regularity conditions (Assumption 2), we have

First, since $p_{d}(\mathbf{x})=p_{m}(\mathbf{x};{\boldsymbol{\theta}}^{*})>0$ , $L({\boldsymbol{\theta}};p_{\mathbf{v}})=0$ implies

B.3 CONSISTENCY

In addition to the assumptions in Theorem 1 and Lemma 1, we need the following regularity conditions to prove the consistency of $\hat{{\boldsymbol{\theta}}}_{N,M}\triangleq\operatorname*{arg\,min}_{{\boldsymbol{\theta}}\in\Theta}\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ .

Let $A({\boldsymbol{\theta}})\triangleq\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $B({\boldsymbol{\theta}})\triangleq\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})^{\intercal}$ . Consider ${\boldsymbol{\theta}}_{1}\in\Theta,{\boldsymbol{\theta}}_{2}\in\Theta$ and let $D$ be the dimension of $\mathbf{v}$ , we have

where $(i)$ is Cauchy-Schwarz inequality, $(ii)$ is Jensen’s inequality, and $(iii)$ is due to Assumption 7. Now let

where $(i)$ results from the independence of $\mathbf{v},\mathbf{x}$ , and $(ii)$ is due to Assumption 7 and Assumption 8. ∎

where $\operatorname{diam}(\cdot)$ denotes the diameter and $D$ is the dimension of $\Theta$ .

Step 1: From Jensen’s inequality, we obtain

where ${\mathbf{x}^{\prime}}_{1}^{N},{\mathbf{v}^{\prime}}_{11}^{NM}$ are independent copies of $\mathbf{x}_{1}^{N}$ and $\mathbf{v}_{11}^{NM}$ . Let $\{\epsilon_{i}\}_{i=1}^{N}$ be a set of independent Rademacher random variables, we have

where $(i)$ is because the quantity is symmetric about in distribution, and $(ii)$ is due to Jensen’s inequality.

Step 2: First note that given $\mathbf{x}_{i},\mathbf{v}_{ij}$ , $\frac{\epsilon_{i}}{M}\sum_{j=1}^{M}f({\boldsymbol{\theta}};\mathbf{x}_{i},\mathbf{v}_{ij})$ is a zero-mean sub-Gaussian process w.r.t. ${\boldsymbol{\theta}}$ . This can be observed from

where $(i)$ holds because $\epsilon_{i}$ is a 1-sub-Gaussian random variable, $(ii)$ is from Cauchy-Schwarz inequality and $(iii)$ is due to Lemma 2. As a result, $\frac{1}{N}\frac{1}{M}\sum_{i=1}^{N}\sum_{j=1}^{M}\epsilon_{i}f({\boldsymbol{\theta}};\mathbf{x}_{i},\mathbf{v}_{ij})$ is a zero-mean sub-Gaussian random process with metric

Since $\Theta$ is compact, the diameter of $\Theta$ with respect to the Euclidean norm $\left\lVert\cdot\right\rVert_{2}$ is finite and we denote it as $\operatorname{diam}(\Theta)<\infty$ . Then, Dudley’s entropy integral (Dudley, 1967) gives

Here $\log N(\Theta,d,\epsilon)$ is the metric entropy of $\Theta$ with metric $d({\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2})=\frac{1}{\sqrt{N}}\sqrt{\frac{1}{NM}\sum_{i=1}^{N}\sum_{j=1}^{M}L^{2}(\mathbf{x}_{i},\mathbf{v}_{ij})}\left\lVert{\boldsymbol{\theta}}_{1}-{\boldsymbol{\theta}}_{2}\right\rVert_{2}$ and size $\epsilon$ .

Step 3: When the dimension of ${\boldsymbol{\theta}}\in\Theta$ is $D$ , it is known that the $\epsilon$ -covering number of $\Theta$ with Euclidean distance is

Therefore, $N(\Theta,d,\epsilon)$ can be bounded by

Hence, the metric integral can be bounded

Finally, combining (19), (20) and (21) gives us

Suppose all the previous assumptions hold (Assumption 1-Assumption 8). Assume further the conditions of Theorem 1 and Lemma 1 are satisfied. Let ${\boldsymbol{\theta}}^{*}$ be the true parameter of the data distribution, and $\hat{{\boldsymbol{\theta}}}_{N,M}$ be the empirical estimator defined by

Then, $\hat{{\boldsymbol{\theta}}}_{N,M}$ is consistent, meaning that

Note that Theorem 1 and Lemma 1 together imply that ${\boldsymbol{\theta}}^{*}=\operatorname*{arg\,min}_{{\boldsymbol{\theta}}\in\Theta}J({\boldsymbol{\theta}};p_{\mathbf{v}})$ . Then, we will show $J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})\overset{p}{\to}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})$ when $N\to\infty$ . This can be done by noticing

We can easily conclude that (22) is $o_{p}(1)$ with the help of Lemma 3, because

as $N\to\infty$ . From Lemma 1 we also have $L(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})-L({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})>0$ if $\hat{{\boldsymbol{\theta}}}_{N,M}\neq{\boldsymbol{\theta}}$ . As shown by Theorem 1, this is the same as $J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})-J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})>0$ if $\hat{{\boldsymbol{\theta}}}_{N,M}\neq{\boldsymbol{\theta}}$ . Therefore (22) = $o_{p}(1)$ gives $J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})\overset{p}{\to}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})$ .

However, the fact that $p(|J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})-J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})|\geq J({\boldsymbol{\theta}}_{\mathcal{S}_{\epsilon}};p_{\mathbf{v}})-J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}}))\geq\delta$ holds for arbitrarily large $N$ contradicts $J(\hat{{\boldsymbol{\theta}}}_{N,M};p_{\mathbf{v}})\overset{p}{\to}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})$ . ∎

B.4 ASYMPTOTIC NORMALITY

To simplify notations we use $\partial_{i}\partial_{j}h(\cdot)\triangleq(\nabla_{\mathbf{x}}^{2}h(\cdot))_{ij}$ , $\partial_{i}h(\cdot)\triangleq(\nabla_{\mathbf{x}}h(\cdot))_{i}$ , and denote $\nabla_{\mathbf{x}}h(\mathbf{x})|_{\mathbf{x}=\mathbf{x}^{\prime}}$ as $\nabla_{\mathbf{x}}h(\mathbf{x}^{\prime})$ . Here $h(\cdot)$ denotes some arbitrary function. Let $l_{m}\triangleq\log p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ , $l_{m}(\mathbf{x};{\boldsymbol{\theta}})\triangleq\log p_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and further adopt the following notations

For the proof of asymptotic normality we need the following extra assumptions.

For ${\boldsymbol{\theta}}_{1}$ , ${\boldsymbol{\theta}}_{2}$ near ${\boldsymbol{\theta}}^{*}$ , and $\forall i,j$ ,

Suppose $l_{m}(\mathbf{x};{\boldsymbol{\theta}})$ is sufficiently smooth (Assumption 9) and $p_{\mathbf{v}}$ has bounded moments (Assumption 2 and Assumption 8). Let $\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}_{1}^{M})\triangleq\frac{1}{M}\sum_{i=1}^{M}\nabla_{\boldsymbol{\theta}}^{2}\mathbf{v}_{i}^{\intercal}\nabla_{\mathbf{x}}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}})\mathbf{v}_{i}+\frac{1}{2}\nabla_{\boldsymbol{\theta}}^{2}(\mathbf{v}_{i}^{\intercal}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}))^{2}$ . Then $\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})$ is Lipschitz continuous, i.e., for ${\boldsymbol{\theta}}_{1}$ and ${\boldsymbol{\theta}}_{2}$ close to ${\boldsymbol{\theta}}^{*}$ , there exists a Lipschitz constant $L(\mathbf{x},\mathbf{v}_{1}^{M})$ such that

First, we write out $\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}}_{1};\mathbf{x},\mathbf{v}_{1}^{M})-\nabla_{\boldsymbol{\theta}}^{2}f({\boldsymbol{\theta}}_{2};\mathbf{x},\mathbf{v}_{1}^{M})$ according to the definitions. Let $A_{ij}({\boldsymbol{\theta}})\triangleq\nabla_{\boldsymbol{\theta}}^{2}\partial_{i}\partial_{j}l_{m}(\mathbf{x};{\boldsymbol{\theta}})$ and $B_{ij}({\boldsymbol{\theta}})\triangleq\nabla_{\boldsymbol{\theta}}^{2}\partial_{i}l_{m}(\mathbf{x};{\boldsymbol{\theta}})\partial_{j}l_{m}(\mathbf{x};{\boldsymbol{\theta}})$ . Then,

Then, Cauchy-Schwarz and Jensen’s inequality give

where $(i)$ is due to Assumption 9, $(ii)$ is Jensen’s, and $(iii)$ is because of Assumption 2 and 8. ∎

Assume that conditions in Theorem 1 and Lemma 1 hold, and $p_{\mathbf{v}}$ has bounded higher-order moments (Assumption 8). Then

where $\nabla_{\boldsymbol{\theta}}f({\boldsymbol{\theta}}^{*};\mathbf{x},\mathbf{v}_{1}^{M})=\nabla_{\boldsymbol{\theta}}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}_{1}^{M})\big{|}_{{\boldsymbol{\theta}}={\boldsymbol{\theta}}^{*}}$

In particular, if $p_{\mathbf{v}}\sim\mathcal{N}(0,I)$ , we have

If $p_{\mathbf{v}}$ is the distribution of multivariate Rademacher random variables, we have

Since ${\boldsymbol{\theta}}^{*}$ is the true parameter of the data distribution, we have

which leads to (23). Note that Assumption 2 guarantees that $|\Sigma_{ij}|<\infty$ and Assumption 8 ensures $|\mathfrak{S}_{ijpq}|<\infty$ .

If $p_{\mathbf{v}}\sim\mathcal{N}(0,I)$ , $\mathfrak{S}$ and $\Sigma$ have the following simpler forms

Then, if we assume that the second derivatives of $l_{m}$ are continuous, we have $\partial_{i}\partial_{j}l_{m}=\partial_{j}\partial_{i}l_{m}$ , and the variance (23) can also be simplified to

Similarly, if $p_{\mathbf{v}}\sim\mathcal{U}(\{\pm 1\}^{D})$ , (23) has the simplified form

With the notations and assumptions in Lemma 4, Lemma 5 and Theorem 2, we have

In particular, if $p_{\mathbf{v}}\sim\mathcal{N}(0,I)$ , then the asymptotic variance is

If $p_{\mathbf{v}}$ is the distribution of multivariate Rademacher random variables, the asymptotic variance is

To simplify notations, we use $P_{N}h(\mathbf{x})\triangleq\frac{1}{N}\sum_{i=1}^{N}h(\mathbf{x}_{i},\cdot)$ , where $h(\mathbf{x},\cdot)$ is some arbitrary function. For example, $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ can be written as $P_{N}f({\boldsymbol{\theta}};\mathbf{x},\mathbf{v}_{1}^{M})$ . By Taylor expansion, we can approximate $P_{N}\nabla_{\boldsymbol{\theta}}f(\hat{{\boldsymbol{\theta}}}_{N,M};\mathbf{x},\mathbf{v}_{1}^{M})$ around ${\boldsymbol{\theta}}^{*}$ :

where $\|E_{\hat{{\boldsymbol{\theta}}}_{N,M},\mathbf{x},\mathbf{v}_{1}^{M}}\|_{F}\leq L(\mathbf{x},\mathbf{v}_{1}^{M})\|\hat{{\boldsymbol{\theta}}}_{N,M}-{\boldsymbol{\theta}}^{*}\|_{2}$ from Lemma 4 and Taylor expansion of vector-valued functions. Combining with the law of large numbers, we have

But of course, the central limit theorem and Lemma 5 yield

Then, Slutsky’s theorem gives the desired result

In particular, if $p_{\mathbf{v}}\sim\mathcal{N}(0,I)$ or $p_{\mathbf{v}}\sim\mathcal{U}(\{\pm 1\}^{D})$ , we have $J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})=J({\boldsymbol{\theta}}^{*})$ , and therefore $\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*};p_{\mathbf{v}})=\nabla_{\boldsymbol{\theta}}^{2}J({\boldsymbol{\theta}}^{*})$ . We can apply Lemma 5 to conclude the simplified expressions for the asymptotic variance.

Under similar assumptions used in Theorem 2 and Theorem 3, we can also conclude that the score matching estimator $\hat{{\boldsymbol{\theta}}}_{N}\triangleq\operatorname*{arg\,min}_{{\boldsymbol{\theta}}\in\Theta}\hat{J}({\boldsymbol{\theta}};\mathbf{x})$ is consistent

The other part of the proof is similar to that of Theorem 2 and Theorem 3 and is thus obmitted. ∎

B.5 NOISE CONTRASTIVE ESTIMATION

Then when $\left\lVert v\right\rVert_{2}\to 0$ , we have

Using Taylor expansion, we can immediately get

Appendix C ADDITIONAL DETAILS OF EXPERIMENTS

To improve computational cost of learning $f,$ Sutherland et al. (2018) use a Nyström-type lite approximation, selecting $L$ inducing points ${\mathbf{z}}_{l}$ , and $f$ of the form:

We compare training using our objective against deep kernel exponential family (DKEF) models trained using exact score matching in Wenliang et al. (2019).

The kernel $k(\mathbf{x},{\mathbf{y}})$ is a mixture of $R=3$ Gaussian kernels, with features extracted by a neural network, $\phi_{w_{r}}(\cdot)$ , length scales $\sigma_{r}$ , and nonnegative mixture coefficients $\rho_{r}$ . The neural network is a three layer fully connected network with a skip connection from the input to output layers and softplus nonlinearities. Each hidden layer has 30 neurons. We have:

A similar closed form for sliced score matching, denoising score matching, approximate backpropogation, and curvature propagation can be derived. The derivation for sliced score matching is presented below for completeness.

For fixed $k$ , $\mathbf{z}$ , and $\lambda_{\boldsymbol{\alpha}}$ , as long as $\lambda_{\boldsymbol{\alpha}}>0$ then the optimal ${\boldsymbol{\alpha}}$ is

The derivation follows very similarly to Proposition 3 in Wenliang et al. (2019). We will show that the loss is quadratic in ${\boldsymbol{\alpha}}$ . Note that

Because $\lambda_{\alpha}>0$ and ${\bm{G}}$ is positive semidefinite, the matrix in parentheses is strictly positive definite, and the claimed result follows directly from standard vector calculus. ∎

RedWine and WhiteWine are dequantized by adding uniform noise to each dimension in the range $[-d,d]$ where $d$ is the median distance between two values for that dimension. For each dataset, 10% of the entire data was used as testing, and 10% of the remaining was used for validation. PCA whitening is applied to the data. Noise of standard deviation 0.05 is added as part of preprocessing.

The DKEF models have $R=3$ Gaussian kernels. Each feature extractor is a 3-layer neural network with a skip connection from the input to output, with 30 hidden neurons per layer. Weights were initialized from a Gaussian distribution with standard deviation equal to $\frac{1}{\sqrt{30}}$ . Length scales $\sigma_{r}$ were initialized to 1.0, 3.3 and 10.0. We use $L=200$ trainable inducing points, which were initialized from training data.

Models are trained using an Adam optimizer, with learning rate $10^{-2}$ . A batch size of 200 is used, with 100 points for computing ${\boldsymbol{\alpha}}$ , and 100 for computing the loss. Models are stopped after validation loss does not improve for 200 steps.

For denoising score matching, we perform a grid search with values [0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.20, 0.24, 0.28, 0.32, 0.40, 0.48, 0.56, 0.64, 1.28]. We train models for each value of $\sigma$ using two random seeds, and pick the $\sigma$ with the best average validation score matching loss. For curvature propagation, one noise sample is used to match the performance of sliced score matching.

Log-likelihoods are presented below. They are estimated using AIS, using a proposal distribution $\mathcal{N}(0,2I)$ , using 1,000,000 samples.

C.2 NICE

The model has four coupling layers, each with five hidden layers, for a total of 20 hidden layers, as well as a final scale layer (Dinh et al., 2015). Softplus nonlinearities are used between hidden layers.

Models are trained using the Adam optimizer with learning rate $10^{-3}$ for 100 epochs. The best checkpoint on exact score matching loss, evaluation every 100 iterations, is used to report test set performance. We use a batch size of 128.

Data are dequantized by adding uniform noise in the range $[-\frac{1}{512},\frac{1}{512}]$ , clipped to be in the range $[-0.001,0.001]$ , and then transformed using a logit transformation $\log(x)-\log(1-x)$ . 90% of the training set is used for training, and 10% for validation, and the standard test set is used.

For grid search for the optimal value of $\sigma$ , eight values are used: [0.01, 0.05, 0.10, 0.20, 0.28, 0.50, 1.00, 1.50]. We also evaluate $\sigma=1.74$ , chosen by the heuristic in Saremi et al. (2018). The model with the best performance on validation score matching loss is used. Only nine values of $\sigma$ are evaluated because training each model takes approximately two hours.

C.3 SCORE ESTIMATION FOR IMPLICIT DISTRIBUTIONS

We put the architectures of all networks used in the MNIST and CelebA experiments in Tab. 8 and Tab. 9 respectively.

For MNIST experiments, we use RMSProp optimizer with a learning rate of 0.001 for all methods. On CelebA, the learning rate is changed to 0.0001. All algorithms are trained for 100000 iterations with a batch size of 128.

All samples are generated after 100000 training iterations.

Appendix D VARIANCE REDUCTION

Below we discuss approaches to reduce the variance of $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ , which can lead to better performance in practice. The most naïve approach, of course, is using a larger $M$ to compute $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ . However, this requires more computation and when $M$ is close to the data dimension, sliced score matching will lose its computational advantage over score matching.

An alternative approach is to leverage control variates (Owen, 2013). A control variate is a random variable whose expectation is tractable, and is highly correlated with another random variable without a tractable expectation. Define $c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})\triangleq\frac{1}{2}(\mathbf{v}^{\intercal}\mathbf{s}_{m}(\mathbf{x};{\boldsymbol{\theta}}))^{2}$ . Note that when $p_{\mathbf{v}}$ is a multivariate standard normal or multivariate Rademacher distribution, $c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})$ will have a tractable expectation, i.e.,

which is easily computable. Now let $\beta(\mathbf{x})c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})$ be our control variate, where $\beta(\mathbf{x})$ is a function to be determined. Due to the structural similarity between $\beta(\mathbf{x})c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})$ and $\hat{J}({\boldsymbol{\theta}};\mathbf{x}_{1}^{N},\mathbf{v}_{11}^{NM})$ , it is easy to believe that $\beta(\mathbf{x})c({\boldsymbol{\theta}};\mathbf{x},\mathbf{v})$ can be a correlated control variate with an appropriate $\beta(\mathbf{x})$ . We thus consider the following objective