Black-box Importance Sampling

Qiang Liu, Jason D. Lee

Introduction

Efficient Monte Carlo methods are workhorses for modern Bayesian statistics and machine learning. Importance sampling (IS) and Markov chain Monte Carlo (MCMC) are two fundamental tools widely used when it is intractable to draw exact samples from the underlying distribution p(x)p(x). IS uses an simple proposal distribution q(x)q(x) to draw a sample {xi}\{x_{i}\}, and attaches it with a set of importance weights that are proportional to the probability ratio p(xi)/q(xi)p(x_{i})/q(x_{i}). MCMC methods, on the other hand, rely on simulating Markov chains whose equilibrium distribution matches the target distribution.

Unfortunately, both importance sampling (IS) and MCMC have their own critical weaknesses. IS heavily relies on a good proposal q(x)q(x) that closely matches the target distribution p(x)p(x) to obtain accurate estimates. However, it is critically challenging, or even impossible, to design good proposals for high dimensional complex target distributions, given the restriction of using simple proposals. Therefore, alternative methods that do not require to calculating proposal probabilities would greatly enhance the powerful of IS, yielding efficient solutions for difficulty problems.

On the other hand, MCMC approximates the target distribution with an (often complex) distribution simulated from a large number of steps of Markov transitions, and has been widely used to solve complex problems. However, MCMC has a long-standing difficulty accessing its convergence, and one may get absurdly wrong results when using non-convergent results (e.g., Morris et al., 1996). In addition, the computational cost of MCMC becomes critically expensive when the number of data instances is very large (a.k.a. the big data setting). A number of approximate versions of MCMC have been developed recently to deal with the big data issue (e.g., Welling and Teh, 2011; Alquier et al., 2016), but these methods usually no longer converge to the correct stationary distribution.

Motivated by combining the advantages of IS and MCMC, we study black-box importance sampling methods that can calculate importance weights for any given sample {xi}i=1n\{x_{i}\}_{i=1}^{n} generated from arbitrary, unknown black-box mechanisms. Such methods allow us to use highly complex proposals that closely match the target distribution, without worrying about the computational tractability of the typical importance weights.

Interestingly, the black-box methods, despite using no information of the proposal distribution, can actually give better estimation accuracy than the typical importance sampling that leverages the proposal information. This appears to be a paradox (using less information yet getting better results), but is consistent with the arguments of O’Hagan (1987) that “Monte Carlo (that uses the proposal information) is fundamentally unsound” for violating the Likelihood Principle, and the interesting results of Henmi et al. (2007); Delyon and Portier (2014) that certain types of approximate versions of IS weights reduce the variance over exact IS weights.

As an example of application, we apply black-box importance weights to samples simulated by a number of short Markov chains, in which MCMC helps provide a complex proposal that are “crudely” closely to the target distribution, and the black-box weights further refine the result. In this way, we obtain consistent estimators even from un-convergent MCMC results, or approximate MCMC transitions that appear commonly in big data settings.

Beyond MCMC, black-box IS can be used to refine many other approximation methods related to complex generation mechanisms, including variational inference with complex proposals (e.g., Rezende and Mohamed, 2015), bootstrapping (Efron, 2012) and perturb-and-MAP methods (Hazan et al., 2013; Papandreou and Yuille, 2011). Further, we envision our method can find more applications in many areas where importance sampling or variance reduction plays an importance role, such as probabilistic inference in graphical model (e.g., Liu et al., 2015), variance reduction for variational inference (e.g., Wang et al., 2013; Ranganath et al., 2014) and policy gradient (e.g., Greensmith et al., 2004), covariance shift in transfer learning (e.g., Sugiyama et al., 2008), off-policy reinforcement learning (e.g., Li et al., 2015), and stochastic optimization (e.g., Zhao and Zhang, 2015).

Our black-box importance weights are calculated by a convex quadratic optimization, based on minimizing a recently proposed kernelized Stein discrepancy that measures the goodness of a sample fit with an unnormalized distribution (Oates et al., 2017; Chwialkowski et al., 2016; Liu et al., 2016); this makes our method widely applicable for unnormalized distributions that widely appear in Bayesian statistics and machine learning.

Our method is closely related to Briol et al. (2015a, b); Oates et al. (2017), which combine Stein’s identity with Bayesian Monte Carlo (O’Hagan, 1991; Ghahramani and Rasmussen, 2002) and control variates, respectively, and can also be interpreted as a form of importance weights similar to our method. The key difference is that the weights in their method can be negative and are not normalized to sum to one, while our approach explicitly optimizes the weights in the probability simplex, which helps provide more stable practical results as we illustrate both theoretically and empirically in our work. We provide a more throughout discussion in Section 3.3.

An alternative approach for black-box weights is to directly approximate the underlying proposal distribution qq with an estimator q^\hat{q} and use the corresponding ratio p(x)/q^(x)p(x)/\hat{q}(x) as the importance weight. Henmi et al. (2007); Delyon and Portier (2014) showed that certain types of approximation q^\hat{q} can improve, rather than deteriorate, the performance compared with the weight with the exact qq. However, the method by Henmi et al. (2007) is not widely applicable since it requires to solve a maximum likelihood estimator in a parametric family that include the proposal distribution; Delyon and Portier (2014) uses a kernel density estimator for qq and tends to give unstable empirical results as we show in our experiments. Related to this, there is a literature in semi-supervised learning for covariance shifts (e.g., Sugiyama and Kawanabe, 2012) that estimates the density ratio p(x)/q(x)p(x)/q(x) given two samples {xi}p\{x_{i}\}\sim p and {yi}q\{y_{i}\}\sim q, when both pp and qq are unknown.

There are also other directions where the advantages of IS and MCMC can be combined, including adaptive importance sampling (e.g., Martino et al., ; Botev et al., 2013; Beaujean and Caldwell, 2013; Yuan et al., 2013, to only name a few), and sequential Monte Carlo (e.g., Smith et al., 2013; Robert and Casella, 2013; Neal, 2001). The black-box techniques can be combined with these methods to obtain more powerful, adaptive methods.

Background: Kernelized Stein Discrepancy

We give a brief introduction to Stein’s identity and kernelized Stein discrepancy (KSD) (Liu et al., 2016; Oates et al., 2017; Chwialkowski et al., 2016) which forms the foundation of our method.

which is in fact a direct rewrite of (1) using the product rule of derivatives. We call sp(x)xlogp(x){\boldsymbol{s}}_{p}(x)\coloneqq\nabla_{x}\log p(x) the score function of p(x)p(x). Note that calculating sp(x){\boldsymbol{s}}_{p}(x) does not depend on the normalization constant in p(x)p(x), that is, when p(x)=f(x)/Zp(x)=f(x)/Z where ZZ is the normalization constant and is often critical difficult to calculate, we have sp(x)=xlogf(x){\boldsymbol{s}}_{p}(x)=\nabla_{x}\log f(x), independent of ZZ. This property makes Stein’s identity a powerful practical tool for handling unnormalized distributions that widely appear in machine learning and statistics.

We can “kernelize” Stein’s identity with a smooth positive definite kernel k(x,x)k(x,x^{\prime}) for which k(x,x)k(x,x^{\prime}) is in the Stein class of p(x)p(x) for each fixed xXx^{\prime}\in\mathcal{X} (we say such k(x,x)k(x,x^{\prime}) is in the Stein class of pp in this case). By first applying (2) on k(x,x)k(x,x^{\prime}) with fixed xx^{\prime} and subsequently with fixed xx, we can get the following kernelized versions of Stein’s identity:

where x,xx,x^{\prime} are i.i.d. drawn from pp and kp(x,x){k_{p}}(x,x^{\prime}) is a new kernel function defined via

Stein Importance Weights

for general test function h(x)h(x). For this purpose, we define an empirical version of the KSD in (5) to measure the discrepancy between {xi,wi}i=1n\{x_{i},w_{i}\}_{i=1}^{n} and p(x)p(x),

where in addition to the normalization constraint iwi=1\sum_{i}w_{i}=1, we also restrict the weights to be non-negative; these two simple constraints have important practical implications as we discuss in the sequel. Note that the optimization in (6) is a convex quadratic programming that can be efficiently solved by off-the-self optimization tools. For example, both mirror descent and Frank Wolfe take O(n2/ϵ)O(n^{2}/\epsilon) to find the optimum with ϵ\epsilon-accuracy. Solving (6) does not require to know how the points {xi}i=1n\{x_{i}\}_{i=1}^{n} are generated, and hence gives a black-box importance sampling.

Theoretically, minimizing the empirical KSD can be justified by the following bound.

ii) Oates et al. (2017, Theorem 3) has a similar result which does not require iwi=1\sum_{i}w_{i}=1, but has a constant term larger than ChC_{h} when iwi=1\sum_{i}w_{i}=1 does hold. We propose to enforce iwi=1\sum_{i}w_{i}=1 because it gives exact estimation for constant functions h(x)=ch(x)=c, and is common practice for importance sampling (which is referred to as self-normalized importance sampling). In our empirical results, we find that the normalized weights can significantly stabilize the algorithm, especially for high dimensional models.

1 Practical Applications

Our method as summarized in Algorithm 1 can be used to refine any sample {xi}i=1n\{x_{i}\}_{i=1}^{n} generated with arbitrary black-box mechanisms, and allows us to apply importance sampling in cases that are otherwise difficult. As an example, we can generate {xi}i=1n\{x_{i}\}_{i=1}^{n} by simulating nn parallel MCMC chains for mm steps, where the length mm of the chains can be smaller than what is typically used in MCMC, because it just needs to be large enough to bring the distribution of {xi}i=1n\{x_{i}\}_{i=1}^{n} “roughly” close to the target distribution. This also makes it easy to parallelize the algorithm compared with running a single long chain. In practice, one may heuristically decide if mm is large enough by checking the variance of the estimated weights {wi}i=1n\{w_{i}\}_{i=1}^{n} (or the effective sample size). One can also simulate {xi}i=1n\{x_{i}\}_{i=1}^{n} using MCMC with approximate translation kernels as these required for massive datasets (e.g., Welling and Teh, 2011; Alquier et al., 2016), so our method provides a new solution for big data problems.

We should remark that when {xi}i=1n\{x_{i}\}_{i=1}^{n} is simulated from nn independent MCMC initialized from a distribution q0(x)q_{0}(x), the weight w0(x)=n1p(x)/q0(x)w_{0}(x)=n^{-1}p(x)/q_{0}(x) does provide a valid importance sampling weights in that iw0(xi)h(xi)\sum_{i}w_{0}(x_{i})h(x_{i}) gives an unbiased estimator (MacEachern et al., 1999, Theorem 6.1). However, this weight does not update as we run more MCMC steps, and performs poorly in practice.

2 Convergence Rate

As an obvious example, when {xi}i=1n\{x_{i}\}_{i=1}^{n} is i.i.d. drawn from an (unknown) proposal distribution q(x)q(x), the typical importance sampling weight wip(xi)/q(xi)w^{*}_{i}\propto p(x_{i})/q(x_{i}) (up to the normalization) can be used as a reference weight to establish an O(n1/2){\mathcal{O}}(n{}^{-1/2}) error rate as the typical Monte Carlo methods have.

Interestingly, it turns out the typical importance weight w(x)p(x)/q(x)w_{*}(x)\propto p(x)/q(x) is not the best possible reference weight; better options can be constructed using various variance reduction techniques to give convergence rates better than the typical O(n1/2){\mathcal{O}}(n^{-1/2}) rate.

Similar theoretical analysis can be found in Briol et al. (2015a); Bach (2015). In particular, Briol et al. (2015a) used a similar “reference weight” idea to establish a convergence rate for Bayesian Monte Carlo. The main technical challenge in our proof is to make sure that the reference weight satisfies the non-negative and self-normalization constraints. Section 2.2 in Appendix provides more detailed discussions.

3 Other “Super-Efficient” Weights

We review several other types of “supper-efficient” weights that also give better convergence rates than the typical O(n1/2){\mathcal{O}}(n^{-1/2}) rate; this includes Bayesian Monte Carlo and the related (linear) control variates method, as well as methods based on density approximation of the proposal distributions, which can be interpreted as multiplicative control variates (Nelson, 1987) that reduce the variance.

Bayesian Monte Carlo (O’Hagan, 1991; Ghahramani and Rasmussen, 2002) was originally developed to evaluate integrals using Bayesian inference procedure with Gaussian prior, which turns out to be equivalent to a weighted form iwih(xi)\sum_{i}w_{i}h(x_{i}) with wiw_{i} being a set of weights independent of the test function hh; unlike our method, these weights are not normalized to sum to one and can take negative values.

Theoretically, the convergence rate of control variates and Bayesian Monte Carlo can both be established to be O(n(1+α)/2){\mathcal{O}}(n^{-(1+\alpha)/2}), where α\alpha depends on how well h^(x)\hat{h}(x) can approximate h(x)h(x); see Oates et al. (2017, 2016); Briol et al. (2015a, b); Bach (2015) for detailed analysis.

This form is similar to our (6), but does not enforce the non-negative garrote constraint (Breiman, 1995) and replacing the normalization constraint iwi=1\sum_{i}w_{i}=1 with a quadratic regularization with regularization coefficient of one. Here the L2 penalty λw22\lambda||w||_{2}^{2} is necessary for ensuring numerical stability in practice. In our case, it is the non-negative constraint that helps stabilize the optimization problem, without needing to specify a regularization parameter.

Approximating the Proposal Distribution

Experiments

We empirically evaluate our method and compare it with the methods mentioned above, first on an illustrative toy example based on Gaussian mixture, and then on Bayesian probit regression. The methods we tested all have a form of iwih(xi)\sum_{i}w_{i}h(x_{i}), where the weights wiw_{i} are decided by one of the following algorithms:

1. Uniform weights wi=1/nw_{i}=1/n (Uniform).

2. Our method that solves (6) (referred as Stein), for which we use RBF kernel k(x,x)=exp(1hxx22)k(x,x^{\prime})=\exp(-\frac{1}{h}||x-x^{\prime}||^{2}_{2}); the bandwidth hh is heuristically chosen to be the median of the pairwise square distance of data {xi}i=1n\{x_{i}\}_{i=1}^{n} as suggested by Gretton et al. (2012).

3. The control functional method Control Func following the empirical guidance in Oates et al. (2017), which is also equivalent to Bayesian MC with kernel kp+(x,x)=kp(x,x)+1{k_{p}^{+}}(x,x^{\prime})={k_{p}}(x,x^{\prime})+1. Note that the weights {wi}\{w_{i}\} in this method may be negative and do not necessarily sum to one. We also test a modified version of it iwih(xi)/iwi\sum_{i}w_{i}h(x_{i})/\sum_{i}w_{i} that normalizes the weights and refer it as Control Func (Normalized). The kernel k(x,x)k(x,x^{\prime}) and the bandwidth are taken to be the same as our method. We follow Oates et al. (2017)’s guidance to select that an L2 regularization coefficient to stabilize the algorithm.

4. The kernel density estimator (KDE) based method by Delyon and Portier (2014) (KDE), which uses weights wi=n1p(xi)/q^i(xi)w_{i}=n^{-1}p(x_{i})/\hat{q}_{i}(x_{i}), where q^i(x)\hat{q}_{i}(x) is a leave-one-out KDE of form q^i(x)=jik(x,xj)/n\hat{q}_{i}(x)=\sum_{j\neq i}k(x,x_{j})/n. We report the result when using RBF kernel with bandwidth decided by the rule of thumb h=\hat{\sigma}\big{(}\frac{d2^{d+5}\Gamma(d/2+3)}{(2d+1)n}\big{)}^{1/(4+d)}, where σ^\hat{\sigma} is the standard deviation of {xi}i=1n\{x_{i}\}_{i=1}^{n} and dd is the dimension of xx. We also tested the choice of kernel and bandwidth suggested in Delyon and Portier (2014) but did not find consistent improvement. Similar to the case of the control functional method, we also test a self-normalized version of KDE and denote it by KDE (Normalized).

In Figure 2(a), we study the performance of the algorithms on distributions with different Gaussianity, where we replace p(x)p(x) with a series of distributions p(xλ)p(x\mid\lambda) whose random variable is (1λ)x+λξ(1-\lambda)x+\lambda\xi where xpx\sim p, ξN(0,1)\xi\sim\mathcal{N}(0,1) and λ\lambda\in controls the Gaussianity of p(xλ)p(x\mid\lambda): it reduces to p(x)p(x) when λ=0\lambda=0 and equals N(0,1)\mathcal{N}(0,1) when λ=1\lambda=1. We observe that Stein tends to perform the best when the distribution has high non-Gaussianity, but is suboptimal compared with Control Func when the distribution is close to Gaussian.

In Figure 2(b), we consider how the different algorithms scale to high dimensions by setting p(x)p(x) to be the standard Gaussian distribution with increasing dimensions. We generally find that our Stein tends to perform among the best under the different settings, expect for low dimensional standard Gaussian under which Control Func performs the best. The self-normalized versions of KDE and Control Func can help to stabilize the algorithm in various cases, for example, KDE (Normalized) significantly improves over KDE in all the cases, and Control Func (Normalized) is significantly better than Control Func in high dimensional cases as shown in Figure 2(b).

Figure 3 shows the performance of our method with the non-negativity constraint (wi0w_{i}\geq 0) replaced by (wib)(w_{i}\geq-b) where bb is a positive number that takes different values. We find that the result of wi0w_{i}\geq 0 generally performs the best when nn is small (e.g., n<1000n<1000), but is slightly suboptimal when nn is large. Because the stability in the small nn case is more practically important than the large nn case, given that the absolute difference on MSE would be negligible in the large nn region, we think enforcing wi0w_{i}\geq 0 is a simple and good practical procedure.

Bayesian Probit Model

where Φ()\Phi(\cdot) is the cumulative distribution function of the standard normal distribution, and p0(x)=N(x;0,0.1)p_{0}(x)=\mathcal{N}(x;0,0.1) is the prior.

To test our method, we simulate {xi}i=1n\{x_{i}\}_{i=1}^{n} by running nn parallel chains of stochastic Langevin dynamics (Welling and Teh, 2011). Since this method is an inexact MCMC, its stationary distribution should be different from the target distribution p(x)p(x). As a result, directly averaging {xi}i=1n\{x_{i}\}_{i=1}^{n} with uniform weights (Unif) can give relatively poor results with convergence rate slower than the typical \O(n1/2)\O(n^{-1/2}) rate (see e.g., Teh et al. (2016)). The black-box weights can be used refine the result.

Figure 4 shows the result on a small simulated dataset with 100100 data instances and 1010 features. We can find that Stein and Control Func (Normalized) significantly improve the performance over Unif. Interestingly, we find that the unnormalized Control Func, as well as KDE and KDE (normalized) (not show in the figure) perform significantly worse in this case.

Figure 5 shows the result on the Forest Covtype dataset from the UCI machine learning repository (Bache and Lichman, 2013); it has 54 features, and is reprocessed to get binary labels following Collobert et al. (2002). For our experiment, we take the first 10,000 data points, so that it is feasible to evaluate the ground truth with No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014). We again find that Stein and Control Func (Normalized) improves over the uniform weights, and the unnormalized Control Func and KDE and KDE (normalized) again perform significantly worse and are not shown in the figure.

Conclusion

We propose a black-box importance sampling method that calculates importance weights without knowing the proposal distribution, which also has the additional benefit of providing variance reduction. We expect our method provides a powerful tool for solving many difficult problems were previously intractable via importance sampling.

References

Appendix A Kernelized Stein Discrepancy

Appendix B Convergence Rate

Section B.1 proves the O(n1/2){\mathcal{O}}(n^{-1/2}) rate using the typical importance sampling weights as the reference weight. Section B.2 proves a better \mathchoice{{\scriptstyle\mathcal{O}}}{{\scriptstyle\mathcal{O}}}{{\scriptscriptstyle\mathcal{O}}}{\scalebox{0.7}{\scriptscriptstyle\mathcal{O}}}{\left(n^{-1/2}\right)} rate by using a reference weight based on a control variates method constructed with an orthogonal basis estimator.

We use the typical importance sampling weight as a reference weight and establish O(n1/2){\mathcal{O}}(n^{-1/2}) rate on the error of our estimator.

Assume {xi}i=1n\{x_{i}\}_{i=1}^{n} is i.i.d. drawn from q(x)q(x)

Define vi(xi)=1np(xi)/q(xi)v_{i}^{*}(x_{i})=\frac{1}{n}p(x_{i})/q(x_{i}), and

In addition, note that i=1nvi=1+O(n1/2)\sum_{i=1}^{n}v_{i}^{*}=1+{\mathcal{O}}(n^{-1/2}), we have

Assume {xi}\{x_{i}\} is i.i.d. drawn from q(x)q(x), and {w^i(x)}\{\hat{w}_{i}(\boldsymbol{x})\} is given by (6), then under Assumption B.1, we have

and combining with Proposition 3.1 gives the result. ∎

We first re-state the assumptions made in Theorem 3.3.

1. Assume kp(x,x){k_{p}}(x,x^{\prime}) has the following eigen-decomposition

The following is an expended version of Theorem 3.3.

Assume {xi}i=1n\{x_{i}\}_{i=1}^{n} is i.i.d. drawn from q(x)q(x), and w^i\hat{w}_{i} is calculated by

To see how Theorem B.5 implies Theorem 3.3, we just need to observe that we obviously have γ(n)2M3bn,\gamma(n)\geq 2M_{3}\frac{b}{n}, and \gamma(n)=\mathchoice{{\scriptstyle\mathcal{O}}}{{\scriptstyle\mathcal{O}}}{{\scriptscriptstyle\mathcal{O}}}{\scalebox{0.7}{\scriptscriptstyle\mathcal{O}}}{\left(1\right)} by taking L=n1/4L=n^{1/4}.

Based on Proposition 3.1, to prove Theorem B.5 we just need to show that for any x={xi}i=1n\boldsymbol{x}=\{x_{i}\}_{i=1}^{n}, there exists a set of positive and normalized weights {wi+(x)}\{w_{i}^{+}(\boldsymbol{x})\}, as a function of x\boldsymbol{x}, such that

Step 1: Construct a control variate estimator based on the orthogonal eigenfunction basis, and obtain the corresponding weights {wi(x)}\{w_{i}(\boldsymbol{x})\}.

Step 3. Construct a set of positive and normalized weights by wi+(x)=max(0,wi(x))imax(0,wi(x))w_{i}^{+}(\boldsymbol{x})=\frac{\max(0,w_{i}(\boldsymbol{x}))}{\sum_{i}\max(0,w_{i}(\boldsymbol{x}))}, and establish the corresponding bound.

Combine the bound in Lemma B.7 and Lemma B.9 below. ∎

We note that the idea of using reference weights was used in Briol et al. (2015a) to establish the convergence rate of Bayesian Monte Carlo. Related results is also presented in Bach (2015). The main additional challenge in our case is to meet the non-negative and normalization constraint (Step 3); this is achieved by showing that the {wi(x)}\{w_{i}(\boldsymbol{x})\} constructed in Step 2 is non-negative with high probability, and their sum approaches to one when nn is large, and hence {wi+(x)}\{w_{i}^{+}(\boldsymbol{x})\} is not significantly different from {wi(x)}\{w_{i}(\boldsymbol{x})\}.

Note that if we discard the non-negative and normalization constraint (Step 3), the error bound would be O(γ0(n)n1){\mathcal{O}}(\gamma_{0}(n)n^{-1}), where

Step 1: Constructing the weights

We now construct an orthogonal series estimator h^(x)\hat{h}(x) for h(x)h(x) based on xD0\boldsymbol{x}_{\mathcal{D}_{0}},

We also truncate at the LLth basis functions to keep h^D0(x)\hat{h}_{\mathcal{D}_{0}}(x) a smooth function, as what is typically done in orthogonal basis estimators. We will discuss the choice of LL later. Based on this we define a control variates estimator:

Given Z^[h]\hat{Z}[h] defined as above, for any hHph\in{\mathcal{H}_{p}}, we have

We can derive the same result for Z^1[h]\hat{Z}_{1}[h] and averaging them would gives the result. ∎

Under Assumption B.4, for the weights {wi(x)}\{w_{i}(\boldsymbol{x})\} defined in Lemma B.6, we have

We can derive the same result for Z^1[h]\hat{Z}_{1}[h] and hence

Step 3: Meeting the Non-negative and Normalization Constraint

The weights defined in (B.6) is not normalized to sum to one, and may also have negative values. To complete the proof, we define a set of new weights,

For the weights {wi(x)}\{w_{i}(\boldsymbol{x})\} defined in Lemma B.6, under Assumption B.4, we have

i). When x={xi}i=1nq\boldsymbol{x}=\{x_{i}\}_{i=1}^{n}\sim q, we have

We just need to prove (10) for iD0i\in\mathcal{D}_{0}. Note that

ii). Note that S=defiwi(x)=S1+S2S\overset{def}{=}\sum_{i}w_{i}(\boldsymbol{x})=S_{1}+S_{2},

where the bound for S2S_{2} uses the Hoffeding’s inequality for two-sample U statistics (Hoeffding, 1963, Section 5b). We take t1=2t/(LM2M3+2)t_{1}=\sqrt{2t}/(LM_{2}M_{3}+\sqrt{2}), we have

where Ms=M32(M2M3+2)2/4M_{s}=M_{3}^{2}(M_{2}M_{3}+\sqrt{2})^{2}/4 (we assume L1L\geq 1).

Define En\mathcal{E}_{n} to be the event that all wi>0w_{i}>0 and iwi1/2\sum_{i}w_{i}\geq 1/2, that is, En={iwi1/2,  wi0, i[n]}\mathcal{E}_{n}=\{\sum_{i}w_{i}\geq 1/2,~{}~{}w_{i}\geq 0,~{}\forall i\in[n]\}. We have from LemmaB.8 that

Note that under event En\mathcal{E}_{n}, we have w=w+\boldsymbol{w}=\boldsymbol{w}^{+}. Therefore,

Appendix C Additional Empirical Results

Here we show in Figure 6 an additional empirical result when p(x)p(x) is a Gaussian mixture model shown in Figure 6(a) and {xi}i=1n\{x_{i}\}_{i=1}^{n} is generated by running nn independent chains of MALA for 1010 steps.