Black-box Importance Sampling
Qiang Liu, Jason D. Lee
Introduction
Efficient Monte Carlo methods are workhorses for modern Bayesian statistics and machine learning. Importance sampling (IS) and Markov chain Monte Carlo (MCMC) are two fundamental tools widely used when it is intractable to draw exact samples from the underlying distribution . IS uses an simple proposal distribution to draw a sample , and attaches it with a set of importance weights that are proportional to the probability ratio . MCMC methods, on the other hand, rely on simulating Markov chains whose equilibrium distribution matches the target distribution.
Unfortunately, both importance sampling (IS) and MCMC have their own critical weaknesses. IS heavily relies on a good proposal that closely matches the target distribution to obtain accurate estimates. However, it is critically challenging, or even impossible, to design good proposals for high dimensional complex target distributions, given the restriction of using simple proposals. Therefore, alternative methods that do not require to calculating proposal probabilities would greatly enhance the powerful of IS, yielding efficient solutions for difficulty problems.
On the other hand, MCMC approximates the target distribution with an (often complex) distribution simulated from a large number of steps of Markov transitions, and has been widely used to solve complex problems. However, MCMC has a long-standing difficulty accessing its convergence, and one may get absurdly wrong results when using non-convergent results (e.g., Morris et al., 1996). In addition, the computational cost of MCMC becomes critically expensive when the number of data instances is very large (a.k.a. the big data setting). A number of approximate versions of MCMC have been developed recently to deal with the big data issue (e.g., Welling and Teh, 2011; Alquier et al., 2016), but these methods usually no longer converge to the correct stationary distribution.
Motivated by combining the advantages of IS and MCMC, we study black-box importance sampling methods that can calculate importance weights for any given sample generated from arbitrary, unknown black-box mechanisms. Such methods allow us to use highly complex proposals that closely match the target distribution, without worrying about the computational tractability of the typical importance weights.
Interestingly, the black-box methods, despite using no information of the proposal distribution, can actually give better estimation accuracy than the typical importance sampling that leverages the proposal information. This appears to be a paradox (using less information yet getting better results), but is consistent with the arguments of O’Hagan (1987) that “Monte Carlo (that uses the proposal information) is fundamentally unsound” for violating the Likelihood Principle, and the interesting results of Henmi et al. (2007); Delyon and Portier (2014) that certain types of approximate versions of IS weights reduce the variance over exact IS weights.
As an example of application, we apply black-box importance weights to samples simulated by a number of short Markov chains, in which MCMC helps provide a complex proposal that are “crudely” closely to the target distribution, and the black-box weights further refine the result. In this way, we obtain consistent estimators even from un-convergent MCMC results, or approximate MCMC transitions that appear commonly in big data settings.
Beyond MCMC, black-box IS can be used to refine many other approximation methods related to complex generation mechanisms, including variational inference with complex proposals (e.g., Rezende and Mohamed, 2015), bootstrapping (Efron, 2012) and perturb-and-MAP methods (Hazan et al., 2013; Papandreou and Yuille, 2011). Further, we envision our method can find more applications in many areas where importance sampling or variance reduction plays an importance role, such as probabilistic inference in graphical model (e.g., Liu et al., 2015), variance reduction for variational inference (e.g., Wang et al., 2013; Ranganath et al., 2014) and policy gradient (e.g., Greensmith et al., 2004), covariance shift in transfer learning (e.g., Sugiyama et al., 2008), off-policy reinforcement learning (e.g., Li et al., 2015), and stochastic optimization (e.g., Zhao and Zhang, 2015).
Our black-box importance weights are calculated by a convex quadratic optimization, based on minimizing a recently proposed kernelized Stein discrepancy that measures the goodness of a sample fit with an unnormalized distribution (Oates et al., 2017; Chwialkowski et al., 2016; Liu et al., 2016); this makes our method widely applicable for unnormalized distributions that widely appear in Bayesian statistics and machine learning.
Our method is closely related to Briol et al. (2015a, b); Oates et al. (2017), which combine Stein’s identity with Bayesian Monte Carlo (O’Hagan, 1991; Ghahramani and Rasmussen, 2002) and control variates, respectively, and can also be interpreted as a form of importance weights similar to our method. The key difference is that the weights in their method can be negative and are not normalized to sum to one, while our approach explicitly optimizes the weights in the probability simplex, which helps provide more stable practical results as we illustrate both theoretically and empirically in our work. We provide a more throughout discussion in Section 3.3.
An alternative approach for black-box weights is to directly approximate the underlying proposal distribution with an estimator and use the corresponding ratio as the importance weight. Henmi et al. (2007); Delyon and Portier (2014) showed that certain types of approximation can improve, rather than deteriorate, the performance compared with the weight with the exact . However, the method by Henmi et al. (2007) is not widely applicable since it requires to solve a maximum likelihood estimator in a parametric family that include the proposal distribution; Delyon and Portier (2014) uses a kernel density estimator for and tends to give unstable empirical results as we show in our experiments. Related to this, there is a literature in semi-supervised learning for covariance shifts (e.g., Sugiyama and Kawanabe, 2012) that estimates the density ratio given two samples and , when both and are unknown.
There are also other directions where the advantages of IS and MCMC can be combined, including adaptive importance sampling (e.g., Martino et al., ; Botev et al., 2013; Beaujean and Caldwell, 2013; Yuan et al., 2013, to only name a few), and sequential Monte Carlo (e.g., Smith et al., 2013; Robert and Casella, 2013; Neal, 2001). The black-box techniques can be combined with these methods to obtain more powerful, adaptive methods.
Background: Kernelized Stein Discrepancy
We give a brief introduction to Stein’s identity and kernelized Stein discrepancy (KSD) (Liu et al., 2016; Oates et al., 2017; Chwialkowski et al., 2016) which forms the foundation of our method.
which is in fact a direct rewrite of (1) using the product rule of derivatives. We call the score function of . Note that calculating does not depend on the normalization constant in , that is, when where is the normalization constant and is often critical difficult to calculate, we have , independent of . This property makes Stein’s identity a powerful practical tool for handling unnormalized distributions that widely appear in machine learning and statistics.
We can “kernelize” Stein’s identity with a smooth positive definite kernel for which is in the Stein class of for each fixed (we say such is in the Stein class of in this case). By first applying (2) on with fixed and subsequently with fixed , we can get the following kernelized versions of Stein’s identity:
where are i.i.d. drawn from and is a new kernel function defined via
Stein Importance Weights
for general test function . For this purpose, we define an empirical version of the KSD in (5) to measure the discrepancy between and ,
where in addition to the normalization constraint , we also restrict the weights to be non-negative; these two simple constraints have important practical implications as we discuss in the sequel. Note that the optimization in (6) is a convex quadratic programming that can be efficiently solved by off-the-self optimization tools. For example, both mirror descent and Frank Wolfe take to find the optimum with -accuracy. Solving (6) does not require to know how the points are generated, and hence gives a black-box importance sampling.
Theoretically, minimizing the empirical KSD can be justified by the following bound.
ii) Oates et al. (2017, Theorem 3) has a similar result which does not require , but has a constant term larger than when does hold. We propose to enforce because it gives exact estimation for constant functions , and is common practice for importance sampling (which is referred to as self-normalized importance sampling). In our empirical results, we find that the normalized weights can significantly stabilize the algorithm, especially for high dimensional models.
1 Practical Applications
Our method as summarized in Algorithm 1 can be used to refine any sample generated with arbitrary black-box mechanisms, and allows us to apply importance sampling in cases that are otherwise difficult. As an example, we can generate by simulating parallel MCMC chains for steps, where the length of the chains can be smaller than what is typically used in MCMC, because it just needs to be large enough to bring the distribution of “roughly” close to the target distribution. This also makes it easy to parallelize the algorithm compared with running a single long chain. In practice, one may heuristically decide if is large enough by checking the variance of the estimated weights (or the effective sample size). One can also simulate using MCMC with approximate translation kernels as these required for massive datasets (e.g., Welling and Teh, 2011; Alquier et al., 2016), so our method provides a new solution for big data problems.
We should remark that when is simulated from independent MCMC initialized from a distribution , the weight does provide a valid importance sampling weights in that gives an unbiased estimator (MacEachern et al., 1999, Theorem 6.1). However, this weight does not update as we run more MCMC steps, and performs poorly in practice.
2 Convergence Rate
As an obvious example, when is i.i.d. drawn from an (unknown) proposal distribution , the typical importance sampling weight (up to the normalization) can be used as a reference weight to establish an error rate as the typical Monte Carlo methods have.
Interestingly, it turns out the typical importance weight is not the best possible reference weight; better options can be constructed using various variance reduction techniques to give convergence rates better than the typical rate.
Similar theoretical analysis can be found in Briol et al. (2015a); Bach (2015). In particular, Briol et al. (2015a) used a similar “reference weight” idea to establish a convergence rate for Bayesian Monte Carlo. The main technical challenge in our proof is to make sure that the reference weight satisfies the non-negative and self-normalization constraints. Section 2.2 in Appendix provides more detailed discussions.
3 Other “Super-Efficient” Weights
We review several other types of “supper-efficient” weights that also give better convergence rates than the typical rate; this includes Bayesian Monte Carlo and the related (linear) control variates method, as well as methods based on density approximation of the proposal distributions, which can be interpreted as multiplicative control variates (Nelson, 1987) that reduce the variance.
Bayesian Monte Carlo (O’Hagan, 1991; Ghahramani and Rasmussen, 2002) was originally developed to evaluate integrals using Bayesian inference procedure with Gaussian prior, which turns out to be equivalent to a weighted form with being a set of weights independent of the test function ; unlike our method, these weights are not normalized to sum to one and can take negative values.
Theoretically, the convergence rate of control variates and Bayesian Monte Carlo can both be established to be , where depends on how well can approximate ; see Oates et al. (2017, 2016); Briol et al. (2015a, b); Bach (2015) for detailed analysis.
This form is similar to our (6), but does not enforce the non-negative garrote constraint (Breiman, 1995) and replacing the normalization constraint with a quadratic regularization with regularization coefficient of one. Here the L2 penalty is necessary for ensuring numerical stability in practice. In our case, it is the non-negative constraint that helps stabilize the optimization problem, without needing to specify a regularization parameter.
Approximating the Proposal Distribution
Experiments
We empirically evaluate our method and compare it with the methods mentioned above, first on an illustrative toy example based on Gaussian mixture, and then on Bayesian probit regression. The methods we tested all have a form of , where the weights are decided by one of the following algorithms:
1. Uniform weights (Uniform).
2. Our method that solves (6) (referred as Stein), for which we use RBF kernel ; the bandwidth is heuristically chosen to be the median of the pairwise square distance of data as suggested by Gretton et al. (2012).
3. The control functional method Control Func following the empirical guidance in Oates et al. (2017), which is also equivalent to Bayesian MC with kernel . Note that the weights in this method may be negative and do not necessarily sum to one. We also test a modified version of it that normalizes the weights and refer it as Control Func (Normalized). The kernel and the bandwidth are taken to be the same as our method. We follow Oates et al. (2017)’s guidance to select that an L2 regularization coefficient to stabilize the algorithm.
4. The kernel density estimator (KDE) based method by Delyon and Portier (2014) (KDE), which uses weights , where is a leave-one-out KDE of form . We report the result when using RBF kernel with bandwidth decided by the rule of thumb h=\hat{\sigma}\big{(}\frac{d2^{d+5}\Gamma(d/2+3)}{(2d+1)n}\big{)}^{1/(4+d)}, where is the standard deviation of and is the dimension of . We also tested the choice of kernel and bandwidth suggested in Delyon and Portier (2014) but did not find consistent improvement. Similar to the case of the control functional method, we also test a self-normalized version of KDE and denote it by KDE (Normalized).
In Figure 2(a), we study the performance of the algorithms on distributions with different Gaussianity, where we replace with a series of distributions whose random variable is where , and controls the Gaussianity of : it reduces to when and equals when . We observe that Stein tends to perform the best when the distribution has high non-Gaussianity, but is suboptimal compared with Control Func when the distribution is close to Gaussian.
In Figure 2(b), we consider how the different algorithms scale to high dimensions by setting to be the standard Gaussian distribution with increasing dimensions. We generally find that our Stein tends to perform among the best under the different settings, expect for low dimensional standard Gaussian under which Control Func performs the best. The self-normalized versions of KDE and Control Func can help to stabilize the algorithm in various cases, for example, KDE (Normalized) significantly improves over KDE in all the cases, and Control Func (Normalized) is significantly better than Control Func in high dimensional cases as shown in Figure 2(b).
Figure 3 shows the performance of our method with the non-negativity constraint () replaced by where is a positive number that takes different values. We find that the result of generally performs the best when is small (e.g., ), but is slightly suboptimal when is large. Because the stability in the small case is more practically important than the large case, given that the absolute difference on MSE would be negligible in the large region, we think enforcing is a simple and good practical procedure.
Bayesian Probit Model
where is the cumulative distribution function of the standard normal distribution, and is the prior.
To test our method, we simulate by running parallel chains of stochastic Langevin dynamics (Welling and Teh, 2011). Since this method is an inexact MCMC, its stationary distribution should be different from the target distribution . As a result, directly averaging with uniform weights (Unif) can give relatively poor results with convergence rate slower than the typical rate (see e.g., Teh et al. (2016)). The black-box weights can be used refine the result.
Figure 4 shows the result on a small simulated dataset with data instances and features. We can find that Stein and Control Func (Normalized) significantly improve the performance over Unif. Interestingly, we find that the unnormalized Control Func, as well as KDE and KDE (normalized) (not show in the figure) perform significantly worse in this case.
Figure 5 shows the result on the Forest Covtype dataset from the UCI machine learning repository (Bache and Lichman, 2013); it has 54 features, and is reprocessed to get binary labels following Collobert et al. (2002). For our experiment, we take the first 10,000 data points, so that it is feasible to evaluate the ground truth with No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014). We again find that Stein and Control Func (Normalized) improves over the uniform weights, and the unnormalized Control Func and KDE and KDE (normalized) again perform significantly worse and are not shown in the figure.
Conclusion
We propose a black-box importance sampling method that calculates importance weights without knowing the proposal distribution, which also has the additional benefit of providing variance reduction. We expect our method provides a powerful tool for solving many difficult problems were previously intractable via importance sampling.
References
Appendix A Kernelized Stein Discrepancy
Appendix B Convergence Rate
Section B.1 proves the rate using the typical importance sampling weights as the reference weight. Section B.2 proves a better \mathchoice{{\scriptstyle\mathcal{O}}}{{\scriptstyle\mathcal{O}}}{{\scriptscriptstyle\mathcal{O}}}{\scalebox{0.7}{\scriptscriptstyle\mathcal{O}}}{\left(n^{-1/2}\right)} rate by using a reference weight based on a control variates method constructed with an orthogonal basis estimator.
We use the typical importance sampling weight as a reference weight and establish rate on the error of our estimator.
Assume is i.i.d. drawn from
Define , and
In addition, note that , we have
Assume is i.i.d. drawn from , and is given by (6), then under Assumption B.1, we have
and combining with Proposition 3.1 gives the result. ∎
We first re-state the assumptions made in Theorem 3.3.
1. Assume has the following eigen-decomposition
The following is an expended version of Theorem 3.3.
Assume is i.i.d. drawn from , and is calculated by
To see how Theorem B.5 implies Theorem 3.3, we just need to observe that we obviously have and \gamma(n)=\mathchoice{{\scriptstyle\mathcal{O}}}{{\scriptstyle\mathcal{O}}}{{\scriptscriptstyle\mathcal{O}}}{\scalebox{0.7}{\scriptscriptstyle\mathcal{O}}}{\left(1\right)} by taking .
Based on Proposition 3.1, to prove Theorem B.5 we just need to show that for any , there exists a set of positive and normalized weights , as a function of , such that
Step 1: Construct a control variate estimator based on the orthogonal eigenfunction basis, and obtain the corresponding weights .
Step 3. Construct a set of positive and normalized weights by , and establish the corresponding bound.
Combine the bound in Lemma B.7 and Lemma B.9 below. ∎
We note that the idea of using reference weights was used in Briol et al. (2015a) to establish the convergence rate of Bayesian Monte Carlo. Related results is also presented in Bach (2015). The main additional challenge in our case is to meet the non-negative and normalization constraint (Step 3); this is achieved by showing that the constructed in Step 2 is non-negative with high probability, and their sum approaches to one when is large, and hence is not significantly different from .
Note that if we discard the non-negative and normalization constraint (Step 3), the error bound would be , where
Step 1: Constructing the weights
We now construct an orthogonal series estimator for based on ,
We also truncate at the th basis functions to keep a smooth function, as what is typically done in orthogonal basis estimators. We will discuss the choice of later. Based on this we define a control variates estimator:
Given defined as above, for any , we have
We can derive the same result for and averaging them would gives the result. ∎
Under Assumption B.4, for the weights defined in Lemma B.6, we have
We can derive the same result for and hence
Step 3: Meeting the Non-negative and Normalization Constraint
The weights defined in (B.6) is not normalized to sum to one, and may also have negative values. To complete the proof, we define a set of new weights,
For the weights defined in Lemma B.6, under Assumption B.4, we have
i). When , we have
We just need to prove (10) for . Note that
ii). Note that ,
where the bound for uses the Hoffeding’s inequality for two-sample U statistics (Hoeffding, 1963, Section 5b). We take , we have
where (we assume ).
Define to be the event that all and , that is, . We have from LemmaB.8 that
Note that under event , we have . Therefore,
Appendix C Additional Empirical Results
Here we show in Figure 6 an additional empirical result when is a Gaussian mixture model shown in Figure 6(a) and is generated by running independent chains of MALA for steps.