Provable Bayesian Inference via Particle Mirror Descent

Bo Dai, Niao He, Hanjun Dai, Le Song

Introduction

Two longstanding challenges in approximate Bayesian inference are i) provable convergence and ii) data-intensive computation at each iteration. MCMC is a general algorithm known to generate samples from distribution that converges to the true posterior. However, in order to generate a single sample at every iteration, it requires a complete scan of the dataset and evaluation of the likelihood at each data point, which is computationally expensive. To address this issue, approximate sampling algorithms have been proposed which use only a small batch of data points at each iteration [e.g. 8, 3, 42, 26]. Chopin , Balakrishnan and Madigan extend the sequential Monte Carlo (SMC) to Bayesian inference on static models. However, these algorithms rely on Gaussian distribution or kernel density estimator as transition kernel for efficiency, which breaks down the convergence guarantee of SMC. On the other hand, the stochastic Langevin dynamics algorithm (SGLD) and its derivatives combine ideas from stochastic optimization and Hamiltonian Monte Carlo, and are proven to converge in terms of integral approximation, as recently shown in . Still, it is unclear whether the dependent samples generated reflects convergence to the true posterior. FireflyMC , introduces auxiliary variables to switch on and off data points to save computation for likelihood evaluations, but this algorithm requires the knowledge of lower bounds of likelihood that is model-specific and may be hard to calculate.

In another line of research, the variational inference algorithms attempt to approximate the entire posterior density by optimizing information divergence . The recent derivatives avoid examination of all the data in each update. However, the major issue for these algorithms is the absence of theoretical guarantees. This is due largely to the fact that variational inference algorithms typically choose a parametric family to approximate the posterior density, which can be far from the true posterior, and require to solve a highly non-convex optimization problem. In most cases, these algorithms optimize over simple exponential family for tractability. More flexible variational families have been explored but largely restricted to mixture models . In these cases, it is often difficult to quantify the approximation and optimization error at each iteration, and analyze how the error accumulates across the iterations. Therefore, a provably convergent variational inference algorithm is still needed.

In this paper, we present such a simple and provable nonparametric inference algorithm, Particle Mirror Descent (PMD), to iteratively approximate the posterior density. PMD relies on the connection that Bayes’ rule can be expressed as the solution to a convex optimization problem over the density space . However, directly solving the optimization will lead to both computational and representational issues: one scan over the entire dataset at each iteration is needed, and the exact function update has no closed-form. To address these issues, we draw inspiration from two sources: (i) stochastic mirror descent, where one can instead descend in the density space using a small batch of data points at each iteration; and (ii) particle filtering and kernel density estimation, where one can maintain a tractable approximate representation of the density using samples. In summary, PMD possesses a number of desiderata:

Simplicity. PMD applies to many probabilistic models, even with non-conjugate priors. The algorithm is summarized in just a few lines of codes, and only requires the value of likelihood and prior, unlike other approximate inference techniques [42, 15, 33, 19, e.g.], which typically require their first and/or second-order derivatives.

Flexibility. Different from other variational inference algorithms, which sacrifice the model flexibility for tractability, our method approximates the posterior by particles or kernel density estimator. The flexibility of nonparametric model enables PMD to capture multi-modal in posterior.

Stochasticity. At iteration $t$ , PMD only visits a mini-batch of data to compute the stochastic functional gradient, and samples $O(t)$ points from the solution. Hence, it avoids scanning over the whole dataset in each update.

Theoretical guarantees. We show the density estimator provided by PMD converges in terms of both integral approximation and $KL$ -divergence to the true posterior density in rate $O(1/\sqrt{m})$ with $m$ particles. To our best knowledge, these results are the first of the kind in Bayesian inference for estimating posterior.

In the remainder, we will introduce the optimization view of Bayes’ rule before presenting our algorithm, and then we provide both theoretical and empirical supports of PMD.

Throughout this paper, we denote $KL$ as the Kullback-Leibler divergence, function $q(\theta)$ as $q$ , a random sequence as $\theta_{[t]}:=[\theta_{1},\ldots,\theta_{t}]$ , integral $f(\cdot)$ w.r.t. some measure $\mu(\theta)$ over support $\Omega$ as $\int f(\theta)\mu(d\theta)$ , or $\int f(\theta)d\theta$ without ambiguity, $\langle\cdot,\cdot\rangle_{L_{2}}$ as the $L_{2}$ inner product, and $\|\cdot\|_{p}$ as the $L_{p}$ norm for $1\leqslant p\leqslant\infty$ .

Optimization View of Bayesian Inference

Our algorithm stems from the connection between Bayes’ rule and optimization. Williams , Zellner , Zhu et al. showed that Bayes’ rule

where $p({X})=\int p(\theta)\prod_{n=1}^{N}p({x}_{n}|\theta)d\theta$ , can be obtained by solving the optimization problem

where $\mathcal{P}$ is the valid density space. The objective, $L(q)$ , is continuously differentiable with respect to $q\in\mathcal{P}$ and one can further show that

Objective function $L(q)$ defined on $q(\theta)\in\mathcal{P}$ is $1$ -strongly convex w.r.t. $KL$ -divergence.

Despite of the closed-form representation of the optimal solution, it can be challenging to compactly represent, tractably compute, or efficiently sample from the solution. The normalization, $p({X})=\int p(\theta)\prod_{n=1}^{N}p({x}_{n}|\theta)d\theta$ , involves high dimensional integral and typically does not admit tractable closed-form computation. Meanwhile, the product in the numerator could be arbitrarily complicated, making it difficult to represent and sample from. However, this optimization perspective provides us a way to tackle these challenges by leveraging recent advances from optimization algorithms.

We will resort to stochastic optimization to avoid scanning the entire dataset for each gradient evaluation. The stochastic mirror descent expands the usual stochastic gradient descent scheme to problems with non-Euclidean geometries, by applying unbiased stochastic subgradients and Bregman distances as prox-map functions. We now explain in details, the stochastic mirror descent algorithm in the context of Bayesian inference.

At $t$ -th iteration, given a data point $x_{t}$ drawn randomly from the dataset, the stochastic functional gradient of $L(q)$ with respect to $q(\theta)\in L_{2}$ is $g_{t}(\theta)=\log(q(\theta))-\log(p(\theta))-N\log p(x_{t}|\theta)$ . The stochastic mirror descent iterates over the prox-mapping step $q_{t+1}=\mathbf{P}_{q_{t}}(\gamma_{t}g_{t})$ , where $\gamma_{t}>0$ is the stepsize and

Since the domain is density space, $KL$ -divergence is a natural choice for the prox-function. The prox-mapping therefore admits the closed-form

where $Z:=\int q_{t}(\theta)\exp(-\gamma_{t}g_{t}(\theta))\,d\theta$ is the normalization. This update is similar the Bayes’ rule. However, an important difference here is that the posterior is updated using the fractional power of the previous solution, the prior and the likelihood. Still computing $q_{t+1}(\theta)$ can be intractable due to the normalization $Z$ .

2 Error Tolerant Stochastic Mirror Descent

To handle the intractable integral normalization at each prox-mapping step, we will consider a modified version of the stochastic mirror descent algorithm which can tolerate additional error in the prox-mapping step. Given $\epsilon\geqslant 0$ and $g\in L_{2}$ , we define the $\epsilon$ -prox-mapping of $q$ as the set

Remark 2: For simplicity, we present the algorithm with stochastic gradient estimated by a single data point. The mini-batch trick is also applicable to reduce the variance of stochastic gradient, and convergence remains the same order but with an improved constant.

Allowing error in each step gives us room to design more flexible algorithms. Essentially, this implies that we can approximate the intermediate density by some tractable representation. As long as the approximation error is not too large, the algorithm will still converge; and if the approximation does not involve costly computation, the overall algorithm will still be efficient.

Particle Mirror Descent Algorithm

We introduce two efficient strategies to approximate prox-mappings, one based on weighted particles and the other based on weighted kernel density estimator. The first strategy is designed for the situation when the prior is a “good” guess of the true posterior, while the second strategy works for general situations. Interestingly, these two methods resemble particle reweighting and rejuvenation respectively in sequential Monte Carlo yet with notable differences.

We first consider the situation when we are given a “good” prior, such that $p(\theta)$ has the same support as the true posterior $q^{*}(\theta)$ , i.e., $0\leqslant q^{*}(\theta)/p(\theta)\leqslant C$ . We will simply maintain a set of samples (or particles) from $p(\theta)$ , and utlize them to estimate the intermediate prox-mappings. Let $\left\{\theta_{i}\right\}_{i=1}^{m}\sim p(\theta)$ be a set of fixed i.i.d. samples. We approximate $q_{t+1}(\theta)$ as a set of weighted particles

The update is derived from the closed-form solution to the exact prox-mapping step (2.1). Since the normalization is a constant common to all components, one can simply update the set of working variable $\alpha_{i}$ as

We show that the one step approximation (4) incurs a dimension-independent error when estimating the integration of a function.

Please refer to the Appendix C for details. When the model has several latent variables $\theta=(\xi,\zeta)$ and some parts of the variables have closed-form update in (2.1). e.g., sparse GPs and LDA (refer to Appendix F), we could incorporate such structure information into algorithm by decomposing the posterior $q(\theta)=q(\xi)q(\zeta|\xi)$ . When $p(\xi)$ satisfies the condition, we could sample $\{\xi_{i}\}_{i=1}^{m}\sim p(\xi)$ and approximate the posterior with summation of several functions, i.e., in the form of $q(\theta)\approx\sum{\alpha_{i}}q(\zeta|\xi_{i})$ .

2 Posterior Approximation Using Weighted Kernel Density Estimator

In general, sampling from prior $p(\theta)$ that are not so “good” will lead to particle depletion and inaccurate estimation of the posterior. To alleviate particle degeneracy, we propose to estimate the prox-mappings via weighted kernel density estimator (KDE). The weighted KDE prevents particles from dying out, in a similar fashion as kernel smoothing variant SMC and one-pass SMC , but with guarantees.

More specifically, we approximate $q_{t+1}(\theta)$ via a weighted kernel density estimator

Intuitively, the sampling procedure gradually adjusts the support of the intermediate distribution towards that of the true posterior, which is similar to “rejuvenation” step. The reweighting procedure gradually adjusts the shape of the intermediate distribution on the support. Same as the mechanism in Doucet et al. , Balakrishnan and Madigan , the weighted KDE could avoid particle depletion.

We demonstrate that the estimator in (6) in one step possesses similar estimation properties as standard KDE for densities (for details, refer to the Appendix D).

3 Overall Algorithm

In practice, we could combine the proposed two algorithms to reduce the computation cost. In the beginning stage, we adopt the second strategy. The computation cost is affordable for small number of particles. After we achieve a reasonably good estimator of the posterior, we could switch to the first strategy using large size particles to get better rate.

Theoretical Guarantees

In this section, we show that PMD algorithm (i) given good prior $p(\theta)$ , achieves a dimension independent, sublinear rate of convergence in terms of integral approximation; and (ii) in general cases, achieves a dimension dependent, sublinear rate of convergence in terms of $KL$ -divergence with proper choices of stepsizes.

Assume $p(\theta)$ has the same support as the true posterior $q^{*}(\theta)$ , i.e., $0\leqslant q^{*}(\theta)/p(\theta)\leqslant C$ . Assume further model $\|p(x|\theta)^{N}\|_{\infty}\leqslant\rho,\forall x$ . Then $\forall f(\theta)$ bounded and integrable, the $T$ -step PMD algorithm with stepsize $\gamma_{t}=\frac{\eta}{t}$ returns $m$ weighted particles such that

where $M:=\max_{t=1,\ldots,T}\|g_{t}\|_{\infty}$ .

The condition for the models, $\|p(x|\theta)^{N}\|_{\infty}\leqslant\rho,\forall x$ , is mild, and there are plenty of models satisfying such requirement. For examples, in binary/multi-class logistic regression, probit regression, as well as latent Dirichlet analysis, $\rho\leqslant 1$ . Please refer to details in Appendix C. The proof combines the results of the weighted particles for integration, and convergence analysis of mirror descent. One can see that the error consists of two terms, one from integration approximation and the other from optimization error. To achieve the best rate of convergence, we need to balance the two terms. That is, when the number particles, $m$ , scales linearly with the number of iterations, we obtain an overall convergence rate of $O(\frac{1}{\sqrt{T}})$ . In other words, if the number of particles is fixed to $m$ , we could achieve the convergence rate $O(\frac{1}{\sqrt{m}})$ with $T=O(m)$ iterations.

2 Strong Convergence of PMD

In general, when the weighted kernel density approximation scheme is used, we show that PMD enjoys a much stronger convergence, i.e., the $KL$ -divergence between the generated density and the true posterior converges sublinearly. Throughout this section, we merely assume that

The prior and likelihood belong to $(\beta;\mathcal{L})$ -Hölder class.

Kernel $K(\cdot)$ is a $\beta$ -valid density kernel with a compact support and there exists $\mu,\nu,\delta>0$ such that $\int K(z)^{2}\,dz\leqslant\mu^{2}$ , $\int\|z\|^{\beta}|K(z)|dz\leqslant\nu$ .

Note that the above assumptions are more of a brief characteristics of the commonly used kernels and inferences problems in practice rather than an exception. The second condition clearly holds true when the logarithmic of the prior and likelihood belongs to $C_{\infty}$ with bounded derivatives of all orders, as assumed in several literature . The third condition is for characterizing the estimator over its support. These assumptions automatically validate all the conditions required to apply Theorem 4 and the corresponding high probability bounds (stated in Corollary 17 in appendix). Let the kernel bandwidth $h_{t}=m_{t}^{-1/(d+2\beta)}$ , we immediately have that with high probability,

Directly applying Theorem 2, and solving the recursion following , we establish the convergence results in terms of KL-divergence.

Based on the above assumptions, when setting $\gamma_{t}=\min\{\frac{2}{t+1},\frac{\Delta}{Mm_{t}^{\beta/(d+2\beta)}}\}$ ,

Unlike Theorem 5, the convergence results are established in terms of the $KL$ -divergence, which is a stronger criterion and can be used to derive the convergence under other divergences . To our best knowledge, these results are the first of its kind for estimating posterior densities in literature. One can immediately see that the final accuracy is essentially determined by two sources of errors, one from noise in applying stochastic gradient, the other from applying weighted kernel density estimator. For the last iterate, an overall $O(\frac{1}{T})$ convergence rate can be achieved when $m_{t}=O(t^{2+d/\beta})$ . There is an explicit trade-off between the overall rate and the total number of particles: the more particles we use at each iteration, the faster algorithm converges. One should also note that in our analysis, we explicitly characterize the effect of the smoothness of model controlled by $\beta$ , which is assumed to be infinite in existing analysis of SGLD. When the smoothness parameter $\beta>>d$ , the number of particles is no longer depend on the dimension. That means, with memory budget $O(dm)$ , i.e., the number of particles is set to be $O(m)$ , we could achieve a $O(1/\sqrt{m})$ rate.

It is worth mentioning that in the above result, the $O(1/T)$ bound corresponding to the stochasticity is tight (see Nemirovski et al. ), and the $O(m^{-\frac{\beta}{d+2\beta}})$ bound for KDE estimation is also tight by itself (see ). An interesting question here is whether the overall complexity provided here is indeed optimal? This is out of the scope of this paper, and we will leave it as an open question.

Related Work

PMD connects stochastic optimization, Monte Carlo approximation and functional analysis to Bayesian inference. Therefore, it is closely related to two different paradigms of inference algorithms derived based on either optimization or Monte Carlo approximation.

From the optimization point of view, the proposed algorithm shares some similarities to stochastic variational inference (SVI) –both algorithms utilize stochastic gradients to update the solution. However, SVI optimizes a surrogate of the objective, the evidence lower bound (ELBO), with respect to a restricted parametric distributionEven in , “nonparametric variational inference” (NPV) uses the mixture of Gaussians as variational family which is still parametric.; while the PMD directly optimizes the objective over all valid densities in a nonparametric form. Our flexibility in density space eliminates the bias and leads to favorable convergence results.

From the sampling point of view, PMD and the particle filtering/sequential Monte Carlo (SMC) both rely on importance sampling. In the framework of SMC sampler , the static SMC variants proposed in bares some resemblances to the proposed PMD. However, their updates come from completely different origins: the static SMC update is based on Monte Carlo approximation of Bayes’ rule, while the PMD update based on inexact prox-mappings. On the algorithmic side, (i) the static SMC re-weights the particles with likelihood while the PMD re-weights based on functional gradient, which can be fractional power of the likelihood; and (ii) the static SMC only utilizes each datum once while the PMD allows multiple pass of the datasets. Most importantly, on the theoretical side, PMD is guaranteed with convergence in terms of both $KL$ -divergence and integral approximation for static model, while SMC is only rigoriously justified for dynamic models. It is unclear whether the convergence still holds for these extensions in .

The static SMC uses simple normal distribution or kernel density estimation for rejuvenation. However, such moving kernel is purely heuristic and it is unclear whether the convergence rate of SMC for dynamic system still holds for static models. To ensure the convergence of static SMC, MCMC is needed in the rejuvenation step. The MCMC step requires to browse all the previously visited data, leading to extra computation cost $\Omega(dmt)$ and memory cost $O(dt)$ , and hence violating the memory budget requirement. We emphasize that even using MCMC in static SMC for rejuvenation, the conditions required for static SMC is more restricted. We discuss the conditions for convergence of SMC and PMD using particles approximation in Appendix C.

Comparing with SGLD, the cost of PMD at each iteration is higher. However, PMD converges in rate of $O(m^{-\frac{1}{2}})$ , faster than SGLD, $O(m^{-\frac{1}{3}})$ , in terms of integral approximation and $KL$ -divergence which is more stringent if all the orders of derivatives of stochastic gradient is bounded. Moreover, even for the integral approximation, SGLD converges only when $f$ having weak Taylor series expansion, while for PMD, $f$ is only required to be bounded. The SGLD also requires the stochastic gradient satisfying several extra conditions to form a Lyapunov system, while such conditions are not needed in PMD.

Experiments

We conduct experiments on mixture models, logistic regression, sparse Gaussian processes and latent Dirichlet allocation to demonstrate the advantages of PMD in capturing multiple modes, dealing with non-conjugate models and incorporating special structures, respectively.

For the mixture model and logistic regression, we compare our algorithm with five general approximate Bayesian inference methods, including three sampling algorithms, i.e., one-pass sequential Monte Carlo (one-pass SMC) which is an improved version of the SMC for Bayesian inference , stochastic gradient Langevin dynamics (SGD Langevin) and Gibbs sampling, and two variational inference methods, i.e., stochastic variational inference (SVI) and stochastic variant of nonparametric variational inference (SGD NPV) . For sparse GP and LDA, we compare with the existing large-scale inference algorithms designed specifically for the models.

For the synthetic data generated by mixture model, we could calculate the true posterior, Therefore, we evaluate the performance directly through total variation and $KL$ -divergence (cross entropy). For the experiments on logistic regression, sparse GP and LDA on real-world datasets, we use indirect criteria which are widely used because of the intractability of the posterior. We keep the same memory budget for Monte Carlo based algorithms if their computational cost is acceptable. To demonstrate the efficiency of each algorithm in utilizing data, we use the number of data visited cumulatively as x-axis.

For the details of the model specification, experimental setups, additional results and algorithm derivations for sparse GP and LDA, please refer to the Appendix H.

We conduct comparison on a simple yet interesting mixture model , the observations $x_{i}\sim p\mathcal{N}(\theta_{1},\sigma_{x}^{2})+(1-p)\mathcal{N}(\theta_{1}+\theta_{2},\sigma_{x}^{2})$ and $\theta_{1}\sim\mathcal{N}(0,\sigma_{1}^{2})$ , $\theta_{2}\sim\mathcal{N}(0,\sigma_{2}^{2})$ , where $(\sigma_{1},\sigma_{2})=(1,1)$ , $\sigma_{x}=2.5$ and $p=0.5$ . The means of two Gaussians are tied together making $\theta_{1}$ and $\theta_{2}$ correlated in the posterior. We generate $1000$ data from the model with $(\theta_{1},\theta_{2})=(1,-2)$ . This is one mode of the posterior, there is another equivalent mode at $(\theta_{1},\theta_{2})=(-1,2)$ .

We initialize all algorithms with prior on $(\theta_{1},\theta_{2})$ . We repeat the experiments $10$ times and report the average results. We keep the same memory for all except SVI. The true posterior and the one generated by our method is illustrated in Figure 1 (1)(2). PMD fits both modes well and recovers nicely the posterior while other algorithms either miss a mode or fail to fit the multimodal density. For the competitors’ results, please refer to Appendix H. PMD achieves the best performance in terms of total variation and cross entropy as shown in Figure 1 (3)(4). This experiment clearly indicates our algorithm is able to take advantages of nonparametric model to capture multiple modes.

We test our algorithm on logistic regression with non-conjugate prior for handwritten digits classification on the MNIST8M 8 vs. 6 dataset. The dataset contains about $1.6$ M training samples and $1932$ testing samples. We initialize all algorithms with same prior and terminate the stochastic algorithms after 5 passes through the dataset. We keep $1000$ samples for Monte Carlo based algorithms, except Gibbs sampling whose computation cost is unaffordable. We repeat the experiments $10$ times and the results are reported in Figure 2(1). Obviously, Gibbs sampling , which needs to scan the whole dataset, is not suitable for large-scale problem. In this experiment, SVI performs best at first, which is expectable because learning in the Gaussian family is simpler comparing to nonparametric density family. Our algorithm achieves comparable performance in nonparametric form after fed with enough data, $98.8\%$ , to SVI which relies on carefully designed lower bound of the log-likelihood . SGD NPV is flexible with mixture models family, however, its speed becomes the bottleneck. For SGD NPV, the speed is dragged down for the use of L-BFGS to optimize the second-order approximation of ELBO.

We use sparse GPs models to predict the year of songs . In this task, we compare to the SVI for sparse GPs and one-pass SMC. We also included subset of data approximation (SoD) as baseline. The data contains about $0.5$ M songs, each represented by $90$ -dimension features. We terminate the stochastic algorithms after 2 passes of dataset. We use $16$ particles in both SMC and PMD. The number of inducing inputs in sparse GP is set to be $2^{10}$ , and other hyperparameters of sparse GP are fixed for all methods. We run experiments $10$ times and results are reported in Figure. 2(2). Our algorithm achieves the best RMSE $0.027$ , significantly better than one-pass SMC and SVI.

We compare to SVI , stochastic gradient Riemannian Langevin dynamic (SGRLD) , and SMC specially designed for LDA on Wikipedia dataset . The dataset contains $0.15$ M documents, about $2$ M words and $8000$ vocabulary. Since we evaluate their performances in terms of perplexity, which is integral over posterior, we do not need to recover the posterior, and therefore, we follow the same setting in , where one particle is used in SMC and PMD to save the cost. We set topic number to $100$ and fix other hyperparameters to be fair to all algorithms. We stop the stochastic algorithms after 5 passes of dataset. The results are reported in Figure 2(3). The top words from several topics found by our algorithm are illustrated in Appendix H. Our algorithm achieves the best perplexity, significantly better than SGRLD and SVI. In this experiment, SMC performs well at the beginning since it treats each documents equally and updates with full likelihood. However, SMC only uses each datum once, while the stochastic algorithms, e.g., SGRLD, SVI and our PMD, could further refine the solution by running the dataset multiple times.

Conclusion

Our work contributes towards achieving better trade-off between efficiency, flexibility and provability in approximate Bayesian inference from optimization perspective. The proposed algorithm, Particle Mirror Descent, successfully combines stochastic mirror descent and nonparametric density approximation. Theoretically, the algorithm enjoys a rate $O(1/\sqrt{m})$ in terms of both integral approximation and $KL$ -divergence, with $O(m)$ particles. Practically, the algorithm achieves competitive performance to existing state-of-the-art inference algorithms in mixture models, logistic regression, sparse Gaussian processes and latent Dirichlet analysis on several large-scale datasets.

Acknowledgements

We thank the anonymous referees for their valuable suggestions. This project was supported in part by NSF/NIH BIGDATA 1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, and NSF CAREER IIS-1350983.

References

Appendix A Strong convexity

As we discussed, the posterior from Bayes’s rule could be viewed as the optimal of an optimization problem in Eq (1). We will show that the objective function is strongly convex w.r.t $KL$ -divergence.

Proof for Lemma 1. The lemma directly results from the generalized Pythagaras theorem for Bregman divergence. Particularly, for $KL$ -divergence, we have

Notice that $L(q)=KL(q||q^{*})-\log Z$ , where $q^{*}=\frac{p(\theta)\Pi_{i}^{N}p(x_{i}|\theta)}{Z},\quad Z=\int p(\theta)\Pi_{i}^{N}p(x_{i}|\theta)$ , we have

Appendix B Finite Convergence of Stochastic Mirror Descent with Inexact Prox-Mapping in Density Space

Since the prox-mapping of stochastic mirror descent is intractable when directly being applied to the optimization problem (1), we propose the $\epsilon$ -inexact prox-mapping within the stochastic mirror descent framework in Section 3. Instead of solving the prox-mapping exactly, we approximate the solution with $\epsilon$ error. In this section, we will show as long as the approximation error is tolerate, the stochastic mirror descent algorithm still converges.

Remark. Based on , one can immediately see that, to guarantee the usual rate of convergence, the error $\epsilon_{t}$ can be of order $O(\gamma_{t}^{2})$ . The first recurrence implies an overall $O(1/T)$ rate of convergence for the $KL$ -divergence when the stepsize $\gamma_{t}$ is as small as $O(1/t)$ and error $\epsilon_{t}$ is as small as $O(1/t^{2})$ . The second result implies an overall $O(1/\sqrt{T})$ rate of convergence for objective function when larger stepsize $\gamma_{t}=O(1/\sqrt{T})$ and larger error $\epsilon_{t}=O(1/t)$ are adopted.

Therefore, combining (8), (9), and (10), we have $\forall q\in\mathcal{P}$

Plugging $q^{*}$ and taking expectation on both sides, the LHS becomes

Because the objective function is $1$ -strongly convex w.r.t. $KL$ -divergence,

we obtain the recursion with inexact prox-mapping,

(b) Summing over $t=1,\ldots,T$ of equation (11), we get

By convexity and optimality condition, this leads to

Furthermore, combined with the $1$ -strongly-convexity, it immediately follows that

Appendix C Convergence Analysis for Integral Approximation

In this section, we provide the details of the convergence analysis of the proposed algorithm in terms of integral approximation w.r.t. the true posterior using a good initialization.

Assume that the prior $p(\theta)$ has support $\Omega$ cover true posterior distribution $q^{*}(\theta)$ , then, we could represent

Given $q(\theta)$ , we sample i.i.d. $\{\theta_{i}\}_{i=1}^{m}$ from $p(\theta)$ , and construct a function

Apply the above conclusion to $f(\theta)=1$ , we have

Then, we have achieve our conclusion that

With the knowledge of $p(\theta)$ and $q(\theta)$ , we set $q_{t}(\theta)=\alpha_{t}(\theta)p(\theta)$ , the PMD algorithm will reduce to adjust $\alpha(\theta_{i})$ for samples $\{\theta_{i}\}_{i=1}^{m}\sim\pi(\theta)$ according to the stochastic gradient. Plug the gradient formula into the exact update rule, we have

where $\alpha_{t+1}(\theta)=\frac{\alpha_{t}(\theta)\exp(-\gamma_{t}g_{t}(\theta))}{Z}$ . Since $Z$ is constant, ignoring it will not effect the multiplicative update.

Given the fact that the objective function, $L(q)$ , is 1-strongly convex w.r.t. the $KL$ -divergence, we can immediately arrive at the following convergence results as appeared in Nemirovski et al. , if we are able to compute the prox-mapping in Eq.(2.1) exactly.

One prox-mapping step Eq.(2.1) reduces the error by

With stepsize $\gamma_{t}=\frac{\eta}{t}$ , it implies

Proof We could obtain the recursion directly from Theorem 2 by setting $\epsilon=0$ , which means solving the prox-mapping exactly, and the rate of convergence rate could be obtained by solving the recursion as stated in .

The second last inequality comes from Pinsker’s inequality.

We first decompose the error into optimization error and finite approximation error.

For the optimization error, by lemma 9, we have

which results the update $\alpha_{t}(\theta)=\frac{\alpha_{t-1}^{1-\gamma_{t-1}}(\theta)p(x|\theta)^{N{\gamma_{t-1}}}}{Z}$ . Notice $Z=\int q_{t}(\theta)\exp(-\gamma_{t}g_{t}(\theta))d\theta$ , we have $\exp(-\gamma_{t}\|g_{t}(\theta)\|_{\infty})\leqslant Z\leqslant\exp(\gamma_{t}\|g_{t}(\theta)\|_{\infty})$ . By induction, it can be show that $\|\alpha_{t}\|_{\infty}\leqslant\max\{C,\rho\exp(\|g_{t}(\theta)\|_{\infty})\}\leqslant\max\{C,\rho\exp(\|g(\theta)\|_{\infty})\}$ . Therefore, by lemma 7, we have

Combine $\epsilon_{1}$ and $\epsilon_{2}$ , we achieve the conclusion. $\blacksquare$

Remark: Simply induction without the assumption from the update of $\alpha_{t}(\theta)$ will result the upper bound of sequence $\|\alpha_{t}\|_{\infty}$ growing. The growth of sequence $\|\alpha_{t}\|_{\infty}$ is also observed in the proof for sequential Monte Carlo on dynamic models. To achieve the uniform convergence rate for SMC of inference on dynamic system, Crisan and Doucet , Gland and Oudjane require the models should satisfy i), $\epsilon\nu(\theta_{i})\leqslant p(x_{i}|\theta_{i})p(\theta_{i}|\theta_{i-1})\leqslant\epsilon^{-1}\nu(\theta_{i})$ , $\forall x$ where $\nu(\theta)$ is a positive measure, and ii), $\frac{\sup_{\theta}p(x|\theta)}{\inf_{\mu\in\mathcal{P}}\langle\mu(\theta)p(\cdot|\theta)p(x|\cdot)\rangle}\leqslant\rho$ . Such rate is only for SMC on dynamic system. For static model, the trandistiion distribution is unknown, and therefore, no guarantee is provided yet. With much simpler and more generalized condition on the model, i.e., $\|p(x|\theta)^{N}\|_{\infty}\leqslant\rho$ , we also achieve the uniform convergence rate for static model. There are plenties of models satisfying such condition. We list several such models below.

logistic regression, $p(y|x,w)=\frac{1}{1+\exp(-yw^{\top}x)}$ , and $\|p(y|x,w)\|_{\infty}\leqslant 1$ .

probit regression, $p(y=1|x,w)=\Phi(w^{\top}x)$ where $\Phi(\cdot)$ is the cumulative distribution function of normal distribution. $\|p(y|x,w)\|_{\infty}\leqslant 1$ .

multi-category logistic regression, $p(y=k|x,W)=\frac{\exp(w_{k}^{\top}x)}{\sum_{i=1}^{K}\exp(w_{k}^{\top}x)}$ , and $\|p(y|x,W)\|_{\infty}\leqslant 1$ .

and $\|p(x_{d}|\theta_{d},\Phi)\|_{\infty}\leqslant\max_{z_{d}}\|p(x_{d}|z_{d},\Phi)\|_{\infty}\leqslant 1$ .

linear regression, $p(y|w,x)=\frac{1}{\sigma\sqrt{2\pi}}\exp(-(y-w^{\top}x)^{2}/2\sigma^{2})$ , and $\|p(y|w,x)\|_{\infty}\leqslant\frac{1}{\sigma\sqrt{2\pi}}$ .

Gaussian model and PCA, $p(x|\mu,\Sigma)=(2\pi\det(\Sigma))^{-\frac{1}{d}}\exp\bigg{(}-\frac{1}{2}(x-\mu)^{\top}\Sigma(x-\mu)\bigg{)}$ , and $\|p(x|\mu,\Sigma)\|_{\infty}\leqslant(2\pi\det(\Sigma))^{-\frac{1}{d}}$ .

Appendix D Error Bound of Weighted Kernel Density Estimator

Before we start to prove the finite convergence in general case, we need to characterize the error induced by weighted kernel density estimator. In this section, we analyze the error in terms of both $L_{1}$ and $L_{2}$ norm, which are used for convergence analysis measured by $KL$ -divergence in Appendix E .

We now present the proof for each of these error bounds.

To formally show that, we begin by giving the definition of a special class of kernels and Hölder classes of densities that we consider.

We say a kernel function $K(\cdot)$ is a $(\beta;\mu,\nu)$ -valid density kernel, if $K(\theta,\theta)=K(\theta-\theta)$ is a bounded, compactly supported kernel such that

$\int|K(z)|^{r}dz\leqslant\infty$ for any $r\geqslant 1$ , particularly, $\int K(z)^{2}\,dz\leqslant\mu^{2}$ for some $\mu>0$ .

For simplicity, we sometimes call $K(\cdot)$ as a $\beta$ -valid density kernel if the constants $\mu$ and $\nu$ are not specifically given. Notice that all spherically symmetric compactly supported probability density and product kernels based on compactly supported symmetric univariate densities satisfy the conditions. For instance, the kernel $K(\theta)=(2\pi)^{-d/2}\exp(-\left\|\theta\right\|^{2}/2)$ satisfies the conditions with $\beta=\infty$ , and it is used through out our experiments. Furthermore, we will focus on a class of smooth densities

We say a density function $q(\cdot)$ is a $(\beta;\mathcal{L})$ -Hölder density function if function $q(\cdot)$ is $\lfloor\beta\rfloor$ -times continuously differentiable on its support $\Omega$ and satisfies

for any $z_{0}$ , there exists $L(z_{0})>0$ such that

where $q_{z_{0}}^{(\beta)}$ is the $\lfloor\beta\rfloor$ -order Taylor approximation, i.e.

in addition, the integral $\int L(z)dz\leqslant\mathcal{L}$ .

$f\in C^{\beta}_{\mathcal{L}}(\Omega)$ means $f$ is $(\beta;\mathcal{L})$ -Hölder density function.

Then given the above setting for the kernel function and the smooth densities, we can characterize the error of the weighted kernel density estimator as follows.

If $q(\cdot)\in C^{\beta}_{\mathcal{L}}(\Omega)$ and $K$ is a $(\beta;\mu,\nu)$ -valid density kernel, then

Proof The proof of this lemma follows directly from Chapter 4.3 in .

D.1.2 KDE error due to variance

The variance term can be bounded using similar techniques as in .

Assume $\omega\sqrt{p}\in L_{1}$ with bounded support, then

Denote $\mu(K):=\sqrt{\int K(\theta)^{2}\,d\theta}$ and kernel $K^{+}(\theta)=\frac{K^{2}(\theta)}{\mu(K)^{2}}$ , then $\mu(K)\leqslant\mu$ , $\int K^{+}d\theta=1$ and

From Theorem 2.1 in , we have $\int\left|(\omega^{2}p)\star K_{h}^{+}-\omega^{2}p\right|\,d\theta=o(1)$ . Therefore, we conclude that

D.1.3 KDE error due to normalization

The normalization error term can be easily derived based on the variance.

Since $\|K_{h}\|_{1}=\int\frac{1}{h^{d}}K(\theta/h)\,d\theta=\int K(\theta)\,d\theta=1$ , we have

D.1.4 KDE error in expectation and with high probability

Based on the above there lemmas, namely, Lemma 12 - 14, we can immediately arrive at the bound of the $L_{1}$ error in expectation as stated in Theorem 4. We now provide the proof for the high probability bound as stated below.

Besides the above assumption, let us also assume that $\omega(\theta)$ is bounded, i.e. there exists $0<B_{1}\leqslant B_{2}<\infty$ such that $B_{1}\leqslant\omega(\theta)\leqslant B_{2},\forall\theta$ . Then, with probability at least $1-\delta$ ,

Invoking the McDiamid’s inequality, we have

Let $q=\omega p\in C^{\beta}_{\mathcal{L}}(\Omega)$ and $K$ be a $(\beta;\mu,\nu)$ -valid density kernel. Assume that $\omega^{2}p\in L_{2}$ and has bounded support. Then

Proof for Theorem 16. The square $L_{2}$ -error can also be decomposed into three terms.

In addition, we have for the normalization error term,

Combining equation (12) , (13) and (15), it follows that

Proof for Theorem 17. Use McDiarmid’s inequality similar as proof for Corollary 15. $\blacksquare$

Appendix E Convergence Analysis for Density Approximation

In this section, we consider the rate of convergence for the entire density measured by $KL$ -divergence. We start with the following lemma that show the renormalization does not effect the optimization in the sense of optimal, and we show the importance weight $\omega_{t}(\theta)=\frac{\exp(-\gamma_{t}g_{t}(\theta))}{Z}$ at each step are bounded under proper assumptions. Moreover, the error of the prox-mapping at each step incurred by the weighted density kernel density estimation is bounded.

Assume for all mini-batch of examples $\|g_{t}(\theta)\|^{2}_{\infty}\leqslant M^{2}$ , then we have

Proof for Lemma 19. Let $Z:=\int q_{t}(\theta)\exp(-\gamma_{t}g_{t}(\theta))d\theta$ . We have $\exp(-\gamma_{t}M)\leqslant Z\leqslant\exp(\gamma_{t}M)$ . (a) Since $\|g_{t}(\theta)\|^{2}_{\infty}\leqslant M^{2}$ , we have

In addition, with probability at least $1-2\delta$ in $\theta_{t}|x_{[t-1]},\theta_{[t-1]}$ , we have

Based on the definition of $\widehat{q}_{t+1}$ , we have

Similarly, combining the results of Corollary 15 and 17, we have with probability at least $1-2\delta$ ,

Our main Theorem 6 follows immediately by applying the results in the above lemma to Theorem 2.

By Theorem 4 and setting $h_{t}=O(1)m_{t}^{-1/(d+2\beta)}$ , we achieve the error bound

where $\mathcal{C}_{2}:=O(1)M(\mu+\nu\mathcal{L})$ .

When setting $\gamma_{t}=\min\{\frac{2}{t+1},\frac{\Delta}{Mm_{t}^{\beta/(d+2\beta)}}\}$ invoking the above lemma, we have

where $\mathcal{C}_{1}:=O(1)(\mu+\nu\mathcal{L})^{2}\mu^{2}\Delta$ . Expanding the result from Theorem 2, it follows that

The above recursion leads to the convergence result for the second term,

Combine these two results, we achieve the desired result

Remark. The convergence in terms of $KL$ -divergence is measuring the entire density and much more stringent compared to integral approximation. For the last iterate, an overall $O(\frac{1}{T})$ convergence rate can be achieved when $m_{t}=O(t^{2+d/\beta})$ . Similar to Lemma 9, with Pinsker’s inequality, we could easily obtain the the rate of convergence in terms of integral approximation from Theorem 6. After $T$ steps, in general cases, the PMD algorithm converges in terms of integral approximation in rate $O(1/\sqrt{T})$ by choosing $O(1/t)$ -decaying stepsizes and $O(t^{2+\frac{d}{2\beta}})$ -growing samples.

Appendix F Derivation Details for Sparse Gaussian Processes and Latent Dirichlet Allocation

We apply the Particle Mirror Descent algorithm to sparse Gaussian processes and latent Dirichlet allocation. For these two models, we decompose the latent variables and incorporate the structure of posterior into the algorithm. The derivation details are presented below.

We treat the inducing variables as the latent variables with uniform prior in sparse Gaussian processes. Then, the posterior of $Z,\mathbf{u}$ could be thought as the solution to the optimization problem

The stochastic gradient of Eq.(16) w.r.t. $q(Z,\mathbf{u})$ will be

and therefore, the prox-mapping in $t$ -step is

We update $q_{t+1}(\mathbf{u}|Z)$ to be the optimal of $L(q(\mathbf{u}|Z))$ as

Plug this into the $L(q(\mathbf{u}|Z))$ , we have

We approximate the $q(Z)$ with particles, i.e., $q(Z)=\sum_{j=1}^{l}w^{j}\delta(Z^{j})$ . The update rule for $w^{j}$ is

F.2 Latent Dirichlet Allocations

The $p(\Phi)$ and $p(\theta)$ are the priors for parameters, $p(\theta_{d}|\alpha)=\frac{\Gamma(K\alpha)}{\Gamma(\alpha)^{K}}\prod_{k}^{K}\theta_{dk}^{\alpha-1}$ and $p(\Phi|\beta_{0})=\prod_{k}^{K}\frac{\Gamma(W\beta_{0})}{\Gamma(\beta_{0})^{W}}\prod_{w}^{W}\Phi_{wk}^{\beta_{0}-1}$ , both are Dirichlet distributions.

We incorporate the special structure into the proposed algorithm. Instead of modeling the $p(\Phi)$ solely, we model the $Z=\{Z\}_{d=1}^{D}$ and $\Phi$ together as $q(Z,\Phi)$ . Based on the model, given $Z$ , the $q(\Phi|Z)$ will be Dirichlet distribution and could be obtained in closed-form.

The posterior of $Z,\Phi$ is the solution to

We approximate the finite summation by expectation, then the objective function becomes

We approximate the $q(Z)\approx\sum_{i=1}^{m}w^{i}\delta(Z^{i})$ by particles, and therefore, $q(Z,\Phi)\approx\sum_{i=1}^{m}w^{i}P(\Phi|Z^{i})$ where $P(\Phi|Z^{i})$ is the Dirichlet distribution as we discussed. It should be noticed that from the objective function, we do not need to instantiate the $z_{d}$ until we visit the $x_{d}$ . By this property, we could first construct the particles $\{Z^{i}\}_{i=1}^{m}$ ‘conceptually’ and assign the value to $\{z_{d}^{i}\}_{i=1}^{m}$ when we need it. The gradient of Eq.(18) w.r.t. $q(\Phi,Z)$ is

The stochastic functional gradient update for $q(\Phi|Z^{i})$ is

Let $q_{t}(\Phi|Z^{i})=\mathcal{D}ir(\beta_{t}^{i})$ , then, the $q_{t+1}(\Phi|Z^{i})$ is also Dirichlet distribution

In mini-batch setting, the updating will be

Plug the $q_{t+1}(\Phi|Z^{i})$ into prox-mapping, we have

Then, we could update $q_{t}(Z)=\sum_{i}^{m}w^{i}\delta(Z^{i})$ by

If we set $\alpha=1$ , $p(Z^{i})$ will be uniformly distributed which has no effect to the update. For general setting, to compute $\log p(Z^{i}|\alpha)$ , we need prefix all the $\{z_{d}^{i}\}_{d=1}^{D}$ . However, when $D$ is huge, the second term will be small and we could ignore it approximately.

Till now, we almost complete the algorithm except the how to assign $z_{d}$ when we visit $x_{d}$ . We could assign the $z_{d}$ randomly. However, considering the requirement for the $z_{d}^{i}$ assignment that the $q(z^{i}_{d}|Z^{i}_{\setminus d})>0$ , which means the assignment should be consistent, an better way is using the average or sampling proportional to $\int p(x_{d}|\Phi,z_{d})q_{t}(\Phi|Z^{i})p(z_{d}|Z^{i}_{1\ldots,d-1})d\Phi$ where $p(z_{d}|Z^{i}_{1\ldots,d-1})=\int p(z_{d}|\alpha)p(\alpha|Z^{i}_{1\ldots,d-1})d\alpha$ , or $\int p(x_{d}|\Phi,z_{d})q_{t}(\Phi|Z^{i})p(z_{d}|\alpha)d\Phi$ .

Appendix G More Related Work

Besides the most related two inference algorithms we discussed in Section (5), i.e., stochastic variational inference and static sequential Monte Carlo , there are several other inference algorithms connect to the PMD from algorithm, stochastic approximation, or representation aspects, respectively.

From algorithmic aspect, our algorithm scheme shares some similarities to annealed importance sampling (AIS) in the sense that both algorithms are sampling from a series of densities and reweighting the samples to approximate the target distribution. The most important difference is the way to construct the intermediate densities. In AIS, the density at each iteration is a weighted product of the joint distribution of all the data and a fixed proposal distribution, while the densities in PMD are a weighted product of previous step solution and the stochastic functional gradient on partial data. Moreover, the choice of the temperature parameter (fractional power) in AIS is heuristic, while in our algorithm, we have a principle way to select the stepsize with quantitative analysis. The difference in intermediate densities results the sampling step in these two algorithms is also different: the AIS might need MCMC to generate samples from the intermediate densities, while we only samples from a KDE which is more efficient. These differences make our method could handle large-scale dataset while AIS cannot.

Sequential Monte-Carlo sampler provides a unified view of SMC in Bayesian inference by adopting different forward/backward kernels, including the variants proposed in as special cases. There are subtle and important differences between the PMD and the SMC samplers. In the SMC samplers, the introduced finite forward/backward Markov kernels are used to construct a distribution over the auxiliary variables. To make the SMC samplers valid, it is required that the marginal distribution of the constructed density by integrating out the auxiliary variables must be the exact posterior. However, there is no such requirement in PMD. In fact, the PMD algorithm only approaches the posterior with controllable error by iterating the dataset many times. Therefore, although the proposed PMD and the SMC sampler bare some similarities operationally, they are essentially different algorithms.

Stochastic approximation becomes a popular trick in extending the classic Bayesian inference methods to large-scale datasets recently. Besides stochastic variational inference, which incorporates stochastic gradient descent into variational inference, the stochastic gradient Langevin dynamics (SGLD) Welling and Teh , and its derivatives combine ideas from stochastic optimization and Hamiltonian Monte Carlo sampling. Although both PMD and the SGLD use the stochastic gradient information to guide next step sampling, the optimization variable in these two algorithms are different which results the completely different updates and properties. In PMD, we directly update the density utilizing functional gradient in density space, while the SGLD perturbs the stochastic gradient in parameter space. Because of the difference in optimization variables, the mechanism of these algorithms are totally different. The SGLD generates a trajectory of dependent samples whose stationary distribution approximates the posterior, the PMD keeps an approximation of the posterior represented by independent particles or their weighted kernel density estimator. In fact, their different properties we discussed in Table 1 solely due to this essential difference.

A number of generalized variational inference approaches are proposed trying to relax the constraints on the density space with flexible densities. Nonparametric density family is a natural choiceAlthough named their methods as “nonparametric” belief propagation and “nonparametric” variational inference, they indeed use mixture of Gaussians, which is still a parametric model.. and extend the belief propagation algorithm with nonparametric models by kernel embedding and particle approximation, respectively. The most important difference between these algorithms and PMD is that they originate from different sources and are designed for different settings. Both the kernel BP Song et al. and particle BP Ihler and McAllester , Lienart et al. are based on belief propagation optimizing local objective and designed for the problem with one sample $X$ in which observations are highly dependent, while the PMD is optimizing the global objective, therefore, more similar to mean-field inference, for the inference problems with many i.i.d. samples.

After the comprehensive review about the similarities and differences between PMD and the existing related approximate Bayesian inference methods from algorithm, stochastic approximation and representation perspectives, we can see the position of the proposed PMD clearly. The PMD connects variation inference and Monte Carlo approximation, which seem two orthogonal paradigms in approximate Bayesian inference, and achieves a balance in trade-off between efficiency, flexibility and provability.

Appendix H Experiments Details

We use the normalized Gaussian kernel in this experiment. For one-pass SMC, we use the suggested kernel bandwidth in . For our method, since we increase the samples, the kernel bandwidth is shrunk in rate of $O(m^{-\frac{1}{2}})$ as the theorem suggested. The batch size for stochastic algorithms and one-pass SMC is set to be $10$ . The total number of particles for the Monte Carlo based competitors, i.e., SMC, SGD Langevin, Gibbs sampling, and our method is $1500$ in total. We also keep $1500$ Gaussian components in SGD NPV. The burn-in period for Gibbs sampling and stochastic Langevin dynamics are $50$ and $1000$ respectively.

The visualization of $10$ runs average posteriors obtained by the alternative methods are plotted in Figure 3. From these figures, we could have a direct understand about the behaviors for each competitors. The Gibbs sampling and stochastic gradient Langevin dynamics sampling stuck in one local mode in each run. Gibbs sampler could fit one of the contour quite well, better than the stochastic Langevin dynamics. It should be noticed that this is the average solution, the two contours in the result of stochastic gradient Langevin dynamics did not mean it finds both modes simultaneously. The one-pass sequential Monte Carlo and stochastic nonparametric variational inference are able to location multiple modes. However, their shapes are not as good as ours. Because of the multiple modes and the highly dependent variables in posterior, the stochastic variational inference fails to converge to the correct modes.

To compare these different kinds of algorithms in a fair way, we evaluate their performances using total variation and cross entropy of the solution against the true potential functions versus the number of observations visited. In order to evaluate the total variation and the cross entropy between the true posterior and the estimated one, we use both kernel density estimation and Gaussian estimation to approximate the posterior density and report the better one for Gibbs sampling and stochastic Langevin dynamics. The kernel bandwidth is set to be $0.1$ times the median of pairwise distances between data points (median trick).

In Figure 1(3)(4), the one-pass SMC performs similar to our algorithm at beginning. However, it cannot utilize the dataset effectively, therefore, it stopped with high error. It should be noticed that the one-pass SMC starts with more particles while our algorithm only requires the same number of particles at final stage. The reason that Gibbs sampling and the stochastic gradient Langevin dynamics perform worse is that they stuck in one mode. It is reasonable that Gibbs sampling fits the single mode better than stochastic gradient Langevin dynamics since it generates one new sample by scanning the whole dataset. For the stochastic nonparametric variational inference, it could locate both modes, however, it optimizes a non-convex objective which makes its variance much larger than our algorithm. The stochastic variational inference fails because of the highly dependent variables and multimodality in posterior.

H.2 Bayesian Logistic Regression

with $w$ as the latent variables. We use Gaussian prior for $w$ with identity covariance matrix.

We first reduce the dimension to $50$ by PCA. The batch size is set to be $100$ and the step size is set to be $\frac{1}{100+\sqrt{t}}$ . We stop the stochastic algorithms after they pass through the whole dataset $5$ times. The burn-in period for stochastic Langevin dynamic is set to be $1000$ . We rerun the experiments $10$ times.

Although the stochastic variant of nonparametric variational inference performs comparable to our algorithm with fewer components, its speed is bottleneck when applied to large-scale problems. The gain from using stochastic gradient is dragged down by using L-BFGS to optimize the second-order approximation of the evidence lower bound.

H.3 Sparse Gaussian Processes

We test the proposed algorithm on 1D synthetic data. The data are generated by

where $x\in[-0.5,0.5]$ and $e\sim\mathcal{N}(0,1)$ . The dataset contains $2048$ observations which is small enough to run the exact GP regression. We use Gaussian RBF kernel in Gaussian processes and sparse Gaussian processes. Since we are comparing different inference algorithms on the same model, we use the same hyperparameters for all the inference algorithms. We set the kernel bandwidth $\sigma$ to be $0.1$ times the median of pairwise distances between data points (median trick), and $\beta^{-1}=0.001$ . We set the stepsize in the form of $\frac{\eta}{n_{0}+\sqrt{t}}$ for both PMD and SVI and the batch size to be $128$ . Figure. 4 illustrates the evolving of the posterior provided by PMD with $16$ particles and $128$ inducing variables when the algorithms visit more and more data. To illustrate the convergence of the posterior provided by PMD, we initialize the $\mathbf{u}=0$ in PMD. Later, we will see we could make the samples in PMD more efficient.

H.3.2 Music Year Prediction

We randomly selected $463,715$ songs to train the model and test on $5,163$ songs. As in , the year values are linearly mapped into $ $. The data is standardized before regression. Gaussian RBF kernel is used in the model. Since we are comparing the inference algorithms, for fairness, we fixed the model parameters for all the inference algorithms, i.e., the kernel bandwidth is set to be the median of pairwise distances between data points and the observations precision$ \beta^{-1}=0.01 $. We set the number of inducing inputs to be$ 2^{10} $and batch size to be$ 512 $. The stepsize for both PMD and SVI are in the form of$ \frac{\eta}{n_{0}+\sqrt{t}} $. To demonstrate the advantages of PMD comparing to SMC, we initialize PMD with prior while SMC with the SoD solution. We rerun the experiments$ 10 $times. We use both$ 16 $particles in SMC and PMD. We stop the stochastic algorithms after they pass through the whole dataset$ 2$ times.

H.4 Latent Dirichlet Allocation

We fix the hyper-parameter $\alpha=0.1$ , $\beta=0.01$ , and $K=100$ . The batchsize is set to be $100$ . We use stepsize $\frac{\eta}{n_{0}+{t}^{\kappa}}$ for PMD, stochastic variational inference and stochastic Riemannian Langevin dynamic. For each algorithm a grid-search was run on step-size parameters and the best performance is reported. We stop the stochastic algorithms after they pass through the whole dataset $5$ times.

The log-perplexity was estimated using the methods discussed in on a separate holdout set with $1000$ documents. For a document $x_{d}$ in holdout set, the perplexity is computed by

We separate the documents in testing set into two non-overlapped parts, $x_{d}^{\text{estimation}}$ and $x_{d}^{\text{evaluation}}$ . We first evaluate the $\theta_{d}$ based on the $x_{d}^{\text{estimation}}$ . For different inference methods, we use the corresponding strategies in learning algorithm to obtain the distribution of $\theta_{d}$ based on $x_{d}^{\text{estimation}}$ . We evaluate $p(x_{dn}|X,\alpha,\beta)$ on $x_{d}^{\text{evaluation}}$ with the obtained distribution of $\theta_{d}$ . Specifically,

For PMD, SMC and stochastic Langevin dynamics,

For stochastic variational inference, $q(\theta_{d})$ is updated as in the learning procedure.

We illustrate several topics learned by LDA with our algorithm in Figure.5.