On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators
Changyou Chen, Nan Ding, Lawrence Carin
Introduction
In large-scale Bayesian learning, diffusion based sampling methods have become increasingly popular. Most of these methods are based on Itô diffusions, defined as:
Here represents model states, the time index, is Brownian motion, functions and ( not necessarily equal to ) are assumed to satisfy the usual Lipschitz continuity condition.
In a Bayesian setting, the goal is to design appropriate functions and , so that the stationary distribution, , of the Itô diffusion has a marginal distribution that is equal to the posterior distribution of interest. For example, 1st-order Langevin dynamics (LD) correspond to , and , with being the identity matrix; 2nd-order Langevin dynamics correspond to , F=\Big{(}\begin{array}[]{c}\operatorname{\mathbf{p}}\\ -D\operatorname{\mathbf{p}}-\nabla_{\bm{\mathchar 28946\relax}}U\end{array}\Big{)}, and \sigma=\sqrt{2D}\Big{(}\begin{array}[]{cc}{\bf 0}&{\bf 0}\\ {\bf 0}&\operatorname{\mathbf{I}}_{n}\end{array}\Big{)} for some . Here is the unnormalized negative log-posterior, and is known as the momentum . Based on the Fokker-Planck equation , the stationary distributions of these dynamics exist and their marginals over are equal to , the posterior distribution we are interested in.
Recently, showed that the SGLD converges weakly to the true posterior. In , the author studied the sample-path inconsistency of the Hamiltonian PDE with stochastic gradients (but not the SGHMC), and pointed out its incompatibility with data subsampling. However, real applications only require convergence in the weak sense, i.e., instead of requiring sample-wise convergence as in , only laws of sample paths are of concernFor completeness, we provide mean sample-path properties of the SGHMC (similar to ) in Appendix J.. Very recently, the invariance measure of an SG-MCMC with a specific stochastic gradient noise was studied in . However, the technique is not readily applicable to our general setting.
In this paper we focus on general SG-MCMCs, and study the role of their numerical integrators. Our main contributions include: i) From a theoretical viewpoint, we prove weak convergence results for general SG-MCMCs, which are of practical interest. Specifically, for a th-order numerical integrator, the bias of the expected sample average of an SG-MCMC at iteration is upper bounded by with optimal step size , and the MSE by with optimal . This generalizes the results of the SGLD with an Euler integrator () in , and is better when ; ii) From a practical perspective, we introduce a numerically efficient 2nd-order integrator, based on symmetric splitting schemes . When applied to the SGHMC, it outperforms existing algorithms, including the SGLD and SGHMC with Euler integrators, considering both synthetic and large real datasets.
Preliminaries & Two Approximation Errors in SG-MCMCs
In weak convergence analysis, instead of working directly with sample-paths in (1), we study how the expected value of any suitably smooth statistic of evolves in time. This motivates the introduction of an (infinitesimal) generator. Formally, the generator of the diffusion (1) is defined for any compactly supported twice differentiable function , such that,
An integrator is said to be a th-order local integrator if for any smooth and bounded function , the corresponding Kolmogorov operator satisfies the following relation:
Convergence Analysis
This section develops theory to analyze finite-time convergence properties of general SG-MCMCs with both fixed and decreasing step sizes, as well as their asymptotic invariant measures.
The solution functional characterizes the difference between and the posterior average for every , thus would typically possess a unique solution, which is at least as smooth as under the elliptic or hypoelliptic settings . In the unbounded domain of , to make the presentation simple, we follow and make certain assumptions on the solution functional, , of the Poisson equation (4), which are used in the detailed proofs. Extensive empirical results have indicated the assumptions to hold in many real applications, though extra work is needed for theoretical verifications for different models, which is beyond the scope of this paper.
All terms in the above equation can be bounded, with details provided in Appendix D. This gives us a bound for the bias of an SG-MCMC algorithm in Theorem 2.
Under Assumption 1, let be the operator norm. The bias of an SG-MCMC with a th-order integrator at time can be bounded as:
2 Stationary invariant measures
For a th-order integrator with full gradients, the corresponding invariant measure has been shown to be bounded by an order of . As a result, Theorem 4 suggests only orders of numerical approximations but not the stochastic gradient approximation affect the asymptotic invariant measure of an SG-MCMC algorithm. This is also reflected by experiments presented in Section 5.
3 SG-MCMCs with decreasing step sizes
The original SGLD was first proposed with a decreasing-step-size sequence , instead of fixing step sizes, as analyzed in . In , the authors provide theoretical foundations on its asymptotic convergence properties. We demonstrate in this section that for general SG-MCMC algorithms, decreasing step sizes for each minibatch are also feasible. Note our techniques here are different from those used for the decreasing-step-size SGLD , which interestingly result in similar convergence patterns. Specifically, by adapting the same techniques used in the previous sections, we establish conditions on the step size sequence to ensure asymptotic convergence, and develop theory on their finite-time ergodic error as well. To guarantee asymptotic consistency, the following conditions on decreasing step size sequences are required.
The step sizes are decreasingActually the sequence need not be decreasing; we assume it is decreasing for simplicity., i.e., , and satisfy that 1) ; and 2) .
Under Assumptions 1 and 2, for a smooth test function , the bias and MSE of a decreasing-step-size SG-MCMC with a th-order integrator at time are bounded as:
As a result, the asymptotic bias approaches 0 according to the assumptions. If further assumingThe assumption of satisfies this requirement, but is weaker than the original assumption. , the MSE also goes to 0. In words, the decreasing-step-size SG-MCMCs are consistent.
Among the kinds of decreasing step size sequences, a commonly recognized one is for . We show in the following corollary that such a sequence leads to a valid sequence.
According to Theorem 5, one theoretical advantage of decreasing-step-size SG-MCMCs over fixed-step-size variants is the asymptotically unbiased estimation of posterior averages, though the benefit might not be significant in large-scale real applications where the asymptotic regime is not reached.
Practical Numerical Integrators
Given the theory for SG-MCMCs with high-order integrators, we here propose a 2nd-order symmetric splitting integrator for practical use. The Euler integrator is known as a 1st-order integrator; the proof and its detailed applications on the SGLD and SGHMC are given in Appendix I.
These sub-generators correspond to the following SDEs, which are all analytically solvable:
so that the corresponding updates for consist of the following 5 steps:
where are intermediate variables. We denote such a splitting method as the ABOBA scheme. From the Markovian property of a Kolmogorov operator, it is readily seen that all such symmetric splitting schemes (with different orders of ‘A’, ‘B’ and ‘O’) are equivalent . Lemma 8 below shows the symmetric splitting scheme is a 2nd-order local integrator.
When this integrator is applied to the SGHMC, the following properties can be obtained.
For a decreasing-step-size SGHMC, based on Remark 7, the optimal step size decreasing rate for the bias is , and for the MSE. These agree with their fixed-step-size counterparts in Remark 9, thus are faster than the SGLD/SGHMC with 1st-order Euler integrators.
Experiments
We here verify our theory and compare with related algorithms on both synthetic data and large-scale machine learning applications.
We consider a standard Gaussian model where . 1000 data samples are generated, and every minibatch in the stochastic gradient is of size 10. The test function is defined as , with explicit expression for the posterior average. To evaluate the expectations in the bias and MSE, we average over 200 runs with random initializations.
First we compare the invariant measures (with ) of the proposed splitting integrator and Euler integrator for the SGHMC. Results of the SGLD are omitted since they are not as competitive. Figure 1 plots the biases with different step sizes. It is clear that the Euler integrator has larger biases in the invariant measure, and quickly explodes when the step size becomes large, which does not happen for the splitting integrator. In real applications we also find this happen frequently (shown in the next section), making the Euler scheme an unstable integrator.
Next we examine the asymptotically optimal step size rates for the SGHMC. From the theory we know these are for the bias and for the MSE, in both fixed-step-size SGHMC (SGHMC-F) and decreasing-step-size SGHMC (SGHMC-D). For the step sizes, we did a grid search to select the best prefactors, which resulted in for the SGHMC-F and for the SGHMC-D, with different values. We plot the traces of the bias for the SGHMC-D and the MSE for the SGHMC-F in Figure 2. Similar results for the bias of the SGHMC-F and the MSE of the SGHMC-D are plotted in Appendix K. We find that when rates are smaller than the theoretically optimal rates, i.e., (bias) and (MSE), the bias and MSE tend to decrease faster than the optimal rates at the beginning (especially for the SGHMC-F), but eventually they slow down and are surpassed by the optimal rates, consistent with the asymptotic theory. This also suggests that if only a small number of iterations were feasible, setting a larger step size than the theoretically optimal one might be beneficial in practice.
Finally, we study the relative convergence speed of the SGHMC and SGLD. We test both fixed-step-size and decreasing-step-size versions. For fixed-step-size experiments, the step sizes are set to , with chosen according to the theory for SGLD and SGHMC. To provide a fair comparison, the constants are selected via a grid search from to 0.5 with an interval of for , it is then fixed in the other runs with different values. The parameter in the SGHMC is selected within as well. For decreasing-step-size experiments, an initial step size is chosen within with an interval of for different algorithmsUsing the same initial step size is not fair because the SGLD requires much smaller step sizes., and then it decreases according to their theoretical optimal rates. Figure 3 shows a comparison of the biases for the SGHMC and SGLD. As indicated by both theory and experiments, the SGHMC with the splitting integrator yields a faster convergence speed than the SGLD with an Euler integrator.
Large-scale machine learning applications
For real applications, we test the SGLD with an Euler integrator, the SGHMC with the splitting integrator (SGHMC-S), and the SGHMC with an Euler integrator (SGHMC-E). First we test them on the latent Dirichlet allocation model (LDA) . The data used consists of 10M randomly downloaded documents from Wikipedia, using scripts provided in . We randomly select 1K documents for testing and validation, respectively. As in , the vocabulary size is 7,702. We use the Expanded-Natural reparametrization trick to sample from the probabilistic simplex . The step sizes are chosen from , and parameter from . The minibatch size is set to 100, with one pass of the whole data in the experiments (and therefore ). We collect 300 posterior samples to calculate test perplexities, with a standard holdout technique as described in .
Next a recently studied sigmoid belief network model (SBN) is tested, which is a directed counterpart of the popular RBM model. We use a one layer model where the bottom layer corresponds to binary observed data, which is generated from the hidden layer (also binary) via a sigmoid function. As shown in , the SBN is readily learned by SG-MCMCs. We test the model on the MNIST dataset, which consists of 60K hand written digits of size for training, and 10K for testing. Again the step sizes are chosen from , from . The minibatch is set to 200, with 5000 iterations for training. Like applied for the RBM , an advance technique called anneal importance sampler (AIS) is adopted for calculating test likelihoods.
We briefly describe the results here, more details are provided in Appendix K. For LDA with 200 topics, the best test perplexities for the SGHMC-S, SGHMC-E and SGLD are 1168, 1180 and 2496, respectively; while these are 1157, 1187 and 2511, respectively, for 500 topics. Similar to the synthetic experiments, we also observed SGHMC-E crashed when using large step sizes. This is illustrated more clearly in Figure 4. For the SBN with 100 hidden units, we obtain negative test log-likelihoods of 103, 105 and 126 for the SGHMC-S, SGHMC-E and SGLD, respectively; and these are 98, 100, and 110 for 200 hidden units. Note the SGHMC-S on SBN yields state-of-the-art results on test likelihoods compared to , which was 113 for 200 hidden units. A decrease of 2 units in the neg-log-likelihood with AIS is considered to be a reasonable gain , which is approximately equal to the gain from a shallow to a deep model . SGHMC-S is more accuracy and robust than SGHMC-E due to its 2nd-order splitting integrator.
Conclusion
For the first time, we develop theory to analyze finite-time ergodic errors, as well as asymptotic invariant measures, of general SG-MCMCs with high-order integrators. Our theory applies for both fixed and decreasing step size SG-MCMCs, which are shown to be equivalent in terms of convergence rates, and are faster with our proposed 2nd-order integrator than previous SG-MCMCs with 1st-order Euler integrators. Experiments on both synthetic and large real datasets validate our theory. The theory also indicates that with increasing order of numerical integrators, the convergence rate of an SG-MCMC is able to theoretically approach the standard MCMC convergence rate. Given the theoretical convergence results, SG-MCMCs can be used effectively in real applications.
Supported in part by ARO, DARPA, DOE, NGA and ONR. We acknowledge Jonathan C. Mattingly and Chunyuan Li for inspiring discussions; David Carlson for the AIS codes.
References
Appendix A Representative Stochastic Gradient MCMC Algorithms
This section briefly introduces three recently proposed stochastic gradient MCMC algorithms, including the stochastic gradient Langevin dynamic (SGLD) , the stochastic gradient Hamiltonian MCMC (SGHMC) , and the stochastic gradient Nosé-Hoover thermostat (SGNHT).
Given data , a generative model with model parameter , and prior , we want to compute the posterior:
The SGLD is based on the following 1st-order Langevin dynamic defined as:
where is the standard Brownian motion. We can show via the Fokker–Planck equation that the equilibrium distribution of (16) is:
A.2 Stochastic gradient Hamiltonian MCMCs
The SGHMC is based on the 2nd-order Langevin dynamic defined as:
where is a constant independent of and . Again we can show that the equilibrium distribution of (19) is:
Similar to the SGLD, we use the Euler scheme to simulate the dynamic (19), shown in Algorithm 2.
A.3 Stochastic gradient Nośe-Hoover thermostats
The SGNHT is based on the Nośe-Hoover thermostat defined as:
If is independent of and , it can also be shown that the equilibrium distribution of (23) is :
The SGNHT is much more interesting than the SGHMC when considering subsampling data in each iteration, as the covariance in SGHMC is hard to estimate, a thermostat is used to adaptively control the system temperature, thus automatically estimate the unknown . The whole algorithm is shown in Algorithm 3.
Appendix B More Details on Kolmogorov’s Backward Equation
The generator is used in the formulation of Kolmogorov’s backward equation, which intuitively tells us how the expected value of any suitably smooth statistic of evolves in time. More precisely:
where is the exponential map operator associated with the generator defined as:
The form (27) instead of the original form (26) of the Kolmogorov’s backward equation is used in our analysis. To be able to expand the form (27) to some particular order such that remainder terms are bounded, the following assumption is required .
Assume 1) is with bounded derivatives of any order, furthermore, and 2) for some positive integer . Under these assumptions, series of the generator expansion can be bounded, thus (B) can be written in the following form :
Appendix C More Comments on Assumption 1
Compared to the SGLD case , in our proofs, we only need be up to 3 in (32) instead of 4. More specifically, the proof for the bias only needs be up to 0 given other assumptions in this paper, and the proof for the MSE needs be up to 3.
As long as the corresponding SDE is hypoelliptic, meaning that the Brownian motion is able to propagate to the other variables of the dynamics , e.g., the model parameter in SGHMC, we can extend Assumption 4.1 of to our setting. Thus we have that (30) is equivalent to finding a function ( is the dimension of , e.g., including the momentum in SGHMC), which tends to infinity as , and is twice differentiable with bounded second derivatives and satisfies the following conditions:
is a Lyapunov function of the SDE, i.e., there exists constants , such that for , we have .
Similar to , (31) is an extra condition that needs to be satisfied, and (32) is more subtle and needs more assumptions to verify in this case. We will not address these issues because it is out of the scope of the paper.
Appendix D The Proof of Theorem 2
For an SG-MCMC with a th-order integrator, according to Definition 1 and (3), we have:
Divide both sides by , use the Poisson equation (4), and reorganize terms. We have:
where the last equation in (36) is obtained by substituting (35) into it and collecting low order terms. By induction on , it is easy to show that for , we have:
for some . According to the assumption, the term is bounded. As a result, collecting low order terms, the bias can be expressed as:
where the last equation follows from the finiteness assumption of , denotes the operator norm and is bounded in the space of due to the assumptions. This completes the proof. ∎
Appendix E The Proof of Theorem 3
Sum over from 1 to and simplify, we have:
Substitute the Poisson equation (4) into the above equation, divide both sides by and rearrange related terms, we have
where is a -dimensional independent Gaussian random variables.
Using the relation (E), for the solution of the Poisson equation (4) applied on , we can bow expand it up to 3 orders from the Taylor theory:
where .
Similarly, for and , using the assumptions in the theory, we have
The expectation of can be similarly bounded. Collecting low order terms, we have
Substitute (43) into (E) we can bound the MSE as:
Appendix F The Proof of Theorem 4
Because the splitting scheme is geometric ergodic, for a test function , from the ergodic theorem we have
for . Average over all the samples and let approach to , we have
where (47) follows by using the result from Theorem 2. This completes the proof. ∎
Appendix G The Proof of Theorem 5
We separate the proof into proofs for the bias and MSE respectively in the following.
Following Theorem 2, in the decreasing step size setting, (D) can be written as:
Similarly, (34) can be simplified using the step size sequence as:
Similar to the derivation of (37), we can derive the following bounds :
Substitute (G) into (G) and collect low order terms, we have:
As a result, the bias can be expressed as:
Taking , both terms go to zero by assumption. This completes the proof. ∎
The proof for the MSE:
Following similar derivations as in Theorem 3, we have that
Substitute the Poisson equation (4) into the above equation and divided both sides by , we have
As a result, there exists some positive constant , such that:
for some , this completes the first part of the theorem. We can see that according to the assumption, the last two terms in (G) approach to 0 when . If we further assume , then the first term in (G) approaches to 0 because:
Appendix H The Proof of Corollary 6
We use the following inequalities to bound the term :
It is then easy to see that the condition for is . Moreover, we notice that other step size assumptions reduce to compare and for , which using (53) has the following bound:
As long as and , the above lower and upper bound would approach to 0, thus all the assumptions for the step size sequences are satisfied. ∎
Appendix I On the Euler Integrator and Symmetric Splitting Integrator
We first review the Euler scheme used in SGLD and SGHMC. In SGLD the update for () follows:
where is the step size, is a vector of i.i.d. standard normal random variables. In SGHMC (), it becomes:
We show in the following Lemma that the Euler integrator is a 1st-order local integrator.
The Euler integrator is a 1st-order local integrator, i.e.,
For the SGLD, according to the Kolmogorov’s backward equation (27), for the SGLD, we have
Substituting the above into (57) and use the definition (54), we have
For the SGHMC, following similar derivations, we have:
Now using the Baker–Campbell–Hausdorff (BCH) formula, we have
I.2 Symmetric splitting integrator
These sub-generators correspond to the following analytically solvable SDEs:
We show that the corresponding integrator is a 2nd-order local integrator below.
The symmetric splitting integrator is a 2nd-order local integrator, i.e.,
This follows from direct calculation using the BCH formula. Specifically,
where is the commutator of and , (59) follows from the BCH formula, and (60) follows from Assumption 1 such that the remainder high order terms are bounded , so the error term can be taken out from the exponential map using Taylor expansion. Similarly, for the other composition, we have
Appendix J Mean Flow Error Analysis
First, applying Kolmogorov’s backward equation on the original SDE (1) with generator , the true mean flow can be expressed as:
Now we want to compute the mean flow of the splitting scheme: . We will split the SDE into several parts, with the Brownian motion term going with the stochastic gradient term. To shown the proof on a different SG-MCMC algorithm, we use the SGHMC with Riemannian information geometry (SGRHMC) defined below. Other stochastic gradient MCMC follows similarly. For the SGRHMC, we have
The splitting scheme we consider is the BAOAB scheme. Denote
Note that . In the stochastic gradient case, we are using the stochastic gradient from the -th minibatch in the splitting scheme, thus we need to modify the operator as:
Similar to the proof of the symmetric splitting error, we can calculate the composition of the mean flows for two mini-batches and using the BCH formula as:
where we have used the fact that commutes with each other to cancel out the term in the BCH formula. Similarly, for the first three mini-batches , we have
Similarly, we can do the composition for the entire trajectory, resulting in after simplification:
This completes the first part of the theorem. From Assumption 1, we can expand and bound (J) with the step size for finite time as:
Similarly, for the true mean flow , it is easy to get
Appendix K Additional Experiments
We plot the traces of bias and MSE with step size for different rates in Figure 5. We can see that when the rates are smaller than the theoretically optimal bias rates and MSE rate , the bias and MSE tend to decrease faster than the optimal rates at the beginning, but eventually they slow down and are surpassed by the optimal rates. This on the other hand suggests if only a small number of iterations were available in the SG-MCMCs, setting a larger step size than the theoretically optimal one might be beneficial in practice.
In addition, Figure 6 shows a comparison of the bias and MSE for SGHMC and SGLD. The step sizes are set to , with choosing according to the theory for SGLD and SGHMC respectively. To be fair, the constants are selected via a grid search from 1e-3 to 0.5 with an interval of 2e-3 for , it is then fixed during other values. The parameter in SGHMC is selected within as well. As indicated by both our theorems and experiments, SGHMC endows a much faster convergence speed than SGHMC on both the bias and MSE.
Figure 7 plots the traces of bias and MSE with decreasing step sizes for different rates in the same Gaussian model. Again we can see that the optimal decreasing rates agree with the theory. Figure 8 shows a comparison of bias and MSE for SGHMC and SGLD with decreasing step sizes on the same Gaussian model. We follow the same procedure as in Section 3.1 to select parameters for SGLD and SGHMC. Specifically, the decreasing rate parameter is set to and in SGLD and SGHMC for the bias, and for the MSE. We can see that SGHMC still obtain a faster convergence speed, though the benefit is not as large as using fix step size.
K.2 LDA & SBN
We first the list quantitative results of the LDA and SBN models in Table 1. It is clear that in both models the SGHMC is much better than the SGLD due to the introduction of momentum variables in the dynamics (similar to the SGD with momemtum in the optimization literature); and the splitting integrator also works better than the Euler integrator due to the higher order errors in splitting integrators. For a fair comparison, we did not consider a better version of the SGLD with Riemannian information geometry of posterior distributions on probabilistic simplexes .
Next a plot of the test perplexities decreasing with the number of documents processed for the whole dataset is given in Figure 9 (top), for a comparison of the Euler integrator and the proposed symmetric splitting integrator. We can see that the symmetric splitting integrator decreases faster than the Euler integrator. Furthermore, the dictionary learned by the SGHMC with the symmetric splitting integrator is also given in Figure 9 (bottom).