Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling

Xiaocheng Shang, Zhanxing Zhu, Benedict Leimkuhler, Amos J. Storkey

Introduction

In machine learning applications, direct sampling with the entire large-scale dataset is computationally infeasible. For instance, standard Markov chain Monte Carlo (MCMC) methods , as well as typical hybrid Monte Carlo (HMC) methods , require the calculation of the acceptance probability and the creation of informed proposals based on the whole dataset.

In order to improve the computational efficiency, a number of stochastic gradient methods have been proposed in the setting of Bayesian sampling based on random (and much smaller) subsets to approximate the likelihood of the whole dataset, thus substantially reducing the computational cost in practice. Welling and Teh proposed the so-called stochastic gradient Langevin dynamics (SGLD) , combining the ideas of stochastic optimization and traditional Brownian dynamics, with a sequence of stepsizes decreasing to zero. A fixed stepsize is often adopted in practice which is the choice in this article as in Vollmer et al. , where a modified SGLD (mSGLD) was also introduced that was designed to reduce the sampling bias.

SGLD generates samples from first order Brownian dynamics, and thus, with a fixed timestep, one can show that it is unable to dissipate excess noise in gradient approximations while maintaining the desired invariant distribution . A stochastic gradient Hamiltonian Monte Carlo (SGHMC) method was proposed by Chen et al. , which relies on second order Langevin dynamics and incorporates a parameter-dependent diffusion matrix that is intended to effectively offset the stochastic perturbation of the gradient. However, it is difficult to accommodate the additional diffusion term in practice. Moreover, as pointed out in , poor estimation of it may have a significant adverse influence on the sampling of the target distribution; for example, the effective system temperature may be altered.

The “thermostat” idea, which is widely used in molecular dynamics , was recently adopted in the stochastic gradient Nosé–Hoover thermostat (SGNHT) by Ding et al. in order to adjust the kinetic energy during simulation in such a way that the canonical ensemble is preserved (i.e. so that a prescribed constant temperature distribution is maintained). In fact, the SGNHT method is essentially equivalent to the adaptive Langevin (Ad-Langevin) thermostat proposed earlier by Jones and Leimkuhler in the molecular dynamics setting (see for discussions).

Despite the substantial interest generated by these methods, the mathematical foundation for stochastic gradient methods has been incomplete. The underlying dynamics of the SGNHT method was taken up by Leimkuhler and Shang , together with the design of discretization schemes with high effective order of accuracy. SGNHT methods are designed based on the assumption of constant noise variance. In this article, we propose a covariance-controlled adaptive Langevin (CCAdL) thermostat, that can handle parameter-dependent noise, improving both robustness and reliability in practice, and which can effectively speed up the convergence to the desired invariant distribution in large-scale machine learning applications.

The rest of the article is organized as follows. In Section 2, we describe the setting of Bayesian sampling with noisy gradients and briefly review existing techniques. Section 3 considers the construction of the novel CCAdL method that can effectively dissipate parameter-dependent noise while maintaining the correct distribution. Various numerical experiments are performed in Section 4 to verify the usefulness of CCAdL in a wide range of large-scale machine learning applications. Finally, we summarize our findings in Section 5.

Bayesian Sampling with Noisy Gradients

In the typical setting of Bayesian sampling , one is interested in drawing states from a posterior distribution defined as

Assuming the data are independent and identically distributed (i.i.d.), the logarithm of the likelihood can be calculated as

where NN is the size of the entire dataset.

However, as already mentioned, it is computationally infeasible to deal with the entire large-scale dataset at each timestep as would typically be required in MCMC and HMC methods. Instead, in order to improve the efficiency, a random (and much smaller, i.e. nNn\ll N) subset is preferred in stochastic gradient methods, in which the likelihood of the dataset for given parameters is approximated by

where {xri}i=1n\{\mathbf{x}_{r_{i}}\}^{n}_{i=1} represents a random subset of X\mathbf{X}. Thus, the “noisy” potential energy can be written as

Our goal is to correctly sample the Gibbs distribution ρ(θ)exp(βU(θ))\rho(\bm{\theta})\propto\exp(-\beta U(\bm{\theta})) (1). As in , the gradient noise is assumed to be Gaussian with mean zero and unknown variance, in which case one may rewrite the noisy force as

where M{{\mathbf{M}}} typically is a diagonal matrix, Σ(θ)\bm{\Sigma}(\bm{\theta}) represents the covariance matrix of the noise, and, R\mathbf{R} is a vector of i.i.d. standard normal random variables. Note that Σ(θ)M1/2R\sqrt{\mathbf{\Sigma}(\bm{\theta})}{{\mathbf{M}}}^{1/2}\mathbf{R} here is actually equivalent to N(0,Σ(θ)M)\mathcal{N}\left({\mathbf{0}},\mathbf{\Sigma}(\bm{\theta}){{\mathbf{M}}}\right).

In a typical setting of numerical integration with associated stepsize hh, one has

and therefore, assuming a constant covariance matrix (i.e. Σ=σ2I\bm{\Sigma}=\sigma^{2}{{\mathbf{I}}}, where I{{\mathbf{I}}} is the identity matrix), the SGNHT method by Ding et al. has the following underlying dynamics, written as a standard Itō stochastic differential equation (SDE) system :

is below the target temperature, the “dynamical friction” ξ\xi would decrease allowing an increase of temperature, while ξ\xi would increase when the temperature is above the target. μ\mu is a coupling parameter which is referred to as the “thermal mass” in the molecular dynamics setting.

Proposition 1 (See Jones and Leimkuhler ). The SGNHT method (8) preserves the modified Gibbs (stationary) distribution:

where ZZ is the normalizing constant, H(θ,p)=pTM1p/2+U(θ)H(\bm{\theta},{{\mathbf{p}}})={{\mathbf{p}}}^{T}{{\mathbf{M}}}^{-1}{{\mathbf{p}}}/2+U(\bm{\theta}) is the Hamiltonian, and

Proposition 1 tells us that the SGNHT method can adaptively dissipate excess noise pumped into the system while maintaining the correct distribution. The variance of the gradient noise, σ2\sigma^{2}, does not need to be known a priori. As long as σ2\sigma^{2} is constant, the auxiliary variable ξ\xi will be able to automatically find its mean value ξˉ\bar{\xi} on the fly. However, with a parameter-dependent covariance matrix Σ(θ)\mathbf{\Sigma}(\bm{\theta}), the SGNHT method (8) would not produce the required target distribution (10).

Ding et al. claimed that it is reasonable to assume the covariance matrix Σ(θ)\mathbf{\Sigma}(\bm{\theta}) is constant when the size of the dataset, NN, is large, in which case the variance of the posterior of θ\bm{\theta} is small. The magnitude of the posterior variance does not actually relate to the constancy of the Σ\mathbf{\Sigma}, however, in general, Σ\mathbf{\Sigma} is not constant. Simply assuming the nonconstancy of the Σ\mathbf{\Sigma} can have a significant impact on the performance of the method (most notably the stability measured by the largest usable stepsize). Therefore, it is essential to have an approach that can handle parameter-dependent noise. In the following section, we propose a covariance-controlled thermostat that can effectively dissipate parameter-dependent noise while maintaining the target stationary distribution.

Covariance-Controlled Adaptive Langevin Thermostat

As mentioned in the previous section, the SGNHT method (8) can only dissipate noise with a constant covariance matrix. When the covariance matrix becomes parameter-dependent, in general, a parameter-dependent covariance matrix does not imply the required “thermal equilibrium”, i.e. the system cannot be expected to converge to the desired invariant distribution (10), typically resulting in poor estimation of functions of parameters of interest. In fact, in that case it is not clear whether or not there exists an invariant distribution at all.

In order to construct a stochastic-dynamical system that preserves the canonical distribution, we suggest adding a suitable damping (viscous) term to effectively dissipate the parameter-dependent gradient noise. To this end, we propose the following covariance-controlled adaptive Langevin (CCAdL) thermostat:

Proposition 2. The CCAdL thermostat (12) preserves the modified Gibbs (stationary) distribution:

The Fokker–Planck equation corresponding to (12) is

Just insert ρ^β\hat{\rho}_{\beta} (13) into the Fokker–Planck operator L{\cal L}^{\dagger} to see that it vanishes. ∎

The incorporation of the parameter-dependent covariance matrix Σ(θ)\bm{\Sigma}(\bm{\theta}) in (12) is intended to offset the covariance matrix coming from the gradient approximation. However, in practice, one does not know Σ(θ)\bm{\Sigma}(\bm{\theta}) a priori. Thus instead one must estimate Σ(θ)\bm{\Sigma}(\bm{\theta}) during the simulation, a task which will be addressed in Section 3.1. This procedure is related to the method used in the SGHMC method proposed by Chen et al. , which uses dynamics of the following form:

It can be shown that the SGHMC method preserves the Gibbs canonical distribution:

Although both CCAdL (12) and SGHMC (14) preserve their respective invariant distributions, let us note several advantages of the former over the latter in practice:

CCAdL and SGHMC both require estimation of the covariance matrix Σ(θ)\bm{\Sigma}(\bm{\theta}) during simulation, which can be costly in high dimension. In numerical experiments, we have found that simply using the diagonal of the covariance matrix, at significantly reduced computational cost, works quite well in CCAdL. By contrast, it is difficult to find a suitable value of the parameter AA in SGHMC since one has to make sure the matrix AIβhΣ(θ)/2A{{\mathbf{I}}}-\beta h\bm{\Sigma}(\bm{\theta})/2 is positive semidefinite. One may attempt to use a large value of the “effective friction” AA and/or a small stepsize hh. However, too-large a friction would essentially reduce SGHMC to SGLD, which is not desirable, as pointed out in , while extremely small stepsize would significantly impact the computational efficiency.

Estimation of the covariance matrix Σ(θ)\bm{\Sigma}(\bm{\theta}) unavoidably introduces additional noise in both CCAdL and SGHMC. Nonetheless, CCAdL can still effectively control the system temperature (i.e. maintaining the correct distribution of the momenta) due to the use of the stabilizing Nosé–Hoover control, while in SGHMC, poor estimation of the covariance matrix may lead to significant deviations of the system temperature (as well as the distribution of the momenta), resulting in poor sampling of the parameters of interest.

Under the assumption that the noise of the stochastic gradient follows a normal distribution, we apply a similar method to that of to estimate the covariance matrix associated with the noisy gradient. If we let g(θ;x)=θlogπ(xθ)g(\bm{\theta};\mathbf{x})=\nabla_{\bm{\theta}}\log\pi(\mathbf{x}|\bm{\theta}) and assume that the size of subset nn is large enough for the central limit theorem to hold, we have

i.e. Σ(θt)=N2It/n\bm{\Sigma}(\bm{\theta}_{t})=N^{2}\mathbf{I}_{t}/n. Assuming θt\bm{\theta}_{t} does not change dramatically over time, we use the moving average update to estimate It\mathbf{I}_{t}:

is the empirical covariance of the gradient. gˉ(θt)\bar{g}(\bm{\theta}_{t}) represents the mean gradient of the log likelihood computed from a subset. As proved in , this estimator has a convergence order of O(1/N)\mathcal{O}(1/N).

As already mentioned, estimating the full covariance matrix is computationally infeasible in high dimension. However, we have found that employing a diagonal approximation of the covariance matrix (i.e. estimating the variance only along each dimension of the noisy gradient) works quite well in practice, as demonstrated in Section 4.

Note that this is a simple, first order (in terms of the stepsize) algorithm. A recent article has introduced higher order of accuracy schemes which can improve accuracy, but our interest here is in the direct comparison of the underlying machinery of SGHMC, SGNHT, and CCAdL, so we avoid further modifications and enhancements related to timestepping at this stage.

In the following section, we compare the newly established CCAdL method with SGHMC and SGNHT on various machine learning tasks to demonstrate the benefits of CCAdL in Bayesian sampling with a noisy gradient.

Numerical Experiments

where xˉ=i=1Nxi/N\bar{\mathbf{x}}=\sum^{N}_{i=1}\mathbf{x}_{i}/N. A random subset of size n=10n=10 was selected at each timestep to approximate the full gradient, resulting in the following stochastic gradients:

It can be seen that the variance of the stochastic gradient noise is no longer constant and actually depends on the size of the subset, nn, and the values of μ\mu and γ\gamma in each iteration. This directly violates the constant noise variance assumption of SGNHT , while CCAdL adjusts to the varying noise variance.

The marginal distributions of μ\mu and γ\gamma obtained from various methods with different combinations of hh and AA were compared and plotted in Figure 1, with Table 1 consisting of the corresponding root mean square error (RMSE) of the distribution and autocorrelation time from 10610^{6} samples. In most of the cases, both SGNHT and CCAdL easily outperform the SGHMC method possibly due to the presence of the Nosé–Hoover device, with SGHMC only showing superiority with small value of hh and large value of AA, neither of which is desirable in practice as discussed in Section 3. Between SGNHT and the newly proposed CCAdL method, the latter achieves better performance in each of the cases investigated, highlighting the importance of the covariance control with parameter-dependent noise.

2 Large-Scale Bayesian Logistic Regression

We then consider a Bayesian logistic regression model trained on the benchmark MNIST dataset for binary classification of digits 7 and 9 using 12,21412,214 training data points, with a test set of size 2037. A 100-dimensional random projection of the original features was used. We used the likelihood function of π({xi,yi}i=1Nw)i=1N1/(1+exp(yiwTxi))\pi\left(\{{{\mathbf{x}}}_{i},y_{i}\}_{i=1}^{N}|\mathbf{w}\right)\propto\prod_{i=1}^{N}1/\left(1+\exp(-y_{i}\mathbf{w}^{T}\mathbf{x}_{i})\right) and the prior distribution of π(w)exp(wTw/2)\pi(\mathbf{w})\propto\exp(-\mathbf{w}^{T}\mathbf{w}/2). A subset of size n=500n=500 was used at each timestep. Since the dimensionality of this problem is not that high, a full covariance estimation was used for CCAdL.

We investigate in Figure 2 (top row) the convergence speed of each method through measuring test log likelihood using the posterior mean against the number of passes over the entire dataset. CCAdL displays significant improvements over SGHMC and SGNHT with different values of hh and AA: (1) CCAdL converges much faster than the other two, which also indicates its faster mixing speed and shorter burn-in period; (2) CCAdL shows robustness in different values of the effective friction AA, with SGHMC and SGNHT relying on a relative large value of AA (especially for the SGHMC method), which is intended to dominate the gradient noise.

To compare the sample quality obtained from each method, Figure 2 (bottom row) plots the two-dimensional marginal posterior distribution in randomly selected dimensions of 2 and 5 based on 10610^{6} samples from each method after the burn-in period (i.e. we start to collect samples when the test log likelihood stabilizes). The true (reference) distribution was obtained by a sufficiently long run of standard HMC. We implemented 10 runs of standard HMC and found there was no variation between these runs, which guarantees its qualification as the true (reference) distribution. Again, CCAdL shows much better performance than SGHMC and SGNHT. Note that the contour of SGHMC does not even fit in the region of the plot, and in fact it shows significant deviation even in the estimation of the mean.

3 Discriminative Restricted Boltzmann Machine (DRBM)

DRBM is a self-contained non-linear classifier, and the gradient of its discriminative objective can be explicitly computed. Due to the limited space, we refer the readers to for more details. We trained a DRBM on different large-scale multi-class datasets from the LIBSVMhttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html dataset collection, including connect-4, letter, and SensIT Vehicle acoustic. The detailed information of these datasets are presented in Table 2.

The error rates computed by various methods on the test set using the posterior mean against the number of passes over the entire dataset were plotted in Figure 3. It can be observed that SGHMC and SGNHT only work well with a large value of the effective friction AA, which corresponds to a strong random walk effect and thus slows down the convergence. On the contrary, CCAdL works reliably (much better than the other two) in a wide range of AA, and more importantly in the large stepsize regime, which speeds up the convergence rate in relation to the computational work performed. It can be easily seen that the performance of SGHMC heavily relies on using a small value of hh and a large value of AA, which significantly limits its usefulness in practice.

Conclusions and Future Work

In this article, we have proposed a novel CCAdL formulation that can effectively dissipate parameter-dependent noise while maintaining a desired invariant distribution. CCAdL combines ideas of SGHMC and SGNHT from the literature, but achieves significant improvements over each of these methods in practice. The additional error introduced by covariance estimation is expected to be small in a relative sense, i.e. substantially smaller than the error arising from the noisy gradient. Our findings have been verified in large-scale machine learning applications. In particular, we have consistently observed that SGHMC relies on a small stepsize hh and a large friction AA, which significantly reduces its usefulness in practice as discussed. The techniques presented in this article could be of use in more general settings of large-scale Bayesian sampling and optimization, which we leave for future work.

A naive nonsymmetric splitting method has been applied for CCAdL for fair comparison in this article. However, we point out that optimal design of splitting methods in ergodic SDE systems has been explored recently in the mathematics community . Moreover, it has been shown in that a certain type of symmetric splitting method for the Ad-Langevin/SGNHT method with a clean (full) gradient inherits the superconvergence property (i.e. fourth order convergence to the invariant distribution for configurational quantities) recently demonstrated in the setting of Langevin dynamics . We leave further exploration of this direction in the context of noisy gradients for future work.

Acknowledgements

XS and ZZ gratefully acknowledge the financial support from the University of Edinburgh and China Scholarship Council.

References