Non-asymptotic mixing of the MALA algorithm

Nawaf Bou-Rabee, Martin Hairer, Eric Vanden-Eijnden

Introduction

The Metropolis-Adjusted Langevin Algorithm (MALA), originally proposed by Roberts and Tweedie [RoTw1996A, RoTw1996B], is a technique to sample exactly complex, high-dimensional probability distributions. MALA fits the general framework of the Metropolis-Hastings method [MeRoRoTeTe1953, Ha1970] and can be viewed as a special case of smart and hybrid Monte-Carlo algorithms [RoDoFr1978, DuKePeRo1987]. The main idea of MALA is to obtain the proposal moves from the forward Euler discretization of an SDE whose invariant measure is the target distribution one seeks to sample. Besides being ergodic with respect to this invariant measure by construction, it was shown recently that MALA also captures the dynamical behavior of the solutions to the SDE [BoVa2010A]. Therefore MALA has the nice feature that it can be used to estimate finite time dynamical properties along infinitely long trajectories of ergodic SDEs.

Still, one issue with MALA is its theoretical rate of convergence, see for example [RoTw1996B, CaWaGuMySe2008]. When applied to measures with tails that are lighter than Gaussian, it is known that MALA does not exhibit a geometric rate of convergence to equilibrium even though the exact solution to the SDE does. The main reason is that the proposal moves generated by forward Euler are not globally stable. Indeed for any time-stepsize one can find an energy value above which the drift in forward Euler gives proposed moves that increase the energy, in contrast to the exact drift in the SDE which always centers the solution towards lower energy values. Since higher energy values have a lower equilibrium probability weight, these proposed moves are typically rejected. While these rejections ensure that MALA is ergodic, at high energy values they prevent MALA from having a spectral gap.

The question we investigate in this paper is how severe this problem is in practical applications. Above we have argued that the main cause of the lack of geometric convergence is the behavior of the chain at high energy values. Since the chain is unlikely to reach such high energy values over finite time horizons, one does not expect their influence to be significant. In practice, it is the behavior of MALA on finite but very long times that is of interest, since this behavior is what one would experience when running the algorithm on a computer. The goal of this article is to quantify the non-asymptotic behavior of MALA.

The main result of this paper states that the convergence of MALA to its equilibrium distribution happens at exponential rate up to terms exponentially small in time-stepsize. This can be formulated in the following way, and will later be reformulated rigorously as Theorem 3.1:

Observe from (1) that the distance of MALA to equilibrium is bounded by the sum of two terms. The first term converges to exponentially fast and essentially gives the speed of convergence to equilibrium for the exact solution to the underlying SDE. The second term on the other hand remains bounded away from as $k\to\infty$ . This term arises from the lack of a spectral gap in MALA, but its important feature is that it is exponentially small in $h$ . Therefore, its importance will be negligible in applications for most practical purposes.

The crux of the proof is the demonstration that MALA inherits some of the convergence properties of the solution to the underlying SDE up to exponentially small terms. This proof relies on finite time accuracy of MALA, ergodicity of MALA with respect to the exact equilibrium measure of the SDE, and an application of Harris’ theorem. In fact, if MALA did not exactly preserve the equilibrium measure of the SDE, the second term in (1) would not be exponentially small in the time-stepsize. For example, if MALA was replaced simply by the uncorrected Euler approximations to the SDE, then one would expect the size of the error term to be ${\mathcal{O}}(h)$ .

The estimate (1) does not imply that MALA does not converge to the equilibrium of the SDE. In fact, it is known [RoTw1996B] that the TV distance between MALA and the equilibrium measure vanishes in the limit as $k\to\infty$ . However, this asymptotic property provides no insight on the nonasymptotic behavior of MALA which is the main focus of this paper. In fact, even though the upper bound in (1) does not converge to zero in the limit $k\to\infty$ , it is the sharpest known bound on finite time intervals.

The power $1/4$ in the exponentially small term in (1) is due to the second-order weak accuracy of the proposal moves generated by the forward Euler scheme, and the conditions we impose on the potential energy. In particular, it can be traced back to the appearance of the factor $U^{4}({\boldsymbol{x}})$ appearing in the statement of Lemma LABEL:MALAphiaccuracy. Under the assumptions made in this paper, this power is sharp.

At the technical level, the main novelty of the proof of our result is twofold. First, we prove finite-time accuracy of MALA in the total variation norm in our setting. While accuracy in total variation of the forward Euler algorithm is known [BaTa1995], it is essential for our analysis to cover situations where the drift of the underlying SDE is not globally Lipschitz continuous. Furthermore, we need to keep track of the dependency of the error estimates with respect to the initial condition. The main idea for this result is to first obtain an error estimate in some weaker Wasserstein distance, and then to strengthen this into a total variation estimate by making use of the regularising properties of the one-step transition probabilities of the forward Euler algorithm. Second, we show that on a very large set, MALA admits a Lyapunov function of the type $\Phi({\boldsymbol{x}})=\mathop{\hbox{\rm{exp}}}\nolimits\bigl{(}\theta U({\boldsymbol{x}})\bigr{)}$ for suitable $\theta>0$ . Since $U$ is allowed to grow much faster than quadratically at infinity, this Lyapunov function fails to be integrable with respect to any Gaussian measure, including of course the transition probabilities of forward Euler. While this leads to technical complications, having such a fast-growing Lyapunov function is a crucial ingredient of our proof, as this is the key to obtaining bounds that are exponentially small in $h$ .

The remainder of this paper is organized as follows. In Section 2, we will state the main assumptions required for the proof of our main result. Along the way, we recall that MALA is ergodic. In Section 3, the proof of the main result is provided. This proof relies crucially on comparison with a ‘patched’ MALA algorithm, where the chain is reflected at the boundaries of a large level set. The accuracy of this patched algorithm is investigated in Section LABEL:app:MALAtvaccuracy. Finally, Section LABEL:app:localdriftcondition shows that $\Phi$ is a Lyapunov function for the MALA algorithm (at least on a large domain), which provides the strong a priori bounds required for our analysis.

We wish to acknowledge stimulating discussions with Jianfeng Lu, Gareth Roberts, Andrew Stuart, and Jonathan Weare. The research of NBR was supported in part by NSF Fellowship DMS-0803095. The research of MH was supported by an EPSRC Advanced Research Fellowship EP/D071593/1 and a Royal Society Wolfson Research Merit Award. The research of EVE was supported in part by NSF grants DMS-0718172 and DMS-0708140, and ONR grant N00014-04-1-6046.

A short overview of the MALA algorithm

Throughout this article, we will make the following assumptions on the potential energy. Not all of these assumptions will be required for every statement, but we find it notationally convenient to have a single set of assumptions to refer to.

One has $U({\boldsymbol{x}})\geq 1$ and, for any $C>0$ there exists an $E>0$ such that

It follows immediately from Assumption 2.1 (A) above that exists a constant $E_{c}>0$ such that {equ}[e:boundHigh] μ({U(x) ≥E}) ≤e^-βE 2 , for all $E>E_{c}$ . Indeed, it suffices to note that {equs} μ({U(x) ≥E}) &= 1Z∫_U(x) ≥E e^- βU(x) dx ≤e^-3 βE/4Z ∫_U(x) ≥E e^- βU(x)/4 dx ≤C e^-3 βE/4 ¡ e^- βE/2 , where the second to last inequality follows from point (A) above, and the last inequality holds for $E$ sufficiently large.

The only place where we actually use the fact that $U({\boldsymbol{x}})$ grows like $|{\boldsymbol{x}}|^{2}$ is in the proof of Lemma LABEL:MALAphiaccuracy below. On the other hand, the statement of that approximation result would certainly be true also for potentials that grow slower at $\infty$ . However, such potentials would not be of interest for the present work. Indeed, if the potential grows slower than $|{\boldsymbol{x}}|^{2}$ and no slower than $|{\boldsymbol{x}}|$ , then MALA can be shown to be exponentially ergodic, so that the results in this article would be superfluous. If the potential grows slower than $|{\boldsymbol{x}}|$ , then MALA will not be exponentially ergodic because the true solution of the SDE will not be either.

Assumption 2.1 (C) is equivalent to the existence of $C>0$ such that $\nabla U$ satisfies the one-sided Lipschitz property

All of these conditions are satisfied, for example, if $U$ is smooth and $U(\boldsymbol{x})\approx|\boldsymbol{x}|^{\alpha}$ with $\alpha>2$ for large values of $\boldsymbol{x}$ . However, they also allow for potentials that have very asymmetric growth at infinity, and they even allow for the potential to grow at exponential speed. As a consequence of Assumption 2.1 (B), one has the following drift condition on the transition probability of the solution.

Using the specific form of ${\mathcal{L}}$ , it follows that for $U({\boldsymbol{x}})>u_{0}$ , we have {equs} L(Θ∘U) &= (Θ’ ∘U) LU + 1β (Θ” ∘U) —∇U—^2 = ((Θ” ∘U ) β - (Θ’ ∘U )) —∇U—^2 + (Θ’ ∘U)β ΔU ≤1β ( (Θ” ∘U) - (β- c)(Θ’ ∘U)) —∇U—^2 -dβ (Θ’ ∘U) U ≤-dαβ (Θ∘U) . The result then follows at once from the fact that the condition $u\Theta^{\prime}(u)>\alpha\Theta(u)$ implies that $\Theta(u)\to\infty$ as $u\to\infty$ . ∎

As a consequence of the ellipticity of the SDE (2), one has the following minorization condition on the solution’s transition probability.

For every $t>0$ and $E>0$ , there exists $\epsilon>0$ such that

for all $\boldsymbol{x},\boldsymbol{y}$ satisfying $U(\boldsymbol{x})\vee U(\boldsymbol{y})<E$ .

Here and in the sequel, the total variation distance between two probability measures is defined as {equ} ∥μ- ν∥_TV= 2sup_A —μ(A) - ν(A)— , where the supremum runs over all measurable sets. In particular, the total variation distance between two probability measures is two if and only if they are mutually singular.

It follows from the ellipticity of the equations that there exists a function $q(t,\boldsymbol{x},\boldsymbol{y})$ smooth in all of its arguments (for $t>0$ ) such that the transition probabilities are given by $Q_{t}(\boldsymbol{x},d\boldsymbol{y})=q(t,\boldsymbol{x},\boldsymbol{y})\,d\boldsymbol{y}$ . Furthermore, $q$ is strictly positive (see, e.g., Lemma 2.2 of [Ta2002]). Hence, by the compactness of the set $\{\boldsymbol{x}\,:\,U(\boldsymbol{x})<E\}$ , one can find a probability measure $\eta$ and a constant $\epsilon>0$ such that,

for any $\boldsymbol{x}$ satisfying $\Phi(\boldsymbol{x})<E$ . Therefore,

for all $\boldsymbol{x},\boldsymbol{y}$ satisfying $U(\boldsymbol{x})\vee U(\boldsymbol{y})<E$ . Since the TV norm is bounded by $2$ , one obtains the desired result. ∎

Harris’ theorem can now be invoked to conclude the transition probability of the true solution converges at a geometric rate to its equilibrium measure. For the reader’s convenience, we state the precise version used in this article. For a proof, see the monograph [MeTw2009], or [HaMa2008] for a shorter and somewhat more constructive version. Harris’ theorem essentially states that if a Markov chain ${\mathcal{P}}$ on an arbitrary (Polish) state space ${\mathcal{X}}$ admits a Lyapunov function such that its sublevel sets are ‘small’, then it is exponentially ergodic. More precisely, Harris’ theorem applies to any Markov chain that satisfies the following assumptions:

for all $\boldsymbol{x}\in{\mathcal{X}}$ .

There exists a constant $\alpha\in(0,1)$ so that the Markov chain $\mathcal{P}$ satisfies

Note that in this statement, we have normalised the total variation distance between two probability measures in such a way that it is equal to $2$ if and only if the measures are mutually singular. One then has:

With this tool at hand, we obtain the following exponential ergodicity result for the solutions to (2):

According to Remark 2.6, for every $\theta\in(0,\beta-c)$ , $\mathop{\hbox{\rm{exp}}}\nolimits(\theta U)$ is a Lyapunov function for the Markov chain $Q_{t}$ . Moreover, by Lemma 2.7 it satisfies a minorization condition on every sublevel set of $U$ . Hence, Harris’ theorem implies that (LABEL:e:boundTV) holds. ∎

Next we recall some integration strategies for (2) and summarize their properties. In particular, we discuss to what extent these strategies preserve the geometric rate of convergence of the true solution.

2 Forward Euler

Hence, the chain is irreducible with respect to Lebesgue measure.

If $\nabla U$ is globally Lipschitz and $h$ is small enough, forward Euler (8) can be shown to be exponentially ergodic with respect to a probability distribution that is a first-order approximant to the equilibrium distribution of the SDE (2). This property is typically established using a Talay-Tubaro expansion of the global weak error of forward Euler [TaTu1990].

3 MALA Algorithm

A Metropolis-Hastings method is a Monte-Carlo method for producing samples from a known probability distribution [MeRoRoTeTe1953, Ha1970]. The method generates a Markov chain from a given proposal Markov chain as follows. A proposal move is computed according to the proposal chain and accepted with a probability that ensures the Metropolized chain is ergodic with respect to the given probability distribution. Here we shall focus on the Metropolized forward Euler integrator defined in terms of the equilibrium density $\pi$ (3) and the transition density $q_{h}$ (9).

Given a time-stepsize $h$ and input state $\boldsymbol{X}_{k}$ the algorithm calculates a proposal move using the forward Euler updating scheme in (8):

and accepts this proposal with a probability

Moreover, it is quite standard to show that MALA gives rise to an ergodic Markov chain. Indeed, denoting by $P_{h}$ the transition probabilities defined by (LABEL:MALA), one has

Let $U$ be a potential satisfying Assumption 2.1. For any $h>0$ the $k$ -step transition probability of MALA converges to $\mu$ in the total variation metric on probability measures, that is

If $\nabla U$ is globally Lipschitz and $h$ is small enough, MALA is geometrically ergodic (see Theorem 4.1 of[RoTw1996B]). However, if $\nabla U$ is nonglobally Lipschitz, MALA is not geometrically ergodic even though the solution to the SDE is (see Theorem 4.2 of[RoTw1996B]). Specifically, one can prove the following.

Let $U$ be a potential satisfying Assumption 2.1. If

then MALA operated at time-stepsize $h$ is not geometrically ergodic.

If (14) holds, the tail of the equilibrium density is no heavier than Gaussian. For example, if $U(x)=x^{4}/4$ then

In this case the theorem states MALA is not geometrically ergodic, in contrast to the true solution of the SDE. The main purpose of this article is to argue that, up to errors that are exponentially small in the time-step size $h$ , the convergence of the transition probabilities of MALA towards equilibrium still takes place at an exponential rate. The next section gives a precise statement of this result, as well as an overview of its proof.

Main Results

We now state and prove the main result of the paper. Throughout this section, $P_{h}$ will denote the one-step transition probabilities of the MALA algorithm as defined in Section 2.3 above. We will also use throughout this section the shorthand notation $P=P_{h}^{\lfloor 1/h\rfloor}$ for the evolution of MALA over one unit of ‘physical time’.

Let $U$ be a potential function satisfying Assumption 2.1, and let $P$ be as above. Then, there exists $\bar{\delta}\in(0,1)$ and, for every $E_{0}>0$ , there exist positive constants $C_{1}$ , $C_{2}$ , and $h_{c}(E_{0})$ such that MALA’s distance to stationarity satisfies

To introduce this patched version of MALA, set $R_{h}=\{\boldsymbol{x}:U(\boldsymbol{x})<E_{h}\}$ , where $E_{h}=E_{\star}h^{-1/4}$ for a constant $E_{\star}$ yet to be determined. The ‘patched MALA’ algorithm is then defined as a Metropolized version of forward Euler with a reflecting boundary condition at the boundary of $R_{h}$ . This boundary condition is enforced by setting the target distribution in MALA to be the equilibrium distribution $\mu$ conditional on being in $R_{h}$ . This distribution possesses the following density with respect to Lebesgue measure:

To be more precise, given a time-stepsize $h$ and input state $\bar{\boldsymbol{X}}_{k}\in R_{h}$ , the algorithm calculates a proposal move using the forward Euler updating scheme in (8):

and accepts this proposal with a probability

This proof relies on Lemmas 3.2 and 3.3 provided below. Using the triangle inequality, we bound the distance of $P^{k}$ to stationarity by {equs} ∥ P^k(x, ⋅) - μ∥_TV &≤∥ P^k(x, ⋅)- ¯P^k(x, ⋅) ∥_TV + ∥ ¯P^k(x, ⋅) - ¯μ ∥_TV + ∥ ¯μ - μ∥_TV =defI_1 + I_2 + I_3 . We now bound all three terms separately.

for all $h<h_{c}$ and every $\boldsymbol{x}$ satisfying $U(\boldsymbol{x})<E_{0}$ .

Lemma 3.3 bounds $I_{2}$ in (3) by using Harris’ theorem, Theorem 2.11. This lemma relies on a drift and minorization condition for patched MALA. The lemma states that patched MALA is exponentially ergodic, that is, for every $\bar{\delta}\in(\delta,1)$ and $E_{0}>0$ , there exist positive constants $C_{3}$ and $h_{c}$ such that

for all $h<h_{c}$ and for all $\boldsymbol{x}$ satisfying $U(\boldsymbol{x})<E_{0}$ .

To bound $I_{3}$ , we use the characterisation of $\bar{\mu}$ in (LABEL:e:defmubar) and the definition of the total variation distance, to get {equ}[3termbound] ∥ ¯μ - μ∥_TV = 2 μ(R_h^c) ≤2 e^-βE_h/2 , where we used Remark 2.2 to obtain the inequality.

Combining the bounds (18), (19) and (LABEL:3termbound) yields

Since the total variation distance between a Markov chain and its invariant measure is nonincreasing in the TV norm, the linear dependence on $k$ can be eliminated as follows. Set $k=\lceil h^{-1/4}\rceil$ in (20) to obtain:

Since $E_{h}\propto h^{-1/4}$ , there exist positive constants $C_{1}$ and $C_{2}$ such that

The next lemma bounds $I_{1}$ in (3) using the drift condition obtained in Lemma LABEL:patchedMALAdrift.

Provided that $E_{\star}$ is sufficiently small there exist positive constants $C_{1}$ , $C_{2}$ and $h_{c}$ such that

for all $\boldsymbol{x}\in R_{h}$ , every $h<h_{c}$ , and every $T>0$ .

The measures $P_{h}(\boldsymbol{x},\cdot)$ and $\bar{P}_{h}(\boldsymbol{x},\cdot)$ are not the same, even for a point $\boldsymbol{x}\in R_{h}$ , since their invariant distributions are different. In particular, patched MALA rejects all proposed moves to $R_{h}^{c}$ . However, if the input state and proposed move are in $R_{h}$ , the acceptance probabilities of the two chains are the same. Hence, if we initiate the two chains in $R_{h}$ , and drive them by the same realization of noise, we obtain a coupling between the two chains such that they are identical up until the first time MALA hits $R_{h}^{c}$ . Based on this observation, we obtain a bound on the total variation difference between the transition probabilities of the two chains in the following way.

Let $\{\boldsymbol{X}_{k}\}$ and $\{\bar{\boldsymbol{X}}_{k}\}$ be instances of the Markov chains with respective transition probabilities $P_{h}$ and $\bar{P}_{h}$ , driven by the same realization of the noise $\boldsymbol{W}$ , the same realisation of the acceptance variables $\zeta_{k}$ , and with identical initial conditions $\boldsymbol{X}_{0}=\bar{\boldsymbol{X}}_{0}=\boldsymbol{x}\in R_{h}$ . As argued above, we then have $\boldsymbol{X}_{k}=\bar{\boldsymbol{X}}_{k}$ for $k\leq n$ provided that the first time MALA hits $R_{h}^{c}$ is greater than $n$ . Let $\tau_{h}$ denote the first time that $\boldsymbol{X}_{k}$ hits $R_{h}^{c}$ . The coupling characterization of the total variation distance implies that, {equs} ∥ P_h^n(x, ⋅)- ¯P_h^n(x, ⋅) ∥_TV ≤2P^x(X_n ≠¯X_n) ≤2P^x( τ_h ≤n) .

At this stage, one of our main ingredients is the fact that the function $\Phi({\boldsymbol{x}})=\mathop{\hbox{\rm{exp}}}\nolimits\bigl{(}\theta U({\boldsymbol{x}})\bigr{)}$ is a Lyapunov function for the MALA algorithm, see Proposition LABEL:MALAlocaldrift below. The probability of MALA first hitting $R_{h}^{c}$ before time $n$ can therefore be expressed as {equ} P^x ( τ_h ≤n ) = ∑_k=1^n P^x( Φ(X_k) ¿ e^θE_h) ≤e^-θE_h ∑_k=1^n E^x Φ(X_k) where we made use of Chebychev’s inequality. We now note that we can apply Proposition LABEL:MALAlocaldrift since $E_{h}<h^{-1/2}$ for $h$ sufficiently small. Since $E_{h}=E_{\star}h^{-1/4}$ , we can make $E_{\star}$ sufficiently small so that there exists some $\bar{\gamma}>0$ such that {equ} E^x Φ(X_1)≤e^-¯γh Φ(x) + Kh . Combining this with the previous bound, we obtain {equ} P^x ( τ_h ≤n ) ≤e^-θE_h ∑_k=1^n ( e^-¯γkhΦ(x) + Kh 1-e-¯γh) .

Summing over $k$ and using the fact that $E_{h}\propto h^{-1/4}$ yields the existence of positive constants $C_{1}$ and $C_{2}$ such that

The following lemma proves a geometric rate of convergence for the Markov chain $\bar{P}$ . Recall $R_{h}=\{\boldsymbol{x}:U(\boldsymbol{x})<E_{h}\}$ . The key tool used is Harris’ theorem, Theorem 2.11.

For every $\bar{\delta}\in(\delta,1)$ , there exist positive constants $C$ and $h_{c}$ such that

for all $\boldsymbol{x}\in R_{h}$ and $h<h_{c}$ . In particular, $\bar{\delta}$ is independent of time-stepsize.

To prove this result, we use once again Harris’ theorem. The verification of its conditions for the Markov chain $\bar{P}$ is precisely the content of Lemmas 3.4 and LABEL:patchedMALAdrift below. ∎

In the next lemma, a minorization condition for patched MALA is derived using finite time accuracy of patched MALA in the TV norm (see Lemma LABEL:patchedMALAtvaccuracy).

Let $U$ be a potential function satisfying Assumption 2.1. Let $\epsilon$ be the constant appearing in the minorization condition of the true solution (see Lemma 2.7), and let $\bar{P}$ be as above. For every $E>0$ and $\bar{\epsilon}\in(0,\epsilon)$ , there exists a positive constant $h_{c}$ such that {equ}[e:bounddiffP] ∥ ¯P(x, ⋅) - ¯P(y, ⋅)∥_TV ≤2(1-¯ϵ) , for all $\boldsymbol{x},\boldsymbol{y}$ satisfying $U(\boldsymbol{x})\vee U(\boldsymbol{y})\leq E$ and $h\leq h_{c}$ .

According to Lemma 2.7, the bound (LABEL:e:bounddiffP) holds when $\bar{P}$ is replaced by $Q_{1}$ , the transition probability for the true solution $\boldsymbol{Y}$ at time one. Combining this with Lemma LABEL:patchedMALAtvaccuracy below, we thus obtain {equs} ∥ ¯P(x, ⋅) - ¯P(y, ⋅)∥_TV &≤2(1-ϵ) + 2sup_Φ(x) ≤E ∥ ¯P(x, ⋅) - Q_1(x, ⋅)∥