Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky

Introduction and informal summary of results

Consider a stochastic optimization problem

When the functions wf(w,z)w\mapsto f(w,z) are not convex, theoretical analysis of global convergence becomes largely intractable. On the other hand, non-convex optimization is currently witnessing an impressive string of empirical successes, most notably in the realm of deep neural networks. Towards the aim of bridging this gap between theory and practice, this paper provides a theoretical justification for Stochastic Gradient Langevin Dynamics (SGLD), a popular variant of stochastic gradient descent, in which properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration (Gelfand and Mitter, 1991; Borkar and Mitter, 1999; Welling and Teh, 2011).

Since the population distribution PP is unknown, we attempt to (approximately) minimize

It is well-recognized, however, that minimization of the empirical risk FzF_{\mathbf{z}} does not immediately translate into minimization of the population risk FF. A standard approach for addressing the issue is to decompose the excess risk into a sum of two terms, F(W^)Fz(W^)F(\widehat{W})-F_{\mathbf{z}}(\widehat{W}) (the generalization error of W^\widehat{W}) and Fz(W^)FF_{\mathbf{z}}(\widehat{W})-F^{*} (the gap between the empirical risk of W^\widehat{W} and the minimum of the population risk), and then show that both of these terms are small (either in expectation or with high probability). Taking W^=Wk\widehat{W}=W_{k} and letting W^\widehat{W}^{*} be the output of the Gibbs algorithm under which the conditional distribution of W^\widehat{W}^{*} given Z=z\mathbf{Z}=\mathbf{z} is equal to πz\pi_{\mathbf{z}}, we decompose the excess risk (1.1) as follows:

where the first term is the difference of expected population risks of SGLD and the Gibbs algorithm, the second term is the generalization error of the Gibbs algorithm, and the third term is easily upper-bounded in terms of expected suboptimality \mathbf{E}\big{(}F_{\mathbf{Z}}(\widehat{W}^{*})-\min_{w}F_{\mathbf{Z}}(w)\big{)} of the Gibbs algorithm for the empirical risk. Observe that only the first term pertains to SGLD, whereas the other two involve solely the Gibbs distribution. The main contribution of this work is in showing finite-time convergence of SGLD for a non-convex objective function. Informally, we can state our main result as follows:

For any ε>0\varepsilon>0, the first term in (1.5) scales as

where λ\lambda_{*} is a certain spectral gap parameter that governs the exponential rate of convergence of the Langevin diffusion to its stationary distribution. This spectral gap parameter itself might depend on β\beta and dd, but is independent of nn.

The second and third terms in (1.5) scale, respectively, as

Our analysis draws heavily on the theory of optimal transportation (Villani, 2003) and on the analysis of Markov diffusion operators (Bakry et al., 2014) (the necessary background on Markov semigroups and functional inequalities is given in Appendix A). In particular, we control the convergence of SGLD to the Gibbs distribution in terms of 22-Wasserstein distance

To control the first term on the right-hand side of (1.5), we first upper-bound the 22-Wasserstein distance between the distributions of WkW_{k} (the kkth iterate of SGLD) and W(kη)W(k\eta) (the point reached by the Langevin diffusion at time t=kηt=k\eta). This requires some heavy lifting: Existing bounds on the 22-Wasserstein distance between a diffusion process and its time-discretized version due to Alfonsi et al. (2015) scale like ηekη\eta e^{k\eta}, which is far too crude for our purposes. By contrast, we take an indirect route via a Girsanov-type change of measure and a weighted transportation-cost inequality of Bolley and Villani (2005) to obtain a bound that scales like kηη1/4k\eta\cdot\eta^{1/4}. This step relies crucially on a certain exponential integrability property of the Langevin diffusion. Next, we show that the Gibbs distribution satisfies a logarithmic Sobolev inequality, which allows us to conclude that the 22-Wasserstein distance between the distribution of W(kη)W(k\eta) and the Gibbs distribution decays exponentially as ekηe^{-k\eta}. Since W2\mathcal{W}_{2} satisfies the triangle inequality, we can produce an upper bound on the first term in (1.5) that scales as kηη1/4+ekηk\eta\cdot\eta^{1/4}+e^{-k\eta}. This immediately suggests that we can make this term as small as we wish by first choosing a large enough horizon t=kηt=k\eta and then a small enough step size η\eta. Overall, this leads to the bounds stated in (1.6).

To control the second term in (1.5), we show that the Gibbs algorithm is stable in 22-Wasserstein distance with respect to local perturbations of the training dataset. This step, again, relies on the logarithmic Sobolev inequality for the Gibbs distribution. To control the third term, we use a nonasymptotic Laplace integral approximation to show that a single draw from the Gibbs distribution is an approximate minimizer of the empirical risk. We use a Wasserstein continuity result due to Polyanskiy and Wu (2016) and a well-known equivalence between stability of empirical minimization and generalization (Mukherjee et al., 2006; Rakhlin et al., 2005) to show that, in fact, the Gibbs algorithm samples from near-minimizers of the population risk.

We remark that our result readily extends to the case when the stochastic gradients gkg_{k} in (1.3) are formed with respect to independent draws from the data-generating distribution PP – e.g., when taking a single pass through the dataset. In this case, the target of optimization is FF itself rather than FzF_{\mathbf{z}}, and we simply omit the second term in (1.5). If the main concern is not consistency (as in (1.1)) but rather the generalization performance of SGLD itself, then the same analysis applied to the decomposition

gives an upper bound of (1.6) plus the first term of (1.7). In other words, while the rate of (1.1) may be hampered by the slow convergence of dlogββ\frac{d\log\beta}{\beta}, the rate of generalization is not. Finally, if each data point is used only once, the generalization performance is controlled by (1.6) alone.

2 Related work

The asymptotic study of convergence of discretized Langevin diffusions for non-convex objectives has a long history, starting with the work of Gelfand and Mitter (1991). Most of the work has focused on annealing-type schemes, where both the step size η\eta and the temperature 1/β1/\beta are decreased with time. Márquez (1997) and Pelletier (1998) studied the rates of weak convergence for both the Langevin diffusion and the discrete-time updates. However, when η\eta and β\beta are kept fixed, the updates do not converge to a global minimizer, but one can still aim for convergence to a stationary distribution. An asymptotic study of this convergence, in the sense of relative entropy, was initiated by Borkar and Mitter (1999).

Dalalyan and Tsybakov (2012) and Dalalyan (2016) analyzed rates of convergence of discrete-time Langevin updates (with exact gradients) in the case of convex functions, and provided nonasymptotic rates of convergence in the total variation distance for sampling from log-concave densities. Durmus and Moulines (2015) refined these results by establishing geometric convergence in total variation distance for convex and strongly convex objective functions, and provided some results for non-convex objectives that can be represented as a bounded perturbation of a convex or a strongly convex function. Bubeck et al. (2015) studied projected Langevin updates in the convex case.

Our work is motivated in part by recent papers on non-convex optimization and, in particular, on optimization problems related to neural networks. A heuristic analysis of SGLD was given by Welling and Teh (2011), and a modification of SGLD to improve generalization performance was recently proposed by Chaudhari et al. (2016). Deliberate addition of noise was also proposed by Ge et al. (2015) as a strategy for escaping from saddle points, and Belloni et al. (2015) analyzed a simulated anealing method based on Hit-and-Run for sampling from nearly log-concave distributions. While these methods aim at avoiding local minima through randomn perturbations, the line of work on continuation methods and graduated optimization (Hazan et al., 2016) attempts to create sequences of smoothed approximations that can successively localize the optimum.

Hardt et al. (2015) studied uniform stability and generalization properties of stochastic gradient descent with both convex and non-convex objectives. For the non-convex case, their upper bound on stability degrades with the number of steps of the optimization procedure, which was taken by the authors as a prescription for early stopping. In contrast, we show that, under our assumptions, non-convexity does not imply loss of stability when the latter is measured in terms of 22-Wasserstein distance to the stationary distribution. In addition, we use the fact that Gibbs distribution concentrates on approximate empirical minimizers, implying convergence for the population risk via stability (Rakhlin et al., 2005; Mukherjee et al., 2006).

The main result

where UzU_{\mathbf{z}} is a random element of U\mathsf{U} with probability law QzQ_{\mathbf{z}}. Conditionally on Z=z\mathbf{Z}=\mathbf{z}, the SGLD update takes the form

The function ff takes nonnegative real values, and there exist constants A,B0A,B\geq 0, such that

For each zZz\in\mathsf{Z}, the function f(,z)f(\cdot,z) is MM-smooth: for some M>0M>0,

For each zZz\in\mathsf{Z}, the function f(,z)f(\cdot,z) is (m,b)(m,b)-dissipative (Hale, 1988): for some m>0m>0 and b0b\geq 0,

There exists a constant δ[0,1)\delta\in[0,1), such that, for each zZn\mathbf{z}\in\mathsf{Z}^{n},We are reusing the constants MM and BB from (A.1) and (A.2) in (2.4) mainly out of considerations of technical convenience; any other constants M,B>0M^{\prime},B^{\prime}>0 can be substituted in their place without affecting the results.

We are now ready to state our main result. A crucial role will be played by the uniform spectral gap

Proof of Theorem 2.1

Let μz,k\mathchar58=L(WkZ=z)\mu_{\mathbf{z},k}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}(W_{k}|\mathbf{Z}=\mathbf{z}), νz,t\mathchar58=L(W(t)Z=z)\nu_{\mathbf{z},t}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}(W(t)|\mathbf{Z}=\mathbf{z}), and Ez[]\mathchar58=E[Z=z]\mathbf{E}_{\mathbf{z}}[\cdot]\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{E}[\cdot|\mathbf{Z}=\mathbf{z}]. In a nutshell, our proof of Theorem 2.1 consists of the following steps:

We first show that, for all sufficiently small η>0\eta>0, the SGLD recursion (2.2) tracks the continuous-time Langevin diffusion process (1.4) in 22-Wasserstein distance:

(the precise statement with explicit constants is given in Proposition 3.1).

Next, we show that the Langevin diffusion (1.4) converges exponentially fast to the Gibbs distribution πz\pi_{\mathbf{z}}:

This, together with (3.1) and the triangle inequality, yields the estimate

(see Proposition 3.3 for explicit constants). Observe that there are two terms on the right-hand side of (3.2), one of which grows linearly with t=kηt=k\eta, while the other one decays exponentially with tt. Thus, we can first choose tt large enough and then η\eta small enough, so that

The resulting choices of t=kηt=k\eta and η\eta translate into the expressions for kk and η\eta given in (A.14).

The upshot of Eq. (3.3) is that, for large enough kk, the conditional probability law of WkW_{k} given Z=z\mathbf{Z}=\mathbf{z} is close, in 22-Wasserstein, to the Gibbs distribution πz\pi_{\mathbf{z}}. Thus, we are led to consider the Gibbs algorithm that generates a random draw from πz\pi_{\mathbf{z}}. We show that the resulting hypothesis is an almost-minimizer of the empirical risk, i.e.,

(see Proposition 3.4 for the exact statement), and also that the Gibbs algorithm is stable in the 22-Wasserstein distance: for any two datasets z,zˉ\mathbf{z},\bar{\mathbf{z}} that differ in a single coordinate,

This estimate, together with Lemma 3.5 below, implies that the Gibbs algorithm is uniformly stable (Bousquet and Elisseeff, 2002):

(see Proposition 3.5). The almost-ERM property (3.4) and the uniform stability propery (3.5), together with (3.3), yield the statement of the theorem.

2 Technical lemmas

We first collect a few lemmas that will be used in the sequel; see Appendix C for the proofs.

For all 0<η<1m4M20<\eta<1\wedge\frac{m}{4M^{2}} and all zZn\mathbf{z}\in\mathsf{Z}^{n},

for some constants c1>0c_{1}>0 and c20c_{2}\geq 0. Then

3 The diffusion approximation

Recall that μz,k=L(WkZ=z)\mu_{\mathbf{z},k}=\mathcal{L}(W_{k}|\mathbf{Z}=\mathbf{z}) and νz,t=L(W(t)Z=z)\nu_{\mathbf{z},t}=\mathcal{L}(W(t)|\mathbf{Z}=\mathbf{z}), and we take μz,0=νz,0=μ0\mu_{\mathbf{z},0}=\nu_{\mathbf{z},0}=\mu_{0}. In this section, we derive an upper bound on the 2-Wasserstein distance W2(μz,k,νz,kη)\mathcal{W}_{2}(\mu_{\mathbf{z},k},\nu_{\mathbf{z},k\eta}). The analysis consists of two steps. We first upper-bound the relative entropy D(μz,kνz,kη)D(\mu_{\mathbf{z},k}\|\nu_{\mathbf{z},k\eta}) via a change-of-measure argument following Dalalyan and Tsybakov (2012) (see also Dalalyan (2016)), except that we also have to deal with the error introduced by the stochastic gradient oracle. We then use a weighted transportation-cost inequality of Bolley and Villani (2005) to control the Wasserstein distance W2(μz,k,νz,kη)\mathcal{W}_{2}(\mu_{\mathbf{z},k},\nu_{\mathbf{z},k\eta}) in terms of D(μz,kνz,kη)D(\mu_{\mathbf{z},k}\|\nu_{\mathbf{z},k\eta}).

The proof of the following lemma is somewhat lengthy, and is given in Appendix D:

Let μ=μz,k\mu=\mu_{\mathbf{z},k}, ν=νz,kη\nu=\nu_{\mathbf{z},k\eta}, and take λ=1\lambda=1. Suppose kη1k\eta\geq 1. Since β2m\beta\geq\frac{2}{m}, we can use Lemma 3.3 to write

Moreover, for all kk and η\eta satisfying the conditions of Lemma 3.6, plus the additional requirement kη1k\eta\geq 1, we can write

Putting everything together, we obtain the following result:

4 Wasserstein distance to the Gibbs distribution

We now fix a time t0t\geq 0 and examine the 2-Wasserstein distance W2(νz,t,πz)\mathcal{W}_{2}(\nu_{\mathbf{z},t},\pi_{\mathbf{z}}). At this point, we need to use a number of concepts from the analysis of Markov diffusion operators; see Appendix A for the requisite background. We start by showing the following:

For β2/m\beta\geq 2/m, all of the the Gibbs measures πz\pi_{\mathbf{z}} satisfy a logarithmic Sobolev inequality with constant

by the Otto–Villani theorem, and, since D(νz,0πz)=D(μ0πz)<D(\nu_{\mathbf{z},0}\|\pi_{\mathbf{z}})=D(\mu_{0}\|\pi_{\mathbf{z}})<\infty by Lemma 3.4, we also have

by the theorem on exponential decay of entropy. Combining Eqs. (3.16) (with μ=νz,t\mu=\nu_{\mathbf{z},t}) and (3.17) and using Lemma 3.4, we get

Letting t=kηt=k\eta and invoking Proposition 3.1, we obtain the following:

For all kk and η\eta satisfying the conditions of Proposition 3.1, we have

5 Almost-ERM property of the Gibbs algorithm

is the differential entropy of pzp_{\mathbf{z}} (Cover and Thomas, 2006). To upper-bound h(pz)h(p_{\mathbf{z}}), we estimate the second moment of πz\pi_{\mathbf{z}}. From (3.17), it follows that W2(νz,t,πz)t0\mathcal{W}_{2}(\nu_{\mathbf{z},t},\pi_{\mathbf{z}})\xrightarrow{t\to\infty}0. Since convergence of probability measures in 2-Wasserstein distance is equivalent to weak convergence plus convergence of second moments (Villani, 2003, Theorem 7.12), we have by Lemma 3.2

The differential entropy of a probability density with a finite second moment is upper-bounded by that of a Gaussian density with the same second moment, so we immediately get

Using Eqs. (3.20) and (3.21) in (3.18) and simplifying, we obtain the following result:

6 Stability of the Gibbs algorithm

where i0[n]i_{0}\in[n] is the index of the coordinate where z\mathbf{z} and zˉ\bar{\mathbf{z}} differ. In particular,

Therefore, since πzˉ\pi_{\bar{\mathbf{z}}} satisfies a logarithmic Sobolev inequality with constant cLSc_{\rm LS} given in Proposition 3.2, we can write

where the last line follows from the quadratic growth estimate (3.6). Taking μ=πz\mu=\pi_{\mathbf{z}} in (3.16) and using the above bound and the second-moment estimate (3.19), we obtain

Finally, observe that, for each zZz\in\mathsf{Z}, the function wf(w,z)w\mapsto f(w,z) satisfies the conditions of Lemma 3.5 with c1=Mc_{1}=M and c2=Bc_{2}=B, while πz\pi_{\mathbf{z}} and πzˉ\pi_{\bar{\mathbf{z}}} satisfy the conditions of Lemma 3.5 with σ2=b+d/βm\sigma^{2}=\frac{b+d/\beta}{m}. Thus, we obtain the following uniform stability estimate for the Gibbs algorithm:

For any two z,zˉZn\mathbf{z},\bar{\mathbf{z}}\in\mathsf{Z}^{n} that differ only in a single coordinate,

7 Completing the proof

Now consider the random hypotheses W^\widehat{W} and W^\widehat{W}^{*} with L(W^Z=z)=μz,k\mathcal{L}(\widehat{W}|\mathbf{Z}=\mathbf{z})=\mu_{\mathbf{z},k} and L(W^Z=z)=πz\mathcal{L}(\widehat{W}^{*}|\mathbf{Z}=\mathbf{z})=\pi_{\mathbf{z}}. Then

The function FF satisfies the conditions of Lemma 3.5 with c1=Mc_{1}=M and c2=Bc_{2}=B, while the probability measures μz,k,πz\mu_{\mathbf{z},k},\pi_{\mathbf{z}} satisfy the conditions of Lemma 3.5 with

by Lemma 3.2. Therefore, for all zZn\mathbf{z}\in\mathsf{Z}^{n},

It remains to analyze the expected excess risk EF(W^)F\mathbf{E}F(\widehat{W}^{*})-F^{*} of the Gibbs algorithm. To that end, we will use stability-based arguments along the lines of Bousquet and Elisseeff (2002) and Rakhlin et al. (2005). We begin by decomposing the excess risk as

The term T1T_{1} is the generalization error of the Gibbs algorithm. To upper-bound it, let Z=(Z1,,Zn)Pn\mathbf{Z}^{\prime}=(Z^{\prime}_{1},\ldots,Z^{\prime}_{n})\sim\mathbf{P}^{\otimes n} be independent of Z\mathbf{Z} and W^\widehat{W}^{*}. Then

Using the fact that Z1,,Zn,Z1,,ZnZ_{1},\ldots,Z_{n},Z^{\prime}_{1},\ldots,Z^{\prime}_{n} are i.i.d., as well as the fact that Z\mathbf{Z}^{\prime} is indepdenent of W^\widehat{W}^{*}, the iith term in the summation in (3.7) can be written out explicitly as follows:

where zˉ(i)\mathchar58=(z1,,zi1,zi,zi+1,,zn)\bar{\mathbf{z}}^{(i)}\mathrel{\mathop{\mathchar 58\relax}}=(z_{1},\ldots,z_{i-1},z^{\prime}_{i},z_{i+1},\ldots,z_{n}). Noting that zˉ(i)\bar{\mathbf{z}}^{(i)} and z\mathbf{z} differ only in the iith coordinate, we can use Proposition 3.5 to upper-bound the integral in (3.24). Since the resulting estimate is uniform in ii, from (3.7) we obtain

where the last step is by Proposition 3.4. From (3.25) and (3.26), we get

Combining Eqs. (3.22) and (3.27), we obtain the claimed excess risk bound (2.6).

Discussion and directions for future research

The first two assumptions are fairly standard in the literature on non-convex optimization. The dissipativity assumption (A.3) merits some discussion. The term “dissipative” comes from the theory of dynamical systems (Hale, 1988; Stuart and Humphries, 1996), where it has the following interpretation: Consider the gradient flow described by the ordinary differential equation

Then it is not hard to show that, if the function wf0(w,z)w\mapsto f_{0}(w,z) is LL-Lipschitz, then ff satisfies (A.2) with m=γ/2m=\gamma/2 and b=L2/2γb=L^{2}/2\gamma. Thus, a byproduct of our analysis is a fine-grained characterization of the impact of weight decay on learning.

Finally, the exponential integrability assumption (A.5) is satisfied, for example, by the Gaussian initialization W0N(0,σ2Id)W_{0}\sim N(0,\sigma^{2}I_{d}) with σ2<1/2\sigma^{2}<1/2.

Effect of gradient noise and minibatch size selection.

Observe that the excess risk bound (2.6) contains a term that goes to zero as ε0\varepsilon\to 0, as well as a term that grows as logε1\log\varepsilon^{-1}, but goes to zero as the gradient noise level δ0\delta\to 0. This suggests selecting the minibatch size

to offset the logε1\log\varepsilon^{-1} term.

Uniform spectral gap.

As shown in Appendix B, Assumptions (A.1)–(A.3) are enough to guarantee that the spectral gap λ\lambda_{*} is strictly positive. In particular, we give a very conservative estimate

Therefore, in order to apply Theorem 2.1, one needs to fully exploit the structural properties of the problem at hand and produce an upper bound on 1/λ1/\lambda_{*} which is polynomial in dd or even dimension-free. (By contrast, exponential dependence of 1/λ1/\lambda_{*} on β\beta is unavoidable in the presence of multiple local minima and saddle points; this is a consequence of sharp upper and lower bounds on the spectral gap due to Bovier et al. (2005).) For example, consider replacing the empirical risk (1.2) with a smoothed objective

Note that here, in contrast with (4.3), this bound is completely dimension-free. A tantalizing line of future work is, therefore, to find other settings where 1/λ1/\lambda_{*} is indeed small.

Acknowledgments

The authors would like to thank Arnak Dalalyan and Ramon van Handel for enlightening discussions.

References

Appendix A Background on Markov semigroups and functional inequalities

Our analysis relies on the theory of Markov diffusion operators and associated functional inequalities. In this Appendix, we only summarize the key ideas and results; the book by Bakry et al. (2014) provides an in-depth exposition.

It can be shown that E(g)0\mathcal{E}(g)\geq 0, i.e., L-\mathcal{L} is a positive operator (since L1=0\mathcal{L}1=0, zero is an eigenvalue).

Let PP be a Markov semigroup with the unique invariant distribution π\pi and the Dirichlet form E\mathcal{E}. We say that π\pi satisfies a Poincaré (or spectral gap) inequality with constant cc if, for all probability measures μπ\mu\ll\pi,

Hence, if λ>0\lambda>0, then the spectrum of L-\mathcal{L} is contained in the set {0}[λ,)\{0\}\cup[\lambda,\infty), so λ\lambda is the gap between the zero eigenvalue and the rest of the spectrum. We say that π\pi satisfies a logarithmic Sobolev inequality with constant cc if, for all μπ\mu\ll\pi,

Exponential decay of entropy (Bakry et al., 2014, Th. 5.2.1): Let μt\mathchar58=L(W(t))\mu_{t}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}(W(t)). Then

where HH is a C1C^{1} function and {B(t)}\{B(t)\} is the standard dd-dimensional Brownian motion. (Replacing the factor 2\sqrt{2} by 2β1\sqrt{2\beta^{-1}} is equivalent to the time rescaling tβtt\mapsto\beta t.) The generator of this semigroup is the second-order differential operator

Thus, the Gibbs measure π\pi satisfies a Poincaré inequality with constant cc if, for any μπ\mu\ll\pi,

and a logarithmic Sobolev inequality with constant cc if

If HH is C2C^{2} and strongly convex, i.e., 2HKId\nabla^{2}H\succeq KI_{d} for some K>0K>0, then π\pi satisfies a logarithmic Sobolev inequality with constant c=1/Kc=1/K. In the absence of convexity, it is in general difficult to obtain upper bounds on Poincaré or log-Sobolev constants. The following two propositions give sufficient conditions based on so-called Lyapunov function criteria:

Then π\pi satisfies a Poincaré inequality with constant

π\pi satisfies a Poincaré inequality with constant cPc_{\rm P}.

There exists some constant K0K\geq 0, such that 2HKId\nabla^{2}H\succeq-KI_{d}.

Let C1C_{1} and C2C_{2} be defined, for some ε>0\varepsilon>0, by

Then π\pi satisfies a logarithmic Sobolev inequality with constant cLS=C1+(C2+2)cPc_{\rm LS}=C_{1}+(C_{2}+2)c_{\rm P}.

In particular, if K0K\neq 0, we can take ε=2/K\varepsilon=2/K, in which case

Appendix B A lower bound on the uniform spectral gap

Our goal here is to prove the crude lower bound on λ\lambda_{*} given in Section 4. To that end, we will use the Lyapunov function criterion due to Bakry et al. (2008), which is reproduced as Proposition A.1 in Appendix A.

We will apply this criterion to the Gibbs distribution πz\pi_{\mathbf{z}} for some zZn\mathbf{z}\in\mathsf{Z}^{n}. Thus, we have H=βFzH=\beta F_{\mathbf{z}} and

Consider the candidate Lyapunov function V(w)=emβw2/4V(w)=e^{m\beta\|w\|^{2}/4}. From the fact that V1V\geq 1 and from the dissipativity assumption (A.3), it follows that

Thus, VV evidently satisfies (A.11) with R2=2κγR^{2}=\frac{2\kappa}{\gamma}, κ0=κ\kappa_{0}=\kappa and λ0=2κ\lambda_{0}=2\kappa, where

Moreover, from Lemma 3.1 and from the fact that Fz0F_{\mathbf{z}}\geq 0, it follows that

Thus, by Proposition A.1, πz\pi_{\mathbf{z}} satisfies a Poincaré inequality with constant

Observe that this bound holds for all zZn\mathbf{z}\in\mathsf{Z}^{n}. Using this fact and the relation 1/λcP1/\lambda\leq c_{\rm P} between the spectral gap and the Poincaré constant, we see that

Appendix C Proofs for Section 3.2

where (i) follows from (A.1) and from Cauchy–Schwarz, while (ii) follows from (3.6). This proves the upper bound on f(w,z)f(w,z). Now take v=cwv=cw for some c(0,1]c\in(0,1] to be chosen later. With this choice, we proceed from Eq. (C.1) as follows:

where (i) uses the fact that f0f\geq 0, while (ii) uses the dissipativity property (2.3). Taking c=13c=\frac{1}{\sqrt{3}}, we get the lower bound in (3.7). ∎

where the second step uses independence of Wkg(Wk,Uz,k)W_{k}-g(W_{k},U_{\mathbf{z},k}) and ξk\xi_{k} and the unbiasedness property (2.1) of the gradient oracle. We can further expand the first term in (C.2):

where we have used (2.1) once again. By (2.4), the second term in (C.3) can be upper-bounded by

whereas the first term can be estimated as

where the inequality follows from the dissipativity condition (2.3) and the bound (3.6) in Lemma 3.1. Combining all of the above, we arrive at the recursion

Fix some η(0,1m2M2)\eta\in(0,1\wedge\frac{m}{2M^{2}}). There are two cases to consider:

If 12ηm+4η2M201-2\eta m+4\eta^{2}M^{2}\leq 0, then from (C.4) it follows that

If 0<12ηm+4η2M2<10<1-2\eta m+4\eta^{2}M^{2}<1, then iterating (C.4) gives

The bound (3.8) follows from Eqs. (C.5) and (C.6) and from the estimate

which easily follows from the independence of Z\mathbf{Z} and W0W_{0} and from Jensen’s inequality.

We now analyze the diffusion (1.4). Let Y(t)\mathchar58=W(t)2Y(t)\mathrel{\mathop{\mathchar 58\relax}}=\|W(t)\|^{2}. Then Itô’s lemma gives

Recognizing the left-hand side of (C.8) as the total Itô derivative of e2mtY(t)e^{2mt}Y(t), we arrive at

which, upon integrating and rearranging, becomes

Now, using the dissipativity condition (2.3), we can write

Substituting this into (C.10), we end up with

Taking expectations and using the martingale property of the Itô integral together with (C.7), we get (3.9). Eq. (3.10) follows from maximizing the right-hand side of (3.9) over all t0t\geq 0. ∎

For L(t)=eW(t)2L(t)=e^{\|W(t)\|^{2}}, Itô’s lemma gives

From the dissipativity condition (2.3) and from the assumption that β2/m\beta\geq 2/m, it follows that

Eq. (3.11) then follows by an application of the Gronwall lemma. ∎

Since pz>0p_{\mathbf{z}}>0 everywhere, we can write

We first upper-bound the partition function:

where the inequality follows from Lemma 3.1. Thus,

Moreover, invoking Lemma 3.1 once again, we have

Substituting (C.12), (C.13), and (C.14) into (C.11), we get (3.12). ∎

where we have used Cauchy–Schwarz and the growth condition (3.13). Now let P\mathbf{P} be the coupling of μ\mu and ν\nu that achieves W2(μ,ν)\mathcal{W}_{2}(\mu,\nu). That is, P=L((W,V))\mathbf{P}=\mathcal{L}((W,V)) with μ=L(W)\mu=\mathcal{L}(W), ν=L(V)\nu=\mathcal{L}(V), and

Interchanging the roles of μ\mu and ν\nu, we complete the proof. ∎

Appendix D Proof of Lemma 3.6

Conditioned on Z=z\mathbf{Z}=\mathbf{z}, {Wk}k=0\{W_{k}\}^{\infty}_{k=0} is a time-homogeneous Markov process. Consider the following continuous-time interpolation of this process:

where Uz(t)Uz,k\overline{U}_{\mathbf{z}}(t)\equiv U_{\mathbf{z},k} for t[kη,(k+1)η)t\in[k\eta,(k+1)\eta). Note that, for each kk, W(kη)\overline{W}(k\eta) and WkW_{k} have the same probability law μz,k\mu_{\mathbf{z},k}. Moreover, by a result of Gyöngy (1986), the process W(t)\overline{W}(t) has the same one-time marginals as the Itô process

Crucially, V(t)V(t) is a Markov process, while W(t)\overline{W}(t) is not. Let \mathbf{P}^{t}_{V}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}\big{(}V(s)\mathrel{\mathop{\mathchar 58\relax}}0\leq s\leq t\big{|}\mathbf{Z}=\mathbf{z}\big{)} and \mathbf{P}^{t}_{W}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}\big{(}W(s)\mathrel{\mathop{\mathchar 58\relax}}0\leq s\leq t\big{|}\mathbf{Z}=\mathbf{z}\big{)}. The Radon–Nikodym derivative of PWt\mathbf{P}^{t}_{W} w.r.t. PVt\mathbf{P}^{t}_{V} is given by the Girsanov formula

(see, e.g., Sec. 7.6.4 in Liptser and Shiryaev (2001)). Using (D.3) and the martingale property of the Itô integral, we have

where the last line follows from the fact that L(W(s))=L(V(s))\mathcal{L}(\overline{W}(s))=\mathcal{L}(V(s)) for each ss.

We first estimate the first summation in (D.4). Consider some s[jη,(j+1)η)s\in[j\eta,(j+1)\eta). From (D.1), we have

Therefore, using Lemmas 3.1 and 3.2 and the gradient noise assumption (A.4), we arrive at

Similarly, the second summation on the right-hand side of (D.4) can be estimated as follows:

Substituting Eqs. (D.5) and (D.6) into (D.4), we obtain

Now, since μz,k=L(W(kη)Z=z)\mu_{\mathbf{z},k}=\mathcal{L}(\overline{W}(k\eta)|\mathbf{Z}=\mathbf{z}) and νz,kη=L(W(kη)Z=z)\nu_{\mathbf{z},k\eta}=\mathcal{L}(W(k\eta)|\mathbf{Z}=\mathbf{z}), the data-processing inequality for the KL divergence gives

Appendix E Proof of Proposition 3.2

To establish the log-Sobolev inequality, we will use the Lyapunov function criterion of Cattiaux et al. (2010), reproduced as Proposition A.2 in Appendix A.

We will apply this proposition to the Gibbs distribution πz\pi_{\mathbf{z}} for some zZn\mathbf{z}\in\mathsf{Z}^{n}, so that H=βFzH=\beta F_{\mathbf{z}} and

We consider the same Lyapunov function V(w)=emβw2/4V(w)=e^{m\beta\|w\|^{2}/4} as in Appendix B. From Eq. (B.1), VV evidently satisfies (A.13) with κ\kappa and γ\gamma given in (B.2), i.e., the first condition of Proposition A.2 is satisfied. Moreover, πz\pi_{\mathbf{z}} satisfies a Poincaré inequality with constant cP1/λc_{\rm P}\leq 1/\lambda_{*}. Thus, the second condition is also satisfied. Finally, by the MM-smoothness assumption (A.2), 2FzMId\nabla^{2}F_{\mathbf{z}}\succeq-MI_{d}, so the third condition of Proposition A.2 is satisfied with K=βMK=\beta M. Consequently, the constants C1C_{1} and C2C_{2} in (A.14) are given by

where we have also used the estimate (3.19) to upper-bound C2C_{2}. Therefore, from Proposition A.2 and from (E.1) it follows that πz\pi_{\mathbf{z}} satisfies a logarithmic Sobolev inequality with