Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, Matus Telgarsky

Introduction and informal summary of results

Consider a stochastic optimization problem

When the functions $w\mapsto f(w,z)$ are not convex, theoretical analysis of global convergence becomes largely intractable. On the other hand, non-convex optimization is currently witnessing an impressive string of empirical successes, most notably in the realm of deep neural networks. Towards the aim of bridging this gap between theory and practice, this paper provides a theoretical justification for Stochastic Gradient Langevin Dynamics (SGLD), a popular variant of stochastic gradient descent, in which properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration (Gelfand and Mitter, 1991; Borkar and Mitter, 1999; Welling and Teh, 2011).

Since the population distribution $P$ is unknown, we attempt to (approximately) minimize

It is well-recognized, however, that minimization of the empirical risk $F_{\mathbf{z}}$ does not immediately translate into minimization of the population risk $F$ . A standard approach for addressing the issue is to decompose the excess risk into a sum of two terms, $F(\widehat{W})-F_{\mathbf{z}}(\widehat{W})$ (the generalization error of $\widehat{W}$ ) and $F_{\mathbf{z}}(\widehat{W})-F^{*}$ (the gap between the empirical risk of $\widehat{W}$ and the minimum of the population risk), and then show that both of these terms are small (either in expectation or with high probability). Taking $\widehat{W}=W_{k}$ and letting $\widehat{W}^{*}$ be the output of the Gibbs algorithm under which the conditional distribution of $\widehat{W}^{*}$ given $\mathbf{Z}=\mathbf{z}$ is equal to $\pi_{\mathbf{z}}$ , we decompose the excess risk (1.1) as follows:

where the first term is the difference of expected population risks of SGLD and the Gibbs algorithm, the second term is the generalization error of the Gibbs algorithm, and the third term is easily upper-bounded in terms of expected suboptimality $\mathbf{E}\big{(}F_{\mathbf{Z}}(\widehat{W}^{*})-\min_{w}F_{\mathbf{Z}}(w)\big{)}$ of the Gibbs algorithm for the empirical risk. Observe that only the first term pertains to SGLD, whereas the other two involve solely the Gibbs distribution. The main contribution of this work is in showing finite-time convergence of SGLD for a non-convex objective function. Informally, we can state our main result as follows:

For any $\varepsilon>0$ , the first term in (1.5) scales as

where $\lambda_{*}$ is a certain spectral gap parameter that governs the exponential rate of convergence of the Langevin diffusion to its stationary distribution. This spectral gap parameter itself might depend on $\beta$ and $d$ , but is independent of $n$ .

The second and third terms in (1.5) scale, respectively, as

Our analysis draws heavily on the theory of optimal transportation (Villani, 2003) and on the analysis of Markov diffusion operators (Bakry et al., 2014) (the necessary background on Markov semigroups and functional inequalities is given in Appendix A). In particular, we control the convergence of SGLD to the Gibbs distribution in terms of $2$ -Wasserstein distance

To control the first term on the right-hand side of (1.5), we first upper-bound the $2$ -Wasserstein distance between the distributions of $W_{k}$ (the $k$ th iterate of SGLD) and $W(k\eta)$ (the point reached by the Langevin diffusion at time $t=k\eta$ ). This requires some heavy lifting: Existing bounds on the $2$ -Wasserstein distance between a diffusion process and its time-discretized version due to Alfonsi et al. (2015) scale like $\eta e^{k\eta}$ , which is far too crude for our purposes. By contrast, we take an indirect route via a Girsanov-type change of measure and a weighted transportation-cost inequality of Bolley and Villani (2005) to obtain a bound that scales like $k\eta\cdot\eta^{1/4}$ . This step relies crucially on a certain exponential integrability property of the Langevin diffusion. Next, we show that the Gibbs distribution satisfies a logarithmic Sobolev inequality, which allows us to conclude that the $2$ -Wasserstein distance between the distribution of $W(k\eta)$ and the Gibbs distribution decays exponentially as $e^{-k\eta}$ . Since $\mathcal{W}_{2}$ satisfies the triangle inequality, we can produce an upper bound on the first term in (1.5) that scales as $k\eta\cdot\eta^{1/4}+e^{-k\eta}$ . This immediately suggests that we can make this term as small as we wish by first choosing a large enough horizon $t=k\eta$ and then a small enough step size $\eta$ . Overall, this leads to the bounds stated in (1.6).

To control the second term in (1.5), we show that the Gibbs algorithm is stable in $2$ -Wasserstein distance with respect to local perturbations of the training dataset. This step, again, relies on the logarithmic Sobolev inequality for the Gibbs distribution. To control the third term, we use a nonasymptotic Laplace integral approximation to show that a single draw from the Gibbs distribution is an approximate minimizer of the empirical risk. We use a Wasserstein continuity result due to Polyanskiy and Wu (2016) and a well-known equivalence between stability of empirical minimization and generalization (Mukherjee et al., 2006; Rakhlin et al., 2005) to show that, in fact, the Gibbs algorithm samples from near-minimizers of the population risk.

We remark that our result readily extends to the case when the stochastic gradients $g_{k}$ in (1.3) are formed with respect to independent draws from the data-generating distribution $P$ – e.g., when taking a single pass through the dataset. In this case, the target of optimization is $F$ itself rather than $F_{\mathbf{z}}$ , and we simply omit the second term in (1.5). If the main concern is not consistency (as in (1.1)) but rather the generalization performance of SGLD itself, then the same analysis applied to the decomposition

gives an upper bound of (1.6) plus the first term of (1.7). In other words, while the rate of (1.1) may be hampered by the slow convergence of $\frac{d\log\beta}{\beta}$ , the rate of generalization is not. Finally, if each data point is used only once, the generalization performance is controlled by (1.6) alone.

2 Related work

The asymptotic study of convergence of discretized Langevin diffusions for non-convex objectives has a long history, starting with the work of Gelfand and Mitter (1991). Most of the work has focused on annealing-type schemes, where both the step size $\eta$ and the temperature $1/\beta$ are decreased with time. Márquez (1997) and Pelletier (1998) studied the rates of weak convergence for both the Langevin diffusion and the discrete-time updates. However, when $\eta$ and $\beta$ are kept fixed, the updates do not converge to a global minimizer, but one can still aim for convergence to a stationary distribution. An asymptotic study of this convergence, in the sense of relative entropy, was initiated by Borkar and Mitter (1999).

Dalalyan and Tsybakov (2012) and Dalalyan (2016) analyzed rates of convergence of discrete-time Langevin updates (with exact gradients) in the case of convex functions, and provided nonasymptotic rates of convergence in the total variation distance for sampling from log-concave densities. Durmus and Moulines (2015) refined these results by establishing geometric convergence in total variation distance for convex and strongly convex objective functions, and provided some results for non-convex objectives that can be represented as a bounded perturbation of a convex or a strongly convex function. Bubeck et al. (2015) studied projected Langevin updates in the convex case.

Our work is motivated in part by recent papers on non-convex optimization and, in particular, on optimization problems related to neural networks. A heuristic analysis of SGLD was given by Welling and Teh (2011), and a modification of SGLD to improve generalization performance was recently proposed by Chaudhari et al. (2016). Deliberate addition of noise was also proposed by Ge et al. (2015) as a strategy for escaping from saddle points, and Belloni et al. (2015) analyzed a simulated anealing method based on Hit-and-Run for sampling from nearly log-concave distributions. While these methods aim at avoiding local minima through randomn perturbations, the line of work on continuation methods and graduated optimization (Hazan et al., 2016) attempts to create sequences of smoothed approximations that can successively localize the optimum.

Hardt et al. (2015) studied uniform stability and generalization properties of stochastic gradient descent with both convex and non-convex objectives. For the non-convex case, their upper bound on stability degrades with the number of steps of the optimization procedure, which was taken by the authors as a prescription for early stopping. In contrast, we show that, under our assumptions, non-convexity does not imply loss of stability when the latter is measured in terms of $2$ -Wasserstein distance to the stationary distribution. In addition, we use the fact that Gibbs distribution concentrates on approximate empirical minimizers, implying convergence for the population risk via stability (Rakhlin et al., 2005; Mukherjee et al., 2006).

The main result

where $U_{\mathbf{z}}$ is a random element of $\mathsf{U}$ with probability law $Q_{\mathbf{z}}$ . Conditionally on $\mathbf{Z}=\mathbf{z}$ , the SGLD update takes the form

The function $f$ takes nonnegative real values, and there exist constants $A,B\geq 0$ , such that

For each $z\in\mathsf{Z}$ , the function $f(\cdot,z)$ is $M$ -smooth: for some $M>0$ ,

For each $z\in\mathsf{Z}$ , the function $f(\cdot,z)$ is $(m,b)$ -dissipative (Hale, 1988): for some $m>0$ and $b\geq 0$ ,

There exists a constant $\delta\in[0,1)$ , such that, for each $\mathbf{z}\in\mathsf{Z}^{n}$ ,We are reusing the constants $M$ and $B$ from (A.1) and (A.2) in (2.4) mainly out of considerations of technical convenience; any other constants $M^{\prime},B^{\prime}>0$ can be substituted in their place without affecting the results.

We are now ready to state our main result. A crucial role will be played by the uniform spectral gap

Proof of Theorem 2.1

Let $\mu_{\mathbf{z},k}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}(W_{k}|\mathbf{Z}=\mathbf{z})$ , $\nu_{\mathbf{z},t}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}(W(t)|\mathbf{Z}=\mathbf{z})$ , and $\mathbf{E}_{\mathbf{z}}[\cdot]\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{E}[\cdot|\mathbf{Z}=\mathbf{z}]$ . In a nutshell, our proof of Theorem 2.1 consists of the following steps:

We first show that, for all sufficiently small $\eta>0$ , the SGLD recursion (2.2) tracks the continuous-time Langevin diffusion process (1.4) in $2$ -Wasserstein distance:

(the precise statement with explicit constants is given in Proposition 3.1).

Next, we show that the Langevin diffusion (1.4) converges exponentially fast to the Gibbs distribution $\pi_{\mathbf{z}}$ :

This, together with (3.1) and the triangle inequality, yields the estimate

(see Proposition 3.3 for explicit constants). Observe that there are two terms on the right-hand side of (3.2), one of which grows linearly with $t=k\eta$ , while the other one decays exponentially with $t$ . Thus, we can first choose $t$ large enough and then $\eta$ small enough, so that

The resulting choices of $t=k\eta$ and $\eta$ translate into the expressions for $k$ and $\eta$ given in (A.14).

The upshot of Eq. (3.3) is that, for large enough $k$ , the conditional probability law of $W_{k}$ given $\mathbf{Z}=\mathbf{z}$ is close, in $2$ -Wasserstein, to the Gibbs distribution $\pi_{\mathbf{z}}$ . Thus, we are led to consider the Gibbs algorithm that generates a random draw from $\pi_{\mathbf{z}}$ . We show that the resulting hypothesis is an almost-minimizer of the empirical risk, i.e.,

(see Proposition 3.4 for the exact statement), and also that the Gibbs algorithm is stable in the $2$ -Wasserstein distance: for any two datasets $\mathbf{z},\bar{\mathbf{z}}$ that differ in a single coordinate,

This estimate, together with Lemma 3.5 below, implies that the Gibbs algorithm is uniformly stable (Bousquet and Elisseeff, 2002):

(see Proposition 3.5). The almost-ERM property (3.4) and the uniform stability propery (3.5), together with (3.3), yield the statement of the theorem.

2 Technical lemmas

We first collect a few lemmas that will be used in the sequel; see Appendix C for the proofs.

For all $0<\eta<1\wedge\frac{m}{4M^{2}}$ and all $\mathbf{z}\in\mathsf{Z}^{n}$ ,

for some constants $c_{1}>0$ and $c_{2}\geq 0$ . Then

3 The diffusion approximation

Recall that $\mu_{\mathbf{z},k}=\mathcal{L}(W_{k}|\mathbf{Z}=\mathbf{z})$ and $\nu_{\mathbf{z},t}=\mathcal{L}(W(t)|\mathbf{Z}=\mathbf{z})$ , and we take $\mu_{\mathbf{z},0}=\nu_{\mathbf{z},0}=\mu_{0}$ . In this section, we derive an upper bound on the 2-Wasserstein distance $\mathcal{W}_{2}(\mu_{\mathbf{z},k},\nu_{\mathbf{z},k\eta})$ . The analysis consists of two steps. We first upper-bound the relative entropy $D(\mu_{\mathbf{z},k}\|\nu_{\mathbf{z},k\eta})$ via a change-of-measure argument following Dalalyan and Tsybakov (2012) (see also Dalalyan (2016)), except that we also have to deal with the error introduced by the stochastic gradient oracle. We then use a weighted transportation-cost inequality of Bolley and Villani (2005) to control the Wasserstein distance $\mathcal{W}_{2}(\mu_{\mathbf{z},k},\nu_{\mathbf{z},k\eta})$ in terms of $D(\mu_{\mathbf{z},k}\|\nu_{\mathbf{z},k\eta})$ .

The proof of the following lemma is somewhat lengthy, and is given in Appendix D:

Let $\mu=\mu_{\mathbf{z},k}$ , $\nu=\nu_{\mathbf{z},k\eta}$ , and take $\lambda=1$ . Suppose $k\eta\geq 1$ . Since $\beta\geq\frac{2}{m}$ , we can use Lemma 3.3 to write

Moreover, for all $k$ and $\eta$ satisfying the conditions of Lemma 3.6, plus the additional requirement $k\eta\geq 1$ , we can write

Putting everything together, we obtain the following result:

4 Wasserstein distance to the Gibbs distribution

We now fix a time $t\geq 0$ and examine the 2-Wasserstein distance $\mathcal{W}_{2}(\nu_{\mathbf{z},t},\pi_{\mathbf{z}})$ . At this point, we need to use a number of concepts from the analysis of Markov diffusion operators; see Appendix A for the requisite background. We start by showing the following:

For $\beta\geq 2/m$ , all of the the Gibbs measures $\pi_{\mathbf{z}}$ satisfy a logarithmic Sobolev inequality with constant

by the Otto–Villani theorem, and, since $D(\nu_{\mathbf{z},0}\|\pi_{\mathbf{z}})=D(\mu_{0}\|\pi_{\mathbf{z}})<\infty$ by Lemma 3.4, we also have

by the theorem on exponential decay of entropy. Combining Eqs. (3.16) (with $\mu=\nu_{\mathbf{z},t}$ ) and (3.17) and using Lemma 3.4, we get

Letting $t=k\eta$ and invoking Proposition 3.1, we obtain the following:

For all $k$ and $\eta$ satisfying the conditions of Proposition 3.1, we have

5 Almost-ERM property of the Gibbs algorithm

is the differential entropy of $p_{\mathbf{z}}$ (Cover and Thomas, 2006). To upper-bound $h(p_{\mathbf{z}})$ , we estimate the second moment of $\pi_{\mathbf{z}}$ . From (3.17), it follows that $\mathcal{W}_{2}(\nu_{\mathbf{z},t},\pi_{\mathbf{z}})\xrightarrow{t\to\infty}0$ . Since convergence of probability measures in 2-Wasserstein distance is equivalent to weak convergence plus convergence of second moments (Villani, 2003, Theorem 7.12), we have by Lemma 3.2

The differential entropy of a probability density with a finite second moment is upper-bounded by that of a Gaussian density with the same second moment, so we immediately get

Using Eqs. (3.20) and (3.21) in (3.18) and simplifying, we obtain the following result:

6 Stability of the Gibbs algorithm

where $i_{0}\in[n]$ is the index of the coordinate where $\mathbf{z}$ and $\bar{\mathbf{z}}$ differ. In particular,

Therefore, since $\pi_{\bar{\mathbf{z}}}$ satisfies a logarithmic Sobolev inequality with constant $c_{\rm LS}$ given in Proposition 3.2, we can write

where the last line follows from the quadratic growth estimate (3.6). Taking $\mu=\pi_{\mathbf{z}}$ in (3.16) and using the above bound and the second-moment estimate (3.19), we obtain

Finally, observe that, for each $z\in\mathsf{Z}$ , the function $w\mapsto f(w,z)$ satisfies the conditions of Lemma 3.5 with $c_{1}=M$ and $c_{2}=B$ , while $\pi_{\mathbf{z}}$ and $\pi_{\bar{\mathbf{z}}}$ satisfy the conditions of Lemma 3.5 with $\sigma^{2}=\frac{b+d/\beta}{m}$ . Thus, we obtain the following uniform stability estimate for the Gibbs algorithm:

For any two $\mathbf{z},\bar{\mathbf{z}}\in\mathsf{Z}^{n}$ that differ only in a single coordinate,

7 Completing the proof

Now consider the random hypotheses $\widehat{W}$ and $\widehat{W}^{*}$ with $\mathcal{L}(\widehat{W}|\mathbf{Z}=\mathbf{z})=\mu_{\mathbf{z},k}$ and $\mathcal{L}(\widehat{W}^{*}|\mathbf{Z}=\mathbf{z})=\pi_{\mathbf{z}}$ . Then

The function $F$ satisfies the conditions of Lemma 3.5 with $c_{1}=M$ and $c_{2}=B$ , while the probability measures $\mu_{\mathbf{z},k},\pi_{\mathbf{z}}$ satisfy the conditions of Lemma 3.5 with

by Lemma 3.2. Therefore, for all $\mathbf{z}\in\mathsf{Z}^{n}$ ,

It remains to analyze the expected excess risk $\mathbf{E}F(\widehat{W}^{*})-F^{*}$ of the Gibbs algorithm. To that end, we will use stability-based arguments along the lines of Bousquet and Elisseeff (2002) and Rakhlin et al. (2005). We begin by decomposing the excess risk as

The term $T_{1}$ is the generalization error of the Gibbs algorithm. To upper-bound it, let $\mathbf{Z}^{\prime}=(Z^{\prime}_{1},\ldots,Z^{\prime}_{n})\sim\mathbf{P}^{\otimes n}$ be independent of $\mathbf{Z}$ and $\widehat{W}^{*}$ . Then

Using the fact that $Z_{1},\ldots,Z_{n},Z^{\prime}_{1},\ldots,Z^{\prime}_{n}$ are i.i.d., as well as the fact that $\mathbf{Z}^{\prime}$ is indepdenent of $\widehat{W}^{*}$ , the $i$ th term in the summation in (3.7) can be written out explicitly as follows:

where $\bar{\mathbf{z}}^{(i)}\mathrel{\mathop{\mathchar 58\relax}}=(z_{1},\ldots,z_{i-1},z^{\prime}_{i},z_{i+1},\ldots,z_{n})$ . Noting that $\bar{\mathbf{z}}^{(i)}$ and $\mathbf{z}$ differ only in the $i$ th coordinate, we can use Proposition 3.5 to upper-bound the integral in (3.24). Since the resulting estimate is uniform in $i$ , from (3.7) we obtain

where the last step is by Proposition 3.4. From (3.25) and (3.26), we get

Combining Eqs. (3.22) and (3.27), we obtain the claimed excess risk bound (2.6).

Discussion and directions for future research

The first two assumptions are fairly standard in the literature on non-convex optimization. The dissipativity assumption (A.3) merits some discussion. The term “dissipative” comes from the theory of dynamical systems (Hale, 1988; Stuart and Humphries, 1996), where it has the following interpretation: Consider the gradient flow described by the ordinary differential equation

Then it is not hard to show that, if the function $w\mapsto f_{0}(w,z)$ is $L$ -Lipschitz, then $f$ satisfies (A.2) with $m=\gamma/2$ and $b=L^{2}/2\gamma$ . Thus, a byproduct of our analysis is a fine-grained characterization of the impact of weight decay on learning.

Finally, the exponential integrability assumption (A.5) is satisfied, for example, by the Gaussian initialization $W_{0}\sim N(0,\sigma^{2}I_{d})$ with $\sigma^{2}<1/2$ .

Effect of gradient noise and minibatch size selection.

Observe that the excess risk bound (2.6) contains a term that goes to zero as $\varepsilon\to 0$ , as well as a term that grows as $\log\varepsilon^{-1}$ , but goes to zero as the gradient noise level $\delta\to 0$ . This suggests selecting the minibatch size

to offset the $\log\varepsilon^{-1}$ term.

Uniform spectral gap.

As shown in Appendix B, Assumptions (A.1)–(A.3) are enough to guarantee that the spectral gap $\lambda_{*}$ is strictly positive. In particular, we give a very conservative estimate

Therefore, in order to apply Theorem 2.1, one needs to fully exploit the structural properties of the problem at hand and produce an upper bound on $1/\lambda_{*}$ which is polynomial in $d$ or even dimension-free. (By contrast, exponential dependence of $1/\lambda_{*}$ on $\beta$ is unavoidable in the presence of multiple local minima and saddle points; this is a consequence of sharp upper and lower bounds on the spectral gap due to Bovier et al. (2005).) For example, consider replacing the empirical risk (1.2) with a smoothed objective

Note that here, in contrast with (4.3), this bound is completely dimension-free. A tantalizing line of future work is, therefore, to find other settings where $1/\lambda_{*}$ is indeed small.

Acknowledgments

The authors would like to thank Arnak Dalalyan and Ramon van Handel for enlightening discussions.

References

Appendix A Background on Markov semigroups and functional inequalities

Our analysis relies on the theory of Markov diffusion operators and associated functional inequalities. In this Appendix, we only summarize the key ideas and results; the book by Bakry et al. (2014) provides an in-depth exposition.

It can be shown that $\mathcal{E}(g)\geq 0$ , i.e., $-\mathcal{L}$ is a positive operator (since $\mathcal{L}1=0$ , zero is an eigenvalue).

Let $P$ be a Markov semigroup with the unique invariant distribution $\pi$ and the Dirichlet form $\mathcal{E}$ . We say that $\pi$ satisfies a Poincaré (or spectral gap) inequality with constant $c$ if, for all probability measures $\mu\ll\pi$ ,

Hence, if $\lambda>0$ , then the spectrum of $-\mathcal{L}$ is contained in the set $\{0\}\cup[\lambda,\infty)$ , so $\lambda$ is the gap between the zero eigenvalue and the rest of the spectrum. We say that $\pi$ satisfies a logarithmic Sobolev inequality with constant $c$ if, for all $\mu\ll\pi$ ,

Exponential decay of entropy (Bakry et al., 2014, Th. 5.2.1): Let $\mu_{t}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}(W(t))$ . Then

where $H$ is a $C^{1}$ function and $\{B(t)\}$ is the standard $d$ -dimensional Brownian motion. (Replacing the factor $\sqrt{2}$ by $\sqrt{2\beta^{-1}}$ is equivalent to the time rescaling $t\mapsto\beta t$ .) The generator of this semigroup is the second-order differential operator

Thus, the Gibbs measure $\pi$ satisfies a Poincaré inequality with constant $c$ if, for any $\mu\ll\pi$ ,

and a logarithmic Sobolev inequality with constant $c$ if

If $H$ is $C^{2}$ and strongly convex, i.e., $\nabla^{2}H\succeq KI_{d}$ for some $K>0$ , then $\pi$ satisfies a logarithmic Sobolev inequality with constant $c=1/K$ . In the absence of convexity, it is in general difficult to obtain upper bounds on Poincaré or log-Sobolev constants. The following two propositions give sufficient conditions based on so-called Lyapunov function criteria:

Then $\pi$ satisfies a Poincaré inequality with constant

$\pi$ satisfies a Poincaré inequality with constant $c_{\rm P}$ .

There exists some constant $K\geq 0$ , such that $\nabla^{2}H\succeq-KI_{d}$ .

Let $C_{1}$ and $C_{2}$ be defined, for some $\varepsilon>0$ , by

Then $\pi$ satisfies a logarithmic Sobolev inequality with constant $c_{\rm LS}=C_{1}+(C_{2}+2)c_{\rm P}$ .

In particular, if $K\neq 0$ , we can take $\varepsilon=2/K$ , in which case

Appendix B A lower bound on the uniform spectral gap

Our goal here is to prove the crude lower bound on $\lambda_{*}$ given in Section 4. To that end, we will use the Lyapunov function criterion due to Bakry et al. (2008), which is reproduced as Proposition A.1 in Appendix A.

We will apply this criterion to the Gibbs distribution $\pi_{\mathbf{z}}$ for some $\mathbf{z}\in\mathsf{Z}^{n}$ . Thus, we have $H=\beta F_{\mathbf{z}}$ and

Consider the candidate Lyapunov function $V(w)=e^{m\beta\|w\|^{2}/4}$ . From the fact that $V\geq 1$ and from the dissipativity assumption (A.3), it follows that

Thus, $V$ evidently satisfies (A.11) with $R^{2}=\frac{2\kappa}{\gamma}$ , $\kappa_{0}=\kappa$ and $\lambda_{0}=2\kappa$ , where

Moreover, from Lemma 3.1 and from the fact that $F_{\mathbf{z}}\geq 0$ , it follows that

Thus, by Proposition A.1, $\pi_{\mathbf{z}}$ satisfies a Poincaré inequality with constant

Observe that this bound holds for all $\mathbf{z}\in\mathsf{Z}^{n}$ . Using this fact and the relation $1/\lambda\leq c_{\rm P}$ between the spectral gap and the Poincaré constant, we see that

Appendix C Proofs for Section 3.2

where (i) follows from (A.1) and from Cauchy–Schwarz, while (ii) follows from (3.6). This proves the upper bound on $f(w,z)$ . Now take $v=cw$ for some $c\in(0,1]$ to be chosen later. With this choice, we proceed from Eq. (C.1) as follows:

where (i) uses the fact that $f\geq 0$ , while (ii) uses the dissipativity property (2.3). Taking $c=\frac{1}{\sqrt{3}}$ , we get the lower bound in (3.7). ∎

where the second step uses independence of $W_{k}-g(W_{k},U_{\mathbf{z},k})$ and $\xi_{k}$ and the unbiasedness property (2.1) of the gradient oracle. We can further expand the first term in (C.2):

where we have used (2.1) once again. By (2.4), the second term in (C.3) can be upper-bounded by

whereas the first term can be estimated as

where the inequality follows from the dissipativity condition (2.3) and the bound (3.6) in Lemma 3.1. Combining all of the above, we arrive at the recursion

Fix some $\eta\in(0,1\wedge\frac{m}{2M^{2}})$ . There are two cases to consider:

If $1-2\eta m+4\eta^{2}M^{2}\leq 0$ , then from (C.4) it follows that

If $0<1-2\eta m+4\eta^{2}M^{2}<1$ , then iterating (C.4) gives

The bound (3.8) follows from Eqs. (C.5) and (C.6) and from the estimate

which easily follows from the independence of $\mathbf{Z}$ and $W_{0}$ and from Jensen’s inequality.

We now analyze the diffusion (1.4). Let $Y(t)\mathrel{\mathop{\mathchar 58\relax}}=\|W(t)\|^{2}$ . Then Itô’s lemma gives

Recognizing the left-hand side of (C.8) as the total Itô derivative of $e^{2mt}Y(t)$ , we arrive at

which, upon integrating and rearranging, becomes

Now, using the dissipativity condition (2.3), we can write

Substituting this into (C.10), we end up with

Taking expectations and using the martingale property of the Itô integral together with (C.7), we get (3.9). Eq. (3.10) follows from maximizing the right-hand side of (3.9) over all $t\geq 0$ . ∎

For $L(t)=e^{\|W(t)\|^{2}}$ , Itô’s lemma gives

From the dissipativity condition (2.3) and from the assumption that $\beta\geq 2/m$ , it follows that

Eq. (3.11) then follows by an application of the Gronwall lemma. ∎

Since $p_{\mathbf{z}}>0$ everywhere, we can write

We first upper-bound the partition function:

where the inequality follows from Lemma 3.1. Thus,

Moreover, invoking Lemma 3.1 once again, we have

Substituting (C.12), (C.13), and (C.14) into (C.11), we get (3.12). ∎

where we have used Cauchy–Schwarz and the growth condition (3.13). Now let $\mathbf{P}$ be the coupling of $\mu$ and $\nu$ that achieves $\mathcal{W}_{2}(\mu,\nu)$ . That is, $\mathbf{P}=\mathcal{L}((W,V))$ with $\mu=\mathcal{L}(W)$ , $\nu=\mathcal{L}(V)$ , and

Interchanging the roles of $\mu$ and $\nu$ , we complete the proof. ∎

Appendix D Proof of Lemma 3.6

Conditioned on $\mathbf{Z}=\mathbf{z}$ , $\{W_{k}\}^{\infty}_{k=0}$ is a time-homogeneous Markov process. Consider the following continuous-time interpolation of this process:

where $\overline{U}_{\mathbf{z}}(t)\equiv U_{\mathbf{z},k}$ for $t\in[k\eta,(k+1)\eta)$ . Note that, for each $k$ , $\overline{W}(k\eta)$ and $W_{k}$ have the same probability law $\mu_{\mathbf{z},k}$ . Moreover, by a result of Gyöngy (1986), the process $\overline{W}(t)$ has the same one-time marginals as the Itô process

Crucially, $V(t)$ is a Markov process, while $\overline{W}(t)$ is not. Let $\mathbf{P}^{t}_{V}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}\big{(}V(s)\mathrel{\mathop{\mathchar 58\relax}}0\leq s\leq t\big{|}\mathbf{Z}=\mathbf{z}\big{)}$ and $\mathbf{P}^{t}_{W}\mathrel{\mathop{\mathchar 58\relax}}=\mathcal{L}\big{(}W(s)\mathrel{\mathop{\mathchar 58\relax}}0\leq s\leq t\big{|}\mathbf{Z}=\mathbf{z}\big{)}$ . The Radon–Nikodym derivative of $\mathbf{P}^{t}_{W}$ w.r.t. $\mathbf{P}^{t}_{V}$ is given by the Girsanov formula

(see, e.g., Sec. 7.6.4 in Liptser and Shiryaev (2001)). Using (D.3) and the martingale property of the Itô integral, we have

where the last line follows from the fact that $\mathcal{L}(\overline{W}(s))=\mathcal{L}(V(s))$ for each $s$ .

We first estimate the first summation in (D.4). Consider some $s\in[j\eta,(j+1)\eta)$ . From (D.1), we have

Therefore, using Lemmas 3.1 and 3.2 and the gradient noise assumption (A.4), we arrive at

Similarly, the second summation on the right-hand side of (D.4) can be estimated as follows:

Substituting Eqs. (D.5) and (D.6) into (D.4), we obtain

Now, since $\mu_{\mathbf{z},k}=\mathcal{L}(\overline{W}(k\eta)|\mathbf{Z}=\mathbf{z})$ and $\nu_{\mathbf{z},k\eta}=\mathcal{L}(W(k\eta)|\mathbf{Z}=\mathbf{z})$ , the data-processing inequality for the KL divergence gives

Appendix E Proof of Proposition 3.2

To establish the log-Sobolev inequality, we will use the Lyapunov function criterion of Cattiaux et al. (2010), reproduced as Proposition A.2 in Appendix A.

We will apply this proposition to the Gibbs distribution $\pi_{\mathbf{z}}$ for some $\mathbf{z}\in\mathsf{Z}^{n}$ , so that $H=\beta F_{\mathbf{z}}$ and

We consider the same Lyapunov function $V(w)=e^{m\beta\|w\|^{2}/4}$ as in Appendix B. From Eq. (B.1), $V$ evidently satisfies (A.13) with $\kappa$ and $\gamma$ given in (B.2), i.e., the first condition of Proposition A.2 is satisfied. Moreover, $\pi_{\mathbf{z}}$ satisfies a Poincaré inequality with constant $c_{\rm P}\leq 1/\lambda_{*}$ . Thus, the second condition is also satisfied. Finally, by the $M$ -smoothness assumption (A.2), $\nabla^{2}F_{\mathbf{z}}\succeq-MI_{d}$ , so the third condition of Proposition A.2 is satisfied with $K=\beta M$ . Consequently, the constants $C_{1}$ and $C_{2}$ in (A.14) are given by

where we have also used the estimate (3.19) to upper-bound $C_{2}$ . Therefore, from Proposition A.2 and from (E.1) it follows that $\pi_{\mathbf{z}}$ satisfies a logarithmic Sobolev inequality with