A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics

Yuchen Zhang, Percy Liang, Moses Charikar

Introduction

A central challenge of non-convex optimization is avoiding sub-optimal local minima. Although escaping all local minima is NP-hard in general [e.g. 7], one might expect that it should be possible to escape “appropriately shallow” local minima, whose basins of attraction have relatively low barriers. As an illustrative example, consider minimizing an empirical risk function in Figure 1. As the figure shows, although the empirical risk is uniformly close to the population risk, it contains many poor local minima that don’t exist in the population risk. Gradient descent is unable to escape such local minima.

A natural workaround is to inject random noise to the gradient. Empirically, adding gradient noise has been found to improve learning for deep neural networks and other non-convex models . However, theoretical understanding of the value of gradient noise is still incomplete. For example, Ge et al. show that by adding isotropic noise ww and by choosing a sufficiently small stepsize η\eta, the iterative update:

is able to escape strict saddle points. Unfortunately, this approach, as well as the subsequent line of work on escaping saddle points , doesn’t guarantee escaping even shallow local minima.

Another line of work in Bayesian statistics studies the Langevin Monte Carlo (LMC) method , which employs an alternative noise term. Given a function ff, LMC performs the iterative update:

where ξ>0\xi>0 is a “temperature” hyperparameter. Unlike the bounded noise added in formula (1), LMC adds a large noise term that scales with 1/η\sqrt{1/\eta}. With a small enough η\eta, the noise dominates the gradient, enabling the algorithm to escape any local minimum. For empirical risk minimization, one might substitute the exact gradient f(x)\nabla f(x) with a stochastic gradient, which gives the Stochastic Gradient Langevin Dynamics (SGLD) algorithm . It can be shown that both LMC and SGLD asymptotically converge to a stationary distribution μ(x)eξf(x)\mu(x)\propto e^{-\xi f(x)} . As ξ\xi\to\infty, the probability mass of μ\mu concentrates on the global minimum of the function ff, and the algorithm asymptotically converges to a neighborhood of the global minimum.

Despite asymptotic consistency, there is no theoretical guarantee that LMC is able to find the global minimum of a general non-convex function, or even a local minimum of it, in polynomial time. Recent works focus on bounding the mixing time (i.e. the time for converging to μ\mu) of LMC and SGLD. Bubeck et al. , Dalalyan and Bonis prove that on convex functions, LMC converges to the stationary distribution in polynomial time. On non-convex functions, however, an exponentially long mixing time is unavoidable in general. According to Bovier et al. , it takes the Langevin diffusion at least eΩ(ξh)e^{\Omega(\xi h)} time to escape a depth-hh basin of attraction. Thus, if the function contains multiple “deep” basins with h=Ω(1)h=\Omega(1), then the mixing time is lower bounded by eΩ(ξ)e^{\Omega(\xi)}.

In parallel work to this paper, Raginsky et al. upper bound the time of SGLD converging to an approximate global minimum of non-convex functions. They show that the upper bound is polynomial in the inverse of a quantity they call the uniform spectral gap. Similar to the mixing time bound, in the presence of multiple local minima, the convergence time to an approximate global minimum can be exponential in dimension dd and the temperature parameter ξ\xi.

This stability property is useful in studying empirical risk minimization (ERM) in situations where the empirical risk ff is pointwise close to the population risk FF, but has poor local minima that don’t exist in the population risk. This phenomenon has been observed in statistical estimation with non-convex penalty functions , as well as in minimizing the zero-one loss (see Figure 1). Under this setting, our result implies that SGLD achieves an approximate local minimum of the (smooth) population risk in polynomial time, ruling out local minima that only exist in the empirical risk. It improves over recent results on non-convex optimization , which compute approximate local minima only for the empirical risk.

As a concrete application, we prove a stronger learnability result for the problem of learning linear classifiers under the zero-one loss , which involves non-convex and non-smooth empirical risk minimization. Our result improves over the recent result of Awasthi et al. : the method of Awasthi et al. handles noisy data corrupted by a very small Massart noise (at most 1.8×1061.8\times 10^{-6}), while our algorithm handles Massart noise up to any constant less than 0.50.5. As a Massart noise of 0.50.5 represents completely random observations, we see that SGLD is capable of learning from very noisy data.

Algorithm and main results

In this section, we define the algorithm and the basic concepts, then present the main theoretical results of this paper.

Because of the noisy update, the sequence (x0,x1,x2,)(x_{0},x_{1},x_{2},\dots) asymptotically converges to a stationary distribution rather than a stationary point . Although this fact introduces challenges to the analysis, we show that its non-asymptotic efficiency can be characterized by a positive quantity called the restricted Cheeger constant.

2 Restricted Cheeger constant

For any measurable function ff, we define a probability measure μf\mu_{f} whose density function is:

For any function ff and any subset VKV\subset K, we define the restricted Cheeger constant as:

The restricted Cheeger constant generalizes the notion of the Cheeger isoperimetric constant , quantifying how well a subset of VV can be made as least connected as possible to the rest of the parameter space. The connectivity is measured by the ratio of the surface measure lim infϵ0μf(Aϵ)μf(A)ϵ\liminf_{\epsilon\searrow 0}\frac{\mu_{f}(A_{\epsilon})-\mu_{f}(A)}{\epsilon} to the set measure μf(A)\mu_{f}(A). Intuitively, this quantifies the chance of escaping the set AA under the probability measure μf\mu_{f}.

A property that will be important in the sequal is that the restricted Cheeger constant is stable under perturbations: if we perturb ff by a small amount, then the values of μf\mu_{f} won’t change much, so that the variation on Cf(V)\mathcal{C}_{f}(V) will also be small. More precisely, for functions f1f_{1} and f2f_{2} satisfying supxKf1(x)f2(x)=ν\sup_{x\in K}|f_{1}(x)-f_{2}(x)|=\nu, we have

and similarly Cf2(V)e2νCf1(V)\mathcal{C}_{f_{2}}(V)\geq e^{-2\nu}\mathcal{C}_{f_{1}}(V). As a result, if two functions f1f_{1} and f2f_{2} are uniformly close, then we have Cf1(V)Cf2(V)\mathcal{C}_{f_{1}}(V)\approx\mathcal{C}_{f_{2}}(V) for a constant ν\nu. This property enables us to lower bound Cf1(V)\mathcal{C}_{f_{1}}(V) by lower bounding the restricted Cheeger constant of an alternative function f2f1f_{2}\approx f_{1}, which might be easier to analyze.

3 Generic non-asymptotic bounds

We make several assumptions on the parameter space and on the objective function.

The parameter space KK satisfies: there exists hmax>0h_{\textrm{max}}>0, such that for any xKx\in K and any hhmaxh\leq h_{\textrm{max}}, the random variable yN(x,2hI)y\sim N(x,2hI) satisfies P(yK)13P(y\in K)\geq\frac{1}{3}.

The function f:K[0,B]f:K\to[0,B] is bounded, differentiable and LL-smooth in KK, meaning that for any x,yKx,y\in K, we have f(y)f(x)yx,f(x)L2yx22|f(y)-f(x)-\langle y-x,\,\nabla f(x)\rangle|\leq\frac{L}{2}\|{y-x}\|_{2}^{2}.

The first assumption states that the parameter space doesn’t contain sharp corners, so that the update (3d) won’t be stuck at the same point for too many iterations. It can be satisfied, for example, by defining the parameter space to be an Euclidean ball and choosing hmax=o(d2)h_{\textrm{max}}=o(d^{-2}). The probability 1/31/3 is arbitrary and can be replaced by any constant in (0,1/2)(0,1/2). The second assumption requires the function ff to be smooth. We show how to handle non-smooth functions in Section 3 by appealing to the stability property of the restricted Cheeger constant discussed earlier. The third assumption requires the stochastic gradient to have sub-exponential tails, which is a standard assumption in stochstic optimization.

The iteration number kmaxk_{\textrm{max}} is bounded by

where the numerator MM is polynomial in (B,L,G,log(1/δ),d,ξ,η0/η,hmax1,bmax1,ρ1)(B,L,G,\log(1/\delta),d,\xi,\eta_{0}/\eta,h_{\textrm{max}}^{-1},b_{\textrm{max}}^{-1},\rho^{-1}). See Appendix B.2 for the explicit polynomial dependence.

Theorem 1 is a generic result that applies to all optimization problems satisfying Assumption A. The right-hand side of the bound (7) is determined by the choice of UU. If we choose UU to be the set of (approximate) local minima, and let ρ>0\rho>0 be sufficiently small, then f(x^)f(\widehat{x}) will roughly be bounded by the worst local minimum. The theorem permits ξ\xi to be arbitrary provided the stepsize η\eta is small enough. Choosing a larger ξ\xi means adding less noise to the SLGD update, which means that the algorithm will be more efficient at finding a stationary point, but less efficient at escaping local minima. Such a trade-off is captured by the restricted Cheeger constant in inequality (8) and will be rigorously studied in the next subsection.

The iteration complexity bound is governed by the restricted Cheeger constant. For any function ff and any target set UU with a positive Borel measure, the restricted Cheeger constant is strictly positive (see Appendix A), so that with a small enough η\eta, the algorithm always converges to the global minimum asymptotically. We remark that the SGD doesn’t enjoy the same asymptotic optimality guarantee, because it uses a O(η)O(\eta) gradient noise in contrast to SGLD’s O(η)O(\sqrt{\eta}) one. Since the convergence theory requires a small enough η\eta, we often have ηη\eta\ll\sqrt{\eta}. the SGD noise is too conservative to allow the algorithm to escape local minima.

The proof of Theorem 1 is fairly technical. We defer the full proof to Appendix B, only sketching the basic proof ideas here. At a high level, we establish the theorem by bounding the hitting time of the Markov chain (x0,x1,x2,)(x_{0},x_{1},x_{2},\dots) to the set Uρ:={x:d(x,U)ρ}U_{\rho}:=\{x:d(x,U)\leq\rho\}. Indeed, if some xkx_{k} hits the set, then:

In order to bound the hitting time, we construct a time-reversible Markov chain, and prove that its hitting time to UρU_{\rho} is on a par with the original hitting time. To analyze this second Markov chain, we define a notion called the restricted conductance, which measures how easily the Markov chain can transition between states within K\UρK\backslash U_{\rho}. This quantity is related to the notion of conductance in the analysis of time-reversible Markov processes , but the ratio between these two quantities can be exponentially large for non-convex ff. We prove that the hitting time of the second Markov chain depends inversely on the restricted conductance, so that the problem reduces to lower bounding the restricted conductance.

Finally, we lower bound the restricted conductance by the restricted Cheeger constant. The former quantity characterizes the Markov chain, while the later captures the geometric properties of the function ff. Thus, we must analyze the SGLD algorithm in depth to establish a connection between them. Once we prove this lower bound, putting all pieces together completes the proof. \blacksquare

4 Lower bounding the restricted Cheeger constant

In this subsection, we prove lower bounds on the restricted Cheeger constant C(ξf)(K\U)\mathcal{C}_{(\xi f)}(K\backslash U) in order to flesh out the iteration complexity bound of Theorem 1. We start with a lower bound for the class of convex functions:

Let KK be a dd-dimensional unit ball. For any convex G{G}-Lipschitz continuous function ff and any ϵ>0\epsilon>0, let the set of ϵ\epsilon-optimal solutions be defined by:

Then for any ξ2dlog(4G/ϵ)ϵ\xi\geq\frac{2d\log(4{G}/\epsilon)}{\epsilon}, we have C(ξf)(K\U)1\mathcal{C}_{(\xi f)}(K\backslash U)\geq 1.

The proposition shows that if we choose a big enough ξ\xi, then C(ξf)(K\U)\mathcal{C}_{(\xi f)}(K\backslash U) will be lower bounded by a universal constant. The lower bound is proved based on an isoperimetric inequality for log-concave distributions. See Appendix C for the proof.

For non-convex functions, directly proving the lower bound is difficult, because the definition of C(ξf)(K\U)\mathcal{C}_{(\xi f)}(K\backslash U) involves verifying the properties of all subsets AK\UA\subset K\backslash U. We start with a generic lemma that reduces the problem to checking properties of all points in K\UK\backslash U.

Lemma 1 reduces the problem of lower bounding Cf(V)\mathcal{C}_{f}(V) to the problem of finding a proper vector field ϕ\phi and verifying its properties for all points xVx\in V. Informally, the quantity Cf(V)\mathcal{C}_{f}(V) measures the chance of escaping the set VV. The lemma shows that if we can construct an “oracle” vector field ϕ\phi, such that at every point xVx\in V it gives the correct direction (i.e. ϕ(x)-\phi(x)) to escape VV, but always stay in KK, then we obtain a strong lower bound on Cf(V)\mathcal{C}_{f}(V). This construction is merely for the theoretical analysis and doesn’t affect the execution of the algorithm.

The proof idea is illustrated in Figure 2: by constructing a mapping π:xxϵϕ(x)\pi:x\mapsto x-\epsilon\phi(x) that satisfies the conditions of the lemma, we obtain π(A)Aϵ\pi(A)\subset A_{\epsilon} for all AVA\subset V, and consequently μf(π(A))μf(Aϵ)\mu_{f}(\pi(A))\leq\mu_{f}(A_{\epsilon}). Then we are able to lower bound the restricted Cheeger constant by:

where dAdA is an infinitesimal of the set VV. It can be shown that the right-hand side of inequality (10) is equal to infxV{ϕ(x),f(x)divϕ(x)}\inf_{x\in V}\{\langle\phi(x),\,\nabla f(x)\rangle-{\textrm{div}}\,\phi(x)\}, which establishes the lemma. See Appendix D for a rigorous proof. \blacksquare

Before demonstrating the applications of Lemma 1, we make several additional mild assumptions on the parameter space and on the function ff.

The parameter space KK is a dd-dimensional ball of radius r>0r>0 centered at the origin. There exists r0>0r_{0}>0 such that for every point xx satisfying x2[rr0,r]\|{x}\|_{2}\in[r-r_{0},r], we have x,f(x)x2\langle x,\,\nabla f(x)\rangle\geq\|{x}\|_{2}.

For some G,L,H>0{G},L,H>0, the function ff is third-order differentiable with f(x)2G\|{\nabla f(x)}\|_{2}\leq{G}, 2f(x)L\|{\nabla^{2}f(x)}\|_{*}\leq L and 2f(x)2f(y)Hxy2\|{\nabla^{2}f(x)-\nabla^{2}f(y)}\|_{*}\leq H\|{x-y}\|_{2} for any x,yKx,y\in K.

The first assumption requires the parameter space to be an Euclidean ball and imposes a gradient condition on its boundary. This is made mainly for the convenience of theoretical analysis. We remark that for any function ff, the condition on the boundary can be satisfied by adding a smooth barrier function ρ(x2)\rho(\|{x}\|_{2}) to it, where the function ρ(t)=0\rho(t)=0 for any t<r2r0t<r-2r_{0}, but sharply increases on the interval [rr0,r][r-r_{0},r] to produce large enough gradients. The second assumption requires the function ff to be third-order differentiable. We shall relax the second assumption in Section 3.

The following proposition describes a lower bound on C(ξf)(K\U)\mathcal{C}_{(\xi f)}(K\backslash U) when ff is a smooth function and the set UU consists of approximate stationary points. Although we shall prove a stronger result, the proof of this proposition is a good example for demonstrating the power of Lemma 1.

Assume that Assumption B holds. For any ϵ>0\epsilon>0, define the set of ϵ\epsilon-approximate stationary points U:={xK:f(x)2<ϵ}U:=\{x\in K:\|{\nabla f(x)}\|_{2}<\epsilon\}. For any ξ2L/ϵ2\xi\geq 2L/\epsilon^{2}, we have C(ξf)(K\U)ξϵ22G\mathcal{C}_{(\xi f)}(K\backslash U)\geq\frac{\xi\epsilon^{2}}{2{G}}.

Recall that G{G} is the Lipschitz constant of function ff. Let the vector field be defined by ϕ(x):=1Gf(x)\phi(x):=\frac{1}{{G}}\nabla f(x), then we have ϕ(x)21\|{\phi(x)}\|_{2}\leq 1. By Assumption B, it is easy to verify that the conditions of Lemma 1 hold. For any xK\Ux\in K\backslash U, the fact that f(x)2ϵ\|{\nabla f(x)}\|_{2}\geq\epsilon implies:

Recall that LL is the smoothness parameter. By Assumption B, the divergence of ϕ(x)\phi(x) is upper bounded by divϕ(x)=1Gtr(2f(x))1G2f(x)LG{\textrm{div}}\,\phi(x)=\frac{1}{{G}}{\textrm{tr}}(\nabla^{2}f(x))\leq\frac{1}{{G}}\|{\nabla^{2}f(x)}\|_{*}\leq\frac{L}{{G}}. Consequently, if we choose ξ2L/ϵ2\xi\geq 2L/\epsilon^{2} as assumed, then we have:

Lemma 1 then establishes the claimed lower bound. ∎

Next, we consider approximate local minima , which rules out local maxima and strict saddle points. For an arbitrary ϵ>0\epsilon>0, the set of ϵ\epsilon-approximate local minima is defined by:

We note that an approximate local minimum is not necessarily close to any local minimum of ff. However, if we assume in addition the the function satisfies the (robust) strict-saddle property , then any point xUx\in U is guaranteed to be close to a local minimum. Based on definition (11), we prove a lower bound for the set of approximate local minima.

Assume that Assumption B holds. For any ϵ>0\epsilon>0, let UU be the set of ϵ\epsilon-approximate local minima. For any ξ\xi satisfying

we have C(ξf)(K\U)ϵ8(2G+1)G\mathcal{C}_{(\xi f)}(K\backslash U)\geq\frac{\sqrt{\epsilon}}{8(2{G}+1){G}}. The notation O~(1)\widetilde{\mathcal{O}}(1) hides a poly-logarithmic function of (L,1/ϵ)(L,1/\epsilon).

Proving Proposition 3 is significantly more challenging than proving Proposition 2. From a high-level point of view, we still construct a vector field ϕ\phi, then lower bound the expression ϕ(x),ξf(x)divϕ(x)\langle\phi(x),\,\xi\nabla f(x)\rangle-{\textrm{div}}\,\phi(x) for every point xK\Ux\in K\backslash U in order to apply Lemma 1. However, there exist saddle points in the set K\UK\backslash U, such that the inner product ϕ(x),ξf(x)\langle\phi(x),\,\xi\nabla f(x)\rangle can be very close to zero. For these points, we need to carefully design the vector field so that the term divϕ(x){\textrm{div}}\,\phi(x) is strictly negative and bounded away from zero. To this end, we define ϕ(x)\phi(x) to be the sum of two components. The first component aligns with the gradient f(x)\nabla f(x). The second component aligns with a projected vector Πx(f(x))\Pi_{x}(\nabla f(x)), which projects f(x)\nabla f(x) to the linear subspace spanned by the eigenvectors of 2f(x)\nabla^{2}f(x) with negative eigenvalues. It can be shown that the second component produces a strictly negative divergence in the neighborhood of strict saddle points. See Appendix E for the complete proof. \blacksquare

5 Polynomial-time bound for finding an approximate local minimum

Combining Proposition 3 with Theorem 1, we conclude that SGLD finds an approximate local minimum of the function ff in polynomial time, assuming that ff is smooth enough to satisfy Assumption B.

Assume that Assumptions A,B hold. For an arbitrary ϵ>0\epsilon>0, let UU be the set of ϵ\epsilon-approximate local minima. For any ρ,δ>0\rho,\delta>0, there exists a large enough ξ\xi and hyperparameters (η,kmax,D)(\eta,k_{\textrm{max}},D) such that with probability at least 1δ1-\delta, SGLD returns a solution x^\widehat{x} satisfying

The iteration number kmaxk_{\textrm{max}} is bounded by a polynomial function of all hyperparameters in the assumptions as well as (ϵ1,ρ1,log(1/δ))(\epsilon^{-1},\rho^{-1},\log(1/\delta)).

Similarly, we can combine Proposition 1 or Proposition 2 with Theorem 1, to obtain complexity bounds for finding the global minimum of a convex function, or finding an approximate stationary point of a smooth function.

Corollary 1 doesn’t specify any upper limit on the temperature parameter ξ\xi. As a result, SGLD can be stuck at the worst approximate local minima. It is important to note that the algorithm’s capability of escaping certain local minima relies on a more delicate choice of ξ\xi. Given objective function ff, we consider an arbitrary smooth function FF such that fF1/ξ\|{f-F}\|_{\infty}\leq 1/\xi. By Theorem 1, for any target subset UU, the hitting time of SGLD can be controlled by lower bounding the restricted Cheeger constant Cξf(K\U)\mathcal{C}_{\xi f}(K\backslash U). By the stability property (6), it is equivalent to lower bounding CξF(K\U)\mathcal{C}_{\xi F}(K\backslash U) because ff and FF are uniformly close. If ξ>0\xi>0 is chosen large enough (w.r.t. smoothness parameters of FF), then the lower bound established by Proposition 3 guarantees a polynomial hitting time to the set UFU_{F} of approximate local minima of FF. Thus, SGLD can efficiently escape all local minimum of ff that lie outside of UFU_{F}. Since the function FF is arbitrary, it can be thought as a favorable perturbation of ff such that the set UFU_{F} eliminates as many local minima of ff as possible. The power of such perturbations are determined by their maximum scale, namely the quantity 1/ξ1/\xi. Therefore, it motivates choosing the smallest possible ξ\xi whenever it satisfies the lower bound in Proposition 3.

The above analysis doesn’t specify any concrete form of the function FF. In Section 3, we present a concrete analysis where the function FF is assumed to be the population risk of empirical risk minimization (ERM). We establish sufficient conditions under which SGLD efficiently finds an approximate local minima of the population risk.

Applications to empirical risk minimization

In this section, we apply SGLD to a specific family of functions, taking the form:

We shall prove that under certain conditions, SGLD finds an approximate local minimum of the (presumably smooth) population risk in polynomial time, even if it is executed on a non-smooth empirical risk. More concretely, we run SGLD on a smoothed approximation of the empirical risk that satisfies Assumption A. With large enough sample size, the empirical risk ff and its smoothed approximation will be close enough to the population risk FF, so that combining the stability property with Theorem 1 and Proposition 3 establishes the hitting time bound. First, let’s formalize the assumptions.

The parameter space KK satisfies: there exists hmax>0h_{\textrm{max}}>0, such that for any xKx\in K and any hhmaxh\leq h_{\textrm{max}}, the random variable yN(x,2hI)y\sim N(x,2hI) satisfies P(yK)13P(y\in K)\geq\frac{1}{3}.

There exist ρ\mboxK,ν>0\rho_{\mbox{\tiny K}},\nu>0 such that in the set K:={x:d(x,K)ρ\mboxK}\overline{K}:=\{x:d(x,K)\leq\rho_{\mbox{\tiny K}}\}, the population risk FF is G{G}-Lipschitz continuous, and supxKf(x)F(x)ν\sup_{x\in\overline{K}}|f(x)-F(x)|\leq\nu.

Since the function ff can be non-differentiable, the stochastic gradient may not be well defined. We consider a smooth approximation of it following the idea of Duchi et al. :

The iteration number kmaxk_{\textrm{max}} is polynomial in (B,log(1/δ),d,hmax1,ν1,ρ\mboxK1,C(ξF)1(K\U))(B,\log(1/\delta),d,h_{\textrm{max}}^{-1},\nu^{-1},\rho_{\mbox{\tiny K}}^{-1},\mathcal{C}^{-1}_{(\xi F)}(K\backslash U)).

In order to lower bound the restricted Cheeger constant C(ξF)(K\U)\mathcal{C}_{(\xi F)}(K\backslash U), we resort to the general lower bounds in Section 2.4. Consider population risks that satisfy the conditions of Assumption B. By combining Theorem 2 with Proposition 3, we conclude that SGLD finds an approximate local minimum of the population risk in polynomial time.

Assume that Assumption C holds. Also assume that Assumption B holds for the population risk FF with smoothness parameters (G,L,H)({G},L,H). For any ϵ>0\epsilon>0, let UU be the set of ϵ\epsilon-approximate local minima of FF. If

Learning linear classifiers with zero-one loss

where 1q(a)2[0,0.5]\frac{1-q(a)}{2}\in[0,0.5] is the Massart noise level. We assume that the noise level is strictly smaller than 0.50.5 when the feature vector aa is separated apart from the decision boundary. Formally, there is a constant 0<q010<q_{0}\leq 1 such that

Assume that d2d\geq 2. For any q0(0,1]q_{0}\in(0,1] and ϵ,δ>0\epsilon,\delta>0, if the sample size nn satisfies:

then there exist hyperparameters (ξ,η,σ,kmax,D)(\xi,\eta,\sigma,k_{\textrm{max}},D) such that SGLD on the smoothed function (13) returns a solution x^\widehat{x} satisfying F(x^)F(x)+ϵF(\widehat{x})\leq F(x^{*})+\epsilon with probability at least 1δ1-\delta. The notation O~(1)\widetilde{O}(1) hides a poly-logarithmic function of (d,1/q0,1/ϵ,1/δ)(d,1/q_{0},1/\epsilon,1/\delta). The time complexity of the algorithm is polynomial in (d,1/q0,1/ϵ,log(1/δ))(d,1/q_{0},1/\epsilon,\log(1/\delta)).

The proof consists of two parts. For the first part, we prove that the population risk is Lipschitz continuous and the empirical risk uniformly converges to the population risk, so that Assumption C hold. For the second part, we lower bound the restricted Cheeger constant by Lemma 1. The proof is spiritually similar to that of Proposition 2 or Proposition 3. We define UU to be the set of approximately optimal solutions, and construct a vector field ϕ\phi such that:

By lower bounding the expression ϕ(x),f(x)divϕ(x)\langle\phi(x),\,\nabla f(x)\rangle-{\textrm{div}}\,\phi(x) for all xK\Ux\in K\backslash U, Lemma 1 establishes a lower bound on the restricted Cheeger constant. The theorem is established by combining the two parts together and by Theorem 2. We defer the full proof to Appendix G. \blacksquare

Conclusion

In this paper, we analyzed the hitting time of the SGLD algorithm on non-convex functions. Our approach is different from existing analyses on Langevin dynamics , which connect LMC to a continuous-time Langevin diffusion process, then study the mixing time of the latter process. In contrast, we are able to establish polynomial-time guarantees for achieving certain optimality sets, regardless of the exponential mixing time.

For future work, we hope to establish stronger results on non-convex optimization using the techniques developed in this paper. Our current analysis doesn’t apply to training over-specified models. For these models, the empirical risk can be minimized far below the population risk , thus the assumption of Corollary 2 is violated. In practice, over-specification often makes the optimization easier, thus it could be interesting to show that this heuristic actually improves the restricted Cheeger constant. Another open problem is avoiding poor population local minima. Jin et al. proved that there are many poor population local minima in training Gaussian mixture models. It would be interesting to investigate whether a careful initialization could prevent SGLD from hitting such bad solutions.

References

Appendix A Restricted Cheeger constant is strictly positive

In this appendix, we prove that under mild conditions, the restricted Cheeger constant for a convex parameter space is always strictly positive. Let KK be an arbitrary convex parameter space with diameter D<+D<+\infty. Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: for any subset AKA\subset K and any ϵ>0\epsilon>0, the following lower bound holds:

where vol(A){\textrm{vol}}(A) represents the Borel measure of set AA. Let f0(x):=0f_{0}(x):=0 be a constant zero function. By the definition of the function-induced probability measure, we have

Combining the inequality (22) with equation (23), we obtain:

If the set AA satisfies AVKA\subset V\subset K, then 1μf0(Aϵ)1μf0(Vϵ)1-\mu_{f_{0}}(A_{\epsilon})\geq 1-\mu_{f_{0}}(V_{\epsilon}). Combining it with the above inequality, we obtain:

According to the definition of the restricted Cheeger constant, the above lower bound implies:

Consider an arbitrary bounded function ff satisfying supxKf(x)B<+\sup_{x\in K}|f(x)|\leq B<+\infty, combining the stability property (6) and inequality (24), we obtain:

We summarize the result as the following proposition.

Appendix B Proof of Theorem 1

The proof consists of two parts. We first establish a general bound on the hitting time of Markov chains to a certain subset UKU\subset K, based on the notion of restricted conductance. Then we prove that the hitting time of SGLD can be bounded by the hitting time of a carefully constructed time-reversible Markov chain. This Markov chain runs a Metropolis-Hastings algorithm that converges to the stationary distribution μξf\mu_{\xi f}. We prove that this Markov chain has a bounded restricted conductance, whose value is characterized by the restricted Cheeger constant that we introduced in Section 2.2. Combining the two parts establishes the general theorem.

For an arbitrary Markov chain defined on the parameter space KK, we represent the Markov chain by its transition kernel π(x,A)\pi(x,A), which gives the conditional probability that the next state satisfies xk+1Ax_{k+1}\in A given the current state xk=xx_{k}=x. Similarly, we use π(x,x)\pi(x,x^{\prime}) to represent the conditional probability P(xk+1=xxk=x)P(x_{k+1}=x^{\prime}|x_{k}=x). If π\pi has a stationary distribution, then we denote it by QπQ_{\pi}.

A Markov chain is call lazy if π(x,x)1/2\pi(x,x)\geq 1/2 for every xKx\in K, and is called time-reversible if it satisfies

If (x0,x1,x2,)(x_{0},x_{1},x_{2},\dots) is a realization of the Markov chain π\pi, then the hitting time to some set UKU\subset K is denoted by:

For arbitrary subset VKV\subset K, we define the restricted conductance, denoted by Φπ(V)\Phi_{\pi}(V), to be the following infinimum ratio:

Based on the notion of restricted conductance, we present a general upper bound on the hitting time. For arbitrary subset UKU\subset K, suppose that π~{\widetilde{\pi}} is an arbitrary Markov chain whose transition kernel is stationary inside UU, namely it satisfies π~(x,x)=1{\widetilde{\pi}}(x,x)=1 for any xUx\in U. Let (x~0,x~1,x~2,)(\widetilde{x}_{0},\widetilde{x}_{1},\widetilde{x}_{2},\dots) be a realization of the Markov chain π~{\widetilde{\pi}}. We denote by QkQ_{k} the probability distribution of x~k\widetilde{x}_{k} at iteration kk. In addition, we define a measure of closeness between any two Markov chains.

For two Markov chains π\pi and π~{\widetilde{\pi}}, we say that π~{\widetilde{\pi}} is ϵ\epsilon-close to π\pi w.r.t. a set UU if the following condition holds for any xK\Ux\in K\backslash U and any AK\{x}A\subset K\backslash\{x\}:

Then we are able to prove the following lemma.

Let π\pi be a time-reversible lazy Markov chain with atom-free stationary distribution QπQ_{\pi}. Assume that π~{\widetilde{\pi}} is ϵ\epsilon-close to π\pi w.r.t. UU where ϵ14Φπ(K\U)\epsilon\leq\frac{1}{4}\Phi_{\pi}(K\backslash U). If there is a constant MM such that the distribution Q0Q_{0} satisfies Q0(A)MQπ(A)Q_{0}(A)\leq M\,Q_{\pi}(A) for any AK\UA\subset K\backslash U, then for any δ>0\delta>0, the hitting time of the Markov chain is bounded by:

See Appendix B.3.1 for the proof of Lemma 2. The lemma shows that if the two chains π\pi and π~{\widetilde{\pi}} are sufficiently close, then the hitting time of the Markov chain π~{\widetilde{\pi}} will be inversely proportional to the square of the restricted conductance of the Markov chain π\pi, namely Φπ(K\U)\Phi_{\pi}(K\backslash U). Note that if the density function of distribution QπQ_{\pi} is bounded, then by choosing Q0Q_{0} to be the uniform distribution over KK, there exists a finite constant MM such that Q0(A)MQπ(A)Q_{0}(A)\leq MQ_{\pi}(A), satisfying the last condition of Lemma 2.

B.2 Proof of the theorem

The SGLD algorithm initializes x0x_{0} by the uniform distribution μf0\mu_{f_{0}} (with f0(x)0f_{0}(x)\equiv 0). Then at iteration k1k\geq 1, it performs the following update:

We refer the particular setting ξ=1\xi=1 as the “standard setting”. For the “non-standard” setting of ξ1\xi\neq 1, we rewrite the first equation as:

This re-formulation reduces to the problem to the standard setting, with stepsize η/ξ\eta/\xi and objective function ξf\xi f. Thus it suffices to prove the theorem in the standard setting, then plug in the stepsize η/ξ\eta/\xi and the objective function ξf\xi f to obtain the general theorem. Therefore, we assume ξ=1\xi=1 and consider the sequence of points (x0,x1,)(x_{0},x_{1},\dots) generated by:

We introduce two additional notations: for arbitrary functions f1,f2f_{1},f_{2}, we denote the maximal gap supxKf1(x)f2(x)\sup_{x\in K}|f_{1}(x)-f_{2}(x)| by the shorthand f1f2\|{f_{1}-f_{2}}\|_{\infty}. For arbitrary set VKV\subset K and ρ>0\rho>0, we denote the super-set {xK:d(x,V)ρ}\{x\in K:d(x,V)\leq\rho\} by the shorthand VρV_{\rho}. Then we prove the following theorem for the standard setting.

Assume that Assumption A holds. Let x0x_{0} be sampled from μf0\mu_{f_{0}} and let the Markov chain (x0,x1,x2,)(x_{0},x_{1},x_{2},\cdots) be generated by update (33). Let UKU\subset K be an arbitrary subset and let ρ>0\rho>0 be an arbitrary positive number. Let C:=Cf(K\U)\mathcal{C}:=\mathcal{C}_{f}(K\backslash U) be a shorthand notation. Then for any δ>0\delta>0 and any stepsize η\eta satisfying

the hitting time to set UρU_{\rho} is bounded by

with probability at least 1δ1-\delta. Here, c,c>0c,c^{\prime}>0 are universal constants.

Theorem 4 shows that if we choose η(0,η0]\eta\in(0,\eta_{0}], where η0\eta_{0} is the right-hand side of inequality (34), then with probability at least 1δ1-\delta, the hitting time to the set UρU_{\rho} is bounded by

Combining it with the definition of η0\eta_{0}, and with simple algebra, we conclude that min{k:xkUρ}Mmin{1,C}4\min\{k:x_{k}\in U_{\rho}\}\leq\frac{M}{\min\{1,\mathcal{C}\}^{4}} where MM is polynomial in (B,L,G,log(1/δ),d,η0/η,hmax1,bmax1,ρ1)(B,L,G,\log(1/\delta),d,\eta_{0}/\eta,h_{\textrm{max}}^{-1},b_{\textrm{max}}^{-1},\rho^{-1}). This establishes the iteration complexity bound. Whenever xkx_{k} hits UρU_{\rho}, we have

which establishes the risk bound. Thus, Theorem 4 establishes Theorem 1 for the special case of ξ=1\xi=1.

In the non-standard setting (ξ1\xi\neq 1), we follow the reduction described above to substitute (η,f)(\eta,f) in Theorem 4 with the pair (η/ξ,ξf)(\eta/\xi,\xi f). As a consequence, the quantity C\mathcal{C} is substituted with C(ξf)(K\U)\mathcal{C}_{(\xi f)}(K\backslash U), and (B,L,G,η0,bmax)(B,L,G,\eta_{0},b_{\textrm{max}}) are substituted with (ξB,ξL,ξG,η0/ξ,bmax/ξ)(\xi B,\xi L,\xi G,\eta_{0}/\xi,b_{\textrm{max}}/\xi). Both the iteration complexity bound and the risk bound hold as in the standard setting, except that after the substitution, the numerator MM in the iteration complexity bound has an additional polynomial dependence on ξ\xi. Thus we have proved the general conclusion of Theorem 1.

where δx\delta_{x} is the Dirac delta function at point xx. The expectation is taken over the stochastic gradient gg defined in equation (33), conditioning on the current state xx. Then for any candidate state yKB(x;42ηd)y\in K\cap\mathcal{B}(x;4\sqrt{2\eta d}), we accept the candidate state (i.e., xk+1=yx_{k+1}=y) with probability:

or reject the candidate state (i.e., xk+1=xx_{k+1}=x) with probability 1αx(y)1-\alpha_{x}(y). All candidate states yKB(x;42ηd)y\notin K\cap\mathcal{B}(x;4\sqrt{2\eta d}) are rejected (i.e., xk+1=xx_{k+1}=x). It is easy to verify that πf\pi_{f} executes a Metropolis-Hastings algorithm. Therefore, it induces a time-reversible Markov chain, and its stationary distribution is equal to μf(x)ef(x)\mu_{f}(x)\propto e^{-f(x)}.

Despite the difference in their definitions, we are able to show that the two Markov chains are ϵ\epsilon-close, where ϵ\epsilon depends on the stepsize η\eta and the properties of the objective function.

Assume that 0<ηbmax232d0<\eta\leq\frac{b_{\textrm{max}}^{2}}{32d} and Assumption A hold. Then the Markov chain π~f{\widetilde{\pi}}_{f} is ϵ\epsilon-close to πf\pi_{f} w.r.t. UρU_{\rho} with ϵ=e33ηd(G2+L)1\epsilon=e^{33\eta d(G^{2}+L)}-1.

Lemma 3 shows that if we choose η\eta small enough, then ϵ\epsilon will be sufficiently small. Recall from Lemma 2 that we need ϵ14Φπf(K\Uρ)\epsilon\leq\frac{1}{4}\Phi_{\pi_{f}}(K\backslash U_{\rho}) to bound the Markov chain πf\pi_{f}’s hitting time to the set UρU_{\rho}. It means that η\eta has to be chosen based on the restricted conductance of the Markov chain πf\pi_{f}. Although calculating the restricted conductance of a Markov chain might be difficult, the following lemma shows that the restricted conductance can be lower bounded by the restricted Cheeger constant.

Assume that ηmin{hmax,16dρ2,bmax232d,1100d(G2+L)}\eta\leq\min\{h_{\textrm{max}},16d\rho^{2},\frac{b_{\textrm{max}}^{2}}{32d},\frac{1}{100d(G^{2}+L)}\} and Assumption A hold. Then for any VKV\subset K, we have:

By Lemma 3 and Lemma 4, we are able to choose a sufficiently small η\eta such that the Markov chains πf\pi_{f} and π~f{\widetilde{\pi}}_{f} are close enough to satisfy the conditions of Lemma 2. Formally, the following condition on η\eta is sufficient.

There exists a universal constant c>0c>0 such that for any stepsize η\eta satisfying:

the Markov chains πf\pi_{f} and π~f{\widetilde{\pi}}_{f} are ϵ\epsilon-close with ϵ14Φπf(K\Uρ)\epsilon\leq\frac{1}{4}\Phi_{\pi_{f}}(K\backslash U_{\rho}). In addition, the restricted conductance satisfies the lower bound Φπf(K\Uρ)min{12,η/dC1536}\Phi_{\pi_{f}}(K\backslash U_{\rho})\geq\min\{\frac{1}{2},\frac{\sqrt{\eta/d}\,\mathcal{C}}{1536}\}.

Under condition (38), the Markov chains πf\pi_{f} and π~f{\widetilde{\pi}}_{f} are ϵ\epsilon-close with ϵ14Φπf(K\Uρ)\epsilon\leq\frac{1}{4}\Phi_{\pi_{f}}(K\backslash U_{\rho}). Recall that the Markov chain πf\pi_{f} is time-reversible and lazy. Since ff is bounded, the stationary distribution Qπf=μfQ_{\pi_{f}}=\mu_{f} is atom-free, and sampling x0x_{0} from Q0:=μf0Q_{0}:=\mu_{f_{0}} implies:

Thus the last condition of Lemma 2 is satisfied. Combining Lemma 2 with the lower bound Φπf(K\Uρ)min{12,η/dC1536}\Phi_{\pi_{f}}(K\backslash U_{\rho})\geq\min\{\frac{1}{2},\frac{\sqrt{\eta/d}\,\mathcal{C}}{1536}\} in Lemma 5, it implies that with probability at least 1δ>01-\delta>0, we have

where c>0c^{\prime}>0 is a universal constant.

Finally, we upper bound the hitting time of SGLD (i.e., the Markov chain induced by formula (33)) using the hitting time upper bound (40). We denote by πsgld\pi_{\textrm{sgld}} the transition kernel of SGLD, and claim that the Markov chain induced by it can be generated as a sub-sequence of the Markov chain induced by π~f{\widetilde{\pi}}_{f}. To see why the claim holds, we consider a Markov chain (x~0,x~1,x~2,)(\widetilde{x}_{0},\widetilde{x}_{1},\widetilde{x}_{2},\dots) generated by π~f{\widetilde{\pi}}_{f}, and construct a sub-sequence (x0,x1,x2,)(x_{0}^{\prime},x_{1}^{\prime},x_{2}^{\prime},\dots) of this Markov chain as follows:

Examine the states x~k\widetilde{x}_{k} in the order k=1,2,,τk=1,2,\dots,\tau, where τ=min{k:x~kU}\tau=\min\{k:\widetilde{x}_{k}\in U\}:

For any state x~k\widetilde{x}_{k}, in order to sample its next state x~k+1\widetilde{x}_{k+1}, the candidate state yy is either drawn from a delta distribution δx~k\delta_{\widetilde{x}_{k}}, or drawn from a normal distribution with stochastic mean vector xηg(x)x-\eta g(x). The probability of these two cases are equal, according to equation (36).

By this construction, it is easy to verify that (x0,x1,x2,)(x^{\prime}_{0},x_{1}^{\prime},x^{\prime}_{2},\dots) is a Markov chain and its transition kernel exactly matches formula (33). Since the sub-sequence (x0,x1,x2,)(x^{\prime}_{0},x_{1}^{\prime},x^{\prime}_{2},\dots) hits UU in at most τ\tau steps, we have

Combining this upper bound with (40) completes the proof of Theorem 4.

B.3 Proof of technical lemmas

If there is a constant CC such that the inequality h0(p)Cph_{0}(p)\leq C\sqrt{p} holds for any p[0,q]p\in[0,q], then the inequality

According to the claim, it suffices to upper bound h0(p)h_{0}(p) for p[0,q]p\in[0,q]. Indeed, since Q0(A)MQπ(A)Q_{0}(A)\leq M\,Q_{\pi}(A) for any AK\UA\subset K\backslash U, we immediately have:

Choosing k:=4log(M/δ)Φπ2(K\U)k:=\frac{4\log(M/\delta)}{\Phi^{2}_{\pi}(K\backslash U)} implies Qk(K\U)δQ_{k}(K\backslash U)\leq\delta. As a consequence, the hitting time is bounded by kk with probability at least 1δ1-\delta.

Recall the properties of the function hkh_{k}. For any p[0,q]p\in[0,q], we can find a set AK\UA\subset K\backslash U such that Qπ(A)=pQ_{\pi}(A)=p and hk(p)=Qk(A)h_{k}(p)=Q_{k}(A). Define, for xKx\in K, two functions:

By the laziness of the Markov chain π~{\widetilde{\pi}}, we obtain 0gi10\leq g_{i}\leq 1, so that they are functions mapping from K\UK\backslash U to $.Usingtherelation. Using the relation2{\widetilde{\pi}}(x,A)-1=1-2{\widetilde{\pi}}(x,K\backslash A),thedefinitionof, the definition ofg_{1}$ implies that:

where the last inequality follows since the δ\delta-closeness ensures π(x,K\A)π~(x,K\A)\pi(x,K\backslash A)\leq{\widetilde{\pi}}(x,K\backslash A). Similarly, using the definition of g2g_{2} and the relation π~(x,A)(1+ϵ)π(x,A){\widetilde{\pi}}(x,A)\leq(1+\epsilon)\pi(x,A), we obtain:

Since QπQ_{\pi} is the stationary distribution of the time-reversible Markov chain π\pi, the right-hand side of (44) is equal to:

Let p1p_{1} and p2p_{2} be the left-hand side of inequality (43) and (44) respectively, and define a shorthand notation:

Then by definition of restricted conductance and the laziness of π\pi, we have Φπ(K\U)r1/2\Phi_{\pi}(K\backslash U)\leq r\leq 1/2. Combining inequalities (43), (44) and (45) and by simple algebra, we obtain:

By the condition ϵ14Φπ(K\U)r4\epsilon\leq\frac{1}{4}\Phi_{\pi}(K\backslash U)\leq\frac{r}{4}, the above inequality implies

It is straightforward to verify that for any 0r10\leq r\leq 1, the right-hand side is upper bounded by 2(1r2/4)p2(1-r^{2}/4)\sqrt{p}. Thus we obtain:

On the other hand, the definition of g1g_{1} and g2g_{2} implies that π~(x,A)=g1(x)+g2(x)2{\widetilde{\pi}}(x,A)=\frac{g_{1}(x)+g_{2}(x)}{2} for any xK\Ux\in K\backslash U. For all xUx\in U, the transition kernel π~{\widetilde{\pi}} is stationary, so that we have π~(x,A)=0{\widetilde{\pi}}(x,A)=0. Combining these two facts implies

The last inequality uses the definition of function hk1h_{k-1}.

Finally, we prove inequality (42) by induction. The inequality holds for k=0k=0 by the assumption. We assume by induction that it holds for an aribtrary integer k1k-1, and prove that it holds for kk. Combining the inductive hypothesis with inequalities (46) and (B.3.1), we have

Thus, inequality (42) holds for kk, which completes the proof.

B.3.2 Proof of Lemma 3

By the definition of the ϵ\epsilon-closeness, it suffices to consider an arbitrary xUρx\notin U_{\rho} and verify the inequality (26). We focus on cases when the acceptance ratio of πf\pi_{f} and π~f{\widetilde{\pi}}_{f} are different, that is, when the candidate state yy satisfies yxy\neq x and yKB(x;42ηd)y\in K\cap\mathcal{B}(x;4\sqrt{2\eta d}). We make the following claim on the acceptance ratio.

For any 0<ηbmax232d0<\eta\leq\frac{b_{\textrm{max}}^{2}}{32d}, if we assume xUρx\notin U_{\rho}, yxy\notin x, and yKB(x;42ηd)y\in K\cap\mathcal{B}(x;4\sqrt{2\eta d}), then the acceptance ratio is lower bounded by αx(y)e33ηd(G2+L)\alpha_{x}(y)\geq e^{-33\eta d(G^{2}+L)}.

Consider an arbitrary point xK\Uρx\in K\backslash U_{\rho} and an arbitrary subset AK\{x}A\subset K\backslash\{x\}. The definitions of πf\pi_{f} and π~f{\widetilde{\pi}}_{f} imply that πf(x,A)π~f(x,A)\pi_{f}(x,A)\leq{\widetilde{\pi}}_{f}(x,A) always hold. In order to prove the opposite, we notice that:

The definition of πf\pi_{f} and Claim 2 implies

By plugging in the definition of αx(y)\alpha_{x}(y) and αy(x)\alpha_{y}(x) and the fact that xyx\neq y, we obtain

In order to prove the claim, we need to lower bound the numerator and upper bound the denominator of equation (49). For the numerator, Jensen’s inequality implies:

where the last inequality uses the upper bound

For the above deduction, we have used the Jensen’s inequality as well as Assumption A.

For the denominator, we notice that the term inside the expectation satisfies:

For the second term on the righthand side, using the relation (ab)22a2+2b2(a-b)^{2}\leq 2a^{2}+2b^{2} and Jensen’s inequality, we obtain

Since xy242ηdbmax\|{x-y}\|_{2}\leq 4\sqrt{2\eta d}\leq b_{\textrm{max}} is assumed, Assumption A implies

Combining inequalities (52)-(55), we obtain

Combining equation (49) with inequalities (50), (56), we obtain

The LL-smoothness of function ff implies that

Combining this inequality with the lower bound (57) completes the proof.

B.3.3 Proof of Lemma 4

Recall that μf\mu_{f} is the stationary distribution of the Markov chain πf\pi_{f}. We consider an arbitrary subset AVA\subset V, and define B:=K\AB:=K\backslash A. Let A1A_{1} and B1B_{1} be defined as

In other words, the points in A1A_{1} and B1B_{1} have low probability to move across the broader between AA and BB. We claim that the distance between points in A1A_{1} and B1B_{1} must be bounded away from a positive number.

Assume that ηmin{hmax,bmax232d,1100d(G2+L)}\eta\leq\min\{h_{\textrm{max}},\frac{b_{\textrm{max}}^{2}}{32d},\frac{1}{100d(G^{2}+L)}\}. If xA1x\in A_{1} and yB1y\in B_{1}, then xy2>14η/d\|{x-y}\|_{2}>\frac{1}{4}\sqrt{\eta/d}.

For any point xK\(A1B1)x\in K\backslash(A_{1}\cup B_{1}), we either have xAx\in A and πf(x,B)1/96\pi_{f}(x,B)\geq 1/96, or we have xBx\in B and πf(x,A)1/96\pi_{f}(x,A)\geq 1/96. It implies:

Since μf\mu_{f} is the stationary distribution of the time-reversible Markov chain πf\pi_{f}, inequality (58) implies:

Notice that AK\B1A\subset K\backslash B_{1}, so that μf(K\B1)μf(A)\mu_{f}(K\backslash B_{1})\geq\mu_{f}(A). According to Claim 3, by defining an auxiliary quantity:

we find that the set (A1)ρη(A_{1})_{\rho_{\eta}} belongs to K\B1K\backslash B_{1}, so that μf(K\B1)μf((A1)ρη)\mu_{f}(K\backslash B_{1})\geq\mu_{f}((A_{1})_{\rho_{\eta}}). The following property is a direct consequence of the definition of restricted Cheeger constant.

For any AVA\subset V and any ν>0\nu>0, we have μf(Aν)eνCf(Vν)μf(A)\mu_{f}(A_{\nu})\geq e^{\nu\cdot\mathcal{C}_{f}(V_{\nu})}\mu_{f}(A).

Letting A:=A1A:=A_{1} and ν:=ρη\nu:={\rho_{\eta}} in Claim 4, we have μf((A1)ρη)eρηCf(Vρη)μf(A1)\mu_{f}((A_{1})_{\rho_{\eta}})\geq e^{{\rho_{\eta}}\cdot\mathcal{C}_{f}(V_{\rho_{\eta}})}\mu_{f}(A_{1}). Combining these inequalities, we obtain

where the last inequality uses the relation max{ab,(α1)b}α1α(ab)+1α(α1)b=α1αa\max\{a-b,(\alpha-1)b\}\geq\frac{\alpha-1}{\alpha}(a-b)+\frac{1}{\alpha}(\alpha-1)b=\frac{\alpha-1}{\alpha}a with α:=eρηCf(Vρη)\alpha:=e^{{\rho_{\eta}}\cdot\mathcal{C}_{f}(V_{\rho_{\eta}})}. Combining it with inequality (B.3.3), we obtain

The lemma’s assumption gives ρη=14η/dρ{\rho_{\eta}}=\frac{1}{4}\sqrt{\eta/d}\leq\rho. Plugging in this relation completes the proof.

Consider any two points xAx\in A and yBy\in B. Let ss be a number such that 2s2ηd=xy22s\sqrt{2\eta d}=\|{x-y}\|_{2}. If s>1s>1, then the claim already holds for the pair (x,y)(x,y). Otherwise, we assume that s1s\leq 1, and as a consequence assume xy222ηd\|{x-y}\|_{2}\leq 2\sqrt{2\eta d}.

Denote by q(z)q(z) the density function of distribution N(x+y2;2ηI)N(\frac{x+y}{2};2\eta I). The integral Zq(z)dz\int_{Z}q(z)dz is equal to P(X9d)P(X\leq 9d), where XX is a random variable satisfying the chi-square distribution with dd degrees of freedom. The following tail bound for the chi-square distribution was proved by Laurent and Massart .

If XX is a random variable satisfying the Chi-square distribution with dd degrees of freedom, then for any x>0x>0,

By choosing x=9/5x=9/5 in Lemma 6, the probability P(X9d)P(X\leq 9d) is lower bounded by 1e(9/5)d>5/61-e^{-(9/5)d}>5/6. Since ηhmax\eta\leq h_{\textrm{max}}, the first assumption of Assumption A implies Kq(z)dz1/3\int_{K}q(z)dz\geq 1/3. Combining these two bounds, we obtain

For any point zZz\in Z, the distances zx2\|{z-x}\|_{2} and zy2\|{z-y}\|_{2} are bounded by 42ηd4\sqrt{2\eta d}. It implies

Claim 2 in the proof of Lemma 3 demonstrates that the acceptance ratio αx(z)\alpha_{x}(z) and αy(z)\alpha_{y}(z) for any zKZz\in K\cap Z are both lower bounded by e33ηd(G2+L)e^{-33\eta d(G^{2}+L)} given the assumption 0<ηbmax232d0<\eta\leq\frac{b_{\textrm{max}}^{2}}{32d}. This lower bound is at least equal to 1/21/2 because of the assumption η1100d(G2+L)\eta\leq\frac{1}{100d(G^{2}+L)}, so that we have

Next, we lower bound the ratio qx(z)/q(z)q_{x}(z)/q(z) and qy(z)/q(z)q_{y}(z)/q(z). For zZz\in Z but zxz\neq x, the function qx(z)q_{x}(z) is defined by

where the last inequality uses Jensen’s inequality; It also uses the fact yx22=s2ηd\|{\frac{y-x}{2}}\|_{2}=s\sqrt{2\eta d} and zx+y2232ηd\|{z-\frac{x+y}{2}}\|_{2}\leq 3\sqrt{2\eta d}.

As a consequence of this upper bound and using Jensen’s inequality, we have:

Combining inequalities (B.3.2), (63) and (64), we obtain:

The assumption η1100d(G2+L)\eta\leq\frac{1}{100d(G^{2}+L)} implies G110ηdG\leq\frac{1}{10\sqrt{\eta d}}. Plugging in this inequality to (65), a sufficient condition for qx(z)/q(z)>1/4q_{x}(z)/q(z)>1/4 is

Following identical steps, we can prove that inequality (66) is a sufficient condition for qy(z)/q(z)>1/4q_{y}(z)/q(z)>1/4 as well.

Assume that condition (66) holds. Combining inequalities (60), (62) with the fact qx(z)>q(z)/4q_{x}(z)>q(z)/4 and qy(z)>q(z)/4q_{y}(z)>q(z)/4, we obtain:

Notice that the set ZZ satisfies ZB(x;42ηd)B(y;42ηd)Z\subset\mathcal{B}(x;4\sqrt{2\eta d})\cap\mathcal{B}(y;4\sqrt{2\eta d}), thus the following lower bound holds:

It implies that either πf(x,B)196\pi_{f}(x,B)\geq\frac{1}{96} or πf(y,A)196\pi_{f}(y,A)\geq\frac{1}{96}. In other words, if xA1x\in A_{1} and yB1y\in B_{1}, then inequality (66) must not hold. As a consequence, we obtain the lower bound:

Let nn be an arbitrary integer and let i{1,,n}i\in\{1,\dots,n\}. By the definition of the restricted Cheeger constant (see equation (5)), we have

where ϵn\epsilon_{n} is an indexed variable satisfying limnϵn=0\lim_{n\to\infty}\epsilon_{n}=0. Suming over i=1,,ni=1,\dots,n, we obtain

Taking the limit nn\to\infty on both sides of the inequality completes the proof.

B.3.4 Proof of Lemma 5

First, we impose the following constraints on the choice of η\eta:

so that the preconditions of both Lemma 3 and Lemma 4 are satisfied. By plugging V:=K\UρV:=K\backslash U_{\rho} to Lemma 4, the restricted conductance is lower bounded by:

The last inequality holds because 1etmin{12,t2}1-e^{-t}\geq\min\{\frac{1}{2},\frac{t}{2}\} holds for any t>0t>0. It is easy to verify that (K\Uρ)ρK\U(K\backslash U_{\rho})_{\rho}\subset K\backslash U, so that we have the lower bound Cf((K\Uρ)ρ)Cf(K\U)=C\mathcal{C}_{f}((K\backslash U_{\rho})_{\rho})\geq\mathcal{C}_{f}(K\backslash U)=\mathcal{C}. Plugging this lower bound to inequality (69), we obtain

Inequality (70) establishes the restricted conductance lower bound for the lemma.

Combining inequality (70) with Lemma 3, it remains to choose a small enough η\eta such that π~f{\widetilde{\pi}}_{f} is ϵ\epsilon-close to πf\pi_{f} with ϵ14Φπ(K\Uρ)\epsilon\leq\frac{1}{4}\Phi_{\pi}(K\backslash U_{\rho}). More precisely, it suffices to make the following inequality hold:

In order to satisfy this inequality, it suffices to choose ηmin{1d(G2+L),C2d3(G2+L)2}\eta\lesssim\min\{\frac{1}{d({G}^{2}+L)},\frac{\mathcal{C}^{2}}{d^{3}({G}^{2}+L)^{2}}\}. Combining this result with (68) completes the proof.

Appendix C Proof of Proposition 1

Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: Let KK be an arbitrary convex set with diameter 22. For any convex function ff and any subset VKV\subset K satisfying μf(V)1/2\mu_{f}(V)\leq 1/2, the following lower bound holds:

The lower bound (71) implies Cf(V)1\mathcal{C}_{f}(V)\geq 1. In order to establish the proposition, it suffices to choose V:=K\UV:=K\backslash U and f:=ξff:=\xi f, then prove the pre-condition μξf(K\U)1/2\mu_{\xi f}(K\backslash U)\leq 1/2.

Let xx^{*} be one of the global minimum of function ff and let B(x;r)\mathcal{B}(x^{*};r) be the ball of radius rr centering at point xx^{*}. If we choose r=ϵ2Gr=\frac{\epsilon}{2{G}}, then for any point xB(x;r)Kx\in\mathcal{B}(x^{*};r)\cap K, we have

Moreover, for any yK\Uy\in K\backslash U we have:

It means for the probability measure μξf\mu_{\xi f}, the density function inside B(x;r)K\mathcal{B}(x^{*};r)\cap K is at least eξϵ/2e^{\xi\epsilon/2} times greater than the density inside K\UK\backslash U. It implies

Without loss of generality, we assume that KK is the unit ball centered at the origin. Consider the Euclidean ball B(x;r/2)\mathcal{B}(x^{\prime};r/2) where x=max{0,1r/(2x2)}xx^{\prime}=\max\{0,1-r/(2\|{x^{*}}\|_{2})\}x^{*}. It is easy to verify that x21r/2\|{x^{\prime}}\|_{2}\leq 1-r/2 and xx2r/2\|{x^{\prime}-x^{*}}\|_{2}\leq r/2, which implies B(x;r/2)B(x;r)K\mathcal{B}(x^{\prime};r/2)\subset\mathcal{B}(x^{*};r)\cap K. Combining this relation with inequality (72), we have

The right-hand side is greater than or equal to 11, because we have assumed ξ2dlog(4G/ϵ)ϵ\xi\geq\frac{2d\log(4{G}/\epsilon)}{\epsilon}. As a consequence, we have μξf(K\U)1/2\mu_{\xi f}(K\backslash U)\leq 1/2.

Appendix D Proof of Lemma 1

Consider a sufficiently small ϵ\epsilon and a continuous mapping π(x):=xϵϕ(x)\pi(x):=x-\epsilon\phi(x). Since ϕ\phi is continuously differentiable in the compact set KK, there exists a constant GG such that ϕ(x)ϕ(y)2Gxy2\|{\phi(x)-\phi(y)}\|_{2}\leq G\|{x-y}\|_{2} for any x,yKx,y\in K. Assuming ϵ<1/G\epsilon<1/G, it implies

Thus, the mapping π\pi is a continuous one-to-one mapping. For any set AKA\subset K, we define π(A):={π(s):xA}\pi(A):=\{\pi(s):x\in A\}.

Since the parameter set KK is compact, we can partition KK into a finite number of small compact subsets, such that each subset has diameter at most δ:=ϵ2\delta:=\epsilon^{2}. Let SS be the collection of these subsets that intersect with AA. The definition implies ABSBAδA\subset\cup_{B\in S}B\subset A_{\delta}. The fact that ϕ(x)21\|{\phi(x)}\|_{2}\leq 1 implies

For arbitrary BSB\in S, we consider a point xBAx\in B\cap A, and remark that every point in BB is δ\delta-close to the point xx. Since ϕ\phi is continuously differentiable, the Jacobian matrix of the transformation π\pi has the following expansion:

where HH is the Jacobian matrix of ϕ\phi satisfying Hij(x)=ϕi(x)xjH_{ij}(x)=\frac{\partial\phi_{i}(x)}{\partial x_{j}}. The remainder term r1(x,y)r_{1}(x,y), as a consequence of the continuous differentiability of ϕ\phi and the fact yx2δ=ϵ2\|{y-x}\|_{2}\leq\delta=\epsilon^{2}, satisfies r1(x,y)2C1ϵ2\|{r_{1}(x,y)}\|_{2}\leq C_{1}\epsilon^{2} for some constant C1C_{1}.

On the other hand, using the relation μf(y)=μf(y)f(y)\nabla\mu_{f}(y)=-\mu_{f}(y)\nabla f(y) and the continuous differentiability of μf\mu_{f}, the density function at π(y)\pi(y) can be approximated by

where the remainder term r2(y)r_{2}(y) satisfies r2(y)C2ϵ2|r_{2}(y)|\leq C_{2}\epsilon^{2} for some constant C2C_{2}. Further using the continuity of ϕ\phi, f\nabla f and the fact yx2ϵ2\|{y-x}\|_{2}\leq\epsilon^{2}, we obtain:

where the remainder term r3(x,y)r_{3}(x,y) satisfies r3(x,y)C3ϵ2|r_{3}(x,y)|\leq C_{3}\epsilon^{2} for some constant C3C_{3}.

Combining equation (74) and equation (75), we can quantify the measure of the set π(B)\pi(B) using that of the set BB. In particular, we have

Plugging equation (76) to the lower bound (73) and using the relation tr(H(x))=divϕ(x){\textrm{tr}}(H(x))={\textrm{div}}\,\phi(x), implies

Finally, plugging in the definition of the restricted Cheeger constant and taking the limit ϵ0\epsilon\to 0 completes the proof.

Appendix E Proof of Proposition 3

Let Φ\Phi denote the CDF of the standard normal distribution. The function Φ\Phi satisfies the following tail bounds:

We define an auxiliary variable σ\sigma based on the value of ϵ\epsilon:

Since e1/(2σ2)=ϵ4Le^{-1/(2\sigma^{2})}=\frac{\sqrt{\epsilon}}{4L}, the tail bound (77) implies Φ(t)ϵ4L\Phi(t)\leq\frac{\sqrt{\epsilon}}{4L} for all t1σt\leq-\frac{1}{\sigma}.

Let g(x):=f(x)2g(x):=\|{\nabla f(x)}\|_{2} be a shorthand notation. We define a vector field:

Note that the function Φ\Phi admits a polynomial expansion:

We remark that the matrix definition (80) implies Φ(A+dA)=Φ(A)+Φ(A)dA\Phi(A+dA)=\Phi(A)+\Phi^{\prime}(A)dA where Φ\Phi^{\prime} is the derivative of function Φ\Phi.

The matrix A(x)A(x) satisfies 0A(x)(2G+1)I0\preceq A(x)\preceq(2{G}+1)I, so that ϕ(x)21\|{\phi(x)}\|_{2}\leq 1 holds. For points that are r0r_{0}-close to the boundary, we have x,f(x)x2\langle x,\,\nabla f(x)\rangle\geq\|{x}\|_{2}. By these lower bounds and definition (79), we obtain:

where the last inequality holds because g(x)x,f(x)/x21g(x)\geq\langle x,\,\nabla f(x)\rangle/\|{x}\|_{2}\geq 1. For any ϵ<rr0(2G+1)G\epsilon<\frac{r-r_{0}}{(2{G}+1)\sqrt{{G}}}, the right-hand side is smaller than x22\|{x}\|_{2}^{2}, so that xϵϕ(x)Kx-\epsilon\,\phi(x)\in K. For points that are not r0r_{0}-close to the boundary, we have xϵϕ(x)Kx-\epsilon\,\phi(x)\in K given ϵ<r0\epsilon<r_{0}. Combining results for the two cases, we conclude that ϕ\phi satisfies the conditions of Lemma 1

By applying Lemma 1, we obtain the following lower bound:

Since A(x)2Gg(x)IA(x)\succeq 2\sqrt{{G}\,g(x)}I, the term (f(x))A(x)f(x)(\nabla f(x))^{\top}A(x)\nabla f(x) is lower bounded by 2G(g(x))5/22\sqrt{{G}}(g(x))^{5/2}. For the second term, we claim the following bound:

We defer the proof to Appendix E.1 and focus on its consequence. Combining inequalities (82) and (83), we obtain

The right-hand side of inequality (84) can be made strictly positive if we choose a large enough ξ\xi. In particular, we choose:

To proceed, we do a case study based on the value of g(x)g(x). For all xx satisfying g(x)<hϵg(x)<h\epsilon, we plug in the upper bound g(x)<hϵg(x)<h\epsilon for g(x)g(x), then plug in the definition of hh. It implies:

Using lower bound (85) for ξ\xi and the lower bound g(x)hϵg(x)\geq h\epsilon for g(x)g(x), it is easy to verify that the last three terms on the right-hand side are non-negative. Furthermore, plugging in the lower bound ξ1ϵ2G1/212h5/2\xi\geq\frac{1}{\epsilon^{2}{G}^{1/2}}\cdot\frac{1}{2h^{5/2}} from (85), it implies:

Combining inequalities (84), (86), (87) proves that the restricted Cheeger constant is lower bounded by ϵ8(2G+1)G\frac{\sqrt{\epsilon}}{8(2{G}+1){G}}. Since 1/σ=O~(1)1/\sigma=\widetilde{\mathcal{O}}(1), it is easy to verify that the constraint (85) can be satisfied if we choose:

E.1 Proof of inequality (83)

Note that div(A(x)f(x)){\textrm{div}}\,(A(x)\nabla f(x)) is equal to the trace of matrix DD. In order to proceed, we perform a case study on the value of g(x)g(x).

We first upper bound the trace of A(x)f2(x)A(x)\nabla f^{2}(x), which can be written as:

The trace of the first term on the right-hand side is bounded by 2Gg(x)L2\sqrt{{G}\,g(x)}L. For the second term, we assume that the matrix 2f(x)\nabla^{2}f(x) has eigenvalues λ1λ2λd\lambda_{1}\leq\lambda_{2}\leq\dots\lambda_{d} with associated eigenvectors u1,,udu_{1},\dots,u_{d}. As a consequence, the matrix Φ(ϵI2f(x)σϵ)\Phi(\frac{-\sqrt{\epsilon}I-\nabla^{2}f(x)}{\sigma\sqrt{\epsilon}}) has the same set of eigenvectors, but with eigenvalues Φ(λ1/ϵ1σ),,Φ(λd/ϵ1σ)\Phi(\frac{-\lambda_{1}/\sqrt{\epsilon}-1}{\sigma}),\dots,\Phi(\frac{-\lambda_{d}/\sqrt{\epsilon}-1}{\sigma}). Thus, the trace of this term is equal to

By the assumptions xK\Ux\in K\backslash U and g(x)<ϵg(x)<\epsilon, and using the definition of ϵ\epsilon-approximate local minima, we obtain λ1ϵ\lambda_{1}\leq-\sqrt{\epsilon}. As a consequence

For other eigenvalues, if λi\lambda_{i} is negative, then we use the upper bound λiΦ(λi/ϵiσ)<0\lambda_{i}\Phi(\frac{-\lambda_{i}/\sqrt{\epsilon}-i}{\sigma})<0; If λi\lambda_{i} is positive, then we have λiΦ(λi/ϵ1σ)λiΦ(1σ)λiϵ4L\lambda_{i}\Phi(\frac{-\lambda_{i}/\sqrt{\epsilon}-1}{\sigma})\leq\lambda_{i}\Phi(-\frac{1}{\sigma})\leq\lambda_{i}\frac{\sqrt{\epsilon}}{4L}. Combining these relations, we have

Combining this inquality with the upper bound on the first term of (89), we obtain

Thus, we have upper bounded the trace of first term on the right-hand side of (88).

For the second term on the right-hand side of (88) , we have

where the last inequality uses the relation g(x)=(2f(x))f(x)g(x)\nabla g(x)=\frac{(\nabla^{2}f(x))\nabla f(x)}{g(x)}, so that g(x)22f(x)22f(x)L\|{\nabla g(x)}\|_{2}\leq\|{\nabla^{2}f(x)}\|_{2}\leq\|{\nabla^{2}f(x)}\|_{*}\leq L.

For the third term on the right-hand side of (88), since 0Φ(ϵI2f(x)σϵ)I0\preceq\Phi^{\prime}(\frac{-\sqrt{\epsilon}I-\nabla^{2}f(x)}{\sigma\sqrt{\epsilon}})\preceq I, we have

Combining upper bounds (90), (91), (92) implies

The proof is similar to the previous case. For the first term on the right-hand side of equation (88), we follow the same arguments for establishing the upper bound (90), but without using the relation λ1ϵ\lambda_{1}\leq-\sqrt{\epsilon} (because conditioning on g(x)ϵg(x)\geq\epsilon, the definition of approximate local minima won’t give λ1ϵ\lambda_{1}\leq-\sqrt{\epsilon}). Then the trace of A(x)f2(x)A(x)\nabla f^{2}(x) is bounded by:

For the second and the third term, the upper bounds (91) and (92) still hold, so that

Combining the two cases completes the proof.

Appendix F Proof of Theorem 2

and its stochastic gradient is computed by:

We defer the proof of claim (96) to the end of this section, focusing on its consequence. Let σ\sigma take the value in claim (96). The conseuqence of (96) and the G{G}-Lipschitz continuity of the function FF imply:

By choosing ρ:=ν/G\rho:=\nu/{G}, we establish the risk bound F(x^)supxUF(x)+5νF(\widehat{x})\leq\sup_{x\in U}F(x)+5\nu. It remains to establish the iteration complexity bound.

Taking expectation over zz on both sides and using Jensen’s inequality, we obtain

Thus, by choosing σ:=νmax{G,B/ρ\mboxK}\sigma:=\frac{\nu}{\max\{{G},B/\rho_{\mbox{\tiny K}}\}}, it ensures that for any xKx\in K:

F.1 Proof of Lemma 7

Let z:=x+zz^{\prime}:=x+z. By change of variables, the above equation implies:

Notice that u/u2,z/σ\langle u/\|{u}\|_{2},\,z/\sigma\rangle satisfies the standard normal distribution. Thus the right-hand side of the above inequality is bounded by

Combining the two inequalities above completes the proof.

Using the fact f(x+z)[0,B]f(x+z)\in[0,B], equation (99) implies:

Appendix G Proof of Theorem 3

There exists hmax=Ω(d2)h_{\textrm{max}}=\Omega(d^{-2}) such that for any xKx\in K, hhmaxh\leq h_{\textrm{max}} and yN(x,2hI)y\sim N(x,2hI), we have P(yK)1/3P(y\in K)\geq 1/3.

The function FF is 3-Lipschitz continuous in K\overline{K}.

For any ν,δ>0\nu,\delta>0, if the sample size nn satisfies ndν2n\gtrsim\frac{d}{\nu^{2}}, then with probability at least 1δ1-\delta we have supxKf(x)F(x)ν\sup_{x\in\overline{K}}|f(x)-F(x)|\leq\nu. The notation “\lesssim” hides a poly-logarithmic function of (d,1/ν,1/δ)(d,1/\nu,1/\delta).

Let α0(0,π/4]\alpha_{0}\in(0,\pi/4] be an arbitrary angle. We define UKU\subset K to be the set of points such that the angle between the point and xx^{*} is bounded by α0\alpha_{0}, or equivalently:

For any xKx\in K, the 3-Lipschitz continuity of function FF implies:

By simple geometry, it is easy to see that

Inequality (100) implies that for small enough α0\alpha_{0}, any point in UU is a nearly optimal solutions. Thus we can use UU as a target optimality set. The following lemma lower bounds the restricted Cheeger constant for the set UU.

Assume that d2d\geq 2. For any α0(0,π/4]\alpha_{0}\in(0,\pi/4], there are universal constant c1,c2>0c_{1},c_{2}>0 such that if we choose ξc1d3/2q0sin2(α0)\xi\geq\frac{c_{1}d^{3/2}}{q_{0}\sin^{2}(\alpha_{0})}, then the restricted Cheeger constant is lower bounded by C(ξF)(K\U)c2d\mathcal{C}_{(\xi F)}(K\backslash U)\geq c_{2}d.

Given a target optimality ϵ>0\epsilon>0, we choose α0:=arcsin(ϵ/12)\alpha_{0}:=\arcsin(\epsilon/12). The risk bound (100) implies

Lemma 8 ensures that the pre-conditions of Theorem 2 hold with a small enough quantity ν\nu. Combining Theorem 2 with inequality (101), with probability at least 1δ1-\delta, SGLD achieves the risk bound:

In order to have a small enough ν\nu, we want the functions ff and FF to be uniformly close. More precisely, we want the gap between them to satisfy:

By Lemma 8, this can be achieved by assuming a large enough sample size nn. In particular, if the sample size satisfies nd4q02ϵ4n\gtrsim\frac{d^{4}}{q_{0}^{2}\epsilon^{4}}, then inequality (103) is guaranteed to be true. The notation “\gtrsim” hides a poly-logarithmic function.

If inequity (103) holds, then νϵ/10\nu\leq\epsilon/10 holds, so that we can rewrite the risk bound (102) as F(x^)F(x)+ϵF(\widehat{x})\leq F(x^{*})+\epsilon. By combining the choice of ν\nu in (103) with the choice of ξ:=c1d3/2q0sin2(α0)\xi:=\frac{c_{1}d^{3/2}}{q_{0}\sin^{2}(\alpha_{0})} in Lemma 9, we find that the relation ξ(0,1/ν]\xi\in(0,1/\nu] hold, satisfying Theorem 2’s condition on (ν,ξ)(\nu,\xi). As a result, Theorem 2 implies that the iteration complexity of SGLD is bounded by the restricted Cheeger constant C(ξF)(K\U)\mathcal{C}_{(\xi F)}(K\backslash U). By Lemma 9, the restricted Cheeger constant is lower bounded by Ω(d)\Omega(d), so that the iteration complexity is polynomial in (d,1/q0,1/ϵ,log(1/δ))(d,1/q_{0},1/\epsilon,\log(1/\delta)).

(1) Let xKx\in K be an arbitrary point and let zN(0,2hI)z\sim N(0,2hI). An equivalent way to express the relation x+zKx+z\in K is the following sandwich inequality:

For any t>0t>0, we consider a sufficient condition for inequality (104):

The random variable z222h\frac{\|{z}\|_{2}^{2}}{2h} satisfies a chi-square distribution with dd degrees of freedom. By Lemma 6, for any t5t\geq 5, the condition z222thd\|{z}\|_{2}^{2}\leq 2thd holds with probability at least 1eΩ(td)1-e^{-\Omega(td)}.

Suppose that tt is a fixed constant, and hh is chosen to be h:=c22td2h:=\frac{c^{2}}{2td^{2}} for a constant c>0c>0. Then the random variable wx:=2x,zw_{x}:=2\langle x,\,z\rangle satisfies a normal distribution N(0;4x22(c/d)2t)N(0;\frac{4\|{x}\|_{2}^{2}(c/d)^{2}}{t}). The interval IxI_{x}, no matter how xKx\in K is chosen, covers either [1/4,(c/d)2][-1/4,-(c/d)^{2}] or [(c/d)2,1/4][(c/d)^{2},1/4]. For c0c\to 0, we have (c/d)22x2t(c/d)1/4(c/d)^{2}\ll\frac{2\|{x}\|_{2}}{t}(c/d)\ll 1/4, so that the probability of wxIxw_{x}\in I_{x} is asymptotically lower bounded by 0.50.5. It implies that there is a strictly positive constant cc (depending on the value of tt) such that P(wxIx)0.4P(w_{x}\in I_{x})\geq 0.4 for all xKx\in K. With this choice of cc, we apply the union bound:

By choosing tt to be a large enough constant, the above probability is lower bounded by 1/31/3.

If we change the distribution of aa from uniform distribution to a normal distribution N(0,Id×d)N(0,I_{d\times d}), the right-hand side of inequality (105) won’t change. Both x,a\langle x,\,a\rangle and x,b\langle x,\,b\rangle become normal random variables with correlation coefficient x,yx2y2\frac{\langle x,\,y\rangle}{\|{x}\|_{2}\|{y}\|_{2}}. Under this setting, Tong proved that the right-side is equal to

By simple algebra and using the fact that x2,y21/4\|{x}\|_{2},\|{y}\|_{2}\geq 1/4, we have

Combining the above relations, and using the fact that arccos(t)31t\arccos(t)\leq 3\sqrt{1-t} for any tt\in, we obtain

which shows that the function FF is 3-Lipschitz continuous in K\overline{K}.

(3) Since function ff is the empirical risk of a linear classifier, its uniform convergence rate can be characterized by the VC-dimension. The VC-dimension of linear classifiers in a dd-dimensional space is equal to d+1d+1. Thus, the concentration inequality of Vapnik implies that with probability at least 1δ1-\delta, we have

where c>0c>0 is a universal constant. When ndn\geq d, the upper bound U(n)U(n) is a monotonically decreasing function of nn. In order to guarantee U(n)νU(n)\leq\nu, it suffices to choose nn0n\geq n_{0} where the number n0n_{0} satisfies:

It is easy to see that n0n_{0} is polynomial in (d,1/ν,1/δ)(d,1/\nu,1/\delta). Thus, by the definition of U(n)U(n), we have

It implies n0dν2n_{0}\lesssim\frac{d}{\nu^{2}}, thus completes the proof.

G.2 Proof of Lemma 9

The first step is to define a vector field that satisfies the conditions of Lemma 1. For arbitrary δ\delta\in, we define a vector field ϕδ\phi_{\delta} such that:

For any δ(0,1]\delta\in(0,1], we can find a constant ϵ0>0\epsilon_{0}>0 such that ϕδ(x)21\|{\phi_{\delta}(x)}\|_{2}\leq 1 and xϵϕδ(x)Kx-\epsilon\phi_{\delta}(x)\in K holds for arbitrary xKx\in K and ϵ(0,ϵ0]\epsilon\in(0,\epsilon_{0}].

The claim shows that ϕδ\phi_{\delta} satisfies the conditions of Lemma 1 for any δ(0,1]\delta\in(0,1], so that given a scalar ξ>0\xi>0, the lemma implies

The right-hand side is uniformly continuous in δ\delta, so that if we take the limit δ0\delta\to 0, we obtain the lower bound:

For the right-hand side, we prove lower bound for it using the Massart noise model. We start by simplifying the fraction term inside the expectation. Without loss of generality, assume that ϵ(0,0.2]\epsilon\in(0,0.2], then the definition of ϕ0\phi_{0} implies:

We further simplify the lower bound (109) by the following claim, which is proved using properties of the Massart noise.

When the event Et\mathcal{E}_{t} holds, we have z2t1/6x2/3\|{z}\|_{2}\leq t\leq 1/6\leq\|{x}\|_{2}/3, so that 3(x+z)x222\|{\frac{3(x+z)}{\|{x}\|_{2}^{2}}}\|_{2}\in. Combining with inequality (109) and Claim 6, we have

where αx+z\alpha_{x+z} represents the angle between x+zx+z and xx^{*}.

It means that divϕ0(x)=tr(H(x))=d13x,x=(d1)x23cos(αx){\textrm{div}}\,\phi_{0}(x)={\textrm{tr}}(H(x))=\frac{d-1}{3}\langle x,\,x^{*}\rangle=\frac{(d-1)\|{x}\|_{2}}{3}\cos(\alpha_{x}). Combining this equation with inequalities (108), (110), and taking the limits t0t\to 0, σ0\sigma\to 0, we obtain:

where αx\alpha_{x} represents the angle between xx and xx^{*}.

According to inequality (111), for any αx(πα0,π]\alpha_{x}\in(\pi-\alpha_{0},\pi], we have

where the last inequality follows since x21/2\|{x}\|_{2}\geq 1/2 and α0(0,π/4]\alpha_{0}\in(0,\pi/4]. Otherwise, if αx[α0,πα0]\alpha_{x}\in[\alpha_{0},\pi-\alpha_{0}], then we have

Once we choose ξ160π3d3/2q0sin2(α0)\xi\geq\frac{160\pi}{3}\frac{d^{3/2}}{q_{0}\sin^{2}(\alpha_{0})}, the above expression will be lower bounded by d/3d/3, which completes the proof.

Since x21\|{x}\|_{2}\leq 1 for any xKx\in K, it is easy to verify that ϕδ(x)21\|{\phi_{\delta}(x)}\|_{2}\leq 1. In order to verify xϵϕδ(x)Kx-\epsilon\phi_{\delta}(x)\in K, we notice that

The right-hand side will be maximized if x22=1/4\|{x}\|_{2}^{2}=1/4. Thus, if we assume δ>0\delta>0, then for any ϵ<δ/16\epsilon<\delta/16, it is easy to verify that the right-hand side is bounded by 3/83/8. As a consequence, we have xϵϕδ(x)22[1/4,1]\|{x-\epsilon\phi_{\delta}(x)}\|_{2}^{2}\in[1/4,1], which verifies that xϵϕδ(x)Kx-\epsilon\phi_{\delta}(x)\in K.

When x=0x=0 or xϵx=0x-\epsilon x^{*}=0, it is easy to verify that F(xϵx)F(x)0F(x-\epsilon x^{*})-F(x)\geq 0 by the definition of the loss function. Otherwise, we assume that x0x\neq 0 and xϵx0x-\epsilon x^{*}\neq 0. In these cases, the events x,a0\langle x,\,a\rangle\neq 0 and xϵx,a0\langle x-\epsilon x^{*},\,a\rangle\neq 0 hold almost surely, so that we can assume that the loss function always takes zero-one values.

Let E\mathcal{E} be the event that condition (112) holds. Under this event, when the parameter changes from xx to xϵxx-\epsilon x^{*}, the loss changes from 1q(a)2\frac{1-q(a)}{2} to 1+q(a)2\frac{1+q(a)}{2}. It means that the loss is non-decreasing with respect to the change xxϵxx\to x-\epsilon x^{*}, and as a consequence, we have F(xϵx)F(x)0F(x-\epsilon x^{*})-F(x)\geq 0.

In order to establish the lower bound in Claim 6, we first lower bound the probability of event E\mathcal{E}. In the proof of Lemma 8. we have shown that this probability is equal to:

Let β\beta be the angle between xx and xϵxx-\epsilon x^{*}, then the right-hand side is equal to β/π\beta/\pi. Using the geometric property that x2sin(α)=ϵx2sin(β)\frac{\|{x}\|_{2}}{|\sin(\alpha)|}=\frac{\epsilon\|{x^{*}}\|_{2}}{|\sin(\beta)|}, we have

For the first term on the right-hand side, we have x1,a1a12=x,ax2ϵx2ϵ|\langle x^{*}_{1},\,a_{1}\rangle|\leq\|{a_{1}}\|_{2}=\frac{|\langle x,\,a\rangle|}{\|{x}\|_{2}}\leq\frac{\epsilon}{\|{x}\|_{2}}\leq\epsilon by condition (112) and the assumption that x21\|{x}\|_{2}\geq 1. For the second term, if we condition on a1a_{1}, then the vector a2a_{2} is uniformly sampled from a (d1)(d-1)-dimensional sphere of radius 1a122\sqrt{1-\|{a_{1}}\|_{2}^{2}} that centers at the origin. The vector x2x^{*}_{2}, constructed to be orthogonal to xx, also belongs to the same (d1)(d-1)-dimensional subspace. Under this setting, Awasthi et al. [4, Lemma 4] proved that

Using the bound a12=x,ax2ϵx2ϵ\|{a_{1}}\|_{2}=\frac{|\langle x,\,a\rangle|}{\|{x}\|_{2}}\leq\frac{\epsilon}{\|{x}\|_{2}}\leq\epsilon, we marginalize a1a_{1} to obtain

Recall that x,a=x1,a1+x2,a2\langle x^{*},\,a\rangle=\langle x^{*}_{1},\,a_{1}\rangle+\langle x^{*}_{2},\,a_{2}\rangle and x1,a1ϵ|\langle x^{*}_{1},\,a_{1}\rangle|\leq\epsilon. These two relations imply x,ax2,a2x1,a1x2,a2ϵ|\langle x^{*},\,a\rangle|\geq|\langle x^{*}_{2},\,a_{2}\rangle|-|\langle x^{*}_{1},\,a_{1}\rangle|\geq|\langle x^{*}_{2},\,a_{2}\rangle|-\epsilon. Combining it with the relation x22=sin(α)\|{x^{*}_{2}}\|_{2}=|\sin(\alpha)|, we obtain

Combining inequalities (113), (114) with the relation that q(a)q0x,aq(a)\geq q_{0}|\langle x^{*},\,a\rangle|, we obtain