A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics

Yuchen Zhang, Percy Liang, Moses Charikar

Introduction

A central challenge of non-convex optimization is avoiding sub-optimal local minima. Although escaping all local minima is NP-hard in general [e.g. 7], one might expect that it should be possible to escape “appropriately shallow” local minima, whose basins of attraction have relatively low barriers. As an illustrative example, consider minimizing an empirical risk function in Figure 1. As the figure shows, although the empirical risk is uniformly close to the population risk, it contains many poor local minima that don’t exist in the population risk. Gradient descent is unable to escape such local minima.

A natural workaround is to inject random noise to the gradient. Empirically, adding gradient noise has been found to improve learning for deep neural networks and other non-convex models . However, theoretical understanding of the value of gradient noise is still incomplete. For example, Ge et al. show that by adding isotropic noise $w$ and by choosing a sufficiently small stepsize $\eta$ , the iterative update:

is able to escape strict saddle points. Unfortunately, this approach, as well as the subsequent line of work on escaping saddle points , doesn’t guarantee escaping even shallow local minima.

Another line of work in Bayesian statistics studies the Langevin Monte Carlo (LMC) method , which employs an alternative noise term. Given a function $f$ , LMC performs the iterative update:

where $\xi>0$ is a “temperature” hyperparameter. Unlike the bounded noise added in formula (1), LMC adds a large noise term that scales with $\sqrt{1/\eta}$ . With a small enough $\eta$ , the noise dominates the gradient, enabling the algorithm to escape any local minimum. For empirical risk minimization, one might substitute the exact gradient $\nabla f(x)$ with a stochastic gradient, which gives the Stochastic Gradient Langevin Dynamics (SGLD) algorithm . It can be shown that both LMC and SGLD asymptotically converge to a stationary distribution $\mu(x)\propto e^{-\xi f(x)}$ . As $\xi\to\infty$ , the probability mass of $\mu$ concentrates on the global minimum of the function $f$ , and the algorithm asymptotically converges to a neighborhood of the global minimum.

Despite asymptotic consistency, there is no theoretical guarantee that LMC is able to find the global minimum of a general non-convex function, or even a local minimum of it, in polynomial time. Recent works focus on bounding the mixing time (i.e. the time for converging to $\mu$ ) of LMC and SGLD. Bubeck et al. , Dalalyan and Bonis prove that on convex functions, LMC converges to the stationary distribution in polynomial time. On non-convex functions, however, an exponentially long mixing time is unavoidable in general. According to Bovier et al. , it takes the Langevin diffusion at least $e^{\Omega(\xi h)}$ time to escape a depth- $h$ basin of attraction. Thus, if the function contains multiple “deep” basins with $h=\Omega(1)$ , then the mixing time is lower bounded by $e^{\Omega(\xi)}$ .

In parallel work to this paper, Raginsky et al. upper bound the time of SGLD converging to an approximate global minimum of non-convex functions. They show that the upper bound is polynomial in the inverse of a quantity they call the uniform spectral gap. Similar to the mixing time bound, in the presence of multiple local minima, the convergence time to an approximate global minimum can be exponential in dimension $d$ and the temperature parameter $\xi$ .

This stability property is useful in studying empirical risk minimization (ERM) in situations where the empirical risk $f$ is pointwise close to the population risk $F$ , but has poor local minima that don’t exist in the population risk. This phenomenon has been observed in statistical estimation with non-convex penalty functions , as well as in minimizing the zero-one loss (see Figure 1). Under this setting, our result implies that SGLD achieves an approximate local minimum of the (smooth) population risk in polynomial time, ruling out local minima that only exist in the empirical risk. It improves over recent results on non-convex optimization , which compute approximate local minima only for the empirical risk.

As a concrete application, we prove a stronger learnability result for the problem of learning linear classifiers under the zero-one loss , which involves non-convex and non-smooth empirical risk minimization. Our result improves over the recent result of Awasthi et al. : the method of Awasthi et al. handles noisy data corrupted by a very small Massart noise (at most $1.8\times 10^{-6}$ ), while our algorithm handles Massart noise up to any constant less than $0.5$ . As a Massart noise of $0.5$ represents completely random observations, we see that SGLD is capable of learning from very noisy data.

Algorithm and main results

In this section, we define the algorithm and the basic concepts, then present the main theoretical results of this paper.

Because of the noisy update, the sequence $(x_{0},x_{1},x_{2},\dots)$ asymptotically converges to a stationary distribution rather than a stationary point . Although this fact introduces challenges to the analysis, we show that its non-asymptotic efficiency can be characterized by a positive quantity called the restricted Cheeger constant.

2 Restricted Cheeger constant

For any measurable function $f$ , we define a probability measure $\mu_{f}$ whose density function is:

For any function $f$ and any subset $V\subset K$ , we define the restricted Cheeger constant as:

The restricted Cheeger constant generalizes the notion of the Cheeger isoperimetric constant , quantifying how well a subset of $V$ can be made as least connected as possible to the rest of the parameter space. The connectivity is measured by the ratio of the surface measure $\liminf_{\epsilon\searrow 0}\frac{\mu_{f}(A_{\epsilon})-\mu_{f}(A)}{\epsilon}$ to the set measure $\mu_{f}(A)$ . Intuitively, this quantifies the chance of escaping the set $A$ under the probability measure $\mu_{f}$ .

A property that will be important in the sequal is that the restricted Cheeger constant is stable under perturbations: if we perturb $f$ by a small amount, then the values of $\mu_{f}$ won’t change much, so that the variation on $\mathcal{C}_{f}(V)$ will also be small. More precisely, for functions $f_{1}$ and $f_{2}$ satisfying $\sup_{x\in K}|f_{1}(x)-f_{2}(x)|=\nu$ , we have

and similarly $\mathcal{C}_{f_{2}}(V)\geq e^{-2\nu}\mathcal{C}_{f_{1}}(V)$ . As a result, if two functions $f_{1}$ and $f_{2}$ are uniformly close, then we have $\mathcal{C}_{f_{1}}(V)\approx\mathcal{C}_{f_{2}}(V)$ for a constant $\nu$ . This property enables us to lower bound $\mathcal{C}_{f_{1}}(V)$ by lower bounding the restricted Cheeger constant of an alternative function $f_{2}\approx f_{1}$ , which might be easier to analyze.

3 Generic non-asymptotic bounds

We make several assumptions on the parameter space and on the objective function.

The parameter space $K$ satisfies: there exists $h_{\textrm{max}}>0$ , such that for any $x\in K$ and any $h\leq h_{\textrm{max}}$ , the random variable $y\sim N(x,2hI)$ satisfies $P(y\in K)\geq\frac{1}{3}$ .

The function $f:K\to[0,B]$ is bounded, differentiable and $L$ -smooth in $K$ , meaning that for any $x,y\in K$ , we have $|f(y)-f(x)-\langle y-x,\,\nabla f(x)\rangle|\leq\frac{L}{2}\|{y-x}\|_{2}^{2}$ .

The first assumption states that the parameter space doesn’t contain sharp corners, so that the update (3d) won’t be stuck at the same point for too many iterations. It can be satisfied, for example, by defining the parameter space to be an Euclidean ball and choosing $h_{\textrm{max}}=o(d^{-2})$ . The probability $1/3$ is arbitrary and can be replaced by any constant in $(0,1/2)$ . The second assumption requires the function $f$ to be smooth. We show how to handle non-smooth functions in Section 3 by appealing to the stability property of the restricted Cheeger constant discussed earlier. The third assumption requires the stochastic gradient to have sub-exponential tails, which is a standard assumption in stochstic optimization.

The iteration number $k_{\textrm{max}}$ is bounded by

where the numerator $M$ is polynomial in $(B,L,G,\log(1/\delta),d,\xi,\eta_{0}/\eta,h_{\textrm{max}}^{-1},b_{\textrm{max}}^{-1},\rho^{-1})$ . See Appendix B.2 for the explicit polynomial dependence.

Theorem 1 is a generic result that applies to all optimization problems satisfying Assumption A. The right-hand side of the bound (7) is determined by the choice of $U$ . If we choose $U$ to be the set of (approximate) local minima, and let $\rho>0$ be sufficiently small, then $f(\widehat{x})$ will roughly be bounded by the worst local minimum. The theorem permits $\xi$ to be arbitrary provided the stepsize $\eta$ is small enough. Choosing a larger $\xi$ means adding less noise to the SLGD update, which means that the algorithm will be more efficient at finding a stationary point, but less efficient at escaping local minima. Such a trade-off is captured by the restricted Cheeger constant in inequality (8) and will be rigorously studied in the next subsection.

The iteration complexity bound is governed by the restricted Cheeger constant. For any function $f$ and any target set $U$ with a positive Borel measure, the restricted Cheeger constant is strictly positive (see Appendix A), so that with a small enough $\eta$ , the algorithm always converges to the global minimum asymptotically. We remark that the SGD doesn’t enjoy the same asymptotic optimality guarantee, because it uses a $O(\eta)$ gradient noise in contrast to SGLD’s $O(\sqrt{\eta})$ one. Since the convergence theory requires a small enough $\eta$ , we often have $\eta\ll\sqrt{\eta}$ . the SGD noise is too conservative to allow the algorithm to escape local minima.

The proof of Theorem 1 is fairly technical. We defer the full proof to Appendix B, only sketching the basic proof ideas here. At a high level, we establish the theorem by bounding the hitting time of the Markov chain $(x_{0},x_{1},x_{2},\dots)$ to the set $U_{\rho}:=\{x:d(x,U)\leq\rho\}$ . Indeed, if some $x_{k}$ hits the set, then:

In order to bound the hitting time, we construct a time-reversible Markov chain, and prove that its hitting time to $U_{\rho}$ is on a par with the original hitting time. To analyze this second Markov chain, we define a notion called the restricted conductance, which measures how easily the Markov chain can transition between states within $K\backslash U_{\rho}$ . This quantity is related to the notion of conductance in the analysis of time-reversible Markov processes , but the ratio between these two quantities can be exponentially large for non-convex $f$ . We prove that the hitting time of the second Markov chain depends inversely on the restricted conductance, so that the problem reduces to lower bounding the restricted conductance.

Finally, we lower bound the restricted conductance by the restricted Cheeger constant. The former quantity characterizes the Markov chain, while the later captures the geometric properties of the function $f$ . Thus, we must analyze the SGLD algorithm in depth to establish a connection between them. Once we prove this lower bound, putting all pieces together completes the proof. $\blacksquare$

4 Lower bounding the restricted Cheeger constant

In this subsection, we prove lower bounds on the restricted Cheeger constant $\mathcal{C}_{(\xi f)}(K\backslash U)$ in order to flesh out the iteration complexity bound of Theorem 1. We start with a lower bound for the class of convex functions:

Let $K$ be a $d$ -dimensional unit ball. For any convex ${G}$ -Lipschitz continuous function $f$ and any $\epsilon>0$ , let the set of $\epsilon$ -optimal solutions be defined by:

Then for any $\xi\geq\frac{2d\log(4{G}/\epsilon)}{\epsilon}$ , we have $\mathcal{C}_{(\xi f)}(K\backslash U)\geq 1$ .

The proposition shows that if we choose a big enough $\xi$ , then $\mathcal{C}_{(\xi f)}(K\backslash U)$ will be lower bounded by a universal constant. The lower bound is proved based on an isoperimetric inequality for log-concave distributions. See Appendix C for the proof.

For non-convex functions, directly proving the lower bound is difficult, because the definition of $\mathcal{C}_{(\xi f)}(K\backslash U)$ involves verifying the properties of all subsets $A\subset K\backslash U$ . We start with a generic lemma that reduces the problem to checking properties of all points in $K\backslash U$ .

Lemma 1 reduces the problem of lower bounding $\mathcal{C}_{f}(V)$ to the problem of finding a proper vector field $\phi$ and verifying its properties for all points $x\in V$ . Informally, the quantity $\mathcal{C}_{f}(V)$ measures the chance of escaping the set $V$ . The lemma shows that if we can construct an “oracle” vector field $\phi$ , such that at every point $x\in V$ it gives the correct direction (i.e. $-\phi(x)$ ) to escape $V$ , but always stay in $K$ , then we obtain a strong lower bound on $\mathcal{C}_{f}(V)$ . This construction is merely for the theoretical analysis and doesn’t affect the execution of the algorithm.

The proof idea is illustrated in Figure 2: by constructing a mapping $\pi:x\mapsto x-\epsilon\phi(x)$ that satisfies the conditions of the lemma, we obtain $\pi(A)\subset A_{\epsilon}$ for all $A\subset V$ , and consequently $\mu_{f}(\pi(A))\leq\mu_{f}(A_{\epsilon})$ . Then we are able to lower bound the restricted Cheeger constant by:

where $dA$ is an infinitesimal of the set $V$ . It can be shown that the right-hand side of inequality (10) is equal to $\inf_{x\in V}\{\langle\phi(x),\,\nabla f(x)\rangle-{\textrm{div}}\,\phi(x)\}$ , which establishes the lemma. See Appendix D for a rigorous proof. $\blacksquare$

Before demonstrating the applications of Lemma 1, we make several additional mild assumptions on the parameter space and on the function $f$ .

The parameter space $K$ is a $d$ -dimensional ball of radius $r>0$ centered at the origin. There exists $r_{0}>0$ such that for every point $x$ satisfying $\|{x}\|_{2}\in[r-r_{0},r]$ , we have $\langle x,\,\nabla f(x)\rangle\geq\|{x}\|_{2}$ .

For some ${G},L,H>0$ , the function $f$ is third-order differentiable with $\|{\nabla f(x)}\|_{2}\leq{G}$ , $\|{\nabla^{2}f(x)}\|_{*}\leq L$ and $\|{\nabla^{2}f(x)-\nabla^{2}f(y)}\|_{*}\leq H\|{x-y}\|_{2}$ for any $x,y\in K$ .

The first assumption requires the parameter space to be an Euclidean ball and imposes a gradient condition on its boundary. This is made mainly for the convenience of theoretical analysis. We remark that for any function $f$ , the condition on the boundary can be satisfied by adding a smooth barrier function $\rho(\|{x}\|_{2})$ to it, where the function $\rho(t)=0$ for any $t<r-2r_{0}$ , but sharply increases on the interval $[r-r_{0},r]$ to produce large enough gradients. The second assumption requires the function $f$ to be third-order differentiable. We shall relax the second assumption in Section 3.

The following proposition describes a lower bound on $\mathcal{C}_{(\xi f)}(K\backslash U)$ when $f$ is a smooth function and the set $U$ consists of approximate stationary points. Although we shall prove a stronger result, the proof of this proposition is a good example for demonstrating the power of Lemma 1.

Assume that Assumption B holds. For any $\epsilon>0$ , define the set of $\epsilon$ -approximate stationary points $U:=\{x\in K:\|{\nabla f(x)}\|_{2}<\epsilon\}$ . For any $\xi\geq 2L/\epsilon^{2}$ , we have $\mathcal{C}_{(\xi f)}(K\backslash U)\geq\frac{\xi\epsilon^{2}}{2{G}}$ .

Recall that ${G}$ is the Lipschitz constant of function $f$ . Let the vector field be defined by $\phi(x):=\frac{1}{{G}}\nabla f(x)$ , then we have $\|{\phi(x)}\|_{2}\leq 1$ . By Assumption B, it is easy to verify that the conditions of Lemma 1 hold. For any $x\in K\backslash U$ , the fact that $\|{\nabla f(x)}\|_{2}\geq\epsilon$ implies:

Recall that $L$ is the smoothness parameter. By Assumption B, the divergence of $\phi(x)$ is upper bounded by ${\textrm{div}}\,\phi(x)=\frac{1}{{G}}{\textrm{tr}}(\nabla^{2}f(x))\leq\frac{1}{{G}}\|{\nabla^{2}f(x)}\|_{*}\leq\frac{L}{{G}}$ . Consequently, if we choose $\xi\geq 2L/\epsilon^{2}$ as assumed, then we have:

Lemma 1 then establishes the claimed lower bound. ∎

Next, we consider approximate local minima , which rules out local maxima and strict saddle points. For an arbitrary $\epsilon>0$ , the set of $\epsilon$ -approximate local minima is defined by:

We note that an approximate local minimum is not necessarily close to any local minimum of $f$ . However, if we assume in addition the the function satisfies the (robust) strict-saddle property , then any point $x\in U$ is guaranteed to be close to a local minimum. Based on definition (11), we prove a lower bound for the set of approximate local minima.

Assume that Assumption B holds. For any $\epsilon>0$ , let $U$ be the set of $\epsilon$ -approximate local minima. For any $\xi$ satisfying

we have $\mathcal{C}_{(\xi f)}(K\backslash U)\geq\frac{\sqrt{\epsilon}}{8(2{G}+1){G}}$ . The notation $\widetilde{\mathcal{O}}(1)$ hides a poly-logarithmic function of $(L,1/\epsilon)$ .

Proving Proposition 3 is significantly more challenging than proving Proposition 2. From a high-level point of view, we still construct a vector field $\phi$ , then lower bound the expression $\langle\phi(x),\,\xi\nabla f(x)\rangle-{\textrm{div}}\,\phi(x)$ for every point $x\in K\backslash U$ in order to apply Lemma 1. However, there exist saddle points in the set $K\backslash U$ , such that the inner product $\langle\phi(x),\,\xi\nabla f(x)\rangle$ can be very close to zero. For these points, we need to carefully design the vector field so that the term ${\textrm{div}}\,\phi(x)$ is strictly negative and bounded away from zero. To this end, we define $\phi(x)$ to be the sum of two components. The first component aligns with the gradient $\nabla f(x)$ . The second component aligns with a projected vector $\Pi_{x}(\nabla f(x))$ , which projects $\nabla f(x)$ to the linear subspace spanned by the eigenvectors of $\nabla^{2}f(x)$ with negative eigenvalues. It can be shown that the second component produces a strictly negative divergence in the neighborhood of strict saddle points. See Appendix E for the complete proof. $\blacksquare$

5 Polynomial-time bound for finding an approximate local minimum

Combining Proposition 3 with Theorem 1, we conclude that SGLD finds an approximate local minimum of the function $f$ in polynomial time, assuming that $f$ is smooth enough to satisfy Assumption B.

Assume that Assumptions A,B hold. For an arbitrary $\epsilon>0$ , let $U$ be the set of $\epsilon$ -approximate local minima. For any $\rho,\delta>0$ , there exists a large enough $\xi$ and hyperparameters $(\eta,k_{\textrm{max}},D)$ such that with probability at least $1-\delta$ , SGLD returns a solution $\widehat{x}$ satisfying

The iteration number $k_{\textrm{max}}$ is bounded by a polynomial function of all hyperparameters in the assumptions as well as $(\epsilon^{-1},\rho^{-1},\log(1/\delta))$ .

Similarly, we can combine Proposition 1 or Proposition 2 with Theorem 1, to obtain complexity bounds for finding the global minimum of a convex function, or finding an approximate stationary point of a smooth function.

Corollary 1 doesn’t specify any upper limit on the temperature parameter $\xi$ . As a result, SGLD can be stuck at the worst approximate local minima. It is important to note that the algorithm’s capability of escaping certain local minima relies on a more delicate choice of $\xi$ . Given objective function $f$ , we consider an arbitrary smooth function $F$ such that $\|{f-F}\|_{\infty}\leq 1/\xi$ . By Theorem 1, for any target subset $U$ , the hitting time of SGLD can be controlled by lower bounding the restricted Cheeger constant $\mathcal{C}_{\xi f}(K\backslash U)$ . By the stability property (6), it is equivalent to lower bounding $\mathcal{C}_{\xi F}(K\backslash U)$ because $f$ and $F$ are uniformly close. If $\xi>0$ is chosen large enough (w.r.t. smoothness parameters of $F$ ), then the lower bound established by Proposition 3 guarantees a polynomial hitting time to the set $U_{F}$ of approximate local minima of $F$ . Thus, SGLD can efficiently escape all local minimum of $f$ that lie outside of $U_{F}$ . Since the function $F$ is arbitrary, it can be thought as a favorable perturbation of $f$ such that the set $U_{F}$ eliminates as many local minima of $f$ as possible. The power of such perturbations are determined by their maximum scale, namely the quantity $1/\xi$ . Therefore, it motivates choosing the smallest possible $\xi$ whenever it satisfies the lower bound in Proposition 3.

The above analysis doesn’t specify any concrete form of the function $F$ . In Section 3, we present a concrete analysis where the function $F$ is assumed to be the population risk of empirical risk minimization (ERM). We establish sufficient conditions under which SGLD efficiently finds an approximate local minima of the population risk.

Applications to empirical risk minimization

In this section, we apply SGLD to a specific family of functions, taking the form:

We shall prove that under certain conditions, SGLD finds an approximate local minimum of the (presumably smooth) population risk in polynomial time, even if it is executed on a non-smooth empirical risk. More concretely, we run SGLD on a smoothed approximation of the empirical risk that satisfies Assumption A. With large enough sample size, the empirical risk $f$ and its smoothed approximation will be close enough to the population risk $F$ , so that combining the stability property with Theorem 1 and Proposition 3 establishes the hitting time bound. First, let’s formalize the assumptions.

There exist $\rho_{\mbox{\tiny K}},\nu>0$ such that in the set $\overline{K}:=\{x:d(x,K)\leq\rho_{\mbox{\tiny K}}\}$ , the population risk $F$ is ${G}$ -Lipschitz continuous, and $\sup_{x\in\overline{K}}|f(x)-F(x)|\leq\nu$ .

Since the function $f$ can be non-differentiable, the stochastic gradient may not be well defined. We consider a smooth approximation of it following the idea of Duchi et al. :

The iteration number $k_{\textrm{max}}$ is polynomial in $(B,\log(1/\delta),d,h_{\textrm{max}}^{-1},\nu^{-1},\rho_{\mbox{\tiny K}}^{-1},\mathcal{C}^{-1}_{(\xi F)}(K\backslash U))$ .

In order to lower bound the restricted Cheeger constant $\mathcal{C}_{(\xi F)}(K\backslash U)$ , we resort to the general lower bounds in Section 2.4. Consider population risks that satisfy the conditions of Assumption B. By combining Theorem 2 with Proposition 3, we conclude that SGLD finds an approximate local minimum of the population risk in polynomial time.

Assume that Assumption C holds. Also assume that Assumption B holds for the population risk $F$ with smoothness parameters $({G},L,H)$ . For any $\epsilon>0$ , let $U$ be the set of $\epsilon$ -approximate local minima of $F$ . If

Learning linear classifiers with zero-one loss

where $\frac{1-q(a)}{2}\in[0,0.5]$ is the Massart noise level. We assume that the noise level is strictly smaller than $0.5$ when the feature vector $a$ is separated apart from the decision boundary. Formally, there is a constant $0<q_{0}\leq 1$ such that

Assume that $d\geq 2$ . For any $q_{0}\in(0,1]$ and $\epsilon,\delta>0$ , if the sample size $n$ satisfies:

then there exist hyperparameters $(\xi,\eta,\sigma,k_{\textrm{max}},D)$ such that SGLD on the smoothed function (13) returns a solution $\widehat{x}$ satisfying $F(\widehat{x})\leq F(x^{*})+\epsilon$ with probability at least $1-\delta$ . The notation $\widetilde{O}(1)$ hides a poly-logarithmic function of $(d,1/q_{0},1/\epsilon,1/\delta)$ . The time complexity of the algorithm is polynomial in $(d,1/q_{0},1/\epsilon,\log(1/\delta))$ .

The proof consists of two parts. For the first part, we prove that the population risk is Lipschitz continuous and the empirical risk uniformly converges to the population risk, so that Assumption C hold. For the second part, we lower bound the restricted Cheeger constant by Lemma 1. The proof is spiritually similar to that of Proposition 2 or Proposition 3. We define $U$ to be the set of approximately optimal solutions, and construct a vector field $\phi$ such that:

By lower bounding the expression $\langle\phi(x),\,\nabla f(x)\rangle-{\textrm{div}}\,\phi(x)$ for all $x\in K\backslash U$ , Lemma 1 establishes a lower bound on the restricted Cheeger constant. The theorem is established by combining the two parts together and by Theorem 2. We defer the full proof to Appendix G. $\blacksquare$

Conclusion

In this paper, we analyzed the hitting time of the SGLD algorithm on non-convex functions. Our approach is different from existing analyses on Langevin dynamics , which connect LMC to a continuous-time Langevin diffusion process, then study the mixing time of the latter process. In contrast, we are able to establish polynomial-time guarantees for achieving certain optimality sets, regardless of the exponential mixing time.

For future work, we hope to establish stronger results on non-convex optimization using the techniques developed in this paper. Our current analysis doesn’t apply to training over-specified models. For these models, the empirical risk can be minimized far below the population risk , thus the assumption of Corollary 2 is violated. In practice, over-specification often makes the optimization easier, thus it could be interesting to show that this heuristic actually improves the restricted Cheeger constant. Another open problem is avoiding poor population local minima. Jin et al. proved that there are many poor population local minima in training Gaussian mixture models. It would be interesting to investigate whether a careful initialization could prevent SGLD from hitting such bad solutions.

References

Appendix A Restricted Cheeger constant is strictly positive

In this appendix, we prove that under mild conditions, the restricted Cheeger constant for a convex parameter space is always strictly positive. Let $K$ be an arbitrary convex parameter space with diameter $D<+\infty$ . Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: for any subset $A\subset K$ and any $\epsilon>0$ , the following lower bound holds:

where ${\textrm{vol}}(A)$ represents the Borel measure of set $A$ . Let $f_{0}(x):=0$ be a constant zero function. By the definition of the function-induced probability measure, we have

Combining the inequality (22) with equation (23), we obtain:

If the set $A$ satisfies $A\subset V\subset K$ , then $1-\mu_{f_{0}}(A_{\epsilon})\geq 1-\mu_{f_{0}}(V_{\epsilon})$ . Combining it with the above inequality, we obtain:

According to the definition of the restricted Cheeger constant, the above lower bound implies:

Consider an arbitrary bounded function $f$ satisfying $\sup_{x\in K}|f(x)|\leq B<+\infty$ , combining the stability property (6) and inequality (24), we obtain:

We summarize the result as the following proposition.

Appendix B Proof of Theorem 1

The proof consists of two parts. We first establish a general bound on the hitting time of Markov chains to a certain subset $U\subset K$ , based on the notion of restricted conductance. Then we prove that the hitting time of SGLD can be bounded by the hitting time of a carefully constructed time-reversible Markov chain. This Markov chain runs a Metropolis-Hastings algorithm that converges to the stationary distribution $\mu_{\xi f}$ . We prove that this Markov chain has a bounded restricted conductance, whose value is characterized by the restricted Cheeger constant that we introduced in Section 2.2. Combining the two parts establishes the general theorem.

For an arbitrary Markov chain defined on the parameter space $K$ , we represent the Markov chain by its transition kernel $\pi(x,A)$ , which gives the conditional probability that the next state satisfies $x_{k+1}\in A$ given the current state $x_{k}=x$ . Similarly, we use $\pi(x,x^{\prime})$ to represent the conditional probability $P(x_{k+1}=x^{\prime}|x_{k}=x)$ . If $\pi$ has a stationary distribution, then we denote it by $Q_{\pi}$ .

A Markov chain is call lazy if $\pi(x,x)\geq 1/2$ for every $x\in K$ , and is called time-reversible if it satisfies

If $(x_{0},x_{1},x_{2},\dots)$ is a realization of the Markov chain $\pi$ , then the hitting time to some set $U\subset K$ is denoted by:

For arbitrary subset $V\subset K$ , we define the restricted conductance, denoted by $\Phi_{\pi}(V)$ , to be the following infinimum ratio:

Based on the notion of restricted conductance, we present a general upper bound on the hitting time. For arbitrary subset $U\subset K$ , suppose that ${\widetilde{\pi}}$ is an arbitrary Markov chain whose transition kernel is stationary inside $U$ , namely it satisfies ${\widetilde{\pi}}(x,x)=1$ for any $x\in U$ . Let $(\widetilde{x}_{0},\widetilde{x}_{1},\widetilde{x}_{2},\dots)$ be a realization of the Markov chain ${\widetilde{\pi}}$ . We denote by $Q_{k}$ the probability distribution of $\widetilde{x}_{k}$ at iteration $k$ . In addition, we define a measure of closeness between any two Markov chains.

For two Markov chains $\pi$ and ${\widetilde{\pi}}$ , we say that ${\widetilde{\pi}}$ is $\epsilon$ -close to $\pi$ w.r.t. a set $U$ if the following condition holds for any $x\in K\backslash U$ and any $A\subset K\backslash\{x\}$ :

Then we are able to prove the following lemma.

Let $\pi$ be a time-reversible lazy Markov chain with atom-free stationary distribution $Q_{\pi}$ . Assume that ${\widetilde{\pi}}$ is $\epsilon$ -close to $\pi$ w.r.t. $U$ where $\epsilon\leq\frac{1}{4}\Phi_{\pi}(K\backslash U)$ . If there is a constant $M$ such that the distribution $Q_{0}$ satisfies $Q_{0}(A)\leq M\,Q_{\pi}(A)$ for any $A\subset K\backslash U$ , then for any $\delta>0$ , the hitting time of the Markov chain is bounded by:

See Appendix B.3.1 for the proof of Lemma 2. The lemma shows that if the two chains $\pi$ and ${\widetilde{\pi}}$ are sufficiently close, then the hitting time of the Markov chain ${\widetilde{\pi}}$ will be inversely proportional to the square of the restricted conductance of the Markov chain $\pi$ , namely $\Phi_{\pi}(K\backslash U)$ . Note that if the density function of distribution $Q_{\pi}$ is bounded, then by choosing $Q_{0}$ to be the uniform distribution over $K$ , there exists a finite constant $M$ such that $Q_{0}(A)\leq MQ_{\pi}(A)$ , satisfying the last condition of Lemma 2.

B.2 Proof of the theorem

The SGLD algorithm initializes $x_{0}$ by the uniform distribution $\mu_{f_{0}}$ (with $f_{0}(x)\equiv 0$ ). Then at iteration $k\geq 1$ , it performs the following update:

We refer the particular setting $\xi=1$ as the “standard setting”. For the “non-standard” setting of $\xi\neq 1$ , we rewrite the first equation as:

This re-formulation reduces to the problem to the standard setting, with stepsize $\eta/\xi$ and objective function $\xi f$ . Thus it suffices to prove the theorem in the standard setting, then plug in the stepsize $\eta/\xi$ and the objective function $\xi f$ to obtain the general theorem. Therefore, we assume $\xi=1$ and consider the sequence of points $(x_{0},x_{1},\dots)$ generated by:

We introduce two additional notations: for arbitrary functions $f_{1},f_{2}$ , we denote the maximal gap $\sup_{x\in K}|f_{1}(x)-f_{2}(x)|$ by the shorthand $\|{f_{1}-f_{2}}\|_{\infty}$ . For arbitrary set $V\subset K$ and $\rho>0$ , we denote the super-set $\{x\in K:d(x,V)\leq\rho\}$ by the shorthand $V_{\rho}$ . Then we prove the following theorem for the standard setting.

Assume that Assumption A holds. Let $x_{0}$ be sampled from $\mu_{f_{0}}$ and let the Markov chain $(x_{0},x_{1},x_{2},\cdots)$ be generated by update (33). Let $U\subset K$ be an arbitrary subset and let $\rho>0$ be an arbitrary positive number. Let $\mathcal{C}:=\mathcal{C}_{f}(K\backslash U)$ be a shorthand notation. Then for any $\delta>0$ and any stepsize $\eta$ satisfying

the hitting time to set $U_{\rho}$ is bounded by

with probability at least $1-\delta$ . Here, $c,c^{\prime}>0$ are universal constants.

Theorem 4 shows that if we choose $\eta\in(0,\eta_{0}]$ , where $\eta_{0}$ is the right-hand side of inequality (34), then with probability at least $1-\delta$ , the hitting time to the set $U_{\rho}$ is bounded by

Combining it with the definition of $\eta_{0}$ , and with simple algebra, we conclude that $\min\{k:x_{k}\in U_{\rho}\}\leq\frac{M}{\min\{1,\mathcal{C}\}^{4}}$ where $M$ is polynomial in $(B,L,G,\log(1/\delta),d,\eta_{0}/\eta,h_{\textrm{max}}^{-1},b_{\textrm{max}}^{-1},\rho^{-1})$ . This establishes the iteration complexity bound. Whenever $x_{k}$ hits $U_{\rho}$ , we have

which establishes the risk bound. Thus, Theorem 4 establishes Theorem 1 for the special case of $\xi=1$ .

In the non-standard setting ( $\xi\neq 1$ ), we follow the reduction described above to substitute $(\eta,f)$ in Theorem 4 with the pair $(\eta/\xi,\xi f)$ . As a consequence, the quantity $\mathcal{C}$ is substituted with $\mathcal{C}_{(\xi f)}(K\backslash U)$ , and $(B,L,G,\eta_{0},b_{\textrm{max}})$ are substituted with $(\xi B,\xi L,\xi G,\eta_{0}/\xi,b_{\textrm{max}}/\xi)$ . Both the iteration complexity bound and the risk bound hold as in the standard setting, except that after the substitution, the numerator $M$ in the iteration complexity bound has an additional polynomial dependence on $\xi$ . Thus we have proved the general conclusion of Theorem 1.

where $\delta_{x}$ is the Dirac delta function at point $x$ . The expectation is taken over the stochastic gradient $g$ defined in equation (33), conditioning on the current state $x$ . Then for any candidate state $y\in K\cap\mathcal{B}(x;4\sqrt{2\eta d})$ , we accept the candidate state (i.e., $x_{k+1}=y$ ) with probability:

or reject the candidate state (i.e., $x_{k+1}=x$ ) with probability $1-\alpha_{x}(y)$ . All candidate states $y\notin K\cap\mathcal{B}(x;4\sqrt{2\eta d})$ are rejected (i.e., $x_{k+1}=x$ ). It is easy to verify that $\pi_{f}$ executes a Metropolis-Hastings algorithm. Therefore, it induces a time-reversible Markov chain, and its stationary distribution is equal to $\mu_{f}(x)\propto e^{-f(x)}$ .

Despite the difference in their definitions, we are able to show that the two Markov chains are $\epsilon$ -close, where $\epsilon$ depends on the stepsize $\eta$ and the properties of the objective function.

Assume that $0<\eta\leq\frac{b_{\textrm{max}}^{2}}{32d}$ and Assumption A hold. Then the Markov chain ${\widetilde{\pi}}_{f}$ is $\epsilon$ -close to $\pi_{f}$ w.r.t. $U_{\rho}$ with $\epsilon=e^{33\eta d(G^{2}+L)}-1$ .

Lemma 3 shows that if we choose $\eta$ small enough, then $\epsilon$ will be sufficiently small. Recall from Lemma 2 that we need $\epsilon\leq\frac{1}{4}\Phi_{\pi_{f}}(K\backslash U_{\rho})$ to bound the Markov chain $\pi_{f}$ ’s hitting time to the set $U_{\rho}$ . It means that $\eta$ has to be chosen based on the restricted conductance of the Markov chain $\pi_{f}$ . Although calculating the restricted conductance of a Markov chain might be difficult, the following lemma shows that the restricted conductance can be lower bounded by the restricted Cheeger constant.

Assume that $\eta\leq\min\{h_{\textrm{max}},16d\rho^{2},\frac{b_{\textrm{max}}^{2}}{32d},\frac{1}{100d(G^{2}+L)}\}$ and Assumption A hold. Then for any $V\subset K$ , we have:

By Lemma 3 and Lemma 4, we are able to choose a sufficiently small $\eta$ such that the Markov chains $\pi_{f}$ and ${\widetilde{\pi}}_{f}$ are close enough to satisfy the conditions of Lemma 2. Formally, the following condition on $\eta$ is sufficient.

There exists a universal constant $c>0$ such that for any stepsize $\eta$ satisfying:

the Markov chains $\pi_{f}$ and ${\widetilde{\pi}}_{f}$ are $\epsilon$ -close with $\epsilon\leq\frac{1}{4}\Phi_{\pi_{f}}(K\backslash U_{\rho})$ . In addition, the restricted conductance satisfies the lower bound $\Phi_{\pi_{f}}(K\backslash U_{\rho})\geq\min\{\frac{1}{2},\frac{\sqrt{\eta/d}\,\mathcal{C}}{1536}\}$ .

Under condition (38), the Markov chains $\pi_{f}$ and ${\widetilde{\pi}}_{f}$ are $\epsilon$ -close with $\epsilon\leq\frac{1}{4}\Phi_{\pi_{f}}(K\backslash U_{\rho})$ . Recall that the Markov chain $\pi_{f}$ is time-reversible and lazy. Since $f$ is bounded, the stationary distribution $Q_{\pi_{f}}=\mu_{f}$ is atom-free, and sampling $x_{0}$ from $Q_{0}:=\mu_{f_{0}}$ implies:

Thus the last condition of Lemma 2 is satisfied. Combining Lemma 2 with the lower bound $\Phi_{\pi_{f}}(K\backslash U_{\rho})\geq\min\{\frac{1}{2},\frac{\sqrt{\eta/d}\,\mathcal{C}}{1536}\}$ in Lemma 5, it implies that with probability at least $1-\delta>0$ , we have

where $c^{\prime}>0$ is a universal constant.

Finally, we upper bound the hitting time of SGLD (i.e., the Markov chain induced by formula (33)) using the hitting time upper bound (40). We denote by $\pi_{\textrm{sgld}}$ the transition kernel of SGLD, and claim that the Markov chain induced by it can be generated as a sub-sequence of the Markov chain induced by ${\widetilde{\pi}}_{f}$ . To see why the claim holds, we consider a Markov chain $(\widetilde{x}_{0},\widetilde{x}_{1},\widetilde{x}_{2},\dots)$ generated by ${\widetilde{\pi}}_{f}$ , and construct a sub-sequence $(x_{0}^{\prime},x_{1}^{\prime},x_{2}^{\prime},\dots)$ of this Markov chain as follows:

Examine the states $\widetilde{x}_{k}$ in the order $k=1,2,\dots,\tau$ , where $\tau=\min\{k:\widetilde{x}_{k}\in U\}$ :

For any state $\widetilde{x}_{k}$ , in order to sample its next state $\widetilde{x}_{k+1}$ , the candidate state $y$ is either drawn from a delta distribution $\delta_{\widetilde{x}_{k}}$ , or drawn from a normal distribution with stochastic mean vector $x-\eta g(x)$ . The probability of these two cases are equal, according to equation (36).

By this construction, it is easy to verify that $(x^{\prime}_{0},x_{1}^{\prime},x^{\prime}_{2},\dots)$ is a Markov chain and its transition kernel exactly matches formula (33). Since the sub-sequence $(x^{\prime}_{0},x_{1}^{\prime},x^{\prime}_{2},\dots)$ hits $U$ in at most $\tau$ steps, we have

Combining this upper bound with (40) completes the proof of Theorem 4.

B.3 Proof of technical lemmas

If there is a constant $C$ such that the inequality $h_{0}(p)\leq C\sqrt{p}$ holds for any $p\in[0,q]$ , then the inequality

According to the claim, it suffices to upper bound $h_{0}(p)$ for $p\in[0,q]$ . Indeed, since $Q_{0}(A)\leq M\,Q_{\pi}(A)$ for any $A\subset K\backslash U$ , we immediately have:

Choosing $k:=\frac{4\log(M/\delta)}{\Phi^{2}_{\pi}(K\backslash U)}$ implies $Q_{k}(K\backslash U)\leq\delta$ . As a consequence, the hitting time is bounded by $k$ with probability at least $1-\delta$ .

Recall the properties of the function $h_{k}$ . For any $p\in[0,q]$ , we can find a set $A\subset K\backslash U$ such that $Q_{\pi}(A)=p$ and $h_{k}(p)=Q_{k}(A)$ . Define, for $x\in K$ , two functions:

By the laziness of the Markov chain ${\widetilde{\pi}}$ , we obtain $0\leq g_{i}\leq 1$ , so that they are functions mapping from $K\backslash U$ to $ $. Using the relation$ 2{\widetilde{\pi}}(x,A)-1=1-2{\widetilde{\pi}}(x,K\backslash A) $, the definition of$ g_{1}$ implies that:

where the last inequality follows since the $\delta$ -closeness ensures $\pi(x,K\backslash A)\leq{\widetilde{\pi}}(x,K\backslash A)$ . Similarly, using the definition of $g_{2}$ and the relation ${\widetilde{\pi}}(x,A)\leq(1+\epsilon)\pi(x,A)$ , we obtain:

Since $Q_{\pi}$ is the stationary distribution of the time-reversible Markov chain $\pi$ , the right-hand side of (44) is equal to:

Let $p_{1}$ and $p_{2}$ be the left-hand side of inequality (43) and (44) respectively, and define a shorthand notation:

Then by definition of restricted conductance and the laziness of $\pi$ , we have $\Phi_{\pi}(K\backslash U)\leq r\leq 1/2$ . Combining inequalities (43), (44) and (45) and by simple algebra, we obtain:

By the condition $\epsilon\leq\frac{1}{4}\Phi_{\pi}(K\backslash U)\leq\frac{r}{4}$ , the above inequality implies

It is straightforward to verify that for any $0\leq r\leq 1$ , the right-hand side is upper bounded by $2(1-r^{2}/4)\sqrt{p}$ . Thus we obtain:

On the other hand, the definition of $g_{1}$ and $g_{2}$ implies that ${\widetilde{\pi}}(x,A)=\frac{g_{1}(x)+g_{2}(x)}{2}$ for any $x\in K\backslash U$ . For all $x\in U$ , the transition kernel ${\widetilde{\pi}}$ is stationary, so that we have ${\widetilde{\pi}}(x,A)=0$ . Combining these two facts implies

The last inequality uses the definition of function $h_{k-1}$ .

Finally, we prove inequality (42) by induction. The inequality holds for $k=0$ by the assumption. We assume by induction that it holds for an aribtrary integer $k-1$ , and prove that it holds for $k$ . Combining the inductive hypothesis with inequalities (46) and (B.3.1), we have

Thus, inequality (42) holds for $k$ , which completes the proof.

B.3.2 Proof of Lemma 3

By the definition of the $\epsilon$ -closeness, it suffices to consider an arbitrary $x\notin U_{\rho}$ and verify the inequality (26). We focus on cases when the acceptance ratio of $\pi_{f}$ and ${\widetilde{\pi}}_{f}$ are different, that is, when the candidate state $y$ satisfies $y\neq x$ and $y\in K\cap\mathcal{B}(x;4\sqrt{2\eta d})$ . We make the following claim on the acceptance ratio.

For any $0<\eta\leq\frac{b_{\textrm{max}}^{2}}{32d}$ , if we assume $x\notin U_{\rho}$ , $y\notin x$ , and $y\in K\cap\mathcal{B}(x;4\sqrt{2\eta d})$ , then the acceptance ratio is lower bounded by $\alpha_{x}(y)\geq e^{-33\eta d(G^{2}+L)}$ .

Consider an arbitrary point $x\in K\backslash U_{\rho}$ and an arbitrary subset $A\subset K\backslash\{x\}$ . The definitions of $\pi_{f}$ and ${\widetilde{\pi}}_{f}$ imply that $\pi_{f}(x,A)\leq{\widetilde{\pi}}_{f}(x,A)$ always hold. In order to prove the opposite, we notice that:

The definition of $\pi_{f}$ and Claim 2 implies

By plugging in the definition of $\alpha_{x}(y)$ and $\alpha_{y}(x)$ and the fact that $x\neq y$ , we obtain

In order to prove the claim, we need to lower bound the numerator and upper bound the denominator of equation (49). For the numerator, Jensen’s inequality implies:

where the last inequality uses the upper bound

For the above deduction, we have used the Jensen’s inequality as well as Assumption A.

For the denominator, we notice that the term inside the expectation satisfies:

For the second term on the righthand side, using the relation $(a-b)^{2}\leq 2a^{2}+2b^{2}$ and Jensen’s inequality, we obtain

Since $\|{x-y}\|_{2}\leq 4\sqrt{2\eta d}\leq b_{\textrm{max}}$ is assumed, Assumption A implies

Combining inequalities (52)-(55), we obtain

Combining equation (49) with inequalities (50), (56), we obtain

The $L$ -smoothness of function $f$ implies that

Combining this inequality with the lower bound (57) completes the proof.

B.3.3 Proof of Lemma 4

Recall that $\mu_{f}$ is the stationary distribution of the Markov chain $\pi_{f}$ . We consider an arbitrary subset $A\subset V$ , and define $B:=K\backslash A$ . Let $A_{1}$ and $B_{1}$ be defined as

In other words, the points in $A_{1}$ and $B_{1}$ have low probability to move across the broader between $A$ and $B$ . We claim that the distance between points in $A_{1}$ and $B_{1}$ must be bounded away from a positive number.

Assume that $\eta\leq\min\{h_{\textrm{max}},\frac{b_{\textrm{max}}^{2}}{32d},\frac{1}{100d(G^{2}+L)}\}$ . If $x\in A_{1}$ and $y\in B_{1}$ , then $\|{x-y}\|_{2}>\frac{1}{4}\sqrt{\eta/d}$ .

For any point $x\in K\backslash(A_{1}\cup B_{1})$ , we either have $x\in A$ and $\pi_{f}(x,B)\geq 1/96$ , or we have $x\in B$ and $\pi_{f}(x,A)\geq 1/96$ . It implies:

Since $\mu_{f}$ is the stationary distribution of the time-reversible Markov chain $\pi_{f}$ , inequality (58) implies:

Notice that $A\subset K\backslash B_{1}$ , so that $\mu_{f}(K\backslash B_{1})\geq\mu_{f}(A)$ . According to Claim 3, by defining an auxiliary quantity:

we find that the set $(A_{1})_{\rho_{\eta}}$ belongs to $K\backslash B_{1}$ , so that $\mu_{f}(K\backslash B_{1})\geq\mu_{f}((A_{1})_{\rho_{\eta}})$ . The following property is a direct consequence of the definition of restricted Cheeger constant.

For any $A\subset V$ and any $\nu>0$ , we have $\mu_{f}(A_{\nu})\geq e^{\nu\cdot\mathcal{C}_{f}(V_{\nu})}\mu_{f}(A)$ .

Letting $A:=A_{1}$ and $\nu:={\rho_{\eta}}$ in Claim 4, we have $\mu_{f}((A_{1})_{\rho_{\eta}})\geq e^{{\rho_{\eta}}\cdot\mathcal{C}_{f}(V_{\rho_{\eta}})}\mu_{f}(A_{1})$ . Combining these inequalities, we obtain

where the last inequality uses the relation $\max\{a-b,(\alpha-1)b\}\geq\frac{\alpha-1}{\alpha}(a-b)+\frac{1}{\alpha}(\alpha-1)b=\frac{\alpha-1}{\alpha}a$ with $\alpha:=e^{{\rho_{\eta}}\cdot\mathcal{C}_{f}(V_{\rho_{\eta}})}$ . Combining it with inequality (B.3.3), we obtain

The lemma’s assumption gives ${\rho_{\eta}}=\frac{1}{4}\sqrt{\eta/d}\leq\rho$ . Plugging in this relation completes the proof.

Consider any two points $x\in A$ and $y\in B$ . Let $s$ be a number such that $2s\sqrt{2\eta d}=\|{x-y}\|_{2}$ . If $s>1$ , then the claim already holds for the pair $(x,y)$ . Otherwise, we assume that $s\leq 1$ , and as a consequence assume $\|{x-y}\|_{2}\leq 2\sqrt{2\eta d}$ .

Denote by $q(z)$ the density function of distribution $N(\frac{x+y}{2};2\eta I)$ . The integral $\int_{Z}q(z)dz$ is equal to $P(X\leq 9d)$ , where $X$ is a random variable satisfying the chi-square distribution with $d$ degrees of freedom. The following tail bound for the chi-square distribution was proved by Laurent and Massart .

If $X$ is a random variable satisfying the Chi-square distribution with $d$ degrees of freedom, then for any $x>0$ ,

By choosing $x=9/5$ in Lemma 6, the probability $P(X\leq 9d)$ is lower bounded by $1-e^{-(9/5)d}>5/6$ . Since $\eta\leq h_{\textrm{max}}$ , the first assumption of Assumption A implies $\int_{K}q(z)dz\geq 1/3$ . Combining these two bounds, we obtain

For any point $z\in Z$ , the distances $\|{z-x}\|_{2}$ and $\|{z-y}\|_{2}$ are bounded by $4\sqrt{2\eta d}$ . It implies

Claim 2 in the proof of Lemma 3 demonstrates that the acceptance ratio $\alpha_{x}(z)$ and $\alpha_{y}(z)$ for any $z\in K\cap Z$ are both lower bounded by $e^{-33\eta d(G^{2}+L)}$ given the assumption $0<\eta\leq\frac{b_{\textrm{max}}^{2}}{32d}$ . This lower bound is at least equal to $1/2$ because of the assumption $\eta\leq\frac{1}{100d(G^{2}+L)}$ , so that we have

Next, we lower bound the ratio $q_{x}(z)/q(z)$ and $q_{y}(z)/q(z)$ . For $z\in Z$ but $z\neq x$ , the function $q_{x}(z)$ is defined by

where the last inequality uses Jensen’s inequality; It also uses the fact $\|{\frac{y-x}{2}}\|_{2}=s\sqrt{2\eta d}$ and $\|{z-\frac{x+y}{2}}\|_{2}\leq 3\sqrt{2\eta d}$ .

As a consequence of this upper bound and using Jensen’s inequality, we have:

Combining inequalities (B.3.2), (63) and (64), we obtain:

The assumption $\eta\leq\frac{1}{100d(G^{2}+L)}$ implies $G\leq\frac{1}{10\sqrt{\eta d}}$ . Plugging in this inequality to (65), a sufficient condition for $q_{x}(z)/q(z)>1/4$ is

Following identical steps, we can prove that inequality (66) is a sufficient condition for $q_{y}(z)/q(z)>1/4$ as well.

Assume that condition (66) holds. Combining inequalities (60), (62) with the fact $q_{x}(z)>q(z)/4$ and $q_{y}(z)>q(z)/4$ , we obtain:

Notice that the set $Z$ satisfies $Z\subset\mathcal{B}(x;4\sqrt{2\eta d})\cap\mathcal{B}(y;4\sqrt{2\eta d})$ , thus the following lower bound holds:

It implies that either $\pi_{f}(x,B)\geq\frac{1}{96}$ or $\pi_{f}(y,A)\geq\frac{1}{96}$ . In other words, if $x\in A_{1}$ and $y\in B_{1}$ , then inequality (66) must not hold. As a consequence, we obtain the lower bound:

Let $n$ be an arbitrary integer and let $i\in\{1,\dots,n\}$ . By the definition of the restricted Cheeger constant (see equation (5)), we have

where $\epsilon_{n}$ is an indexed variable satisfying $\lim_{n\to\infty}\epsilon_{n}=0$ . Suming over $i=1,\dots,n$ , we obtain

Taking the limit $n\to\infty$ on both sides of the inequality completes the proof.

B.3.4 Proof of Lemma 5

First, we impose the following constraints on the choice of $\eta$ :

so that the preconditions of both Lemma 3 and Lemma 4 are satisfied. By plugging $V:=K\backslash U_{\rho}$ to Lemma 4, the restricted conductance is lower bounded by:

The last inequality holds because $1-e^{-t}\geq\min\{\frac{1}{2},\frac{t}{2}\}$ holds for any $t>0$ . It is easy to verify that $(K\backslash U_{\rho})_{\rho}\subset K\backslash U$ , so that we have the lower bound $\mathcal{C}_{f}((K\backslash U_{\rho})_{\rho})\geq\mathcal{C}_{f}(K\backslash U)=\mathcal{C}$ . Plugging this lower bound to inequality (69), we obtain

Inequality (70) establishes the restricted conductance lower bound for the lemma.

Combining inequality (70) with Lemma 3, it remains to choose a small enough $\eta$ such that ${\widetilde{\pi}}_{f}$ is $\epsilon$ -close to $\pi_{f}$ with $\epsilon\leq\frac{1}{4}\Phi_{\pi}(K\backslash U_{\rho})$ . More precisely, it suffices to make the following inequality hold:

In order to satisfy this inequality, it suffices to choose $\eta\lesssim\min\{\frac{1}{d({G}^{2}+L)},\frac{\mathcal{C}^{2}}{d^{3}({G}^{2}+L)^{2}}\}$ . Combining this result with (68) completes the proof.

Appendix C Proof of Proposition 1

Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: Let $K$ be an arbitrary convex set with diameter $2$ . For any convex function $f$ and any subset $V\subset K$ satisfying $\mu_{f}(V)\leq 1/2$ , the following lower bound holds:

The lower bound (71) implies $\mathcal{C}_{f}(V)\geq 1$ . In order to establish the proposition, it suffices to choose $V:=K\backslash U$ and $f:=\xi f$ , then prove the pre-condition $\mu_{\xi f}(K\backslash U)\leq 1/2$ .

Let $x^{*}$ be one of the global minimum of function $f$ and let $\mathcal{B}(x^{*};r)$ be the ball of radius $r$ centering at point $x^{*}$ . If we choose $r=\frac{\epsilon}{2{G}}$ , then for any point $x\in\mathcal{B}(x^{*};r)\cap K$ , we have

Moreover, for any $y\in K\backslash U$ we have:

It means for the probability measure $\mu_{\xi f}$ , the density function inside $\mathcal{B}(x^{*};r)\cap K$ is at least $e^{\xi\epsilon/2}$ times greater than the density inside $K\backslash U$ . It implies

Without loss of generality, we assume that $K$ is the unit ball centered at the origin. Consider the Euclidean ball $\mathcal{B}(x^{\prime};r/2)$ where $x^{\prime}=\max\{0,1-r/(2\|{x^{*}}\|_{2})\}x^{*}$ . It is easy to verify that $\|{x^{\prime}}\|_{2}\leq 1-r/2$ and $\|{x^{\prime}-x^{*}}\|_{2}\leq r/2$ , which implies $\mathcal{B}(x^{\prime};r/2)\subset\mathcal{B}(x^{*};r)\cap K$ . Combining this relation with inequality (72), we have

The right-hand side is greater than or equal to $1$ , because we have assumed $\xi\geq\frac{2d\log(4{G}/\epsilon)}{\epsilon}$ . As a consequence, we have $\mu_{\xi f}(K\backslash U)\leq 1/2$ .

Appendix D Proof of Lemma 1

Consider a sufficiently small $\epsilon$ and a continuous mapping $\pi(x):=x-\epsilon\phi(x)$ . Since $\phi$ is continuously differentiable in the compact set $K$ , there exists a constant $G$ such that $\|{\phi(x)-\phi(y)}\|_{2}\leq G\|{x-y}\|_{2}$ for any $x,y\in K$ . Assuming $\epsilon<1/G$ , it implies

Thus, the mapping $\pi$ is a continuous one-to-one mapping. For any set $A\subset K$ , we define $\pi(A):=\{\pi(s):x\in A\}$ .

Since the parameter set $K$ is compact, we can partition $K$ into a finite number of small compact subsets, such that each subset has diameter at most $\delta:=\epsilon^{2}$ . Let $S$ be the collection of these subsets that intersect with $A$ . The definition implies $A\subset\cup_{B\in S}B\subset A_{\delta}$ . The fact that $\|{\phi(x)}\|_{2}\leq 1$ implies

For arbitrary $B\in S$ , we consider a point $x\in B\cap A$ , and remark that every point in $B$ is $\delta$ -close to the point $x$ . Since $\phi$ is continuously differentiable, the Jacobian matrix of the transformation $\pi$ has the following expansion:

where $H$ is the Jacobian matrix of $\phi$ satisfying $H_{ij}(x)=\frac{\partial\phi_{i}(x)}{\partial x_{j}}$ . The remainder term $r_{1}(x,y)$ , as a consequence of the continuous differentiability of $\phi$ and the fact $\|{y-x}\|_{2}\leq\delta=\epsilon^{2}$ , satisfies $\|{r_{1}(x,y)}\|_{2}\leq C_{1}\epsilon^{2}$ for some constant $C_{1}$ .

On the other hand, using the relation $\nabla\mu_{f}(y)=-\mu_{f}(y)\nabla f(y)$ and the continuous differentiability of $\mu_{f}$ , the density function at $\pi(y)$ can be approximated by

where the remainder term $r_{2}(y)$ satisfies $|r_{2}(y)|\leq C_{2}\epsilon^{2}$ for some constant $C_{2}$ . Further using the continuity of $\phi$ , $\nabla f$ and the fact $\|{y-x}\|_{2}\leq\epsilon^{2}$ , we obtain:

where the remainder term $r_{3}(x,y)$ satisfies $|r_{3}(x,y)|\leq C_{3}\epsilon^{2}$ for some constant $C_{3}$ .

Combining equation (74) and equation (75), we can quantify the measure of the set $\pi(B)$ using that of the set $B$ . In particular, we have

Plugging equation (76) to the lower bound (73) and using the relation ${\textrm{tr}}(H(x))={\textrm{div}}\,\phi(x)$ , implies

Finally, plugging in the definition of the restricted Cheeger constant and taking the limit $\epsilon\to 0$ completes the proof.

Appendix E Proof of Proposition 3

Let $\Phi$ denote the CDF of the standard normal distribution. The function $\Phi$ satisfies the following tail bounds:

We define an auxiliary variable $\sigma$ based on the value of $\epsilon$ :

Since $e^{-1/(2\sigma^{2})}=\frac{\sqrt{\epsilon}}{4L}$ , the tail bound (77) implies $\Phi(t)\leq\frac{\sqrt{\epsilon}}{4L}$ for all $t\leq-\frac{1}{\sigma}$ .

Let $g(x):=\|{\nabla f(x)}\|_{2}$ be a shorthand notation. We define a vector field:

Note that the function $\Phi$ admits a polynomial expansion:

We remark that the matrix definition (80) implies $\Phi(A+dA)=\Phi(A)+\Phi^{\prime}(A)dA$ where $\Phi^{\prime}$ is the derivative of function $\Phi$ .

The matrix $A(x)$ satisfies $0\preceq A(x)\preceq(2{G}+1)I$ , so that $\|{\phi(x)}\|_{2}\leq 1$ holds. For points that are $r_{0}$ -close to the boundary, we have $\langle x,\,\nabla f(x)\rangle\geq\|{x}\|_{2}$ . By these lower bounds and definition (79), we obtain:

where the last inequality holds because $g(x)\geq\langle x,\,\nabla f(x)\rangle/\|{x}\|_{2}\geq 1$ . For any $\epsilon<\frac{r-r_{0}}{(2{G}+1)\sqrt{{G}}}$ , the right-hand side is smaller than $\|{x}\|_{2}^{2}$ , so that $x-\epsilon\,\phi(x)\in K$ . For points that are not $r_{0}$ -close to the boundary, we have $x-\epsilon\,\phi(x)\in K$ given $\epsilon<r_{0}$ . Combining results for the two cases, we conclude that $\phi$ satisfies the conditions of Lemma 1

By applying Lemma 1, we obtain the following lower bound:

Since $A(x)\succeq 2\sqrt{{G}\,g(x)}I$ , the term $(\nabla f(x))^{\top}A(x)\nabla f(x)$ is lower bounded by $2\sqrt{{G}}(g(x))^{5/2}$ . For the second term, we claim the following bound:

We defer the proof to Appendix E.1 and focus on its consequence. Combining inequalities (82) and (83), we obtain

The right-hand side of inequality (84) can be made strictly positive if we choose a large enough $\xi$ . In particular, we choose:

To proceed, we do a case study based on the value of $g(x)$ . For all $x$ satisfying $g(x)<h\epsilon$ , we plug in the upper bound $g(x)<h\epsilon$ for $g(x)$ , then plug in the definition of $h$ . It implies:

Using lower bound (85) for $\xi$ and the lower bound $g(x)\geq h\epsilon$ for $g(x)$ , it is easy to verify that the last three terms on the right-hand side are non-negative. Furthermore, plugging in the lower bound $\xi\geq\frac{1}{\epsilon^{2}{G}^{1/2}}\cdot\frac{1}{2h^{5/2}}$ from (85), it implies:

Combining inequalities (84), (86), (87) proves that the restricted Cheeger constant is lower bounded by $\frac{\sqrt{\epsilon}}{8(2{G}+1){G}}$ . Since $1/\sigma=\widetilde{\mathcal{O}}(1)$ , it is easy to verify that the constraint (85) can be satisfied if we choose:

E.1 Proof of inequality (83)

Note that ${\textrm{div}}\,(A(x)\nabla f(x))$ is equal to the trace of matrix $D$ . In order to proceed, we perform a case study on the value of $g(x)$ .

We first upper bound the trace of $A(x)\nabla f^{2}(x)$ , which can be written as:

The trace of the first term on the right-hand side is bounded by $2\sqrt{{G}\,g(x)}L$ . For the second term, we assume that the matrix $\nabla^{2}f(x)$ has eigenvalues $\lambda_{1}\leq\lambda_{2}\leq\dots\lambda_{d}$ with associated eigenvectors $u_{1},\dots,u_{d}$ . As a consequence, the matrix $\Phi(\frac{-\sqrt{\epsilon}I-\nabla^{2}f(x)}{\sigma\sqrt{\epsilon}})$ has the same set of eigenvectors, but with eigenvalues $\Phi(\frac{-\lambda_{1}/\sqrt{\epsilon}-1}{\sigma}),\dots,\Phi(\frac{-\lambda_{d}/\sqrt{\epsilon}-1}{\sigma})$ . Thus, the trace of this term is equal to

By the assumptions $x\in K\backslash U$ and $g(x)<\epsilon$ , and using the definition of $\epsilon$ -approximate local minima, we obtain $\lambda_{1}\leq-\sqrt{\epsilon}$ . As a consequence

For other eigenvalues, if $\lambda_{i}$ is negative, then we use the upper bound $\lambda_{i}\Phi(\frac{-\lambda_{i}/\sqrt{\epsilon}-i}{\sigma})<0$ ; If $\lambda_{i}$ is positive, then we have $\lambda_{i}\Phi(\frac{-\lambda_{i}/\sqrt{\epsilon}-1}{\sigma})\leq\lambda_{i}\Phi(-\frac{1}{\sigma})\leq\lambda_{i}\frac{\sqrt{\epsilon}}{4L}$ . Combining these relations, we have

Combining this inquality with the upper bound on the first term of (89), we obtain

Thus, we have upper bounded the trace of first term on the right-hand side of (88).

For the second term on the right-hand side of (88) , we have

where the last inequality uses the relation $\nabla g(x)=\frac{(\nabla^{2}f(x))\nabla f(x)}{g(x)}$ , so that $\|{\nabla g(x)}\|_{2}\leq\|{\nabla^{2}f(x)}\|_{2}\leq\|{\nabla^{2}f(x)}\|_{*}\leq L$ .

For the third term on the right-hand side of (88), since $0\preceq\Phi^{\prime}(\frac{-\sqrt{\epsilon}I-\nabla^{2}f(x)}{\sigma\sqrt{\epsilon}})\preceq I$ , we have

Combining upper bounds (90), (91), (92) implies

The proof is similar to the previous case. For the first term on the right-hand side of equation (88), we follow the same arguments for establishing the upper bound (90), but without using the relation $\lambda_{1}\leq-\sqrt{\epsilon}$ (because conditioning on $g(x)\geq\epsilon$ , the definition of approximate local minima won’t give $\lambda_{1}\leq-\sqrt{\epsilon}$ ). Then the trace of $A(x)\nabla f^{2}(x)$ is bounded by:

For the second and the third term, the upper bounds (91) and (92) still hold, so that

Combining the two cases completes the proof.

Appendix F Proof of Theorem 2

and its stochastic gradient is computed by:

We defer the proof of claim (96) to the end of this section, focusing on its consequence. Let $\sigma$ take the value in claim (96). The conseuqence of (96) and the ${G}$ -Lipschitz continuity of the function $F$ imply:

By choosing $\rho:=\nu/{G}$ , we establish the risk bound $F(\widehat{x})\leq\sup_{x\in U}F(x)+5\nu$ . It remains to establish the iteration complexity bound.

Taking expectation over $z$ on both sides and using Jensen’s inequality, we obtain

Thus, by choosing $\sigma:=\frac{\nu}{\max\{{G},B/\rho_{\mbox{\tiny K}}\}}$ , it ensures that for any $x\in K$ :

F.1 Proof of Lemma 7

Let $z^{\prime}:=x+z$ . By change of variables, the above equation implies:

Notice that $\langle u/\|{u}\|_{2},\,z/\sigma\rangle$ satisfies the standard normal distribution. Thus the right-hand side of the above inequality is bounded by

Combining the two inequalities above completes the proof.

Using the fact $f(x+z)\in[0,B]$ , equation (99) implies:

Appendix G Proof of Theorem 3

There exists $h_{\textrm{max}}=\Omega(d^{-2})$ such that for any $x\in K$ , $h\leq h_{\textrm{max}}$ and $y\sim N(x,2hI)$ , we have $P(y\in K)\geq 1/3$ .

The function $F$ is 3-Lipschitz continuous in $\overline{K}$ .

For any $\nu,\delta>0$ , if the sample size $n$ satisfies $n\gtrsim\frac{d}{\nu^{2}}$ , then with probability at least $1-\delta$ we have $\sup_{x\in\overline{K}}|f(x)-F(x)|\leq\nu$ . The notation “ $\lesssim$ ” hides a poly-logarithmic function of $(d,1/\nu,1/\delta)$ .

Let $\alpha_{0}\in(0,\pi/4]$ be an arbitrary angle. We define $U\subset K$ to be the set of points such that the angle between the point and $x^{*}$ is bounded by $\alpha_{0}$ , or equivalently:

For any $x\in K$ , the 3-Lipschitz continuity of function $F$ implies:

By simple geometry, it is easy to see that

Inequality (100) implies that for small enough $\alpha_{0}$ , any point in $U$ is a nearly optimal solutions. Thus we can use $U$ as a target optimality set. The following lemma lower bounds the restricted Cheeger constant for the set $U$ .

Assume that $d\geq 2$ . For any $\alpha_{0}\in(0,\pi/4]$ , there are universal constant $c_{1},c_{2}>0$ such that if we choose $\xi\geq\frac{c_{1}d^{3/2}}{q_{0}\sin^{2}(\alpha_{0})}$ , then the restricted Cheeger constant is lower bounded by $\mathcal{C}_{(\xi F)}(K\backslash U)\geq c_{2}d$ .

Given a target optimality $\epsilon>0$ , we choose $\alpha_{0}:=\arcsin(\epsilon/12)$ . The risk bound (100) implies

Lemma 8 ensures that the pre-conditions of Theorem 2 hold with a small enough quantity $\nu$ . Combining Theorem 2 with inequality (101), with probability at least $1-\delta$ , SGLD achieves the risk bound:

In order to have a small enough $\nu$ , we want the functions $f$ and $F$ to be uniformly close. More precisely, we want the gap between them to satisfy:

By Lemma 8, this can be achieved by assuming a large enough sample size $n$ . In particular, if the sample size satisfies $n\gtrsim\frac{d^{4}}{q_{0}^{2}\epsilon^{4}}$ , then inequality (103) is guaranteed to be true. The notation “ $\gtrsim$ ” hides a poly-logarithmic function.

If inequity (103) holds, then $\nu\leq\epsilon/10$ holds, so that we can rewrite the risk bound (102) as $F(\widehat{x})\leq F(x^{*})+\epsilon$ . By combining the choice of $\nu$ in (103) with the choice of $\xi:=\frac{c_{1}d^{3/2}}{q_{0}\sin^{2}(\alpha_{0})}$ in Lemma 9, we find that the relation $\xi\in(0,1/\nu]$ hold, satisfying Theorem 2’s condition on $(\nu,\xi)$ . As a result, Theorem 2 implies that the iteration complexity of SGLD is bounded by the restricted Cheeger constant $\mathcal{C}_{(\xi F)}(K\backslash U)$ . By Lemma 9, the restricted Cheeger constant is lower bounded by $\Omega(d)$ , so that the iteration complexity is polynomial in $(d,1/q_{0},1/\epsilon,\log(1/\delta))$ .

(1) Let $x\in K$ be an arbitrary point and let $z\sim N(0,2hI)$ . An equivalent way to express the relation $x+z\in K$ is the following sandwich inequality:

For any $t>0$ , we consider a sufficient condition for inequality (104):

The random variable $\frac{\|{z}\|_{2}^{2}}{2h}$ satisfies a chi-square distribution with $d$ degrees of freedom. By Lemma 6, for any $t\geq 5$ , the condition $\|{z}\|_{2}^{2}\leq 2thd$ holds with probability at least $1-e^{-\Omega(td)}$ .

Suppose that $t$ is a fixed constant, and $h$ is chosen to be $h:=\frac{c^{2}}{2td^{2}}$ for a constant $c>0$ . Then the random variable $w_{x}:=2\langle x,\,z\rangle$ satisfies a normal distribution $N(0;\frac{4\|{x}\|_{2}^{2}(c/d)^{2}}{t})$ . The interval $I_{x}$ , no matter how $x\in K$ is chosen, covers either $[-1/4,-(c/d)^{2}]$ or $[(c/d)^{2},1/4]$ . For $c\to 0$ , we have $(c/d)^{2}\ll\frac{2\|{x}\|_{2}}{t}(c/d)\ll 1/4$ , so that the probability of $w_{x}\in I_{x}$ is asymptotically lower bounded by $0.5$ . It implies that there is a strictly positive constant $c$ (depending on the value of $t$ ) such that $P(w_{x}\in I_{x})\geq 0.4$ for all $x\in K$ . With this choice of $c$ , we apply the union bound:

By choosing $t$ to be a large enough constant, the above probability is lower bounded by $1/3$ .

If we change the distribution of $a$ from uniform distribution to a normal distribution $N(0,I_{d\times d})$ , the right-hand side of inequality (105) won’t change. Both $\langle x,\,a\rangle$ and $\langle x,\,b\rangle$ become normal random variables with correlation coefficient $\frac{\langle x,\,y\rangle}{\|{x}\|_{2}\|{y}\|_{2}}$ . Under this setting, Tong proved that the right-side is equal to

By simple algebra and using the fact that $\|{x}\|_{2},\|{y}\|_{2}\geq 1/4$ , we have

Combining the above relations, and using the fact that $\arccos(t)\leq 3\sqrt{1-t}$ for any $t\in$ , we obtain

which shows that the function $F$ is 3-Lipschitz continuous in $\overline{K}$ .

(3) Since function $f$ is the empirical risk of a linear classifier, its uniform convergence rate can be characterized by the VC-dimension. The VC-dimension of linear classifiers in a $d$ -dimensional space is equal to $d+1$ . Thus, the concentration inequality of Vapnik implies that with probability at least $1-\delta$ , we have

where $c>0$ is a universal constant. When $n\geq d$ , the upper bound $U(n)$ is a monotonically decreasing function of $n$ . In order to guarantee $U(n)\leq\nu$ , it suffices to choose $n\geq n_{0}$ where the number $n_{0}$ satisfies:

It is easy to see that $n_{0}$ is polynomial in $(d,1/\nu,1/\delta)$ . Thus, by the definition of $U(n)$ , we have

It implies $n_{0}\lesssim\frac{d}{\nu^{2}}$ , thus completes the proof.

G.2 Proof of Lemma 9

The first step is to define a vector field that satisfies the conditions of Lemma 1. For arbitrary $\delta\in$ , we define a vector field $\phi_{\delta}$ such that:

For any $\delta\in(0,1]$ , we can find a constant $\epsilon_{0}>0$ such that $\|{\phi_{\delta}(x)}\|_{2}\leq 1$ and $x-\epsilon\phi_{\delta}(x)\in K$ holds for arbitrary $x\in K$ and $\epsilon\in(0,\epsilon_{0}]$ .

The claim shows that $\phi_{\delta}$ satisfies the conditions of Lemma 1 for any $\delta\in(0,1]$ , so that given a scalar $\xi>0$ , the lemma implies

The right-hand side is uniformly continuous in $\delta$ , so that if we take the limit $\delta\to 0$ , we obtain the lower bound:

For the right-hand side, we prove lower bound for it using the Massart noise model. We start by simplifying the fraction term inside the expectation. Without loss of generality, assume that $\epsilon\in(0,0.2]$ , then the definition of $\phi_{0}$ implies:

We further simplify the lower bound (109) by the following claim, which is proved using properties of the Massart noise.

When the event $\mathcal{E}_{t}$ holds, we have $\|{z}\|_{2}\leq t\leq 1/6\leq\|{x}\|_{2}/3$ , so that $\|{\frac{3(x+z)}{\|{x}\|_{2}^{2}}}\|_{2}\in$ . Combining with inequality (109) and Claim 6, we have

where $\alpha_{x+z}$ represents the angle between $x+z$ and $x^{*}$ .

It means that ${\textrm{div}}\,\phi_{0}(x)={\textrm{tr}}(H(x))=\frac{d-1}{3}\langle x,\,x^{*}\rangle=\frac{(d-1)\|{x}\|_{2}}{3}\cos(\alpha_{x})$ . Combining this equation with inequalities (108), (110), and taking the limits $t\to 0$ , $\sigma\to 0$ , we obtain:

where $\alpha_{x}$ represents the angle between $x$ and $x^{*}$ .

According to inequality (111), for any $\alpha_{x}\in(\pi-\alpha_{0},\pi]$ , we have

where the last inequality follows since $\|{x}\|_{2}\geq 1/2$ and $\alpha_{0}\in(0,\pi/4]$ . Otherwise, if $\alpha_{x}\in[\alpha_{0},\pi-\alpha_{0}]$ , then we have

Once we choose $\xi\geq\frac{160\pi}{3}\frac{d^{3/2}}{q_{0}\sin^{2}(\alpha_{0})}$ , the above expression will be lower bounded by $d/3$ , which completes the proof.

Since $\|{x}\|_{2}\leq 1$ for any $x\in K$ , it is easy to verify that $\|{\phi_{\delta}(x)}\|_{2}\leq 1$ . In order to verify $x-\epsilon\phi_{\delta}(x)\in K$ , we notice that

The right-hand side will be maximized if $\|{x}\|_{2}^{2}=1/4$ . Thus, if we assume $\delta>0$ , then for any $\epsilon<\delta/16$ , it is easy to verify that the right-hand side is bounded by $3/8$ . As a consequence, we have $\|{x-\epsilon\phi_{\delta}(x)}\|_{2}^{2}\in[1/4,1]$ , which verifies that $x-\epsilon\phi_{\delta}(x)\in K$ .

When $x=0$ or $x-\epsilon x^{*}=0$ , it is easy to verify that $F(x-\epsilon x^{*})-F(x)\geq 0$ by the definition of the loss function. Otherwise, we assume that $x\neq 0$ and $x-\epsilon x^{*}\neq 0$ . In these cases, the events $\langle x,\,a\rangle\neq 0$ and $\langle x-\epsilon x^{*},\,a\rangle\neq 0$ hold almost surely, so that we can assume that the loss function always takes zero-one values.

Let $\mathcal{E}$ be the event that condition (112) holds. Under this event, when the parameter changes from $x$ to $x-\epsilon x^{*}$ , the loss changes from $\frac{1-q(a)}{2}$ to $\frac{1+q(a)}{2}$ . It means that the loss is non-decreasing with respect to the change $x\to x-\epsilon x^{*}$ , and as a consequence, we have $F(x-\epsilon x^{*})-F(x)\geq 0$ .

In order to establish the lower bound in Claim 6, we first lower bound the probability of event $\mathcal{E}$ . In the proof of Lemma 8. we have shown that this probability is equal to:

Let $\beta$ be the angle between $x$ and $x-\epsilon x^{*}$ , then the right-hand side is equal to $\beta/\pi$ . Using the geometric property that $\frac{\|{x}\|_{2}}{|\sin(\alpha)|}=\frac{\epsilon\|{x^{*}}\|_{2}}{|\sin(\beta)|}$ , we have

For the first term on the right-hand side, we have $|\langle x^{*}_{1},\,a_{1}\rangle|\leq\|{a_{1}}\|_{2}=\frac{|\langle x,\,a\rangle|}{\|{x}\|_{2}}\leq\frac{\epsilon}{\|{x}\|_{2}}\leq\epsilon$ by condition (112) and the assumption that $\|{x}\|_{2}\geq 1$ . For the second term, if we condition on $a_{1}$ , then the vector $a_{2}$ is uniformly sampled from a $(d-1)$ -dimensional sphere of radius $\sqrt{1-\|{a_{1}}\|_{2}^{2}}$ that centers at the origin. The vector $x^{*}_{2}$ , constructed to be orthogonal to $x$ , also belongs to the same $(d-1)$ -dimensional subspace. Under this setting, Awasthi et al. [4, Lemma 4] proved that

Using the bound $\|{a_{1}}\|_{2}=\frac{|\langle x,\,a\rangle|}{\|{x}\|_{2}}\leq\frac{\epsilon}{\|{x}\|_{2}}\leq\epsilon$ , we marginalize $a_{1}$ to obtain

Recall that $\langle x^{*},\,a\rangle=\langle x^{*}_{1},\,a_{1}\rangle+\langle x^{*}_{2},\,a_{2}\rangle$ and $|\langle x^{*}_{1},\,a_{1}\rangle|\leq\epsilon$ . These two relations imply $|\langle x^{*},\,a\rangle|\geq|\langle x^{*}_{2},\,a_{2}\rangle|-|\langle x^{*}_{1},\,a_{1}\rangle|\geq|\langle x^{*}_{2},\,a_{2}\rangle|-\epsilon$ . Combining it with the relation $\|{x^{*}_{2}}\|_{2}=|\sin(\alpha)|$ , we obtain

Combining inequalities (113), (114) with the relation that $q(a)\geq q_{0}|\langle x^{*},\,a\rangle|$ , we obtain