A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics
Yuchen Zhang, Percy Liang, Moses Charikar
Introduction
A central challenge of non-convex optimization is avoiding sub-optimal local minima. Although escaping all local minima is NP-hard in general [e.g. 7], one might expect that it should be possible to escape “appropriately shallow” local minima, whose basins of attraction have relatively low barriers. As an illustrative example, consider minimizing an empirical risk function in Figure 1. As the figure shows, although the empirical risk is uniformly close to the population risk, it contains many poor local minima that don’t exist in the population risk. Gradient descent is unable to escape such local minima.
A natural workaround is to inject random noise to the gradient. Empirically, adding gradient noise has been found to improve learning for deep neural networks and other non-convex models . However, theoretical understanding of the value of gradient noise is still incomplete. For example, Ge et al. show that by adding isotropic noise and by choosing a sufficiently small stepsize , the iterative update:
is able to escape strict saddle points. Unfortunately, this approach, as well as the subsequent line of work on escaping saddle points , doesn’t guarantee escaping even shallow local minima.
Another line of work in Bayesian statistics studies the Langevin Monte Carlo (LMC) method , which employs an alternative noise term. Given a function , LMC performs the iterative update:
where is a “temperature” hyperparameter. Unlike the bounded noise added in formula (1), LMC adds a large noise term that scales with . With a small enough , the noise dominates the gradient, enabling the algorithm to escape any local minimum. For empirical risk minimization, one might substitute the exact gradient with a stochastic gradient, which gives the Stochastic Gradient Langevin Dynamics (SGLD) algorithm . It can be shown that both LMC and SGLD asymptotically converge to a stationary distribution . As , the probability mass of concentrates on the global minimum of the function , and the algorithm asymptotically converges to a neighborhood of the global minimum.
Despite asymptotic consistency, there is no theoretical guarantee that LMC is able to find the global minimum of a general non-convex function, or even a local minimum of it, in polynomial time. Recent works focus on bounding the mixing time (i.e. the time for converging to ) of LMC and SGLD. Bubeck et al. , Dalalyan and Bonis prove that on convex functions, LMC converges to the stationary distribution in polynomial time. On non-convex functions, however, an exponentially long mixing time is unavoidable in general. According to Bovier et al. , it takes the Langevin diffusion at least time to escape a depth- basin of attraction. Thus, if the function contains multiple “deep” basins with , then the mixing time is lower bounded by .
In parallel work to this paper, Raginsky et al. upper bound the time of SGLD converging to an approximate global minimum of non-convex functions. They show that the upper bound is polynomial in the inverse of a quantity they call the uniform spectral gap. Similar to the mixing time bound, in the presence of multiple local minima, the convergence time to an approximate global minimum can be exponential in dimension and the temperature parameter .
This stability property is useful in studying empirical risk minimization (ERM) in situations where the empirical risk is pointwise close to the population risk , but has poor local minima that don’t exist in the population risk. This phenomenon has been observed in statistical estimation with non-convex penalty functions , as well as in minimizing the zero-one loss (see Figure 1). Under this setting, our result implies that SGLD achieves an approximate local minimum of the (smooth) population risk in polynomial time, ruling out local minima that only exist in the empirical risk. It improves over recent results on non-convex optimization , which compute approximate local minima only for the empirical risk.
As a concrete application, we prove a stronger learnability result for the problem of learning linear classifiers under the zero-one loss , which involves non-convex and non-smooth empirical risk minimization. Our result improves over the recent result of Awasthi et al. : the method of Awasthi et al. handles noisy data corrupted by a very small Massart noise (at most ), while our algorithm handles Massart noise up to any constant less than . As a Massart noise of represents completely random observations, we see that SGLD is capable of learning from very noisy data.
Algorithm and main results
In this section, we define the algorithm and the basic concepts, then present the main theoretical results of this paper.
Because of the noisy update, the sequence asymptotically converges to a stationary distribution rather than a stationary point . Although this fact introduces challenges to the analysis, we show that its non-asymptotic efficiency can be characterized by a positive quantity called the restricted Cheeger constant.
2 Restricted Cheeger constant
For any measurable function , we define a probability measure whose density function is:
For any function and any subset , we define the restricted Cheeger constant as:
The restricted Cheeger constant generalizes the notion of the Cheeger isoperimetric constant , quantifying how well a subset of can be made as least connected as possible to the rest of the parameter space. The connectivity is measured by the ratio of the surface measure to the set measure . Intuitively, this quantifies the chance of escaping the set under the probability measure .
A property that will be important in the sequal is that the restricted Cheeger constant is stable under perturbations: if we perturb by a small amount, then the values of won’t change much, so that the variation on will also be small. More precisely, for functions and satisfying , we have
and similarly . As a result, if two functions and are uniformly close, then we have for a constant . This property enables us to lower bound by lower bounding the restricted Cheeger constant of an alternative function , which might be easier to analyze.
3 Generic non-asymptotic bounds
We make several assumptions on the parameter space and on the objective function.
The parameter space satisfies: there exists , such that for any and any , the random variable satisfies .
The function is bounded, differentiable and -smooth in , meaning that for any , we have .
The first assumption states that the parameter space doesn’t contain sharp corners, so that the update (3d) won’t be stuck at the same point for too many iterations. It can be satisfied, for example, by defining the parameter space to be an Euclidean ball and choosing . The probability is arbitrary and can be replaced by any constant in . The second assumption requires the function to be smooth. We show how to handle non-smooth functions in Section 3 by appealing to the stability property of the restricted Cheeger constant discussed earlier. The third assumption requires the stochastic gradient to have sub-exponential tails, which is a standard assumption in stochstic optimization.
The iteration number is bounded by
where the numerator is polynomial in . See Appendix B.2 for the explicit polynomial dependence.
Theorem 1 is a generic result that applies to all optimization problems satisfying Assumption A. The right-hand side of the bound (7) is determined by the choice of . If we choose to be the set of (approximate) local minima, and let be sufficiently small, then will roughly be bounded by the worst local minimum. The theorem permits to be arbitrary provided the stepsize is small enough. Choosing a larger means adding less noise to the SLGD update, which means that the algorithm will be more efficient at finding a stationary point, but less efficient at escaping local minima. Such a trade-off is captured by the restricted Cheeger constant in inequality (8) and will be rigorously studied in the next subsection.
The iteration complexity bound is governed by the restricted Cheeger constant. For any function and any target set with a positive Borel measure, the restricted Cheeger constant is strictly positive (see Appendix A), so that with a small enough , the algorithm always converges to the global minimum asymptotically. We remark that the SGD doesn’t enjoy the same asymptotic optimality guarantee, because it uses a gradient noise in contrast to SGLD’s one. Since the convergence theory requires a small enough , we often have . the SGD noise is too conservative to allow the algorithm to escape local minima.
The proof of Theorem 1 is fairly technical. We defer the full proof to Appendix B, only sketching the basic proof ideas here. At a high level, we establish the theorem by bounding the hitting time of the Markov chain to the set . Indeed, if some hits the set, then:
In order to bound the hitting time, we construct a time-reversible Markov chain, and prove that its hitting time to is on a par with the original hitting time. To analyze this second Markov chain, we define a notion called the restricted conductance, which measures how easily the Markov chain can transition between states within . This quantity is related to the notion of conductance in the analysis of time-reversible Markov processes , but the ratio between these two quantities can be exponentially large for non-convex . We prove that the hitting time of the second Markov chain depends inversely on the restricted conductance, so that the problem reduces to lower bounding the restricted conductance.
Finally, we lower bound the restricted conductance by the restricted Cheeger constant. The former quantity characterizes the Markov chain, while the later captures the geometric properties of the function . Thus, we must analyze the SGLD algorithm in depth to establish a connection between them. Once we prove this lower bound, putting all pieces together completes the proof.
4 Lower bounding the restricted Cheeger constant
In this subsection, we prove lower bounds on the restricted Cheeger constant in order to flesh out the iteration complexity bound of Theorem 1. We start with a lower bound for the class of convex functions:
Let be a -dimensional unit ball. For any convex -Lipschitz continuous function and any , let the set of -optimal solutions be defined by:
Then for any , we have .
The proposition shows that if we choose a big enough , then will be lower bounded by a universal constant. The lower bound is proved based on an isoperimetric inequality for log-concave distributions. See Appendix C for the proof.
For non-convex functions, directly proving the lower bound is difficult, because the definition of involves verifying the properties of all subsets . We start with a generic lemma that reduces the problem to checking properties of all points in .
Lemma 1 reduces the problem of lower bounding to the problem of finding a proper vector field and verifying its properties for all points . Informally, the quantity measures the chance of escaping the set . The lemma shows that if we can construct an “oracle” vector field , such that at every point it gives the correct direction (i.e. ) to escape , but always stay in , then we obtain a strong lower bound on . This construction is merely for the theoretical analysis and doesn’t affect the execution of the algorithm.
The proof idea is illustrated in Figure 2: by constructing a mapping that satisfies the conditions of the lemma, we obtain for all , and consequently . Then we are able to lower bound the restricted Cheeger constant by:
where is an infinitesimal of the set . It can be shown that the right-hand side of inequality (10) is equal to , which establishes the lemma. See Appendix D for a rigorous proof.
Before demonstrating the applications of Lemma 1, we make several additional mild assumptions on the parameter space and on the function .
The parameter space is a -dimensional ball of radius centered at the origin. There exists such that for every point satisfying , we have .
For some , the function is third-order differentiable with , and for any .
The first assumption requires the parameter space to be an Euclidean ball and imposes a gradient condition on its boundary. This is made mainly for the convenience of theoretical analysis. We remark that for any function , the condition on the boundary can be satisfied by adding a smooth barrier function to it, where the function for any , but sharply increases on the interval to produce large enough gradients. The second assumption requires the function to be third-order differentiable. We shall relax the second assumption in Section 3.
The following proposition describes a lower bound on when is a smooth function and the set consists of approximate stationary points. Although we shall prove a stronger result, the proof of this proposition is a good example for demonstrating the power of Lemma 1.
Assume that Assumption B holds. For any , define the set of -approximate stationary points . For any , we have .
Recall that is the Lipschitz constant of function . Let the vector field be defined by , then we have . By Assumption B, it is easy to verify that the conditions of Lemma 1 hold. For any , the fact that implies:
Recall that is the smoothness parameter. By Assumption B, the divergence of is upper bounded by . Consequently, if we choose as assumed, then we have:
Lemma 1 then establishes the claimed lower bound. ∎
Next, we consider approximate local minima , which rules out local maxima and strict saddle points. For an arbitrary , the set of -approximate local minima is defined by:
We note that an approximate local minimum is not necessarily close to any local minimum of . However, if we assume in addition the the function satisfies the (robust) strict-saddle property , then any point is guaranteed to be close to a local minimum. Based on definition (11), we prove a lower bound for the set of approximate local minima.
Assume that Assumption B holds. For any , let be the set of -approximate local minima. For any satisfying
we have . The notation hides a poly-logarithmic function of .
Proving Proposition 3 is significantly more challenging than proving Proposition 2. From a high-level point of view, we still construct a vector field , then lower bound the expression for every point in order to apply Lemma 1. However, there exist saddle points in the set , such that the inner product can be very close to zero. For these points, we need to carefully design the vector field so that the term is strictly negative and bounded away from zero. To this end, we define to be the sum of two components. The first component aligns with the gradient . The second component aligns with a projected vector , which projects to the linear subspace spanned by the eigenvectors of with negative eigenvalues. It can be shown that the second component produces a strictly negative divergence in the neighborhood of strict saddle points. See Appendix E for the complete proof.
5 Polynomial-time bound for finding an approximate local minimum
Combining Proposition 3 with Theorem 1, we conclude that SGLD finds an approximate local minimum of the function in polynomial time, assuming that is smooth enough to satisfy Assumption B.
Assume that Assumptions A,B hold. For an arbitrary , let be the set of -approximate local minima. For any , there exists a large enough and hyperparameters such that with probability at least , SGLD returns a solution satisfying
The iteration number is bounded by a polynomial function of all hyperparameters in the assumptions as well as .
Similarly, we can combine Proposition 1 or Proposition 2 with Theorem 1, to obtain complexity bounds for finding the global minimum of a convex function, or finding an approximate stationary point of a smooth function.
Corollary 1 doesn’t specify any upper limit on the temperature parameter . As a result, SGLD can be stuck at the worst approximate local minima. It is important to note that the algorithm’s capability of escaping certain local minima relies on a more delicate choice of . Given objective function , we consider an arbitrary smooth function such that . By Theorem 1, for any target subset , the hitting time of SGLD can be controlled by lower bounding the restricted Cheeger constant . By the stability property (6), it is equivalent to lower bounding because and are uniformly close. If is chosen large enough (w.r.t. smoothness parameters of ), then the lower bound established by Proposition 3 guarantees a polynomial hitting time to the set of approximate local minima of . Thus, SGLD can efficiently escape all local minimum of that lie outside of . Since the function is arbitrary, it can be thought as a favorable perturbation of such that the set eliminates as many local minima of as possible. The power of such perturbations are determined by their maximum scale, namely the quantity . Therefore, it motivates choosing the smallest possible whenever it satisfies the lower bound in Proposition 3.
The above analysis doesn’t specify any concrete form of the function . In Section 3, we present a concrete analysis where the function is assumed to be the population risk of empirical risk minimization (ERM). We establish sufficient conditions under which SGLD efficiently finds an approximate local minima of the population risk.
Applications to empirical risk minimization
In this section, we apply SGLD to a specific family of functions, taking the form:
We shall prove that under certain conditions, SGLD finds an approximate local minimum of the (presumably smooth) population risk in polynomial time, even if it is executed on a non-smooth empirical risk. More concretely, we run SGLD on a smoothed approximation of the empirical risk that satisfies Assumption A. With large enough sample size, the empirical risk and its smoothed approximation will be close enough to the population risk , so that combining the stability property with Theorem 1 and Proposition 3 establishes the hitting time bound. First, let’s formalize the assumptions.
The parameter space satisfies: there exists , such that for any and any , the random variable satisfies .
There exist such that in the set , the population risk is -Lipschitz continuous, and .
Since the function can be non-differentiable, the stochastic gradient may not be well defined. We consider a smooth approximation of it following the idea of Duchi et al. :
The iteration number is polynomial in .
In order to lower bound the restricted Cheeger constant , we resort to the general lower bounds in Section 2.4. Consider population risks that satisfy the conditions of Assumption B. By combining Theorem 2 with Proposition 3, we conclude that SGLD finds an approximate local minimum of the population risk in polynomial time.
Assume that Assumption C holds. Also assume that Assumption B holds for the population risk with smoothness parameters . For any , let be the set of -approximate local minima of . If
Learning linear classifiers with zero-one loss
where is the Massart noise level. We assume that the noise level is strictly smaller than when the feature vector is separated apart from the decision boundary. Formally, there is a constant such that
Assume that . For any and , if the sample size satisfies:
then there exist hyperparameters such that SGLD on the smoothed function (13) returns a solution satisfying with probability at least . The notation hides a poly-logarithmic function of . The time complexity of the algorithm is polynomial in .
The proof consists of two parts. For the first part, we prove that the population risk is Lipschitz continuous and the empirical risk uniformly converges to the population risk, so that Assumption C hold. For the second part, we lower bound the restricted Cheeger constant by Lemma 1. The proof is spiritually similar to that of Proposition 2 or Proposition 3. We define to be the set of approximately optimal solutions, and construct a vector field such that:
By lower bounding the expression for all , Lemma 1 establishes a lower bound on the restricted Cheeger constant. The theorem is established by combining the two parts together and by Theorem 2. We defer the full proof to Appendix G.
Conclusion
In this paper, we analyzed the hitting time of the SGLD algorithm on non-convex functions. Our approach is different from existing analyses on Langevin dynamics , which connect LMC to a continuous-time Langevin diffusion process, then study the mixing time of the latter process. In contrast, we are able to establish polynomial-time guarantees for achieving certain optimality sets, regardless of the exponential mixing time.
For future work, we hope to establish stronger results on non-convex optimization using the techniques developed in this paper. Our current analysis doesn’t apply to training over-specified models. For these models, the empirical risk can be minimized far below the population risk , thus the assumption of Corollary 2 is violated. In practice, over-specification often makes the optimization easier, thus it could be interesting to show that this heuristic actually improves the restricted Cheeger constant. Another open problem is avoiding poor population local minima. Jin et al. proved that there are many poor population local minima in training Gaussian mixture models. It would be interesting to investigate whether a careful initialization could prevent SGLD from hitting such bad solutions.
References
Appendix A Restricted Cheeger constant is strictly positive
In this appendix, we prove that under mild conditions, the restricted Cheeger constant for a convex parameter space is always strictly positive. Let be an arbitrary convex parameter space with diameter . Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: for any subset and any , the following lower bound holds:
where represents the Borel measure of set . Let be a constant zero function. By the definition of the function-induced probability measure, we have
Combining the inequality (22) with equation (23), we obtain:
If the set satisfies , then . Combining it with the above inequality, we obtain:
According to the definition of the restricted Cheeger constant, the above lower bound implies:
Consider an arbitrary bounded function satisfying , combining the stability property (6) and inequality (24), we obtain:
We summarize the result as the following proposition.
Appendix B Proof of Theorem 1
The proof consists of two parts. We first establish a general bound on the hitting time of Markov chains to a certain subset , based on the notion of restricted conductance. Then we prove that the hitting time of SGLD can be bounded by the hitting time of a carefully constructed time-reversible Markov chain. This Markov chain runs a Metropolis-Hastings algorithm that converges to the stationary distribution . We prove that this Markov chain has a bounded restricted conductance, whose value is characterized by the restricted Cheeger constant that we introduced in Section 2.2. Combining the two parts establishes the general theorem.
For an arbitrary Markov chain defined on the parameter space , we represent the Markov chain by its transition kernel , which gives the conditional probability that the next state satisfies given the current state . Similarly, we use to represent the conditional probability . If has a stationary distribution, then we denote it by .
A Markov chain is call lazy if for every , and is called time-reversible if it satisfies
If is a realization of the Markov chain , then the hitting time to some set is denoted by:
For arbitrary subset , we define the restricted conductance, denoted by , to be the following infinimum ratio:
Based on the notion of restricted conductance, we present a general upper bound on the hitting time. For arbitrary subset , suppose that is an arbitrary Markov chain whose transition kernel is stationary inside , namely it satisfies for any . Let be a realization of the Markov chain . We denote by the probability distribution of at iteration . In addition, we define a measure of closeness between any two Markov chains.
For two Markov chains and , we say that is -close to w.r.t. a set if the following condition holds for any and any :
Then we are able to prove the following lemma.
Let be a time-reversible lazy Markov chain with atom-free stationary distribution . Assume that is -close to w.r.t. where . If there is a constant such that the distribution satisfies for any , then for any , the hitting time of the Markov chain is bounded by:
See Appendix B.3.1 for the proof of Lemma 2. The lemma shows that if the two chains and are sufficiently close, then the hitting time of the Markov chain will be inversely proportional to the square of the restricted conductance of the Markov chain , namely . Note that if the density function of distribution is bounded, then by choosing to be the uniform distribution over , there exists a finite constant such that , satisfying the last condition of Lemma 2.
B.2 Proof of the theorem
The SGLD algorithm initializes by the uniform distribution (with ). Then at iteration , it performs the following update:
We refer the particular setting as the “standard setting”. For the “non-standard” setting of , we rewrite the first equation as:
This re-formulation reduces to the problem to the standard setting, with stepsize and objective function . Thus it suffices to prove the theorem in the standard setting, then plug in the stepsize and the objective function to obtain the general theorem. Therefore, we assume and consider the sequence of points generated by:
We introduce two additional notations: for arbitrary functions , we denote the maximal gap by the shorthand . For arbitrary set and , we denote the super-set by the shorthand . Then we prove the following theorem for the standard setting.
Assume that Assumption A holds. Let be sampled from and let the Markov chain be generated by update (33). Let be an arbitrary subset and let be an arbitrary positive number. Let be a shorthand notation. Then for any and any stepsize satisfying
the hitting time to set is bounded by
with probability at least . Here, are universal constants.
Theorem 4 shows that if we choose , where is the right-hand side of inequality (34), then with probability at least , the hitting time to the set is bounded by
Combining it with the definition of , and with simple algebra, we conclude that where is polynomial in . This establishes the iteration complexity bound. Whenever hits , we have
which establishes the risk bound. Thus, Theorem 4 establishes Theorem 1 for the special case of .
In the non-standard setting (), we follow the reduction described above to substitute in Theorem 4 with the pair . As a consequence, the quantity is substituted with , and are substituted with . Both the iteration complexity bound and the risk bound hold as in the standard setting, except that after the substitution, the numerator in the iteration complexity bound has an additional polynomial dependence on . Thus we have proved the general conclusion of Theorem 1.
where is the Dirac delta function at point . The expectation is taken over the stochastic gradient defined in equation (33), conditioning on the current state . Then for any candidate state , we accept the candidate state (i.e., ) with probability:
or reject the candidate state (i.e., ) with probability . All candidate states are rejected (i.e., ). It is easy to verify that executes a Metropolis-Hastings algorithm. Therefore, it induces a time-reversible Markov chain, and its stationary distribution is equal to .
Despite the difference in their definitions, we are able to show that the two Markov chains are -close, where depends on the stepsize and the properties of the objective function.
Assume that and Assumption A hold. Then the Markov chain is -close to w.r.t. with .
Lemma 3 shows that if we choose small enough, then will be sufficiently small. Recall from Lemma 2 that we need to bound the Markov chain ’s hitting time to the set . It means that has to be chosen based on the restricted conductance of the Markov chain . Although calculating the restricted conductance of a Markov chain might be difficult, the following lemma shows that the restricted conductance can be lower bounded by the restricted Cheeger constant.
Assume that and Assumption A hold. Then for any , we have:
By Lemma 3 and Lemma 4, we are able to choose a sufficiently small such that the Markov chains and are close enough to satisfy the conditions of Lemma 2. Formally, the following condition on is sufficient.
There exists a universal constant such that for any stepsize satisfying:
the Markov chains and are -close with . In addition, the restricted conductance satisfies the lower bound .
Under condition (38), the Markov chains and are -close with . Recall that the Markov chain is time-reversible and lazy. Since is bounded, the stationary distribution is atom-free, and sampling from implies:
Thus the last condition of Lemma 2 is satisfied. Combining Lemma 2 with the lower bound in Lemma 5, it implies that with probability at least , we have
where is a universal constant.
Finally, we upper bound the hitting time of SGLD (i.e., the Markov chain induced by formula (33)) using the hitting time upper bound (40). We denote by the transition kernel of SGLD, and claim that the Markov chain induced by it can be generated as a sub-sequence of the Markov chain induced by . To see why the claim holds, we consider a Markov chain generated by , and construct a sub-sequence of this Markov chain as follows:
Examine the states in the order , where :
For any state , in order to sample its next state , the candidate state is either drawn from a delta distribution , or drawn from a normal distribution with stochastic mean vector . The probability of these two cases are equal, according to equation (36).
By this construction, it is easy to verify that is a Markov chain and its transition kernel exactly matches formula (33). Since the sub-sequence hits in at most steps, we have
Combining this upper bound with (40) completes the proof of Theorem 4.
B.3 Proof of technical lemmas
If there is a constant such that the inequality holds for any , then the inequality
According to the claim, it suffices to upper bound for . Indeed, since for any , we immediately have:
Choosing implies . As a consequence, the hitting time is bounded by with probability at least .
Recall the properties of the function . For any , we can find a set such that and . Define, for , two functions:
By the laziness of the Markov chain , we obtain , so that they are functions mapping from to $2{\widetilde{\pi}}(x,A)-1=1-2{\widetilde{\pi}}(x,K\backslash A)g_{1}$ implies that:
where the last inequality follows since the -closeness ensures . Similarly, using the definition of and the relation , we obtain:
Since is the stationary distribution of the time-reversible Markov chain , the right-hand side of (44) is equal to:
Let and be the left-hand side of inequality (43) and (44) respectively, and define a shorthand notation:
Then by definition of restricted conductance and the laziness of , we have . Combining inequalities (43), (44) and (45) and by simple algebra, we obtain:
By the condition , the above inequality implies
It is straightforward to verify that for any , the right-hand side is upper bounded by . Thus we obtain:
On the other hand, the definition of and implies that for any . For all , the transition kernel is stationary, so that we have . Combining these two facts implies
The last inequality uses the definition of function .
Finally, we prove inequality (42) by induction. The inequality holds for by the assumption. We assume by induction that it holds for an aribtrary integer , and prove that it holds for . Combining the inductive hypothesis with inequalities (46) and (B.3.1), we have
Thus, inequality (42) holds for , which completes the proof.
B.3.2 Proof of Lemma 3
By the definition of the -closeness, it suffices to consider an arbitrary and verify the inequality (26). We focus on cases when the acceptance ratio of and are different, that is, when the candidate state satisfies and . We make the following claim on the acceptance ratio.
For any , if we assume , , and , then the acceptance ratio is lower bounded by .
Consider an arbitrary point and an arbitrary subset . The definitions of and imply that always hold. In order to prove the opposite, we notice that:
The definition of and Claim 2 implies
By plugging in the definition of and and the fact that , we obtain
In order to prove the claim, we need to lower bound the numerator and upper bound the denominator of equation (49). For the numerator, Jensen’s inequality implies:
where the last inequality uses the upper bound
For the above deduction, we have used the Jensen’s inequality as well as Assumption A.
For the denominator, we notice that the term inside the expectation satisfies:
For the second term on the righthand side, using the relation and Jensen’s inequality, we obtain
Since is assumed, Assumption A implies
Combining inequalities (52)-(55), we obtain
Combining equation (49) with inequalities (50), (56), we obtain
The -smoothness of function implies that
Combining this inequality with the lower bound (57) completes the proof.
B.3.3 Proof of Lemma 4
Recall that is the stationary distribution of the Markov chain . We consider an arbitrary subset , and define . Let and be defined as
In other words, the points in and have low probability to move across the broader between and . We claim that the distance between points in and must be bounded away from a positive number.
Assume that . If and , then .
For any point , we either have and , or we have and . It implies:
Since is the stationary distribution of the time-reversible Markov chain , inequality (58) implies:
Notice that , so that . According to Claim 3, by defining an auxiliary quantity:
we find that the set belongs to , so that . The following property is a direct consequence of the definition of restricted Cheeger constant.
For any and any , we have .
Letting and in Claim 4, we have . Combining these inequalities, we obtain
where the last inequality uses the relation with . Combining it with inequality (B.3.3), we obtain
The lemma’s assumption gives . Plugging in this relation completes the proof.
Consider any two points and . Let be a number such that . If , then the claim already holds for the pair . Otherwise, we assume that , and as a consequence assume .
Denote by the density function of distribution . The integral is equal to , where is a random variable satisfying the chi-square distribution with degrees of freedom. The following tail bound for the chi-square distribution was proved by Laurent and Massart .
If is a random variable satisfying the Chi-square distribution with degrees of freedom, then for any ,
By choosing in Lemma 6, the probability is lower bounded by . Since , the first assumption of Assumption A implies . Combining these two bounds, we obtain
For any point , the distances and are bounded by . It implies
Claim 2 in the proof of Lemma 3 demonstrates that the acceptance ratio and for any are both lower bounded by given the assumption . This lower bound is at least equal to because of the assumption , so that we have
Next, we lower bound the ratio and . For but , the function is defined by
where the last inequality uses Jensen’s inequality; It also uses the fact and .
As a consequence of this upper bound and using Jensen’s inequality, we have:
Combining inequalities (B.3.2), (63) and (64), we obtain:
The assumption implies . Plugging in this inequality to (65), a sufficient condition for is
Following identical steps, we can prove that inequality (66) is a sufficient condition for as well.
Assume that condition (66) holds. Combining inequalities (60), (62) with the fact and , we obtain:
Notice that the set satisfies , thus the following lower bound holds:
It implies that either or . In other words, if and , then inequality (66) must not hold. As a consequence, we obtain the lower bound:
Let be an arbitrary integer and let . By the definition of the restricted Cheeger constant (see equation (5)), we have
where is an indexed variable satisfying . Suming over , we obtain
Taking the limit on both sides of the inequality completes the proof.
B.3.4 Proof of Lemma 5
First, we impose the following constraints on the choice of :
so that the preconditions of both Lemma 3 and Lemma 4 are satisfied. By plugging to Lemma 4, the restricted conductance is lower bounded by:
The last inequality holds because holds for any . It is easy to verify that , so that we have the lower bound . Plugging this lower bound to inequality (69), we obtain
Inequality (70) establishes the restricted conductance lower bound for the lemma.
Combining inequality (70) with Lemma 3, it remains to choose a small enough such that is -close to with . More precisely, it suffices to make the following inequality hold:
In order to satisfy this inequality, it suffices to choose . Combining this result with (68) completes the proof.
Appendix C Proof of Proposition 1
Lovász and Simonovits [22, Theorem 2.6] proved the following isoperimetric inequality: Let be an arbitrary convex set with diameter . For any convex function and any subset satisfying , the following lower bound holds:
The lower bound (71) implies . In order to establish the proposition, it suffices to choose and , then prove the pre-condition .
Let be one of the global minimum of function and let be the ball of radius centering at point . If we choose , then for any point , we have
Moreover, for any we have:
It means for the probability measure , the density function inside is at least times greater than the density inside . It implies
Without loss of generality, we assume that is the unit ball centered at the origin. Consider the Euclidean ball where . It is easy to verify that and , which implies . Combining this relation with inequality (72), we have
The right-hand side is greater than or equal to , because we have assumed . As a consequence, we have .
Appendix D Proof of Lemma 1
Consider a sufficiently small and a continuous mapping . Since is continuously differentiable in the compact set , there exists a constant such that for any . Assuming , it implies
Thus, the mapping is a continuous one-to-one mapping. For any set , we define .
Since the parameter set is compact, we can partition into a finite number of small compact subsets, such that each subset has diameter at most . Let be the collection of these subsets that intersect with . The definition implies . The fact that implies
For arbitrary , we consider a point , and remark that every point in is -close to the point . Since is continuously differentiable, the Jacobian matrix of the transformation has the following expansion:
where is the Jacobian matrix of satisfying . The remainder term , as a consequence of the continuous differentiability of and the fact , satisfies for some constant .
On the other hand, using the relation and the continuous differentiability of , the density function at can be approximated by
where the remainder term satisfies for some constant . Further using the continuity of , and the fact , we obtain:
where the remainder term satisfies for some constant .
Combining equation (74) and equation (75), we can quantify the measure of the set using that of the set . In particular, we have
Plugging equation (76) to the lower bound (73) and using the relation , implies
Finally, plugging in the definition of the restricted Cheeger constant and taking the limit completes the proof.
Appendix E Proof of Proposition 3
Let denote the CDF of the standard normal distribution. The function satisfies the following tail bounds:
We define an auxiliary variable based on the value of :
Since , the tail bound (77) implies for all .
Let be a shorthand notation. We define a vector field:
Note that the function admits a polynomial expansion:
We remark that the matrix definition (80) implies where is the derivative of function .
The matrix satisfies , so that holds. For points that are -close to the boundary, we have . By these lower bounds and definition (79), we obtain:
where the last inequality holds because . For any , the right-hand side is smaller than , so that . For points that are not -close to the boundary, we have given . Combining results for the two cases, we conclude that satisfies the conditions of Lemma 1
By applying Lemma 1, we obtain the following lower bound:
Since , the term is lower bounded by . For the second term, we claim the following bound:
We defer the proof to Appendix E.1 and focus on its consequence. Combining inequalities (82) and (83), we obtain
The right-hand side of inequality (84) can be made strictly positive if we choose a large enough . In particular, we choose:
To proceed, we do a case study based on the value of . For all satisfying , we plug in the upper bound for , then plug in the definition of . It implies:
Using lower bound (85) for and the lower bound for , it is easy to verify that the last three terms on the right-hand side are non-negative. Furthermore, plugging in the lower bound from (85), it implies:
Combining inequalities (84), (86), (87) proves that the restricted Cheeger constant is lower bounded by . Since , it is easy to verify that the constraint (85) can be satisfied if we choose:
E.1 Proof of inequality (83)
Note that is equal to the trace of matrix . In order to proceed, we perform a case study on the value of .
We first upper bound the trace of , which can be written as:
The trace of the first term on the right-hand side is bounded by . For the second term, we assume that the matrix has eigenvalues with associated eigenvectors . As a consequence, the matrix has the same set of eigenvectors, but with eigenvalues . Thus, the trace of this term is equal to
By the assumptions and , and using the definition of -approximate local minima, we obtain . As a consequence
For other eigenvalues, if is negative, then we use the upper bound ; If is positive, then we have . Combining these relations, we have
Combining this inquality with the upper bound on the first term of (89), we obtain
Thus, we have upper bounded the trace of first term on the right-hand side of (88).
For the second term on the right-hand side of (88) , we have
where the last inequality uses the relation , so that .
For the third term on the right-hand side of (88), since , we have
Combining upper bounds (90), (91), (92) implies
The proof is similar to the previous case. For the first term on the right-hand side of equation (88), we follow the same arguments for establishing the upper bound (90), but without using the relation (because conditioning on , the definition of approximate local minima won’t give ). Then the trace of is bounded by:
For the second and the third term, the upper bounds (91) and (92) still hold, so that
Combining the two cases completes the proof.
Appendix F Proof of Theorem 2
and its stochastic gradient is computed by:
We defer the proof of claim (96) to the end of this section, focusing on its consequence. Let take the value in claim (96). The conseuqence of (96) and the -Lipschitz continuity of the function imply:
By choosing , we establish the risk bound . It remains to establish the iteration complexity bound.
Taking expectation over on both sides and using Jensen’s inequality, we obtain
Thus, by choosing , it ensures that for any :
F.1 Proof of Lemma 7
Let . By change of variables, the above equation implies:
Notice that satisfies the standard normal distribution. Thus the right-hand side of the above inequality is bounded by
Combining the two inequalities above completes the proof.
Using the fact , equation (99) implies:
Appendix G Proof of Theorem 3
There exists such that for any , and , we have .
The function is 3-Lipschitz continuous in .
For any , if the sample size satisfies , then with probability at least we have . The notation “” hides a poly-logarithmic function of .
Let be an arbitrary angle. We define to be the set of points such that the angle between the point and is bounded by , or equivalently:
For any , the 3-Lipschitz continuity of function implies:
By simple geometry, it is easy to see that
Inequality (100) implies that for small enough , any point in is a nearly optimal solutions. Thus we can use as a target optimality set. The following lemma lower bounds the restricted Cheeger constant for the set .
Assume that . For any , there are universal constant such that if we choose , then the restricted Cheeger constant is lower bounded by .
Given a target optimality , we choose . The risk bound (100) implies
Lemma 8 ensures that the pre-conditions of Theorem 2 hold with a small enough quantity . Combining Theorem 2 with inequality (101), with probability at least , SGLD achieves the risk bound:
In order to have a small enough , we want the functions and to be uniformly close. More precisely, we want the gap between them to satisfy:
By Lemma 8, this can be achieved by assuming a large enough sample size . In particular, if the sample size satisfies , then inequality (103) is guaranteed to be true. The notation “” hides a poly-logarithmic function.
If inequity (103) holds, then holds, so that we can rewrite the risk bound (102) as . By combining the choice of in (103) with the choice of in Lemma 9, we find that the relation hold, satisfying Theorem 2’s condition on . As a result, Theorem 2 implies that the iteration complexity of SGLD is bounded by the restricted Cheeger constant . By Lemma 9, the restricted Cheeger constant is lower bounded by , so that the iteration complexity is polynomial in .
(1) Let be an arbitrary point and let . An equivalent way to express the relation is the following sandwich inequality:
For any , we consider a sufficient condition for inequality (104):
The random variable satisfies a chi-square distribution with degrees of freedom. By Lemma 6, for any , the condition holds with probability at least .
Suppose that is a fixed constant, and is chosen to be for a constant . Then the random variable satisfies a normal distribution . The interval , no matter how is chosen, covers either or . For , we have , so that the probability of is asymptotically lower bounded by . It implies that there is a strictly positive constant (depending on the value of ) such that for all . With this choice of , we apply the union bound:
By choosing to be a large enough constant, the above probability is lower bounded by .
If we change the distribution of from uniform distribution to a normal distribution , the right-hand side of inequality (105) won’t change. Both and become normal random variables with correlation coefficient . Under this setting, Tong proved that the right-side is equal to
By simple algebra and using the fact that , we have
Combining the above relations, and using the fact that for any , we obtain
which shows that the function is 3-Lipschitz continuous in .
(3) Since function is the empirical risk of a linear classifier, its uniform convergence rate can be characterized by the VC-dimension. The VC-dimension of linear classifiers in a -dimensional space is equal to . Thus, the concentration inequality of Vapnik implies that with probability at least , we have
where is a universal constant. When , the upper bound is a monotonically decreasing function of . In order to guarantee , it suffices to choose where the number satisfies:
It is easy to see that is polynomial in . Thus, by the definition of , we have
It implies , thus completes the proof.
G.2 Proof of Lemma 9
The first step is to define a vector field that satisfies the conditions of Lemma 1. For arbitrary , we define a vector field such that:
For any , we can find a constant such that and holds for arbitrary and .
The claim shows that satisfies the conditions of Lemma 1 for any , so that given a scalar , the lemma implies
The right-hand side is uniformly continuous in , so that if we take the limit , we obtain the lower bound:
For the right-hand side, we prove lower bound for it using the Massart noise model. We start by simplifying the fraction term inside the expectation. Without loss of generality, assume that , then the definition of implies:
We further simplify the lower bound (109) by the following claim, which is proved using properties of the Massart noise.
When the event holds, we have , so that . Combining with inequality (109) and Claim 6, we have
where represents the angle between and .
It means that . Combining this equation with inequalities (108), (110), and taking the limits , , we obtain:
where represents the angle between and .
According to inequality (111), for any , we have
where the last inequality follows since and . Otherwise, if , then we have
Once we choose , the above expression will be lower bounded by , which completes the proof.
Since for any , it is easy to verify that . In order to verify , we notice that
The right-hand side will be maximized if . Thus, if we assume , then for any , it is easy to verify that the right-hand side is bounded by . As a consequence, we have , which verifies that .
When or , it is easy to verify that by the definition of the loss function. Otherwise, we assume that and . In these cases, the events and hold almost surely, so that we can assume that the loss function always takes zero-one values.
Let be the event that condition (112) holds. Under this event, when the parameter changes from to , the loss changes from to . It means that the loss is non-decreasing with respect to the change , and as a consequence, we have .
In order to establish the lower bound in Claim 6, we first lower bound the probability of event . In the proof of Lemma 8. we have shown that this probability is equal to:
Let be the angle between and , then the right-hand side is equal to . Using the geometric property that , we have
For the first term on the right-hand side, we have by condition (112) and the assumption that . For the second term, if we condition on , then the vector is uniformly sampled from a -dimensional sphere of radius that centers at the origin. The vector , constructed to be orthogonal to , also belongs to the same -dimensional subspace. Under this setting, Awasthi et al. [4, Lemma 4] proved that
Using the bound , we marginalize to obtain
Recall that and . These two relations imply . Combining it with the relation , we obtain
Combining inequalities (113), (114) with the relation that , we obtain