Multi-layer neural networks are one of the oldest approaches to statistical machine learning, dating back at least to the 1960’s [Ros62]. Over the last ten years, under the impulse of increasing computer power and larger data availability, they have emerged as a powerful tool for a wide variety of learning tasks [KSH12, GBCB16].
In practice, the parameters of neural networks are learned by stochastic gradient descent [RM51] (SGD) or its variants. In the present case, this amounts to the iteration
Here θk=(θik)i≤N denotes the parameters after k iterations, sk is a step size, and (xk,yk) is the k-th example. Throughout the paper, we make the following assumption:
In large scale applications, this is not far from truth: the data is so large that each example is visited at most a few times [Bot10]. Further, theoretical guarantees suggest that there is limited advantage to be gained from multiple passes [SSBD14]. For recent work deriving scaling limits under such assumption (in different problems) see [WML17].
Formal relationships can be established between RN(θ) and R(ρ). For instance, under mild assumptions, infθRN(θ)=infρR(ρ)+O(1/N). We refer to the next sections for mathematical statements of this type.
Roughly speaking, R(ρ) corresponds to the population risk when the number of hidden units goes to infinity, and the empirical distribution of parameters ρ^(N) converges to ρ. Since U(⋅,⋅) is positive semidefinite, we obtain that the risk becomes convex in this limit. The fact that learning can be viewed as convex optimization in an infinite-dimensional space was indeed pointed out in the past [LBW96, BRV+06]. Does this mean that the landscape of the population risk simplifies for large N and descent algorithms will converge to a unique (or nearly unique) global optimum?
when N→∞, ε→0 (here ⇒ denotes weak convergence). The asymptotic dynamics of ρt is defined by the following PDE, which we shall refer to as distributional dynamics (DD)
Using these results, analyzing learning in two-layer neural networks reduces to analyzing the PDE (7). While this is far from being an easy task, the PDE formulation leads to several simplifications and insights. First of all, it factors out the invariance of the risk (4) (and of the SGD dynamics (3)), with respect to permutations of the units {1,…,N}.
where the infimum is taken over all couplings of ρ1 and ρ2. Informally, the fact that ρt is a gradient flow means that (7) is equivalent, for small τ, to
Powerful tools from the mathematical literature on gradient flows in measure spaces [AGS08] can be exploited to study the behavior of (7).
Most importantly, the scaling limit elucidates the dependence of the landscape of two-layer neural networks on the number of hidden units N.
A remarkable feature of neural networks is the observation that, while they might be dramatically over parametrized, this does not lead to performance degradation. In the case of bounded activation functions, this phenomenon was clarified in the nineties for empirical risk minimization algorithms, see e.g. [Bar98]. The present work provides analogous insight for the SGD dynamics: roughly speaking, our results imply that the landscape remains essentially unchanged as N grows, provided N≫D. In particular, assume that the PDE (7) converges close to an optimum in time t∗(D). This might depend on D, but does not depend on the number of hidden units N (which does not appear in the DD PDE (7)). If t∗(D)=OD(1), we can then take N arbitrarily (as long as N≫D) and will achieve a population risk which is independent of N (and corresponds to the optimum), using k=t∗/ε=O(D) samples.
Our analysis can accommodate some important variants of SGD, a particularly interesting one being noisy SGD:
where Ψλ(θ;ρ)=Ψ(θ;ρ)+(λ/2)∥θ∥22, and Δθf(θ)=∑i=1d∂θi2f(θ) denotes the usual Laplacian. This can be viewed as a gradient flow for the free energy Fβ,λ(ρ)=(1/2)R(ρ)+(λ/2)∫∥θ∥22ρ(dθ)−β−1Ent(ρ), where Ent(ρ)=−∫ρ(θ)logρ(θ)dθ is the entropy of ρ (by definition Ent(ρ)=−∞ if ρ is singular). Fβ,λ(ρ) is an entropy-regularized risk, which penalizes strongly non-uniform ρ.
We will prove below that, for β<∞, the evolution (12) generically converges to the minimizer of Fβ,λ(ρ), hence implying global convergence of noisy SGD in a number of steps independent of N.
Examples
With probability 1/2: y=+1, x∼N(0,(1+Δ)2Id)
With probability 1/2: y=−1, x∼N(0,(1−Δ)2Id).
(This example will be generalized later.) Of course, optimal classification in this model becomes entirely trivial if we compute the feature h(x)=∥x∥2. However, it is non-trivial that a SGD-trained neural network will succeed.
We choose an activation function without offset or output weights, namely σ∗(x;θi)=σ(⟨wi,x⟩). While qualitatively similar results are obtained for other choices of σ, we will use a simple piecewise linear function as a running example: σ(t)=s1 if t≤t1, σ(t)=s2 if t≥t2, and σ(t) interpolated linearly for t∈(t1,t2). In simulations we use t1=0.5, t2=1.5, s1=−2.5, s2=7.5.
where the form of ψ(r;ρ) can be derived from Ψ(θ;ρ). This reduced PDE can be efficiently solved numerically, see Supplementary Information (SI) for technical details. As illustrated by Fig. 1, the empirical results match closely the predictions produced by this PDE.
In Fig. 2, we compare the asymptotic risk achieved by SGD with the prediction obtained by minimizing R(ρ), cf. (5) over spherically symmetric distributions. It turns out that, for certain values of Δ, the minimum is achieved by the uniform distribution over a sphere of radius ∥w∥2=r∗, to be denoted by ρr∗\mboxunif. The value of r∗ is computed by minimizing
where expressions for v(r), ud(r1,r2) can be readily derived from V(w), U(w1,w2) and are given in the SI.
Let r∗ be a global minimizer of r↦Rd(1)(r). Then ρr∗\mboxunif is a global minimizer of ρ↦R(ρ) if and only if v(r)+ud(r,r∗)≥v(r∗)+ud(r∗,r∗) for all r≥0.
Checking numerically this condition yields that ρr∗\mboxunif is a global minimizer for Δ in an interval [Δdl,Δdh], where limd→∞Δdl=0 and limd→∞Δdh=Δ∞≈0.47.
for any k∈[T/ε,10T/ε] with probability at least 1−δ.
In particular, if we set ε=1/(C0d), then the number of SGD steps is k∈[(C0T)d,(10C0T)d]: the number of samples used by SGD does not depend on the number of hidden units N, and is only linear in the dimension. Unfortunately the proof does not provide the dependence of T on η, but Theorem 6 below suggests exponential local convergence.
2 Centered anisotropic Gaussians
We can generalize the previous result to a problem in which the network needs to select a subset of relevant nonlinear features out of many a priori equivalent ones. We assume the joint law of (y,x) to be as follows:
With probability 1/2: y=+1, x∼N(0,Σ+), and
With probability 1/2: y=−1, x∼N(0,Σ−).
Even with a reduced degree of symmetry, SGD converges to a network with nearly-optimal risk, after using a number of samples k=O(d), which is independent of the number of hidden units N.
3 A better activation function
We consider the same data distribution introduced in the last section (anisotropic Gaussians). Figure 3 reports the evolution of the risk RN(θk) for three experiments with d=320, s0=60 and different values of Δ. SGD is initialized by setting ai=1, bi=1 and wi0∼iidN(0,0.82/d⋅Id) for i≤N. We observe that SGD converges to a network with very small risk, but this convergence has a nontrivial structure and presents long flat regions.
The empirical results are well captured by our predictions based on the continuum limit. In this case we obtain a reduced PDE for the joint distribution of the four quantities r=(a,b,r1=∥PVw∥2,r2=∥PV⊥w∥2), denoted by ρt. The reduced PDE is analogous to (13) albeit in 4 rather than 1 dimensions. In Figure 3 we consider the evolution of the risk, alongside three properties of the distribution ρt –the means of the output weight a, of the offset b, and of r1.
4 Predicting failure
SGD does not always converge to a near global optimum. Our analysis allows to construct examples in which SGD fails. For instance, Figure 4 reports results for isotropic Gaussians problem. We violate the assumptions of Theorem 1 by using non monotone activation function. Namely, we use σ∗(x;θ)=σ(⟨w,x⟩), where σ(t)=−2.5 for t≤0, σ(t)=7.5 for t≥1.5, and σ(t) linearly interpolates from (0,−2.5) to (0.5,−4), and from (0.5,−4) to (1.5,7.5).
Depending on the initialization, SGD converges to two different limits, one with a small risk, and the second with high risk. Again this behavior is well tracked by solving a one-dimensional PDE for the distribution ρt of r=∥w∥2.
General results
In this section we return to the general supervised learning problem described in the introduction and describe our general results. Proofs are deferred to the SI.
First, we note that the minimum of the asymptotic risk R(ρ) of (5) provides a good approximation of the minimum of the finite-N risk RN(θ).
We next consider the distributional dynamics (7) and (12). These should be interpreted to hold in weak sense, cf. SI. In order to establish that these PDEs indeed describe the limit of the SGD dynamics, we make the following assumptions.
t↦ξ(t) is bounded Lipschitz: ∥ξ∥∞,∥ξ∥\mboxLip≤K1, with ∫0∞ξ(t)dt=∞.
The activation function (x,θ)↦σ∗(x;θ) is bounded, with sub-Gaussian gradient: ∥σ∗∥∞≤K2, ∥∇θσ∗(X;θ)∥ψ2≤K2. Labels are bounded ∣yk∣≤K2.
The gradients θ↦∇V(θ), (θ1,θ2)↦∇θ1U(θ1,θ2) are bounded, Lipschitz continuous (namely ∥∇θV(θ)∥2, ∥∇θ1U(θ1,θ2)∥2≤K3, ∥∇θV(θ)−∇θV(θ′)∥2≤K3∥θ−θ′∥2, ∥∇θ1U(θ1,θ2)−∇θ1U(θ1′,θ2′)∥2≤K3∥(θ1,θ2)−(θ1′,θ2′)∥2).
We also introduce the following error term which quantifies in a non-asymptotic sense the accuracy of our PDE model
The convergence of the SGD process to the PDE model is an example of a phenomenon which is known in probability theory as propagation of chaos [Szn91].
with probability 1−e−z2. The same statements hold for noisy SGD (11), provided (7) is replaced by (12), and if β≥1, λ≤1, and ρ0 is K0 sub-Gaussian for some K0>0.
Notice that dependence of the error terms in N and D is rather benign. On the other hand, the error grows exponentially with the time horizon T, which limits its applicability to cases in which the DD converges rapidly to a good solution. We do not expect this behavior to be improvable within the general setting of 3, which a priori includes cases in which the dynamics is unstable.
We can regard J(θ;ρt)=ρt(θ)∇θΨ(θ;ρt) as a current. The fixed points of the continuum dynamics are densities that correspond to zero current, as stated below.
Assume V(⋅),U(⋅,⋅) to be differentiable with bounded gradient. If ρt is a solution of the PDE (7), then R(ρt) is non-increasing. Further, probability distribution ρ is a fixed point of the PDE (7) if and only if
Note that global optimizers of R(ρ), defined by condition (17), are fixed points, but the set of fixed points is, in general, larger than the set of optimizers. Our next proposition provides an analogous characterization of the fixed points of diffusion DD (12) (see [CMV+03] for related results).
Assume that conditions A1-A3 hold and that ρ0 is absolutely continuous with respect to Lebesgue measure, with Fβ,λ(ρ0)<∞. If (ρt)t≥0 is a solution of the diffusion PDE (12), then ρt is absolutely continuous. Further, there is at most one fixed point ρ∗=ρ∗β,λ of (12) satisfying Fβ,λ(ρ∗)<∞. This fixed point is absolutely continuous and its density satisfies
In the next sections we state our results about convergence of the distributional dynamics to its fixed point. In the case of noisy SGD (and for the diffusion PDE (12)), a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation (12)), we do not have such general result. However, we obtain a stability condition for fixed point containing one point mass, which is useful to characterize possible limiting points (and is used in treating the examples in the previous section).
Remarkably, the diffusion PDE (12) generically admits a unique fixed point, which is the global minimum of Fβ,λ(ρ) and the evolution (12) converges to it, if initialized so that Fβ,λ(ρ0)<∞. This statement requires some qualifications. First of all, we introduce sufficient regularity assumptions to guarantee the existence of sufficiently smooth solutions of (12).
Next notice that the right-hand side of the fixed point equation (21) is not necessarily normalizable (for instance, it is not when V(⋅), U(⋅,⋅) are bounded). In order to ensure the existence of a fixed point, we need λ>0.
Assume that conditions A1-A4 hold, and 1/K0≤λ≤K0 for some K0>0 Then Fβ,λ(ρ) has a unique minimizer, denoted by ρ∗β,λ, which satisfies
where C is a constant depending on K0,K1,K2,K3. Further, letting ρt be a solution of the diffusion PDE (12) with initialization satisfying Fβ,λ(ρ0)<∞, we have, as t→∞,
The proof of this theorem is based on the following formula that describes the free energy decrease along the trajectories of the distributional dynamics (12):
(A key technical hurdle is of course proving that this expression makes sense, which we do by showing the existence of strong solutions.) It follows that the right-hand side must vanish as t→∞, from which we prove that (eventually taking subsequences) ρt⇒ρ∗ where ρ∗ must satisfy βΨλ(θ;ρ∗)+logρ∗(θ)=const. This in turns mean ρ∗ is a solution of the fixed point condition 21 and is in fact a global minimum of Fβ,λ by convexity.
This result can be used in conjunction with Theorem 3, in order to analyze the regularized noisy SGD algorithm (11).
2 Convergence: noiseless SGD
The next theorems provide necessary and sufficient conditions for distributions containing a single point mass to be a stable fixed point of the evolution. This result is useful in order to characterize the large time asymptotics of the dynamics (7). Here, we write ∇1U(θ1,θ2) for the gradient of U with respect to its first argument, and ∇1,12U for the corresponding Hessian. Further, for a probability distribution ρ∗, we define
Note that H0(ρ∗) is nothing but the Hessian of θ↦Ψ(θ;ρ∗) at θ∗.
If ρ0 has a bounded density with respect to Lebesgue measure, then it cannot be that ρt converges weakly to ρ∗ as t→∞.
Discussion and future directions
In this paper we developed a new approach to the analysis of two-layers neural networks. Using a propagation-of-chaos argument, we proved that –if the number of hidden units satisfies N≫D– SGD dynamics is well approximated by the PDE in (7), while noisy SGD is well approximated by (12). Both of these asymptotic descriptions correspond to Wasserstein gradient flows for certain energy (or free energy) functionals. While empirical risk minimization is known to be insensitive to overparametrization [Bar98], the present work clarifies that the SGD behavior is also independent of the number of hidden units, as soon as this is large enough.
We illustrated our approach on several concrete examples, by proving convergence of SGD to a near-global optimum. This type of analysis provides a new mechanism for avoiding the perils of non-convexity. We do not prove that the finite-N risk RN(θ) has a unique local minimum, or that all local minima are close to each other. Such claims have often been the target of earlier work, but might be too strong for the case of neural networks. We prove instead that the PDE (7) converges to a near global optimum, when initialized with a bounded density. This effectively gets rid of some exceptional stationary points of RN(θ), and merges multiple finite N stationary points that result into similar distributions ρ.
In the case of noisy SGD (11), we prove that it converges generically to a near-global minimum of the regularized risk, in time independent of the number of hidden units.
We emphasize that while we focused here on the case of square loss, our approach should be generalizable to other loss functions as well, cf. SI.
The present work opens the way to several interesting research directions. We will mention two of them. (i) The PDE (7) corresponds to gradient flow in the Wasserstein metric for the risk R(ρ), see [AGS08]. Building on this remark, tools from optimal transportation theory can be used to prove convergence. (ii) Multiple finite-N local minima can correspond to the same minimizer ρ∗ of R(ρ) in the limit N→∞. Ideas from glass theory [MP99] might be useful to investigate this structure.
Let us finally mention that, after a first version of this paper appeared as a preprint, several other groups obtained results that are closely related to Theorem 3 [RVE18, SS18, CB18].
Acknowledgements
This work was partially supported by grants NSF DMS-1613091, NSF CCF-1714305 and NSF IIS-1741162. S. M. was partially supported by Office of Technology Licensing Stanford Graduate Fellowship. P.-M. N. was partially supported by William R. Hewlett Stanford Graduate Fellowship. The authors would like to thank Jiajun Tong for helpful discussions concerning strong solutions for parabolic PDEs.
References
Supplementary information
We present here proofs and additional technical details for our mathematical results, as well as additional information concerning the numerical experiments.
Notations
We use lowercase bold for vectors (e.g. u,v,…), uppercase bold for matrices (e.g. A,B,…), and lowercase plain for scalar (x,y,…).
Given a measurable space Ω, we denote by \mathscrsfsP(Ω) the set of probability measures on Ω.
Given a measurable function f, and a measure μ, we denote by ⟨f,μ⟩=⟨μ,f⟩=∫fdμ the corresponding integral.
∥f∥\mboxLip≡supx=y∣f(x)−f(y)∣/∥x−y∥2 denotes the Lipshitz constant of a function f.
d\mboxBL(⋅,⋅) is the bounded Lipschitz distance between probability measures
Here C(μ,ν) is the set of couplings of μ and ν.
Wp(⋅,⋅) is the Wasserstein distance between probability measures
For p=1, the Kantorovich-Rubinstein duality gives
K is a generic constant depending on K0,K1,K2,K3, where Ki’s are constants which will be specified from the context.
General results: Statics
In this section, we discuss some properties of the population risk, RN(θ), and its continuum counterpart R(ρ). For future reference, we copy the key definitions from the main text:
We show that minimizing the population risk RN(θ) yields similar results to minimizing its continuum counterpart R(ρ):
We establish the condition for ρ∗ to be a minimizer:
First notice that, for any θ=(θi)i≤N, we have
Indeed, RN(θ)=R(ρ) for ρ=(1/N)∑i=1Nδθi.
whence the claim (6.6) follows since ε is arbitrary.
We next establish the minimum condition (6.7). Notice that since V(⋅) is continuous, and U(⋅,⋅) is bounded below, it follows from Fatou’s lemma that, for any ρ, the function θ↦Ψ(θ;ρ) is lower semicontinous and takes values in (−∞,∞]. In particular the set S0(ρ)≡argminθΨ(θ;ρ) must be closed.
We first prove that any minimizer must satisfy (6.7). Let ρ∗ be a minimizer and define Ψ∗=infθΨ(θ;ρ∗). By rearranging terms, for any probability measure ρ, we have
First we will assume Ψ∗>−∞ (whence, by lower semicontinuity, S0(ρ∗) must be a non-empty closed set). Let θ1∈S0(ρ∗), and assume by contradiction that there exist θ0∈supp(ρ∗), θ0∈S0(ρ∗). Let B(θ0;ε) be a ball of radius ε around θ0. By lower semicontinuity, we can find ε0,Δ>0 such that infθ∈B(θ0;ε0)Ψ(θ;ρ∗)=Ψ∗+Δ>Ψ∗. Further t0≡ρ∗(B(θ0;ε0))>0 because θ0∈supp(ρ∗).
Let ν≡1B(θ0;ε0)ρ∗/t0 (i.e. ν is the conditional distribution given θ∈B(θ0;ε0)). Define, for t∈[0,t0], the probability measure
where the second inequality follows from the fact that U is continuous and δθ1, ν have bounded support. By taking t small enough, we get R(ρ)<R(ρ∗) hence reaching a contradiction.
By selecting t=tM=min(t0,(M+Ψ0)/C0(M)) (which is positive for all M large enough), we obtain R(ρM,t)−R(ρ∗)<0 for all M large and hence reach a contradiction.
Setting μ=2[Ψ(⋅;ρ∗)−Ψ∗], and noticing that condition (6.7) implies ⟨Ψ(⋅;ρ∗)−Ψ∗,ρ∗⟩=0, we get R(ρ)≥R(ρ∗)+⟨U,(ρ−ρ∗)⊗2⟩≥R(ρ∗).
2 Some additional results
We often find empirically that the optimal density ρ∗ is supported on a set of Lebesgue measure (sometimes on a finite set of points). The following consequence of the previous results partially explains these findings.
Assume θ↦V(θ) to be an analytic function and (θ1,θ2)↦U(θ1,θ2) to be analytic with respect to θ1, uniformly in θ2. Namely there exists a locally bounded function θ↦B(θ) such that ∥∇θ1kU(θ1,θ2)∥2≤k!B(θ1)k for all k, θ1, θ2. If ρ∗ is a minimizer of R(ρ), then one of the following holds
The support of ρ∗ has zero Lebesgue measure.
If D=1, then (b) can be replaced by: (b′)ρ∗ is a convex combination of countably many point masses with no accumulation point (finitely many if Ψ(θ;ρ∗)→∞ as ∣θ∣→∞).
Note that, under the stated conditions f(θ)≡∫U(θ,θ′)ρ∗(dθ′) is analytic. Indeed, by a standard dominated convergence argument, we have that ∇kf is given by the integral of ∫∇kU(θ1,θ2)ρ∗(dθ2) for any k≥0. Further, by an application of the intermediate value theorem there exists tθ1,θ2,δ∈ such that
which vanishes as k→∞ for uniformly over ∥δ∥2≤δ0 for δ0 small enough.
General results: Dynamics
In this section we consider the SGD dynamics with step size sk=εξ(kε), under the assumptions A1,A2,A3 stated in the main text. For the readers convenience, we reproduce here the form of the limiting PDE
For background on this and similar PDEs (and the analogous ones at finite temperature, cf. Section 10), we refer to [MV00, CMV+03, CMV06, AGS08, CDF+11]. Our treatment will be mostly self-contained because of some differences between our setting and the one in these papers.
Recall assumptions A1, A2, A3 in the main text. By [Szn91, Theorem 1.1], assumptions A1 and A3 are sufficient for the existence and uniqueness of solution of PDE (7.1).
Assume conditions A1 and A3 hold. Let (ρt)t≥0 be the solution of the PDE (7.1). Let (θit)t≥0 be the solution of nonlinear dynamics (7.4). Then t↦θit is K1K3-Lipschitz continuous, and t↦ρt is K1K3-Lipschitz continuous in W2 Wasserstein distance, with K1 and K3 as per conditions A1 and A3. In particular, t↦ρt is continuous in the topology of weak convergence.
Since ξ and ∇Ψ are K1 and K3 bounded respectively, t↦θit is K1K3-Lipschitz continuous. Further, Eq. (5.2) implies that t↦ρt is Lipschitz continuous in W2 Wasserstein distance, namely
The proof follows a ‘propagation of chaos’ argument [Szn91]. Throughout this proof, we will use K to denote generic constant depending on the constants K1,K2,K3 in conditions A1, A2, A3.
It is convenient to introduce the notations zk=(xk,yk) to denote the k-th example and define
Note that the assumption of bounded Lipschitz ∇V, ∇1U (here and below ∇1U(θ1,θ2) denotes the gradient of U with respect to its first argument) implies ∥G(θ;ρ)∥2≤K and ∥G(θ1;ρ)−G(θ2;ρ)∥2≤K∥θ1−θ2∥2. Further
With these notations, we can rewrite the SGD dynamics in the main text as
Recall (θi0)i≤N∼ρ0 independently.
We next state and prove the key estimate controlling the difference between the original dynamics and the nonlinear dynamics.
Under the assumptions of Theorem 3, there exists a constant K depending uniquely on K1,K2,K3 in conditions A1, A2, and A3, such that for any T≥0, we have
with probability at least 1−e−z2.
We next consider the three terms above. Using the Lipschitz continuity of G(θ;ρ) with respect to θ and ρ (see Eq. (7.10)), and due to condition A1 and Lemma 7.1 (implying that ξ, θit, and ρs are Lipschitz continuous), we get
Bounding the second term yields (by using the Lipschitz continuity of G with respect to its first argument):
where ρ^k(N)≡(1/N)∑i≤Nδθik. Hence
and taking union bound over i≤N, we get
For the term E3,0i(t), we use the Lipschitz continuity property (7.10), whence
Since for any fixed k, (θjkε)j≤N,j=i are i.i.d. and independent of θik, and ∇1U is bounded, we get by another application of Azuma-Hoeffding inequality, cf. Lemma A.1,
Conditional on the good events in Eq. (7.22) and (7.25), Eq. (7.20) thus yields
with probability at least 1−e−z2.
Using the bounds (7.16), (7.17), (7.26) in Eq. (7.15), we get
Using the bound (7.27), the claim follows. ∎
Under the assumptions of Theorem 3, we have
Let θ=(θ1,…,θi,…,θn) and θ′=(θ1,…,θi′,…,θn) be two configurations that differ only in position i. Then
Under the assumptions of Theorem 3, we have,
with probability at least 1−e−z2.
By Eq. (7.32) and by Azuma-Höeffding inequality and union bound, we get
with probability at least 1−e−z2. The claim follows since
The proof of the theorem follows from a straightforward application of Lemma 7.2, 7.3, 7.4. The proof for any bounded Lipschitz function f follows the same argument as Lemma 7.3, 7.4. As a result, for any sequence (N,ε=εN) such that N→∞ and εN→0 with N/log(N/εN)→∞ and εNlog(N/εN)→0, we have ρ^⌊k/ε⌋(N) converges weakly to ρt almost surely immediately.
2 Proof of Theorem 3: Generalization to β<∞𝛽\beta<\infty
Here we generalize the proof given in the previous section to noisy SGD at finite temperature β<∞. Since the proof follows the same scheme as in the noiseless case, we will limit ourselves to describing the differences.
Throughout this section we assume that conditions A1, A2, A3 hold. We also let
for some λ≤1. Further we assume ρ0 is K02-sub-Gaussian. Finally, we assume 1≤β<∞.
For the reader’s convenience, we reproduce here the form of the limiting PDE
which again should be interpreted in weak sense.
Recall conditionss A1, A2, A3 in the main text. By a modified argument of [Szn91, Theorem 1.1], conditions A1 and A3 are sufficient for the existence and uniqueness of solution of PDE (7.37) in weak sense. Section 10 provides further information of this PDE, including a proof of existence and uniqueness.
As in the noiseless case, there is an equivalent formulation of this PDE as a fixed point distribution for the following nonlinear dynamics, which is an integration form of a stochastic differential equation,
where {Wi(s)}s≥0 for i≤N are independent D-dimensional Brownian motions, and G(θ;ρ)≡−∇Ψλ(θ;ρ). The assumptions on U, V, λ, and ξ ensures that this nonlinear dynamics has a unique continuous solution.
It is convenient to collect some standard estimates about the solution of the stochastic differential equation (7.38).
Assume ρ0 is K02-sub-Gaussian, ξ(s) and G(0;ρs) are K0-bounded, G(θ;ρs) is K0-Lipschitz in θ, and β≥1. Let (θit)t≥0 for i≤N be the solution of (7.38) with independent initialization (θi0)i≤N∼ρ0. Let (ρt)t≥0 be the solution of PDE (7.37). Then there exists a constant K depending uniquely on K0, such that
Part (a). First, note that for any D-dimensional K02-sub-Gaussian random vector X, we have
Note that (θi0)i≤N∼ρ0 independently, and ρ0 is K02-sub-Gaussian. Therefore
Taking τ=1/(2K02) and u=2K0(D+logN+z), we get
Then we define Wξ,i(t)≡∫0t2ξ(s)dWi(s). We have Var(Wξ,ij(t))=∫0t2ξ(s)ds≤2K0t for j≤D. Note exp{τ∥Wξ,i(t)∥22} is a submartingale, due to Doob’s martingale inequality, we have
Taking τ=1/(4K0T) and u=4K0T(D+logN+z), we get
By noting that ξ(s), G(0;ρs) are K0-bounded, and G(θ;ρs) is K0-Lipschitz in θ, according to Eq. (7.38), there exists some constant K depending on K0, such that
where Δi(t)≡sups≤t∥θis∥2, W≡maxi≤Nsupt≤T∥Wξ,i(t)∥2, and Θ≡maxi≤N∥θi0∥2. Due to Gronwall’s inequality, we have
The high probability bound (7.42) holds by noting the high probability bound for Θ and W in Eq. (7.46) and (7.47).
Part (b). Define Δi(h;k,ε)=sup0≤u≤h∥θikε+u−θikε∥2. By noting that ξ(s), G(0;ρs) are K0-bounded, and G(θ;ρs) is K0-Lipschitz in θ, according to Eq. (7.38), we have
where Wξ,i,k(u)≡∫kεkε+u2ξ(s)dWi(s). Similar to the bound Eq. (7.47), we have
Plugging the bound Eq. (7.42) and Eq. (7.49) into Eq. (7.48), we have
with probability at least 1−e−z2.
Part (c). Equation (7.44) holds directly by noting that
As in the noiseless case, the key step consists in bounding the difference between the nonlinear dynamics and the SGD dynamics.
Under the assumptions of Theorem 3, there exists a constant K depending uniquely on K0,K1,K2,K3, such that for any T≥0, we have
with probability at least 1−e−z2.
Terms E2i(t), E3i(t) can be bounded the same as in Lemma 7.2, i.e., Eq. (7.17) and (7.26), by noting that the replacement of Ψ by Ψλ does not affect these estimates.
To bound E4i(t), notice that \bm{W}_{\xi,i}\equiv\int_{0}^{T}\big{(}\sqrt{2\xi(s)}-\sqrt{2\xi([s])}\big{)}\,{\rm d}\bm{W}_{i}(s) is a Gaussian random vector, Wξ,i∼N(0,τ2ID), where, using the Lipschitz continuity of ξ,
and therefore by applying Doob’s inequality to the submartingale t↦E4i(t), we get
We need to modify the proof of Lemma 7.2 to bound terms E1i(t).
To bound the first term E1,Ai(t), due to the Lipschitz property of G(θ;ρ) and the boundedness of G(0;ρ), with probability at least 1−e−z2, we have for all i≤N and t≤T,
Here the last inequality is due to Eq. (7.42) in Lemma 7.5.
To bound the second term E1,Bi(t), using the fact that ∇1U is bounded Lipschitz, we have for all i≤N and t≤T,
Here the last inequality is due to Eq. (7.44) in Lemma 7.5.
To bound the third term E1,Ci(t), with probability at least 1−e−z2, we have for all i≤N and t≤T,
Here the last inequality is due to Eq. (7.43) in Lemma 7.5.
As a result, combining Eq. (7.17), (7.26), (7.27), (7.51), (7.52), (7.54), (7.55), and (7.56), defining
Applying Gronwall’s inequality gives the desired result.
The generalization of Theorem 3 to β<∞ follows from this lemma exactly as in the previous section.
3 Proof of Proposition 2: Monotonicity of the risk
By Lemma 7.1, t↦ρt is Lipschitz continuous in Wasserstein distance W2(ρt1,ρt2)≤K∣t1−t2∣. Hence, we get
where in the second step we used Eq. (7.3). This immediately implies that R(ρt) is non-increasing in t.
Let ρ be a fixed point of Eq. (7.1). Since ∂tR(ρt)∣ρ0=ρ=0, the above formula implies
and therefore ρ is supported in the set of θ’s such that ∇Ψ(θ;ρ)=0.
Vice versa, if this is the case, setting ρ0=ρ, Eq. (7.3) implies ∂t⟨φ,ρt⟩=0, then ρt≡ρ0 is a fixed point.
4 A general continuity result
It is useful to notice that the solution (ρt)t≥0 of the PDE (7.1) is continuous with respect to changes in V(⋅), U(⋅,⋅). Namely, we consider the following two PDEs:
The proof adapts the argument used to establish uniqueness in [Szn91]. Without loss of generality, we fix ξ(t)≡1/2. We further denote by K generic constants depending on K1, K3.
The assumption of bounded Lipschitz ∇V and ∇U implies that ∇Ψ(θ;ρ) is K-bounded Lipschitz with respect to argument (θ,ρ), that is,
Using bound (7.70), the first term E1(t) is simply bounded by
To bound the second term E2(t), we have
with the definition of ε0 given by Equation (7.69).
Combining Equation (7.74), (7.75), and (7.76), we have
Applying Equation (7.73), the result follows. ∎
5 Some properties of the solution of the PDE (7.1)
In this section we prove four lemmas on the properties of the solution of the PDE (7.1), under conditions A1 and A3. All of these facts are quite standard, but we provide complete proofs for them for reader’s convenience.
We will use several times the following notations. Let ρt be a solution of the PDE (7.1) with initialization ρ0. Let (θt)t≥0 be the solution of the ordinary differential equation (ODE)
With these notations, ρt is the push forward of ρ0 under φt: ρt=φ∗tρ0. In other words, for any Borel set B, ρt(φt(B))=ρ0(B).
The lemma holds immediately by noting that ρt(Ω)≥ρt(φt(Ω))=ρ0(Ω). ∎
Assume conditions A1, A3 hold. Further assume there exists a constant K<∞ such that
for any θ∈(0,∞)D and ρ∈P([0,∞]D). Let (ρt)t≥0 be the solution of the PDE (7.1) with initial condition ρ0 with ρ0((0,∞)D)=1. Then for any t<∞, ρt((0,∞)D)=1.
According to Eqs. (7.80) and (7.79), we have for i∈[d],
Then according to (7.81), we have φt(Ωk(0))⊆Ωk(t). Note Ωk(t) is increasing in k for fixed t, and ∪kΩk(t)=∪kΩk(0)=(0,∞)D. Hence,
Let (ρt)t≥0 be a continuous curve in a compact metric space (Ω,d). Denoting
to be the set of all limiting points of (ρt)t≥0. Then S∗ is a connected compact set.
First, it is easy to see that S∗ should be closed. Note that Ω is a compact space, then S∗ should be a compact set. If S∗={ρ∗} is a singleton, this lemma holds automatically. Therefore, we would like to consider the case when S∗ is not a singleton.
For any ρ1,ρ2∈S∗, and d(ρ1,ρ2)>0. We would like to show ρ1 and ρ2 are connected in S∗.
We use proof by contradiction. Now suppose ρ1 and ρ2 are not connected. Define A⊆S∗ to be the maximal connected subset of S∗ containing ρ1. It is easy to see that A is compact. It is also easy to see that its complement B≡S∗∖A is also a compact set, and ρ2∈B. As a result, we have A∪B=S∗, A∩B=∅, and ρ1∈A, ρ2∈B.
Note that Ω is a metric space, so it satisfies T4 separation axiom. Since A and B are closed sets and A∩B=∅, there exists an open set O, such that A⊆O, O∩B=∅. Hence, ∂O⊆S∗c.
Note that ρ1 and ρ2 are limiting points of (ρt)t≥0 which is a continuous curve in Ω. Therefore, it must cross the boundary ∂O infinite times. That is, there is a sequence (tk)k≥1 of time with limk→∞tk=∞, such that ρtk∈∂O. But since ∂O is compact, there exists a limiting point ρ∗∈∂O, so that a subsequence of sequence ρtk converges to ρ∗. Therefore, ρ∗ should be a limiting point of (ρt)t≥0. But this contradict with ∂O⊆S∗c. ∎
Under the assumptions of A1 and A3, further assume that U,V are twice continuous differentiable, and that ρ0 has density with respect to the Lebesgue measure, bounded by M0. Then ρt also has a density, bounded by Mt=KM0exp{KDt} (where K depends on the constants in the assumptions).
Let J(θ;t) for the Jacobian of φt(⋅) at θ0=θ. Then Eq. (7.79) implies that J(θ;t) satisfies
with initial condition J(θ;0)=ID. This implies
Therefore, using the fact that ∥∇2Ψ(θ;ρt)∥\mboxop is K-bounded, we obtain \lambda_{\min}\big{(}\bm{J}({\bm{\theta}};t)\big{)}\geq\exp(-Kt). Finally, since φt is a diffeomorphism, we have
6 Proof of Theorems 6: Stability conditions
where H0≡H0(δθ∗)=∇2V(θ∗)+∇1,12U(θ∗,θ∗). Notice that
and therefore ∇1,22U(θ∗,θ∗)⪰0, whence H1⪰H0.
We first establish the condition for ρ∗=δθ∗ to be a fixed point. Note that Ψ(θ;ρ∗)=V(θ)+U(θ,θ∗) and supp(ρ∗)={θ∗}. Hence the condition in the main text is satisfied if and only if ∇θΨ(θ;ρ∗)∣θ=θ∗=0, i.e. ∇V(θ∗)+∇1U(θ∗,θ∗)=0.
To establish the stability result of Theorem 6, the following lemma provides a key estimate.
Under the assumptions of Theorem 6, let λ≡λmin(H0)>0. Then there exists r1,ε1,γ>0 such that the following hold
If supp(ρ)⊆B(θ∗;r1)≡{θ:∥θ−θ∗∥2≤r1}, then,
If ∫∥θ−θ∗∥22ρ(dθ)≤ε12 and supp(ρ)⊆B(θ∗;r1), then for any θ∈B(θ∗;r1)∖B(θ∗;r1/2),
Since ∇2V(θ) is continuous and ∇12U(θ,θ′) is bounded continuous, it follows that θ↦∇2Ψ(θ;ρ) is continuous, and ρ↦∇2Ψ(θ;ρ) is continuous in the weak topology, and in fact (θ,ρ)↦∇2Ψ(θ;ρ) is continuous in the product topology.
Since H0≻0 strictly, for any δ>0 we can choose r1=r1(δ)>0 such that
for all θ∈B(θ∗;r1), and ρ such that supp(ρ)⊆B(θ∗;r1). If these conditions hold
In order to bound the second term, note that, since ∇Ψ(θ∗;ρ∗)=0,
where Q=∫(θ−μ)(θ−μ)Tρ(dθ) is the covariance of (θ−θ∗).
Let now consider the claim at point (i). Integrating Eq. (7.103) with respect to ρ(dθ), we get
By choosing δ sufficiently small, we can ensure that (1−δ)H0−(δ/2)I⪰λmin(H0)I/2, H1−δH0−(3δ/2)I⪰λmin(H1)I/2, and therefore
Next consider point (ii). In this case, Eq. (7.107) implies
Substituting in Eq. (7.103), and using ∥μ∥2≤ε1, we get
This is strictly positive for all ε1 small enough, hence implying the claim (7.92). ∎
We are now in position of proving Theorem 6.
Let r0=min(r1/2,ε1/2) and assume, without loss of generality t0=0, so that supp(ρ0)⊆B(θ∗;r0). We also define
As usual, we adopt the convention that the infimum of an empty set is equal to +∞.
Define φ1(θ)=h(∥θ−θ∗∥2), with h to be an non-decreasing function with
where, in the last inequality, we used Lemma 7.12.(ii). Next, define
Applying again Eq. (7.1), we get, for t≤T∗,
Together the last two bounds imply T∗=∞. Indeed assume by contradiction T∗<∞. Then either T1≤T2, T1<∞, or T2<T1, T2<∞.
Consider the first case: T1≤T2, T1<∞. Since ⟨ρT1,φ2⟩≥ε12 but ⟨ρ0,φ2⟩≤r02≤ε12/4, there exists t<T∗ such that ∂t⟨ρ0,φ2⟩>0. However this contradicts Eq. (7.126). Consider then the second case: T2<T1, T2<∞. This implies ⟨ρT2,φ1⟩>0, but on the other hand ⟨ρ0,φ1⟩=0. Hence, there exists t<T∗ such that ∂t⟨ρ0,φ1⟩>0. However this contradicts Eq. (7.122).
We conclude that T∗=∞ and hence we can apply Eq. (7.126) for any t, thus obtaining ∂t⟨φ2,ρt⟩≤−λ⟨φ2,ρt⟩ and hence ⟨φ2,ρt⟩≤(r02/2)e−λt, which concludes the proof. ∎
7 Proof of Theorem 7: Instability conditions
In this section we will prove the instability result of Theorem 7. Throughout the section, we assume ξ(t)≡1/2. We will use several times the nonlinear dynamics, defined for ρt a solution of Eq. (7.1) with initial condition ρ0:
where Pu⊥=I−uuT is the projector orthogonal to vector u.
First consider the case d=1: in this case, the assumption ν(B(x0;r))≥1−ε is not required. Denote by F the distribution function associated to ν (i.e. F(x)≡ν((−∞,x])). By assumption F is differentiable with F′(x)≤M. In order to construct the desired coupling, let Z be a random variable uniformly distributed in $.Forasmallconstant\xi_{0}>0,definetherandomvariables(X_{1},X_{2})$ by letting
(Note that X2 is not defined for Z=ξ0 but this is a zero-probability event.) On the event {Z>ξ0} (which has probability 1−ξ0), we have, for some W∈[X1,X2],
By choosing ξ0 small enough, this proves the claim for d=1.
Consider next d>1 and assume without loss of generality u=e1.
Let ν(⋅)=ν(⋅∣X∈B(x0;r)), Xab≡(Xa,…,Xb), and denote by f1∣[2,d] the density of ν(X1∈⋅∣X2n), and by f[a,b] the density of ν(Xab∈⋅). We then have
In order to construct the coupling, we sample Z∼ν. If Z∈B(x0;r), then we take X1=X2=Z. If Z∈B(x0;r) and maxx1f1∣[2,d](x1∣Z2d)>M/Δ, we also take X1=X2=Z. Otherwise we have Z∈B(x0;r) and maxx1f1∣[2,d](x1∣Z2d)≤M/Δ, then we sample (X1,1,X2,1) from the coupling developed in the case d=1 applied to f1∣[2,d](⋅∣Z2d), and set X1=(X1,1,Z2d), X2=(X2,1,Z2d). Now define γ to be the joint distribution of X1,X2. Then γ is a coupling of ν with itself.
Hence, we can choose Δ,ξ0 small enough so that the claim (7.128) holds. ∎
By choosing ε0,# small enough, we can ensure ∥∇Ψ(θ;ρt)−∇Ψ(θ;ρ∗)∥2≤g0/3 for all θ and all t≥t0.
for all t≥t0. Let ρt0 be the conditional probability measure of ρt0 given θ∈B(θ∗;r0). By Lemma 7.11, ρt0 has a density upper bounded by a constant M=M(ε0,t0) (note that ρt0(S)≤ρt0(S)/(p∗−ε0)).
Set H0=H0(ρ∗)=∇2Ψ(θ∗;ρ∗). Since θ∗ is a critical point of θ↦Ψ(θ;ρ∗), for any δ>0, we can find r1(δ)>0 such that
As shown in the proof of Theorem 6, the function (θ,ρ)↦∇2Ψ(θ;ρ) is continuous when the space of probability distributions ρ is endowed with the weak topology. Analogously ρ↦∇Ψ(θ∗;ρ) is continuous in the weak topology. Hence for this δ>0 and r1(δ)>0, there exists ε0,∗(δ,r1)>0 small enough such that, the following inequalities hold
Let us emphasize that r1 depends on δ but can be taken to be independent of ε0. Further, by an application of the intermediate value theorem, for all θ∈B(θ∗;r1),
For r0<r1, θt0∈B(θ∗;r0), we let (θt)t≥t0 be the solution of Eq. (7.127) with this initial condition. We then define
Under the conditions of Theorem 7, there exists r1>0 and ε0,∗>0 such that, for all r0≤r1, ε0≤ε0,∗, there exists T\mboxUB(ε0,r0,r1,t0) such that the following happens. If d\mboxBL(ρt,ρ∗)≤ε0 and ∣ρt(B(θ∗;r0))−p∗∣≤ε0 for all t≥t0 for some t0, then
We fix a δ≤(λ1−λ2)/10. Then we choose r1>0 and ε0,∗>0 such that Eq. (7.144) holds, with an additional requirement that ε0,∗<p∗/10. We will prove this lemma with this choice of r1 and ε0,∗.
We always denote (θit)t≥t0 to be the solution of Eq. (7.127) with initial condition θit0, for i=1,2. First we claim that, for 0<δ≤(λ1−λ2)/10, assuming
then for any θ1t0,θ2t0∈B(θ∗;r1) with P⊥(θ1t0−θ2t0)=0, we have
for all t∈[t0,t\mboxexit(θ1t0,r1)∧t\mboxexit(θ2t0,r1)].
For now we assume this claim holds. Fix r0≤r1 and ε0≤ε0,∗. Define γ to be the coupling of Lemma 7.13 corresponding to u which is the eigenvector corresponding to the least eigenvalue of H0, and ν=ρt0 which is the conditional measure of ρt0 given θt0∈B(θ∗;r0). Note ρt0 has a density upper bounded by a constant M=M(ε0,t0). By Lemma 7.13, we have γ(E)≥9/10, where
for some Z=Z(ε0,r0,t0)>0. Now we take (θ1t0,θ2t0)∈E. Note the assumption of this lemma gives d\mboxBL(ρt,ρ∗)≤ε0≤ε0,∗ for all t≥t0. According to Eq. (7.144), we have Eq. (7.149) holds, and due to this claim, we have ∥θ1t−θ2t∥2≥(1/Z)eλ1(t−t0)/2 for all t∈[t0,t\mboxexit(θ1t0,r1)∧t\mboxexit(θ2t0,r1)].
Define T\mboxUB(ε0,r0,r1,t0)=(2/λ1)log(2Zr1). Then for t>T\mboxUB, we have ∥θ1t−θ2t∥2≥2r1. This is impossible if θ1t,θ2t∈B(θ∗;r1) and hence we deduce (t\mboxexit(θ1t0,r1)∧t\mboxexit(θ2t0,r1))≤T\mboxUB for all (θ1t0,θ2t0)∈E.
Denoting by S the event in the last expression, we obtain ρt0(S)≥(p∗−ε0)ρt0(S)≥(9/20)(p∗−ε0)≥p∗/3 by noting that ε0<p∗/10.
Proof of the claim. Define the quantities
We then have, for t∈[t0,t\mboxexit(θ1t0,r1)∧t\mboxexit(θ2t0,r1)],
Summarizing, we obtained the inequalities
The matrix of coefficients on the right-hand side is
This has a (un-normalized) left eigenvectors (1,−v), (−v,1) with eigenvalues ξ± given by:
Note we took δ<(λ1−λ2)/10, we have v>0, and ξ+≥λ1.
Multiplying the inequalities (7.154), (7.155) by (1,−v), we thus obtain
Since we assumed x⊥(t0)=0, whence, for all t∈[t0,t\mboxexit(θ1t0,r1)∧t\mboxexit(θ2t0,r1)], we have
We next strengthen the last lemma and prove that trajectories that exit B(θ∗;r1) do not re-enter B(θ∗;r0).
Under the conditions of Theorem 7, there exists r0,∗,r1>0 (with r0,∗<r1) and ε0,∗>0 such that, for all r0≤r0,∗, ε0≤ε0,∗, there exists T\mboxUB(ε0,r0,r1,t0) such that the following happens. If d\mboxBL(ρt,ρ∗)≤ε0 and ∣ρt(B(θ∗;r0))−p∗∣≤ε0 for all t≥t0 for some t0, then
Let P+ be the projector onto the eigenspace of −H0 corresponding to positive eigenvalues, and P− the projector onto the subspace corresponding to negative eigenvalues, and let λ0≡mini≤D∣λi(H0)∣ to be the least absolute value of eigenvalue of H0. By condition B1 of Theorem 7, we have λ0>0. Let λmax denote the largest absolute value of eigenvalue of H0.
Fix a δ such that 0<δ≤min{λ0/(1+λ0+λmax),λ0/λmax,λ1−λ2,1}/10, where λ1,λ2 are as defined in Lemma 7.15. Next we choose r1 as per Lemma 7.15, and we further require λ0r12≤η0, where η0 is as per condition B3 in the statement of Theorem 7. We take ε0,∗ to be the minimum of the parameter ε0,∗ as per Lemma 7.15 and the parameter ε0,# as per Lemma 7.14, where in Lemma 7.14, we choose u=Ψ(θ∗;ρ∗)−λ0r12/8, and Δ=λ0r12/8. Then we will choose smaller r1 and ε0,∗ so that Eq. (7.144) holds. Finally, we take r0,∗=δr1<r1. We will prove this lemma with this choice of r1, ε0,∗, and r0,∗, and with the same function T\mboxUB as per Lemma 7.15.
We bound the evolution of these quantities following the same argument used above for x∥(t), x⊥(t). Namely
For t∈[t∗(θt0;r1,δ),t\mboxexit(θt0;r1)], we have z+(t)+z−(t)≥δr1. Using the inequality a(a+b)≤a+b holding for non-negative a and b, we have
Proceeding analogously for z−, we arrive at the inequalities
Since δ≤λ0/(10(1+λ0+λmax)), we can ensure that Ψ(θt\mboxexit;ρ∗)≤Ψ(θ∗;ρ∗)−λ0r12/4. By Lemma 7.14, since d\mboxBL(ρt,ρ∗)≤ε0,∗≤ε0,# for all t≥t0, we have Ψ(θt;ρ∗)≤Ψ(θ∗;ρ∗)−λ0r12/8 for all t≥t\mboxexit(θt0;r1). Note for all θ∈B(θ∗;δr1), we have Ψ(θ;ρ∗)≥Ψ(θ∗;ρ∗)−λmaxδ2r12/2. Since δ≤λ0/λmax/10, we have θt∈B(θ∗;δr1) for all t≥t\mboxexit(θt0;r1).
This implies that, for any θt0∈B(θ∗;r0) for r0≤r0,∗ with t\mboxexit(θt0,r1)≤T\mboxUB(ε0,r0,r1,t0)<∞, it will never return to B(θ∗;r0). This gives the desired result. ∎
Finally we upper bound the probability that θt∈B(θ∗;r0) for some t>t0, given that θt0∈B(θ∗;r0). We define
Under the conditions of Theorem 7, for any η>0, there exists r0,∗>0 and ε0,∗>0 such that, for all r0≤r0,∗, ε0≤ε0,∗, the following happens. If d\mboxBL(ρt,ρ∗)≤ε0 and ∣ρt(B(θ∗;r0))−p∗∣≤ε0 for all t≥t0 for some t0, then
Since we also had ρt(B(θ∗;r0))≥p∗−ε0 for all t≥t0, note η,ε0≤p∗/10, we reached a contradiction.
Centered isotropic Gaussians
In this section we consider the centered isotropic Gaussians example discussed in the main text. That is, we assume the joint law of (y,x) to be as follows:
With probability 1/2: y=+1, x∼N(0,(1+Δ)2Id).
With probability 1/2: y=−1, x∼N(0,(1−Δ)2Id).
x↦σ(x) is bounded, non-decreasing, Lipschitz continuous. Its weak derivative x↦σ′(x) is Lipschitz in a neighborhood of .
q is analytic on (0,∞) with supr∈[0,∞]q′′(r)<∞.
q′(r)>0 for all r∈(0,∞), with supr∈[0,∞]q′(r)<∞, and limr→0q′(r)=limr→∞q′(r)=0.
−∞<q(0+)<−1, 1<q(+∞)<∞, and −1<(q(0+)+q(+∞))/2<1.
Letting Z(r)≡q′(τ−r)/q′(τ+r) for some τ+>τ−>0 we have Z′(r)>0 for all r∈(0,∞).
Note that condition S1 and part of S2 are implied by S0, but we list them here for conveniency. Some of these assumptions can be relaxed at the cost of extra technical work. In the interest of simplicity, we prefer to avoid being overly general.
In particular, we choose s1=−2.5, s2=7.5, t1=0.5, t2=1.5 in our simulations. In section 8.5, we check that this choice satisfies the above assumptions.
Throughout this section, we set τ±=(1±Δ) and q+(r)=q(τ+r), q−(r)=q(τ−r). Also, we will assume ξ(t)=1/2, since other choices of ξ(⋅) merely amounts to a time reparametrization.
Before analyzing our model, we introduce the function space and space of probability measures we will work on. We equip the set [0,∞] with a metric dˉ, where dˉ(x,y)=∣1/(1+x)−1/(1+y)∣ for any x,y∈[0,∞]. Then ([0,∞],dˉ) is a compact metric space, and we will still denote it by [0,∞] for simplicity in notations. We denote Cb([0,∞]) to be the set of bounded continuous functions on [0,∞], where continuity is defined using the topology generated by dˉ. More explicitly, we have isomorphism
Because of condition S2 and S3, we have q,q′∈Cb([0,∞]).
Let \mathscrsfsP([0,∞]) be the set of probability measures on [0,∞]. Due to Prokhorov’s theorem, there exists a complete metric dˉP on \mathscrsfsP([0,∞]) equivalent to the topology of weak convergence, so that (\mathscrsfsP([0,∞]),dˉP) is a compact metric space. In this section, we will denote by \mathscrsfsP=\mathscrsfsP([0,∞]).
Since the distribution of x is invariant under rotations for each of the two classes, so are the functions
where μ\mboxHaar is the Haar measure over the group of orthogonal rotations. Since ρ↦R(ρ) is convex, R(ρs)≤R(ρ).
We therefore restrict ourselves to ρ’s that are invariant under rotations. In other words, under ρ, the vector w is uniformly random conditional on ∥w∥2. We denote by ρ the probability distribution of ∥w∥2 when w∼ρ and we let Rd(ρ) denote the resulting risk. We then have
where Θ∼(1/Zd)sind−2θ⋅1{θ∈[0,π]}dθ.
As d→∞, we have limd→∞ud(r1,r2)=u∞(r1,r2) (uniformly over compact sets), with
For d=∞, we have the simpler expression
The following theorem provides a characterization of global minimizers of Rd(ρ).
ρ∗ is a global minimizer of Rd(ρ) if and only if supp(ρ∗)⊆argminrψd(r;ρ∗).
In particular, ρ∗=δr∗ is a global minimizer or Rd(ρ) if and only if v(r)+ud(r,r∗)≥v(r∗)+u(r∗,r∗) for all r.
Point 1 is essentially a special case of the second part of Proposition 1 in the main text (cf. Eq. (6.7)) and follows by the same argument. Point 2 is follows by taking ρ∗=δr∗. ∎
Given the last result, it is interesting to understand whether the optimal radial distribution ρ∗ is a single point mass or not. Under the ansatz ρ=δr (a single point mass at radius r) we obtain an effective risk Rd(1)(r)≡Rd(δr) defined by Rd(1)(r)=1+2v(r)+ud(r,r), which is plotted in Figure 11.6 for the case of our running example (8.1), and Δ=0.4.
Let r∗=r∗(Δ,d) be the minimizer of Rd(1)(r), and define, for d≤∞,
In the case d=∞, the minimization problem simplifies further. Either the minimum risk is , or it is achieved at a point mass ρ∗=δr∗.
Consider d=∞. Recall \mathscrsfsP=\mathscrsfsP([0,∞]). In this case Δ∞ defined as per Eq. (8.16) is such that Δ∞∈(0,1). Further
For Δ<Δ∞, infρ∈\mathscrsfsPR∞(ρ)>0 and the unique global minimizer of risk function R∞(ρ) is a point mass located at some r∗(Δ)∈(0,∞).
For Δ≥Δ∞, all global minimizers of risk function R∞(ρ) have risk zero, and there exists a global minimizer that has compact support bounded away from .
Recall the definitions q+(r)=q(τ+r) and q−(r)=q(τ−r). Further, we define the set Γ⊆ by
According to condition S3, for Δ=1, we have q−(r)=q(0)<−1 and q+(+∞)=q(+∞)>+1. Since q is continuous, it is easy to see that there exists an ε>0, such that [1−ε,1]⊆Γ. Further, for Δ=0 we have q+(r)=q−(r). By continuity, there exists an ε>0, such that [0,ε]∈∖Γ.
Since q is an increasing function, we have
By the remarks above, we have 0<Δ∞<1. Notice that this definition does not coincide with the one in Eq. (8.16). However, the proof below (together with Proposition 4) implies that the two definitions actually coincide.
Step 1. Prove that infρ∈\mathscrsfsPR∞(ρ)>0 as Δ<Δ∞.
First, we consider the optimization problem
We claim that, for Δ<Δ∞ we have f∗<0. Indeed, for any λ∈[0,+∞), we have the following upper bound
Since q+−λq−∈Cb([0,+∞]), then L(⋅,λ) is continuous in ρ in weak topology. By the compactness of \mathscrsfsP, the supremum of L(⋅,λ) is attained by some ρλ∈\mathscrsfsP. This ρλ should satisfy
Let h(r)≡q+(r)−λq−(r). Note the supremum of h should either satisfy
for r∈(0,∞), or the supremum should be attained at the boundary or +∞. According to condition S4, [q−′(r)/q+′(r)]′>0 for r∈(0,∞), the equation (8.21) has at most one solution r∗∈(0,∞).
Assume that there exists r∗∈(0,∞) such that h′(r∗)=0. Then we have h′(r)>0 for 0<r<r∗, and h′(r)<0 for r∗<r<+∞, whence supp(ρλ)={r∗}. If h′(r)=0 does not have a solution in (0,∞), the only supremum of h(r) could be achieved at or +∞. Therefore, supp(ρλ)={0} or supp(ρλ)={+∞}. This concludes that, for any λ∈[0,+∞), supρ∈\mathscrsfsPL(ρ,λ) is achieved by a point mass. Therefore, we have
For Δ<Δ∞, the right hand side of the above inequality is less than . Therefore, we cannot have a probability distribution ρ such that ⟨q+,ρ⟩=1 and ⟨q−,ρ⟩=−1. The infimum of the risk cannot be .
Step 2. Show that the global minimizer should be a delta function for Δ<Δ∞.
According to Proposition 1, the global minimizer ρ∗∈\mathscrsfsP should satisfy
with ψ∞ given in Eq. (8.12).
As proved in the last step, as Δ<Δ∞, we cannot have both λ+(ρ∗)=0 and λ−(ρ∗)=0. The argument given above also implies that ψ∞(r;ρ∗) is minimized at a unique point, and hence the support of ρ∗ should be a single point. This proves the first part of the theorem.
For Δ≥Δ∞, there exists r>0, such that q(τ+r)≥1, and q(τ−r)≤−1. Therefore, there exists r∗>0 such that q(τ+r∗)−1=−1−q(τ−r∗)=ε∗≥0. Consider the following probability measure on [0,+∞],
It can be checked that R∞(ρ∗)=0.
We would like to show further that there exists a global minimizer that is compactly supported. We construct this global minimizer as following. First, define
Then we know that q−(r0)=−1 and q+(r0)≥1. Now for any 0≤r≤r0, define u(r)=q−−1(−2−q−(r)). According to condition S3, we have −1<[q(0)+q(+∞)]/2<1, then u(r) is well defined on [0,r0]. It is easy to see that u(r0)=r0, and [q−(r)+q−(u(r))]/2=−1 for any 0≤r≤r0. Now we consider the function z(r)=[q+(r)+q+(u(r))]/2−1. Note that z(r0)>0, and z(0)≤[q(0)+q(∞)]/2−1<0. Therefore, there exists r∗ satisfying 0<r∗≤r0 such that z(r∗)=0. Consider the following probability measure on (0,+∞),
It is easy to see that R∞(ρ∗)=0. ∎
2 Dynamics: Fixed points
We specialize the general evolution (7.1) to the present case. Assuming ρ0 to be spherically symmetric, then ρt is spherically symmetric for any t≥0. We let ρt denote the distribution of ∥w∥2 when w∼ρt. This satisfies the following PDE:
We will view this as an evolution in the space of probability distribution on the completed half-line \mathscrsfsP([0,∞]).
In analogy with Proposition 2, we can prove the following characterization of fixed points.
A distribution ρ∈\mathscrsfsP([0,∞]) is a fixed point of the PDE (8.22) if and only if
Notice, in particular, global minimizers of Rd(ρ) are fixed points of this evolution, but not vice-versa. The next result classifies fixed points.
Consider d=∞ and recall the definition of λ+(ρ) and λ−(ρ) given by Eqs. (8.13) and (8.14). Then the fixed points of the PDE (8.22) (i.e. the probability measures ρ∈\mathscrsfsP([0,∞]) satisfying (8.23)) are of one of the following types
A point mass ρr∗=δr∗ at some location r∗∈{0,+∞}, but not of type (a).
A mixture of the type ρ=a0δ0+a∞δ+∞+aδr∗, but not of type (a) or (b).
For Δ<Δ∞, the PDE has a unique fixed point of type (b), with λ+(ρ∗)<0 and λ−(ρ∗)>0; it has no type-(a) fixed points; it has possibly fixed points of type (c).
For Δ>Δ∞, the PDE has some fixed points of type (b), with λ+(ρ∗)>0 and λ−(ρ∗)<0; it also has some type-(a) fixed points; it has possibly fixed points of type (c).
For Δ=Δ∞, the PDE has a unique fixed point of type (a) which is also a delta function at some location r∗, and no type (b) fixed points; it has possibly fixed points of type (c).
We use the characterization of fixed points in Proposition 5. Recall that ψ∞(r;ρ∗) is defined as in Equation (8.12). The derivative ∂rψ∞(r;ρ) gives
If a fixed point has λ+(ρ∗)=λ−(ρ∗)=0, then R∞(ρ∗)=0. This is type-(a) fixed point. Consider then the case (λ+(ρ∗),λ−(ρ∗))=(0,0). For the same reason as in the proof of Theorem 8, we conclude that ∂rψ∞(r;ρ∗) has at most three zeros, two of which are located at and +∞. This proves that all fixed points are of type (a), (b) or (c).
We already proved in Theorem 8 that, for Δ<Δ∞, infρR∞(ρ)>0. Therefore, for Δ<Δ∞, there is no type (a) fixed points.
We next prove that, as Δ<Δ∞, fixed point of type (b) is always unique. The location of the delta fixed point should satisfy
Note that ∂rψ∞(r∗;δr∗)<0 for r>0 small enough, and ∂rψ∞(r∗;δr∗)>0 for r large enough, whence this equation has at least one solution r∗∈(0,∞). In order to prove that it has a unique solution in (0,+∞), define r+≡inf{r:q+(r)≥1} and r−≡inf{r:q−(r)≥−1}. Note that q+′(r∗),q−′(r∗)>0 and that, in order to satisfy Eq. (8.25), the terms λ+(δr∗)=1/2⋅(q+(r∗)−1) and λ−(δr∗)=1/2⋅(q−(r∗)+1) must have opposite signs. For Δ<Δ∞, we must have λ+(δr∗)<0 and λ−(δr∗)>0, and all stationary points should be within [r−,r+]. Note that q−′(r)/q+′(r) is strictly increasing, and [1−q+(r)]/[1+q−(r)] is decreasing on [r−,r+]. Therefore, the fixed point of type δr∗ with r∗∈(0,∞) is unique.
For Δ>Δ∞, we must have λ+(ρ∗)>0 and λ−(ρ∗)<0, and all solutions should be within [r+,r−]. There could possibly be multiple fixed points of type δr∗ with r∗∈[r+,r−].
If Δ=Δ∞, it is easy to see that, ρ∗=δr∗ at some r∗∈(0,∞) is the unique fixed point with zero risk, and the unique fixed point as a point mass. ∎
3 Dynamics: Convergence to global minimum for d=∞𝑑d=\infty
In this section, denote \mathscrsfsP\mboxgood to be
We then prove that the d=∞ dynamics converges to a global minimizer from any initialization in \mathscrsfsP\mboxgood.
Consider the PDE (8.22) for d=∞, with initialization ρ0∈\mathscrsfsP\mboxgood. It has a unique solution (ρt)t≥0, such that
Without loss of generality, we assume ξ(t)=1/2. First we show the existence and uniqueness of solution of the PDE.
Step 1. Existence and uniqueness of solution. Mass ρt((0,∞))=1 for all t.
According to conditions S1 - S3, q(r), q′(r), and q′′(r) are uniformly bounded on [0,∞]. Recall that
Hence v′(r),∂1u∞(r1,r2),v′′(r),∂112u∞(r1,r2),∂122u∞(r1,r2) are uniformly bounded. Recall we further assumed ξ(t)≡1/2. Therefore, conditions A1 and A3 are satisfied with D=1, V=v, and U=u. By Remark 7.1, there is the existence and uniqueness of solution of PDE (8.22) for d=∞. Denote this solution to be (ρt)t≥0.
Recall the formula of ∂rψ∞(r;ρ) given in Equation (8.24), it is easy to see that the assumption of Lemma 7.9 is satisfied with d=1 and Ψ=ψ∞. Hence, we have ρt((0,∞))=1 for any t<∞.
Step 2. Classify the limiting set S∗.
Recall the definition of (\mathscrsfsP([0,+∞]),dˉP) at the beginning of Section 8. Since (\mathscrsfsP([0,+∞]),dˉP) is a compact metric space, and (ρt)t≥0 is a continuous curve in this space, then there exists a subsequence (tk)k≥1 of times, such that (ρtk)k≥1 converges in metric dˉP to a probability distribution ρ∗∈\mathscrsfsP([0,+∞]).
Analogously to Proposition 2 (using Eq. (8.22)), we have
Since R∞(ρt)≥0, we have
Recall the definition of λ+(ρ) and λ−(ρ) given by Eq. (8.13) and (8.14). Since q∈Cb([0,∞]), we have
Note ∂rψ∞(r;ρ) is given by Eq. (8.24), and q′∈Cb([0,+∞]), hence
In other words, any limiting point ρ∗ of the PDE is a fixed point of the PDE (8.22).
Note R∞(ρ)=1/2⋅[λ+(ρ)2+λ−(ρ)2], we have
Note R∞(ρt) is decreasing with t, hence
Let S∗=S∗(ρ0) be the set of all limiting points of the (ρt)t≥0,
Due to Lemma 7.10, S∗ is a connected compact set. Since R∞(ρt) is decreasing as t increases, we have R∞(ρ∗)≡R∗ is a constant for all ρ∗∈S∗. Since we assumed R∞(ρ0)<1, and R∞(ρt) is decreasing in t, we have R∗<1.
Let ρ∗ be a fixed point of PDE such that λ+(ρ∗)≥0,λ−(ρ∗)≥0 or λ+(ρ∗)≤0,λ−(ρ∗)≤0 but not both λ+(ρ∗) and λ−(ρ∗) equal . In this case, according to Eq. (8.24), ∂rψ∞(r;ρ∗) must be strictly increasing or strictly decreasing in r. Since supp(ρ∗)⊆{r∈[0,∞]:∂rψ∞(r;ρ∗)=0}, ρ∗ must be a combination of two delta functions located at and +∞, i.e., ρ∗=a0δ0+(1−a0)δ∞. But for a fixed point of this type, it is easy to see that R∞(ρ∗)≥1. Such fixed points ρ∗ cannot be one of the limiting points of the PDE since R∞(ρ0)<1.
Since S∗ is a connected set, L(S∗) should also be a connected set. Further notice that R∞(ρ∗)=1/2⋅[λ+(ρ∗)2+λ−(ρ∗)2], and R∞(ρ1)=R∞(ρ2) for any ρ1,ρ2∈S∗. Therefore, we can only have L(S∗)⊆P2≡{(λ+,λ−):λ+>0,λ−<0}, or L(S∗)⊆P1≡{(λ+,λ−):λ+<0,λ−>0}, or L(S∗)={(0,0)}.
Step 3. Finish the proof using two claims.
Claim (1). If L(S∗)⊆P1, then for any ρ∗∈S∗, we have ρ∗((0,∞))=1.
Claim (2). We cannot have L(S∗)⊆P2.
Here we assume these two claims hold, and use them to prove our results. For Δ<Δ∞, we proved in Theorem 9 that, there is not a fixed point such that L(ρ∗)=(0,0). Therefore, we cannot have L(S∗)={(0,0)}. Due to Claim (2), we cannot have L(S∗)⊆P2. Hence, we must have L(S∗)⊆P1. According to Theorem 9, for Δ<Δ∞, the only fixed point of PDE with ρ∗((0,∞))=1 is a point mass at some location r∗. Furthermore, this delta function fixed point is unique and is also the global minimizer of the risk. Therefore, we conclude that, as Δ<Δ∞, the PDE will converge to this global minimizer.
For Δ≥Δ∞, according to Claim (1), if ρ∗ is a limiting point such that L(ρ∗)∈P1, then ρ∗((0,∞))=1. According to Theorem 9, a fixed point ρ∗ with ρ∗((0,∞))=1 and L(ρ∗)=(0,0) must be a point mass at some location r∗, with L(ρ∗)∈P2. Therefore, we cannot have L(S∗)⊆P1. Claim (2) also tells us that we cannot have L(S∗)⊆P2. Hence, we must have L(S∗)={(0,0)}. In this case, all the points in the set S∗ have risk . Therefore, we conclude that, as Δ≥Δ∞, the PDE will converge to some limiting set with risk .
We are left with the task of proving the two claims above. Before that, we introduce some useful notations. Recall Z(r)=q−′(r)/q+′(r) for r∈(0,+∞). According to condition S4, Z′(r)>0 for r∈(0,+∞). This implies that Z(0+)≡Z0≥0 and Z(+∞)≡Z∞≤∞ exist. We rewrite ∂rψ∞(r;ρ) as
Proof of Claim (1). If L(S∗)⊆P1, then for any ρ∗∈S∗, we have ρ∗({0,∞})=0.
Assume L(S∗)⊆P1. Then, we must have L(S∗)⊆P1∩{(λ+,λ−):Z0<−λ+/λ−<Z∞}. Otherwise suppose there exists ρ∗∈S∗, such that −λ+(ρ∗)/λ−(ρ∗)≥Z∞ or −λ+(ρ∗)/λ−(ρ∗)≤Z0, according to Eq. (8.28), ψ∞(r;ρ∗) must be strictly increasing or strictly decreasing in r. Since supp(ρ∗)⊆{r∈[0,∞]:∂rψ∞(r;ρ∗)=0}, then ρ∗ must be a combination of two delta functions located at and +∞. But such ρ∗ must have R∞(ρ∗)≥1, and thus ρ∗ cannot be a limiting point of the PDE. Hence the claim that L(S∗)⊆P1∩{(λ+,λ−):Z0<−λ+/λ−<Z∞} holds.
Since S∗ is a compact set, and L is a continuous map, then L(S∗) is a compact set. Therefore, there must exist ε0>0, so that for any ρ∗∈S∗, we have Z0+3ε0<−λ+(ρ∗)/λ−(ρ∗)<Z∞−3ε0. For this ε0>0, since S∗ contains all the limiting points of PDE starting from ρ0, there exists t0 large enough, so that as t≥t0, we have Z0+2ε0<−λ+(ρt)/λ−(ρt)<Z∞−2ε0, and λ+(ρt)<0, λ−(ρt)>0. For the same ε0, since Z(r) is continuous at and +∞, there exists 0<r0<r∞<∞, so that Z(r)<Z0+ε0 for r∈(0,r0), and Z(r)>Z∞−ε0 for r∈(r∞,∞). Therefore, for any t≥t0, ∂rψ∞(r;ρt)<0 for any r∈(0,r0), and ∂rψ∞(r;ρt)>0 for any r∈(r∞,+∞).
As a result, according to the equation (8.28), we must have ∂rψ∞(r;ρt)<0 for any r∈(0,r0) and t≥t0, and ∂rψ∞(r;ρt)>0 for any r∈(r∞,∞) and t≥t0.
Due to Lemma 7.9, ρt0((0,∞))=1. Denoting Ωk=[1/k,k], then limk→∞ρt0(Ωk)=1. With this choice of Ωk, for any k≥{r∞,1/r0}, and for any t≥t0, we have ⟨∂rψ∞(r;ρt),n(r)⟩>0 for r∈∂Ωk where n(r) is the normal vector point outside Ωk. Therefore, if we consider the ODE
starting with r(t0)∈Ωk, r(t) cannot leak outside Ωk from either boundaries of Ωk, and we must have r(t)∈Ωk for any t≥t0. Due to Lemma 7.8, ρt(Ωk)≥ρt0(Ωk) for any t≥t0. As a result, we conclude that for any ρ∗∈S∗,
Note ∪kΩk=(0,∞). This gives ρ∗({0,∞})=0, which proves Claim (1).
Proof of Claim (2), step (1). If L(S∗)⊆P2, then S∗ must be a singleton.
In the case L(S∗)⊆P2, the argument is similar to the proof of Claim (1), and hence will be presented in a synthetic form. First, we must have L(S∗)⊆P2∩{(λ+,λ−):Z0<−λ+/λ−<Z∞}. Therefore, there must exist ε0>0, so that for any ρ∗∈S∗, we have Z0+3ε0<−λ+(ρ∗)/λ−(ρ∗)<Z∞−3ε0. For this ε0>0, there exists t0 large enough, so that as t≥t0, we have Z0+2ε0<−λ+(ρt)/λ−(ρt)<Z∞−2ε0, and λ+(ρt)>0, λ−(ρt)<0. Further, there exists 0<r0<r∞<∞, so that ∂rψ∞(r;ρt)>0 for any r∈(0,r0) and t≥t0, and ∂rψ∞(r;ρt)<0 for any r∈(r∞,∞) and t≥t0.
Therefore, if we consider the ODE (8.29) starting with r(t0)∈[0,r0), we must have r(t)∈[0,r0) for any t≥t0; if we start with r(t0)∈(r∞,∞], we must have r(t)∈(r∞,∞] for any t≥t0. Due to Lemma 7.8, {ρt([0,r))}t≥t0 for 0<r≤r0 and {ρt((r,+∞])}t≥t0 for r≥r∞ must be non-decreasing in t. According to Theorem 9, we can express ρ∗∈S∗ in the form ρ∗=a0(ρ∗)δ0+a∞(ρ∗)δ∞+a(ρ∗)δr∗. By the stated monotonicity property, for any ρ1,ρ2∈S∗, it holds that a0(ρ1)=a0(ρ2), a∞(ρ1)=a∞(ρ2), and hence a(ρ1)=a(ρ2). We denote them in short as a0, a∞, and a.
For any such fixed point ρ∗∈S∗, since we must have supp(ρ∗)⊆{r:∂rψ∞(r;ρ∗)=0}, r∗∈(0,+∞) should be a solution of ϕ(r)=0 where
Proof of Claim (2), step (2). If ρ∗ is a fixed point with L(ρ∗)∈P2, then ρ∗ is unstable.
We apply Theorem 7 to ρ∗=a0δ0+a∞δ∞+aδr∗. We will check the conditions of Theorem 7 to show that this type of fixed point is unstable.
First we check condition B1. Since [q−′(r)/q+′(r)]′>0 and q+′(r)>0 for r∈(0,+∞), we have
Note the stationary condition of the PDE implies
and λ+(ρ∗)>0, λ−(ρ∗)<0. Combined with the equation above, we have
This verifies condition B1 of Theorem 7.
Second, since λ+(ρ∗)>0 and λ−(ρ∗)<0, according to Equation (8.28), we must have ∂rψ∞(r;ρ∗)>0 for r∈(0,r∗), and ∂rψ∞(r;ρ∗)<0 for r∈(r∗,∞). Therefore, we have ψ∞(0;ρ∗)<ψ∞(r∗;ρ∗) and ψ∞(+∞;ρ∗)<ψ∞(r∗;ρ∗). Note L(η)≡{r:ψ∞(r;ρ∗)≤ψ∞(r∗;ρ∗)−η}. For any η>0 small enough, ρ∗(L(η))=1−a, which verifies condition B2. It is also easy to see that, for any η>0, ∂L(η) is a compact set, hence condition B3 holds. Note that we assumed further that ρ0 has a bounded density with respect to Lebesgue measure, all the assumptions of Theorem 7 are satisfied. Theorem 7 implies that the PDE cannot converge to ρ∗. As a result, we conclude that we cannot have L(S∗(ρ0))⊆P2 for ρ0∈\mathscrsfsP\mboxgood. This proves Claim (2).
4 Proof of Theorem 1
The key step consists in proving that the dynamics for large but finite d is well approximated by the dynamics at d=∞. The key estimate is provided by the next lemma.
Assume σ satisfies condition S0, recall the definition of ud and u∞ given by Equation (8.8) and (8.9). Then we have
where (G1,G2)∼N(0,I2), and Θ∼(1/Zd)sin(θ)d−2⋅1{θ∈[0,π]}dθ are mutually independent.
Define G3=G1cosΘ+G2sinΘ, then
According to condition S0, ∥σ′∥∞ and ∥σ∥∞ are bounded, it is sufficient to bound the following quantity uniformly for r∈[0,∞)
Assuming this claim holds, let us show that it implies the desired bound on T(r). We have
We are left with the task of proving Eq. (8.37).
Denote X=G2 and Y=G3 for simplicity in notations. Note that (X,Y)=d(Y,X)=d(−X,−Y). It follows that we can assume, without loss of generality, a>0. We have
Let y∼Unif({−1,+1}), [x∣y=+1]∼N(0,Σ+), [x∣y=−1]∼N(0,Σ−) with τ−2ID⪯Σ+,Σ−⪯τ+2ID for some 0<τ−<τ+<∞. Assume that the activation function σ satisfies condition S0. Define
Then assumptions A2 and A3 are satisfied.
Note that x is sub-Gaussian, and by condition S0 we have σ′ is bounded, then ∇θσ(⟨x,θ⟩)=σ′(⟨x,θ⟩)x is also sub-Gaussian (with sub-Gaussian parameter independent of D). Condition S0 also gives that σ is bounded, therefore assumption A2 is satisfied.
Since ∥σ∥∞,∥σ′∥∞<∞, applying Cauchy-Schwarz inequality, we have ∇V,∇1U,∇122U are uniformly bounded.
It is difficult to bound ∇2V and ∇12U directly because σ′ may not be differentiable. We will use a longer argument to bound them.
First, for a bounded-Lipschitz function f, and for g∈{1,σ}, define
where G∼N(0,Id). Since we have τ−2ID⪯Σ+,Σ−⪯τ+2ID for some 0<τ−<τ+<∞, in order to bound ∇2V and ∇12U, it is sufficient to bound ∇12Wσ,1 and ∇12Wσ,σ.
is uniformly bounded for g=1 or g=σ. Let h=σ−σ0, then h=0 for r∈[−δ0,δ0], and h is K-bounded-Lipschitz for some constant K. It is sufficient to bound ∇12Wh,g for g∈{1,σ}.
Since G is Gaussian, using Stein’s formula, for any unit vector n, we have
Taking directional derivatives of E1 and E2, we have
is uniformly bounded for r∈[0,∞]. Hence E11 is uniformly bounded. Using a similar argument, we can show that each terms E12, E13, E21, E22, and E23 are uniformly bounded.
Now we look at ∇1E3(θ1,θ2,n). We have
In order to bound E32, we apply Stein’s formula to get
We can apply Stein’s formula to the right hand side of the last equation. Using the same argument as above, we obtain that E31 is uniformly bounded.
As a result, ∇2V and ∇12U are uniformly bounded. Therefore, assumption A3 is satisfied.
We are now in position to prove Theorem 1.
First we consider PDE (8.22) for d=∞. We fix an initial radial density ρ0∈\mathscrsfsP\mboxgood. Due to Theorem 10, for any η>0, there exists T=T(η,ρ0,Δ)>0, so that the solution (ρt∞)t≥0 of PDE (8.22) for d=∞ with initialization ρ0 satisfies
Next we would like to bound the difference of R∞(ρ) and Rd(ρ) for any ρ. Note
By Lemma 8.1, there exists d0=d0(η,Δ) large enough, so that for d≥d0, we have
Finally, let (θk)k≥1 be the trajectory of SGD, with step size sk=εξ(kε), and initialization wi0∼iidρ0 for i≤N. We apply Theorem 3 to bound the difference of the law of trajectory of SGD and the solution of PDE (8.53). The assumptions of Theorem 3 are verified by Lemma 8.2. As a consequence, there exists constant K (which depend uniquely on the constants in assumptions A1 A2 A3), such that for any t≤10T, we have
As a consequence, for any δ>0, there exists C0=C0(δ,η,ρ0,Δ), so that as N,1/ε≥C0d and ε≥1/N10, for any t≤10T, we have
Therefore, the trajectory θ⌊t/ε⌋ of SGD as t∈[T,10T] satisfies
with probability at least 1−δ. This gives the desired result.
5 Checking conditions 𝖲𝟢𝖲𝟢{\sf S0}–𝖲𝟦𝖲𝟦{\sf S4} for the running example
The requirements of Lemma 8.3 are not restrictive. An example of parameters that satisfies all conditions gives s1=−2.5, s2=7.5, t1=0.5, t2=1.5.
It is straightforward to see that condition S0 holds. To show condition S1, denote by σ′(r) the weak derivative of σ(r), we calculate the function q′(r) for r>0 explicitly,
Since s1<s2 and 0<t1<t2, it is easy to see that q′(r) is analytic on (0,∞), and hence q(r) is analytic on (0,∞). Differentiating q′(r) in Eq. (8.56), it is easy to see that limr→∞q′′(r)=0, and q′′(0+)=0. Hence, we have supr∈[0,+∞]q′′(r)<∞. Then condition S1 holds.
Since s2>s1, 0<t1<t2, we have q′(r)>0 for r∈(0,+∞), limr→∞q′(r)=0, and q′(0+)=0. Hence, we have supr∈[0,+∞]q′(r)<∞. Then condition S2 holds. Note that q(0)=σ(0)=s1<−1, and q(+∞)=(s1+s2)/2>1. In addition, [q(0)+q(+∞)]/2=(3s1+s2)/4∈(−1,1). Therefore, condition S3 holds.
Finally, we show that condition S4 holds. Define p(r)=exp[−t12/(2r2)]−exp[−t22/(2r2)], which is a positively scaled version of q′(r). To show that for r∈(0,∞),
we only need to show that for r∈(0,∞)
Define x≡t22/(2τ+2r2)>0, s≡τ+2/τ−2>1, 0<c≡t12/t22<1, we have
It is sufficient to show that F2(x;s,c)>0 for x>0, s>1, and 0<c<1. Note that F2(0+;s,c)=0. Hence it is sufficient to show that ∂xF2(x;s,c)>0 for x>0.
Note that s>1 and 0≤c<1, F3(0+;s,c)=0. It is therefore sufficient to show that ∂xF3(x;s,c)>0 for x>0.
Since 0<c<1, s>1, and x>0, we have ∂xF3(x;s,c)>0, and hence condition S4 holds. ∎
Centered anisotropic Gaussians
In this section we consider the centered anisotropic Gaussian example discussed in the main text. That is, we assume the joint law of (y,x) to be as follows:
With probability 1/2: y=+1, x∼N(0,Σ+).
With probability 1/2: y=−1, x∼N(0,Σ−).
We will assume Σ+,Σ+ to be diagonalizable in the same orthonormal basis, and to differ only on a subspace of dimension s0. We want to study whether and how the neural network will identify this subspace of relevant features. Without loss of generality, we can assume that the eigenvalues correspond to the standard basis. In order to focus on the simplest possible model of this type, we will choose:
Because of condition S2 and S3, we have q∘r+,q∘r−,q′∘r+,q′∘r−∈Cb(E2).
Let \mathscrsfsP(E2) be the set of probability measures on E2. Due to Prokhorov’s theorem, there exists a complete metric dˉP on \mathscrsfsP(E2) equivalent to the topology of weak convergence, so that (\mathscrsfsP(E2),dˉP) is a compact metric space. In this section, we will denote by \mathscrsfsP=\mathscrsfsP(E2).
Since the distribution of x is invariant under rotations in first s0 coordinates, and invariant under rotations in last d−s0 coordinates, so are the functions
where μ\mboxHaar is the Haar measure over the group of orthogonal rotations. Since ρ↦R(ρ) is convex, R(ρs)≤R(ρ).
where Θ1∼(1/Zs0)sins0−2θ⋅1{θ∈[0,π]}dθ and Θ2∼(1/Zd−s0)sind−s0−2θ⋅1{θ∈[0,π]}dθ are independent.
As d→∞, we have limd→∞ud(a1,a2,b1,b2)=u∞(a1,a2,b1,b2), with
and the risk function converges to (for a=(a1,a2))
For s0=γ⋅d with 0<γ<1 and d→∞, we have the simpler expression
The following theorem provides a characterization of the global minimizers of R∞(ρ).
Consider d=∞. Recall \mathscrsfsP=\mathscrsfsP(E2) where E2≡[0,+∞)2∪{∞}. Then there exists Δ∞∈(0,1), such that
For Δ<Δ∞, infρ∈\mathscrsfsPR∞(ρ)>0 and the unique global minimizer of risk function R∞(ρ) is a point mass located at (r∗,0) for some r∗=r∗(Δ)∈(0,∞).
For Δ≥Δ∞, all global minimizers of risk function R∞(ρ) have risk zero, and there exists a global minimizer that has finite support.
Suppose ρ2∗∈argminρ2∈\mathscrsfsP(E2)R∞(2)(ρ2). Then we must have ⟨q∘r+,ρ2∗⟩≤1 and ⟨q∘r−,ρ2∗⟩≥−1. Indeed, if either ⟨q∘r+,ρ2∗⟩>1 or ⟨q∘r−,ρ2∗⟩<−1, since q(+∞)>1 and q(0)<−1, the distribution ρ2′=a0δ0+a∞δ∞+(1−a0−a∞)ρ2∗ with appropriate choice of a0 and a∞ will give a lower risk.
This ρ2∗∈\mathscrsfsP(E2) induces a ρ1∈\mathscrsfsP([0,∞]) as follows: for any Borel set B⊆[0,∞], ρ1(B)=ρ2∗({r∈E2:∥r∥2∈B}). For this ρ1, it is easy to see that ⟨q−,ρ1⟩≤⟨q∘r−,ρ2∗⟩ and ⟨q+,ρ1⟩≥⟨q∘r+,ρ2∗⟩, and the equalities hold if and only if ρ2∗(E1)=1, where E1≡([0,+∞)×{0})∪{∞}. Since q(+∞)>1 and q(0)<−1, we can take ρ1∗=a0δ0+a∞δ∞+(1−a0−a∞)ρ1 with appropriate choice of a0 and a∞, so that ⟨q∘r+,ρ2∗⟩≤⟨q+,ρ1∗⟩≤1 and ⟨q∘r−,ρ2∗⟩≥⟨q−,ρ1∗⟩≥−1. Therefore, we always have infρ1∈\mathscrsfsP([0,∞])R∞(1)(ρ1)≤infρ2∈\mathscrsfsP(E2)R∞(2)(ρ2), and ρ2∗(E1)=1 for any ρ2∗∈argminρ2∈\mathscrsfsP(E2)R∞(2)(ρ2). Note that R∞(2)(ρ1×δ0)=R∞(1)(ρ1) for any ρ1∈\mathscrsfsP([0,∞]). Hence, we must have infρ1∈\mathscrsfsP([0,∞])R∞(1)(ρ1)=infρ2∈\mathscrsfsP(E2)R∞(2)(ρ2).
Due to the above argument, we reduced our analysis to the centered isotropic Gaussians case. All the conclusions can be proved using the same argument as in the proof of Theorem 8.
2 Dynamics: Fixed points
We specialize the general evolution (7.1) to the present case. Assuming ρ0 to be invariant with respect to products of orthogonal transformations, the same happens for ρt. We let ρt∈\mathscrsfsP(E2) denote the distribution of (∥w1∥2,∥w2∥2) when w∼ρt. Then ρt satisfies the following PDE:
We will view this as an evolution in the space of probability distribution on \mathscrsfsP=\mathscrsfsP(E2).
In analogy with Proposition 2, we can prove the following characterization of fixed points.
A distribution ρ∈\mathscrsfsP is a fixed point of the PDE (9.16) if and only if
Notice, in particular, global minimizers of Rd(ρ) are fixed points of this evolution, but not vice-versa. The next result classifies fixed points.
Consider d=∞, and recall the definition of λ+(ρ) and λ−(ρ) given by Eq. (9.15) and (9.14). Then the fixed points of the PDE (9.16) (i.e. the probability measures ρ∈\mathscrsfsP satisfying (9.17)) must be of one of the following types
A point mass ρr∗=δ(r∗,0) at some location (r∗,0) with r∗∈{0,+∞}, but not of type (a).
A mixture of the type ρ=a0δ0+a∞δ∞+a1δ(r∗1,0)+a2ρ2 with supp(ρ2)⊆{0}×(0,∞), but not of type (b) and (a).
For Δ<Δ∞, the PDE has a unique fixed point of type (b), with λ+(ρ∗)<0 and λ−(ρ∗)>0; it has no type-(a) fixed points; it has possibly fixed points of type (c).
For Δ>Δ∞, the PDE has some fixed points of type (b), with λ+(ρ∗)>0 and λ−(ρ∗)<0; it also has some type-(a) fixed points; it has possibly fixed points of type (c).
For Δ=Δ∞, the PDE has a unique fixed point of type (a) which is also a delta function at some location (r∗1,0), and no type (b) fixed points; it has possibly fixed points of type (c).
We use the characterization of fixed points in Proposition 6. Recall that ψ∞(r;ρ∗) is defined as in Eq. (9.13). The gradient ∇ψ∞(r;ρ) is given by
If a fixed point ρ∗ gives λ+(ρ∗)=λ−(ρ∗)=0, then R∞(ρ∗)=0. This is type-(a) fixed point. Consider then the case (λ+(ρ∗),λ−(ρ∗))=(0,0).
Suppose ρ∗((0,+∞)2)>0. Since q′(r)>0 and τ+>1>τ−, in order for ∇ψ∞(r;ρ∗)=0 for some r∈(0,+∞)2, we must have (λ+(ρ∗),λ−(ρ∗))=(0,0). Therefore, as ρ∗ is a fixed point with (λ+(ρ∗),λ−(ρ∗))=(0,0), we must have ρ∗((0,+∞)2)=0. That is, we can write ρ∗=a0δ0+a∞δ∞+a1ρ1+a2ρ2, with supp(ρ1)∈(0,∞)×{0}, and supp(ρ2)∈{0}×(0,∞).
The solutions of ∇ψ∞((r1,r2);ρ∗)=0 with r2=0 are of the form 0, (r∗1,0), and ∞. Therefore, ρ1=δ(r∗1,0) for some r∗1∈(0,∞). Hence, as ρ∗ is not a type-(a) stationary point, it must be a type-(b) or type-(c) stationary point.
This proves that all fixed points are of type (a), (b), or (c). The remaining claims follows the same argument as the proof of Theorem 9.
3 Dynamics: Convergence to global minimum for d=∞𝑑d=\infty
In this section, denote \mathscrsfsP\mboxgood to be
We then prove that the d=∞ dynamics converges to a global minimizer from any initialization ρ0∈\mathscrsfsP\mboxgood.
Consider the PDE (9.16) for d=∞, with initialization ρ0∈\mathscrsfsP\mboxgood. It has a unique solution (ρt)t≥0, such that
Without loss of generality, we assume ξ(t)=1/2. First we show the existence and uniqueness of solution of the PDE.
Step 1. Existence and uniqueness of solution. Mass ρt((0,∞)2)=1 for all t.
According to conditions S1 - S3, q(r), q′(r), and q′′(r) are uniformly bounded on [0,∞]. Note
Then ∇v(r),∇1u∞(r1,r2),∇2v(r),∇112u∞(r1,r2),∇122u∞(r1,r2) are uniformly bounded. Therefore, conditions A1 and A3 are satisfied with D=2, V=v, and U=u. Then, there is the existence and uniqueness of solution of PDE (9.16) for d=∞. Denote this solution to be (ρt)t≥0.
Recall the expression for ∇ψ∞(r;ρ) in Eq. (9.18). It is easy to see that the assumption of Lemma 7.9 is satisfied with d=2 and Ψ=ψ∞. Hence, we have ρt((0,∞)2)=1 for any fixed t<∞.
Step 2. Classify the limiting set S∗.
Recall the definition of (\mathscrsfsP(E2),dˉP) at the beginning of Section 9. Since (\mathscrsfsP(E2),dˉP) is a compact metric space, and (ρt)t≥0 is a continuous curve in this space, then there exists a subsequence (tk)k≥1 of times, such that (ρtk)k≥1 converges in metric dˉP to a probability distribution ρ∗∈\mathscrsfsP(E2).
For any ρ0∈\mathscrsfsP\mboxgood, let S∗=S∗(ρ0) be the set of limiting points of the PDE,
Analogous to the proof of Theorem 10, we have the following properties for S∗:
S∗ is connected and compact.
For any ρ∗∈S∗, ρ∗ is a fixed point of PDE.
For any ρ∗∈S∗, R∞(ρ∗)=R∗<1.
Recall the definition of λ+(ρ∗) and λ−(ρ∗) given by Equation (9.14) and (9.15). Let ρ∗ be a fixed point of PDE such that λ+(ρ∗)≥0,λ−(ρ∗)≥0 or λ+(ρ∗)≤0,λ−(ρ∗)≤0 but not both λ+(ρ∗) and λ−(ρ∗) equal . In this case, according to Eq. (9.18), both ∂r1ψ∞(r;ρ∗) and ∂r2ψ∞(r;ρ∗) must be strictly positive or strictly negative. Since supp(ρ∗)⊆{r∈E2:∇rψ∞(r;ρ∗)=0}, ρ∗ must be a combination of two delta functions located at 0 and ∞, i.e., ρ∗=a0δ0+(1−a0)δ∞. But for a fixed point like this, it is easy to see that R∞(ρ∗)≥1. Such fixed points ρ∗ cannot be one of the limiting points of the PDE since R∞(ρ0)<1.
Since S∗ is a connected set, L(S∗) should also be a connected set. Further notice that R∞(ρ∗)=1/2⋅[λ+(ρ∗)2+λ−(ρ∗)2], and R∞(ρ1)=R∞(ρ2) for any ρ1,ρ2∈S∗. Therefore, we can only have L(S∗)⊆P2≡{(λ+,λ−):λ+>0,λ−<0}, or L(S∗)⊆P1≡{(λ+,λ−):λ+<0,λ−>0}, or L(S∗)={(0,0)}.
Step 3. Finish the proof using two claims.
Claim (1). If L(S∗)⊆P1, then for any ρ∗∈S∗, we have ρ∗((0,∞)×{0})=1.
Claim (2). We cannot have L(S∗)⊆P2.
Here we assume these two claims holds, and use it to prove our results. For Δ<Δ∞, we proved in Theorem 12 that, there is no fixed point such that L(ρ∗)=(0,0). Therefore, we cannot have L(S∗)={(0,0)}. Due to Claim (2), we cannot have L(S∗)⊆P2. Hence, we must have L(S∗)⊆P1. According to Theorem 12, for Δ<Δ∞, the only fixed point of PDE with ρ∗((0,∞)×{0})=1 is a point mass at some location r∗=(r∗1,0). Furthermore, this delta function fixed point is unique and is also the global minimizer of the risk. Therefore, we conclude that, for Δ<Δ∞, the PDE will converge to this global minimizer.
For Δ≥Δ∞, according to Claim (1), if ρ∗ is a limiting point such that L(ρ∗)∈P1, then ρ∗((0,∞)×{0})=1. According to Theorem 12, a fixed point ρ∗ with ρ∗((0,∞)×{0})=1 and L(ρ∗)=(0,0) must be a point mass at some location r∗=(r∗1,0), with L(ρ∗)∈P2. Therefore, we cannot have L(S∗)⊆P1. Claim (2) also tells us that we cannot have L(S∗)⊆P2. Hence, we must have L(S∗)={(0,0)}. In this case, all the points in the set S∗ have risk . Therefore, we conclude that, as Δ≥Δ∞, the PDE will converge to some limiting set with risk .
We are left with the task of proving the two claims above. Before that, we introduce some useful notions used in the proof. Define Z(r) for r∈E2,
Define Zl(r)≡Z((r,lr)) for r,l∈[0,∞]. Then we have
According to condition S4, for any fixed l∈[0,∞], Zl(r) is increasing in r.
Recall the formula of ∇rψ∞(r;ρ) given by Equation (9.18). Define
Proof of Claim (1). If L(S∗)⊆P1, then for any ρ∗∈S∗, we have ρ∗((0,∞)×{0})=1.
Assume L(S∗)⊆P1. There must exist t0 large enough, so that as t≥t0, we have λ+(ρt)<0, and λ−(ρt)>0. Therefore, we must have χtg(r;ρt)>0 for any r∈(0,∞)2. We denote
starting with r(t0)∈Γk for some k∈(0,∞), we claim r(t)∈Γk for any t≥t0. Indeed, for any r∈∂Γk∩{r:r2=kr1>0}, its normal vector pointing outside Γk gives n(r)=(−r2,r1)/∥r∥2, and hence ⟨∇rψ∞(r;ρ),n(r)⟩=χtg(r;ρt)>0. Therefore, r(t) cannot leak outside Γk from this boundary. Further note that r(t) cannot reach the boundary ([0,∞)×{0})∪{∞} for any finite time t. This proves the claim that r(t)∈Γk for any t≥t0.
According to Lemma 7.8, we have ρt(Γk)≥ρt0(Γk) for any k∈(0,∞). Furthermore, according to Lemma 7.9, ρt0((0,∞)2)=1, hence limk→∞ρt0(Γk)=1. Therefore, for any ρ∗∈S∗, we must have
Theorem 12 implies that for any such fixed point ρ∗, we have supp(ρ∗)⊆([0,∞)×{0})∪{∞}.
In this case, we claim L(S∗)⊆P1∩{(λ+,λ−):Z0(0)<−λ+/λ−<Z0(∞)}. Indeed, suppose there exists ρ∗∈S∗, such that −λ+(ρ∗)/λ−(ρ∗)≥Z0(∞) or −λ−(ρ∗)/λ−(ρ∗)≤Z0(0), according to Equation (9.24), χnm((r,0);ρ∗) must be strictly positive or strictly negative. However, we know supp(ρ∗)∈{r:∇ψ∞(r;ρ∗)=0}. Hence, ρ∗ should be a combination of two delta functions located at 0 and ∞. Such fixed point ρ∗ has risk R∞(ρ∗)≥1, hence ρ∗ cannot be a limiting point of the PDE. Hence the claim holds.
Since S∗ is a compact set, and L is a continuous map, then L(S∗) is a compact set. Therefore, there must exist ε0>0, so that for any ρ∗∈S∗, we have Z0(0)+3ε0<−λ+(ρ∗)/λ−(ρ∗)<Z0(∞)−3ε0. For this ε0>0, we take t0 large enough, so that for t≥t0, we have Z0(0)+2ε0<−λ+(ρt)/λ−(ρt)<Z0(∞)−2ε0, and λ+(ρt)<0, λ−(ρt)>0.
According to the conditions S0 - S4 on q(r), for any fixed l, Zl(r) is an increasing function of r, and for any fixed r, Zl(r) is continuous in l. Therefore, for the fixed ε0>0, there exists 0<r0<r∞<∞ and b>0, such that
As a result, for any t≥t0, we have
where Γ(⋅) is defined as in Equation (9.26).
According to Lemma 7.9, ρt0((0,∞)2)=1. Define
We have Ok is increasing in k, and ∪kOk⊃(0,∞)2. Hence limk→∞ρt0(Ok)=1. Now we fix a parameter k.
Recall the formula for χnm and χtg given by Equation (9.24) and (9.25). It is easy to see that, there exists 0<uk1,uk2<∞ depending on (b,k,τ+,τ−,Z0(0),Z0(∞),ε0), such that for any r∈(0,∞)2 with b⋅r1≤r2≤k⋅r1, and t≥t0, we have
Consider the following spiral curve rk∞(s)=(rk1∞(s),rk2∞(s)), with
and another spiral curve rk0(s)=(rk10(s),rk20(s)), with
for s∈[0,sk∗] with sk∗=arctan(k)−arctan(b).
Because of inequality (9.35), along the curve rk∞(s), denoting n(rk∞(s)) to be its normal vector with [n(rk∞(s))]2>0, we have for any t≥t0 and s∈[0,sk∗],
Along the curve rk0(s), denoting n(rk0(s)) to be its normal vector with [n(rk0(s))]2>0, we have for any t≥t0 and s∈[0,sk∗],
Consider the ODE (9.27) starting with r(t0)∈Ωk for any k≥{r∞,1/r0}, we claim r(t)∈Ωk for any t≥t0. Indeed, combining Eq. (9.31), (9.33), (9.39), and (9.38), for any r∈∂Ωk∖(([0,∞)×{0})∪{∞}) and t≥t0, the gradient ∇ψ∞(r;ρt) pointing outside Ωk. Therefore, r(t) cannot leak outside Γk from this boundary. Further note that r(t) cannot reach the boundary ([0,∞)×{0})∪{∞} for any finite time t. This proves the claim that r(t)∈Ωk for any t≥t0. According to Lemma 7.8, ρt(Ωk)≥ρt0(Ωk) for any k≥{r∞,1/r0} and t≥t0.
Recall the definition of Ok given by Equation (9.32). Note that Ok⊆Ωk, and limk→∞ρt0(Ok)=1, which implies limk→∞ρt0(Ωk)=1. Hence, for any ρ∗∈S∗,
It is easy to see that ∪kΩk=(0,∞)×[0,∞). Combining with the fact that ρ∗((0,∞)2)=0 for any ρ∗∈S∗, claim (1) holds.
Proof of Claim (2). We cannot have L(S∗)⊆P2.
In the case L(S∗)⊆P2, the argument is similar to the proof of Claim (1), and hence will be presented in a synthetic form. First, there exists t0 large enough, so that as t≥t0, we have λ+(ρt)>0, and λ−(ρt)<0. Then χtg(r;ρt)<0 for any r∈(0,∞)2. Letting
According to the same argument as in the proof of Claim (1), we have ρt(Γk)≥ρt0(Γk) for any k∈(0,∞) and t≥t0. As a result, we have supp(ρ∗)⊆({0}×[0,∞))∪{∞}.
However, the fixed point ρ∗ with support on ({0}×[0,∞))∪{∞} has risk R∞(ρ∗)≥1. Therefore, we cannot have L(S∗)⊆P2. This proves claim (2).
4 Dynamics: Proof of Theorem 2
We will prove that the dynamics for large but finite d is well approximated by the dynamics at d=∞. The key estimate is provided by the next lemma.
Assume σ satisfies condition S0, recall the definition of ud and u∞ given by Equation (9.9) and (9.10). Assuming k=γ⋅d for some γ∈(0,1), then we have
Define F3=F1cosΘ1+F2sinΘ1, G3=G1cosΘ2+G2sinΘ2, then
We have similar bounds for ∣∂a2ud,1(a,b)−∂a2u∞,1(a,b)∣.
Note the relationship of Θ3=Θ3(a) with (Θ1,Θ2) is given by Eq. (9.51), which yields
The lemma holds by noting that as d→∞, we have s0→∞ and d−s0→∞. ∎
Recall the definition of R∞ given by Eq. (9.11), and R given by Eq. (6.2). Recall the set of good initialization given by
Define \mathscrsfsP\mboxgood1 and \mathscrsfsP\mboxgood2 to be
With this definition, it is easy to see that \mathscrsfsP\mboxgood1=\mathscrsfsP\mboxgood.
Here we bound d\mboxBL(ρ02,d,ρ02,∞). Note the joint distribution of ud and u∞ is a coupling of ρ02,d and ρ02,∞, hence
It is easy to see that limd→∞Y1/(Y1+Y2)=γ almost surely. Bounded convergence theorem implies that limd→∞d\mboxBL(ρ02,d,ρ02,∞)=0.
Now we consider the PDE (9.16) for d=∞. We fix its initialization ρ02,∞∈\mathscrsfsP\mboxgood2 induced by ρ01∈\mathscrsfsP\mboxgood1. Denote the solution of PDE (9.16) to be (ρt∞)t≥0. Due to Theorem 13, for any η>0, there exists T=T(η,ρ01,γ,Δ)>0, so that its solution (ρt∞)t≥0 satisfies
Then we would like to bound the difference of R∞(ρ) and Rd(ρ) for any ρ. Note
By Lemma 9.1, there exists d0=d0(η,Δ) large enough, so that for d≥d0, we have
Finally, let (θk)k≥1 be the trajectory of SGD, with step size sk=εξ(kε), and initialization wi0∼iidρ0 for i≤N. We apply Theorem 3 to bound the difference of the law of trajectory of SGD and the solution of PDE (9.60). The assumptions of Theorem 3 are verified by Lemma 8.2. As a consequence, there exists constant K (which depend uniquely on the constants in assumptions A1 A2 A3), such that
with probability 1−e−z2 for any t≤10T, where
As a consequence, for any δ>0, there exists C0=C0(δ,η,ρ01,γ,Δ), so that as N,1/ε≥C0d and ε≥1/N10, for t≤10T, we have
Therefore, the trajectory θ⌊t/ε⌋ of SGD as t∈[T,10T] satisfies
with probability at least 1−δ. This gives the desired result. ∎
Finite temperature
We will states the lemma regarding statics properties of the finite temperature free energy in Section 10.1, and regarding dynamics properties in Section 10.2. We will prove Proposition 3, Theorem 4, and Theorem 5 in Section 10.3. Throughout Section 10.1 and 10.2, to distinguish the dimension of parameters with the generalized differential operator, we will denote the dimension of parameters by d instead of D. This should not be confused with the dimension of feature vectors, which never appears throughout this section.
We introduce the set K of admissible probability densities,
Define Ω0={θ:1/(2πσ)d⋅exp{−∥θ∥22/(2σ2)}≤ρ(θ)1/2≤1}. Then we have
Noting that ∣ρlogρ∣≤ρ for any ρ∈, the second term is bounded by
Assume U and V are bounded-Lipschitz. Then for any λ>0 and 0<β<∞, Fβ,λ(ρ) has a unique minimizer ρ∗∈K. Moreover, we have
Taking σ2=4/(βλ) gives Eq. (10.12) .
For any ρ∈K, we call the following equation the Boltzmann fixed point condition
Under the assumption of Lemma 10.2, the minimizer ρ∗∈K of Fβ,λ(ρ) satisfies the Boltzmann fixed point condition.
for any ε<ε0. As ε is sufficiently small, we have Fβ,λ(ρε)<Fβ,λ(ρ∗). This contradict with the fact that ρ∗∈K is the minimizer of Fβ,λ(ρ).
for some constant γ(β,λ;ρ∗).
Note we have ∫ρ∗(θ)dθ=1. Therefore, we must have γ(β,λ;ρ∗)=−1/β⋅logZ(β,λ;ρ∗). This proves that ρ∗ satisfies the Boltzmann fixed point condition.
Under the assumption of Lemma 10.2, the Boltzmann fixed point condition has a unique solution in K.
The last two lemmas already imply that the Boltzmann fixed point condition has at least one solution. Assume ρ1,ρ2∈K to be two such solutions. Then ρi is positive, and
Note the right hand side does not equal unless ρ1=ρ2.
Under the assumption of Lemma 10.2, and further assume condition A3 holds. Let ρ∗β,λ be the minimizer of Fβ,λ(ρ). Then there is a constant K depending on the parameter K3 in condition A3, such that for any β≥1, we have
Let G,G1,G2∼N(0,Id) be independent, we have
Using the intermediate value theorem and Cauchy-Schwarz inequality, and noting that ∇2V is K3-bounded by condition A3, we have
We have similar bound for the U term. Therefore,
Next we give a lower bound for Ent(ρ∗gτ):
As a result, taking τ=1/β, we have
2 Dynamics
Recall that the finite-temperature distributional dynamics reads:
Notice that this notion of weak solution is equivalent to the one introduced earlier in Eq. (7.3), see for instance [San15, Proposition 4.2].
Without loss of generality, we assume ξ(t)≡1/2.
We use the JKO scheme of [JKO98, Theorem 5.1] to show the existence, uniqueness, and absolute continuousness of solution of PDE (10.22). Since the proof is basically the same as the proof of [JKO98, Theorem 5.1], we will skip several details.
For any ρk−1h, the optimization problem (10.24) has a unique minimizer ρkh∈K, where the proof is basically the same as Lemma 10.2, by additionally noting that W22(ρ,ρk−1h) as a function of ρ is lower bounded, lower semi-continuous, and convex over ρ∈K.
In the following, we will show that this ρh approximately satisfies PDE (10.23) in the weak form.
Since ρkh is the minimizer of optimization problem (10.24), we have for each τ>0,
Using the result in the proof of [JKO98, Theorem 5.1], and noting ∇V is bounded Lipschitz, we have
We need to further calculate the derivative of ⟨U,ντ⊗2⟩ with respect to τ. Note U is symmetric, we have
The uniqueness of solution of Eq. (10.23) can be proved using standard method from theory of elliptic-parabolic equations (see, for instance, [JKO98, Theorem 5.1]). In the proof of uniqueness we need the smoothness property of the solution, which is proved by Lemma 10.7.
Before proving this lemma, we give some notations in the following.
For any nonnegative integer l and 1≤p≤∞, we denote Wpl(Ω) to be the Banach space (Sobolev space) consisting of the elements of Lp(S) having generalized derivatives of all forms up to order l included, that are p’th power integrable on Ω. The norm in Wpl(Ω) is defined by the equality
where α=(α1,…,αd) is a multi-index with ∣α∣=∑i=1dαi, and Dθαu=∂∣α∣u/∂θ1α1⋯∂θdαd.
We say u∈Llocr,p(S) if for any compact subset [t1′,t2′]⊂(t1,t2) and compact subset Ω′⊂Ω, we have u∈Lr,p([t1′,t2′]×Ω′). We will denote Lp,p(S) by Lp(S), and Llocp,p(S) by Llocp(S).
For nonnegative integer l and 1≤p≤∞, we denote Wp2l,l(S) to be the Banach space consisting of the elements of Lp(S) having generalized derivatives of the form DtrDθα with r and α satisfying the inequality 2r+∣α∣≤2l. The corresponding norm is defined by
We denote Cm,n(S) to be the function space of continuous function with m continuous derivative in time, and n continuous derivatives in space. For example, u∈C1,2(S) if and only if u,∂tu,∇θu,∇θ2u∈C0,0(S)≡C(S). We say u∈Ccm,n(S) if u∈Cm,n(S) and the support of u is compact. We will denote Cn,n(S) by Cn(S), and Ccn,n(S) by Ccn(S).
We denote G to be the heat kernel, where for t>0, we have
The proof is similar to the one of [JKO98, Theorem 5.1], so we will skip some details. Without loss of generality we can set β=1, and ξ(t)=1/2 (different choices can be obtained by rescaling Ψ(θ;ρ) and reparametrizing time).
Step 1. Show that ρ∈Lloc∞,p(E).
Taking G to be the heat kernel, it is easy to see that
provided that the po,pi satisfy the relations
Here, C is a constant depends only on T,δ and on pi,po.
Define φ1≡ρ(Δη−⟨∇Ψ,∇η⟩)1{t>ε}, φ2≡ρ(2∇η−η∇Ψ)1{t>ε}, and ψ≡ρ(ε)η. Then Eq. (10.43) reads
where Ω1⊆supp(η)⊆Ω2. Therefore, ρ∈Lloc∞,po(E), where pi,po satisfy Eq. (10.46).
Note there exists a sequence pi,l,po,l for 1≤l≤k and k<∞, so that pi,l+1=po,l, pi,1=p<d/(d−1), pi,k=∞, and pi,l,po,l for fixed l satisfies Eq. (10.46). Since we have ρ∈Lloc∞,p(E), using Eq. (10.50) iteratively, we have ρ∈Lloc∞,po,l(E) for any 1≤l≤k. As a result, we have ρ∈Lloc∞(E).
Step 3. Derivatives, Dρ, D2ρ, and D3ρ.
where 1<p≤∞ and m is a nonnegative integer.
Then we show the regularity of D2ρ. Note that ∇2Ψ∈Lloc∞(E), we have Dφ1,Dφ2∈L∞(E). Due to Eq. (10.51), we have D3{φ1∗2G},D3{φ2∗2G}∈L∞(E), which also implies D2{φ1∗2G}∈Lloc∞(E). Hence we have D2(ρη)=D2{φ1∗2G}+D3{φ2∗2G}+D2[ψ∗Gε]∈L∞(S), which gives D2ρ∈Lloc∞(E).
Next we show the regularity of D3ρ. Note that ∇3Ψ∈Lloc∞(E), we have D2φ1,D2φ2∈L∞(E). Due to Eq. (10.51), we have D4{φ1∗2G},D4{φ2∗2G}∈L∞(E), which also implies D3{φ1∗2G}∈Lloc∞(E). Hence we have D3(ρη)=D3{φ1∗2G}+D4{φ2∗2G}+D3[ψ∗G]∈L∞(S), which gives D3ρ∈Lloc∞(E).
Step 4. Derivatives, Dtρ, DtDρ, and DtD2ρ.
Now we study the regularity of Dtρ,DtDρ,DtD2ρ. Note we have Dt(ρη)=Dt{φ1∗2G}−Dt{Dφ1∗2G}+Dt{ψ∗Gε}. Due to Eq. (10.51), φ1,Dφ2∈L∞(E) implies that Dt{φ1∗2G},Dt{Dφ1∗2G}∈L∞(E) and hence Dt[ρη]∈L∞(S), Dtρ∈Lloc∞(E).
Note we have DtD(ρη)=Dt{Dφ1∗2G}+Dt{D2φ1∗2G}+Dt{Dψ∗Gε}. The fact that Dφ1,D2φ2∈L∞(E) implies that Dt{Dφ1∗2G},Dt{D2φ1∗2G}∈L∞(E) and hence DtDρ∈Lloc∞(E).
Note we have DtD2(ρη)=Dt{D2φ1∗2G}−Dt{D3φ1∗2G}+Dt{D2ψ∗Gε}. Note that ∇4Ψ∈Lloc∞(E), hence D3φ2∈L∞(E). Combining with the fact that D2φ1∈L∞(E), we have Dt{D2φ1∗2G},Dt{D3φ1∗2G}∈L∞(E) and hence DtD2ρ∈Lloc∞(E).
Finally we show the regularity of Dt2ρ. We have Dt2(ρη)=Dt{Dt[φ1∗2G]−Dt[Dφ1∗2G]+Dt[ψ∗Gε]}, and
We say ρ∗ is a fixed point of PDE (10.22), if its solution (ρt)t≥0 starting from ρ∗ satisfies ρt≡ρ∗ for any t≥0.
Assume conditions A1 - A3 hold. Then any fixed point ρ∗ of PDE (10.22) with ρ∗∈K must satisfy the Boltzmann fixed point condition (10.13).
Suppose ρ∗∈K is a fixed point of PDE (10.22), taking W(θ)≡Ψλ(θ;ρ∗), then ρ∗∈K is a fixed point of the Fokker-Planck equation (10.54).
Since λ/2⋅∥θ∥22−2K3≤Ψλ(θ;ρ∗)≤λ/2⋅∥θ∥22+2K3, the Fokker-Planck equation has a unique fixed point [MV00], which solves
This is exactly the Boltzmann fixed point condition.
Assume conditions A1 - A4 hold. Let (ρt)t≥0 be the solution of PDE (10.22) for an initialization ρ0∈K. Then the free energy Fβ,λ(ρt) is differentiable with respect to t, with
Therefore, Fβ,λ(ρt) is non-increasing in t.
Calculate the differential of the free energy along the curve ρt, we have
Assume K0∥θ∥22−K1≤Φ(θ)≤K0∥θ∥22+K1 for some positive constant K0,K1. Define
First we show that the measure μ∗ satisfies the Poincare inequality: for any f∈D,
Note we have the Poincare inequality for the Gaussian distribution μ,
for any differentiable f. Therefore, we have
This proves the Poincare inequality (10.58) for μ∗.
Assume conditions A1 - A4 hold. Then the solution (ρt)t≥0 of PDE (10.22) for any initialization ρ0∈K converges weakly to ρ∗∈K as t→∞, where ρ∗ is the unique solution of the Boltzmann fixed point condition, which is the global minimizer of Fβ,λ.
According to Lemma 10.10, Fβ,λ is non-increasing along the solution path. According to Lemma 10.2, Fβ,λ(ρt) is lower bounded. Therefore, we have
According to condition A3, ∇θU is K3-bounded-Lipschitz with respect to (θ,θ′). Therefore,
as d\mboxBL(ρt,ρ∗)→0. Accordingly, we have
Combining Eq. (10.63) with Eq. (10.61), we have
3 Proof of Proposition 3, Theorem 4, and Theorem 5
Proposition 3 is given by Lemma 10.6, 10.4, and Lemma 10.9. Theorem 4 is given by Lemma 10.2, 10.4, 10.5, and 10.12.
Now we prove Theorem 5. First, according to Lemma 10.5, for any η>0, there exists constant K depending on η,K0,K1,K2,K3, such that as we take β≥KD, we have
According to Lemma 10.12, we have ρt converges to ρ∗β,λ weakly. Therefore, there exists T=T(η,V,U,{Ki},D,λ,β)<∞, so that d\mboxBL(ρt,ρ∗β,λ)≤η/(3Z) for any t≥T, where Z=Z({Ki}) is the bounded-Lipschitz constant of R with respect to ρ. Hence, we have
Finally, according to Theorem 3, there exists K′ depending on Ki’s, so that for all k≤10T/ε, we have
with probability at least 1−e−z2. Hence there exists C0=C0(η,{Ki},δ), so that as N,1/ε≥C0exp{C0T}D and ε≥1/N10, we have
Combining Eq. (10.67), (10.68), and (10.69) we get the desired result.
4 Dependence of convergence time on D𝐷D and η𝜂\eta
Theorem 5 does not provide any estimate for the dependence of the convergence time on the problem dimensions D and on the accuracy η. However the proof suggests the following heuristic. When ρt is sufficiently close to the minimizer ρ∗, we heuristically can approximate the free energy dissipation formula (10.2) as
This is the same as the free energy dissipation for the Fokker-Planck equation with potential Ψλ(θ;ρ∗). This suggests that, close to ρ∗, convergence should be dominated by the speed of convergence in this Fokker-Plank equation, which is controlled by the log-Sobolev constant of the potential Ψλ(θ;ρ∗), to be denote by c∗ [MV00]:
Note that the log-Sobolev constant can be exponentially small in D. We expect this heuristic to capture the rough dependence of the convergence time T on η and D, hence suggesting T=eO(D)log(1/η).
Numerical Experiments
In this section, we discuss numerical experiments whose results were presented in the main text, as well as some additional ones. Some technical details of the figures in the main text are also presented here; in particular, Section 11.1.1 for Figure 1, Section 11.1.2 for Figure 2, Section 11.2 for Figure 3, and Section 11.3 for Figure 4.
In this section, we present details of the numerical experiments pertaining to the example of centered isotropic Gaussians:
With probability 1/2: y=+1, x∼N(0,(1+Δ)2Id).
With probability 1/2: y=−1, x∼N(0,(1+Δ)2Id).
In all numerical examples in this section, we use the activation σ∗(x;θi)=σ(⟨wi,x⟩), where σ(t)=s1 if t≤t1, σ(t)=s2 if t≥t2, and σ(t) interpolated linearly for t∈(t1,t2). In simulations we use t1=0.5, t2=1.5, s1=−2.5, s2=7.5. This is also used for examples with centered Gaussians in the main text, cf. Figures 1 and 2, and Section 8 in the supplemental information. This activation is plotted in Figure 11.1.
Here we discuss empirical validation for the dynamics in the isotropic Gaussian example.
PDE simulation. Simulating the PDE (Eq. of the main text) for general d is computationally intensive. In order to simplify the problem, we only consider d=∞. In that case, we recall that the risk is given by Eq. (8.10), which we copy here for ease of reference:
The PDE is then ∂tρt=2ξ(t)∂r[ρt∂rψ∞(r;ρt)].
The solution to the PDE is approximated, at all time t, by the following multiple-deltas ansatz:
Under this ansatz, let us write R∞(ρt)=R∞,J(r(t)), where r(t)=(r1(t),...,rJ(t))⊤, and
Notice that ∂rψ∞(ri(t);ρt)=(J/2)(∇R∞,J(r(t)))i. Therefore we obtain
Hence under the multiple-deltas ansatz, one can simulate numerically the PDE via the above evolution equation of r(t). In particular, given r(t), one approximates r(t+δt) for some small displacement δt by
In general, one would want to take a large J to obtain a more accurate approximation. There are certain cases where one can take small J (even J=1). An example of such case is given in the following.
Details of Figure 1 of the main text. For the data generation, we set Δ=0.8. For the SGD simulation, we take d=40, N=800, with ε=10−6 and ξ(t)=1. The weights are initialized as (wi)i≤N∼iidN(0,0.82/d⋅Id). We take a single SGD run. At iteration 103,4×106,107, we plot the histogram of (∥wi∥2)i≤N. This produces the results of the SGD in Figure 1 of the main text.
To obtain results from the PDE, we take J=400, and generate ri(0)=∥Zi∥2, where (Zi)i≤J∼iidN(0,0.82/d⋅Id). We obtain r(t) from t=0 until t=107ε, by discretizing this interval with 105 points equally spaced on the log10 scale and sequentially computing r(t) at each point using Eq. (11.8). Note that the SGD result at iteration k corresponds to r(εk). We re-simulate the PDE for 100 times, each with an independently generated initialization. The obtained histogram for the PDE, as shown in the figure, is the aggregation of these 100 runs.
Further numerical simulations. Figure 11.2 plots the evolution of ρt for Δ=0.2. The setting is identical to the one in Figure 1 of the main text, described in the previous paragraphs.
In Figure 11.3, we plot the evolution of the population risk for the SGD and its PDE prediction counterpart, for Δ=0.2 and Δ=0.8. The setting for the SGD plots is the same as described in the previous paragraphs. We compute the risk attained by the SGD by Monte Carlo averaging over 104 samples. The setting for the PDE plots tagged “J=400” is almost the same as in the previous paragraphs, except that we take only 1 run. For the PDE plot tagged “J=1”, we take J=1 and r(0)=0.8 instead. In the inset plot, we also show the evolution of (1/N)∑i=1N∥wi∥2 of the SGD, and (1/J)∑i=1Jri(t) of the PDE.
In Figure 11.4, we plot the function Rd(1)(r), for d=40 and Δ=0.2. (Recall Rd(1)(r) from Eq. of the main text, and see also Section 11.1.3.) On this landscape, we also plot the evolution of the corresponding SGD and PDE, as described in the last paragraph.
Comments. We observe in Figure 11.3 a good match between the SGD and the PDE, even when J=1, for Δ=0.2. This can be explained with our theory, which predicts that at Δ=0.2, the minimum risk is achieved by the uniform distribution over a sphere of radius ∥w∥2=r∗ (see also Section 11.1.3). This corresponds to ρt, as t→∞, being a delta function and placing probability 1 at r∗. Furthermore due to the way we initialize the SGD, ρ0 is well concentrated. One can then expect that ρt is also well concentrated at all time t, in which case J=1 is sufficient. This claim is reflected in our numerical experiments, shown in Figure 11.2.
We also observe in Figure 11.3 that the case Δ=0.2 has a rapid transition from a high risk to a lower risk, unlike the case Δ=0.8. This is also expected from our theory. As said above, ρt is approximately a delta function at all time t, and the position r(t) evolves by gradient flow in the landscape of Rd(1)(r). This latter claim is well supported by Figure 11.4. As observed in Figure 11.4, Rd(1)(r) is rather benign, and hence the transition of the population risk should be smooth. However the case for Δ=0.8 is different: ρt is not concentrating at large t, as evident in Figure 1 of the main text, even though Rd(1)(r) is generally benign for a vast variety of values of d and Δ (see Figure 11.6 and Section 11.1.3).
Note that the computation of the PDE assumes d=∞. Furthermore it also requires N=∞ (recalling Theorem 3 of the main text). The discrepancy to the SGD is due to the fact that d and N are finite in the SGD simulations. Nevertheless in our numerical examples, such discrepancy is insignificant.
1.2 Empirical validation of the statics
Here we discuss numerical verification for the statics in the isotropic Gaussian example.
Optimizing Rd(ρ). For the chosen activation, we have from Eq. (8.8) that
where σsl=(s2−s1)/(t2−t1), σitc=s1−σslt1, ϕ(x)=exp(−x2/2)/2π, Φ(x)=∫−∞xϕ(t)dt, and Γ is the Gamma function. To numerically optimize Rd(ρ), we perform the following approximation:
which is a quadratic programming problem and can be solved numerically. Here v can be computed easily with the explicit formula, and the computation of U amounts to numerically evaluating double integrals. In the case d=∞, the computation of U is much easier, since
Details of Figure 2 of the main text. For the SGD simulation, we take N=800, with ε=3×10−3 and ξ(t)=t−1/4. The weights are initialized as (wi)i≤N∼iidN(0,0.42/d⋅Id). We compute the risk attained by the SGD by Monte Carlo averaging over 104 samples. We take a single SGD run per Δ, per d, and report the risk at iteration 107.
For the approximate optimization of Rd(ρ), we choose K=100, and oi, i=1,...,K, being equally spaced on the interval [0.01, 10].
For the optimization of Rd(1)(r) (recalling Eq. in the main text), we approximate it with mini=1,...,KRd(1)(oi), for the above chosen oi and K.
We find that in general, one needs higher maxi=1,...,Koi to produce accurate results for higher Δ. For the chosen set of oi’s, we choose to plot up until Δ=0.8.
Further numerical simulations. In Figure 11.5, we extend Figure 2 of the main text to include results for additional values of d. The setting remains the same.
This figure provides further support to the respective discussion in the main text. For the threshold values of Δ for which the minimum risk is achieved by a uniform distribution ρr∗unif over a sphere of radius ∥w∥2=r∗ (see the main text around Eq. , and Section 11.1.3).
1.3 Checking the condition of Lemma 1 in the main text
We check of the condition of Lemma 1 in the main text. This has two steps: (1) we solve for the minimizer r∗ of Rd(1)(r)=1+2v(r)+ud(r,r), where v(r) and ud(r1,r2) are given by Eq. (11.10) and (11.11) respectively, and (2) we check whether v(r)+ud(r,r∗)≥v(r∗)+ud(r∗,r∗) for all r≥0. Figure 11.6 suggests that the behavior of Rd(1)(r) is rather benign and hence r∗ can be solved easily by searching for a local minimum. For the second step, we check the condition on a grid of values of r from 0.1 to 10 with a spacing of 0.1, for each value of Δ on a grid from 0.01 to 0.99 with a spacing of 0.01. In general, we find that the conditioned is satisfied for Δ∈[Δdl,Δdh]. Table 1 reports Δdl and Δdh for a number of values of d for the isotropic Gaussians example with the given activation function.
2 Centered anisotropic Gaussians with ReLU Activation
With probability 1/2: y=+1, x∼N(0,Σ+).
With probability 1/2: y=−1, x∼N(0,Σ−).
This setting is used in Figure 3 in the main text.
We consider s0=γd for some γ∈(0,1). For simplicity, we consider the limit d→∞. For θ∼ρ, let ρ be the joint distribution of four parameters r=(a,b,r1=∥w1:s0∥2,r2=∥w(s0+1):d∥2), where wi:j=(wi,...,wj)⊤. Using a similar argument to Section 8, we have, in the limit d→∞, the risk R(ρ)=R∞(ρ) for
where ϕ(x)=exp(−x2/2)/2π and Φ(x)=∫−∞xϕ(t)dt. Furthermore, letting ρt denote the corresponding distribution at time t, the PDE in the main text can be reduced to the following PDE of ρt:
PDE simulation. As in Section 11.1.1, we posit that the solution to the PDE can be approximated, at all time t, by the multiple-deltas ansatz:
for i=1,...,J, where R∞,J(r1(t),…,rJ(t))=R∞(ρt) under the ansatz, and ∇i denotes the gradient of R∞,J(r1,...,rJ) w.r.t. ri. More explicitly,
Again, given ri(t), one approximates ri(t+δt) for some small displacement δt by
Details of Figure 3 of the main text. For the SGD simulation, we take d=320, s0=60, N=800, with ε=2×10−4 and ξ(t)=t−1/4. The weights are initialized as (wi)i≤N∼iidN(0,0.82/d⋅Id), (ai)i≤N=1 and (bi)i≤N=1. We take a single SGD run. We compute the risk attained by the SGD by Monte Carlo averaging over 104 samples.
To produce the inset plot in Figure 3 of the main text, for the “a (mean)” axis, we compute N1∑i=1Nai for the SGD and J1∑i=1Jai(t) for the PDE. Similarly, for the “b (mean)” axis, we compute N1∑i=1Nbi for the SGD and J1∑i=1Jbi(t) for the PDE, and for the “r1 (mean)” axis, we compute N1∑i=1N∥wi,1:s0∥2 for the SGD and J1∑i=1Jr1,i(t) for the PDE.
Further numerical simulations. In Figure 11.7, we plot the evolution of the four parameters, for the same setting as Figure 3 of the main text. Here “a (mean)”, “b (mean)” and “r1 (mean)” hold the same meanings, and “r2 (mean)” refers to N1∑i=1N∥wi,(s0+1):d∥2 for the SGD and J1∑i=1Jr2,i(t) for the PDE.
In Figure 11.8, we plot the population risk’s evolution for the same setting as Figure 3 of the main text, apart from that Δ=0.6 and s0 varies.
Comments. We observe a good match between the SGD and the PDE in Figure 3 of the main text as well as Figure 11.7, up until iteration 106. In general there is less discrepancy with larger s0, d and N, recalling that the PDE is computed assuming infinite s0, d and N. This is evident from Figure 11.8.
As a note, in Figure 11.8, the PDE evolves differently for different s0. This is because the ratio s0/d is used to determine the initialization of the PDE.
3 Isotropic Gaussians: Predictable Failure of SGD
In this section, we consider the isotropic Gaussians example (see Section 11.1 for the setting and notations), with the following activation function: σ∗(x;θi)=σ(⟨wi,x⟩), where σ(t)=−2.5 for t≤0, σ(t)=7.5 for t≥1.5, and σ(t) linearly interpolates from the knot (0,−2.5) to (0.5,−4), and from (0.5,−4) to (1.5,7.5). This activation is plotted in Figure 11.1. This corresponds to Section “Predicting failure” and Figure 4 in the main text. The simulation of the PDE can be done in the same way as in Section 11.1.1.
Rationale of the activation choice. We give an explanation for the choice of the above activation based on our theory. We aim to find an activation σ∗(x;θi)=σ(⟨wi,x⟩) in which there exists a local minimum that does not generalize well. To simplify the task, we wish for such minimum to be attained at ρ∗=δ0. This minimum does not generalize well, since it implies all the weights are zero and the neuron outputs are constant, rendering the network unable to perform classification. Theorem 6 of the main text suggests taking σ(t) such that
In the isotropic Gaussians case, this becomes
(Note that the condition ∇V(0)+∇1U(0,0)=0 in Theorem 6 of the main text is trivially satisfied.) Another requirement is that there should still be a minimum whose risk is nearly zero. Hence we do not wish for a dramatic change in the choice of the activation function, as compared to the one used in Section 11.1. That is, we leave σ(0)<0 unchanged. Hence we would want σ′′(0)<0, which is accomplished by our aforementioned choice.
Note that Theorem 6 of the main text also suggests that if the SGD is initialized sufficiently close to this local minimum, the SGD trajectory should converge to it.
Details of Figure 4 of the main text. For the data generation, we set Δ=0.5. For the SGD simulation, we take d=320, N=800, with ε=10−5 and ξ(t)=t−1/4. We take a single SGD run each for two different initializations: the weights are initialized as (wi)i≤N∼iidN(0,κ2/d⋅Id) for either κ=0.1 or κ=0.4. We compute the risk attained by the SGD by Monte Carlo averaging over 104 samples.
To obtain results from the PDE, we take J=400, and generate ri(0)=∥Zi∥2, where (Zi)i≤N∼iidN(0,κ2/d⋅Id). We obtain r(t) from t=0 until t=107ε, by discretizing this interval with 105 points equally spaced on the log10 scale and sequentially computing r(t) at each point using Eq. (11.8). Note that the SGD result at iteration k corresponds to r(ε4/3k). We take a single run of the PDE.
To produce the inset plot, we compute N1∑i=1N∥wi∥2 for the SGD, and J1∑i=1Jri(t) for the PDE.
As observed from Figure 4 of the main text, the SGD trajectory with κ=0.1 converges to a point where ∥wi∥2 is nearly zero and the risk is very high, in stark contrast to the SGD trajectory with κ=0.4, as expected.
Appendix A Concentration inequalities
Let Zk=Xk−Xk−1 be the martingale differences. By the subgaussian condition (A.1), we get
Letting G∼N(0,Id) a standard Gaussian vector and ξ≥0,
By Markov inequality, setting ξ=1/(2nL2), we get, for all t≥0,
Finally, to upper bound maxk≤n∥Xk∥2, we define the stopping time τ≡min{k:∥Xk∥2≥2Ln(d+t)}, and the martingale Xk=Xk∧τ. Since {maxk≤n∥Xn∥2≥2Ln(d+t)}={∥Xn∥2≥2Ln(d+t)}, the claim follows by applying the previous inequality to Xn. ∎
Appendix B On the generalization to other loss functions
First of all, we note that the population risk reads
The corresponding distributional dynamics is formally identical to the one for quadratic loss, cf. Eq. (7.1). The only change is in the definition of Ψ(θ;ρ):