A Mean Field View of the Landscape of Two-Layers Neural Networks

Song Mei, Andrea Montanari, Phan-Minh Nguyen

Introduction

Multi-layer neural networks are one of the oldest approaches to statistical machine learning, dating back at least to the 1960’s [Ros62]. Over the last ten years, under the impulse of increasing computer power and larger data availability, they have emerged as a powerful tool for a wide variety of learning tasks [KSH12, GBCB16].

In practice, the parameters of neural networks are learned by stochastic gradient descent [RM51] (SGD) or its variants. In the present case, this amounts to the iteration

Here θk=(θik)iN{\bm{\theta}}^{k}=({\bm{\theta}}^{k}_{i})_{i\leq N} denotes the parameters after kk iterations, sks_{k} is a step size, and (xk,yk)({\bm{x}}_{k},y_{k}) is the kk-th example. Throughout the paper, we make the following assumption:

In large scale applications, this is not far from truth: the data is so large that each example is visited at most a few times [Bot10]. Further, theoretical guarantees suggest that there is limited advantage to be gained from multiple passes [SSBD14]. For recent work deriving scaling limits under such assumption (in different problems) see [WML17].

Formal relationships can be established between RN(θ)R_{N}({\bm{\theta}}) and R(ρ)R(\rho). For instance, under mild assumptions, infθRN(θ)=infρR(ρ)+O(1/N)\inf_{{\bm{\theta}}}R_{N}({\bm{\theta}})=\inf_{\rho}R(\rho)+O(1/N). We refer to the next sections for mathematical statements of this type.

Roughly speaking, R(ρ)R(\rho) corresponds to the population risk when the number of hidden units goes to infinity, and the empirical distribution of parameters ρ^(N)\hat{\rho}^{(N)} converges to ρ\rho. Since U(,)U(\,\cdot\,,\,\cdot\,) is positive semidefinite, we obtain that the risk becomes convex in this limit. The fact that learning can be viewed as convex optimization in an infinite-dimensional space was indeed pointed out in the past [LBW96, BRV+06]. Does this mean that the landscape of the population risk simplifies for large NN and descent algorithms will converge to a unique (or nearly unique) global optimum?

when NN\to\infty, ε0{\varepsilon}\to 0 (here \Rightarrow denotes weak convergence). The asymptotic dynamics of ρt\rho_{t} is defined by the following PDE, which we shall refer to as distributional dynamics (DD)

Using these results, analyzing learning in two-layer neural networks reduces to analyzing the PDE (7). While this is far from being an easy task, the PDE formulation leads to several simplifications and insights. First of all, it factors out the invariance of the risk (4) (and of the SGD dynamics (3)), with respect to permutations of the units {1,,N}\{1,\dots,N\}.

where the infimum is taken over all couplings of ρ1\rho_{1} and ρ2\rho_{2}. Informally, the fact that ρt\rho_{t} is a gradient flow means that (7) is equivalent, for small τ\tau, to

Powerful tools from the mathematical literature on gradient flows in measure spaces [AGS08] can be exploited to study the behavior of (7).

Most importantly, the scaling limit elucidates the dependence of the landscape of two-layer neural networks on the number of hidden units NN.

A remarkable feature of neural networks is the observation that, while they might be dramatically over parametrized, this does not lead to performance degradation. In the case of bounded activation functions, this phenomenon was clarified in the nineties for empirical risk minimization algorithms, see e.g. [Bar98]. The present work provides analogous insight for the SGD dynamics: roughly speaking, our results imply that the landscape remains essentially unchanged as NN grows, provided NDN\gg D. In particular, assume that the PDE (7) converges close to an optimum in time t(D)t_{*}(D). This might depend on DD, but does not depend on the number of hidden units NN (which does not appear in the DD PDE (7)). If t(D)=OD(1)t_{*}(D)=O_{D}(1), we can then take NN arbitrarily (as long as NDN\gg D) and will achieve a population risk which is independent of NN (and corresponds to the optimum), using k=t/ε=O(D)k=t_{*}/{\varepsilon}=O(D) samples.

Our analysis can accommodate some important variants of SGD, a particularly interesting one being noisy SGD:

where Ψλ(θ;ρ)=Ψ(θ;ρ)+(λ/2)θ22\Psi_{\lambda}({\bm{\theta}};\rho)=\Psi({\bm{\theta}};\rho)+(\lambda/2)\|{\bm{\theta}}\|_{2}^{2}, and Δθf(θ)=i=1dθi2f(θ)\Delta_{{\bm{\theta}}}f({\bm{\theta}})=\sum_{i=1}^{d}\partial_{\theta_{i}}^{2}f({\bm{\theta}}) denotes the usual Laplacian. This can be viewed as a gradient flow for the free energy Fβ,λ(ρ)=(1/2)R(ρ)+(λ/2)θ22ρ(dθ)β1Ent(ρ)F_{\beta,\lambda}(\rho)=(1/2)R(\rho)+(\lambda/2)\int\|{\bm{\theta}}\|_{2}^{2}\rho({\rm d}{\bm{\theta}})-\beta^{-1}{\rm Ent}(\rho), where Ent(ρ)=ρ(θ)logρ(θ)dθ{\rm Ent}(\rho)=-\int\rho({\bm{\theta}})\log\rho({\bm{\theta}})\,{\rm d}{\bm{\theta}} is the entropy of ρ\rho (by definition Ent(ρ)={\rm Ent}(\rho)=-\infty if ρ\rho is singular). Fβ,λ(ρ)F_{\beta,\lambda}(\rho) is an entropy-regularized risk, which penalizes strongly non-uniform ρ\rho.

We will prove below that, for β<\beta<\infty, the evolution (12) generically converges to the minimizer of Fβ,λ(ρ)F_{\beta,\lambda}(\rho), hence implying global convergence of noisy SGD in a number of steps independent of NN.

Examples

With probability 1/21/2: y=+1y=+1, xN(0,(1+Δ)2Id){\bm{x}}\sim{\sf N}(0,(1+\Delta)^{2}{\bm{I}}_{d})

With probability 1/21/2: y=1y=-1, xN(0,(1Δ)2Id){\bm{x}}\sim{\sf N}(0,(1-\Delta)^{2}{\bm{I}}_{d}).

(This example will be generalized later.) Of course, optimal classification in this model becomes entirely trivial if we compute the feature h(x)=x2h({\bm{x}})=\|{\bm{x}}\|_{2}. However, it is non-trivial that a SGD-trained neural network will succeed.

We choose an activation function without offset or output weights, namely σ(x;θi)=σ(wi,x)\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle). While qualitatively similar results are obtained for other choices of σ\sigma, we will use a simple piecewise linear function as a running example: σ(t)=s1\sigma(t)=s_{1} if tt1t\leq t_{1}, σ(t)=s2\sigma(t)=s_{2} if tt2t\geq t_{2}, and σ(t)\sigma(t) interpolated linearly for t(t1,t2)t\in(t_{1},t_{2}). In simulations we use t1=0.5t_{1}=0.5, t2=1.5t_{2}=1.5, s1=2.5s_{1}=-2.5, s2=7.5s_{2}=7.5.

where the form of ψ(r;ρ)\psi(r;\rho) can be derived from Ψ(θ;ρ)\Psi({\bm{\theta}};\rho). This reduced PDE can be efficiently solved numerically, see Supplementary Information (SI) for technical details. As illustrated by Fig. 1, the empirical results match closely the predictions produced by this PDE.

In Fig. 2, we compare the asymptotic risk achieved by SGD with the prediction obtained by minimizing R(ρ)R(\rho), cf. (5) over spherically symmetric distributions. It turns out that, for certain values of Δ\Delta, the minimum is achieved by the uniform distribution over a sphere of radius w2=r\|{\bm{w}}\|_{2}=r_{*}, to be denoted by ρr\mboxunif\rho^{\mbox{\tiny\rm unif}}_{r_{*}}. The value of rr_{*} is computed by minimizing

where expressions for v(r)v(r), ud(r1,r2)u_{d}(r_{1},r_{2}) can be readily derived from V(w)V({\bm{w}}), U(w1,w2)U({\bm{w}}_{1},{\bm{w}}_{2}) and are given in the SI.

Let rr_{*} be a global minimizer of rRd(1)(r)r\mapsto R_{d}^{(1)}(r). Then ρr\mboxunif\rho^{\mbox{\tiny\rm unif}}_{r_{*}} is a global minimizer of ρR(ρ)\rho\mapsto R(\rho) if and only if v(r)+ud(r,r)v(r)+ud(r,r)v(r)+u_{d}(r,r_{*})\geq v(r_{*})+u_{d}(r_{*},r_{*}) for all r0r\geq 0.

Checking numerically this condition yields that ρr\mboxunif\rho^{\mbox{\tiny\rm unif}}_{r_{*}} is a global minimizer for Δ\Delta in an interval [Δdl,Δdh][\Delta^{\rm l}_{d},\Delta^{\rm h}_{d}], where limdΔdl=0\lim_{d\to\infty}\Delta^{\rm l}_{d}=0 and limdΔdh=Δ0.47\lim_{d\to\infty}\Delta^{\rm h}_{d}=\Delta_{\infty}\approx 0.47.

for any k[T/ε,10T/ε]k\in[T/{\varepsilon},10T/{\varepsilon}] with probability at least 1δ1-\delta.

In particular, if we set ε=1/(C0d){\varepsilon}=1/(C_{0}d), then the number of SGD steps is k[(C0T)d,(10C0T)d]k\in[(C_{0}T)\,d,(10C_{0}T)\,d]: the number of samples used by SGD does not depend on the number of hidden units NN, and is only linear in the dimension. Unfortunately the proof does not provide the dependence of TT on η\eta, but Theorem 6 below suggests exponential local convergence.

2 Centered anisotropic Gaussians

We can generalize the previous result to a problem in which the network needs to select a subset of relevant nonlinear features out of many a priori equivalent ones. We assume the joint law of (y,x)(y,{\bm{x}}) to be as follows:

With probability 1/21/2: y=+1y=+1, xN(0,Σ+){\bm{x}}\sim{\sf N}(0,{\bm{\Sigma}}_{+}), and

With probability 1/21/2: y=1y=-1, xN(0,Σ){\bm{x}}\sim{\sf N}(0,{\bm{\Sigma}}_{-}).

Even with a reduced degree of symmetry, SGD converges to a network with nearly-optimal risk, after using a number of samples k=O(d)k=O(d), which is independent of the number of hidden units NN.

3 A better activation function

We consider the same data distribution introduced in the last section (anisotropic Gaussians). Figure 3 reports the evolution of the risk RN(θk)R_{N}({\bm{\theta}}^{k}) for three experiments with d=320d=320, s0=60s_{0}=60 and different values of Δ\Delta. SGD is initialized by setting ai=1a_{i}=1, bi=1b_{i}=1 and wi0iidN(0,0.82/dId){\bm{w}}_{i}^{0}\sim_{iid}{\sf N}({\mathbf{0}},0.8^{2}/d\cdot{\bm{I}}_{d}) for iNi\leq N. We observe that SGD converges to a network with very small risk, but this convergence has a nontrivial structure and presents long flat regions.

The empirical results are well captured by our predictions based on the continuum limit. In this case we obtain a reduced PDE for the joint distribution of the four quantities r=(a,b,r1=PVw2,r2=PVw2){\bm{r}}=(a,b,r_{1}=\|{\bm{P}}_{{\cal V}}{\bm{w}}\|_{2},r_{2}=\|{\bm{P}}^{\perp}_{{\cal V}}{\bm{w}}\|_{2}), denoted by ρt\overline{\rho}_{t}. The reduced PDE is analogous to (13) albeit in 44 rather than 11 dimensions. In Figure 3 we consider the evolution of the risk, alongside three properties of the distribution ρt\overline{\rho}_{t} –the means of the output weight aa, of the offset bb, and of r1r_{1}.

4 Predicting failure

SGD does not always converge to a near global optimum. Our analysis allows to construct examples in which SGD fails. For instance, Figure 4 reports results for isotropic Gaussians problem. We violate the assumptions of Theorem 1 by using non monotone activation function. Namely, we use σ(x;θ)=σ(w,x)\sigma_{*}({\bm{x}};{\bm{\theta}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle), where σ(t)=2.5\sigma(t)=-2.5 for t0t\leq 0, σ(t)=7.5\sigma(t)=7.5 for t1.5t\geq 1.5, and σ(t)\sigma(t) linearly interpolates from (0,2.5)(0,-2.5) to (0.5,4)(0.5,-4), and from (0.5,4)(0.5,-4) to (1.5,7.5)(1.5,7.5).

Depending on the initialization, SGD converges to two different limits, one with a small risk, and the second with high risk. Again this behavior is well tracked by solving a one-dimensional PDE for the distribution ρt\overline{\rho}_{t} of r=w2r=\|{\bm{w}}\|_{2}.

General results

In this section we return to the general supervised learning problem described in the introduction and describe our general results. Proofs are deferred to the SI.

First, we note that the minimum of the asymptotic risk R(ρ)R(\rho) of (5) provides a good approximation of the minimum of the finite-NN risk RN(θ)R_{N}({\bm{\theta}}).

We next consider the distributional dynamics (7) and (12). These should be interpreted to hold in weak sense, cf. SI. In order to establish that these PDEs indeed describe the limit of the SGD dynamics, we make the following assumptions.

tξ(t)t\mapsto\xi(t) is bounded Lipschitz: ξ,ξ\mboxLipK1\|\xi\|_{\infty},\|\xi\|_{\mbox{\tiny\rm Lip}}\leq K_{1}, with 0ξ(t)dt=\int_{0}^{\infty}\xi(t){\rm d}t=\infty.

The activation function (x,θ)σ(x;θ)({\bm{x}},{\bm{\theta}})\mapsto\sigma_{*}({\bm{x}};{\bm{\theta}}) is bounded, with sub-Gaussian gradient: σK2\|\sigma_{*}\|_{\infty}\leq K_{2}, θσ(X;θ)ψ2K2\|\nabla_{{\bm{\theta}}}\sigma_{*}(\bm{X};{\bm{\theta}})\|_{\psi_{2}}\leq K_{2}. Labels are bounded ykK2|y_{k}|\leq K_{2}.

The gradients θV(θ){\bm{\theta}}\mapsto\nabla V({\bm{\theta}}), (θ1,θ2)θ1U(θ1,θ2)({\bm{\theta}}_{1},{\bm{\theta}}_{2})\mapsto\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2}) are bounded, Lipschitz continuous (namely θV(θ)2\|\nabla_{{\bm{\theta}}}V({\bm{\theta}})\|_{2}, θ1U(θ1,θ2)2K3\|\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})\|_{2}\leq K_{3}, θV(θ)θV(θ)2K3θθ2\|\nabla_{{\bm{\theta}}}V({\bm{\theta}})-\nabla_{{\bm{\theta}}}V({\bm{\theta}}^{\prime})\|_{2}\leq K_{3}\|{\bm{\theta}}-{\bm{\theta}}^{\prime}\|_{2}, θ1U(θ1,θ2)θ1U(θ1,θ2)2K3(θ1,θ2)(θ1,θ2)2\|\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})-\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}^{\prime}_{1},{\bm{\theta}}_{2}^{\prime})\|_{2}\leq K_{3}\|({\bm{\theta}}_{1},{\bm{\theta}}_{2})-({\bm{\theta}}_{1}^{\prime},{\bm{\theta}}_{2}^{\prime})\|_{2}).

We also introduce the following error term which quantifies in a non-asymptotic sense the accuracy of our PDE model

The convergence of the SGD process to the PDE model is an example of a phenomenon which is known in probability theory as propagation of chaos [Szn91].

with probability 1ez21-e^{-z^{2}}. The same statements hold for noisy SGD (11), provided (7) is replaced by (12), and if β1\beta\geq 1, λ1\lambda\leq 1, and ρ0\rho_{0} is K0K_{0} sub-Gaussian for some K0>0K_{0}>0.

Notice that dependence of the error terms in NN and DD is rather benign. On the other hand, the error grows exponentially with the time horizon TT, which limits its applicability to cases in which the DD converges rapidly to a good solution. We do not expect this behavior to be improvable within the general setting of 3, which a priori includes cases in which the dynamics is unstable.

We can regard J(θ;ρt)=ρt(θ)θΨ(θ;ρt)\bm{J}({\bm{\theta}};\rho_{t})=\rho_{t}({\bm{\theta}})\nabla_{{\bm{\theta}}}\Psi({\bm{\theta}};\rho_{t}) as a current. The fixed points of the continuum dynamics are densities that correspond to zero current, as stated below.

Assume V(),U(,)V(\,\cdot\,),U(\,\cdot\,,\,\cdot\,) to be differentiable with bounded gradient. If ρt\rho_{t} is a solution of the PDE (7), then R(ρt)R(\rho_{t}) is non-increasing. Further, probability distribution ρ\rho is a fixed point of the PDE (7) if and only if

Note that global optimizers of R(ρ)R(\rho), defined by condition (17), are fixed points, but the set of fixed points is, in general, larger than the set of optimizers. Our next proposition provides an analogous characterization of the fixed points of diffusion DD (12) (see [CMV+03] for related results).

Assume that conditions A1-A3 hold and that ρ0\rho_{0} is absolutely continuous with respect to Lebesgue measure, with Fβ,λ(ρ0)<F_{\beta,\lambda}(\rho_{0})<\infty. If (ρt)t0(\rho_{t})_{t\geq 0} is a solution of the diffusion PDE (12), then ρt\rho_{t} is absolutely continuous. Further, there is at most one fixed point ρ=ρβ,λ\rho_{*}=\rho_{*}^{\beta,\lambda} of (12) satisfying Fβ,λ(ρ)<F_{\beta,\lambda}(\rho_{*})<\infty. This fixed point is absolutely continuous and its density satisfies

In the next sections we state our results about convergence of the distributional dynamics to its fixed point. In the case of noisy SGD (and for the diffusion PDE (12)), a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation (12)), we do not have such general result. However, we obtain a stability condition for fixed point containing one point mass, which is useful to characterize possible limiting points (and is used in treating the examples in the previous section).

Remarkably, the diffusion PDE (12) generically admits a unique fixed point, which is the global minimum of Fβ,λ(ρ)F_{\beta,\lambda}(\rho) and the evolution (12) converges to it, if initialized so that Fβ,λ(ρ0)<F_{\beta,\lambda}(\rho_{0})<\infty. This statement requires some qualifications. First of all, we introduce sufficient regularity assumptions to guarantee the existence of sufficiently smooth solutions of (12).

Next notice that the right-hand side of the fixed point equation (21) is not necessarily normalizable (for instance, it is not when V()V(\,\cdot\,), U(,)U(\,\cdot\,,\,\cdot\,) are bounded). In order to ensure the existence of a fixed point, we need λ>0\lambda>0.

Assume that conditions A1-A4 hold, and 1/K0λK01/K_{0}\leq\lambda\leq K_{0} for some K0>0K_{0}>0 Then Fβ,λ(ρ)F_{\beta,\lambda}(\rho) has a unique minimizer, denoted by ρβ,λ\rho_{*}^{\beta,\lambda}, which satisfies

where CC is a constant depending on K0K_{0},K1K_{1},K2K_{2},K3K_{3}. Further, letting ρt\rho_{t} be a solution of the diffusion PDE (12) with initialization satisfying Fβ,λ(ρ0)<F_{\beta,\lambda}(\rho_{0})<\infty, we have, as tt\to\infty,

The proof of this theorem is based on the following formula that describes the free energy decrease along the trajectories of the distributional dynamics (12):

(A key technical hurdle is of course proving that this expression makes sense, which we do by showing the existence of strong solutions.) It follows that the right-hand side must vanish as tt\to\infty, from which we prove that (eventually taking subsequences) ρtρ\rho_{t}\Rightarrow\rho_{*} where ρ\rho_{*} must satisfy βΨλ(θ;ρ)+logρ(θ)=const\beta\Psi_{\lambda}({\bm{\theta}};\rho_{*})+\log\rho_{*}({\bm{\theta}})={\rm const}. This in turns mean ρ\rho_{*} is a solution of the fixed point condition 21 and is in fact a global minimum of Fβ,λF_{\beta,\lambda} by convexity.

This result can be used in conjunction with Theorem 3, in order to analyze the regularized noisy SGD algorithm (11).

2 Convergence: noiseless SGD

The next theorems provide necessary and sufficient conditions for distributions containing a single point mass to be a stable fixed point of the evolution. This result is useful in order to characterize the large time asymptotics of the dynamics (7). Here, we write 1U(θ1,θ2)\nabla_{1}U({\bm{\theta}}_{1},{\bm{\theta}}_{2}) for the gradient of UU with respect to its first argument, and 1,12U\nabla^{2}_{1,1}U for the corresponding Hessian. Further, for a probability distribution ρ\rho_{*}, we define

Note that H0(ρ)\bm{H}_{0}(\rho_{*}) is nothing but the Hessian of θΨ(θ;ρ){\bm{\theta}}\mapsto\Psi({\bm{\theta}};\rho_{*}) at θ{\bm{\theta}}_{*}.

If ρ0\rho_{0} has a bounded density with respect to Lebesgue measure, then it cannot be that ρt\rho_{t} converges weakly to ρ\rho_{*} as tt\to\infty.

Discussion and future directions

In this paper we developed a new approach to the analysis of two-layers neural networks. Using a propagation-of-chaos argument, we proved that –if the number of hidden units satisfies NDN\gg D– SGD dynamics is well approximated by the PDE in (7), while noisy SGD is well approximated by (12). Both of these asymptotic descriptions correspond to Wasserstein gradient flows for certain energy (or free energy) functionals. While empirical risk minimization is known to be insensitive to overparametrization [Bar98], the present work clarifies that the SGD behavior is also independent of the number of hidden units, as soon as this is large enough.

We illustrated our approach on several concrete examples, by proving convergence of SGD to a near-global optimum. This type of analysis provides a new mechanism for avoiding the perils of non-convexity. We do not prove that the finite-NN risk RN(θ)R_{N}({\bm{\theta}}) has a unique local minimum, or that all local minima are close to each other. Such claims have often been the target of earlier work, but might be too strong for the case of neural networks. We prove instead that the PDE (7) converges to a near global optimum, when initialized with a bounded density. This effectively gets rid of some exceptional stationary points of RN(θ)R_{N}({\bm{\theta}}), and merges multiple finite NN stationary points that result into similar distributions ρ\rho.

In the case of noisy SGD (11), we prove that it converges generically to a near-global minimum of the regularized risk, in time independent of the number of hidden units.

We emphasize that while we focused here on the case of square loss, our approach should be generalizable to other loss functions as well, cf. SI.

The present work opens the way to several interesting research directions. We will mention two of them. (i)(i) The PDE (7) corresponds to gradient flow in the Wasserstein metric for the risk R(ρ)R(\rho), see [AGS08]. Building on this remark, tools from optimal transportation theory can be used to prove convergence. (ii)(ii) Multiple finite-NN local minima can correspond to the same minimizer ρ\rho_{*} of R(ρ)R(\rho) in the limit NN\to\infty. Ideas from glass theory [MP99] might be useful to investigate this structure.

Let us finally mention that, after a first version of this paper appeared as a preprint, several other groups obtained results that are closely related to Theorem 3 [RVE18, SS18, CB18].

Acknowledgements

This work was partially supported by grants NSF DMS-1613091, NSF CCF-1714305 and NSF IIS-1741162. S. M. was partially supported by Office of Technology Licensing Stanford Graduate Fellowship. P.-M. N. was partially supported by William R. Hewlett Stanford Graduate Fellowship. The authors would like to thank Jiajun Tong for helpful discussions concerning strong solutions for parabolic PDEs.

References

Supplementary information

We present here proofs and additional technical details for our mathematical results, as well as additional information concerning the numerical experiments.

Notations

We use lowercase bold for vectors (e.g. u,v,{\bm{u}},{\bm{v}},\dots), uppercase bold for matrices (e.g. A,B,\bm{A},\bm{B},\dots), and lowercase plain for scalar (x,y,x,y,\dots).

Given a measurable space Ω\Omega, we denote by \mathscrsfsP(Ω)\mathscrsfs{P}(\Omega) the set of probability measures on Ω\Omega.

Given a measurable function ff, and a measure μ\mu, we denote by f,μ=μ,f=fdμ\langle f,\mu\rangle=\langle\mu,f\rangle=\int f\,{\rm d}\mu the corresponding integral.

f\mboxLipsupxyf(x)f(y)/xy2\|f\|_{\mbox{\tiny\rm Lip}}\equiv\sup_{x\neq y}|f({\bm{x}})-f({\bm{y}})|/\|{\bm{x}}-{\bm{y}}\|_{2} denotes the Lipshitz constant of a function ff.

d\mboxBL(,)d_{\mbox{\tiny\rm BL}}(\,\cdot\,,\,\cdot\,) is the bounded Lipschitz distance between probability measures

Here C(μ,ν){\mathcal{C}}(\mu,\nu) is the set of couplings of μ\mu and ν\nu.

Wp(,)W_{p}(\,\cdot\,,\,\cdot\,) is the Wasserstein distance between probability measures

For p=1p=1, the Kantorovich-Rubinstein duality gives

KK is a generic constant depending on K0,K1,K2,K3K_{0},K_{1},K_{2},K_{3}, where KiK_{i}’s are constants which will be specified from the context.

General results: Statics

In this section, we discuss some properties of the population risk, RN(θ)R_{N}({\bm{\theta}}), and its continuum counterpart R(ρ)R(\rho). For future reference, we copy the key definitions from the main text:

We show that minimizing the population risk RN(θ)R_{N}({\bm{\theta}}) yields similar results to minimizing its continuum counterpart R(ρ)R(\rho):

We establish the condition for ρ\rho_{*} to be a minimizer:

First notice that, for any θ=(θi)iN{\bm{\theta}}=({\bm{\theta}}_{i})_{i\leq N}, we have

Indeed, RN(θ)=R(ρ)R_{N}({\bm{\theta}})=R(\rho) for ρ=(1/N)i=1Nδθi\rho=(1/N)\sum_{i=1}^{N}\delta_{{\bm{\theta}}_{i}}.

whence the claim (6.6) follows since ε{\varepsilon} is arbitrary.

We next establish the minimum condition (6.7). Notice that since V()V(\,\cdot\,) is continuous, and U(,)U(\,\cdot\,,\,\cdot\,) is bounded below, it follows from Fatou’s lemma that, for any ρ\rho, the function θΨ(θ;ρ){\bm{\theta}}\mapsto\Psi({\bm{\theta}};\rho) is lower semicontinous and takes values in (,](-\infty,\infty]. In particular the set S0(ρ)argminθΨ(θ;ρ)S_{0}(\rho)\equiv\arg\min_{{\bm{\theta}}}\Psi({\bm{\theta}};\rho) must be closed.

We first prove that any minimizer must satisfy (6.7). Let ρ\rho_{*} be a minimizer and define Ψ=infθΨ(θ;ρ)\Psi_{*}=\inf_{{\bm{\theta}}}\Psi({\bm{\theta}};\rho_{*}). By rearranging terms, for any probability measure ρ\rho, we have

First we will assume Ψ>\Psi_{*}>-\infty (whence, by lower semicontinuity, S0(ρ)S_{0}(\rho_{*}) must be a non-empty closed set). Let θ1S0(ρ){\bm{\theta}}_{1}\in S_{0}(\rho_{*}), and assume by contradiction that there exist θ0supp(ρ){\bm{\theta}}_{0}\in{\rm supp}(\rho_{*}), θ0∉S0(ρ){\bm{\theta}}_{0}\not\in S_{0}(\rho_{*}). Let B(θ0;ε){\sf B}({\bm{\theta}}_{0};{\varepsilon}) be a ball of radius ε{\varepsilon} around θ0{\bm{\theta}}_{0}. By lower semicontinuity, we can find ε0,Δ>0{\varepsilon}_{0},\Delta>0 such that infθB(θ0;ε0)Ψ(θ;ρ)=Ψ+Δ>Ψ\inf_{{\bm{\theta}}\in{\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0})}\Psi({\bm{\theta}};\rho_{*})=\Psi_{*}+\Delta>\Psi_{*}. Further t0ρ(B(θ0;ε0))>0t_{0}\equiv\rho_{*}({\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0}))>0 because θ0supp(ρ){\bm{\theta}}_{0}\in{\rm supp}(\rho_{*}).

Let ν1B(θ0;ε0)ρ/t0\nu\equiv{\bm{1}}_{{\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0})}\rho_{*}/t_{0} (i.e. ν\nu is the conditional distribution given θB(θ0;ε0){\bm{\theta}}\in{\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0})). Define, for t[0,t0]t\in[0,t_{0}], the probability measure

where the second inequality follows from the fact that UU is continuous and δθ1\delta_{{\bm{\theta}}_{1}}, ν\nu have bounded support. By taking tt small enough, we get R(ρ)<R(ρ)R(\rho)<R(\rho_{*}) hence reaching a contradiction.

By selecting t=tM=min(t0,(M+Ψ0)/C0(M))t=t_{M}=\min(t_{0},(M+\Psi_{0})/C_{0}(M)) (which is positive for all MM large enough), we obtain R(ρM,t)R(ρ)<0R(\rho_{M,t})-R(\rho_{*})<0 for all MM large and hence reach a contradiction.

Setting μ=2[Ψ(;ρ)Ψ]\mu=2[\Psi(\,\cdot\,;\rho_{*})-\Psi_{*}], and noticing that condition (6.7) implies Ψ(;ρ)Ψ,ρ=0\langle\Psi(\,\cdot\,;\rho_{*})-\Psi_{*},\rho_{*}\rangle=0, we get R(ρ)R(ρ)+U,(ρρ)2R(ρ)R(\rho)\geq R(\rho_{*})+\langle U,(\rho-\rho_{*})^{\otimes 2}\rangle\geq R(\rho_{*}).

2 Some additional results

We often find empirically that the optimal density ρ\rho_{*} is supported on a set of Lebesgue measure (sometimes on a finite set of points). The following consequence of the previous results partially explains these findings.

Assume θV(θ){\bm{\theta}}\mapsto V({\bm{\theta}}) to be an analytic function and (θ1,θ2)U(θ1,θ2)({\bm{\theta}}_{1},{\bm{\theta}}_{2})\mapsto U({\bm{\theta}}_{1},{\bm{\theta}}_{2}) to be analytic with respect to θ1{\bm{\theta}}_{1}, uniformly in θ2{\bm{\theta}}_{2}. Namely there exists a locally bounded function θB(θ){\bm{\theta}}\mapsto B({\bm{\theta}}) such that θ1kU(θ1,θ2)2k!B(θ1)k\|\nabla^{k}_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})\|_{2}\leq k!B({\bm{\theta}}_{1})^{k} for all kk, θ1{\bm{\theta}}_{1}, θ2{\bm{\theta}}_{2}. If ρ\rho_{*} is a minimizer of R(ρ)R(\rho), then one of the following holds

The support of ρ\rho_{*} has zero Lebesgue measure.

If D=1D=1, then (b)(b) can be replaced by: (b)(b^{\prime}) ρ\rho_{*} is a convex combination of countably many point masses with no accumulation point (finitely many if Ψ(θ;ρ)\Psi(\theta;\rho_{*})\to\infty as θ|\theta|\to\infty).

Note that, under the stated conditions f(θ)U(θ,θ)ρ(dθ)f({\bm{\theta}})\equiv\int U({\bm{\theta}},{\bm{\theta}}^{\prime})\,\rho_{*}({\rm d}{\bm{\theta}}^{\prime}) is analytic. Indeed, by a standard dominated convergence argument, we have that kf\nabla^{k}f is given by the integral of kU(θ1,θ2)ρ(dθ2)\int\nabla^{k}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})\,\rho_{*}({\rm d}{\bm{\theta}}_{2}) for any k0k\geq 0. Further, by an application of the intermediate value theorem there exists tθ1,θ2,δt_{{\bm{\theta}}_{1},{\bm{\theta}}_{2},{\bm{\delta}}}\in such that

which vanishes as kk\to\infty for uniformly over δ2δ0\|{\bm{\delta}}\|_{2}\leq\delta_{0} for δ0\delta_{0} small enough.

General results: Dynamics

In this section we consider the SGD dynamics with step size sk=εξ(kε)s_{k}={\varepsilon}\xi(k{\varepsilon}), under the assumptions A1,A2,A3{\sf A1},{\sf A2},{\sf A3} stated in the main text. For the readers convenience, we reproduce here the form of the limiting PDE

For background on this and similar PDEs (and the analogous ones at finite temperature, cf. Section 10), we refer to [MV00, CMV+03, CMV06, AGS08, CDF+11]. Our treatment will be mostly self-contained because of some differences between our setting and the one in these papers.

Recall assumptions A1, A2, A3 in the main text. By [Szn91, Theorem 1.1], assumptions A1 and A3 are sufficient for the existence and uniqueness of solution of PDE (7.1).

Assume conditions A1 and A3 hold. Let (ρt)t0(\rho_{t})_{t\geq 0} be the solution of the PDE (7.1). Let (θit)t0(\overline{\bm{\theta}}_{i}^{t})_{t\geq 0} be the solution of nonlinear dynamics (7.4). Then tθitt\mapsto\overline{\bm{\theta}}^{t}_{i} is K1K3K_{1}K_{3}-Lipschitz continuous, and tρtt\mapsto\rho_{t} is K1K3K_{1}K_{3}-Lipschitz continuous in W2W_{2} Wasserstein distance, with K1K_{1} and K3K_{3} as per conditions A1 and A3. In particular, tρtt\mapsto\rho_{t} is continuous in the topology of weak convergence.

Since ξ\xi and Ψ\nabla\Psi are K1K_{1} and K3K_{3} bounded respectively, tθitt\mapsto\overline{\bm{\theta}}^{t}_{i} is K1K3K_{1}K_{3}-Lipschitz continuous. Further, Eq. (5.2) implies that tρtt\mapsto\rho_{t} is Lipschitz continuous in W2W_{2} Wasserstein distance, namely

The proof follows a ‘propagation of chaos’ argument [Szn91]. Throughout this proof, we will use KK to denote generic constant depending on the constants K1,K2,K3K_{1},K_{2},K_{3} in conditions A1, A2, A3.

It is convenient to introduce the notations zk=(xk,yk){\bm{z}}_{k}=({\bm{x}}_{k},y_{k}) to denote the kk-th example and define

Note that the assumption of bounded Lipschitz V\nabla V, 1U\nabla_{1}U (here and below 1U(θ1,θ2)\nabla_{1}U({\bm{\theta}}_{1},{\bm{\theta}}_{2}) denotes the gradient of UU with respect to its first argument) implies G(θ;ρ)2K\|{\bm{G}}({\bm{\theta}};\rho)\|_{2}\leq K and G(θ1;ρ)G(θ2;ρ)2Kθ1θ22\|{\bm{G}}({\bm{\theta}}_{1};\rho)-{\bm{G}}({\bm{\theta}}_{2};\rho)\|_{2}\leq K\|{\bm{\theta}}_{1}-{\bm{\theta}}_{2}\|_{2}. Further

With these notations, we can rewrite the SGD dynamics in the main text as

Recall (θi0)iNρ0({\bm{\theta}}^{0}_{i})_{i\leq N}\sim\rho_{0} independently.

We next state and prove the key estimate controlling the difference between the original dynamics and the nonlinear dynamics.

Under the assumptions of Theorem 3, there exists a constant KK depending uniquely on K1,K2,K3K_{1},K_{2},K_{3} in conditions A1, A2, and A3, such that for any T0T\geq 0, we have

with probability at least 1ez21-e^{-z^{2}}.

We next consider the three terms above. Using the Lipschitz continuity of G(θ;ρ){\bm{G}}({\bm{\theta}};\rho) with respect to θ{\bm{\theta}} and ρ\rho (see Eq. (7.10)), and due to condition A1 and Lemma 7.1 (implying that ξ\xi, θit\overline{\bm{\theta}}^{t}_{i}, and ρs\rho_{s} are Lipschitz continuous), we get

Bounding the second term yields (by using the Lipschitz continuity of G{\bm{G}} with respect to its first argument):

where ρ^k(N)(1/N)iNδθik\hat{\rho}^{(N)}_{k}\equiv(1/N)\sum_{i\leq N}\delta_{{\bm{\theta}}^{k}_{i}}. Hence

and taking union bound over iNi\leq N, we get

For the term E3,0i(t)E_{3,0}^{i}(t), we use the Lipschitz continuity property (7.10), whence

Since for any fixed kk, (θjkε)jN,ji(\overline{\bm{\theta}}_{j}^{k{\varepsilon}})_{j\leq N,j\neq i} are i.i.d. and independent of θik{\bm{\theta}}_{i}^{k}, and 1U\nabla_{1}U is bounded, we get by another application of Azuma-Hoeffding inequality, cf. Lemma A.1,

Conditional on the good events in Eq. (7.22) and (7.25), Eq. (7.20) thus yields

with probability at least 1ez21-e^{-z^{2}}.

Using the bounds (7.16), (7.17), (7.26) in Eq. (7.15), we get

Using the bound (7.27), the claim follows. ∎

Under the assumptions of Theorem 3, we have

Let θ=(θ1,,θi,,θn){\bm{\theta}}=({\bm{\theta}}_{1},\dots,{\bm{\theta}}_{i},\dots,{\bm{\theta}}_{n}) and θ=(θ1,,θi,,θn){\bm{\theta}}^{\prime}=({\bm{\theta}}_{1},\dots,{\bm{\theta}}_{i}^{\prime},\dots,{\bm{\theta}}_{n}) be two configurations that differ only in position ii. Then

Under the assumptions of Theorem 3, we have,

with probability at least 1ez21-e^{-z^{2}}.

By Eq. (7.32) and by Azuma-Höeffding inequality and union bound, we get

with probability at least 1ez21-e^{-z^{2}}. The claim follows since

The proof of the theorem follows from a straightforward application of Lemma 7.2, 7.3, 7.4. The proof for any bounded Lipschitz function ff follows the same argument as Lemma 7.3, 7.4. As a result, for any sequence (N,ε=εN)(N,{\varepsilon}={\varepsilon}_{N}) such that NN\to\infty and εN0{\varepsilon}_{N}\to 0 with N/log(N/εN)N/\log(N/{\varepsilon}_{N})\to\infty and εNlog(N/εN)0{\varepsilon}_{N}\log(N/{\varepsilon}_{N})\to 0, we have ρ^k/ε(N)\hat{\rho}_{\lfloor k/{\varepsilon}\rfloor}^{(N)} converges weakly to ρt\rho_{t} almost surely immediately.

2 Proof of Theorem 3: Generalization to β<∞𝛽\beta<\infty

Here we generalize the proof given in the previous section to noisy SGD at finite temperature β<\beta<\infty. Since the proof follows the same scheme as in the noiseless case, we will limit ourselves to describing the differences.

Throughout this section we assume that conditions A1{\sf A1}, A2{\sf A2}, A3{\sf A3} hold. We also let

for some λ1\lambda\leq 1. Further we assume ρ0\rho_{0} is K02K_{0}^{2}-sub-Gaussian. Finally, we assume 1β<1\leq\beta<\infty.

For the reader’s convenience, we reproduce here the form of the limiting PDE

which again should be interpreted in weak sense.

Recall conditionss A1, A2, A3 in the main text. By a modified argument of [Szn91, Theorem 1.1], conditions A1 and A3 are sufficient for the existence and uniqueness of solution of PDE (7.37) in weak sense. Section 10 provides further information of this PDE, including a proof of existence and uniqueness.

As in the noiseless case, there is an equivalent formulation of this PDE as a fixed point distribution for the following nonlinear dynamics, which is an integration form of a stochastic differential equation,

where {Wi(s)}s0\{\bm{W}_{i}(s)\}_{s\geq 0} for iNi\leq N are independent DD-dimensional Brownian motions, and G(θ;ρ)Ψλ(θ;ρ){\bm{G}}({\bm{\theta}};\rho)\equiv-\nabla\Psi_{\lambda}({\bm{\theta}};\rho). The assumptions on UU, VV, λ\lambda, and ξ\xi ensures that this nonlinear dynamics has a unique continuous solution.

It is convenient to collect some standard estimates about the solution of the stochastic differential equation (7.38).

Assume ρ0\rho_{0} is K02K_{0}^{2}-sub-Gaussian, ξ(s)\xi(s) and G(0;ρs){\bm{G}}({\mathbf{0}};\rho_{s}) are K0K_{0}-bounded, G(θ;ρs){\bm{G}}({\bm{\theta}};\rho_{s}) is K0K_{0}-Lipschitz in θ{\bm{\theta}}, and β1\beta\geq 1. Let (θit)t0(\overline{\bm{\theta}}_{i}^{t})_{t\geq 0} for iNi\leq N be the solution of (7.38) with independent initialization (θi0)iNρ0({\bm{\theta}}_{i}^{0})_{i\leq N}\sim\rho_{0}. Let (ρt)t0(\rho_{t})_{t\geq 0} be the solution of PDE (7.37). Then there exists a constant KK depending uniquely on K0K_{0}, such that

Part (a). First, note that for any DD-dimensional K02K_{0}^{2}-sub-Gaussian random vector X\bm{X}, we have

Note that (θi0)iNρ0({\bm{\theta}}_{i}^{0})_{i\leq N}\sim\rho_{0} independently, and ρ0\rho_{0} is K02K_{0}^{2}-sub-Gaussian. Therefore

Taking τ=1/(2K02)\tau=1/(2K_{0}^{2}) and u=2K0(D+logN+z)u=2K_{0}(\sqrt{D+\log N}+z), we get

Then we define Wξ,i(t)0t2ξ(s)dWi(s)\bm{W}_{\xi,i}(t)\equiv\int_{0}^{t}\sqrt{2\xi(s)}\,{\rm d}\bm{W}_{i}(s). We have Var(Wξ,ij(t))=0t2ξ(s)ds2K0t{\rm Var}(W_{\xi,i}^{j}(t))=\int_{0}^{t}2\xi(s){\rm d}s\leq 2K_{0}t for jDj\leq D. Note exp{τWξ,i(t)22}\exp\{\tau\|\bm{W}_{\xi,i}(t)\|_{2}^{2}\} is a submartingale, due to Doob’s martingale inequality, we have

Taking τ=1/(4K0T)\tau=1/(4K_{0}T) and u=4K0T(D+logN+z)u=4\sqrt{K_{0}T}(\sqrt{D+\log N}+z), we get

By noting that ξ(s)\xi(s), G(0;ρs){\bm{G}}({\mathbf{0}};\rho_{s}) are K0K_{0}-bounded, and G(θ;ρs){\bm{G}}({\bm{\theta}};\rho_{s}) is K0K_{0}-Lipschitz in θ{\bm{\theta}}, according to Eq. (7.38), there exists some constant KK depending on K0K_{0}, such that

where Δi(t)supstθis2\Delta_{i}(t)\equiv\sup_{s\leq t}\|\overline{\bm{\theta}}_{i}^{s}\|_{2}, WmaxiNsuptTWξ,i(t)2W\equiv\max_{i\leq N}\sup_{t\leq T}\|\bm{W}_{\xi,i}(t)\|_{2}, and ΘmaxiNθi02\Theta\equiv\max_{i\leq N}\|{\bm{\theta}}_{i}^{0}\|_{2}. Due to Gronwall’s inequality, we have

The high probability bound (7.42) holds by noting the high probability bound for Θ\Theta and WW in Eq. (7.46) and (7.47).

Part (b). Define Δi(h;k,ε)=sup0uhθikε+uθikε2\Delta_{i}(h;k,{\varepsilon})=\sup_{0\leq u\leq h}\|\overline{\bm{\theta}}_{i}^{k{\varepsilon}+u}-\overline{\bm{\theta}}_{i}^{k{\varepsilon}}\|_{2}. By noting that ξ(s)\xi(s), G(0;ρs){\bm{G}}({\mathbf{0}};\rho_{s}) are K0K_{0}-bounded, and G(θ;ρs){\bm{G}}({\bm{\theta}};\rho_{s}) is K0K_{0}-Lipschitz in θ{\bm{\theta}}, according to Eq. (7.38), we have

where Wξ,i,k(u)kεkε+u2ξ(s)dWi(s)\bm{W}_{\xi,i,k}(u)\equiv\int_{k{\varepsilon}}^{k{\varepsilon}+u}\sqrt{2\xi(s)}\,{\rm d}\bm{W}_{i}(s). Similar to the bound Eq. (7.47), we have

Plugging the bound Eq. (7.42) and Eq. (7.49) into Eq. (7.48), we have

with probability at least 1ez21-e^{-z^{2}}.

Part (c). Equation (7.44) holds directly by noting that

As in the noiseless case, the key step consists in bounding the difference between the nonlinear dynamics and the SGD dynamics.

Under the assumptions of Theorem 3, there exists a constant KK depending uniquely on K0,K1,K2,K3K_{0},K_{1},K_{2},K_{3}, such that for any T0T\geq 0, we have

with probability at least 1ez21-\,e^{-z^{2}}.

Terms E2i(t)E_{2}^{i}(t), E3i(t)E_{3}^{i}(t) can be bounded the same as in Lemma 7.2, i.e., Eq. (7.17) and (7.26), by noting that the replacement of Ψ\Psi by Ψλ\Psi_{\lambda} does not affect these estimates.

To bound E4i(t)E_{4}^{i}(t), notice that \bm{W}_{\xi,i}\equiv\int_{0}^{T}\big{(}\sqrt{2\xi(s)}-\sqrt{2\xi([s])}\big{)}\,{\rm d}\bm{W}_{i}(s) is a Gaussian random vector, Wξ,iN(0,τ2ID)\bm{W}_{\xi,i}\sim{\sf N}({\mathbf{0}},\tau^{2}{\bm{I}}_{D}), where, using the Lipschitz continuity of ξ\xi,

and therefore by applying Doob’s inequality to the submartingale tE4i(t)t\mapsto E_{4}^{i}(t), we get

We need to modify the proof of Lemma 7.2 to bound terms E1i(t)E_{1}^{i}(t).

To bound the first term E1,Ai(t)E_{1,A}^{i}(t), due to the Lipschitz property of G(θ;ρ){\bm{G}}({\bm{\theta}};\rho) and the boundedness of G(0;ρ){\bm{G}}({\mathbf{0}};\rho), with probability at least 1ez21-e^{-z^{2}}, we have for all iNi\leq N and tTt\leq T,

Here the last inequality is due to Eq. (7.42) in Lemma 7.5.

To bound the second term E1,Bi(t)E_{1,B}^{i}(t), using the fact that 1U\nabla_{1}U is bounded Lipschitz, we have for all iNi\leq N and tTt\leq T,

Here the last inequality is due to Eq. (7.44) in Lemma 7.5.

To bound the third term E1,Ci(t)E_{1,C}^{i}(t), with probability at least 1ez21-e^{-z^{2}}, we have for all iNi\leq N and tTt\leq T,

Here the last inequality is due to Eq. (7.43) in Lemma 7.5.

As a result, combining Eq. (7.17), (7.26), (7.27), (7.51), (7.52), (7.54), (7.55), and (7.56), defining

Applying Gronwall’s inequality gives the desired result.

The generalization of Theorem 3 to β<\beta<\infty follows from this lemma exactly as in the previous section.

3 Proof of Proposition 2: Monotonicity of the risk

By Lemma 7.1, tρtt\mapsto\rho_{t} is Lipschitz continuous in Wasserstein distance W2(ρt1,ρt2)Kt1t2W_{2}(\rho_{t_{1}},\rho_{t_{2}})\leq K|t_{1}-t_{2}|. Hence, we get

where in the second step we used Eq. (7.3). This immediately implies that R(ρt)R(\rho_{t}) is non-increasing in tt.

Let ρ\rho be a fixed point of Eq. (7.1). Since tR(ρt)ρ0=ρ=0\partial_{t}R(\rho_{t})|_{\rho_{0}=\rho}=0, the above formula implies

and therefore ρ\rho is supported in the set of θ{\bm{\theta}}’s such that Ψ(θ;ρ)=0\nabla\Psi({\bm{\theta}};\rho)={\mathbf{0}}.

Vice versa, if this is the case, setting ρ0=ρ\rho_{0}=\rho, Eq. (7.3) implies tφ,ρt=0\partial_{t}\langle\varphi,\rho_{t}\rangle=0, then ρtρ0\rho_{t}\equiv\rho_{0} is a fixed point.

4 A general continuity result

It is useful to notice that the solution (ρt)t0(\rho_{t})_{t\geq 0} of the PDE (7.1) is continuous with respect to changes in V()V(\,\cdot\,), U(,)U(\,\cdot\,,\,\cdot\,). Namely, we consider the following two PDEs:

The proof adapts the argument used to establish uniqueness in [Szn91]. Without loss of generality, we fix ξ(t)1/2\xi(t)\equiv 1/2. We further denote by KK generic constants depending on K1K_{1}, K3K_{3}.

The assumption of bounded Lipschitz V\nabla V and U\nabla U implies that Ψ(θ;ρ)\nabla\Psi({\bm{\theta}};\rho) is KK-bounded Lipschitz with respect to argument (θ,ρ)({\bm{\theta}},\rho), that is,

Using bound (7.70), the first term E1(t)E_{1}(t) is simply bounded by

To bound the second term E2(t)E_{2}(t), we have

with the definition of ε0{\varepsilon}_{0} given by Equation (7.69).

Combining Equation (7.74), (7.75), and (7.76), we have

Applying Equation (7.73), the result follows. ∎

5 Some properties of the solution of the PDE (7.1)

In this section we prove four lemmas on the properties of the solution of the PDE (7.1), under conditions A1 and A3. All of these facts are quite standard, but we provide complete proofs for them for reader’s convenience.

We will use several times the following notations. Let ρt\rho_{t} be a solution of the PDE (7.1) with initialization ρ0\rho_{0}. Let (θt)t0({\bm{\theta}}^{t})_{t\geq 0} be the solution of the ordinary differential equation (ODE)

With these notations, ρt\rho_{t} is the push forward of ρ0\rho_{0} under φt{\bm{\varphi}}^{t}: ρt=φtρ0\rho_{t}={\bm{\varphi}}_{*}^{t}\rho_{0}. In other words, for any Borel set BB, ρt(φt(B))=ρ0(B)\rho_{t}({\bm{\varphi}}^{t}(B))=\rho_{0}(B).

The lemma holds immediately by noting that ρt(Ω)ρt(φt(Ω))=ρ0(Ω)\rho_{t}(\Omega)\geq\rho_{t}({\bm{\varphi}}^{t}(\Omega))=\rho_{0}(\Omega). ∎

Assume conditions A1, A3 hold. Further assume there exists a constant K<K<\infty such that

for any θ(0,)D{\bm{\theta}}\in(0,\infty)^{D} and ρP([0,]D)\rho\in{\mathcal{P}}([0,\infty]^{D}). Let (ρt)t0(\rho_{t})_{t\geq 0} be the solution of the PDE (7.1) with initial condition ρ0\rho_{0} with ρ0((0,)D)=1\rho_{0}((0,\infty)^{D})=1. Then for any t<t<\infty, ρt((0,)D)=1\rho_{t}((0,\infty)^{D})=1.

According to Eqs. (7.80) and (7.79), we have for i[d]i\in[d],

Then according to (7.81), we have φt(Ωk(0))Ωk(t){\bm{\varphi}}^{t}(\Omega_{k}(0))\subseteq\Omega_{k}(t). Note Ωk(t)\Omega_{k}(t) is increasing in kk for fixed tt, and kΩk(t)=kΩk(0)=(0,)D\cup_{k}\Omega_{k}(t)=\cup_{k}\Omega_{k}(0)=(0,\infty)^{D}. Hence,

Let (ρt)t0(\rho_{t})_{t\geq 0} be a continuous curve in a compact metric space (Ω,d)(\Omega,d). Denoting

to be the set of all limiting points of (ρt)t0(\rho_{t})_{t\geq 0}. Then S{\mathcal{S}}_{*} is a connected compact set.

First, it is easy to see that S{\mathcal{S}}_{*} should be closed. Note that Ω\Omega is a compact space, then S{\mathcal{S}}_{*} should be a compact set. If S={ρ}{\mathcal{S}}_{*}=\{\rho_{*}\} is a singleton, this lemma holds automatically. Therefore, we would like to consider the case when S{\mathcal{S}}_{*} is not a singleton.

For any ρ1,ρ2S\rho_{1},\rho_{2}\in{\mathcal{S}}_{*}, and d(ρ1,ρ2)>0d(\rho_{1},\rho_{2})>0. We would like to show ρ1\rho_{1} and ρ2\rho_{2} are connected in S{\mathcal{S}}_{*}.

We use proof by contradiction. Now suppose ρ1\rho_{1} and ρ2\rho_{2} are not connected. Define AS{\mathcal{A}}\subseteq{\mathcal{S}}_{*} to be the maximal connected subset of S{\mathcal{S}}_{*} containing ρ1\rho_{1}. It is easy to see that A{\mathcal{A}} is compact. It is also easy to see that its complement BSA{\mathcal{B}}\equiv{\mathcal{S}}_{*}\setminus{\mathcal{A}} is also a compact set, and ρ2B\rho_{2}\in{\mathcal{B}}. As a result, we have AB=S{\mathcal{A}}\cup{\mathcal{B}}={\mathcal{S}}_{*}, AB={\mathcal{A}}\cap{\mathcal{B}}=\emptyset, and ρ1A\rho_{1}\in{\mathcal{A}}, ρ2B\rho_{2}\in{\mathcal{B}}.

Note that Ω\Omega is a metric space, so it satisfies T4 separation axiom. Since A{\mathcal{A}} and B{\mathcal{B}} are closed sets and AB={\mathcal{A}}\cap{\mathcal{B}}=\emptyset, there exists an open set O{\mathcal{O}}, such that AO{\mathcal{A}}\subseteq{\mathcal{O}}, OB={\mathcal{O}}\cap{\mathcal{B}}=\emptyset. Hence, OSc\partial{\mathcal{O}}\subseteq{\mathcal{S}}_{*}^{c}.

Note that ρ1\rho_{1} and ρ2\rho_{2} are limiting points of (ρt)t0(\rho_{t})_{t\geq 0} which is a continuous curve in Ω\Omega. Therefore, it must cross the boundary O\partial{\mathcal{O}} infinite times. That is, there is a sequence (tk)k1(t_{k})_{k\geq 1} of time with limktk=\lim_{k\to\infty}t_{k}=\infty, such that ρtkO\rho_{t_{k}}\in\partial{\mathcal{O}}. But since O\partial{\mathcal{O}} is compact, there exists a limiting point ρO\rho_{*}\in\partial{\mathcal{O}}, so that a subsequence of sequence ρtk\rho_{t_{k}} converges to ρ\rho_{*}. Therefore, ρ\rho_{*} should be a limiting point of (ρt)t0(\rho_{t})_{t\geq 0}. But this contradict with OSc\partial{\mathcal{O}}\subseteq{\mathcal{S}}_{*}^{c}. ∎

Under the assumptions of A1 and A3, further assume that U,VU,V are twice continuous differentiable, and that ρ0\rho_{0} has density with respect to the Lebesgue measure, bounded by M0M_{0}. Then ρt\rho_{t} also has a density, bounded by Mt=KM0exp{KDt}M_{t}=K\,M_{0}\exp\{KDt\} (where KK depends on the constants in the assumptions).

Let J(θ;t)\bm{J}({\bm{\theta}};t) for the Jacobian of φt(){\bm{\varphi}}^{t}(\,\cdot\,) at θ0=θ{\bm{\theta}}^{0}={\bm{\theta}}. Then Eq. (7.79) implies that J(θ;t)\bm{J}({\bm{\theta}};t) satisfies

with initial condition J(θ;0)=ID\bm{J}({\bm{\theta}};0)={\bm{I}}_{D}. This implies

Therefore, using the fact that 2Ψ(θ;ρt)\mboxop\|\nabla^{2}\Psi({\bm{\theta}};\rho_{t})\|_{\mbox{\tiny\rm op}} is KK-bounded, we obtain \lambda_{\min}\big{(}\bm{J}({\bm{\theta}};t)\big{)}\geq\exp(-Kt). Finally, since φt{\bm{\varphi}}^{t} is a diffeomorphism, we have

6 Proof of Theorems 6: Stability conditions

where H0H0(δθ)=2V(θ)+1,12U(θ,θ)\bm{H}_{0}\equiv\bm{H}_{0}(\delta_{{\bm{\theta}}_{*}})=\nabla^{2}V({\bm{\theta}}_{*})+\nabla^{2}_{1,1}U({\bm{\theta}}_{*},{\bm{\theta}}_{*}). Notice that

and therefore 1,22U(θ,θ)0\nabla^{2}_{1,2}U({\bm{\theta}}_{*},{\bm{\theta}}_{*})\succeq{\mathbf{0}}, whence H1H0\bm{H}_{1}\succeq\bm{H}_{0}.

We first establish the condition for ρ=δθ\rho_{*}=\delta_{{\bm{\theta}}_{*}} to be a fixed point. Note that Ψ(θ;ρ)=V(θ)+U(θ,θ)\Psi({\bm{\theta}};\rho_{*})=V({\bm{\theta}})+U({\bm{\theta}},{\bm{\theta}}_{*}) and supp(ρ)={θ}{\rm supp}(\rho_{*})=\{{\bm{\theta}}_{*}\}. Hence the condition in the main text is satisfied if and only if θΨ(θ;ρ)θ=θ=0\nabla_{\bm{\theta}}\Psi({\bm{\theta}};\rho_{*})|_{{\bm{\theta}}={\bm{\theta}}_{*}}={\mathbf{0}}, i.e. V(θ)+1U(θ,θ)=0\nabla V({\bm{\theta}}_{*})+\nabla_{1}U({\bm{\theta}}_{*},{\bm{\theta}}_{*})={\mathbf{0}}.

To establish the stability result of Theorem 6, the following lemma provides a key estimate.

Under the assumptions of Theorem 6, let λλmin(H0)>0\lambda\equiv\lambda_{\min}(\bm{H}_{0})>0. Then there exists r1,ε1,γ>0r_{1},{\varepsilon}_{1},\gamma>0 such that the following hold

If supp(ρ)B(θ;r1){θ:  θθ2r1}{\rm supp}(\rho)\subseteq{\sf B}({\bm{\theta}}_{*};r_{1})\equiv\{{\bm{\theta}}:\;\|{\bm{\theta}}-{\bm{\theta}}_{*}\|_{2}\leq r_{1}\}, then,

If θθ22ρ(dθ)ε12\int\|{\bm{\theta}}-{\bm{\theta}}_{*}\|_{2}^{2}\,\rho({\rm d}{\bm{\theta}})\leq{\varepsilon}^{2}_{1} and supp(ρ)B(θ;r1){\rm supp}(\rho)\subseteq{\sf B}({\bm{\theta}}_{*};r_{1}), then for any θB(θ;r1)B(θ;r1/2){\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{1})\setminus{\sf B}({\bm{\theta}}_{*};r_{1}/2),

Since 2V(θ)\nabla^{2}V({\bm{\theta}}) is continuous and 12U(θ,θ)\nabla_{1}^{2}U({\bm{\theta}},{\bm{\theta}}^{\prime}) is bounded continuous, it follows that θ2Ψ(θ;ρ){\bm{\theta}}\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho) is continuous, and ρ2Ψ(θ;ρ)\rho\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho) is continuous in the weak topology, and in fact (θ,ρ)2Ψ(θ;ρ)({\bm{\theta}},\rho)\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho) is continuous in the product topology.

Since H00\bm{H}_{0}\succ{\mathbf{0}} strictly, for any δ>0\delta>0 we can choose r1=r1(δ)>0r_{1}=r_{1}(\delta)>0 such that

for all θB(θ;r1){\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{1}), and ρ\rho such that supp(ρ)B(θ;r1){\rm supp}(\rho)\subseteq{\sf B}({\bm{\theta}}_{*};r_{1}). If these conditions hold

In order to bound the second term, note that, since Ψ(θ;ρ)=0\nabla\Psi({\bm{\theta}}_{*};\rho_{*})={\mathbf{0}},

where Q=(θμ)(θμ)Tρ(dθ){\bm{Q}}=\int({\bm{\theta}}-\mu)({\bm{\theta}}-\mu)^{{\sf T}}\,\rho({\rm d}{\bm{\theta}}) is the covariance of (θθ)({\bm{\theta}}-{\bm{\theta}}_{*}).

Let now consider the claim at point (i)(i). Integrating Eq. (7.103) with respect to ρ(dθ)\rho({\rm d}{\bm{\theta}}), we get

By choosing δ\delta sufficiently small, we can ensure that (1δ)H0(δ/2)Iλmin(H0)I/2(1-\delta)\bm{H}_{0}-(\delta/2){\bm{I}}\succeq\lambda_{\min}(\bm{H}_{0}){\bm{I}}/2, H1δH0(3δ/2)Iλmin(H1)I/2\bm{H}_{1}-\delta\bm{H}_{0}-(3\delta/2){\bm{I}}\succeq\lambda_{\min}(\bm{H}_{1}){\bm{I}}/2, and therefore

Next consider point (ii)(ii). In this case, Eq. (7.107) implies

Substituting in Eq. (7.103), and using μ2ε1\|{\bm{\mu}}\|_{2}\leq{\varepsilon}_{1}, we get

This is strictly positive for all ε1{\varepsilon}_{1} small enough, hence implying the claim (7.92). ∎

We are now in position of proving Theorem 6.

Let r0=min(r1/2,ε1/2)r_{0}=\min(r_{1}/2,{\varepsilon}_{1}/2) and assume, without loss of generality t0=0t_{0}=0, so that supp(ρ0)B(θ;r0){\rm supp}(\rho_{0})\subseteq{\sf B}({\bm{\theta}}_{*};r_{0}). We also define

As usual, we adopt the convention that the infimum of an empty set is equal to ++\infty.

Define φ1(θ)=h(θθ2)\varphi_{1}({\bm{\theta}})=h(\|{\bm{\theta}}-{\bm{\theta}}_{*}\|_{2}), with hh to be an non-decreasing function with

where, in the last inequality, we used Lemma 7.12.(ii)(ii). Next, define

Applying again Eq. (7.1), we get, for tTt\leq T_{*},

Together the last two bounds imply T=T_{*}=\infty. Indeed assume by contradiction T<T_{*}<\infty. Then either T1T2T_{1}\leq T_{2}, T1<T_{1}<\infty, or T2<T1T_{2}<T_{1}, T2<T_{2}<\infty.

Consider the first case: T1T2T_{1}\leq T_{2}, T1<T_{1}<\infty. Since ρT1,φ2ε12\langle\rho_{T_{1}},\varphi_{2}\rangle\geq{\varepsilon}_{1}^{2} but ρ0,φ2r02ε12/4\langle\rho_{0},\varphi_{2}\rangle\leq r_{0}^{2}\leq{\varepsilon}_{1}^{2}/4, there exists t<Tt<T_{*} such that tρ0,φ2>0\partial_{t}\langle\rho_{0},\varphi_{2}\rangle>0. However this contradicts Eq. (7.126). Consider then the second case: T2<T1T_{2}<T_{1}, T2<T_{2}<\infty. This implies ρT2,φ1>0\langle\rho_{T_{2}},\varphi_{1}\rangle>0, but on the other hand ρ0,φ1=0\langle\rho_{0},\varphi_{1}\rangle=0. Hence, there exists t<Tt<T_{*} such that tρ0,φ1>0\partial_{t}\langle\rho_{0},\varphi_{1}\rangle>0. However this contradicts Eq. (7.122).

We conclude that T=T_{*}=\infty and hence we can apply Eq. (7.126) for any tt, thus obtaining tφ2,ρtλφ2,ρt\partial_{t}\langle\varphi_{2},\rho_{t}\rangle\leq-\lambda\,\langle\varphi_{2},\rho_{t}\rangle and hence φ2,ρt(r02/2)eλt\langle\varphi_{2},\rho_{t}\rangle\leq(r_{0}^{2}/2)e^{-\lambda t}, which concludes the proof. ∎

7 Proof of Theorem 7: Instability conditions

In this section we will prove the instability result of Theorem 7. Throughout the section, we assume ξ(t)1/2\xi(t)\equiv 1/2. We will use several times the nonlinear dynamics, defined for ρt\rho_{t} a solution of Eq. (7.1) with initial condition ρ0\rho_{0}:

where Pu=IuuT{\bm{P}}^{\perp}_{{\bm{u}}}={\bm{I}}-{\bm{u}}{\bm{u}}^{{\sf T}} is the projector orthogonal to vector u{\bm{u}}.

First consider the case d=1d=1: in this case, the assumption ν(B(x0;r))1ε\nu({\sf B}({\bm{x}}_{0};r))\geq 1-{\varepsilon} is not required. Denote by FF the distribution function associated to ν\nu (i.e. F(x)ν((,x])F(x)\equiv\nu((-\infty,x])). By assumption FF is differentiable with F(x)MF^{\prime}(x)\leq M. In order to construct the desired coupling, let ZZ be a random variable uniformly distributed in $.Forasmallconstant. For a small constant\xi_{0}>0,definetherandomvariables, define the random variables(X_{1},X_{2})$ by letting

(Note that X2X_{2} is not defined for Z=ξ0Z=\xi_{0} but this is a zero-probability event.) On the event {Z>ξ0}\{Z>\xi_{0}\} (which has probability 1ξ01-\xi_{0}), we have, for some W[X1,X2]W\in[X_{1},X_{2}],

By choosing ξ0\xi_{0} small enough, this proves the claim for d=1d=1.

Consider next d>1d>1 and assume without loss of generality u=e1{\bm{u}}={\bm{e}}_{1}.

Let ν()=ν(XB(x0;r))\overline{\nu}(\,\cdot\,)=\nu(\,\cdot\,|\bm{X}\in{\sf B}({\bm{x}}_{0};r)), Xab(Xa,,Xb)\bm{X}_{a}^{b}\equiv(X_{a},\dots,X_{b}), and denote by f1[2,d]f_{1|[2,d]} the density of ν(X1X2n)\overline{\nu}(X_{1}\in\,\cdot\,|\bm{X}_{2}^{n}), and by f[a,b]f_{[a,b]} the density of ν(Xab)\overline{\nu}(\bm{X}_{a}^{b}\in\,\cdot\,). We then have

In order to construct the coupling, we sample Zν{\bm{Z}}\sim\nu. If Z∉B(x0;r){\bm{Z}}\not\in{\sf B}({\bm{x}}_{0};r), then we take X1=X2=Z\bm{X}_{1}=\bm{X}_{2}={\bm{Z}}. If ZB(x0;r){\bm{Z}}\in{\sf B}({\bm{x}}_{0};r) and maxx1f1[2,d](x1Z2d)>M/Δ\max_{x_{1}}f_{1|[2,d]}(x_{1}|{\bm{Z}}_{2}^{d})>M/\Delta, we also take X1=X2=Z\bm{X}_{1}=\bm{X}_{2}={\bm{Z}}. Otherwise we have ZB(x0;r){\bm{Z}}\in{\sf B}({\bm{x}}_{0};r) and maxx1f1[2,d](x1Z2d)M/Δ\max_{x_{1}}f_{1|[2,d]}(x_{1}|{\bm{Z}}_{2}^{d})\leq M/\Delta, then we sample (X1,1,X2,1)(X_{1,1},X_{2,1}) from the coupling developed in the case d=1d=1 applied to f1[2,d](Z2d)f_{1|[2,d]}(\,\cdot\,|{\bm{Z}}_{2}^{d}), and set X1=(X1,1,Z2d)\bm{X}_{1}=(X_{1,1},{\bm{Z}}_{2}^{d}), X2=(X2,1,Z2d)\bm{X}_{2}=(X_{2,1},{\bm{Z}}_{2}^{d}). Now define γ\gamma to be the joint distribution of X1,X2\bm{X}_{1},\bm{X}_{2}. Then γ\gamma is a coupling of ν\nu with itself.

Hence, we can choose Δ,ξ0\Delta,\xi_{0} small enough so that the claim (7.128) holds. ∎

By choosing ε0,#{\varepsilon}_{0,\#} small enough, we can ensure Ψ(θ;ρt)Ψ(θ;ρ)2g0/3\|\nabla\Psi({\bm{\theta}};\rho_{t})-\nabla\Psi({\bm{\theta}};\rho_{*})\|_{2}\leq g_{0}/3 for all θ{\bm{\theta}} and all tt0t\geq t_{0}.

for all tt0t\geq t_{0}. Let ρt0\overline{\rho}_{t_{0}} be the conditional probability measure of ρt0\rho_{t_{0}} given θB(θ;r0){\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{0}). By Lemma 7.11, ρt0\overline{\rho}_{t_{0}} has a density upper bounded by a constant M=M(ε0,t0)M=M({\varepsilon}_{0},t_{0}) (note that ρt0(S)ρt0(S)/(pε0)\overline{\rho}_{t_{0}}(S)\leq\rho_{t_{0}}(S)/(p_{*}-{\varepsilon}_{0})).

Set H0=H0(ρ)=2Ψ(θ;ρ)\bm{H}_{0}=\bm{H}_{0}(\rho_{*})=\nabla^{2}\Psi({\bm{\theta}}_{*};\rho_{*}). Since θ{\bm{\theta}}_{*} is a critical point of θΨ(θ;ρ){\bm{\theta}}\mapsto\Psi({\bm{\theta}};\rho_{*}), for any δ>0\delta>0, we can find r1(δ)>0r_{1}(\delta)>0 such that

As shown in the proof of Theorem 6, the function (θ,ρ)2Ψ(θ;ρ)({\bm{\theta}},\rho)\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho) is continuous when the space of probability distributions ρ\rho is endowed with the weak topology. Analogously ρΨ(θ;ρ)\rho\mapsto\nabla\Psi({\bm{\theta}}_{*};\rho) is continuous in the weak topology. Hence for this δ>0\delta>0 and r1(δ)>0r_{1}(\delta)>0, there exists ε0,(δ,r1)>0{\varepsilon}_{0,*}(\delta,r_{1})>0 small enough such that, the following inequalities hold

Let us emphasize that r1r_{1} depends on δ\delta but can be taken to be independent of ε0{\varepsilon}_{0}. Further, by an application of the intermediate value theorem, for all θB(θ;r1){\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{1}),

For r0<r1r_{0}<r_{1}, θt0B(θ;r0){\bm{\theta}}^{t_{0}}\in{\sf B}({\bm{\theta}}_{*};r_{0}), we let (θt)tt0({\bm{\theta}}^{t})_{t\geq t_{0}} be the solution of Eq. (7.127) with this initial condition. We then define

Under the conditions of Theorem 7, there exists r1>0r_{1}>0 and ε0,>0{\varepsilon}_{0,*}>0 such that, for all r0r1r_{0}\leq r_{1}, ε0ε0,{\varepsilon}_{0}\leq{\varepsilon}_{0,*}, there exists T\mboxUB(ε0,r0,r1,t0)T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0}) such that the following happens. If d\mboxBL(ρt,ρ)ε0d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0} and ρt(B(θ;r0))pε0|\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))-p_{*}|\leq{\varepsilon}_{0} for all tt0t\geq t_{0} for some t0t_{0}, then

We fix a δ(λ1λ2)/10\delta\leq(\lambda_{1}-\lambda_{2})/10. Then we choose r1>0r_{1}>0 and ε0,>0{\varepsilon}_{0,*}>0 such that Eq. (7.144) holds, with an additional requirement that ε0,<p/10{\varepsilon}_{0,*}<p_{*}/10. We will prove this lemma with this choice of r1r_{1} and ε0,{\varepsilon}_{0,*}.

We always denote (θit)tt0({\bm{\theta}}_{i}^{t})_{t\geq t_{0}} to be the solution of Eq. (7.127) with initial condition θit0{\bm{\theta}}_{i}^{t_{0}}, for i=1,2i=1,2. First we claim that, for 0<δ(λ1λ2)/100<\delta\leq(\lambda_{1}-\lambda_{2})/10, assuming

then for any θ1t0,θ2t0B(θ;r1){\bm{\theta}}^{t_{0}}_{1},{\bm{\theta}}^{t_{0}}_{2}\in{\sf B}({\bm{\theta}}_{*};r_{1}) with P(θ1t0θ2t0)=0{\bm{P}}_{\perp}({\bm{\theta}}^{t_{0}}_{1}-{\bm{\theta}}^{t_{0}}_{2})={\mathbf{0}}, we have

for all t[t0,t\mboxexit(θ1t0,r1)t\mboxexit(θ2t0,r1)]t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})].

For now we assume this claim holds. Fix r0r1r_{0}\leq r_{1} and ε0ε0,{\varepsilon}_{0}\leq{\varepsilon}_{0,*}. Define γ\gamma to be the coupling of Lemma 7.13 corresponding to u{\bm{u}} which is the eigenvector corresponding to the least eigenvalue of H0\bm{H}_{0}, and ν=ρt0\nu=\overline{\rho}_{t_{0}} which is the conditional measure of ρt0\rho_{t_{0}} given θt0B(θ;r0){\bm{\theta}}^{t_{0}}\in{\sf B}({\bm{\theta}}_{*};r_{0}). Note ρt0\overline{\rho}_{t_{0}} has a density upper bounded by a constant M=M(ε0,t0)M=M({\varepsilon}_{0},t_{0}). By Lemma 7.13, we have γ(E)9/10\gamma({\mathcal{E}})\geq 9/10, where

for some Z=Z(ε0,r0,t0)>0Z=Z({\varepsilon}_{0},r_{0},t_{0})>0. Now we take (θ1t0,θ2t0)E({\bm{\theta}}_{1}^{t_{0}},{\bm{\theta}}_{2}^{t_{0}})\in{\mathcal{E}}. Note the assumption of this lemma gives d\mboxBL(ρt,ρ)ε0ε0,d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0}\leq{\varepsilon}_{0,*} for all tt0t\geq t_{0}. According to Eq. (7.144), we have Eq. (7.149) holds, and due to this claim, we have θ1tθ2t2(1/Z)eλ1(tt0)/2\|{\bm{\theta}}^{t}_{1}-{\bm{\theta}}^{t}_{2}\|_{2}\geq(1/Z)e^{\lambda_{1}(t-t_{0})/2} for all t[t0,t\mboxexit(θ1t0,r1)t\mboxexit(θ2t0,r1)]t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})].

Define T\mboxUB(ε0,r0,r1,t0)=(2/λ1)log(2Zr1)T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0})=(2/\lambda_{1})\log(2Z\,r_{1}). Then for t>T\mboxUBt>T_{\mbox{\tiny\rm UB}}, we have θ1tθ2t22r1\|{\bm{\theta}}_{1}^{t}-{\bm{\theta}}^{t}_{2}\|_{2}\geq 2r_{1}. This is impossible if θ1t,θ2tB(θ;r1){\bm{\theta}}_{1}^{t},{\bm{\theta}}^{t}_{2}\in{\sf B}({\bm{\theta}}_{*};r_{1}) and hence we deduce (t\mboxexit(θ1t0,r1)t\mboxexit(θ2t0,r1))T\mboxUB(t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1}))\leq T_{\mbox{\tiny\rm UB}} for all (θ1t0,θ2t0)E({\bm{\theta}}^{t_{0}}_{1},{\bm{\theta}}^{t_{0}}_{2})\in{\mathcal{E}}.

Denoting by SS the event in the last expression, we obtain ρt0(S)(pε0)ρt0(S)(9/20)(pε0)p/3\rho_{t_{0}}(S)\geq(p_{*}-{\varepsilon}_{0})\overline{\rho}_{t_{0}}(S)\geq(9/20)(p_{*}-{\varepsilon}_{0})\geq p_{*}/3 by noting that ε0<p/10{\varepsilon}_{0}<p_{*}/10.

Proof of the claim. Define the quantities

We then have, for t[t0,t\mboxexit(θ1t0,r1)t\mboxexit(θ2t0,r1)]t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})],

Summarizing, we obtained the inequalities

The matrix of coefficients on the right-hand side is

This has a (un-normalized) left eigenvectors (1,v)(1,-v), (v,1)(-v,1) with eigenvalues ξ±\xi_{\pm} given by:

Note we took δ<(λ1λ2)/10\delta<(\lambda_{1}-\lambda_{2})/10, we have v>0v>0, and ξ+λ1\xi_{+}\geq\lambda_{1}.

Multiplying the inequalities (7.154), (7.155) by (1,v)(1,-v), we thus obtain

Since we assumed x(t0)=0x_{\perp}(t_{0})=0, whence, for all t[t0,t\mboxexit(θ1t0,r1)t\mboxexit(θ2t0,r1)]t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})], we have

We next strengthen the last lemma and prove that trajectories that exit B(θ;r1){\sf B}({\bm{\theta}}_{*};r_{1}) do not re-enter B(θ;r0){\sf B}({\bm{\theta}}_{*};r_{0}).

Under the conditions of Theorem 7, there exists r0,,r1>0r_{0,*},r_{1}>0 (with r0,<r1r_{0,*}<r_{1}) and ε0,>0{\varepsilon}_{0,*}>0 such that, for all r0r0,r_{0}\leq r_{0,*}, ε0ε0,{\varepsilon}_{0}\leq{\varepsilon}_{0,*}, there exists T\mboxUB(ε0,r0,r1,t0)T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0}) such that the following happens. If d\mboxBL(ρt,ρ)ε0d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0} and ρt(B(θ;r0))pε0|\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))-p_{*}|\leq{\varepsilon}_{0} for all tt0t\geq t_{0} for some t0t_{0}, then

Let P+{\bm{P}}_{+} be the projector onto the eigenspace of H0-\bm{H}_{0} corresponding to positive eigenvalues, and P{\bm{P}}_{-} the projector onto the subspace corresponding to negative eigenvalues, and let λ0miniDλi(H0)\lambda_{0}\equiv\min_{i\leq D}|\lambda_{i}(\bm{H}_{0})| to be the least absolute value of eigenvalue of H0\bm{H}_{0}. By condition B1 of Theorem 7, we have λ0>0\lambda_{0}>0. Let λmax\lambda_{\max} denote the largest absolute value of eigenvalue of H0\bm{H}_{0}.

Fix a δ\delta such that 0<δmin{λ0/(1+λ0+λmax),λ0/λmax,λ1λ2,1}/100<\delta\leq\min\{\lambda_{0}/(1+\lambda_{0}+\lambda_{\max}),\sqrt{\lambda_{0}/\lambda_{\max}},\lambda_{1}-\lambda_{2},1\}/10, where λ1,λ2\lambda_{1},\lambda_{2} are as defined in Lemma 7.15. Next we choose r1r_{1} as per Lemma 7.15, and we further require λ0r12η0\lambda_{0}r_{1}^{2}\leq\eta_{0}, where η0\eta_{0} is as per condition B3 in the statement of Theorem 7. We take ε0,{\varepsilon}_{0,*} to be the minimum of the parameter ε0,{\varepsilon}_{0,*} as per Lemma 7.15 and the parameter ε0,#{\varepsilon}_{0,\#} as per Lemma 7.14, where in Lemma 7.14, we choose u=Ψ(θ;ρ)λ0r12/8u=\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{0}r_{1}^{2}/8, and Δ=λ0r12/8\Delta=\lambda_{0}r_{1}^{2}/8. Then we will choose smaller r1r_{1} and ε0,{\varepsilon}_{0,*} so that Eq. (7.144) holds. Finally, we take r0,=δr1<r1r_{0,*}=\delta r_{1}<r_{1}. We will prove this lemma with this choice of r1r_{1}, ε0,{\varepsilon}_{0,*}, and r0,r_{0,*}, and with the same function T\mboxUBT_{\mbox{\tiny\rm UB}} as per Lemma 7.15.

We bound the evolution of these quantities following the same argument used above for x(t)x_{\parallel}(t), x(t)x_{\perp}(t). Namely

For t[t(θt0;r1,δ),t\mboxexit(θt0;r1)]t\in[t_{*}({\bm{\theta}}^{t_{0}};r_{1},\delta),t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}};r_{1})], we have z+(t)+z(t)δr1\sqrt{z_{+}(t)+z_{-}(t)}\geq\delta r_{1}. Using the inequality a(a+b)a+b\sqrt{a(a+b)}\leq a+b holding for non-negative aa and bb, we have

Proceeding analogously for zz_{-}, we arrive at the inequalities

Since δλ0/(10(1+λ0+λmax))\delta\leq\lambda_{0}/(10(1+\lambda_{0}+\lambda_{\max})), we can ensure that Ψ(θt\mboxexit;ρ)Ψ(θ;ρ)λ0r12/4\Psi({\bm{\theta}}^{t_{\mbox{\tiny\rm exit}}};\rho_{*})\leq\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{0}r_{1}^{2}/4. By Lemma 7.14, since d\mboxBL(ρt,ρ)ε0,ε0,#d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0,*}\leq{\varepsilon}_{0,\#} for all tt0t\geq t_{0}, we have Ψ(θt;ρ)Ψ(θ;ρ)λ0r12/8\Psi({\bm{\theta}}^{t};\rho_{*})\leq\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{0}r_{1}^{2}/8 for all tt\mboxexit(θt0;r1)t\geq t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}};r_{1}). Note for all θB(θ;δr1){\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};\delta r_{1}), we have Ψ(θ;ρ)Ψ(θ;ρ)λmaxδ2r12/2\Psi({\bm{\theta}};\rho_{*})\geq\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{\max}\delta^{2}r_{1}^{2}/2. Since δλ0/λmax/10\delta\leq\sqrt{\lambda_{0}/\lambda_{\max}}/10, we have θt∉B(θ;δr1){\bm{\theta}}^{t}\not\in{\sf B}({\bm{\theta}}_{*};\delta r_{1}) for all tt\mboxexit(θt0;r1)t\geq t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}};r_{1}).

This implies that, for any θt0B(θ;r0){\bm{\theta}}^{t_{0}}\in{\sf B}({\bm{\theta}}_{*};r_{0}) for r0r0,r_{0}\leq r_{0,*} with t\mboxexit(θt0,r1)T\mboxUB(ε0,r0,r1,t0)<t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}},r_{1})\leq T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0})<\infty, it will never return to B(θ;r0){\sf B}({\bm{\theta}}_{*};r_{0}). This gives the desired result. ∎

Finally we upper bound the probability that θtB(θ;r0){\bm{\theta}}^{t}\in{\sf B}({\bm{\theta}}_{*};r_{0}) for some t>t0t>t_{0}, given that θt0∉B(θ;r0){\bm{\theta}}^{t_{0}}\not\in{\sf B}({\bm{\theta}}_{*};r_{0}). We define

Under the conditions of Theorem 7, for any η>0\eta>0, there exists r0,>0r_{0,*}>0 and ε0,>0{\varepsilon}_{0,*}>0 such that, for all r0r0,r_{0}\leq r_{0,*}, ε0ε0,{\varepsilon}_{0}\leq{\varepsilon}_{0,*}, the following happens. If d\mboxBL(ρt,ρ)ε0d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0} and ρt(B(θ;r0))pε0|\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))-p_{*}|\leq{\varepsilon}_{0} for all tt0t\geq t_{0} for some t0t_{0}, then

Since we also had ρt(B(θ;r0))pε0\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))\geq p_{*}-{\varepsilon}_{0} for all tt0t\geq t_{0}, note η,ε0p/10\eta,{\varepsilon}_{0}\leq p_{*}/10, we reached a contradiction.

Centered isotropic Gaussians

In this section we consider the centered isotropic Gaussians example discussed in the main text. That is, we assume the joint law of (y,x)(y,{\bm{x}}) to be as follows:

With probability 1/21/2: y=+1y=+1, xN(0,(1+Δ)2Id){\bm{x}}\sim{\sf N}({\mathbf{0}},(1+\Delta)^{2}{\bm{I}}_{d}).

With probability 1/21/2: y=1y=-1, xN(0,(1Δ)2Id){\bm{x}}\sim{\sf N}({\mathbf{0}},(1-\Delta)^{2}{\bm{I}}_{d}).

xσ(x)x\mapsto\sigma(x) is bounded, non-decreasing, Lipschitz continuous. Its weak derivative xσ(x)x\mapsto\sigma^{\prime}(x) is Lipschitz in a neighborhood of .

qq is analytic on (0,)(0,\infty) with supr[0,]q(r)<\sup_{r\in[0,\infty]}q^{\prime\prime}(r)<\infty.

q(r)>0q^{\prime}(r)>0 for all r(0,)r\in(0,\infty), with supr[0,]q(r)<\sup_{r\in[0,\infty]}q^{\prime}(r)<\infty, and limr0q(r)=limrq(r)=0\lim_{r\to 0}q^{\prime}(r)=\lim_{r\to\infty}q^{\prime}(r)=0.

<q(0+)<1-\infty<q(0+)<-1, 1<q(+)<1<q(+\infty)<\infty, and 1<(q(0+)+q(+))/2<1-1<(q(0+)+q(+\infty))/2<1.

Letting Z(r)q(τr)/q(τ+r)Z(r)\equiv q^{\prime}(\tau_{-}r)/q^{\prime}(\tau_{+}r) for some τ+>τ>0\tau_{+}>\tau_{-}>0 we have Z(r)>0Z^{\prime}(r)>0 for all r(0,)r\in(0,\infty).

Note that condition S1 and part of S2 are implied by S0, but we list them here for conveniency. Some of these assumptions can be relaxed at the cost of extra technical work. In the interest of simplicity, we prefer to avoid being overly general.

In particular, we choose s1=2.5s_{1}=-2.5, s2=7.5s_{2}=7.5, t1=0.5t_{1}=0.5, t2=1.5t_{2}=1.5 in our simulations. In section 8.5, we check that this choice satisfies the above assumptions.

Throughout this section, we set τ±=(1±Δ)\tau_{\pm}=(1\pm\Delta) and q+(r)=q(τ+r)q_{+}(r)=q(\tau_{+}r), q(r)=q(τr)q_{-}(r)=q(\tau_{-}r). Also, we will assume ξ(t)=1/2\xi(t)=1/2, since other choices of ξ()\xi(\,\cdot\,) merely amounts to a time reparametrization.

Before analyzing our model, we introduce the function space and space of probability measures we will work on. We equip the set [0,][0,\infty] with a metric dˉ{\bar{d}}, where dˉ(x,y)=1/(1+x)1/(1+y){\bar{d}}(x,y)=|1/(1+x)-1/(1+y)| for any x,y[0,]x,y\in[0,\infty]. Then ([0,],dˉ)([0,\infty],{\bar{d}}) is a compact metric space, and we will still denote it by [0,][0,\infty] for simplicity in notations. We denote Cb([0,])C_{b}([0,\infty]) to be the set of bounded continuous functions on [0,][0,\infty], where continuity is defined using the topology generated by dˉ{\bar{d}}. More explicitly, we have isomorphism

Because of condition S2{\sf S2} and S3{\sf S3}, we have q,qCb([0,])q,q^{\prime}\in C_{b}([0,\infty]).

Let \mathscrsfsP([0,])\mathscrsfs{P}([0,\infty]) be the set of probability measures on [0,][0,\infty]. Due to Prokhorov’s theorem, there exists a complete metric dˉP{\bar{d}}{P} on \mathscrsfsP([0,])\mathscrsfs{P}([0,\infty]) equivalent to the topology of weak convergence, so that (\mathscrsfsP([0,]),dˉP)(\mathscrsfs{P}([0,\infty]),{\bar{d}}{P}) is a compact metric space. In this section, we will denote by \mathscrsfsP=\mathscrsfsP([0,]){\overline{\mathscrsfs{P}}}=\mathscrsfs{P}([0,\infty]).

Since the distribution of x{\bm{x}} is invariant under rotations for each of the two classes, so are the functions

where μ\mboxHaar\mu_{\mbox{\tiny\rm Haar}} is the Haar measure over the group of orthogonal rotations. Since ρR(ρ)\rho\mapsto R(\rho) is convex, R(ρs)R(ρ)R(\rho_{s})\leq R(\rho).

We therefore restrict ourselves to ρ\rho’s that are invariant under rotations. In other words, under ρ\rho, the vector w{\bm{w}} is uniformly random conditional on w2\|{\bm{w}}\|_{2}. We denote by ρ\overline{\rho} the probability distribution of w2\|{\bm{w}}\|_{2} when wρ{\bm{w}}\sim\rho and we let Rd(ρ)\overline{R}_{d}(\overline{\rho}) denote the resulting risk. We then have

where Θ(1/Zd)sind2θ1{θ[0,π]}dθ\Theta\sim(1/Z_{d})\sin^{d-2}\theta\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta.

As dd\to\infty, we have limdud(r1,r2)=u(r1,r2)\lim_{d\to\infty}u_{d}(r_{1},r_{2})=u_{\infty}(r_{1},r_{2}) (uniformly over compact sets), with

For d=d=\infty, we have the simpler expression

The following theorem provides a characterization of global minimizers of Rd(ρ)\overline{R}_{d}(\overline{\rho}).

ρ\overline{\rho}_{*} is a global minimizer of Rd(ρ)\overline{R}_{d}(\overline{\rho}) if and only if supp(ρ)argminrψd(r;ρ){\rm supp}(\overline{\rho}_{*})\subseteq\arg\min_{r}\psi_{d}(r;\overline{\rho}_{*}).

In particular, ρ=δr\overline{\rho}_{*}=\delta_{r_{*}} is a global minimizer or Rd(ρ)\overline{R}_{d}(\overline{\rho}) if and only if v(r)+ud(r,r)v(r)+u(r,r)v(r)+u_{d}(r,r_{*})\geq v(r_{*})+u(r_{*},r_{*}) for all rr.

Point 1 is essentially a special case of the second part of Proposition 1 in the main text (cf. Eq. (6.7)) and follows by the same argument. Point 2 is follows by taking ρ=δr\overline{\rho}_{*}=\delta_{r_{*}}. ∎

Given the last result, it is interesting to understand whether the optimal radial distribution ρ\overline{\rho}_{*} is a single point mass or not. Under the ansatz ρ=δr\overline{\rho}=\delta_{r} (a single point mass at radius rr) we obtain an effective risk Rd(1)(r)Rd(δr)\overline{R}_{d}^{(1)}(r)\equiv\overline{R}_{d}(\delta_{r}) defined by Rd(1)(r)=1+2v(r)+ud(r,r)\overline{R}_{d}^{(1)}(r)=1+2v(r)+u_{d}(r,r), which is plotted in Figure 11.6 for the case of our running example (8.1), and Δ=0.4\Delta=0.4.

Let r=r(Δ,d)r_{*}=r_{*}(\Delta,d) be the minimizer of Rd(1)(r)\overline{R}_{d}^{(1)}(r), and define, for dd\leq\infty,

In the case d=d=\infty, the minimization problem simplifies further. Either the minimum risk is , or it is achieved at a point mass ρ=δr\overline{\rho}_{*}=\delta_{r_{*}}.

Consider d=d=\infty. Recall \mathscrsfsP=\mathscrsfsP([0,]){\overline{\mathscrsfs{P}}}=\mathscrsfs{P}([0,\infty]). In this case Δ\Delta_{\infty} defined as per Eq. (8.16) is such that Δ(0,1)\Delta_{\infty}\in(0,1). Further

For Δ<Δ\Delta<\Delta_{\infty}, infρ\mathscrsfsPR(ρ)>0\inf_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}\overline{R}_{\infty}(\overline{\rho})>0 and the unique global minimizer of risk function R(ρ)\overline{R}_{\infty}(\overline{\rho}) is a point mass located at some r(Δ)(0,)r_{*}(\Delta)\in(0,\infty).

For ΔΔ\Delta\geq\Delta_{\infty}, all global minimizers of risk function R(ρ)\overline{R}_{\infty}(\overline{\rho}) have risk zero, and there exists a global minimizer that has compact support bounded away from .

Recall the definitions q+(r)=q(τ+r)q_{+}(r)=q(\tau_{+}r) and q(r)=q(τr)q_{-}(r)=q(\tau_{-}r). Further, we define the set Γ\Gamma\subseteq by

According to condition S3, for Δ=1\Delta=1, we have q(r)=q(0)<1q_{-}(r)=q(0)<-1 and q+(+)=q(+)>+1q_{+}(+\infty)=q(+\infty)>+1. Since qq is continuous, it is easy to see that there exists an ε>0{\varepsilon}>0, such that [1ε,1]Γ[1-{\varepsilon},1]\subseteq\Gamma. Further, for Δ=0\Delta=0 we have q+(r)=q(r)q_{+}(r)=q_{-}(r). By continuity, there exists an ε>0{\varepsilon}>0, such that [0,ε]Γ[0,{\varepsilon}]\in\setminus\Gamma.

Since qq is an increasing function, we have

By the remarks above, we have 0<Δ<10<\Delta_{\infty}<1. Notice that this definition does not coincide with the one in Eq. (8.16). However, the proof below (together with Proposition 4) implies that the two definitions actually coincide.

Step 1. Prove that infρ\mathscrsfsPR(ρ)>0\inf_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}\overline{R}_{\infty}(\overline{\rho})>0 as Δ<Δ\Delta<\Delta_{\infty}.

First, we consider the optimization problem

We claim that, for Δ<Δ\Delta<\Delta_{\infty} we have f<0f_{*}<0. Indeed, for any λ[0,+)\lambda\in[0,+\infty), we have the following upper bound

Since q+λqCb([0,+])q_{+}-\lambda\,q_{-}\in C_{b}([0,+\infty]), then L(,λ)L(\,\cdot\,,\lambda) is continuous in ρ\overline{\rho} in weak topology. By the compactness of \mathscrsfsP{\overline{\mathscrsfs{P}}}, the supremum of L(,λ)L(\,\cdot\,,\lambda) is attained by some ρλ\mathscrsfsP\overline{\rho}_{\lambda}\in{\overline{\mathscrsfs{P}}}. This ρλ\overline{\rho}_{\lambda} should satisfy

Let h(r)q+(r)λq(r)h(r)\equiv q_{+}(r)-\lambda q_{-}(r). Note the supremum of hh should either satisfy

for r(0,)r\in(0,\infty), or the supremum should be attained at the boundary or ++\infty. According to condition S4, [q(r)/q+(r)]>0[q_{-}^{\prime}(r)/q_{+}^{\prime}(r)]^{\prime}>0 for r(0,)r\in(0,\infty), the equation (8.21) has at most one solution r(0,)r_{*}\in(0,\infty).

Assume that there exists r(0,)r_{*}\in(0,\infty) such that h(r)=0h^{\prime}(r_{*})=0. Then we have h(r)>0h^{\prime}(r)>0 for 0<r<r0<r<r_{*}, and h(r)<0h^{\prime}(r)<0 for r<r<+r_{*}<r<+\infty, whence supp(ρλ)={r}{\rm supp}(\overline{\rho}_{\lambda})=\{r_{*}\}. If h(r)=0h^{\prime}(r)=0 does not have a solution in (0,)(0,\infty), the only supremum of h(r)h(r) could be achieved at or ++\infty. Therefore, supp(ρλ)={0}{\rm supp}(\overline{\rho}_{\lambda})=\{0\} or supp(ρλ)={+}{\rm supp}(\overline{\rho}_{\lambda})=\{+\infty\}. This concludes that, for any λ[0,+)\lambda\in[0,+\infty), supρ\mathscrsfsPL(ρ,λ)\sup_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}L(\overline{\rho},\lambda) is achieved by a point mass. Therefore, we have

For Δ<Δ\Delta<\Delta_{\infty}, the right hand side of the above inequality is less than . Therefore, we cannot have a probability distribution ρ\overline{\rho} such that q+,ρ=1\langle q_{+},\overline{\rho}\rangle=1 and q,ρ=1\langle q_{-},\overline{\rho}\rangle=-1. The infimum of the risk cannot be .

Step 2. Show that the global minimizer should be a delta function for Δ<Δ\Delta<\Delta_{\infty}.

According to Proposition 1, the global minimizer ρ\mathscrsfsP\overline{\rho}_{*}\in{\overline{\mathscrsfs{P}}} should satisfy

with ψ\psi_{\infty} given in Eq. (8.12).

As proved in the last step, as Δ<Δ\Delta<\Delta_{\infty}, we cannot have both λ+(ρ)=0\lambda_{+}(\overline{\rho}_{*})=0 and λ(ρ)=0\lambda_{-}(\overline{\rho}_{*})=0. The argument given above also implies that ψ(r;ρ)\psi_{\infty}(r;\overline{\rho}_{*}) is minimized at a unique point, and hence the support of ρ\overline{\rho}_{*} should be a single point. This proves the first part of the theorem.

For ΔΔ\Delta\geq\Delta_{\infty}, there exists r>0r>0, such that q(τ+r)1q(\tau_{+}r)\geq 1, and q(τr)1q(\tau_{-}r)\leq-1. Therefore, there exists r>0r_{*}>0 such that q(τ+r)1=1q(τr)=ε0q(\tau_{+}r_{*})-1=-1-q(\tau_{-}r_{*})={\varepsilon}_{*}\geq 0. Consider the following probability measure on [0,+][0,+\infty],

It can be checked that R(ρ)=0\overline{R}_{\infty}(\overline{\rho}_{*})=0.

We would like to show further that there exists a global minimizer that is compactly supported. We construct this global minimizer as following. First, define

Then we know that q(r0)=1q_{-}(r_{0})=-1 and q+(r0)1q_{+}(r_{0})\geq 1. Now for any 0rr00\leq r\leq r_{0}, define u(r)=q1(2q(r))u(r)=q_{-}^{-1}(-2-q_{-}(r)). According to condition S3, we have 1<[q(0)+q(+)]/2<1-1<[q(0)+q(+\infty)]/2<1, then u(r)u(r) is well defined on [0,r0][0,r_{0}]. It is easy to see that u(r0)=r0u(r_{0})=r_{0}, and [q(r)+q(u(r))]/2=1[q_{-}(r)+q_{-}(u(r))]/2=-1 for any 0rr00\leq r\leq r_{0}. Now we consider the function z(r)=[q+(r)+q+(u(r))]/21z(r)=[q_{+}(r)+q_{+}(u(r))]/2-1. Note that z(r0)>0z(r_{0})>0, and z(0)[q(0)+q()]/21<0z(0)\leq[q(0)+q(\infty)]/2-1<0. Therefore, there exists rr_{*} satisfying 0<rr00<r_{*}\leq r_{0} such that z(r)=0z(r_{*})=0. Consider the following probability measure on (0,+)(0,+\infty),

It is easy to see that R(ρ)=0\overline{R}_{\infty}(\overline{\rho}_{*})=0. ∎

2 Dynamics: Fixed points

We specialize the general evolution (7.1) to the present case. Assuming ρ0\rho_{0} to be spherically symmetric, then ρt\rho_{t} is spherically symmetric for any t0t\geq 0. We let ρt\overline{\rho}_{t} denote the distribution of w2\|{\bm{w}}\|_{2} when wρt{\bm{w}}\sim\rho_{t}. This satisfies the following PDE:

We will view this as an evolution in the space of probability distribution on the completed half-line \mathscrsfsP([0,])\mathscrsfs{P}([0,\infty]).

In analogy with Proposition 2, we can prove the following characterization of fixed points.

A distribution ρ\mathscrsfsP([0,])\overline{\rho}\in\mathscrsfs{P}([0,\infty]) is a fixed point of the PDE (8.22) if and only if

Notice, in particular, global minimizers of Rd(ρ)\overline{R}_{d}(\overline{\rho}) are fixed points of this evolution, but not vice-versa. The next result classifies fixed points.

Consider d=d=\infty and recall the definition of λ+(ρ)\lambda_{+}(\overline{\rho}) and λ(ρ)\lambda_{-}(\overline{\rho}) given by Eqs. (8.13) and (8.14). Then the fixed points of the PDE (8.22) (i.e. the probability measures ρ\mathscrsfsP([0,])\overline{\rho}\in\mathscrsfs{P}([0,\infty]) satisfying (8.23)) are of one of the following types

A point mass ρr=δr\overline{\rho}_{r_{*}}=\delta_{r_{*}} at some location r∉{0,+}r_{*}\not\in\{0,+\infty\}, but not of type (a)(a).

A mixture of the type ρ=a0δ0+aδ++aδr\overline{\rho}=a_{0}\delta_{0}+a_{\infty}\delta_{+\infty}+a\delta_{r_{*}}, but not of type (a)(a) or (b)(b).

For Δ<Δ\Delta<\Delta_{\infty}, the PDE has a unique fixed point of type (b)(b), with λ+(ρ)<0\lambda_{+}(\overline{\rho}_{*})<0 and λ(ρ)>0\lambda_{-}(\overline{\rho}_{*})>0; it has no type-(a)(a) fixed points; it has possibly fixed points of type (c)(c).

For Δ>Δ\Delta>\Delta_{\infty}, the PDE has some fixed points of type (b)(b), with λ+(ρ)>0\lambda_{+}(\overline{\rho}_{*})>0 and λ(ρ)<0\lambda_{-}(\overline{\rho}_{*})<0; it also has some type-(a)(a) fixed points; it has possibly fixed points of type (c)(c).

For Δ=Δ\Delta=\Delta_{\infty}, the PDE has a unique fixed point of type (a)(a) which is also a delta function at some location rr_{*}, and no type (b)(b) fixed points; it has possibly fixed points of type (c)(c).

We use the characterization of fixed points in Proposition 5. Recall that ψ(r;ρ)\psi_{\infty}(r;\overline{\rho}_{*}) is defined as in Equation (8.12). The derivative rψ(r;ρ)\partial_{r}\psi_{\infty}(r;\overline{\rho}) gives

If a fixed point has λ+(ρ)=λ(ρ)=0\lambda_{+}(\overline{\rho}_{*})=\lambda_{-}(\overline{\rho}_{*})=0, then R(ρ)=0\overline{R}_{\infty}(\overline{\rho}_{*})=0. This is type-(a)(a) fixed point. Consider then the case (λ+(ρ),λ(ρ))(0,0)(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))\neq(0,0). For the same reason as in the proof of Theorem 8, we conclude that rψ(r;ρ)\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*}) has at most three zeros, two of which are located at and ++\infty. This proves that all fixed points are of type (a)(a), (b)(b) or (c)(c).

We already proved in Theorem 8 that, for Δ<Δ\Delta<\Delta_{\infty}, infρR(ρ)>0\inf_{\overline{\rho}}\overline{R}_{\infty}(\overline{\rho})>0. Therefore, for Δ<Δ\Delta<\Delta_{\infty}, there is no type (a)(a) fixed points.

We next prove that, as Δ<Δ\Delta<\Delta_{\infty}, fixed point of type (b)(b) is always unique. The location of the delta fixed point should satisfy

Note that rψ(r;δr)<0\partial_{r}\psi_{\infty}(r_{*};\delta_{r_{*}})<0 for r>0r>0 small enough, and rψ(r;δr)>0\partial_{r}\psi_{\infty}(r_{*};\delta_{r_{*}})>0 for rr large enough, whence this equation has at least one solution r(0,)r_{*}\in(0,\infty). In order to prove that it has a unique solution in (0,+)(0,+\infty), define r+inf{r:q+(r)1}r_{+}\equiv\inf\{r:q_{+}(r)\geq 1\} and rinf{r:q(r)1}r_{-}\equiv\inf\{r:q_{-}(r)\geq-1\}. Note that q+(r),q(r)>0q^{\prime}_{+}(r_{*}),q^{\prime}_{-}(r_{*})>0 and that, in order to satisfy Eq. (8.25), the terms λ+(δr)=1/2(q+(r)1)\lambda_{+}(\delta_{r_{*}})=1/2\cdot(q_{+}(r_{*})-1) and λ(δr)=1/2(q(r)+1)\lambda_{-}(\delta_{r_{*}})=1/2\cdot(q_{-}(r_{*})+1) must have opposite signs. For Δ<Δ\Delta<\Delta_{\infty}, we must have λ+(δr)<0\lambda_{+}(\delta_{r_{*}})<0 and λ(δr)>0\lambda_{-}(\delta_{r_{*}})>0, and all stationary points should be within [r,r+][r_{-},r_{+}]. Note that q(r)/q+(r)q_{-}^{\prime}(r)/q_{+}^{\prime}(r) is strictly increasing, and [1q+(r)]/[1+q(r)][1-q_{+}(r)]/[1+q_{-}(r)] is decreasing on [r,r+][r_{-},r_{+}]. Therefore, the fixed point of type δr\delta_{r_{*}} with r(0,)r_{*}\in(0,\infty) is unique.

For Δ>Δ\Delta>\Delta_{\infty}, we must have λ+(ρ)>0\lambda_{+}(\overline{\rho}_{*})>0 and λ(ρ)<0\lambda_{-}(\overline{\rho}_{*})<0, and all solutions should be within [r+,r][r_{+},r_{-}]. There could possibly be multiple fixed points of type δr\delta_{r_{*}} with r[r+,r]r_{*}\in[r_{+},r_{-}].

If Δ=Δ\Delta=\Delta_{\infty}, it is easy to see that, ρ=δr\overline{\rho}_{*}=\delta_{r_{*}} at some r(0,)r_{*}\in(0,\infty) is the unique fixed point with zero risk, and the unique fixed point as a point mass. ∎

3 Dynamics: Convergence to global minimum for d=∞𝑑d=\infty

In this section, denote \mathscrsfsP\mboxgood\mathscrsfs{P}_{\mbox{\tiny\rm good}} to be

We then prove that the d=d=\infty dynamics converges to a global minimizer from any initialization in \mathscrsfsP\mboxgood\mathscrsfs{P}_{\mbox{\tiny\rm good}}.

Consider the PDE (8.22) for d=d=\infty, with initialization ρ0\mathscrsfsP\mboxgood\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}. It has a unique solution (ρt)t0(\overline{\rho}_{t})_{t\geq 0}, such that

Without loss of generality, we assume ξ(t)=1/2\xi(t)=1/2. First we show the existence and uniqueness of solution of the PDE.

Step 1. Existence and uniqueness of solution. Mass ρt((0,))=1\overline{\rho}_{t}((0,\infty))=1 for all tt.

According to conditions S1 - S3, q(r)q(r), q(r)q^{\prime}(r), and q(r)q^{\prime\prime}(r) are uniformly bounded on [0,][0,\infty]. Recall that

Hence v(r),1u(r1,r2),v(r),112u(r1,r2),122u(r1,r2)v^{\prime}(r),\partial_{1}u_{\infty}(r_{1},r_{2}),v^{\prime\prime}(r),\partial_{11}^{2}u_{\infty}(r_{1},r_{2}),\partial_{12}^{2}u_{\infty}(r_{1},r_{2}) are uniformly bounded. Recall we further assumed ξ(t)1/2\xi(t)\equiv 1/2. Therefore, conditions A1 and A3 are satisfied with D=1D=1, V=vV=v, and U=uU=u. By Remark 7.1, there is the existence and uniqueness of solution of PDE (8.22) for d=d=\infty. Denote this solution to be (ρt)t0(\overline{\rho}_{t})_{t\geq 0}.

Recall the formula of rψ(r;ρ)\partial_{r}\psi_{\infty}(r;\overline{\rho}) given in Equation (8.24), it is easy to see that the assumption of Lemma 7.9 is satisfied with d=1d=1 and Ψ=ψ\Psi=\psi_{\infty}. Hence, we have ρt((0,))=1\overline{\rho}_{t}((0,\infty))=1 for any t<t<\infty.

Step 2. Classify the limiting set S{\mathcal{S}}_{*}.

Recall the definition of (\mathscrsfsP([0,+]),dˉP)(\mathscrsfs{P}([0,+\infty]),{\bar{d}}{P}) at the beginning of Section 8. Since (\mathscrsfsP([0,+]),dˉP)(\mathscrsfs{P}([0,+\infty]),{\bar{d}}{P}) is a compact metric space, and (ρt)t0(\overline{\rho}_{t})_{t\geq 0} is a continuous curve in this space, then there exists a subsequence (tk)k1(t_{k})_{k\geq 1} of times, such that (ρtk)k1(\overline{\rho}_{t_{k}})_{k\geq 1} converges in metric dˉP{\bar{d}}{P} to a probability distribution ρ\mathscrsfsP([0,+])\overline{\rho}_{*}\in\mathscrsfs{P}([0,+\infty]).

Analogously to Proposition 2 (using Eq. (8.22)), we have

Since R(ρt)0\overline{R}_{\infty}(\overline{\rho}_{t})\geq 0, we have

Recall the definition of λ+(ρ)\lambda_{+}(\overline{\rho}) and λ(ρ)\lambda_{-}(\overline{\rho}) given by Eq. (8.13) and (8.14). Since qCb([0,])q\in C_{b}([0,\infty]), we have

Note rψ(r;ρ)\partial_{r}\psi_{\infty}(r;\overline{\rho}) is given by Eq. (8.24), and qCb([0,+])q^{\prime}\in C_{b}([0,+\infty]), hence

In other words, any limiting point ρ\overline{\rho}_{*} of the PDE is a fixed point of the PDE (8.22).

Note R(ρ)=1/2[λ+(ρ)2+λ(ρ)2]\overline{R}_{\infty}(\overline{\rho})=1/2\cdot[\lambda_{+}(\overline{\rho})^{2}+\lambda_{-}(\overline{\rho})^{2}], we have

Note R(ρt)\overline{R}_{\infty}(\overline{\rho}_{t}) is decreasing with tt, hence

Let S=S(ρ0){\mathcal{S}}_{*}={\mathcal{S}}_{*}(\overline{\rho}_{0}) be the set of all limiting points of the (ρt)t0(\overline{\rho}_{t})_{t\geq 0},

Due to Lemma 7.10, S{\mathcal{S}}_{*} is a connected compact set. Since R(ρt)\overline{R}_{\infty}(\overline{\rho}_{t}) is decreasing as tt increases, we have R(ρ)R\overline{R}_{\infty}(\overline{\rho}_{*})\equiv\overline{R}_{*} is a constant for all ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}. Since we assumed R(ρ0)<1\overline{R}_{\infty}(\overline{\rho}_{0})<1, and R(ρt)\overline{R}_{\infty}(\overline{\rho}_{t}) is decreasing in tt, we have R<1\overline{R}_{*}<1.

Let ρ\overline{\rho}_{*} be a fixed point of PDE such that λ+(ρ)0,λ(ρ)0\lambda_{+}(\overline{\rho}_{*})\geq 0,\lambda_{-}(\overline{\rho}_{*})\geq 0 or λ+(ρ)0,λ(ρ)0\lambda_{+}(\overline{\rho}_{*})\leq 0,\lambda_{-}(\overline{\rho}_{*})\leq 0 but not both λ+(ρ)\lambda_{+}(\overline{\rho}_{*}) and λ(ρ)\lambda_{-}(\overline{\rho}_{*}) equal . In this case, according to Eq. (8.24), rψ(r;ρ)\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*}) must be strictly increasing or strictly decreasing in rr. Since supp(ρ){r[0,]:rψ(r;ρ)=0}{\rm supp}(\overline{\rho}_{*})\subseteq\{r\in[0,\infty]:\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})=0\}, ρ\overline{\rho}_{*} must be a combination of two delta functions located at and ++\infty, i.e., ρ=a0δ0+(1a0)δ\overline{\rho}_{*}=a_{0}\delta_{0}+(1-a_{0})\delta_{\infty}. But for a fixed point of this type, it is easy to see that R(ρ)1\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1. Such fixed points ρ\overline{\rho}_{*} cannot be one of the limiting points of the PDE since R(ρ0)<1\overline{R}_{\infty}(\overline{\rho}_{0})<1.

Since S{\mathcal{S}}_{*} is a connected set, L(S)L({\mathcal{S}}_{*}) should also be a connected set. Further notice that R(ρ)=1/2[λ+(ρ)2+λ(ρ)2]\overline{R}_{\infty}(\overline{\rho}_{*})=1/2\cdot[\lambda_{+}(\overline{\rho}_{*})^{2}+\lambda_{-}(\overline{\rho}_{*})^{2}], and R(ρ1)=R(ρ2)\overline{R}_{\infty}(\overline{\rho}_{1})=\overline{R}_{\infty}(\overline{\rho}_{2}) for any ρ1,ρ2S\overline{\rho}_{1},\overline{\rho}_{2}\in{\mathcal{S}}_{*}. Therefore, we can only have L(S)P2{(λ+,λ):λ+>0,λ<0}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}\equiv\{(\lambda_{+},\lambda_{-}):\lambda_{+}>0,\lambda_{-}<0\}, or L(S)P1{(λ+,λ):λ+<0,λ>0}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\equiv\{(\lambda_{+},\lambda_{-}):\lambda_{+}<0,\lambda_{-}>0\}, or L(S)={(0,0)}L({\mathcal{S}}_{*})=\{(0,0)\}.

Step 3. Finish the proof using two claims.

Claim (1)(1). If L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}, then for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have ρ((0,))=1\overline{\rho}_{*}((0,\infty))=1.

Claim (2)(2). We cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}.

Here we assume these two claims hold, and use them to prove our results. For Δ<Δ\Delta<\Delta_{\infty}, we proved in Theorem 9 that, there is not a fixed point such that L(ρ)=(0,0)L(\overline{\rho}_{*})=(0,0). Therefore, we cannot have L(S)={(0,0)}L({\mathcal{S}}_{*})=\{(0,0)\}. Due to Claim (2)(2), we cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}. Hence, we must have L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}. According to Theorem 9, for Δ<Δ\Delta<\Delta_{\infty}, the only fixed point of PDE with ρ((0,))=1\overline{\rho}_{*}((0,\infty))=1 is a point mass at some location rr_{*}. Furthermore, this delta function fixed point is unique and is also the global minimizer of the risk. Therefore, we conclude that, as Δ<Δ\Delta<\Delta_{\infty}, the PDE will converge to this global minimizer.

For ΔΔ\Delta\geq\Delta_{\infty}, according to Claim (1)(1), if ρ\overline{\rho}_{*} is a limiting point such that L(ρ)P1L(\overline{\rho}_{*})\in{\mathcal{P}}_{1}, then ρ((0,))=1\overline{\rho}_{*}((0,\infty))=1. According to Theorem 9, a fixed point ρ\overline{\rho}_{*} with ρ((0,))=1\overline{\rho}_{*}((0,\infty))=1 and L(ρ)(0,0)L(\overline{\rho}_{*})\neq(0,0) must be a point mass at some location rr_{*}, with L(ρ)P2L(\overline{\rho}_{*})\in{\mathcal{P}}_{2}. Therefore, we cannot have L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}. Claim (2)(2) also tells us that we cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}. Hence, we must have L(S)={(0,0)}L({\mathcal{S}}_{*})=\{(0,0)\}. In this case, all the points in the set S{\mathcal{S}}_{*} have risk . Therefore, we conclude that, as ΔΔ\Delta\geq\Delta_{\infty}, the PDE will converge to some limiting set with risk .

We are left with the task of proving the two claims above. Before that, we introduce some useful notations. Recall Z(r)=q(r)/q+(r)Z(r)=q_{-}^{\prime}(r)/q_{+}^{\prime}(r) for r(0,+)r\in(0,+\infty). According to condition S4, Z(r)>0Z^{\prime}(r)>0 for r(0,+)r\in(0,+\infty). This implies that Z(0+)Z00Z(0+)\equiv Z_{0}\geq 0 and Z(+)ZZ(+\infty)\equiv Z_{\infty}\leq\infty exist. We rewrite rψ(r;ρ)\partial_{r}\psi_{\infty}(r;\overline{\rho}) as

Proof of Claim (1). If L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}, then for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have ρ({0,})=0\overline{\rho}_{*}(\{0,\infty\})=0.

Assume L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}. Then, we must have L(S)P1{(λ+,λ):Z0<λ+/λ<Z}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}<-\lambda_{+}/\lambda_{-}<Z_{\infty}\}. Otherwise suppose there exists ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, such that λ+(ρ)/λ(ρ)Z-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\geq Z_{\infty} or λ+(ρ)/λ(ρ)Z0-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\leq Z_{0}, according to Eq. (8.28), ψ(r;ρ)\psi_{\infty}(r;\overline{\rho}_{*}) must be strictly increasing or strictly decreasing in rr. Since supp(ρ){r[0,]:rψ(r;ρ)=0}{\rm supp}(\overline{\rho}_{*})\subseteq\{r\in[0,\infty]:\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})=0\}, then ρ\overline{\rho}_{*} must be a combination of two delta functions located at and ++\infty. But such ρ\overline{\rho}_{*} must have R(ρ)1\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1, and thus ρ\overline{\rho}_{*} cannot be a limiting point of the PDE. Hence the claim that L(S)P1{(λ+,λ):Z0<λ+/λ<Z}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}<-\lambda_{+}/\lambda_{-}<Z_{\infty}\} holds.

Since S{\mathcal{S}}_{*} is a compact set, and LL is a continuous map, then L(S)L({\mathcal{S}}_{*}) is a compact set. Therefore, there must exist ε0>0{\varepsilon}_{0}>0, so that for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have Z0+3ε0<λ+(ρ)/λ(ρ)<Z3ε0Z_{0}+3{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})<Z_{\infty}-3{\varepsilon}_{0}. For this ε0>0{\varepsilon}_{0}>0, since S{\mathcal{S}}_{*} contains all the limiting points of PDE starting from ρ0\overline{\rho}_{0}, there exists t0t_{0} large enough, so that as tt0t\geq t_{0}, we have Z0+2ε0<λ+(ρt)/λ(ρt)<Z2ε0Z_{0}+2{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{t})/\lambda_{-}(\overline{\rho}_{t})<Z_{\infty}-2{\varepsilon}_{0}, and λ+(ρt)<0\lambda_{+}(\overline{\rho}_{t})<0, λ(ρt)>0\lambda_{-}(\overline{\rho}_{t})>0. For the same ε0{\varepsilon}_{0}, since Z(r)Z(r) is continuous at and ++\infty, there exists 0<r0<r<0<r_{0}<r_{\infty}<\infty, so that Z(r)<Z0+ε0Z(r)<Z_{0}+{\varepsilon}_{0} for r(0,r0)r\in(0,r_{0}), and Z(r)>Zε0Z(r)>Z_{\infty}-{\varepsilon}_{0} for r(r,)r\in(r_{\infty},\infty). Therefore, for any tt0t\geq t_{0}, rψ(r;ρt)<0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})<0 for any r(0,r0)r\in(0,r_{0}), and rψ(r;ρt)>0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})>0 for any r(r,+)r\in(r_{\infty},+\infty).

As a result, according to the equation (8.28), we must have rψ(r;ρt)<0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})<0 for any r(0,r0)r\in(0,r_{0}) and tt0t\geq t_{0}, and rψ(r;ρt)>0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})>0 for any r(r,)r\in(r_{\infty},\infty) and tt0t\geq t_{0}.

Due to Lemma 7.9, ρt0((0,))=1\overline{\rho}_{t_{0}}((0,\infty))=1. Denoting Ωk=[1/k,k]\Omega_{k}=[1/k,k], then limkρt0(Ωk)=1\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\Omega_{k})=1. With this choice of Ωk\Omega_{k}, for any k{r,1/r0}k\geq\{r_{\infty},1/r_{0}\}, and for any tt0t\geq t_{0}, we have rψ(r;ρt),n(r)>0\langle\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t}),{\bm{n}}(r)\rangle>0 for rΩkr\in\partial\Omega_{k} where n(r){\bm{n}}(r) is the normal vector point outside Ωk\Omega_{k}. Therefore, if we consider the ODE

starting with r(t0)Ωkr(t_{0})\in\Omega_{k}, r(t)r(t) cannot leak outside Ωk\Omega_{k} from either boundaries of Ωk\Omega_{k}, and we must have r(t)Ωkr(t)\in\Omega_{k} for any tt0t\geq t_{0}. Due to Lemma 7.8, ρt(Ωk)ρt0(Ωk)\overline{\rho}_{t}(\Omega_{k})\geq\overline{\rho}_{t_{0}}(\Omega_{k}) for any tt0t\geq t_{0}. As a result, we conclude that for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*},

Note kΩk=(0,)\cup_{k}\Omega_{k}=(0,\infty). This gives ρ({0,})=0\overline{\rho}_{*}(\{0,\infty\})=0, which proves Claim (1)(1).

Proof of Claim (2)(2), step (1)(1). If L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}, then S{\mathcal{S}}_{*} must be a singleton.

In the case L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}, the argument is similar to the proof of Claim (1)(1), and hence will be presented in a synthetic form. First, we must have L(S)P2{(λ+,λ):Z0<λ+/λ<Z}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}<-\lambda_{+}/\lambda_{-}<Z_{\infty}\}. Therefore, there must exist ε0>0{\varepsilon}_{0}>0, so that for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have Z0+3ε0<λ+(ρ)/λ(ρ)<Z3ε0Z_{0}+3{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})<Z_{\infty}-3{\varepsilon}_{0}. For this ε0>0{\varepsilon}_{0}>0, there exists t0t_{0} large enough, so that as tt0t\geq t_{0}, we have Z0+2ε0<λ+(ρt)/λ(ρt)<Z2ε0Z_{0}+2{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{t})/\lambda_{-}(\overline{\rho}_{t})<Z_{\infty}-2{\varepsilon}_{0}, and λ+(ρt)>0\lambda_{+}(\overline{\rho}_{t})>0, λ(ρt)<0\lambda_{-}(\overline{\rho}_{t})<0. Further, there exists 0<r0<r<0<r_{0}<r_{\infty}<\infty, so that rψ(r;ρt)>0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})>0 for any r(0,r0)r\in(0,r_{0}) and tt0t\geq t_{0}, and rψ(r;ρt)<0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})<0 for any r(r,)r\in(r_{\infty},\infty) and tt0t\geq t_{0}.

Therefore, if we consider the ODE (8.29) starting with r(t0)[0,r0)r(t_{0})\in[0,r_{0}), we must have r(t)[0,r0)r(t)\in[0,r_{0}) for any tt0t\geq t_{0}; if we start with r(t0)(r,]r(t_{0})\in(r_{\infty},\infty], we must have r(t)(r,]r(t)\in(r_{\infty},\infty] for any tt0t\geq t_{0}. Due to Lemma 7.8, {ρt([0,r))}tt0\{\overline{\rho}_{t}([0,r))\}_{t\geq t_{0}} for 0<rr00<r\leq r_{0} and {ρt((r,+])}tt0\{\overline{\rho}_{t}((r,+\infty])\}_{t\geq t_{0}} for rrr\geq r_{\infty} must be non-decreasing in tt. According to Theorem 9, we can express ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*} in the form ρ=a0(ρ)δ0+a(ρ)δ+a(ρ)δr\overline{\rho}_{*}=a_{0}(\overline{\rho}_{*})\delta_{0}+a_{\infty}(\overline{\rho}_{*})\delta_{\infty}+a(\overline{\rho}_{*})\delta_{r_{*}}. By the stated monotonicity property, for any ρ1,ρ2S\overline{\rho}_{1},\overline{\rho}_{2}\in{\mathcal{S}}_{*}, it holds that a0(ρ1)=a0(ρ2)a_{0}(\overline{\rho}_{1})=a_{0}(\overline{\rho}_{2}), a(ρ1)=a(ρ2)a_{\infty}(\overline{\rho}_{1})=a_{\infty}(\overline{\rho}_{2}), and hence a(ρ1)=a(ρ2)a(\overline{\rho}_{1})=a(\overline{\rho}_{2}). We denote them in short as a0a_{0}, aa_{\infty}, and aa.

For any such fixed point ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, since we must have supp(ρ){r:rψ(r;ρ)=0}{\rm supp}(\overline{\rho}_{*})\subseteq\{r:\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})=0\}, r(0,+)r_{*}\in(0,+\infty) should be a solution of ϕ(r)=0\phi(r)=0 where

Proof of Claim (2)(2), step (2)(2). If ρ\overline{\rho}_{*} is a fixed point with L(ρ)P2L(\overline{\rho}_{*})\in{\mathcal{P}}_{2}, then ρ\overline{\rho}_{*} is unstable.

We apply Theorem 7 to ρ=a0δ0+aδ+aδr\overline{\rho}_{*}=a_{0}\delta_{0}+a_{\infty}\delta_{\infty}+a\delta_{r_{*}}. We will check the conditions of Theorem 7 to show that this type of fixed point is unstable.

First we check condition B1. Since [q(r)/q+(r)]>0[q_{-}^{\prime}(r)/q_{+}^{\prime}(r)]^{\prime}>0 and q+(r)>0q_{+}^{\prime}(r)>0 for r(0,+)r\in(0,+\infty), we have

Note the stationary condition of the PDE implies

and λ+(ρ)>0\lambda_{+}(\overline{\rho}_{*})>0, λ(ρ)<0\lambda_{-}(\overline{\rho}_{*})<0. Combined with the equation above, we have

This verifies condition B1{\sf B1} of Theorem 7.

Second, since λ+(ρ)>0\lambda_{+}(\overline{\rho}_{*})>0 and λ(ρ)<0\lambda_{-}(\overline{\rho}_{*})<0, according to Equation (8.28), we must have rψ(r;ρ)>0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})>0 for r(0,r)r\in(0,r_{*}), and rψ(r;ρ)<0\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})<0 for r(r,)r\in(r_{*},\infty). Therefore, we have ψ(0;ρ)<ψ(r;ρ)\psi_{\infty}(0;\overline{\rho}_{*})<\psi_{\infty}(r_{*};\overline{\rho}_{*}) and ψ(+;ρ)<ψ(r;ρ)\psi_{\infty}(+\infty;\overline{\rho}_{*})<\psi_{\infty}(r_{*};\overline{\rho}_{*}). Note L(η){r:ψ(r;ρ)ψ(r;ρ)η}{\cal L}(\eta)\equiv\{r:\psi_{\infty}(r;\overline{\rho}_{*})\leq\psi_{\infty}(r_{*};\overline{\rho}_{*})-\eta\}. For any η>0\eta>0 small enough, ρ(L(η))=1a\overline{\rho}_{*}({\cal L}(\eta))=1-a, which verifies condition B2{\sf B2}. It is also easy to see that, for any η>0\eta>0, L(η)\partial{\cal L}(\eta) is a compact set, hence condition B3 holds. Note that we assumed further that ρ0\overline{\rho}_{0} has a bounded density with respect to Lebesgue measure, all the assumptions of Theorem 7 are satisfied. Theorem 7 implies that the PDE cannot converge to ρ\overline{\rho}_{*}. As a result, we conclude that we cannot have L(S(ρ0))P2L({\mathcal{S}}_{*}(\overline{\rho}_{0}))\subseteq{\mathcal{P}}_{2} for ρ0\mathscrsfsP\mboxgood\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}. This proves Claim (2)(2).

4 Proof of Theorem 1

The key step consists in proving that the dynamics for large but finite dd is well approximated by the dynamics at d=d=\infty. The key estimate is provided by the next lemma.

Assume σ\sigma satisfies condition S0, recall the definition of udu_{d} and uu_{\infty} given by Equation (8.8) and (8.9). Then we have

where (G1,G2)N(0,I2)(G_{1},G_{2})\sim{\sf N}(0,{\bm{I}}_{2}), and Θ(1/Zd)sin(θ)d21{θ[0,π]}dθ\Theta\sim(1/Z_{d})\sin(\theta)^{d-2}\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta are mutually independent.

Define G3=G1cosΘ+G2sinΘG_{3}=G_{1}\cos\Theta+G_{2}\sin\Theta, then

According to condition S0, σ\|\sigma^{\prime}\|_{\infty} and σ\|\sigma\|_{\infty} are bounded, it is sufficient to bound the following quantity uniformly for r[0,)r\in[0,\infty)

Assuming this claim holds, let us show that it implies the desired bound on T(r)T(r). We have

We are left with the task of proving Eq. (8.37).

Denote X=G2X=G_{2} and Y=G3Y=G_{3} for simplicity in notations. Note that (X,Y)=d(Y,X)=d(X,Y)(X,Y)\stackrel{{\scriptstyle{\rm d}}}{{=}}(Y,X)\stackrel{{\scriptstyle{\rm d}}}{{=}}(-X,-Y). It follows that we can assume, without loss of generality, a>0a>0. We have

Let yUnif({1,+1})y\sim\textup{Unif}(\{-1,+1\}), [xy=+1]N(0,Σ+)[{\bm{x}}|y=+1]\sim{\sf N}({\mathbf{0}},{\bm{\Sigma}}_{+}), [xy=1]N(0,Σ)[{\bm{x}}|y=-1]\sim{\sf N}(0,{\bm{\Sigma}}_{-}) with τ2IDΣ+,Στ+2ID\tau_{-}^{2}{\bm{I}}_{D}\preceq{\bm{\Sigma}}_{+},{\bm{\Sigma}}_{-}\preceq\tau_{+}^{2}{\bm{I}}_{D} for some 0<τ<τ+<0<\tau_{-}<\tau_{+}<\infty. Assume that the activation function σ\sigma satisfies condition S0. Define

Then assumptions A2 and A3 are satisfied.

Note that x{\bm{x}} is sub-Gaussian, and by condition S0 we have σ\sigma^{\prime} is bounded, then θσ(x,θ)=σ(x,θ)x\nabla_{\bm{\theta}}\sigma(\langle{\bm{x}},{\bm{\theta}}\rangle)=\sigma^{\prime}(\langle{\bm{x}},{\bm{\theta}}\rangle){\bm{x}} is also sub-Gaussian (with sub-Gaussian parameter independent of DD). Condition S0 also gives that σ\sigma is bounded, therefore assumption A2 is satisfied.

Since σ,σ<\|\sigma\|_{\infty},\|\sigma^{\prime}\|_{\infty}<\infty, applying Cauchy-Schwarz inequality, we have V,1U,122U\nabla V,\nabla_{1}U,\nabla_{12}^{2}U are uniformly bounded.

It is difficult to bound 2V\nabla^{2}V and 12U\nabla_{1}^{2}U directly because σ\sigma^{\prime} may not be differentiable. We will use a longer argument to bound them.

First, for a bounded-Lipschitz function ff, and for g{1,σ}g\in\{1,\sigma\}, define

where GN(0,Id){\bm{G}}\sim{\sf N}(0,{\bm{I}}_{d}). Since we have τ2IDΣ+,Στ+2ID\tau_{-}^{2}{\bm{I}}_{D}\preceq{\bm{\Sigma}}_{+},{\bm{\Sigma}}_{-}\preceq\tau_{+}^{2}{\bm{I}}_{D} for some 0<τ<τ+<0<\tau_{-}<\tau_{+}<\infty, in order to bound 2V\nabla^{2}V and 12U\nabla_{1}^{2}U, it is sufficient to bound 12Wσ,1\nabla_{1}^{2}W_{\sigma,1} and 12Wσ,σ\nabla_{1}^{2}W_{\sigma,\sigma}.

is uniformly bounded for g=1g=1 or g=σg=\sigma. Let h=σσ0h=\sigma-\sigma_{0}, then h=0h=0 for r[δ0,δ0]r\in[-\delta_{0},\delta_{0}], and hh is KK-bounded-Lipschitz for some constant KK. It is sufficient to bound 12Wh,g\nabla_{1}^{2}W_{h,g} for g{1,σ}g\in\{1,\sigma\}.

Since G{\bm{G}} is Gaussian, using Stein’s formula, for any unit vector n{\bm{n}}, we have

Taking directional derivatives of E1E_{1} and E2E_{2}, we have

is uniformly bounded for r[0,]r\in[0,\infty]. Hence E11E_{11} is uniformly bounded. Using a similar argument, we can show that each terms E12E_{12}, E13E_{13}, E21E_{21}, E22E_{22}, and E23E_{23} are uniformly bounded.

Now we look at 1E3(θ1,θ2,n)\nabla_{1}E_{3}({\bm{\theta}}_{1},{\bm{\theta}}_{2},{\bm{n}}). We have

In order to bound E32E_{32}, we apply Stein’s formula to get

We can apply Stein’s formula to the right hand side of the last equation. Using the same argument as above, we obtain that E31E_{31} is uniformly bounded.

As a result, 2V\nabla^{2}V and 12U\nabla_{1}^{2}U are uniformly bounded. Therefore, assumption A3 is satisfied.

We are now in position to prove Theorem 1.

First we consider PDE (8.22) for d=d=\infty. We fix an initial radial density ρ0\mathscrsfsP\mboxgood\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}. Due to Theorem 10, for any η>0\eta>0, there exists T=T(η,ρ0,Δ)>0T=T(\eta,\overline{\rho}_{0},\Delta)>0, so that the solution (ρt)t0(\overline{\rho}_{t}^{\infty})_{t\geq 0} of PDE (8.22) for d=d=\infty with initialization ρ0\overline{\rho}_{0} satisfies

Next we would like to bound the difference of R(ρ)\overline{R}_{\infty}(\overline{\rho}) and Rd(ρ)\overline{R}_{d}(\overline{\rho}) for any ρ\overline{\rho}. Note

By Lemma 8.1, there exists d0=d0(η,Δ)d_{0}=d_{0}(\eta,\Delta) large enough, so that for dd0d\geq d_{0}, we have

Finally, let (θk)k1({\bm{\theta}}^{k})_{k\geq 1} be the trajectory of SGD, with step size sk=εξ(kε)s_{k}={\varepsilon}\xi(k{\varepsilon}), and initialization wi0iidρ0{\bm{w}}_{i}^{0}\sim_{iid}\rho_{0} for iNi\leq N. We apply Theorem 3 to bound the difference of the law of trajectory of SGD and the solution of PDE (8.53). The assumptions of Theorem 3 are verified by Lemma 8.2. As a consequence, there exists constant KK (which depend uniquely on the constants in assumptions A1 A2 A3), such that for any t10Tt\leq 10T, we have

As a consequence, for any δ>0\delta>0, there exists C0=C0(δ,η,ρ0,Δ)C_{0}=C_{0}(\delta,\eta,\overline{\rho}_{0},\Delta), so that as N,1/εC0dN,1/{\varepsilon}\geq C_{0}d and ε1/N10{\varepsilon}\geq 1/N^{10}, for any t10Tt\leq 10T, we have

Therefore, the trajectory θt/ε{\bm{\theta}}^{\lfloor t/{\varepsilon}\rfloor} of SGD as t[T,10T]t\in[T,10T] satisfies

with probability at least 1δ1-\delta. This gives the desired result.

5 Checking conditions 𝖲𝟢𝖲𝟢{\sf S0}–𝖲𝟦𝖲𝟦{\sf S4} for the running example

The requirements of Lemma 8.3 are not restrictive. An example of parameters that satisfies all conditions gives s1=2.5s_{1}=-2.5, s2=7.5s_{2}=7.5, t1=0.5t_{1}=0.5, t2=1.5t_{2}=1.5.

It is straightforward to see that condition S0 holds. To show condition S1, denote by σ(r)\sigma^{\prime}(r) the weak derivative of σ(r)\sigma(r), we calculate the function q(r)q^{\prime}(r) for r>0r>0 explicitly,

Since s1<s2s_{1}<s_{2} and 0<t1<t20<t_{1}<t_{2}, it is easy to see that q(r)q^{\prime}(r) is analytic on (0,)(0,\infty), and hence q(r)q(r) is analytic on (0,)(0,\infty). Differentiating q(r)q^{\prime}(r) in Eq. (8.56), it is easy to see that limrq(r)=0\lim_{r\rightarrow\infty}q^{\prime\prime}(r)=0, and q(0+)=0q^{\prime\prime}(0+)=0. Hence, we have supr[0,+]q(r)<\sup_{r\in[0,+\infty]}q^{\prime\prime}(r)<\infty. Then condition S1 holds.

Since s2>s1s_{2}>s_{1}, 0<t1<t20<t_{1}<t_{2}, we have q(r)>0q^{\prime}(r)>0 for r(0,+)r\in(0,+\infty), limrq(r)=0\lim_{r\rightarrow\infty}q^{\prime}(r)=0, and q(0+)=0q^{\prime}(0+)=0. Hence, we have supr[0,+]q(r)<\sup_{r\in[0,+\infty]}q^{\prime}(r)<\infty. Then condition S2 holds. Note that q(0)=σ(0)=s1<1q(0)=\sigma(0)=s_{1}<-1, and q(+)=(s1+s2)/2>1q(+\infty)=(s_{1}+s_{2})/2>1. In addition, [q(0)+q(+)]/2=(3s1+s2)/4(1,1)[q(0)+q(+\infty)]/2=(3s_{1}+s_{2})/4\in(-1,1). Therefore, condition S3 holds.

Finally, we show that condition S4 holds. Define p(r)=exp[t12/(2r2)]exp[t22/(2r2)]p(r)=\exp[-t_{1}^{2}/(2r^{2})]-\exp[-t_{2}^{2}/(2r^{2})], which is a positively scaled version of q(r)q^{\prime}(r). To show that for r(0,)r\in(0,\infty),

we only need to show that for r(0,)r\in(0,\infty)

Define xt22/(2τ+2r2)>0x\equiv t_{2}^{2}/(2\tau_{+}^{2}r^{2})>0, sτ+2/τ2>1s\equiv\tau_{+}^{2}/\tau_{-}^{2}>1, 0<ct12/t22<10<c\equiv t_{1}^{2}/t_{2}^{2}<1, we have

It is sufficient to show that F2(x;s,c)>0F_{2}(x;s,c)>0 for x>0x>0, s>1s>1, and 0<c<10<c<1. Note that F2(0+;s,c)=0F_{2}(0+;s,c)=0. Hence it is sufficient to show that xF2(x;s,c)>0\partial_{x}F_{2}(x;s,c)>0 for x>0x>0.

Note that s>1s>1 and 0c<10\leq c<1, F3(0+;s,c)=0F_{3}(0+;s,c)=0. It is therefore sufficient to show that xF3(x;s,c)>0\partial_{x}F_{3}(x;s,c)>0 for x>0x>0.

Since 0<c<10<c<1, s>1s>1, and x>0x>0, we have xF3(x;s,c)>0\partial_{x}F_{3}(x;s,c)>0, and hence condition S4 holds. ∎

Centered anisotropic Gaussians

In this section we consider the centered anisotropic Gaussian example discussed in the main text. That is, we assume the joint law of (y,x)(y,{\bm{x}}) to be as follows:

With probability 1/21/2: y=+1y=+1, xN(0,Σ+){\bm{x}}\sim{\sf N}({\mathbf{0}},{\bm{\Sigma}}_{+}).

With probability 1/21/2: y=1y=-1, xN(0,Σ){\bm{x}}\sim{\sf N}({\mathbf{0}},{\bm{\Sigma}}_{-}).

We will assume Σ+,Σ+{\bm{\Sigma}}_{+},{\bm{\Sigma}}_{+} to be diagonalizable in the same orthonormal basis, and to differ only on a subspace of dimension s0s_{0}. We want to study whether and how the neural network will identify this subspace of relevant features. Without loss of generality, we can assume that the eigenvalues correspond to the standard basis. In order to focus on the simplest possible model of this type, we will choose:

Because of condition S2{\sf S2} and S3{\sf S3}, we have qr+,qr,qr+,qrCb(E2)q\circ r_{+},q\circ r_{-},q^{\prime}\circ r_{+},q^{\prime}\circ r_{-}\in C_{b}(E_{2}).

Let \mathscrsfsP(E2)\mathscrsfs{P}(E_{2}) be the set of probability measures on E2E_{2}. Due to Prokhorov’s theorem, there exists a complete metric dˉP{\bar{d}}{P} on \mathscrsfsP(E2)\mathscrsfs{P}(E_{2}) equivalent to the topology of weak convergence, so that (\mathscrsfsP(E2),dˉP)(\mathscrsfs{P}(E_{2}),{\bar{d}}{P}) is a compact metric space. In this section, we will denote by \mathscrsfsP=\mathscrsfsP(E2){\overline{\mathscrsfs{P}}}=\mathscrsfs{P}(E_{2}).

Since the distribution of x{\bm{x}} is invariant under rotations in first s0s_{0} coordinates, and invariant under rotations in last ds0d-s_{0} coordinates, so are the functions

where μ\mboxHaar\mu_{\mbox{\tiny\rm Haar}} is the Haar measure over the group of orthogonal rotations. Since ρR(ρ)\rho\mapsto R(\rho) is convex, R(ρs)R(ρ)R(\rho_{s})\leq R(\rho).

where Θ1(1/Zs0)sins02θ1{θ[0,π]}dθ\Theta_{1}\sim(1/Z_{s_{0}})\sin^{s_{0}-2}\theta\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta and Θ2(1/Zds0)sinds02θ1{θ[0,π]}dθ\Theta_{2}\sim(1/Z_{d-s_{0}})\sin^{d-s_{0}-2}\theta\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta are independent.

As dd\to\infty, we have limdud(a1,a2,b1,b2)=u(a1,a2,b1,b2)\lim_{d\to\infty}u_{d}(a_{1},a_{2},b_{1},b_{2})=u_{\infty}(a_{1},a_{2},b_{1},b_{2}), with

and the risk function converges to (for a=(a1,a2){\bm{a}}=(a_{1},a_{2}))

For s0=γds_{0}=\gamma\cdot d with 0<γ<10<\gamma<1 and dd\to\infty, we have the simpler expression

The following theorem provides a characterization of the global minimizers of R(ρ)\overline{R}_{\infty}(\overline{\rho}).

Consider d=d=\infty. Recall \mathscrsfsP=\mathscrsfsP(E2){\overline{\mathscrsfs{P}}}=\mathscrsfs{P}(E_{2}) where E2[0,+)2{}E_{2}\equiv[0,+\infty)^{2}\cup\{\infty\}. Then there exists Δ(0,1)\Delta_{\infty}\in(0,1), such that

For Δ<Δ\Delta<\Delta_{\infty}, infρ\mathscrsfsPR(ρ)>0\inf_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}\overline{R}_{\infty}(\overline{\rho})>0 and the unique global minimizer of risk function R(ρ)\overline{R}_{\infty}(\overline{\rho}) is a point mass located at (r,0)(r_{*},0) for some r=r(Δ)(0,)r_{*}=r_{*}(\Delta)\in(0,\infty).

For ΔΔ\Delta\geq\Delta_{\infty}, all global minimizers of risk function R(ρ)\overline{R}_{\infty}(\overline{\rho}) have risk zero, and there exists a global minimizer that has finite support.

Suppose ρ2argminρ2\mathscrsfsP(E2)R(2)(ρ2)\overline{\rho}_{2}^{*}\in{\rm arg\,min}_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2}). Then we must have qr+,ρ21\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle\leq 1 and qr,ρ21\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle\geq-1. Indeed, if either qr+,ρ2>1\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle>1 or qr,ρ2<1\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle<-1, since q(+)>1q(+\infty)>1 and q(0)<1q(0)<-1, the distribution ρ2=a0δ0+aδ+(1a0a)ρ2\overline{\rho}_{2}^{\prime}=a_{0}\delta_{\mathbf{0}}+a_{\infty}\delta_{\infty}+(1-a_{0}-a_{\infty})\overline{\rho}_{2}^{*} with appropriate choice of a0a_{0} and aa_{\infty} will give a lower risk.

This ρ2\mathscrsfsP(E2)\overline{\rho}_{2}^{*}\in\mathscrsfs{P}(E_{2}) induces a ρ1\mathscrsfsP([0,])\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty]) as follows: for any Borel set B[0,]B\subseteq[0,\infty], ρ1(B)=ρ2({rE2:r2B})\overline{\rho}_{1}(B)=\overline{\rho}_{2}^{*}(\{{\bm{r}}\in E_{2}:\|{\bm{r}}\|_{2}\in B\}). For this ρ1\overline{\rho}_{1}, it is easy to see that q,ρ1qr,ρ2\langle q_{-},\overline{\rho}_{1}\rangle\leq\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle and q+,ρ1qr+,ρ2\langle q_{+},\overline{\rho}_{1}\rangle\geq\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle, and the equalities hold if and only if ρ2(E1)=1\overline{\rho}_{2}^{*}(E_{1})=1, where E1([0,+)×{0}){}E_{1}\equiv([0,+\infty)\times\{0\})\cup\{\infty\}. Since q(+)>1q(+\infty)>1 and q(0)<1q(0)<-1, we can take ρ1=a0δ0+aδ+(1a0a)ρ1\overline{\rho}_{1}^{*}=a_{0}\delta_{0}+a_{\infty}\delta_{\infty}+(1-a_{0}-a_{\infty})\overline{\rho}_{1} with appropriate choice of a0a_{0} and aa_{\infty}, so that qr+,ρ2q+,ρ11\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle\leq\langle q_{+},\overline{\rho}_{1}^{*}\rangle\leq 1 and qr,ρ2q,ρ11\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle\geq\langle q_{-},\overline{\rho}_{1}^{*}\rangle\geq-1. Therefore, we always have infρ1\mathscrsfsP([0,])R(1)(ρ1)infρ2\mathscrsfsP(E2)R(2)(ρ2)\inf_{\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty])}\overline{R}_{\infty}^{(1)}(\overline{\rho}_{1})\leq\inf_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2}), and ρ2(E1)=1\overline{\rho}_{2}^{*}(E_{1})=1 for any ρ2argminρ2\mathscrsfsP(E2)R(2)(ρ2)\overline{\rho}_{2}^{*}\in{\rm arg\,min}_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2}). Note that R(2)(ρ1×δ0)=R(1)(ρ1)\overline{R}_{\infty}^{(2)}(\overline{\rho}_{1}\times\delta_{0})=\overline{R}_{\infty}^{(1)}(\overline{\rho}_{1}) for any ρ1\mathscrsfsP([0,])\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty]). Hence, we must have infρ1\mathscrsfsP([0,])R(1)(ρ1)=infρ2\mathscrsfsP(E2)R(2)(ρ2)\inf_{\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty])}\overline{R}_{\infty}^{(1)}(\overline{\rho}_{1})=\inf_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2}).

Due to the above argument, we reduced our analysis to the centered isotropic Gaussians case. All the conclusions can be proved using the same argument as in the proof of Theorem 8.

2 Dynamics: Fixed points

We specialize the general evolution (7.1) to the present case. Assuming ρ0\rho_{0} to be invariant with respect to products of orthogonal transformations, the same happens for ρt\rho_{t}. We let ρt\mathscrsfsP(E2)\overline{\rho}_{t}\in\mathscrsfs{P}(E_{2}) denote the distribution of (w12,w22)(\|{\bm{w}}_{1}\|_{2},\|{\bm{w}}_{2}\|_{2}) when wρt{\bm{w}}\sim\rho_{t}. Then ρt\overline{\rho}_{t} satisfies the following PDE:

We will view this as an evolution in the space of probability distribution on \mathscrsfsP=\mathscrsfsP(E2){\overline{\mathscrsfs{P}}}=\mathscrsfs{P}(E_{2}).

In analogy with Proposition 2, we can prove the following characterization of fixed points.

A distribution ρ\mathscrsfsP\overline{\rho}\in{\overline{\mathscrsfs{P}}} is a fixed point of the PDE (9.16) if and only if

Notice, in particular, global minimizers of Rd(ρ)\overline{R}_{d}(\overline{\rho}) are fixed points of this evolution, but not vice-versa. The next result classifies fixed points.

Consider d=d=\infty, and recall the definition of λ+(ρ)\lambda_{+}(\overline{\rho}) and λ(ρ)\lambda_{-}(\overline{\rho}) given by Eq. (9.15) and (9.14). Then the fixed points of the PDE (9.16) (i.e. the probability measures ρ\mathscrsfsP\overline{\rho}\in{\overline{\mathscrsfs{P}}} satisfying (9.17)) must be of one of the following types

A point mass ρr=δ(r,0)\overline{\rho}_{r_{*}}=\delta_{(r_{*},0)} at some location (r,0)(r_{*},0) with r∉{0,+}r_{*}\not\in\{0,+\infty\}, but not of type (a)(a).

A mixture of the type ρ=a0δ0+aδ+a1δ(r1,0)+a2ρ2\overline{\rho}=a_{0}\delta_{\mathbf{0}}+a_{\infty}\delta_{\infty}+a_{1}\delta_{(r_{*1},0)}+a_{2}\overline{\rho}_{2} with supp(ρ2){0}×(0,){\rm supp}(\overline{\rho}_{2})\subseteq\{0\}\times(0,\infty), but not of type (b)(b) and (a)(a).

For Δ<Δ\Delta<\Delta_{\infty}, the PDE has a unique fixed point of type (b)(b), with λ+(ρ)<0\lambda_{+}(\overline{\rho}_{*})<0 and λ(ρ)>0\lambda_{-}(\overline{\rho}_{*})>0; it has no type-(a)(a) fixed points; it has possibly fixed points of type (c)(c).

For Δ>Δ\Delta>\Delta_{\infty}, the PDE has some fixed points of type (b)(b), with λ+(ρ)>0\lambda_{+}(\overline{\rho}_{*})>0 and λ(ρ)<0\lambda_{-}(\overline{\rho}_{*})<0; it also has some type-(a)(a) fixed points; it has possibly fixed points of type (c)(c).

For Δ=Δ\Delta=\Delta_{\infty}, the PDE has a unique fixed point of type (a)(a) which is also a delta function at some location (r1,0)(r_{*1},0), and no type (b)(b) fixed points; it has possibly fixed points of type (c)(c).

We use the characterization of fixed points in Proposition 6. Recall that ψ(r;ρ)\psi_{\infty}({\bm{r}};\overline{\rho}_{*}) is defined as in Eq. (9.13). The gradient ψ(r;ρ)\nabla\psi_{\infty}({\bm{r}};\overline{\rho}) is given by

If a fixed point ρ\overline{\rho}_{*} gives λ+(ρ)=λ(ρ)=0\lambda_{+}(\overline{\rho}_{*})=\lambda_{-}(\overline{\rho}_{*})=0, then R(ρ)=0\overline{R}_{\infty}(\overline{\rho}_{*})=0. This is type-(a)(a) fixed point. Consider then the case (λ+(ρ),λ(ρ))(0,0)(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))\neq(0,0).

Suppose ρ((0,+)2)>0\overline{\rho}_{*}((0,+\infty)^{2})>0. Since q(r)>0q^{\prime}(r)>0 and τ+>1>τ\tau_{+}>1>\tau_{-}, in order for ψ(r;ρ)=0\nabla\psi_{\infty}({\bm{r}};\overline{\rho}_{*})={\mathbf{0}} for some r(0,+)2{\bm{r}}\in(0,+\infty)^{2}, we must have (λ+(ρ),λ(ρ))=(0,0)(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))=(0,0). Therefore, as ρ\overline{\rho}_{*} is a fixed point with (λ+(ρ),λ(ρ))(0,0)(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))\neq(0,0), we must have ρ((0,+)2)=0\overline{\rho}_{*}((0,+\infty)^{2})=0. That is, we can write ρ=a0δ0+aδ+a1ρ1+a2ρ2\overline{\rho}_{*}=a_{0}\delta_{\mathbf{0}}+a_{\infty}\delta_{\infty}+a_{1}\overline{\rho}_{1}+a_{2}\overline{\rho}_{2}, with supp(ρ1)(0,)×{0}{\rm supp}(\overline{\rho}_{1})\in(0,\infty)\times\{0\}, and supp(ρ2){0}×(0,){\rm supp}(\overline{\rho}_{2})\in\{0\}\times(0,\infty).

The solutions of ψ((r1,r2);ρ)=0\nabla\psi_{\infty}((r_{1},r_{2});\overline{\rho}_{*})=0 with r2=0r_{2}=0 are of the form 0{\mathbf{0}}, (r1,0)(r_{*1},0), and \infty. Therefore, ρ1=δ(r1,0)\overline{\rho}_{1}=\delta_{(r_{*1},0)} for some r1(0,)r_{*1}\in(0,\infty). Hence, as ρ\overline{\rho}_{*} is not a type-(a)(a) stationary point, it must be a type-(b)(b) or type-(c)(c) stationary point.

This proves that all fixed points are of type (a)(a), (b)(b), or (c)(c). The remaining claims follows the same argument as the proof of Theorem 9.

3 Dynamics: Convergence to global minimum for d=∞𝑑d=\infty

In this section, denote \mathscrsfsP\mboxgood\mathscrsfs{P}_{\mbox{\tiny\rm good}} to be

We then prove that the d=d=\infty dynamics converges to a global minimizer from any initialization ρ0\mathscrsfsP\mboxgood\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}.

Consider the PDE (9.16) for d=d=\infty, with initialization ρ0\mathscrsfsP\mboxgood\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}. It has a unique solution (ρt)t0(\overline{\rho}_{t})_{t\geq 0}, such that

Without loss of generality, we assume ξ(t)=1/2\xi(t)=1/2. First we show the existence and uniqueness of solution of the PDE.

Step 1. Existence and uniqueness of solution. Mass ρt((0,)2)=1\overline{\rho}_{t}((0,\infty)^{2})=1 for all tt.

According to conditions S1 - S3, q(r)q(r), q(r)q^{\prime}(r), and q(r)q^{\prime\prime}(r) are uniformly bounded on [0,][0,\infty]. Note

Then v(r),1u(r1,r2),2v(r),112u(r1,r2),122u(r1,r2)\nabla v({\bm{r}}),\nabla_{1}u_{\infty}({\bm{r}}_{1},{\bm{r}}_{2}),\nabla^{2}v({\bm{r}}),\nabla_{11}^{2}u_{\infty}({\bm{r}}_{1},{\bm{r}}_{2}),\nabla_{12}^{2}u_{\infty}({\bm{r}}_{1},{\bm{r}}_{2}) are uniformly bounded. Therefore, conditions A1 and A3 are satisfied with D=2D=2, V=vV=v, and U=uU=u. Then, there is the existence and uniqueness of solution of PDE (9.16) for d=d=\infty. Denote this solution to be (ρt)t0(\overline{\rho}_{t})_{t\geq 0}.

Recall the expression for ψ(r;ρ)\nabla\psi_{\infty}({\bm{r}};\overline{\rho}) in Eq. (9.18). It is easy to see that the assumption of Lemma 7.9 is satisfied with d=2d=2 and Ψ=ψ\Psi=\psi_{\infty}. Hence, we have ρt((0,)2)=1\overline{\rho}_{t}((0,\infty)^{2})=1 for any fixed t<t<\infty.

Step 2. Classify the limiting set S{\mathcal{S}}_{*}.

Recall the definition of (\mathscrsfsP(E2),dˉP)(\mathscrsfs{P}(E_{2}),{\bar{d}}{P}) at the beginning of Section 9. Since (\mathscrsfsP(E2),dˉP)(\mathscrsfs{P}(E_{2}),{\bar{d}}{P}) is a compact metric space, and (ρt)t0(\overline{\rho}_{t})_{t\geq 0} is a continuous curve in this space, then there exists a subsequence (tk)k1(t_{k})_{k\geq 1} of times, such that (ρtk)k1(\overline{\rho}_{t_{k}})_{k\geq 1} converges in metric dˉP{\bar{d}}{P} to a probability distribution ρ\mathscrsfsP(E2)\overline{\rho}_{*}\in\mathscrsfs{P}(E_{2}).

For any ρ0\mathscrsfsP\mboxgood\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}, let S=S(ρ0){\mathcal{S}}_{*}={\mathcal{S}}_{*}(\overline{\rho}_{0}) be the set of limiting points of the PDE,

Analogous to the proof of Theorem 10, we have the following properties for S{\mathcal{S}}_{*}:

S{\mathcal{S}}_{*} is connected and compact.

For any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, ρ\overline{\rho}_{*} is a fixed point of PDE.

For any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, R(ρ)=R<1\overline{R}_{\infty}(\overline{\rho}_{*})=\overline{R}_{*}<1.

Recall the definition of λ+(ρ)\lambda_{+}(\overline{\rho}_{*}) and λ(ρ)\lambda_{-}(\overline{\rho}_{*}) given by Equation (9.14) and (9.15). Let ρ\overline{\rho}_{*} be a fixed point of PDE such that λ+(ρ)0,λ(ρ)0\lambda_{+}(\overline{\rho}_{*})\geq 0,\lambda_{-}(\overline{\rho}_{*})\geq 0 or λ+(ρ)0,λ(ρ)0\lambda_{+}(\overline{\rho}_{*})\leq 0,\lambda_{-}(\overline{\rho}_{*})\leq 0 but not both λ+(ρ)\lambda_{+}(\overline{\rho}_{*}) and λ(ρ)\lambda_{-}(\overline{\rho}_{*}) equal . In this case, according to Eq. (9.18), both r1ψ(r;ρ)\partial_{r_{1}}\psi_{\infty}({\bm{r}};\overline{\rho}_{*}) and r2ψ(r;ρ)\partial_{r_{2}}\psi_{\infty}({\bm{r}};\overline{\rho}_{*}) must be strictly positive or strictly negative. Since supp(ρ){rE2:rψ(r;ρ)=0}{\rm supp}(\overline{\rho}_{*})\subseteq\{{\bm{r}}\in E_{2}:\nabla_{\bm{r}}\psi_{\infty}({\bm{r}};\overline{\rho}_{*})={\mathbf{0}}\}, ρ\overline{\rho}_{*} must be a combination of two delta functions located at 0{\mathbf{0}} and \infty, i.e., ρ=a0δ0+(1a0)δ\overline{\rho}_{*}=a_{0}\delta_{\mathbf{0}}+(1-a_{0})\delta_{\infty}. But for a fixed point like this, it is easy to see that R(ρ)1\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1. Such fixed points ρ\overline{\rho}_{*} cannot be one of the limiting points of the PDE since R(ρ0)<1\overline{R}_{\infty}(\overline{\rho}_{0})<1.

Since S{\mathcal{S}}_{*} is a connected set, L(S)L({\mathcal{S}}_{*}) should also be a connected set. Further notice that R(ρ)=1/2[λ+(ρ)2+λ(ρ)2]\overline{R}_{\infty}(\overline{\rho}_{*})=1/2\cdot[\lambda_{+}(\overline{\rho}_{*})^{2}+\lambda_{-}(\overline{\rho}_{*})^{2}], and R(ρ1)=R(ρ2)\overline{R}_{\infty}(\overline{\rho}_{1})=\overline{R}_{\infty}(\overline{\rho}_{2}) for any ρ1,ρ2S\overline{\rho}_{1},\overline{\rho}_{2}\in{\mathcal{S}}_{*}. Therefore, we can only have L(S)P2{(λ+,λ):λ+>0,λ<0}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}\equiv\{(\lambda_{+},\lambda_{-}):\lambda_{+}>0,\lambda_{-}<0\}, or L(S)P1{(λ+,λ):λ+<0,λ>0}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\equiv\{(\lambda_{+},\lambda_{-}):\lambda_{+}<0,\lambda_{-}>0\}, or L(S)={(0,0)}L({\mathcal{S}}_{*})=\{(0,0)\}.

Step 3. Finish the proof using two claims.

Claim (1). If L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}, then for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have ρ((0,)×{0})=1\overline{\rho}_{*}((0,\infty)\times\{0\})=1.

Claim (2). We cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}.

Here we assume these two claims holds, and use it to prove our results. For Δ<Δ\Delta<\Delta_{\infty}, we proved in Theorem 12 that, there is no fixed point such that L(ρ)=(0,0)L(\overline{\rho}_{*})=(0,0). Therefore, we cannot have L(S)={(0,0)}L({\mathcal{S}}_{*})=\{(0,0)\}. Due to Claim (2)(2), we cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}. Hence, we must have L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}. According to Theorem 12, for Δ<Δ\Delta<\Delta_{\infty}, the only fixed point of PDE with ρ((0,)×{0})=1\overline{\rho}_{*}((0,\infty)\times\{0\})=1 is a point mass at some location r=(r1,0){\bm{r}}_{*}=(r_{*1},0). Furthermore, this delta function fixed point is unique and is also the global minimizer of the risk. Therefore, we conclude that, for Δ<Δ\Delta<\Delta_{\infty}, the PDE will converge to this global minimizer.

For ΔΔ\Delta\geq\Delta_{\infty}, according to Claim (1), if ρ\overline{\rho}_{*} is a limiting point such that L(ρ)P1L(\overline{\rho}_{*})\in{\mathcal{P}}_{1}, then ρ((0,)×{0})=1\overline{\rho}_{*}((0,\infty)\times\{0\})=1. According to Theorem 12, a fixed point ρ\overline{\rho}_{*} with ρ((0,)×{0})=1\overline{\rho}_{*}((0,\infty)\times\{0\})=1 and L(ρ)(0,0)L(\overline{\rho}_{*})\neq(0,0) must be a point mass at some location r=(r1,0){\bm{r}}_{*}=(r_{*1},0), with L(ρ)P2L(\overline{\rho}_{*})\in{\mathcal{P}}_{2}. Therefore, we cannot have L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}. Claim (2)(2) also tells us that we cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}. Hence, we must have L(S)={(0,0)}L({\mathcal{S}}_{*})=\{(0,0)\}. In this case, all the points in the set S{\mathcal{S}}_{*} have risk . Therefore, we conclude that, as ΔΔ\Delta\geq\Delta_{\infty}, the PDE will converge to some limiting set with risk .

We are left with the task of proving the two claims above. Before that, we introduce some useful notions used in the proof. Define Z(r)Z({\bm{r}}) for rE2{\bm{r}}\in E_{2},

Define Zl(r)Z((r,lr))Z_{l}(r)\equiv Z((r,lr)) for r,l[0,]r,l\in[0,\infty]. Then we have

According to condition S4, for any fixed l[0,]l\in[0,\infty], Zl(r)Z_{l}(r) is increasing in rr.

Recall the formula of rψ(r;ρ)\nabla_{\bm{r}}\psi_{\infty}({\bm{r}};\overline{\rho}) given by Equation (9.18). Define

Proof of Claim (1)(1). If L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}, then for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have ρ((0,)×{0})=1\overline{\rho}_{*}((0,\infty)\times\{0\})=1.

Assume L(S)P1L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}. There must exist t0t_{0} large enough, so that as tt0t\geq t_{0}, we have λ+(ρt)<0\lambda_{+}(\overline{\rho}_{t})<0, and λ(ρt)>0\lambda_{-}(\overline{\rho}_{t})>0. Therefore, we must have χtg(r;ρt)>0\chi_{\rm tg}({\bm{r}};\overline{\rho}_{t})>0 for any r(0,)2{\bm{r}}\in(0,\infty)^{2}. We denote

starting with r(t0)Γk{\bm{r}}(t_{0})\in\Gamma_{k} for some k(0,)k\in(0,\infty), we claim r(t)Γk{\bm{r}}(t)\in\Gamma_{k} for any tt0t\geq t_{0}. Indeed, for any rΓk{r:r2=kr1>0}{\bm{r}}\in\partial\Gamma_{k}\cap\{{\bm{r}}:r_{2}=kr_{1}>0\}, its normal vector pointing outside Γk\Gamma_{k} gives n(r)=(r2,r1)/r2{\bm{n}}({\bm{r}})=(-r_{2},r_{1})/\|{\bm{r}}\|_{2}, and hence rψ(r;ρ),n(r)=χtg(r;ρt)>0\langle\nabla_{\bm{r}}\psi_{\infty}({\bm{r}};\overline{\rho}),{\bm{n}}({\bm{r}})\rangle=\chi_{\rm tg}({\bm{r}};\overline{\rho}_{t})>0. Therefore, r(t){\bm{r}}(t) cannot leak outside Γk\Gamma_{k} from this boundary. Further note that r(t){\bm{r}}(t) cannot reach the boundary ([0,)×{0}){}([0,\infty)\times\{0\})\cup\{\infty\} for any finite time tt. This proves the claim that r(t)Γk{\bm{r}}(t)\in\Gamma_{k} for any tt0t\geq t_{0}.

According to Lemma 7.8, we have ρt(Γk)ρt0(Γk)\rho_{t}(\Gamma_{k})\geq\rho_{t_{0}}(\Gamma_{k}) for any k(0,)k\in(0,\infty). Furthermore, according to Lemma 7.9, ρt0((0,)2)=1\overline{\rho}_{t_{0}}((0,\infty)^{2})=1, hence limkρt0(Γk)=1\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\Gamma_{k})=1. Therefore, for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we must have

Theorem 12 implies that for any such fixed point ρ\overline{\rho}_{*}, we have supp(ρ)([0,)×{0}){}{\rm supp}(\overline{\rho}_{*})\subseteq([0,\infty)\times\{0\})\cup\{\infty\}.

In this case, we claim L(S)P1{(λ+,λ):Z0(0)<λ+/λ<Z0()}L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}(0)<-\lambda_{+}/\lambda_{-}<Z_{0}(\infty)\}. Indeed, suppose there exists ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, such that λ+(ρ)/λ(ρ)Z0()-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\geq Z_{0}(\infty) or λ(ρ)/λ(ρ)Z0(0)-\lambda_{-}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\leq Z_{0}(0), according to Equation (9.24), χnm((r,0);ρ)\chi_{\rm nm}((r,0);\overline{\rho}_{*}) must be strictly positive or strictly negative. However, we know supp(ρ){r:ψ(r;ρ)=0}{\rm supp}(\overline{\rho}_{*})\in\{{\bm{r}}:\nabla\psi_{\infty}({\bm{r}};\overline{\rho}_{*})={\mathbf{0}}\}. Hence, ρ\overline{\rho}_{*} should be a combination of two delta functions located at 0{\mathbf{0}} and \infty. Such fixed point ρ\overline{\rho}_{*} has risk R(ρ)1\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1, hence ρ\overline{\rho}_{*} cannot be a limiting point of the PDE. Hence the claim holds.

Since S{\mathcal{S}}_{*} is a compact set, and LL is a continuous map, then L(S)L({\mathcal{S}}_{*}) is a compact set. Therefore, there must exist ε0>0{\varepsilon}_{0}>0, so that for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, we have Z0(0)+3ε0<λ+(ρ)/λ(ρ)<Z0()3ε0Z_{0}(0)+3{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})<Z_{0}(\infty)-3{\varepsilon}_{0}. For this ε0>0{\varepsilon}_{0}>0, we take t0t_{0} large enough, so that for tt0t\geq t_{0}, we have Z0(0)+2ε0<λ+(ρt)/λ(ρt)<Z0()2ε0Z_{0}(0)+2{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{t})/\lambda_{-}(\overline{\rho}_{t})<Z_{0}(\infty)-2{\varepsilon}_{0}, and λ+(ρt)<0\lambda_{+}(\overline{\rho}_{t})<0, λ(ρt)>0\lambda_{-}(\overline{\rho}_{t})>0.

According to the conditions S0 - S4 on q(r)q(r), for any fixed ll, Zl(r)Z_{l}(r) is an increasing function of rr, and for any fixed rr, Zl(r)Z_{l}(r) is continuous in ll. Therefore, for the fixed ε0>0{\varepsilon}_{0}>0, there exists 0<r0<r<0<r_{0}<r_{\infty}<\infty and b>0b>0, such that

As a result, for any tt0t\geq t_{0}, we have

where Γ()\Gamma_{(\cdot)} is defined as in Equation (9.26).

According to Lemma 7.9, ρt0((0,)2)=1\overline{\rho}_{t_{0}}((0,\infty)^{2})=1. Define

We have OkO_{k} is increasing in kk, and kOk(0,)2\cup_{k}O_{k}\supset(0,\infty)^{2}. Hence limkρt0(Ok)=1\lim_{k\to\infty}\overline{\rho}_{t_{0}}(O_{k})=1. Now we fix a parameter kk.

Recall the formula for χnm\chi_{\rm nm} and χtg\chi_{\rm tg} given by Equation (9.24) and (9.25). It is easy to see that, there exists 0<uk1,uk2<0<u_{k1},u_{k2}<\infty depending on (b,k,τ+,τ,Z0(0),Z0(),ε0)(b,k,\tau_{+},\tau_{-},Z_{0}(0),Z_{0}(\infty),{\varepsilon}_{0}), such that for any r(0,)2{\bm{r}}\in(0,\infty)^{2} with br1r2kr1b\cdot r_{1}\leq r_{2}\leq k\cdot r_{1}, and tt0t\geq t_{0}, we have

Consider the following spiral curve rk(s)=(rk1(s),rk2(s)){\bm{r}}_{k}^{\infty}(s)=(r_{k1}^{\infty}(s),r_{k2}^{\infty}(s)), with

and another spiral curve rk0(s)=(rk10(s),rk20(s)){\bm{r}}_{k}^{0}(s)=(r_{k1}^{0}(s),r_{k2}^{0}(s)), with

for s[0,sk]s\in[0,s_{k*}] with sk=arctan(k)arctan(b)s_{k*}=\arctan(k)-\arctan(b).

Because of inequality (9.35), along the curve rk(s){\bm{r}}_{k}^{\infty}(s), denoting n(rk(s)){\bm{n}}({\bm{r}}_{k}^{\infty}(s)) to be its normal vector with [n(rk(s))]2>0[{\bm{n}}({\bm{r}}_{k}^{\infty}(s))]_{2}>0, we have for any tt0t\geq t_{0} and s[0,sk]s\in[0,s_{k*}],

Along the curve rk0(s){\bm{r}}_{k}^{0}(s), denoting n(rk0(s)){\bm{n}}({\bm{r}}_{k}^{0}(s)) to be its normal vector with [n(rk0(s))]2>0[{\bm{n}}({\bm{r}}_{k}^{0}(s))]_{2}>0, we have for any tt0t\geq t_{0} and s[0,sk]s\in[0,s_{k*}],

Consider the ODE (9.27) starting with r(t0)Ωk{\bm{r}}(t_{0})\in\Omega_{k} for any k{r,1/r0}k\geq\{r_{\infty},1/r_{0}\}, we claim r(t)Ωk{\bm{r}}(t)\in\Omega_{k} for any tt0t\geq t_{0}. Indeed, combining Eq. (9.31), (9.33), (9.39), and (9.38), for any rΩk(([0,)×{0}){}){\bm{r}}\in\partial\Omega_{k}\setminus(([0,\infty)\times\{0\})\cup\{\infty\}) and tt0t\geq t_{0}, the gradient ψ(r;ρt)\nabla\psi_{\infty}({\bm{r}};\overline{\rho}_{t}) pointing outside Ωk\Omega_{k}. Therefore, r(t){\bm{r}}(t) cannot leak outside Γk\Gamma_{k} from this boundary. Further note that r(t){\bm{r}}(t) cannot reach the boundary ([0,)×{0}){}([0,\infty)\times\{0\})\cup\{\infty\} for any finite time tt. This proves the claim that r(t)Ωk{\bm{r}}(t)\in\Omega_{k} for any tt0t\geq t_{0}. According to Lemma 7.8, ρt(Ωk)ρt0(Ωk)\overline{\rho}_{t}(\overline{\Omega}_{k})\geq\overline{\rho}_{t_{0}}(\overline{\Omega}_{k}) for any k{r,1/r0}k\geq\{r_{\infty},1/r_{0}\} and tt0t\geq t_{0}.

Recall the definition of OkO_{k} given by Equation (9.32). Note that OkΩkO_{k}\subseteq\Omega_{k}, and limkρt0(Ok)=1\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\overline{O}_{k})=1, which implies limkρt0(Ωk)=1\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\overline{\Omega}_{k})=1. Hence, for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*},

It is easy to see that kΩk=(0,)×[0,)\cup_{k}\overline{\Omega}_{k}=(0,\infty)\times[0,\infty). Combining with the fact that ρ((0,)2)=0\overline{\rho}_{*}((0,\infty)^{2})=0 for any ρS\overline{\rho}_{*}\in{\mathcal{S}}_{*}, claim (1)(1) holds.

Proof of Claim (2). We cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}.

In the case L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}, the argument is similar to the proof of Claim (1), and hence will be presented in a synthetic form. First, there exists t0t_{0} large enough, so that as tt0t\geq t_{0}, we have λ+(ρt)>0\lambda_{+}(\overline{\rho}_{t})>0, and λ(ρt)<0\lambda_{-}(\overline{\rho}_{t})<0. Then χtg(r;ρt)<0\chi_{\rm tg}({\bm{r}};\overline{\rho}_{t})<0 for any r(0,)2{\bm{r}}\in(0,\infty)^{2}. Letting

According to the same argument as in the proof of Claim (1), we have ρt(Γk)ρt0(Γk)\rho_{t}(\Gamma_{k})\geq\rho_{t_{0}}(\Gamma_{k}) for any k(0,)k\in(0,\infty) and tt0t\geq t_{0}. As a result, we have supp(ρ)({0}×[0,)){}{\rm supp}(\overline{\rho}_{*})\subseteq(\{0\}\times[0,\infty))\cup\{\infty\}.

However, the fixed point ρ\overline{\rho}_{*} with support on ({0}×[0,)){}(\{0\}\times[0,\infty))\cup\{\infty\} has risk R(ρ)1\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1. Therefore, we cannot have L(S)P2L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}. This proves claim (2).

4 Dynamics: Proof of Theorem 2

We will prove that the dynamics for large but finite dd is well approximated by the dynamics at d=d=\infty. The key estimate is provided by the next lemma.

Assume σ\sigma satisfies condition S0, recall the definition of udu_{d} and uu_{\infty} given by Equation (9.9) and (9.10). Assuming k=γdk=\gamma\cdot d for some γ(0,1)\gamma\in(0,1), then we have

Define F3=F1cosΘ1+F2sinΘ1F_{3}=F_{1}\cos\Theta_{1}+F_{2}\sin\Theta_{1}, G3=G1cosΘ2+G2sinΘ2G_{3}=G_{1}\cos\Theta_{2}+G_{2}\sin\Theta_{2}, then

We have similar bounds for a2ud,1(a,b)a2u,1(a,b)|\partial_{a_{2}}u_{d,1}({\bm{a}},{\bm{b}})-\partial_{a_{2}}u_{\infty,1}({\bm{a}},{\bm{b}})|.

Note the relationship of Θ3=Θ3(a)\Theta_{3}=\Theta_{3}({\bm{a}}) with (Θ1,Θ2)(\Theta_{1},\Theta_{2}) is given by Eq. (9.51), which yields

The lemma holds by noting that as dd\to\infty, we have s0s_{0}\to\infty and ds0d-s_{0}\to\infty. ∎

Recall the definition of R\overline{R}_{\infty} given by Eq. (9.11), and RR given by Eq. (6.2). Recall the set of good initialization given by

Define \mathscrsfsP\mboxgood1\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{1} and \mathscrsfsP\mboxgood2\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{2} to be

With this definition, it is easy to see that \mathscrsfsP\mboxgood1=\mathscrsfsP\mboxgood\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{1}=\mathscrsfs{P}_{\mbox{\tiny\rm good}}.

Here we bound d\mboxBL(ρ02,d,ρ02,)d_{\mbox{\tiny\rm BL}}(\overline{\rho}_{0}^{2,d},\overline{\rho}_{0}^{2,\infty}). Note the joint distribution of ud{\bm{u}}_{d} and u{\bm{u}}_{\infty} is a coupling of ρ02,d\overline{\rho}_{0}^{2,d} and ρ02,\overline{\rho}_{0}^{2,\infty}, hence

It is easy to see that limdY1/(Y1+Y2)=γ\lim_{d\to\infty}Y_{1}/(Y_{1}+Y_{2})=\gamma almost surely. Bounded convergence theorem implies that limdd\mboxBL(ρ02,d,ρ02,)=0\lim_{d\to\infty}d_{\mbox{\tiny\rm BL}}(\overline{\rho}_{0}^{2,d},\overline{\rho}_{0}^{2,\infty})=0.

Now we consider the PDE (9.16) for d=d=\infty. We fix its initialization ρ02,\mathscrsfsP\mboxgood2\overline{\rho}_{0}^{2,\infty}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{2} induced by ρ01\mathscrsfsP\mboxgood1\overline{\rho}_{0}^{1}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{1}. Denote the solution of PDE (9.16) to be (ρt)t0(\overline{\rho}_{t}^{\infty})_{t\geq 0}. Due to Theorem 13, for any η>0\eta>0, there exists T=T(η,ρ01,γ,Δ)>0T=T(\eta,\overline{\rho}_{0}^{1},\gamma,\Delta)>0, so that its solution (ρt)t0(\overline{\rho}_{t}^{\infty})_{t\geq 0} satisfies

Then we would like to bound the difference of R(ρ)\overline{R}_{\infty}(\overline{\rho}) and Rd(ρ)\overline{R}_{d}(\overline{\rho}) for any ρ\overline{\rho}. Note

By Lemma 9.1, there exists d0=d0(η,Δ)d_{0}=d_{0}(\eta,\Delta) large enough, so that for dd0d\geq d_{0}, we have

Finally, let (θk)k1({\bm{\theta}}^{k})_{k\geq 1} be the trajectory of SGD, with step size sk=εξ(kε)s_{k}={\varepsilon}\xi(k{\varepsilon}), and initialization wi0iidρ0{\bm{w}}_{i}^{0}\sim_{iid}\rho_{0} for iNi\leq N. We apply Theorem 3 to bound the difference of the law of trajectory of SGD and the solution of PDE (9.60). The assumptions of Theorem 3 are verified by Lemma 8.2. As a consequence, there exists constant KK (which depend uniquely on the constants in assumptions A1 A2 A3), such that

with probability 1ez21-e^{-z^{2}} for any t10Tt\leq 10T, where

As a consequence, for any δ>0\delta>0, there exists C0=C0(δ,η,ρ01,γ,Δ)C_{0}=C_{0}(\delta,\eta,\overline{\rho}_{0}^{1},\gamma,\Delta), so that as N,1/εC0dN,1/{\varepsilon}\geq C_{0}d and ε1/N10{\varepsilon}\geq 1/N^{10}, for t10Tt\leq 10T, we have

Therefore, the trajectory θt/ε{\bm{\theta}}^{\lfloor t/{\varepsilon}\rfloor} of SGD as t[T,10T]t\in[T,10T] satisfies

with probability at least 1δ1-\delta. This gives the desired result. ∎

Finite temperature

We will states the lemma regarding statics properties of the finite temperature free energy in Section 10.1, and regarding dynamics properties in Section 10.2. We will prove Proposition 3, Theorem 4, and Theorem 5 in Section 10.3. Throughout Section 10.1 and 10.2, to distinguish the dimension of parameters with the generalized differential operator, we will denote the dimension of parameters by dd instead of DD. This should not be confused with the dimension of feature vectors, which never appears throughout this section.

We introduce the set K{\mathcal{K}} of admissible probability densities,

Define Ω0={θ:1/(2πσ)dexp{θ22/(2σ2)}ρ(θ)1/21}\Omega_{0}=\{{\bm{\theta}}:1/(\sqrt{2\pi}\sigma)^{d}\cdot\exp\{-\|{\bm{\theta}}\|_{2}^{2}/(2\sigma^{2})\}\leq\rho({\bm{\theta}})^{1/2}\leq 1\}. Then we have

Noting that ρlogρρ|\rho\log\rho|\leq\sqrt{\rho} for any ρ\rho\in, the second term is bounded by

Assume UU and VV are bounded-Lipschitz. Then for any λ>0\lambda>0 and 0<β<0<\beta<\infty, Fβ,λ(ρ)F_{\beta,\lambda}(\rho) has a unique minimizer ρK\rho_{*}\in{\mathcal{K}}. Moreover, we have

Taking σ2=4/(βλ)\sigma^{2}=4/(\beta\lambda) gives Eq. (10.12) .

For any ρK\rho\in{\mathcal{K}}, we call the following equation the Boltzmann fixed point condition

Under the assumption of Lemma 10.2, the minimizer ρK\rho_{*}\in{\mathcal{K}} of Fβ,λ(ρ)F_{\beta,\lambda}(\rho) satisfies the Boltzmann fixed point condition.

for any ε<ε0{\varepsilon}<{\varepsilon}_{0}. As ε{\varepsilon} is sufficiently small, we have Fβ,λ(ρε)<Fβ,λ(ρ)F_{\beta,\lambda}(\rho_{\varepsilon})<F_{\beta,\lambda}(\rho_{*}). This contradict with the fact that ρK\rho_{*}\in{\mathcal{K}} is the minimizer of Fβ,λ(ρ)F_{\beta,\lambda}(\rho).

for some constant γ(β,λ;ρ)\gamma(\beta,\lambda;\rho_{*}).

Note we have ρ(θ)dθ=1\int\rho_{*}({\bm{\theta}}){\rm d}{\bm{\theta}}=1. Therefore, we must have γ(β,λ;ρ)=1/βlogZ(β,λ;ρ)\gamma(\beta,\lambda;\rho_{*})=-1/\beta\cdot\log Z(\beta,\lambda;\rho_{*}). This proves that ρ\rho_{*} satisfies the Boltzmann fixed point condition.

Under the assumption of Lemma 10.2, the Boltzmann fixed point condition has a unique solution in K{\mathcal{K}}.

The last two lemmas already imply that the Boltzmann fixed point condition has at least one solution. Assume ρ1,ρ2K\rho_{1},\rho_{2}\in K to be two such solutions. Then ρi\rho_{i} is positive, and

Note the right hand side does not equal unless ρ1=ρ2\rho_{1}=\rho_{2}.

Under the assumption of Lemma 10.2, and further assume condition A3 holds. Let ρβ,λ\rho_{*}^{\beta,\lambda} be the minimizer of Fβ,λ(ρ)F_{\beta,\lambda}(\rho). Then there is a constant KK depending on the parameter K3K_{3} in condition A3, such that for any β1\beta\geq 1, we have

Let G,G1,G2N(0,Id){\bm{G}},{\bm{G}}_{1},{\bm{G}}_{2}\sim{\sf N}({\mathbf{0}},{\bm{I}}_{d}) be independent, we have

Using the intermediate value theorem and Cauchy-Schwarz inequality, and noting that 2V\nabla^{2}V is K3K_{3}-bounded by condition A3, we have

We have similar bound for the UU term. Therefore,

Next we give a lower bound for Ent(ρgτ){\rm Ent}(\rho*g_{\tau}):

As a result, taking τ=1/β\tau=1/\beta, we have

2 Dynamics

Recall that the finite-temperature distributional dynamics reads:

Notice that this notion of weak solution is equivalent to the one introduced earlier in Eq. (7.3), see for instance [San15, Proposition 4.2].

Without loss of generality, we assume ξ(t)1/2\xi(t)\equiv 1/2.

We use the JKO scheme of [JKO98, Theorem 5.1] to show the existence, uniqueness, and absolute continuousness of solution of PDE (10.22). Since the proof is basically the same as the proof of [JKO98, Theorem 5.1], we will skip several details.

For any ρk1h\overline{\rho}_{k-1}^{h}, the optimization problem (10.24) has a unique minimizer ρkhK\overline{\rho}_{k}^{h}\in{\mathcal{K}}, where the proof is basically the same as Lemma 10.2, by additionally noting that W22(ρ,ρk1h)W_{2}^{2}(\rho,\overline{\rho}_{k-1}^{h}) as a function of ρ\rho is lower bounded, lower semi-continuous, and convex over ρK\rho\in{\mathcal{K}}.

In the following, we will show that this ρh\rho^{h} approximately satisfies PDE (10.23) in the weak form.

Since ρkh\overline{\rho}_{k}^{h} is the minimizer of optimization problem (10.24), we have for each τ>0\tau>0,

Using the result in the proof of [JKO98, Theorem 5.1], and noting V\nabla V is bounded Lipschitz, we have

We need to further calculate the derivative of U,ντ2\langle U,\nu_{\tau}^{\otimes 2}\rangle with respect to τ\tau. Note UU is symmetric, we have

The uniqueness of solution of Eq. (10.23) can be proved using standard method from theory of elliptic-parabolic equations (see, for instance, [JKO98, Theorem 5.1]). In the proof of uniqueness we need the smoothness property of the solution, which is proved by Lemma 10.7.

Before proving this lemma, we give some notations in the following.

For any nonnegative integer ll and 1p1\leq p\leq\infty, we denote Wpl(Ω)W_{p}^{l}(\Omega) to be the Banach space (Sobolev space) consisting of the elements of Lp(S)L^{p}(S) having generalized derivatives of all forms up to order ll included, that are pp’th power integrable on Ω\Omega. The norm in Wpl(Ω)W_{p}^{l}(\Omega) is defined by the equality

where α=(α1,,αd){\bm{\alpha}}=(\alpha_{1},\ldots,\alpha_{d}) is a multi-index with α=i=1dαi|{\bm{\alpha}}|=\sum_{i=1}^{d}\alpha_{i}, and Dθαu=αu/θ1α1θdαdD^{\bm{\alpha}}_{\bm{\theta}}u=\partial^{|{\bm{\alpha}}|}u/\partial\theta_{1}^{\alpha_{1}}\cdots\partial\theta_{d}^{\alpha_{d}}.

We say uLlocr,p(S)u\in L^{r,p}_{\rm loc}(S) if for any compact subset [t1,t2](t1,t2)[t_{1}^{\prime},t_{2}^{\prime}]\subset(t_{1},t_{2}) and compact subset ΩΩ\Omega^{\prime}\subset\Omega, we have uLr,p([t1,t2]×Ω)u\in L^{r,p}([t_{1}^{\prime},t_{2}^{\prime}]\times\Omega^{\prime}). We will denote Lp,p(S)L^{p,p}(S) by Lp(S)L^{p}(S), and Llocp,p(S)L^{p,p}_{\rm loc}(S) by Llocp(S)L^{p}_{\rm loc}(S).

For nonnegative integer ll and 1p1\leq p\leq\infty, we denote Wp2l,l(S)W_{p}^{2l,l}(S) to be the Banach space consisting of the elements of Lp(S)L^{p}(S) having generalized derivatives of the form DtrDθαD_{t}^{r}D_{\bm{\theta}}^{\bm{\alpha}} with rr and α{\bm{\alpha}} satisfying the inequality 2r+α2l2r+|{\bm{\alpha}}|\leq 2l. The corresponding norm is defined by

We denote Cm,n(S)C^{m,n}(S) to be the function space of continuous function with mm continuous derivative in time, and nn continuous derivatives in space. For example, uC1,2(S)u\in C^{1,2}(S) if and only if u,tu,θu,θ2uC0,0(S)C(S)u,\partial_{t}u,\nabla_{\bm{\theta}}u,\nabla_{\bm{\theta}}^{2}u\in C^{0,0}(S)\equiv C(S). We say uCcm,n(S)u\in C_{c}^{m,n}(S) if uCm,n(S)u\in C^{m,n}(S) and the support of uu is compact. We will denote Cn,n(S)C^{n,n}(S) by Cn(S)C^{n}(S), and Ccn,n(S)C^{n,n}_{c}(S) by Ccn(S)C^{n}_{c}(S).

We denote GG to be the heat kernel, where for t>0t>0, we have

The proof is similar to the one of [JKO98, Theorem 5.1], so we will skip some details. Without loss of generality we can set β=1\beta=1, and ξ(t)=1/2\xi(t)=1/2 (different choices can be obtained by rescaling Ψ(θ;ρ)\Psi({\bm{\theta}};\rho) and reparametrizing time).

Step 1. Show that ρLloc,p(E)\rho\in L^{\infty,p}_{\rm loc}({E}).

Taking GG to be the heat kernel, it is easy to see that

provided that the po,pip_{o},p_{i} satisfy the relations

Here, CC is a constant depends only on T,δT,\delta and on pi,pop_{i},p_{o}.

Define φ1ρ(ΔηΨ,η)1{t>ε}\varphi_{1}\equiv\rho(\Delta\eta-\langle\nabla\Psi,\nabla\eta\rangle){\bm{1}}\{t>{\varepsilon}\}, φ2ρ(2ηηΨ)1{t>ε}\varphi_{2}\equiv\rho(2\nabla\eta-\eta\nabla\Psi){\bm{1}}\{t>{\varepsilon}\}, and ψρ(ε)η\psi\equiv\rho({\varepsilon})\eta. Then Eq. (10.43) reads

where Ω1supp(η)Ω2\Omega_{1}\subseteq{\rm supp}(\eta)\subseteq\Omega_{2}. Therefore, ρLloc,po(E)\rho\in L_{\rm loc}^{\infty,p_{o}}({E}), where pi,pop_{i},p_{o} satisfy Eq. (10.46).

Note there exists a sequence pi,l,po,lp_{i,l},p_{o,l} for 1lk1\leq l\leq k and k<k<\infty, so that pi,l+1=po,lp_{i,l+1}=p_{o,l}, pi,1=p<d/(d1)p_{i,1}=p<d/(d-1), pi,k=p_{i,k}=\infty, and pi,l,po,lp_{i,l},p_{o,l} for fixed ll satisfies Eq. (10.46). Since we have ρLloc,p(E)\rho\in L^{\infty,p}_{\rm loc}({E}), using Eq. (10.50) iteratively, we have ρLloc,po,l(E)\rho\in L^{\infty,p_{o,l}}_{\rm loc}({E}) for any 1lk1\leq l\leq k. As a result, we have ρLloc(E)\rho\in L^{\infty}_{\rm loc}({E}).

Step 3. Derivatives, DρD\rho, D2ρD^{2}\rho, and D3ρD^{3}\rho.

where 1<p1<p\leq\infty and mm is a nonnegative integer.

Then we show the regularity of D2ρD^{2}\rho. Note that 2ΨLloc(E)\nabla^{2}\Psi\in L_{\rm loc}^{\infty}({E}), we have Dφ1,Dφ2L(E)D\varphi_{1},D\varphi_{2}\in L^{\infty}({E}). Due to Eq. (10.51), we have D3{φ12G},D3{φ22G}L(E)D^{3}\{\varphi_{1}*_{2}G\},D^{3}\{\varphi_{2}*_{2}G\}\in L^{\infty}({E}), which also implies D2{φ12G}Lloc(E)D^{2}\{\varphi_{1}*_{2}G\}\in L^{\infty}_{\rm loc}({E}). Hence we have D2(ρη)=D2{φ12G}+D3{φ22G}+D2[ψGε]L(S)D^{2}(\rho\eta)=D^{2}\{\varphi_{1}*_{2}G\}+D^{3}\{\varphi_{2}*_{2}G\}+D^{2}[\psi*G_{\varepsilon}]\in L^{\infty}({S}), which gives D2ρLloc(E)D^{2}\rho\in L^{\infty}_{\rm loc}({E}).

Next we show the regularity of D3ρD^{3}\rho. Note that 3ΨLloc(E)\nabla^{3}\Psi\in L_{\rm loc}^{\infty}({E}), we have D2φ1,D2φ2L(E)D^{2}\varphi_{1},D^{2}\varphi_{2}\in L^{\infty}({E}). Due to Eq. (10.51), we have D4{φ12G},D4{φ22G}L(E)D^{4}\{\varphi_{1}*_{2}G\},D^{4}\{\varphi_{2}*_{2}G\}\in L^{\infty}({E}), which also implies D3{φ12G}Lloc(E)D^{3}\{\varphi_{1}*_{2}G\}\in L^{\infty}_{\rm loc}({E}). Hence we have D3(ρη)=D3{φ12G}+D4{φ22G}+D3[ψG]L(S)D^{3}(\rho\eta)=D^{3}\{\varphi_{1}*_{2}G\}+D^{4}\{\varphi_{2}*_{2}G\}+D^{3}[\psi*G]\in L^{\infty}({S}), which gives D3ρLloc(E)D^{3}\rho\in L^{\infty}_{\rm loc}({E}).

Step 4. Derivatives, DtρD_{t}\rho, DtDρD_{t}D\rho, and DtD2ρD_{t}D^{2}\rho.

Now we study the regularity of Dtρ,DtDρ,DtD2ρD_{t}\rho,D_{t}D\rho,D_{t}D^{2}\rho. Note we have Dt(ρη)=Dt{φ12G}Dt{Dφ12G}+Dt{ψGε}D_{t}(\rho\eta)=D_{t}\{\varphi_{1}*_{2}G\}-D_{t}\{D\varphi_{1}*_{2}G\}+D_{t}\{\psi*G_{\varepsilon}\}. Due to Eq. (10.51), φ1,Dφ2L(E)\varphi_{1},D\varphi_{2}\in L^{\infty}({E}) implies that Dt{φ12G},Dt{Dφ12G}L(E)D_{t}\{\varphi_{1}*_{2}G\},D_{t}\{D\varphi_{1}*_{2}G\}\in L^{\infty}({E}) and hence Dt[ρη]L(S)D_{t}[\rho\eta]\in L^{\infty}({S}), DtρLloc(E)D_{t}\rho\in L^{\infty}_{\rm loc}({E}).

Note we have DtD(ρη)=Dt{Dφ12G}+Dt{D2φ12G}+Dt{DψGε}D_{t}D(\rho\eta)=D_{t}\{D\varphi_{1}*_{2}G\}+D_{t}\{D^{2}\varphi_{1}*_{2}G\}+D_{t}\{D\psi*G_{\varepsilon}\}. The fact that Dφ1,D2φ2L(E)D\varphi_{1},D^{2}\varphi_{2}\in L^{\infty}({E}) implies that Dt{Dφ12G},Dt{D2φ12G}L(E)D_{t}\{D\varphi_{1}*_{2}G\},D_{t}\{D^{2}\varphi_{1}*_{2}G\}\in L^{\infty}({E}) and hence DtDρLloc(E)D_{t}D\rho\in L^{\infty}_{\rm loc}({E}).

Note we have DtD2(ρη)=Dt{D2φ12G}Dt{D3φ12G}+Dt{D2ψGε}D_{t}D^{2}(\rho\eta)=D_{t}\{D^{2}\varphi_{1}*_{2}G\}-D_{t}\{D^{3}\varphi_{1}*_{2}G\}+D_{t}\{D^{2}\psi*G_{\varepsilon}\}. Note that 4ΨLloc(E)\nabla^{4}\Psi\in L_{\rm loc}^{\infty}({E}), hence D3φ2L(E)D^{3}\varphi_{2}\in L^{\infty}({E}). Combining with the fact that D2φ1L(E)D^{2}\varphi_{1}\in L^{\infty}({E}), we have Dt{D2φ12G},Dt{D3φ12G}L(E)D_{t}\{D^{2}\varphi_{1}*_{2}G\},D_{t}\{D^{3}\varphi_{1}*_{2}G\}\in L^{\infty}({E}) and hence DtD2ρLloc(E)D_{t}D^{2}\rho\in L^{\infty}_{\rm loc}({E}).

Finally we show the regularity of Dt2ρD_{t}^{2}\rho. We have Dt2(ρη)=Dt{Dt[φ12G]Dt[Dφ12G]+Dt[ψGε]}D_{t}^{2}(\rho\eta)=D_{t}\{D_{t}[\varphi_{1}*_{2}G]-D_{t}[D\varphi_{1}*_{2}G]+D_{t}[\psi*G_{\varepsilon}]\}, and

We say ρ\rho_{*} is a fixed point of PDE (10.22), if its solution (ρt)t0(\rho_{t})_{t\geq 0} starting from ρ\rho_{*} satisfies ρtρ\rho_{t}\equiv\rho_{*} for any t0t\geq 0.

Assume conditions A1 - A3 hold. Then any fixed point ρ\rho_{*} of PDE (10.22) with ρK\rho_{*}\in{\mathcal{K}} must satisfy the Boltzmann fixed point condition (10.13).

Suppose ρK\rho_{*}\in{\mathcal{K}} is a fixed point of PDE (10.22), taking W(θ)Ψλ(θ;ρ)W({\bm{\theta}})\equiv\Psi_{\lambda}({\bm{\theta}};\rho_{*}), then ρK\rho_{*}\in{\mathcal{K}} is a fixed point of the Fokker-Planck equation (10.54).

Since λ/2θ222K3Ψλ(θ;ρ)λ/2θ22+2K3\lambda/2\cdot\|{\bm{\theta}}\|_{2}^{2}-2K_{3}\leq\Psi_{\lambda}({\bm{\theta}};\rho_{*})\leq\lambda/2\cdot\|{\bm{\theta}}\|_{2}^{2}+2K_{3}, the Fokker-Planck equation has a unique fixed point [MV00], which solves

This is exactly the Boltzmann fixed point condition.

Assume conditions A1 - A4 hold. Let (ρt)t0(\rho_{t})_{t\geq 0} be the solution of PDE (10.22) for an initialization ρ0K\rho_{0}\in{\mathcal{K}}. Then the free energy Fβ,λ(ρt)F_{\beta,\lambda}(\rho_{t}) is differentiable with respect to tt, with

Therefore, Fβ,λ(ρt)F_{\beta,\lambda}(\rho_{t}) is non-increasing in tt.

Calculate the differential of the free energy along the curve ρt\rho_{t}, we have

Assume K0θ22K1Φ(θ)K0θ22+K1K_{0}\|{\bm{\theta}}\|_{2}^{2}-K_{1}\leq\Phi({\bm{\theta}})\leq K_{0}\|{\bm{\theta}}\|_{2}^{2}+K_{1} for some positive constant K0,K1K_{0},K_{1}. Define

First we show that the measure μ\mu_{*} satisfies the Poincare inequality: for any fDf\in{{\mathcal{D}}},

Note we have the Poincare inequality for the Gaussian distribution μ\mu,

for any differentiable ff. Therefore, we have

This proves the Poincare inequality (10.58) for μ\mu_{*}.

Assume conditions A1 - A4 hold. Then the solution (ρt)t0(\rho_{t})_{t\geq 0} of PDE (10.22) for any initialization ρ0K\rho_{0}\in{\mathcal{K}} converges weakly to ρK\rho_{*}\in{\mathcal{K}} as tt\to\infty, where ρ\rho_{*} is the unique solution of the Boltzmann fixed point condition, which is the global minimizer of Fβ,λF_{\beta,\lambda}.

According to Lemma 10.10, Fβ,λF_{\beta,\lambda} is non-increasing along the solution path. According to Lemma 10.2, Fβ,λ(ρt)F_{\beta,\lambda}(\rho_{t}) is lower bounded. Therefore, we have

According to condition A3, θU\nabla_{\bm{\theta}}U is K3K_{3}-bounded-Lipschitz with respect to (θ,θ)({\bm{\theta}},{\bm{\theta}}^{\prime}). Therefore,

as d\mboxBL(ρt,ρ)0d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\to 0. Accordingly, we have

Combining Eq. (10.63) with Eq. (10.61), we have

3 Proof of Proposition 3, Theorem 4, and Theorem 5

Proposition 3 is given by Lemma 10.6, 10.4, and Lemma 10.9. Theorem 4 is given by Lemma 10.2, 10.4, 10.5, and 10.12.

Now we prove Theorem 5. First, according to Lemma 10.5, for any η>0\eta>0, there exists constant KK depending on η,K0,K1,K2,K3\eta,K_{0},K_{1},K_{2},K_{3}, such that as we take βKD\beta\geq KD, we have

According to Lemma 10.12, we have ρt\rho_{t} converges to ρβ,λ\rho_{*}^{\beta,\lambda} weakly. Therefore, there exists T=T(η,V,U,{Ki},D,λ,β)<T=T(\eta,V,U,\{K_{i}\},D,\lambda,\beta)<\infty, so that d\mboxBL(ρt,ρβ,λ)η/(3Z)d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*}^{\beta,\lambda})\leq\eta/(3Z) for any tTt\geq T, where Z=Z({Ki})Z=Z(\{K_{i}\}) is the bounded-Lipschitz constant of RR with respect to ρ\rho. Hence, we have

Finally, according to Theorem 3, there exists KK^{\prime} depending on KiK_{i}’s, so that for all k10T/εk\leq 10T/{\varepsilon}, we have

with probability at least 1ez21-e^{-z^{2}}. Hence there exists C0=C0(η,{Ki},δ)C_{0}=C_{0}(\eta,\{K_{i}\},\delta), so that as N,1/εC0exp{C0T}DN,1/{\varepsilon}\geq C_{0}\exp\{C_{0}T\}D and ε1/N10{\varepsilon}\geq 1/N^{10}, we have

Combining Eq. (10.67), (10.68), and (10.69) we get the desired result.

4 Dependence of convergence time on D𝐷D and η𝜂\eta

Theorem 5 does not provide any estimate for the dependence of the convergence time on the problem dimensions DD and on the accuracy η\eta. However the proof suggests the following heuristic. When ρt\rho_{t} is sufficiently close to the minimizer ρ\rho_{*}, we heuristically can approximate the free energy dissipation formula (10.2) as

This is the same as the free energy dissipation for the Fokker-Planck equation with potential Ψλ(θ;ρ)\Psi_{\lambda}({\bm{\theta}};\rho_{*}). This suggests that, close to ρ\rho_{*}, convergence should be dominated by the speed of convergence in this Fokker-Plank equation, which is controlled by the log-Sobolev constant of the potential Ψλ(θ;ρ)\Psi_{\lambda}({\bm{\theta}};\rho_{*}), to be denote by cc_{*} [MV00]:

Note that the log-Sobolev constant can be exponentially small in DD. We expect this heuristic to capture the rough dependence of the convergence time TT on η\eta and DD, hence suggesting T=eO(D)log(1/η)T=e^{O(D)}\log(1/\eta).

Numerical Experiments

In this section, we discuss numerical experiments whose results were presented in the main text, as well as some additional ones. Some technical details of the figures in the main text are also presented here; in particular, Section 11.1.1 for Figure 1, Section 11.1.2 for Figure 2, Section 11.2 for Figure 3, and Section 11.3 for Figure 4.

In this section, we present details of the numerical experiments pertaining to the example of centered isotropic Gaussians:

With probability 1/21/2: y=+1y=+1, xN(0,(1+Δ)2Id){\bm{x}}\sim{\sf N}(0,(1+\Delta)^{2}{\bm{I}}_{d}).

With probability 1/21/2: y=1y=-1, xN(0,(1+Δ)2Id){\bm{x}}\sim{\sf N}(0,(1+\Delta)^{2}{\bm{I}}_{d}).

In all numerical examples in this section, we use the activation σ(x;θi)=σ(wi,x)\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle), where σ(t)=s1\sigma(t)=s_{1} if tt1t\leq t_{1}, σ(t)=s2\sigma(t)=s_{2} if tt2t\geq t_{2}, and σ(t)\sigma(t) interpolated linearly for t(t1,t2)t\in(t_{1},t_{2}). In simulations we use t1=0.5t_{1}=0.5, t2=1.5t_{2}=1.5, s1=2.5s_{1}=-2.5, s2=7.5s_{2}=7.5. This is also used for examples with centered Gaussians in the main text, cf. Figures 1 and 2, and Section 8 in the supplemental information. This activation is plotted in Figure 11.1.

Here we discuss empirical validation for the dynamics in the isotropic Gaussian example.

PDE simulation. Simulating the PDE (Eq. of the main text) for general dd is computationally intensive. In order to simplify the problem, we only consider d=d=\infty. In that case, we recall that the risk is given by Eq. (8.10), which we copy here for ease of reference:

The PDE is then tρt=2ξ(t)r[ρtrψ(r;ρt)]\partial_{t}\overline{\rho}_{t}=2\xi(t)\partial_{r}[\overline{\rho}_{t}\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})].

The solution to the PDE is approximated, at all time tt, by the following multiple-deltas ansatz:

Under this ansatz, let us write R(ρt)=R,J(r(t))\overline{R}_{\infty}(\overline{\rho}_{t})=\overline{R}_{\infty,J}(\mathbf{r}(t)), where r(t)=(r1(t),...,rJ(t))\mathbf{r}(t)=(r_{1}(t),...,r_{J}(t))^{\top}, and

Notice that rψ(ri(t);ρt)=(J/2)(R,J(r(t)))i\partial_{r}\psi_{\infty}(r_{i}(t);\overline{\rho}_{t})=(J/2)(\nabla\overline{R}_{\infty,J}(\mathbf{r}(t)))_{i}. Therefore we obtain

Hence under the multiple-deltas ansatz, one can simulate numerically the PDE via the above evolution equation of r(t)\mathbf{r}(t). In particular, given r(t)\mathbf{r}(t), one approximates r(t+δt)\mathbf{r}(t+\delta t) for some small displacement δt\delta t by

In general, one would want to take a large JJ to obtain a more accurate approximation. There are certain cases where one can take small JJ (even J=1J=1). An example of such case is given in the following.

Details of Figure 1 of the main text. For the data generation, we set Δ=0.8\Delta=0.8. For the SGD simulation, we take d=40d=40, N=800N=800, with ε=106{\varepsilon}=10^{-6} and ξ(t)=1\xi(t)=1. The weights are initialized as (wi)iNiidN(0,0.82/dId)({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,0.8^{2}/d\cdot\mathbf{I}_{d}). We take a single SGD run. At iteration 103,4×106,10710^{3},4\times 10^{6},10^{7}, we plot the histogram of (wi2)iN(\|{\bm{w}}_{i}\|_{2})_{i\leq N}. This produces the results of the SGD in Figure 1 of the main text.

To obtain results from the PDE, we take J=400J=400, and generate ri(0)=Zi2r_{i}(0)=\|Z_{i}\|_{2}, where (Zi)iJiidN(0,0.82/dId)(Z_{i})_{i\leq J}\sim_{iid}\mathsf{N}(0,0.8^{2}/d\cdot\mathbf{I}_{d}). We obtain r(t)\mathbf{r}(t) from t=0t=0 until t=107εt=10^{7}{\varepsilon}, by discretizing this interval with 10510^{5} points equally spaced on the log10\log_{10} scale and sequentially computing r(t)\mathbf{r}(t) at each point using Eq. (11.8). Note that the SGD result at iteration kk corresponds to r(εk)\mathbf{r}({\varepsilon}k). We re-simulate the PDE for 100 times, each with an independently generated initialization. The obtained histogram for the PDE, as shown in the figure, is the aggregation of these 100 runs.

Further numerical simulations. Figure 11.2 plots the evolution of ρt\overline{\rho}_{t} for Δ=0.2\Delta=0.2. The setting is identical to the one in Figure 1 of the main text, described in the previous paragraphs.

In Figure 11.3, we plot the evolution of the population risk for the SGD and its PDE prediction counterpart, for Δ=0.2\Delta=0.2 and Δ=0.8\Delta=0.8. The setting for the SGD plots is the same as described in the previous paragraphs. We compute the risk attained by the SGD by Monte Carlo averaging over 10410^{4} samples. The setting for the PDE plots tagged “J=400J=400” is almost the same as in the previous paragraphs, except that we take only 1 run. For the PDE plot tagged “J=1J=1”, we take J=1J=1 and r(0)=0.8r(0)=0.8 instead. In the inset plot, we also show the evolution of (1/N)i=1Nwi2(1/N)\sum_{i=1}^{N}\|{\bm{w}}_{i}\|_{2} of the SGD, and (1/J)i=1Jri(t)(1/J)\sum_{i=1}^{J}r_{i}(t) of the PDE.

In Figure 11.4, we plot the function Rd(1)(r)\overline{R}^{(1)}_{d}(r), for d=40d=40 and Δ=0.2\Delta=0.2. (Recall Rd(1)(r)\overline{R}^{(1)}_{d}(r) from Eq. of the main text, and see also Section 11.1.3.) On this landscape, we also plot the evolution of the corresponding SGD and PDE, as described in the last paragraph.

Comments. We observe in Figure 11.3 a good match between the SGD and the PDE, even when J=1J=1, for Δ=0.2\Delta=0.2. This can be explained with our theory, which predicts that at Δ=0.2\Delta=0.2, the minimum risk is achieved by the uniform distribution over a sphere of radius w2=r\|{\bm{w}}\|_{2}=r_{*} (see also Section 11.1.3). This corresponds to ρt\overline{\rho}_{t}, as tt\to\infty, being a delta function and placing probability 1 at rr_{*}. Furthermore due to the way we initialize the SGD, ρ0\overline{\rho}_{0} is well concentrated. One can then expect that ρt\overline{\rho}_{t} is also well concentrated at all time tt, in which case J=1J=1 is sufficient. This claim is reflected in our numerical experiments, shown in Figure 11.2.

We also observe in Figure 11.3 that the case Δ=0.2\Delta=0.2 has a rapid transition from a high risk to a lower risk, unlike the case Δ=0.8\Delta=0.8. This is also expected from our theory. As said above, ρt\overline{\rho}_{t} is approximately a delta function at all time tt, and the position r(t)r(t) evolves by gradient flow in the landscape of Rd(1)(r)\overline{R}^{(1)}_{d}(r). This latter claim is well supported by Figure 11.4. As observed in Figure 11.4, Rd(1)(r)\overline{R}^{(1)}_{d}(r) is rather benign, and hence the transition of the population risk should be smooth. However the case for Δ=0.8\Delta=0.8 is different: ρt\overline{\rho}_{t} is not concentrating at large tt, as evident in Figure 1 of the main text, even though Rd(1)(r)\overline{R}^{(1)}_{d}(r) is generally benign for a vast variety of values of dd and Δ\Delta (see Figure 11.6 and Section 11.1.3).

Note that the computation of the PDE assumes d=d=\infty. Furthermore it also requires N=N=\infty (recalling Theorem 3 of the main text). The discrepancy to the SGD is due to the fact that dd and NN are finite in the SGD simulations. Nevertheless in our numerical examples, such discrepancy is insignificant.

1.2 Empirical validation of the statics

Here we discuss numerical verification for the statics in the isotropic Gaussian example.

Optimizing Rd(ρ)\overline{R}_{d}(\overline{\rho}). For the chosen activation, we have from Eq. (8.8) that

where σsl=(s2s1)/(t2t1)\sigma_{\rm sl}=(s_{2}-s_{1})/(t_{2}-t_{1}), σitc=s1σslt1\sigma_{\rm itc}=s_{1}-\sigma_{\rm sl}t_{1}, ϕ(x)=exp(x2/2)/2π\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi}, Φ(x)=xϕ(t)dt\Phi(x)=\int_{-\infty}^{x}\phi(t){\rm d}t, and Γ\Gamma is the Gamma function. To numerically optimize Rd(ρ)\overline{R}_{d}(\overline{\rho}), we perform the following approximation:

which is a quadratic programming problem and can be solved numerically. Here v\mathbf{v} can be computed easily with the explicit formula, and the computation of U\mathbf{U} amounts to numerically evaluating double integrals. In the case d=d=\infty, the computation of U\mathbf{U} is much easier, since

Details of Figure 2 of the main text. For the SGD simulation, we take N=800N=800, with ε=3×103{\varepsilon}=3\times 10^{-3} and ξ(t)=t1/4\xi(t)=t^{-1/4}. The weights are initialized as (wi)iNiidN(0,0.42/dId)({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,0.4^{2}/d\cdot\mathbf{I}_{d}). We compute the risk attained by the SGD by Monte Carlo averaging over 10410^{4} samples. We take a single SGD run per Δ\Delta, per dd, and report the risk at iteration 10710^{7}.

For the approximate optimization of Rd(ρ)\overline{R}_{d}(\overline{\rho}), we choose K=100K=100, and oio_{i}, i=1,...,Ki=1,...,K, being equally spaced on the interval [0.01, 10].

For the optimization of Rd(1)(r)\overline{R}^{(1)}_{d}(r) (recalling Eq. in the main text), we approximate it with mini=1,...,KRd(1)(oi)\min_{i=1,...,K}\overline{R}^{(1)}_{d}(o_{i}), for the above chosen oio_{i} and KK.

We find that in general, one needs higher maxi=1,...,Koi\max_{i=1,...,K}o_{i} to produce accurate results for higher Δ\Delta. For the chosen set of oio_{i}’s, we choose to plot up until Δ=0.8\Delta=0.8.

Further numerical simulations. In Figure 11.5, we extend Figure 2 of the main text to include results for additional values of dd. The setting remains the same.

This figure provides further support to the respective discussion in the main text. For the threshold values of Δ\Delta for which the minimum risk is achieved by a uniform distribution ρrunif\rho^{\rm unif}_{r_{*}} over a sphere of radius w2=r\|{\bm{w}}\|_{2}=r_{*} (see the main text around Eq. , and Section 11.1.3).

1.3 Checking the condition of Lemma 1 in the main text

We check of the condition of Lemma 1 in the main text. This has two steps: (1) we solve for the minimizer rr_{*} of Rd(1)(r)=1+2v(r)+ud(r,r)\overline{R}_{d}^{(1)}(r)=1+2v(r)+u_{d}(r,r), where v(r)v(r) and ud(r1,r2)u_{d}(r_{1},r_{2}) are given by Eq. (11.10) and (11.11) respectively, and (2) we check whether v(r)+ud(r,r)v(r)+ud(r,r)v(r)+u_{d}(r,r_{*})\geq v(r_{*})+u_{d}(r_{*},r_{*}) for all r0r\geq 0. Figure 11.6 suggests that the behavior of Rd(1)(r)\overline{R}_{d}^{(1)}(r) is rather benign and hence rr_{*} can be solved easily by searching for a local minimum. For the second step, we check the condition on a grid of values of rr from 0.1 to 10 with a spacing of 0.1, for each value of Δ\Delta on a grid from 0.01 to 0.99 with a spacing of 0.01. In general, we find that the conditioned is satisfied for Δ[Δdl,Δdh]\Delta\in[\Delta_{d}^{\rm l},\Delta_{d}^{\rm h}]. Table 1 reports Δdl\Delta_{d}^{\rm l} and Δdh\Delta_{d}^{\rm h} for a number of values of dd for the isotropic Gaussians example with the given activation function.

2 Centered anisotropic Gaussians with ReLU Activation

With probability 1/21/2: y=+1y=+1, xN(0,Σ+){\bm{x}}\sim{\sf N}(0,\mathbf{\Sigma}_{+}).

With probability 1/21/2: y=1y=-1, xN(0,Σ){\bm{x}}\sim{\sf N}(0,\mathbf{\Sigma}_{-}).

This setting is used in Figure 3 in the main text.

We consider s0=γds_{0}=\gamma d for some γ(0,1)\gamma\in(0,1). For simplicity, we consider the limit dd\to\infty. For θρ{\bm{\theta}}\sim\rho, let ρ\overline{\rho} be the joint distribution of four parameters r=(a,b,r1=w1:s02,r2=w(s0+1):d2)\mathbf{r}=(a,b,r_{1}=\|{\bm{w}}_{1:s_{0}}\|_{2},r_{2}=\|{\bm{w}}_{(s_{0}+1):d}\|_{2}), where wi:j=(wi,...,wj){\bm{w}}_{i:j}=(w_{i},...,w_{j})^{\top}. Using a similar argument to Section 8, we have, in the limit dd\to\infty, the risk R(ρ)=R(ρ)R(\rho)=\overline{R}_{\infty}(\overline{\rho}) for

where ϕ(x)=exp(x2/2)/2π\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi} and Φ(x)=xϕ(t)dt\Phi(x)=\int_{-\infty}^{x}\phi(t){\rm d}t. Furthermore, letting ρt\overline{\rho}_{t} denote the corresponding distribution at time tt, the PDE in the main text can be reduced to the following PDE of ρt\overline{\rho}_{t}:

PDE simulation. As in Section 11.1.1, we posit that the solution to the PDE can be approximated, at all time tt, by the multiple-deltas ansatz:

for i=1,...,Ji=1,...,J, where R,J(r1(t),,rJ(t))=R(ρt)\overline{R}_{\infty,J}(\mathbf{r}_{1}(t),\dots,\mathbf{r}_{J}(t))=\overline{R}_{\infty}(\overline{\rho}_{t}) under the ansatz, and i\nabla_{i} denotes the gradient of R,J(r1,...,rJ)\overline{R}_{\infty,J}(\mathbf{r}_{1},...,\mathbf{r}_{J}) w.r.t. ri\mathbf{r}_{i}. More explicitly,

Again, given ri(t)\mathbf{r}_{i}(t), one approximates ri(t+δt)\mathbf{r}_{i}(t+\delta t) for some small displacement δt\delta t by

Details of Figure 3 of the main text. For the SGD simulation, we take d=320d=320, s0=60s_{0}=60, N=800N=800, with ε=2×104{\varepsilon}=2\times 10^{-4} and ξ(t)=t1/4\xi(t)=t^{-1/4}. The weights are initialized as (wi)iNiidN(0,0.82/dId)({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,0.8^{2}/d\cdot\mathbf{I}_{d}), (ai)iN=1(a_{i})_{i\leq N}=1 and (bi)iN=1(b_{i})_{i\leq N}=1. We take a single SGD run. We compute the risk attained by the SGD by Monte Carlo averaging over 10410^{4} samples.

To produce the inset plot in Figure 3 of the main text, for the “aa (mean)” axis, we compute 1Ni=1Nai\frac{1}{N}\sum_{i=1}^{N}a_{i} for the SGD and 1Ji=1Jai(t)\frac{1}{J}\sum_{i=1}^{J}a_{i}(t) for the PDE. Similarly, for the “bb (mean)” axis, we compute 1Ni=1Nbi\frac{1}{N}\sum_{i=1}^{N}b_{i} for the SGD and 1Ji=1Jbi(t)\frac{1}{J}\sum_{i=1}^{J}b_{i}(t) for the PDE, and for the “r1r_{1} (mean)” axis, we compute 1Ni=1Nwi,1:s02\frac{1}{N}\sum_{i=1}^{N}\|{\bm{w}}_{i,1:s_{0}}\|_{2} for the SGD and 1Ji=1Jr1,i(t)\frac{1}{J}\sum_{i=1}^{J}r_{1,i}(t) for the PDE.

Further numerical simulations. In Figure 11.7, we plot the evolution of the four parameters, for the same setting as Figure 3 of the main text. Here “aa (mean)”, “bb (mean)” and “r1r_{1} (mean)” hold the same meanings, and “r2r_{2} (mean)” refers to 1Ni=1Nwi,(s0+1):d2\frac{1}{N}\sum_{i=1}^{N}\|{\bm{w}}_{i,(s_{0}+1):d}\|_{2} for the SGD and 1Ji=1Jr2,i(t)\frac{1}{J}\sum_{i=1}^{J}r_{2,i}(t) for the PDE.

In Figure 11.8, we plot the population risk’s evolution for the same setting as Figure 3 of the main text, apart from that Δ=0.6\Delta=0.6 and s0s_{0} varies.

Comments. We observe a good match between the SGD and the PDE in Figure 3 of the main text as well as Figure 11.7, up until iteration 10610^{6}. In general there is less discrepancy with larger s0s_{0}, dd and NN, recalling that the PDE is computed assuming infinite s0s_{0}, dd and NN. This is evident from Figure 11.8.

As a note, in Figure 11.8, the PDE evolves differently for different s0s_{0}. This is because the ratio s0/ds_{0}/d is used to determine the initialization of the PDE.

3 Isotropic Gaussians: Predictable Failure of SGD

In this section, we consider the isotropic Gaussians example (see Section 11.1 for the setting and notations), with the following activation function: σ(x;θi)=σ(wi,x)\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle), where σ(t)=2.5\sigma(t)=-2.5 for t0t\leq 0, σ(t)=7.5\sigma(t)=7.5 for t1.5t\geq 1.5, and σ(t)\sigma(t) linearly interpolates from the knot (0,2.5)(0,-2.5) to (0.5,4)(0.5,-4), and from (0.5,4)(0.5,-4) to (1.5,7.5)(1.5,7.5). This activation is plotted in Figure 11.1. This corresponds to Section “Predicting failure” and Figure 4 in the main text. The simulation of the PDE can be done in the same way as in Section 11.1.1.

Rationale of the activation choice. We give an explanation for the choice of the above activation based on our theory. We aim to find an activation σ(x;θi)=σ(wi,x)\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle) in which there exists a local minimum that does not generalize well. To simplify the task, we wish for such minimum to be attained at ρ=δ0\rho_{*}=\delta_{\mathbf{0}}. This minimum does not generalize well, since it implies all the weights are zero and the neuron outputs are constant, rendering the network unable to perform classification. Theorem 6 of the main text suggests taking σ(t)\sigma(t) such that

In the isotropic Gaussians case, this becomes

(Note that the condition V(0)+1U(0,0)=0\nabla V({\mathbf{0}})+\nabla_{1}U({\mathbf{0}},{\mathbf{0}})={\mathbf{0}} in Theorem 6 of the main text is trivially satisfied.) Another requirement is that there should still be a minimum whose risk is nearly zero. Hence we do not wish for a dramatic change in the choice of the activation function, as compared to the one used in Section 11.1. That is, we leave σ(0)<0\sigma(0)<0 unchanged. Hence we would want σ(0)<0\sigma^{\prime\prime}(0)<0, which is accomplished by our aforementioned choice.

Note that Theorem 6 of the main text also suggests that if the SGD is initialized sufficiently close to this local minimum, the SGD trajectory should converge to it.

Details of Figure 4 of the main text. For the data generation, we set Δ=0.5\Delta=0.5. For the SGD simulation, we take d=320d=320, N=800N=800, with ε=105{\varepsilon}=10^{-5} and ξ(t)=t1/4\xi(t)=t^{-1/4}. We take a single SGD run each for two different initializations: the weights are initialized as (wi)iNiidN(0,κ2/dId)({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,\kappa^{2}/d\cdot\mathbf{I}_{d}) for either κ=0.1\kappa=0.1 or κ=0.4\kappa=0.4. We compute the risk attained by the SGD by Monte Carlo averaging over 10410^{4} samples.

To obtain results from the PDE, we take J=400J=400, and generate ri(0)=Zi2r_{i}(0)=\|Z_{i}\|_{2}, where (Zi)iNiidN(0,κ2/dId)(Z_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,\kappa^{2}/d\cdot\mathbf{I}_{d}). We obtain r(t){\bm{r}}(t) from t=0t=0 until t=107εt=10^{7}{\varepsilon}, by discretizing this interval with 10510^{5} points equally spaced on the log10\log_{10} scale and sequentially computing r(t)\mathbf{r}(t) at each point using Eq. (11.8). Note that the SGD result at iteration kk corresponds to r(ε4/3k)\mathbf{r}({\varepsilon}^{4/3}k). We take a single run of the PDE.

To produce the inset plot, we compute 1Ni=1Nwi2\frac{1}{N}\sum_{i=1}^{N}\|{\bm{w}}_{i}\|_{2} for the SGD, and 1Ji=1Jri(t)\frac{1}{J}\sum_{i=1}^{J}r_{i}(t) for the PDE.

As observed from Figure 4 of the main text, the SGD trajectory with κ=0.1\kappa=0.1 converges to a point where wi2\|{\bm{w}}_{i}\|_{2} is nearly zero and the risk is very high, in stark contrast to the SGD trajectory with κ=0.4\kappa=0.4, as expected.

Appendix A Concentration inequalities

Let Zk=XkXk1{\bm{Z}}_{k}=\bm{X}_{k}-\bm{X}_{k-1} be the martingale differences. By the subgaussian condition (A.1), we get

Letting GN(0,Id){\bm{G}}\sim{\sf N}(0,{\bm{I}}_{d}) a standard Gaussian vector and ξ0\xi\geq 0,

By Markov inequality, setting ξ=1/(2nL2)\xi=1/(2nL^{2}), we get, for all t0t\geq 0,

Finally, to upper bound maxknXk2\max_{k\leq n}\|\bm{X}_{k}\|_{2}, we define the stopping time τmin{k:  Xk22Ln(d+t)}\tau\equiv\min\{k:\;\|\bm{X}_{k}\|_{2}\geq 2L\sqrt{n}(\sqrt{d}+t)\}, and the martingale Xk=Xkτ\overline{\bm{X}}_{k}=\bm{X}_{k\wedge\tau}. Since {maxknXn22Ln(d+t)}={Xn22Ln(d+t)}\{\max_{k\leq n}\|\bm{X}_{n}\|_{2}\geq 2L\sqrt{n}(\sqrt{d}+t)\}=\{\|\overline{\bm{X}}_{n}\|_{2}\geq 2L\sqrt{n}(\sqrt{d}+t)\}, the claim follows by applying the previous inequality to Xn\overline{\bm{X}}_{n}. ∎

Appendix B On the generalization to other loss functions

First of all, we note that the population risk reads

The corresponding distributional dynamics is formally identical to the one for quadratic loss, cf. Eq. (7.1). The only change is in the definition of Ψ(θ;ρ)\Psi({\bm{\theta}};\rho):