A Mean Field View of the Landscape of Two-Layers Neural Networks

Song Mei, Andrea Montanari, Phan-Minh Nguyen

stat.ML cond-mat.stat-mech cs.LG math.ST

Introduction

Multi-layer neural networks are one of the oldest approaches to statistical machine learning, dating back at least to the 1960’s [Ros62]. Over the last ten years, under the impulse of increasing computer power and larger data availability, they have emerged as a powerful tool for a wide variety of learning tasks [KSH12, GBCB16].

In practice, the parameters of neural networks are learned by stochastic gradient descent [RM51] (SGD) or its variants. In the present case, this amounts to the iteration

Here ${\bm{\theta}}^{k}=({\bm{\theta}}^{k}_{i})_{i\leq N}$ denotes the parameters after $k$ iterations, $s_{k}$ is a step size, and $({\bm{x}}_{k},y_{k})$ is the $k$ -th example. Throughout the paper, we make the following assumption:

In large scale applications, this is not far from truth: the data is so large that each example is visited at most a few times [Bot10]. Further, theoretical guarantees suggest that there is limited advantage to be gained from multiple passes [SSBD14]. For recent work deriving scaling limits under such assumption (in different problems) see [WML17].

Formal relationships can be established between $R_{N}({\bm{\theta}})$ and $R(\rho)$ . For instance, under mild assumptions, $\inf_{{\bm{\theta}}}R_{N}({\bm{\theta}})=\inf_{\rho}R(\rho)+O(1/N)$ . We refer to the next sections for mathematical statements of this type.

Roughly speaking, $R(\rho)$ corresponds to the population risk when the number of hidden units goes to infinity, and the empirical distribution of parameters $\hat{\rho}^{(N)}$ converges to $\rho$ . Since $U(\,\cdot\,,\,\cdot\,)$ is positive semidefinite, we obtain that the risk becomes convex in this limit. The fact that learning can be viewed as convex optimization in an infinite-dimensional space was indeed pointed out in the past [LBW96, BRV+06]. Does this mean that the landscape of the population risk simplifies for large $N$ and descent algorithms will converge to a unique (or nearly unique) global optimum?

when $N\to\infty$ , ${\varepsilon}\to 0$ (here $\Rightarrow$ denotes weak convergence). The asymptotic dynamics of $\rho_{t}$ is defined by the following PDE, which we shall refer to as distributional dynamics (DD)

Using these results, analyzing learning in two-layer neural networks reduces to analyzing the PDE (7). While this is far from being an easy task, the PDE formulation leads to several simplifications and insights. First of all, it factors out the invariance of the risk (4) (and of the SGD dynamics (3)), with respect to permutations of the units $\{1,\dots,N\}$ .

where the infimum is taken over all couplings of $\rho_{1}$ and $\rho_{2}$ . Informally, the fact that $\rho_{t}$ is a gradient flow means that (7) is equivalent, for small $\tau$ , to

Powerful tools from the mathematical literature on gradient flows in measure spaces [AGS08] can be exploited to study the behavior of (7).

Most importantly, the scaling limit elucidates the dependence of the landscape of two-layer neural networks on the number of hidden units $N$ .

A remarkable feature of neural networks is the observation that, while they might be dramatically over parametrized, this does not lead to performance degradation. In the case of bounded activation functions, this phenomenon was clarified in the nineties for empirical risk minimization algorithms, see e.g. [Bar98]. The present work provides analogous insight for the SGD dynamics: roughly speaking, our results imply that the landscape remains essentially unchanged as $N$ grows, provided $N\gg D$ . In particular, assume that the PDE (7) converges close to an optimum in time $t_{*}(D)$ . This might depend on $D$ , but does not depend on the number of hidden units $N$ (which does not appear in the DD PDE (7)). If $t_{*}(D)=O_{D}(1)$ , we can then take $N$ arbitrarily (as long as $N\gg D$ ) and will achieve a population risk which is independent of $N$ (and corresponds to the optimum), using $k=t_{*}/{\varepsilon}=O(D)$ samples.

Our analysis can accommodate some important variants of SGD, a particularly interesting one being noisy SGD:

where $\Psi_{\lambda}({\bm{\theta}};\rho)=\Psi({\bm{\theta}};\rho)+(\lambda/2)\|{\bm{\theta}}\|_{2}^{2}$ , and $\Delta_{{\bm{\theta}}}f({\bm{\theta}})=\sum_{i=1}^{d}\partial_{\theta_{i}}^{2}f({\bm{\theta}})$ denotes the usual Laplacian. This can be viewed as a gradient flow for the free energy $F_{\beta,\lambda}(\rho)=(1/2)R(\rho)+(\lambda/2)\int\|{\bm{\theta}}\|_{2}^{2}\rho({\rm d}{\bm{\theta}})-\beta^{-1}{\rm Ent}(\rho)$ , where ${\rm Ent}(\rho)=-\int\rho({\bm{\theta}})\log\rho({\bm{\theta}})\,{\rm d}{\bm{\theta}}$ is the entropy of $\rho$ (by definition ${\rm Ent}(\rho)=-\infty$ if $\rho$ is singular). $F_{\beta,\lambda}(\rho)$ is an entropy-regularized risk, which penalizes strongly non-uniform $\rho$ .

We will prove below that, for $\beta<\infty$ , the evolution (12) generically converges to the minimizer of $F_{\beta,\lambda}(\rho)$ , hence implying global convergence of noisy SGD in a number of steps independent of $N$ .

Examples

With probability $1/2$ : $y=+1$ , ${\bm{x}}\sim{\sf N}(0,(1+\Delta)^{2}{\bm{I}}_{d})$

With probability $1/2$ : $y=-1$ , ${\bm{x}}\sim{\sf N}(0,(1-\Delta)^{2}{\bm{I}}_{d})$ .

(This example will be generalized later.) Of course, optimal classification in this model becomes entirely trivial if we compute the feature $h({\bm{x}})=\|{\bm{x}}\|_{2}$ . However, it is non-trivial that a SGD-trained neural network will succeed.

We choose an activation function without offset or output weights, namely $\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle)$ . While qualitatively similar results are obtained for other choices of $\sigma$ , we will use a simple piecewise linear function as a running example: $\sigma(t)=s_{1}$ if $t\leq t_{1}$ , $\sigma(t)=s_{2}$ if $t\geq t_{2}$ , and $\sigma(t)$ interpolated linearly for $t\in(t_{1},t_{2})$ . In simulations we use $t_{1}=0.5$ , $t_{2}=1.5$ , $s_{1}=-2.5$ , $s_{2}=7.5$ .

where the form of $\psi(r;\rho)$ can be derived from $\Psi({\bm{\theta}};\rho)$ . This reduced PDE can be efficiently solved numerically, see Supplementary Information (SI) for technical details. As illustrated by Fig. 1, the empirical results match closely the predictions produced by this PDE.

In Fig. 2, we compare the asymptotic risk achieved by SGD with the prediction obtained by minimizing $R(\rho)$ , cf. (5) over spherically symmetric distributions. It turns out that, for certain values of $\Delta$ , the minimum is achieved by the uniform distribution over a sphere of radius $\|{\bm{w}}\|_{2}=r_{*}$ , to be denoted by $\rho^{\mbox{\tiny\rm unif}}_{r_{*}}$ . The value of $r_{*}$ is computed by minimizing

where expressions for $v(r)$ , $u_{d}(r_{1},r_{2})$ can be readily derived from $V({\bm{w}})$ , $U({\bm{w}}_{1},{\bm{w}}_{2})$ and are given in the SI.

Let $r_{*}$ be a global minimizer of $r\mapsto R_{d}^{(1)}(r)$ . Then $\rho^{\mbox{\tiny\rm unif}}_{r_{*}}$ is a global minimizer of $\rho\mapsto R(\rho)$ if and only if $v(r)+u_{d}(r,r_{*})\geq v(r_{*})+u_{d}(r_{*},r_{*})$ for all $r\geq 0$ .

Checking numerically this condition yields that $\rho^{\mbox{\tiny\rm unif}}_{r_{*}}$ is a global minimizer for $\Delta$ in an interval $[\Delta^{\rm l}_{d},\Delta^{\rm h}_{d}]$ , where $\lim_{d\to\infty}\Delta^{\rm l}_{d}=0$ and $\lim_{d\to\infty}\Delta^{\rm h}_{d}=\Delta_{\infty}\approx 0.47$ .

for any $k\in[T/{\varepsilon},10T/{\varepsilon}]$ with probability at least $1-\delta$ .

In particular, if we set ${\varepsilon}=1/(C_{0}d)$ , then the number of SGD steps is $k\in[(C_{0}T)\,d,(10C_{0}T)\,d]$ : the number of samples used by SGD does not depend on the number of hidden units $N$ , and is only linear in the dimension. Unfortunately the proof does not provide the dependence of $T$ on $\eta$ , but Theorem 6 below suggests exponential local convergence.

2 Centered anisotropic Gaussians

We can generalize the previous result to a problem in which the network needs to select a subset of relevant nonlinear features out of many a priori equivalent ones. We assume the joint law of $(y,{\bm{x}})$ to be as follows:

With probability $1/2$ : $y=+1$ , ${\bm{x}}\sim{\sf N}(0,{\bm{\Sigma}}_{+})$ , and

With probability $1/2$ : $y=-1$ , ${\bm{x}}\sim{\sf N}(0,{\bm{\Sigma}}_{-})$ .

Even with a reduced degree of symmetry, SGD converges to a network with nearly-optimal risk, after using a number of samples $k=O(d)$ , which is independent of the number of hidden units $N$ .

3 A better activation function

We consider the same data distribution introduced in the last section (anisotropic Gaussians). Figure 3 reports the evolution of the risk $R_{N}({\bm{\theta}}^{k})$ for three experiments with $d=320$ , $s_{0}=60$ and different values of $\Delta$ . SGD is initialized by setting $a_{i}=1$ , $b_{i}=1$ and ${\bm{w}}_{i}^{0}\sim_{iid}{\sf N}({\mathbf{0}},0.8^{2}/d\cdot{\bm{I}}_{d})$ for $i\leq N$ . We observe that SGD converges to a network with very small risk, but this convergence has a nontrivial structure and presents long flat regions.

The empirical results are well captured by our predictions based on the continuum limit. In this case we obtain a reduced PDE for the joint distribution of the four quantities ${\bm{r}}=(a,b,r_{1}=\|{\bm{P}}_{{\cal V}}{\bm{w}}\|_{2},r_{2}=\|{\bm{P}}^{\perp}_{{\cal V}}{\bm{w}}\|_{2})$ , denoted by $\overline{\rho}_{t}$ . The reduced PDE is analogous to (13) albeit in $4$ rather than $1$ dimensions. In Figure 3 we consider the evolution of the risk, alongside three properties of the distribution $\overline{\rho}_{t}$ –the means of the output weight $a$ , of the offset $b$ , and of $r_{1}$ .

4 Predicting failure

SGD does not always converge to a near global optimum. Our analysis allows to construct examples in which SGD fails. For instance, Figure 4 reports results for isotropic Gaussians problem. We violate the assumptions of Theorem 1 by using non monotone activation function. Namely, we use $\sigma_{*}({\bm{x}};{\bm{\theta}})=\sigma(\langle{\bm{w}},{\bm{x}}\rangle)$ , where $\sigma(t)=-2.5$ for $t\leq 0$ , $\sigma(t)=7.5$ for $t\geq 1.5$ , and $\sigma(t)$ linearly interpolates from $(0,-2.5)$ to $(0.5,-4)$ , and from $(0.5,-4)$ to $(1.5,7.5)$ .

Depending on the initialization, SGD converges to two different limits, one with a small risk, and the second with high risk. Again this behavior is well tracked by solving a one-dimensional PDE for the distribution $\overline{\rho}_{t}$ of $r=\|{\bm{w}}\|_{2}$ .

General results

In this section we return to the general supervised learning problem described in the introduction and describe our general results. Proofs are deferred to the SI.

First, we note that the minimum of the asymptotic risk $R(\rho)$ of (5) provides a good approximation of the minimum of the finite- $N$ risk $R_{N}({\bm{\theta}})$ .

We next consider the distributional dynamics (7) and (12). These should be interpreted to hold in weak sense, cf. SI. In order to establish that these PDEs indeed describe the limit of the SGD dynamics, we make the following assumptions.

$t\mapsto\xi(t)$ is bounded Lipschitz: $\|\xi\|_{\infty},\|\xi\|_{\mbox{\tiny\rm Lip}}\leq K_{1}$ , with $\int_{0}^{\infty}\xi(t){\rm d}t=\infty$ .

The activation function $({\bm{x}},{\bm{\theta}})\mapsto\sigma_{*}({\bm{x}};{\bm{\theta}})$ is bounded, with sub-Gaussian gradient: $\|\sigma_{*}\|_{\infty}\leq K_{2}$ , $\|\nabla_{{\bm{\theta}}}\sigma_{*}(\bm{X};{\bm{\theta}})\|_{\psi_{2}}\leq K_{2}$ . Labels are bounded $|y_{k}|\leq K_{2}$ .

The gradients ${\bm{\theta}}\mapsto\nabla V({\bm{\theta}})$ , $({\bm{\theta}}_{1},{\bm{\theta}}_{2})\mapsto\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})$ are bounded, Lipschitz continuous (namely $\|\nabla_{{\bm{\theta}}}V({\bm{\theta}})\|_{2}$ , $\|\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})\|_{2}\leq K_{3}$ , $\|\nabla_{{\bm{\theta}}}V({\bm{\theta}})-\nabla_{{\bm{\theta}}}V({\bm{\theta}}^{\prime})\|_{2}\leq K_{3}\|{\bm{\theta}}-{\bm{\theta}}^{\prime}\|_{2}$ , $\|\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})-\nabla_{{\bm{\theta}}_{1}}U({\bm{\theta}}^{\prime}_{1},{\bm{\theta}}_{2}^{\prime})\|_{2}\leq K_{3}\|({\bm{\theta}}_{1},{\bm{\theta}}_{2})-({\bm{\theta}}_{1}^{\prime},{\bm{\theta}}_{2}^{\prime})\|_{2}$ ).

We also introduce the following error term which quantifies in a non-asymptotic sense the accuracy of our PDE model

The convergence of the SGD process to the PDE model is an example of a phenomenon which is known in probability theory as propagation of chaos [Szn91].

with probability $1-e^{-z^{2}}$ . The same statements hold for noisy SGD (11), provided (7) is replaced by (12), and if $\beta\geq 1$ , $\lambda\leq 1$ , and $\rho_{0}$ is $K_{0}$ sub-Gaussian for some $K_{0}>0$ .

Notice that dependence of the error terms in $N$ and $D$ is rather benign. On the other hand, the error grows exponentially with the time horizon $T$ , which limits its applicability to cases in which the DD converges rapidly to a good solution. We do not expect this behavior to be improvable within the general setting of 3, which a priori includes cases in which the dynamics is unstable.

We can regard $\bm{J}({\bm{\theta}};\rho_{t})=\rho_{t}({\bm{\theta}})\nabla_{{\bm{\theta}}}\Psi({\bm{\theta}};\rho_{t})$ as a current. The fixed points of the continuum dynamics are densities that correspond to zero current, as stated below.

Assume $V(\,\cdot\,),U(\,\cdot\,,\,\cdot\,)$ to be differentiable with bounded gradient. If $\rho_{t}$ is a solution of the PDE (7), then $R(\rho_{t})$ is non-increasing. Further, probability distribution $\rho$ is a fixed point of the PDE (7) if and only if

Note that global optimizers of $R(\rho)$ , defined by condition (17), are fixed points, but the set of fixed points is, in general, larger than the set of optimizers. Our next proposition provides an analogous characterization of the fixed points of diffusion DD (12) (see [CMV+03] for related results).

Assume that conditions A1-A3 hold and that $\rho_{0}$ is absolutely continuous with respect to Lebesgue measure, with $F_{\beta,\lambda}(\rho_{0})<\infty$ . If $(\rho_{t})_{t\geq 0}$ is a solution of the diffusion PDE (12), then $\rho_{t}$ is absolutely continuous. Further, there is at most one fixed point $\rho_{*}=\rho_{*}^{\beta,\lambda}$ of (12) satisfying $F_{\beta,\lambda}(\rho_{*})<\infty$ . This fixed point is absolutely continuous and its density satisfies

In the next sections we state our results about convergence of the distributional dynamics to its fixed point. In the case of noisy SGD (and for the diffusion PDE (12)), a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation (12)), we do not have such general result. However, we obtain a stability condition for fixed point containing one point mass, which is useful to characterize possible limiting points (and is used in treating the examples in the previous section).

Remarkably, the diffusion PDE (12) generically admits a unique fixed point, which is the global minimum of $F_{\beta,\lambda}(\rho)$ and the evolution (12) converges to it, if initialized so that $F_{\beta,\lambda}(\rho_{0})<\infty$ . This statement requires some qualifications. First of all, we introduce sufficient regularity assumptions to guarantee the existence of sufficiently smooth solutions of (12).

Next notice that the right-hand side of the fixed point equation (21) is not necessarily normalizable (for instance, it is not when $V(\,\cdot\,)$ , $U(\,\cdot\,,\,\cdot\,)$ are bounded). In order to ensure the existence of a fixed point, we need $\lambda>0$ .

Assume that conditions A1-A4 hold, and $1/K_{0}\leq\lambda\leq K_{0}$ for some $K_{0}>0$ Then $F_{\beta,\lambda}(\rho)$ has a unique minimizer, denoted by $\rho_{*}^{\beta,\lambda}$ , which satisfies

where $C$ is a constant depending on $K_{0}$ , $K_{1}$ , $K_{2}$ , $K_{3}$ . Further, letting $\rho_{t}$ be a solution of the diffusion PDE (12) with initialization satisfying $F_{\beta,\lambda}(\rho_{0})<\infty$ , we have, as $t\to\infty$ ,

The proof of this theorem is based on the following formula that describes the free energy decrease along the trajectories of the distributional dynamics (12):

(A key technical hurdle is of course proving that this expression makes sense, which we do by showing the existence of strong solutions.) It follows that the right-hand side must vanish as $t\to\infty$ , from which we prove that (eventually taking subsequences) $\rho_{t}\Rightarrow\rho_{*}$ where $\rho_{*}$ must satisfy $\beta\Psi_{\lambda}({\bm{\theta}};\rho_{*})+\log\rho_{*}({\bm{\theta}})={\rm const}$ . This in turns mean $\rho_{*}$ is a solution of the fixed point condition 21 and is in fact a global minimum of $F_{\beta,\lambda}$ by convexity.

This result can be used in conjunction with Theorem 3, in order to analyze the regularized noisy SGD algorithm (11).

2 Convergence: noiseless SGD

The next theorems provide necessary and sufficient conditions for distributions containing a single point mass to be a stable fixed point of the evolution. This result is useful in order to characterize the large time asymptotics of the dynamics (7). Here, we write $\nabla_{1}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})$ for the gradient of $U$ with respect to its first argument, and $\nabla^{2}_{1,1}U$ for the corresponding Hessian. Further, for a probability distribution $\rho_{*}$ , we define

Note that $\bm{H}_{0}(\rho_{*})$ is nothing but the Hessian of ${\bm{\theta}}\mapsto\Psi({\bm{\theta}};\rho_{*})$ at ${\bm{\theta}}_{*}$ .

If $\rho_{0}$ has a bounded density with respect to Lebesgue measure, then it cannot be that $\rho_{t}$ converges weakly to $\rho_{*}$ as $t\to\infty$ .

Discussion and future directions

In this paper we developed a new approach to the analysis of two-layers neural networks. Using a propagation-of-chaos argument, we proved that –if the number of hidden units satisfies $N\gg D$ – SGD dynamics is well approximated by the PDE in (7), while noisy SGD is well approximated by (12). Both of these asymptotic descriptions correspond to Wasserstein gradient flows for certain energy (or free energy) functionals. While empirical risk minimization is known to be insensitive to overparametrization [Bar98], the present work clarifies that the SGD behavior is also independent of the number of hidden units, as soon as this is large enough.

We illustrated our approach on several concrete examples, by proving convergence of SGD to a near-global optimum. This type of analysis provides a new mechanism for avoiding the perils of non-convexity. We do not prove that the finite- $N$ risk $R_{N}({\bm{\theta}})$ has a unique local minimum, or that all local minima are close to each other. Such claims have often been the target of earlier work, but might be too strong for the case of neural networks. We prove instead that the PDE (7) converges to a near global optimum, when initialized with a bounded density. This effectively gets rid of some exceptional stationary points of $R_{N}({\bm{\theta}})$ , and merges multiple finite $N$ stationary points that result into similar distributions $\rho$ .

In the case of noisy SGD (11), we prove that it converges generically to a near-global minimum of the regularized risk, in time independent of the number of hidden units.

We emphasize that while we focused here on the case of square loss, our approach should be generalizable to other loss functions as well, cf. SI.

The present work opens the way to several interesting research directions. We will mention two of them. $(i)$ The PDE (7) corresponds to gradient flow in the Wasserstein metric for the risk $R(\rho)$ , see [AGS08]. Building on this remark, tools from optimal transportation theory can be used to prove convergence. $(ii)$ Multiple finite- $N$ local minima can correspond to the same minimizer $\rho_{*}$ of $R(\rho)$ in the limit $N\to\infty$ . Ideas from glass theory [MP99] might be useful to investigate this structure.

Let us finally mention that, after a first version of this paper appeared as a preprint, several other groups obtained results that are closely related to Theorem 3 [RVE18, SS18, CB18].

Acknowledgements

This work was partially supported by grants NSF DMS-1613091, NSF CCF-1714305 and NSF IIS-1741162. S. M. was partially supported by Office of Technology Licensing Stanford Graduate Fellowship. P.-M. N. was partially supported by William R. Hewlett Stanford Graduate Fellowship. The authors would like to thank Jiajun Tong for helpful discussions concerning strong solutions for parabolic PDEs.

References

Supplementary information

We present here proofs and additional technical details for our mathematical results, as well as additional information concerning the numerical experiments.

Notations

We use lowercase bold for vectors (e.g. ${\bm{u}},{\bm{v}},\dots$ ), uppercase bold for matrices (e.g. $\bm{A},\bm{B},\dots$ ), and lowercase plain for scalar ( $x,y,\dots$ ).

Given a measurable space $\Omega$ , we denote by $\mathscrsfs{P}(\Omega)$ the set of probability measures on $\Omega$ .

Given a measurable function $f$ , and a measure $\mu$ , we denote by $\langle f,\mu\rangle=\langle\mu,f\rangle=\int f\,{\rm d}\mu$ the corresponding integral.

$\|f\|_{\mbox{\tiny\rm Lip}}\equiv\sup_{x\neq y}|f({\bm{x}})-f({\bm{y}})|/\|{\bm{x}}-{\bm{y}}\|_{2}$ denotes the Lipshitz constant of a function $f$ .

$d_{\mbox{\tiny\rm BL}}(\,\cdot\,,\,\cdot\,)$ is the bounded Lipschitz distance between probability measures

Here ${\mathcal{C}}(\mu,\nu)$ is the set of couplings of $\mu$ and $\nu$ .

$W_{p}(\,\cdot\,,\,\cdot\,)$ is the Wasserstein distance between probability measures

For $p=1$ , the Kantorovich-Rubinstein duality gives

$K$ is a generic constant depending on $K_{0},K_{1},K_{2},K_{3}$ , where $K_{i}$ ’s are constants which will be specified from the context.

General results: Statics

In this section, we discuss some properties of the population risk, $R_{N}({\bm{\theta}})$ , and its continuum counterpart $R(\rho)$ . For future reference, we copy the key definitions from the main text:

We show that minimizing the population risk $R_{N}({\bm{\theta}})$ yields similar results to minimizing its continuum counterpart $R(\rho)$ :

We establish the condition for $\rho_{*}$ to be a minimizer:

First notice that, for any ${\bm{\theta}}=({\bm{\theta}}_{i})_{i\leq N}$ , we have

Indeed, $R_{N}({\bm{\theta}})=R(\rho)$ for $\rho=(1/N)\sum_{i=1}^{N}\delta_{{\bm{\theta}}_{i}}$ .

whence the claim (6.6) follows since ${\varepsilon}$ is arbitrary.

We next establish the minimum condition (6.7). Notice that since $V(\,\cdot\,)$ is continuous, and $U(\,\cdot\,,\,\cdot\,)$ is bounded below, it follows from Fatou’s lemma that, for any $\rho$ , the function ${\bm{\theta}}\mapsto\Psi({\bm{\theta}};\rho)$ is lower semicontinous and takes values in $(-\infty,\infty]$ . In particular the set $S_{0}(\rho)\equiv\arg\min_{{\bm{\theta}}}\Psi({\bm{\theta}};\rho)$ must be closed.

We first prove that any minimizer must satisfy (6.7). Let $\rho_{*}$ be a minimizer and define $\Psi_{*}=\inf_{{\bm{\theta}}}\Psi({\bm{\theta}};\rho_{*})$ . By rearranging terms, for any probability measure $\rho$ , we have

First we will assume $\Psi_{*}>-\infty$ (whence, by lower semicontinuity, $S_{0}(\rho_{*})$ must be a non-empty closed set). Let ${\bm{\theta}}_{1}\in S_{0}(\rho_{*})$ , and assume by contradiction that there exist ${\bm{\theta}}_{0}\in{\rm supp}(\rho_{*})$ , ${\bm{\theta}}_{0}\not\in S_{0}(\rho_{*})$ . Let ${\sf B}({\bm{\theta}}_{0};{\varepsilon})$ be a ball of radius ${\varepsilon}$ around ${\bm{\theta}}_{0}$ . By lower semicontinuity, we can find ${\varepsilon}_{0},\Delta>0$ such that $\inf_{{\bm{\theta}}\in{\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0})}\Psi({\bm{\theta}};\rho_{*})=\Psi_{*}+\Delta>\Psi_{*}$ . Further $t_{0}\equiv\rho_{*}({\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0}))>0$ because ${\bm{\theta}}_{0}\in{\rm supp}(\rho_{*})$ .

Let $\nu\equiv{\bm{1}}_{{\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0})}\rho_{*}/t_{0}$ (i.e. $\nu$ is the conditional distribution given ${\bm{\theta}}\in{\sf B}({\bm{\theta}}_{0};{\varepsilon}_{0})$ ). Define, for $t\in[0,t_{0}]$ , the probability measure

where the second inequality follows from the fact that $U$ is continuous and $\delta_{{\bm{\theta}}_{1}}$ , $\nu$ have bounded support. By taking $t$ small enough, we get $R(\rho)<R(\rho_{*})$ hence reaching a contradiction.

By selecting $t=t_{M}=\min(t_{0},(M+\Psi_{0})/C_{0}(M))$ (which is positive for all $M$ large enough), we obtain $R(\rho_{M,t})-R(\rho_{*})<0$ for all $M$ large and hence reach a contradiction.

Setting $\mu=2[\Psi(\,\cdot\,;\rho_{*})-\Psi_{*}]$ , and noticing that condition (6.7) implies $\langle\Psi(\,\cdot\,;\rho_{*})-\Psi_{*},\rho_{*}\rangle=0$ , we get $R(\rho)\geq R(\rho_{*})+\langle U,(\rho-\rho_{*})^{\otimes 2}\rangle\geq R(\rho_{*})$ .

2 Some additional results

We often find empirically that the optimal density $\rho_{*}$ is supported on a set of Lebesgue measure (sometimes on a finite set of points). The following consequence of the previous results partially explains these findings.

Assume ${\bm{\theta}}\mapsto V({\bm{\theta}})$ to be an analytic function and $({\bm{\theta}}_{1},{\bm{\theta}}_{2})\mapsto U({\bm{\theta}}_{1},{\bm{\theta}}_{2})$ to be analytic with respect to ${\bm{\theta}}_{1}$ , uniformly in ${\bm{\theta}}_{2}$ . Namely there exists a locally bounded function ${\bm{\theta}}\mapsto B({\bm{\theta}})$ such that $\|\nabla^{k}_{{\bm{\theta}}_{1}}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})\|_{2}\leq k!B({\bm{\theta}}_{1})^{k}$ for all $k$ , ${\bm{\theta}}_{1}$ , ${\bm{\theta}}_{2}$ . If $\rho_{*}$ is a minimizer of $R(\rho)$ , then one of the following holds

The support of $\rho_{*}$ has zero Lebesgue measure.

If $D=1$ , then $(b)$ can be replaced by: $(b^{\prime})$ $\rho_{*}$ is a convex combination of countably many point masses with no accumulation point (finitely many if $\Psi(\theta;\rho_{*})\to\infty$ as $|\theta|\to\infty$ ).

Note that, under the stated conditions $f({\bm{\theta}})\equiv\int U({\bm{\theta}},{\bm{\theta}}^{\prime})\,\rho_{*}({\rm d}{\bm{\theta}}^{\prime})$ is analytic. Indeed, by a standard dominated convergence argument, we have that $\nabla^{k}f$ is given by the integral of $\int\nabla^{k}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})\,\rho_{*}({\rm d}{\bm{\theta}}_{2})$ for any $k\geq 0$ . Further, by an application of the intermediate value theorem there exists $t_{{\bm{\theta}}_{1},{\bm{\theta}}_{2},{\bm{\delta}}}\in$ such that

which vanishes as $k\to\infty$ for uniformly over $\|{\bm{\delta}}\|_{2}\leq\delta_{0}$ for $\delta_{0}$ small enough.

General results: Dynamics

In this section we consider the SGD dynamics with step size $s_{k}={\varepsilon}\xi(k{\varepsilon})$ , under the assumptions ${\sf A1},{\sf A2},{\sf A3}$ stated in the main text. For the readers convenience, we reproduce here the form of the limiting PDE

For background on this and similar PDEs (and the analogous ones at finite temperature, cf. Section 10), we refer to [MV00, CMV+03, CMV06, AGS08, CDF+11]. Our treatment will be mostly self-contained because of some differences between our setting and the one in these papers.

Recall assumptions A1, A2, A3 in the main text. By [Szn91, Theorem 1.1], assumptions A1 and A3 are sufficient for the existence and uniqueness of solution of PDE (7.1).

Assume conditions A1 and A3 hold. Let $(\rho_{t})_{t\geq 0}$ be the solution of the PDE (7.1). Let $(\overline{\bm{\theta}}_{i}^{t})_{t\geq 0}$ be the solution of nonlinear dynamics (7.4). Then $t\mapsto\overline{\bm{\theta}}^{t}_{i}$ is $K_{1}K_{3}$ -Lipschitz continuous, and $t\mapsto\rho_{t}$ is $K_{1}K_{3}$ -Lipschitz continuous in $W_{2}$ Wasserstein distance, with $K_{1}$ and $K_{3}$ as per conditions A1 and A3. In particular, $t\mapsto\rho_{t}$ is continuous in the topology of weak convergence.

Since $\xi$ and $\nabla\Psi$ are $K_{1}$ and $K_{3}$ bounded respectively, $t\mapsto\overline{\bm{\theta}}^{t}_{i}$ is $K_{1}K_{3}$ -Lipschitz continuous. Further, Eq. (5.2) implies that $t\mapsto\rho_{t}$ is Lipschitz continuous in $W_{2}$ Wasserstein distance, namely

The proof follows a ‘propagation of chaos’ argument [Szn91]. Throughout this proof, we will use $K$ to denote generic constant depending on the constants $K_{1},K_{2},K_{3}$ in conditions A1, A2, A3.

It is convenient to introduce the notations ${\bm{z}}_{k}=({\bm{x}}_{k},y_{k})$ to denote the $k$ -th example and define

Note that the assumption of bounded Lipschitz $\nabla V$ , $\nabla_{1}U$ (here and below $\nabla_{1}U({\bm{\theta}}_{1},{\bm{\theta}}_{2})$ denotes the gradient of $U$ with respect to its first argument) implies $\|{\bm{G}}({\bm{\theta}};\rho)\|_{2}\leq K$ and $\|{\bm{G}}({\bm{\theta}}_{1};\rho)-{\bm{G}}({\bm{\theta}}_{2};\rho)\|_{2}\leq K\|{\bm{\theta}}_{1}-{\bm{\theta}}_{2}\|_{2}$ . Further

With these notations, we can rewrite the SGD dynamics in the main text as

Recall $({\bm{\theta}}^{0}_{i})_{i\leq N}\sim\rho_{0}$ independently.

We next state and prove the key estimate controlling the difference between the original dynamics and the nonlinear dynamics.

Under the assumptions of Theorem 3, there exists a constant $K$ depending uniquely on $K_{1},K_{2},K_{3}$ in conditions A1, A2, and A3, such that for any $T\geq 0$ , we have

with probability at least $1-e^{-z^{2}}$ .

We next consider the three terms above. Using the Lipschitz continuity of ${\bm{G}}({\bm{\theta}};\rho)$ with respect to ${\bm{\theta}}$ and $\rho$ (see Eq. (7.10)), and due to condition A1 and Lemma 7.1 (implying that $\xi$ , $\overline{\bm{\theta}}^{t}_{i}$ , and $\rho_{s}$ are Lipschitz continuous), we get

Bounding the second term yields (by using the Lipschitz continuity of ${\bm{G}}$ with respect to its first argument):

where $\hat{\rho}^{(N)}_{k}\equiv(1/N)\sum_{i\leq N}\delta_{{\bm{\theta}}^{k}_{i}}$ . Hence

and taking union bound over $i\leq N$ , we get

For the term $E_{3,0}^{i}(t)$ , we use the Lipschitz continuity property (7.10), whence

Since for any fixed $k$ , $(\overline{\bm{\theta}}_{j}^{k{\varepsilon}})_{j\leq N,j\neq i}$ are i.i.d. and independent of ${\bm{\theta}}_{i}^{k}$ , and $\nabla_{1}U$ is bounded, we get by another application of Azuma-Hoeffding inequality, cf. Lemma A.1,

Conditional on the good events in Eq. (7.22) and (7.25), Eq. (7.20) thus yields

with probability at least $1-e^{-z^{2}}$ .

Using the bounds (7.16), (7.17), (7.26) in Eq. (7.15), we get

Using the bound (7.27), the claim follows. ∎

Under the assumptions of Theorem 3, we have

Let ${\bm{\theta}}=({\bm{\theta}}_{1},\dots,{\bm{\theta}}_{i},\dots,{\bm{\theta}}_{n})$ and ${\bm{\theta}}^{\prime}=({\bm{\theta}}_{1},\dots,{\bm{\theta}}_{i}^{\prime},\dots,{\bm{\theta}}_{n})$ be two configurations that differ only in position $i$ . Then

Under the assumptions of Theorem 3, we have,

with probability at least $1-e^{-z^{2}}$ .

By Eq. (7.32) and by Azuma-Höeffding inequality and union bound, we get

with probability at least $1-e^{-z^{2}}$ . The claim follows since

The proof of the theorem follows from a straightforward application of Lemma 7.2, 7.3, 7.4. The proof for any bounded Lipschitz function $f$ follows the same argument as Lemma 7.3, 7.4. As a result, for any sequence $(N,{\varepsilon}={\varepsilon}_{N})$ such that $N\to\infty$ and ${\varepsilon}_{N}\to 0$ with $N/\log(N/{\varepsilon}_{N})\to\infty$ and ${\varepsilon}_{N}\log(N/{\varepsilon}_{N})\to 0$ , we have $\hat{\rho}_{\lfloor k/{\varepsilon}\rfloor}^{(N)}$ converges weakly to $\rho_{t}$ almost surely immediately.

2 Proof of Theorem 3: Generalization to β<∞𝛽\beta<\infty

Here we generalize the proof given in the previous section to noisy SGD at finite temperature $\beta<\infty$ . Since the proof follows the same scheme as in the noiseless case, we will limit ourselves to describing the differences.

Throughout this section we assume that conditions ${\sf A1}$ , ${\sf A2}$ , ${\sf A3}$ hold. We also let

for some $\lambda\leq 1$ . Further we assume $\rho_{0}$ is $K_{0}^{2}$ -sub-Gaussian. Finally, we assume $1\leq\beta<\infty$ .

For the reader’s convenience, we reproduce here the form of the limiting PDE

which again should be interpreted in weak sense.

Recall conditionss A1, A2, A3 in the main text. By a modified argument of [Szn91, Theorem 1.1], conditions A1 and A3 are sufficient for the existence and uniqueness of solution of PDE (7.37) in weak sense. Section 10 provides further information of this PDE, including a proof of existence and uniqueness.

As in the noiseless case, there is an equivalent formulation of this PDE as a fixed point distribution for the following nonlinear dynamics, which is an integration form of a stochastic differential equation,

where $\{\bm{W}_{i}(s)\}_{s\geq 0}$ for $i\leq N$ are independent $D$ -dimensional Brownian motions, and ${\bm{G}}({\bm{\theta}};\rho)\equiv-\nabla\Psi_{\lambda}({\bm{\theta}};\rho)$ . The assumptions on $U$ , $V$ , $\lambda$ , and $\xi$ ensures that this nonlinear dynamics has a unique continuous solution.

It is convenient to collect some standard estimates about the solution of the stochastic differential equation (7.38).

Assume $\rho_{0}$ is $K_{0}^{2}$ -sub-Gaussian, $\xi(s)$ and ${\bm{G}}({\mathbf{0}};\rho_{s})$ are $K_{0}$ -bounded, ${\bm{G}}({\bm{\theta}};\rho_{s})$ is $K_{0}$ -Lipschitz in ${\bm{\theta}}$ , and $\beta\geq 1$ . Let $(\overline{\bm{\theta}}_{i}^{t})_{t\geq 0}$ for $i\leq N$ be the solution of (7.38) with independent initialization $({\bm{\theta}}_{i}^{0})_{i\leq N}\sim\rho_{0}$ . Let $(\rho_{t})_{t\geq 0}$ be the solution of PDE (7.37). Then there exists a constant $K$ depending uniquely on $K_{0}$ , such that

Part (a). First, note that for any $D$ -dimensional $K_{0}^{2}$ -sub-Gaussian random vector $\bm{X}$ , we have

Note that $({\bm{\theta}}_{i}^{0})_{i\leq N}\sim\rho_{0}$ independently, and $\rho_{0}$ is $K_{0}^{2}$ -sub-Gaussian. Therefore

Taking $\tau=1/(2K_{0}^{2})$ and $u=2K_{0}(\sqrt{D+\log N}+z)$ , we get

Then we define $\bm{W}_{\xi,i}(t)\equiv\int_{0}^{t}\sqrt{2\xi(s)}\,{\rm d}\bm{W}_{i}(s)$ . We have ${\rm Var}(W_{\xi,i}^{j}(t))=\int_{0}^{t}2\xi(s){\rm d}s\leq 2K_{0}t$ for $j\leq D$ . Note $\exp\{\tau\|\bm{W}_{\xi,i}(t)\|_{2}^{2}\}$ is a submartingale, due to Doob’s martingale inequality, we have

Taking $\tau=1/(4K_{0}T)$ and $u=4\sqrt{K_{0}T}(\sqrt{D+\log N}+z)$ , we get

By noting that $\xi(s)$ , ${\bm{G}}({\mathbf{0}};\rho_{s})$ are $K_{0}$ -bounded, and ${\bm{G}}({\bm{\theta}};\rho_{s})$ is $K_{0}$ -Lipschitz in ${\bm{\theta}}$ , according to Eq. (7.38), there exists some constant $K$ depending on $K_{0}$ , such that

where $\Delta_{i}(t)\equiv\sup_{s\leq t}\|\overline{\bm{\theta}}_{i}^{s}\|_{2}$ , $W\equiv\max_{i\leq N}\sup_{t\leq T}\|\bm{W}_{\xi,i}(t)\|_{2}$ , and $\Theta\equiv\max_{i\leq N}\|{\bm{\theta}}_{i}^{0}\|_{2}$ . Due to Gronwall’s inequality, we have

The high probability bound (7.42) holds by noting the high probability bound for $\Theta$ and $W$ in Eq. (7.46) and (7.47).

Part (b). Define $\Delta_{i}(h;k,{\varepsilon})=\sup_{0\leq u\leq h}\|\overline{\bm{\theta}}_{i}^{k{\varepsilon}+u}-\overline{\bm{\theta}}_{i}^{k{\varepsilon}}\|_{2}$ . By noting that $\xi(s)$ , ${\bm{G}}({\mathbf{0}};\rho_{s})$ are $K_{0}$ -bounded, and ${\bm{G}}({\bm{\theta}};\rho_{s})$ is $K_{0}$ -Lipschitz in ${\bm{\theta}}$ , according to Eq. (7.38), we have

where $\bm{W}_{\xi,i,k}(u)\equiv\int_{k{\varepsilon}}^{k{\varepsilon}+u}\sqrt{2\xi(s)}\,{\rm d}\bm{W}_{i}(s)$ . Similar to the bound Eq. (7.47), we have

Plugging the bound Eq. (7.42) and Eq. (7.49) into Eq. (7.48), we have

with probability at least $1-e^{-z^{2}}$ .

Part (c). Equation (7.44) holds directly by noting that

As in the noiseless case, the key step consists in bounding the difference between the nonlinear dynamics and the SGD dynamics.

Under the assumptions of Theorem 3, there exists a constant $K$ depending uniquely on $K_{0},K_{1},K_{2},K_{3}$ , such that for any $T\geq 0$ , we have

with probability at least $1-\,e^{-z^{2}}$ .

Terms $E_{2}^{i}(t)$ , $E_{3}^{i}(t)$ can be bounded the same as in Lemma 7.2, i.e., Eq. (7.17) and (7.26), by noting that the replacement of $\Psi$ by $\Psi_{\lambda}$ does not affect these estimates.

To bound $E_{4}^{i}(t)$ , notice that $\bm{W}_{\xi,i}\equiv\int_{0}^{T}\big{(}\sqrt{2\xi(s)}-\sqrt{2\xi([s])}\big{)}\,{\rm d}\bm{W}_{i}(s)$ is a Gaussian random vector, $\bm{W}_{\xi,i}\sim{\sf N}({\mathbf{0}},\tau^{2}{\bm{I}}_{D})$ , where, using the Lipschitz continuity of $\xi$ ,

and therefore by applying Doob’s inequality to the submartingale $t\mapsto E_{4}^{i}(t)$ , we get

We need to modify the proof of Lemma 7.2 to bound terms $E_{1}^{i}(t)$ .

To bound the first term $E_{1,A}^{i}(t)$ , due to the Lipschitz property of ${\bm{G}}({\bm{\theta}};\rho)$ and the boundedness of ${\bm{G}}({\mathbf{0}};\rho)$ , with probability at least $1-e^{-z^{2}}$ , we have for all $i\leq N$ and $t\leq T$ ,

Here the last inequality is due to Eq. (7.42) in Lemma 7.5.

To bound the second term $E_{1,B}^{i}(t)$ , using the fact that $\nabla_{1}U$ is bounded Lipschitz, we have for all $i\leq N$ and $t\leq T$ ,

Here the last inequality is due to Eq. (7.44) in Lemma 7.5.

To bound the third term $E_{1,C}^{i}(t)$ , with probability at least $1-e^{-z^{2}}$ , we have for all $i\leq N$ and $t\leq T$ ,

Here the last inequality is due to Eq. (7.43) in Lemma 7.5.

As a result, combining Eq. (7.17), (7.26), (7.27), (7.51), (7.52), (7.54), (7.55), and (7.56), defining

Applying Gronwall’s inequality gives the desired result.

The generalization of Theorem 3 to $\beta<\infty$ follows from this lemma exactly as in the previous section.

3 Proof of Proposition 2: Monotonicity of the risk

By Lemma 7.1, $t\mapsto\rho_{t}$ is Lipschitz continuous in Wasserstein distance $W_{2}(\rho_{t_{1}},\rho_{t_{2}})\leq K|t_{1}-t_{2}|$ . Hence, we get

where in the second step we used Eq. (7.3). This immediately implies that $R(\rho_{t})$ is non-increasing in $t$ .

Let $\rho$ be a fixed point of Eq. (7.1). Since $\partial_{t}R(\rho_{t})|_{\rho_{0}=\rho}=0$ , the above formula implies

and therefore $\rho$ is supported in the set of ${\bm{\theta}}$ ’s such that $\nabla\Psi({\bm{\theta}};\rho)={\mathbf{0}}$ .

Vice versa, if this is the case, setting $\rho_{0}=\rho$ , Eq. (7.3) implies $\partial_{t}\langle\varphi,\rho_{t}\rangle=0$ , then $\rho_{t}\equiv\rho_{0}$ is a fixed point.

4 A general continuity result

It is useful to notice that the solution $(\rho_{t})_{t\geq 0}$ of the PDE (7.1) is continuous with respect to changes in $V(\,\cdot\,)$ , $U(\,\cdot\,,\,\cdot\,)$ . Namely, we consider the following two PDEs:

The proof adapts the argument used to establish uniqueness in [Szn91]. Without loss of generality, we fix $\xi(t)\equiv 1/2$ . We further denote by $K$ generic constants depending on $K_{1}$ , $K_{3}$ .

The assumption of bounded Lipschitz $\nabla V$ and $\nabla U$ implies that $\nabla\Psi({\bm{\theta}};\rho)$ is $K$ -bounded Lipschitz with respect to argument $({\bm{\theta}},\rho)$ , that is,

Using bound (7.70), the first term $E_{1}(t)$ is simply bounded by

To bound the second term $E_{2}(t)$ , we have

with the definition of ${\varepsilon}_{0}$ given by Equation (7.69).

Combining Equation (7.74), (7.75), and (7.76), we have

Applying Equation (7.73), the result follows. ∎

5 Some properties of the solution of the PDE (7.1)

In this section we prove four lemmas on the properties of the solution of the PDE (7.1), under conditions A1 and A3. All of these facts are quite standard, but we provide complete proofs for them for reader’s convenience.

We will use several times the following notations. Let $\rho_{t}$ be a solution of the PDE (7.1) with initialization $\rho_{0}$ . Let $({\bm{\theta}}^{t})_{t\geq 0}$ be the solution of the ordinary differential equation (ODE)

With these notations, $\rho_{t}$ is the push forward of $\rho_{0}$ under ${\bm{\varphi}}^{t}$ : $\rho_{t}={\bm{\varphi}}_{*}^{t}\rho_{0}$ . In other words, for any Borel set $B$ , $\rho_{t}({\bm{\varphi}}^{t}(B))=\rho_{0}(B)$ .

The lemma holds immediately by noting that $\rho_{t}(\Omega)\geq\rho_{t}({\bm{\varphi}}^{t}(\Omega))=\rho_{0}(\Omega)$ . ∎

Assume conditions A1, A3 hold. Further assume there exists a constant $K<\infty$ such that

for any ${\bm{\theta}}\in(0,\infty)^{D}$ and $\rho\in{\mathcal{P}}([0,\infty]^{D})$ . Let $(\rho_{t})_{t\geq 0}$ be the solution of the PDE (7.1) with initial condition $\rho_{0}$ with $\rho_{0}((0,\infty)^{D})=1$ . Then for any $t<\infty$ , $\rho_{t}((0,\infty)^{D})=1$ .

According to Eqs. (7.80) and (7.79), we have for $i\in[d]$ ,

Then according to (7.81), we have ${\bm{\varphi}}^{t}(\Omega_{k}(0))\subseteq\Omega_{k}(t)$ . Note $\Omega_{k}(t)$ is increasing in $k$ for fixed $t$ , and $\cup_{k}\Omega_{k}(t)=\cup_{k}\Omega_{k}(0)=(0,\infty)^{D}$ . Hence,

Let $(\rho_{t})_{t\geq 0}$ be a continuous curve in a compact metric space $(\Omega,d)$ . Denoting

to be the set of all limiting points of $(\rho_{t})_{t\geq 0}$ . Then ${\mathcal{S}}_{*}$ is a connected compact set.

First, it is easy to see that ${\mathcal{S}}_{*}$ should be closed. Note that $\Omega$ is a compact space, then ${\mathcal{S}}_{*}$ should be a compact set. If ${\mathcal{S}}_{*}=\{\rho_{*}\}$ is a singleton, this lemma holds automatically. Therefore, we would like to consider the case when ${\mathcal{S}}_{*}$ is not a singleton.

For any $\rho_{1},\rho_{2}\in{\mathcal{S}}_{*}$ , and $d(\rho_{1},\rho_{2})>0$ . We would like to show $\rho_{1}$ and $\rho_{2}$ are connected in ${\mathcal{S}}_{*}$ .

We use proof by contradiction. Now suppose $\rho_{1}$ and $\rho_{2}$ are not connected. Define ${\mathcal{A}}\subseteq{\mathcal{S}}_{*}$ to be the maximal connected subset of ${\mathcal{S}}_{*}$ containing $\rho_{1}$ . It is easy to see that ${\mathcal{A}}$ is compact. It is also easy to see that its complement ${\mathcal{B}}\equiv{\mathcal{S}}_{*}\setminus{\mathcal{A}}$ is also a compact set, and $\rho_{2}\in{\mathcal{B}}$ . As a result, we have ${\mathcal{A}}\cup{\mathcal{B}}={\mathcal{S}}_{*}$ , ${\mathcal{A}}\cap{\mathcal{B}}=\emptyset$ , and $\rho_{1}\in{\mathcal{A}}$ , $\rho_{2}\in{\mathcal{B}}$ .

Note that $\Omega$ is a metric space, so it satisfies T4 separation axiom. Since ${\mathcal{A}}$ and ${\mathcal{B}}$ are closed sets and ${\mathcal{A}}\cap{\mathcal{B}}=\emptyset$ , there exists an open set ${\mathcal{O}}$ , such that ${\mathcal{A}}\subseteq{\mathcal{O}}$ , ${\mathcal{O}}\cap{\mathcal{B}}=\emptyset$ . Hence, $\partial{\mathcal{O}}\subseteq{\mathcal{S}}_{*}^{c}$ .

Note that $\rho_{1}$ and $\rho_{2}$ are limiting points of $(\rho_{t})_{t\geq 0}$ which is a continuous curve in $\Omega$ . Therefore, it must cross the boundary $\partial{\mathcal{O}}$ infinite times. That is, there is a sequence $(t_{k})_{k\geq 1}$ of time with $\lim_{k\to\infty}t_{k}=\infty$ , such that $\rho_{t_{k}}\in\partial{\mathcal{O}}$ . But since $\partial{\mathcal{O}}$ is compact, there exists a limiting point $\rho_{*}\in\partial{\mathcal{O}}$ , so that a subsequence of sequence $\rho_{t_{k}}$ converges to $\rho_{*}$ . Therefore, $\rho_{*}$ should be a limiting point of $(\rho_{t})_{t\geq 0}$ . But this contradict with $\partial{\mathcal{O}}\subseteq{\mathcal{S}}_{*}^{c}$ . ∎

Under the assumptions of A1 and A3, further assume that $U,V$ are twice continuous differentiable, and that $\rho_{0}$ has density with respect to the Lebesgue measure, bounded by $M_{0}$ . Then $\rho_{t}$ also has a density, bounded by $M_{t}=K\,M_{0}\exp\{KDt\}$ (where $K$ depends on the constants in the assumptions).

Let $\bm{J}({\bm{\theta}};t)$ for the Jacobian of ${\bm{\varphi}}^{t}(\,\cdot\,)$ at ${\bm{\theta}}^{0}={\bm{\theta}}$ . Then Eq. (7.79) implies that $\bm{J}({\bm{\theta}};t)$ satisfies

with initial condition $\bm{J}({\bm{\theta}};0)={\bm{I}}_{D}$ . This implies

Therefore, using the fact that $\|\nabla^{2}\Psi({\bm{\theta}};\rho_{t})\|_{\mbox{\tiny\rm op}}$ is $K$ -bounded, we obtain $\lambda_{\min}\big{(}\bm{J}({\bm{\theta}};t)\big{)}\geq\exp(-Kt)$ . Finally, since ${\bm{\varphi}}^{t}$ is a diffeomorphism, we have

6 Proof of Theorems 6: Stability conditions

where $\bm{H}_{0}\equiv\bm{H}_{0}(\delta_{{\bm{\theta}}_{*}})=\nabla^{2}V({\bm{\theta}}_{*})+\nabla^{2}_{1,1}U({\bm{\theta}}_{*},{\bm{\theta}}_{*})$ . Notice that

and therefore $\nabla^{2}_{1,2}U({\bm{\theta}}_{*},{\bm{\theta}}_{*})\succeq{\mathbf{0}}$ , whence $\bm{H}_{1}\succeq\bm{H}_{0}$ .

We first establish the condition for $\rho_{*}=\delta_{{\bm{\theta}}_{*}}$ to be a fixed point. Note that $\Psi({\bm{\theta}};\rho_{*})=V({\bm{\theta}})+U({\bm{\theta}},{\bm{\theta}}_{*})$ and ${\rm supp}(\rho_{*})=\{{\bm{\theta}}_{*}\}$ . Hence the condition in the main text is satisfied if and only if $\nabla_{\bm{\theta}}\Psi({\bm{\theta}};\rho_{*})|_{{\bm{\theta}}={\bm{\theta}}_{*}}={\mathbf{0}}$ , i.e. $\nabla V({\bm{\theta}}_{*})+\nabla_{1}U({\bm{\theta}}_{*},{\bm{\theta}}_{*})={\mathbf{0}}$ .

To establish the stability result of Theorem 6, the following lemma provides a key estimate.

Under the assumptions of Theorem 6, let $\lambda\equiv\lambda_{\min}(\bm{H}_{0})>0$ . Then there exists $r_{1},{\varepsilon}_{1},\gamma>0$ such that the following hold

If ${\rm supp}(\rho)\subseteq{\sf B}({\bm{\theta}}_{*};r_{1})\equiv\{{\bm{\theta}}:\;\|{\bm{\theta}}-{\bm{\theta}}_{*}\|_{2}\leq r_{1}\}$ , then,

If $\int\|{\bm{\theta}}-{\bm{\theta}}_{*}\|_{2}^{2}\,\rho({\rm d}{\bm{\theta}})\leq{\varepsilon}^{2}_{1}$ and ${\rm supp}(\rho)\subseteq{\sf B}({\bm{\theta}}_{*};r_{1})$ , then for any ${\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{1})\setminus{\sf B}({\bm{\theta}}_{*};r_{1}/2)$ ,

Since $\nabla^{2}V({\bm{\theta}})$ is continuous and $\nabla_{1}^{2}U({\bm{\theta}},{\bm{\theta}}^{\prime})$ is bounded continuous, it follows that ${\bm{\theta}}\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho)$ is continuous, and $\rho\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho)$ is continuous in the weak topology, and in fact $({\bm{\theta}},\rho)\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho)$ is continuous in the product topology.

Since $\bm{H}_{0}\succ{\mathbf{0}}$ strictly, for any $\delta>0$ we can choose $r_{1}=r_{1}(\delta)>0$ such that

for all ${\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{1})$ , and $\rho$ such that ${\rm supp}(\rho)\subseteq{\sf B}({\bm{\theta}}_{*};r_{1})$ . If these conditions hold

In order to bound the second term, note that, since $\nabla\Psi({\bm{\theta}}_{*};\rho_{*})={\mathbf{0}}$ ,

where ${\bm{Q}}=\int({\bm{\theta}}-\mu)({\bm{\theta}}-\mu)^{{\sf T}}\,\rho({\rm d}{\bm{\theta}})$ is the covariance of $({\bm{\theta}}-{\bm{\theta}}_{*})$ .

Let now consider the claim at point $(i)$ . Integrating Eq. (7.103) with respect to $\rho({\rm d}{\bm{\theta}})$ , we get

By choosing $\delta$ sufficiently small, we can ensure that $(1-\delta)\bm{H}_{0}-(\delta/2){\bm{I}}\succeq\lambda_{\min}(\bm{H}_{0}){\bm{I}}/2$ , $\bm{H}_{1}-\delta\bm{H}_{0}-(3\delta/2){\bm{I}}\succeq\lambda_{\min}(\bm{H}_{1}){\bm{I}}/2$ , and therefore

Next consider point $(ii)$ . In this case, Eq. (7.107) implies

Substituting in Eq. (7.103), and using $\|{\bm{\mu}}\|_{2}\leq{\varepsilon}_{1}$ , we get

This is strictly positive for all ${\varepsilon}_{1}$ small enough, hence implying the claim (7.92). ∎

We are now in position of proving Theorem 6.

Let $r_{0}=\min(r_{1}/2,{\varepsilon}_{1}/2)$ and assume, without loss of generality $t_{0}=0$ , so that ${\rm supp}(\rho_{0})\subseteq{\sf B}({\bm{\theta}}_{*};r_{0})$ . We also define

As usual, we adopt the convention that the infimum of an empty set is equal to $+\infty$ .

Define $\varphi_{1}({\bm{\theta}})=h(\|{\bm{\theta}}-{\bm{\theta}}_{*}\|_{2})$ , with $h$ to be an non-decreasing function with

where, in the last inequality, we used Lemma 7.12. $(ii)$ . Next, define

Applying again Eq. (7.1), we get, for $t\leq T_{*}$ ,

Together the last two bounds imply $T_{*}=\infty$ . Indeed assume by contradiction $T_{*}<\infty$ . Then either $T_{1}\leq T_{2}$ , $T_{1}<\infty$ , or $T_{2}<T_{1}$ , $T_{2}<\infty$ .

Consider the first case: $T_{1}\leq T_{2}$ , $T_{1}<\infty$ . Since $\langle\rho_{T_{1}},\varphi_{2}\rangle\geq{\varepsilon}_{1}^{2}$ but $\langle\rho_{0},\varphi_{2}\rangle\leq r_{0}^{2}\leq{\varepsilon}_{1}^{2}/4$ , there exists $t<T_{*}$ such that $\partial_{t}\langle\rho_{0},\varphi_{2}\rangle>0$ . However this contradicts Eq. (7.126). Consider then the second case: $T_{2}<T_{1}$ , $T_{2}<\infty$ . This implies $\langle\rho_{T_{2}},\varphi_{1}\rangle>0$ , but on the other hand $\langle\rho_{0},\varphi_{1}\rangle=0$ . Hence, there exists $t<T_{*}$ such that $\partial_{t}\langle\rho_{0},\varphi_{1}\rangle>0$ . However this contradicts Eq. (7.122).

We conclude that $T_{*}=\infty$ and hence we can apply Eq. (7.126) for any $t$ , thus obtaining $\partial_{t}\langle\varphi_{2},\rho_{t}\rangle\leq-\lambda\,\langle\varphi_{2},\rho_{t}\rangle$ and hence $\langle\varphi_{2},\rho_{t}\rangle\leq(r_{0}^{2}/2)e^{-\lambda t}$ , which concludes the proof. ∎

7 Proof of Theorem 7: Instability conditions

In this section we will prove the instability result of Theorem 7. Throughout the section, we assume $\xi(t)\equiv 1/2$ . We will use several times the nonlinear dynamics, defined for $\rho_{t}$ a solution of Eq. (7.1) with initial condition $\rho_{0}$ :

where ${\bm{P}}^{\perp}_{{\bm{u}}}={\bm{I}}-{\bm{u}}{\bm{u}}^{{\sf T}}$ is the projector orthogonal to vector ${\bm{u}}$ .

First consider the case $d=1$ : in this case, the assumption $\nu({\sf B}({\bm{x}}_{0};r))\geq 1-{\varepsilon}$ is not required. Denote by $F$ the distribution function associated to $\nu$ (i.e. $F(x)\equiv\nu((-\infty,x])$ ). By assumption $F$ is differentiable with $F^{\prime}(x)\leq M$ . In order to construct the desired coupling, let $Z$ be a random variable uniformly distributed in $ $. For a small constant$ \xi_{0}>0 $, define the random variables$ (X_{1},X_{2})$ by letting

(Note that $X_{2}$ is not defined for $Z=\xi_{0}$ but this is a zero-probability event.) On the event $\{Z>\xi_{0}\}$ (which has probability $1-\xi_{0}$ ), we have, for some $W\in[X_{1},X_{2}]$ ,

By choosing $\xi_{0}$ small enough, this proves the claim for $d=1$ .

Consider next $d>1$ and assume without loss of generality ${\bm{u}}={\bm{e}}_{1}$ .

Let $\overline{\nu}(\,\cdot\,)=\nu(\,\cdot\,|\bm{X}\in{\sf B}({\bm{x}}_{0};r))$ , $\bm{X}_{a}^{b}\equiv(X_{a},\dots,X_{b})$ , and denote by $f_{1|[2,d]}$ the density of $\overline{\nu}(X_{1}\in\,\cdot\,|\bm{X}_{2}^{n})$ , and by $f_{[a,b]}$ the density of $\overline{\nu}(\bm{X}_{a}^{b}\in\,\cdot\,)$ . We then have

In order to construct the coupling, we sample ${\bm{Z}}\sim\nu$ . If ${\bm{Z}}\not\in{\sf B}({\bm{x}}_{0};r)$ , then we take $\bm{X}_{1}=\bm{X}_{2}={\bm{Z}}$ . If ${\bm{Z}}\in{\sf B}({\bm{x}}_{0};r)$ and $\max_{x_{1}}f_{1|[2,d]}(x_{1}|{\bm{Z}}_{2}^{d})>M/\Delta$ , we also take $\bm{X}_{1}=\bm{X}_{2}={\bm{Z}}$ . Otherwise we have ${\bm{Z}}\in{\sf B}({\bm{x}}_{0};r)$ and $\max_{x_{1}}f_{1|[2,d]}(x_{1}|{\bm{Z}}_{2}^{d})\leq M/\Delta$ , then we sample $(X_{1,1},X_{2,1})$ from the coupling developed in the case $d=1$ applied to $f_{1|[2,d]}(\,\cdot\,|{\bm{Z}}_{2}^{d})$ , and set $\bm{X}_{1}=(X_{1,1},{\bm{Z}}_{2}^{d})$ , $\bm{X}_{2}=(X_{2,1},{\bm{Z}}_{2}^{d})$ . Now define $\gamma$ to be the joint distribution of $\bm{X}_{1},\bm{X}_{2}$ . Then $\gamma$ is a coupling of $\nu$ with itself.

Hence, we can choose $\Delta,\xi_{0}$ small enough so that the claim (7.128) holds. ∎

By choosing ${\varepsilon}_{0,\#}$ small enough, we can ensure $\|\nabla\Psi({\bm{\theta}};\rho_{t})-\nabla\Psi({\bm{\theta}};\rho_{*})\|_{2}\leq g_{0}/3$ for all ${\bm{\theta}}$ and all $t\geq t_{0}$ .

for all $t\geq t_{0}$ . Let $\overline{\rho}_{t_{0}}$ be the conditional probability measure of $\rho_{t_{0}}$ given ${\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{0})$ . By Lemma 7.11, $\overline{\rho}_{t_{0}}$ has a density upper bounded by a constant $M=M({\varepsilon}_{0},t_{0})$ (note that $\overline{\rho}_{t_{0}}(S)\leq\rho_{t_{0}}(S)/(p_{*}-{\varepsilon}_{0})$ ).

Set $\bm{H}_{0}=\bm{H}_{0}(\rho_{*})=\nabla^{2}\Psi({\bm{\theta}}_{*};\rho_{*})$ . Since ${\bm{\theta}}_{*}$ is a critical point of ${\bm{\theta}}\mapsto\Psi({\bm{\theta}};\rho_{*})$ , for any $\delta>0$ , we can find $r_{1}(\delta)>0$ such that

As shown in the proof of Theorem 6, the function $({\bm{\theta}},\rho)\mapsto\nabla^{2}\Psi({\bm{\theta}};\rho)$ is continuous when the space of probability distributions $\rho$ is endowed with the weak topology. Analogously $\rho\mapsto\nabla\Psi({\bm{\theta}}_{*};\rho)$ is continuous in the weak topology. Hence for this $\delta>0$ and $r_{1}(\delta)>0$ , there exists ${\varepsilon}_{0,*}(\delta,r_{1})>0$ small enough such that, the following inequalities hold

Let us emphasize that $r_{1}$ depends on $\delta$ but can be taken to be independent of ${\varepsilon}_{0}$ . Further, by an application of the intermediate value theorem, for all ${\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};r_{1})$ ,

For $r_{0}<r_{1}$ , ${\bm{\theta}}^{t_{0}}\in{\sf B}({\bm{\theta}}_{*};r_{0})$ , we let $({\bm{\theta}}^{t})_{t\geq t_{0}}$ be the solution of Eq. (7.127) with this initial condition. We then define

Under the conditions of Theorem 7, there exists $r_{1}>0$ and ${\varepsilon}_{0,*}>0$ such that, for all $r_{0}\leq r_{1}$ , ${\varepsilon}_{0}\leq{\varepsilon}_{0,*}$ , there exists $T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0})$ such that the following happens. If $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0}$ and $|\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))-p_{*}|\leq{\varepsilon}_{0}$ for all $t\geq t_{0}$ for some $t_{0}$ , then

We fix a $\delta\leq(\lambda_{1}-\lambda_{2})/10$ . Then we choose $r_{1}>0$ and ${\varepsilon}_{0,*}>0$ such that Eq. (7.144) holds, with an additional requirement that ${\varepsilon}_{0,*}<p_{*}/10$ . We will prove this lemma with this choice of $r_{1}$ and ${\varepsilon}_{0,*}$ .

We always denote $({\bm{\theta}}_{i}^{t})_{t\geq t_{0}}$ to be the solution of Eq. (7.127) with initial condition ${\bm{\theta}}_{i}^{t_{0}}$ , for $i=1,2$ . First we claim that, for $0<\delta\leq(\lambda_{1}-\lambda_{2})/10$ , assuming

then for any ${\bm{\theta}}^{t_{0}}_{1},{\bm{\theta}}^{t_{0}}_{2}\in{\sf B}({\bm{\theta}}_{*};r_{1})$ with ${\bm{P}}_{\perp}({\bm{\theta}}^{t_{0}}_{1}-{\bm{\theta}}^{t_{0}}_{2})={\mathbf{0}}$ , we have

for all $t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})]$ .

For now we assume this claim holds. Fix $r_{0}\leq r_{1}$ and ${\varepsilon}_{0}\leq{\varepsilon}_{0,*}$ . Define $\gamma$ to be the coupling of Lemma 7.13 corresponding to ${\bm{u}}$ which is the eigenvector corresponding to the least eigenvalue of $\bm{H}_{0}$ , and $\nu=\overline{\rho}_{t_{0}}$ which is the conditional measure of $\rho_{t_{0}}$ given ${\bm{\theta}}^{t_{0}}\in{\sf B}({\bm{\theta}}_{*};r_{0})$ . Note $\overline{\rho}_{t_{0}}$ has a density upper bounded by a constant $M=M({\varepsilon}_{0},t_{0})$ . By Lemma 7.13, we have $\gamma({\mathcal{E}})\geq 9/10$ , where

for some $Z=Z({\varepsilon}_{0},r_{0},t_{0})>0$ . Now we take $({\bm{\theta}}_{1}^{t_{0}},{\bm{\theta}}_{2}^{t_{0}})\in{\mathcal{E}}$ . Note the assumption of this lemma gives $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0}\leq{\varepsilon}_{0,*}$ for all $t\geq t_{0}$ . According to Eq. (7.144), we have Eq. (7.149) holds, and due to this claim, we have $\|{\bm{\theta}}^{t}_{1}-{\bm{\theta}}^{t}_{2}\|_{2}\geq(1/Z)e^{\lambda_{1}(t-t_{0})/2}$ for all $t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})]$ .

Define $T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0})=(2/\lambda_{1})\log(2Z\,r_{1})$ . Then for $t>T_{\mbox{\tiny\rm UB}}$ , we have $\|{\bm{\theta}}_{1}^{t}-{\bm{\theta}}^{t}_{2}\|_{2}\geq 2r_{1}$ . This is impossible if ${\bm{\theta}}_{1}^{t},{\bm{\theta}}^{t}_{2}\in{\sf B}({\bm{\theta}}_{*};r_{1})$ and hence we deduce $(t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1}))\leq T_{\mbox{\tiny\rm UB}}$ for all $({\bm{\theta}}^{t_{0}}_{1},{\bm{\theta}}^{t_{0}}_{2})\in{\mathcal{E}}$ .

Denoting by $S$ the event in the last expression, we obtain $\rho_{t_{0}}(S)\geq(p_{*}-{\varepsilon}_{0})\overline{\rho}_{t_{0}}(S)\geq(9/20)(p_{*}-{\varepsilon}_{0})\geq p_{*}/3$ by noting that ${\varepsilon}_{0}<p_{*}/10$ .

Proof of the claim. Define the quantities

We then have, for $t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})]$ ,

Summarizing, we obtained the inequalities

The matrix of coefficients on the right-hand side is

This has a (un-normalized) left eigenvectors $(1,-v)$ , $(-v,1)$ with eigenvalues $\xi_{\pm}$ given by:

Note we took $\delta<(\lambda_{1}-\lambda_{2})/10$ , we have $v>0$ , and $\xi_{+}\geq\lambda_{1}$ .

Multiplying the inequalities (7.154), (7.155) by $(1,-v)$ , we thus obtain

Since we assumed $x_{\perp}(t_{0})=0$ , whence, for all $t\in[t_{0},t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{1},r_{1})\wedge t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}}_{2},r_{1})]$ , we have

We next strengthen the last lemma and prove that trajectories that exit ${\sf B}({\bm{\theta}}_{*};r_{1})$ do not re-enter ${\sf B}({\bm{\theta}}_{*};r_{0})$ .

Under the conditions of Theorem 7, there exists $r_{0,*},r_{1}>0$ (with $r_{0,*}<r_{1}$ ) and ${\varepsilon}_{0,*}>0$ such that, for all $r_{0}\leq r_{0,*}$ , ${\varepsilon}_{0}\leq{\varepsilon}_{0,*}$ , there exists $T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0})$ such that the following happens. If $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0}$ and $|\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))-p_{*}|\leq{\varepsilon}_{0}$ for all $t\geq t_{0}$ for some $t_{0}$ , then

Let ${\bm{P}}_{+}$ be the projector onto the eigenspace of $-\bm{H}_{0}$ corresponding to positive eigenvalues, and ${\bm{P}}_{-}$ the projector onto the subspace corresponding to negative eigenvalues, and let $\lambda_{0}\equiv\min_{i\leq D}|\lambda_{i}(\bm{H}_{0})|$ to be the least absolute value of eigenvalue of $\bm{H}_{0}$ . By condition B1 of Theorem 7, we have $\lambda_{0}>0$ . Let $\lambda_{\max}$ denote the largest absolute value of eigenvalue of $\bm{H}_{0}$ .

Fix a $\delta$ such that $0<\delta\leq\min\{\lambda_{0}/(1+\lambda_{0}+\lambda_{\max}),\sqrt{\lambda_{0}/\lambda_{\max}},\lambda_{1}-\lambda_{2},1\}/10$ , where $\lambda_{1},\lambda_{2}$ are as defined in Lemma 7.15. Next we choose $r_{1}$ as per Lemma 7.15, and we further require $\lambda_{0}r_{1}^{2}\leq\eta_{0}$ , where $\eta_{0}$ is as per condition B3 in the statement of Theorem 7. We take ${\varepsilon}_{0,*}$ to be the minimum of the parameter ${\varepsilon}_{0,*}$ as per Lemma 7.15 and the parameter ${\varepsilon}_{0,\#}$ as per Lemma 7.14, where in Lemma 7.14, we choose $u=\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{0}r_{1}^{2}/8$ , and $\Delta=\lambda_{0}r_{1}^{2}/8$ . Then we will choose smaller $r_{1}$ and ${\varepsilon}_{0,*}$ so that Eq. (7.144) holds. Finally, we take $r_{0,*}=\delta r_{1}<r_{1}$ . We will prove this lemma with this choice of $r_{1}$ , ${\varepsilon}_{0,*}$ , and $r_{0,*}$ , and with the same function $T_{\mbox{\tiny\rm UB}}$ as per Lemma 7.15.

We bound the evolution of these quantities following the same argument used above for $x_{\parallel}(t)$ , $x_{\perp}(t)$ . Namely

For $t\in[t_{*}({\bm{\theta}}^{t_{0}};r_{1},\delta),t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}};r_{1})]$ , we have $\sqrt{z_{+}(t)+z_{-}(t)}\geq\delta r_{1}$ . Using the inequality $\sqrt{a(a+b)}\leq a+b$ holding for non-negative $a$ and $b$ , we have

Proceeding analogously for $z_{-}$ , we arrive at the inequalities

Since $\delta\leq\lambda_{0}/(10(1+\lambda_{0}+\lambda_{\max}))$ , we can ensure that $\Psi({\bm{\theta}}^{t_{\mbox{\tiny\rm exit}}};\rho_{*})\leq\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{0}r_{1}^{2}/4$ . By Lemma 7.14, since $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0,*}\leq{\varepsilon}_{0,\#}$ for all $t\geq t_{0}$ , we have $\Psi({\bm{\theta}}^{t};\rho_{*})\leq\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{0}r_{1}^{2}/8$ for all $t\geq t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}};r_{1})$ . Note for all ${\bm{\theta}}\in{\sf B}({\bm{\theta}}_{*};\delta r_{1})$ , we have $\Psi({\bm{\theta}};\rho_{*})\geq\Psi({\bm{\theta}}_{*};\rho_{*})-\lambda_{\max}\delta^{2}r_{1}^{2}/2$ . Since $\delta\leq\sqrt{\lambda_{0}/\lambda_{\max}}/10$ , we have ${\bm{\theta}}^{t}\not\in{\sf B}({\bm{\theta}}_{*};\delta r_{1})$ for all $t\geq t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}};r_{1})$ .

This implies that, for any ${\bm{\theta}}^{t_{0}}\in{\sf B}({\bm{\theta}}_{*};r_{0})$ for $r_{0}\leq r_{0,*}$ with $t_{\mbox{\tiny\rm exit}}({\bm{\theta}}^{t_{0}},r_{1})\leq T_{\mbox{\tiny\rm UB}}({\varepsilon}_{0},r_{0},r_{1},t_{0})<\infty$ , it will never return to ${\sf B}({\bm{\theta}}_{*};r_{0})$ . This gives the desired result. ∎

Finally we upper bound the probability that ${\bm{\theta}}^{t}\in{\sf B}({\bm{\theta}}_{*};r_{0})$ for some $t>t_{0}$ , given that ${\bm{\theta}}^{t_{0}}\not\in{\sf B}({\bm{\theta}}_{*};r_{0})$ . We define

Under the conditions of Theorem 7, for any $\eta>0$ , there exists $r_{0,*}>0$ and ${\varepsilon}_{0,*}>0$ such that, for all $r_{0}\leq r_{0,*}$ , ${\varepsilon}_{0}\leq{\varepsilon}_{0,*}$ , the following happens. If $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\leq{\varepsilon}_{0}$ and $|\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))-p_{*}|\leq{\varepsilon}_{0}$ for all $t\geq t_{0}$ for some $t_{0}$ , then

Since we also had $\rho_{t}({\sf B}({\bm{\theta}}_{*};r_{0}))\geq p_{*}-{\varepsilon}_{0}$ for all $t\geq t_{0}$ , note $\eta,{\varepsilon}_{0}\leq p_{*}/10$ , we reached a contradiction.

Centered isotropic Gaussians

In this section we consider the centered isotropic Gaussians example discussed in the main text. That is, we assume the joint law of $(y,{\bm{x}})$ to be as follows:

With probability $1/2$ : $y=+1$ , ${\bm{x}}\sim{\sf N}({\mathbf{0}},(1+\Delta)^{2}{\bm{I}}_{d})$ .

With probability $1/2$ : $y=-1$ , ${\bm{x}}\sim{\sf N}({\mathbf{0}},(1-\Delta)^{2}{\bm{I}}_{d})$ .

$x\mapsto\sigma(x)$ is bounded, non-decreasing, Lipschitz continuous. Its weak derivative $x\mapsto\sigma^{\prime}(x)$ is Lipschitz in a neighborhood of .

$q$ is analytic on $(0,\infty)$ with $\sup_{r\in[0,\infty]}q^{\prime\prime}(r)<\infty$ .

$q^{\prime}(r)>0$ for all $r\in(0,\infty)$ , with $\sup_{r\in[0,\infty]}q^{\prime}(r)<\infty$ , and $\lim_{r\to 0}q^{\prime}(r)=\lim_{r\to\infty}q^{\prime}(r)=0$ .

$-\infty<q(0+)<-1$ , $1<q(+\infty)<\infty$ , and $-1<(q(0+)+q(+\infty))/2<1$ .

Letting $Z(r)\equiv q^{\prime}(\tau_{-}r)/q^{\prime}(\tau_{+}r)$ for some $\tau_{+}>\tau_{-}>0$ we have $Z^{\prime}(r)>0$ for all $r\in(0,\infty)$ .

Note that condition S1 and part of S2 are implied by S0, but we list them here for conveniency. Some of these assumptions can be relaxed at the cost of extra technical work. In the interest of simplicity, we prefer to avoid being overly general.

In particular, we choose $s_{1}=-2.5$ , $s_{2}=7.5$ , $t_{1}=0.5$ , $t_{2}=1.5$ in our simulations. In section 8.5, we check that this choice satisfies the above assumptions.

Throughout this section, we set $\tau_{\pm}=(1\pm\Delta)$ and $q_{+}(r)=q(\tau_{+}r)$ , $q_{-}(r)=q(\tau_{-}r)$ . Also, we will assume $\xi(t)=1/2$ , since other choices of $\xi(\,\cdot\,)$ merely amounts to a time reparametrization.

Before analyzing our model, we introduce the function space and space of probability measures we will work on. We equip the set $[0,\infty]$ with a metric ${\bar{d}}$ , where ${\bar{d}}(x,y)=|1/(1+x)-1/(1+y)|$ for any $x,y\in[0,\infty]$ . Then $([0,\infty],{\bar{d}})$ is a compact metric space, and we will still denote it by $[0,\infty]$ for simplicity in notations. We denote $C_{b}([0,\infty])$ to be the set of bounded continuous functions on $[0,\infty]$ , where continuity is defined using the topology generated by ${\bar{d}}$ . More explicitly, we have isomorphism

Because of condition ${\sf S2}$ and ${\sf S3}$ , we have $q,q^{\prime}\in C_{b}([0,\infty])$ .

Let $\mathscrsfs{P}([0,\infty])$ be the set of probability measures on $[0,\infty]$ . Due to Prokhorov’s theorem, there exists a complete metric ${\bar{d}}{P}$ on $\mathscrsfs{P}([0,\infty])$ equivalent to the topology of weak convergence, so that $(\mathscrsfs{P}([0,\infty]),{\bar{d}}{P})$ is a compact metric space. In this section, we will denote by ${\overline{\mathscrsfs{P}}}=\mathscrsfs{P}([0,\infty])$ .

Since the distribution of ${\bm{x}}$ is invariant under rotations for each of the two classes, so are the functions

where $\mu_{\mbox{\tiny\rm Haar}}$ is the Haar measure over the group of orthogonal rotations. Since $\rho\mapsto R(\rho)$ is convex, $R(\rho_{s})\leq R(\rho)$ .

We therefore restrict ourselves to $\rho$ ’s that are invariant under rotations. In other words, under $\rho$ , the vector ${\bm{w}}$ is uniformly random conditional on $\|{\bm{w}}\|_{2}$ . We denote by $\overline{\rho}$ the probability distribution of $\|{\bm{w}}\|_{2}$ when ${\bm{w}}\sim\rho$ and we let $\overline{R}_{d}(\overline{\rho})$ denote the resulting risk. We then have

where $\Theta\sim(1/Z_{d})\sin^{d-2}\theta\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta$ .

As $d\to\infty$ , we have $\lim_{d\to\infty}u_{d}(r_{1},r_{2})=u_{\infty}(r_{1},r_{2})$ (uniformly over compact sets), with

For $d=\infty$ , we have the simpler expression

The following theorem provides a characterization of global minimizers of $\overline{R}_{d}(\overline{\rho})$ .

$\overline{\rho}_{*}$ is a global minimizer of $\overline{R}_{d}(\overline{\rho})$ if and only if ${\rm supp}(\overline{\rho}_{*})\subseteq\arg\min_{r}\psi_{d}(r;\overline{\rho}_{*})$ .

In particular, $\overline{\rho}_{*}=\delta_{r_{*}}$ is a global minimizer or $\overline{R}_{d}(\overline{\rho})$ if and only if $v(r)+u_{d}(r,r_{*})\geq v(r_{*})+u(r_{*},r_{*})$ for all $r$ .

Point 1 is essentially a special case of the second part of Proposition 1 in the main text (cf. Eq. (6.7)) and follows by the same argument. Point 2 is follows by taking $\overline{\rho}_{*}=\delta_{r_{*}}$ . ∎

Given the last result, it is interesting to understand whether the optimal radial distribution $\overline{\rho}_{*}$ is a single point mass or not. Under the ansatz $\overline{\rho}=\delta_{r}$ (a single point mass at radius $r$ ) we obtain an effective risk $\overline{R}_{d}^{(1)}(r)\equiv\overline{R}_{d}(\delta_{r})$ defined by $\overline{R}_{d}^{(1)}(r)=1+2v(r)+u_{d}(r,r)$ , which is plotted in Figure 11.6 for the case of our running example (8.1), and $\Delta=0.4$ .

Let $r_{*}=r_{*}(\Delta,d)$ be the minimizer of $\overline{R}_{d}^{(1)}(r)$ , and define, for $d\leq\infty$ ,

In the case $d=\infty$ , the minimization problem simplifies further. Either the minimum risk is , or it is achieved at a point mass $\overline{\rho}_{*}=\delta_{r_{*}}$ .

Consider $d=\infty$ . Recall ${\overline{\mathscrsfs{P}}}=\mathscrsfs{P}([0,\infty])$ . In this case $\Delta_{\infty}$ defined as per Eq. (8.16) is such that $\Delta_{\infty}\in(0,1)$ . Further

For $\Delta<\Delta_{\infty}$ , $\inf_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}\overline{R}_{\infty}(\overline{\rho})>0$ and the unique global minimizer of risk function $\overline{R}_{\infty}(\overline{\rho})$ is a point mass located at some $r_{*}(\Delta)\in(0,\infty)$ .

For $\Delta\geq\Delta_{\infty}$ , all global minimizers of risk function $\overline{R}_{\infty}(\overline{\rho})$ have risk zero, and there exists a global minimizer that has compact support bounded away from .

Recall the definitions $q_{+}(r)=q(\tau_{+}r)$ and $q_{-}(r)=q(\tau_{-}r)$ . Further, we define the set $\Gamma\subseteq$ by

According to condition S3, for $\Delta=1$ , we have $q_{-}(r)=q(0)<-1$ and $q_{+}(+\infty)=q(+\infty)>+1$ . Since $q$ is continuous, it is easy to see that there exists an ${\varepsilon}>0$ , such that $[1-{\varepsilon},1]\subseteq\Gamma$ . Further, for $\Delta=0$ we have $q_{+}(r)=q_{-}(r)$ . By continuity, there exists an ${\varepsilon}>0$ , such that $[0,{\varepsilon}]\in\setminus\Gamma$ .

Since $q$ is an increasing function, we have

By the remarks above, we have $0<\Delta_{\infty}<1$ . Notice that this definition does not coincide with the one in Eq. (8.16). However, the proof below (together with Proposition 4) implies that the two definitions actually coincide.

Step 1. Prove that $\inf_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}\overline{R}_{\infty}(\overline{\rho})>0$ as $\Delta<\Delta_{\infty}$ .

First, we consider the optimization problem

We claim that, for $\Delta<\Delta_{\infty}$ we have $f_{*}<0$ . Indeed, for any $\lambda\in[0,+\infty)$ , we have the following upper bound

Since $q_{+}-\lambda\,q_{-}\in C_{b}([0,+\infty])$ , then $L(\,\cdot\,,\lambda)$ is continuous in $\overline{\rho}$ in weak topology. By the compactness of ${\overline{\mathscrsfs{P}}}$ , the supremum of $L(\,\cdot\,,\lambda)$ is attained by some $\overline{\rho}_{\lambda}\in{\overline{\mathscrsfs{P}}}$ . This $\overline{\rho}_{\lambda}$ should satisfy

Let $h(r)\equiv q_{+}(r)-\lambda q_{-}(r)$ . Note the supremum of $h$ should either satisfy

for $r\in(0,\infty)$ , or the supremum should be attained at the boundary or $+\infty$ . According to condition S4, $[q_{-}^{\prime}(r)/q_{+}^{\prime}(r)]^{\prime}>0$ for $r\in(0,\infty)$ , the equation (8.21) has at most one solution $r_{*}\in(0,\infty)$ .

Assume that there exists $r_{*}\in(0,\infty)$ such that $h^{\prime}(r_{*})=0$ . Then we have $h^{\prime}(r)>0$ for $0<r<r_{*}$ , and $h^{\prime}(r)<0$ for $r_{*}<r<+\infty$ , whence ${\rm supp}(\overline{\rho}_{\lambda})=\{r_{*}\}$ . If $h^{\prime}(r)=0$ does not have a solution in $(0,\infty)$ , the only supremum of $h(r)$ could be achieved at or $+\infty$ . Therefore, ${\rm supp}(\overline{\rho}_{\lambda})=\{0\}$ or ${\rm supp}(\overline{\rho}_{\lambda})=\{+\infty\}$ . This concludes that, for any $\lambda\in[0,+\infty)$ , $\sup_{\overline{\rho}\in{\overline{\mathscrsfs{P}}}}L(\overline{\rho},\lambda)$ is achieved by a point mass. Therefore, we have

For $\Delta<\Delta_{\infty}$ , the right hand side of the above inequality is less than . Therefore, we cannot have a probability distribution $\overline{\rho}$ such that $\langle q_{+},\overline{\rho}\rangle=1$ and $\langle q_{-},\overline{\rho}\rangle=-1$ . The infimum of the risk cannot be .

Step 2. Show that the global minimizer should be a delta function for $\Delta<\Delta_{\infty}$ .

According to Proposition 1, the global minimizer $\overline{\rho}_{*}\in{\overline{\mathscrsfs{P}}}$ should satisfy

with $\psi_{\infty}$ given in Eq. (8.12).

As proved in the last step, as $\Delta<\Delta_{\infty}$ , we cannot have both $\lambda_{+}(\overline{\rho}_{*})=0$ and $\lambda_{-}(\overline{\rho}_{*})=0$ . The argument given above also implies that $\psi_{\infty}(r;\overline{\rho}_{*})$ is minimized at a unique point, and hence the support of $\overline{\rho}_{*}$ should be a single point. This proves the first part of the theorem.

For $\Delta\geq\Delta_{\infty}$ , there exists $r>0$ , such that $q(\tau_{+}r)\geq 1$ , and $q(\tau_{-}r)\leq-1$ . Therefore, there exists $r_{*}>0$ such that $q(\tau_{+}r_{*})-1=-1-q(\tau_{-}r_{*})={\varepsilon}_{*}\geq 0$ . Consider the following probability measure on $[0,+\infty]$ ,

It can be checked that $\overline{R}_{\infty}(\overline{\rho}_{*})=0$ .

We would like to show further that there exists a global minimizer that is compactly supported. We construct this global minimizer as following. First, define

Then we know that $q_{-}(r_{0})=-1$ and $q_{+}(r_{0})\geq 1$ . Now for any $0\leq r\leq r_{0}$ , define $u(r)=q_{-}^{-1}(-2-q_{-}(r))$ . According to condition S3, we have $-1<[q(0)+q(+\infty)]/2<1$ , then $u(r)$ is well defined on $[0,r_{0}]$ . It is easy to see that $u(r_{0})=r_{0}$ , and $[q_{-}(r)+q_{-}(u(r))]/2=-1$ for any $0\leq r\leq r_{0}$ . Now we consider the function $z(r)=[q_{+}(r)+q_{+}(u(r))]/2-1$ . Note that $z(r_{0})>0$ , and $z(0)\leq[q(0)+q(\infty)]/2-1<0$ . Therefore, there exists $r_{*}$ satisfying $0<r_{*}\leq r_{0}$ such that $z(r_{*})=0$ . Consider the following probability measure on $(0,+\infty)$ ,

It is easy to see that $\overline{R}_{\infty}(\overline{\rho}_{*})=0$ . ∎

2 Dynamics: Fixed points

We specialize the general evolution (7.1) to the present case. Assuming $\rho_{0}$ to be spherically symmetric, then $\rho_{t}$ is spherically symmetric for any $t\geq 0$ . We let $\overline{\rho}_{t}$ denote the distribution of $\|{\bm{w}}\|_{2}$ when ${\bm{w}}\sim\rho_{t}$ . This satisfies the following PDE:

We will view this as an evolution in the space of probability distribution on the completed half-line $\mathscrsfs{P}([0,\infty])$ .

In analogy with Proposition 2, we can prove the following characterization of fixed points.

A distribution $\overline{\rho}\in\mathscrsfs{P}([0,\infty])$ is a fixed point of the PDE (8.22) if and only if

Notice, in particular, global minimizers of $\overline{R}_{d}(\overline{\rho})$ are fixed points of this evolution, but not vice-versa. The next result classifies fixed points.

Consider $d=\infty$ and recall the definition of $\lambda_{+}(\overline{\rho})$ and $\lambda_{-}(\overline{\rho})$ given by Eqs. (8.13) and (8.14). Then the fixed points of the PDE (8.22) (i.e. the probability measures $\overline{\rho}\in\mathscrsfs{P}([0,\infty])$ satisfying (8.23)) are of one of the following types

A point mass $\overline{\rho}_{r_{*}}=\delta_{r_{*}}$ at some location $r_{*}\not\in\{0,+\infty\}$ , but not of type $(a)$ .

A mixture of the type $\overline{\rho}=a_{0}\delta_{0}+a_{\infty}\delta_{+\infty}+a\delta_{r_{*}}$ , but not of type $(a)$ or $(b)$ .

For $\Delta<\Delta_{\infty}$ , the PDE has a unique fixed point of type $(b)$ , with $\lambda_{+}(\overline{\rho}_{*})<0$ and $\lambda_{-}(\overline{\rho}_{*})>0$ ; it has no type- $(a)$ fixed points; it has possibly fixed points of type $(c)$ .

For $\Delta>\Delta_{\infty}$ , the PDE has some fixed points of type $(b)$ , with $\lambda_{+}(\overline{\rho}_{*})>0$ and $\lambda_{-}(\overline{\rho}_{*})<0$ ; it also has some type- $(a)$ fixed points; it has possibly fixed points of type $(c)$ .

For $\Delta=\Delta_{\infty}$ , the PDE has a unique fixed point of type $(a)$ which is also a delta function at some location $r_{*}$ , and no type $(b)$ fixed points; it has possibly fixed points of type $(c)$ .

We use the characterization of fixed points in Proposition 5. Recall that $\psi_{\infty}(r;\overline{\rho}_{*})$ is defined as in Equation (8.12). The derivative $\partial_{r}\psi_{\infty}(r;\overline{\rho})$ gives

If a fixed point has $\lambda_{+}(\overline{\rho}_{*})=\lambda_{-}(\overline{\rho}_{*})=0$ , then $\overline{R}_{\infty}(\overline{\rho}_{*})=0$ . This is type- $(a)$ fixed point. Consider then the case $(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))\neq(0,0)$ . For the same reason as in the proof of Theorem 8, we conclude that $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})$ has at most three zeros, two of which are located at and $+\infty$ . This proves that all fixed points are of type $(a)$ , $(b)$ or $(c)$ .

We already proved in Theorem 8 that, for $\Delta<\Delta_{\infty}$ , $\inf_{\overline{\rho}}\overline{R}_{\infty}(\overline{\rho})>0$ . Therefore, for $\Delta<\Delta_{\infty}$ , there is no type $(a)$ fixed points.

We next prove that, as $\Delta<\Delta_{\infty}$ , fixed point of type $(b)$ is always unique. The location of the delta fixed point should satisfy

Note that $\partial_{r}\psi_{\infty}(r_{*};\delta_{r_{*}})<0$ for $r>0$ small enough, and $\partial_{r}\psi_{\infty}(r_{*};\delta_{r_{*}})>0$ for $r$ large enough, whence this equation has at least one solution $r_{*}\in(0,\infty)$ . In order to prove that it has a unique solution in $(0,+\infty)$ , define $r_{+}\equiv\inf\{r:q_{+}(r)\geq 1\}$ and $r_{-}\equiv\inf\{r:q_{-}(r)\geq-1\}$ . Note that $q^{\prime}_{+}(r_{*}),q^{\prime}_{-}(r_{*})>0$ and that, in order to satisfy Eq. (8.25), the terms $\lambda_{+}(\delta_{r_{*}})=1/2\cdot(q_{+}(r_{*})-1)$ and $\lambda_{-}(\delta_{r_{*}})=1/2\cdot(q_{-}(r_{*})+1)$ must have opposite signs. For $\Delta<\Delta_{\infty}$ , we must have $\lambda_{+}(\delta_{r_{*}})<0$ and $\lambda_{-}(\delta_{r_{*}})>0$ , and all stationary points should be within $[r_{-},r_{+}]$ . Note that $q_{-}^{\prime}(r)/q_{+}^{\prime}(r)$ is strictly increasing, and $[1-q_{+}(r)]/[1+q_{-}(r)]$ is decreasing on $[r_{-},r_{+}]$ . Therefore, the fixed point of type $\delta_{r_{*}}$ with $r_{*}\in(0,\infty)$ is unique.

For $\Delta>\Delta_{\infty}$ , we must have $\lambda_{+}(\overline{\rho}_{*})>0$ and $\lambda_{-}(\overline{\rho}_{*})<0$ , and all solutions should be within $[r_{+},r_{-}]$ . There could possibly be multiple fixed points of type $\delta_{r_{*}}$ with $r_{*}\in[r_{+},r_{-}]$ .

If $\Delta=\Delta_{\infty}$ , it is easy to see that, $\overline{\rho}_{*}=\delta_{r_{*}}$ at some $r_{*}\in(0,\infty)$ is the unique fixed point with zero risk, and the unique fixed point as a point mass. ∎

3 Dynamics: Convergence to global minimum for d=∞𝑑d=\infty

In this section, denote $\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ to be

We then prove that the $d=\infty$ dynamics converges to a global minimizer from any initialization in $\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ .

Consider the PDE (8.22) for $d=\infty$ , with initialization $\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ . It has a unique solution $(\overline{\rho}_{t})_{t\geq 0}$ , such that

Without loss of generality, we assume $\xi(t)=1/2$ . First we show the existence and uniqueness of solution of the PDE.

Step 1. Existence and uniqueness of solution. Mass $\overline{\rho}_{t}((0,\infty))=1$ for all $t$ .

According to conditions S1 - S3, $q(r)$ , $q^{\prime}(r)$ , and $q^{\prime\prime}(r)$ are uniformly bounded on $[0,\infty]$ . Recall that

Hence $v^{\prime}(r),\partial_{1}u_{\infty}(r_{1},r_{2}),v^{\prime\prime}(r),\partial_{11}^{2}u_{\infty}(r_{1},r_{2}),\partial_{12}^{2}u_{\infty}(r_{1},r_{2})$ are uniformly bounded. Recall we further assumed $\xi(t)\equiv 1/2$ . Therefore, conditions A1 and A3 are satisfied with $D=1$ , $V=v$ , and $U=u$ . By Remark 7.1, there is the existence and uniqueness of solution of PDE (8.22) for $d=\infty$ . Denote this solution to be $(\overline{\rho}_{t})_{t\geq 0}$ .

Recall the formula of $\partial_{r}\psi_{\infty}(r;\overline{\rho})$ given in Equation (8.24), it is easy to see that the assumption of Lemma 7.9 is satisfied with $d=1$ and $\Psi=\psi_{\infty}$ . Hence, we have $\overline{\rho}_{t}((0,\infty))=1$ for any $t<\infty$ .

Step 2. Classify the limiting set ${\mathcal{S}}_{*}$ .

Recall the definition of $(\mathscrsfs{P}([0,+\infty]),{\bar{d}}{P})$ at the beginning of Section 8. Since $(\mathscrsfs{P}([0,+\infty]),{\bar{d}}{P})$ is a compact metric space, and $(\overline{\rho}_{t})_{t\geq 0}$ is a continuous curve in this space, then there exists a subsequence $(t_{k})_{k\geq 1}$ of times, such that $(\overline{\rho}_{t_{k}})_{k\geq 1}$ converges in metric ${\bar{d}}{P}$ to a probability distribution $\overline{\rho}_{*}\in\mathscrsfs{P}([0,+\infty])$ .

Analogously to Proposition 2 (using Eq. (8.22)), we have

Since $\overline{R}_{\infty}(\overline{\rho}_{t})\geq 0$ , we have

Recall the definition of $\lambda_{+}(\overline{\rho})$ and $\lambda_{-}(\overline{\rho})$ given by Eq. (8.13) and (8.14). Since $q\in C_{b}([0,\infty])$ , we have

Note $\partial_{r}\psi_{\infty}(r;\overline{\rho})$ is given by Eq. (8.24), and $q^{\prime}\in C_{b}([0,+\infty])$ , hence

In other words, any limiting point $\overline{\rho}_{*}$ of the PDE is a fixed point of the PDE (8.22).

Note $\overline{R}_{\infty}(\overline{\rho})=1/2\cdot[\lambda_{+}(\overline{\rho})^{2}+\lambda_{-}(\overline{\rho})^{2}]$ , we have

Note $\overline{R}_{\infty}(\overline{\rho}_{t})$ is decreasing with $t$ , hence

Let ${\mathcal{S}}_{*}={\mathcal{S}}_{*}(\overline{\rho}_{0})$ be the set of all limiting points of the $(\overline{\rho}_{t})_{t\geq 0}$ ,

Due to Lemma 7.10, ${\mathcal{S}}_{*}$ is a connected compact set. Since $\overline{R}_{\infty}(\overline{\rho}_{t})$ is decreasing as $t$ increases, we have $\overline{R}_{\infty}(\overline{\rho}_{*})\equiv\overline{R}_{*}$ is a constant for all $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ . Since we assumed $\overline{R}_{\infty}(\overline{\rho}_{0})<1$ , and $\overline{R}_{\infty}(\overline{\rho}_{t})$ is decreasing in $t$ , we have $\overline{R}_{*}<1$ .

Let $\overline{\rho}_{*}$ be a fixed point of PDE such that $\lambda_{+}(\overline{\rho}_{*})\geq 0,\lambda_{-}(\overline{\rho}_{*})\geq 0$ or $\lambda_{+}(\overline{\rho}_{*})\leq 0,\lambda_{-}(\overline{\rho}_{*})\leq 0$ but not both $\lambda_{+}(\overline{\rho}_{*})$ and $\lambda_{-}(\overline{\rho}_{*})$ equal . In this case, according to Eq. (8.24), $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})$ must be strictly increasing or strictly decreasing in $r$ . Since ${\rm supp}(\overline{\rho}_{*})\subseteq\{r\in[0,\infty]:\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})=0\}$ , $\overline{\rho}_{*}$ must be a combination of two delta functions located at and $+\infty$ , i.e., $\overline{\rho}_{*}=a_{0}\delta_{0}+(1-a_{0})\delta_{\infty}$ . But for a fixed point of this type, it is easy to see that $\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1$ . Such fixed points $\overline{\rho}_{*}$ cannot be one of the limiting points of the PDE since $\overline{R}_{\infty}(\overline{\rho}_{0})<1$ .

Since ${\mathcal{S}}_{*}$ is a connected set, $L({\mathcal{S}}_{*})$ should also be a connected set. Further notice that $\overline{R}_{\infty}(\overline{\rho}_{*})=1/2\cdot[\lambda_{+}(\overline{\rho}_{*})^{2}+\lambda_{-}(\overline{\rho}_{*})^{2}]$ , and $\overline{R}_{\infty}(\overline{\rho}_{1})=\overline{R}_{\infty}(\overline{\rho}_{2})$ for any $\overline{\rho}_{1},\overline{\rho}_{2}\in{\mathcal{S}}_{*}$ . Therefore, we can only have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}\equiv\{(\lambda_{+},\lambda_{-}):\lambda_{+}>0,\lambda_{-}<0\}$ , or $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\equiv\{(\lambda_{+},\lambda_{-}):\lambda_{+}<0,\lambda_{-}>0\}$ , or $L({\mathcal{S}}_{*})=\{(0,0)\}$ .

Step 3. Finish the proof using two claims.

Claim $(1)$ . If $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ , then for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we have $\overline{\rho}_{*}((0,\infty))=1$ .

Claim $(2)$ . We cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ .

Here we assume these two claims hold, and use them to prove our results. For $\Delta<\Delta_{\infty}$ , we proved in Theorem 9 that, there is not a fixed point such that $L(\overline{\rho}_{*})=(0,0)$ . Therefore, we cannot have $L({\mathcal{S}}_{*})=\{(0,0)\}$ . Due to Claim $(2)$ , we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ . Hence, we must have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ . According to Theorem 9, for $\Delta<\Delta_{\infty}$ , the only fixed point of PDE with $\overline{\rho}_{*}((0,\infty))=1$ is a point mass at some location $r_{*}$ . Furthermore, this delta function fixed point is unique and is also the global minimizer of the risk. Therefore, we conclude that, as $\Delta<\Delta_{\infty}$ , the PDE will converge to this global minimizer.

For $\Delta\geq\Delta_{\infty}$ , according to Claim $(1)$ , if $\overline{\rho}_{*}$ is a limiting point such that $L(\overline{\rho}_{*})\in{\mathcal{P}}_{1}$ , then $\overline{\rho}_{*}((0,\infty))=1$ . According to Theorem 9, a fixed point $\overline{\rho}_{*}$ with $\overline{\rho}_{*}((0,\infty))=1$ and $L(\overline{\rho}_{*})\neq(0,0)$ must be a point mass at some location $r_{*}$ , with $L(\overline{\rho}_{*})\in{\mathcal{P}}_{2}$ . Therefore, we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ . Claim $(2)$ also tells us that we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ . Hence, we must have $L({\mathcal{S}}_{*})=\{(0,0)\}$ . In this case, all the points in the set ${\mathcal{S}}_{*}$ have risk . Therefore, we conclude that, as $\Delta\geq\Delta_{\infty}$ , the PDE will converge to some limiting set with risk .

We are left with the task of proving the two claims above. Before that, we introduce some useful notations. Recall $Z(r)=q_{-}^{\prime}(r)/q_{+}^{\prime}(r)$ for $r\in(0,+\infty)$ . According to condition S4, $Z^{\prime}(r)>0$ for $r\in(0,+\infty)$ . This implies that $Z(0+)\equiv Z_{0}\geq 0$ and $Z(+\infty)\equiv Z_{\infty}\leq\infty$ exist. We rewrite $\partial_{r}\psi_{\infty}(r;\overline{\rho})$ as

Proof of Claim (1). If $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ , then for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we have $\overline{\rho}_{*}(\{0,\infty\})=0$ .

Assume $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ . Then, we must have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}<-\lambda_{+}/\lambda_{-}<Z_{\infty}\}$ . Otherwise suppose there exists $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , such that $-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\geq Z_{\infty}$ or $-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\leq Z_{0}$ , according to Eq. (8.28), $\psi_{\infty}(r;\overline{\rho}_{*})$ must be strictly increasing or strictly decreasing in $r$ . Since ${\rm supp}(\overline{\rho}_{*})\subseteq\{r\in[0,\infty]:\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})=0\}$ , then $\overline{\rho}_{*}$ must be a combination of two delta functions located at and $+\infty$ . But such $\overline{\rho}_{*}$ must have $\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1$ , and thus $\overline{\rho}_{*}$ cannot be a limiting point of the PDE. Hence the claim that $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}<-\lambda_{+}/\lambda_{-}<Z_{\infty}\}$ holds.

Since ${\mathcal{S}}_{*}$ is a compact set, and $L$ is a continuous map, then $L({\mathcal{S}}_{*})$ is a compact set. Therefore, there must exist ${\varepsilon}_{0}>0$ , so that for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we have $Z_{0}+3{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})<Z_{\infty}-3{\varepsilon}_{0}$ . For this ${\varepsilon}_{0}>0$ , since ${\mathcal{S}}_{*}$ contains all the limiting points of PDE starting from $\overline{\rho}_{0}$ , there exists $t_{0}$ large enough, so that as $t\geq t_{0}$ , we have $Z_{0}+2{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{t})/\lambda_{-}(\overline{\rho}_{t})<Z_{\infty}-2{\varepsilon}_{0}$ , and $\lambda_{+}(\overline{\rho}_{t})<0$ , $\lambda_{-}(\overline{\rho}_{t})>0$ . For the same ${\varepsilon}_{0}$ , since $Z(r)$ is continuous at and $+\infty$ , there exists $0<r_{0}<r_{\infty}<\infty$ , so that $Z(r)<Z_{0}+{\varepsilon}_{0}$ for $r\in(0,r_{0})$ , and $Z(r)>Z_{\infty}-{\varepsilon}_{0}$ for $r\in(r_{\infty},\infty)$ . Therefore, for any $t\geq t_{0}$ , $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})<0$ for any $r\in(0,r_{0})$ , and $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})>0$ for any $r\in(r_{\infty},+\infty)$ .

As a result, according to the equation (8.28), we must have $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})<0$ for any $r\in(0,r_{0})$ and $t\geq t_{0}$ , and $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})>0$ for any $r\in(r_{\infty},\infty)$ and $t\geq t_{0}$ .

Due to Lemma 7.9, $\overline{\rho}_{t_{0}}((0,\infty))=1$ . Denoting $\Omega_{k}=[1/k,k]$ , then $\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\Omega_{k})=1$ . With this choice of $\Omega_{k}$ , for any $k\geq\{r_{\infty},1/r_{0}\}$ , and for any $t\geq t_{0}$ , we have $\langle\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t}),{\bm{n}}(r)\rangle>0$ for $r\in\partial\Omega_{k}$ where ${\bm{n}}(r)$ is the normal vector point outside $\Omega_{k}$ . Therefore, if we consider the ODE

starting with $r(t_{0})\in\Omega_{k}$ , $r(t)$ cannot leak outside $\Omega_{k}$ from either boundaries of $\Omega_{k}$ , and we must have $r(t)\in\Omega_{k}$ for any $t\geq t_{0}$ . Due to Lemma 7.8, $\overline{\rho}_{t}(\Omega_{k})\geq\overline{\rho}_{t_{0}}(\Omega_{k})$ for any $t\geq t_{0}$ . As a result, we conclude that for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ ,

Note $\cup_{k}\Omega_{k}=(0,\infty)$ . This gives $\overline{\rho}_{*}(\{0,\infty\})=0$ , which proves Claim $(1)$ .

Proof of Claim $(2)$ , step $(1)$ . If $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ , then ${\mathcal{S}}_{*}$ must be a singleton.

In the case $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ , the argument is similar to the proof of Claim $(1)$ , and hence will be presented in a synthetic form. First, we must have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}<-\lambda_{+}/\lambda_{-}<Z_{\infty}\}$ . Therefore, there must exist ${\varepsilon}_{0}>0$ , so that for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we have $Z_{0}+3{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})<Z_{\infty}-3{\varepsilon}_{0}$ . For this ${\varepsilon}_{0}>0$ , there exists $t_{0}$ large enough, so that as $t\geq t_{0}$ , we have $Z_{0}+2{\varepsilon}_{0}<-\lambda_{+}(\overline{\rho}_{t})/\lambda_{-}(\overline{\rho}_{t})<Z_{\infty}-2{\varepsilon}_{0}$ , and $\lambda_{+}(\overline{\rho}_{t})>0$ , $\lambda_{-}(\overline{\rho}_{t})<0$ . Further, there exists $0<r_{0}<r_{\infty}<\infty$ , so that $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})>0$ for any $r\in(0,r_{0})$ and $t\geq t_{0}$ , and $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})<0$ for any $r\in(r_{\infty},\infty)$ and $t\geq t_{0}$ .

Therefore, if we consider the ODE (8.29) starting with $r(t_{0})\in[0,r_{0})$ , we must have $r(t)\in[0,r_{0})$ for any $t\geq t_{0}$ ; if we start with $r(t_{0})\in(r_{\infty},\infty]$ , we must have $r(t)\in(r_{\infty},\infty]$ for any $t\geq t_{0}$ . Due to Lemma 7.8, $\{\overline{\rho}_{t}([0,r))\}_{t\geq t_{0}}$ for $0<r\leq r_{0}$ and $\{\overline{\rho}_{t}((r,+\infty])\}_{t\geq t_{0}}$ for $r\geq r_{\infty}$ must be non-decreasing in $t$ . According to Theorem 9, we can express $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ in the form $\overline{\rho}_{*}=a_{0}(\overline{\rho}_{*})\delta_{0}+a_{\infty}(\overline{\rho}_{*})\delta_{\infty}+a(\overline{\rho}_{*})\delta_{r_{*}}$ . By the stated monotonicity property, for any $\overline{\rho}_{1},\overline{\rho}_{2}\in{\mathcal{S}}_{*}$ , it holds that $a_{0}(\overline{\rho}_{1})=a_{0}(\overline{\rho}_{2})$ , $a_{\infty}(\overline{\rho}_{1})=a_{\infty}(\overline{\rho}_{2})$ , and hence $a(\overline{\rho}_{1})=a(\overline{\rho}_{2})$ . We denote them in short as $a_{0}$ , $a_{\infty}$ , and $a$ .

For any such fixed point $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , since we must have ${\rm supp}(\overline{\rho}_{*})\subseteq\{r:\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})=0\}$ , $r_{*}\in(0,+\infty)$ should be a solution of $\phi(r)=0$ where

Proof of Claim $(2)$ , step $(2)$ . If $\overline{\rho}_{*}$ is a fixed point with $L(\overline{\rho}_{*})\in{\mathcal{P}}_{2}$ , then $\overline{\rho}_{*}$ is unstable.

We apply Theorem 7 to $\overline{\rho}_{*}=a_{0}\delta_{0}+a_{\infty}\delta_{\infty}+a\delta_{r_{*}}$ . We will check the conditions of Theorem 7 to show that this type of fixed point is unstable.

First we check condition B1. Since $[q_{-}^{\prime}(r)/q_{+}^{\prime}(r)]^{\prime}>0$ and $q_{+}^{\prime}(r)>0$ for $r\in(0,+\infty)$ , we have

Note the stationary condition of the PDE implies

and $\lambda_{+}(\overline{\rho}_{*})>0$ , $\lambda_{-}(\overline{\rho}_{*})<0$ . Combined with the equation above, we have

This verifies condition ${\sf B1}$ of Theorem 7.

Second, since $\lambda_{+}(\overline{\rho}_{*})>0$ and $\lambda_{-}(\overline{\rho}_{*})<0$ , according to Equation (8.28), we must have $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})>0$ for $r\in(0,r_{*})$ , and $\partial_{r}\psi_{\infty}(r;\overline{\rho}_{*})<0$ for $r\in(r_{*},\infty)$ . Therefore, we have $\psi_{\infty}(0;\overline{\rho}_{*})<\psi_{\infty}(r_{*};\overline{\rho}_{*})$ and $\psi_{\infty}(+\infty;\overline{\rho}_{*})<\psi_{\infty}(r_{*};\overline{\rho}_{*})$ . Note ${\cal L}(\eta)\equiv\{r:\psi_{\infty}(r;\overline{\rho}_{*})\leq\psi_{\infty}(r_{*};\overline{\rho}_{*})-\eta\}$ . For any $\eta>0$ small enough, $\overline{\rho}_{*}({\cal L}(\eta))=1-a$ , which verifies condition ${\sf B2}$ . It is also easy to see that, for any $\eta>0$ , $\partial{\cal L}(\eta)$ is a compact set, hence condition B3 holds. Note that we assumed further that $\overline{\rho}_{0}$ has a bounded density with respect to Lebesgue measure, all the assumptions of Theorem 7 are satisfied. Theorem 7 implies that the PDE cannot converge to $\overline{\rho}_{*}$ . As a result, we conclude that we cannot have $L({\mathcal{S}}_{*}(\overline{\rho}_{0}))\subseteq{\mathcal{P}}_{2}$ for $\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ . This proves Claim $(2)$ .

4 Proof of Theorem 1

The key step consists in proving that the dynamics for large but finite $d$ is well approximated by the dynamics at $d=\infty$ . The key estimate is provided by the next lemma.

Assume $\sigma$ satisfies condition S0, recall the definition of $u_{d}$ and $u_{\infty}$ given by Equation (8.8) and (8.9). Then we have

where $(G_{1},G_{2})\sim{\sf N}(0,{\bm{I}}_{2})$ , and $\Theta\sim(1/Z_{d})\sin(\theta)^{d-2}\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta$ are mutually independent.

Define $G_{3}=G_{1}\cos\Theta+G_{2}\sin\Theta$ , then

According to condition S0, $\|\sigma^{\prime}\|_{\infty}$ and $\|\sigma\|_{\infty}$ are bounded, it is sufficient to bound the following quantity uniformly for $r\in[0,\infty)$

Assuming this claim holds, let us show that it implies the desired bound on $T(r)$ . We have

We are left with the task of proving Eq. (8.37).

Denote $X=G_{2}$ and $Y=G_{3}$ for simplicity in notations. Note that $(X,Y)\stackrel{{\scriptstyle{\rm d}}}{{=}}(Y,X)\stackrel{{\scriptstyle{\rm d}}}{{=}}(-X,-Y)$ . It follows that we can assume, without loss of generality, $a>0$ . We have

Let $y\sim\textup{Unif}(\{-1,+1\})$ , $[{\bm{x}}|y=+1]\sim{\sf N}({\mathbf{0}},{\bm{\Sigma}}_{+})$ , $[{\bm{x}}|y=-1]\sim{\sf N}(0,{\bm{\Sigma}}_{-})$ with $\tau_{-}^{2}{\bm{I}}_{D}\preceq{\bm{\Sigma}}_{+},{\bm{\Sigma}}_{-}\preceq\tau_{+}^{2}{\bm{I}}_{D}$ for some $0<\tau_{-}<\tau_{+}<\infty$ . Assume that the activation function $\sigma$ satisfies condition S0. Define

Then assumptions A2 and A3 are satisfied.

Note that ${\bm{x}}$ is sub-Gaussian, and by condition S0 we have $\sigma^{\prime}$ is bounded, then $\nabla_{\bm{\theta}}\sigma(\langle{\bm{x}},{\bm{\theta}}\rangle)=\sigma^{\prime}(\langle{\bm{x}},{\bm{\theta}}\rangle){\bm{x}}$ is also sub-Gaussian (with sub-Gaussian parameter independent of $D$ ). Condition S0 also gives that $\sigma$ is bounded, therefore assumption A2 is satisfied.

Since $\|\sigma\|_{\infty},\|\sigma^{\prime}\|_{\infty}<\infty$ , applying Cauchy-Schwarz inequality, we have $\nabla V,\nabla_{1}U,\nabla_{12}^{2}U$ are uniformly bounded.

It is difficult to bound $\nabla^{2}V$ and $\nabla_{1}^{2}U$ directly because $\sigma^{\prime}$ may not be differentiable. We will use a longer argument to bound them.

First, for a bounded-Lipschitz function $f$ , and for $g\in\{1,\sigma\}$ , define

where ${\bm{G}}\sim{\sf N}(0,{\bm{I}}_{d})$ . Since we have $\tau_{-}^{2}{\bm{I}}_{D}\preceq{\bm{\Sigma}}_{+},{\bm{\Sigma}}_{-}\preceq\tau_{+}^{2}{\bm{I}}_{D}$ for some $0<\tau_{-}<\tau_{+}<\infty$ , in order to bound $\nabla^{2}V$ and $\nabla_{1}^{2}U$ , it is sufficient to bound $\nabla_{1}^{2}W_{\sigma,1}$ and $\nabla_{1}^{2}W_{\sigma,\sigma}$ .

is uniformly bounded for $g=1$ or $g=\sigma$ . Let $h=\sigma-\sigma_{0}$ , then $h=0$ for $r\in[-\delta_{0},\delta_{0}]$ , and $h$ is $K$ -bounded-Lipschitz for some constant $K$ . It is sufficient to bound $\nabla_{1}^{2}W_{h,g}$ for $g\in\{1,\sigma\}$ .

Since ${\bm{G}}$ is Gaussian, using Stein’s formula, for any unit vector ${\bm{n}}$ , we have

Taking directional derivatives of $E_{1}$ and $E_{2}$ , we have

is uniformly bounded for $r\in[0,\infty]$ . Hence $E_{11}$ is uniformly bounded. Using a similar argument, we can show that each terms $E_{12}$ , $E_{13}$ , $E_{21}$ , $E_{22}$ , and $E_{23}$ are uniformly bounded.

Now we look at $\nabla_{1}E_{3}({\bm{\theta}}_{1},{\bm{\theta}}_{2},{\bm{n}})$ . We have

In order to bound $E_{32}$ , we apply Stein’s formula to get

We can apply Stein’s formula to the right hand side of the last equation. Using the same argument as above, we obtain that $E_{31}$ is uniformly bounded.

As a result, $\nabla^{2}V$ and $\nabla_{1}^{2}U$ are uniformly bounded. Therefore, assumption A3 is satisfied.

We are now in position to prove Theorem 1.

First we consider PDE (8.22) for $d=\infty$ . We fix an initial radial density $\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ . Due to Theorem 10, for any $\eta>0$ , there exists $T=T(\eta,\overline{\rho}_{0},\Delta)>0$ , so that the solution $(\overline{\rho}_{t}^{\infty})_{t\geq 0}$ of PDE (8.22) for $d=\infty$ with initialization $\overline{\rho}_{0}$ satisfies

Next we would like to bound the difference of $\overline{R}_{\infty}(\overline{\rho})$ and $\overline{R}_{d}(\overline{\rho})$ for any $\overline{\rho}$ . Note

By Lemma 8.1, there exists $d_{0}=d_{0}(\eta,\Delta)$ large enough, so that for $d\geq d_{0}$ , we have

Finally, let $({\bm{\theta}}^{k})_{k\geq 1}$ be the trajectory of SGD, with step size $s_{k}={\varepsilon}\xi(k{\varepsilon})$ , and initialization ${\bm{w}}_{i}^{0}\sim_{iid}\rho_{0}$ for $i\leq N$ . We apply Theorem 3 to bound the difference of the law of trajectory of SGD and the solution of PDE (8.53). The assumptions of Theorem 3 are verified by Lemma 8.2. As a consequence, there exists constant $K$ (which depend uniquely on the constants in assumptions A1 A2 A3), such that for any $t\leq 10T$ , we have

As a consequence, for any $\delta>0$ , there exists $C_{0}=C_{0}(\delta,\eta,\overline{\rho}_{0},\Delta)$ , so that as $N,1/{\varepsilon}\geq C_{0}d$ and ${\varepsilon}\geq 1/N^{10}$ , for any $t\leq 10T$ , we have

Therefore, the trajectory ${\bm{\theta}}^{\lfloor t/{\varepsilon}\rfloor}$ of SGD as $t\in[T,10T]$ satisfies

with probability at least $1-\delta$ . This gives the desired result.

5 Checking conditions 𝖲𝟢𝖲𝟢{\sf S0}–𝖲𝟦𝖲𝟦{\sf S4} for the running example

The requirements of Lemma 8.3 are not restrictive. An example of parameters that satisfies all conditions gives $s_{1}=-2.5$ , $s_{2}=7.5$ , $t_{1}=0.5$ , $t_{2}=1.5$ .

It is straightforward to see that condition S0 holds. To show condition S1, denote by $\sigma^{\prime}(r)$ the weak derivative of $\sigma(r)$ , we calculate the function $q^{\prime}(r)$ for $r>0$ explicitly,

Since $s_{1}<s_{2}$ and $0<t_{1}<t_{2}$ , it is easy to see that $q^{\prime}(r)$ is analytic on $(0,\infty)$ , and hence $q(r)$ is analytic on $(0,\infty)$ . Differentiating $q^{\prime}(r)$ in Eq. (8.56), it is easy to see that $\lim_{r\rightarrow\infty}q^{\prime\prime}(r)=0$ , and $q^{\prime\prime}(0+)=0$ . Hence, we have $\sup_{r\in[0,+\infty]}q^{\prime\prime}(r)<\infty$ . Then condition S1 holds.

Since $s_{2}>s_{1}$ , $0<t_{1}<t_{2}$ , we have $q^{\prime}(r)>0$ for $r\in(0,+\infty)$ , $\lim_{r\rightarrow\infty}q^{\prime}(r)=0$ , and $q^{\prime}(0+)=0$ . Hence, we have $\sup_{r\in[0,+\infty]}q^{\prime}(r)<\infty$ . Then condition S2 holds. Note that $q(0)=\sigma(0)=s_{1}<-1$ , and $q(+\infty)=(s_{1}+s_{2})/2>1$ . In addition, $[q(0)+q(+\infty)]/2=(3s_{1}+s_{2})/4\in(-1,1)$ . Therefore, condition S3 holds.

Finally, we show that condition S4 holds. Define $p(r)=\exp[-t_{1}^{2}/(2r^{2})]-\exp[-t_{2}^{2}/(2r^{2})]$ , which is a positively scaled version of $q^{\prime}(r)$ . To show that for $r\in(0,\infty)$ ,

we only need to show that for $r\in(0,\infty)$

Define $x\equiv t_{2}^{2}/(2\tau_{+}^{2}r^{2})>0$ , $s\equiv\tau_{+}^{2}/\tau_{-}^{2}>1$ , $0<c\equiv t_{1}^{2}/t_{2}^{2}<1$ , we have

It is sufficient to show that $F_{2}(x;s,c)>0$ for $x>0$ , $s>1$ , and $0<c<1$ . Note that $F_{2}(0+;s,c)=0$ . Hence it is sufficient to show that $\partial_{x}F_{2}(x;s,c)>0$ for $x>0$ .

Note that $s>1$ and $0\leq c<1$ , $F_{3}(0+;s,c)=0$ . It is therefore sufficient to show that $\partial_{x}F_{3}(x;s,c)>0$ for $x>0$ .

Since $0<c<1$ , $s>1$ , and $x>0$ , we have $\partial_{x}F_{3}(x;s,c)>0$ , and hence condition S4 holds. ∎

Centered anisotropic Gaussians

In this section we consider the centered anisotropic Gaussian example discussed in the main text. That is, we assume the joint law of $(y,{\bm{x}})$ to be as follows:

With probability $1/2$ : $y=+1$ , ${\bm{x}}\sim{\sf N}({\mathbf{0}},{\bm{\Sigma}}_{+})$ .

With probability $1/2$ : $y=-1$ , ${\bm{x}}\sim{\sf N}({\mathbf{0}},{\bm{\Sigma}}_{-})$ .

We will assume ${\bm{\Sigma}}_{+},{\bm{\Sigma}}_{+}$ to be diagonalizable in the same orthonormal basis, and to differ only on a subspace of dimension $s_{0}$ . We want to study whether and how the neural network will identify this subspace of relevant features. Without loss of generality, we can assume that the eigenvalues correspond to the standard basis. In order to focus on the simplest possible model of this type, we will choose:

Because of condition ${\sf S2}$ and ${\sf S3}$ , we have $q\circ r_{+},q\circ r_{-},q^{\prime}\circ r_{+},q^{\prime}\circ r_{-}\in C_{b}(E_{2})$ .

Let $\mathscrsfs{P}(E_{2})$ be the set of probability measures on $E_{2}$ . Due to Prokhorov’s theorem, there exists a complete metric ${\bar{d}}{P}$ on $\mathscrsfs{P}(E_{2})$ equivalent to the topology of weak convergence, so that $(\mathscrsfs{P}(E_{2}),{\bar{d}}{P})$ is a compact metric space. In this section, we will denote by ${\overline{\mathscrsfs{P}}}=\mathscrsfs{P}(E_{2})$ .

Since the distribution of ${\bm{x}}$ is invariant under rotations in first $s_{0}$ coordinates, and invariant under rotations in last $d-s_{0}$ coordinates, so are the functions

where $\mu_{\mbox{\tiny\rm Haar}}$ is the Haar measure over the group of orthogonal rotations. Since $\rho\mapsto R(\rho)$ is convex, $R(\rho_{s})\leq R(\rho)$ .

where $\Theta_{1}\sim(1/Z_{s_{0}})\sin^{s_{0}-2}\theta\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta$ and $\Theta_{2}\sim(1/Z_{d-s_{0}})\sin^{d-s_{0}-2}\theta\cdot{\bm{1}}\{\theta\in[0,\pi]\}{\rm d}\theta$ are independent.

As $d\to\infty$ , we have $\lim_{d\to\infty}u_{d}(a_{1},a_{2},b_{1},b_{2})=u_{\infty}(a_{1},a_{2},b_{1},b_{2})$ , with

and the risk function converges to (for ${\bm{a}}=(a_{1},a_{2})$ )

For $s_{0}=\gamma\cdot d$ with $0<\gamma<1$ and $d\to\infty$ , we have the simpler expression

The following theorem provides a characterization of the global minimizers of $\overline{R}_{\infty}(\overline{\rho})$ .

Consider $d=\infty$ . Recall ${\overline{\mathscrsfs{P}}}=\mathscrsfs{P}(E_{2})$ where $E_{2}\equiv[0,+\infty)^{2}\cup\{\infty\}$ . Then there exists $\Delta_{\infty}\in(0,1)$ , such that

For $\Delta\geq\Delta_{\infty}$ , all global minimizers of risk function $\overline{R}_{\infty}(\overline{\rho})$ have risk zero, and there exists a global minimizer that has finite support.

Suppose $\overline{\rho}_{2}^{*}\in{\rm arg\,min}_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2})$ . Then we must have $\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle\leq 1$ and $\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle\geq-1$ . Indeed, if either $\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle>1$ or $\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle<-1$ , since $q(+\infty)>1$ and $q(0)<-1$ , the distribution $\overline{\rho}_{2}^{\prime}=a_{0}\delta_{\mathbf{0}}+a_{\infty}\delta_{\infty}+(1-a_{0}-a_{\infty})\overline{\rho}_{2}^{*}$ with appropriate choice of $a_{0}$ and $a_{\infty}$ will give a lower risk.

This $\overline{\rho}_{2}^{*}\in\mathscrsfs{P}(E_{2})$ induces a $\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty])$ as follows: for any Borel set $B\subseteq[0,\infty]$ , $\overline{\rho}_{1}(B)=\overline{\rho}_{2}^{*}(\{{\bm{r}}\in E_{2}:\|{\bm{r}}\|_{2}\in B\})$ . For this $\overline{\rho}_{1}$ , it is easy to see that $\langle q_{-},\overline{\rho}_{1}\rangle\leq\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle$ and $\langle q_{+},\overline{\rho}_{1}\rangle\geq\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle$ , and the equalities hold if and only if $\overline{\rho}_{2}^{*}(E_{1})=1$ , where $E_{1}\equiv([0,+\infty)\times\{0\})\cup\{\infty\}$ . Since $q(+\infty)>1$ and $q(0)<-1$ , we can take $\overline{\rho}_{1}^{*}=a_{0}\delta_{0}+a_{\infty}\delta_{\infty}+(1-a_{0}-a_{\infty})\overline{\rho}_{1}$ with appropriate choice of $a_{0}$ and $a_{\infty}$ , so that $\langle q\circ r_{+},\overline{\rho}_{2}^{*}\rangle\leq\langle q_{+},\overline{\rho}_{1}^{*}\rangle\leq 1$ and $\langle q\circ r_{-},\overline{\rho}_{2}^{*}\rangle\geq\langle q_{-},\overline{\rho}_{1}^{*}\rangle\geq-1$ . Therefore, we always have $\inf_{\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty])}\overline{R}_{\infty}^{(1)}(\overline{\rho}_{1})\leq\inf_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2})$ , and $\overline{\rho}_{2}^{*}(E_{1})=1$ for any $\overline{\rho}_{2}^{*}\in{\rm arg\,min}_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2})$ . Note that $\overline{R}_{\infty}^{(2)}(\overline{\rho}_{1}\times\delta_{0})=\overline{R}_{\infty}^{(1)}(\overline{\rho}_{1})$ for any $\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty])$ . Hence, we must have $\inf_{\overline{\rho}_{1}\in\mathscrsfs{P}([0,\infty])}\overline{R}_{\infty}^{(1)}(\overline{\rho}_{1})=\inf_{\overline{\rho}_{2}\in\mathscrsfs{P}(E_{2})}\overline{R}_{\infty}^{(2)}(\overline{\rho}_{2})$ .

Due to the above argument, we reduced our analysis to the centered isotropic Gaussians case. All the conclusions can be proved using the same argument as in the proof of Theorem 8.

2 Dynamics: Fixed points

We specialize the general evolution (7.1) to the present case. Assuming $\rho_{0}$ to be invariant with respect to products of orthogonal transformations, the same happens for $\rho_{t}$ . We let $\overline{\rho}_{t}\in\mathscrsfs{P}(E_{2})$ denote the distribution of $(\|{\bm{w}}_{1}\|_{2},\|{\bm{w}}_{2}\|_{2})$ when ${\bm{w}}\sim\rho_{t}$ . Then $\overline{\rho}_{t}$ satisfies the following PDE:

We will view this as an evolution in the space of probability distribution on ${\overline{\mathscrsfs{P}}}=\mathscrsfs{P}(E_{2})$ .

In analogy with Proposition 2, we can prove the following characterization of fixed points.

A distribution $\overline{\rho}\in{\overline{\mathscrsfs{P}}}$ is a fixed point of the PDE (9.16) if and only if

Notice, in particular, global minimizers of $\overline{R}_{d}(\overline{\rho})$ are fixed points of this evolution, but not vice-versa. The next result classifies fixed points.

Consider $d=\infty$ , and recall the definition of $\lambda_{+}(\overline{\rho})$ and $\lambda_{-}(\overline{\rho})$ given by Eq. (9.15) and (9.14). Then the fixed points of the PDE (9.16) (i.e. the probability measures $\overline{\rho}\in{\overline{\mathscrsfs{P}}}$ satisfying (9.17)) must be of one of the following types

A point mass $\overline{\rho}_{r_{*}}=\delta_{(r_{*},0)}$ at some location $(r_{*},0)$ with $r_{*}\not\in\{0,+\infty\}$ , but not of type $(a)$ .

A mixture of the type $\overline{\rho}=a_{0}\delta_{\mathbf{0}}+a_{\infty}\delta_{\infty}+a_{1}\delta_{(r_{*1},0)}+a_{2}\overline{\rho}_{2}$ with ${\rm supp}(\overline{\rho}_{2})\subseteq\{0\}\times(0,\infty)$ , but not of type $(b)$ and $(a)$ .

For $\Delta=\Delta_{\infty}$ , the PDE has a unique fixed point of type $(a)$ which is also a delta function at some location $(r_{*1},0)$ , and no type $(b)$ fixed points; it has possibly fixed points of type $(c)$ .

We use the characterization of fixed points in Proposition 6. Recall that $\psi_{\infty}({\bm{r}};\overline{\rho}_{*})$ is defined as in Eq. (9.13). The gradient $\nabla\psi_{\infty}({\bm{r}};\overline{\rho})$ is given by

If a fixed point $\overline{\rho}_{*}$ gives $\lambda_{+}(\overline{\rho}_{*})=\lambda_{-}(\overline{\rho}_{*})=0$ , then $\overline{R}_{\infty}(\overline{\rho}_{*})=0$ . This is type- $(a)$ fixed point. Consider then the case $(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))\neq(0,0)$ .

Suppose $\overline{\rho}_{*}((0,+\infty)^{2})>0$ . Since $q^{\prime}(r)>0$ and $\tau_{+}>1>\tau_{-}$ , in order for $\nabla\psi_{\infty}({\bm{r}};\overline{\rho}_{*})={\mathbf{0}}$ for some ${\bm{r}}\in(0,+\infty)^{2}$ , we must have $(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))=(0,0)$ . Therefore, as $\overline{\rho}_{*}$ is a fixed point with $(\lambda_{+}(\overline{\rho}_{*}),\lambda_{-}(\overline{\rho}_{*}))\neq(0,0)$ , we must have $\overline{\rho}_{*}((0,+\infty)^{2})=0$ . That is, we can write $\overline{\rho}_{*}=a_{0}\delta_{\mathbf{0}}+a_{\infty}\delta_{\infty}+a_{1}\overline{\rho}_{1}+a_{2}\overline{\rho}_{2}$ , with ${\rm supp}(\overline{\rho}_{1})\in(0,\infty)\times\{0\}$ , and ${\rm supp}(\overline{\rho}_{2})\in\{0\}\times(0,\infty)$ .

The solutions of $\nabla\psi_{\infty}((r_{1},r_{2});\overline{\rho}_{*})=0$ with $r_{2}=0$ are of the form ${\mathbf{0}}$ , $(r_{*1},0)$ , and $\infty$ . Therefore, $\overline{\rho}_{1}=\delta_{(r_{*1},0)}$ for some $r_{*1}\in(0,\infty)$ . Hence, as $\overline{\rho}_{*}$ is not a type- $(a)$ stationary point, it must be a type- $(b)$ or type- $(c)$ stationary point.

This proves that all fixed points are of type $(a)$ , $(b)$ , or $(c)$ . The remaining claims follows the same argument as the proof of Theorem 9.

3 Dynamics: Convergence to global minimum for d=∞𝑑d=\infty

In this section, denote $\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ to be

We then prove that the $d=\infty$ dynamics converges to a global minimizer from any initialization $\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ .

Consider the PDE (9.16) for $d=\infty$ , with initialization $\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ . It has a unique solution $(\overline{\rho}_{t})_{t\geq 0}$ , such that

Without loss of generality, we assume $\xi(t)=1/2$ . First we show the existence and uniqueness of solution of the PDE.

Step 1. Existence and uniqueness of solution. Mass $\overline{\rho}_{t}((0,\infty)^{2})=1$ for all $t$ .

According to conditions S1 - S3, $q(r)$ , $q^{\prime}(r)$ , and $q^{\prime\prime}(r)$ are uniformly bounded on $[0,\infty]$ . Note

Then $\nabla v({\bm{r}}),\nabla_{1}u_{\infty}({\bm{r}}_{1},{\bm{r}}_{2}),\nabla^{2}v({\bm{r}}),\nabla_{11}^{2}u_{\infty}({\bm{r}}_{1},{\bm{r}}_{2}),\nabla_{12}^{2}u_{\infty}({\bm{r}}_{1},{\bm{r}}_{2})$ are uniformly bounded. Therefore, conditions A1 and A3 are satisfied with $D=2$ , $V=v$ , and $U=u$ . Then, there is the existence and uniqueness of solution of PDE (9.16) for $d=\infty$ . Denote this solution to be $(\overline{\rho}_{t})_{t\geq 0}$ .

Recall the expression for $\nabla\psi_{\infty}({\bm{r}};\overline{\rho})$ in Eq. (9.18). It is easy to see that the assumption of Lemma 7.9 is satisfied with $d=2$ and $\Psi=\psi_{\infty}$ . Hence, we have $\overline{\rho}_{t}((0,\infty)^{2})=1$ for any fixed $t<\infty$ .

Step 2. Classify the limiting set ${\mathcal{S}}_{*}$ .

Recall the definition of $(\mathscrsfs{P}(E_{2}),{\bar{d}}{P})$ at the beginning of Section 9. Since $(\mathscrsfs{P}(E_{2}),{\bar{d}}{P})$ is a compact metric space, and $(\overline{\rho}_{t})_{t\geq 0}$ is a continuous curve in this space, then there exists a subsequence $(t_{k})_{k\geq 1}$ of times, such that $(\overline{\rho}_{t_{k}})_{k\geq 1}$ converges in metric ${\bar{d}}{P}$ to a probability distribution $\overline{\rho}_{*}\in\mathscrsfs{P}(E_{2})$ .

For any $\overline{\rho}_{0}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ , let ${\mathcal{S}}_{*}={\mathcal{S}}_{*}(\overline{\rho}_{0})$ be the set of limiting points of the PDE,

Analogous to the proof of Theorem 10, we have the following properties for ${\mathcal{S}}_{*}$ :

${\mathcal{S}}_{*}$ is connected and compact.

For any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , $\overline{\rho}_{*}$ is a fixed point of PDE.

For any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , $\overline{R}_{\infty}(\overline{\rho}_{*})=\overline{R}_{*}<1$ .

Recall the definition of $\lambda_{+}(\overline{\rho}_{*})$ and $\lambda_{-}(\overline{\rho}_{*})$ given by Equation (9.14) and (9.15). Let $\overline{\rho}_{*}$ be a fixed point of PDE such that $\lambda_{+}(\overline{\rho}_{*})\geq 0,\lambda_{-}(\overline{\rho}_{*})\geq 0$ or $\lambda_{+}(\overline{\rho}_{*})\leq 0,\lambda_{-}(\overline{\rho}_{*})\leq 0$ but not both $\lambda_{+}(\overline{\rho}_{*})$ and $\lambda_{-}(\overline{\rho}_{*})$ equal . In this case, according to Eq. (9.18), both $\partial_{r_{1}}\psi_{\infty}({\bm{r}};\overline{\rho}_{*})$ and $\partial_{r_{2}}\psi_{\infty}({\bm{r}};\overline{\rho}_{*})$ must be strictly positive or strictly negative. Since ${\rm supp}(\overline{\rho}_{*})\subseteq\{{\bm{r}}\in E_{2}:\nabla_{\bm{r}}\psi_{\infty}({\bm{r}};\overline{\rho}_{*})={\mathbf{0}}\}$ , $\overline{\rho}_{*}$ must be a combination of two delta functions located at ${\mathbf{0}}$ and $\infty$ , i.e., $\overline{\rho}_{*}=a_{0}\delta_{\mathbf{0}}+(1-a_{0})\delta_{\infty}$ . But for a fixed point like this, it is easy to see that $\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1$ . Such fixed points $\overline{\rho}_{*}$ cannot be one of the limiting points of the PDE since $\overline{R}_{\infty}(\overline{\rho}_{0})<1$ .

Step 3. Finish the proof using two claims.

Claim (1). If $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ , then for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we have $\overline{\rho}_{*}((0,\infty)\times\{0\})=1$ .

Claim (2). We cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ .

Here we assume these two claims holds, and use it to prove our results. For $\Delta<\Delta_{\infty}$ , we proved in Theorem 12 that, there is no fixed point such that $L(\overline{\rho}_{*})=(0,0)$ . Therefore, we cannot have $L({\mathcal{S}}_{*})=\{(0,0)\}$ . Due to Claim $(2)$ , we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ . Hence, we must have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ . According to Theorem 12, for $\Delta<\Delta_{\infty}$ , the only fixed point of PDE with $\overline{\rho}_{*}((0,\infty)\times\{0\})=1$ is a point mass at some location ${\bm{r}}_{*}=(r_{*1},0)$ . Furthermore, this delta function fixed point is unique and is also the global minimizer of the risk. Therefore, we conclude that, for $\Delta<\Delta_{\infty}$ , the PDE will converge to this global minimizer.

For $\Delta\geq\Delta_{\infty}$ , according to Claim (1), if $\overline{\rho}_{*}$ is a limiting point such that $L(\overline{\rho}_{*})\in{\mathcal{P}}_{1}$ , then $\overline{\rho}_{*}((0,\infty)\times\{0\})=1$ . According to Theorem 12, a fixed point $\overline{\rho}_{*}$ with $\overline{\rho}_{*}((0,\infty)\times\{0\})=1$ and $L(\overline{\rho}_{*})\neq(0,0)$ must be a point mass at some location ${\bm{r}}_{*}=(r_{*1},0)$ , with $L(\overline{\rho}_{*})\in{\mathcal{P}}_{2}$ . Therefore, we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ . Claim $(2)$ also tells us that we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ . Hence, we must have $L({\mathcal{S}}_{*})=\{(0,0)\}$ . In this case, all the points in the set ${\mathcal{S}}_{*}$ have risk . Therefore, we conclude that, as $\Delta\geq\Delta_{\infty}$ , the PDE will converge to some limiting set with risk .

We are left with the task of proving the two claims above. Before that, we introduce some useful notions used in the proof. Define $Z({\bm{r}})$ for ${\bm{r}}\in E_{2}$ ,

Define $Z_{l}(r)\equiv Z((r,lr))$ for $r,l\in[0,\infty]$ . Then we have

According to condition S4, for any fixed $l\in[0,\infty]$ , $Z_{l}(r)$ is increasing in $r$ .

Recall the formula of $\nabla_{\bm{r}}\psi_{\infty}({\bm{r}};\overline{\rho})$ given by Equation (9.18). Define

Proof of Claim $(1)$ . If $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ , then for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we have $\overline{\rho}_{*}((0,\infty)\times\{0\})=1$ .

Assume $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}$ . There must exist $t_{0}$ large enough, so that as $t\geq t_{0}$ , we have $\lambda_{+}(\overline{\rho}_{t})<0$ , and $\lambda_{-}(\overline{\rho}_{t})>0$ . Therefore, we must have $\chi_{\rm tg}({\bm{r}};\overline{\rho}_{t})>0$ for any ${\bm{r}}\in(0,\infty)^{2}$ . We denote

starting with ${\bm{r}}(t_{0})\in\Gamma_{k}$ for some $k\in(0,\infty)$ , we claim ${\bm{r}}(t)\in\Gamma_{k}$ for any $t\geq t_{0}$ . Indeed, for any ${\bm{r}}\in\partial\Gamma_{k}\cap\{{\bm{r}}:r_{2}=kr_{1}>0\}$ , its normal vector pointing outside $\Gamma_{k}$ gives ${\bm{n}}({\bm{r}})=(-r_{2},r_{1})/\|{\bm{r}}\|_{2}$ , and hence $\langle\nabla_{\bm{r}}\psi_{\infty}({\bm{r}};\overline{\rho}),{\bm{n}}({\bm{r}})\rangle=\chi_{\rm tg}({\bm{r}};\overline{\rho}_{t})>0$ . Therefore, ${\bm{r}}(t)$ cannot leak outside $\Gamma_{k}$ from this boundary. Further note that ${\bm{r}}(t)$ cannot reach the boundary $([0,\infty)\times\{0\})\cup\{\infty\}$ for any finite time $t$ . This proves the claim that ${\bm{r}}(t)\in\Gamma_{k}$ for any $t\geq t_{0}$ .

According to Lemma 7.8, we have $\rho_{t}(\Gamma_{k})\geq\rho_{t_{0}}(\Gamma_{k})$ for any $k\in(0,\infty)$ . Furthermore, according to Lemma 7.9, $\overline{\rho}_{t_{0}}((0,\infty)^{2})=1$ , hence $\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\Gamma_{k})=1$ . Therefore, for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , we must have

Theorem 12 implies that for any such fixed point $\overline{\rho}_{*}$ , we have ${\rm supp}(\overline{\rho}_{*})\subseteq([0,\infty)\times\{0\})\cup\{\infty\}$ .

In this case, we claim $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{1}\cap\{(\lambda_{+},\lambda_{-}):Z_{0}(0)<-\lambda_{+}/\lambda_{-}<Z_{0}(\infty)\}$ . Indeed, suppose there exists $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , such that $-\lambda_{+}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\geq Z_{0}(\infty)$ or $-\lambda_{-}(\overline{\rho}_{*})/\lambda_{-}(\overline{\rho}_{*})\leq Z_{0}(0)$ , according to Equation (9.24), $\chi_{\rm nm}((r,0);\overline{\rho}_{*})$ must be strictly positive or strictly negative. However, we know ${\rm supp}(\overline{\rho}_{*})\in\{{\bm{r}}:\nabla\psi_{\infty}({\bm{r}};\overline{\rho}_{*})={\mathbf{0}}\}$ . Hence, $\overline{\rho}_{*}$ should be a combination of two delta functions located at ${\mathbf{0}}$ and $\infty$ . Such fixed point $\overline{\rho}_{*}$ has risk $\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1$ , hence $\overline{\rho}_{*}$ cannot be a limiting point of the PDE. Hence the claim holds.

According to the conditions S0 - S4 on $q(r)$ , for any fixed $l$ , $Z_{l}(r)$ is an increasing function of $r$ , and for any fixed $r$ , $Z_{l}(r)$ is continuous in $l$ . Therefore, for the fixed ${\varepsilon}_{0}>0$ , there exists $0<r_{0}<r_{\infty}<\infty$ and $b>0$ , such that

As a result, for any $t\geq t_{0}$ , we have

where $\Gamma_{(\cdot)}$ is defined as in Equation (9.26).

According to Lemma 7.9, $\overline{\rho}_{t_{0}}((0,\infty)^{2})=1$ . Define

We have $O_{k}$ is increasing in $k$ , and $\cup_{k}O_{k}\supset(0,\infty)^{2}$ . Hence $\lim_{k\to\infty}\overline{\rho}_{t_{0}}(O_{k})=1$ . Now we fix a parameter $k$ .

Recall the formula for $\chi_{\rm nm}$ and $\chi_{\rm tg}$ given by Equation (9.24) and (9.25). It is easy to see that, there exists $0<u_{k1},u_{k2}<\infty$ depending on $(b,k,\tau_{+},\tau_{-},Z_{0}(0),Z_{0}(\infty),{\varepsilon}_{0})$ , such that for any ${\bm{r}}\in(0,\infty)^{2}$ with $b\cdot r_{1}\leq r_{2}\leq k\cdot r_{1}$ , and $t\geq t_{0}$ , we have

Consider the following spiral curve ${\bm{r}}_{k}^{\infty}(s)=(r_{k1}^{\infty}(s),r_{k2}^{\infty}(s))$ , with

and another spiral curve ${\bm{r}}_{k}^{0}(s)=(r_{k1}^{0}(s),r_{k2}^{0}(s))$ , with

for $s\in[0,s_{k*}]$ with $s_{k*}=\arctan(k)-\arctan(b)$ .

Because of inequality (9.35), along the curve ${\bm{r}}_{k}^{\infty}(s)$ , denoting ${\bm{n}}({\bm{r}}_{k}^{\infty}(s))$ to be its normal vector with $[{\bm{n}}({\bm{r}}_{k}^{\infty}(s))]_{2}>0$ , we have for any $t\geq t_{0}$ and $s\in[0,s_{k*}]$ ,

Along the curve ${\bm{r}}_{k}^{0}(s)$ , denoting ${\bm{n}}({\bm{r}}_{k}^{0}(s))$ to be its normal vector with $[{\bm{n}}({\bm{r}}_{k}^{0}(s))]_{2}>0$ , we have for any $t\geq t_{0}$ and $s\in[0,s_{k*}]$ ,

Consider the ODE (9.27) starting with ${\bm{r}}(t_{0})\in\Omega_{k}$ for any $k\geq\{r_{\infty},1/r_{0}\}$ , we claim ${\bm{r}}(t)\in\Omega_{k}$ for any $t\geq t_{0}$ . Indeed, combining Eq. (9.31), (9.33), (9.39), and (9.38), for any ${\bm{r}}\in\partial\Omega_{k}\setminus(([0,\infty)\times\{0\})\cup\{\infty\})$ and $t\geq t_{0}$ , the gradient $\nabla\psi_{\infty}({\bm{r}};\overline{\rho}_{t})$ pointing outside $\Omega_{k}$ . Therefore, ${\bm{r}}(t)$ cannot leak outside $\Gamma_{k}$ from this boundary. Further note that ${\bm{r}}(t)$ cannot reach the boundary $([0,\infty)\times\{0\})\cup\{\infty\}$ for any finite time $t$ . This proves the claim that ${\bm{r}}(t)\in\Omega_{k}$ for any $t\geq t_{0}$ . According to Lemma 7.8, $\overline{\rho}_{t}(\overline{\Omega}_{k})\geq\overline{\rho}_{t_{0}}(\overline{\Omega}_{k})$ for any $k\geq\{r_{\infty},1/r_{0}\}$ and $t\geq t_{0}$ .

Recall the definition of $O_{k}$ given by Equation (9.32). Note that $O_{k}\subseteq\Omega_{k}$ , and $\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\overline{O}_{k})=1$ , which implies $\lim_{k\to\infty}\overline{\rho}_{t_{0}}(\overline{\Omega}_{k})=1$ . Hence, for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ ,

It is easy to see that $\cup_{k}\overline{\Omega}_{k}=(0,\infty)\times[0,\infty)$ . Combining with the fact that $\overline{\rho}_{*}((0,\infty)^{2})=0$ for any $\overline{\rho}_{*}\in{\mathcal{S}}_{*}$ , claim $(1)$ holds.

Proof of Claim (2). We cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ .

In the case $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ , the argument is similar to the proof of Claim (1), and hence will be presented in a synthetic form. First, there exists $t_{0}$ large enough, so that as $t\geq t_{0}$ , we have $\lambda_{+}(\overline{\rho}_{t})>0$ , and $\lambda_{-}(\overline{\rho}_{t})<0$ . Then $\chi_{\rm tg}({\bm{r}};\overline{\rho}_{t})<0$ for any ${\bm{r}}\in(0,\infty)^{2}$ . Letting

According to the same argument as in the proof of Claim (1), we have $\rho_{t}(\Gamma_{k})\geq\rho_{t_{0}}(\Gamma_{k})$ for any $k\in(0,\infty)$ and $t\geq t_{0}$ . As a result, we have ${\rm supp}(\overline{\rho}_{*})\subseteq(\{0\}\times[0,\infty))\cup\{\infty\}$ .

However, the fixed point $\overline{\rho}_{*}$ with support on $(\{0\}\times[0,\infty))\cup\{\infty\}$ has risk $\overline{R}_{\infty}(\overline{\rho}_{*})\geq 1$ . Therefore, we cannot have $L({\mathcal{S}}_{*})\subseteq{\mathcal{P}}_{2}$ . This proves claim (2).

4 Dynamics: Proof of Theorem 2

We will prove that the dynamics for large but finite $d$ is well approximated by the dynamics at $d=\infty$ . The key estimate is provided by the next lemma.

Assume $\sigma$ satisfies condition S0, recall the definition of $u_{d}$ and $u_{\infty}$ given by Equation (9.9) and (9.10). Assuming $k=\gamma\cdot d$ for some $\gamma\in(0,1)$ , then we have

Define $F_{3}=F_{1}\cos\Theta_{1}+F_{2}\sin\Theta_{1}$ , $G_{3}=G_{1}\cos\Theta_{2}+G_{2}\sin\Theta_{2}$ , then

We have similar bounds for $|\partial_{a_{2}}u_{d,1}({\bm{a}},{\bm{b}})-\partial_{a_{2}}u_{\infty,1}({\bm{a}},{\bm{b}})|$ .

Note the relationship of $\Theta_{3}=\Theta_{3}({\bm{a}})$ with $(\Theta_{1},\Theta_{2})$ is given by Eq. (9.51), which yields

The lemma holds by noting that as $d\to\infty$ , we have $s_{0}\to\infty$ and $d-s_{0}\to\infty$ . ∎

Recall the definition of $\overline{R}_{\infty}$ given by Eq. (9.11), and $R$ given by Eq. (6.2). Recall the set of good initialization given by

Define $\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{1}$ and $\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{2}$ to be

With this definition, it is easy to see that $\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{1}=\mathscrsfs{P}_{\mbox{\tiny\rm good}}$ .

Here we bound $d_{\mbox{\tiny\rm BL}}(\overline{\rho}_{0}^{2,d},\overline{\rho}_{0}^{2,\infty})$ . Note the joint distribution of ${\bm{u}}_{d}$ and ${\bm{u}}_{\infty}$ is a coupling of $\overline{\rho}_{0}^{2,d}$ and $\overline{\rho}_{0}^{2,\infty}$ , hence

It is easy to see that $\lim_{d\to\infty}Y_{1}/(Y_{1}+Y_{2})=\gamma$ almost surely. Bounded convergence theorem implies that $\lim_{d\to\infty}d_{\mbox{\tiny\rm BL}}(\overline{\rho}_{0}^{2,d},\overline{\rho}_{0}^{2,\infty})=0$ .

Now we consider the PDE (9.16) for $d=\infty$ . We fix its initialization $\overline{\rho}_{0}^{2,\infty}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{2}$ induced by $\overline{\rho}_{0}^{1}\in\mathscrsfs{P}_{\mbox{\tiny\rm good}}^{1}$ . Denote the solution of PDE (9.16) to be $(\overline{\rho}_{t}^{\infty})_{t\geq 0}$ . Due to Theorem 13, for any $\eta>0$ , there exists $T=T(\eta,\overline{\rho}_{0}^{1},\gamma,\Delta)>0$ , so that its solution $(\overline{\rho}_{t}^{\infty})_{t\geq 0}$ satisfies

Then we would like to bound the difference of $\overline{R}_{\infty}(\overline{\rho})$ and $\overline{R}_{d}(\overline{\rho})$ for any $\overline{\rho}$ . Note

By Lemma 9.1, there exists $d_{0}=d_{0}(\eta,\Delta)$ large enough, so that for $d\geq d_{0}$ , we have

Finally, let $({\bm{\theta}}^{k})_{k\geq 1}$ be the trajectory of SGD, with step size $s_{k}={\varepsilon}\xi(k{\varepsilon})$ , and initialization ${\bm{w}}_{i}^{0}\sim_{iid}\rho_{0}$ for $i\leq N$ . We apply Theorem 3 to bound the difference of the law of trajectory of SGD and the solution of PDE (9.60). The assumptions of Theorem 3 are verified by Lemma 8.2. As a consequence, there exists constant $K$ (which depend uniquely on the constants in assumptions A1 A2 A3), such that

with probability $1-e^{-z^{2}}$ for any $t\leq 10T$ , where

As a consequence, for any $\delta>0$ , there exists $C_{0}=C_{0}(\delta,\eta,\overline{\rho}_{0}^{1},\gamma,\Delta)$ , so that as $N,1/{\varepsilon}\geq C_{0}d$ and ${\varepsilon}\geq 1/N^{10}$ , for $t\leq 10T$ , we have

Therefore, the trajectory ${\bm{\theta}}^{\lfloor t/{\varepsilon}\rfloor}$ of SGD as $t\in[T,10T]$ satisfies

with probability at least $1-\delta$ . This gives the desired result. ∎

Finite temperature

We will states the lemma regarding statics properties of the finite temperature free energy in Section 10.1, and regarding dynamics properties in Section 10.2. We will prove Proposition 3, Theorem 4, and Theorem 5 in Section 10.3. Throughout Section 10.1 and 10.2, to distinguish the dimension of parameters with the generalized differential operator, we will denote the dimension of parameters by $d$ instead of $D$ . This should not be confused with the dimension of feature vectors, which never appears throughout this section.

We introduce the set ${\mathcal{K}}$ of admissible probability densities,

Define $\Omega_{0}=\{{\bm{\theta}}:1/(\sqrt{2\pi}\sigma)^{d}\cdot\exp\{-\|{\bm{\theta}}\|_{2}^{2}/(2\sigma^{2})\}\leq\rho({\bm{\theta}})^{1/2}\leq 1\}$ . Then we have

Noting that $|\rho\log\rho|\leq\sqrt{\rho}$ for any $\rho\in$ , the second term is bounded by

Assume $U$ and $V$ are bounded-Lipschitz. Then for any $\lambda>0$ and $0<\beta<\infty$ , $F_{\beta,\lambda}(\rho)$ has a unique minimizer $\rho_{*}\in{\mathcal{K}}$ . Moreover, we have

Taking $\sigma^{2}=4/(\beta\lambda)$ gives Eq. (10.12) .

For any $\rho\in{\mathcal{K}}$ , we call the following equation the Boltzmann fixed point condition

Under the assumption of Lemma 10.2, the minimizer $\rho_{*}\in{\mathcal{K}}$ of $F_{\beta,\lambda}(\rho)$ satisfies the Boltzmann fixed point condition.

for any ${\varepsilon}<{\varepsilon}_{0}$ . As ${\varepsilon}$ is sufficiently small, we have $F_{\beta,\lambda}(\rho_{\varepsilon})<F_{\beta,\lambda}(\rho_{*})$ . This contradict with the fact that $\rho_{*}\in{\mathcal{K}}$ is the minimizer of $F_{\beta,\lambda}(\rho)$ .

for some constant $\gamma(\beta,\lambda;\rho_{*})$ .

Note we have $\int\rho_{*}({\bm{\theta}}){\rm d}{\bm{\theta}}=1$ . Therefore, we must have $\gamma(\beta,\lambda;\rho_{*})=-1/\beta\cdot\log Z(\beta,\lambda;\rho_{*})$ . This proves that $\rho_{*}$ satisfies the Boltzmann fixed point condition.

Under the assumption of Lemma 10.2, the Boltzmann fixed point condition has a unique solution in ${\mathcal{K}}$ .

The last two lemmas already imply that the Boltzmann fixed point condition has at least one solution. Assume $\rho_{1},\rho_{2}\in K$ to be two such solutions. Then $\rho_{i}$ is positive, and

Note the right hand side does not equal unless $\rho_{1}=\rho_{2}$ .

Under the assumption of Lemma 10.2, and further assume condition A3 holds. Let $\rho_{*}^{\beta,\lambda}$ be the minimizer of $F_{\beta,\lambda}(\rho)$ . Then there is a constant $K$ depending on the parameter $K_{3}$ in condition A3, such that for any $\beta\geq 1$ , we have

Let ${\bm{G}},{\bm{G}}_{1},{\bm{G}}_{2}\sim{\sf N}({\mathbf{0}},{\bm{I}}_{d})$ be independent, we have

Using the intermediate value theorem and Cauchy-Schwarz inequality, and noting that $\nabla^{2}V$ is $K_{3}$ -bounded by condition A3, we have

We have similar bound for the $U$ term. Therefore,

Next we give a lower bound for ${\rm Ent}(\rho*g_{\tau})$ :

As a result, taking $\tau=1/\beta$ , we have

2 Dynamics

Recall that the finite-temperature distributional dynamics reads:

Notice that this notion of weak solution is equivalent to the one introduced earlier in Eq. (7.3), see for instance [San15, Proposition 4.2].

Without loss of generality, we assume $\xi(t)\equiv 1/2$ .

We use the JKO scheme of [JKO98, Theorem 5.1] to show the existence, uniqueness, and absolute continuousness of solution of PDE (10.22). Since the proof is basically the same as the proof of [JKO98, Theorem 5.1], we will skip several details.

For any $\overline{\rho}_{k-1}^{h}$ , the optimization problem (10.24) has a unique minimizer $\overline{\rho}_{k}^{h}\in{\mathcal{K}}$ , where the proof is basically the same as Lemma 10.2, by additionally noting that $W_{2}^{2}(\rho,\overline{\rho}_{k-1}^{h})$ as a function of $\rho$ is lower bounded, lower semi-continuous, and convex over $\rho\in{\mathcal{K}}$ .

In the following, we will show that this $\rho^{h}$ approximately satisfies PDE (10.23) in the weak form.

Since $\overline{\rho}_{k}^{h}$ is the minimizer of optimization problem (10.24), we have for each $\tau>0$ ,

Using the result in the proof of [JKO98, Theorem 5.1], and noting $\nabla V$ is bounded Lipschitz, we have

We need to further calculate the derivative of $\langle U,\nu_{\tau}^{\otimes 2}\rangle$ with respect to $\tau$ . Note $U$ is symmetric, we have

The uniqueness of solution of Eq. (10.23) can be proved using standard method from theory of elliptic-parabolic equations (see, for instance, [JKO98, Theorem 5.1]). In the proof of uniqueness we need the smoothness property of the solution, which is proved by Lemma 10.7.

Before proving this lemma, we give some notations in the following.

For any nonnegative integer $l$ and $1\leq p\leq\infty$ , we denote $W_{p}^{l}(\Omega)$ to be the Banach space (Sobolev space) consisting of the elements of $L^{p}(S)$ having generalized derivatives of all forms up to order $l$ included, that are $p$ ’th power integrable on $\Omega$ . The norm in $W_{p}^{l}(\Omega)$ is defined by the equality

where ${\bm{\alpha}}=(\alpha_{1},\ldots,\alpha_{d})$ is a multi-index with $|{\bm{\alpha}}|=\sum_{i=1}^{d}\alpha_{i}$ , and $D^{\bm{\alpha}}_{\bm{\theta}}u=\partial^{|{\bm{\alpha}}|}u/\partial\theta_{1}^{\alpha_{1}}\cdots\partial\theta_{d}^{\alpha_{d}}$ .

We say $u\in L^{r,p}_{\rm loc}(S)$ if for any compact subset $[t_{1}^{\prime},t_{2}^{\prime}]\subset(t_{1},t_{2})$ and compact subset $\Omega^{\prime}\subset\Omega$ , we have $u\in L^{r,p}([t_{1}^{\prime},t_{2}^{\prime}]\times\Omega^{\prime})$ . We will denote $L^{p,p}(S)$ by $L^{p}(S)$ , and $L^{p,p}_{\rm loc}(S)$ by $L^{p}_{\rm loc}(S)$ .

For nonnegative integer $l$ and $1\leq p\leq\infty$ , we denote $W_{p}^{2l,l}(S)$ to be the Banach space consisting of the elements of $L^{p}(S)$ having generalized derivatives of the form $D_{t}^{r}D_{\bm{\theta}}^{\bm{\alpha}}$ with $r$ and ${\bm{\alpha}}$ satisfying the inequality $2r+|{\bm{\alpha}}|\leq 2l$ . The corresponding norm is defined by

We denote $C^{m,n}(S)$ to be the function space of continuous function with $m$ continuous derivative in time, and $n$ continuous derivatives in space. For example, $u\in C^{1,2}(S)$ if and only if $u,\partial_{t}u,\nabla_{\bm{\theta}}u,\nabla_{\bm{\theta}}^{2}u\in C^{0,0}(S)\equiv C(S)$ . We say $u\in C_{c}^{m,n}(S)$ if $u\in C^{m,n}(S)$ and the support of $u$ is compact. We will denote $C^{n,n}(S)$ by $C^{n}(S)$ , and $C^{n,n}_{c}(S)$ by $C^{n}_{c}(S)$ .

We denote $G$ to be the heat kernel, where for $t>0$ , we have

The proof is similar to the one of [JKO98, Theorem 5.1], so we will skip some details. Without loss of generality we can set $\beta=1$ , and $\xi(t)=1/2$ (different choices can be obtained by rescaling $\Psi({\bm{\theta}};\rho)$ and reparametrizing time).

Step 1. Show that $\rho\in L^{\infty,p}_{\rm loc}({E})$ .

Taking $G$ to be the heat kernel, it is easy to see that

provided that the $p_{o},p_{i}$ satisfy the relations

Here, $C$ is a constant depends only on $T,\delta$ and on $p_{i},p_{o}$ .

Define $\varphi_{1}\equiv\rho(\Delta\eta-\langle\nabla\Psi,\nabla\eta\rangle){\bm{1}}\{t>{\varepsilon}\}$ , $\varphi_{2}\equiv\rho(2\nabla\eta-\eta\nabla\Psi){\bm{1}}\{t>{\varepsilon}\}$ , and $\psi\equiv\rho({\varepsilon})\eta$ . Then Eq. (10.43) reads

where $\Omega_{1}\subseteq{\rm supp}(\eta)\subseteq\Omega_{2}$ . Therefore, $\rho\in L_{\rm loc}^{\infty,p_{o}}({E})$ , where $p_{i},p_{o}$ satisfy Eq. (10.46).

Note there exists a sequence $p_{i,l},p_{o,l}$ for $1\leq l\leq k$ and $k<\infty$ , so that $p_{i,l+1}=p_{o,l}$ , $p_{i,1}=p<d/(d-1)$ , $p_{i,k}=\infty$ , and $p_{i,l},p_{o,l}$ for fixed $l$ satisfies Eq. (10.46). Since we have $\rho\in L^{\infty,p}_{\rm loc}({E})$ , using Eq. (10.50) iteratively, we have $\rho\in L^{\infty,p_{o,l}}_{\rm loc}({E})$ for any $1\leq l\leq k$ . As a result, we have $\rho\in L^{\infty}_{\rm loc}({E})$ .

Step 3. Derivatives, $D\rho$ , $D^{2}\rho$ , and $D^{3}\rho$ .

where $1<p\leq\infty$ and $m$ is a nonnegative integer.

Then we show the regularity of $D^{2}\rho$ . Note that $\nabla^{2}\Psi\in L_{\rm loc}^{\infty}({E})$ , we have $D\varphi_{1},D\varphi_{2}\in L^{\infty}({E})$ . Due to Eq. (10.51), we have $D^{3}\{\varphi_{1}*_{2}G\},D^{3}\{\varphi_{2}*_{2}G\}\in L^{\infty}({E})$ , which also implies $D^{2}\{\varphi_{1}*_{2}G\}\in L^{\infty}_{\rm loc}({E})$ . Hence we have $D^{2}(\rho\eta)=D^{2}\{\varphi_{1}*_{2}G\}+D^{3}\{\varphi_{2}*_{2}G\}+D^{2}[\psi*G_{\varepsilon}]\in L^{\infty}({S})$ , which gives $D^{2}\rho\in L^{\infty}_{\rm loc}({E})$ .

Next we show the regularity of $D^{3}\rho$ . Note that $\nabla^{3}\Psi\in L_{\rm loc}^{\infty}({E})$ , we have $D^{2}\varphi_{1},D^{2}\varphi_{2}\in L^{\infty}({E})$ . Due to Eq. (10.51), we have $D^{4}\{\varphi_{1}*_{2}G\},D^{4}\{\varphi_{2}*_{2}G\}\in L^{\infty}({E})$ , which also implies $D^{3}\{\varphi_{1}*_{2}G\}\in L^{\infty}_{\rm loc}({E})$ . Hence we have $D^{3}(\rho\eta)=D^{3}\{\varphi_{1}*_{2}G\}+D^{4}\{\varphi_{2}*_{2}G\}+D^{3}[\psi*G]\in L^{\infty}({S})$ , which gives $D^{3}\rho\in L^{\infty}_{\rm loc}({E})$ .

Step 4. Derivatives, $D_{t}\rho$ , $D_{t}D\rho$ , and $D_{t}D^{2}\rho$ .

Now we study the regularity of $D_{t}\rho,D_{t}D\rho,D_{t}D^{2}\rho$ . Note we have $D_{t}(\rho\eta)=D_{t}\{\varphi_{1}*_{2}G\}-D_{t}\{D\varphi_{1}*_{2}G\}+D_{t}\{\psi*G_{\varepsilon}\}$ . Due to Eq. (10.51), $\varphi_{1},D\varphi_{2}\in L^{\infty}({E})$ implies that $D_{t}\{\varphi_{1}*_{2}G\},D_{t}\{D\varphi_{1}*_{2}G\}\in L^{\infty}({E})$ and hence $D_{t}[\rho\eta]\in L^{\infty}({S})$ , $D_{t}\rho\in L^{\infty}_{\rm loc}({E})$ .

Note we have $D_{t}D(\rho\eta)=D_{t}\{D\varphi_{1}*_{2}G\}+D_{t}\{D^{2}\varphi_{1}*_{2}G\}+D_{t}\{D\psi*G_{\varepsilon}\}$ . The fact that $D\varphi_{1},D^{2}\varphi_{2}\in L^{\infty}({E})$ implies that $D_{t}\{D\varphi_{1}*_{2}G\},D_{t}\{D^{2}\varphi_{1}*_{2}G\}\in L^{\infty}({E})$ and hence $D_{t}D\rho\in L^{\infty}_{\rm loc}({E})$ .

Note we have $D_{t}D^{2}(\rho\eta)=D_{t}\{D^{2}\varphi_{1}*_{2}G\}-D_{t}\{D^{3}\varphi_{1}*_{2}G\}+D_{t}\{D^{2}\psi*G_{\varepsilon}\}$ . Note that $\nabla^{4}\Psi\in L_{\rm loc}^{\infty}({E})$ , hence $D^{3}\varphi_{2}\in L^{\infty}({E})$ . Combining with the fact that $D^{2}\varphi_{1}\in L^{\infty}({E})$ , we have $D_{t}\{D^{2}\varphi_{1}*_{2}G\},D_{t}\{D^{3}\varphi_{1}*_{2}G\}\in L^{\infty}({E})$ and hence $D_{t}D^{2}\rho\in L^{\infty}_{\rm loc}({E})$ .

Finally we show the regularity of $D_{t}^{2}\rho$ . We have $D_{t}^{2}(\rho\eta)=D_{t}\{D_{t}[\varphi_{1}*_{2}G]-D_{t}[D\varphi_{1}*_{2}G]+D_{t}[\psi*G_{\varepsilon}]\}$ , and

We say $\rho_{*}$ is a fixed point of PDE (10.22), if its solution $(\rho_{t})_{t\geq 0}$ starting from $\rho_{*}$ satisfies $\rho_{t}\equiv\rho_{*}$ for any $t\geq 0$ .

Assume conditions A1 - A3 hold. Then any fixed point $\rho_{*}$ of PDE (10.22) with $\rho_{*}\in{\mathcal{K}}$ must satisfy the Boltzmann fixed point condition (10.13).

Suppose $\rho_{*}\in{\mathcal{K}}$ is a fixed point of PDE (10.22), taking $W({\bm{\theta}})\equiv\Psi_{\lambda}({\bm{\theta}};\rho_{*})$ , then $\rho_{*}\in{\mathcal{K}}$ is a fixed point of the Fokker-Planck equation (10.54).

Since $\lambda/2\cdot\|{\bm{\theta}}\|_{2}^{2}-2K_{3}\leq\Psi_{\lambda}({\bm{\theta}};\rho_{*})\leq\lambda/2\cdot\|{\bm{\theta}}\|_{2}^{2}+2K_{3}$ , the Fokker-Planck equation has a unique fixed point [MV00], which solves

This is exactly the Boltzmann fixed point condition.

Assume conditions A1 - A4 hold. Let $(\rho_{t})_{t\geq 0}$ be the solution of PDE (10.22) for an initialization $\rho_{0}\in{\mathcal{K}}$ . Then the free energy $F_{\beta,\lambda}(\rho_{t})$ is differentiable with respect to $t$ , with

Therefore, $F_{\beta,\lambda}(\rho_{t})$ is non-increasing in $t$ .

Calculate the differential of the free energy along the curve $\rho_{t}$ , we have

Assume $K_{0}\|{\bm{\theta}}\|_{2}^{2}-K_{1}\leq\Phi({\bm{\theta}})\leq K_{0}\|{\bm{\theta}}\|_{2}^{2}+K_{1}$ for some positive constant $K_{0},K_{1}$ . Define

First we show that the measure $\mu_{*}$ satisfies the Poincare inequality: for any $f\in{{\mathcal{D}}}$ ,

Note we have the Poincare inequality for the Gaussian distribution $\mu$ ,

for any differentiable $f$ . Therefore, we have

This proves the Poincare inequality (10.58) for $\mu_{*}$ .

Assume conditions A1 - A4 hold. Then the solution $(\rho_{t})_{t\geq 0}$ of PDE (10.22) for any initialization $\rho_{0}\in{\mathcal{K}}$ converges weakly to $\rho_{*}\in{\mathcal{K}}$ as $t\to\infty$ , where $\rho_{*}$ is the unique solution of the Boltzmann fixed point condition, which is the global minimizer of $F_{\beta,\lambda}$ .

According to Lemma 10.10, $F_{\beta,\lambda}$ is non-increasing along the solution path. According to Lemma 10.2, $F_{\beta,\lambda}(\rho_{t})$ is lower bounded. Therefore, we have

According to condition A3, $\nabla_{\bm{\theta}}U$ is $K_{3}$ -bounded-Lipschitz with respect to $({\bm{\theta}},{\bm{\theta}}^{\prime})$ . Therefore,

as $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*})\to 0$ . Accordingly, we have

Combining Eq. (10.63) with Eq. (10.61), we have

3 Proof of Proposition 3, Theorem 4, and Theorem 5

Proposition 3 is given by Lemma 10.6, 10.4, and Lemma 10.9. Theorem 4 is given by Lemma 10.2, 10.4, 10.5, and 10.12.

Now we prove Theorem 5. First, according to Lemma 10.5, for any $\eta>0$ , there exists constant $K$ depending on $\eta,K_{0},K_{1},K_{2},K_{3}$ , such that as we take $\beta\geq KD$ , we have

According to Lemma 10.12, we have $\rho_{t}$ converges to $\rho_{*}^{\beta,\lambda}$ weakly. Therefore, there exists $T=T(\eta,V,U,\{K_{i}\},D,\lambda,\beta)<\infty$ , so that $d_{\mbox{\tiny\rm BL}}(\rho_{t},\rho_{*}^{\beta,\lambda})\leq\eta/(3Z)$ for any $t\geq T$ , where $Z=Z(\{K_{i}\})$ is the bounded-Lipschitz constant of $R$ with respect to $\rho$ . Hence, we have

Finally, according to Theorem 3, there exists $K^{\prime}$ depending on $K_{i}$ ’s, so that for all $k\leq 10T/{\varepsilon}$ , we have

with probability at least $1-e^{-z^{2}}$ . Hence there exists $C_{0}=C_{0}(\eta,\{K_{i}\},\delta)$ , so that as $N,1/{\varepsilon}\geq C_{0}\exp\{C_{0}T\}D$ and ${\varepsilon}\geq 1/N^{10}$ , we have

Combining Eq. (10.67), (10.68), and (10.69) we get the desired result.

4 Dependence of convergence time on D𝐷D and η𝜂\eta

Theorem 5 does not provide any estimate for the dependence of the convergence time on the problem dimensions $D$ and on the accuracy $\eta$ . However the proof suggests the following heuristic. When $\rho_{t}$ is sufficiently close to the minimizer $\rho_{*}$ , we heuristically can approximate the free energy dissipation formula (10.2) as

This is the same as the free energy dissipation for the Fokker-Planck equation with potential $\Psi_{\lambda}({\bm{\theta}};\rho_{*})$ . This suggests that, close to $\rho_{*}$ , convergence should be dominated by the speed of convergence in this Fokker-Plank equation, which is controlled by the log-Sobolev constant of the potential $\Psi_{\lambda}({\bm{\theta}};\rho_{*})$ , to be denote by $c_{*}$ [MV00]:

Note that the log-Sobolev constant can be exponentially small in $D$ . We expect this heuristic to capture the rough dependence of the convergence time $T$ on $\eta$ and $D$ , hence suggesting $T=e^{O(D)}\log(1/\eta)$ .

Numerical Experiments

In this section, we discuss numerical experiments whose results were presented in the main text, as well as some additional ones. Some technical details of the figures in the main text are also presented here; in particular, Section 11.1.1 for Figure 1, Section 11.1.2 for Figure 2, Section 11.2 for Figure 3, and Section 11.3 for Figure 4.

In this section, we present details of the numerical experiments pertaining to the example of centered isotropic Gaussians:

With probability $1/2$ : $y=+1$ , ${\bm{x}}\sim{\sf N}(0,(1+\Delta)^{2}{\bm{I}}_{d})$ .

With probability $1/2$ : $y=-1$ , ${\bm{x}}\sim{\sf N}(0,(1+\Delta)^{2}{\bm{I}}_{d})$ .

In all numerical examples in this section, we use the activation $\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle)$ , where $\sigma(t)=s_{1}$ if $t\leq t_{1}$ , $\sigma(t)=s_{2}$ if $t\geq t_{2}$ , and $\sigma(t)$ interpolated linearly for $t\in(t_{1},t_{2})$ . In simulations we use $t_{1}=0.5$ , $t_{2}=1.5$ , $s_{1}=-2.5$ , $s_{2}=7.5$ . This is also used for examples with centered Gaussians in the main text, cf. Figures 1 and 2, and Section 8 in the supplemental information. This activation is plotted in Figure 11.1.

Here we discuss empirical validation for the dynamics in the isotropic Gaussian example.

PDE simulation. Simulating the PDE (Eq. of the main text) for general $d$ is computationally intensive. In order to simplify the problem, we only consider $d=\infty$ . In that case, we recall that the risk is given by Eq. (8.10), which we copy here for ease of reference:

The PDE is then $\partial_{t}\overline{\rho}_{t}=2\xi(t)\partial_{r}[\overline{\rho}_{t}\partial_{r}\psi_{\infty}(r;\overline{\rho}_{t})]$ .

The solution to the PDE is approximated, at all time $t$ , by the following multiple-deltas ansatz:

Under this ansatz, let us write $\overline{R}_{\infty}(\overline{\rho}_{t})=\overline{R}_{\infty,J}(\mathbf{r}(t))$ , where $\mathbf{r}(t)=(r_{1}(t),...,r_{J}(t))^{\top}$ , and

Notice that $\partial_{r}\psi_{\infty}(r_{i}(t);\overline{\rho}_{t})=(J/2)(\nabla\overline{R}_{\infty,J}(\mathbf{r}(t)))_{i}$ . Therefore we obtain

Hence under the multiple-deltas ansatz, one can simulate numerically the PDE via the above evolution equation of $\mathbf{r}(t)$ . In particular, given $\mathbf{r}(t)$ , one approximates $\mathbf{r}(t+\delta t)$ for some small displacement $\delta t$ by

In general, one would want to take a large $J$ to obtain a more accurate approximation. There are certain cases where one can take small $J$ (even $J=1$ ). An example of such case is given in the following.

Details of Figure 1 of the main text. For the data generation, we set $\Delta=0.8$ . For the SGD simulation, we take $d=40$ , $N=800$ , with ${\varepsilon}=10^{-6}$ and $\xi(t)=1$ . The weights are initialized as $({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,0.8^{2}/d\cdot\mathbf{I}_{d})$ . We take a single SGD run. At iteration $10^{3},4\times 10^{6},10^{7}$ , we plot the histogram of $(\|{\bm{w}}_{i}\|_{2})_{i\leq N}$ . This produces the results of the SGD in Figure 1 of the main text.

To obtain results from the PDE, we take $J=400$ , and generate $r_{i}(0)=\|Z_{i}\|_{2}$ , where $(Z_{i})_{i\leq J}\sim_{iid}\mathsf{N}(0,0.8^{2}/d\cdot\mathbf{I}_{d})$ . We obtain $\mathbf{r}(t)$ from $t=0$ until $t=10^{7}{\varepsilon}$ , by discretizing this interval with $10^{5}$ points equally spaced on the $\log_{10}$ scale and sequentially computing $\mathbf{r}(t)$ at each point using Eq. (11.8). Note that the SGD result at iteration $k$ corresponds to $\mathbf{r}({\varepsilon}k)$ . We re-simulate the PDE for 100 times, each with an independently generated initialization. The obtained histogram for the PDE, as shown in the figure, is the aggregation of these 100 runs.

Further numerical simulations. Figure 11.2 plots the evolution of $\overline{\rho}_{t}$ for $\Delta=0.2$ . The setting is identical to the one in Figure 1 of the main text, described in the previous paragraphs.

In Figure 11.3, we plot the evolution of the population risk for the SGD and its PDE prediction counterpart, for $\Delta=0.2$ and $\Delta=0.8$ . The setting for the SGD plots is the same as described in the previous paragraphs. We compute the risk attained by the SGD by Monte Carlo averaging over $10^{4}$ samples. The setting for the PDE plots tagged “ $J=400$ ” is almost the same as in the previous paragraphs, except that we take only 1 run. For the PDE plot tagged “ $J=1$ ”, we take $J=1$ and $r(0)=0.8$ instead. In the inset plot, we also show the evolution of $(1/N)\sum_{i=1}^{N}\|{\bm{w}}_{i}\|_{2}$ of the SGD, and $(1/J)\sum_{i=1}^{J}r_{i}(t)$ of the PDE.

In Figure 11.4, we plot the function $\overline{R}^{(1)}_{d}(r)$ , for $d=40$ and $\Delta=0.2$ . (Recall $\overline{R}^{(1)}_{d}(r)$ from Eq. of the main text, and see also Section 11.1.3.) On this landscape, we also plot the evolution of the corresponding SGD and PDE, as described in the last paragraph.

Comments. We observe in Figure 11.3 a good match between the SGD and the PDE, even when $J=1$ , for $\Delta=0.2$ . This can be explained with our theory, which predicts that at $\Delta=0.2$ , the minimum risk is achieved by the uniform distribution over a sphere of radius $\|{\bm{w}}\|_{2}=r_{*}$ (see also Section 11.1.3). This corresponds to $\overline{\rho}_{t}$ , as $t\to\infty$ , being a delta function and placing probability 1 at $r_{*}$ . Furthermore due to the way we initialize the SGD, $\overline{\rho}_{0}$ is well concentrated. One can then expect that $\overline{\rho}_{t}$ is also well concentrated at all time $t$ , in which case $J=1$ is sufficient. This claim is reflected in our numerical experiments, shown in Figure 11.2.

We also observe in Figure 11.3 that the case $\Delta=0.2$ has a rapid transition from a high risk to a lower risk, unlike the case $\Delta=0.8$ . This is also expected from our theory. As said above, $\overline{\rho}_{t}$ is approximately a delta function at all time $t$ , and the position $r(t)$ evolves by gradient flow in the landscape of $\overline{R}^{(1)}_{d}(r)$ . This latter claim is well supported by Figure 11.4. As observed in Figure 11.4, $\overline{R}^{(1)}_{d}(r)$ is rather benign, and hence the transition of the population risk should be smooth. However the case for $\Delta=0.8$ is different: $\overline{\rho}_{t}$ is not concentrating at large $t$ , as evident in Figure 1 of the main text, even though $\overline{R}^{(1)}_{d}(r)$ is generally benign for a vast variety of values of $d$ and $\Delta$ (see Figure 11.6 and Section 11.1.3).

Note that the computation of the PDE assumes $d=\infty$ . Furthermore it also requires $N=\infty$ (recalling Theorem 3 of the main text). The discrepancy to the SGD is due to the fact that $d$ and $N$ are finite in the SGD simulations. Nevertheless in our numerical examples, such discrepancy is insignificant.

1.2 Empirical validation of the statics

Here we discuss numerical verification for the statics in the isotropic Gaussian example.

Optimizing $\overline{R}_{d}(\overline{\rho})$ . For the chosen activation, we have from Eq. (8.8) that

where $\sigma_{\rm sl}=(s_{2}-s_{1})/(t_{2}-t_{1})$ , $\sigma_{\rm itc}=s_{1}-\sigma_{\rm sl}t_{1}$ , $\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi}$ , $\Phi(x)=\int_{-\infty}^{x}\phi(t){\rm d}t$ , and $\Gamma$ is the Gamma function. To numerically optimize $\overline{R}_{d}(\overline{\rho})$ , we perform the following approximation:

which is a quadratic programming problem and can be solved numerically. Here $\mathbf{v}$ can be computed easily with the explicit formula, and the computation of $\mathbf{U}$ amounts to numerically evaluating double integrals. In the case $d=\infty$ , the computation of $\mathbf{U}$ is much easier, since

Details of Figure 2 of the main text. For the SGD simulation, we take $N=800$ , with ${\varepsilon}=3\times 10^{-3}$ and $\xi(t)=t^{-1/4}$ . The weights are initialized as $({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,0.4^{2}/d\cdot\mathbf{I}_{d})$ . We compute the risk attained by the SGD by Monte Carlo averaging over $10^{4}$ samples. We take a single SGD run per $\Delta$ , per $d$ , and report the risk at iteration $10^{7}$ .

For the approximate optimization of $\overline{R}_{d}(\overline{\rho})$ , we choose $K=100$ , and $o_{i}$ , $i=1,...,K$ , being equally spaced on the interval [0.01, 10].

For the optimization of $\overline{R}^{(1)}_{d}(r)$ (recalling Eq. in the main text), we approximate it with $\min_{i=1,...,K}\overline{R}^{(1)}_{d}(o_{i})$ , for the above chosen $o_{i}$ and $K$ .

We find that in general, one needs higher $\max_{i=1,...,K}o_{i}$ to produce accurate results for higher $\Delta$ . For the chosen set of $o_{i}$ ’s, we choose to plot up until $\Delta=0.8$ .

Further numerical simulations. In Figure 11.5, we extend Figure 2 of the main text to include results for additional values of $d$ . The setting remains the same.

This figure provides further support to the respective discussion in the main text. For the threshold values of $\Delta$ for which the minimum risk is achieved by a uniform distribution $\rho^{\rm unif}_{r_{*}}$ over a sphere of radius $\|{\bm{w}}\|_{2}=r_{*}$ (see the main text around Eq. , and Section 11.1.3).

1.3 Checking the condition of Lemma 1 in the main text

We check of the condition of Lemma 1 in the main text. This has two steps: (1) we solve for the minimizer $r_{*}$ of $\overline{R}_{d}^{(1)}(r)=1+2v(r)+u_{d}(r,r)$ , where $v(r)$ and $u_{d}(r_{1},r_{2})$ are given by Eq. (11.10) and (11.11) respectively, and (2) we check whether $v(r)+u_{d}(r,r_{*})\geq v(r_{*})+u_{d}(r_{*},r_{*})$ for all $r\geq 0$ . Figure 11.6 suggests that the behavior of $\overline{R}_{d}^{(1)}(r)$ is rather benign and hence $r_{*}$ can be solved easily by searching for a local minimum. For the second step, we check the condition on a grid of values of $r$ from 0.1 to 10 with a spacing of 0.1, for each value of $\Delta$ on a grid from 0.01 to 0.99 with a spacing of 0.01. In general, we find that the conditioned is satisfied for $\Delta\in[\Delta_{d}^{\rm l},\Delta_{d}^{\rm h}]$ . Table 1 reports $\Delta_{d}^{\rm l}$ and $\Delta_{d}^{\rm h}$ for a number of values of $d$ for the isotropic Gaussians example with the given activation function.

2 Centered anisotropic Gaussians with ReLU Activation

With probability $1/2$ : $y=+1$ , ${\bm{x}}\sim{\sf N}(0,\mathbf{\Sigma}_{+})$ .

With probability $1/2$ : $y=-1$ , ${\bm{x}}\sim{\sf N}(0,\mathbf{\Sigma}_{-})$ .

This setting is used in Figure 3 in the main text.

We consider $s_{0}=\gamma d$ for some $\gamma\in(0,1)$ . For simplicity, we consider the limit $d\to\infty$ . For ${\bm{\theta}}\sim\rho$ , let $\overline{\rho}$ be the joint distribution of four parameters $\mathbf{r}=(a,b,r_{1}=\|{\bm{w}}_{1:s_{0}}\|_{2},r_{2}=\|{\bm{w}}_{(s_{0}+1):d}\|_{2})$ , where ${\bm{w}}_{i:j}=(w_{i},...,w_{j})^{\top}$ . Using a similar argument to Section 8, we have, in the limit $d\to\infty$ , the risk $R(\rho)=\overline{R}_{\infty}(\overline{\rho})$ for

where $\phi(x)=\exp(-x^{2}/2)/\sqrt{2\pi}$ and $\Phi(x)=\int_{-\infty}^{x}\phi(t){\rm d}t$ . Furthermore, letting $\overline{\rho}_{t}$ denote the corresponding distribution at time $t$ , the PDE in the main text can be reduced to the following PDE of $\overline{\rho}_{t}$ :

PDE simulation. As in Section 11.1.1, we posit that the solution to the PDE can be approximated, at all time $t$ , by the multiple-deltas ansatz:

for $i=1,...,J$ , where $\overline{R}_{\infty,J}(\mathbf{r}_{1}(t),\dots,\mathbf{r}_{J}(t))=\overline{R}_{\infty}(\overline{\rho}_{t})$ under the ansatz, and $\nabla_{i}$ denotes the gradient of $\overline{R}_{\infty,J}(\mathbf{r}_{1},...,\mathbf{r}_{J})$ w.r.t. $\mathbf{r}_{i}$ . More explicitly,

Again, given $\mathbf{r}_{i}(t)$ , one approximates $\mathbf{r}_{i}(t+\delta t)$ for some small displacement $\delta t$ by

Details of Figure 3 of the main text. For the SGD simulation, we take $d=320$ , $s_{0}=60$ , $N=800$ , with ${\varepsilon}=2\times 10^{-4}$ and $\xi(t)=t^{-1/4}$ . The weights are initialized as $({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,0.8^{2}/d\cdot\mathbf{I}_{d})$ , $(a_{i})_{i\leq N}=1$ and $(b_{i})_{i\leq N}=1$ . We take a single SGD run. We compute the risk attained by the SGD by Monte Carlo averaging over $10^{4}$ samples.

To produce the inset plot in Figure 3 of the main text, for the “ $a$ (mean)” axis, we compute $\frac{1}{N}\sum_{i=1}^{N}a_{i}$ for the SGD and $\frac{1}{J}\sum_{i=1}^{J}a_{i}(t)$ for the PDE. Similarly, for the “ $b$ (mean)” axis, we compute $\frac{1}{N}\sum_{i=1}^{N}b_{i}$ for the SGD and $\frac{1}{J}\sum_{i=1}^{J}b_{i}(t)$ for the PDE, and for the “ $r_{1}$ (mean)” axis, we compute $\frac{1}{N}\sum_{i=1}^{N}\|{\bm{w}}_{i,1:s_{0}}\|_{2}$ for the SGD and $\frac{1}{J}\sum_{i=1}^{J}r_{1,i}(t)$ for the PDE.

Further numerical simulations. In Figure 11.7, we plot the evolution of the four parameters, for the same setting as Figure 3 of the main text. Here “ $a$ (mean)”, “ $b$ (mean)” and “ $r_{1}$ (mean)” hold the same meanings, and “ $r_{2}$ (mean)” refers to $\frac{1}{N}\sum_{i=1}^{N}\|{\bm{w}}_{i,(s_{0}+1):d}\|_{2}$ for the SGD and $\frac{1}{J}\sum_{i=1}^{J}r_{2,i}(t)$ for the PDE.

In Figure 11.8, we plot the population risk’s evolution for the same setting as Figure 3 of the main text, apart from that $\Delta=0.6$ and $s_{0}$ varies.

Comments. We observe a good match between the SGD and the PDE in Figure 3 of the main text as well as Figure 11.7, up until iteration $10^{6}$ . In general there is less discrepancy with larger $s_{0}$ , $d$ and $N$ , recalling that the PDE is computed assuming infinite $s_{0}$ , $d$ and $N$ . This is evident from Figure 11.8.

As a note, in Figure 11.8, the PDE evolves differently for different $s_{0}$ . This is because the ratio $s_{0}/d$ is used to determine the initialization of the PDE.

3 Isotropic Gaussians: Predictable Failure of SGD

In this section, we consider the isotropic Gaussians example (see Section 11.1 for the setting and notations), with the following activation function: $\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle)$ , where $\sigma(t)=-2.5$ for $t\leq 0$ , $\sigma(t)=7.5$ for $t\geq 1.5$ , and $\sigma(t)$ linearly interpolates from the knot $(0,-2.5)$ to $(0.5,-4)$ , and from $(0.5,-4)$ to $(1.5,7.5)$ . This activation is plotted in Figure 11.1. This corresponds to Section “Predicting failure” and Figure 4 in the main text. The simulation of the PDE can be done in the same way as in Section 11.1.1.

Rationale of the activation choice. We give an explanation for the choice of the above activation based on our theory. We aim to find an activation $\sigma_{*}({\bm{x}};{\bm{\theta}}_{i})=\sigma(\langle{\bm{w}}_{i},{\bm{x}}\rangle)$ in which there exists a local minimum that does not generalize well. To simplify the task, we wish for such minimum to be attained at $\rho_{*}=\delta_{\mathbf{0}}$ . This minimum does not generalize well, since it implies all the weights are zero and the neuron outputs are constant, rendering the network unable to perform classification. Theorem 6 of the main text suggests taking $\sigma(t)$ such that

In the isotropic Gaussians case, this becomes

(Note that the condition $\nabla V({\mathbf{0}})+\nabla_{1}U({\mathbf{0}},{\mathbf{0}})={\mathbf{0}}$ in Theorem 6 of the main text is trivially satisfied.) Another requirement is that there should still be a minimum whose risk is nearly zero. Hence we do not wish for a dramatic change in the choice of the activation function, as compared to the one used in Section 11.1. That is, we leave $\sigma(0)<0$ unchanged. Hence we would want $\sigma^{\prime\prime}(0)<0$ , which is accomplished by our aforementioned choice.

Note that Theorem 6 of the main text also suggests that if the SGD is initialized sufficiently close to this local minimum, the SGD trajectory should converge to it.

Details of Figure 4 of the main text. For the data generation, we set $\Delta=0.5$ . For the SGD simulation, we take $d=320$ , $N=800$ , with ${\varepsilon}=10^{-5}$ and $\xi(t)=t^{-1/4}$ . We take a single SGD run each for two different initializations: the weights are initialized as $({\bm{w}}_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,\kappa^{2}/d\cdot\mathbf{I}_{d})$ for either $\kappa=0.1$ or $\kappa=0.4$ . We compute the risk attained by the SGD by Monte Carlo averaging over $10^{4}$ samples.

To obtain results from the PDE, we take $J=400$ , and generate $r_{i}(0)=\|Z_{i}\|_{2}$ , where $(Z_{i})_{i\leq N}\sim_{iid}\mathsf{N}(0,\kappa^{2}/d\cdot\mathbf{I}_{d})$ . We obtain ${\bm{r}}(t)$ from $t=0$ until $t=10^{7}{\varepsilon}$ , by discretizing this interval with $10^{5}$ points equally spaced on the $\log_{10}$ scale and sequentially computing $\mathbf{r}(t)$ at each point using Eq. (11.8). Note that the SGD result at iteration $k$ corresponds to $\mathbf{r}({\varepsilon}^{4/3}k)$ . We take a single run of the PDE.

To produce the inset plot, we compute $\frac{1}{N}\sum_{i=1}^{N}\|{\bm{w}}_{i}\|_{2}$ for the SGD, and $\frac{1}{J}\sum_{i=1}^{J}r_{i}(t)$ for the PDE.

As observed from Figure 4 of the main text, the SGD trajectory with $\kappa=0.1$ converges to a point where $\|{\bm{w}}_{i}\|_{2}$ is nearly zero and the risk is very high, in stark contrast to the SGD trajectory with $\kappa=0.4$ , as expected.

Appendix A Concentration inequalities

Let ${\bm{Z}}_{k}=\bm{X}_{k}-\bm{X}_{k-1}$ be the martingale differences. By the subgaussian condition (A.1), we get

Letting ${\bm{G}}\sim{\sf N}(0,{\bm{I}}_{d})$ a standard Gaussian vector and $\xi\geq 0$ ,

By Markov inequality, setting $\xi=1/(2nL^{2})$ , we get, for all $t\geq 0$ ,

Finally, to upper bound $\max_{k\leq n}\|\bm{X}_{k}\|_{2}$ , we define the stopping time $\tau\equiv\min\{k:\;\|\bm{X}_{k}\|_{2}\geq 2L\sqrt{n}(\sqrt{d}+t)\}$ , and the martingale $\overline{\bm{X}}_{k}=\bm{X}_{k\wedge\tau}$ . Since $\{\max_{k\leq n}\|\bm{X}_{n}\|_{2}\geq 2L\sqrt{n}(\sqrt{d}+t)\}=\{\|\overline{\bm{X}}_{n}\|_{2}\geq 2L\sqrt{n}(\sqrt{d}+t)\}$ , the claim follows by applying the previous inequality to $\overline{\bm{X}}_{n}$ . ∎

Appendix B On the generalization to other loss functions

First of all, we note that the population risk reads

The corresponding distributional dynamics is formally identical to the one for quadratic loss, cf. Eq. (7.1). The only change is in the definition of $\Psi({\bm{\theta}};\rho)$ :