A mean-field limit for certain deep neural networks

Dyego Araújo, Roberto I. Oliveira, Daniel Yukimura

Introduction

Deep neural networks (also known as deep nets or DNNs) and the "deep learning" methods they implement have revolutionized Machine Learning . They are behind recent spectacular successes in machine translation, pattern recognition, and two-person game playing, to name but a few fields. Unfortunately, we are far from understanding theoretically what makes DNNs work. In fact, even describing the evolution of a DNN during training is quite difficult.

Large neural networks are high-dimensional disordered objects with many interacting components. It is natural to use methods from Statistical Mechanics to describe them . The main goal of this paper is to give a scaling limit for the evolution of certain deep neural networks during training via a mean field model inspired by Statistical Mechanics. That is, we show that, for a large network, the parameters of a large DNN behave like random particles interacting with each other and with their own probability densities in the way prescribed by a certain McKean-Vlasov process. This will require the following assumptions: complete connections between layers, plus "random features" that are not learned on the first and last layers of weights. The latter assumption does not seem to be crucial, as we explain below (see §2.3 and §5.3).

Our result extends work done for simpler "shallow" neural networks by Mei, Montanari and Nguyen (see also ); Rotskoff and Vanden-Eijnden ; and Sirignano and Spiliopoulos . A recent (and independent) preprint by Sirignano and Spiliopoulos also gives a limiting description of deep networks. As we will see, that paper studies a different sense of limit: a DNN has "layers of neurons," and considers what happens when layers grow one at a time. By contrast, our methods allow us to consider networks all of whose layers are large, which seems more natural. Our work is closely related to the nonrigorous (and independent) results of Nguyen .

The paper is organized as follows. Section 2 gives presents the context for our result, including related work. Our own work starts in Section 3, we present our problem setup. Section 4 motivates and describes the McKean-Vlasov process used to find the scaling limit of DNNs. The main theorems are stated in Section 5, where we offer additional comparisons with related work. Proofs are outlined in Section 6 and presented in Sections 7 to 12. Some extensions and open problems are described in the Conclusion (Section 13).

Given a Polish space (M,ρ)(M,\rho), we will use P(M)\mathcal{P}(M) to denote the set of probability measures over the Borel σ\sigma-field of (M,ρ)(M,\rho). Given μP(M)\mu\in\mathcal{P}(M), we write ZμZ\sim\mu to say that ZZ is a random element of MM with law μ\mu.

Background and related work

This section provides some context for our results and how they relate to the existing literature. In §2.1, we describe the Statistical Learning formulation of supervised learning with deep neural networks. In §2.2, we discuss the work on "shallow" networks and how it relates to McKean-Vlasov processes. In §2.3, we outline our main results. In §2.4, give some additional pointers to related literature.

At a high level, a DNN is just one special type of "parametric function"

as small as possible. In the jargon of the area, minimizing LN\mathcal{L}_{N} means that the DNN generalizes well to unseen data coming from the same distribution as the training data .

The trouble with DNNs comes from their sheer size and complexity. The seminal "AlexNet" DNN had p108p\approx 10^{8} parameters; recent ones are much larger. In addition, the functions computed in DNNs are complicated layers of compositions of nonlinear functions. When deep nets are trained by gradient-descent type procedures, as is usually the case, many questions can be raised. Does gradient descent typically converge to a global minimum , and, if so, does it converge in reasonable time ? How does the shape of the "energy landscape" relate to generalization versus "overfitting" (or finding spurious patterns in data) of the DNN ? In spite of many recent results, we still lack satisfying answers to these questions.

2. McKean-Vlasov scaling limits

Given the above difficulties, it is reasonable to take a step back and ask the simpler question: what is a deep neural network actually doing when it is trained?

Statistical Mechanics offers a natural way to approach large neural networks, be they deep or shallow. Physicists and mathematicians have been using this perspective since the original boom of neural nets; see the books of Engel and van den Broeck and Mézard and Montanari for more information.

A more recent development has been to describe neural networks being trained to mean-field models of McKean-Vlasov type . As noted above, the papers obtained results of this kind in the case of "shallow" networks, which are simpler than deep nets.

Assume the DNN is initialized with values θi\theta_{i} that are i.i.d. with law μ0\mu_{0}. Suppose also that we train the network by gradient descent in the following sense:

(In truth, the network is trained by stochastic gradient descent, but we ignore the distinction for the time being.)

The authors of noted that, for shallow networks, one can rewrite this evolution as

where bb is a drift term depending on the value of the weight and on the empirical measure of the weights at time tt:

A natural ansatz is that for large NN, μ^t\widehat{\mu}_{t} approaches a deterministic measure μt\mu^{\star}_{t} for each t0t\geq 0. If we had μt=μ^t\mu^{\star}_{t}=\widehat{\mu}_{t}, then the weight trajectories θi\theta_{i} would be decoupled and thus i.i.d. (as the θi(0)\theta_{i}(0) are i.i.d.). So we expect that for large NN the weights are nearly i.i.d. Under this ansatz, we also have that for a random index ii, θi(t)\theta_{i}(t) follows μ^tμt\widehat{\mu}_{t}\approx\mu^{\star}_{t}. Therefore, one may conjecture that, in the thermodynamic limit, the θi\theta_{i} approach i.i.d. particles satisfying

Problem ()(\star) describes a particle θ(t)\overline{{\boldsymbol{\theta}}}(t) starting from a random initial state which interacts with its own distribution μt\mu^{\star}_{t}. Under fairly reasonable conditions on the function bb, there exists a unique probability measure μ\mu^{\star} over continuous trajectories such that, if θ()μ\overline{{\boldsymbol{\theta}}}(\cdot)\sim\mu^{\star}, then ()(\star) is satisfied, and the μt\mu^{\star}_{t} are the time-tt marginals of μt\mu^{\star}_{t}. In this paper, measures μ\mu^{\star} obtained as solutions to problems of the form ()(\star) are called McKean-Vlasov measures, although this name is reserved for the case where the evolution of θ\overline{{\boldsymbol{\theta}}} includes a diffusion term. More information about McKean-Vlasov problems and their history can be found in the references given in §2.4.3.

If there is a unique McKean-Vlasov measure satisfying ()(\star), one may then give a coupling argument to show that the trajectories θi\theta_{i} are close to an i.i.d. sample of μ\mu^{\star}. This property is called "propagation of chaos" and is used directly in . The densities p(t,x)p(t,x) of the measures μt\mu^{\star}_{t} (if they exist) are weak solutions of the PDE:

Montanari, Mei and Nguyen have used properties of PDE in the case of shallow learning to study the generalization abilities of shallow networks.

3. Deep neural networks and our contribution

The main result of this paper, described in sections 3 to 5, extends the analysis of to deep neural networks. More precisely, we consider a specific class of DNNs with complete connections and with first and last layers as "frozen random features" (the role of these is elucidated in Remark 3.4 below). These networks consist of units organized in layers. We consider what happens when the DNN is trained by a version of stochastic gradient descent.

As in the papers discussed above, we find that, when NN is large, the weights of the DNN are close to certain "ideal particles" derived from a McKean-Vlasov process. However, the picture for DNNs is much more complicated than that of shallow networks. The asymptotic weight distributions are layer-dependent, and weights do not become independent in the large-NN limit. In fact, the fundamental units of our analysis are not individual weights, but rather paths of weights in the network that go from the input to the output. The specific dependency structure of weights along a path is essential in the analysis. We will also see that the drift term for our McKean-Vlasov process will depend on the solution μ\mu^{\star} in a discontinuous fashion, which means that proving existence and uniqueness is a nontrivial task.

Two recent independent papers of Sirignano and Spiliopoulos and Nguyen obtain results related to our work. Reference uses a nonrigorous ansatz to obtain essentially the same process that we do. One difference comes from the fact that uses different scalings for the updates in different layers, in order to avoid the issues discussed in Remark 3.4.

By contrast, considers a different scaling limit: the number of units in different layers do not all diverge simultaneously, but rather one at a time. The authors of note that letting all layers diverge at the same time brings a host of technicalities. Our main contribution is to tackle these technicalities head on, which will be important for future progress in the field. In fairness, we note that our assumptions ("random features", differentiable activation functions) leave something to be desired, though some of these restrictions may be lifted. A more thorough comparison between our results and those of is given in §5.3.

The existence of a limiting process may also contribute to the analysis of learning in the class of networks we study. This program has been partially carried out by Mei et al. for shallow networks, where the limiting PDEs describe so-called Wasserstein gradient flows in the space of probability measures . Our limiting process leads to a PDE family that we believe is new and deserving of further study.

4. Additional background

We present here a concise and partial overview of the work related to our paper.

The field of neural network applications is simply too large to survey adequately. Literally dozens of arXiv preprints on the subject appear every week. The landmark "AlexNet" paper on image recognition , the survey and the book serve as reasonable introductions to this field. Generalized Adversarial Networks or GANs are one example of DNNs that fall outside the scope of supervised learning.

Statistical Learning Theory was developed to explain generalization in much more general contexts than deep learning. Much of that theory is predicated on assuming that the algorithm under consideration cannot interpolate the training data . However, practical applications feature "over parameterized" deep nets, with pNp_{N} much larger than the sample size. In this case, interpolation may well happen . This has generated a flurry of related activity on statistical methods that do interpolate: see for a recent example.

4.2. Other scaling limits for neural networks

We described above McKean-Vlasov limits for deep and shallow networks. Natural Tangent Kernels have been put forward by Jacot, Gabriel and Hongler as scaling limits of deep networks when the number of neurons diverges, albeit under a different time scale. Other papers have followed up on this idea . However, Chizat and Bach argue that this stems from small step sizes and an effective linearization of the loss function around the starting point of the optimization. This idea is also explored by Mei, Misiakiewicz and Montanari .

4.3. Mean-field models, propagation of chaos and related topics

The kind of models discussed in §2.2 come from a body of work on nonlinear Markov processes, mean-field particle models, and kinetic theory. Here we give only a few references to the area.

McKean’s seminal paper introduces McKean-Vlasov processes. More general nonlinear Markov processes are surveyed in Kolokoltsov’s book . Spohn’s book covers kinetic theory, Vlasov’s equation, and related topics. The term "propagation of chaos" was a concept introduced by Kac in the context of kinetic theory and made central in Sznitman’s St Flour lectures . Rachev and Rüschendorf present methods for proving existence and uniqueness for McKean-Vlasov problems. We will adapt their method to our setting, but this will require circumventing certain discontinuities.

Setup for rigorous results

This section describes the sort of deep neural networks and regression task considered in this work.

Our problem centers around understanding the following regression task:

In our case, the possible functions ff are constrained by the architecture of a Deep Neural Network. This consists of a family of parametric functions indexed by NN,

2. The neural network.

Our parametric functions will be defined as follows.

First, for i1[1:N1]i_{1}\in[1:N_{1}], we define

Finally, the output of the network is given by

Our model defines a very general version of a neural network with fully connected layers, where the internal units may have dimension greater than 11.

3. The backpropagation equations.

As is standard, the gradients of y^N\widehat{y}_{N} with respect to the weights may be computed via backpropagation . The key idea is to write the derivatives of y^N\widehat{y}_{N} with respect to the activations recursively, starting from the output. Since

4. Training by stochastic gradient descent

Lastly, we describe the training procedure for our network by Stochastic Gradient Descent (or SGD). Specifically, weights are updated at discrete time steps according to a point estimate of the gradient of the loss function LN\mathcal{L}_{N} obtained from a fresh sample point. This matches the assumption of previous papers such as .

The full vector of scaled gradients is denoted by

In other words, the network weights of the hidden layers follow, on average, the negative gradient of the population loss.

Mean-field picture for the weights

We now explain the mean-field behavior we expect from our system in the large-NN limit.

The basic idea behind our model is that the summations appearing in the gradients of LNL_{N} are to be replaced by integrals over conditional distributions. This idea was applied nonrigorously (and independently) by Nguyen . The ansatz can be summarized as follows:

Weights on the same layer are statistically indistinguishable.

Averages over many paths are expected to obey a law of large numbers and become deterministic in the limit.

By contrast, a weight θi1,i2(1)\theta^{(1)}_{i_{1},i_{2}} has one backwards path through θ1,i1(0)\theta^{(0)}_{1,i_{1}}. So we expect that θi1,i2(1)\theta^{(1)}_{i_{1},i_{2}} will depend on θ1,i1(0)\theta^{(0)}_{1,i_{1}} (but will decouple from other weights) even in the thermodynamic limit. A similar reasoning shows that θiL1,iL(L1)\theta^{(L-1)}_{i_{L-1},i_{L}} should depend on θiL,1(L)\theta^{(L)}_{i_{L},1}, but on no other weights.

This leads us to consider input-to-output paths in the network as our basic units in the analysis:

If our ansatz is correct, understanding the distribution of weights on a path suffices to describe the law of all weights in the network. We clarify this in the next subsection.

2. Paths and measures

Note that the vector θ\theta in (4.1) lives in

Note that the heuristic analysis suggests a factorization for the distribution over a path: the measure over a path should factor as

Moreover, if we can compute the law of the weights on a path, the weights over the entire network should factor (approximately) as follows, at any fixed training step.

We will eventually make this precise via the introduction of "ideal particles". It will be crucial that the above independence structure also works at the level of trajectories. For this we will need the following notation.

3. Mean-field representation of network evolution

The case of z(L)\overline{z}^{(L)} is different: we define it in terms of xx and the conditional distribution μ(L1L)\mu^{(L-1\mid L)}. Letting a(L)RDLa^{(L)}\in R^{D_{L}}, we define:

This definition depends on the choice of regular conditional probability, but this issue will not concern us in what follows.

3.2. Mean-field versions of gradients.

We can define, analogously to (3.8), a function as follows:

Now that we have the mean-field version of the gradients, we can define the following McKean-Vlasov problem.

Then we may rewrite conditions in the above definition as saying that, if Θμ[0,T]\overline{\Theta}\sim\mu^{\star}_{[0,T]}, then

Statement of rigorous results

In order to properly analyse our model, we start by delineating a series of assumptions. These are meant to be as broad as possible, while still allowing the use of known analytical tools. We start by making a hypothesis on the initialization of the Network.

Basically, this assumption means that initialization is independent and weights in the same layer share a common law. The next assumption deals with the nature of the activation functions.

There exist a constant C>1C>1 with the following properties.

Assumption 5.2 is made for technical convenience, and excludes the popular ReLU nonlinearity. We expect that our results will extend with minor changes to more general activation functions.

2. Main theorems.

We now state the main results of the paper. The first one states that our McKean-Vlasov problem is well-posed.

Under the assumptions in Section 5.1, the McKean-Vlasov problem in Section 4.3 has a unique solution μ[0,T]\mu^{\star}_{[0,T]}. This solution has the following structure:

such that, if Θμ[0,T]\overline{\Theta}\sim\mu^{\star}_{[0,T]}, then almost surely, for all t[0,T]t\in[0,T]

As a consequence, μ[0,T]\mu^{\star}_{[0,T]} factorizes into independent components, ie,

Our second main results shows that we can describe the loss of our DNN with N1N\gg 1 particles in terms of the McKean-Vlasov limiting process.

Make the assumptions in the previous Section. Given T>0T>0, let μ[0,T]\mu^{\star}_{[0,T]} denote the unique solution of the McKean-Vlasov problem described in Subsection 4.3. Also, let {θN(k),k[0:Tε]}\left\{{\boldsymbol{\theta}}_{N}(k),\,k\in\left[0:\left\lceil\frac{T}{\varepsilon}\right\rceil\right]\right\} be steps in the SGD process for the DNN described in Subsection 3.4.

Then, for all k[0:Tε]k\in\left[0:\left\lceil\frac{T}{\varepsilon}\right\rceil\right], it holds that:

In fact, our proofs also give a "microscopic" approximation of SGD, whereby we approximate individual weights of the network.

Under the assumptions in Section 5.1, consider μ[0,T]\mu^{\star}_{[0,T]} as in Theorem 5.3, and let

be as in Theorem 5.4. Then one can couple the above evolution with certain "ideal particles"

and for all k[0:T/ε]k\in[0:\lceil T/\varepsilon\rceil]:

One key difference between our system and and other interacting particle systems of mean-field type (eg. as popularized by ) is that network weights do not become independent in the limit, as even ideal particles have nontrivial dependencies. However, disjoint paths of weights will become independent. This implies a propagation of chaos for weight vectors over randomly chosen paths:a constant number of them will become asymptotically independent. This is in keeping with the point made in Section 3 where we noted that paths are the fundamental units of our system.

3. More comments on related results

As noted, there are two recent preprints that tackle similar problems to our main results. Nguyen has essentially the same ansatz and the same end result as we do, but he uses nonrigorous methods.

Our own work seems to identify one important source of the above difficulties. The most natural measure to consider is over paths of weights. Then, however the closed evolution obtained turns out to involve conditional measures, and conditioning is a discontinuous operation. In our approach, we have not tried to write a single empirical measure for all weights. Rather, our approach follows the ideas outlined in the next section.

Proof overview

In this section we present an outline for the proofs of our main results, which are further described in the subsequent sections. We start by summarizing the proof for existence and uniqueness of our limiting McKean-Vlasov problem (Theorem 5.3). Next, we overview the results from sections 10 to 12, showing that our strategy for training DNNs approaches the trajectories obtained from our McKean-Vlasov problem (Theorems 5.4 and 5.5).

Section 7 is where is where we make our first restriction to the set of measures. Solutions for our McKean-Vlasov problem follow Lipschitz trajectories through time (see Lemma 7.2). This property is also enough to guarantee that ψ\psi is well-defined, i.e. that the system (6.1) has a unique solution (Corollary 7.4). Lemma 7.3, where the above statement is in fact proved, is a consequence of the Picard-Lindelöf Theorem.

Even though our map is well defined, standard ODE results are not enough to ensure that we can iterate our map and find a fixed point. This is mostly due to the particularity of the dependencies we have for the outermost layers. Therefore, we need to make sure that the output law from the map ψ\psi also behaves well, i.e. that it still maintains Lipschitz continuity in the initial conditions.

This set of measures is topologically closed (Theorem 8.7).

A series of results then show that this set of special measures is closed under ψ\psi (Lemma 8.5), and closed under the Wasserstein L1L^{1} metric (Theorem 8.7). One key dea is to show that, for special measures, z(L)\overline{z}^{(L)} is also Lipschitz continuous with respect to the Wasserstein metric (Lemma 8.4). So the discontinuity issues in our problem disappar. We will also need a series of technical estimates on the Cauchy problem for the ODEs considered above.

In Section 9 we finally show that ψ\psi is a contraction on the set of special measures defined in Section 8. Since this set is invariant for ψ\psi, contains all solutions and is topologically closed, existence and uniqueness follow. We emphasize that finding this "special set of measures" is the key idea that makes the fixed point method work in our setting.

2. Approximation of loss function and ideal particles

The McKean-Vlasov problem is our guess for the process that describes solutions. To show that it is the right guess, we must couple discrete-time SGD to the continuous-time "ideal particles" described in Theorem 5.5, and compare the corresponding values of the loss function.

The first step, done in Section 10, is to compare SGD to a continuous-time gradient flow for network weights. This same idea appears in , but the details here are more complicated.

The second step is to introduce the "ideal particles" from Theorem 5.5 to which we will compare the continuous-time network weights. Section 11 carries our the first steps of this task. A crucial point here is that this: the specific dependency structure of the McKean-Vlasov solution allows us to "glue" the ideal particles to the edges of the DNN so that one "sees" the McKean-Vlasov solution along each path. In this way, we can reduce the problem of finding the mean-field approximation of the DNN to a coupling of real and ideal weights.

Finally, Section 12 shows that ideal particles give good approximations to network weights and allow one to compare the loss of the true DNN to a loss associated to the McKean-Vlasov problem.

Simple properties of the McKean-Vlasov problem

In this section, we prove some preliminary facts and estimates that any solution to our McKean-Vlasov problem must satisfy. In Section 7.1, we start with certain trivial properties that restrict which measures could potentially be solutions. In Section 7.2, we show that, if we apply the "natural" iteration to a potential solution, then the corresponding map is well-defined and satisfies certain properties.

Our first lemma is fairly straightforward. We start with the following definition.

as the set of all measures ν[0,T]\nu_{[0,T]} with the following properties: if Θ[0,T]ν[0,T]\overline{\Theta}_{[0,T]}\sim\nu_{[0,T]}, then:

Θ[0,T](0)\overline{\Theta}^{(0)}_{[0,T]} and Θ[0,T](L)\overline{\Theta}^{(L)}_{[0,T]} are a.s. constant;

Under assumptions in section 5.1, there exists a constant R\reflem:trivialsoln=R\reflem:trivialsoln(C,L)>0R_{\ref{lem:trivialsoln}}=R_{\ref{lem:trivialsoln}}(C,L)>0 such that if μ[0,T]\mu^{\star}_{[0,T]} is a solution of the McKean-Vlasov problem in Definition 4.4, then it must belong to PR\mathcal{P}_{R} for any R>R\reflem:trivialsolnR>R_{\ref{lem:trivialsoln}}.

2. A mapping and its structure

The next step is to define a map ψ:PRPR\psi:\mathcal{P}_{R}\to\mathcal{P}_{R} with the following property:

The following lemma shows that this operation is well-defined and has some additional structure.

Make the assumptions in Section 5.1. Let ν[0,T]PR,\nu_{[0,T]}\in\mathcal{P}_{R}, where RR\reflem:trivialsolnR\geq R_{\ref{lem:trivialsoln}}. Then, one can find measurable mappings:

An immediate consequence of the Lemma 7.3 is that the map ψ\psi discussed above is well-defined.

In the setting of Lemma 7.3, there exists R\reflem:trivialsoln>0R_{\ref{lem:trivialsoln}}>0 such that, if RR\reflem:trivialsolnR\geq R_{\ref{lem:trivialsoln}}, then there exists a map ψ:PRPR\psi:\mathcal{P}_{R}\to\mathcal{P}_{R} defined as follows: Given ν[0,T]PR\nu_{[0,T]}\in\mathcal{P}_{R}, then ψ(ν[0,T])\psi(\nu_{[0,T]}) is the unique measure with μ0\mu_{0} as its time marginal, and therefore if Θψ(ν[0,T])\overline{\Theta}\sim\psi(\nu_{[0,T]}), then almost surely Θ\overline{\Theta} solves the initial value problem 7.1.

The result stated in the lemma follows from the fact that the ODE (7.1) has a unique solution for any starting state. To clarify this statement, a few steps are required.

and the integrand in the RHS is a bounded continuous function of tt. Therefore, the expectation changes continuously with tt by Bounded Convergence.

Since the function inside the expected value in the RHS is bounded and continuous in tt, the RHS is itself continuous in tt. Moreover, it is also a globally Lipschitz function of Θ(0)(0)\overline{\Theta}^{(0)}(0) and Θ(1)(t)\overline{\Theta}^{(1)}(t), with a constant that comes from the boundedness of M(2)\overline{M}^{(2)} and the assumption that σ(1)\sigma^{(1)} has Lipschitz derivative. The Picard-Lindelöf Theorem implies that the trajectory Θ(1)\overline{\Theta}^{(1)} is a deterministic function Gν(1)G_{\nu}^{(1)} of Θ(0)(0)\overline{\Theta}^{(0)}(0) (which appears as a parameter in the RHS) and of the initial condition Θ(1)(0)\overline{\Theta}^{(1)}(0). Under our conditions, the solution to the ODE is the uniform limit of Euler-Maruyama discretizations. The recursive construction of these discretizations means that each of them is measurable with respect to (Θ(0)(0),Θ(1)(0))(\overline{\Theta}^{(0)}(0),\overline{\Theta}^{(1)}(0)), and the same is true of the limit Gν(1)G_{\nu}^{(1)}. Finally, the Lipschitz property of trajectories comes from the fact that the RHS of the ODE is bounded.

Finally, the fact that 7.1 has a unique solution for almost every initial position clearly implies that ψ(ν[0,T])\psi(\nu_{[0,T]}) is uniquely specified.∎

Potential solutions of the McKean-Vlasov problem

In this section we continue the work of Section 7. Corollary 7.4 shows that there is a well-defined map ψ\psi whose fixed points are solutions of the McKean Vlasov problem. It is then natural to iterate that map to obtain existence and uniqueness of solutions.

To circumvent such problems, we will show that we can further constrain the set of potential solutions of our problem. We will define the set of RR-special measures (Definition 8.2) so that a mesure ν[0,T]\nu_{[0,T]} in this set will have special Lipschitz properties in its (L1)th(L-1)^{\text{th}} coordinate. The results in this section will imply that this set is well-behaved with respect to ψ\psi and that the search for solutions can be narrowed down to this set.

To make all of this precise, we need two new definitions. The first defines the property we want the (L1)th(L-1)^{\text{th}} coordinate process to obey.

is RR-Lipschitz in tt, and F(a)(0)=a(L1)F(a)(0)=a^{(L-1)}.

is eRte^{Rt}-Lipschitz as a function of aa.

The second definition defines the set of probability measures we will work with.

is said to be RR-special if μ[0,T]PR\mu_{[0,T]}\in\mathcal{P}_{R} and there exists an RR-special function FμF_{\mu} such that, if Θμ[0,T]\Theta\sim\mu_{[0,T]}, then almost surely, for all t[0,T]t\in[0,T]:

The set of all RR-special measures is denoted by PR\mathcal{P}^{\star}_{R}.

Our first result on RR-special measures is that the discontinuity issues raised in Remark 4.6 will not matter when we consider these measures. In fact, we canestimate the difference between the values of z(L)\overline{z}^{(L)} for different measures via their Wasserstein distance. That will be accomplished by the next Lemmas.

The next two Lemmas are two of our main results about RR-special measures. Lemma 8.5 shows that this set is closed under the mapping ψ\psi while Lemma 8.6 that all solutions must be RR-special.

There exists a value R\reflem:specialclosedpsi>0R_{\ref{lem:specialclosedpsi}}>0 depending only on CC and LL such that, for all RR\reflem:specialclosedpsiR\geq R_{\ref{lem:specialclosedpsi}}, the mapping ψ\psi introduced in Corollary 7.4 satisfies ψ(PR)PR\psi(\mathcal{P}^{\star}_{R})\subset\mathcal{P}^{\star}_{R}.

There exists a value R\reflem:specialcontainssolnR_{\ref{lem:specialcontainssoln}} depending only on CC and LL such that, for all RR\reflem:specialcontainssolnR\geq R_{\ref{lem:specialcontainssoln}}, any solution to the McKean-Vlasov problem belongs to PR\mathcal{P}^{\star}_{R}.

Finally, we mention an important technical result, to be proved in Appendix C.

The set PR\mathcal{P}^{\star}_{R} is closed under weak convergence (and thus also under Wasserstein L1L^{1} metric).

The remainder of the section contains the proofs of the above Lemmas.

Let Θν[0,T]PR\Theta\sim\nu_{[0,T]}\in\mathcal{P}^{\star}_{R}. Then Θ(L)\Theta^{(L)} is a.s. constant and

where δ{\boldsymbol{\delta}_{\cdot}} is a Dirac delta. Plugging this into Equation (4.6) which defines z(L)\overline{z}^{(L)} finishes the proof. ∎

We now show how to estimate differences in z(L)\overline{z}^{(L)} for RR-special measures.

It follows that, in order to prove the Lemma, it suffices to show that:

The quantity in the LHS of (8.1) may be rewritten as:

Moreover, because the two measures are special, we have the almost-sure identities:

3. Proof that special measures are closed under ψ𝜓\psi.

Now we move on to prove Lemma 8.5, which shows that the map ψ\psi maps PR\mathcal{P}^{\star}_{R} onto itself, assuming RR is large enough.

which clearly suffices to prove the Lemma.

To do this, we investigate the ODE (7.2) satisfied by Θ(L1)\overline{\Theta}^{(L-1)} under the assumption that ν[0,T]\nu_{[0,T]} is RR-special. We will need the following result, which is a direct consequence of Picard iteration.

Fix constants R2V\refprop:Picardsimple>0R\geq 2V_{\ref{prop:Picardsimple}}>0. Given (a(L1),a(L))(a^{(L-1)},a^{(L)}) consider the initial value problem

HH is uniformly bounded by V\refprop:PicardsimpleV_{\ref{prop:Picardsimple}};

is V\refprop:PicardsimpleeRtV_{\ref{prop:Picardsimple}}\cdot e^{Rt}-Lipschitz;

is V\refprop:PicardsimpleV_{\ref{prop:Picardsimple}}-Lipschitz.

that associates to each (a(L1),a(L))(a^{(L-1)},a^{(L)}), the solution of 8.3 is an RR-Special function.

Given this Proposition, Lemma 8.5 follows from the following result.

Let ν[0,T]PR\nu_{[0,T]}\in\mathcal{P}^{\star}_{R}. Then there exists a constant V\refclaim:Hexists=V\refclaim:Hexists(C,L)V_{\ref{claim:Hexists}}=V_{\ref{claim:Hexists}}(C,L) and a continuous function

satisfying properties (1) - (3) in the proposition above with V\refprop:Picardsimple=V\refclaim:HexistsV_{\ref{prop:Picardsimple}}=V_{\ref{claim:Hexists}}, and so that, if Θψ(ν[0,T])\overline{\Theta}\sim\psi(\nu_{[0,T]}), then almost surely, for all t[0,T]t\in[0,T]

Indeed, we may apply Proposition 8.8 to the HH coming from this Claim to deduce that, if RR\reflem:specialclosedpsi(C,L):=2V\refclaim:Hexists(C,L)R\geq R_{\ref{lem:specialclosedpsi}}(C,L):=2V_{\ref{claim:Hexists}}(C,L) and ν[0,T]PR\nu_{[0,T]}\in\mathcal{P}^{\star}_{R}, then for Θψ(ν[0,T])\overline{\Theta}\sim\psi(\nu_{[0,T]}) and for all t[0,T]t\in[0,T]

where GG is the RR-special function from Proposition 8.8. In particular, this implies Goal (8.2).

We now prove the Claim. In what follows, we will use symbols V0,V1,V_{0},V_{1},\dots to denote positive quantities that depend only on CC and LL.

Consider ν[0,T]PR\nu_{[0,T]}\in\mathcal{P}^{\star}_{R}. From Lemma 7.3 we know that there exists an RR-special function FνF_{\nu} such that the identity

Now consider the drift term for the (L1)th(L-1)^{\text{th}} particle. Recalling Assumption 5.2, Lemma 8.3, and the definitions in 4.3 we see that

is continuous in (t,a(L1),a(L))(t,a^{(L-1)},a^{(L)}), CC-Lipschitz in a(L1)a^{(L-1)} and V1eRtV_{1}\cdot e^{Rt}-Lipschitz in a(L)a^{(L)} for t0t\geq 0 fixed. Γ(L1)\overline{\Gamma}^{(L-1)} is also uniformly bounded by V1V_{1}. Once we notice additionally that Γ(L1)(x,νt,a)\overline{\Gamma}^{(L-1)}(x,\nu_{t},{\bf a}) depends on a{\bf a} only through a=(a(L1),a(L))a=(a^{(L-1)},a^{(L)}), we may define

4. Proof that all solutions are special.

We now argue that all fixed points μ\mu^{\star} of ψ\psi are RR-special.

The proof of this result is similar to that of Lemma 8.5 in Section 8.3. The key difference is that we will analyse a slightly more complicated Cauchy problem than the one in Proposition 8.8.

We also know from Lemma 7.3 that GG^{\star} belongs to the following set of functions if RR\reflem:trivialsolnR\geq R_{\ref{lem:trivialsoln}}

Indeed, this will imply that μ[0,T]\mu^{\star}_{[0,T]} is RR-special, as per Definition 8.2.

To do this, we will need to "do analysis" on the set U\mathcal{U}. We need to establish some additional notation. For g1,g2Ug_{1},g_{2}\in\mathcal{U} and t[0,T]t\in[0,T], we define.

The next Proposition gives properties of a nonstandard Cauchy-type problem for functions in U\mathcal{U}. It should be contrasted with Proposition 8.8 in the proof of Lemma 8.5.

Fix R2V\refprop:Picardcomplicated>0R\geq 2V_{\ref{prop:Picardcomplicated}}>0 constants and let

be a measurable function that satisfies the following properties:

Ξ\Xi is uniformly bounded by V\refprop:PicardcomplicatedV_{\ref{prop:Picardcomplicated}}.

If two functions g1,g2Ug_{1},g_{2}\in\mathcal{U} are such that g1(a(L1),a(L))=g2(a(L1),a(L))g_{1}(a^{(L-1)},a^{(L)})=g_{2}(a^{(L-1)},a^{(L)}) μ0(L1)×μ0(L)\mu_{0}^{(L-1)}\times\mu_{0}^{(L)}-almost surely, then:

μ0(L1)×μ0(L)\mu_{0}^{(L-1)}\times\mu_{0}^{(L)}-almost-surely;

is measurable. Also, for a(L)a^{(L)} fixed, then (t,a(L1))Ξ(t,a(L1),a(L),g)(t,a^{(L-1)})\mapsto\Xi\left(t,a^{(L-1)},a^{(L)},g\right) is continuous;

There exist a constant V\refclaim:Xiexists=V\refclaim:Xiexists(C,L)>0V_{\ref{claim:Xiexists}}=V_{\ref{claim:Xiexists}}(C,L)>0 and a function:

satisfying assumptions (1) - (5) in Proposition 8.9, and so that, for μ0\mu_{0}-a.e initial condition and for all t[0,T]t\in[0,T]:

Once we prove this claim, the theorem is essentially proved. We just need to apply Proposition 8.9 with this function Ξ\Xi together with Lemma 7.3. The upshot is that, if RR\reflem:specialcontainssoln:=2V\refclaim:XiexistsR\geq R_{\ref{lem:specialcontainssoln}}:=2V_{\ref{claim:Xiexists}}, then the following identity holds almost surely, simultaneously for all t[0,T]t\in[0,T]:

In other words, T(G)=G\mathcal{T}(G^{\star})=G^{\star} a.s.. The Proposition implies that there exists a RR-special function G=GoG^{\star}=G_{o} a.e. , and this finishes the proof because it achieves Goal 8.9.

We now prove the Claim. In what follows, we will use symbols V0,V1,V_{0},V_{1},\dots to denote positive quantities that only depend on CC and LL.

Then z(L)(x,μt,a(L))=Ξ0(t,x,a(L),G)\overline{z}^{(L)}\left(x,\mu^{\star}_{t},a^{(L)}\right)=\Xi_{0}\left(t,x,a^{(L)},G^{\star}\right) for every (t,x)(t,x) and μ0(L1)\mu_{0}^{(L-1)}-almost-every a(L)a^{(L)}. Moreover, the map Ξ0\Xi_{0} satisfies analogues of the properties (1) - (5) in Proposition 8.9.

Ξ0\Xi_{0} is bounded by V0V_{0} because σ(L1)\sigma^{(L-1)} is bounded by CC

Ξ0(t,x,a(L),g1)=Ξ0(t,x,a(L),g2)\Xi_{0}\left(t,x,a^{(L)},g_{1}\right)=\Xi_{0}\left(t,x,a^{(L)},g_{2}\right) for μ0(L)\mu_{0}^{(L)}-almost everywhere a(L)a^{(L)} if g1=g2g_{1}=g_{2} μ0(L1,L)\quad\mu_{0}^{(L-1,L)}-almost everywhere;

for all gUg\in\mathcal{U}, the map (t,x,a(L))Ξ0(t,x,a(L),g)(t,x,a^{(L)})\mapsto\Xi_{0}\left(t,x,a^{(L)},g\right) is measurable for being a composition of measurable functions, and it is continuous in tt for similar reasons;

Since Dσ(L)C\left\|D\sigma^{(L)}\right\|_{\infty}\leq C, then σ(L)\sigma^{(L)} is CC-Lipschitz. So

finally, for g1,g2Ug_{1},g_{2}\in\mathcal{U} and (t,x,a(L))(t,x,a^{(L)}) as above, using again the Lipschitz property:

Going back to the Definition 4.9 for M(L)(x,μt,a(L))\overline{M}^{(L)}(x,\mu^{\star}_{t},a^{(L)}), we see at once that we can define

Ξ1\Xi_{1} satisfies similar properties to (1’)-(5’) with a constant V1V_{1} replacing V0V_{0}, because M(L+1)\overline{M}^{(L+1)} is bounded by CC and Dzσ(L)D_{z}\sigma^{(L)} is CC-Lipschitz.

The same is true (with a different constant V2V_{2} replacing V1V_{1}) for the map

Existence and uniqueness for the McKean-Vlasov problem

In this section, we prove Theorem 5.3, which asserts existence and uniqueness of the solution of the McKean-Vlasov problem in Definition 4.4.

The groundwork for this was built in sections 7 and 8. A quick recap: we defined a map ψ\psi (cf. Corollary 7.4) whose fixed points are the solutions of the McKean-Vlasov problem. Then, we defined PR\mathcal{P}^{\star}_{R} the set of the so called RR-special measures with the following properties:

it is topologically closed (Theorem 8.7);

if RR is large enough, ψ(PR)PR\psi(\mathcal{P}^{\star}_{R})\subset\mathcal{P}^{\star}_{R} (Lemma 8.5);

if RR is large enough, PR\mathcal{P}^{\star}_{R} contains all solutions (Lemma 8.6).

The first step is to construct, for any fixed μ[0,T],ν[0,T]PR\mu_{[0,T]},\nu_{[0,T]}\in\mathcal{P}^{\star}_{R}, a reasonable coupling between ψ(μ[0,T])\psi(\mu_{[0,T]}) and ψ(ν[0,T])\psi(\nu_{[0,T]}). By looking at Definition 7.2 for ψ\psi, we see that this map comes with an associated process Θ\overline{\Theta}.

We will denote by Θμ\overline{\Theta}_{\mu} the process whose law is ψ(μ[0,T])\psi(\mu_{[0,T]}) and Θν\overline{\Theta}_{\nu} the one with law ψ(ν[0,T])\psi(\nu_{[0,T]}). The way to couple these processes is to notice that, since μ[0,T],ν[0,T]PRPR\mu_{[0,T]},\nu_{[0,T]}\in\mathcal{P}^{\star}_{R}\subset\mathcal{P}_{R}, then according to Definition 7.1 we have μ0=ν0\mu_{0}=\nu_{0}. This means we can set Θμ(0)=Θν(0)\overline{\Theta}_{\mu}(0)=\overline{\Theta}_{\nu}(0). Since the randomness inherent to these processes comes from their initialization, the coupling of the starting points couples the entire processes.

This coupling gives an upper bound for the Wasserstein distance between the measures, as we see below:

So all we need to do is to find an upper bound for the RHS of this inequality.

To do so, we must work coordinate by coordinate. Define

Using this notation, we establish the following lemma:

There is a constant C\reflem:deltaellupperboundC_{\ref{lem:deltaellupperbound}} such that the following inequalities hold:

The fact that Δ(0)(t)=Δ(L)(t)=0\Delta^{(0)}(t)=\Delta^{(L)}(t)=0 comes directly from the definition of the processes Θμ,Θν\overline{\Theta}_{\mu},\overline{\Theta}_{\nu}. By the coupling we made, we see that Δ(0)=0\Delta(0)=0, and both Δ(0),Δ(L)\Delta^{(0)},\Delta^{(L)} are constant over time.

justified again by Lemma B.4 and for the fact that a function is always smaller than its supremum. Repeating the same argument as in 9.3, we can finally conclude that:

Choose C\reflem:deltaellupperbound=max{C\reflem:uniflipL1,C\reflem:uniflipgenl}C_{\ref{lem:deltaellupperbound}}=\max\{C_{\ref{lem:unif-lip-L-1}},C_{\ref{lem:unif-lip-gen-l}}\} to finish Lemma. ∎

Now we can use these estimates to prove the following upper bound for ψ\psi:

There is a constant C\reflem:deltaupperboundC_{\ref{lem:deltaupperbound}} depending only on L,RL,R and CC such that, for all t[0,T]t\in[0,T],

We start by adding all the inequalities of Lemma 9.1 and using the equivalence of norms 9.2 to get

where V\refeq:prepareGronwalldelta=V\refeq:equivnormsC\reflem:deltaellupperboundmax{1,L2,v\refeq:equivnorms}V_{\ref{eq:prepareGronwalldelta}}=V_{\ref{eq:equiv-norms}}\cdot C_{\ref{lem:deltaellupperbound}}\cdot\max\{1,L-2,v_{\ref{eq:equiv-norms}}\}.

Now, we integrate these estimates over μ0\mu_{0} to find:

Now we use Lemma 8.4 to estimate the integral of the difference of z(L)\overline{z}^{(L)} with respect to μ0\mu_{0} to get:

where C\reflem:deltaupperbound=V\refeq:prepareGronwalldeltamax{1,C(eRT+1)}C_{\ref{lem:deltaupperbound}}=V_{\ref{eq:prepareGronwalldelta}}\cdot\max\{1,C(e^{RT}+1)\}.

Finally, we apply Gronwall’s Lemma and the Observation 9.1 to get:

The above inequality may be combined with a Gronwall-style iteration to obtain:

To complete the proof of Theorem 5.3, we need to argue that μ[0,T]\mu^{\star}_{[0,T]} has the independence structure described there. As it turns out, this is a mere consequence of Lemma 7.3. Since μ[0,T]=ψ(μ[0,T])\mu^{\star}_{[0,T]}=\psi(\mu^{\star}_{[0,T]}), we have that Θ[0,T]μ[0,T]\overline{\Theta}_{[0,T]}\sim\mu^{\star}_{[0,T]} must satisfy, for all t[0,T]t\in[0,T]

This implies that μ[0,T]\mu^{\star}_{[0,T]} admits the decomposition:

From SGD to continuous-time gradient flow

In this section we show that the Stochastic Gradient Descent (SGD) process used for training our neural network is close to a continuous time version. This is the first step in our connection between SGD and the McKean-Vlasov process.

This norm computes maximum layer average of the parameters.

When ε\varepsilon is small, we will show that this process is close to the usual Continuous Time Gradient Descent (CTGD) obtained from the problem of minimizing the parametric Mean-Squared Loss

By recalling Remark 3.5, we see that this process can be written in the integral form as

For the SGD, we can unroll the iteration and obtain

holds with probability higher than 1eu21-e^{-u^{2}}.

To make the estimation we first define an auxiliar process, consider for each choice of indices

Next, we apply the (L)\left\|\cdot\right\|_{(L)} to obtain the following:

where (I)(I) will be controlled using martingale techniques and (II)(II) will be controlled using Lipschitz arguments. We start with (II)(II).

By invoking Lemma B.17, we see that the function α\mboxGrad^N\alpha\cdot\widehat{\mbox{Grad}}_{N} is Lipschitz with respect to the (L)\left\|\cdot\right\|_{(L)}. This means that

where C\refeq:sgdctgdestimate3=C\reflem:gradlnorm(1+C\refeq:bthetalip)C_{\ref{eq:sgd-ctgd-estimate-3}}=C_{\ref{lem:grad-lnorm}}\,(1+C_{\ref{eq:btheta-lip}}).

Now we can plug this result back into 10 to obtain

Define a stopping time τ=inf{rk,MrtN(r)(L)>u}k\tau=\inf\left\{r\leq k,\,\left\|\text{Mrt}_{N}(r)\right\|_{(L)}>u\right\}\wedge k. Then notice that

By employing a Cramér-Chernoff argument, we get:

where the second inequality is the Optional Stopping theorem for submartingales.

Computing the Expectation in the RHS above, we find

This means that we can use Lemma A.3 to get

We complete the Cramér-Chernoff argument to find:

This statement can be rearranged to show that

holds with probability higher than 1eu21-e^{-u^{2}}.

By plugging this back into (10.12), we see that there is a constant C\refprop:SGDtoCTGD=C\refprop:SGDtoCTGD(C,T,L)C_{\ref{prop:SGDtoCTGD}}=C_{\ref{prop:SGDtoCTGD}}(C,T,L) such that

holds with probability higher than 1eu21-e^{-u^{2}}, which proves the lemma. ∎

Ideal particles and the gradient flow

In this section we introduce the ideal particles to which we will compare the weights θN(){\boldsymbol{\theta}}_{N}(\cdot) in our network. It will be crucial to use the structure of the McKean-Vlasov measure, as described in Theorem 5.3.

Given the initial positions θN(0){\boldsymbol{\theta}}_{N}(0) of the particles in the DNN (sampled as in Assumption 5.1), we define

Note the following immediate proposition related to the McKean-Vlasov process (see our discussion in Section 4). Recall that μ[0,T]\mu^{\star}_{[0,T]} is the law of the unique solution to our McKean-Vlasov problem.

This comes from two observations. The first one is the structure of μ[0,T]\mu^{\star}_{[0,T]} described in Theorem 5.3. The second is that the initial law of the McKean Vlasov process μ\mu^{\star} is equal to the starting measure for the θN{\boldsymbol{\theta}}_{N} along a path.∎

Our main goal in this section will be to show that the time evolution of the ideal weights θN(t)\overline{\boldsymbol{\theta}}_{N}(t) nearly follows a continuous-time gradient flow with respect to LN(θN(t))L_{N}(\overline{\boldsymbol{\theta}}_{N}(t)). More precisely, we will prove the following result.

satisfies, for all ξ>0\xi>0 and 0tT0\leq t\leq T,

where d=max{di:i[1:L]}d=\max\{d_{i}\,:\,i\in[1:L]\} is the largest dimension of an inner layer and C\reflem:nearlygrad=C\reflem:nearlygrad(C,L)C_{\ref{lem:nearlygrad}}=C_{\ref{lem:nearlygrad}}(C,L). This implies that the inequality

holds with probability greater than 1eu21-e^{-u^{2}}

The definition of the ideal particles implies that, for a fixed i2i_{2}, the pairs

are i.i.d. with common law (μ)t(0,1)(\mu^{\star})_{t}^{(0,1)}. If we recall the formula for z(2)\overline{z}^{(2)} in (4.4), we obtain:

In order to compare these quantities, we introduce an intermediate term.

and define Δ1\Delta_{1} and Δ2\Delta_{2} defined in analogy with (11.3) and (11.4) respectively.

and moreover θiL,1(L)(0)=θiL,1(L)(t)\overline{{\boldsymbol{\theta}}}^{(L)}_{i_{L},1}(0)=\overline{{\boldsymbol{\theta}}}^{(L)}_{i_{L},1}(t) a.s.. Therefore,

for a certain bounded function h(L)h^{(L)}, and we may bound Δ1\Delta_{1} with a MGF computation as above.

2. Gradients and ideal particles

From now on, we fix (X,Y)(X,Y) and the multi-index and analyze the LHS of (11.8). Unfortunately, this requires some tedious calculations. For the sake of brevity, we will be somewhat cavalier in our notation. We will introduce new random elements A1,A1,A_{1},\overline{A_{1}},\dots and a trilinear mapping Q\mathcal{Q} below, without specifying their domains and ranges explicitly. All of these ranges are vector spaces will also use \|\cdot\| to denote the norms over any of these spaces. We will also use V=V(C,L)V=V(C,L) to denote a constant that depends only on CC and LL, which may change from line to line.

where Q\mathcal{Q} is a trilinear mapping satisfying:

for all a,b,ca,b,c in the appropriate spaces (and with the appropriate norms). The function α\alpha and the random elements in equations (11.9) through (11.14) are all uniformly bounded by constants depending only on CC and LL, so we obtain:

with V\refeq:threetermsprooftrack=V\refeq:threetermsprooftrack(C,L)>0V_{\ref{eq:threetermsprooftrack}}=V_{\ref{eq:threetermsprooftrack}}(C,L)>0. Let us now consider each of the terms of the RHS of (11.15). Note that

Since σ(L+1)()\sigma^{(L+1)}\left(\cdot\right) is CC-Lipschitz, we obtain:

with Δz(L+1)\Delta z^{(L+1)} defined in Lemma 11.4. Similarly,

respectively. Each of these terms is CC-Lipschitz and CC-Uniformly Bounded. By comparing the two products as we did for Q\mathcal{Q} – that is, changing terms in the product one by one –, we obtain:

where C\reflem:nearlygrad=C\reflem:nearlygrad(C,L)>0C_{\ref{lem:nearlygrad}}=C_{\ref{lem:nearlygrad}}(C,L)>0, completing the proof. ∎

Loss approximation and ideal particles

This section combines the elements of the previous two sections to prove Theorems 5.4 (on approximation of the loss function) and 5.5 (on approximating weights by ideal particles). We will focus on the first proof most of the time, and then argue that Theorem 5.5 follows from the same calculations.

To prove Theorem 5.4, we will control the successive approximations of the process developed in the previous sections.

θN(){\boldsymbol{\theta}}_{N}(\cdot) are the weights in the original DNN, which evolve in discrete time;

θN()\overline{{\boldsymbol{\theta}}}_{N}(\cdot) is the ideal particle process introduced in Section 11; and

μ\mu^{\star} is the solution to the McKean-Vlasov problem in Definition 4.4.

We deal with term (I)(I) in the following lemma:

where d:=max{di,1iL1}d:=\max\{d_{i},1\leq i\leq L-1\} and C\refeq:errorboundsgdctgdC_{\ref{eq:error-bound-sgd-ctgd}} depends only on LL and CC.

Notice that according to Lemma B.18 in the Appendix, we have that the function LN\mathcal{L}_{N} is Lipschitz. Therefore, it holds that:

Now, we will use the result in proposition 10.1 together with the elementary lemma A.4 to get:

The next step is to find an upper bound for (II)(II). That task is accomplished in the next lemma:

for some constant C\refeq:errorboundctgdidealC_{\ref{eq:error-bound-ctgd-ideal}} which depends only on T,LT,L and CC. Here d:=max{di,1iL1}d:=\max\{d_{i},1\leq i\leq L-1\} as in the previous Lemma.

We use this to derive the following inequality:

Now, according to Lemma 11.3, with probability greater than 1eu21-e^{-u^{2}} it holds that

Follow that up with a Gronwall Inequality to get:

Recall that this inequality holds in a set of probability higher than 1eu21-e^{-u^{2}}. To close things off, we recall that according to B.18, LN\mathcal{L}_{N} is a Lipschitz function. Then, we invoke Lemma A.4 to find:

where C\refeq:errorboundctgdideal=C\reflem:LNlipschitzC\reflem:nearlygrad(π2+Llog(5))eCC\refrem:lipschitzyN TT2C_{\ref{eq:error-bound-ctgd-ideal}}=C_{\ref{lem:L_N-lipschitz}}\cdot\sqrt{C_{\ref{lem:nearlygrad}}}\left(\frac{\sqrt{\pi}}{2}+\sqrt{L\log(5)}\right)e^{C\cdot C_{\ref{rem:lipschitz-y_N}}\cdot\ T}\cdot T^{2}. ∎

Finally, we need to find a proper bound for (III)(III), which is achieved by the following lemma:

Let θN\overline{{\boldsymbol{\theta}}}_{N} be the ideal particle process defined in section 11 and μ[0,T]\mu^{\star}_{[0,T]} be the unique solution to the McKean-Vlasov problem under assumptions 5.1 and 5.2. Then, for t[0:T]t\in[0:T] the following inequality holds:

To start the proof of this lemma, we rewrite the Loss function difference as:

Then, using the fact that Y,y^N,yY,\widehat{y}_{N},\overline{y} are CC-bounded we get:

Now, by recalling the definition of y^N\widehat{y}_{N} in (3.4), the definition of y\overline{y} in (4.7), and using the fact that σ(L+1)\sigma^{(L+1)} is CC-Lipschitz according to Assumption 5.2, we get:

Finally, we invoke lemma 11.4 together with lemma A.4 to conclude:

where C\refeq:errorboundidealvlasov=4C22KL+1(π2+Llog(5))C_{\ref{eq:error-bound-ideal-vlasov}}=4\cdot C^{2}\sqrt{2K_{L+1}}\left(\frac{\sqrt{\pi}}{2}+\sqrt{L\cdot\log(5)}\right)

2. Approximation by ideal particles

To prove Theorem 5.5, we first note that, by symmetry,

does not depend on the specific choice of input-to-output path. Therefore, we only need to bound:

This was done implictly in the proofs of the three lemmas. We omit the details.

Conclusion

From our point of view, the key contribution of our paper is to deal with some of the intricacies of mean-field limits for deep neural networks. Chief among those is the fact that DNNs lead to McKean-Vlasov problems involving conditional densities. Our results are obtained at the cost of considering that weights close to the input and output are "random features" that are not learned. However, as noted in Remark 3.4, our results should extend to the setting where weights close to input and output have a higher learning rate. We also believe that one could add noise to our analysis as in . This is the approach of . Another challenge would be to remove the boundedness and smoothness conditions on the nolinearities (cf. Assumption 5.2), and also to ameliorate the dimensional dependence of our bounds (cf. ).

To our mind, however, there are other, more pressing problems. One is to find a good way to model other DNN architectures. An obvious first goal, which is also related to supervised learning, is to study convolutional layers in DNNs . These are fundamentally different even at the heuristic level. Another would be to develop methods to study the long-term behavior of DNNs, following the lead of Mei et al . The main challenge here is that, as far as we can see, the PDEs for our densities have not been studied before.

A different direction is to develop a theory of nonparametric learning and optimization inspired by our result. Theorem 5.4 implies that, after O(1/ϵ)O(1/\epsilon) steps of training, our network computes (in the limit of large NN) a function y\overline{y} which is a composition of functions of the form

Acknowledgements. During the late stages of the preparation of this paper, RIO was a participant of "The rough high dimensional landscape problem" program in the Kavli Institute for Theoretical Physics (KITP) in Santa Barbara, CA. We thank program organizers Giulio Biroli, Chiara Cammarota, Patrick Charbonneau and Andrea Montanari for the invitation. We also thank other participants for their questions regarding this work, and the KITP and its staff for their hospitality.

Appendix A Some technical lemmas

We prove here some technical results on Cauchy ODE-type problems needed in Section 8.

This result was stated and used in the proof of Lemma 8.5, which showed that the set of RR-special measures (Definition 8.2) are closed under the map ψ\psi from Corollary 7.4.

Recall that we assume V>0V>0, R2VR\geq 2V and

HH is uniformly bounded by V>0V>0 in norm;

We claim that GG is RR-special, as per Definition 8.1 for any R2VR\geq 2V. To see this, notice first that G(a)(0)=a(L1)G(a)(0)=a^{(L-1)} and G(a)G(a) is VV-Lipschitz because HH is bounded by VV in norm.

What is left is to show that aG(a)(t)a\mapsto G(a)(t) is eRte^{Rt}-Lipschitz for each t[0,T]t\in[0,T]. To this end, consider ϵ>0\epsilon>0, t[0,T]t\in[0,T], and define:

which implies the desired Lipschitz property.

Therefore, for all t[0,T]t\in[0,T] and a,ba,b as above:

Since tΔϵ(t)t\mapsto\Delta_{\epsilon}(t) is continuous, one can use Picard iteration with the above expression and deduce that

where ξ()\xi(\cdot) is the unique solution of the Cauchy Problem:

In particular, since R2VR\geq 2V, we have that Δϵ(t)ξ(t)eRt\Delta_{\epsilon}(t)\leq\xi(t)\leq e^{Rt}, as desired. ∎

A.1.2. Proof of Proposition 8.9

This result was stated and used in the proof of Lemma 8.6, where we showed that all solutions to the McKean-Vlasov problem in Definition 4.4 are RR-special measures (cf. Definition 8.2).

Step 1: preliminaries. Recall the definition of U\mathcal{U} in (8.8) and of dist(,;t){\rm dist}(\cdot,\cdot\cdot;t) in (8.10) (also recalled in (A.1) below). The following simple fact will be convenient.

For any g1,g2Ug_{1},g_{2}\in\mathcal{U} and t[0,T]t\in[0,T],

because g1(a)(0)=g2(a)(0)g_{1}(a)(0)=g_{2}(a)(0) and both g1,g2g_{1},g_{2} are RR-Lipschitz. For the same reason, tdist(g1,g2;t)t\mapsto{\rm dist}(g_{1},g_{2};t) is continuous for any fixed g1,g2Ug_{1},g_{2}\in\mathcal{U}. Finally,

is a metric over U\mathcal{U}, and (U,dist)(\mathcal{U},{\rm dist}) is a complete metric space.

Also recall our assumptions: 2RV>02R\geq V>0 are constants and

If two functions g1,g2Ug_{1},g_{2}\in\mathcal{U} are such that

for μ0(L1,L)\mu_{0}^{(L-1,L)}-a.e. (a(L1),a(L))(a^{(L-1)},a^{(L)}), then:

for μ0(L1,L)\mu_{0}^{(L-1,L)}-a.e. (a(L1),a(L))(a^{(L-1)},a^{(L)});

is measurable; if a(L)a^{(L)} is also fixed, then (t,a(L1))Ξ(t,a(L1),a(L),g)(t,a^{(L-1)})\mapsto\Xi(t,a^{(L-1)},a^{(L)},g) is continuous;

Notice that this integral is a Riemann integral, since Ξ\Xi is continuous in the first two variables. Also observe that that T(g)\mathcal{T}(g) is measurable, T(g)(a)(0)=a(L1)\mathcal{T}(g)(a)(0)=a^{(L-1)}, and tT(g)(a)(t)t\mapsto\mathcal{T}(g)(a)(t) is VV-Lipschitz (and also RR-Lipschitz) ecause Ξ\Xi is bounded by VV in norm. We conclude that T(g)U\mathcal{T}(g)\in\mathcal{U} for all gUg\in\mathcal{U}, so T:UU\mathcal{T}:\mathcal{U}\to\mathcal{U} is well defined.

Step 2: existence, uniqueness and convergence to fixed point. We will show that:

which implies in particular that GoG_{o} is the unique fixed point of T\mathcal{T}.

We achieve this goal this via a Picard-type iteration. Note that for all g1,g2Ug_{1},g_{2}\in\mathcal{U}, item (5) in the assumptions and equation (A.4) imply:

Since dist(,;t){\rm dist}(\cdot,\cdot\cdot;t) is a continuous function of tt that is upper bounded by dist{\rm dist}, we may iterate this inequality to deduce that

Recall from Fact 1 that (U,dist)(\mathcal{U},{\rm dist}) is complete. Therefore, the Banach fixed point theorem shows that T\mathcal{T} has a unique fixed point GG and that limjdist(Tj(g),Go)=0\lim_{j}{\rm dist}(\mathcal{T}^{j}(g),G_{o})=0 for all gUg\in\mathcal{U}. In particular, this GoG_{o} must satisfy the following identity for all (t,a)(t,a) in the appropriate set.

Step 4: GoG_{o} is RR-special. Since GoUG_{o}\in\mathcal{U}, it suffices to show that

satisfies Δϵ(t)e2Vt\Delta_{\epsilon}(t)\leq e^{2Vt} (recall the assumption that R2VR\geq 2V.

A.2. Lemmata on moment-generating function

Suppose we have a sum of mm dd-dimensional random vectors Y1,,YmY_{1},\dots,Y_{m}. Assume c>0c>0 is such that Yic|Y_{i}|\leq c a.s. for all 1im1\leq i\leq m. Then

Each MGF in the RHS corresponds to taking Xi=2w,YiX_{i}=2\langle w,Y_{i}\rangle in the previous Lemma, with 2c2c replacing cc. Applying the bound from that result and the fact that N5d|\mathcal{N}|\leq 5^{d} finishes the proof.∎

Let (Yk)k0(Y_{k})_{k\geq 0} be a dd-dimensional martingale with respect to the filtration (Fk)k0(\mathcal{F}_{k})_{k\geq 0} with Y0=0Y_{0}=0. Assume c>0c>0 is such that YkYk1c|Y_{k}-Y_{k-1}|\leq c a.s. for all k>0k>0. Then,

A simple modification of Hoeffding’s Lemma implies that for k>0k>0,

Then, applying this bound recursively gives

Using the fact that N5d|\mathcal{N}|\leq 5^{d} finishes the proof. ∎

To use the upper bounds in probability to get upper bound in expected value, we use the following elementary lemma:

Let ZZ be a positive random variable for which we know that, for some a,b>0a,b>0, Za+bt\quad Z\leq a+b\cdot t\quad holds with probability greater than 1et21-e^{-t^{2}}.

Appendix B Estimates on Lipschitz constants

In this section, we present a collection of Lipschitz estimates for several functions used throughout many of our results. These estimates are specially usefull to work through the many nonlinearities in our model. We start by setting a framework that make computations for our results cleaner.

Since many of our maps act on measure spaces, for instance the ones in Section 4.3 (the mean-field versions of the model components), is necessary to set a metric for these spaces. We always use the Wasserstein L1L^{1} metric. Let us briefly review the definition of this metric:

For more readable computations, we also denote the integral of a function ff over some measure μ\mu as

Here we present some auxiliary results using the concept of Lipschitz semi-norm. These summarize typical computations, making our subsequent results more concise.

Let (Min,din),(Mout,dout)(M_{in},d_{in}),(M_{out},d_{out}) be two metric spaces, and let f:MinMoutf:M_{in}\to M_{out} be a Lipschitz function. We define

Notice that this quantity is finite iff ff is Lipschitz. Also, if VfLipV\geq\left\|f\right\|_{\text{Lip}} then

The following two lemmas are well known results regarding the Lipschitz semi-norm; we omit their proofs.

fgLipfgLip+gfLip\left\|f\cdot g\right\|_{\text{Lip}}\leq\left\|f\right\|_{\infty}\cdot\left\|g\right\|_{\text{Lip}}+\left\|g\right\|_{\infty}\cdot\left\|f\right\|_{\text{Lip}},

f+gLipfLip+gLip\left\|f+g\right\|_{\text{Lip}}\leq\left\|f\right\|_{\text{Lip}}+\left\|g\right\|_{\text{Lip}}.

For Lipschitz functions f:M1M2f:M_{1}\rightarrow M_{2}, g:M2M3g:M_{2}\rightarrow M_{3}, we have

Let η\eta be the optimal coupling between (μ,ν)(\mu,\nu). Then:

where step B.6 is simply the triangular inequality for integrals, B.7 uses the fact that ff is Lipschitz, and B.8 is an obvious upper bound. ∎

B.2. Lipschitz bounds for the mean-field maps

We now proceed to the Lipschitz estimates for the maps described in section 4.3. We assume that conditions 5.1 and 5.2 hold.

To obtain the Lipschitz upper bound, we prove (B.9) by induction up until L1L-1, and then work the case z(L+1)\overline{z}^{(L+1)} separately.

This argument works up until L1L-1. It fails for z(L)\overline{z}^{(L)} for it is defined differently. Nevertheless, we can work through it and prove that the result holds for z(L+1)\overline{z}^{(L+1)} as well. Let η\eta be the optimal coupling between (μ,ν)(\mu,\nu) and notice that:

showing C\refeq:zbarorder(L+1)=O(CL+1)C_{\ref{eq:zbar-order}}(L+1)=\mathcal{O}\left(C^{L+1}\right). ∎

The boundness property of item (1) also comes directly from the assumption 5.2, since the maps M\overline{M} and Γ\overline{\Gamma} are products of derivative terms

For the Lipschitz estimate we do induction in the inverse order.

B.3. Lipschitz bounds for the existence of McKean-Vlasov process

In this subsection, we isolate the Lipschitz computations which will be needed when proving the existence of the McKean-Vlasov process. In order to better present these computations, we define the following function:

The only gradient missing in previous Lemma is Γ(L1)\overline{\Gamma}^{(L-1)}. It needs a different analysis for it has a different structure than the others. Although it may not necessarily be Lipschitz, we can prove the following upper bound, which turns out to be good enough for our purposes:

We start by making the following computations:

where V1max{Γ(L1)yLip,Y+y}V_{1}\geq\max\left\{\left\|\overline{\Gamma}^{(L-1)}\right\|_{\infty}\cdot\left\|\overline{y}\right\|_{\text{Lip}}\,,\,\left\|Y\right\|_{\infty}+\left\|\overline{y}\right\|_{\infty}\right\}.

Now, it suffices to show that, for some constant V2V_{2}, it holds that:

By recalling definition (4.14), using the upper bounds from previous section and some seriously tedious calculations, one may show that:

B.4. Lipschitz bounds for DNN maps

We also present Lipschitz estimates for the maps defined in section 3. The estimates and the proofs are similar to the ones obtained for their mean-field versions.

The previous lemma also implies that y^N\widehat{y}_{N} is bounded and Lipschitz,

And for the Lipschitz property we estimate using induction,

Using that the average of Lipschitz functions of same order is still Lipschitz, we can see that the gradients obtained through backpropagation are also Lipschitz, and bounded.

Our last lemma for this section deals with the special norm (L)\left\|\cdot\right\|_{(L)},

which is used in section 10. Unfortunately, the Lipschitz property for this case do not quite follows the results obtained for the L1L^{1} norm. We need to exploit the averaging structure of our functions, as the norm itself does it. In the next lemma, we omit the dependency on the variables XX and YY. We remark that the estimates are also uniform in these entries, which can be proved using the same strategies used during this section.

Next, we observer the behavior of the gradient terms:

We first take care of the first part of the RHS:

If we combine this estimate with the one in (B.18), then we can update the constants K1K_{1} to K4K_{4} obtaining

Is not hard to see that if we average the terms on the RHS, as in the (L)\left\|\cdot\right\|_{(L)}, then each term can be bounded by aNbN(L)\left\|a_{N}-b_{N}\right\|_{(L)}. This concludes the proof. ∎

There is a constant C\reflem:LNlipschitzC_{\ref{lem:L_N-lipschitz}} such that

Pick C\reflem:LNlipschitz=max{4L  CL+3,2C}C_{\ref{lem:L_N-lipschitz}}=\max\{4L\;C^{L+3},2C\} to finish lemma. ∎

Appendix C The set of special probability measures is closed

We prove here that the set PR\mathcal{P}^{\star}_{R} defined in §7.2 is closed under weak convergence.

In what follows, we first show that the Claim implies that μ[0,T]PR\mu_{[0,T]}\in\mathcal{P}^{\star}_{R}, and then prove the Claim.

Proof that the Claim implies μ[0,T]PR\mu_{[0,T]}\in\mathcal{P}^{\star}_{R}. By passing to a subsequence, we may assume that FjFF_{j}\to F uniformly over compact sets. We will use this to show that we may take F:=FF:=F in Definition 8.2 to certify that μ[0,T]\mu_{[0,T]} is RR-special.

To do this, we first argue that FF is RR-special. This is simple: inspection of Definition 8.1 reveals that the properties required of a RR-special function are preserved under pointwise convergence, and thus also under uniform convergence over compact sets.

We now argue that FF attests that μ\mu is RR-special: that is, we will show that, if Θμ[0,T]\Theta\sim\mu_{[0,T]}, then Θ[0,T](L1)=F(Θ(L1)(0),Θ(L)(0))\Theta_{[0,T]}^{(L-1)}=F(\Theta^{(L-1)}(0),\Theta^{(L)}(0)) a.s..

Now, when j+j\to+\infty, the pair (Θj(L1)(0),Θj(L)(0))(\Theta_{j}^{(L-1)(0)},\Theta_{j}^{(L)}(0)) is a.s. convergent, and there a.s. is some (random) R[0,+)R\in[0,+\infty) with:

Using continuity of FF and the convergence of Θj\Theta_{j}, we obtain:

Proof of the Claim. We will need the following form of the Ascoli-Arzèla Theorem.

This follows from the usual formulation of Ascoli-Arzèla for the case when XX is itself compact. We omit the details.

References