A mean-field limit for certain deep neural networks
Dyego Araújo, Roberto I. Oliveira, Daniel Yukimura
Introduction
Deep neural networks (also known as deep nets or DNNs) and the "deep learning" methods they implement have revolutionized Machine Learning . They are behind recent spectacular successes in machine translation, pattern recognition, and two-person game playing, to name but a few fields. Unfortunately, we are far from understanding theoretically what makes DNNs work. In fact, even describing the evolution of a DNN during training is quite difficult.
Large neural networks are high-dimensional disordered objects with many interacting components. It is natural to use methods from Statistical Mechanics to describe them . The main goal of this paper is to give a scaling limit for the evolution of certain deep neural networks during training via a mean field model inspired by Statistical Mechanics. That is, we show that, for a large network, the parameters of a large DNN behave like random particles interacting with each other and with their own probability densities in the way prescribed by a certain McKean-Vlasov process. This will require the following assumptions: complete connections between layers, plus "random features" that are not learned on the first and last layers of weights. The latter assumption does not seem to be crucial, as we explain below (see §2.3 and §5.3).
Our result extends work done for simpler "shallow" neural networks by Mei, Montanari and Nguyen (see also ); Rotskoff and Vanden-Eijnden ; and Sirignano and Spiliopoulos . A recent (and independent) preprint by Sirignano and Spiliopoulos also gives a limiting description of deep networks. As we will see, that paper studies a different sense of limit: a DNN has "layers of neurons," and considers what happens when layers grow one at a time. By contrast, our methods allow us to consider networks all of whose layers are large, which seems more natural. Our work is closely related to the nonrigorous (and independent) results of Nguyen .
The paper is organized as follows. Section 2 gives presents the context for our result, including related work. Our own work starts in Section 3, we present our problem setup. Section 4 motivates and describes the McKean-Vlasov process used to find the scaling limit of DNNs. The main theorems are stated in Section 5, where we offer additional comparisons with related work. Proofs are outlined in Section 6 and presented in Sections 7 to 12. Some extensions and open problems are described in the Conclusion (Section 13).
Given a Polish space , we will use to denote the set of probability measures over the Borel -field of . Given , we write to say that is a random element of with law .
Background and related work
This section provides some context for our results and how they relate to the existing literature. In §2.1, we describe the Statistical Learning formulation of supervised learning with deep neural networks. In §2.2, we discuss the work on "shallow" networks and how it relates to McKean-Vlasov processes. In §2.3, we outline our main results. In §2.4, give some additional pointers to related literature.
At a high level, a DNN is just one special type of "parametric function"
as small as possible. In the jargon of the area, minimizing means that the DNN generalizes well to unseen data coming from the same distribution as the training data .
The trouble with DNNs comes from their sheer size and complexity. The seminal "AlexNet" DNN had parameters; recent ones are much larger. In addition, the functions computed in DNNs are complicated layers of compositions of nonlinear functions. When deep nets are trained by gradient-descent type procedures, as is usually the case, many questions can be raised. Does gradient descent typically converge to a global minimum , and, if so, does it converge in reasonable time ? How does the shape of the "energy landscape" relate to generalization versus "overfitting" (or finding spurious patterns in data) of the DNN ? In spite of many recent results, we still lack satisfying answers to these questions.
2. McKean-Vlasov scaling limits
Given the above difficulties, it is reasonable to take a step back and ask the simpler question: what is a deep neural network actually doing when it is trained?
Statistical Mechanics offers a natural way to approach large neural networks, be they deep or shallow. Physicists and mathematicians have been using this perspective since the original boom of neural nets; see the books of Engel and van den Broeck and Mézard and Montanari for more information.
A more recent development has been to describe neural networks being trained to mean-field models of McKean-Vlasov type . As noted above, the papers obtained results of this kind in the case of "shallow" networks, which are simpler than deep nets.
Assume the DNN is initialized with values that are i.i.d. with law . Suppose also that we train the network by gradient descent in the following sense:
(In truth, the network is trained by stochastic gradient descent, but we ignore the distinction for the time being.)
The authors of noted that, for shallow networks, one can rewrite this evolution as
where is a drift term depending on the value of the weight and on the empirical measure of the weights at time :
A natural ansatz is that for large , approaches a deterministic measure for each . If we had , then the weight trajectories would be decoupled and thus i.i.d. (as the are i.i.d.). So we expect that for large the weights are nearly i.i.d. Under this ansatz, we also have that for a random index , follows . Therefore, one may conjecture that, in the thermodynamic limit, the approach i.i.d. particles satisfying
Problem describes a particle starting from a random initial state which interacts with its own distribution . Under fairly reasonable conditions on the function , there exists a unique probability measure over continuous trajectories such that, if , then is satisfied, and the are the time- marginals of . In this paper, measures obtained as solutions to problems of the form are called McKean-Vlasov measures, although this name is reserved for the case where the evolution of includes a diffusion term. More information about McKean-Vlasov problems and their history can be found in the references given in §2.4.3.
If there is a unique McKean-Vlasov measure satisfying , one may then give a coupling argument to show that the trajectories are close to an i.i.d. sample of . This property is called "propagation of chaos" and is used directly in . The densities of the measures (if they exist) are weak solutions of the PDE:
Montanari, Mei and Nguyen have used properties of PDE in the case of shallow learning to study the generalization abilities of shallow networks.
3. Deep neural networks and our contribution
The main result of this paper, described in sections 3 to 5, extends the analysis of to deep neural networks. More precisely, we consider a specific class of DNNs with complete connections and with first and last layers as "frozen random features" (the role of these is elucidated in Remark 3.4 below). These networks consist of units organized in layers. We consider what happens when the DNN is trained by a version of stochastic gradient descent.
As in the papers discussed above, we find that, when is large, the weights of the DNN are close to certain "ideal particles" derived from a McKean-Vlasov process. However, the picture for DNNs is much more complicated than that of shallow networks. The asymptotic weight distributions are layer-dependent, and weights do not become independent in the large- limit. In fact, the fundamental units of our analysis are not individual weights, but rather paths of weights in the network that go from the input to the output. The specific dependency structure of weights along a path is essential in the analysis. We will also see that the drift term for our McKean-Vlasov process will depend on the solution in a discontinuous fashion, which means that proving existence and uniqueness is a nontrivial task.
Two recent independent papers of Sirignano and Spiliopoulos and Nguyen obtain results related to our work. Reference uses a nonrigorous ansatz to obtain essentially the same process that we do. One difference comes from the fact that uses different scalings for the updates in different layers, in order to avoid the issues discussed in Remark 3.4.
By contrast, considers a different scaling limit: the number of units in different layers do not all diverge simultaneously, but rather one at a time. The authors of note that letting all layers diverge at the same time brings a host of technicalities. Our main contribution is to tackle these technicalities head on, which will be important for future progress in the field. In fairness, we note that our assumptions ("random features", differentiable activation functions) leave something to be desired, though some of these restrictions may be lifted. A more thorough comparison between our results and those of is given in §5.3.
The existence of a limiting process may also contribute to the analysis of learning in the class of networks we study. This program has been partially carried out by Mei et al. for shallow networks, where the limiting PDEs describe so-called Wasserstein gradient flows in the space of probability measures . Our limiting process leads to a PDE family that we believe is new and deserving of further study.
4. Additional background
We present here a concise and partial overview of the work related to our paper.
The field of neural network applications is simply too large to survey adequately. Literally dozens of arXiv preprints on the subject appear every week. The landmark "AlexNet" paper on image recognition , the survey and the book serve as reasonable introductions to this field. Generalized Adversarial Networks or GANs are one example of DNNs that fall outside the scope of supervised learning.
Statistical Learning Theory was developed to explain generalization in much more general contexts than deep learning. Much of that theory is predicated on assuming that the algorithm under consideration cannot interpolate the training data . However, practical applications feature "over parameterized" deep nets, with much larger than the sample size. In this case, interpolation may well happen . This has generated a flurry of related activity on statistical methods that do interpolate: see for a recent example.
4.2. Other scaling limits for neural networks
We described above McKean-Vlasov limits for deep and shallow networks. Natural Tangent Kernels have been put forward by Jacot, Gabriel and Hongler as scaling limits of deep networks when the number of neurons diverges, albeit under a different time scale. Other papers have followed up on this idea . However, Chizat and Bach argue that this stems from small step sizes and an effective linearization of the loss function around the starting point of the optimization. This idea is also explored by Mei, Misiakiewicz and Montanari .
4.3. Mean-field models, propagation of chaos and related topics
The kind of models discussed in §2.2 come from a body of work on nonlinear Markov processes, mean-field particle models, and kinetic theory. Here we give only a few references to the area.
McKean’s seminal paper introduces McKean-Vlasov processes. More general nonlinear Markov processes are surveyed in Kolokoltsov’s book . Spohn’s book covers kinetic theory, Vlasov’s equation, and related topics. The term "propagation of chaos" was a concept introduced by Kac in the context of kinetic theory and made central in Sznitman’s St Flour lectures . Rachev and Rüschendorf present methods for proving existence and uniqueness for McKean-Vlasov problems. We will adapt their method to our setting, but this will require circumventing certain discontinuities.
Setup for rigorous results
This section describes the sort of deep neural networks and regression task considered in this work.
Our problem centers around understanding the following regression task:
In our case, the possible functions are constrained by the architecture of a Deep Neural Network. This consists of a family of parametric functions indexed by ,
2. The neural network.
Our parametric functions will be defined as follows.
First, for , we define
Finally, the output of the network is given by
Our model defines a very general version of a neural network with fully connected layers, where the internal units may have dimension greater than .
3. The backpropagation equations.
As is standard, the gradients of with respect to the weights may be computed via backpropagation . The key idea is to write the derivatives of with respect to the activations recursively, starting from the output. Since
4. Training by stochastic gradient descent
Lastly, we describe the training procedure for our network by Stochastic Gradient Descent (or SGD). Specifically, weights are updated at discrete time steps according to a point estimate of the gradient of the loss function obtained from a fresh sample point. This matches the assumption of previous papers such as .
The full vector of scaled gradients is denoted by
In other words, the network weights of the hidden layers follow, on average, the negative gradient of the population loss.
Mean-field picture for the weights
We now explain the mean-field behavior we expect from our system in the large- limit.
The basic idea behind our model is that the summations appearing in the gradients of are to be replaced by integrals over conditional distributions. This idea was applied nonrigorously (and independently) by Nguyen . The ansatz can be summarized as follows:
Weights on the same layer are statistically indistinguishable.
Averages over many paths are expected to obey a law of large numbers and become deterministic in the limit.
By contrast, a weight has one backwards path through . So we expect that will depend on (but will decouple from other weights) even in the thermodynamic limit. A similar reasoning shows that should depend on , but on no other weights.
This leads us to consider input-to-output paths in the network as our basic units in the analysis:
If our ansatz is correct, understanding the distribution of weights on a path suffices to describe the law of all weights in the network. We clarify this in the next subsection.
2. Paths and measures
Note that the vector in (4.1) lives in
Note that the heuristic analysis suggests a factorization for the distribution over a path: the measure over a path should factor as
Moreover, if we can compute the law of the weights on a path, the weights over the entire network should factor (approximately) as follows, at any fixed training step.
We will eventually make this precise via the introduction of "ideal particles". It will be crucial that the above independence structure also works at the level of trajectories. For this we will need the following notation.
3. Mean-field representation of network evolution
The case of is different: we define it in terms of and the conditional distribution . Letting , we define:
This definition depends on the choice of regular conditional probability, but this issue will not concern us in what follows.
3.2. Mean-field versions of gradients.
We can define, analogously to (3.8), a function as follows:
Now that we have the mean-field version of the gradients, we can define the following McKean-Vlasov problem.
Then we may rewrite conditions in the above definition as saying that, if , then
Statement of rigorous results
In order to properly analyse our model, we start by delineating a series of assumptions. These are meant to be as broad as possible, while still allowing the use of known analytical tools. We start by making a hypothesis on the initialization of the Network.
Basically, this assumption means that initialization is independent and weights in the same layer share a common law. The next assumption deals with the nature of the activation functions.
There exist a constant with the following properties.
Assumption 5.2 is made for technical convenience, and excludes the popular ReLU nonlinearity. We expect that our results will extend with minor changes to more general activation functions.
2. Main theorems.
We now state the main results of the paper. The first one states that our McKean-Vlasov problem is well-posed.
Under the assumptions in Section 5.1, the McKean-Vlasov problem in Section 4.3 has a unique solution . This solution has the following structure:
such that, if , then almost surely, for all
As a consequence, factorizes into independent components, ie,
Our second main results shows that we can describe the loss of our DNN with particles in terms of the McKean-Vlasov limiting process.
Make the assumptions in the previous Section. Given , let denote the unique solution of the McKean-Vlasov problem described in Subsection 4.3. Also, let be steps in the SGD process for the DNN described in Subsection 3.4.
Then, for all , it holds that:
In fact, our proofs also give a "microscopic" approximation of SGD, whereby we approximate individual weights of the network.
Under the assumptions in Section 5.1, consider as in Theorem 5.3, and let
be as in Theorem 5.4. Then one can couple the above evolution with certain "ideal particles"
and for all :
One key difference between our system and and other interacting particle systems of mean-field type (eg. as popularized by ) is that network weights do not become independent in the limit, as even ideal particles have nontrivial dependencies. However, disjoint paths of weights will become independent. This implies a propagation of chaos for weight vectors over randomly chosen paths:a constant number of them will become asymptotically independent. This is in keeping with the point made in Section 3 where we noted that paths are the fundamental units of our system.
3. More comments on related results
As noted, there are two recent preprints that tackle similar problems to our main results. Nguyen has essentially the same ansatz and the same end result as we do, but he uses nonrigorous methods.
Our own work seems to identify one important source of the above difficulties. The most natural measure to consider is over paths of weights. Then, however the closed evolution obtained turns out to involve conditional measures, and conditioning is a discontinuous operation. In our approach, we have not tried to write a single empirical measure for all weights. Rather, our approach follows the ideas outlined in the next section.
Proof overview
In this section we present an outline for the proofs of our main results, which are further described in the subsequent sections. We start by summarizing the proof for existence and uniqueness of our limiting McKean-Vlasov problem (Theorem 5.3). Next, we overview the results from sections 10 to 12, showing that our strategy for training DNNs approaches the trajectories obtained from our McKean-Vlasov problem (Theorems 5.4 and 5.5).
Section 7 is where is where we make our first restriction to the set of measures. Solutions for our McKean-Vlasov problem follow Lipschitz trajectories through time (see Lemma 7.2). This property is also enough to guarantee that is well-defined, i.e. that the system (6.1) has a unique solution (Corollary 7.4). Lemma 7.3, where the above statement is in fact proved, is a consequence of the Picard-Lindelöf Theorem.
Even though our map is well defined, standard ODE results are not enough to ensure that we can iterate our map and find a fixed point. This is mostly due to the particularity of the dependencies we have for the outermost layers. Therefore, we need to make sure that the output law from the map also behaves well, i.e. that it still maintains Lipschitz continuity in the initial conditions.
This set of measures is topologically closed (Theorem 8.7).
A series of results then show that this set of special measures is closed under (Lemma 8.5), and closed under the Wasserstein metric (Theorem 8.7). One key dea is to show that, for special measures, is also Lipschitz continuous with respect to the Wasserstein metric (Lemma 8.4). So the discontinuity issues in our problem disappar. We will also need a series of technical estimates on the Cauchy problem for the ODEs considered above.
In Section 9 we finally show that is a contraction on the set of special measures defined in Section 8. Since this set is invariant for , contains all solutions and is topologically closed, existence and uniqueness follow. We emphasize that finding this "special set of measures" is the key idea that makes the fixed point method work in our setting.
2. Approximation of loss function and ideal particles
The McKean-Vlasov problem is our guess for the process that describes solutions. To show that it is the right guess, we must couple discrete-time SGD to the continuous-time "ideal particles" described in Theorem 5.5, and compare the corresponding values of the loss function.
The first step, done in Section 10, is to compare SGD to a continuous-time gradient flow for network weights. This same idea appears in , but the details here are more complicated.
The second step is to introduce the "ideal particles" from Theorem 5.5 to which we will compare the continuous-time network weights. Section 11 carries our the first steps of this task. A crucial point here is that this: the specific dependency structure of the McKean-Vlasov solution allows us to "glue" the ideal particles to the edges of the DNN so that one "sees" the McKean-Vlasov solution along each path. In this way, we can reduce the problem of finding the mean-field approximation of the DNN to a coupling of real and ideal weights.
Finally, Section 12 shows that ideal particles give good approximations to network weights and allow one to compare the loss of the true DNN to a loss associated to the McKean-Vlasov problem.
Simple properties of the McKean-Vlasov problem
In this section, we prove some preliminary facts and estimates that any solution to our McKean-Vlasov problem must satisfy. In Section 7.1, we start with certain trivial properties that restrict which measures could potentially be solutions. In Section 7.2, we show that, if we apply the "natural" iteration to a potential solution, then the corresponding map is well-defined and satisfies certain properties.
Our first lemma is fairly straightforward. We start with the following definition.
as the set of all measures with the following properties: if , then:
and are a.s. constant;
Under assumptions in section 5.1, there exists a constant such that if is a solution of the McKean-Vlasov problem in Definition 4.4, then it must belong to for any .
2. A mapping and its structure
The next step is to define a map with the following property:
The following lemma shows that this operation is well-defined and has some additional structure.
Make the assumptions in Section 5.1. Let where . Then, one can find measurable mappings:
An immediate consequence of the Lemma 7.3 is that the map discussed above is well-defined.
In the setting of Lemma 7.3, there exists such that, if , then there exists a map defined as follows: Given , then is the unique measure with as its time marginal, and therefore if , then almost surely solves the initial value problem 7.1.
The result stated in the lemma follows from the fact that the ODE (7.1) has a unique solution for any starting state. To clarify this statement, a few steps are required.
and the integrand in the RHS is a bounded continuous function of . Therefore, the expectation changes continuously with by Bounded Convergence.
Since the function inside the expected value in the RHS is bounded and continuous in , the RHS is itself continuous in . Moreover, it is also a globally Lipschitz function of and , with a constant that comes from the boundedness of and the assumption that has Lipschitz derivative. The Picard-Lindelöf Theorem implies that the trajectory is a deterministic function of (which appears as a parameter in the RHS) and of the initial condition . Under our conditions, the solution to the ODE is the uniform limit of Euler-Maruyama discretizations. The recursive construction of these discretizations means that each of them is measurable with respect to , and the same is true of the limit . Finally, the Lipschitz property of trajectories comes from the fact that the RHS of the ODE is bounded.
Finally, the fact that 7.1 has a unique solution for almost every initial position clearly implies that is uniquely specified.∎
Potential solutions of the McKean-Vlasov problem
In this section we continue the work of Section 7. Corollary 7.4 shows that there is a well-defined map whose fixed points are solutions of the McKean Vlasov problem. It is then natural to iterate that map to obtain existence and uniqueness of solutions.
To circumvent such problems, we will show that we can further constrain the set of potential solutions of our problem. We will define the set of -special measures (Definition 8.2) so that a mesure in this set will have special Lipschitz properties in its coordinate. The results in this section will imply that this set is well-behaved with respect to and that the search for solutions can be narrowed down to this set.
To make all of this precise, we need two new definitions. The first defines the property we want the coordinate process to obey.
is -Lipschitz in , and .
is -Lipschitz as a function of .
The second definition defines the set of probability measures we will work with.
is said to be -special if and there exists an -special function such that, if , then almost surely, for all :
The set of all -special measures is denoted by .
Our first result on -special measures is that the discontinuity issues raised in Remark 4.6 will not matter when we consider these measures. In fact, we canestimate the difference between the values of for different measures via their Wasserstein distance. That will be accomplished by the next Lemmas.
The next two Lemmas are two of our main results about -special measures. Lemma 8.5 shows that this set is closed under the mapping while Lemma 8.6 that all solutions must be -special.
There exists a value depending only on and such that, for all , the mapping introduced in Corollary 7.4 satisfies .
There exists a value depending only on and such that, for all , any solution to the McKean-Vlasov problem belongs to .
Finally, we mention an important technical result, to be proved in Appendix C.
The set is closed under weak convergence (and thus also under Wasserstein metric).
The remainder of the section contains the proofs of the above Lemmas.
Let . Then is a.s. constant and
where is a Dirac delta. Plugging this into Equation (4.6) which defines finishes the proof. ∎
We now show how to estimate differences in for -special measures.
It follows that, in order to prove the Lemma, it suffices to show that:
The quantity in the LHS of (8.1) may be rewritten as:
Moreover, because the two measures are special, we have the almost-sure identities:
3. Proof that special measures are closed under ψ𝜓\psi.
Now we move on to prove Lemma 8.5, which shows that the map maps onto itself, assuming is large enough.
which clearly suffices to prove the Lemma.
To do this, we investigate the ODE (7.2) satisfied by under the assumption that is -special. We will need the following result, which is a direct consequence of Picard iteration.
Fix constants . Given consider the initial value problem
is uniformly bounded by ;
is -Lipschitz;
is -Lipschitz.
that associates to each , the solution of 8.3 is an -Special function.
Given this Proposition, Lemma 8.5 follows from the following result.
Let . Then there exists a constant and a continuous function
satisfying properties (1) - (3) in the proposition above with , and so that, if , then almost surely, for all
Indeed, we may apply Proposition 8.8 to the coming from this Claim to deduce that, if and , then for and for all
where is the -special function from Proposition 8.8. In particular, this implies Goal (8.2).
We now prove the Claim. In what follows, we will use symbols to denote positive quantities that depend only on and .
Consider . From Lemma 7.3 we know that there exists an -special function such that the identity
Now consider the drift term for the particle. Recalling Assumption 5.2, Lemma 8.3, and the definitions in 4.3 we see that
is continuous in , -Lipschitz in and -Lipschitz in for fixed. is also uniformly bounded by . Once we notice additionally that depends on only through , we may define
4. Proof that all solutions are special.
We now argue that all fixed points of are -special.
The proof of this result is similar to that of Lemma 8.5 in Section 8.3. The key difference is that we will analyse a slightly more complicated Cauchy problem than the one in Proposition 8.8.
We also know from Lemma 7.3 that belongs to the following set of functions if
Indeed, this will imply that is -special, as per Definition 8.2.
To do this, we will need to "do analysis" on the set . We need to establish some additional notation. For and , we define.
The next Proposition gives properties of a nonstandard Cauchy-type problem for functions in . It should be contrasted with Proposition 8.8 in the proof of Lemma 8.5.
Fix constants and let
be a measurable function that satisfies the following properties:
is uniformly bounded by .
If two functions are such that -almost surely, then:
-almost-surely;
is measurable. Also, for fixed, then is continuous;
There exist a constant and a function:
satisfying assumptions (1) - (5) in Proposition 8.9, and so that, for -a.e initial condition and for all :
Once we prove this claim, the theorem is essentially proved. We just need to apply Proposition 8.9 with this function together with Lemma 7.3. The upshot is that, if , then the following identity holds almost surely, simultaneously for all :
In other words, a.s.. The Proposition implies that there exists a -special function a.e. , and this finishes the proof because it achieves Goal 8.9.
We now prove the Claim. In what follows, we will use symbols to denote positive quantities that only depend on and .
Then for every and -almost-every . Moreover, the map satisfies analogues of the properties (1) - (5) in Proposition 8.9.
is bounded by because is bounded by
for -almost everywhere if -almost everywhere;
for all , the map is measurable for being a composition of measurable functions, and it is continuous in for similar reasons;
Since , then is -Lipschitz. So
finally, for and as above, using again the Lipschitz property:
Going back to the Definition 4.9 for , we see at once that we can define
satisfies similar properties to (1’)-(5’) with a constant replacing , because is bounded by and is -Lipschitz.
The same is true (with a different constant replacing ) for the map
Existence and uniqueness for the McKean-Vlasov problem
In this section, we prove Theorem 5.3, which asserts existence and uniqueness of the solution of the McKean-Vlasov problem in Definition 4.4.
The groundwork for this was built in sections 7 and 8. A quick recap: we defined a map (cf. Corollary 7.4) whose fixed points are the solutions of the McKean-Vlasov problem. Then, we defined the set of the so called -special measures with the following properties:
it is topologically closed (Theorem 8.7);
if is large enough, (Lemma 8.5);
if is large enough, contains all solutions (Lemma 8.6).
The first step is to construct, for any fixed , a reasonable coupling between and . By looking at Definition 7.2 for , we see that this map comes with an associated process .
We will denote by the process whose law is and the one with law . The way to couple these processes is to notice that, since , then according to Definition 7.1 we have . This means we can set . Since the randomness inherent to these processes comes from their initialization, the coupling of the starting points couples the entire processes.
This coupling gives an upper bound for the Wasserstein distance between the measures, as we see below:
So all we need to do is to find an upper bound for the RHS of this inequality.
To do so, we must work coordinate by coordinate. Define
Using this notation, we establish the following lemma:
There is a constant such that the following inequalities hold:
The fact that comes directly from the definition of the processes . By the coupling we made, we see that , and both are constant over time.
justified again by Lemma B.4 and for the fact that a function is always smaller than its supremum. Repeating the same argument as in 9.3, we can finally conclude that:
Choose to finish Lemma. ∎
Now we can use these estimates to prove the following upper bound for :
There is a constant depending only on and such that, for all ,
We start by adding all the inequalities of Lemma 9.1 and using the equivalence of norms 9.2 to get
where .
Now, we integrate these estimates over to find:
Now we use Lemma 8.4 to estimate the integral of the difference of with respect to to get:
where .
Finally, we apply Gronwall’s Lemma and the Observation 9.1 to get:
The above inequality may be combined with a Gronwall-style iteration to obtain:
To complete the proof of Theorem 5.3, we need to argue that has the independence structure described there. As it turns out, this is a mere consequence of Lemma 7.3. Since , we have that must satisfy, for all
This implies that admits the decomposition:
From SGD to continuous-time gradient flow
In this section we show that the Stochastic Gradient Descent (SGD) process used for training our neural network is close to a continuous time version. This is the first step in our connection between SGD and the McKean-Vlasov process.
This norm computes maximum layer average of the parameters.
When is small, we will show that this process is close to the usual Continuous Time Gradient Descent (CTGD) obtained from the problem of minimizing the parametric Mean-Squared Loss
By recalling Remark 3.5, we see that this process can be written in the integral form as
For the SGD, we can unroll the iteration and obtain
holds with probability higher than .
To make the estimation we first define an auxiliar process, consider for each choice of indices
Next, we apply the to obtain the following:
where will be controlled using martingale techniques and will be controlled using Lipschitz arguments. We start with .
By invoking Lemma B.17, we see that the function is Lipschitz with respect to the . This means that
where .
Now we can plug this result back into 10 to obtain
Define a stopping time . Then notice that
By employing a Cramér-Chernoff argument, we get:
where the second inequality is the Optional Stopping theorem for submartingales.
Computing the Expectation in the RHS above, we find
This means that we can use Lemma A.3 to get
We complete the Cramér-Chernoff argument to find:
This statement can be rearranged to show that
holds with probability higher than .
By plugging this back into (10.12), we see that there is a constant such that
holds with probability higher than , which proves the lemma. ∎
Ideal particles and the gradient flow
In this section we introduce the ideal particles to which we will compare the weights in our network. It will be crucial to use the structure of the McKean-Vlasov measure, as described in Theorem 5.3.
Given the initial positions of the particles in the DNN (sampled as in Assumption 5.1), we define
Note the following immediate proposition related to the McKean-Vlasov process (see our discussion in Section 4). Recall that is the law of the unique solution to our McKean-Vlasov problem.
This comes from two observations. The first one is the structure of described in Theorem 5.3. The second is that the initial law of the McKean Vlasov process is equal to the starting measure for the along a path.∎
Our main goal in this section will be to show that the time evolution of the ideal weights nearly follows a continuous-time gradient flow with respect to . More precisely, we will prove the following result.
satisfies, for all and ,
where is the largest dimension of an inner layer and . This implies that the inequality
holds with probability greater than
The definition of the ideal particles implies that, for a fixed , the pairs
are i.i.d. with common law . If we recall the formula for in (4.4), we obtain:
In order to compare these quantities, we introduce an intermediate term.
and define and defined in analogy with (11.3) and (11.4) respectively.
and moreover a.s.. Therefore,
for a certain bounded function , and we may bound with a MGF computation as above.
2. Gradients and ideal particles
From now on, we fix and the multi-index and analyze the LHS of (11.8). Unfortunately, this requires some tedious calculations. For the sake of brevity, we will be somewhat cavalier in our notation. We will introduce new random elements and a trilinear mapping below, without specifying their domains and ranges explicitly. All of these ranges are vector spaces will also use to denote the norms over any of these spaces. We will also use to denote a constant that depends only on and , which may change from line to line.
where is a trilinear mapping satisfying:
for all in the appropriate spaces (and with the appropriate norms). The function and the random elements in equations (11.9) through (11.14) are all uniformly bounded by constants depending only on and , so we obtain:
with . Let us now consider each of the terms of the RHS of (11.15). Note that
Since is -Lipschitz, we obtain:
with defined in Lemma 11.4. Similarly,
respectively. Each of these terms is -Lipschitz and -Uniformly Bounded. By comparing the two products as we did for – that is, changing terms in the product one by one –, we obtain:
where , completing the proof. ∎
Loss approximation and ideal particles
This section combines the elements of the previous two sections to prove Theorems 5.4 (on approximation of the loss function) and 5.5 (on approximating weights by ideal particles). We will focus on the first proof most of the time, and then argue that Theorem 5.5 follows from the same calculations.
To prove Theorem 5.4, we will control the successive approximations of the process developed in the previous sections.
are the weights in the original DNN, which evolve in discrete time;
is the ideal particle process introduced in Section 11; and
is the solution to the McKean-Vlasov problem in Definition 4.4.
We deal with term in the following lemma:
where and depends only on and .
Notice that according to Lemma B.18 in the Appendix, we have that the function is Lipschitz. Therefore, it holds that:
Now, we will use the result in proposition 10.1 together with the elementary lemma A.4 to get:
The next step is to find an upper bound for . That task is accomplished in the next lemma:
for some constant which depends only on and . Here as in the previous Lemma.
We use this to derive the following inequality:
Now, according to Lemma 11.3, with probability greater than it holds that
Follow that up with a Gronwall Inequality to get:
Recall that this inequality holds in a set of probability higher than . To close things off, we recall that according to B.18, is a Lipschitz function. Then, we invoke Lemma A.4 to find:
where . ∎
Finally, we need to find a proper bound for , which is achieved by the following lemma:
Let be the ideal particle process defined in section 11 and be the unique solution to the McKean-Vlasov problem under assumptions 5.1 and 5.2. Then, for the following inequality holds:
To start the proof of this lemma, we rewrite the Loss function difference as:
Then, using the fact that are -bounded we get:
Now, by recalling the definition of in (3.4), the definition of in (4.7), and using the fact that is -Lipschitz according to Assumption 5.2, we get:
Finally, we invoke lemma 11.4 together with lemma A.4 to conclude:
where ∎
2. Approximation by ideal particles
To prove Theorem 5.5, we first note that, by symmetry,
does not depend on the specific choice of input-to-output path. Therefore, we only need to bound:
This was done implictly in the proofs of the three lemmas. We omit the details.
Conclusion
From our point of view, the key contribution of our paper is to deal with some of the intricacies of mean-field limits for deep neural networks. Chief among those is the fact that DNNs lead to McKean-Vlasov problems involving conditional densities. Our results are obtained at the cost of considering that weights close to the input and output are "random features" that are not learned. However, as noted in Remark 3.4, our results should extend to the setting where weights close to input and output have a higher learning rate. We also believe that one could add noise to our analysis as in . This is the approach of . Another challenge would be to remove the boundedness and smoothness conditions on the nolinearities (cf. Assumption 5.2), and also to ameliorate the dimensional dependence of our bounds (cf. ).
To our mind, however, there are other, more pressing problems. One is to find a good way to model other DNN architectures. An obvious first goal, which is also related to supervised learning, is to study convolutional layers in DNNs . These are fundamentally different even at the heuristic level. Another would be to develop methods to study the long-term behavior of DNNs, following the lead of Mei et al . The main challenge here is that, as far as we can see, the PDEs for our densities have not been studied before.
A different direction is to develop a theory of nonparametric learning and optimization inspired by our result. Theorem 5.4 implies that, after steps of training, our network computes (in the limit of large ) a function which is a composition of functions of the form
Acknowledgements. During the late stages of the preparation of this paper, RIO was a participant of "The rough high dimensional landscape problem" program in the Kavli Institute for Theoretical Physics (KITP) in Santa Barbara, CA. We thank program organizers Giulio Biroli, Chiara Cammarota, Patrick Charbonneau and Andrea Montanari for the invitation. We also thank other participants for their questions regarding this work, and the KITP and its staff for their hospitality.
Appendix A Some technical lemmas
We prove here some technical results on Cauchy ODE-type problems needed in Section 8.
This result was stated and used in the proof of Lemma 8.5, which showed that the set of -special measures (Definition 8.2) are closed under the map from Corollary 7.4.
Recall that we assume , and
is uniformly bounded by in norm;
We claim that is -special, as per Definition 8.1 for any . To see this, notice first that and is -Lipschitz because is bounded by in norm.
What is left is to show that is -Lipschitz for each . To this end, consider , , and define:
which implies the desired Lipschitz property.
Therefore, for all and as above:
Since is continuous, one can use Picard iteration with the above expression and deduce that
where is the unique solution of the Cauchy Problem:
In particular, since , we have that , as desired. ∎
A.1.2. Proof of Proposition 8.9
This result was stated and used in the proof of Lemma 8.6, where we showed that all solutions to the McKean-Vlasov problem in Definition 4.4 are -special measures (cf. Definition 8.2).
Step 1: preliminaries. Recall the definition of in (8.8) and of in (8.10) (also recalled in (A.1) below). The following simple fact will be convenient.
For any and ,
because and both are -Lipschitz. For the same reason, is continuous for any fixed . Finally,
is a metric over , and is a complete metric space.
Also recall our assumptions: are constants and
If two functions are such that
for -a.e. , then:
for -a.e. ;
is measurable; if is also fixed, then is continuous;
Notice that this integral is a Riemann integral, since is continuous in the first two variables. Also observe that that is measurable, , and is -Lipschitz (and also -Lipschitz) ecause is bounded by in norm. We conclude that for all , so is well defined.
Step 2: existence, uniqueness and convergence to fixed point. We will show that:
which implies in particular that is the unique fixed point of .
We achieve this goal this via a Picard-type iteration. Note that for all , item (5) in the assumptions and equation (A.4) imply:
Since is a continuous function of that is upper bounded by , we may iterate this inequality to deduce that
Recall from Fact 1 that is complete. Therefore, the Banach fixed point theorem shows that has a unique fixed point and that for all . In particular, this must satisfy the following identity for all in the appropriate set.
Step 4: is -special. Since , it suffices to show that
satisfies (recall the assumption that .
A.2. Lemmata on moment-generating function
Suppose we have a sum of -dimensional random vectors . Assume is such that a.s. for all . Then
Each MGF in the RHS corresponds to taking in the previous Lemma, with replacing . Applying the bound from that result and the fact that finishes the proof.∎
Let be a -dimensional martingale with respect to the filtration with . Assume is such that a.s. for all . Then,
A simple modification of Hoeffding’s Lemma implies that for ,
Then, applying this bound recursively gives
Using the fact that finishes the proof. ∎
To use the upper bounds in probability to get upper bound in expected value, we use the following elementary lemma:
Let be a positive random variable for which we know that, for some , holds with probability greater than .
Appendix B Estimates on Lipschitz constants
In this section, we present a collection of Lipschitz estimates for several functions used throughout many of our results. These estimates are specially usefull to work through the many nonlinearities in our model. We start by setting a framework that make computations for our results cleaner.
Since many of our maps act on measure spaces, for instance the ones in Section 4.3 (the mean-field versions of the model components), is necessary to set a metric for these spaces. We always use the Wasserstein metric. Let us briefly review the definition of this metric:
For more readable computations, we also denote the integral of a function over some measure as
Here we present some auxiliary results using the concept of Lipschitz semi-norm. These summarize typical computations, making our subsequent results more concise.
Let be two metric spaces, and let be a Lipschitz function. We define
Notice that this quantity is finite iff is Lipschitz. Also, if then
The following two lemmas are well known results regarding the Lipschitz semi-norm; we omit their proofs.
,
.
For Lipschitz functions , , we have
Let be the optimal coupling between . Then:
where step B.6 is simply the triangular inequality for integrals, B.7 uses the fact that is Lipschitz, and B.8 is an obvious upper bound. ∎
B.2. Lipschitz bounds for the mean-field maps
We now proceed to the Lipschitz estimates for the maps described in section 4.3. We assume that conditions 5.1 and 5.2 hold.
To obtain the Lipschitz upper bound, we prove (B.9) by induction up until , and then work the case separately.
This argument works up until . It fails for for it is defined differently. Nevertheless, we can work through it and prove that the result holds for as well. Let be the optimal coupling between and notice that:
showing . ∎
The boundness property of item (1) also comes directly from the assumption 5.2, since the maps and are products of derivative terms
For the Lipschitz estimate we do induction in the inverse order.
B.3. Lipschitz bounds for the existence of McKean-Vlasov process
In this subsection, we isolate the Lipschitz computations which will be needed when proving the existence of the McKean-Vlasov process. In order to better present these computations, we define the following function:
The only gradient missing in previous Lemma is . It needs a different analysis for it has a different structure than the others. Although it may not necessarily be Lipschitz, we can prove the following upper bound, which turns out to be good enough for our purposes:
We start by making the following computations:
where .
Now, it suffices to show that, for some constant , it holds that:
By recalling definition (4.14), using the upper bounds from previous section and some seriously tedious calculations, one may show that:
B.4. Lipschitz bounds for DNN maps
We also present Lipschitz estimates for the maps defined in section 3. The estimates and the proofs are similar to the ones obtained for their mean-field versions.
The previous lemma also implies that is bounded and Lipschitz,
And for the Lipschitz property we estimate using induction,
Using that the average of Lipschitz functions of same order is still Lipschitz, we can see that the gradients obtained through backpropagation are also Lipschitz, and bounded.
Our last lemma for this section deals with the special norm ,
which is used in section 10. Unfortunately, the Lipschitz property for this case do not quite follows the results obtained for the norm. We need to exploit the averaging structure of our functions, as the norm itself does it. In the next lemma, we omit the dependency on the variables and . We remark that the estimates are also uniform in these entries, which can be proved using the same strategies used during this section.
Next, we observer the behavior of the gradient terms:
We first take care of the first part of the RHS:
If we combine this estimate with the one in (B.18), then we can update the constants to obtaining
Is not hard to see that if we average the terms on the RHS, as in the , then each term can be bounded by . This concludes the proof. ∎
There is a constant such that
Pick to finish lemma. ∎
Appendix C The set of special probability measures is closed
We prove here that the set defined in §7.2 is closed under weak convergence.
In what follows, we first show that the Claim implies that , and then prove the Claim.
Proof that the Claim implies . By passing to a subsequence, we may assume that uniformly over compact sets. We will use this to show that we may take in Definition 8.2 to certify that is -special.
To do this, we first argue that is -special. This is simple: inspection of Definition 8.1 reveals that the properties required of a -special function are preserved under pointwise convergence, and thus also under uniform convergence over compact sets.
We now argue that attests that is -special: that is, we will show that, if , then a.s..
Now, when , the pair is a.s. convergent, and there a.s. is some (random) with:
Using continuity of and the convergence of , we obtain:
Proof of the Claim. We will need the following form of the Ascoli-Arzèla Theorem.
This follows from the usual formulation of Ascoli-Arzèla for the case when is itself compact. We omit the details.