On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport
Lenaic Chizat, Francis Bach
Introduction
While (1) is a convex problem, finding approximate minimizers is hard as the variable is infinite-dimensional. Several lines of work provide optimization methods but with strong limitations.
This approach tackles a variant of (1) where the regularization term is replaced by an upper bound on the total variation norm; the associated constraint set is the convex hull of all Diracs and negatives of Diracs at elements of , and thus adapted to conditional gradient algorithms . At each iteration, one adds a new particle by solving a linear minimization problem over the constraint set (which correspond to finding a particle ), and then updates the weights. The resulting iterates are sparse and there is a guaranteed sublinear convergence rate of the objective function to its minimum. However, the linear minimization subroutine is hard to perform in general : it is for instance NP-hard for neural networks with homogeneous activations . One thus generally resorts to space gridding (in low dimension) or to approximate steps, akin to boosting . The practical behavior is improved with nonconvex updates reminiscent of the flow studied below.
Another approach is to parameterize the unknown measure by its sequence of moments. The space of such sequences is characterized by a hierarchy of SDP-representable necessary conditions. This approach concerns a large class of generalized moment problems and can be adapted to deal with special instances of (1) . It is however restricted to which are combinations of few polynomial moments, and its complexity explodes exponentially with the dimension . For , convergence to a global minimizer is only guaranteed asymptotically, similarly to the results of the present paper.
A third approach, which exploits the differentiability of , consists in discretizing the unknown measure as a mixture of particles parameterized by their positions and weights. This corresponds to the finite-dimensional problem
which can then be solved by classical gradient descent-based algorithms. This method is simple to implement and is widely used for the task of neural network training but, a priori, we may only hope to converge to local minima since is non-convex. Our goal is to show that this method also benefits from the convex structure of (1) and enjoys an asymptotical global optimality guarantee.
There is a recent literature on global optimality results for (2) in the specific task of training neural networks. It is known that in this context, has less, or no, local minima in an over-parameterization regime and stochastic gradient descent (SGD) finds a global minimizer under restrictive assumptions ; see for an account of recent results. Our approach is not directly comparable to these works: it is more abstract and nonquantitative—we study an ideal dynamics that one can only hope to approximate—but also much more generic. Our objective, in the space of measures, has many local minima, but we build gradient flows that avoids them, relying mainly on the homogeneity properties of (see for other uses of homogeneity in non-convex optimization). The novelty is to see (2) as a discretization of (1)—a point of view also present in but not yet exploited for global optimality guarantees.
2 Organization of the paper and summary of contributions
Our goal is to explain when and why the non-convex particle gradient descent finds global minima. We do so by studying the many-particle limit of the gradient flow of . More specifically:
In Section 2, we introduce a more general class of problems and study the many-particle limit of the associated particle gradient flow. This limit is characterized as a Wasserstein gradient flow (Theorem 2.6), an object which is a by-product of optimal transport theory.
In Section 3, under assumptions on and the initialization, we prove that if this Wasserstein gradient flow converges, then the limit is a global minimizer of . Under the same conditions, it follows that if are gradient flows for suitably initialized, then
Two different settings that leverage the structure of are treated: the -homogeneous and the partially -homogeneous case. In Section 4, we apply these results to sparse deconvolution and training neural networks with a single hidden layer, with sigmoid or ReLU activation function. In each case, our result prescribes conditions on the initialization pattern.
We perform simple numerical experiments that indicate that this asymptotic regime is already at play for small values of , even for high-dimensional problems. The method behaves incomparably better than simply optimizing on the weights with a very large set of fixed particles.
Our focus on qualitative results might be surprising for an optimization paper, but we believe that this is an insightful first step given the hardness and the generality of the problem. We suggest to understand our result as a first consistency principle for practical and a commonly used non-convex optimization methods. While we focus on the idealistic setting of a continuous-time gradient flow with exact gradients, this is expected to reflect the behavior of first order descent algorithms, as they are known to approximate the former: see for (accelerated) gradient descent and [21, Thm. 2.1] for SGD.
Several independent works have studied the many-particle limit of training a neural network with a single large hidden layer and a quadratic loss . Their main focus is on quantifying the convergence of SGD or noisy SGD to the limit trajectory, which is precisely a mean-field limit in this case. Since in our approach this limit is mostly an intermediate step necessary to state our global convergence theorems, it is not studied extensively for itself. These papers thus provide a solid complement to Section 2.4 (a difference is that we do not assume that is quadratic nor that is differentiable). Also, proves a quantitive global convergence result for noisy SGD to an approximate minimizer: we stress that our results are of a different nature, as they rely on homogeneity and not on the mixing effect of noise.
Particle gradient flows and many-particle limit
(locally Lipschitz derivatives with sublinear growth) there exists a family of nested nonempty closed convex subsets of such that:
and are bounded and is Lipschitz on each , and
there exists such that for all , where stands for the maximal norm of an element in .
The functions and obtained through the lifting share the property of being positively -homogeneous in the variable . A function between vector spaces is said positively -homogeneous when for all and argument , it holds . This property is central for our global convergence results (but is not needed throughout Section 2).
2 Particle gradient flow
3 Wasserstein gradient flow
Thus is simply a field of (minus) subgradients of —it is in fact the field of minimal norm subgradients. We write this relation . The set is called the Wasserstein subdifferential of , as it can be interpreted as the subdifferential of relatively to the Wasserstein metric on (see Appendix B.2.1). We thus expect that for initializations with arbitrary probability distributions, the generalization of the gradient flow coindices with the following object.
A Wasserstein gradient flow for the functional on a time interval is an absolutely continuous path in that satisfies, distributionally on ,
This is a proper generalization of Definition 2.2 since, whenever is a particle gradient flow for , then is a Wasserstein gradient flow for in the sense of Definition 2.4 (see Proposition B.1). By leveraging the abstract theory of gradient flows developed in , we show in Appendix B.2.1 that these Wasserstein gradient flows are well-defined.
Under Assumptions 2.1, if is concentrated on a set , then there exists a unique Wasserstein gradient flow for starting from . It satisfies the continuity equation with the velocity field defined in (5) (with in place of ).
Note that the condition on the initialization is automatically satisfied in Proposition 2.3 because there the initial measure has a finite discrete support: it is thus contained in any for large enough.
4 Many-particle limit
We now characterize the many-particle limit of classical gradient flows, under Assumptions 2.1.
Given a measure , an example for the sequence is where are independent samples distributed according to . By the law of large numbers for empirical distributions, the sequence of empirical distributions converges (almost surely, for ) to . In particular, our proof of Theorem 2.6 gives an alternative proof of the existence claim in Proposition 2.5 (the latter remains necessary for the uniqueness of the limit).
Convergence to global minimizers
As can be seen from Definition 2.4, a probability measure is a stationary point of a Wasserstein gradient flow if and only if . It is proved in that these stationary points are, in some cases, optimal over probabilities that have a smaller support. However, they are not in general global minimizers of over , even when is convex. Such global minimizers are indeed characterized as follows.
Assume that is convex. A measure such that minimizes on iff and for -a.e. .
Despite these strong differences between stationarity and global optimality, we show in this section that Wasserstein gradient flows converge to global minimizers, under two main conditions:
On the structure: and must share a homogeneity direction (see Section 2.1 for the definition of homogeneity), and
On the initialization: the support of the initialization of the Wasserstein gradient flow satisfies a “separation” property. This property is preserved throughout the dynamic and, combined with homogeneity, allows to escape from neighborhoods of non-optimal points.
We turn these general ideas into concrete statements for two cases of interest, that exhibit different structures and behaviors: (i) when and are positively -homogeneous and (ii) when and are positively -homogeneous with respect to one variable.
2 The 222-homogeneous case
(smooth convex loss) The loss is convex, differentiable with differential Lipschitz on bounded sets and bounded on sublevel sets,
Taking the balls of radius as the family , these assumptions imply Assumptions 2.1. We believe that Assumption 3.2-(4) is not of practical importance: it is only used to avoid some pathological cases in the proof of Theorem 3.3. By applying Morse-Sard’s lemma , it is anyways fulfilled if the function in question is times continuously differentiable. We now state our first global convergence result. It involves a condition on the initialization, a separation property, that can only be satisfied in the many-particle limit. In an ambient space , we say that a set separates the sets and if any continuous path in with endpoints in and intersects .
A proof and stronger statements are presented in Appendix C. There, we give a criterion for Wasserstein gradient flows to escape neighborhoods of non-optimal measures—also valid in the finite-particle setting—and then show that it is always satisfied by the flow defined above. We also weaken the assumption that converges: we only need a certain projection of to converge weakly. Finally, the fact that limits in and can be interchanged is not anecdotal: it shows that the convergence is not conditioned on a relative speed of growth of both parameters.
This result might be easier to understand by drawing an informal distinction between (i) the structural assumptions which are instrumental and (ii) the technical conditions which have a limited practical interest. The initialization and the homogeneity assumptions are of the first kind. The Sard-type regularity is in contrast a purely technical condition: it is generally hard to check and known counter-examples involve artificial constructions such as the Cantor function . Similarly, when there is compactness, a gradient flow that does not converge is an unexpected (in some sense adversarial) behavior, see a counter-example in . We were however not able to exclude this possibility under interesting assumptions (see a discussion in Appendix C.5).
3 The partially 111-homogeneous case
Similar results hold in the partially -homogeneous setting, which covers the lifted problems of Section 2.1 when is bounded (e.g., sparse deconvolution and neural networks with sigmoid activation).
(smooth convex loss) The loss is convex, differentiable with differential Lipschitz on bounded sets and bounded on sublevel sets,
(boundary conditions) The function behaves nicely at the boundary of the domain: either
With the family of nested sets , , these assumptions imply Assumptions 2.1. The following theorem mirrors the statement of Theorem 3.3, but with a different condition on the initialization. The remarks after Theorem 3.3 also apply here.
Case studies and numerical illustrations
In this section, we apply the previous abstract statements to specific examples and show on synthetic experiments that the particle-complexity to reach global optimality is very favorable.
Assume that the filter impulse response is times continuously differentiable, and that the support of contains . If the projection of the Wasserstein gradient flow of weakly converges to , then is a global minimizer of
We show an example of such a reconstruction on the -torus on Figure 1, where the ground truth consists of weighted spikes, is an ideal low pass filter (a Dirichlet kernel of order ) and is a noisy observation of the filtered spikes. The particle gradient flow is integrated with the forward-backward algorithm and the particles initialized on a uniform grid on .
2 Neural networks with a single hidden layer
Assume that has finite moments up to order , that the support of is and that boundary condition 3.4-(iii)-(a) holds. If the Wasserstein gradient flow of converges in to , then is a global minimizer of .
Note that we have to explicitly assume the boundary condition 3.4-(iii)-(a) because the Sard-type regularity at infinity cannot be checked a priori (this technical detail is discussed in Appendix D.3).
The activation function is positively -homogeneous: this makes -homogeneous and corresponds, at a formal level, to the setting of Theorem 3.3. An admissible choice of regularizer here would be the (semi-convex) function . However, as shown in Appendix D.4, the differential has discontinuities: this prevents altogether from defining gradient flows, even in the finite-particle regime.
We display on Figure 2 particle gradient flows for training a neural network with a single hidden layer and ReLU activation in the classical (non-differentiable) parameterization, with (no regularization). Features are normally distributed, and the ground truth labels are generated with a similar network with neurons. The particle gradient flow is “integrated” with mini-batch SGD and the particles are initialized on a small centered sphere.
3 Empirical particle-complexity
Since our convergence results are non-quantitative, one might argue that similar—and much simpler to prove—asymptotical results hold for the method of distributing particles on the whole of and simply optimizing on the weights, which is a convex problem. Yet, the comparison of the particle-complexity shown in Figure 3 stands strongly in favor of particle gradient flows. While exponential particle-complexity is unavoidable for the convex approach, we observed on several synthetic problems that particle gradient descent only needs a slight over-parameterization to find global minimizers within optimization error (see details in Appendix D.5).
Conclusion
We have established asymptotic global optimality properties for a family of non-convex gradient flows. These results were enabled by the study of a Wasserstein gradient flow: this object simplifies the handling of many-particle regimes, analogously to a mean-field limit. The particle-complexity to reach global optimality turns out very favorable on synthetic numerical problems. This confirms the relevance of our qualitative results and calls for quantitative ones that would further exploit the properties of such particle gradient flows. Multiple layer neural networks are also an interesting avenue for future research.
We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).
References
Supplementary material
Supplementary material for the paper: “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport” authored by Lénaïc Chizat and Francis Bach (NIPS 2018).
Appendix B: Many-particle limit and Wasserstein gradient flow
Appendix C: Convergence to global minimizers
Appendix D: Case studies and numerical experiments
Appendix A Introductory facts
Note that the functional of interest in this article is continuous for the Wasserstein metric. This strong regularity is rather rare in the study of Wasserstein gradient flows.
Under Assumptions 2.1, the function is continuous for the Wasserstein metric .
A.2 Lifting to the space of probability measures
A.2.1 The partially 111-homogeneous case
For all , there is such that .
If then where is any point in . Otherwise, we define the map and the probability measure , which satisfies . ∎
This operator is well defined whenever is -integrable.
It holds . For a regularizer on of the form , it holds . If the infimum defining is attained and if minimizes , then there exists that minimizes over .
The class of regularizer considered in Proposition A.3 includes the total variation norm.
Let . For , it holds with equality if, for instance, is a lift of of the form (8).
A.2.2 The 222-homogeneous case
This operator is well-defined iff has finite second order moments.
Appendix B Many-particle limit and Wasserstein gradient flow
In the specific case of gradient flows of lower bounded functions, we can derive estimates that imply that (even if is not globally semiconvex). Indeed, for all , it holds
by Jensen’s inequality. Since is lower bounded, this proves that the gradient flow has bounded length on bounded time intervals. By compactness, if was finite then would exist, thus contradicting the maximality of , hence and the gradient flow is globally defined.
B.2 Link between classical and Wasserstein gradient flows
We first give a rigorous definition of the continuity equation which appear in the definition of Wasserstein gradient flows (Definition 2.4).
As we show now, there is a precise link between classical and Wasserstein gradient flow (Definitions 2.2 and 2.4). This is a simple result but might be instructive for readers who are not familiar with the concept of distributional solutions of partial differential equations.
which precisely means that is a distributional solution to (6). ∎
Note that has the same number of atoms throughout the dynamic. In particular, if no minimizer of is an atomic measure with at most atoms, then is guaranteed to not converge to a minimizer.
B.2.1 Properties of the Wasserstein gradient flow (proof of Proposition 2.5)
Under Assumptions 2.1, suppose that . For all , is proper and continuous for on its closed domain. Moreover,
there exists such that for all admissible transport plan , considering the transport interpolation , the function is differentiable with a -Lipschitz derivative;
if and only if for -almost every , where is the convex function on that is worth on and outside.
First, it is clear that is proper because is finite whenever . It is moreover continuous (see Lemma A.1) on its closed domain .
Let us denote . Since and are Lipschitz on and respectively, is differentiable with
where we used Hölder’s inequality to obtain is the last line. On the other hand,
As a consequence, is Lipschitz with . In particular, using the notions defined in , is -geodesically semiconvex. Remark that these bounds may explode when goes to infinity: this explains why we work with measures supported on .
where we used Hölder’s inequality for the bound . As a consequence,
The previous properties are sufficient to guarantee that Wasserstein gradient flows for the functionals are well defined.
We are in position to prove the well-posedness of Wasserstein gradient flows for the original functional . Notice that, by the characterization in Lemma B.3, the Wasserstein gradient flows for the functions all coincide for on if is concentrated in for all . Our strategy is thus simply to make sure that for all time , such a exists, i.e. to make sure that the support of gradient flows does not grow too fast.
Let be such that is concentrated on . Given Lemma B.3, for all , there exists a unique, globally defined, Wasserstein gradient flow for . For all , consider the first exit time from :
Note that the definition of involves the flow but in fact, for all and , it holds by the uniqueness in Lemma B.3. Thus, if , we have existence and uniqueness of a Wasserstein gradient flow in the sense of Definition 2.4 on . It only remains to show that so that the gradient flow can be defined at all times.
Let us now add a useful representation lemma for the Wasserstein gradient flow as the pushforward of by the flow of the velocity fields.
Then is uniquely well-defined, continuous, is Lipschitz on , uniformly on compact time intervals for all , and it holds .
The claims concerning are classical and follow from the fact that satisfies a one-sided Lispchitz property on , uniformly on compact time intervals [3, Lemma 8.1.4]. The expression as a pushforward is also a general property of the continuity equation, see [3, Prop. 8.1.8]. ∎
B.3 Proof of the many-particle limit (Theorem 2.6)
While we could rely on abstract stability results for Wasserstein gradient flows [3, Thm.11.2.1 (Stability)] our proof is direct and uses basic arguments. It also gives an independent argument for the existence of Wasserstein gradient flows, distinct from the standard one : it involves a discretization in space instead of the classical discretization in time.
We first show that, at least on a small time interval , the paths are contained in for some . Let us introduce the first exit time from
In order to show that is strictly positive, it is sufficient to bound the velocity of individual particles before . Consider the Lipschitz constant of on . Given the expression of the velocity of each particle (given in Eq. (5)) and the minimum travel distance required to exit , we obtain the lower bound on the exit time .
Let us now work on the time interval and prove the existence of a limit curve in the space using standard estimates for gradient flows and compactness. Our starting point is the bound, for ,
which follows by matching each particle at to its future position at , and by Jensen’s inequality. Recalling the identity from Proposition 2.3, it follows
and thus the family of curves is equicontinuous in on , uniformly in . Moreover, for all , the family lies in a ball, as such weakly precompact (but a priori not -precompact). Since the weak topology is weaker than the topology of , by Ascoli theorem, we can extract a subsequence converging weakly to a curve continuous in the weak topology, which is concentrated in at all time. We have also uniform convergence in the Bounded Lipschitz metric, which metrizes weak convergence of probability measures. In the following we only consider this subsequence, still denoted by .
We first prove that the first term in (9) tends to . Since all are concentrated on , it is sufficient to show that the sequence of velocity fields converges uniformly on to . We have, using the fact that a projection on a convex set is -Lipschitz,
Moreover, we have for all ,
So far, we have shown the convergence, up to a subsequence, to a Wasserstein gradient flow on : it remains to show that . Since and all paths decrease monotonically the value of , everything lies in a sublevel of , where is bounded. It follows that a uniform bound on the velocity of the particles with linear growth in is available and, by Grönwall’s inequality, we obtain that , just as in the end of the proof of Proposition 2.5. The theorem follows by combining this result with the uniqueness stated in Proposition 2.5.
Appendix C Convergence to global minimizers
We give in this section a proof of Theorems 3.3 and 3.5. All results have two versions: one in the -homogeneous setting (Assumptions 3.2) and its counterpart in the partially -homogeneous setting (Assumptions 3.4). We have displayed in Figure 4 the level sets of functions with these homogeneity properties, in order to highlight the differences between these two cases. The proofs tend to be more straightforward in the -homogeneous setting and they can be read independently of the other case. This section is organized as follows:
In Section C.1, we justify the global optimality conditions.
We give in Section C.2 a criteria for Wasserstein gradient flows to escape from neighborhoods of non-optimal stationary points, and we also characterize measures that can be limits of Wasserstein gradient flows. These results are valid for arbitrary initializations.
In Section C.3, we prove that the assumption on the support of the initialization made in Theorems 3.3 and 3.5 is preserved by Wasserstein gradient flows.
All these facts combined lead to a proof of Theorems 3.3 and 3.5 in Section C.4.
It will be often the case in the statements and in the proofs that they involve the projection of a probability measure (with ) (introduced in Section A.2) instead of itself. This is motivated by two facts: (i) this projected measure it generally the object of interest in the optimization problem as it clears the redundancy caused by homogeneity and (ii) the assumptions that the projection of a Wasserstein gradient flow converges is more reasonable than the convergence in of the original gradient flow, where generally no compactness is available.
Let us first remark that, by a first order Taylor expansion of , we have that for all with , it holds and
Let be such that , consider and its Lebesgue decomposition where , is singular to (see [10, Thm. 4.3.2]). Clearly, by the above first order formula, it is necessary to have everywhere with equality -a.e., for to be a minimizer. It is also sufficient since in this case we have, by convexity,
C.2 A criteria to escape from non-optimal stationary points
We now give a criteria for Wasserstein gradient flows to escape from non-optimal stationary points. It is valid both in the finite-particle regime and in the many-particle limit. Such a result supports the idea that, even in the finite-particle case (i.e. classical gradient flows), the point of view using measures is natural.
Such a set is given by where is the -sublevel set of the restriction of to the unit sphere, for some that can be chosen arbitrarily close to .
We now give a general property of the stationary points.
C.2.2 The partially 111-homogeneous case
For the partially -homogeneous case, we consider the operator defined in Appendix A.2.
where the last bound is due to the fact that is -Lipschitz and upper bounded in norm by whenever satisfies . ∎
As for the -homogeneous case, we give a general property of the stationary points.
Under Assumptions 3.4, let be a Wasserstein gradient flow of . If converges weakly to , then vanishes -a.e.
C.3 Stability of separation properties
Here we prove the fact that the separation properties of the support used in Theorems 3.5 and 3.3 are preserved by Wasserstein gradient flows. We give a proof based on topological degree theory: this tool allows to cover the case of discontinuous velocity fields, which appear when is non-differentiable. In a more regular setting, the facts that follow are easier to prove because then, is the pushforward of by a homeomorphism. Let us give a definition of the topological degree sufficient to our setting.
If are disjoint open subsets of and then .
These properties characterize a uniquely well-defined map from the set of triplets as above to the set of signed integers [8, Thm. 1-2]. Intuitively, it gives an algebraic count of the number of solutions to for , where algebraic means that a solution counts as if preserves orientation around and otherwise.
The following lemma shows that taking the support of a measure and its pushforward by a continuous map are operations that almost commute. They commute for instance if the map is closed (i.e. maps closed sets to closed set).
Let and a neighborhood of . By continuity, is the neighborhood of a point in so , hence so . Conversely, let and let a neighborhood of that does not intersect . This neighborhood satisfies , so it holds . Hence so which implies . ∎
We first state the property and the stability result that we wish to establish in the -homogeneous setting.
Under Assumptions 3.2, let be a Wasserstein gradient flow of . If the support of satisfies Property C.9, so does the support of , for all .
Note that this property is generally lost in the limit . This lemma is a consequence of the following, more abstract proposition, that deals with sets instead of measures. The reader can keep in mind that we will apply this result with being the flow of the velocity field introduced in Lemma B.4 and being the support of .
C.3.2 The partially 111-homogeneous case
Here are the analogous separation property and stability lemma for the partially -homogeneous case.
Under Assumptions 3.4, let be a Wasserstein gradient flow of . If the support of satisfies Property C.12, then so does the support of , for all .
Similarly as above, we first prove an abstract topological result.
C.4 Main theorems: proofs and generalization
First, let us state a lemma that relates the convergence of the Wasserstein gradient flows to an asymptotic property for the classical gradient flows, when . This result is used in the last claims of Theorems 3.3 and 3.5.
Under Assumptions 2.1, let be a Wasserstein gradient flow which initialization is concentrated on a set and such that . If is a sequence of measures concentrated on a set that converges to in , then
Under the assumptions of Theorem 3.3, if converges weakly, then its limit is a global minimizer of over and .
This statement is stronger than Theorem 3.3: indeed, if converges for the Wasserstein metric, then converges weakly (but the converse is generally not true).
C.4.2 The partially 111-homogeneous case
Again, we prove a statement in terms of the projected measures: Theorem 3.5 can be deduced as an immediate corollary. Some highlights of the proof are given in Figure 5.
Under the assumptions of Theorem 3.5, if converges weakly, then its limit is a global minimizer of over and .
In the unfavorable case encountered in the proof of Theorem C.17, we had to invoke the following lemma. It has a different nature than the other results of this paper because it relies on an explicit integration of the trajectories of the gradient flow, which means that it depends on the choice of the metric.
Consider, for a measure , a point such that and for some . For any and , there exists such that if is a Wasserstein gradient flow of that satisfies for all , and denoting the solution of the flow of Lemma B.4 starting from with , it holds and .
The Lipschitz regularity of and its derivative implies that there exists such that for all . Without loss of generality, we assume that . Consider and assume that there exists such that for . Writing , it holds for ,
In particular, if we can make sure that for and if then, as on this interval, there exists such that .
It remains to make sure that we indeed have for , by adjusting if necessary the value of . Parametrizing in instead of (it is an admissible reparametrization thanks to the positive lower bound on its derivative), we get
Thus, choosing , it is guaranteed that for . ∎
C.5 Remarks
We conclude this theoretical section with two opening remarks related to the global convergence theorems.
In the statements of Theorems 3.3 and 3.5, the convergence of the Wasserstein gradient flow comes as an assumption. In order to prove convergence of gradient flows, one generally needs two properties: (i) compactness of the trajectories and (ii) a so-called Łojasiewicz inequality which, intuitively, controls how much a function flattens around its critical points. As compactness in is a very strong requirement, we have relaxed the topology where convergence is required to obtain more reasonable assumptions. Yet, even when a gradient flow lies in a compact set, there are some cases where it does not converge. There has been recent progress on related issues with the study of Łojasiewicz inequalities in Wasserstein space , but to our knowledge, no general result is known in our non-geodesically convex case.
We stress that Propositions C.1 and C.4 provide with an intuitive criterion for a particle gradient flow to escape local minimum: roughly, it is sufficient that, when it passes close to a local minimum, at least one particle belongs to a -sublevel set of the current potential . In this paper we exploit this property by studying the many-particle limit, but other approaches are worth exploring. For instance, we could estimate the size of this sublevel set in specific cases, and use it as an indication for the particle-complexity to attain global minimizers. A discussion on a specific example is given in Section D.5.
Appendix D Case studies and numerical experiments
In this section, we verify the assumptions for the examples treated in Section 4.
If is convex in the second variable, then is convex. If is differentiable in the second variable with Lipschitz, uniformly in the first variable, then is differentiable with differential Lipschitz. If moreover for some constants , then is bounded on sublevel sets.
It is direct to see that is -Lipschitz in the operator norm. Finally, if , then
D.2 Sparse deconvolution
D.3 Neural network: sigmoid activation
D.4 Neural network: ReLU activation
Although we do not use this fact explicitly in the paper, it is interesting to note that the regularizing potential is admissible in the -homogeneous setting of Assumptions 3.2: although it is not differentiable nor convex, it is positively -homogeneous and semiconvex.
The homogeneity property is clear, and to see that is semi-convex, it is sufficient to remark that
is convex, since it is the square of a norm. ∎
D.4.2 A differentiable parameterization
We now consider the alternative parameterization considered in Proposition 4.3, defined as where and is the signed square function that acts entry-wise. As is clearly positively -homogeneous so we just have to prove the differentiability of , which is done with the same technique as in Lemma D.3.
Note that the condition on the moments of is less strong for ReLU activation than for sigmoids in Lemma D.2: this comes from the fact that ReLU is piece-wise linear. Similarly as what explained in the end of Section D.3, it is difficult to verify the Sard-type regularity assumption so it is left as an assumption in Proposition 4.3.
D.5 Numerical experiments : details and additional results
Animated plots of the particle gradient flows shown in this article may be found online at https://lchizat.github.io/PGF.htmlThese videos appear at this place in the official supplementary material of the NIPS 2018 publication, but had to be removed from the present version due to software incompatibility.
Here we give more details on the numerical experiments behind Figure 3.
For the leftmost panel, the setting is similar to that of Figure 1: for each realization, spikes are randomly distributed on the -torus (with a minimum separation of ) with random weights between and and a small noise is added to the filtered signal. Then for each choice of , we initialize particles on a regular grid on and integrate the particle gradient flow with the forward-backward algorithm until the improvement per iteration is below a small tolerance threshold.
For the center panel, the setting is similar to that of Figure 2, but here in dimension . The data is normally distributed and the ground truth labels are generated by a similar neural network with neurons (with random normally distributed parameters). The objective function is the square loss without regularization, so the global minimum corresponds to a loss. We optimize using SGD with fresh samples at each iteration.
The rightmost panel shows, similarly, the particle-complexity for training a neural network with a single hidden layer and sigmoid activation function, in dimension . The data is distributed on a sphere and the ground truth labels are generated by a similar neural network with neurons with random normal weights. Again, we minimize with SGD the square loss without regularization and the global minimum corresponds to a loss.
We compare the performance with the method of simply minimizing on the weights with the same initialization. This is a convex problem, and the minimum value attained does not depend on the minimization method. We plot for each case the final excess loss as a function of for several random realizations of the experiment and, for each value of , its geometric average over all realizations. We have indicated in transparent green the area of loss values which should be interpreted as “optimal” but are not exactly because the optimization has been stopped in finite time and the loss is not known exactly but estimated through sampling.
In all previous numerical experiments dealing with the partially -homogeneous case, we have initialized the particle gradient flow on a discretization of . But Theorem 3.5 allows for a large variety of initialization patterns. In this paragraph, we comment on the various possibilities and explain how the proof of Theorem 3.5 helps understanding why the corresponding particle-complexity is impacted.
We display on Figure 6 a sparse spikes deconvolution experiment, in a similar setting than in Figure 1, but with different initializations. For this problem, where spikes are to be recovered, we have observed numerically that the particle gradient flows initialized on a uniform grid on succeed in finding a global minimizer as soon as there are more than particles. In the first panel of Figure 6, the particle gradient flow with particles initialized on fails at finding a minimizer and a larger number of particles is needed for success (as shown in the center panel, with ).