On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Lenaic Chizat, Francis Bach

Introduction

While (1) is a convex problem, finding approximate minimizers is hard as the variable is infinite-dimensional. Several lines of work provide optimization methods but with strong limitations.

This approach tackles a variant of (1) where the regularization term is replaced by an upper bound on the total variation norm; the associated constraint set is the convex hull of all Diracs and negatives of Diracs at elements of $\theta\in\Theta$ , and thus adapted to conditional gradient algorithms . At each iteration, one adds a new particle by solving a linear minimization problem over the constraint set (which correspond to finding a particle $\theta\in\Theta$ ), and then updates the weights. The resulting iterates are sparse and there is a guaranteed sublinear convergence rate of the objective function to its minimum. However, the linear minimization subroutine is hard to perform in general : it is for instance NP-hard for neural networks with homogeneous activations . One thus generally resorts to space gridding (in low dimension) or to approximate steps, akin to boosting . The practical behavior is improved with nonconvex updates reminiscent of the flow studied below.

Another approach is to parameterize the unknown measure by its sequence of moments. The space of such sequences is characterized by a hierarchy of SDP-representable necessary conditions. This approach concerns a large class of generalized moment problems and can be adapted to deal with special instances of (1) . It is however restricted to $\phi$ which are combinations of few polynomial moments, and its complexity explodes exponentially with the dimension $d$ . For $d\geq 2$ , convergence to a global minimizer is only guaranteed asymptotically, similarly to the results of the present paper.

A third approach, which exploits the differentiability of $\phi$ , consists in discretizing the unknown measure $\mu$ as a mixture of $m$ particles parameterized by their positions and weights. This corresponds to the finite-dimensional problem

which can then be solved by classical gradient descent-based algorithms. This method is simple to implement and is widely used for the task of neural network training but, a priori, we may only hope to converge to local minima since $J_{m}$ is non-convex. Our goal is to show that this method also benefits from the convex structure of (1) and enjoys an asymptotical global optimality guarantee.

There is a recent literature on global optimality results for (2) in the specific task of training neural networks. It is known that in this context, $J_{m}$ has less, or no, local minima in an over-parameterization regime and stochastic gradient descent (SGD) finds a global minimizer under restrictive assumptions ; see for an account of recent results. Our approach is not directly comparable to these works: it is more abstract and nonquantitative—we study an ideal dynamics that one can only hope to approximate—but also much more generic. Our objective, in the space of measures, has many local minima, but we build gradient flows that avoids them, relying mainly on the homogeneity properties of $J_{m}$ (see for other uses of homogeneity in non-convex optimization). The novelty is to see (2) as a discretization of (1)—a point of view also present in but not yet exploited for global optimality guarantees.

2 Organization of the paper and summary of contributions

Our goal is to explain when and why the non-convex particle gradient descent finds global minima. We do so by studying the many-particle limit $m\to\infty$ of the gradient flow of $J_{m}$ . More specifically:

In Section 2, we introduce a more general class of problems and study the many-particle limit of the associated particle gradient flow. This limit is characterized as a Wasserstein gradient flow (Theorem 2.6), an object which is a by-product of optimal transport theory.

In Section 3, under assumptions on $\phi$ and the initialization, we prove that if this Wasserstein gradient flow converges, then the limit is a global minimizer of $J$ . Under the same conditions, it follows that if $(\bm{w}^{(m)}(t),\bm{\theta}^{(m)}(t))_{t\geq 0}$ are gradient flows for $J_{m}$ suitably initialized, then

Two different settings that leverage the structure of $\phi$ are treated: the $2$ -homogeneous and the partially $1$ -homogeneous case. In Section 4, we apply these results to sparse deconvolution and training neural networks with a single hidden layer, with sigmoid or ReLU activation function. In each case, our result prescribes conditions on the initialization pattern.

We perform simple numerical experiments that indicate that this asymptotic regime is already at play for small values of $m$ , even for high-dimensional problems. The method behaves incomparably better than simply optimizing on the weights with a very large set of fixed particles.

Our focus on qualitative results might be surprising for an optimization paper, but we believe that this is an insightful first step given the hardness and the generality of the problem. We suggest to understand our result as a first consistency principle for practical and a commonly used non-convex optimization methods. While we focus on the idealistic setting of a continuous-time gradient flow with exact gradients, this is expected to reflect the behavior of first order descent algorithms, as they are known to approximate the former: see for (accelerated) gradient descent and [21, Thm. 2.1] for SGD.

Several independent works have studied the many-particle limit of training a neural network with a single large hidden layer and a quadratic loss $R$ . Their main focus is on quantifying the convergence of SGD or noisy SGD to the limit trajectory, which is precisely a mean-field limit in this case. Since in our approach this limit is mostly an intermediate step necessary to state our global convergence theorems, it is not studied extensively for itself. These papers thus provide a solid complement to Section 2.4 (a difference is that we do not assume that $R$ is quadratic nor that $V$ is differentiable). Also, proves a quantitive global convergence result for noisy SGD to an approximate minimizer: we stress that our results are of a different nature, as they rely on homogeneity and not on the mixing effect of noise.

Particle gradient flows and many-particle limit

(locally Lipschitz derivatives with sublinear growth) there exists a family $(Q_{r})_{r>0}$ of nested nonempty closed convex subsets of $\Omega$ such that:

$\Phi$ and $V$ are bounded and $d\Phi$ is Lipschitz on each $Q_{r}$ , and

there exists $C_{1},C_{2}>0$ such that $\sup_{u\in Q_{r}}(\|d\Phi_{u}\|+\|\partial V(u)\|)\leq C_{1}+C_{2}r$ for all $r>0$ , where $\|\partial V(u)\|$ stands for the maximal norm of an element in $\partial V(u)$ .

The functions $\Phi$ and $V$ obtained through the lifting share the property of being positively $1$ -homogeneous in the variable $w$ . A function $f$ between vector spaces is said positively $p$ -homogeneous when for all $\lambda>0$ and argument $x$ , it holds $f(\lambda x)=\lambda^{p}f(x)$ . This property is central for our global convergence results (but is not needed throughout Section 2).

2 Particle gradient flow

3 Wasserstein gradient flow

Thus $v_{t}$ is simply a field of (minus) subgradients of $F^{\prime}(\mu_{m,t})$ —it is in fact the field of minimal norm subgradients. We write this relation $v_{t}\in-\partial F^{\prime}(\mu_{m,t})$ . The set $\partial F^{\prime}$ is called the Wasserstein subdifferential of $F$ , as it can be interpreted as the subdifferential of $F$ relatively to the Wasserstein metric on $\mathcal{P}_{2}(\Omega)$ (see Appendix B.2.1). We thus expect that for initializations with arbitrary probability distributions, the generalization of the gradient flow coindices with the following object.

A Wasserstein gradient flow for the functional $F$ on a time interval ${[0,T[}$ is an absolutely continuous path $(\mu_{t})_{t\in{[0,T[}}$ in $\mathcal{P}_{2}(\Omega)$ that satisfies, distributionally on ${[0,T[}\times\Omega^{d}$ ,

This is a proper generalization of Definition 2.2 since, whenever $(\mathbf{u}(t))_{t\geq 0}$ is a particle gradient flow for $F_{m}$ , then $t\mapsto\mu_{m,t}:=\frac{1}{m}\sum_{i=1}^{m}\delta_{\mathbf{u}_{i}{(t)}}$ is a Wasserstein gradient flow for $F$ in the sense of Definition 2.4 (see Proposition B.1). By leveraging the abstract theory of gradient flows developed in , we show in Appendix B.2.1 that these Wasserstein gradient flows are well-defined.

Under Assumptions 2.1, if $\mu_{0}\in\mathcal{P}_{2}(\Omega)$ is concentrated on a set $Q_{r_{0}}\subset\Omega$ , then there exists a unique Wasserstein gradient flow $(\mu_{t})_{t\geq 0}$ for $F$ starting from $\mu_{0}$ . It satisfies the continuity equation with the velocity field defined in (5) (with $\mu_{t}$ in place of $\mu_{m,t}$ ).

Note that the condition on the initialization is automatically satisfied in Proposition 2.3 because there the initial measure has a finite discrete support: it is thus contained in any $Q_{r}$ for $r>0$ large enough.

4 Many-particle limit

We now characterize the many-particle limit of classical gradient flows, under Assumptions 2.1.

Given a measure $\mu_{0}\in\mathcal{P}_{2}(Q_{r_{0}})$ , an example for the sequence $\mathbf{u}_{m}(0)$ is $\mathbf{u}_{m}(0)=(u_{1},\dots,u_{m})$ where $u_{1},u_{2},\dots,u_{m}$ are independent samples distributed according to $\mu_{0}$ . By the law of large numbers for empirical distributions, the sequence of empirical distributions $\mu_{m,0}=\frac{1}{m}\sum_{i=1}^{m}\delta_{u_{i}}$ converges (almost surely, for $W_{2}$ ) to $\mu_{0}$ . In particular, our proof of Theorem 2.6 gives an alternative proof of the existence claim in Proposition 2.5 (the latter remains necessary for the uniqueness of the limit).

Convergence to global minimizers

As can be seen from Definition 2.4, a probability measure $\mu\in\mathcal{P}_{2}(\Omega)$ is a stationary point of a Wasserstein gradient flow if and only if $0\in\partial F^{\prime}(\mu)(u)\mbox{ for }\mu\mbox{-a.e. }u\in\Omega$ . It is proved in that these stationary points are, in some cases, optimal over probabilities that have a smaller support. However, they are not in general global minimizers of $F$ over $\mathcal{M}_{+}(\Omega)$ , even when $R$ is convex. Such global minimizers are indeed characterized as follows.

Assume that $R$ is convex. A measure $\mu\in\mathcal{M}_{+}(\Omega)$ such that $F(\mu)<\infty$ minimizes $F$ on $\mathcal{M}_{+}(\Omega)$ iff $F^{\prime}(\mu)\geq 0$ and $F^{\prime}(\mu)(u)=0$ for $\mu$ -a.e. $u\in\Omega$ .

Despite these strong differences between stationarity and global optimality, we show in this section that Wasserstein gradient flows converge to global minimizers, under two main conditions:

On the structure: $\Phi$ and $V$ must share a homogeneity direction (see Section 2.1 for the definition of homogeneity), and

On the initialization: the support of the initialization of the Wasserstein gradient flow satisfies a “separation” property. This property is preserved throughout the dynamic and, combined with homogeneity, allows to escape from neighborhoods of non-optimal points.

We turn these general ideas into concrete statements for two cases of interest, that exhibit different structures and behaviors: (i) when $\Phi$ and $V$ are positively $2$ -homogeneous and (ii) when $\Phi$ and $V$ are positively $1$ -homogeneous with respect to one variable.

2 The 222-homogeneous case

(smooth convex loss) The loss $R$ is convex, differentiable with differential $dR$ Lipschitz on bounded sets and bounded on sublevel sets,

Taking the balls of radius $r>0$ as the family $(Q_{r})_{r>0}$ , these assumptions imply Assumptions 2.1. We believe that Assumption 3.2-(4) is not of practical importance: it is only used to avoid some pathological cases in the proof of Theorem 3.3. By applying Morse-Sard’s lemma , it is anyways fulfilled if the function in question is $d-1$ times continuously differentiable. We now state our first global convergence result. It involves a condition on the initialization, a separation property, that can only be satisfied in the many-particle limit. In an ambient space $\Omega$ , we say that a set $C$ separates the sets $A$ and $B$ if any continuous path in $\Omega$ with endpoints in $A$ and $B$ intersects $C$ .

A proof and stronger statements are presented in Appendix C. There, we give a criterion for Wasserstein gradient flows to escape neighborhoods of non-optimal measures—also valid in the finite-particle setting—and then show that it is always satisfied by the flow defined above. We also weaken the assumption that $\mu_{t}$ converges: we only need a certain projection of $\mu_{t}$ to converge weakly. Finally, the fact that limits in $m$ and $t$ can be interchanged is not anecdotal: it shows that the convergence is not conditioned on a relative speed of growth of both parameters.

This result might be easier to understand by drawing an informal distinction between (i) the structural assumptions which are instrumental and (ii) the technical conditions which have a limited practical interest. The initialization and the homogeneity assumptions are of the first kind. The Sard-type regularity is in contrast a purely technical condition: it is generally hard to check and known counter-examples involve artificial constructions such as the Cantor function . Similarly, when there is compactness, a gradient flow that does not converge is an unexpected (in some sense adversarial) behavior, see a counter-example in . We were however not able to exclude this possibility under interesting assumptions (see a discussion in Appendix C.5).

3 The partially 111-homogeneous case

Similar results hold in the partially $1$ -homogeneous setting, which covers the lifted problems of Section 2.1 when $\phi$ is bounded (e.g., sparse deconvolution and neural networks with sigmoid activation).

(smooth convex loss) The loss $R$ is convex, differentiable with differential $dR$ Lipschitz on bounded sets and bounded on sublevel sets,

(boundary conditions) The function $\phi$ behaves nicely at the boundary of the domain: either

With the family of nested sets $Q_{r}:=[-r,r]\times\Theta$ , $r>0$ , these assumptions imply Assumptions 2.1. The following theorem mirrors the statement of Theorem 3.3, but with a different condition on the initialization. The remarks after Theorem 3.3 also apply here.

Case studies and numerical illustrations

In this section, we apply the previous abstract statements to specific examples and show on synthetic experiments that the particle-complexity to reach global optimality is very favorable.

Assume that the filter impulse response $\psi$ is $\min\{2,d\}$ times continuously differentiable, and that the support of $\mu_{0}$ contains $\{0\}\times\Theta$ . If the projection $(h^{1}(\mu_{t}))_{t}$ of the Wasserstein gradient flow of $F$ weakly converges to $\nu\in\mathcal{M}(\Theta)$ , then $\nu$ is a global minimizer of

We show an example of such a reconstruction on the $1$ -torus on Figure 1, where the ground truth consists of $m_{0}=5$ weighted spikes, $\psi$ is an ideal low pass filter (a Dirichlet kernel of order $7$ ) and $y$ is a noisy observation of the filtered spikes. The particle gradient flow is integrated with the forward-backward algorithm and the particles initialized on a uniform grid on $\{0\}\times\Theta$ .

2 Neural networks with a single hidden layer

Assume that $\rho_{x}$ has finite moments up to order $\min\{4,2d-2\}$ , that the support of $\mu_{0}$ is $\{0\}\times\Theta$ and that boundary condition 3.4-(iii)-(a) holds. If the Wasserstein gradient flow of $F$ converges in $W_{2}$ to $\mu_{\infty}$ , then $\mu_{\infty}$ is a global minimizer of $F$ .

Note that we have to explicitly assume the boundary condition 3.4-(iii)-(a) because the Sard-type regularity at infinity cannot be checked a priori (this technical detail is discussed in Appendix D.3).

The activation function $\sigma(s)=\max\{0,s\}$ is positively $1$ -homogeneous: this makes $\Phi$ $2$ -homogeneous and corresponds, at a formal level, to the setting of Theorem 3.3. An admissible choice of regularizer here would be the (semi-convex) function $V(w,\theta)=|w|\cdot|\theta|$ . However, as shown in Appendix D.4, the differential $d\Phi$ has discontinuities: this prevents altogether from defining gradient flows, even in the finite-particle regime.

We display on Figure 2 particle gradient flows for training a neural network with a single hidden layer and ReLU activation in the classical (non-differentiable) parameterization, with $d=2$ (no regularization). Features are normally distributed, and the ground truth labels are generated with a similar network with $m_{0}=4$ neurons. The particle gradient flow is “integrated” with mini-batch SGD and the particles are initialized on a small centered sphere.

3 Empirical particle-complexity

Since our convergence results are non-quantitative, one might argue that similar—and much simpler to prove—asymptotical results hold for the method of distributing particles on the whole of $\Theta$ and simply optimizing on the weights, which is a convex problem. Yet, the comparison of the particle-complexity shown in Figure 3 stands strongly in favor of particle gradient flows. While exponential particle-complexity is unavoidable for the convex approach, we observed on several synthetic problems that particle gradient descent only needs a slight over-parameterization $m>m_{0}$ to find global minimizers within optimization error (see details in Appendix D.5).

Conclusion

We have established asymptotic global optimality properties for a family of non-convex gradient flows. These results were enabled by the study of a Wasserstein gradient flow: this object simplifies the handling of many-particle regimes, analogously to a mean-field limit. The particle-complexity to reach global optimality turns out very favorable on synthetic numerical problems. This confirms the relevance of our qualitative results and calls for quantitative ones that would further exploit the properties of such particle gradient flows. Multiple layer neural networks are also an interesting avenue for future research.

We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).

References

Supplementary material

Supplementary material for the paper: “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport” authored by Lénaïc Chizat and Francis Bach (NIPS 2018).

Appendix B: Many-particle limit and Wasserstein gradient flow

Appendix C: Convergence to global minimizers

Appendix D: Case studies and numerical experiments

Appendix A Introductory facts

Note that the functional of interest in this article is continuous for the Wasserstein metric. This strong regularity is rather rare in the study of Wasserstein gradient flows.

Under Assumptions 2.1, the function $F$ is continuous for the Wasserstein metric $W_{2}$ .

A.2 Lifting to the space of probability measures

A.2.1 The partially 111-homogeneous case

For all $\mu\in\mathcal{M}_{+}(\Omega)$ , there is $\nu\in\mathcal{P}(\Omega)$ such that $F(\mu)=F(\nu)$ .

If $|\mu|(\Omega)=0$ then $F(\mu)=0=F(\delta_{(0,\theta_{0}})$ where $\theta_{0}$ is any point in $\Theta$ . Otherwise, we define the map $T:(w,\theta)\mapsto(|\mu|(\Omega)\cdot w,\theta)$ and the probability measure $\nu:=T_{\#}(\mu/|\mu|(\Omega))\in\mathcal{P}(\Omega)$ , which satisfies $F(\nu)=F(\mu)$ . ∎

This operator is well defined whenever $(w,\theta)\mapsto w$ is $\mu$ -integrable.

It holds $\mathcal{M}(\Theta)\subset h^{1}(\mathcal{P}(\Omega))=h^{1}(\mathcal{M}_{+}(\Omega))$ . For a regularizer $G$ on $\mathcal{M}(\Theta)$ of the form $G(\mu)=\inf_{\nu\in h^{-1}(\mu)}\int_{\Omega}Vd\nu$ , it holds $\inf_{\nu\in\mathcal{M}(\Theta)}J(\nu)=\inf_{\mu\in\mathcal{M}_{+}(\Omega)}F(\mu)$ . If the infimum defining $G$ is attained and if $\nu\in\mathcal{M}(\Theta)$ minimizes $J$ , then there exists $\mu\in h^{-1}(\nu)$ that minimizes $F$ over $\mathcal{M}_{+}(\Omega)$ .

The class of regularizer considered in Proposition A.3 includes the total variation norm.

Let $V(w,\theta)=|w|$ . For $\mu\in\mathcal{M}(\Theta)$ , it holds $\int Vd\mu\geq|h^{1}(\mu)|(\Theta)$ with equality if, for instance, $\mu$ is a lift of $h^{1}(\mu)$ of the form (8).

A.2.2 The 222-homogeneous case

This operator is well-defined iff $\mu$ has finite second order moments.

Appendix B Many-particle limit and Wasserstein gradient flow

In the specific case of gradient flows of lower bounded functions, we can derive estimates that imply that $T=\infty$ (even if $F_{m}$ is not globally semiconvex). Indeed, for all $t>0$ , it holds

by Jensen’s inequality. Since $F_{m}$ is lower bounded, this proves that the gradient flow has bounded length on bounded time intervals. By compactness, if $T$ was finite then $\mathbf{u}(T)$ would exist, thus contradicting the maximality of $T$ , hence $T=\infty$ and the gradient flow is globally defined.

B.2 Link between classical and Wasserstein gradient flows

We first give a rigorous definition of the continuity equation which appear in the definition of Wasserstein gradient flows (Definition 2.4).

As we show now, there is a precise link between classical and Wasserstein gradient flow (Definitions 2.2 and 2.4). This is a simple result but might be instructive for readers who are not familiar with the concept of distributional solutions of partial differential equations.

which precisely means that $(\mu_{m,t})_{t\geq 0}$ is a distributional solution to (6). ∎

Note that $(\mu_{m,t})_{t}$ has the same number of atoms throughout the dynamic. In particular, if no minimizer of $F$ is an atomic measure with at most $m$ atoms, then $(\mu_{m,t})_{t}$ is guaranteed to not converge to a minimizer.

B.2.1 Properties of the Wasserstein gradient flow (proof of Proposition 2.5)

Under Assumptions 2.1, suppose that $V=0$ . For all $r>0$ , $F^{(r)}$ is proper and continuous for $W_{2}$ on its closed domain. Moreover,

there exists $\lambda_{r}>0$ such that for all admissible transport plan $\gamma$ , considering the transport interpolation $\mu^{\gamma}_{t}\coloneqq((1-t)\pi^{1}+t\pi^{2})_{\#}\gamma$ , the function $t\mapsto F(\mu^{\gamma}_{t})$ is differentiable with a $\lambda_{r}C_{2}^{2}(\gamma)$ -Lipschitz derivative;

if and only if $v(u)\in\partial(F^{\prime}(\mu)+\iota_{Q_{r}})(u)$ for $\mu$ -almost every $u\in\Omega$ , where $\iota_{Q_{r}}$ is the convex function on $\Omega$ that is worth on $Q_{r}$ and $\infty$ outside.

First, it is clear that $F$ is proper because $F^{(r)}(\delta_{u_{0}})=R(\Phi(u_{0}))$ is finite whenever $u_{0}\in Q_{r}$ . It is moreover continuous (see Lemma A.1) on its closed domain $\{\mu\in\mathcal{P}_{2}(\Omega)\;;\;\mu(Q_{r})=1\}$ .

Let us denote $h(t):=F^{(r)}(\mu^{\gamma}_{t})$ . Since $dR$ and $d\Phi$ are Lipschitz on $\mathcal{F}_{r}$ and $Q_{r}$ respectively, $h(t)$ is differentiable with

where we used Hölder’s inequality to obtain $C_{1}^{2}(\gamma)\leq C_{2}^{2}(\gamma)$ is the last line. On the other hand,

As a consequence, $h^{\prime}$ is $\lambda_{r}\cdot C_{2}^{2}(\gamma)$ Lipschitz with $\lambda_{r}=L_{dR}\|d\Phi\|_{\infty,r}^{2}+L_{d\Phi}\|dR\|_{\infty,r}$ . In particular, using the notions defined in , $F^{(r)}$ is $(-\lambda_{r})$ -geodesically semiconvex. Remark that these bounds may explode when $r$ goes to infinity: this explains why we work with measures supported on $Q_{r}$ .

where we used Hölder’s inequality for the bound $C_{1}^{2}(\gamma)\leq C_{2}^{2}(\gamma)$ . As a consequence,

The previous properties are sufficient to guarantee that Wasserstein gradient flows for the functionals $F^{(r)}$ are well defined.

We are in position to prove the well-posedness of Wasserstein gradient flows for the original functional $F$ . Notice that, by the characterization in Lemma B.3, the Wasserstein gradient flows for the functions $F^{(r)}$ all coincide for $r>r_{0}>0$ on $[0,T]$ if $\mu^{(2r_{0})}_{t}$ is concentrated in $Q_{r_{0}}$ for all $t\in[0,T]$ . Our strategy is thus simply to make sure that for all time $T$ , such a $r_{0}>0$ exists, i.e. to make sure that the support of gradient flows does not grow too fast.

Let $r_{0}$ be such that $\mu_{0}$ is concentrated on $Q_{r_{0}}$ . Given Lemma B.3, for all $r>r_{0}$ , there exists a unique, globally defined, Wasserstein gradient flow $(\mu^{(r)}_{t})_{t\geq 0}$ for $F^{(r)}$ . For all $r>r_{0}$ , consider the first exit time from $Q_{r}$ :

Note that the definition of $t_{r}$ involves the flow $(\mu_{t}^{2r})_{t}$ but in fact, for all $\bar{r}>r$ and $0\leq t\leq t_{r}$ , it holds $\mu_{t}^{(2r)}=\mu_{t}^{(\bar{r})}$ by the uniqueness in Lemma B.3. Thus, if $t_{r}>0$ , we have existence and uniqueness of a Wasserstein gradient flow in the sense of Definition 2.4 on $[0,t_{r}]$ . It only remains to show that $\lim_{r\to\infty}t_{r}=\infty$ so that the gradient flow can be defined at all times.

Let us now add a useful representation lemma for the Wasserstein gradient flow as the pushforward of $\mu_{0}$ by the flow of the velocity fields.

Then $X$ is uniquely well-defined, continuous, $X(t,\cdot)$ is Lipschitz on $Q_{r}$ , uniformly on compact time intervals for all $r>0$ , and it holds $\mu_{t}=(X_{t})_{\#}\mu_{0}$ .

The claims concerning $X$ are classical and follow from the fact that $v_{t}$ satisfies a one-sided Lispchitz property on $Q_{r}$ , uniformly on compact time intervals [3, Lemma 8.1.4]. The expression as a pushforward is also a general property of the continuity equation, see [3, Prop. 8.1.8]. ∎

B.3 Proof of the many-particle limit (Theorem 2.6)

While we could rely on abstract stability results for Wasserstein gradient flows [3, Thm.11.2.1 (Stability)] our proof is direct and uses basic arguments. It also gives an independent argument for the existence of Wasserstein gradient flows, distinct from the standard one : it involves a discretization in space instead of the classical discretization in time.

We first show that, at least on a small time interval $[0,t_{r}]$ , the paths are contained in $Q_{r}$ for some $r>r_{0}$ . Let us introduce $t_{r}$ the first exit time from $Q_{r}$

In order to show that $t_{r}$ is strictly positive, it is sufficient to bound the velocity of individual particles before $t_{r}$ . Consider $L_{V,r}$ the Lipschitz constant of $V$ on $Q_{r}$ . Given the expression of the velocity of each particle (given in Eq. (5)) and the minimum travel distance $r-r_{0}$ required to exit $Q_{r}$ , we obtain the lower bound on the exit time $t_{r}\geq(r-r_{0})/(\|d\Phi\|_{\infty,r}\|dR\|_{\infty,r}+L_{V,r})>0$ .

Let us now work on the time interval $[0,t_{r}]$ and prove the existence of a limit curve $t\mapsto\mu_{t}$ in the space $\mathcal{P}_{2}(\Theta)$ using standard estimates for gradient flows and compactness. Our starting point is the bound, for $0\leq t_{1}<t_{2}\leq t_{r}$ ,

which follows by matching each particle at $t_{1}$ to its future position at $t_{2}$ , and by Jensen’s inequality. Recalling the identity $\frac{1}{m}\sum_{i=1}^{m}|\mathbf{u}_{m,i}^{\prime}(t)|^{2}=-\frac{d}{dt}F(\mu_{m,t})$ from Proposition 2.3, it follows

and thus the family of curves $(t\mapsto\mu_{m,t})_{m}$ is equicontinuous in $W_{2}$ on $[0,t_{r}]$ , uniformly in $m$ . Moreover, for all $t\in[0,t_{r}]$ , the family $(\mu_{m,t})_{m}$ lies in a $W_{2}$ ball, as such weakly precompact (but a priori not $W_{2}$ -precompact). Since the weak topology is weaker than the topology of $W_{2}$ , by Ascoli theorem, we can extract a subsequence converging weakly to a curve $(\mu_{t})_{t\geq 0}$ continuous in the weak topology, which is concentrated in $Q_{r}$ at all time. We have also uniform convergence in the Bounded Lipschitz metric, which metrizes weak convergence of probability measures. In the following we only consider this subsequence, still denoted by $(\mu_{m})_{m}$ .

We first prove that the first term in (9) tends to . Since all $(\mu_{m,t})_{m,t}$ are concentrated on $Q_{r}$ , it is sufficient to show that the sequence of velocity fields $(t,u)\mapsto v_{m,t}(u)$ converges uniformly on $[0,t_{r}]\times Q_{r}$ to $(t,u)\mapsto v_{t}(u)$ . We have, using the fact that a projection on a convex set is $1$ -Lipschitz,

Moreover, we have for all $t\in{[0,t_{r}]}$ ,

So far, we have shown the convergence, up to a subsequence, to a Wasserstein gradient flow on $[0,t_{r}]$ : it remains to show that $\lim_{r\to\infty}t_{r}=\infty$ . Since $F(\mu_{m,0})\to F(\mu_{0})$ and all paths $(\mu_{t,m})_{t}$ decrease monotonically the value of $F$ , everything lies in a sublevel of $R$ , where $dR$ is bounded. It follows that a uniform bound on the velocity of the particles with linear growth in $r$ is available and, by Grönwall’s inequality, we obtain that $\lim_{r\to\infty}t_{r}=\infty$ , just as in the end of the proof of Proposition 2.5. The theorem follows by combining this result with the uniqueness stated in Proposition 2.5.

Appendix C Convergence to global minimizers

We give in this section a proof of Theorems 3.3 and 3.5. All results have two versions: one in the $2$ -homogeneous setting (Assumptions 3.2) and its counterpart in the partially $1$ -homogeneous setting (Assumptions 3.4). We have displayed in Figure 4 the level sets of functions with these homogeneity properties, in order to highlight the differences between these two cases. The proofs tend to be more straightforward in the $2$ -homogeneous setting and they can be read independently of the other case. This section is organized as follows:

In Section C.1, we justify the global optimality conditions.

We give in Section C.2 a criteria for Wasserstein gradient flows to escape from neighborhoods of non-optimal stationary points, and we also characterize measures that can be limits of Wasserstein gradient flows. These results are valid for arbitrary initializations.

In Section C.3, we prove that the assumption on the support of the initialization made in Theorems 3.3 and 3.5 is preserved by Wasserstein gradient flows.

All these facts combined lead to a proof of Theorems 3.3 and 3.5 in Section C.4.

It will be often the case in the statements and in the proofs that they involve the projection $h^{i}(\mu)$ of a probability measure $\mu\in\mathcal{P}(\Omega)$ (with $i=1,2$ ) (introduced in Section A.2) instead of $\mu$ itself. This is motivated by two facts: (i) this projected measure it generally the object of interest in the optimization problem as it clears the redundancy caused by homogeneity and (ii) the assumptions that the projection $h^{i}(\mu_{t})$ of a Wasserstein gradient flow converges is more reasonable than the convergence in $W_{2}$ of the original gradient flow, where generally no compactness is available.

Let us first remark that, by a first order Taylor expansion of $R$ , we have that for all $\mu,\sigma\in\mathcal{M}(\Omega)$ with $F(\mu),F(\sigma)<\infty$ , it holds $\int|F^{\prime}(\mu)|d\sigma<\infty$ and

Let $\mu,\nu\in\mathcal{M}_{+}(\Omega)$ be such that $F(\nu),F(\mu)<\infty$ , consider $\sigma:=\nu-\mu$ and its Lebesgue decomposition $\sigma=f\mu+\sigma^{\perp}$ where $f\in L^{1}(\mu)$ , $\delta^{\perp}\in\mathcal{M}_{+}(\Omega)$ is singular to $\mu$ (see [10, Thm. 4.3.2]). Clearly, by the above first order formula, it is necessary to have $F^{\prime}(\mu)\geq 0$ everywhere with equality $\mu$ -a.e., for $\mu$ to be a minimizer. It is also sufficient since in this case we have, by convexity,

C.2 A criteria to escape from non-optimal stationary points

We now give a criteria for Wasserstein gradient flows to escape from non-optimal stationary points. It is valid both in the finite-particle regime and in the many-particle limit. Such a result supports the idea that, even in the finite-particle case (i.e. classical gradient flows), the point of view using measures is natural.

Such a set is given by $A=\{r\theta\;;\;r\in{]0,\infty[}\text{ and }\theta\in K\}$ where $K$ is the $(-\eta)$ -sublevel set of the restriction of $F^{\prime}(\mu)$ to the unit sphere, for some $\eta>0$ that can be chosen arbitrarily close to .

We now give a general property of the stationary points.

C.2.2 The partially 111-homogeneous case

For the partially $1$ -homogeneous case, we consider the operator $h^{1}:\mathcal{M}_{+}(\Omega)\to\mathcal{M}(\Theta)$ defined in Appendix A.2.

where the last bound is due to the fact that $u\mapsto\langle f,\phi(u)\rangle$ is $\|\phi\|_{C^{1}}$ -Lipschitz and upper bounded in norm by $\|\phi\|_{C^{1}}$ whenever $f\in\mathcal{F}$ satisfies $\|f\|\leq 1$ . ∎

As for the $2$ -homogeneous case, we give a general property of the stationary points.

Under Assumptions 3.4, let $(\mu_{t})_{t}$ be a Wasserstein gradient flow of $F$ . If $h^{1}(\mu_{t})$ converges weakly to $\nu\in\mathcal{M}_{+}(\Theta)$ , then $F^{\prime}(\nu)$ vanishes $\nu$ -a.e.

C.3 Stability of separation properties

Here we prove the fact that the separation properties of the support used in Theorems 3.5 and 3.3 are preserved by Wasserstein gradient flows. We give a proof based on topological degree theory: this tool allows to cover the case of discontinuous velocity fields, which appear when $V$ is non-differentiable. In a more regular setting, the facts that follow are easier to prove because then, $\mu_{t}$ is the pushforward of $\mu_{0}$ by a homeomorphism. Let us give a definition of the topological degree sufficient to our setting.

If $A_{1},A_{2}$ are disjoint open subsets of $A$ and $y\notin f(\overline{A}\setminus(A_{1}\cup A_{2}))$ then $\deg(f,A,y)=\deg(f,A_{1},y)+\deg(f,A_{2},y)$ .

These properties characterize a uniquely well-defined map $\deg$ from the set of triplets $(f,A,y)$ as above to the set of signed integers [8, Thm. 1-2]. Intuitively, it gives an algebraic count of the number of solutions to $f(x)=y$ for $x\in A$ , where algebraic means that a solution $x$ counts as $+1$ if $f$ preserves orientation around $x$ and $-1$ otherwise.

The following lemma shows that taking the support of a measure and its pushforward by a continuous map are operations that almost commute. They commute for instance if the map is closed (i.e. maps closed sets to closed set).

Let $y\in f(\operatorname{spt}\mu)$ and $\mathcal{V}$ a neighborhood of $y$ . By continuity, $f^{-1}(\mathcal{V})$ is the neighborhood of a point in $\operatorname{spt}\mu$ so $0<\mu(f^{-1}(\mathcal{V}))=f_{\#}\mu(\mathcal{V})$ , hence $y\in\operatorname{spt}f_{\#}\mu$ so $f(\operatorname{spt}\mu)\subset\operatorname{spt}f_{\#}\mu$ . Conversely, let $y\in\overline{f(\operatorname{spt}\mu)}^{c}$ and let $\mathcal{V}$ a neighborhood of $y$ that does not intersect $\overline{f(\operatorname{spt}\mu)}$ . This neighborhood satisfies $f^{-1}(\mathcal{V})\subset(\operatorname{spt}\mu)^{c}$ , so it holds $f_{\#}\mu(\mathcal{V})=\mu(f^{-1}(\mathcal{V}))\leq\mu((\operatorname{spt}\mu)^{c})=0$ . Hence $y\in(\operatorname{spt}f_{\#}\mu)^{c}$ so $\overline{f(\operatorname{spt}\mu)}^{c}\subset(\operatorname{spt}f_{\#}\mu)^{c}$ which implies $\operatorname{spt}f_{\#}\mu\subset\overline{f(\operatorname{spt}\mu)}$ . ∎

We first state the property and the stability result that we wish to establish in the $2$ -homogeneous setting.

Under Assumptions 3.2, let $(\mu_{t})_{t\geq 0}$ be a Wasserstein gradient flow of $F$ . If the support of $\mu_{0}$ satisfies Property C.9, so does the support of $\mu_{t}$ , for all $t>0$ .

Note that this property is generally lost in the limit $t\to\infty$ . This lemma is a consequence of the following, more abstract proposition, that deals with sets instead of measures. The reader can keep in mind that we will apply this result with $X$ being the flow of the velocity field introduced in Lemma B.4 and $K$ being the support of $\mu_{0}$ .

C.3.2 The partially 111-homogeneous case

Here are the analogous separation property and stability lemma for the partially $1$ -homogeneous case.

Under Assumptions 3.4, let $(\mu_{t})_{t\geq 0}$ be a Wasserstein gradient flow of $F$ . If the support of $\mu_{0}$ satisfies Property C.12, then so does the support of $\mu_{t}$ , for all $t>0$ .

Similarly as above, we first prove an abstract topological result.

C.4 Main theorems: proofs and generalization

First, let us state a lemma that relates the convergence of the Wasserstein gradient flows to an asymptotic property for the classical gradient flows, when $m,t\to\infty$ . This result is used in the last claims of Theorems 3.3 and 3.5.

Under Assumptions 2.1, let $(\mu_{t})$ be a Wasserstein gradient flow which initialization is concentrated on a set $Q_{r_{0}}$ and such that $F(\mu_{t})\to F^{*}$ . If $(\mu_{0,m})_{m}$ is a sequence of measures concentrated on a set $Q_{r_{0}}$ that converges to $\mu_{0}$ in $W_{2}$ , then

Under the assumptions of Theorem 3.3, if $h^{2}(\mu_{t})$ converges weakly, then its limit is a global minimizer of $F$ over $\mathcal{M}_{+}(\Omega)$ and $\lim_{t\to\infty}F(\mu_{t})=F^{*}$ .

This statement is stronger than Theorem 3.3: indeed, if $\mu_{t}$ converges for the Wasserstein metric, then $h^{2}(\mu_{t})$ converges weakly (but the converse is generally not true).

C.4.2 The partially 111-homogeneous case

Again, we prove a statement in terms of the projected measures: Theorem 3.5 can be deduced as an immediate corollary. Some highlights of the proof are given in Figure 5.

Under the assumptions of Theorem 3.5, if $h^{1}(\mu_{t})$ converges weakly, then its limit is a global minimizer of $F$ over $\mathcal{M}_{+}(\Omega)$ and $\lim_{t\to\infty}F(\mu_{t})=F^{*}$ .

In the unfavorable case encountered in the proof of Theorem C.17, we had to invoke the following lemma. It has a different nature than the other results of this paper because it relies on an explicit integration of the trajectories of the gradient flow, which means that it depends on the choice of the metric.

Consider, for a measure $\nu\in\mathcal{M}(\Theta)$ , a point $\theta_{0}\in\Theta$ such that $|\nabla g_{\nu}(\theta)|=0$ and $g_{\nu}(\theta)\leq-\eta$ for some $\eta>0$ . For any $M>0$ and $r_{0}>0$ , there exists $T,\epsilon>0$ such that if $(\mu_{t})_{t}$ is a Wasserstein gradient flow of $F$ that satisfies for all $t\in[0,T]$ , $\|g_{\mu_{t}}-g_{\nu}\|_{C^{1}}\leq\epsilon$ and denoting $(w(t),\theta(t))$ the solution of the flow of Lemma B.4 starting from $(w_{0},\theta_{0})$ with $w_{0}\in[-M,0]$ , it holds $w(T)=0$ and $|\theta(T)-\theta_{0}|<r_{0}$ .

The Lipschitz regularity of $g_{\nu}$ and its derivative implies that there exists $L>0$ such that $\max\{|g_{\nu}(\theta)-g_{\nu}(\theta_{0})|,|\nabla g_{\nu}(\theta)-\nabla g_{\nu}(\theta_{0})|\}\leq L|\theta-\theta_{0}|$ for all $\theta\in\Theta$ . Without loss of generality, we assume that $r_{0}<\eta/(4L)$ . Consider $\epsilon\in{]0,\eta/4[}$ and assume that there exists $\bar{T}>0$ such that $\|g_{\mu_{t}}-g_{\nu}\|_{C^{1}}\leq\epsilon$ for $t\in[0,\bar{T}]$ . Writing $q(t)=|\theta(t)-\theta_{0}|$ , it holds for $t\in[0,\bar{T}]$ ,

In particular, if we can make sure that $|q(t)|<\bar{r}$ for $t\in[0,\bar{T}]$ and if $\bar{T}>2/\eta$ then, as $(dw/dt)\geq\eta/2$ on this interval, there exists $T<2/\eta$ such that $w(T)=0$ .

It remains to make sure that we indeed have $|q(t)|<\bar{r}$ for $t\in[0,T]$ , by adjusting if necessary the value of $\epsilon$ . Parametrizing in $w$ instead of $t$ (it is an admissible reparametrization thanks to the positive lower bound on its derivative), we get

Thus, choosing $\epsilon<Lr_{0}/(\exp(Lw_{0}^{2}/\eta)-1)$ , it is guaranteed that $q(t)\leq r_{0}$ for $t\in[0,T]$ . ∎

C.5 Remarks

We conclude this theoretical section with two opening remarks related to the global convergence theorems.

In the statements of Theorems 3.3 and 3.5, the convergence of the Wasserstein gradient flow comes as an assumption. In order to prove convergence of gradient flows, one generally needs two properties: (i) compactness of the trajectories and (ii) a so-called Łojasiewicz inequality which, intuitively, controls how much a function flattens around its critical points. As compactness in $W_{2}$ is a very strong requirement, we have relaxed the topology where convergence is required to obtain more reasonable assumptions. Yet, even when a gradient flow lies in a compact set, there are some cases where it does not converge. There has been recent progress on related issues with the study of Łojasiewicz inequalities in Wasserstein space , but to our knowledge, no general result is known in our non-geodesically convex case.

We stress that Propositions C.1 and C.4 provide with an intuitive criterion for a particle gradient flow to escape local minimum: roughly, it is sufficient that, when it passes close to a local minimum, at least one particle belongs to a -sublevel set of the current potential $F^{\prime}(\mu)$ . In this paper we exploit this property by studying the many-particle limit, but other approaches are worth exploring. For instance, we could estimate the size of this sublevel set in specific cases, and use it as an indication for the particle-complexity to attain global minimizers. A discussion on a specific example is given in Section D.5.

Appendix D Case studies and numerical experiments

In this section, we verify the assumptions for the examples treated in Section 4.

If $r$ is convex in the second variable, then $R$ is convex. If $r$ is differentiable in the second variable with $\partial_{2}r$ Lipschitz, uniformly in the first variable, then $R$ is differentiable with differential $dR$ Lipschitz. If moreover $|\partial_{2}r|^{2}\leq C_{1}r+C_{2}$ for some constants $C_{1},C_{2}>0$ , then $dR$ is bounded on sublevel sets.

It is direct to see that $dR$ is $L$ -Lipschitz in the operator norm. Finally, if $|\partial_{2}r|^{2}\leq C_{1}r+C_{2}$ , then

D.2 Sparse deconvolution

D.3 Neural network: sigmoid activation

D.4 Neural network: ReLU activation

Although we do not use this fact explicitly in the paper, it is interesting to note that the regularizing potential $V:(w,\theta)\mapsto|w|\cdot|\theta|$ is admissible in the $2$ -homogeneous setting of Assumptions 3.2: although it is not differentiable nor convex, it is positively $2$ -homogeneous and semiconvex.

The homogeneity property is clear, and to see that $V$ is semi-convex, it is sufficient to remark that

is convex, since it is the square of a norm. ∎

D.4.2 A differentiable parameterization

We now consider the alternative parameterization considered in Proposition 4.3, defined as $\Phi(\theta):x\mapsto\sigma(s(\theta)\cdot(x,1))$ where $\sigma(t)=\max\{t,0\}$ and $s$ is the signed square function $s(t)=t|t|=\operatorname{sign}(t)\cdot t^{2}$ that acts entry-wise. As $\Phi$ is clearly positively $2$ -homogeneous so we just have to prove the differentiability of $\Phi$ , which is done with the same technique as in Lemma D.3.

Note that the condition on the moments of $\rho_{x}$ is less strong for ReLU activation than for sigmoids in Lemma D.2: this comes from the fact that ReLU is piece-wise linear. Similarly as what explained in the end of Section D.3, it is difficult to verify the Sard-type regularity assumption so it is left as an assumption in Proposition 4.3.

D.5 Numerical experiments : details and additional results

Animated plots of the particle gradient flows shown in this article may be found online at https://lchizat.github.io/PGF.htmlThese videos appear at this place in the official supplementary material of the NIPS 2018 publication, but had to be removed from the present version due to software incompatibility.

Here we give more details on the numerical experiments behind Figure 3.

For the leftmost panel, the setting is similar to that of Figure 1: for each realization, $5$ spikes are randomly distributed on the $1$ -torus (with a minimum separation of $0.1$ ) with random weights between $0.5$ and $1.5$ and a small noise is added to the filtered signal. Then for each choice of $m$ , we initialize $m$ particles on a regular grid on $\{0\}\times\Theta$ and integrate the particle gradient flow with the forward-backward algorithm until the improvement per iteration is below a small tolerance threshold.

For the center panel, the setting is similar to that of Figure 2, but here in dimension $d=100$ . The data is normally distributed and the ground truth labels are generated by a similar neural network with $20$ neurons (with random normally distributed parameters). The objective function is the square loss without regularization, so the global minimum corresponds to a loss. We optimize using SGD with fresh samples at each iteration.

The rightmost panel shows, similarly, the particle-complexity for training a neural network with a single hidden layer and sigmoid activation function, in dimension $d=100$ . The data is distributed on a sphere and the ground truth labels are generated by a similar neural network with $20$ neurons with random normal weights. Again, we minimize with SGD the square loss without regularization and the global minimum corresponds to a loss.

We compare the performance with the method of simply minimizing on the weights with the same initialization. This is a convex problem, and the minimum value attained does not depend on the minimization method. We plot for each case the final excess loss as a function of $m$ for several random realizations of the experiment and, for each value of $m$ , its geometric average over all realizations. We have indicated in transparent green the area of loss values which should be interpreted as “optimal” but are not exactly because the optimization has been stopped in finite time and the loss is not known exactly but estimated through sampling.

In all previous numerical experiments dealing with the partially $1$ -homogeneous case, we have initialized the particle gradient flow on a discretization of $\{0\}\times\Theta$ . But Theorem 3.5 allows for a large variety of initialization patterns. In this paragraph, we comment on the various possibilities and explain how the proof of Theorem 3.5 helps understanding why the corresponding particle-complexity is impacted.

We display on Figure 6 a sparse spikes deconvolution experiment, in a similar setting than in Figure 1, but with different initializations. For this problem, where $m_{0}=5$ spikes are to be recovered, we have observed numerically that the particle gradient flows initialized on a uniform grid on $\{0\}\times\Theta$ succeed in finding a global minimizer as soon as there are more than $m=7$ particles. In the first panel of Figure 6, the particle gradient flow with $m=15$ particles initialized on $\{1\}\times\Theta$ fails at finding a minimizer and a larger number of particles is needed for success (as shown in the center panel, with $m=30$ ).