On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Lenaic Chizat, Francis Bach

Introduction

While (1) is a convex problem, finding approximate minimizers is hard as the variable is infinite-dimensional. Several lines of work provide optimization methods but with strong limitations.

This approach tackles a variant of (1) where the regularization term is replaced by an upper bound on the total variation norm; the associated constraint set is the convex hull of all Diracs and negatives of Diracs at elements of θΘ\theta\in\Theta, and thus adapted to conditional gradient algorithms . At each iteration, one adds a new particle by solving a linear minimization problem over the constraint set (which correspond to finding a particle θΘ\theta\in\Theta), and then updates the weights. The resulting iterates are sparse and there is a guaranteed sublinear convergence rate of the objective function to its minimum. However, the linear minimization subroutine is hard to perform in general : it is for instance NP-hard for neural networks with homogeneous activations . One thus generally resorts to space gridding (in low dimension) or to approximate steps, akin to boosting . The practical behavior is improved with nonconvex updates reminiscent of the flow studied below.

Another approach is to parameterize the unknown measure by its sequence of moments. The space of such sequences is characterized by a hierarchy of SDP-representable necessary conditions. This approach concerns a large class of generalized moment problems and can be adapted to deal with special instances of (1) . It is however restricted to ϕ\phi which are combinations of few polynomial moments, and its complexity explodes exponentially with the dimension dd. For d2d\geq 2, convergence to a global minimizer is only guaranteed asymptotically, similarly to the results of the present paper.

A third approach, which exploits the differentiability of ϕ\phi, consists in discretizing the unknown measure μ\mu as a mixture of mm particles parameterized by their positions and weights. This corresponds to the finite-dimensional problem

which can then be solved by classical gradient descent-based algorithms. This method is simple to implement and is widely used for the task of neural network training but, a priori, we may only hope to converge to local minima since JmJ_{m} is non-convex. Our goal is to show that this method also benefits from the convex structure of (1) and enjoys an asymptotical global optimality guarantee.

There is a recent literature on global optimality results for (2) in the specific task of training neural networks. It is known that in this context, JmJ_{m} has less, or no, local minima in an over-parameterization regime and stochastic gradient descent (SGD) finds a global minimizer under restrictive assumptions ; see for an account of recent results. Our approach is not directly comparable to these works: it is more abstract and nonquantitative—we study an ideal dynamics that one can only hope to approximate—but also much more generic. Our objective, in the space of measures, has many local minima, but we build gradient flows that avoids them, relying mainly on the homogeneity properties of JmJ_{m} (see for other uses of homogeneity in non-convex optimization). The novelty is to see (2) as a discretization of (1)—a point of view also present in but not yet exploited for global optimality guarantees.

2 Organization of the paper and summary of contributions

Our goal is to explain when and why the non-convex particle gradient descent finds global minima. We do so by studying the many-particle limit mm\to\infty of the gradient flow of JmJ_{m}. More specifically:

In Section 2, we introduce a more general class of problems and study the many-particle limit of the associated particle gradient flow. This limit is characterized as a Wasserstein gradient flow (Theorem 2.6), an object which is a by-product of optimal transport theory.

In Section 3, under assumptions on ϕ\phi and the initialization, we prove that if this Wasserstein gradient flow converges, then the limit is a global minimizer of JJ. Under the same conditions, it follows that if (w(m)(t),θ(m)(t))t0(\bm{w}^{(m)}(t),\bm{\theta}^{(m)}(t))_{t\geq 0} are gradient flows for JmJ_{m} suitably initialized, then

Two different settings that leverage the structure of ϕ\phi are treated: the 22-homogeneous and the partially 11-homogeneous case. In Section 4, we apply these results to sparse deconvolution and training neural networks with a single hidden layer, with sigmoid or ReLU activation function. In each case, our result prescribes conditions on the initialization pattern.

We perform simple numerical experiments that indicate that this asymptotic regime is already at play for small values of mm, even for high-dimensional problems. The method behaves incomparably better than simply optimizing on the weights with a very large set of fixed particles.

Our focus on qualitative results might be surprising for an optimization paper, but we believe that this is an insightful first step given the hardness and the generality of the problem. We suggest to understand our result as a first consistency principle for practical and a commonly used non-convex optimization methods. While we focus on the idealistic setting of a continuous-time gradient flow with exact gradients, this is expected to reflect the behavior of first order descent algorithms, as they are known to approximate the former: see for (accelerated) gradient descent and [21, Thm. 2.1] for SGD.

Several independent works have studied the many-particle limit of training a neural network with a single large hidden layer and a quadratic loss RR. Their main focus is on quantifying the convergence of SGD or noisy SGD to the limit trajectory, which is precisely a mean-field limit in this case. Since in our approach this limit is mostly an intermediate step necessary to state our global convergence theorems, it is not studied extensively for itself. These papers thus provide a solid complement to Section 2.4 (a difference is that we do not assume that RR is quadratic nor that VV is differentiable). Also, proves a quantitive global convergence result for noisy SGD to an approximate minimizer: we stress that our results are of a different nature, as they rely on homogeneity and not on the mixing effect of noise.

Particle gradient flows and many-particle limit

(locally Lipschitz derivatives with sublinear growth) there exists a family (Qr)r>0(Q_{r})_{r>0} of nested nonempty closed convex subsets of Ω\Omega such that:

Φ\Phi and VV are bounded and dΦd\Phi is Lipschitz on each QrQ_{r}, and

there exists C1,C2>0C_{1},C_{2}>0 such that supuQr(dΦu+V(u))C1+C2r\sup_{u\in Q_{r}}(\|d\Phi_{u}\|+\|\partial V(u)\|)\leq C_{1}+C_{2}r for all r>0r>0, where V(u)\|\partial V(u)\| stands for the maximal norm of an element in V(u)\partial V(u).

The functions Φ\Phi and VV obtained through the lifting share the property of being positively 11-homogeneous in the variable ww. A function ff between vector spaces is said positively pp-homogeneous when for all λ>0\lambda>0 and argument xx, it holds f(λx)=λpf(x)f(\lambda x)=\lambda^{p}f(x). This property is central for our global convergence results (but is not needed throughout Section 2).

2 Particle gradient flow

3 Wasserstein gradient flow

Thus vtv_{t} is simply a field of (minus) subgradients of F(μm,t)F^{\prime}(\mu_{m,t})—it is in fact the field of minimal norm subgradients. We write this relation vtF(μm,t)v_{t}\in-\partial F^{\prime}(\mu_{m,t}). The set F\partial F^{\prime} is called the Wasserstein subdifferential of FF, as it can be interpreted as the subdifferential of FF relatively to the Wasserstein metric on P2(Ω)\mathcal{P}_{2}(\Omega) (see Appendix B.2.1). We thus expect that for initializations with arbitrary probability distributions, the generalization of the gradient flow coindices with the following object.

A Wasserstein gradient flow for the functional FF on a time interval [0,T[{[0,T[} is an absolutely continuous path (μt)t[0,T[(\mu_{t})_{t\in{[0,T[}} in P2(Ω)\mathcal{P}_{2}(\Omega) that satisfies, distributionally on [0,T[×Ωd{[0,T[}\times\Omega^{d},

This is a proper generalization of Definition 2.2 since, whenever (u(t))t0(\mathbf{u}(t))_{t\geq 0} is a particle gradient flow for FmF_{m}, then tμm,t:=1mi=1mδui(t)t\mapsto\mu_{m,t}:=\frac{1}{m}\sum_{i=1}^{m}\delta_{\mathbf{u}_{i}{(t)}} is a Wasserstein gradient flow for FF in the sense of Definition 2.4 (see Proposition B.1). By leveraging the abstract theory of gradient flows developed in , we show in Appendix B.2.1 that these Wasserstein gradient flows are well-defined.

Under Assumptions 2.1, if μ0P2(Ω)\mu_{0}\in\mathcal{P}_{2}(\Omega) is concentrated on a set Qr0ΩQ_{r_{0}}\subset\Omega, then there exists a unique Wasserstein gradient flow (μt)t0(\mu_{t})_{t\geq 0} for FF starting from μ0\mu_{0}. It satisfies the continuity equation with the velocity field defined in (5) (with μt\mu_{t} in place of μm,t\mu_{m,t}).

Note that the condition on the initialization is automatically satisfied in Proposition 2.3 because there the initial measure has a finite discrete support: it is thus contained in any QrQ_{r} for r>0r>0 large enough.

4 Many-particle limit

We now characterize the many-particle limit of classical gradient flows, under Assumptions 2.1.

Given a measure μ0P2(Qr0)\mu_{0}\in\mathcal{P}_{2}(Q_{r_{0}}), an example for the sequence um(0)\mathbf{u}_{m}(0) is um(0)=(u1,,um)\mathbf{u}_{m}(0)=(u_{1},\dots,u_{m}) where u1,u2,,umu_{1},u_{2},\dots,u_{m} are independent samples distributed according to μ0\mu_{0}. By the law of large numbers for empirical distributions, the sequence of empirical distributions μm,0=1mi=1mδui\mu_{m,0}=\frac{1}{m}\sum_{i=1}^{m}\delta_{u_{i}} converges (almost surely, for W2W_{2}) to μ0\mu_{0}. In particular, our proof of Theorem 2.6 gives an alternative proof of the existence claim in Proposition 2.5 (the latter remains necessary for the uniqueness of the limit).

Convergence to global minimizers

As can be seen from Definition 2.4, a probability measure μP2(Ω)\mu\in\mathcal{P}_{2}(\Omega) is a stationary point of a Wasserstein gradient flow if and only if 0F(μ)(u)\mboxforμ\mboxa.e.uΩ0\in\partial F^{\prime}(\mu)(u)\mbox{ for }\mu\mbox{-a.e. }u\in\Omega. It is proved in that these stationary points are, in some cases, optimal over probabilities that have a smaller support. However, they are not in general global minimizers of FF over M+(Ω)\mathcal{M}_{+}(\Omega), even when RR is convex. Such global minimizers are indeed characterized as follows.

Assume that RR is convex. A measure μM+(Ω)\mu\in\mathcal{M}_{+}(\Omega) such that F(μ)<F(\mu)<\infty minimizes FF on M+(Ω)\mathcal{M}_{+}(\Omega) iff F(μ)0F^{\prime}(\mu)\geq 0 and F(μ)(u)=0F^{\prime}(\mu)(u)=0 for μ\mu-a.e. uΩu\in\Omega.

Despite these strong differences between stationarity and global optimality, we show in this section that Wasserstein gradient flows converge to global minimizers, under two main conditions:

On the structure: Φ\Phi and VV must share a homogeneity direction (see Section 2.1 for the definition of homogeneity), and

On the initialization: the support of the initialization of the Wasserstein gradient flow satisfies a “separation” property. This property is preserved throughout the dynamic and, combined with homogeneity, allows to escape from neighborhoods of non-optimal points.

We turn these general ideas into concrete statements for two cases of interest, that exhibit different structures and behaviors: (i) when Φ\Phi and VV are positively 22-homogeneous and (ii) when Φ\Phi and VV are positively 11-homogeneous with respect to one variable.

2 The 222-homogeneous case

(smooth convex loss) The loss RR is convex, differentiable with differential dRdR Lipschitz on bounded sets and bounded on sublevel sets,

Taking the balls of radius r>0r>0 as the family (Qr)r>0(Q_{r})_{r>0}, these assumptions imply Assumptions 2.1. We believe that Assumption 3.2-(4) is not of practical importance: it is only used to avoid some pathological cases in the proof of Theorem 3.3. By applying Morse-Sard’s lemma , it is anyways fulfilled if the function in question is d1d-1 times continuously differentiable. We now state our first global convergence result. It involves a condition on the initialization, a separation property, that can only be satisfied in the many-particle limit. In an ambient space Ω\Omega, we say that a set CC separates the sets AA and BB if any continuous path in Ω\Omega with endpoints in AA and BB intersects CC.

A proof and stronger statements are presented in Appendix C. There, we give a criterion for Wasserstein gradient flows to escape neighborhoods of non-optimal measures—also valid in the finite-particle setting—and then show that it is always satisfied by the flow defined above. We also weaken the assumption that μt\mu_{t} converges: we only need a certain projection of μt\mu_{t} to converge weakly. Finally, the fact that limits in mm and tt can be interchanged is not anecdotal: it shows that the convergence is not conditioned on a relative speed of growth of both parameters.

This result might be easier to understand by drawing an informal distinction between (i) the structural assumptions which are instrumental and (ii) the technical conditions which have a limited practical interest. The initialization and the homogeneity assumptions are of the first kind. The Sard-type regularity is in contrast a purely technical condition: it is generally hard to check and known counter-examples involve artificial constructions such as the Cantor function . Similarly, when there is compactness, a gradient flow that does not converge is an unexpected (in some sense adversarial) behavior, see a counter-example in . We were however not able to exclude this possibility under interesting assumptions (see a discussion in Appendix C.5).

3 The partially 111-homogeneous case

Similar results hold in the partially 11-homogeneous setting, which covers the lifted problems of Section 2.1 when ϕ\phi is bounded (e.g., sparse deconvolution and neural networks with sigmoid activation).

(smooth convex loss) The loss RR is convex, differentiable with differential dRdR Lipschitz on bounded sets and bounded on sublevel sets,

(boundary conditions) The function ϕ\phi behaves nicely at the boundary of the domain: either

With the family of nested sets Qr:=[r,r]×ΘQ_{r}:=[-r,r]\times\Theta, r>0r>0, these assumptions imply Assumptions 2.1. The following theorem mirrors the statement of Theorem 3.3, but with a different condition on the initialization. The remarks after Theorem 3.3 also apply here.

Case studies and numerical illustrations

In this section, we apply the previous abstract statements to specific examples and show on synthetic experiments that the particle-complexity to reach global optimality is very favorable.

Assume that the filter impulse response ψ\psi is min{2,d}\min\{2,d\} times continuously differentiable, and that the support of μ0\mu_{0} contains {0}×Θ\{0\}\times\Theta. If the projection (h1(μt))t(h^{1}(\mu_{t}))_{t} of the Wasserstein gradient flow of FF weakly converges to νM(Θ)\nu\in\mathcal{M}(\Theta), then ν\nu is a global minimizer of

We show an example of such a reconstruction on the 11-torus on Figure 1, where the ground truth consists of m0=5m_{0}=5 weighted spikes, ψ\psi is an ideal low pass filter (a Dirichlet kernel of order 77) and yy is a noisy observation of the filtered spikes. The particle gradient flow is integrated with the forward-backward algorithm and the particles initialized on a uniform grid on {0}×Θ\{0\}\times\Theta.

2 Neural networks with a single hidden layer

Assume that ρx\rho_{x} has finite moments up to order min{4,2d2}\min\{4,2d-2\}, that the support of μ0\mu_{0} is {0}×Θ\{0\}\times\Theta and that boundary condition 3.4-(iii)-(a) holds. If the Wasserstein gradient flow of FF converges in W2W_{2} to μ\mu_{\infty}, then μ\mu_{\infty} is a global minimizer of FF.

Note that we have to explicitly assume the boundary condition 3.4-(iii)-(a) because the Sard-type regularity at infinity cannot be checked a priori (this technical detail is discussed in Appendix D.3).

The activation function σ(s)=max{0,s}\sigma(s)=\max\{0,s\} is positively 11-homogeneous: this makes Φ\Phi 22-homogeneous and corresponds, at a formal level, to the setting of Theorem 3.3. An admissible choice of regularizer here would be the (semi-convex) function V(w,θ)=wθV(w,\theta)=|w|\cdot|\theta| . However, as shown in Appendix D.4, the differential dΦd\Phi has discontinuities: this prevents altogether from defining gradient flows, even in the finite-particle regime.

We display on Figure 2 particle gradient flows for training a neural network with a single hidden layer and ReLU activation in the classical (non-differentiable) parameterization, with d=2d=2 (no regularization). Features are normally distributed, and the ground truth labels are generated with a similar network with m0=4m_{0}=4 neurons. The particle gradient flow is “integrated” with mini-batch SGD and the particles are initialized on a small centered sphere.

3 Empirical particle-complexity

Since our convergence results are non-quantitative, one might argue that similar—and much simpler to prove—asymptotical results hold for the method of distributing particles on the whole of Θ\Theta and simply optimizing on the weights, which is a convex problem. Yet, the comparison of the particle-complexity shown in Figure 3 stands strongly in favor of particle gradient flows. While exponential particle-complexity is unavoidable for the convex approach, we observed on several synthetic problems that particle gradient descent only needs a slight over-parameterization m>m0m>m_{0} to find global minimizers within optimization error (see details in Appendix D.5).

Conclusion

We have established asymptotic global optimality properties for a family of non-convex gradient flows. These results were enabled by the study of a Wasserstein gradient flow: this object simplifies the handling of many-particle regimes, analogously to a mean-field limit. The particle-complexity to reach global optimality turns out very favorable on synthetic numerical problems. This confirms the relevance of our qualitative results and calls for quantitative ones that would further exploit the properties of such particle gradient flows. Multiple layer neural networks are also an interesting avenue for future research.

We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).

References

Supplementary material

Supplementary material for the paper: “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport” authored by Lénaïc Chizat and Francis Bach (NIPS 2018).

Appendix B: Many-particle limit and Wasserstein gradient flow

Appendix C: Convergence to global minimizers

Appendix D: Case studies and numerical experiments

Appendix A Introductory facts

Note that the functional of interest in this article is continuous for the Wasserstein metric. This strong regularity is rather rare in the study of Wasserstein gradient flows.

Under Assumptions 2.1, the function FF is continuous for the Wasserstein metric W2W_{2}.

A.2 Lifting to the space of probability measures

A.2.1 The partially 111-homogeneous case

For all μM+(Ω)\mu\in\mathcal{M}_{+}(\Omega), there is νP(Ω)\nu\in\mathcal{P}(\Omega) such that F(μ)=F(ν)F(\mu)=F(\nu).

If μ(Ω)=0|\mu|(\Omega)=0 then F(μ)=0=F(δ(0,θ0)F(\mu)=0=F(\delta_{(0,\theta_{0}}) where θ0\theta_{0} is any point in Θ\Theta. Otherwise, we define the map T:(w,θ)(μ(Ω)w,θ)T:(w,\theta)\mapsto(|\mu|(\Omega)\cdot w,\theta) and the probability measure ν:=T#(μ/μ(Ω))P(Ω)\nu:=T_{\#}(\mu/|\mu|(\Omega))\in\mathcal{P}(\Omega), which satisfies F(ν)=F(μ)F(\nu)=F(\mu). ∎

This operator is well defined whenever (w,θ)w(w,\theta)\mapsto w is μ\mu-integrable.

It holds M(Θ)h1(P(Ω))=h1(M+(Ω))\mathcal{M}(\Theta)\subset h^{1}(\mathcal{P}(\Omega))=h^{1}(\mathcal{M}_{+}(\Omega)). For a regularizer GG on M(Θ)\mathcal{M}(\Theta) of the form G(μ)=infνh1(μ)ΩVdνG(\mu)=\inf_{\nu\in h^{-1}(\mu)}\int_{\Omega}Vd\nu, it holds infνM(Θ)J(ν)=infμM+(Ω)F(μ)\inf_{\nu\in\mathcal{M}(\Theta)}J(\nu)=\inf_{\mu\in\mathcal{M}_{+}(\Omega)}F(\mu). If the infimum defining GG is attained and if νM(Θ)\nu\in\mathcal{M}(\Theta) minimizes JJ, then there exists μh1(ν)\mu\in h^{-1}(\nu) that minimizes FF over M+(Ω)\mathcal{M}_{+}(\Omega).

The class of regularizer considered in Proposition A.3 includes the total variation norm.

Let V(w,θ)=wV(w,\theta)=|w|. For μM(Θ)\mu\in\mathcal{M}(\Theta), it holds Vdμh1(μ)(Θ)\int Vd\mu\geq|h^{1}(\mu)|(\Theta) with equality if, for instance, μ\mu is a lift of h1(μ)h^{1}(\mu) of the form (8).

A.2.2 The 222-homogeneous case

This operator is well-defined iff μ\mu has finite second order moments.

Appendix B Many-particle limit and Wasserstein gradient flow

In the specific case of gradient flows of lower bounded functions, we can derive estimates that imply that T=T=\infty (even if FmF_{m} is not globally semiconvex). Indeed, for all t>0t>0, it holds

by Jensen’s inequality. Since FmF_{m} is lower bounded, this proves that the gradient flow has bounded length on bounded time intervals. By compactness, if TT was finite then u(T)\mathbf{u}(T) would exist, thus contradicting the maximality of TT, hence T=T=\infty and the gradient flow is globally defined.

B.2 Link between classical and Wasserstein gradient flows

We first give a rigorous definition of the continuity equation which appear in the definition of Wasserstein gradient flows (Definition 2.4).

As we show now, there is a precise link between classical and Wasserstein gradient flow (Definitions 2.2 and 2.4). This is a simple result but might be instructive for readers who are not familiar with the concept of distributional solutions of partial differential equations.

which precisely means that (μm,t)t0(\mu_{m,t})_{t\geq 0} is a distributional solution to (6). ∎

Note that (μm,t)t(\mu_{m,t})_{t} has the same number of atoms throughout the dynamic. In particular, if no minimizer of FF is an atomic measure with at most mm atoms, then (μm,t)t(\mu_{m,t})_{t} is guaranteed to not converge to a minimizer.

B.2.1 Properties of the Wasserstein gradient flow (proof of Proposition 2.5)

Under Assumptions 2.1, suppose that V=0V=0. For all r>0r>0, F(r)F^{(r)} is proper and continuous for W2W_{2} on its closed domain. Moreover,

there exists λr>0\lambda_{r}>0 such that for all admissible transport plan γ\gamma, considering the transport interpolation μtγ((1t)π1+tπ2)#γ\mu^{\gamma}_{t}\coloneqq((1-t)\pi^{1}+t\pi^{2})_{\#}\gamma, the function tF(μtγ)t\mapsto F(\mu^{\gamma}_{t}) is differentiable with a λrC22(γ)\lambda_{r}C_{2}^{2}(\gamma)-Lipschitz derivative;

if and only if v(u)(F(μ)+ιQr)(u)v(u)\in\partial(F^{\prime}(\mu)+\iota_{Q_{r}})(u) for μ\mu-almost every uΩu\in\Omega, where ιQr\iota_{Q_{r}} is the convex function on Ω\Omega that is worth on QrQ_{r} and \infty outside.

First, it is clear that FF is proper because F(r)(δu0)=R(Φ(u0))F^{(r)}(\delta_{u_{0}})=R(\Phi(u_{0})) is finite whenever u0Qru_{0}\in Q_{r}. It is moreover continuous (see Lemma A.1) on its closed domain {μP2(Ω)  ;  μ(Qr)=1}\{\mu\in\mathcal{P}_{2}(\Omega)\;;\;\mu(Q_{r})=1\}.

Let us denote h(t):=F(r)(μtγ)h(t):=F^{(r)}(\mu^{\gamma}_{t}). Since dRdR and dΦd\Phi are Lipschitz on Fr\mathcal{F}_{r} and QrQ_{r} respectively, h(t)h(t) is differentiable with

where we used Hölder’s inequality to obtain C12(γ)C22(γ)C_{1}^{2}(\gamma)\leq C_{2}^{2}(\gamma) is the last line. On the other hand,

As a consequence, hh^{\prime} is λrC22(γ)\lambda_{r}\cdot C_{2}^{2}(\gamma) Lipschitz with λr=LdRdΦ,r2+LdΦdR,r\lambda_{r}=L_{dR}\|d\Phi\|_{\infty,r}^{2}+L_{d\Phi}\|dR\|_{\infty,r}. In particular, using the notions defined in , F(r)F^{(r)} is (λr)(-\lambda_{r})-geodesically semiconvex. Remark that these bounds may explode when rr goes to infinity: this explains why we work with measures supported on QrQ_{r}.

where we used Hölder’s inequality for the bound C12(γ)C22(γ)C_{1}^{2}(\gamma)\leq C_{2}^{2}(\gamma). As a consequence,

The previous properties are sufficient to guarantee that Wasserstein gradient flows for the functionals F(r)F^{(r)} are well defined.

We are in position to prove the well-posedness of Wasserstein gradient flows for the original functional FF. Notice that, by the characterization in Lemma B.3, the Wasserstein gradient flows for the functions F(r)F^{(r)} all coincide for r>r0>0r>r_{0}>0 on [0,T][0,T] if μt(2r0)\mu^{(2r_{0})}_{t} is concentrated in Qr0Q_{r_{0}} for all t[0,T]t\in[0,T]. Our strategy is thus simply to make sure that for all time TT, such a r0>0r_{0}>0 exists, i.e. to make sure that the support of gradient flows does not grow too fast.

Let r0r_{0} be such that μ0\mu_{0} is concentrated on Qr0Q_{r_{0}}. Given Lemma B.3, for all r>r0r>r_{0}, there exists a unique, globally defined, Wasserstein gradient flow (μt(r))t0(\mu^{(r)}_{t})_{t\geq 0} for F(r)F^{(r)}. For all r>r0r>r_{0}, consider the first exit time from QrQ_{r}:

Note that the definition of trt_{r} involves the flow (μt2r)t(\mu_{t}^{2r})_{t} but in fact, for all rˉ>r\bar{r}>r and 0ttr0\leq t\leq t_{r}, it holds μt(2r)=μt(rˉ)\mu_{t}^{(2r)}=\mu_{t}^{(\bar{r})} by the uniqueness in Lemma B.3. Thus, if tr>0t_{r}>0, we have existence and uniqueness of a Wasserstein gradient flow in the sense of Definition 2.4 on [0,tr][0,t_{r}]. It only remains to show that limrtr=\lim_{r\to\infty}t_{r}=\infty so that the gradient flow can be defined at all times.

Let us now add a useful representation lemma for the Wasserstein gradient flow as the pushforward of μ0\mu_{0} by the flow of the velocity fields.

Then XX is uniquely well-defined, continuous, X(t,)X(t,\cdot) is Lipschitz on QrQ_{r}, uniformly on compact time intervals for all r>0r>0, and it holds μt=(Xt)#μ0\mu_{t}=(X_{t})_{\#}\mu_{0}.

The claims concerning XX are classical and follow from the fact that vtv_{t} satisfies a one-sided Lispchitz property on QrQ_{r}, uniformly on compact time intervals [3, Lemma 8.1.4]. The expression as a pushforward is also a general property of the continuity equation, see [3, Prop. 8.1.8]. ∎

B.3 Proof of the many-particle limit (Theorem 2.6)

While we could rely on abstract stability results for Wasserstein gradient flows [3, Thm.11.2.1 (Stability)] our proof is direct and uses basic arguments. It also gives an independent argument for the existence of Wasserstein gradient flows, distinct from the standard one : it involves a discretization in space instead of the classical discretization in time.

We first show that, at least on a small time interval [0,tr][0,t_{r}], the paths are contained in QrQ_{r} for some r>r0r>r_{0}. Let us introduce trt_{r} the first exit time from QrQ_{r}

In order to show that trt_{r} is strictly positive, it is sufficient to bound the velocity of individual particles before trt_{r}. Consider LV,rL_{V,r} the Lipschitz constant of VV on QrQ_{r}. Given the expression of the velocity of each particle (given in Eq. (5)) and the minimum travel distance rr0r-r_{0} required to exit QrQ_{r}, we obtain the lower bound on the exit time tr(rr0)/(dΦ,rdR,r+LV,r)>0t_{r}\geq(r-r_{0})/(\|d\Phi\|_{\infty,r}\|dR\|_{\infty,r}+L_{V,r})>0.

Let us now work on the time interval [0,tr][0,t_{r}] and prove the existence of a limit curve tμtt\mapsto\mu_{t} in the space P2(Θ)\mathcal{P}_{2}(\Theta) using standard estimates for gradient flows and compactness. Our starting point is the bound, for 0t1<t2tr0\leq t_{1}<t_{2}\leq t_{r},

which follows by matching each particle at t1t_{1} to its future position at t2t_{2}, and by Jensen’s inequality. Recalling the identity 1mi=1mum,i(t)2=ddtF(μm,t)\frac{1}{m}\sum_{i=1}^{m}|\mathbf{u}_{m,i}^{\prime}(t)|^{2}=-\frac{d}{dt}F(\mu_{m,t}) from Proposition 2.3, it follows

and thus the family of curves (tμm,t)m(t\mapsto\mu_{m,t})_{m} is equicontinuous in W2W_{2} on [0,tr][0,t_{r}], uniformly in mm. Moreover, for all t[0,tr]t\in[0,t_{r}], the family (μm,t)m(\mu_{m,t})_{m} lies in a W2W_{2} ball, as such weakly precompact (but a priori not W2W_{2}-precompact). Since the weak topology is weaker than the topology of W2W_{2}, by Ascoli theorem, we can extract a subsequence converging weakly to a curve (μt)t0(\mu_{t})_{t\geq 0} continuous in the weak topology, which is concentrated in QrQ_{r} at all time. We have also uniform convergence in the Bounded Lipschitz metric, which metrizes weak convergence of probability measures. In the following we only consider this subsequence, still denoted by (μm)m(\mu_{m})_{m}.

We first prove that the first term in (9) tends to . Since all (μm,t)m,t(\mu_{m,t})_{m,t} are concentrated on QrQ_{r}, it is sufficient to show that the sequence of velocity fields (t,u)vm,t(u)(t,u)\mapsto v_{m,t}(u) converges uniformly on [0,tr]×Qr[0,t_{r}]\times Q_{r} to (t,u)vt(u)(t,u)\mapsto v_{t}(u). We have, using the fact that a projection on a convex set is 11-Lipschitz,

Moreover, we have for all t[0,tr]t\in{[0,t_{r}]},

So far, we have shown the convergence, up to a subsequence, to a Wasserstein gradient flow on [0,tr][0,t_{r}]: it remains to show that limrtr=\lim_{r\to\infty}t_{r}=\infty. Since F(μm,0)F(μ0)F(\mu_{m,0})\to F(\mu_{0}) and all paths (μt,m)t(\mu_{t,m})_{t} decrease monotonically the value of FF, everything lies in a sublevel of RR, where dRdR is bounded. It follows that a uniform bound on the velocity of the particles with linear growth in rr is available and, by Grönwall’s inequality, we obtain that limrtr=\lim_{r\to\infty}t_{r}=\infty, just as in the end of the proof of Proposition 2.5. The theorem follows by combining this result with the uniqueness stated in Proposition 2.5.

Appendix C Convergence to global minimizers

We give in this section a proof of Theorems 3.3 and 3.5. All results have two versions: one in the 22-homogeneous setting (Assumptions 3.2) and its counterpart in the partially 11-homogeneous setting (Assumptions 3.4). We have displayed in Figure 4 the level sets of functions with these homogeneity properties, in order to highlight the differences between these two cases. The proofs tend to be more straightforward in the 22-homogeneous setting and they can be read independently of the other case. This section is organized as follows:

In Section C.1, we justify the global optimality conditions.

We give in Section C.2 a criteria for Wasserstein gradient flows to escape from neighborhoods of non-optimal stationary points, and we also characterize measures that can be limits of Wasserstein gradient flows. These results are valid for arbitrary initializations.

In Section C.3, we prove that the assumption on the support of the initialization made in Theorems 3.3 and 3.5 is preserved by Wasserstein gradient flows.

All these facts combined lead to a proof of Theorems 3.3 and 3.5 in Section C.4.

It will be often the case in the statements and in the proofs that they involve the projection hi(μ)h^{i}(\mu) of a probability measure μP(Ω)\mu\in\mathcal{P}(\Omega) (with i=1,2i=1,2) (introduced in Section A.2) instead of μ\mu itself. This is motivated by two facts: (i) this projected measure it generally the object of interest in the optimization problem as it clears the redundancy caused by homogeneity and (ii) the assumptions that the projection hi(μt)h^{i}(\mu_{t}) of a Wasserstein gradient flow converges is more reasonable than the convergence in W2W_{2} of the original gradient flow, where generally no compactness is available.

Let us first remark that, by a first order Taylor expansion of RR, we have that for all μ,σM(Ω)\mu,\sigma\in\mathcal{M}(\Omega) with F(μ),F(σ)<F(\mu),F(\sigma)<\infty, it holds F(μ)dσ<\int|F^{\prime}(\mu)|d\sigma<\infty and

Let μ,νM+(Ω)\mu,\nu\in\mathcal{M}_{+}(\Omega) be such that F(ν),F(μ)<F(\nu),F(\mu)<\infty, consider σ:=νμ\sigma:=\nu-\mu and its Lebesgue decomposition σ=fμ+σ\sigma=f\mu+\sigma^{\perp} where fL1(μ)f\in L^{1}(\mu), δM+(Ω)\delta^{\perp}\in\mathcal{M}_{+}(\Omega) is singular to μ\mu (see [10, Thm. 4.3.2]). Clearly, by the above first order formula, it is necessary to have F(μ)0F^{\prime}(\mu)\geq 0 everywhere with equality μ\mu-a.e., for μ\mu to be a minimizer. It is also sufficient since in this case we have, by convexity,

C.2 A criteria to escape from non-optimal stationary points

We now give a criteria for Wasserstein gradient flows to escape from non-optimal stationary points. It is valid both in the finite-particle regime and in the many-particle limit. Such a result supports the idea that, even in the finite-particle case (i.e. classical gradient flows), the point of view using measures is natural.

Such a set is given by A={rθ  ;  r]0,[ and θK}A=\{r\theta\;;\;r\in{]0,\infty[}\text{ and }\theta\in K\} where KK is the (η)(-\eta)-sublevel set of the restriction of F(μ)F^{\prime}(\mu) to the unit sphere, for some η>0\eta>0 that can be chosen arbitrarily close to .

We now give a general property of the stationary points.

C.2.2 The partially 111-homogeneous case

For the partially 11-homogeneous case, we consider the operator h1:M+(Ω)M(Θ)h^{1}:\mathcal{M}_{+}(\Omega)\to\mathcal{M}(\Theta) defined in Appendix A.2.

where the last bound is due to the fact that uf,ϕ(u)u\mapsto\langle f,\phi(u)\rangle is ϕC1\|\phi\|_{C^{1}}-Lipschitz and upper bounded in norm by ϕC1\|\phi\|_{C^{1}} whenever fFf\in\mathcal{F} satisfies f1\|f\|\leq 1. ∎

As for the 22-homogeneous case, we give a general property of the stationary points.

Under Assumptions 3.4, let (μt)t(\mu_{t})_{t} be a Wasserstein gradient flow of FF. If h1(μt)h^{1}(\mu_{t}) converges weakly to νM+(Θ)\nu\in\mathcal{M}_{+}(\Theta), then F(ν)F^{\prime}(\nu) vanishes ν\nu-a.e.

C.3 Stability of separation properties

Here we prove the fact that the separation properties of the support used in Theorems 3.5 and 3.3 are preserved by Wasserstein gradient flows. We give a proof based on topological degree theory: this tool allows to cover the case of discontinuous velocity fields, which appear when VV is non-differentiable. In a more regular setting, the facts that follow are easier to prove because then, μt\mu_{t} is the pushforward of μ0\mu_{0} by a homeomorphism. Let us give a definition of the topological degree sufficient to our setting.

If A1,A2A_{1},A_{2} are disjoint open subsets of AA and yf(A(A1A2))y\notin f(\overline{A}\setminus(A_{1}\cup A_{2})) then deg(f,A,y)=deg(f,A1,y)+deg(f,A2,y)\deg(f,A,y)=\deg(f,A_{1},y)+\deg(f,A_{2},y).

These properties characterize a uniquely well-defined map deg\deg from the set of triplets (f,A,y)(f,A,y) as above to the set of signed integers [8, Thm. 1-2]. Intuitively, it gives an algebraic count of the number of solutions to f(x)=yf(x)=y for xAx\in A, where algebraic means that a solution xx counts as +1+1 if ff preserves orientation around xx and 1-1 otherwise.

The following lemma shows that taking the support of a measure and its pushforward by a continuous map are operations that almost commute. They commute for instance if the map is closed (i.e. maps closed sets to closed set).

Let yf(sptμ)y\in f(\operatorname{spt}\mu) and V\mathcal{V} a neighborhood of yy. By continuity, f1(V)f^{-1}(\mathcal{V}) is the neighborhood of a point in sptμ\operatorname{spt}\mu so 0<μ(f1(V))=f#μ(V)0<\mu(f^{-1}(\mathcal{V}))=f_{\#}\mu(\mathcal{V}), hence ysptf#μy\in\operatorname{spt}f_{\#}\mu so f(sptμ)sptf#μf(\operatorname{spt}\mu)\subset\operatorname{spt}f_{\#}\mu. Conversely, let yf(sptμ)cy\in\overline{f(\operatorname{spt}\mu)}^{c} and let V\mathcal{V} a neighborhood of yy that does not intersect f(sptμ)\overline{f(\operatorname{spt}\mu)}. This neighborhood satisfies f1(V)(sptμ)cf^{-1}(\mathcal{V})\subset(\operatorname{spt}\mu)^{c}, so it holds f#μ(V)=μ(f1(V))μ((sptμ)c)=0f_{\#}\mu(\mathcal{V})=\mu(f^{-1}(\mathcal{V}))\leq\mu((\operatorname{spt}\mu)^{c})=0. Hence y(sptf#μ)cy\in(\operatorname{spt}f_{\#}\mu)^{c} so f(sptμ)c(sptf#μ)c\overline{f(\operatorname{spt}\mu)}^{c}\subset(\operatorname{spt}f_{\#}\mu)^{c} which implies sptf#μf(sptμ)\operatorname{spt}f_{\#}\mu\subset\overline{f(\operatorname{spt}\mu)}. ∎

We first state the property and the stability result that we wish to establish in the 22-homogeneous setting.

Under Assumptions 3.2, let (μt)t0(\mu_{t})_{t\geq 0} be a Wasserstein gradient flow of FF. If the support of μ0\mu_{0} satisfies Property C.9, so does the support of μt\mu_{t}, for all t>0t>0.

Note that this property is generally lost in the limit tt\to\infty. This lemma is a consequence of the following, more abstract proposition, that deals with sets instead of measures. The reader can keep in mind that we will apply this result with XX being the flow of the velocity field introduced in Lemma B.4 and KK being the support of μ0\mu_{0}.

C.3.2 The partially 111-homogeneous case

Here are the analogous separation property and stability lemma for the partially 11-homogeneous case.

Under Assumptions 3.4, let (μt)t0(\mu_{t})_{t\geq 0} be a Wasserstein gradient flow of FF. If the support of μ0\mu_{0} satisfies Property C.12, then so does the support of μt\mu_{t}, for all t>0t>0.

Similarly as above, we first prove an abstract topological result.

C.4 Main theorems: proofs and generalization

First, let us state a lemma that relates the convergence of the Wasserstein gradient flows to an asymptotic property for the classical gradient flows, when m,tm,t\to\infty. This result is used in the last claims of Theorems 3.3 and 3.5.

Under Assumptions 2.1, let (μt)(\mu_{t}) be a Wasserstein gradient flow which initialization is concentrated on a set Qr0Q_{r_{0}} and such that F(μt)FF(\mu_{t})\to F^{*}. If (μ0,m)m(\mu_{0,m})_{m} is a sequence of measures concentrated on a set Qr0Q_{r_{0}} that converges to μ0\mu_{0} in W2W_{2}, then

Under the assumptions of Theorem 3.3, if h2(μt)h^{2}(\mu_{t}) converges weakly, then its limit is a global minimizer of FF over M+(Ω)\mathcal{M}_{+}(\Omega) and limtF(μt)=F\lim_{t\to\infty}F(\mu_{t})=F^{*}.

This statement is stronger than Theorem 3.3: indeed, if μt\mu_{t} converges for the Wasserstein metric, then h2(μt)h^{2}(\mu_{t}) converges weakly (but the converse is generally not true).

C.4.2 The partially 111-homogeneous case

Again, we prove a statement in terms of the projected measures: Theorem 3.5 can be deduced as an immediate corollary. Some highlights of the proof are given in Figure 5.

Under the assumptions of Theorem 3.5, if h1(μt)h^{1}(\mu_{t}) converges weakly, then its limit is a global minimizer of FF over M+(Ω)\mathcal{M}_{+}(\Omega) and limtF(μt)=F\lim_{t\to\infty}F(\mu_{t})=F^{*}.

In the unfavorable case encountered in the proof of Theorem C.17, we had to invoke the following lemma. It has a different nature than the other results of this paper because it relies on an explicit integration of the trajectories of the gradient flow, which means that it depends on the choice of the metric.

Consider, for a measure νM(Θ)\nu\in\mathcal{M}(\Theta), a point θ0Θ\theta_{0}\in\Theta such that gν(θ)=0|\nabla g_{\nu}(\theta)|=0 and gν(θ)ηg_{\nu}(\theta)\leq-\eta for some η>0\eta>0. For any M>0M>0 and r0>0r_{0}>0, there exists T,ϵ>0T,\epsilon>0 such that if (μt)t(\mu_{t})_{t} is a Wasserstein gradient flow of FF that satisfies for all t[0,T]t\in[0,T], gμtgνC1ϵ\|g_{\mu_{t}}-g_{\nu}\|_{C^{1}}\leq\epsilon and denoting (w(t),θ(t))(w(t),\theta(t)) the solution of the flow of Lemma B.4 starting from (w0,θ0)(w_{0},\theta_{0}) with w0[M,0]w_{0}\in[-M,0], it holds w(T)=0w(T)=0 and θ(T)θ0<r0|\theta(T)-\theta_{0}|<r_{0}.

The Lipschitz regularity of gνg_{\nu} and its derivative implies that there exists L>0L>0 such that max{gν(θ)gν(θ0),gν(θ)gν(θ0)}Lθθ0\max\{|g_{\nu}(\theta)-g_{\nu}(\theta_{0})|,|\nabla g_{\nu}(\theta)-\nabla g_{\nu}(\theta_{0})|\}\leq L|\theta-\theta_{0}| for all θΘ\theta\in\Theta. Without loss of generality, we assume that r0<η/(4L)r_{0}<\eta/(4L). Consider ϵ]0,η/4[\epsilon\in{]0,\eta/4[} and assume that there exists Tˉ>0\bar{T}>0 such that gμtgνC1ϵ\|g_{\mu_{t}}-g_{\nu}\|_{C^{1}}\leq\epsilon for t[0,Tˉ]t\in[0,\bar{T}]. Writing q(t)=θ(t)θ0q(t)=|\theta(t)-\theta_{0}|, it holds for t[0,Tˉ]t\in[0,\bar{T}],

In particular, if we can make sure that q(t)<rˉ|q(t)|<\bar{r} for t[0,Tˉ]t\in[0,\bar{T}] and if Tˉ>2/η\bar{T}>2/\eta then, as (dw/dt)η/2(dw/dt)\geq\eta/2 on this interval, there exists T<2/ηT<2/\eta such that w(T)=0w(T)=0.

It remains to make sure that we indeed have q(t)<rˉ|q(t)|<\bar{r} for t[0,T]t\in[0,T], by adjusting if necessary the value of ϵ\epsilon. Parametrizing in ww instead of tt (it is an admissible reparametrization thanks to the positive lower bound on its derivative), we get

Thus, choosing ϵ<Lr0/(exp(Lw02/η)1)\epsilon<Lr_{0}/(\exp(Lw_{0}^{2}/\eta)-1), it is guaranteed that q(t)r0q(t)\leq r_{0} for t[0,T]t\in[0,T]. ∎

C.5 Remarks

We conclude this theoretical section with two opening remarks related to the global convergence theorems.

In the statements of Theorems 3.3 and 3.5, the convergence of the Wasserstein gradient flow comes as an assumption. In order to prove convergence of gradient flows, one generally needs two properties: (i) compactness of the trajectories and (ii) a so-called Łojasiewicz inequality which, intuitively, controls how much a function flattens around its critical points. As compactness in W2W_{2} is a very strong requirement, we have relaxed the topology where convergence is required to obtain more reasonable assumptions. Yet, even when a gradient flow lies in a compact set, there are some cases where it does not converge. There has been recent progress on related issues with the study of Łojasiewicz inequalities in Wasserstein space , but to our knowledge, no general result is known in our non-geodesically convex case.

We stress that Propositions C.1 and C.4 provide with an intuitive criterion for a particle gradient flow to escape local minimum: roughly, it is sufficient that, when it passes close to a local minimum, at least one particle belongs to a -sublevel set of the current potential F(μ)F^{\prime}(\mu). In this paper we exploit this property by studying the many-particle limit, but other approaches are worth exploring. For instance, we could estimate the size of this sublevel set in specific cases, and use it as an indication for the particle-complexity to attain global minimizers. A discussion on a specific example is given in Section D.5.

Appendix D Case studies and numerical experiments

In this section, we verify the assumptions for the examples treated in Section 4.

If rr is convex in the second variable, then RR is convex. If rr is differentiable in the second variable with 2r\partial_{2}r Lipschitz, uniformly in the first variable, then RR is differentiable with differential dRdR Lipschitz. If moreover 2r2C1r+C2|\partial_{2}r|^{2}\leq C_{1}r+C_{2} for some constants C1,C2>0C_{1},C_{2}>0, then dRdR is bounded on sublevel sets.

It is direct to see that dRdR is LL-Lipschitz in the operator norm. Finally, if 2r2C1r+C2|\partial_{2}r|^{2}\leq C_{1}r+C_{2}, then

D.2 Sparse deconvolution

D.3 Neural network: sigmoid activation

D.4 Neural network: ReLU activation

Although we do not use this fact explicitly in the paper, it is interesting to note that the regularizing potential V:(w,θ)wθV:(w,\theta)\mapsto|w|\cdot|\theta| is admissible in the 22-homogeneous setting of Assumptions 3.2: although it is not differentiable nor convex, it is positively 22-homogeneous and semiconvex.

The homogeneity property is clear, and to see that VV is semi-convex, it is sufficient to remark that

is convex, since it is the square of a norm. ∎

D.4.2 A differentiable parameterization

We now consider the alternative parameterization considered in Proposition 4.3, defined as Φ(θ):xσ(s(θ)(x,1))\Phi(\theta):x\mapsto\sigma(s(\theta)\cdot(x,1)) where σ(t)=max{t,0}\sigma(t)=\max\{t,0\} and ss is the signed square function s(t)=tt=sign(t)t2s(t)=t|t|=\operatorname{sign}(t)\cdot t^{2} that acts entry-wise. As Φ\Phi is clearly positively 22-homogeneous so we just have to prove the differentiability of Φ\Phi, which is done with the same technique as in Lemma D.3.

Note that the condition on the moments of ρx\rho_{x} is less strong for ReLU activation than for sigmoids in Lemma D.2: this comes from the fact that ReLU is piece-wise linear. Similarly as what explained in the end of Section D.3, it is difficult to verify the Sard-type regularity assumption so it is left as an assumption in Proposition 4.3.

D.5 Numerical experiments : details and additional results

Animated plots of the particle gradient flows shown in this article may be found online at https://lchizat.github.io/PGF.htmlThese videos appear at this place in the official supplementary material of the NIPS 2018 publication, but had to be removed from the present version due to software incompatibility.

Here we give more details on the numerical experiments behind Figure 3.

For the leftmost panel, the setting is similar to that of Figure 1: for each realization, 55 spikes are randomly distributed on the 11-torus (with a minimum separation of 0.10.1) with random weights between 0.50.5 and 1.51.5 and a small noise is added to the filtered signal. Then for each choice of mm, we initialize mm particles on a regular grid on {0}×Θ\{0\}\times\Theta and integrate the particle gradient flow with the forward-backward algorithm until the improvement per iteration is below a small tolerance threshold.

For the center panel, the setting is similar to that of Figure 2, but here in dimension d=100d=100. The data is normally distributed and the ground truth labels are generated by a similar neural network with 2020 neurons (with random normally distributed parameters). The objective function is the square loss without regularization, so the global minimum corresponds to a loss. We optimize using SGD with fresh samples at each iteration.

The rightmost panel shows, similarly, the particle-complexity for training a neural network with a single hidden layer and sigmoid activation function, in dimension d=100d=100. The data is distributed on a sphere and the ground truth labels are generated by a similar neural network with 2020 neurons with random normal weights. Again, we minimize with SGD the square loss without regularization and the global minimum corresponds to a loss.

We compare the performance with the method of simply minimizing on the weights with the same initialization. This is a convex problem, and the minimum value attained does not depend on the minimization method. We plot for each case the final excess loss as a function of mm for several random realizations of the experiment and, for each value of mm, its geometric average over all realizations. We have indicated in transparent green the area of loss values which should be interpreted as “optimal” but are not exactly because the optimization has been stopped in finite time and the loss is not known exactly but estimated through sampling.

In all previous numerical experiments dealing with the partially 11-homogeneous case, we have initialized the particle gradient flow on a discretization of {0}×Θ\{0\}\times\Theta. But Theorem 3.5 allows for a large variety of initialization patterns. In this paragraph, we comment on the various possibilities and explain how the proof of Theorem 3.5 helps understanding why the corresponding particle-complexity is impacted.

We display on Figure 6 a sparse spikes deconvolution experiment, in a similar setting than in Figure 1, but with different initializations. For this problem, where m0=5m_{0}=5 spikes are to be recovered, we have observed numerically that the particle gradient flows initialized on a uniform grid on {0}×Θ\{0\}\times\Theta succeed in finding a global minimizer as soon as there are more than m=7m=7 particles. In the first panel of Figure 6, the particle gradient flow with m=15m=15 particles initialized on {1}×Θ\{1\}\times\Theta fails at finding a minimizer and a larger number of particles is needed for success (as shown in the center panel, with m=30m=30).