Computational Optimal Transport

Gabriel Peyré, Marco Cuturi

Chapter 1 Introduction

The shortest path principle guides most decisions in life and sciences: When a commodity, a person or a single bit of information is available at a given point and needs to be sent at a target point, one should favor using the least possible effort. This is typically reached by moving an item along a straight line when in the plane or along geodesic curves in more involved metric spaces. The theory of optimal transport generalizes that intuition in the case where, instead of moving only one item at a time, one is concerned with the problem of moving simultaneously several items (or a continuous distribution thereof) from one configuration onto another. As schoolteachers might attest, planning the transportation of a group of individuals, with the constraint that they reach a given target configuration upon arrival, is substantially more involved than carrying it out for a single individual. Indeed, thinking in terms of groups or distributions requires a more advanced mathematical formalism which was first hinted at in the seminal work of Monge (1781). Yet, no matter how complicated that formalism might look at first sight, that problem has deep and concrete connections with our daily life. Transportation, be it of people, commodities or information, very rarely involves moving only one item. All major economic problems, in logistics, production planning or network routing, involve moving distributions, and that thread appears in all of the seminal references on optimal transport. Indeed Tolstoı (1930), Hitchcock (1941) and Kantorovich (1942) were all guided by practical concerns. It was only a few years later, mostly after the 1980s, that mathematicians discovered, thanks to the works of Brenier (1991) and others, that this theory provided a fertile ground for research, with deep connections to convexity, partial differential equations and statistics. At the turn of the millenium, researchers in computer, imaging and more generally data sciences understood that optimal transport theory provided very powerful tools to study distributions in a different and more abstract context, that of comparing distributions readily available to them under the form of bags-of-features or descriptors.

Several reference books have been written on optimal transport, including the two recent monographs by Villani (2003; 2009), those by Rachev and Rüschendorf (1998a; 1998b) and more recently that by Santambrogio (2015). As exemplified by these books, the more formal and abstract concepts in that theory deserve in and by themselves several hundred pages. Now that optimal transport has gradually established itself as an applied tool (for instance, in economics, as put forward recently by Galichon (2016)), we have tried to balance that rich literature with a computational viewpoint, centered on applications to data science, notably imaging sciences and machine learning. We follow in that sense the motivation of the recent review by Kolouri et al. (2017) but try to cover more ground. Ultimately, our goal is to present an overview of the main theoretical insights that support the practical effectiveness of OT and spend more time explaining how to turn these insights into fast computational schemes. The main body of Chapters 2, 3, 4, 9, and 10 is devoted solely to the study of the geometry induced by optimal transport in the space of probability vectors or discrete histograms. Targeting more advanced readers, we also give in the same chapters, in light gray boxes, a more general mathematical exposition of optimal transport tailored for discrete measures. Discrete measures are defined by their probability weights, but also by the location at which these weights are defined. These locations are usually taken in a continuous metric space, giving a second important degree of freedom to model random phenomena. Lastly, the third and most technical layer of exposition is indicated in dark gray boxes and deals with arbitrary measures that need not be discrete, and which can have in particular a density w.r.t. a base measure. This is traditionally the default setting for most classic textbooks on OT theory, but one that plays a less important role in general for practical applications. Chapters 5 to 8 deal with the interplay between continuous and discrete measures and are thus targeting a more mathematically inclined audience.

The field of computational optimal transport is at the time of this writing still an extremely active one. There are therefore a wide variety of topics that we have not touched upon in this survey. Let us cite in no particular order the subjects of distributionally robust optimization (Shafieezadeh Abadeh et al., 2015; Esfahani and Kuhn, 2018; Lee and Raginsky, 2018; GAO et al., 2018), in which parameter estimation is carried out by minimizing the worst posssible empirical risk of any data measure taken within a certain Wasserstein distance of the input data; convergence of the Langevin Monte Carlo sampling algorithm in the Wasserstein geometry (Dalalyan and Karagulyan, 2017; Dalalyan, 2017; Bernton, 2018); other numerical methods to solve OT with a squared Euclidian cost in low-dimensional settings using the Monge-Ampère equation (Froese and Oberman, 2011; Benamou et al., 2014; Sulman et al., 2011) which are only briefly mentioned in Remark 2.25.

n\llbracket n\rrbracket: set of integers {1,,n}\{1,\dots,n\}.

(α,β)(\alpha,\beta): measures, defined on spaces (X,Y)(\mathcal{X},\mathcal{Y}).

T:XYT:\mathcal{X}\rightarrow\mathcal{Y}: Monge map, typically such that Tα=βT_{\sharp}\alpha=\beta.

(αt)t=01(\alpha_{t})_{t=0}^{1}: dynamic measures, with αt=0=α0\alpha_{t=0}=\alpha_{0} and αt=1=α1\alpha_{t=1}=\alpha_{1}.

vv: speed for Benamou–Brenier formulations; J=αvJ=\alpha v: momentum.

ss: flow for W1\operatorname{\mathcal{W}}_{1}-like problem (optimization under divergence constraints).

λΣS\lambda\in\Sigma_{S}: weight vector used to compute the barycenters of SS measures.

,\langle\cdot,\,\cdot\rangle: for the usual Euclidean dot-product between vectors; for two matrices of the same size AA and BB, A,B=\mboxdef.tr(AB)\langle A,\,B\rangle\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\operatorname{tr}(A^{\top}B) is the Frobenius dot-product.

Chapter 2 Theoretical Foundations

A large part of this review focuses exclusively on the study of the geometry induced by optimal transport on the simplex.

2 Assignment and Monge Problem

Note that the optimal assignment problem may have several optimal solutions. Suppose, for instance, that n=m=2n=m=2 and that the matrix C is the pairwise distance matrix between the four corners of a 2-D square of side length 11, as represented in the left plot of Figure 2.2. In that case only two assignments exist, and they are both optimal.

3 Kantorovich Relaxation

The assignment problem, and its generalization found in the Monge problem laid out in Remark 2.4, is not always relevant to studying discrete measures, such as those found in practical problems. Indeed, because the assignment problem is formulated as a permutation problem, it can only be used to compare uniform histograms of the same size. A direct generalization to discrete measures with nonuniform weights can be carried out using Monge’s formalism of push-forward maps, but that formulation may also be degenerate in the absence of feasible solutions satisfying the mass conservation constraint (2.4) (see the end of Remark 2.4). Additionally, the assignment problem (2.5) is combinatorial, and the feasible set for the Monge problem (2.9), despite being continuously parameterized as the set consisting in all push-forward measures that satisfy the mass conservation constraint, is nonconvex. Both are therefore difficult to solve when approached in their original formulation.

where we used the following matrix-vector notation:

This is a linear program (see Chapter 3), and as is usually the case with such programs, its optimal solutions are not necessarily unique.

which shows that the assignment problem (2.2) can be recast as a Kantorovich problem (2.11) where the couplings P are restricted to be exactly permutation matrices:

4 Metric Properties of Optimal Transport

An important feature of OT is that it defines a distance between histograms and probability measures as soon as the cost matrix satisfies certain suitable properties. Indeed, OT can be understood as a canonical way to lift a ground distance between points to a distance between histogram or measures.

We first consider the case where, using a term first introduced by Rubner et al. (2000), the “ground metric” matrix C is fixed, representing substitution costs between bins, and shared across several histograms we would like to compare. The following proposition states that OT provides a valid distance between histograms supported on these bins.

The first inequality is due to the suboptimality of \SS\SS, the second is the triangle inequality for elements in D, and the third comes from Minkowski’s inequality. One thus has

5 Dual Problem

The Kantorovich problem (2.11) is a constrained convex minimization problem, and as such, it can be naturally paired with a so-called dual problem, which is a constrained concave maximization problem. The following fundamental proposition explains the relationship between the primal and dual problems.

The Kantorovich problem (2.11) admits the dual

where the set of admissible dual variables is

Such dual variables are often referred to as “Kantorovich potentials.”

We exchange the min and the max above, which is always possible when considering linear programs (in finite dimension), to obtain

The primal-dual optimality relation for the Lagrangian (2.22) allows us to locate the support of the optimal transport plan (see also §3.3)

Setting prices. Note that the pricing system used by the vendor allows quite naturally for arbitrarily negative prices. Indeed, if the vendor applies a price vector f for warehouses and a price vector g for factories, then the total bill will not be changed by simultaneously decreasing all entries in f by an arbitrary number and increasing all entries of g by that same number, since the total amount of resources in all warehouses is equal to those that have to be delivered to the factories. In other words, the vendor can give the illusion of giving an extremely good deal to the operator by paying him to collect some of his goods, but compensate that loss by simply charging him more for delivering them. Knowing this, the vendor, wishing to charge as much as she can for that service, sets vectors f and g to be as high as possible.

and therefore observe that any attempt at doing the job by himself would necessarily be more expensive than the vendor’s price.

6 Special Cases

In general, computing OT distances is numerically involved. Before detailing in §§3,4, and 7 different numerical solvers, we first review special favorable cases where the resolution of the OT problem is relatively easy.

Chapter 3 Algorithmic Foundations

This chapter describes the most common algorithmic tools from combinatorial optimization and linear programming that can be used to solve the discrete formulation of optimal transport, as described in the primal problem (2.11) or alternatively its dual (2.20).

The origins of these algorithms can be traced back to World War II, either right before with Tolstoı’s seminal work (1930) or during the war itself, when Hitchcock (1941) and Kantorovich (1942) formalized the generic problem of dispatching available resources toward consumption sites in an optimal way. Both of these formulations, as well as the later contribution by Koopmans (1949), fell short of providing a provably correct algorithm to solve that problem (the cycle violation method was already proposed as a heuristic by Tolstoı (1939)). One had to wait until the field of linear programming fully blossomed, with the proposal of the simplex method, to be at last able to solve rigorously these problems.

The goal of linear programming is to solve optimization problems whose objective function is linear and whose constraints are linear (in)equalities in the variables of interest. The optimal transport problem fits that description and is therefore a particular case of that wider class of problems. One can argue, however, that optimal transport is truly special among all linear program. First, Dantzig’s early motivation to solve linear programs was greatly related to that of solving transportation problems (Dantzig, 1949, p. 210). Second, despite being only a particular case, the optimal transport problem remained in the spotlight of optimization, because it was understood shortly after that optimal transport problems were related, and in fact equivalent, to an important class of linear programs known as minimum cost network flows (Korte and Vygen, 2012, p. 213, Lem. 9.3) thanks to a result by Ford and Fulkerson (1962). As such, the OT problem has been the subject of particular attention, ever since the birth of mathematical programming (Dantzig, 1951), and is still widely used to introduce optimization to a new audience (Nocedal and Wright, 1999, §1, p. 4).

We have already introduced in Equation (2.11) the primal OT problem:

Therefore we can write the original optimal transport problem as

where the nmnm-dimensional vector c is equal to the stacked columns contained in the cost matrix C.

The dual problem corresponding to Equation (3.2) is, following duality in linear programming (Bertsimas and Tsitsiklis, 1997, p. 143) defined as

Note that this program is exactly equivalent to that presented in Equation (2.4).

We have therefore proved a weak duality result, namely that zzˉ\underline{z}\leq\bar{z}.

2 C-Transforms

We present in this section an important property of the dual optimal transport problem (3.3) which takes a more important meaning when used for the semidiscrete optimal transport problem in §5.1. This section builds upon the original formulation (2.20) that splits dual variables according to row and column sum constraints:

This result allows us first to reformulate the dual problem as a piecewise affine concave maximization problem expressed in a single variable f as

The following identities, in which the inequality sign between vectors should be understood elementwise, hold:

3 Complementary Slackness

The converse result is also true. We define first the idea that two variables for the primal and dual problems are complementary.

If a pair of feasible primal and dual variables is complementary, then we can conclude they are optimal.

4 Vertices of the Transportation Polytope

4.2 The North-West Corner Rule

5 A Heuristic Description of the Network Simplex

5.2 Network Simplex Update

5.3 Improvement of the Primal Solution

6 Dual Ascent Methods

7 Auction Algorithm

The auction algorithm was originally proposed by Bertsekas (1981) and later refined in (Bertsekas and Eckstein, 1988). Several economic interpretations of this algorithm have been proposed (see e.g. Bertsekas (1992)). The algorithm can be adapted for arbitrary marginals, but we present it here in its formulation to solve optimal assignment problems.

On the contrary, it is easy to show that if there exists a vector g and a permutation σ\sigma such that

The size of SS can only increase at each iteration.

Given a point jj the auction algorithm uses not only the optimum appearing in the usual C-transform but also a second best,

to define the following updates on g for an index iSi\notin S, as well as on SS and ξ\xi:

update SS and ξ\xi: If there exists an index iSi^{\prime}\in S such that ξi=ji1\xi_{i^{\prime}}=j^{1}_{i}, remove it by updating SS{i}S\leftarrow S\setminus\{i^{\prime}\}. Set ξi=ji1\xi_{i}=j^{1}_{i} and add ii to SS, SS{i}.S\leftarrow S\cup\{i\}.

The auction algorithm maintains ε\varepsilon-complementary slackness at each iteration.

The auction algorithm finds an assignment whose cost is nεn\varepsilon suboptimal.

Therefore by simple suboptimality of g we first have

The auction algorithm can therefore be regarded as an alternative way to use the machinery of C-transforms. Next we explore another approach grounded on regularization, the so-called Sinkhorn algorithm, which also bears similarities with the auction algorithm as discussed in (Schmitzer, 2016b).

Note finally that, on low-dimensional regular grids in Euclidean space, it is possible to couple these classical linear solvers with multiscale strategies, to obtain a significant speed-up (Schmitzer, 2016a; Oberman and Ruan, 2015).

Chapter 4 Entropic Regularization of Optimal Transport

This chapter introduces a family of numerical schemes to approximate solutions to Kantorovich formulation of optimal transport and its many generalizations. It operates by adding an entropic regularization penalty to the original problem. This regularization has several important advantages, which make it, when taken altogether, a very useful tool: the minimization of the regularized problem can be solved using a simple alternate minimization scheme; that scheme translates into iterations that are simple matrix-vector products, making them particularly suited to execution of GPU; for some applications, these matrix-vector products do not require storing an n×mn\times m cost matrix, but instead only require access to a kernel evaluation; in the case where a large group of measures share the same support, all of these matrix-vector products can be cast as matrix-matrix products with significant speedups; the resulting approximate distance is smooth with respect to input histogram weights and positions of the Diracs and can be differentiated using automatic differentiation.

The discrete entropy of a coupling matrix is defined as

Since the objective is an ε\varepsilon-strongly convex function, Problem (4.2) has a unique optimal solution. The idea to regularize the optimal transport problem by an entropic term can be traced back to modeling ideas in transportation theory (Wilson, 1969): Actual traffic patterns in a network do not agree with those predicted by the solution of the optimal transport problem. Indeed, the former are more diffuse than the latter, which tend to rely on a few routes as a result of the sparsity of optimal couplings for (2.11). To mitigate this sparsity, researchers in transportation proposed a model, called the “gravity” model (Erlander, 1980), that is able to form a more “blurred” prediction of traffic given marginals and transportation costs.

Defining the Kullback–Leibler divergence between couplings as

Indeed one has that using the definition above

2 Sinkhorn’s Algorithm and Its Convergence

The solution to (4.2) is unique and has the form

where \odot corresponds to entrywise multiplication of vectors. That problem is known in the numerical analysis community as the matrix scaling problem (see (Nemirovski and Rothblum, 1999) and references therein). An intuitive way to handle these equations is to solve them iteratively, by modifying first u so that it satisfies the left-hand side of Equation (4.14) and then v to satisfy its right-hand side. These two updates define Sinkhorn’s algorithm,

In practice, however, one should prefer using (4.15), which only requires manipulating scaling vectors and multiplication against a Gibbs kernel, which can often be accelerated (see Remarks 4.17 and 4.19 below).

3 Speeding Up Sinkhorn’s Iterations

Note that the basic Sinkhorn iterations described in Equation (4.15) are intrinsically GPU friendly, since they only consist in matrix-vector products, and this was exploited, for instance, to solve matching problems in Slomp et al. (2011)). However, the matrix-matrix operations presented in Equation (4.26) present even better opportunities for parallelism, which explains the success of Sinkhorn’s algorithm to compute OT distances between histograms at large scale.

We consider in this section an important particular case for which the complexity of each Sinkhorn iteration can be significantly reduced. That particular case happens when each index ii and jj considered in the cost-matrix can be described as a dd-uple taken in the cartesian product of dd finite sets n1,,nd\llbracket n_{1}\rrbracket,\dots,\llbracket n_{d}\rrbracket,

then one obtains as a direct consequence that the kernel appearing in the Sinkhorn iterations has a separable multiplicative structure,

An important example of this speed-up arises when X=Y=d\mathcal{X}=\mathcal{Y}=^{d}; the ground cost is the qq-th power of the qq-norm,

and the space is discretized using a regular grid in which only points xi=(i1/n1,,id/nd)x_{i}=(i_{1}/n_{1},\ldots,i_{d}/n_{d}) for i=(i1,,id)n1××ndi=(i_{1},\dots,i_{d})\in\llbracket n_{1}\rrbracket\times\dots\times\llbracket n_{d}\rrbracket are considered. In that case a multiplication by K can be carried out more efficiently by applying each 1-D nk×nkn_{k}\times n_{k} convolution matrix

to u reshaped as a tensor whose first dimension has been permuted to match the kk-th set of indices. For instance, if d=2d=2 (planar case) and q=2q=2 (2-Wasserstein, resulting in Gaussian convolutions), histograms a and as a consequence Sinkhorn multipliers u can be instantiated as n1×n2n_{1}\times n_{2} matrices. We write U\mathbf{U} to underline the fact that the multiplier u is reshaped as a n1×n2n_{1}\times n_{2} matrix, rather than a vector of length n1n2n_{1}n_{2}. Then, computing Ku, which would naively require (n1n2)2(n_{1}n_{2})^{2} operations with a naive implementation, can be obtained by applying two 1-D convolutions separately, as

to recover a n1×n2n_{1}\times n_{2} matrix in (n12)n2+n1(n22)(n_{1}^{2})n_{2}+n_{1}(n_{2}^{2}) operations instead of n12n22n_{1}^{2}n_{2}^{2} operations. Note that this example agrees with the exponent (1+1/d)(1+1/d) given above. With larger dd, one needs to apply these very same 1-D convolutions to each slice of u (reshaped as a tensor of suitable size) an operation which is extremely efficient on GPUs.

This important observations underlies many of the practical successes found when applying optimal transport to shape data in 2-D and 3-D, as highlighted in (Solomon et al., 2015; Bonneel et al., 2016), in which distributions supported on grids of sizes as large as 2003=8×106200^{3}=8\times 10^{6} are handled.

4 Stability and Log-Domain Computations

As briefly mentioned in Remark 4.7, the Sinkhorn algorithm suffers from numerical overflow when the regularization parameter ε\varepsilon is small compared to the entries of the cost matrix C. This concern can be alleviated to some extent by carrying out computations in the log domain. The relevance of this approach is made more clear by considering the dual problem associated to (4.2), in which these log-domain computations arise naturally.

We start from the end of the proof of Proposition 4.3, which links the optimal primal solution P and dual multipliers f and g for the marginal constraints as

therefore, the first term in (4.32) cancels out with the first term in the entropy above. The remaining terms are those appearing in (4.30). ∎

Such iterations are mathematically equivalent to the Sinkhorn iterations (4.15) when considering the primal-dual relations highlighted in (4.31). Indeed, we recover that at any iteration

Using this notation, Equations (4.35) and (4.36) can be rewritten

Note that these operations are equivalent to the entropic cc-transform introduced in §5.3 (see in particular (5.11)). Using this notation, Sinkhorn’s iterates read

Note that as ε0\varepsilon\rightarrow 0, minε\text{min}_{\varepsilon}\, converges to min\min, but the iterations do not converge anymore in the limit ε=0\varepsilon=0, because alternate minimization does not converge for constrained problems, which is the case for the unregularized dual (2.20).

Instead of substracting z\underline{z} to stabilize the log-domain iterations as in (4.42), one can actually substract the previously computed scalings. This leads to the stabilized iteration

5 Regularized Approximations of the Optimal Transport Cost

The entropic dual (4.30) is a smooth unconstrained concave maximization problem, which approximates the original Kantorovich dual (2.20), as detailed in the following proposition.

In (Cuturi, 2013), lower and upper bounds to approximate the Wasserstein distance between two histograms were proposed. These bounds consist in evaluating the primal and dual objectives at the solutions provided by the Sinkhorn algorithm.

where \odot stands for the elementwise product of matrices,

Equation (4.47) is obtained by writing that the primal and dual problems have the same values at the optima (see (4.30)), and hence

This “algorithmic” Sinkhorn functional lower bounds the regularized cost function as soon as L1L\geq 1.

6 Generalized Sinkhorn

The regularized OT problem (4.2) is a special case of a structured convex optimization problem of the form

As shown in (Peyré, 2015; Frogner et al., 2015; Chizat et al., 2018b; Karlsson and Ringh, 2016), Sinkhorn iterations (4.15) can hence be extended to this problem, and they read

where the proximal operator for the KL divergence is

For some functions F,GF,G it is possible to prove the linear rate of convergence for iterations (4.51), and these schemes can be generalized to arbitrary measures; see (Chizat et al., 2018b) for more details.

This algorithm can be used to approximate the solution to various generalizations of OT, and in particular unbalanced OT problems of the form (10.7) (see §10.2 and in particular iterations (10.9)) and gradient flow problems of the form (9.26) (see §9.3).

Chapter 5 Semidiscrete Optimal Transport

This chapter studies methods to tackle the optimal transport problem when one of the two input measures is discrete (a sum of Dirac masses) and the other one is arbitrary, including notably the case where it has a density with respect to the Lebesgue measure. When the ambient space has low dimension, this problem has a strong geometrical flavor because one can show that the optimal transport from a continuous density toward a discrete one is a piecewise constant map, where the preimage of each point in the support of the discrete measure is a union of disjoint cells. When the cost is the squared Euclidean distance, these cells correspond to an important concept from computational geometry, the so-called Laguerre cells, which are Voronoi cells offset by a constant. This connection allows us to borrow tools from computational geometry to obtain fast computational schemes. In high dimensions, the semidescrete formulation can also be interpreted as a stochastic programming problem, which can also benefit from a bit of regularization, extending therefore the scope of applications of the entropic regularization scheme presented in Chapter 4. All these constructions rely heavily on the notion of the cc-transform, this time for general cost functions and not only matrices as in §3.2. The cc-transform is a generalization of the Legendre transform from convex analysis and plays a pivotal role in the theory and algorithms for OT.

Recall that the dual OT problem (2.24) reads

where we used the useful indicator function notation (4.50). Keeping either dual potential ff or gg fixed and optimizing w.r.t. gg or ff, respectively, leads to closed form solutions that provide the definition of the cc-transform:

where we denoted cˉ(y,x)=\mboxdef.c(x,y)\bar{c}(y,x)\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}c(x,y). Indeed, one can check that

where we denoted fccˉ=(fc)cˉf^{c\bar{c}}=(f^{c})^{\bar{c}}. This invariance property shows that one can “improve” only once the dual potential this way. Alternatively, this means that alternate maximization does not converge (it immediately enters a cycle), which is classical for functionals involving a nonsmooth (a constraint) coupling of the optimized variables. This is in sharp contrast with entropic regularization of OT as shown in Chapter 4. In this case, because of the regularization, the dual objective (4.30) is smooth, and alternate maximization corresponds to Sinkhorn iterations (4.43) and (4.44). These iterates, written over the dual variables, define entropically smoothed versions of the cc-transform, where min\min operations are replaced by a “soft-min.”

Using (5.3), one can reformulate (2.24) as an unconstrained convex program over a single potential,

Since one can iterate the map (f,g)(gcˉ,fc)(f,g)\mapsto(g^{\bar{c}},f^{c}), it is possible to add the constraint that ff is cˉ\bar{c}-concave and gg is cc-concave, which is important to ensure enough regularity on these potentials and show, for instance, existence of solutions to (2.24).

2 Semidiscrete Formulation

Crucially, using the discrete cˉ\bar{c}-transform in the semidiscrete problem (5.4) yields a finite-dimensional optimization,

The Laguerre cells associated to the dual weights g

This allows one to conveniently rewrite the minimized energy as

The gradient of this function can be computed as follows:

The initial idea of a semidiscrete solver for Monge–Ampère equations was proposed by Oliker and Prussner (1989), and its relation to the dual variational problem was shown by Aurenhammer et al. (1998). A theoretical analysis and its application to the reflector problem in optics is detailed in (Caffarelli et al., 1999). The semidiscrete formulation was used in (Carlier et al., 2010) in conjunction with a continuation approach based on Knothe’s transport. The recent revival of this methods in various fields is due to Mérigot (2011), who proposed a quasi-Newton solver and clarified the link with concepts from computational geometry. We refer to (Lévy and Schwindt, 2018) for a recent overview. The use of a Newton solver which is applied to sampling in computer graphics is proposed in (De Goes et al., 2012); see also (Lévy, 2015) for applications to 3-D volume and surface processing. An important area of application of the semidiscrete method is for the resolution of the incompressible fluid dynamic (Euler’s equations) using Lagrangian methods (de Goes et al., 2015; Gallouët and Mérigot, 2017). The semidiscrete OT solver enforces incompressibility at each iteration by imposing that the (possibly weighted) points cloud approximates a uniform distribution inside the domain. The convergence (with linear rate) of damped Newton iterations is proved in (Mirebeau, 2015) for the Monge–Ampère equation and is refined in (Kitagawa et al., 2016) for optimal transport. Semidiscrete OT finds important applications to illumination design, notably reflectors; see (Meyron et al., 2018).

3 Entropic Semidiscrete Formulation

The dual of the entropic regularized problem between arbitrary measures (4.9) is a smooth unconstrained optimization problem:

where we denoted (fg)(x,y)=\mboxdef.f(x)+g(y)(f\oplus g)(x,y)\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}f(x)+g(y).

Similarly to the unregularized problem (5.1), one can minimize explicitly with respect to either ff or gg in (5.9), which yields a smoothed cc-transform

Instead of maximizing (5.9), one can thus solve the following finite-dimensional optimization problem:

Note once again that this formula (5.13) is still valid for ε=0\varepsilon=0. Note also that the family of functions (χjε)j(\chi_{j}^{\varepsilon})_{j} is a partition of unity, i.e. jχjε=1\sum_{j}\chi_{j}^{\varepsilon}=1 and χjε0\chi_{j}^{\varepsilon}\geq 0. Figure 5.3, bottom row, illustrates this.

which can be seen as a smoothed version of the Legendre transform of G(α,β)=\mboxdef.Lc(α,β)\mathcal{G}(\alpha,\beta)\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{L}_{c}(\alpha,\beta),

4 Stochastic Optimization Methods

The semidiscrete formulation (5.8) and its smoothed version (5.12) are appealing because the energies to be minimized are written as an expectation with respect to the probability distribution α\alpha,

and XX denotes a random vector distributed on X\mathcal{X} according to α\alpha. Note that the gradient of each of the involved functional reads

One can thus use stochastic optimization methods to perform the maximization, as proposed in Genevay et al. (2016). This allows us to obtain provably convergent algorithms without the need to resort to an arbitrary discretization of α\alpha (either approximating α\alpha using sums of Diracs or using quadrature formula for the integrals). The measure α\alpha is used as a black box from which one can draw independent samples, which is a natural computational setup for many high-dimensional applications in statistics and machine learning. This class of methods has been generalized to the computation of Wasserstein barycenters (as described in §9.2) in (Staib et al., 2017b).

This defines the stochastic gradient descent with averaging (SGA) algorithm. Note that it is possible to avoid explicitly storing all the iterates by simply updating a running average as follows:

In this case, a typical choice of decay is rather of the form

When neither α\alpha nor β\beta is a discrete measure, one cannot resort to semidiscrete strategies involving finite-dimensional dual variables, such as that given in Problem (5.7). The only option is to use stochastic optimization methods on the dual problem (4.45), as proposed in (Genevay et al., 2016). A suitable regularization of that problem is crucial, for instance by setting an entropic regularization strength ε>0\varepsilon>0, to obtain an unconstrained problem that can be solved by stochastic descent schemes. A possible approach to revisit Problem (4.45) is to restrict that infinite-dimensional optimization problem over a space of continuous functions to a much smaller subset, such as that spanned by multilayer neural networks (Seguy et al., 2018). This approach leads to nonconvex finite-dimensional optimization problems with no approximation guarantees, but this can provide an effective way to compute a proxy for the Wasserstein distance in high-dimensional scenarios. Another solution is to use nonparametric families, which is equivalent to considering some sort of progressive refinement, as that proposed by Genevay et al. (2016) using reproducing kernel Hilbert spaces, whose dimension is proportional to the number of iterations of the SGD algorithm.

This chapter focuses on optimal transport problems in which the ground cost is equal to a distance. Historically, this corresponds to the original problem posed by Monge in 1781; this setting was also that chosen in early applications of optimal transport in computer vision (Rubner et al., 2000) under the name of “earth mover’s distances”.

Unlike the case where the ground cost is a squared Hilbertian distance (studied in particular in Chapter 7), transport problems where the cost is a metric are more difficult to analyze theoretically. In contrast to Remark 2.24 that states the uniqueness of a transport map or coupling between two absolutely continuous measures when using a squared metric, the optimal Kantorovich coupling is in general not unique when the cost is the ground distance itself. Hence, in this regime it is often impossible to recover a uniquely defined Monge map, making this class of problems ill-suited for interpolation of measures. We refer to works by Trudinger and Wang (2001); Caffarelli et al. (2002); Sudakov (1979); Evans and Gangbo (1999) for proofs of existence of optimal W1\operatorname{\mathcal{W}}_{1} transportation plans and detailed analyses of their geometric structure.

Although more difficult to analyze in theory, optimal transport with a linear ground distance is usually more robust to outliers and noise than a quadratic cost. Furthermore, a cost that is a metric results in an elegant dual reformulation involving local flow, divergence constraints, or Lipschitzness of the dual potential, suggesting cheaper numerical algorithms that align with minimum-cost flow methods over networks in graph theory. This setting is also popular because the associated OT distances define a norm that can compare arbitrary distributions, even if they are not positive; this property is shared by a larger class of so-called dual norms (see §8.2 and Remark 6 for more details).

Here we assume that dd is a distance on X=Y\mathcal{X}=\mathcal{Y}, and we solve the OT problem with the ground cost c(x,y)=d(x,y)c(x,y)=d(x,y). The following proposition highlights key properties of the cc-transform (5.1) in this setup. In the following, we denote the Lipschitz constant of a function fC(X)f\in\mathcal{C}(\mathcal{X}) as

We define Lipschitz functions to be those functions ff satisfying Lip(f)<+\operatorname{Lip}(f)<+\infty; they form a convex subset of C(X)\mathcal{C}(\mathcal{X}).

Suppose X=Y\mathcal{X}=\mathcal{Y} and c(x,y)=d(x,y)c(x,y)=d(x,y). Then, there exists gg such that f=gcf=g^{c} if and only Lip(f)1\operatorname{Lip}(f)\leq 1. Furthermore, if Lip(f)1\operatorname{Lip}(f)\leq 1, then fc=ff^{c}=-f.

First, suppose f=gcf=g^{c}. Then, for x,yXx,y\in\mathcal{X},

The first equality follows from the definition of gcg^{c}, the next inequality from the identity inffinfgsupfg|\inf f-\inf g|\leq\sup|f-g|, and the last from the triangle inequality. This shows that Lip(f)1\operatorname{Lip}(f)\leq 1.

Now, suppose Lip(f)1\operatorname{Lip}(f)\leq 1, and define g=\mboxdef.fg\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}-f. By the Lipschitz property, for all x,yXx,y\in\mathcal{X}, f(y)d(x,y)f(x)f(y)+d(x,y)f(y)-d(x,y)\leq f(x)\leq f(y)+d(x,y). Applying these inequalities,

Hence, f=gcf=g^{c} with g=fg=-f. Using the same inequalities shows

Starting from the single potential formulation (5.4), one can iterate the construction and replace the couple (g,gc)(g,g^{c}) by (gc,(gc)c)(g^{c},(g^{c})^{c}). The last proposition shows that one can thus use (gc,gc)(g^{c},-g^{c}), which in turn is equivalent to any pair (f,f)(f,-f) such that Lip(f)1\operatorname{Lip}(f)\leq 1. This leads to the following alternative expression for the W1\operatorname{\mathcal{W}}_{1} distance:

This expression shows that W1\operatorname{\mathcal{W}}_{1} is actually a norm, i.e. W1(α,β)=αβW1\operatorname{\mathcal{W}}_{1}(\alpha,\beta)=\left\|\alpha-\beta\right\|_{\operatorname{\mathcal{W}}_{1}}, and that it is still valid for any measures (not necessary positive) as long as Xα=Xβ\int_{\mathcal{X}}\alpha=\int_{\mathcal{X}}\beta. This norm is often called the Kantorovich and Rubinstein norm (1958).

which is a finite-dimensional convex program with quadratic-cone constraints. It can be solved using interior point methods or, as we detail next for a similar problem, using proximal methods.

which is a linear program. Note that furthermore, in this 1-D case, a closed form expression for W1\operatorname{\mathcal{W}}_{1} using cumulative functions is given in (2.37).

Here the constraint f1\left\|\nabla f\right\|_{\infty}\leq 1 signifies that the norm of the gradient of ff at any point xx is upper bounded by 11, f(x)21\left\|\nabla f(x)\right\|_{2}\leq 1 for any xx.

Considering the dual problem to (6.3), one obtains an optimization problem under fixed divergence constraint

where iji\rightarrow j indicates that i1=ii_{1}=i and iK=ji_{K}=j, namely that the path starts at ii and ends at jj.

Problem (6.3) becomes, in the graph setting,

The associated dual problem, which is analogous to Formula (6.4), is then

This is a linear program and more precisely an instance of min-cost flow problems. Highly efficient dedicated simplex solvers have been devised to solve it; see, for instance, (Ling and Okada, 2007). Figure 6.1 shows an example of primal and dual solutions. Formulation (6.6) is the so-called Beckmann formulation (Beckmann, 1952) and has been used and extended to define and study traffic congestion models; see, for instance, (Carlier et al., 2008).

Chapter 7 Dynamic Formulations

This chapter presents the geodesic (also called dynamic) point of view of optimal transport when the cost is a squared geodesic distance. This describes the optimal transport between two measures as a curve in the space of measures minimizing a total length. The dynamic point of view offers an alternative and intuitive interpretation of optimal transport, which not only allows us to draw links with fluid dynamics but also results in an efficient numerical tool to compute OT in small dimensions when interpolating between two densities. The drawback of that approach is that it cannot scale to large-scale sparse measures and works only in low dimensions on regular domains (because one needs to grid the space) with a squared geodesic cost.

In this chapter, we use the notation (α0,α1)(\alpha_{0},\alpha_{1}) in place of (α,β)(\alpha,\beta) in agreement with the idea that we start at time t=0t=0 from one measure to reach another one at time t=1t=1.

This definition leads to the following minimal-path reformulation of W2\operatorname{\mathcal{W}}_{2}, originally introduced by Benamou and Brenier (2000):

where αt\alpha_{t} is a scalar-valued measure and vtv_{t} a vector-valued measure. Figure 7.1 shows two examples of such paths of measures.

The formulation (7.2) is a nonconvex formulation in the variables (αt,vt)t(\alpha_{t},v_{t})_{t} because of the constraint (7.1) involving the product αtvt\alpha_{t}v_{t}. Introducing a vector-valued measure (often called the “momentum”)

Benamou and Brenier showed in their landmark paper (2000) that it is instead convex in the variable (αt,Jt)t(\alpha_{t},J_{t})_{t} when writing

where we define the set of constraints as

This definition might seem complicated, but it is crucial to impose that the momentum Jt(x)J_{t}(x) should vanish when αt(x)=0\alpha_{t}(x)=0. Note also that (7.3) is written in an informal way as if the measures (αt,Jt)(\alpha_{t},J_{t}) were density functions, but this is acceptable because θ\theta is a 1-homogeneous function (and hence defined even if the measures do not have a density with respect to Lebesgue measure) and can thus be extended in an unambiguous way from density to functions.

2 Discretization on Uniform Staggered Grids

In order to evaluate the functional to be optimized, one needs interpolation operators from midgrid points to grid points, for all (k,i1,i2)1,T×1,n1×1,n2(k,i_{1},i_{2})\in\llbracket 1,T\rrbracket\times\llbracket 1,n_{1}\rrbracket\times\llbracket 1,n_{2}\rrbracket,

The simplest choice is to use a linear operator I(r,s)=r+s2\mathcal{I}(r,s)=\frac{r+s}{2}, which is the one we consider next. The discrete counterpart to (7.3) reads

In the case where X\mathcal{X} is a graph and c(x,y)=dX(x,y)2c(x,y)=d_{\mathcal{X}}(x,y)^{2} is the squared geodesic distance, it is possible to derive faithful discretization methods that use a discrete divergence associated to the graph structure in place of the uniform grid discretization (7.10). In order to ensure that the heat equation has a gradient flow structure (see §9.3 for more details about gradient flows) for the corresponding dynamic Wasserstein distance, Maas (2011) and later Mielke (2013) proposed to use a logarithmic mean I(r,s)\mathcal{I}(r,s) (see also (Solomon et al., 2016b; Chow et al., 2012, 2017b, 2017a)).

3 Proximal Solvers

An alternative to these high-precision solvers are low-precision first order methods, which are well suited for nonsmooth but highly structured problems such as (7.11). While this class of solvers is not new, it has recently been revitalized in the fields of imaging and machine learning because they are the perfect fit for these applications, where numerical precision is not the driving goal. We refer, for instance, to the monograph (Bauschke and Combettes, 2011) for a detailed account on these solvers and their use for large-scale applications. We here concentrate on a specific solver, but of course many more can be used, and we refer to (Papadakis et al., 2014) for a study of several such approaches for dynamical OT. Note that the idea of using a first order scheme for dynamical OT was initially proposed by Benamou and Brenier (2000).

The DR algorithm (Lions and Mercier, 1979) is specifically tailored to solve nonsmooth structured problems of the form

The proximal operator of these two functions can be computed efficiently. Indeed, one has

4 Dynamical Unbalanced OT

In order to be able to match input measures with different mass α0(X)α1(X)\alpha_{0}(\mathcal{X})\neq\alpha_{1}(\mathcal{X}) (the so-called “unbalanced” settings, the terminology introduced by Benamou (2003)), and also to cope with local mass variation, several normalizations or relaxations have been proposed, in particular by relaxing the fixed marginal constraint; see §10.2. A general methodology consists in introducing a source term st(x)s_{t}(x) in the continuity equation (7.4). We thus consider

The crucial question is how to measure the cost associated to this source term and introduce it in the original dynamic formulation (7.3). Several proposals appear in the literature, for instance, using an L2L^{2} cost Piccoli and Rossi (2014). In order to avoid having to “teleport” mass (mass which travels at infinite speed and suddenly grows in a region where there was no mass before), the associated cost should be infinite. It turns out that this can be achieved in a simple convex way, by also allowing sts_{t} to be an arbitrary measure (e.g. using a 1-homogeneous cost) by penalizing sts_{t} in the same way as the momentum JtJ_{t},

where θ\theta is the convex 1-homogeneous function introduced in (7.5), and τ\tau is a weight controlling the trade-off between mass transportation and mass creation/destruction. This formulation was proposed independently by several authors (Liero et al., 2016; Chizat et al., 2018c; Kondratyev et al., 2016). This “dynamic” formulation has a “static” counterpart; see Remark 5. The convex optimization problem (7.15) can be solved using methods similar to those detailed in §7.3. Figure 7.4 displays a comparison of several unbalanced OT dynamic interpolations. This dynamic formulation resembles “metamorphosis” models for shape registration (Trouvé and Younes, 2005), and a more precise connection is detailed in (Maas et al., 2015, 2016).

As τ0\tau\rightarrow 0, and if α0(X)=α1(X)\alpha_{0}(\mathcal{X})=\alpha_{1}(\mathcal{X}), then one retrieves the classical OT problem, WFR(α0,α1)W(α0,α1)\operatorname{WFR}(\alpha_{0},\alpha_{1})\rightarrow\operatorname{\mathcal{W}}(\alpha_{0},\alpha_{1}). In contrast, as τ+\tau\rightarrow+\infty, this distance approaches the Hellinger metric over densities

5 More General Mobility Functionals

It is possible to generalize the dynamic formulation (7.3) by considering other “mobility functions” θ\theta in place of the one defined in (7.5). A possible choice for this mobility functional is proposed in Dolbeault et al. (2009),

where the parameter should satisfy p1p\geq 1 and s[1,p]s\in[1,p] in order for θ\theta to be convex. Note that this definition should be handled with care in the case 1<sp1<s\leq p because θ\theta does not have a linear growth at infinity, so that solutions to (7.3) must be constrained to have a density with respect to the Lebesgue measure.

The case s=1s=1 corresponds to the classical OT problem and the optimal value of (7.3) defines Wp(α,β)\operatorname{\mathcal{W}}_{p}(\alpha,\beta). In this case, θ\theta is 1-homogeneous, so that solutions to (7.3) can be arbitrary measures. The case (s=1,p=2)(s=1,p=2) is the initial setup considered in (7.3) to define W2\operatorname{\mathcal{W}}_{2}.

The limiting case s=ps=p is also interesting, because it corresponds to a dual Sobolev norm W1,pW^{-1,p} and the value of (7.3) is then equal to

6 Dynamic Formulation over the Paths Space

where, for any path γXˉ\gamma\in\bar{\mathcal{X}}, P0(γ)=γ(0),P1(γ)=γ(1)P_{0}(\gamma)=\gamma(0),P_{1}(\gamma)=\gamma(1).

The dynamical version of classical OT (2.15), formulated over the space of paths, then reads

where γxi,yj\gamma_{x_{i},y_{j}} is the geodesic between xix_{i} and yjy_{j}. Furthermore, the measures defined by the distribution of the curve points γ(t)\gamma(t) at time tt, where γ\gamma is drawn following πˉ\bar{\pi}^{\star}, i.e.

is a solution to the dynamical formulation (7.3), i.e. it is the displacement interpolation. In the discrete case, one recovers (7.9).

We now turn to the re-interpretation of entropic OT, defined in Chapter 4, using the space of paths. Similarly to (4.11), this is defined using a Kullback–Leibler projection, but this time of a reference measure over the space of paths Kˉ\bar{\mathcal{K}} which is the distribution of a reversible Brownian motion (Wiener process), which has a uniform distribution at the initial and final times

We refer to the review paper by Léonard (2014) for an overview of this problem and an historical account of the work of Schrödinger (1931). One can show that the (unique) solution πˉε\bar{\pi}_{\varepsilon}^{\star} to (7.19) converges to a solution of (7.17) as ε0\varepsilon\rightarrow 0. Furthermore, this solution is linked to the solution of the static entropic OT problem (4.9) using Brownian bridge γˉx,yεXˉ\bar{\gamma}_{x,y}^{\varepsilon}\in\bar{\mathcal{X}} (which are similar to fuzzy geodesic and converge to δγx,y\delta_{\gamma_{x,y}} as ε0\varepsilon\rightarrow 0). In the discrete setting, this means that

The resulting mixture of Brownian bridges is displayed on Figure 7.5.

Another way to describe this entropic interpolation (αt)t(\alpha_{t})_{t} is using a regularization of the Benamou–Brenier dynamic formulation (7.2), namely

see (Gentil et al., 2015; Chen et al., 2016a).

Chapter 8 Statistical Divergences

We study in this chapter the statistical properties of the Wasserstein distance. More specifically, we compare it to other major distances and divergences routinely used in data sciences. We quantify how one can approximate the distance between two probability distributions when having only access to samples from said distributions. To introduce these subjects, §8.1 and §8.2 review respectively divergences and integral probability metrics between probability distributions. A divergence DD typically satisfies D(α,β)0D(\alpha,\beta)\geq 0 and D(α,β)=0D(\alpha,\beta)=0 if and only if α=β\alpha=\beta, but it does not need to be symmetric or satisfy the triangular inequality. An integral probability metric for measures is a dual norm defined using a prescribed family of test functions. These quantities are sound alternatives to Wasserstein distances and are routinely used as loss functions to tackle inference problems, as will be covered in §9. We show first in §8.3 that the optimal transport distance is not Hilbertian, i.e. one cannot approximate it efficiently using a Hilbertian metric on a suitable feature representation of probability measures. We show in §8.4 how to approximate D(α,β)D(\alpha,\beta) from discrete samples (xi)i(x_{i})_{i} and (yj)j(y_{j})_{j} drawn from α\alpha and β\beta. A good statistical understanding of that problem is crucial when using the Wasserstein distance in machine learning. Note that this section will be chiefly concerned with the statistical approximation of optimal transport between distributions supported on continuous sets. The very same problem when the ground space is finite has received some attention in the literature following the work of Sommerfeld and Munk (2018), extended to entropic regularized quantities by Bigot et al. (2017a).

Before detailing in the following section “weak” norms, whose construction shares similarities with W1\operatorname{\mathcal{W}}_{1}, let us detail a generic construction of so-called divergences between measures, which can then be used as loss functions when estimating probability distributions. Such divergences compare two input measures by comparing their mass pointwise, without introducing any notion of mass transportation. Divergences are functionals which, by looking at the pointwise ratio between two measures, give a sense of how close they are. They have nice analytical and computational properties and build upon entropy functions.

If φ=\varphi^{\prime}_{\infty}=\infty, then φ\varphi grows faster than any linear function and φ\varphi is said superlinear. Any entropy function φ\varphi induces a φ\varphi-divergence (also known as Ciszár divergence (Ciszár, 1967; Ali and Silvey, 1966) or ff-divergence) as follows.

if α,β\alpha,\beta are nonnegative and \infty otherwise.

The additional term φα(X)\varphi^{\prime}_{\infty}\alpha^{\perp}(\mathcal{X}) in (8.1) is important to ensure that Dφ\mathcal{D}_{\varphi} defines a continuous functional (for the weak topology of measures) even if φ\varphi has a linear growth at infinity, as this is, for instance, the case for the absolute value (8.8) defining the TV norm. If φ\varphi as a superlinear growth, e.g. the usual entropy (8.4), then φ=+\varphi^{\prime}_{\infty}=+\infty so that Dφ(αβ)=+\mathcal{D}_{\varphi}(\alpha|\beta)=+\infty if α\alpha does not have a density with respect to β\beta.

are supported on the same set of nn points (xi)i=1nX(x_{i})_{i=1}^{n}\subset\mathcal{X}, (8.1) defines a divergence on Σn\Sigma_{n}

The proof of the following proposition can be found in (Liero et al., 2018, Thm 2.7).

If φ\varphi is an entropy function, then Dφ\mathcal{D}_{\varphi} is jointly 11-homogeneous, convex and weakly* lower semicontinuous in (α,β)(\alpha,\beta).

A φ\varphi-divergence can be expressed using the Legendre transform

see Liero et al. (2018) for more details.

We now review a few popular instances of this framework. Figure 8.1 displays the associated entropy functionals, while Figure 8.2 reviews the relationship between them.

The Kullback–Leibler divergence KL=\mboxdef.DφKL\operatorname{KL}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{D}_{\varphi_{\operatorname{KL}}}, also known as the relative entropy, was already introduced in (4.10) and (4.6). It is the divergence associated to the Shannon–Boltzman entropy function φKL\varphi_{\operatorname{KL}}, given by

It is interesting to contrast the geometry of the Kullback–Leibler divergence to that defined by quadratic optimal transport when comparing Gaussians. As detailed, for instance, by Costa et al. (2015), the Kullback–Leibler divergence has a closed form for Gaussian densities. In the univariate case, d=1d=1, if α=N(mα,σα2)\alpha=\mathcal{N}(m_{\alpha},\sigma_{\alpha}^{2}) and β=N(mβ,σβ2)\beta=\mathcal{N}(m_{\beta},\sigma_{\beta}^{2}), one has

This expression shows that the divergence between α\alpha and β\beta diverges to infinity as σβ\sigma_{\beta} diminishes to and β\beta becomes a Dirac mass. In that sense, one can say that singular Gaussians are infinitely far from all other Gaussians in the KL geometry. That geometry is thus useful when one wants to avoid dealing with singular covariances. To simplify the analysis, one can look at the infinitesimal geometry of KL, which is obtained by performing a Taylor expansion at order 2,

The KL hyperbolic geometry over the space of Gaussian parameters (m,σ)(m,\sigma) should be contrasted with the Euclidean geometry associated to OT as described in Remark 2.31, since in the univariate case

Figure 8.3 shows a visual comparison of these two geometries and their respective geodesics. This interesting comparison was suggested to us by Jean Feydy.

The total variation distance TV=\mboxdef.DφTV\operatorname{TV}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{D}_{\varphi_{\operatorname{TV}}} is the divergence associated to

It actually defines a norm on the full space of measure M(X)\mathcal{M}(\mathcal{X}) where

The total variation norm (8.9) defines the so-called “strong” topology on the space of measure. On a compact domain X\mathcal{X} of radius RR, one has

so that this strong notion of convergence implies the weak convergence metrized by Wasserstein distances. The converse is, however, not true, since δx\delta_{x} does not converge strongly to δy\delta_{y} if xyx\rightarrow y (note that δxδyTV=2\left\|\delta_{x}-\delta_{y}\right\|_{\operatorname{TV}}=2 if xyx\neq y). A chief advantage is that M+1(X)\mathcal{M}_{+}^{1}(\mathcal{X}) (once again on a compact ground space X\mathcal{X}) is compact for the weak topology, so that from any sequence of probability measures (αk)k(\alpha_{k})_{k}, one can always extract a converging subsequence, which makes it a suitable space for several optimization problems, such as those considered in Chapter 9.

The Hellinger distance h=\mboxdef.DφH1/2\mathfrak{h}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{D}_{\varphi_{H}}^{1/2} is the square root of the divergence associated to

The KL divergence is not symmetric and, while being a Bregman divergence (which are locally quadratic norms), it is not the square of a distance. On the other hand, the Jensen–Shannon distance JS(α,β)\text{JS}(\alpha,\beta), defined as

is a distance (Endres and Schindelin, 2003; Österreicher and Vajda, 2003). JS2\text{JS}^{2} can be shown to be a φ\varphi-divergence for φ(s)=tlog(t)(t+1)log(t+1)\varphi(s)=t\log(t)-(t+1)\log(t+1). In sharp contrast with KL\operatorname{KL}, JS(α,β)\text{JS}(\alpha,\beta) is always bounded; more precisely, it satisfies 0JS(α,β)2ln(2)0\leq\text{JS}(\alpha,\beta)^{2}\leq\ln(2). Similarly to the TV norm and the Hellinger distance, it metrizes the strong convergence.

The χ2\chi^{2}-divergence χ2=\mboxdef.Dφχ2\chi^{2}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{D}_{\varphi_{\chi^{2}}} is the divergence associated to

If (α,β)(\alpha,\beta) are discrete as in (8.2) and have the same support, then

2 Integral Probability Metrics

Formulation (6.3) is a special case of a dual norm. A dual norm is a convenient way to design “weak” norms that can deal with arbitrary measures. For a symmetric convex set BB of measurable functions, one defines

These dual norms are often called “integral probability metrics’; see (Sriperumbudur et al., 2012).

The total variation norm (Example 8.1.5) is a dual norm associated to the whole space of continuous functions

The total variation distance is the only nontrivial divergence that is also a dual norm; see (Sriperumbudur et al., 2009).

By using smaller “balls” BB, which typically only contain continuous (and sometimes regular) functions, one defines weaker dual norms. In order for B\left\|\cdot\right\|_{B} to metrize the weak convergence (see Definition 2.2), it is sufficient for the space spanned by BB to be dense in the set of continuous functions for the sup-norm \left\|\cdot\right\|_{\infty} (i.e. for the topology of uniform convergence); see (Ambrosio et al., 2006, para. 5.1).

Figure 8.4 displays a comparison of several such dual norms, which we now detail.

W1\operatorname{\mathcal{W}}_{1} as defined in (6.3), is a special case of dual norm (8.10), using

2.2 Dual RKHS Norms and Maximum Mean Discrepancies

It is also possible to define “Euclidean” norms (built using quadratic functionals) on measures using the machinery of kernel methods and more specifically reproducing kernel Hilbert spaces (RKHS; see (Schölkopf and Smola, 2002) for a survey of their applications in data sciences), of which we recall first some basic definitions.

If kk is conditionally positive, one defines the following norm:

These norms are often referred to as “maximum mean discrepancy” (MMD) (see (Gretton et al., 2007)) and have also been called “kernel norms” in shape analysis (Glaunes et al., 2004). This expression (8.13) can be rephrased, introducing two independent random vectors (X,X)(X,X^{\prime}) on X\mathcal{X} distributed with law α\alpha, as

One can show that k2\left\|\cdot\right\|^{2}_{k} is the dual norm in the sense of (8.10) associated to the unit ball BB of the RKHS associated to kk. We refer to (Berlinet and Thomas-Agnan, 2003; Hofmann et al., 2008; Schölkopf and Smola, 2002) for more details on RKHS functional spaces.

In the special case where α\alpha is a discrete measure of the form (2.3), one thus has the simple expression

which should be contrasted with (6.4), where an L1L^{1} norm of the vector field ss was used in place of the L2L^{2} norm used here. The “weighted” version of this Sobolev dual norm,

can be interpreted as the natural “linearization” of the Wasserstein W2\operatorname{\mathcal{W}}_{2} norm, in the sense that the Benamou–Brenier dynamic formulation can be interpreted infinitesimally as

see (Santambrogio, 2015, Theo. 5.34), and see (Peyre, 2011) for sharp constants.

This corresponds to a dual RKHS norm with a convolutive kernel k(x,y)=k0(xy)k(x,y)=k_{0}(x-y) with k^0(ω)=±ω2r\hat{k}_{0}(\omega)=\pm\left\|\omega\right\|^{-2r}. Taking the inverse Fourier transform, one sees that (up to constant) one has

The energy distance (or Cramer distance when d=1d=1) (Székely and Rizzo, 2004) associated to a distance dd is defined as

while the Wasserstein distance exhibits a perfect linear scaling,

Note, however, that for the energy distance, the parameter pp must satisfy 0<p<20<p<2, and that for p=2p=2, it degenerates to the distance between the means

so it is not a norm anymore. This shows that it is not possible to get the same linear scaling under fsf_{s\sharp} with the energy distance as for the Wasserstein distance.

3 Wasserstein Spaces Are Not Hilbertian

Some of the special cases of the Wasserstein geometry outlined earlier in §2.6 have highlighted the fact that the optimal transport distance can sometimes be computed in closed form. They also illustrate that in such cases the optimal transport distance is a Hilbertian metric between probability measures, in the sense that there exists a map ϕ\phi from the space of input measures onto a Hilbert space, as defined below.

A distance dd defined on a set Z×Z\mathcal{Z}\times\mathcal{Z} is said to be Hilbertian if there exists a Hilbert space H\mathcal{H} and a mapping ϕ:ZH\phi:\mathcal{Z}\rightarrow\mathcal{H} such that for any pair z,zz,z^{\prime} in Z\mathcal{Z} we have that d(z,z)=ϕ(z)ϕ(z)Hd(z,z^{\prime})=\|\phi(z)-\phi(z^{\prime})\|_{\mathcal{H}}.

For instance, Remark 2.30 shows that the Wasserstein metric is a Hilbert norm between univariate distributions, simply by defining ϕ\phi to be the map that associates to a measure its generalized quantile function. Remark 2.31 shows that for univariate Gaussians, as written in (8.7) in this chapter, the Wasserstein distance between two univariate Gaussians is simply the Euclidean distance between their mean and standard deviation.

Hilbertian distances have many favorable properties when used in a data analysis context (Dattorro, 2017). First, they can be easily cast as radial basis function kernels: for any Hilbertian distance dd, it is indeed known that edp/te^{-d^{p}/t} is a positive definite kernel for any value 0p20\leq p\leq 2 and any positive scalar tt as shown in (Berg et al., 1984, Cor. 3.3.3, Prop. 3.2.7). The Gaussian (p=2)(p=2) and Laplace (p=1)(p=1) kernels are simple applications of that result using the usual Euclidean distance. The entire field of kernel methods (Hofmann et al., 2008) builds upon the positive definiteness of a kernel function to define convex learning algorithms operating on positive definite kernel matrices. Points living in a Hilbertian space can also be efficiently embedded in lower dimensions with low distortion factors (Johnson and Lindenstrauss, 1984), (Barvinok, 2002, §V.6.2) using simple methods such as multidimensional scaling (Borg and Groenen, 2005).

Because Hilbertian distances have such properties, one might hope that the Wasserstein distance remains Hilbertian in more general settings than those outlined above, notably when the dimension of X\mathcal{X} is 2 and more. This can be disproved using the following equivalence.

A distance dd is Hilbertian if and only if d2d^{2} is negative definite.

If a distance is Hilbertian, then d2d^{2} is trivially negative definite. Indeed, given nn points in Z\mathcal{Z}, the sum rirjd2(zi,zj)\sum r_{i}r_{j}d^{2}(z_{i},z_{j}) can be rewritten as rirjϕ(zi)ϕ(zj)H2\sum r_{i}r_{j}\|\phi(z_{i})-\phi(z_{j})\|_{\mathcal{H}}^{2} which can be expanded, taking advantage of the fact that ri=0\sum r_{i}=0 to 2rirjϕ(zi),ϕ(zj)H-2\sum r_{i}r_{j}\langle\phi(z_{i}),\,\phi(z_{j})\rangle_{\mathcal{H}} which is negative by definition of a Hilbert dot product. If, on the contrary, d2d^{2} is negative definite, then the fact that dd is Hilbertian proceeds from a key result by Schoenberg (1938) outlined in ((Berg et al., 1984, p. 82, Prop. 3.2)).

It is therefore sufficient to show that the squared Wasserstein distance is not negative definite to show that it is not Hilbertian, as stated in the following proposition.

3.2 Negative/Positive Definite Variants of Optimal Transport

We show later in §10.4 that the sliced approximation to Wasserstein distances, essentially a sum of 1-D directional transportation distance computed on random push-forwards of measures projected on lines, is negative definite as the sum of negative definite functions (Berg et al., 1984, §3.1.11). This result can be used to define a positive definite kernel (Kolouri et al., 2016). Another way to recover a positive definite kernel is to cast the optimal transport problem as a soft-min problem (over all possible transportation tables) rather than a minimum, as proposed by Kosowsky and Yuille (1994) to introduce entropic regularization. That soft-min defines a term whose neg-exponential (also known as a generating function) is positive definite (Cuturi, 2012).

4 Empirical Estimators for OT, MMD and φ𝜑\varphi-divergences

In an applied setting, given two input measures (α,β)M+1(X)2(\alpha,\beta)\in\mathcal{M}_{+}^{1}(\mathcal{X})^{2}, an important statistical problem is to approximate the (usually unknown) divergence D(α,β)D(\alpha,\beta) using only samples (xi)i=1n(x_{i})_{i=1}^{n} from α\alpha and (yj)j=1m(y_{j})_{j=1}^{m} from β\beta. These samples are assumed to be independently identically distributed from their respective distributions.

For both Wasserstein distances Wp\operatorname{\mathcal{W}}_{p} (see 2.18) and MMD norms (see §8.2), a straightforward estimator of the unknown distance between distriubtions is compute it directly between the empirical measures, hoping ideally that one can control the rate of convergence of the latter to the former,

Note that here both α^n\hat{\alpha}_{n} and β^m\hat{\beta}_{m} are random measures, so D(α^n,β^m)D(\hat{\alpha}_{n},\hat{\beta}_{m}) is a random number. For simplicity, we assume that X\mathcal{X} is compact (handling unbounded domain requires extra constraints on the moments of the input measures).

For such a dual distance that metrizes the weak convergence (see Definition 2.2), since there is the weak convergence α^nα\hat{\alpha}_{n}\rightarrow\alpha, one has D(α^n,β^n)D(α,β)D(\hat{\alpha}_{n},\hat{\beta}_{n})\rightarrow D(\alpha,\beta) as n+n\rightarrow+\infty. But an important question is the speed of convergence of D(α^n,β^n)D(\hat{\alpha}_{n},\hat{\beta}_{n}) toward D(α,β)D(\alpha,\beta), and this rate is often called the “sample complexity” of DD.

Note that for D(α,β)=TVD(\alpha,\beta)=\left\|\cdot\right\|_{\operatorname{TV}}, since the TV norm does not metrize the weak convergence, α^nβ^nTV\|\hat{\alpha}_{n}-\hat{\beta}_{n}\|_{\operatorname{TV}} is not a consistent estimator, namely it does not converge toward αβTV\left\|\alpha-\beta\right\|_{\operatorname{TV}}. Indeed, with probability 1, α^nβ^nTV=2\|\hat{\alpha}_{n}-\hat{\beta}_{n}\|_{\operatorname{TV}}=2 since the support of the two discrete measures does not overlap. Similar issues arise with other φ\varphi-divergences, which cannot be estimated using divergences between empirical distributions.

For weak norms k2\left\|\cdot\right\|^{2}_{k} which are dual of RKHS norms (also called MMD), as defined in (8.13), and contrary to Wasserstein distances, the sample complexity does not depend on the ambient dimension

see (Sriperumbudur et al., 2012). Figure 8.7 shows a numerical comparison of the sample complexity rates for Wasserstein and MMD distances. Note, however, that α^nβ^nk2\|\hat{\alpha}_{n}-\hat{\beta}_{n}\|_{k}^{2} is a slightly biased estimate of αβk2\left\|\alpha-\beta\right\|_{k}^{2}. In order to define an unbiased estimator, and thus to be able to use, for instance, SGD when minimizing such losses, one should rather use the unbiased estimator

4.2 Empirical Estimators for φ𝜑\varphi-divergences

One can then approximate the φ\varphi divergence using

5 Entropic Regularization: Between OT and MMD

Following Proposition 4.7, we recall that the Sinkhorn divergence is defined as

This definition is generalized to any input distribution (not necessarily discrete) as

where π\pi^{\star} is the solution of (4.9).

In order to cancel the bias introduced by the regularization (in particular, Wp,ε(α,α)0\operatorname{\mathcal{W}}_{p,\varepsilon}(\alpha,\alpha)\neq 0), we introduce a corrected regularized divergence

The following proposition, whose proof can be found in (Ramdas et al., 2017), shows that this regularized divergence interpolates between the Wasserstein distance and the energy distance defined in Example 8.2.7.

where ED(X,d)\left\|\cdot\right\|_{\text{ED}(\mathcal{X},d)} is defined in (8.19).

Chapter 9 Variational Wasserstein Problems

In data analysis, common divergences between probability measures (e.g. Euclidean, total variation, Hellinger, Kullback–Leibler) are often used to measure a fitting error or a loss in parameter estimation problems. Up to this chapter, we have made the case that the optimal transport geometry has a unique ability, not shared with other information divergences, to leverage physical ideas (mass displacement) and geometry (a ground cost between observations or bins) to compare measures. These two facts combined make it thus very tempting to use the Wasserstein distance as a loss function. This idea was recently explored for various applied problems. However, the main technical challenge associated with that idea lies in approximating and differentiating efficiently the Wasserstein distance. We start this chapter with a few motivating examples and show how the different numerical schemes presented in the first chapters of this book can be used to solve variational Wasserstein problems.

In image processing, the Wasserstein distance can be used as a loss to synthesize textures (Tartavel et al., 2016), to account for the discrepancy between statistics of synthesized and input examples. It is also used for image segmentation to account for statistical homogeneity of image regions (Swoboda and Schnörr, 2013; Rabin and Papadakis, 2015; Peyré et al., 2012; Ni et al., 2009; Schmitzer and Schnörr, 2013b; Li et al., 2018b). The Wasserstein distance is also a very natural fidelity term for inverse problems when the measurements are probability measures, for instance, image restoration (Lellmann et al., 2014), tomographic inversion (Abraham et al., 2017), density regularization (Burger et al., 2012), particle image velocimetry (Saumier et al., 2015), sparse recovery and compressed sensing (Indyk and Price, 2011), and seismic inversion (Métivier et al., 2016). Distances between measures (mostly kernel-based as shown in §8.2.2) are routinely used for shape matching (represented as measures over a lifted space, often called currents) in computational anatomy (Vaillant and Glaunès, 2005), but OT distances offer an interesting alternative (Feydy et al., 2017). To reduce the dimensionality of a dataset of histograms, Lee and Seung have shown that the nonnegative matrix factorization problem can be cast using the Kullback–Leibler divergence to quantify a reconstruction loss (Lee and Seung, 1999). When prior information is available on the geometry of the bins of those histograms, the Wasserstein distance can be used instead, with markedly different results (Sandler and Lindenbaum, 2011; Zen et al., 2014; Rolet et al., 2016).

All of these problems have in common that they require access to the gradients of Wasserstein distances, or approximations thereof. We start this section by presenting methods to approximate such gradients, then follow with three important applications that can be cast as variational Wasserstein problems.

In statistics, text processing or imaging, one must usually compare a probability distribution β\beta arising from measurements to a model, namely a parameterized family of distributions {αθ,θΘ}\{\alpha_{\theta},\theta\in\Theta\}, where Θ\Theta is a subset of a Euclidean space. Such a comparison is done through a “loss” or a “fidelity” term, which is the Wasserstein distance in this section. In the simplest scenario, the computation of a suitable parameter θ\theta is obtained by minimizing directly

Of course, one can consider more complicated problems: for instance, the barycenter problem described in §9.2 consists in a sum of such terms. However, most of these more advanced problems can be usually solved by adapting tools defined for the basic case above, either using the chain rule to compute explicitly derivatives or using automatic differentiation as advocated in §9.1.3.

The Wasserstein distance between two histograms or two densities is convex with respect to its two inputs, as shown by (2.20) and (2.24), respectively. Therefore, when the parameter θ\theta is itself a histogram, namely Θ=Σn\Theta=\Sigma_{n} and αθ=θ\alpha_{\theta}=\theta, or more generally when θ\theta describes KK weights in the simplex, Θ=ΣK\Theta=\Sigma_{K}, and αθ=i=1Kθiαi\alpha_{\theta}=\sum_{i=1}^{K}\theta_{i}\alpha_{i} is a convex combination of known atoms α1,,αK\alpha_{1},\dots,\alpha_{K} in ΣN\Sigma_{N}, Problem (9.1) remains convex (the first case corresponds to the barycenter problem, the second to one iteration of the dictionary learning problem with a Wasserstein loss (Rolet et al., 2016)). However, for more general parameterizations θαθ\theta\mapsto\alpha_{\theta}, Problem (9.1) is in general not convex.

For those simple cases where the Wasserstein distance has a closed form, such as univariate (see §2.30) or elliptically contoured (see §2.31) distributions, simple workarounds exist. They consist mostly in casting the Wasserstein distance as a simpler distance between suitable representations of these distributions (Euclidean on quantile functions for univariate measures, Bures metric for covariance matrices for elliptically contoured distributions of the same family) and solving Problem (9.1) directly on such representations.

In most cases, however, one has to resort to a careful discretization of αθ\alpha_{\theta} to compute a local minimizer for Problem (9.1). Two approaches can be envisioned: Eulerian or Lagrangian. Figure 9.1 illustrates the difference between these two fundamental discretization schemes. At the risk of oversimplifying this argument, one may say that a Eulerian discretization is the most suitable when measures are supported on a low-dimensional space (as when dealing with shapes or color spaces), or for intrinsically discrete problems (such as those arising from string or text analysis). When applied to fitting problems where observations can take continuous values in high-dimensional spaces, a Lagrangian perspective is usually the only suitable choice.

1.1 Eulerian Discretization

We recall that Proposition 4.6 shows that the entropic loss function is differentiable and convex with respect to the input histograms, with gradient.

1.2 Lagrangian Discretization

A different approach consists in using instead fixed (typically uniform) weights and approximating an input measure α\alpha as an empirical measure αθ=1niδx(θ)i\alpha_{\theta}=\frac{1}{n}\sum_{i}\delta_{x(\theta)_{i}} for a point-cloud parameterization map x:θx(θ)=(x(θ)i)i=1nXnx:\theta\mapsto x(\theta)=(x(\theta)_{i})_{i=1}^{n}\in\mathcal{X}^{n}, where we assume here that X\mathcal{X} is Euclidean. Problem (9.1) is thus approximated as

where P is the unique optimal solution of (4.2). For ε=0\varepsilon=0, this formula defines the set of upper gradients.

1.3 Automatic Differentiation

with respect to θ\theta, in, respectively, the Eulerian and the Lagrangian cases for LL large enough.

The cost for computing the gradient of functionals involving Sinkhorn divergences is the same as that of computation of the functional itself; see, for instance, (Bonneel et al., 2016; Genevay et al., 2018) for some applications of this approach. We also refer to (Adams and Zemel, 2011) for an early work on differentiating Sinkhorn iterations with respect to the cost matrix (as done in the Lagrangian framework), with applications to learning rankings. Further details on automatic differentiation can be found in (Griewank and Walther, 2008; Rall, 1981; Neidinger, 2010), in particular on the “reverse mode,” which is the fastest way to compute gradients. In terms of implementation, all recent deep-learning Python frameworks feature state-of-the-art reverse-mode differentiation and support for GPU/TPU computations (Al-Rfou et al., 2016; Abadi et al., 2016; Pytorch, 2017), they should be adopted for any large-scale application of Sinkhorn losses. We strongly encourage the use of such automatic differentiation techniques, since they have the same complexity as computing (9.3) and (9.8), these formulas being mostly useful to obtain a theoretical understanding of what automatic differentation is computing. The only downside is that reverse mode automatic differentation is memory intensive (the memory grows proportionally with the number of iterations). There exist, however, subsampling strategies that mitigate this problem (Griewank, 1992).

2 Wasserstein Barycenters, Clustering and Dictionary Learning

A basic problem in unsupervised learning is to compute the “mean” or “barycenter” of several data points. A classical way to define such a weighted mean of points (xs)s=1SXS(x_{s})_{s=1}^{S}\in\mathcal{X}^{S} living in a metric space (X,d)(\mathcal{X},d) (where dd is a distance or more generally a divergence) is by solving a variational problem

Although this problem is an LP, its scale forbids the use of generic solvers for medium-scale problems. One can resort to using first order methods such as subgradient descent on the dual (Carlier et al., 2015).

In general, solving (9.10) or (9.11) is not straightforward, but there exist some special cases for which solutions are explicit or simple.

One can use entropic smoothing and approximate the solution of (9.10) using

for some ε>0\varepsilon>0. This is a smooth convex minimization problem, which can be tackled using gradient descent (Cuturi and Doucet, 2014; Gramfort et al., 2015). An alternative is to use descent methods (typically quasi-Newton) on the semi-dual (Cuturi and Peyré, 2016), which is useful to integrate additional regularizations on the barycenter, to impose, for instance, some smoothness w.r.t a given norm. A simpler yet very effective approach, as remarked by Benamou et al. (2015) is to rewrite (9.15) as a (weighted) KL projection problem

and the scalings are sequentially updated as

An alternative way to derive these iterations is to perform alternate minimization on the variables of a dual problem, which is detailed in the following proposition.

Introducing Lagrange multipliers in (9.16) leads to

Strong duality holds, so that one can exchange the min and the max, to obtain

which shows the desired formula. To show (9.22), since this function is separable, one needs to compute

whose optimality condition reads u=log(r/k)u=\log(r/k), i.e. r=keur=ke^{u}, hence the result.

Figures 9.3 and 9.4 show applications to 2-D and 3-D shapes interpolation. Figure 9.5 shows a computation of barycenters on a surface, where the ground cost is the square of the geodesic distance. For this figure, the computations are performed using the geodesic in heat approximation detailed in Remark 4.19. We refer to (Solomon et al., 2015) for more details and other applications to computer graphics and imaging sciences.

The efficient computation of Wasserstein barycenters remains at this time an active research topic (Staib et al., 2017a; Dvurechenskii et al., 2018). Beyond their methodological interest, Wasserstein barycenters have found many applications outside the field of shape analysis. They have been used for image processing (Rabin et al., 2011), in particular color modification (Solomon et al., 2015) (see Figure 9.6); Bayesian computations (Srivastava et al., 2015a, b) to summarize measures; and nonlinear dimensionality reduction, to express an input measure as a Wasserstein barycenter of other known measures (Bonneel et al., 2016). All of these problems result in involved nonconvex objective functions which can be accurately optimized using automatic differentiation (see Remark 9.1.3). Problems closely related to the computation of barycenters include the computation of principal components analyses over the Wasserstein space (see, for instance, (Seguy and Cuturi, 2015; Bigot et al., 2017b)) and the statistical estimation of template models (Boissard et al., 2015). The ability to compute barycenters enables more advanced clustering methods such as the kk-means on the space of probability measures (del Barrio et al., 2016; Ho et al., 2017).

3 Gradient Flows

where τ\tau is a small enough step size. This corresponds to a so-called “explicit” minimization scheme and only applies for smooth functions FF. For nonsmooth functions, one can use instead an “implicit” scheme, which is also called the proximal-point algorithm (see, for instance, Bauschke and Combettes (2011))

Analyzing the convergence of gradient flows discretized in both time and space is difficult in general. Due to the polyhedral nature of the linear program defining the distance, using too-small step sizes leads to a “locking” phenomena (the distribution is stuck and does not evolve, so that the step size should be not too small, as discussed in (Maury and Preux, 2017)). We refer to (Matthes and Osberger, 2014, 2017) for a convergence analysis of a discretization method for gradient flows in one dimension.

It is also possible to compute gradient flows for unbalanced optimal transport distances as detailed in §10.2. This results in evolutions allowing mass creation or destruction, which is crucial to model many physical, biological or chemical phenomena. An example of unbalanced gradient flow is the celebrated Hele-Shaw model for cell growth (Perthame et al., 2014), which is studied theoretically in (Gallouët and Monsaingeon, 2017; Di Marino and Chizat, 2017). Such an unbalanced gradient flow also can be approximated using the generalized Sinkhorn algorithm (Chizat et al., 2018b).

4 Minimum Kantorovich Estimators

Given some discrete samples (xi)i=1nX(x_{i})_{i=1}^{n}\subset\mathcal{X} from some unknown distribution, the goal is to fit a parametric model θαθM(X)\theta\mapsto\alpha_{\theta}\in\mathcal{M}(\mathcal{X}) to the observed empirical input measure β\beta

where L\mathcal{L} is some “loss” function between a discrete and a “continuous” (arbitrary) distribution (see Figure 9.10).

In the case where αθ\alpha_{\theta} as a density ρθ=\mboxdef.ραθ\rho_{\theta}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\rho_{\alpha_{\theta}} with respect to the Lebesgue measure (or any other fixed reference measure), the maximum likelihood estimator (MLE) is obtained by solving

This corresponds to using an empirical counterpart of a Kullback–Leibler loss since, assuming the xix_{i} are i.i.d. samples of some βˉ\bar{\beta}, then

This MLE approach is known to lead to optimal estimation procedures in many cases (see, for instance, Owen (2001)). However, it fails to work when estimating singular distributions, typically when the αθ\alpha_{\theta} does not have a density (so that LMLE(αθ,β)=+\mathcal{L}_{\text{MLE}}(\alpha_{\theta},\beta)=+\infty) or when (xi)i(x_{i})_{i} are samples from some singular βˉ\bar{\beta} (so that the αθ\alpha_{\theta} should share the same support as β\beta for KL(αθβˉ)\operatorname{KL}(\alpha_{\theta}|\bar{\beta}) to be finite, but this support is usually unknown). Another issue is that in several cases of practical interest, the density ρθ\rho_{\theta} is inaccessible (or too hard to compute).

A typical setup where both problems (singular and unknown densities) occur is for so-called generative models, where the parametric measure is written as a push-forward of a fixed reference measure ζM(Z)\zeta\in\mathcal{M}(\mathcal{Z})

where the push-forward operator is introduced in Definition 2.1. The space Z\mathcal{Z} is usually low-dimensional, so that the support of αθ\alpha_{\theta} is localized along a low-dimensional “manifold” and the resulting density is highly singular (it does not have a density with respect to Lebesgue measure). Furthermore, computing this density is usually intractable, while generating i.i.d. samples from αθ\alpha_{\theta} is achieved by computing xi=hθ(zi)x_{i}=h_{\theta}(z_{i}), where (zi)i(z_{i})_{i} are i.i.d. samples from ζ\zeta.

In order to cope with such a difficult scenario, one has to use weak metrics in place of the MLE functional LMLE\mathcal{L}_{\text{MLE}}, which needs to be written in dual form as

Dual norms shown in §8.2 correspond to imposing

while optimal transport (2.24) sets R=R(c)\mathcal{R}=\mathcal{R}(c) as defined in (2.25).

For a fixed θ\theta, evaluating the energy to be minimized in (9.34) using such a loss function corresponds to solving a semidiscrete optimal transport, which is the focus of Chapter 5. Minimizing the energy with respect to θ\theta is much more involved and is typically highly nonconvex.

Denoting fθf_{\theta} a solution to (9.35) when evaluating E(θ)=L(αθ,β)\mathcal{E}(\theta)=\mathcal{L}(\alpha_{\theta},\beta), a subgradient is obtained using the formula

For the OT loss, an alternative gradient formula is obtained when one rather computes a primal optimal coupling for the following equivalent problem:

Note that in the semidiscrete case considered here, the objective to be minimized can be actually decomposed as

where each γiM+1(Z)\gamma_{i}\in\mathcal{M}_{+}^{1}(\mathcal{Z}). Once an optimal (γθ,i)i(\gamma_{\theta,i})_{i} solving (9.38) is obtained, the gradient of E(θ)\mathcal{E}(\theta) is computed as

The class of estimators obtained using L=Lc\mathcal{L}=\mathcal{L}_{c}, often called “minimum Kantorovich estimators,” was initially introduced in (Bassetti et al., 2006); see also (Canas and Rosasco, 2012). It has been used in the context of generative models by (Montavon et al., 2016) to train restricted Boltzmann machines and in (Bernton et al., 2017) in conjunction with approximate Bayesian computations. Approximations of these computations using Deep Network are used to train deep generative models for both GAN (Arjovsky et al., 2017) and VAE (Bousquet et al., 2017); see also (Genevay et al., 2018, 2017; Salimans et al., 2018). Note that the use of Sinkhorn divergences for parametric model fitting is used routinely for shape matching and registration, see (Gold et al., 1998; Chui and Rangarajan, 2000; Myronenko and Song, 2010; Feydy et al., 2017).

Chapter 10 Extensions of Optimal Transport

This chapter details several variational problems that are related to (and share the same structure of) the Kantorovich formulation of optimal transport. The goal is to extend optimal transport to more general settings: several input histograms and measures, unnormalized ones, more general classes of measures, and optimal transport between measures that focuses on local regularities (points nearby in the source measure should be mapped onto points nearby in the target measure) rather than a total transport cost, including cases where these two measures live in different metric spaces.

The entropic regularization scheme (4.2) naturally extends to this setting

and one can then apply Sinkhorn’s algorithm to compute the optimal P in scaling form, where each entry indexed by a multi-index vector i=(i1,,iS)i=(i_{1},\ldots,i_{S})

2 Unbalanced Optimal Transport

A major bottleneck of optimal transport in its usual form is that it requires the two input measures (α,β)(\alpha,\beta) to have the same total mass. While many workarounds have been proposed (including renormalizing the input measure, or using dual norms such as detailed in § 8.2), it is only recently that satisfying unifying theories have been developed. We only sketch here a simple but important particular case.

Sinkhorn’s iterations (4.15) can be adapted to this problem by making use of the generalized algorithm detailed in §4.6. The solution has the form (4.12) and the scalings are updated as

In practice, unbalanced OT techniques seem to outperform classical OT for applications (such as in imaging or machine learning) where the input data is noisy or not perfectly known. They are also crucial when the signal strength of a measure, as measured by its total mass, must be accounted for, or when normalization is not meaningful. This was the original motivation of Frogner et al. (2015), whose goal was to compare sets of word labels used to describe images. Unbalanced OT and the corresponding Sinkhorn iterations have also been used for applications to the dynamics of cells in (Schiebinger et al., 2017).

A particularly simple setup to account for mass variation is to use dual norms, as detailed in §8.2. By choosing a compact set BC(X)B\subset\mathcal{C}(\mathcal{X}) one obtains a norm defined on the whole space M(X)\mathcal{M}(\mathcal{X}) (in particular, the measures do not need to be positive). A particular instance of this setting is the flat norm (8.11), which is recovered as a special instance of unbalanced transport, when using Dφ(αα)=ααTV\mathcal{D}_{\varphi}(\alpha|\alpha^{\prime})=\left\|\alpha-\alpha^{\prime}\right\|_{\operatorname{TV}} to be the total variation norm (8.9); see, for instance, (Hanin, 1992; Lellmann et al., 2014). We also refer to (Schmitzer and Wirth, 2017) for a general framework to define Wasserstein-1 unbalanced transport.

3 Problems with Extra Constraints on the Couplings

Many other OT-like problems have been proposed in the literature. They typically correspond to adding extra constraints C\mathcal{C} on the set of feasible couplings appearing in the original OT problem (2.15)

Let us give two representative examples. The optimal transport with capacity constraint (Korman and McCann, 2015) corresponds to imposing that the density ρπ\rho_{\pi} (for instance, with respect to the Lebesgue measure) is upper bounded

This constraint imposes that the barycentric projection map (4.20) of any admissible coupling must be equal to the identity. For arbitrary (α,β)(\alpha,\beta), this set C\mathcal{C} is typically empty, but necessary and sufficient conditions exist (α\alpha and β\beta should be in “convex order”) to ensure C\mathcal{C}\neq\emptyset so that (α,β)(\alpha,\beta) satisfy a martingale constraint. This constraint can be difficult to enforce numerically when discretizing an existing problem. It also forbids the solution to concentrate on a single Monge map, and can lead to couplings concentrated on the union of several graphs (a “multivalued” Monge map), or even more complicated support sets. Using an entropic penalization as in (4.9), one can solve approximately (10.10) using the Dykstra algorithm as explained in Benamou et al. (2015), which is a generalization of Sinkhorn’s algorithm shown in §4.2. This requires computing the projection onto C\mathcal{C} for the KL\operatorname{KL} divergence, which is straightforward for (10.11) but cannot be done in closed form (10.12) and thus necessitates subiterations; see (Guo and Obloj, 2017) for more details.

4 Sliced Wasserstein Distance and Barycenters

The advantage of this functional is that 1-D Wasserstein distances are simple to compute, as detailed in §2.6. In the specific case where m=nm=n and

this is achieved by simply sorting points

Fixing the vector yy, the function Eβ(x)=\mboxdef.SW(α,β)2\mathcal{E}_{\beta}(x)\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\operatorname{SW}(\alpha,\beta)^{2} is smooth, and one can use this function to define a mapping by gradient descent

using a small enough step size τ>0\tau>0. To make the method tractable, one can use a stochastic gradient descent (SGD), replacing this integral with a discrete sum against randomly drawn directions θ\SSd\theta\in\SS^{d} (see §5.4 for more details on SGD). The flow (10.15) can be understood as (Langrangian implementation of) a Wasserstein gradient flow (in the sense of §9.3) of the function αSW(α,β)2\alpha\mapsto\operatorname{SW}(\alpha,\beta)^{2}. Numerically, one finds that this flow has no local minimizer and that it thus converges to α=β\alpha=\beta. The usefulness of the Lagrangian solver is that, at convergence, it defines a matching (similar to a Monge map) between the two distributions. This method has been used successfully for color transfer and texture synthesis in (Rabin et al., 2011) and is related to the alternate minimization approach detailed in (Pitié et al., 2007).

It is simple to extend this Lagrangian scheme to compute approximate “sliced” barycenters of measures, by mimicking the Frechet definition of Wasserstein barycenters (9.11) and minimizing

given a set (βs)s=1S(\beta_{s})_{s=1}^{S} of fixed input measure. Using a Lagrangian discretization of the form (10.14) for both α\alpha and the (βs)s(\beta_{s})_{s}, one can perform the nonconvex minimization over the position x=(xi)ix=(x_{i})_{i}

by gradient descent using formula (10.15) to compute Eβs(x)\nabla\mathcal{E}_{\beta_{s}}(x) (coupled with a random sampling of the direction θ\theta).

A related way to compute an approximated sliced barycenter, without resorting to an iterative minimization scheme, is to use the fact that (10.13) computes a distance between the Radon transforms R(α)\mathcal{R}(\alpha) and R(β)\mathcal{R}(\beta) where

A crucial point is that the Radon transform is invertible and that its inverse can be computed using a filtered backprojection formula. Given a collection of measures ρ=(ρθ)θ\SSd\rho=(\rho_{\theta})_{\theta\in\SS^{d}}, one defines the filtered backprojection operator as

In order to compute barycenters of input densities, it makes sense to replace formula (9.11) by its equivalent using Radon transform, and thus consider independently for each θ\theta the 1-D barycenter problem

Each 1-D barycenter problem is easily computed using the monotone rearrangement as detailed in Remark 6. The Radon approximation αR=\mboxdef.R+(ρ)\alpha_{R}\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\mathcal{R}^{+}(\rho^{\star}) of a sliced barycenter solving (9.11) is then obtained by the inverse Radon transform R+\mathcal{R}^{+}. Note that in general, αR\alpha_{R} is not a solution to (9.11) because the Radon transform is not surjective, so that ρ\rho^{\star}, which is obtained as a barycenter of the Radon transforms R(βs)\mathcal{R}(\beta_{s}) does not necessarily belong to the range of R\mathcal{R}. But numerically it seems in practice to be almost the case (Bonneel et al., 2015). Numerically, this Radon transform formulation is very effective for input measures and barycenters discretized on a fixed grid (e.g. a uniform grid for images), and R\mathcal{R} and well as R+\mathcal{R}^{+} are computed approximately on this grid using fast algorithms (see, for instance, (Averbuch et al., 2001)). Figure 10.2 illustrates this computation of barycenters (and highlights the way the Radon transforms are interpolated), while Figure 10.3 shows a comparison of the Radon barycenters (10.20) and the ones obtained by Lagrangian discretization (10.17).

for 1p21\leq p\leq 2 for the exponential kernels and 0<p<20<p<2 for the energy distance kernels. This means that for any collection (αi)i(\alpha_{i})_{i} of input measures, the matrix (k(αi,αj))i,j(k(\alpha_{i},\alpha_{j}))_{i,j} is symmetric positive semidefinite. It is possible to use these kernels to perform a variety of machine learning tasks using the “kernel trick,” for instance, in regression, classification (SVM and logistic), clustering (K-means) and dimensionality reduction (PCA) (Hofmann et al., 2008). We refer to Kolouri et al. (2016) for details and applications.

5 Transporting Vectors and Matrices

where g(x)=(gs(x))s=1dg(x)=(g_{s}(x))_{s=1}^{d} are the coordinates of gg in the canonical basis.

One can extend several divergences introduced in §8.1 to this setting. For instance, the Bures metric (2.42) is a generalization of the Hellinger distance (defined in Remark 8.1.6), since they are equal on positive diagonal matrices. One can also extend the Kullback–Leibler divergence (4.6) (see also Remark 8.1.2), which is generalized to positive matrices as

where log()\log(\cdot) is the matrix logarithm. This matrix KL is convex with both of its arguments.

It is possible to solve convex dynamic formulations to define OT distances between such matrices (Carlen and Maas, 2014; Chen et al., 2016b, 2017). There also exists an equivalent of Sinkhorn’s algorithm, which is due to Gurvits (2004) and has been extensively studied in (Georgiou and Pavon, 2015); see also the review paper (Idel, 2016). It is known to converge only in some cases but seems empirically to always work.

6 Gromov–Wasserstein Distances

For some applications such as shape matching, an important weakness of optimal transport distances lies in the fact that they are not invariant to important families of invariances, such as rescaling, translation or rotations. Although some nonconvex variants of OT to handle such global transformations have been proposed (Cohen and Guibas, 1999; Pele and Taskar, 2013) and recently applied to problems such as cross-lingual word embeddings alignments (Grave et al., 2019; Alvarez-Melis et al., 2019; Grave et al., 2019), these methods require specifying first a subset of invariances, possibly between different metric spaces, to be relevant. We describe in this section a more general and very natural extension of OT that can deal with measures defined on different spaces without requiring the definition of a family of invariances.

The Hausdorff distance between two sets A,BZA,B\subset\mathcal{Z} for some metric dZd_{\mathcal{Z}} is

see Figure 10.5. This defines a distance between compact sets K(Z)\mathcal{K}(\mathcal{Z}) of Z\mathcal{Z}, and if Z\mathcal{Z} is compact, then (K(Z),HZ)(\mathcal{K}(\mathcal{Z}),\mathcal{H}_{\mathcal{Z}}) is itself compact; see (Burago et al., 2001).

Following Mémoli (2011), one remarks that this distance between sets (A,B)(A,B) can be defined similarly to the Wasserstein distance between measures (which should be somehow understood as “weighted” sets). One replaces the measures couplings (2.14) by sets couplings

With respect to Kantorovich problem (2.15), one should replace integration (since one does not have access to measures) by maximization, and one has

Note that the support of a measure coupling πU(α,β)\pi\in\mathcal{U}(\alpha,\beta) is a set coupling between the supports, i.e. Supp(π)R(Supp(α),Supp(β))\operatorname{Supp}(\pi)\in\mathcal{R}(\operatorname{Supp}(\alpha),\operatorname{Supp}(\beta)). The Hausdorff distance is thus connected to the \infty-Wasserstein distance (see Remark 2.20) and one has H(A,B)W(α,β)\mathcal{H}(A,B)\leq\operatorname{\mathcal{W}}_{\infty}(\alpha,\beta) for any measure (α,β)(\alpha,\beta) whose supports are (A,B)(A,B).

6.2 Gromov–Hausdorff distance

The Gromov–Hausdorff (GH) distance (Gromov, 2001) (see also (Edwards, 1975)) is a way to measure the distance between two metric spaces (X,dX),(Y,dY)(\mathcal{X},d_{\mathcal{X}}),(\mathcal{Y},d_{\mathcal{Y}}) by quantifying how far they are from being isometric to each other, see Figure 10.6. It is defined as the minimum Hausdorff distance between every possible isometric embedding of the two spaces in a third one,

Here, the constraint is that ff must be an isometric embedding, meaning that dZ(f(x),f(x))=dX(x,x)d_{\mathcal{Z}}(f(x),f(x^{\prime}))=d_{\mathcal{X}}(x,x^{\prime}) for any (x,x)X2(x,x^{\prime})\in\mathcal{X}^{2} (similarly for gg). One can show that GH\mathcal{G}\mathcal{H} defines a distance between compact metric spaces up to isometries, so that in particular GH(dX,dY)=0\mathcal{G}\mathcal{H}(d_{\mathcal{X}},d_{\mathcal{Y}})=0 if and only if there exists an isometry h:XYh:\mathcal{X}\rightarrow\mathcal{Y}, i.e. hh is bijective and dY(h(x),h(x))=dX(x,x)d_{\mathcal{Y}}(h(x),h(x^{\prime}))=d_{\mathcal{X}}(x,x^{\prime}) for any (x,x)X2(x,x^{\prime})\in\mathcal{X}^{2}.

Similarly to (10.23) and as explained in (Mémoli, 2011), it is possible to rewrite equivalently the GH distance using couplings as follows:

The initial motivation of the GH distance is to define and study limits of metric spaces, as illustrated in Figure 10.7, and we refer to (Burago et al., 2001) for details. There is an explicit description of the geodesics for the GH distance (Chowdhury and Mémoli, 2016), which is very similar to the one in Gromov–Wasserstein spaces, detailed in Remark 8.

The underlying optimization problem (10.24) is highly nonconvex, and computing the global minimum is untractable. It has been approached numerically using approximation schemes and has found applications in vision and graphics for shape matching (Mémoli and Sapiro, 2005; Bronstein et al., 2006).

It is often desirable to “smooth” the definition of the Hausdorff distance by replacing the maximization by an integration. This in turn necessitates the introduction of measures, and it is one of the motivations for the definition of the GW distance in the next section.

6.3 Gromov–Wasserstein Distance

see Figure 10.8. This problem is similar to the GH problem (10.24) when replacing maximization by a sum and set couplings by measure couplings. This is a nonconvex problem, which can be recast as a quadratic assignment problem (Loiola et al., 2007) and is in full generality NP-hard to solve for arbitrary inputs. It is in fact equivalent to a graph matching problem (Lyzinski et al., 2016) for a particular cost.

6.4 Entropic Regularization

As proposed initially in (Gold and Rangarajan, 1996; Rangarajan et al., 1999), and later revisited in (Solomon et al., 2016a) for applications in graphics, one can use iteratively Sinkhorn’s algorithm to progressively compute a stationary point of (10.27). Indeed, successive linearizations of the objective function lead to consider the succession of updates

Acknowledgements

We would like to thank the many colleagues, collaborators and students who have helped us at various stages when preparing this survey. Some of their inputs have shaped this work, and we would like to thank in particular Jean-David Benamou, Yann Brenier, Guillaume Carlier, Vincent Duval and the entire MOKAPLAN team at Inria; Francis Bach, Espen Bernton, Mathieu Blondel, Nicolas Courty, Rémi Flamary, Alexandre Gramfort, Young-Heon Kim, Daniel Matthes, Philippe Rigollet, Filippo Santambrogio, Justin Solomon, Jonathan Weed; as well as the feedback by our current and former students on these subjects, in particular Gwendoline de Bie, Lénaïc Chizat, Aude Genevay, Hicham Janati, Théo Lacombe, Boris Muzellec, Francois-Pierre Paty, Vivien Seguy.

References