GAN and VAE from an Optimal Transport Point of View

Aude Genevay, Gabriel Peyré, Marco Cuturi

Minimum Kantorovitch Estimators

Given some empirical distribution ν=\mboxdef.1nj=1nδyj\nu\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\frac{1}{n}\sum_{j=1}^{n}\delta_{y_{j}} where yjXRpy_{j}\in\mathcal{X}\subset\mathbb{R}^{p}, and a parametric family of probability distributions (μθ)θΘP(X)(\mu_{\theta})_{\theta\in\Theta}\subset\mathcal{P}(\mathcal{X}), ΘRq\Theta\subset\mathbb{R}^{q}, a Minimum Kantorovitch Estimator (MKE) for θ\theta is defined as any solution of the problem

where WcW_{c} is the Wasserstein cost on P(X)\mathcal{P}(\mathcal{X}) for some ground cost function c:X×XRc:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}, defined as

where P1(x,y)=xP_{1}(x,y)=x and P2(x,y)=yP_{2}(x,y)=y, and P1P_{1\sharp} and P2P_{2\sharp} are marginalization operators that return for a given coupling γ\gamma its first and second marginal, respectively.

The notations P1P_{1\sharp} and P2P_{2\sharp} above agree with the more general notion of pushforward measures: Given a measurable map g:ZXg:\mathcal{Z}\rightarrow\mathcal{X}, which can be interpreted as a function “moving” points from a measurable space to another, one can naturally extend gg to become a more general map gg_{\sharp} that can now “move” an entire probability measure on Z\mathcal{Z} towards a new probability measure on X\mathcal{X}. The operator gg_{\sharp} “pushes forward” each elementary mass of a measure ζ\zeta in P(Z)\mathcal{P}(\mathcal{Z}) by applying the map gg to obtain then a mass in X\mathcal{X}, to build on aggregate a new measure in P(X)\mathcal{P}(\mathcal{X}) written gζg_{\sharp}\zeta. More rigorously, the pushforward measure of a measure ζP(Z)\zeta\in\mathcal{P}(\mathcal{Z}) by a map g:ZXg:\mathcal{Z}\rightarrow\mathcal{X} is the measure denoted as gζg_{\sharp}\zeta in P(X)\mathcal{P}(\mathcal{X}) such that for any set BXB\subset\mathcal{X}, (gζ)(B)=\mboxdef.ζ(g1(B))=ζ({zZ  ;  g(z)B})(g_{\sharp}\zeta)(B)\stackrel{{\scriptstyle\mbox{\tiny def.}}}{{=}}\zeta(g^{-1}(B))=\zeta(\left\{z\in\mathcal{Z}\;;\;g(z)\in B\right\}).

MKE-GM.

The MKE approach can be used directly in the case where (μθ)θ(\mu_{\theta})_{\theta} is a statistical model, namely a parameterized family of probability distributions with a given density with respect to a dominant base measure, as considered for instance with exponential families on discrete spaces in . However, the MKE approach can also be used in a generative model setting, where μθ\mu_{\theta} is defined instead as the push forward of a fixed distribution ζ\zeta supported on a low dimensional space ZRd\mathcal{Z}\subset\mathbb{R}^{d}, dpd\ll p, where the parameterization lies now in choosing a map gθ:ZXg_{\theta}:\mathcal{Z}\mapsto\mathcal{X}, i.e. μθ=gθζ\mu_{\theta}=g_{\theta\sharp}\zeta, resulting in the following special case of the original (MKE) problem:

The map gθg_{\theta} should be therefore thought as a “decoding” map from a low dimensional space to a high dimensional space. In such a setting, the maximum likelihood estimator is in general undefined or difficult to compute (because the support of the measures μθ\mu_{\theta} are singular) while MKEs are attractive because they are always well defined.

Dual Formulation and GAN

Because (1) is a linear program, it has a dual formulation, known as the Kantorovich problem [13, Thm. 5.9]:

subscript𝒵ℎsubscript𝑔𝜃𝑧differential-d𝜁𝑧subscript𝒳~ℎ𝑦differential-d𝜈𝑦ℎ𝑥~ℎ𝑦𝑐𝑥𝑦E(\theta)=\underset{h,\tilde{h}}{\max}\;\left\{\int_{\mathcal{Z}}h(g_{\theta}(z))\mathrm{d}\zeta(z)+\int_{\mathcal{X}}\tilde{h}(y)\mathrm{d}\nu(y)\;;\;h(x)+\tilde{h}(y)\leqslant c(x,y)\right\}. (2) where (h,h~)(h,\tilde{h}) are continuous functions on X\mathcal{X} often called Kantorovich potentials in the literature.

In the dual formulation (2), θ\theta does not appear anymore in the constraints. Therefore, the gradient of EE can be computed as

𝜃subscript𝑔𝜃𝑧top∇superscriptℎ⋆subscript𝑔𝜃𝑧differential-d𝜁𝑧\nabla E(\theta)=\int_{\mathcal{Z}}[\partial_{\theta}g_{\theta}(z)]^{\top}\nabla h^{\star}(g_{\theta}(z))\mathrm{d}\zeta(z), (3) where hh^{\star} is an optimal dual function solving (2). Here [θgθ(z)]Rq×p[\partial_{\theta}g_{\theta}(z)]^{\top}\in\mathbb{R}^{q\times p} is the adjoint of the Jacobian of θgθ(z)\theta\mapsto g_{\theta}(z), where qq is the dimension of the parameter space Θ\Theta.

A key remark in Kantorovich’s formulation is to notice that the cost of any pair (h,h~)(h,\tilde{h}) can always be improved by replacing h~\tilde{h} in (2) by the cc-transform hch^{c} of hh defined as

which is, indeed, given a candidate potential hh for the first variable, the best possible potential that can be paired with hh that satisfies the constraints of (2) (see [13, Thm. 5.9]). For this reason, one can parameterize problem (2) as depending on one potential function only.

A first approach to solve (2) is to remark that since ν\nu is discrete, one can replace the continuous potential h~\tilde{h} by the discrete vector (h~(yj))jRn(\tilde{h}(y_{j}))_{j}\in\mathbb{R}^{n} and impose h=(h~)ch=(\tilde{h})^{c}. As shown in , the optimization over h~\tilde{h} can then be achieved using stochastic gradient descent.

Similarly to , another approach is to approximate (2) by restricting the dual potential hh to have a parametric form h=hξ:XRh=h_{\xi}:\mathcal{X}\rightarrow\mathbb{R} where ξ\xi is a discriminative deep network (see Figure 1, center). This map hξh_{\xi} is often referred to as being an “adversarial” map. Plugging this ansatz in (2) leads to the Wasserstein-GAN problem

𝜃𝜉subscript𝒵subscriptℎ𝜉subscript𝑔𝜉𝑧differential-d𝜁𝑧subscript𝑗superscriptsubscriptℎ𝜉𝑐subscript𝑦𝑗\underset{\theta}{\min}\;\underset{\xi}{\max}\;\int_{\mathcal{Z}}h_{\xi}\circ g_{\xi}(z)\mathrm{d}\zeta(z)+\sum_{j}h_{\xi}^{c}(y_{j}). (WGAN) In the special case where c(x,y)= ⁣xy ⁣c(x,y)=|\!|x-y|\!|, one can prove that the mechanics of cc-transforms result in the additional constraint that h~=h\tilde{h}=-h, subject to hh being a 11-Lipschitz function, see [13, Particular case 5.4]. This is used in to replace hξch_{\xi}^{c} by hξ-h_{\xi} in (WGAN) and use a deep network made of ReLu units whose Lipschitz constant is upper-bounded by 11.

As a side-note, and as previously commented in the literature, there is at this point no empirical evidence that supports the idea that using discriminative deep networks that way can result in accurate approximations of Wasserstein distances. These alternative formulations provide instead a very useful proxy for a quantity directly related to the Wasserstein distance.

Primal Formulation and VAE

Following , in the special case of a generative model μθ=gθζ\mu_{\theta}=g_{\theta\sharp}\zeta, formula (1) can be conveniently re-written as

This is advantageous because now π\pi is defined over Z×X\mathcal{Z}\times\mathcal{X}, which is lower-dimensional than X×X\mathcal{X}\times\mathcal{X}, and also because, as in Equation (2), θ\theta does not appear in the constraints either. This provides an alternative formula for the gradient of EE:

𝜃subscript𝑔𝜃𝑧topsubscript∇1𝑐subscript𝑔𝜃𝑧𝑦differential-dsuperscript𝜋⋆𝑧𝑦\nabla E(\theta)=\int_{\mathcal{Z}\times\mathcal{X}}[\partial_{\theta}g_{\theta}(z)]^{\top}\nabla_{1}c(g_{\theta}(z),y)\mathrm{d}\pi^{\star}(z,y), (5) where π\pi^{\star} is an optimal coupling solving (4). Here 1c(x,y)Rp\nabla_{1}c(x,y)\in\mathbb{R}^{p} denotes the gradient of cc with respect to the first variable.

suggests to look for couplings π\pi with a parametric form. A simple way to achieve this is to restrict couplings π\pi to those of the form

where fξ:XZf_{\xi}:\mathcal{X}\rightarrow\mathcal{Z} is a parametric “encoding” map (typically a deep network), see Figure 1, right. This πξ\pi_{\xi} satisfies by construction the marginal constraint P2π=νP_{2\sharp}\pi=\nu, but in general it cannot satisfy the other constraint P1π=ζP_{1\sharp}\pi=\zeta (because P1πξP_{1\sharp}\pi_{\xi} is discrete while ζ\zeta is not). So following , it makes sense to consider a relaxed “unbalanced” formulation (in the sense of ) of the form

subscript𝒵𝒳𝑐subscript𝑔𝜃𝑧𝑦differential-d𝜋𝑧𝑦𝜆𝐷conditionalsubscript𝑃1♯𝜋𝜁subscript𝑃2♯𝜋𝜈E_{\lambda}(\theta)=\underset{\pi}{\min}\;\left\{\int_{\mathcal{Z}\times\mathcal{X}}c(g_{\theta}(z),y)\mathrm{d}\pi(z,y)+\lambda D(P_{1\sharp}\pi|\zeta)\;;\;P_{2\sharp}\pi=\nu\right\}, (6) where D()D(\cdot|\cdot) is some distance or divergence between positive measures on Z\mathcal{Z} and λ>0\lambda>0 a relaxation parameter.

Plugging the ansatz π=πξ\pi=\pi_{\xi} in (6), one obtains the Wasserstein-VAE formulation

𝜃𝜉subscriptΔ𝜈subscript𝑔𝜃subscript𝑓𝜉subscriptId𝒳𝜆𝐷conditionalsubscript𝑓𝜉♯𝜈𝜁\underset{(\theta,\xi)}{\min}\;\Delta_{\nu}(g_{\theta}\circ f_{\xi},\mathrm{Id}_{\mathcal{X}})+\lambda D(f_{\xi\sharp}\nu|\zeta), (WVAE) where Δν(φ,IdX)\Delta_{\nu}(\varphi,\mathrm{Id}_{\mathcal{X}}) is the cost measuring the deviation of a map φ:XX\varphi:\mathcal{X}\rightarrow\mathcal{X} to identity

Such a cost is usually associated with the Monge formulation of optimal transport , whose original motivation was to find an optimal map under that cost that would be able to push forward a given measure onto another[12, §1.1].

Conclusions

The WGAN and WVAE formulations are very different, and are in some sense dual one of each other. For GAN, the couple (gθ,hξ)(g_{\theta},h_{\xi}) should be thought as a (primal, dual) pair (often referred to as adversarial pair, which is reminiscent of game theory saddle points). For VAE, the couple (fξ,gθ)(f_{\xi},g_{\theta}) is rather an (encoding, decoding) pair, and both have the flavour of transportation maps.

In sharp contrast to the primal gradient formula (5) which only requires integrating against an optimal coupling π\pi^{\star}, the dual gradient formula (3) involves the integration of the gradient of an optimal potential hh^{\star}. The latter tends to be more unstable and thus necessitates accurate optimization sub-iterations to obtain an optimal dual potential hh^{\star} or an approximation hξh_{\xi}^{\star} within a restricted parametric class . This is somehow inline with the empirical observation that training VAE is more stable than training GAN. One should however bear in mind that, although both formulations can be motivated by the same minimum Kantorovitch estimation problem (MKE-GM), they define quite different estimators. In particular, GAN is often credited for producing less blurry outputs when used for image generation.

Denoting θMKE,θWGAN\theta_{\text{\tiny MKE}},\theta_{\text{\tiny WGAN}} and θWVAE\theta_{\text{\tiny WVAE}} the solutions of (MKE-GM), (WGAN) and (WVAE), one has in the limit λ+\lambda\rightarrow+\infty (to cancel the bias due to the marginal constraint relaxation),

furthermore mentions that in the “non-parametric limit” (i.e. when the number of parameters appearing in ξ\xi tends to ++\infty, and also letting λ+\lambda\rightarrow+\infty), the gap between the estimators should vanish. Indeed, hξh_{\xi} and fξf_{\xi} should capture the desired optimal map in the limit and one thus recovers the true solution to (MKE-GM). While it would be interesting from a theoretical perspective to prove and quantify such a claim, it is unclear wether it would be useful for the practitioner. Indeed, the convergence rate might be slow, so that in practice one can be quite far from this non-parametric limit. One could even argue that this limit may give poor estimators for complicated datasets, so that parameterizing the maps and using non-convex optimization solvers lead instead to a beneficial and implicit regularization of these estimators.

References