Wasserstein Auto-Encoders

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Schoelkopf

Introduction

The field of representation learning was initially driven by supervised approaches, with impressive results using large labelled datasets. Unsupervised generative modeling, in contrast, used to be a domain governed by probabilistic approaches focusing on low-dimensional data. Recent years have seen a convergence of those two approaches. In the new field that formed at the intersection, variational auto-encoders (VAEs) constitute one well-established approach, theoretically elegant yet with the drawback that they tend to generate blurry samples when applied to natural images. In contrast, generative adversarial networks (GANs) turned out to be more impressive in terms of the visual quality of images sampled from the model, but come without an encoder, have been reported harder to train, and suffer from the “mode collapse” problem where the resulting model is unable to capture all the variability in the true data distribution. There has been a flurry of activity in assaying numerous configurations of GANs as well as combinations of VAEs and GANs. A unifying framework combining the best of GANs and VAEs in a principled way is yet to be discovered.

This work builds up on the theoretical analysis presented in . Following , we approach generative modeling from the optimal transport (OT) point of view. The OT cost is a way to measure a distance between probability distributions and provides a much weaker topology than many others, including ff-divergences associated with the original GAN algorithms . This is particularly important in applications, where data is usually supported on low dimensional manifolds in the input space X\mathcal{X}. As a result, stronger notions of distances (such as ff-divergences, which capture the density ratio between distributions) often max out, providing no useful gradients for training. In contrast, OT was claimed to have a nicer behaviour although it requires, in its GAN-like implementation, the addition of a constraint or a regularization term into the objective.

In this work we aim at minimizing OT Wc(PX,PG)W_{c}(P_{X},P_{G}) between the true (but unknown) data distribution PXP_{X} and a latent variable model PGP_{G} specified by the prior distribution PZP_{Z} of latent codes ZZZ\in\mathcal{Z} and the generative model PG(XZ)P_{G}(X|Z) of the data points XXX\in\mathcal{X} given ZZ. Our main contributions are listed below (cf. also Figure 1):

Empirical evaluation of WAE on MNIST and CelebA datasets with squared cost c(x,y)=xy22c(x,y)=\|x-y\|^{2}_{2}. Our experiments show that WAE keeps the good properties of VAEs (stable training, encoder-decoder architecture, and a nice latent manifold structure) while generating samples of better quality, approaching those of GANs.

We propose and examine two different regularizers DZ(PZ,QZ)\mathcal{D}_{Z}(P_{Z},Q_{Z}). One is based on GANs and adversarial training in the latent space Z\mathcal{Z}. The other uses the maximum mean discrepancy, which is known to perform well when matching high-dimensional standard normal distributions PZP_{Z} . Importantly, the second option leads to a fully adversary-free min-min optimization problem.

Finally, the theoretical considerations presented in and used here to derive the WAE objective might be interesting in their own right. In particular, Theorem 1 shows that in the case of generative models, the primal form of Wc(PX,PG)W_{c}(P_{X},P_{G}) is equivalent to a problem involving the optimization of a probabilistic encoder Q(ZX)Q(Z|X) .

The paper is structured as follows. In Section 2 we review a novel auto-encoder formulation for OT between PXP_{X} and the latent variable model PGP_{G} derived in . Relaxing the resulting constrained optimization problem we arrive at an objective of Wasserstein auto-encoders. We propose two different regularizers, leading to WAE-GAN and WAE-MMD algorithms. Section 3 discusses the related work. We present the experimental results in Section 4 and conclude by pointing out some promising directions for future work.

Proposed method

2 Optimal transport and its dual formulations

A rich class of divergences between probability distributions is induced by the optimal transport (OT) problem . Kantorovich’s formulation of the problem is given by

where c(x,y) ⁣:X×XR+c(x,y)\colon\mathcal{X}\times\mathcal{X}\to\mathcal{R}_{+} is any measurable cost function and P(XPX,YPG)\mathcal{P}(X\sim P_{X},Y\sim P_{G}) is a set of all joint distributions of (X,Y)(X,Y) with marginals PXP_{X} and PGP_{G} respectively. A particularly interesting case is when (X,d)(\mathcal{X},d) is a metric space and c(x,y)=dp(x,y)c(x,y)=d^{p}(x,y) for p1p\geq 1. In this case WpW_{p}, the pp-th root of WcW_{c}, is called the pp-Wasserstein distance.

When c(x,y)=d(x,y)c(x,y)=d(x,y) the following Kantorovich-Rubinstein duality holds Note that the same symbol is used for WpW_{p} and WcW_{c}, but only pp is a number and thus the above W1W_{1} refers to the 1-Wasserstein distance. :

where FL\mathcal{F}_{L} is the class of all bounded 1-Lipschitz functions on (X,d)(\mathcal{X},d).

3 Application to generative models: Wasserstein auto-encoders

One way to look at modern generative models like VAEs and GANs is to postulate that they are trying to minimize certain discrepancy measures between the data distribution PXP_{X} and the model PGP_{G}. Unfortunately, most of the standard divergences known in the literature, including those listed above, are hard or even impossible to compute, especially when PXP_{X} is unknown and PGP_{G} is parametrized by deep neural networks. Previous research provides several tricks to address this issue.

In this work we will focus on latent variable models PGP_{G} defined by a two-step procedure, where first a code ZZ is sampled from a fixed distribution PZP_{Z} on a latent space Z\mathcal{Z} and then ZZ is mapped to the image XX=RdX\in\mathcal{X}=\mathcal{R}^{d} with a (possibly random) transformation. This results in a density of the form

assuming all involved densities are properly defined. For simplicity we will focus on non-random decoders, i.e. generative models PG(XZ)P_{G}(X|Z) deterministically mapping ZZ to X=G(Z)X=G(Z) for a given map G ⁣:ZXG\colon\mathcal{Z}\to\mathcal{X}. Similar results for random decoders can be found in Supplementary B.1.

For PGP_{G} as defined above with deterministic PG(XZ)P_{G}(X|Z) and any function G ⁣:ZXG\colon\mathcal{Z}\to\mathcal{X}

where QZQ_{Z} is the marginal distribution of ZZ when XPXX\sim P_{X} and ZQ(ZX)Z\sim Q(Z|X).

This result allows us to optimize over random encoders Q(ZX)Q(Z|X) instead of optimizing over all couplings between XX and YY. Of course, both problems are still constrained. In order to implement a numerical solution we relax the constraints on QZQ_{Z} by adding a penalty to the objective. This finally leads us to the WAE objective:

where Q\mathcal{Q} is any nonparametric set of probabilistic encoders, DZ\mathcal{D}_{Z} is an arbitrary divergence between QZQ_{Z} and PZP_{Z}, and λ>0\lambda>0 is a hyperparameter. Similarly to VAE, we propose to use deep neural networks to parametrize both encoders QQ and decoders GG. Note that as opposed to VAEs, the WAE formulation allows for non-random encoders deterministically mapping inputs to their latent codes.

We propose two different penalties DZ(QZ,PZ)\mathcal{D}_{Z}(Q_{Z},P_{Z}):

MMD-based DZ\mathcal{D}_{Z}. For a positive-definite reproducing kernel k ⁣:Z×ZRk\colon\mathcal{Z}\times\mathcal{Z}\to\mathcal{R} the following expression is called the maximum mean discrepancy (MMD):

Related work

Literature on auto-encoders Classical unregularized auto-encoders minimize only the reconstruction cost. This results in different training points being encoded into non-overlapping zones chaotically scattered all across the Z\mathcal{Z} space with “holes” in between where the decoder mapping PG(XZ)P_{G}(X|Z) has never been trained. Overall, the encoder Q(ZX)Q(Z|X) trained in this way does not provide a useful representation and sampling from the latent space Z\mathcal{Z} becomes hard .

When used with c(x,y)=xy22c(x,y)=\|x-y\|^{2}_{2} WAE-GAN is equivalent to adversarial auto-encoders (AAE) proposed by . Theory of (and in particular Theorem 1) thus suggests that AAEs minimize the 2-Wasserstein distance between PXP_{X} and PGP_{G}. This provides the first theoretical justification for AAEs known to the authors. WAE generalizes AAE in two ways: first, it can use any cost function cc in the input space X\mathcal{X}; second, it can use any discrepancy measure DZ\mathcal{D}_{Z} in the latent space Z\mathcal{Z} (for instance MMD), not necessarily the adversarial one of WAE-GAN.

Finally, independently proposed a regularized auto-encoder objective similar to and our (4) based on very different motivations and arguments. Following VAEs their objective (called InfoVAE) defines the reconstruction cost in the image space implicitly through the negative log likelihood term logpG(xz)-\log p_{G}(x|z), which should be properly normalized for all zZz\in\mathcal{Z}. In theory VAE and InfoVAE can both induce arbitrary cost functions, however in practice this may require an estimation of the normalizing constant (partition function) which canTwo popular choices are Gaussian and Bernoulli decoders PG(XZ)P_{G}(X|Z) leading to pixel-wise squared and cross-entropy losses respectively. In both cases the normalizing constants can be computed in closed form and don’t depend on ZZ. be different for different values of zz. WAEs specify the cost c(x,y)c(x,y) explicitly and don’t constrain it in any way.

Literature on OT address computing the OT cost in large scale using SGD and sampling. They approach this task either through the dual formulation, or via a regularized version of the primal. They do not discuss any implications for generative modeling. Our approach is based on the primal form of OT, we arrive at regularizers which are very different, and our main focus is on generative modeling.

The WGAN minimizes the 1-Wasserstein distance W1(PX,PG)W_{1}(P_{X},P_{G}) for generative modeling. The authors approach this task from the dual form. Their algorithm comes without an encoder and can not be readily applied to any other cost WcW_{c}, because the neat form of the Kantorovich-Rubinstein duality (2) holds only for W1W_{1}. WAE approaches the same problem from the primal form, can be applied for any cost function cc, and comes naturally with an encoder.

Literature on GANs Many of the GAN variations (including ff-GAN and WGAN) come without an encoder. Often it may be desirable to reconstruct the latent codes and use the learned manifold, in which cases these models are not applicable.

Experiments

In this section we empirically evaluate The code is available at github.com/tolstikhin/wae. the proposed WAE model. We would like to test if WAE can simultaneously achieve (i) accurate reconstructions of data points, (ii) reasonable geometry of the latent manifold, and (iii) random samples of good (visual) quality. Importantly, the model should generalize well: requirements (i) and (ii) should be met on both training and test data. We trained WAE-GAN and WAE-MMD (Algorithms 1 and 2) on two real-world datasets: MNIST consisting of 70k images and CelebA containing roughly 203k images.

Experimental setup In all reported experiments we used Euclidian latent spaces Z=Rdz\mathcal{Z}=\mathcal{R}^{d_{z}} for various dz{d_{z}} depending on the complexity of the dataset, isotropic Gaussian prior distributions PZ(Z)=N(Z;0,σz2Id)P_{Z}(Z)=\mathcal{N}(Z;\bm{0},\sigma^{2}_{z}\cdot\bm{I}_{d}) over Z\mathcal{Z}, and a squared cost function c(x,y)=xy22c(x,y)=\|x-y\|_{2}^{2} for data points x,yX=Rdxx,y\in\mathcal{X}=\mathcal{R}^{d_{x}}. We used deterministic encoder-decoder pairs, Adam with β1=0.5,β2=0.999\beta_{1}=0.5,\beta_{2}=0.999, and convolutional deep neural network architectures for encoder mapping μϕ ⁣:XZ\mu_{\phi}\colon\mathcal{X}\to\mathcal{Z} and decoder mapping Gθ ⁣:ZXG_{\theta}\colon\mathcal{Z}\to\mathcal{X} similar to the DCGAN ones reported by with batch normalization . We tried various values of λ\lambda and noticed that λ=10\lambda=10 seems to work good across all datasets we considered.

Since we are using deterministic encoders, choosing dzd_{z} larger than intrinsic dimensionality of the dataset would force the encoded distribution QZQ_{Z} to live on a manifold in Z\mathcal{Z}. This would make matching QZQ_{Z} to PZP_{Z} impossible if PZP_{Z} is Gaussian and may lead to numerical instabilities. We use dz=8d_{z}=8 for MNIST and dz=64d_{z}=64 for CelebA which seems to work reasonably well.

We also report results of VAEs. VAEs used the same latent spaces as discussed above and standard Gaussian priors PZ=N(0,Id)P_{Z}=\mathcal{N}(\bm{0},\bm{I}_{d}). We used Gaussian encoders Q(Z|X)=\mathcal{N}\bigl{(}Z;\mu_{\phi}(X),\Sigma(X)\bigr{)} with mean μϕ\mu_{\phi} and diagonal covariance Σ\Sigma. For MNIST we used Bernoulli decoders parametrized by GθG_{\theta} and for CelebA the Gaussian decoders P_{G}(X|Z)=\mathcal{N}\bigl{(}X;G_{\theta}(Z),\sigma^{2}_{G}\cdot\bm{I}_{d}\bigr{)} with mean GθG_{\theta}. Functions μϕ\mu_{\phi}, Σ\Sigma, and GθG_{\theta} were parametrized by deep nets of the same architectures as in WAE.

Random samples are generated by sampling PZP_{Z} and decoding the resulting noise vectors zz into Gθ(z)G_{\theta}(z). As expected, in our experiments we observed that for both WAE-GAN and WAE-MMD the quality of samples strongly depends on how accurately QZQ_{Z} matches PZP_{Z}. To see this, notice that during training the decoder function GθG_{\theta} is presented only with encoded versions μϕ(X)\mu_{\phi}(X) of the data points XPXX\sim P_{X}. Indeed, the decoder is trained on samples from QZQ_{Z} and thus there is no reason to expect good results when feeding it with samples from PZP_{Z}. In our experiments we noticed that even slight differences between QZQ_{Z} and PZP_{Z} may affect the quality of samples.

In some cases WAE-GAN seems to lead to a better matching and generates better samples than WAE-MMD. However, due to adversarial training WAE-GAN is less stable than WAE-MMD, which has a very stable training much like VAE.

In order to quantitatively assess the quality of the generated images, we use the Fréchet Inception Distance introduced by and report the results on CelebA based on 10410^{4} samples. We also heuristically evaluate the sharpness of generated samples Every image is converted to greyscale and convolved with the Laplace filter (010141010)\left(\begin{smallmatrix}0&1&0\\ 1&-4&1\\ 0&1&0\end{smallmatrix}\right), which acts as an edge detector. We compute the variance of the resulting activations and average these values across 1000 images sampled from a given model. The blurrier the image, the less edges it has, and the more activations will be close to zero, leading to smaller variances. using the Laplace filter. The numbers, summarized in Table 1, show that WAE-MMD has samples of slightly better quality than VAE, while WAE-GAN achieves the best results overall. The bigVAE, bigWAE-MMD, and bigWAE-GAN also included in the table are the best models based on larger scale study reported in Supplementary D, which involved a much wider hyperparameter sweep, as well as using ResNet50-v2 architecture for the encoder and decoder mappings.

Test reconstructions and interpolations. We take random points xx from the held out test set and report their auto-encoded versions Gθ(μϕ(x))G_{\theta}(\mu_{\phi}(x)). Next, pairs (x,y)(x,y) of different data points are sampled randomly from the held out test set and encoded: zx=μϕ(x)z_{x}=\mu_{\phi}(x), zy=μϕ(y)z_{y}=\mu_{\phi}(y). We linearly interpolate between zxz_{x} and zyz_{y} with equally-sized steps in the latent space and show decoded images.

Conclusion

Using the optimal transport cost, we have derived Wasserstein auto-encoders—a new family of algorithms for building generative models. We discussed their relations to other probabilistic modeling techniques. We conducted experiments using two particular implementations of the proposed method, showing that in comparison to VAEs, the images sampled from the trained WAE models are of better quality, without compromising the stability of training and the quality of reconstruction. Future work will include further exploration of the criteria for matching the encoded distribution QZQ_{Z} to the prior distribution PZP_{Z}, assaying the possibility of adversarially training the cost function cc in the input space X\mathcal{X}, and a theoretical analysis of the dual formulations for WAE-GAN and WAE-MMD.

The authors are thankful to Carl Johann Simon-Gabriel, Mateo Rojas-Carulla, Arthur Gretton, Paul Rubenstein, and Fei Sha for stimulating discussions. The authors also thank Josip Djolonga, Carlos Riquelme, and Paul Rubenstein for carrying out extended experimental evaluations (bigWAE and bigVAE) reported in Table 1 and Section D.

References

Appendix A Implicit generative models: a short tour of GANs and VAEs

Even though GANs and VAEs are quite different—both in terms of the conceptual frameworks and empirical performance—they share important features: (a) both can be trained by sampling from the model PGP_{G} without knowing an analytical form of its density and (b) both can be scaled up with SGD. As a result, it becomes possible to use highly flexible implicit models PGP_{G} defined by a two-step procedure, where first a code ZZ is sampled from a fixed distribution PZP_{Z} on a latent space Z\mathcal{Z} and then ZZ is mapped to the image G(Z)X=RdG(Z)\in\mathcal{X}=\mathcal{R}^{d} with a (possibly random) transformation G ⁣:ZXG\colon\mathcal{Z}\to\mathcal{X}. This results in latent variable models PGP_{G} of the form (3).

These models are indeed easy to sample and, provided GG can be differentiated analytically with respect to its parameters, PGP_{G} can be trained with SGD. The field is growing rapidly and numerous variations of VAEs and GANs are available in the literature. Next we introduce and compare several of them.

The original generative adversarial network (GAN) approach minimizes

Variational auto-encoders (VAE) utilize models PGP_{G} of the form (3) and minimize

Appendix B Proof of Theorem 1 and further details

We will consider certain sets of joint probability distributions of three random variables (X,Y,Z)X×X×Z(X,Y,Z)\in\mathcal{X}\times\mathcal{X}\times\mathcal{Z}. The reader may wish to think of XX as true images, YY as images sampled from the model, and ZZ as latent codes. We denote by PG,Z(Y,Z)P_{G,Z}(Y,Z) a joint distribution of a variable pair (Y,Z)(Y,Z), where ZZ is first sampled from PZP_{Z} and next YY from PG(YZ)P_{G}(Y|Z). Note that PGP_{G} defined in (3) and used throughout this work is the marginal distribution of YY when (Y,Z)PG,Z(Y,Z)\sim P_{G,Z}.

In the optimal transport problem (1), we consider joint distributions Γ(X,Y)\Gamma(X,Y) which are called couplings between values of XX and YY. Because of the marginal constraint, we can write Γ(X,Y)=Γ(YX)PX(X)\Gamma(X,Y)=\Gamma(Y|X)P_{X}(X) and we can consider Γ(YX)\Gamma(Y|X) as a non-deterministic mapping from XX to YY. Theorem 1. shows how to factor this mapping through Z\mathcal{Z}, i.e., decompose it into an encoding distribution Q(ZX)Q(Z|X) and the generating distribution PG(YZ)P_{G}(Y|Z).

As in Section 2.2, P(XPX,YPG)\mathcal{P}(X\sim P_{X},Y\sim P_{G}) denotes the set of all joint distributions of (X,Y)(X,Y) with marginals PX,PGP_{X},P_{G}, and likewise for P(XPX,ZPZ)\mathcal{P}(X\sim P_{X},Z\sim P_{Z}). The set of all joint distributions of (X,Y,Z)(X,Y,Z) such that XPXX\sim P_{X}, (Y,Z)PG,Z(Y,Z)\sim P_{G,Z}, and (Y\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X)|Z will be denoted by PX,Y,Z\mathcal{P}_{X,Y,Z}. Finally, we denote by PX,Y\mathcal{P}_{X,Y} and PX,Z\mathcal{P}_{X,Z} the sets of marginals on (X,Y)(X,Y) and (X,Z)(X,Z) (respectively) induced by distributions in PX,Y,Z\mathcal{P}_{X,Y,Z}. Note that P(PX,PG)\mathcal{P}(P_{X},P_{G}), PX,Y,Z\mathcal{P}_{X,Y,Z}, and PX,Y\mathcal{P}_{X,Y} depend on the choice of conditional distributions PG(YZ)P_{G}(Y|Z), while PX,Z\mathcal{P}_{X,Z} does not. In fact, it is easy to check that PX,Z=P(XPX,ZPZ)\mathcal{P}_{X,Z}=\mathcal{P}(X\sim P_{X},Z\sim P_{Z}). From the definitions it is clear that PX,YP(PX,PG)\mathcal{P}_{X,Y}\subseteq\mathcal{P}(P_{X},P_{G}) and we immediately get the following upper bound:

If PG(YZ)P_{G}(Y|Z) are Dirac measures (i.e., Y=G(Z)Y=G(Z)), it turns out that PX,Y=P(PX,PG)\mathcal{P}_{X,Y}=\mathcal{P}(P_{X},P_{G}):

PX,YP(PX,PG)\mathcal{P}_{X,Y}\subseteq\mathcal{P}(P_{X},P_{G}) with identity if We conjecture that this is also a necessary condition. The necessity is not used in the paper. PG(YZ=z)P_{G}(Y|Z=z) are Dirac for all zZz\in\mathcal{Z}.

We are now in place to prove Theorem 1. Lemma 2 obviously leads to

The tower rule of expectation, and the conditional independence property of PX,Y,Z\mathcal{P}_{X,Y,Z} implies

It remains to notice that PX,Z=P(XPX,ZPZ)\mathcal{P}_{X,Z}=\mathcal{P}(X\sim P_{X},Z\sim P_{Z}) as stated earlier.

If the decoders are non-deterministic, Lemma 2 provides only the inclusion of sets PX,YP(PX,PG){\mathcal{P}_{X,Y}\subseteq\mathcal{P}(P_{X},P_{G})} and we get the following upper bound on the OT:

Let X=Rd\mathcal{X}=\mathcal{R}^{d} and assume the conditional distributions PG(YZ=z)P_{G}(Y|Z=z) have mean values G(z)RdG(z)\in\mathcal{R}^{d} and marginal variances σ12,,σd20\sigma_{1}^{2},\dots,\sigma_{d}^{2}\geq 0 for all zZz\in\mathcal{Z}, where G ⁣:ZXG\colon\mathcal{Z}\to\mathcal{X}. Take c(x,y)=xy22c(x,y)=\|x-y\|^{2}_{2}. Then

First inequality follows from (9). For the identity we proceed similarly to the proof of Theorem 1 and write

Together with (11) and the fact that PX,Z=P(XPX,ZPZ)\mathcal{P}_{X,Z}=\mathcal{P}(X\sim P_{X},Z\sim P_{Z}) this concludes the proof. ∎

Appendix C Further details on experiments

We use mini-batches of size 100 and trained the models for 100 epochs. We used λ=10\lambda=10 and σz2=1\sigma^{2}_{z}=1. For the encoder-decoder pair we set α=103\alpha=10^{-3} for Adam in the beginning and for the adversary in WAE-GAN to α=5×104\alpha=5\times 10^{-4}. After 30 epochs we decreased both by factor of 2, and after first 50 epochs further by factor of 5.

Both encoder and decoder used fully convolutional architectures with 4x4 convolutional filters.

Finally, we used two heuristics. First, we always pretrained separately the encoder for several mini-batch steps before the main training stage so that the sample mean and covariance of QZQ_{Z} would try to match those of PZP_{Z}. Second, while training we were adding a pixel-wise Gaussian noise truncated at 0.010.01 to all the images before feeding them to the encoder, which was meant to make the encoders random. We played with all possible ways of combining these two heuristics and noticed that together they result in slightly (almost negligibly) better results compared to using only one or none of them.

Our VAE model used cross-entropy loss (Bernoulli decoder) and otherwise same architectures and hyperparameters as listed above.

C.2 CelebA

We pre-processed CelebA images by first taking a 140x140 center crops and then resizing to the 64x64 resolution. We used mini-batches of size 100 and trained the models for various number of epochs (up to 250). All reported WAE models were trained for 55 epochs and VAE for 68 epochs. For WAE-MMD we used λ=100\lambda=100 and for WAE-GAN λ=1\lambda=1. Both used σz2=2\sigma^{2}_{z}=2.

For WAE-MMD the learning rate of Adam was initially set to α=103\alpha=10^{-3}. For WAE-GAN the learning rate of Adam for the encoder-decoder pair was initially set to α=3×104\alpha=3\times 10^{-4} and for the adversary to 10310^{-3}. All learning rates were decreased by factor of 2 after 30 epochs, further by factor of 5 after 50 first epochs, and finally additional factor of 10 after 100 first epochs.

Both encoder and decoder used fully convolutional architectures with 5x5 convolutional filters.

For WAE-GAN we used a heuristic proposed in Supplementary IV of . Notice that the theoretically optimal discriminator would result in D(z)=logpZ(z)logqZ(z)D^{*}(z)=\log p_{Z}(z)-\log q_{Z}(z), where pZp_{Z} and qZq_{Z} are densities of PZP_{Z} and QZQ_{Z} respectively. In our experiments we added the log prior logpZ(z)\log p_{Z}(z) explicitly to the adversary output as we know it analytically. This should hopefully make it easier for the adversary to learn the remaining QZQ_{Z} density term.

VAE model used squared loss, i.e. Gaussian decoders P_{G}(X|Z)={\mathcal{N}\bigl{(}X;G_{\theta}(Z),\sigma^{2}_{G}\cdot\bm{I}_{d}\bigr{)}} with σG2=0.3\sigma^{2}_{G}=0.3. We also tried using Bernoulli decoders (the cross-entropy loss) and observed that they matched the performance of the Gaussian decoder with the best choice of σG2\sigma^{2}_{G}. We decided to report the VAE model which used the same reconstruction loss as our WAE models, i.e. the squared loss. VAE model used α=104\alpha=10^{-4} as the initial Adam learning rate and did not use batch normalization. Otherwise all the architectures and hyperparameters were as explained above.

Appendix D Extended experimental setup (bigVAE and bigWAE)

In this section we will shortly report the experimental results we got while running a larger scale study, which involved a much wider hyperparameter sweep, as well as using ResNet50-v2 architecture for the encoder and decoder mappings.

In total we trained over 3 thousand WAE-GAN, WAE-MMD, and VAE models with various hyperparameters while making sure each one of the three algorithms gets exactly the same computational budget. Each model was trained on 8 Google Cloud TPU-v2 hardware accelerators for 100 000 mini-batch steps with Adam using the same default parameters as reported in the previous section. For each new model that we trained we selected a hyperparameters configuration randomly (from the ranges listed below) and trained each configuration with three different random seeds.

The dimensionality dzd_{z} of the latent space was chosen uniformly from {16,32,64,128,256,512}\{16,32,64,128,256,512\};

The mini-batch size was chosen uniformly from {512,1024}\{512,1024\};

The learning rate of Adam for the encoder-decoder pair across all three models was sampled from the log-uniform distribution on [105,102][10^{-5},10^{-2}];

The learning rate of Adam for the adversary of WAE-GAN was sampled from the log-uniform distribution on [105,101][10^{-5},10^{-1}];

For WAE-MMD we used a sum of inverse multiquadratics kernels at various scales:

where S={0.1,0.2,0.5,1,2,5,10}\mathcal{S}=\{0.1,0.2,0.5,1,2,5,10\} was kept fixed and the base scale CC was sampled uniformly from [0.1,16][0.1,16];

The regularization coefficient λ\lambda for WAE-GAN and WAE-MMD was sampled uniformly from [103,103][10^{-3},10^{3}];

For VAE 2σG22\sigma^{2}_{G} effectively plays the role of the regularization coefficient λ\lambda. We sample the value of σG2\sigma^{2}_{G} uniformly from $$;

For both encoder μϕ ⁣:XZ\mu_{\phi}\colon\mathcal{X}\to\mathcal{Z} and decoder Gθ ⁣:ZXG_{\theta}\colon\mathcal{Z}\to\mathcal{X} mappings we use either DCGAN-style convolutional DNN architectures reported in the previous sections or the ResNet50-v2 architecture;

For WAE-MMD and WAE-GAN we uniformly choose either deterministic encoder mapping (like the one used in the previous sections) or stochastic Gaussian one with a diagonal covariance (like in VAE).

For each model (hyperparameter configuration) we (i) saved several intermediate checkpoints, (ii) evaluated all of them using the FID score as described in the previous sections (based on 10 000 samples), and (iii) selected the checkpoint with the lowest FID score. The figures below display some resulting statistics of these models.