Deep Generative Models for Distribution-Preserving Lossy Compression

Michael Tschannen, Eirikur Agustsson, Mario Lucic

Introduction

Data compression methods based on deep neural networks (DNNs) have recently received a great deal of attention. These methods were shown to outperform traditional compression codecs in image compression , speech compression , and video compression under several distortion measures. In addition, DNN-based compression methods are flexible and can be adapted to specific domains leading to further reductions in bitrate, and promise fast processing thanks to their internal representations that are amenable to modern data processing pipelines .

In the context of image compression, learning-based methods arguably excel at low bitrates by learning to realistically synthesize local image content, such as texture. While learning-based methods can lead to larger distortions w.r.t. measures optimized by traditional compression algorithms, such as peak signal-to-noise ratio (PSNR), they avoid artifacts such as blur and blocking, producing visually more pleasing results . In particular, visual quality can be improved by incorporating generative adversarial networks (GANs) into the learning process . Work leveraged GANs for artifact suppression, whereas used them to learn synthesizing image content beyond local texture, such as facades of buildings, obtaining visually pleasing results at very low bitrates.

In this paper, we propose a formalization of this line of work: A compression system that respects the distribution of the original data at all rates—a system whose decoder generates i.i.d. samples from the data distribution at zero bitrate, then gradually produces reconstructions containing more content of the original image as the bitrate increases, and eventually achieves perfect reconstruction at high enough bitrate (see Figure 1 for examples). Such a system can be learned from data in a fully unsupervised fashion by solving what we call the distribution-preserving lossy compression (DPLC) problem: Optimizing the rate-distortion tradeoff under the constraint that the reconstruction follows the distribution of the training data. Enforcing this constraint promotes artifact-free reconstructions, at all rates.

We then show that the algorithm proposed in is solving a special case of the DPLC problem, and demonstrate that it fails to produce stochastic decoders as the rate tends to zero in practice, i.e., it is not effective in enforcing the distribution constraint at very low bitrates. This is not surprising as it was designed with a different goal in mind. We then propose and study different alternative approaches based on deep generative models that overcome the issues inherent with . In a nutshell, one first learns a generative model and then applies it to learn a stochastic decoder, obeying the distribution constraint on the reconstruction, along with a corresponding encoder. To quantify the distribution mismatch of the reconstructed samples and the training data in the learning process we rely on the Wasserstein distance. One distinct advantage of our approach is that we can theoretically characterize the distribution of the reconstruction and bound the distortion as a function of the bitrate.

On the practical side, to learn the generative model, we rely on Wasserstein GAN (WGAN) and Wasserstein autoencoder (WAE) , as well as a novel combination thereof termed Wasserstein++. The latter attains high sample quality comparable to WGAN (when measured in terms of the Fréchet inception distance (FID) ) and yields a generator with good mode coverage as well as a structured latent space suited to be combined with an encoder, like WAE. We present an extensive empirical evaluation of the proposed approach on two standard GAN data sets, CelebA and LSUN bedrooms , realizing the first system that effectively solves the DPLC problem.

Outline. We formally define and motivate the DPLC problem in Section 2. We then present several approaches to solve the DPLC problem in Section 3. Practical aspects are discussed in Section 4 and an extensive evaluation is presented in Section 5. Finally, we discuss related work in Section 6.

Problem formulation

In the classic lossy compression setting, both $E$ and $D$ are typically deterministic. As a result, the number of distinct reconstructed inputs $\hat{X}:=D(E(X))$ is bounded by $2^{R}$ . The main drawback is, as $R$ decreases, the reconstruction $\hat{X}$ will incur increasing degradations (such as blur or blocking in the case of natural images), and will be constant for $R=0$ . Note that simply allowing $E,D$ in (1) to be stochastic does not resolve this problem as discussed in Section 3.

Distribution-preserving lossy compression. Motivated by recent advances in extreme image compression , we propose and study a novel compression problem: Solve (1) under the constraint that the distribution of reconstructed instances $\hat{X}$ follows the distribution of the training data $X$ . Formally, we want to solve the problem

where the decoder is allowed to be stochastic.Note that a stochastic decoder is necessary if $P_{X}$ is continuous. The goal of the distribution matching constraint is to enforce artifact-free reconstructions at all rates. Furthermore, as the rate $R\rightarrow 0$ , the solution converges to a generative model of $X$ , while for sufficiently large rates $R$ the solution guarantees perfect reconstruction and trivially satisfies the distribution constraint.

Deep generative models for distribution-preserving lossy compression

The distribution constraint makes solving the problem (2) extremely challenging, as it amounts to learning an exact generative model of the generally unknown distribution $P_{X}$ for $R=0$ . As a remedy, one can relax the problem and consider the regularized formulation,

where $\hat{X}=D(E(X))$ , and $d_{f}$ is a (statistical) divergence that can be estimated from samples using, e.g., the GAN framework .

Proposed method. We propose and study different generative model-based approaches to approximately solve the DPLC problem. These approaches overcome the aforementioned problems and can be applied for all bitrates $R$ , enabling a gentle tradeoff between matching the distribution of the training data and perfectly reconstructing the training samples. Figure 2 provides an overview of the proposed method.

In order to mitigate the bias-to-the-mean-issues with relaxations of the form (3), we decompose $D$ as $D=G\circ B$ , where $G$ is a generative model taking samples from a fixed prior distribution $P_{Z}$ as an input, trained to minimize a divergence between $P_{G(Z)}$ and $P_{X}$ , and $B$ is a stochastic function that is trained together with $E$ to minimize distortion for a fixed $G$ .

Out of the plethora of divergences commonly used for learning generative models $G$ , the Wasserstein distance between $P_{\hat{X}}$ and $P_{X}$ is particularly well suited for DPLC. In fact, it has a distinct advantage as it can be defined for an arbitrary transportation cost function, in particular for the distortion measure $d$ quantifying the quality of the reconstruction in (2). For this choice of transportation cost, we can analytically quantify the distortion as a function of the rate and the Wasserstein distance between $P_{G(Z)}$ and $P_{X}$ .

where $\mathcal{P}(P_{X},P_{Y})$ is a set of all joint distributions of $(X,Y)$ with marginals $P_{X}$ and $P_{Y}$ , respectively. When $(\mathcal{X},d^{\prime})$ is a metric space and we set $c(x,y)=d^{\prime}(x,y)$ we have by Kantorovich-Rubinstein duality that

In this model, the so-called Wasserstein Autoencoder (WAE), $Q(Z|X)$ is parametrized as the push-forward of $P_{X}$ , through some possibly stochastic function $F\colon\mathcal{X}\to\mathcal{Z}$ and (6) becomes

Note that, in order to solve (2), one cannot simply set $c(x,y)=d(x,y)$ and replace $F$ in (7) with a rate-constrained version $\hat{F}=B\circ E$ , where $E$ is a rate-constrained encoder as introduced in Section 2 and $B\colon\mathcal{W}\to\mathcal{Z}$ a stochastic function. Indeed, the tuple $(X,G(F(X)))$ in (7) parametrizes the couplings $\mathcal{P}(P_{X},P_{G(Z)})$ and $G\circ F$ should therefore be of high model capacity. Using $\hat{F}$ instead of $F$ severely constrains the model capacity of $G\circ\hat{F}$ (for small $R$ ) compared to $G\circ F$ , and minimizing (7) over $G\circ\hat{F}$ would hence not compute a $G(Z)$ which approximately minimizes $W_{c}(P_{X},P_{G(Z)})$ .

Learning the function $B\circ E$ . To circumvent this issue, instead of replacing $F$ in (7) by $\hat{F}$ , we propose to first learn $G^{\star}$ by either minimizing the primal form (6) via WAE or the dual form (5) via WGAN (if $d$ is a metric) for $c(x,y)=d(x,y)$ , and subsequently minimize the distortion as

w.r.t. the fixed generator $G^{\star}$ . We then recover the stochastic decoder $D$ in (2) as $D=G^{\star}\circ B$ . Clearly, the distribution constraint in (8) ensures that $G^{\star}(B(E(X)))\sim G^{\star}(Z)$ since $G$ was trained to map $P_{Z}$ to $P_{X}$ .

Reconstructing the Wasserstein distance. The proposed method has the following guarantees.

The proof is presented in Appendix A. Theorem 1 states that the distortion incurred by the proposed procedure is equal to $W_{d}(P_{X},P_{G^{\star}(Z)})$ up to an additive error term that decays exponentially in $R$ , hence converging to $W_{d}(P_{X},P_{G^{\star}(Z)})$ as $R\to\infty$ . Intuitively, as $E$ is no longer rate-constrained asymptotically, we can replace $F$ in (6) by $B\circ E$ and our two-step procedure is equivalent to minimizing (7) w.r.t. $G$ , which amounts to minimizing $W_{d}(P_{X},P_{G(Z)})$ w.r.t. $G$ by (6).

Furthermore, according to Theorem 1, the distribution mismatch between $G^{\star}(B(E(X)))$ and $P_{X}$ is determined by the quality of the generative model $G^{\star}$ , and is independent of $R$ . This is natural given that we learn $G^{\star}$ independently.

We note that the proof of (9) in Theorem 1 hinges upon the fact that $W_{d}$ is defined w.r.t. the distortion measure $d$ . The bound can also be applied to a generator $G^{\prime}$ obtained by minimizing, e.g., some $f$ -divergence between $P_{X}$ and $P_{G(Z)}$ . However, if $W_{d}(P_{X},P_{G^{\prime}(Z)})>W_{d}(P_{X},P_{G^{\star}(Z)})$ (which will generally be the case in practice) then the distortion obtained by using $G^{\prime}$ will asymptotically be larger than that obtained for $G^{\star}$ . This suggests using $W_{d}$ rather than $f$ -divergences to learn $G$ .

Unsupervised training via Wasserstein++

To learn $G$ , $B$ , and $E$ from data, we parametrize each component as a DNN and solve the corresponding optimization problems via stochastic gradient descent (SGD). We embed the code $\mathcal{W}$ as vectors (henceforth referred to as “centers”) in Euclidean space. Note that the centers can also be learned from the data . Here, we simply fix them to the set of vectors $\{-1,1\}^{R}$ and use the differentiable approximation from to backpropagate gradients through this non-differentiable embedding. To ensure that the mapping $B$ is stochastic, we feed noise together with the (embedded) code $E(X)$ .

The distribution constraint in (8), i.e., ensuring that $B(E(X))\sim P_{Z}$ , can be implemented using a maximum mean discrepancy (MMD) or GAN-based regularizer. Firstly, we note that both MMD and GAN-based regularizers can be learned from the samples—for MMD via the corresponding U-estimator, and for GAN via the adversarial framework. Secondly, matching the (simple) prior distribution $P_{Z}$ is much easier than matching the likely complex distribution $P_{X}$ as in (3). Intuitively, at high rates, $B$ should learn to ignore the noise at its input and map the code to $P_{Z}$ . On the other hand, as $R\to 0$ , the code becomes low-dimensional and $B$ is forced to combine it with the stochasticity of the noise at its input to match $P_{Z}$ . In practice, we observe that MMD is robust and allows to enforce $P_{Z}$ at all rates $R$ , while GAN-based regularizers are prone to mode collapse at low rates.

Wasserstein++. As previously discussed, $G^{\star}$ can be learned via WGAN or WAE . As the WAE framework naturally includes an encoder, it ensures that the structure of the latent space $\mathcal{Z}$ is amenable to encode into. On the other hand, there is no reason that such a structure should emerge in the latent space of $G$ trained via WGAN (in particular when $\mathcal{Z}$ is high-dimensional).In principle, this is not an issue if $B$ has enough model capacity, but it might lead to differences in practice as the distortion (8) should be easier to minimize if the $\mathcal{Z}$ -space is suitably structured, see Section 5. In our experiments we observed that WAE tends to produce somewhat less sharp samples than WGAN. On the other hand, WAE is arguably less prone to mode dropping than WGAN as the WAE objective severely penalizes mode dropping due to the reconstruction error term. To combine the best of both approaches, we propose the following novel combination of the primal and the dual form of $W_{d}$ , via their convex combination

with $\gamma\in$ . There are two practical questions remaining. Firstly, minimizing this expression w.r.t. $G$ can be done by alternating between performing gradient updates for the critic $f$ and gradient updates for $G,F$ . In other words, we combine the steps of the WGAN algorithm [16, Algorithm 1] and WAE-MMD algorithm [17, Algorithm 2], and call this combined algorithm Wasserstein++. Secondly, one can train the critic $f$ on fake samples from $G(Z)$ or from $G(F(X))$ , which will not follow the same distribution in general due to a mismatch between $F(X)$ and $P_{Z}$ , which is more pronounced in the beginning of the optimization process. Preliminary experiments suggest that the following setup yields samples of best quality (in terms of FID score):

Train $G$ only on samples from $F(X)$ , for both the WGAN and the WAE loss term.

Empirical evaluationCode is available at https://github.com/mitscha/dplc.

Setup. We empirically evaluate the proposed DPLC framework for $G^{\star}$ trained via WAE-MMD (with an inverse multiquadratics kernel, see ), WGAN with gradient penalty (WGAN-GP) , and Wasserstein++ (implementing the $1$ -Lipschitz constraint in (4) via the gradient penalty from ), on two standard generative modeling benchmark image datasets, CelebA and LSUN bedrooms , both downscaled to $64\times 64$ resolution. We focus on these data sets at relatively low resolution as current state-of-the-art generative models can handle them reasonably well, and we do not want to limit ourselves by the difficulties arising with generative models at higher resolutions. The Euclidean distance is used as distortion measure (training objective) $d$ in all experiments.

Architectures, hyperparameters, and optimizer. The prior $P_{Z}$ is an $m$ -dimensional multivariate standard normal, and the noise vector providing stochasticity to $B$ has $m$ i.i.d. entries distributed uniformly on $ $. We use the DCGAN generator and discriminator architecture for$ G $and$ f $, respectively. For$ F $and$ E $we follow and apply the architecture similar to the DCGAN discriminator.$ B $is realized as a stack of$ n $residual blocks . We set$ m=128 $,$ n=2 $for CelebA, and$ m=512 $,$ n=4 $for the LSUN bedrooms data set. We chose$ m $to be larger than the standard latent space dimension for GANs as we observed that lower$ m$ may lead to blurry reconstructions.

As baselines, we consider compressive autoencoders (CAEs) with the same architecture $G\circ B\circ E$ but without feeding noise to $B$ , training $G,B,E$ jointly to minimize distortion, and BPG , a state-of-the-art engineered codec.The implementation from used in this paper cannot compress to rates below $\approx 0.2$ bpp on average for the data sets considered here. In addition, to corroborate the claims made on the disadvantages of (3) in Section 3, we train $G\circ B\circ E$ to minimize (3) as done in the generative compression (GC) approach from , but replacing $d_{f}$ by $W_{d}$ .

Throughout, we rely on the Adam optimizer . To train $G$ by means of WAE-MMD and WGAN-GP we use the training parameters form and , respectively. For Wasserstein++, we set $\gamma$ in (4) to $2.5\cdot 10^{-5}$ for CelebA and to $10^{-4}$ for LSUN. Further, we use the same training parameters to solve (8) as for WAE-MMD. Thereby, to compensate for the increase in the reconstruction loss with decreasing rate, we adjust the coefficient of the MMD penalty, $\lambda_{\text{MMD}}$ (see Appendix C), proportionally as a function of the reconstruction loss of the CAE baseline, i.e., $\lambda_{\text{MMD}}(R)=\text{const.}\cdot\text{MSE}_{\text{CAE}}(R)$ . We adjust the coefficient $\lambda$ of the divergence term $d_{f}$ in (3) analogously. This ensures that the regularization strength is roughly the same at all rates. Appendix B provides a detailed description of all architectures and hyperparameters.

Results. Table 1 shows sample FID of $G^{\star}$ for WAE, WGAN-GP, and Wasserstein++, as well as the reconstruction FID and MSE for WAE and Wasserstein++.The reconstruction FID and MSE in Table 1 are obtained as $G^{\star}(F(X))$ , without rate constraint. We do not report reconstruction FID and MSE for WGAN-GP as its formulation (5) does not naturally include an unconstrained encoder. In Figure 3 we plot the MSE, the reconstruction FID, and PV obtained by our DPLC models as a function of the bitrate, for different $G^{\star}$ , along with the values obtained for the baselines. Figure 1 presents visual examples produced by our DPLC model with $G^{\star}$ trained using Wasserstein++, along with examples obtained for GC and CAE. More visual examples can be found in Appendix D.

Discussion. We first discuss the performance of the trained generators $G^{\star}$ , shown in Table 1. For both CelebA and LSUN bedrooms, the sample FID obtained by Wasserstein++ is considerably smaller than that of WAE, but slightly larger than that of WGAN-GP. Further, Wasserstein++ yields a significantly smaller reconstruction FID than WAE, but a larger reconstruction MSE. Note that the decrease in sample and reconstruction FID achieved by Wasserstein++ compared to WAE should be expected to come at the cost of an increased reconstruction MSE, as the Wasserstein++ objective is obtained by adding a WGAN term to the WAE objective (which minimizes distortion).

We now turn to the DPLC results obtained for CelebA shown in Figure 3, top row. It can be seen that among our DPLC models, the one combined with $G^{\star}$ from WAE yields the lowest MSE, followed by those based on Wasserstein++, and WGAN-GP. This is not surprising as the optimization of WGAN-GP does not include a distortion term. CAE obtains a lower MSE than all DPLC models which is again intuitive as $G,B,E$ are trained jointly and to minimize distortion exclusively (in particular there is no constraint on the distribution in $\mathcal{Z}$ -space). Finally, BPG obtains the overall lowest MSE. Note, however, that BPG relies on several advanced techniques such as entropy coding based on context models (see, e.g., ), which we did not implement here (but which could be incorporated into our DPLC framework).

Among our DPLC methods, DPLC based on Wasserstein++ attains the lowest reconstruction FID (i.e., its distribution most faithfully reproduces the data distribution) followed by WGAN-GP and WAE. For all three models, the FID decreases as the rate increases, meaning that the models manage not only to reduce distortion as the rate increases, but also to better reproduce the original distribution. The FID of CAE increases drastically as the rate falls below $0.03$ bpp. Arguably, this can be attributed to significant blur incurred at these low rates (see Figure 9 in Appendix D). BPG yields a very high FID as soon as the rate falls below 0.5 bpp due to compression artifacts.

The PV can be seen to increase steadily for all DPLC models as the rate decreases, as expected. This is also reflected by the visual examples in Figure 1, left: At $0.5$ bpp no variability is visible, at $0.125$ bpp the facial expression starts to vary, and decreasing the rate further leads to the encoder producing different persons, deviating more and more form the original image, until the system generates random faces.

In contrast, the PV obtained by solving (3) as in GC is essentially 0, except at 0 bpp, where it is comparable to that of our DPLC models. The noise injected into $D=G\circ B$ is hence ignored unless it is the only source of randomness at 0 bpp. We emphasize that this is the case even though we adjust the coefficient $\lambda$ of the $d_{f}$ term as $\lambda(R)=\text{const.}\cdot\text{MSE}_{\text{CAE}}(R)$ to compensate for the increase in distortion with decreasing rate. The performance of GC in terms of MSE and reconstruction FID is comparable to that of the DPLC model with Wasserstein++ $G^{\star}$ .

We now turn to the DPLC results obtained for LSUN bedrooms. The qualitative behavior of DPLC based on WAE and Wasserstein++ in terms of MSE, reconstruction FID, and PV is essentially the same as observed for CelebA. Wasserstein++ provides the lowest FID by a large margin, for all positive rates. The reconstruction FID for WAE is high at all rates, which is not surprising as the sample FID obtained by WAE is large (cf. Table 1), i.e., WAE struggles to model the distribution of the LSUN bedrooms data set.

For DPLC based on WGAN-GP, in contrast, while the MSE and PV follow the same trend as for CelebA, the reconstruction FID increases notably as the bitrate decreases. By inspecting the corresponding reconstructions (cf. Figure 12 in Appendix D) one can see that the model manages to approximate the data distribution well at zero bitrate, but yields increasingly blurry reconstructions as the bitrate increases. This indicates that either the (trained) function $B\circ E$ is not mapping the original images to $\mathcal{Z}$ space in a way suitable for $G^{\star}$ to produce crisp reconstructions, or the range of $G^{\star}$ does not cover the support of $P_{X}$ well. We tried to address the former issue by increasing the depth of $B$ (to increase model capacity) and by increasing $\lambda_{\text{MMD}}$ (to reduce the mismatch between the distribution of $B(E(X))$ and $P_{Z}$ ), but we did not observe improvements in reconstruction quality. We therefore suspect mode coverage issues to cause the blur in the reconstructions.

Finally, GC largely ignores the noise injected into $D$ at high bitrates, while using it to produce stochastic decoders at low bitrates. However, at low rates, the rFID of GC is considerably higher than that of DPLC based on Wasserstein++, meaning that it does not faithfully reproduce the data distribution despite using stochasticity. Indeed, GC suffers from mode collapse at low rates as can be seen in Figure 14 in Appendix D.

Related work

DNN-based methods for compression have become an active area of research over the past few years. Most authors focus on image compression , while others consider audio and video data. Compressive autoencoders and recurrent neural networks (RNNs) have emerged as the most popular DNN architectures for compression.

GANs have been used in the context of learned image compression before . Work applies a GAN loss to image patches for artifact suppression, whereas applies the GAN loss to the entire image to encourage the decoder to generate image content (but does not demonstrate a properly working stochastic decoder). GANs are leveraged by and to improve image quality of super resolution and engineered compression methods, respectively.

Santurkar et al. use a generator trained with a GAN as a decoder in a compression system. However, they rely on vanilla GAN only rather than considering different $W_{d}$ -based generative models and they do not provide an analytical characterization of their model. Most importantly, they optimize their model using conventional distortion minimization with deterministic decoder, rather than solving the DPLC problem.

Gregor et al. propose a variational autoencoder (VAE)-type generative model that learns a hierarchy of progressively more abstract representations. By storing the high-level part of the representation and generating the low-level one, they manage to partially preserve and partially generate image content. However, their framework is lacking a notion of rate and distortion and does not quantize the representations into a code (apart from using finite precision data types).

Probably most closely related to Wasserstein++ is VAE-GAN , combining VAE with vanilla GAN . However, whereas the VAE part and the GAN part minimize different divergences (Kullback-Leibler and Jensen-Shannon in the case of VAE and vanilla GAN, respectively), WAE and WGAN minimize the same cost function, so Wasserstein++ is somewhat more principled conceptually. More generally, learning generative models jointly with an inference mechanism for the latent variables has attracted significant attention, see, e.g., and for an overview.

Outside of the domain of machine learning, the problem of distribution-preserving (scalar) quantization was studied. Specifically, studies moment preserving quantization, that is quantization with the design criterion that certain moments of the data distribution shall be preserved. Further, proposes an engineered dither-based quantization method that preserves the distribution of the variable to be quantized.

Conclusion

In this paper, we studied the DPLC problem, which amounts to optimizing the rate-distortion tradeoff under the constraint that the reconstructed samples follow the distribution of the training data. We proposed different approaches to solve the DPLC problem, in particular Wasserstein++, a novel combination of WAE and WGAN, and analytically characterized the properties of the resulting compression systems. These systems allowed us to obtain essentially artifact-free reconstructions at all rates, covering the full spectrum from learning a generative model of the data at zero bitrate on one hand, to learning a compression system with almost perfect reconstruction at high bitrate on the other hand. Most importantly, our framework improves over previous methods by producing stochastic decoders at low bitrates, thereby effectively solving the DPLC problem for the first time. Future work includes scaling the proposed approach up to full-resolution images and applying it to data types other than images.

Acknowledgments. The authors would like to thank Fabian Mentzer for insightful discussions and for providing code to generate BPG images for the empirical evaluation in Section 5.

References

Appendix A Proof of Theorem 1

Eq. (10) directly follows from the distribution constraint in (8), $B(E(X))\sim P_{Z}$ : This constraint implies $G^{\star}(B(E(X)))\sim G^{\star}(Z)$ which in turn implies (10).

The construction of the discrete encoder $\hat{F}$ in the proof of Theorem 1 requires optimal vector quantization, which is generally NP hard. However, if we make stronger assumptions on $P_{Z}$ than done in the statement of Theorem 1 one can prove exponential convergence without optimal vector quantization. For example, if $Z$ is uniformly distributed on $^{m}$ , $R=km$ for a positive integer $k$ , and $\|\cdot\|$ is the Euclidean norm, then one can partition $^{m}$ into $2^{R}$ hypercubes of equal edge length $1/2^{k}=1/2^{\frac{R}{m}}$ and associate the centers $c_{i}$ with the centers of these hypercubes. In this case, the two last terms in (13) are upper-bounded by $\sqrt{m}\cdot 2^{-\frac{R}{m}}$ . The quantization error thus converges to as $2^{-\frac{R}{m}}$ for $R\to\infty$ .

Appendix B Hyperparameters and architectures

Learning the generative model $G^{\star}$ . The training parameters used train $G^{\star}$ using WAE, WGAN-GP, and Wasserstein++ are shown in Table 2. The parameters for WAE correspond to those used for the WAE-MMD experiments on CelebA in , see Appendix C.2, with the difference that we use a batch size of 256 and a slightly modified schedule (note that 41k iterations with a batch size of 256 correspond to roughly 55 epochs with batch size 100, which is suggested by ). This does not notably impact the performance WAE (we obtain a slightly lower sample FID than reported in [17, Table 1]). The parameters for WGAN-GP correspond to those recommended for LSUN bedrooms in [28, Appendix E].

Learning the function $B\circ E$ . The training parameters to solve (8) can be found in Table 3. To solve (1) (i.e., to train the CAE baseline) we use the same parameters as for WAE (except that $\lambda_{\text{MMD}}=0$ as there is no distribution constraint in (1)), see Table 2.

Training the baseline (3) as in . To solve (3) we use the parameters and schedule specified in Table 2 for Wasserstein++ (except that we do not need $\lambda_{\text{MMD}}$ ), and we determine $\lambda$ in (3) based on the bitrate as $\lambda(R)=2.5\cdot 10^{-5}\cdot\frac{\text{MSE}_{\text{CAE}}(R)}{\text{MSE}_{\text{CAE}}(0.5\text{bpp})}$ for CelebA and $\lambda(R)=7.5\cdot 10^{-5}\cdot\frac{\text{MSE}_{\text{CAE}}(R)}{\text{MSE}_{\text{CAE}}(1\text{bpp})}$ for LSUN bedrooms.

Architectures. We use the following notation. cxsy-z stands for a 2D convolution with an x $\times$ x kernel, stride y, and z filters, followed by the ReLU non-linearity (the ReLU non-linearity is omitted when the convolution is followed by quantization or the tanh non-linearity). The suffixes b and l, i.e., cxsyb-z and cxsyl-z, indicate that batch normalization is employed before the non-linearity and layer normalization as well as leaky ReLU with a negative slope of $0.2$ instead of ReLU, respectively. txsyb-z stands for a transposed 2D convolution with an x $\times$ x kernel, stride 1/y, and z filters, followed by batch normalization and ReLU non-linearity. fc-z denotes flattening followed by a fully-connected layer with z neurons. r-z designates a residual block with z filters in each layer. The abbreviations bn, tanh, and -q are used for batch normalization, the tanh non-linearity, and quantization with differentiable approximation for gradient backpropagation, respectively. $k$ is the number of channels of the quantized feature representation (i.e., $k$ determines the bitrate), and the suffix +n denotes concatenation of an $m$ -dimensional noise vector with i.i.d. entries uniformly distributed on $$, reshaped as to match the spatial dimension of the feature maps in the corresponding network layer.

$F$ : c4s2-64, c4s2b-128, c4s2b-256, c4s2b-512, fc- $m$ , bn

$G$ : t4s2b-512, t4s2b-256, t4s2b-128, t4s2b-64, t4s2b-64, c3s1-3, tanh

$f$ : c3s1-64, c4s2l-64, c4s2l-128, c4s2l-256, c4s2l-512, fc-1

$E$ : c4s2-64, c4s2b-128, c4s2b-256, c4s2b-512, c3s1- $k$ -q+n

$B$ : c3s1-512, r-512, …, r-512, fc- $m$ , bn

Appendix C The Wasserstein++ algorithm

Appendix D Visual examples

In the following, we show random samples and reconstructions produced by different DPLC models and CAE, at different bitrates, for the CelebA and LSUN bedrooms data set. None of the examples are cherry-picked.