Autoregressive Quantile Networks for Generative Modeling

Georg Ostrovski, Will Dabney, Rémi Munos

Introduction

There has been a staggering increase in progress on generative modeling in recent years, built largely upon fundamental advances such as generative adversarial networks (Goodfellow et al., 2014), variational inference (Kingma & Welling, 2013), and autoregressive density estimation (van den Oord et al., 2016c). These have led to breakthroughs in state-of-the-art generation of natural images (Karras et al., 2017) and audio (van den Oord et al., 2016a), and even been used for unsupervised learning of disentangled representations (Higgins et al., 2017; Chen et al., 2016). These domains often have real-valued distributions with underlying metrics; that is, there is a domain-specific notion of similarity between data points. This similarity is ignored by the predominant work-horse of generative modeling, the Kullback-Leibler (KL) divergence. Progress is now being made towards algorithms that optimize with respect to these underlying metrics (Arjovsky et al., 2017; Bousquet et al., 2017).

In this paper, we present a novel approach to generative modeling, that, while strikingly different from existing methods, is grounded in the well-understood statistical methods of quantile regression. Unlike the majority of recent work, we approach generative modeling without the use of the KL divergence, and without explicitly approximating a likelihood model. Like GANs, in this way we produce an implicitly defined model, but unlike GANs our optimization procedure is inherently stable and lacks degenerate solutions which cause loss of diversity and mode collapse.

Much of the recent research on GANs has been focused on improving stability (Radford et al., 2015; Arjovsky et al., 2017; Daskalakis et al., 2017) and sample diversity (Gulrajani et al., 2017; Salimans et al., 2016, 2018). By stark contrast, methods such as PixelCNN (van den Oord et al., 2016b) readily produce high diversity, but due to their use of KL divergence are unable to make reasonable trade-offs between likelihood and perceptual similarity (Theis et al., 2015; Bellemare et al., 2017; Bousquet et al., 2017).

Our proposed method, autoregressive implicit quantile networks (AIQN), combines the benefits of both: a loss function that respects the underlying metric of the data leading to improved perceptual quality, and a stable optimization process leading to highly diverse samples. While there has been an increasing tendency towards complex architectures (Chen et al., 2017; Salimans et al., 2017) and multiple objective loss functions to overcome these challenges, AIQN is conceptually simple and does not rely on any special architecture or optimization techniques. Empirically it proves to be robust to hyperparameter variations and easy to optimize.

Our work is motivated by the recent advances achieved by reframing GANs in terms of optimal transport, leading to the Wasserstein GAN algorithm (Arjovsky et al., 2017), as well as work towards understanding the relationship between optimal transport and both GANs and VAEs (Bousquet et al., 2017). In agreement with these results, we focus on loss functions grounded in perceptually meaningful metrics. We build upon recent work in distributional reinforcement learning (Dabney et al., 2018a), which has begun to bridge the gap between approaches in reinforcement learning and unsupervised learning. Towards a practical algorithm we base our experimental results on Gated PixelCNN (van den Oord et al., 2016b), and show that using AIQN significantly improves objective performance on CIFAR-10 and ImageNet 32x32 in terms of Fréchet Inception Distance (FID) and Inception score, as well as subjective perceptual quality in image samples and inpainting.

Background

We begin by establishing some notation, before turning to a review of three of the most prevalent methods for generative modeling. Calligraphic letters (e.g. $\mathcal{X}$ ) denote sets or spaces, capital letters (e.g. $X$ ) denote random variables, and lower case letters (e.g. $x$ ) indicate values. A probability distribution with random variable $X\in\mathcal{X}$ is denoted $p_{X}\in\mathscr{P}(\mathcal{X})$ , its cumulative distribution function (c.d.f.) $F_{X}$ , and inverse c.d.f. or quantile function $Q_{X}=F^{-1}_{X}$ . When probability distributions or quantile functions are parameterized by some $\theta$ we will write $p_{\theta}$ or $Q_{\theta}$ recognizing that here we do not view $\theta$ as a random variable.

Perhaps the simplest way to approach generative modeling of a random variable $X\in\mathcal{X}$ is by fixing some discretization of $\mathcal{X}$ into $n$ separate values, say $x_{1},\ldots,x_{n}\in\mathcal{X}$ , and parameterize the approximate distribution with $p_{\theta}(x_{i})\propto\exp(\theta_{i})$ . This type of categorical parameterization is widely used, only slightly less commonly when $\mathcal{X}$ does not lend itself naturally to such a partitioning. Typically, the parameters $\theta$ are optimized to minimize the Kullback-Leibler (KL) divergence between observed values of $X$ and the model $p_{\theta}$ , $\theta^{*}=\operatorname*{arg\,min}_{\theta}D_{KL}(p_{X}\|p_{\theta})$ .

When the conditional density is modeled by a simple (e.g. Gaussian) base distribution, the ordering of the dimensions can be crucial (Papamakarios et al., 2017). However, it is common practice to choose an arbitrary ordering and rely upon a more powerful conditional model to avoid these problems. This class of models includes PixelRNN and PixelCNN (van den Oord et al., 2016c, b), MAF (Papamakarios et al., 2017), MADE (Germain et al., 2015), and many others. Fundamentally, all these approaches use the KL divergence as their loss function.

Another class of methods, generally known as latent variable methods, can bypass the need for autoregressive models using a different modeling assumption. Specifically, consider the Variational Autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014), which represents $p_{\theta}$ as the marginalization over a latent random variable $Z\in\mathcal{Z}$ . The VAE is trained to maximize an approximate lower bound of the log-likelihood of the observations:

Although VAEs are straightforward to implement and optimize, and effective at capturing structure in high-dimensional spaces, they often miss fine-grained detail, resulting in blurry images.

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) pose the problem of learning a generative model as a two-player zero-sum game between a discriminator $D$ , attempting to distinguish between $x\sim p_{X}$ (real data) and $x\sim p_{\theta}$ (generated data), and a generator $G$ , attempting to generate data indistinguishable from real data. The generator is an implicit latent variable model that reparameterizes samples, typically from an isotropic Gaussian distribution, into values in $\mathcal{X}$ . The original formulation of GANs,

can be seen as minimizing a lower-bound on the Jensen-Shannon divergence (Goodfellow et al., 2014; Bousquet et al., 2017). That is, even in the case of GANs we are often minimizing functions of the KL divergenceThe Jensen-Shannon divergence is the sum of KLs between distributions $P,Q$ and their uniform mixture $M=0.5(P+Q)$ : $\operatorname{JSD}(P||Q)=0.5(D_{KL}(P||M)+D_{KL}(Q||M))$ ..

Many recent advances have come from principled combinations of these three fundamental methods (Makhzani et al., 2015; Dumoulin et al., 2016; Rosca et al., 2017).

A common perspective in generative modeling is that the choice of model should encode existing metric assumptions about the domain, combined with a generic likelihood-focused loss such as the KL divergence. Under this view, the KL’s general applicability and robust optimization properties make it a natural choice, and most implementations of the methods we reviewed in the previous section attempt to, at least indirectly, minimize a version of the KL.

On the other hand, as every model inevitably makes trade-offs when constrained by capacity or limited training, it is desirable for its optimization goal to incentivize trade-offs prioritizing approximately correct solutions, when the data space is endowed with a metric supporting a meaningful (albeit potentially subjective) notion of approximation. It has been argued (Theis et al., 2015; Bousquet et al., 2017; Arjovsky et al., 2017; Bellemare et al., 2017) that the KL may not always be appropriate from this perspective, by making sub-optimal trade-offs between likelihood and similarity.

Indeed, many limitations of existing models can be traced back to the use of KL, and the resulting trade-offs in approximate solutions it implies. For instance, its use appears to play a central role in one of the primary failure modes of VAEs, that of blurry samples. Zhao et al. (2017) argue that the Gaussian posterior $p_{\theta}(x|z)$ implies an overly simple model, which, when unable to perfectly fit the data, is forced to average (thus creating blur), and is not incentivized by the KL towards an alternative notion of approximate solution. Theis et al. (2015) emphasized that an improvement of log-likelihood does not necessarily translate to higher perceptual quality, and that the KL loss is more likely to produce atypical samples than some other training criteria.

We offer an alternative perspective: a good model should encode assumptions about the data distribution, whereas a good loss should encode the notion of similarity, that is, the underlying metric on the data space. From this point of view, the KL corresponds to an actual absence of explicit underlying metric, with complete focus on probability.

The optimal transport metrics $W_{c}$ , for underlying metric $c(x,x^{\prime})$ , and in particular the $p$ -Wasserstein distance, when $c$ is an $L_{p}$ metric, have frequently been proposed as being well-suited replacements to KL (Bousquet et al., 2017; Genevay et al., 2017). Briefly, the advantages are (1) avoidance of mode collapse (no need to choose between spreading over modes or collapsing to a single mode as in KL), and (2) the ability to trade off errors and incentivize approximations that respect the underlying metric.

Recently, Arjovsky et al. (2017) introduced the Wasserstein GAN, reposing the two-player game as the estimation of the gradient of the $1$ -Wasserstein distance between the data and generator distributions. They reframe this in terms of the dual form of the $1$ -Wasserstein, with the critic estimating a function $f$ which maximally separates the two distributions. While this is an exciting line of work, it still faces limitations when the critic solution is approximate, i.e. when $f^{*}$ is not found before each update. In this case, due to insufficient training of the critic (Bellemare et al., 2017) or limitations of the function approximator, the gradient direction produced can be arbitrarily bad (Bousquet et al., 2017).

Thus, we are left with the question of how to minimize a distribution loss respecting an underlying metric. Recent work in distributional reinforcement learning has proposed the use of quantile regression as a method for minimizing the $1$ -Wasserstein in the univariate case when approximating using a mixture of Dirac functions (Dabney et al., 2018b).

2 Quantile Regression

In this section, we review quantile regression as a method for estimating the quantile function of a distribution at specific points, i.e. its inverse cumulative distribution function. This leads to recent work on approximating a distribution by a neural network approximation of its quantile function, acting as a reparameterization of a random sample from the uniform distribution.

Using this loss allows one to train a neural network to approximate a scalar distribution represented by its inverse c.d.f. For this, the network can output a fixed grid of quantiles (Dabney et al., 2018b), with the respective quantile regression losses being applied to each output independently. A more effective approach is to provide the desired quantile $\tau$ as an additional input to the network, and train it to output the corresponding value of $F_{Z}^{-1}(\tau)$ . The implicit quantile network (IQN) model (Dabney et al., 2018a) reparameterizes a sample $\tau\sim\mathcal{U}()$ through a deterministic function to produce samples from the underlying data distribution. These two methods can be seen to belong to the top-right and bottom-right categories in Figure 1. An IQN $Q_{\theta}$ can be trained by stochastic gradient descent on the quantile regression loss, with $u=z-Q_{\theta}(\tau)$ and training samples $(z,\tau)$ drawn from $z\sim Z$ and $\tau\sim\mathcal{U}()$ .

One drawback to the quantile regression loss is that gradients do not scale with the magnitude of the error, but instead with the sign of the error and the quantile weight $\tau$ . This increases gradient variance and can negatively impact the final model’s sample quality. Increasing the batch size, and thus averaging over more values of $\tau$ , would have the effect of lowering this variance. Alternatively, we can smooth the gradients as the model converges by allowing errors, under some threshold $\kappa$ , to be scaled with their magnitude, reverting to an expectile loss. This results in the Huber quantile loss (Huber, 1964; Dabney et al., 2018b):

Autoregressive Implicit Quantiles

Let $X=(X_{1},\ldots,X_{n})\in\mathcal{X}_{1}\times\dots\times\mathcal{X}_{n}=\mathcal{X}$ be an $n$ -dimensional random variable. We begin by analyzing the effect of two naive applications of IQN to modeling the distribution of $X$ .

First, suppose we use the same quantile target, $\tau\in$ , for every output dimension. The only modification to IQN would be to output $n$ dimensions instead of $1$ , the loss being applied to each output dimension independently. This is equivalent to assuming that the dimensions of $X$ are comonotonic. Two random variables are comonotonic if and only if they can be expressed as non-decreasing (deterministic) functions of a single random variable (Dhaene et al., 2006). Thus a joint quantile function for a comonotonic $X$ can be written as $F_{X}^{-1}(\tau)=(F^{-1}_{X_{1}}(\tau),F^{-1}_{X_{2}}(\tau),\ldots,F^{-1}_{X_{n}}(\tau))$ . While there are many interesting uses for comonotonic random variables, we believe this assumption is too strong to be useful more broadly.

Second, one could use a separate value $\tau_{i}\in$ for each $X_{i}$ , with the IQN being unchanged from the first case. This corresponds to making an independence assumption on the dimensions of $X$ . Again we would expect this to be an unreasonably restrictive modeling assumption for many domains, such as the case of natural images.

Now, we turn to our proposed approach of extending IQN to multivariate distributions. We fix an ordering of the $n$ dimensions. If the density function $p_{X}$ is expressed as a product of conditional likelihoods, as in Equation 1, then the joint c.d.f. can be written as

Furthermore, for $\tau_{joint}=\prod_{i=1}^{n}\tau_{i}$ , we can write the joint-quantile function of $X$ as

This approach has been used previously by Koenker & Xiao (2006), who introduced a quantile autoregression model for quantile regression on time-series.

As previously mentioned, for the restricted model class of a uniform mixture of Diracs, quantile regression can be shown to minimize the $1$ -Wasserstein metric (Dabney et al., 2018b). We extend this analysis for the case of arbitrary approximate quantile functions, and find that quantile regression minimizes a closely related divergence which we call quantile divergence, defined, for any distributions $P$ and $Q$ , as

Indeed, the expected quantile loss of any parameterized quantile function $\bar{Q}_{\theta}$ equals, up to a constant, the quantile divergence between $P$ and the distribution $Q_{\theta}$ implicitly defined by $\bar{Q}_{\theta}$ :

where $h(P)$ does not depend on $Q_{\theta}$ . Thus quantile regression minimizes the quantile divergence $q(P,Q_{\theta})$ and the sample gradient $\nabla_{\theta}\rho_{\tau}(z-\bar{Q}_{\theta}(\tau))$ (for $\tau\sim\mathcal{U}()$ and $z\sim P$ ) is an unbiased estimate of $\nabla_{\theta}q(P,Q_{\theta})$ . See Appendix for proofs.

2 Quantile Density Function

Although IQN does not directly model the log-likelihood of the data distribution, observe that we can still query the implied density at a point (Jones, 1992):

Indeed, this quantity, known as the sparsity function (Tukey, 1965) or the quantile-density function (Parzen, 1979) plays a central role in the analysis of quantile regression models (Koenker, 1994). A common approach involves choosing a bandwidth parameter $h$ and estimating this quantity through finite-differences around the value of interest as $(F_{X}^{-1}(\tau+h)-F_{X}^{-1}(\tau-h))/2h$ (Siddiqui, 1960). However, as we have the full quantile function, the quantile-density function can be computed exactly using a single step of back-propagation to compute $\frac{\partial F^{-1}(\tau)}{\partial\tau}$ . As this only allows querying the density given the value of $\tau$ , application to general likelihoods would require finding the value of $\tau$ that produces the closest approximation to the query point $x$ . Though arguably too inefficient for training, this could potentially be used to interrogate the model.

PixelIQN

To test our proposed method, which is architecturally compatible with many generative model approaches, we wanted to compare and contrast IQN, that is quantile regression and quantile reparameterization, with a method trained with an explicit parameterization to minimize KL divergence. A natural choice for this was PixelCNN, specifically we build upon the Gated PixelCNN of van den Oord et al. (2016b).

The Gated PixelCNN takes as input an image $x\sim X$ , sampled from the training distribution at training time, and potentially all zeros or partially generated at generation time, as well as a location-dependent context $s$ . The model consists of a number of residual layer blocks, whose structure is chosen to allow each output pixel to be a function of all preceding input pixels (in a raster-scan order). At its core, each layer block computes two gated activations of the form

with $k$ the layer index, $*$ denoting convolution, and $V_{k,f}$ and $V_{k,g}$ being $1\times 1$ convolution kernels. See Figure 2 for a full schematic depiction of a Gated PixelCNN layer block. After a number of such layer blocks, the PixelCNN produces a final output layer with shape $(n,n,3,256)$ , with a softmax across the final dimension, corresponding to the approximate conditional likelihood for the value of each pixel-channel. That is, the conditional likelihood is the product of these individual autoregressive models,

Typically the location-dependent conditioning term was used to condition on class labels, but here, we will use it to condition on the sample pointConditioning on labels remains possible (see Section 4.2). $\tau\in^{3n^{2}}$ . Thus, in addition to the input image $x$ we input, in place of $s$ , the sample points $\tau=(\tau_{1},\ldots,\tau_{3n^{2}})$ to be reparameterized, with each $\tau_{i}\sim\mathcal{U}()$ . Finally, our network outputs only the full sample image of shape $(n,n,3)$ , without the need for an additional softmax layer. Note that the number of $\tau$ values generated exactly corresponds to the number of random draws from softmax distributions in the original PixelCNN. We are simply changing the role of the randomness, from a draw at the output to a part of the input.

Architecturally, our proposed model, PixelIQN, is exactly the network given by van den Oord et al. (2016b), with the one exception that we output only a single value per pixel-channel and do not require the softmax activations.

In PixelCNN training is done by passing the training image through the network, and training each output softmax distribution using the KL divergence between the training image and the approximate distribution,

We begin by demonstrating PixelIQN on CIFAR-10 (Krizhevsky & Hinton, 2009). For comparison, we train both a baseline Gated PixelCNN and a PixelIQN. Both models correspond to the $15$ -layer network variant in (van den Oord et al., 2016b), see Appendix for detailed hyperparameters and training procedure. The two methods have substantially different loss functions, so we performed a hyperparameter search using a short training run, with the same number ( $500$ ) of hyperparameter configurations evaluated for both models. For all results, we report full training runs using the best found hyperparameters in each case. The evaluation metric used for the hyperparameter search was the Fréchet Inception Distance (FID) (Heusel et al., 2017), see Appendix for details. In addition to FID, we report Inception score (Salimans et al., 2016) for both models.

Figure 5 (left) shows Inception score and FID for both models evaluated at several points throughout training. The fully trained PixelCNN achieves an Inception score and FID of $4.6$ and $65.9$ respectively, while PixelIQN substantially outperforms it with an Inception score of $5.3$ and FID of $49.5$ . This also compares favorably with e.g. WGAN (Arjovsky et al., 2017), which reaches an Inception score of $3.8$ . For subjective evaluations, we give samples from both models in Figure 3. Samples coming from PixelIQN are much more visually coherent. Of note, the PixelIQN model achieves a performance level comparable to that of the fully trained PixelCNN with only about one third the number of training updates (and about one third of the wall-clock time).

2 ImageNet 32x32

Next, we turn to the small ImageNet dataset (Russakovsky et al., 2015), first used for generative modeling in the PixelRNN work (van den Oord et al., 2016c). Again, we evaluate using FID and Inception score. For this much harder dataset, we base our PixelCNN and PixelIQN models on the larger $20$ -layer variant used in (van den Oord et al., 2016b). Due to substantially longer training time for this model, we did not perform additional hyperparameter tuning, and mostly used the same hyperparameter values as in the previous sections for both models; details can be found in the Appendix.

Figure 5 shows Inception score and FID throughout training of PixelCNN and PixelIQN. Again, PixelIQN substantially outperforms the baseline in terms of final performance and sample complexity. For final scores and a comparison to state-of-the-art GAN models, see Table 5. Figure 6 shows random (non-cherry-picked) samples from both models. Compared to PixelCNN, PixelIQN samples appear to have superior quality with more global consistency and less ‘high-frequency noise’.

In Figure 7, we show the inpainting performance of PixelIQN, by fixing the top half of a validation set image as input and sampling repeatedly from the model to generate different completions. We note that the model consistently generates plausible completions with significant diversity between different completion samples for the same input image. Meanwhile, WGAN-GP has been seen to produce deterministic completions (Bellemare et al., 2017).

Following (van den Oord et al., 2016b), we also trained a class-conditional PixelIQN variant, providing to the model the one-hot class label corresponding to a training image (in addition to a $\tau$ sample). Samples from a class-conditional model can be expected to have higher visual quality, as the class label provides $\log_{2}(1000)\approx 10$ bits of information, see Figure 8. As seen in Figure 5 and Table 5, class conditioning also further improves Inception score and FID. To generate each sample for the computation of these scores, we sample one of 1000 class labels randomly, then generate an image conditioned on this label via the trained model.

Finally, motivated by the very long training time for the large PixelCNN model (approximately 1 day per 100K training steps, on 16 NVIDIA Tesla P100 GPUs), we also trained smaller $15$ -layer versions of the models (same as the ones used on CIFAR-10) on the small ImageNet dataset. For comparison, these take approximately 12 hours for 100K training steps on a single P100 GPU, or less than 3 hours on 8 P100 GPUs. As expected, little PixelCNN, while suitable for the CIFAR-10 dataset, fails to achieve competitive scores on the ImageNet dataset, achieving Inception score $5.1$ and FID $66.4$ . Astonishingly, little PixelIQN on this dataset reaches Inception score $7.3$ and FID $38.5$ , see Figure 5 (right). It thereby not only outperforms the little PixelCNN, but also the larger $20$ -layer version! This strongly supports the hypothesis that PixelCNN, and potentially many other models, are constrained not only by their model capacity, but crucially also by the sub-optimal trade-offs made by their log-likelihood training criterion, failing to align with perceptual or evaluation metrics.

Discussion and Conclusions

Most existing generative models for images belong to one of two classes. The first are likelihood-based models, trained with an elementwise KL reconstruction loss, which, while perceptually meaningless, provides robust optimization properties and high sample diversity. The second are GANs, trained based on a discriminator loss, typically better aligned with a perceptual metric and enabling the generator to produce realistic, globally consistent samples. Their advantages come at the cost of a harder optimization problem, high parameter sensitivity, and most importantly, a tendency to collapse modes of the data distribution.

AIQNs are a new, fundamentally different, technique for generative modeling. By using a quantile regression loss instead of KL divergence, they combine some of the best properties of the two model classes. By their nature, they preserve modes of the learned distribution, while producing perceptually appealing high-quality samples. The inevitable approximation trade-offs a generative model makes when constrained by capacity or insufficient training can vary significantly depending on the loss used. We argue that the proposed quantile regression loss aligns more effectively with a given metric and therefore makes subjectively more advantageous trade-offs.

Devising methods for quantile regression over multidimensional outputs is an active area of research. New methods are continuing to be investigated (Carlier et al., 2016; Hallin & Miroslav, 2016), and a promising direction for future work is to find ways to use these to replace autoregressive models. One approach to reducing the computational burden of such models is to apply AIQN to the latent dimensions of a VAE. Similar in spirit to Rosca et al. (2017), this would use the VAE to reduce the dimensionality of the problem and the AIQN to sample from the true latent distribution. In the Appendix we give preliminary results using such an technique, on CelebA $64\times 64$ (Liu et al., 2015).

We have shown that IQN, computationally cheap and technically simple, can be readily applied to existing architectures, PixelCNN and VAE (Appendix), improving robustness and sampling quality of the underlying model. We demonstrated that PixelIQN produces more realistic, globally coherent samples, and improves Inception score and FID.

We further point out that many recent advances in generative models could be easily combined with our proposed method. Recent algorithmic improvements to GANs such as mini-batch discrimination and progressive growing (Salimans et al., 2016; Karras et al., 2017), while not strictly necessary in our work, could be applied to further improve performance. PixelCNN++ (Salimans et al., 2017) is an architectural improvement of PixelCNN, with several beneficial modifications supported by experimental evidence. Although we have built upon the original Gated PixelCNN in this work, we believe all of these modifications to be compatible with our work, except for the use of a mixture of logistics in place of PixelCNN’s softmax. As we have entirely replaced this model component, this change does not map onto our model. Of note, the motivation behind this change closely mirrors our own, in looking for a loss that respects the underlying metric between examples. The recent PixelSNAIL model (Chen et al., 2017) achieves state-of-the-art modeling performance by enhancing PixelCNN with ELU nonlinearities, modified block structure, and an attention mechanism. Again, all of these are fully compatible with our work and should improve results further.

Finally, the implicit quantile formulation lifts a number of architectural restrictions of previous generative models. Most importantly, the reparameterization as an inverse c.d.f. allows to learn distributions over continuous ranges without pre-specified boundaries or quantization. This enables modeling continuous-valued variables, for example for generation of sound (van den Oord et al., 2016a), opening multiple interesting avenues for further investigation.

Acknowledgements

We would like to acknowledge the important role many of our colleagues at DeepMind played for this work. We especially thank Aäron van den Oord and Sander Dieleman for invaluable advice on the PixelCNN model; Ivo Danihelka and Danilo J. Rezende for careful reading and insightful comments on an earlier version of the paper; Igor Babuschkin, Alexandre Galashov, Dominik Grewe, Jacob Menick, and Mihaela Rosca for technical help.

References

Appendix

Quantile regression minimizes the quantile divergence

For any distributions $P$ and $Q$ , define the quantile divergence

Then the expected quantile loss of a quantile function $\bar{Q}$ implicitly defining the distribution $Q$ satisfies

Let $P$ be a distribution with p.d.f. $f_{P}$ and c.d.f. $F_{P}$ . Define

where the third equality follows from an integration by parts of $\int_{-\infty}^{q}xf_{P}(x)\,dx$ . Thus the function $q\mapsto g_{\tau}(q)$ is minimized for $q=F^{-1}_{P}(\tau)$ and its minimum is

Thus for a quantile function $\bar{Q}$ , we have the expected quantile loss:

This finishes the proof of the proposition. ∎

We observe that quantile regression is nothing else than a projection under the quantile divergence. Thus for a parametrized quantile function $\bar{Q}_{\theta}$ with corresponding distribution $Q_{\theta}$ , the sample-based quantile regression gradient $\nabla_{\theta}\rho_{\tau}(X-\bar{Q}_{\theta}(\tau))$ for a sample $\tau\sim\mathcal{U}()$ and $X\sim P$ is an unbiased estimate of $\nabla_{\theta}q(P,Q_{\theta})$ :

We illustrate the relation between the $1$ -Wasserstein metric and the quantile divergence in Figure 9. Notice that, for each $\tau\in$ , while the Wasserstein measures the error between the two quantile functions, the quantile divergence measures a subset of the area enclosed between their graphs.

Network and Training Details

All PixelCNN and PixelIQN models in Section 4 are directly based on the small and large conditional Gated PixelCNN models developed in (van den Oord et al., 2016b). For CIFAR-10 (Section 4.1), we are using the smaller variant with $15$ layer blocks, convolutional filters of size $5$ , $128$ feature planes in each layer block, and $1024$ features planes for the residual connections feeding into the output layer of the network. For small ImageNet (Section 4.2), we use both this model, and a larger $20$ layer version with $256$ feature planes in each layer block.

For PixelIQN, we rescale the $\tau\in^{3n^{2}}$ linearly to lie in $^{3n^{2}}$ , and input it to the network in exactly the same way as the location-dependent conditioning in (van den Oord et al., 2016b), that is, by applying a $1\times 1$ convolution producing the same number of feature planes as in the respective layer block, and adding it to the output of this block prior to the gating activation.

All models on CIFAR-10 were trained for a total of $300$ K steps, those on ImageNet for $400$ K steps. We trained the small models with a mini-batch size of $32$ , running approximately $200$ K updates per day on a single NVIDIA Tesla P100 GPU, while the larger models were trained with a mini-batch size of $128$ with synchronous updates from $16$ P100 GPUs, achieving approximately half of this step rate.

Hyperparameter Tuning and Evaluation

All quantitative evaluations of our PixelCNN and PixelIQN models are based on the Fréchet Inception Distance (FID) (Heusel et al., 2017),

where $(\mu_{1},\Sigma_{1})$ are the mean and covariance of $10,000$ samples from the model (PixelCNN or PixelIQN), and $(\mu_{2},\Sigma_{2})$ are the mean and covariance matrix computed over a set of $10,000$ training data points. We slightly deviate from the usual practice of using the entire training set for FID computation, as this would require an equal number ( $50,000$ in the case of CIFAR-10) of samples to be drawn from the model, which is computationally very expensive for autoregressive models like PixelCNN or PixelIQN.

We use Polyak averaging (Polyak & Juditsky, 1992), keeping an exponentially weighted average over past parameters with a weight of $0.9999$ . This average is being loaded instead of the model parameters before samples are generated, but never used for training.

To tune our small PixelCNN and PixelIQN models, we performed a hyperparameter search over $500$ hyperparameter configurations for each model, each configuration evaluated after $100$ K training steps on CIFAR-10, based on its FID score computed on a small set of $2500$ generated samples.

For PixelCNN, the parameter search involved choosing from RMSProp, Adam, and SGD as the optimizer, and tuning the learning rate, involving both constant and decaying learning rate schedules. As a result we settled on the RMSProp optimizer and a set of three possible learning rate regimes, namely a constant learning rate of $10^{-4}$ or $3\cdot 10^{-5}$ , and a decaying learning rate regime: $10^{-4}$ in the first $120$ K, $3\cdot 10^{-5}$ for the next $60$ K, and $10^{-5}$ for the remaining training steps. We found the first of these to work best on ImageNet, and the decaying schedule to work best on CIFAR-10, and only report the best model for each dataset.

For PixelIQN, the parameter search included the above (but with constant learning rates only), and additionally a sweep over a range of values for the Huber loss parameter $\kappa$ (Equation 2). As a result, we used Adam with a constant learning rate of $10^{-4}$ for all PixelIQN model variants on both datasets, and set $\kappa=0.002$ . We found that the model is not sensitive to this hyperparameter, but performs somewhat worse if the regular quantile regression loss is used instead of the Huber variant.

AIQN-VAE

One potential drawback to PixelIQN presented above, shared by PixelCNN and more generally autoregressive models, is that due to their autoregressive nature sampling can be extremely time-consuming. This is especially true as the resolution of images increases. Although it is possible to partially reduce this overhead with clever engineering, these models are inherently much slower to sample from than models such as GANs and VAEs. In this section, we demonstrate how PixelIQN, due to the continuous nature of the quantile function, can be used to learn distributions over lower-dimensional, latent spaces, such as those produced by an autoencoder, variational or otherwise. Specifically, we use a standard VAE, but simultaneously train a small AIQN to model the training distribution over latent codes. For sampling, we then generate samples of the latent distribution using AIQN instead of the VAE prior.

This approach works well for two reasons. First, even a thoroughly trained VAE does not produce an encoder that fully matches the Gaussian prior. Generaly, the data distribution exists on a non-Gaussian manifold in the latent space, despite the use of variational training. Second, unlike existing methods, AIQN learns to approximate the full continuous-valued distribution without discretizing values or making prior assumptions about the value range or underlying distribution.

We can see similarities between this approach and two other recent publications. First, the $\alpha$ -GAN proposed by Rosca et al. (2017). In both, there is an attempt to sample from the true latent distribution of a VAE-like latent variable model. In the case of $\alpha$ -GAN this sampling distribution is trained using a GAN, while we propose to learn the distribution using quantile regression. The similarity makes sense considering AIQN shares some of the benefits of GANs. Unlike in this related work, we have not replaced the KL penalty on the latent representation. It would be an interesting direction for future research to explore a similar formulation. Generally, the same trade-offs between GANs and AIQN should be expected to come into play here just as they do when learning image distributions. Second, the VQ-VAE model (van den Oord et al., 2017), learns a PixelCNN model of the (discretized) latent space. Here, especially in the latent space, distribution losses respecting distances between individual points is more applicable than likelihood-based losses.

where $\mathcal{L}_{VAE}$ is the standard VAE loss function. Then, for generation, we sample $\tau\sim\mathcal{U}(^{m})$ , and reparameterize this sample through the AIQN and the decoder to produce $y=d(Q_{\tau})$ , a sample from the approximated distribution. We call this simple combination the AIQN-VAE.

We demonstrate the AIQN-VAE using the CelebA dataset (Liu et al., 2015), at resolution $64\times 64$ . We modified an open source VAE implementationhttps://github.com/LynnHo/VAE-Tensorflow to simultaneously train the AIQN on the output of the VAE encoder, with Polyak averaging (Polyak & Juditsky, 1992) of the AIQN weights. We reduce the latent dimension to $32$ , as our purpose is to investigate the use of VAEs to learn in lower-dimensional latent spaces. The AIQN used three fully connected layers of width $512$ with ReLU activations. For the AIQN-VAE, but not the VAE, we lowered latent dimension variance to $0.1$ and the KL-term weight to $0.5$ . It has been observed that in this setting the VAE prior alone will produce poor samples, thus high-quality samples will only be possible by learning the latent distribution. Figure 10 shows samples from both a VAE and AIQN-VAE after $200K$ training iterations. Both models may be expected to improve with further training, however, we can see that the AIQN-VAE samples are frequently clearer and less blurry than those from the VAE.