Stacked Generative Adversarial Networks

Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, Serge Belongie

Introduction

Recent years have witnessed tremendous success of deep neural networks (DNNs), especially the kind of bottom-up neural networks trained for discriminative tasks. In particular, Convolutional Neural Networks (CNNs) have achieved impressive accuracy on the challenging ImageNet classification benchmark . Interestingly, it has been shown that CNNs trained on ImageNet for classification can learn representations that are transferable to other tasks , and even to other modalities . However, bottom-up discriminative models are focused on learning useful representations from data, being incapable of capturing the data distribution.

Learning top-down generative models that can explain complex data distribution is a long-standing problem in machine learning research. The expressive power of deep neural networks makes them natural candidates for generative models, and several recent works have shown promising results . While state-of-the-art DNNs can rival human performance in certain discriminative tasks, current best deep generative models still fail when there are large variations in the data distribution.

A natural question therefore arises: can we leverage the hierarchical representations in a discriminatively trained model to help the learning of top-down generative models? In this paper, we propose a generative model named Stacked Generative Adversarial Networks (SGAN). Our model consists of a top-down stack of GANs, each trained to generate “plausible” lower-level representations conditioned on higher-level representations. Similar to the image discriminator in the original GAN model which is trained to distinguish “fake” images from “real” ones, we introduce a set of representation discriminators that are trained to distinguish “fake” representations from “real” representations. The adversarial loss introduced by the representation discriminator forces the intermediate representations of the SGAN to lie on the manifold of the bottom-up DNN’s representation space. In addition to the adversarial loss, we also introduce a conditional loss that imposes each generator to use the higher-level conditional information, and a novel entropy loss that encourages each generator to generate diverse representations. By stacking several GANs in a top-down way and using the top-most GAN to receive labels and the bottom-most GAN to generate images, SGAN can be trained to model the data distribution conditioned on class labels. Through extensive experiments, we demonstrate that our SGAN is able to generate images of much higher quality than a vanilla GAN. In particular, our model obtains state-of-the-art Inception scores on CIFAR-10 dataset.

Related Work

Deep Generative Image Models. There has been a large body of work on generative image modeling with deep learning. Some early efforts include Restricted Boltzmann Machines and Deep Belief Networks . More recently, several successful paradigms of deep generative models have emerged, including the auto-regressive models , Variational Auto-encoders (VAEs) , and Generative Adversarial Networks (GANs) . Our work builds upon the GAN framework, which employs a generator that transforms a noise vector into an image and a discriminator that distinguishes between real and generated images.

There are several works that use a pre-trained discriminative model to aid the training of a generator. add a regularization term that encourages the reconstructed image to be similar to the original image in the feature space of a discriminative network. use an additional “style loss” based on Gram matrices of feature activations. Different from our method, all the works above only add loss terms to regularize the generator’s output, without regularizing its internal representations.

Matching Intermediate Representations Between Two DNNs. There have been some works that attempt to “match” the intermediate representations between two DNNs. use the intermediate representations of one pre-trained DNN to guide another DNN in the context of knowledge transfer. Our method can be considered as a special kind of knowledge transfer. However, we aim at transferring the knowledge in a bottom-up DNN to a top-down generative model, instead of another bottom-up DNN. Also, some auto-encoder architectures employ layer-wise reconstruction loss . The layer-wise loss is usually accompanied by lateral connections from the encoder to the decodery. On the other hand, SGAN is a generative model and does not require any information from the encoder once training completes. Another important difference is that we use adversarial loss instead of $L2$ reconstruction loss to match intermediate representations.

Visualizing Deep Representations. Our work is also related to the recent efforts in visualizing the internal representations of DNNs. One popular approach uses gradient-based optimization to find an image whose representation is close to the one we want to visualize . Other approaches, such as , train a top-down deconvolutional network to reconstruct the input image from a feature representation by minimizing the Euclidean reconstruction error in image space. However, there is inherent uncertainty in the reconstruction process, since the representations in higher layers of the DNN are trained to be invariant to irrelevant transformations and to ignore low-level details. With Euclidean training objective, the deconvolutional network tends to produce blurry images. To alleviate this problem, Dosovitskiy abd Brox further propose a feature loss and an adversarial loss that enables much sharper reconstructions. However, it still does not tackle the problem of uncertainty in reconstruction. Given a high-level feature representation, the deconvolutional network deterministically generates a single image, despite the fact that there exist many images having the same representation. Also, there is no obvious way to sample images because the feature prior distribution is unknown. Concurrent to our work, Nguyen et al. incorporate the feature prior with a variant of denoising auto-encoder (DAE). Their sampling relies on an iterative optimization procedure, while we are focused on efficient feed-forward sampling.

Methods

In this section we introduce our model architecture. In Sec. 3.1 we briefly overview the framework of Generative Adversarial Networks. We then describe our proposal for Stacked Generative Adversarial Networks in Sec. 3.2. In Sect. 3.3 and 3.4 we will focus on our two novel loss functions, conditional loss and entropy loss, respectively.

As shown in Fig. 1 (a), the original GAN is trained using a two-player min-max game: a discriminator $D$ trained to distinguish generated images from real images, and a generator $G$ trained to fool $D$ . The discriminator loss $\mathcal{L}_{D}$ and the generator loss $\mathcal{L}_{G}$ are defined as follows:

In practice, $D$ and $G$ are usually updated alternately. The training process matches the generated image distribution $P_{G}(x)$ with the real image distribution $P_{data}(x)$ in the training set. In other words, The adversarial training forces $G$ to generate images that reside on the natural images manifold.

2 Stacked Generative Adversarial Networks

Pre-trained Encoder. We first consider a bottom-up DNN pre-trained for classification, which is referred to as the encoder $E$ throughout. We define a stack of bottom-up deterministic nonlinear mappings: $h_{i+1}=E_{i}(h_{i})$ , where $i\in\{0,1,...,N-1\}$ , $E_{i}$ consists of a sequence of neural layers (e.g., convolution, pooling), $N$ is the number of hierarchies (stacks), $h_{i}(i\neq 0,N)$ are intermediate representations, $h_{N}=y$ is the classification result, and $h_{0}=x$ is the input image. Note that in our formulation, each $E_{i}$ can contain multiple layers and the way of grouping layers together into $E_{i}$ is determined by us. The number of stacks $N$ is therefore less than the number of layers in $E$ and is also determined by us.

Stacked Generators. Provided with a pre-trained encoder $E$ , our goal is to train a top-down generator $G$ that inverts $E$ . Specifically, $G$ consists of a top-down stack of generators $G_{i}$ , each trained to invert a bottom-up mapping $E_{i}$ . Each $G_{i}$ takes in a higher-level feature and a noise vector as inputs, and outputs the lower-level feature $\hat{h}_{i}$ . We first train each GAN independently and then train them jointly in an end-to-end manner, as shown in Fig. 1. Each generator receives conditional input from encoders in the independent training stage, and from the upper generators in the joint training stage. In other words, $\hat{h}_{i}=G_{i}(h_{i+1},z_{i})$ during independent training and $\hat{h}_{i}=G_{i}(\hat{h}_{i+1},z_{i})$ during joint training. The loss equations shown in this section are for independent training stage but can be easily modified to joint training by replacing $h_{i+1}$ with $\hat{h}_{i+1}$ .

Intuitively, the total variations of images could be decomposed into multiple levels, with higher-level semantic variations (e.g., attributes, object categories, rough shapes) and lower-level variations (e.g., detailed contours and textures, background clutters). Our model allows using different noise variables to represent different levels of variations.

The training procedure is shown in Fig. 1 (b). Each generator $G_{i}$ is trained with a linear combination of three loss terms: adversarial loss, conditional loss, and entropy loss.

where $\mathcal{L}_{G_{i}}^{adv}$ , $\mathcal{L}_{G_{i}}^{cond}$ , $\mathcal{L}_{G_{i}}^{ent}$ denote adversarial loss, conditional loss, and entropy loss respectively. $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ are the weights associated with different loss terms. In practice, we find it sufficient to set the weights such that the magnitude of different terms are of similar scales. In this subsection we first introduce the adversarial loss $\mathcal{L}_{G_{i}}^{adv}$ . We will then introduce $\mathcal{L}_{G_{i}}^{cond}$ and $\mathcal{L}_{G_{i}}^{ent}$ in Sec. 3.3 and 3.4 respectively.

For each generator $G_{i}$ , we introduce a representation discriminator $D_{i}$ that distinguishes generated representations $\hat{h}_{i}$ , from “real” representations ${h_{i}}$ . Specifically, the discriminator $D_{i}$ is trained with the loss function:

And $G_{i}$ is trained to “fool” the representation discriminator $D_{i}$ , with the adversarial loss defined by:

During joint training, the adversarial loss provided by representational discriminators can also be regarded as a type of deep supervision , providing intermediate supervision signals. In our current formulation, $E$ is a discriminative model, and $G$ is a generative model conditioned on labels. However, it is also possible to train SGAN without using label information: $E$ can be trained with an unsupervised objective and $G$ can be cast into an unconditional generative model by removing the label input from the top generator. We leave this for future exploration.

Sampling. To sample images, all $G_{i}$ s are stacked together in a top-down manner, as shown in Fig. 1 (c). Our SGAN is capable of modeling the data distribution conditioned on the class label: $p_{G}(\hat{x}|y)=p_{G}(\hat{h}_{0}|\hat{h}_{N})\propto p_{G}(\hat{h}_{0},\hat{h}_{1},...,\hat{h}_{N-1}|\hat{h}_{N})=\prod\limits_{0\leq i\leq N-1}p_{G_{i}}(\hat{h}_{i}|\hat{h}_{i+1})$ , where each $p_{G_{i}}(\hat{h}_{i}|\hat{h}_{i+1})$ is modeled by a generator $G_{i}$ . From an information-theoretic perspective, SGAN factorizes the total entropy of the image distribution $H(x)$ into multiple (smaller) conditional entropy terms: $H(x)=H(h_{0},h_{1},...,h_{N})=\sum_{i=0}^{N-1}H(h_{i}|h_{i+1})+H(y)$ , thereby decomposing one difficult task into multiple easier tasks.

3 Conditional Loss

At each stack, a generator $G_{i}$ is trained to capture the distribution of lower-level representations $\hat{h}_{i}$ , conditioned on higher-level representations $h_{i+1}$ . However, in the above formulation, the generator might choose to ignore $h_{i+1}$ , and generate plausible $\hat{h}_{i}$ from scratch. Some previous works tackle this problem by feeding the conditional information to both the generator and discriminator. This approach, however, might introduce unnecessary complexity to the discriminator and increase model instability .

Here we adopt a different approach: we regularize the generator by adding a loss term $\mathcal{L}_{G_{i}}^{cond}$ named conditional loss. We feed the generated lower-level representations $\hat{h}_{i}=G_{i}(h_{i+1},z_{i})$ back to the encoder $E$ , and compute the recovered higher-level representations. We then enforce the recovered representations to be similar to the conditional representations. Formally:

where $f$ is a distance measure. We define $f$ to be the Euclidean distance for intermediate representations and cross-entropy for labels. Our conditional loss $\mathcal{L}_{G_{i}}^{cond}$ is similar to the “feature loss” used by and the “FCN loss” in .

4 Entropy Loss

Simply adding the conditional loss $\mathcal{L}_{G_{i}}^{cond}$ leads to another issue: the generator $G_{i}$ learns to ignore the noise $z_{i}$ , and compute $\hat{h}_{i}$ deterministically from $h_{i+1}$ . This problem has been encountered in various applications of conditional GANs, e.g., synthesizing future frames conditioned on previous frames , generating images conditioned on label maps , and most related to our work, synthesizing images conditioned on feature representations . All the above works attempted to generate diverse images/videos by feeding noise to the generator, but failed because the conditional generator simply ignores the noise. To our knowledge, there is still no principled way to deal with this issue. It might be tempting to think that minibatch discrimination , which encourages sample diversity in each minibatch, could solve this problem. However, even if the generator generates $\hat{h}_{i}$ deterministically from $h_{i+1}$ , the generated samples in each minibatch are still diverse since generators are conditioned on different $h_{i+1}$ . Thus, there is no obvious way minibatch discrimination could penalize a collapsed conditional generator.

Variational Conditional Entropy Maximization. To tackle this problem, we would like to encourage the generated representation $\hat{h}_{i}$ to be sufficiently diverse when conditioned on $h_{i+1}$ , i.e., the conditional entropy $H(\hat{h}_{i}|h_{i+1})$ should be as high as possible. Since directly maximizing $H(\hat{h}_{i}|h_{i+1})$ is intractable, we propose to maximize instead a variational lower bound on the conditional entropy. Specifically, we use an auxiliary distribution $Q_{i}(z_{i}|\hat{h}_{i})$ to approximate the true posterior $P_{i}(z_{i}|\hat{h}_{i})$ , and augment the training objective with a loss term named entropy loss:

Below we give a proof that minimizing $\mathcal{L}_{G_{i}}^{ent}$ is equivalent to maximizing a variational lower bound for $H(\hat{h}_{i}|h_{i+1})$ .

In practice, we parameterize $Q_{i}$ with a deep network that predicts the posterior distribution of $z_{i}$ given $\hat{h}_{i}$ . $Q_{i}$ shares most of the parameters with $D_{i}$ . We treat the posterior as a diagonal Gaussian with fixed standard deviations, and use the network $Q_{i}$ to only predict the posterior mean, making $\mathcal{L}_{G_{i}}^{ent}$ equivalent to the Euclidean reconstruction error. In each iteration we update both $G_{i}$ and $Q_{i}$ to minimize $\mathcal{L}_{G_{i}}^{ent}$ .

Our method is similar to the variational mutual information maximization technique proposed by Chen et al. . A key difference is that uses the $Q$ -network to predict only a small set of deliberately constructed “latent code”, while our $Q_{i}$ tries to predict all the noise variables $z_{i}$ in each stack. The loss used in therefore maximizes the mutual information between the output and the latent code, while ours maximizes the entropy of the output $\hat{h}_{i}$ , conditioned on ${h}_{i+1}$ . also train a separate network to map images back to latent space to perform unsupervised feature learning. Independent of our work, proposes to regularize EBGAN with entropy maximization in order to prevent the discriminator from degenerating to uniform prediction. Our entropy loss is motivated from generating multiple possible outputs from the same conditional input.

Experiments

In this section, we perform experiments on a variety of datasets including MNIST , SVHN , and CIFAR-10 . Code and pre-trained models are available at: https://github.com/xunhuang1995/SGAN. Readers may refer to our code repository for more details about experimental setup, hyper-parameters, etc.

Encoder: For all datasets we use a small CNN with two convolutional layers as our encoder: conv1-pool1-conv2-pool2-fc3-fc4, where fc3 is a fully connected layer and fc4 outputs classification scores before softmax. On CIFAR-10 we apply horizontal flipping to train the encoder. No data augmentation is used on other datasets.

Generator: We use generators with two stacks throughout our experiments. Note that our framework is generally applicable to the setting with multiple stacks, and we hypothesize that using more stacks would be helpful for large-scale and high-resolution datasets. For all datasets, our top GAN $G_{1}$ generates fc3 features from some random noise $z_{1}$ , conditioned on label $y$ . The bottom GAN $G_{0}$ generates images from some noise $z_{0}$ , conditioned on fc3 features generated from GAN $G_{1}$ . We set the loss coefficient parameters $\lambda_{1}=\lambda_{2}=1$ and $\lambda_{3}=10$ . The choice of the parameters are made so that the magnitude of each loss term is of the same scale.

We thoroughly evaluate SGAN on three widely adopted datasets: MNIST , SVHN , and CIFAR-10 . The details of each dataset is described in the following.

MNIST: The MNIST dataset contains $70,000$ labeled images of hand-written digits with $60,000$ in the training set and $10,000$ in the test set. Each image is sized by $28\times 28$ .

SVHN: The SVHN dataset is composed of real-world color images of house numbers collected by Google Street View . Each image is of size $32\times 32$ and the task is to classify the digit at the center of the image. The dataset contains $73,257$ training images and $26,032$ test images.

CIFAR-10: The CIFAR-10 dataset consists of colored natural scene images sized at $32\times 32$ pixels. There are 50,000 training images and 10,000 test images in $10$ classes.

2 Samples

In Fig. 2 (a), we show MNIST samples generated by SGAN. Each row corresponds to samples conditioned on a given digit class label. SGAN is able to generate diverse images with different characteristics. The samples are visually indistinguishable from real MNIST images shown in Fig. 2 (b), but still have differences compared with corresponding nearest neighbor training images.

We further examine the effect of entropy loss. In Fig. 2 (c) we show the samples generated by bottom GAN when conditioned on a fixed fc3 feature generated by the top GAN. The samples (per row) have sufficient low-level variations, which reassures that bottom GAN learns to generate images without ignoring the noise $z_{0}$ . In contrast, in Fig. 2 (d), we show samples generated without using entropy loss for bottom generator, where we observe that the bottom GAN ignores the noise and instead deterministically generates images from fc3 features.

An advantage of SGAN compared with a vanilla GAN is its interpretability: it decomposes the total variations of an image into different levels. For example, in MNIST it decomposes the variations into $y$ that represents the high-level digit label, $z_{1}$ that captures the mid-level coarse pose of the digit and $z_{0}$ that represents the low-level spatial details.

The samples generated on SVHN and CIFAR-10 datasets can be seen in Fig. 3 and Fig. 4, respectively. Provided with the same fc3 feature, we see in each row of panel (c) that SGAN is able to generate samples with similar coarse outline but different lighting conditions and background clutters. Also, the nearest neighbor images in the training set indicate that SGAN is not simply memorizing training data, but can truly generate novel images.

3 Comparison with the state of the art

Here, we compare SGAN with other state-of-the-art generative models on CIFAR-10 dataset. The visual quality of generated images is measured by the widely used metric, Inception score . Following , we sample $50,000$ images from our model and use the code provided by to compute the score. As shown in Tab. 1, SGAN obtains a score of $8.59\pm 0.12$ , outperforming AC-GAN ( $8.25\pm 0.07$ ) and Improved GAN ( $8.09\pm 0.07$ ). Also, note that the $5$ techniques introduced in are not used in our implementations. Incorporating these techniques might further boost the performance of our model.

4 Visual Turing test

To further verify the effectiveness of SGAN, we conduct human visual Turing test in which we ask AMT workers to distinguish between real images and images generated by our networks. We exactly follow the interface used in Improved GAN , in which the workers are given $9$ images at each time and can receive feedback about whether their answers are correct. With $9,000$ votes for each evaluated model, our AMT workers got $24.4\%$ error rate for samples from SGAN and $15.6\%$ for samples from DCGAN $(\mathcal{L}^{adv}+L^{cond}+\mathcal{L}^{ent})$ . This further confirms that our stacked design can significantly improve the image quality over GAN without stacking.

5 More ablation studies

In Sec. 4.2 we have examined the effect of entropy loss. In order to further understand the effect of different model components, we conduct extensive ablation studies by evaluating several baseline methods on CIFAR-10 dataset. If not mentioned otherwise, all models below use the same training hyper-parameters as the full SGAN model.

SGAN: The full model, as described in Sec. 3.

SGAN-no-joint: Same architecture as (a), but each GAN is trained independently, and there is no final joint training stage.

DCGAN ( $\mathcal{L}^{adv}+\mathcal{L}^{cond}+\mathcal{L}^{ent}$ ): This is a single GAN model with the same architecture as the bottom GAN in SGAN, except that the generator is conditioned on labels instead of fc3 features. Note that other techniques proposed in this paper, including conditional loss $\mathcal{L}^{cond}$ and entropy loss $\mathcal{L}^{ent}$ , are still employed. We also tried to use the full generator $G$ in SGAN as the baseline, instead of only the bottom generator $G_{0}$ . However, we failed to make it converge, possibly because $G$ is too deep to be trained without intermediate supervision from representation discriminators.

DCGAN ( $\mathcal{L}^{adv}+\mathcal{L}^{cond}$ ): Same architecture as (c), but trained without entropy loss $\mathcal{L}^{ent}$ .

DCGAN ( $\mathcal{L}^{adv}+\mathcal{L}^{ent}$ ): Same architecture as (c), but trained without conditional loss $\mathcal{L}^{cond}$ . This model therefore does not use label information.

DCGAN ( $\mathcal{L}^{adv}$ ): Same architecture as (c), but trained with neither conditional loss $\mathcal{L}^{cond}$ nor entropy loss $\mathcal{L}^{ent}$ . This model also does not use label information. It can be viewed as a plain unconditional DCGAN model and serves as the ultimate baseline.

SGAN obtains slightly higher Inception score than SGAN-no-joint. Yet SGAN-no-joint also generates very high quality samples and outperforms all previous methods in terms of Inception scores.

SGAN, either with or without joint training, achieves significantly higher Inception score and better sample quality than the baseline DCGANs. This demonstrates the effectiveness of the proposed stacked approach.

As shown in Fig. 5 (d), DCGAN ( $\mathcal{L}^{adv}+\mathcal{L}^{cond}$ ) collapses to generating a single image per category, while adding the entropy loss enables it to generate diverse images (Fig. 5 (c)). This further demonstrates that entropy loss is effective at improving output diversity.

The single DCGAN ( $\mathcal{L}^{adv}+\mathcal{L}^{cond}+\mathcal{L}^{ent}$ ) model obtains higher Inception score than the conditional DCGAN reported in . This suggests that $\mathcal{L}^{cond}+\mathcal{L}^{ent}$ might offer some advantages compared to a plain conditional DCGAN, even without stacking.

In general, Inception score correlates well with visual quality of images. However, it seems to be insensitive to diversity issues . For example, it gives the same score to Fig. 5 (d) and (e) while (d) has clearly collapsed. This is consistent with results in .

Discussion and Future Work

This paper introduces a top-down generative framework named SGAN, which effectively leverages the representational information from a pre-trained discriminative network. Our approach decomposes the hard problem of estimating image distribution into multiple relatively easier tasks – each generating plausible representations conditioned on higher-level representations. The key idea is to use representation discriminators at different training hierarchies to provide intermediate supervision. We also propose a novel entropy loss to tackle the problem that conditional GANs tend to ignore the noise. Our entropy loss could be employed in other applications of conditional GANs, e.g., synthesizing different future frames given the same past frames , or generating a diverse set of images conditioned on the same label map . We believe this is an interesting research direction in the future.

We would like to thank Danlu Chen for the help with Fig. 1. Also, we want to thank Danlu Chen, Shuai Tang, Saining Xie, Zhuowen Tu, Felix Wu and Kilian Weinberger for helpful discussions. Yixuan Li is supported by US Army Research Office W911NF-14-1-0477. Serge Belongie is supported in part by a Google Focused Research Award.