On Self Modulation for Generative Adversarial Networks
Ting Chen, Mario Lucic, Neil Houlsby, Sylvain Gelly
Introduction
Generative Adversarial Networks (GANs) are a powerful class of generative models successfully applied to a variety of tasks such as image generation (Zhang et al., 2017; Miyato et al., 2018; Karras et al., 2017), learned compression (Tschannen et al., 2018), super-resolution (Ledig et al., 2017), inpainting (Pathak et al., 2016), and domain transfer (Isola et al., 2016; Zhu et al., 2017).
Training GANs is a notoriously challenging task (Goodfellow et al., 2014; Arjovsky et al., 2017; Lucic et al., 2018) as one is searching in a high-dimensional parameter space for a Nash equilibrium of a non-convex game. As a practical remedy one applies (usually a variant of) stochastic gradient descent, which can be unstable and lack guarantees Salimans et al. (2016). As a result, one of the main research challenges is to stabilize GAN training. Several approaches have been proposed, including varying the underlying divergence between the model and data distributions (Arjovsky et al., 2017; Mao et al., 2016), regularization and normalization schemes (Gulrajani et al., 2017; Miyato et al., 2018), optimization schedules (Karras et al., 2017), and specific neural architectures (Radford et al., 2016; Zhang et al., 2018). A particularly successful approach is based on conditional generation; where the generator (and possibly discriminator) are given side information, for example class labels Mirza & Osindero (2014); Odena et al. (2017); Miyato & Koyama (2018). In fact, state-of-the-art conditional GANs inject side information via conditional batch normalization (CBN) layers (De Vries et al., 2017; Miyato & Koyama, 2018; Zhang et al., 2018). While this approach does help, a major drawback is that it requires external information, such as labels or embeddings, which is not always available.
In this work we show that GANs benefit from self-modulation layers in the generator. Our approach is motivated by Feature-wise Linear Modulation in supervised learning (Perez et al., 2018; De Vries et al., 2017), with one key difference: instead of conditioning on external information, we condition on the generator’s own input. As self-modulation requires a simple change which is easily applicable to all popular generator architectures, we believe that is a useful addition to the GAN toolbox.
Summary of contributions. We provide a simple yet effective technique that can added universally to yield better GANs. We demonstrate empirically that for a wide variety of settings (loss functions, regularizers and normalizers, neural architectures, and optimization settings) that the proposed approach yields between a and improvement in sample quality. When using fixed hyperparameters settings our approach outperforms the baseline in of cases. Further, we show that self-modulation still helps even if label information is available. Finally, we discuss the effects of this method in light of recently proposed diagnostic tools, generator conditioning (Odena et al., 2018) and precision/recall for generative models (Sajjadi et al., 2018).
Self-Modulation for Generative Adversarial Networks
Several recent works observe that conditioning the generative process on side information (such as labels or class embeddings) leads to improved models (Mirza & Osindero, 2014; Odena et al., 2017; Miyato & Koyama, 2018). Two major approaches to conditioning on side information have emerged: (1) Directly concatenate the side information with the noise vector (Mirza & Osindero, 2014), i.e. . (2) Condition the hidden layers directly on , which is usually instantiated via conditional batch normalization (De Vries et al., 2017; Miyato & Koyama, 2018).
Despite the success of conditional approaches, two concerns arise. The first is practical; side information is often unavailable. The second is conceptual; unsupervised models, such as GANs, seek to model data without labels. Including them side-steps the challenge and value of unsupervised learning.
We propose self-modulating layers for the generator network. In these layers the hidden activations are modulated as a function of latent vector . In particular, we apply modulation in a feature-wise fashion which allows the model to re-weight the feature maps as a function of the input. This is also motivated by the FiLM layer for supervised models (Perez et al., 2018; De Vries et al., 2017) in which a similar mechanism is used to condition a supervised network on side information.
Batch normalization (Ioffe & Szegedy, 2015) can improve the training of deep neural nets, and it is widely used in both discriminative and generative modeling (Szegedy et al., 2015; Radford et al., 2016; Miyato et al., 2018). It is thus present in most modern networks, and provides a convenient entry point for self-modulation. Therefore, we present our method in the context of its application via batch normalization. In batch normalization the activations of a layer, , are transformed as
where and are the estimated mean and variances of the features across the data, and and are learnable scale and shift parameters.
We do the same for with independent parameters.
This conditionally composed can be directly used in Equation 1. Despite its simplicity, we demonstrate that it outperforms the standard conditional models.
Discussion. Table 1 summarizes recent techniques for generator conditioning. While we choose to implement this approach via batch normalization, it can also operate independently by removing the normalization part in the Equation 1. We made this pragmatic choice due to the fact that such conditioning is common (Radford et al., 2016; Miyato et al., 2018; Miyato & Koyama, 2018).
The second question is whether one benefits from more complex modulation architectures, such as using an attention network (Vaswani et al., 2017) whereby and could be made dependent on all upstream activations, or constraining the elements in to which would yield a similar gating mechanism to an LSTM cell (Hochreiter & Schmidhuber, 1997). Based on initial experiments we concluded that this additional complexity does not yield a substantial increase in performance.
Experiments
We perform a large-scale study of self-modulation to demonstrate that this method yields robust improvements in a variety of settings. We consider loss functions, architectures, discriminator regularization/normalization strategies, and a variety of hyperparameter settings collected from recent studies (Radford et al., 2016; Gulrajani et al., 2017; Miyato et al., 2018; Lucic et al., 2018; Kurach et al., 2018). We study both unconditional (without labels) and conditional (with labels) generation. Finally, we analyze the results through the lens of the condition number of the generator’s Jacobian as suggested by Odena et al. (2018), and precision and recall as defined in Sajjadi et al. (2018).
Loss functions. We consider two loss functions. The first one is the non-saturating loss proposed in Goodfellow et al. (2014):
The second one is the hinge loss used in Miyato et al. (2018):
Controlling the Lipschitz constant of the discriminator. The discriminator’s Lipschitz constant is a central quantity analyzed in the GAN literature (Miyato et al., 2018; Zhou et al., 2018). We consider two state-of-the-art techniques: gradient penalty (Gulrajani et al., 2017), and spectral normalization (Miyato et al., 2018). Without normalization and regularization the models can perform poorly on some datasets. For the gradient penalty regularizer we consider regularization strength .
Network architecture. We use two popular architecture types: one based on DCGAN (Radford et al., 2016), and another from Miyato et al. (2018) which incorporates residual connections (He et al., 2016). The details can be found in the appendix.
Optimization hyper-parameters. We train all models for k generator steps with the Adam optimizer (Kingma & Ba, 2014) (We also perform a subset of the studies with K steps and discuss it in. We test two popular settings of the Adam hyperparameters : and . Previous studies find that multiple discriminator steps per generator step can help the training (Goodfellow et al., 2014; Salimans et al., 2016), thus we also consider both and discriminator steps per generator stepWe also experimented with 5 steps which didn’t outperform the step setting.. In total, this amounts to three different sets of hyper-parameters for : , , . We fix the learning rate to as in Miyato et al. (2018). All models are trained with batch size of 64 on a single nVidia P100 GPU. We report the best performing model attained during the training period; although the results follow the same pattern if the final model is report.
Datasets. We consider four datasets: cifar10, celeba-hq, lsun-bedroom, and imagenet. The lsun-bedroom dataset (Yu et al., 2015) contains around 3M images. We partition the images randomly into a test set containing 30588 images and a train set containing the rest. celeba-hq contains 30k images (Karras et al., 2017). We use the version obtained by running the code provided by the authorsAvailable at https://github.com/tkarras/progressive_growing_of_gans.. We use 3000 examples as the test set and the remaining examples as the training set. cifar10 contains 70K images (), partitioned into 60000 training instances and 10000 testing instances. Finally, we evaluate our method on imagenet, which contains M training images and K test images. We re-size the images to as done in Miyato & Koyama (2018) and Zhang et al. (2018).
Metrics. Quantitative evaluation of generative models remains one of the most challenging tasks. This is particularly true in the context of implicit generative models where likelihood cannot be effectively evaluated. Nevertheless, two quantitative measures have recently emerged: The Inception Score and the Frechet Inception Distance. While both of these scores have some drawbacks, they correlate well with scores assigned by human annotators and are somewhat robust.
2 Robustness experiments for unconditional generation
To test robustness, we run a Cartesian product of the parameters in Section 3.1 which results in 36 settings for each dataset (2 losses, 2 architectures, 3 hyperparameter settings for spectral normalization, and 6 for gradient penalty). For each setting we run five random seeds for self-modulation and the baseline (no self-modulation, just batch normalization). We compute the median score across random seeds which results in trained models.
We distinguish between two sets of experiments. In the unpaired setting we define the model as the tuple of loss, regularizer/normalization, neural architecture, and conditioning (self-modulated or classic batch normalization). For each model compute the minimum FID across optimization hyperparameters (, , ). We therefore compare the performance of self-modulation and baseline for each model after hyperparameter optimization. The results of this study are reported in Table 2, and the relative improvements are in Table 3 and Figure 2.
We observe the following: (1) When using the resnet style architecture, the proposed method outperforms the baseline in all considered settings. (2) When using the sndcgan architecture, it outperforms the baseline in of the cases. The breakdown by datasets is shown in Figure 2. (3) The improvement can be as high as a reduction in fid. (4) We observe similar improvement to the inception score, reported in the appendix.
In the second setting, the paired setting, we assess how effective is the technique when simply added to an existing model with the same set of hyperparameters. In particular, we fix everything except the type of conditioning – the model tuple now includes the optimization hyperparameters. This results in 36 settings for each data set for a total of 144 comparisons. We observe that self-modulation outperforms the baseline in 124/144 settings. These results suggest that self-modulation can be applied to most GANs even without additional hyperparameter tuning.
Conditional Generation. We demonstrate that self-modulation also works for label-conditional generation. Here, one is given access the class label which may be used by the generator and the discriminator. We compare two settings: (1) Generator conditioning is applied via label-conditional Batch Norm (De Vries et al., 2017; Miyato & Koyama, 2018) with no use of labels in the discriminator (G-Cond). (2) Generator conditioning applied as above, but with projection based conditioning in the discriminator (intuitively it encourages the discriminator to use label discriminative features to distinguish true/fake samples), as in Miyato & Koyama (2018) (P-cGAN). The former can be considered as a special case of the latter where discriminator conditioning is disabled. For P-cGAN, we use the architectures and hyper-parameter settings of Miyato & Koyama (2018). See the appendix, Section B.3 for details. In both cases, we compare standard label-conditional batch normalization to self-modulation with additional labels, as discussed in Section 2, Equation 3.
The results are shown in Table 4. Again, we observe that the simple incorporation of self-modulation leads to a significant improvement in performance in the considered settings.
Training for longer on imagenet. To demonstrate that self-modulation continues to yield improvement after training for longer, we train imagenet for k generator steps. Due to the increased computational demand we use a single setting for the unconditional and conditional settings models following Miyato et al. (2018) and Miyato & Koyama (2018), but using only two discriminator steps per generator. We expect that the results would continue to improve if training longer. However, currently results from k steps require training for 10 days on a P100 GPU.
We compute the median FID across 3 random seeds. After k steps the baseline unconditional model attains FID , self-modulation attains ( improvement). In the conditional setting self-modulation improves the FID from to (13% improvement). The improvements in IS are from to , and to in unconditional and conditional setting, respectively.
Where to apply self-modulation? Given the robust improvements of the proposed method, an immediate question is where to apply the modulation. We tested two settings: (1) applying modulation to every batch normalization layer, and (2) applying it to a single layer. The results of this ablation are in Figure 2. These results suggest that the benefit of self-modulation is greatest in the last layer, as may be intuitive, but applying it to each layer is most effective.
Related Work
Conditional GANs. Conditioning on side information, such as class labels, has been shown to improve the performance of GANs. Initial proposals were based on concatenating this additional feature with the input vector (Mirza & Osindero, 2014; Radford et al., 2016; Odena et al., 2017). Recent approaches, such as the projection cGAN (Miyato & Koyama, 2018) injects label information into the generator architecture using conditional Batch Norm layers (De Vries et al., 2017). Self-modulation is a simple yet effective complementary addition to this line of work which makes a significant difference when no side information is available. In addition, when side information is available it can be readily applied as discussed in Section 2 and leads to further improvements.
Conditional Modulation. Conditional modulation, using side information to modulate the computation flow in neural networks, is a rich idea which has been applied in various contexts (beyond GANs). In particular, Dumoulin et al. (2017) apply Conditional Instance Normalization (Ulyanov et al., 2016) to image style-transfer (Dumoulin et al., 2017). Kim et al. (2017) use Dynamic Layer Normalization (Ba et al., 2016) for adaptive acoustic modelling. Feature-wise Linear Modulation (Perez et al., 2018) generalizes this family of methods by conditioning the Batch Norm scaling and bias factors (which correspond to multiplicative and additive interactions) on general external embedding vectors in supervised learning. The proposed method applies to generators in GAN (unsupervised learning), and it works with both unconditional (without side information) and conditional (with side information) settings.
Multiplicative and Additive Modulation. Existing conditional modulations mentioned above are usually instantiated via Batch Normalization, which include both multiplicative and additive modulation. These two types of modulation also link to other techniques widely used in neural network literature. The multiplicative modulation is closely related to Gating, which is adopted in LSTM (Hochreiter & Schmidhuber, 1997), gated PixelCNN (van den Oord et al., 2016), Convolutional Sequence-to-sequence networks (Gehring et al., 2017) and Squeeze-and-excitation Networks (Hu et al., 2018). The additive modulation is closely related to Residual Networks (He et al., 2016). The proposed method adopts both types of modulation.
Discussion
We present a generator modification that improves the performance of most GANs. This technique is simple to implement and can be applied to all popular GANs, therefore we believe that self-modulation is a useful addition to the GAN toolbox.
Our results suggest that self-modulation clearly yields performance gains, however, they do not say how this technique results in better models. Interpretation of deep networks is a complex topic, especially for GANs, where the training process is less well understood. Rather than purely speculate, we compute two diagnostic statistics that were proposed recently ignite the discussion of the method’s effects.
First, we compute the condition number of the generators Jacobian. Odena et al. (2018) provide evidence that better generators have a Jacobian with lower condition number and hence regularize using this quantity. We estimate the generator condition number in the same was as Odena et al. (2018). We compute the Jacobian at each in a minibatch, then average the logarithm of the condition numbers computed from each Jacobian.
Second, we compute a notion of precision and recall for generative models. Sajjadi et al. (2018) define the quantities, and , for generators. These quantities relate intuitively to the traditional precision and recall metrics for classification. Generating points which have low probability under the true data distribution is interpreted as a loss in precision, and is penalized by the score. Failing to generate points that have high probability under the true data distributions is interpreted as a loss in recall, and is penalized by the score.
Figure 3 shows both statistics. The left hand plot shows the condition number plotted against FID score for each model. We observe that poor models tend to have large condition numbers; the correlation, although noisy, is always positive. This result corroborates the observations in (Odena et al., 2018). However, we notice an inverse trend in the vicinity of the best models. The cluster of the best models with self-modulation has lower FID, but higher condition number, than the best models without self-modulation. Overall the correlation between FID and condition number is smaller for self-modulated models. This is surprising, it appears that rather than unilaterally reducing the condition number, self-modulation provides some training stability, yielding models with a small range of generator condition numbers.
The right-hand plot in Figure 3 shows the and scores. Models in the upper-left quadrant cover true data modes better (higher precision), and models in the lower-right quadrant produce more modes (higher recall). Self-modulated models tend to favor higher recall. This effect is most pronounced on imagenet.
Overall these diagnostics indicate that self-modulation stabilizes the generator towards favorable conditioning values. It also appears to improve mode coverage. However, these metrics are very new; further development of analysis tools and theoretical study is needed to better disentangle the symptoms and causes of the self-modulation technique, and indeed of others.
We would like to thank Ilya Tolstikhin for helpful discussions. We would also like to thank Xiaohua Zhai, Marcin Michalski, Karol Kurach and Anton Raichuk for their help with infustrature. We also appreciate general discussions with Olivier Bachem, Alexander Kolesnikov, Thomas Unterthiner, and Josip Djolonga. Finally, we are grateful for the support of other members of the Google Brain team.
References
Appendix A Additional results
A.2 FIDs
A.3 Which layer to modulate?
Figure 4 presents the performance when modulating different layers of the generator for each dataset.
A.4 Conditioning and Precision/Recall
Figure 5 presents the generator Jacobian condition number and precision/recall plot for each dataset.
Appendix B Model Architectures
We describe the model structures that are used in our experiments in this section.
The SNDCGAN architecture we follows the ones used in Miyato et al. (2018). Since the resolution of images in cifar10is , while resolutions of images in other datasets are . There are slightly differences in terms of spatial dimensions for both architectures. The proposed self-modulation is applied to replace existing BN layer, we term it sBN (self-modulated BN) for short in Table 7, 8, 9, 10.
B.2 ResNet Architectures
The ResNet architecture we also follows the ones used in Miyato et al. (2018). Again, due to the resolution differences, two ResNet architectures are used in this work. The proposed self-modulation is applied to replace existing BN layer, we term it sBN (self-modulated BN) for short in Table 11, 12, 13, 14.
B.3 Conditional GAN Architecture
For the conditional setting with label information available, we adopt the Projection Based Conditional GAN (P-cGAN) (Miyato & Koyama, 2018). There are both conditioning in generators as well ad discriminators. For generator, conditional batch norm is applied via conditioning on label information, more specifically, this can be expressed as follows,
Where each label is associated with a scaling and shifting parameters independently.
We use the same architectures and hyper-parameter settingsWith one exception: to make it consistent with previous unconditional settings (and also due to the computation time), instead of running five discriminator steps per generator step, we only use two discriminator steps per generator step. as in Miyato & Koyama (2018). More specifically, the architecture is the same as ResNet above, and we compare in two settings: (1) only generator label conditioning is applied, and there is no projection based conditioning in the discriminator, and (2) both generator and discriminator conditioning are applied, which is the standard full P-cGAN.