Self-Supervised GANs via Auxiliary Rotation Loss

Ting Chen, Xiaohua Zhai, Marvin Ritter, Mario Lucic, Neil Houlsby

Introduction

Generative Adversarial Networks (GANs) are a class of unsupervised generative models . GANs involve training a generator and discriminator model in an adversarial game, such that the generator learns to produce samples from a desired data distribution. Training GANs is challenging because it involves searching for a Nash equilibrium of a non-convex game in a high-dimensional parameter space. In practice, GANs are typically trained using alternating stochastic gradient descent which is often unstable and lacks theoretical guarantees . Consequently, training may exhibit instability, divergence, cyclic behavior, or mode collapse . As a result, many techniques to stabilize GAN training have been proposed . A major contributor to training instability is the fact that the generator and discriminator learn in a non-stationary environment. In particular, the discriminator is a classifier for which the distribution of one class (the fake samples) shifts as the generator changes during training. In non-stationary online environments, neural networks forget previous tasks . If the discriminator forgets previous classification boundaries, training may become unstable or cyclic. This issue is usually addressed either by reusing old samples or by applying continual learning techniques . These issues become more prominent in the context of complex data sets. A key technique in these settings is conditioning whereby both the generator and discriminator have access to labeled data. Arguably, augmenting the discriminator with supervised information encourages it to learn more stable representations which opposes catastrophic forgetting. Furthermore, learning the conditional model for each class is easier than learning the joint distribution. The main drawback in this setting is the necessity for labeled data. Even when labeled data is available, it is usually sparse and covers only a limited amount of high level abstractions.

Motivated by the aforementioned challenges, our goal is to show that one can recover the benefits of conditioning, without requiring labeled data. To ensure that the representations learned by the discriminator are more stable and useful, we add an auxiliary, self-supervised loss to the discriminator. This leads to more stable training because the dependence of the discriminator’s representations on the quality of the generator’s output is reduced. We introduce a novel model – the self-supervised GAN – in which the generator and discriminator collaborate on the task of representation learning, and compete on the generative task.

Our contributions We present an unsupervised generative model that combines adversarial training with self-supervised learning. Our model recovers the benefits of conditional GANs, but requires no labeled data. In particular, under the same training conditions, the self-supervised GAN closes the gap in natural image synthesis between unconditional and conditional models. Within this setting the quality of discriminator’s representations is greatly increased which might be of separate interest in the context of transfer learning. A large-scale implementation of the model leads to promising results on unconditional imagenet generation, a task considered daunting by the community. We believe that this work is an important step in the direction of high quality, fully unsupervised, natural image synthesis.

A Key Issue: Discriminator Forgetting

The original value function for GAN training is :

This challenge has received a great deal of attention and explicit temporal dependencies have been proposed to improve training in this setting . Furthermore, in online learning of non-convex functions, neural networks have been shown to forget previous tasks . In the context of GANs, learning varying levels of detail, structure, and texture, can be considered different tasks. For example, if the generator first learns the global structure, the discriminator will naturally try to build a representation which allows it to efficiently penalize the generator based only on the differences in global structure, or the lack of local structure. As such, one source of instability in training is that the discriminator is not incentivised to maintain a useful data representation as long as the current representation is useful to discriminate between the classes.

We demonstrate the impact of discriminator forgetting in two settings. (1) A simple scenario shown in Figure 3(a), and, (2) during the training of a GAN shown in Figure 2. In the first case a classifier is trained sequentially on 1-vs.-all classification tasks on each of the ten classes in cifar10. It is trained for 11k iterations on each task before switching to the next. At 1010k iterations the training cycle repeats from the first task. Figure 3(a) shows substantial forgetting, despite the tasks being similar. Each time the task switches, the classifier accuracy drops substantially. After 1010k iterations, the cycle of tasks repeats, and the accuracy is the same as the first cycle. No useful information is carried across tasks. This demonstrates that the model does not retain generalizable representations in this non-stationary environment. In the second setting shown in Figure 2 we observe a similar effect during GAN training. Every 100100k iterations, the discriminator representations are evaluated on imagenet classification; the full protocol is described in Section 4.4. During training, classification of the unconditional GAN increases, then decreases, indicating that information about the classes is acquired and later forgotten. This forgetting correlates with training instability. Adding self-supervision, as detailed in the following section, prevents this forgetting of the classes in the discriminator representations.

The Self-Supervised GAN

Motivated by the main challenge of discriminator forgetting, we aim to imbue the discriminator with a mechanism which allows learning useful representations, independently of the quality of the current generator. To this end, we exploit recent advancements in self-supervised approaches for representation learning. The main idea behind self-supervision is to train a model on a pretext task like predicting rotation angle or relative location of an image patch, and then extracting representations from the resulting networks . We propose to add a self-supervised task to our discriminator.

In particular, we apply the state-of-the-art self-supervision method based on image rotation . In this method, the images are rotated, and the angle of rotation becomes the artificial label (cf. Figure 1). The self-supervised task is then to predict the angle of rotation of an image. The effects of this additional loss on the image classification task is evident in Figure 3(b): When coupled with the self-supervised loss, the network learns representations that transfer across tasks and the performance continually improves. On the second cycle through the tasks, from 1010k iterations onward, performance is improved. Intuitively, this loss encourages the classifier to learn useful image representations to detect the rotation angles, which transfers to the image classification task.

We augment the discriminator with a rotation-based loss which results in the following loss functions:

where V(G,D)V(G,D) is the value function from Equation 1, rRr\in\mathcal{R} is a rotation selected from a set of possible rotations. In this work we use R={0°,90°,180°,270°}\mathcal{R}=\{0^{\degree},90^{\degree},180^{\degree},270^{\degree}\} as in Gidaris et al. . Image x\bm{x} rotated by rr degrees is denoted as xr\bm{x}^{r}, and Q(Rxr)Q(R\mid\bm{x}^{r}) is the discriminator’s predictive distribution over the angles of rotation of the sample.

In our model, the generator and discriminator are adversarial with respect to the true vs. fake prediction loss, V(G,D)V(G,D), however, they are collaborative with respect to the rotation task. First, consider the value function of the generator which biases the generation towards images, that when rotated, the discriminator can detect their rotation angle. Note that the generator is not conditional but only generates “upright” images which are subsequently rotated and fed to the discriminator. On the other hand, the discriminator is trained to detect rotation angles based only on the true data. In other words, the parameters of the discriminator get updated only based on the rotation loss on the true data. This prevents the undesirable collaborative solution whereby the generator generates images whose subsequent rotation is easy to detect. As a result, the generator is encouraged to generate images that are rotation-detectable because they share features with real images that are used for rotation classification.

Experiments

We demonstrate empirically that (1) self-supervision improves the representation quality with respect to baseline GAN models, and that (2) it leads to improved unconditional generation for complex datasets, matching the performance of conditional GANs, under equal training conditions.

Datasets We focus primarily on imagenet, the largest and most diverse image dataset commonly used to evaluate GANs. Until now, most GANs trained on imagenet are conditional. imagenet contains 1.31.3M training images and 5050k test images. We resize the images to 128×128×3128\times 128\times 3 as done in Miyato and Koyama and Zhang et al. . We provide additional comparison on three smaller datasets, namely cifar10, celeba-hq, lsun-bedroom, for which unconditional GANs can be successfully trained. The lsun-bedroom dataset contains 33M images. We partition these randomly into a test set containing approximately 3030k images and a train set containing the rest. celeba-hq contains 3030k images . We use the 128×128×3128\times 128\times 3 version obtained by running the code provided by the authors.https://github.com/tkarras/progressive_growing_of_gans. We use 33k examples as the test set and the remaining examples as the training set. cifar10 contains 7070k images (32×32×332\times 32\times 3), partitioned into 6060k training instances and 1010k test instances.

We compare the self-supervised GAN (SS-GAN) to two well-performing baseline models, namely (1) the unconditional GAN with spectral normalization proposed in Miyato et al. , denoted Uncond-GAN, and (2) the conditional GAN using the label-conditioning strategy and the Projection Conditional GAN (Cond-GAN) . We chose the latter as it was shown to outperform the AC-GAN , and is adopted by the best performing conditional GANs .

We use ResNet architectures for the generator and discriminator as in Miyato et al. . For the conditional generator in Cond-GAN, we apply label-conditional batch normalization. In contrast, SS-GAN does not use conditional batch normalization. However, to have a similar effect on the generator, we consider a variant of SS-GAN where we apply the self-modulated batch normalization which does not require labels and denote it SS-GAN (sBN). We note that labels are available only for cifar10 and imagenet, so Cond-GAN is only applied on those data sets.

We use a batch size of 64 and to implement the rotation-loss we rotate 16 images in the batch in all four considered directions. We do not add any new images into the batch to compute the rotation loss. For the true vs. fake task we use the hinge loss from Miyato et al. . We set β=1\beta=1 or the the self-supervised loss. For α\alpha we performed a small sweep α{0.2,0.5,1}\alpha\in\{0.2,0.5,1\}, and select α=0.2\alpha=0.2 for all datasets (see the appendix for details). For all other hyperparameters, we use the values in Miyato et al. and Miyato and Koyama . We train cifar10, lsun-bedroom and celeba-hq for 100100k steps on a single P100 GPU. For imagenet we train for 11M steps. For all datasets we use the Adam optimizer with learning rate 0.00020.0002.

2 Comparison of Sample Quality

Results Figure 4 shows FID training curves on cifar10 and imagenet. Table 1 shows the FID of the best run across three random seeds for each dataset and model combination. The unconditional GAN is unstable on imagenet and the training often diverges. The conditional counterpart outperforms it substantially. The proposed method, namely SS-GAN, is stable on imagenet, and performs substantially better than the unconditional GAN. When equipped with self-modulation it matches the performance on the conditional GAN. In terms of mean performance (Figure 4) the proposed approach matches the conditional GAN, and in terms of the best models selected across random seeds (Table 1), the performance gap is within 5%5\%. On cifar10 and lsun-bedroom we observe a substantial improvement over the unconditional GAN and matching the performance of the conditional GAN. Self-supervision appears not to significantly improve the results on celeba-hq. We posit that this is due to low-diversity in celeba-hq, and also for which rotation task is less informative.

Robustness across hyperparameters GANs are fragile; changes to the hyperparameter settings have a substantial impact to their performance . Therefore, we evaluate different hyperparameter settings to test the stability of SS-GAN. We consider two classes of hyperparameters: First, those controlling the Lipschitz constant of the discriminator, a central quantity analyzed in the GAN literature . We evaluate two state-of-the-art techniques: gradient penalty , and spectral normalization . The gradient penalty introduces a regularization strength parameter, λ\lambda. We test two values λ{1,10}\lambda\in\{1,10\}. Second, we vary the hyperparameters of the Adam optimizer. We test two popular settings (β1,β2)(\beta_{1},\beta_{2}): (0.5,0.999)(0.5,0.999) and (0,0.9)(0,0.9). Previous studies find that multiple discriminator steps per generator step help training , so we try both 11 and 22 discriminator steps per generator step.

Table 2 compares the mean FID scores of the unconditional models across penalties and optimization hyperparameters. We observe that the proposed approach yields consistent performance improvements. We observe that in settings where the unconditional GAN collapses (yielding FIDs larger than 100) the self-supervised model does not exhibit such a collapse.

3 Large Scale Self-Supervised GAN

We scale up training the SS-GAN to attain the best possible FID for unconditional imagenet generation. To do this, we increase the model’s capacity to match the model in .The details can be found at https://github.com/google/compare_gan. We train the model on 128 cores of Google TPU v3 Pod for 500500k steps using batch size of 2048. For comparison, we also train the same model without the auxiliary self-supervised loss (Uncond-GAN). We report the FID at 5050k to be comparable other literature reporting results on imagenet. We repeat each run three times with different random seeds.

For SS-GAN we obtain the FID of 23.6±0.123.6\pm 0.1 and 71.6±66.371.6\pm 66.3 for Uncond-GAN. Self-supervision stabilizes training; the mean and variance across random seeds is greatly reduced because, unlike for the regular unconditional GAN, SS-GAN never collapsed. We observe improvement in the best model across random seeds, and the best SS-GAN attains an FID of 23.423.4. To our knowledge, this is the best results attained training unconditionally on imagenet.

4 Representation Quality

We test empirically whether self-supervision encourages the discriminator to learn meaningful representations. For this, we compare the quality of the representations extracted from the intermediate layers of the discriminator’s ResNet architecture. We apply a common evaluation method for representation learning, proposed in Zhang et al. . In particular, we train a logistic regression classifier on the feature maps from each ResNet block to perform the 1000-way classification task on imagenet or 10-way on cifar10 and report top-1 classification accuracy.

We report results using the Cond-GAN, Uncond-GAN, and SS-GAN models. We also ablate the adversarial loss from our SS-GAN which results in a purely rotation-based self-supervised model (Rot-only) which uses the same architecture and hyperparameters as the SS-GAN discriminator. We report the mean accuracy and standard deviation across three independent models with different random seeds. Training details for the logistic classifier are in the appendix.

Table 4 shows the quality of representation at after 11M training steps on imagenet. Figure 9 shows the learning curves for representation quality of the final ResNet block on imagenet. The curves for the other blocks are provided in appendix. Note that “training steps” refers to the training iterations of the original GAN, and not to the linear classifier which is always trained to convergence. Overall, the SS-GAN yields the best representations across all blocks and training iterations. We observe similar results on cifar10 provided in Table 3.

In detail, the imagenet ResNet contains six blocks. For Uncond-GAN and Rot-only, Block 3 performs best, for Cond-GAN and SS-GAN, the final Block 5 performs best. The representation quality for Uncond-GAN drops at 500k steps, which is consistent with the FID drop in Figure 4. Overall, the SS-GAN and Cond-GAN representations are better than Uncond-GAN, which correlates with their improved sample quality. Surprisingly, the the SS-GAN overtakes Cond-GAN after training for 300300k steps. One possibility is that the Cond-GAN is overfitting the training data. We inspect the representation performance of Cond-GAN on the training set and indeed see a very large generalization gap, which indicates overfitting.

When we ablate the GAN loss, leaving just the rotation loss, the representation quality substantially decreases. It seems that the adversarial and rotation losses complement each other both in terms of FID and representation quality. We emphasize that our discriminator architecture is optimized for image generation, not representation quality. Rot-only, therefore, is an ablation method, and is not a state-of-the-art self-supervised learning algorithm. We discuss these next.

Table 5 compares the representation quality of SS-GAN to state-of-the-art published self-supervised learning algorithms. Despite the architecture and hyperparameters being optimized for image quality, the SS-GAN model achieves competitive results on imagenet. Among those methods, only BiGAN also uses a GAN to learn representations; but SS-GAN performs substantially (0.073 accuracy points) better. BiGAN learns the representation with an additional encoder network, while SS-GAN is arguably simpler because it extracts the representation directly from the discriminator. The best performing method is the recent DeepClustering algorithm . This method is just 0.027 accuracy points ahead of SS-GAN and requires expensive offline clustering after every training epoch.

In summary, the representation quality evaluation highlights the correlation between representation quality and image quality. It also confirms that the SS-GAN does learn relatively powerful image representations.

Related Work

Catastrophic forgetting was previously considered as a major cause for GAN training instability. The main remedy suggested in the literature is to introduce temporal memory into the training algorithm in various ways. For example, Grnarova et al. induce discriminator memory by replaying previously generated images. An alternative is to instead reuse previous models: Salimans et al. introduce checkpoint averaging, where a running average of the parameters of each player is kept, and Grnarova et al. maintain a queue of models that are used at each training iteration. Kim et al. add memory to retain information about previous samples. Other papers frame GAN training as a continual learning task. Thanh-Tung et al. study catastrophic forgetting in the discriminator and mode collapse, relating these to training instability. Liang et al. counter discriminator forgetting by leveraging techniques from continual learning directly (Elastic Weight Sharing and Intelligent Synapses ).

Conditional GANs Conditional GANs are currently the best approach for generative modeling of complex data sets, such as ImageNet. The AC-GAN was the first model to introduce an auxiliary classification loss for the discriminator . The main difference between AC-GAN and the proposed approach is that self-supervised GAN requires no labels. Furthermore, the AC-GAN generator generates images conditioned on the class, whereas our generator is unconditional and the images are subsequently rotated to produce the artificial label. Finally, the self-supervision loss for the discriminator is applied only over real images, whereas the AC-GAN uses both real and fake.

More recently, the P-cGAN model proposed by Miyato and Koyama includes one real/fake head per class . This architecture improves performance over AC-GAN. The best performing GANs trained on GPUs and TPUs use P-cGAN style conditioning in the discriminator. We note that conditional GANs also use labels in the generator, either by concatenating with the latent vector, or via FiLM modulation .

Self-supervised learning Self-supervised learning is a family of methods that learn the high level semantic representation by solving a surrogate task. It has been widely used in the video domain , the robotics domain and the image domain . We focused on the image domain in this paper. Gidaris et al. proposed to rotate the image and predict the rotation angle. This conceptually simple task yields useful representations for downstream image classification tasks. Apart form trying to predict the rotation, one can also make edits to the given image and ask the network to predict the edited part. For example, the network can be trained to solve the context prediction problem, like the relative location of disjoint patches or the patch permutation of a jigsaw puzzle . Other surrogate tasks include image inpainting , predicting the color channels from a grayscale image , and predicting the unsupervised clustering classes . Recently, Kolesnikov et al. conducted a study on self-supervised learning with modern neural architectures.

Conclusions and Future Work

Motivated by the desire to counter discriminator forgetting, we propose a deep generative model that combines adversarial and self-supervised learning. The resulting novel model, namely self-supervised GAN when combined with the recently introduced self-modulation, can match equivalent conditional GANs on the task of image synthesis, without having access to labeled data. We then show that this model can be scaled to attain an FID of 23.4 on unconditional ImageNet generation which is an extremely challenging task.

This line of work opens several avenues for future research. First, it would be interesting to use a state-of-the-art self-supervised architecture for the discriminator, and optimize for best possible representations. Second, the self-supervised GAN could be used in a semi-supervised setting where a small number of labels could be used to fine-tune the model. Finally, one may exploit several recently introduced techniques, such as self-attention, orthogonal normalization and regularization, and sampling truncation , to yield even better performance in unconditional image synthesis.

We hope that this approach, combining collaborative self-supervision with adversarial training, can pave the way towards high quality, fully unsupervised, generative modeling of complex data.

We would also like to thank Marcin Michalski, Karol Kurach and Anton Raichuk for their help with infustrature, and major contributions to the Compare GAN library. We appreciate useful discussions with Ilya Tolstikhin, Olivier Bachem, Alexander Kolesnikov, Josip Djolonga, and Tiansheng Yao. Finally, we are grateful for the support of other members of the Google Brain team, Zürich.

References

Appendix A FID Metric Details

We compute the FID score using the protocol as described in . The image embeddings are extracted from an Inception V1 network provided by the TF library , We use the layer “pool_3”. We fit the multivariate Gaussians used to compute the metric to real samples from the test sets and fake samples. We use 30003000 samples for celeba-hq and 1000010000 for the other datasets.

Appendix B SS-GAN Hyper-parameters

We compare different choices of α\alpha, while fixing β=1\beta=1 for simplicity. A reasonable value of α\alpha helps aqa the generator to train using the self-supervision task, however, an inappropriate value of α\alpha could bias the convergence point of the generator. Table 7 shows the effectiveness of α\alpha. In the values compared, the optimal α\alpha is 1 for cifar10, and 0.2 for imagenet. In our main experiments, we used α=0.2\alpha=0.2 for all datasets.

Appendix C Representation Quality

C.2 Additional Results

Table 6 shows the top-1 accuracy with on cifar10 with standard deviations. The results are stable on cifar10 as all the standard deviation is within 0.01. Table 7 shows the top-1 accuracy with on imagenet with standard deviations. Uncond-GAN representation quality shows large variance as we observe that the unconditional GAN collapses in some cases.

Figure 8 shows the representation quality on all 4 blocks on the cifar10 dataset. SS-GAN consistently outperforms other models on all 4 blocks. Figure 9 shows the representation quality on all 6 blocks on the imagenet dataset. We observe that all methods perform similarly before 500k steps on block0, which contains low level features. While going from block0 to block6, the conditional GAN and SS-GAN achieve much better representation results. The conditional GAN benefits from the supervised labels in layers closer to the classification head. However, the unconditional GAN attains worse result at the last layer and the rotation only model gets decreasing quality with more training steps. When combining the self-supervised loss and the adversarial loss, SS-GAN representation quality becomes stable and outperforms the other models.

Figure 10 and Figure 11 show the correlation between top-1 accuracy and FID score. We report the FID and top-1 accuracy from training steps 10k to 100k on cifar10, and 100k to 1M on imagenet. We evaluate 10×310\times 3 models in total, where 10 is the number of training steps at which we evaluate and 3 is the number of random seeds for each run. The collapsed models with FID score larger than 100 are removed from the plot. Overall, the representation quality and the FID score is correlated for all methods on the cifar10 dataset. On imagenet, only SS-GAN gets better representation quality with better sample quality on block4 and block5.