Generative Adversarial Networks

Gilad Cohen, Raja Giryes

Introduction to GANs

Generative adversarial networks (GANs) are currently the leading method to learn a distribution of a given dataset and generate new examples from it. To begin to understand the concept of GANs, let us consider an interplay between the police and money counterfeiters. A state just launched its new currency and millions of bills and coins are widespread all over the country. Recently, the police detected a flood of counterfeit money in circulation. Further inspection reveals that the forged bills are lighter than the genuine bills, and can be easily filtered out. The criminals then find out about the police’s discovery and use new printers which control better the cash weight. In turn, the police investigate and discover that the new counterfeit bills have a different texture near the corners, and remove them from the system. After many iterations of generating and discriminating forged money, the counterfeit money becomes almost indistinguishable from the real money. Note that the system has two agents: 1) The counterfeiters who create ”close to real” money and 2) The police who detect counterfeit bills adequately.

Back to deep learning: The above two agents are called ”generator” and ”discriminator”, respectively. These two entities are trained jointly, where the generator learns how to fool the discriminator with new adversarial examples out of the dataset distribution; and the discriminator learns to distinguish between real and fake data samples. The GAN architecture is used in more and more applications since its introduction in 2014. It was proved successful in many domains such as computer vision Dziugaite et al. (2015); Karras et al. (2018); Hussein et al. (2020); Ledig et al. (2017), semantic segmentation Luc et al. (2016); Isola et al. (2016); Wang et al. (2018); Hoffman et al. (2018), time-series synthesis Brophy et al. (2019); Hartmann et al. (2018), image editing Rott Shaham et al. (2019); Lample et al. (2017); Gal et al. (2021b); Abdal et al. (2021); Xia et al. (2021), natural language processing Fedus et al. (2018); Jetchev et al. (2016); Guo et al. (2018), text-to-image generation Ramesh et al. (2021); Radford et al. (2021); Patashnik et al. (2021), and many more. In the next section, we depict the basic GAN architecture and loss. Later, we will present more sophisticated architectures, losses, and common usages.

The basic GAN concept

The first GAN that was introduced by Goodfellow et al. (2014) is depicted in Fig. 1.

The architecture of GANs is comprised from two individual components: A discriminator (DD) and a generator (GG). DD is trained to distinguish between real images from the natural distribution and generated images, while GG is trained to craft fake images which fool the discriminator. A random distribution, zpz\textbf{z}\sim p_{\textbf{z}}, is given as input to G. The purpose of GANs is to learn the generated samples’ distribution, G(z)pgG(\textbf{z})\sim p_{g} that estimates the real world distribution prp_{r}. GANs are optimized by solving the following min-max optimization problem:

On the one hand, DD aims at predicting D(x)=1D(\textbf{x})=1 for real data samples and D(G(z))=0D(G(\textbf{z}))=0 for fake samples. On the other hand, the GAN learns how to fool DD by finding GG which is optimized on hampering the second term in Eq. (1).

On the first iteration, only the discriminator weights θD\theta_{D} are updated. We sample a minibatch of mm noise samples {z(1),...,z(m)\textbf{z}^{(1)},...,\textbf{z}^{(m)}} from pzp_{\textbf{z}} and a minibatch of mm real data examples {x(1),...,x(m)\textbf{x}^{(1)},...,\textbf{x}^{(m)}} from prp_{r}. We then calculate the discriminator’s gradients

and update the discriminator weights, θD\theta_{D}, by ascending this term. On the second iteration, only the generator’s weights θG\theta_{G} are updated. We sample a minibatch of mm noise samples {z(1),...,z(m)\textbf{z}^{(1)},...,\textbf{z}^{(m)}} from pzp_{\textbf{z}}, and calculate the generator’s gradients

and update the generator weights, θG\theta_{G}, by descending this term. Goodfellow et al. (2014) showed that under certain conditions on DD, GG, and the training procedure, the distribution pzp_{\textbf{z}} converges to prp_{r}.

GAN Advantages and Problems

Since their first introduction in 2014, GANs have attracted a growing interest all over the academia and industry, thanks to many advantages over other generative models (mainly Variational Auto-encoders (VAEs) Kingma and Welling (2013)):

Sharp images: GANs produce sharper images than other generative models. The images at the output of the Generator look more natural and with better quality than images generated using VAEs, which tend to be blurrier.

Configurable size: The latent random variable size is not restricted, enriching the generator search space.

Versatile generator: The GAN framework can support many different generator networks, unlike other generative models that may have architectural constraints. VAEs, for example, enforce using a Gaussian at the generator’s first layer.

The above advantages make GANs very attractive in the deep learning community, achieving state-of-the-art results in a variety of domains, and generating very natural images. Yet, the original GAN suffers from three major problems that are described in detail below. We first present a short summary of them:

Mode collapse: During the synchronized training of the generator and discriminator, the generator tends to learn to produce a specific pattern (mode) which fools the discriminator. Although this pattern minimizes Eq. (1), the generator does not cover the full distribution of the dataset.

Vanishing gradients: Very frequently the discriminator is trained ”too well” to distinguish real images from adversarial images; in this scenario, the training step of the generator back propagates very low gradients, which does not help the generator to learn.

Instability: The model (θD\theta_{D} or θG\theta_{G}) parameters fluctuate, and generally not stable during the training. The generator seldom achieves a point where it outputs very high-quality images.

Data distributions are multi-modal, meaning that every sample is classified (usually) to only one label. For example, in MNIST there are ten classes of digits (or modes) labeled from ’0’ to ’9’. Mode collapse is the phenomenon where the generator only yields a small subset of the possible modes. In Fig. 2 you can observe generated MNIST images by two different GANs. The top row shows the training of a ”good” GAN, not suffering from mode collapse. It generates every kind of mode (digit type) throughout the training. The bottom row exhibits a GAN training with mode collapse, generating just the digit ’6’.

2 Vanishing gradients

GANs often suffer from training instability, where DD performs very well and GG does not get a chance to train a good distribution. We turn to provide some mathematical observations and understanding of why training a generator GG is extremely hard when the discriminator DD is close to optimal.

The global optimality, stated in Goodfellow et al. (2014), is defined when DD is optimized for any given GG. The optimal DD is achieved when its derivative for Eq. (1) equals 0:

where x is the real input data, D(x)D^{*}(\textbf{x}) is the optimal discriminator, and pr(x)p_{r}(\textbf{x})/pg(x)p_{g}(\textbf{x}) is the distribution of the real/generated data, respectively, over the real data x. If we substitute the optimal discriminator D(x)D^{*}(\textbf{x}) into Eq. (1), we can visualize the loss for the generator GG:

Before we continue any further, we mention two important metrics for probability measurement. The first one is the Kullback-Leiblar (KL) divergence:

which measures how much the distribution p2p_{2} differs from the distribution p1p_{1}. Note that this metric is not symmetrical, i.e., KL(p1p2)KL(p2p1)KL(p_{1}||p_{2})\neq KL(p_{2}||p_{1}). A symmetrical metric is Jensen-Shannon (JS) divergence, defined as:

Back to our GAN with the optimal DD, Eq. (4) shows that the GAN loss function can be reformulated as:

which shows that for DD^{*}, the generator loss turns to be a minimization of the JS divergence between prp_{r} and pgp_{g}. The relation between GAN training and the JS divergence may explain its instability. To understand this, view Fig. 3, which shows an example for JS divergences of different distributions. In Fig. 3(a) we see the real image distribution prp_{r}, as a Gaussian with zero mean, and we consider three examples for generated image distributions: pg1p_{g1}, pg2p_{g2}, and pg3p_{g3}. Fig. 3(b) plots the JS(pr(x),pq(x))JS(p_{r}(\textbf{x}),p_{q}(\textbf{x})) measure between prp_{r} and some pqp_{q} distribution, where the mean of pqp_{q} ranges from to 8080. As shown in the red box, the gradient of the JS divergence vanishes after a mean of 30. In other words, when the discriminator is close to optimal (D(x)D^{*}(\textbf{x})), and we try to train a ”poor” generator with pgp_{g} far from the real distribution prp_{r}, training will not be feasible due to extremely low gradients. This is prominent especially in the beginning of the training where GG weights are randomized.

Due to the vanishing gradient problem, Goodfellow et al. Goodfellow et al. (2014) proposed a change to the original adversarial loss. Instead of minimizing log(1D(G(z)))\log(1-D(G(\textbf{z}))) (as in Eq. (2)), they suggest to maximize log(D(G(z)))\log(D(G(\textbf{z}))). The latter cost function yields the same fixed point of DD and GG dynamics but maintains higher gradients early in the training when the distributions prp_{r} and pgp_{g} are far from each other. On the other hand, this training strategy promotes mode collapse, as we show next.

With an optimal discriminator DD^{*}, KL(pgpr)KL(p_{g}||p_{r}) can be reformulated as:

If we switch the order of the two sides in Eq. (8), we get:

The alternative loss for GG is thus only affected by the first two terms (the last two terms are constant). Since JS(prpg)JS(p_{r}||p_{g}) is bounded by [0,log2][0,\log 2] (see Fig. 3(b)), the loss function is dominated by KL(pgpr)KL(p_{g}||p_{r}), which is also called the reverse KL divergence. since KL(pgpr)KL(p_{g}||p_{r}) usually does not equal KL(prpg)KL(p_{r}||p_{g}), the optimized pgp_{g} by the reversed KL is totally different than pgp_{g} optimized by the KL divergence. Fig. 4 shows the difference of the two optimization where the distribution pp is a mixture of two Gaussians, and qq is a single Gaussian. When we optimize for KL(prpg)KL(p_{r}||p_{g}), qq averages all of pp modes to hit the mass center (Fig. 4(a)). However, for the reverse KL divergence optimization, qq distribution chooses a single mode (Fig. 4(b)), which will cause a mode collapse during training.

In summary, using the original GG loss from Eq. (1) will result in vanishing gradients for GG, and using the other loss in Eq. (9) will result in a mode collapse. These problems are inherent within the GAN loss and thus cannot be solved using sophisticated architectures. In Section 5 we discuss other loss functions for GANs, which solve these problems.

3 Instability and Image quality

Early GAN models which use the G losses described above (\log\big{(}1-D(G(\textbf{z}))\big{)} and -\log\big{(}D(G(\textbf{z}))\big{)}) exhibit great instability in their cost values during training. Arjovsky et al. Arjovsky et al. (2017) experimented with these two GG losses and claimed that in both cases the gradients cause instability to the GAN training. The loss in Eq. (1) stays constant after the first training steps (Fig. 5(a)), and the loss in Eq. (8) fluctuates during the entire training (Fig. 5(b)). In both cases, they did not find a significant correlation between the calculated loss and the generated image quality. In other words, it is very difficult to predict when during the training the generator actually produces good quality images, and the only way to get a good generator is to stop the training and manually visualize many generated images.

Using the above two G losses in Eq. (1) and Eq. (8) yield poor quality images compared to modern GAN models. In the following sections we will cover more sophisticated losses that enhance the generator’s resolution and image size.

4 Problems: summary

The original GAN model and loss proposed by Goodfellow et al. Goodfellow et al. (2014) suffer from three inherent challenges: (1) Mode collapse; (2) Vanishing gradients; and (3) Image quality. Follow-up works improve the performance of GANs on one or more of these problems by using different architectures for DD or GG, modifying the cost function, and more. A subset of architecture-variant and loss-variant GANs are portrayed in Fig. 6(a),(b), respectively. A sample of some recent prominent GAN models are presented in the following sections.

Improved GAN architectures

Many types of new GAN architectures have been proposed since 2014 (Berthelot et al. (2017); Brock et al. (2019); Denton et al. (2015); Karras et al. (2018); Radford et al. (2015); Zhang et al. (2019); Karras et al. (2019); Feigin et al. (2020)). Different GAN architectures were proposed for different tasks, such as image super-resolution Ledig et al. (2017) and image-to-image transfer Zhu et al. (2017); Park et al. (2019). In this section we present some of these models, which improved the performance on image quality, vanishing gradients, and mode collapse, compared to the original GAN.

Semi-supervised learning is a promising research field between supervised learning and unsupervised learning. In supervised learning, each data is labeled, and in unsupervised learning, no labels are provided; Semi-supervised learning has labels only for a small subset of the training data, and no annotations for the rest of the data, like many real-world problems.

SGAN Odena (2016) extended the original GAN learning to the semi-supervised context by adding to the discriminator network an additional task of classifying the image labels (Fig. 7). The generator’s architecture is the same as before. In SGAN the discriminator utilizes two heads, a softmax and a sigmoid. The sigmoid is used to distinguish between real and fake images, and the softmax predicts the images’ labels (only for images predicted as real). The results on MNIST showed that both DD and GG in SGAN are improved compared to the original GAN.

2 Conditional GAN (CGAN)

CGAN was originally proposed as an extension of the original GAN, where both the discriminator and generator were fed by an additional class of the image Mirza and Osindero (2014); Odena et al. (2017). An illustration of CGAN architecture is shown in Fig. 8. In this setup the loss function in Eq. (1) is slightly modified to condition both the real images xx and the latent variable zz on yy (the label):

All values (xx, yy, zz) are first encoded by some neural layers prior to their fusion in the discriminator and generator. This improves the ability of the discriminator to classify real/fake images and enhances the generator’s ability to control the modalities of the generated images. Mirza and Osindero (2014) showed that their CGAN architecture used with a language model can handle also multimodal datasets, such as Flicker that contains labeled image data with their particular user tags. They demonstrated that their generator is capable of producing an automatic tagging.

3 Deep Convolutional GAN (DCGAN)

Before describing DCGAN, we provide a brief reminder of what is a convolutional neural network. {myprop} Convolutional neural networks (CNNs) were proposed by LeCun et al. LeCun et al. (1999); These networks consist of trained spatial filters applied on hidden activations throughout their architecture. These networks perform correlations using their trained kernels with a sliding window over images (or hidden activations). This was shown to improve the accuracy on many recognition tasks, especially in computer vision. A very basic and popular CNN network called LeNet is shown in Fig. 9.

DCGAN proposed to create images using solely deconvolutional networks in their generator Radford et al. (2015). The generator architecture is depicted in Fig. 10. Deconvolutional networks can be conceived as CNNs that use the same components but in reverse, projecting features into the image pixel space. Zeiler and Fergus (2014) showed that deconvolutional layers achieve good visualization for CNNs; this allowed the DCGAN generator to create high-resolution images for the first time.

In addition to improved resolution, DCGAN showed better stability during the training thanks to multiple modifications in the original GAN network:

All pooling layers were replaced. In the discriminator, they used convolution kernels with stride ¿ 1, and the generator utilized fractional-strided convolution to increase the spatial size.

Both the discriminator and generator were trained with batch normalization which promotes similar statistics for real images and fake generated images.

The discriminator architecture replaces all normal ReLU activations with Leaky-ReLU Maas (2013); this activation multiplies also the negative kernel output by a small value (0.20.2), to prevent from ”dead” gradients to propagate to the generator. The generator architecture used ReLU after every deconvolution layer except the output, which uses Tanh activation.

The authors demonstrated empirically that the mere CNN architecture used in DCGAN in not the key contributing factor for the GAN’s performance, and the above modifications are crucial. To show that they measured GANs quality by considering them as feature extractors on supervised datasets, and evaluating the performance of linear models trained on these features (for more information see Radford et al. (2015)). DCGAN yields a classification error of 22.48% on the StreetView House Numbers dataset (SVHN) Netzer et al. (2011), whereas a purely supervised CNN with the same architecture achieved a significantly higher 28.87% test error.

4 Progressive GAN (PROGAN)

PROGAN described a novel training methodology for GANs, involving progressive steps toward the development of the entire network architecture Karras et al. (2018). This progressive architecture uses the idea of progressive neural networks first proposed by Rusu et al. Rusu et al. (2016). These architectures are immune to forgetting and can leverage prior knowledge via lateral connections to previously learned features. The progressive training scheme is shown in Fig. 11. First, they trained low resolution 4x44x4 pixel images. Next, both DD and GG grow to include a layer of 8x88x8 spatial resolution. This progressive training gradually adds more and more intermediate layers to enhance the resolution, until reaching a high resolution of 1024x10241024x1024 pixel images with the CelebA dataset. All previous layers remain trainable in later steps.

Many state-of-the-art GAN architectures utilize this type of progressive training scheme, and it has resulted in very credible images Karras et al. (2018, 2019); Brock et al. (2019); Karras et al. (2020b); Richardson et al. (2021), and more stable learning for both DD and GG.

5 BigGAN

Attention in computer vision is a method that focuses the task on an ”interesting” region in the image. The self-attention is used to train the network (usually CNN) in an unsupervised manner, teaching it to segment or localize the most relevant pixels (or activations) for the vision task.

BigGAN Brock et al. (2019) has achieved state-of-the-art generation on the ImageNet datasets. Its architecture is based on the Self-attention GAN (SAGAN) Zhang et al. (2019), which employs a self-attention mechanism in both DD and GG, to capture a large receptive field without sacrificing computational efficiency for CNNs Vaswani et al. (2017). The original SAGAN architecture can learn global semantics and long-range dependencies for images, thus generating excellent multi-label images based on the ImageNet datasets (128×128128\times 128 pixels). BigGAN achieved improved performance by scaling up the GAN training: Increasing the number of network parameters (×4\times 4) and increasing the batch size (×8\times 8). They achieved better performance on ImageNet with size 128×128128\times 128 and were also able to train BigGAN also on the resolutions 256×256256\times 256 and 512×512512\times 512.

Unlike many previous GAN models, which randomized the latent variable from either zN(0,1)z\sim\mathcal{N}(0,\,1) or zU(1,1)z\sim\mathcal{U}(-1,\,1), BigGAN uses a simple truncation trick. During training zz is sampled from N(0,1)\mathcal{N}(0,\,1), but for generating images in inference zz is selected from a truncated normal, where values that lie outside the range are re-sampled until falling in the range. This truncation trick shows improvement in individual sample quality at the cost of a reduction in overall sample variety, as shown in Fig. 12(a). Notice though that using this different inference sampling trick as is may not generate good images with some large models, producing some saturation artifacts (Fig. 12(b)). To solve this issue, the authors proposed to use Orthogonal Regularization to force GG to be more amenable to truncation, making it smoother so that the full lateral space will map to good generated images.

6 StyleGAN

StyleGAN Karras et al. (2019) proposed an alternative generator architecture for GANs. Unlike the traditional GG architecture that samples a random latent variable at its input, their architecture starts from a learned constant input and adjust the characteristics of the images in every convolutional layer along GG using different, learned latent variables. Additionally, they inject learned noise directly throughout GG (see Fig. 13).

This architectural change leads to automatic, unsupervised separation of high-level attributes (e.g., pose and identity) from low-level variations (e.g., freckles, hair) in the generated images. StyleGAN did not modify the discriminator or the loss function. They used the same DD architecture as in Karras et al. (2018). In addition to state-of-the-art quality for face image generation, StyleGAN demonstrates a higher degree of latent space disentanglement, presenting more linear representations of different factors of variation, turning the GAN synthesis to be much more controllable. Recently a more advanced styleGAN architecture has been proposed Karras et al. (2020b); Gal et al. (2021a); Karras et al. (2021). Some generated examples are presented in Fig. 14. One may also use StyleGAN for image editing by calculating the latent vector of a given input image in the styleGAN (this operation is known as styleGAN inversion) and then manipulating this vector for editing the image Xia et al. (2021); Richardson et al. (2021); Abdal et al. (2019); Tov et al. (2021); Abdal et al. (2020); Shen et al. (2020); Shen and Zhou (2021); Patashnik et al. (2021).

Improved GAN objectives

As described in Section 3.2, the original GAN’s min-max loss (Eq. (1)) promotes mode collapse and vanishing gradient phenomena. This section presents selected loss functions and regularizations that remedy these problems and also improves the image quality. This is just a partial list and other loss functions and regularization methods exist such as the least square GANs Mao et al. (2017) or optimal transport models Peyre and Cuturi (2019).

WGAN Arjovsky et al. (2017) has solved the vanishing gradient and mode collapse problems of the original GAN by replacing the cost in Eq. (1) with the Earth Mover (EM) distance, which is known also as the Wasserstein distance and is defined as:

where (pr,pg)\prod(p_{r},p_{g}) denotes the set of all joint distributions γ(x,y)\gamma(x,y) whose marginals are prp_{r} and pgp_{g}, respectively. The EM distance is therefore the minimum cost of transporting ”mass” in converting distribution prp_{r} into the distribution pgp_{g}.

Unlike KL and JS distances, EM is capable of indicating distance even when the prp_{r} and pgp_{g} distributions are far from each other; EM is also continuous and thus provides useful gradients for training GG. However, the infimum in Eq. (11) is highly intractable, so the authors estimate the EM cost with:

where {fw}wW\{f_{w}\}_{w\in\mathcal{W}} is a parameterized family of all functions that are KK-Lipschitz for some KK (fLK||f||_{L}\leq K). Readers are referred to Arjovsky et al. (2017) for more details.

Fig. 15 compares the gradient of WGAN to the original GAN from two non overlapping Gaussian distributions. It can be observed that WGAN has a smooth and measurable gradient everywhere (blue line), and learns better even when GG is not producing good images.

2 Self Supervised GAN (SSGAN)

We showed in Section 4.2 that CGANs can generate natural images. However, they require labeled images to do so, which is a major drawback. SSGANs Chen et al. (2019) exploit two popular unsupervised learning techniques, adversarial training, and self-supervision, bridging the gap between conditional and unconditional GANs.

Neural networks have been shown to forget previous learned tasks French (1999); Kirkpatrick et al. (2017), and catastrophic forgetting was previously considered as a major cause for GAN training instability. Motivated by the desire to counter the discriminator forgetting, SSGAN adds to the discriminator an additional loss, which enables it to learn useful representations, independently of the quality of the generator. In a self-supervised manner, the authors train a model on a task of prediction a rotation angle ([<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><moseparator="true">,</mo></mrow><annotationencoding="application/xtex">,</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3em;verticalalign:0.1944em;"></span><spanclass="mpunct">,</span></span></span></span></span>,<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><moseparator="true">,</mo></mrow><annotationencoding="application/xtex">,</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3em;verticalalign:0.1944em;"></span><spanclass="mpunct">,</span></span></span></span></span>][<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3em;vertical-align:-0.1944em;"></span><span class="mpunct">,</span></span></span></span></span>,<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3em;vertical-align:-0.1944em;"></span><span class="mpunct">,</span></span></span></span></span>]), as shown in Fig. 16. The objectives of DD and GG are updated to:

where V(G,D)V(G,D) is the original GAN objective in Eq. (1), PdataP_{data} and PGP_{G} are the real data and generated data distributions, respectively, rRr\in R is a rotation selected from a set of all allowed angles (R={<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><moseparator="true">,</mo></mrow><annotationencoding="application/xtex">,</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3em;verticalalign:0.1944em;"></span><spanclass="mpunct">,</span></span></span></span></span>,<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><moseparator="true">,</mo></mrow><annotationencoding="application/xtex">,</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.3em;verticalalign:0.1944em;"></span><spanclass="mpunct">,</span></span></span></span></span>}R=\{<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3em;vertical-align:-0.1944em;"></span><span class="mpunct">,</span></span></span></span></span>,<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">,</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.3em;vertical-align:-0.1944em;"></span><span class="mpunct">,</span></span></span></span></span>\}). An image x rotated by rr degrees is denoted as xr\textbf{x}^{r}, and Q(Rxr)Q(R|\textbf{x}^{r}) is the discriminator’s predictive distribution over the angles of rotation of the sample. These new losses enforce DD to learn good representation via learning the rotation information in a self-supervised approach.

Using the above scheme, SSGAN achieves good high-quality images, matching the performance of conditional GANs without having access to labeled data.

3 Spectral Normalization GAN (SNGAN)

SNGAN Miyato et al. (2018) proposes to add weight normalization to stabilize the training of the discriminator. Their technique is computationally inexpensive and can be applied easily to existing GANs architectures. In previous works that stabilized GAN training (Gulrajani et al. (2017); Arjovsky et al. (2017); Qi (2019)) it was emphasized that DD should be a K-Lipshitz continuous function, forcing it not to change rapidly. This characteristic of DD stabilizes the training of GANs. SNGAN controls the Lipschitz constant of DD by literally constraining the spectral norm of each layer, normalizing each weight matrix WW so it satisfies the spectral constraint σ(W)=1\sigma(\textbf{W})=1 (i.e., the largest singular value of the weight matrix of each layer is 1). This is performed by simply normalizing each layer:

where W are the weight parameters of each layer in DD. This paper proves that this will make the Lipschitz constant of the discriminator function to be bound by 11, which is important for the WGAN optimization.

SNGAN achieves an extraordinary advance on ImageNet, and better or equal quality on CIFAR-10 and STL-10, compared to the previous training stabilization techniques that include weight clipping Arjovsky et al. (2017), gradient penalty Wu et al. (2019); Mescheder et al. (2018), batch normalization Ioffe and Szegedy (2015), weight normalization Salimans and Kingma (2016), layer normalization Ba et al. (2016), and orthonormal regularization Brock et al. (2017).

4 SphereGAN

SphereGAN Park and Kwon (2019) is a novel integral probability metric (IPM)-based GAN, which uses the hypersphere to bound IPMs in the objective function, thus enhancing the stability of the training. By exploiting the information of higher-order statistics of data using geometric moment matching, they achieved more accurate results. The objective function of SphereGAN is defined as

Unlike conventional approaches that use the Wasserstein distance and add additional constraint terms (see Table 1 in Park and Kwon (2019)), SphereGAN does not need any additional constraints to force DD in a desired function space, due to the usage of geometric transformation in DD.

Data augmentation with GAN

We turn now to describe how to use GAN for data augmentation. Data augmentation is a crucial process for tasks lacking large training sets. There are three circumstances which require data augmentation:

Limited annotations: Training a large DNN with a very few labeled data in the training set.

Limited diversity: When the training data lacks variations. For example, it may not cover diverse illuminations or a variety of appearances.

Restricted data: The database might contain sensitive information, and thus accessing it directly is strictly restricted.

The first two scenarios can be solved via supervised learning but they will cost a lot of human effort to enrich the labeled data or to use active learning approaches Cohn et al. (1996). An alternative approach is to utilize GANs to augment the data. Semi-supervised GAN (SGAN) (see Section 4.1) is a simple example of such a GAN architecture that is able to generate new annotated images and can automatically enrich training data with few labels.

The second case is more common in the real world. Data variability is important for many psychology or neuroscience experiments. For example, a human’s EEG is sensitive to different types of face images such as happy, angry, and sad faces Mavratzakis et al. (2016). Preparing those stimuli in traditional experiments is time-consuming and costly for researchers. Architectures like StyleGAN Karras et al. (2019) (see Section 4.6) are specialized to generate broad types of face images as a stimulus. More important, these images can be controlled to exhibit specific attributes (e.g. level of happiness or facial textures), and thus enhancing the stimuli variation in the experiment Wang et al. (2020). Data augmentation can be also helpful in the training of the GANs themselves Karras et al. (2020a); Zhao et al. (2020), which allows to train them with only few examples and still get high quality generated data (e.g., images).

The last case is related to unsupervised learning approaches. When a database is restricted to preserve users’ privacy, researchers can use GANs to synthesize this sensitive data themselves. For example, Delaney et al. employed GANs to generate synthetic ECG signals that resemble real ECG data Delaney et al. (2019).

Conclusion

This chapter has provided just a glimpse of GANs and their various usages. This fast-growing technique has many variants and applications that have not been presented here for the sake of brevity. Yet, we believe that the description of GANs given here provides the reader with the tools to better understand this tool and be able to navigate between the vast amount of recent literature on this topic. It is worth mentioning also normalizing flows Papamakarios et al. (2021) and score based generative models Song et al. (2021b, a); Nichol and Dhariwal (2021), which become very popular recently and show competitive performance with GANs.

References