Towards Principled Methods for Training Generative Adversarial Networks

Martin Arjovsky, Léon Bottou

Introduction

Generative adversarial networks (GANs)(Goodfellow et al., 2014a) have achieved great success at generating realistic and sharp looking images. However, they are widely general methods, now starting to be applied to several other important problems, such as semisupervised learning, stabilizing sequence learning methods for speech and language, and 3D modelling. (Denton et al., 2015; Radford et al., 2015; Salimans et al., 2016; Lamb et al., 2016; Wu et al., 2016)

However, they still remain remarkably difficult to train, with most current papers dedicated to heuristically finding stable architectures. (Radford et al., 2015; Salimans et al., 2016)

Despite their success, there is little to no theory explaining the unstable behaviour of GAN training. Furthermore, approaches to attacking this problem still rely on heuristics that are extremely sensitive to modifications. This makes it extremely hard to experiment with new variants, or to use them in new domains, which limits their applicability drastically. This paper aims to change that, by providing a solid understanding of these issues, and creating principled research directions towards adressing them.

It is interesting to note that the architecture of the generator used by GANs doesn’t differ significantly from other approaches like variational autoencoders (Kingma & Welling, 2013). After all, at the core of it we first sample from a simple prior zp(z)z\sim p(z), and then output our final sample gθ(z)g_{\theta}(z), sometimes adding noise in the end. Always, gθg_{\theta} is a neural network parameterized by θ\theta, and the main difference is how gθg_{\theta} is trained.

If Pr(x)>Pg(x)P_{r}(x)>P_{g}(x), then xx is a point with higher probability of coming from the data than being a generated sample. This is the core of the phenomenon commonly described as ‘mode dropping’: when there are large regions with high values of PrP_{r}, but small or zero values in PgP_{g}. It is important to note that when Pr(x)>0P_{r}(x)>0 but Pg(x)0P_{g}(x)\to 0, the integrand inside the KL grows quickly to infinity, meaning that this cost function assigns an extremely high cost to a generator’s distribution not covering parts of the data.

If Pr(x)<Pg(x)P_{r}(x)<P_{g}(x), then xx has low probability of being a data point, but high probability of being generated by our model. This is the case when we see our generator outputting an image that doesn’t look real. In this case, when Pr(x)0P_{r}(x)\to 0 and Pg(x)>0P_{g}(x)>0, we see that the value inside the KL goes to 0, meaning that this cost function will pay extremely low cost for generating fake looking samples.

Generative adversarial networks are formulated in two steps. We first train a discriminator DD to maximize

One can show easily that the optimal discriminator has the shape

Why do updates get worse as the discriminator gets better? Both in the original and the new cost function.

Is the new cost function following a similar divergence to the JSDJSD? If so, what are its properties?

Is there a way to avoid some of these issues?

The fundamental contributions of this paper are the answer to all these questions, and perhaps more importantly, to introduce the tools to analyze them properly. We provide a new direction designed to avoid the instability issues in GANs, and examine in depth the theory behind it. Finally, we state a series of open questions and problems, that determine several new directions of research that begin with our methods.

Sources of instability

Let g:ZXg:\mathcal{Z}\rightarrow\mathcal{X} be a function composed by affine transformations and pointwise nonlinearities, which can either be rectifiers, leaky rectifiers, or smooth strictly increasing functions (such as the sigmoid, tanh, softplus, etc). Then, g(Z)g(\mathcal{Z}) is contained in a countable union of manifolds of dimension at most dimZ\dim\mathcal{Z}. Therefore, if the dimension of Z\mathcal{Z} is less than the one of X\mathcal{X}, g(Z)g(\mathcal{Z}) will be a set of measure 0 in X\mathcal{X}.

Since M\mathcal{M} and P\mathcal{P} are compact and disjoint, 0<δ=d(P,M)0<\delta=d(\mathcal{P},\mathcal{M}) the distance between both sets. We now define

In the next theorem, we take away the disjoint assumption, to make it general to the case of two different manifolds. However, if the two manifolds match perfectly on a big part of the space, no discriminator could separate them. Intuitively, the chances of two low dimensional manifolds having this property is rather dim: for two curves to match in space in a specific segment, they couldn’t be perturbed in any arbitrarilly small way and still satisfy this property. To do this, we will define the notion of two manifolds perfectly aligning, and show that this property never holds with probability 1 under any arbitrarilly small perturbations.

We say that two manifolds without boundary M\mathcal{M} and P\mathcal{P} perfectly align if there is an xMPx\in\mathcal{M}\cap\mathcal{P} such that M\mathcal{M} and P\mathcal{P} don’t intersect transversally in xx.

We shall note the boundary and interior of a manifold M\mathcal{M} by M\partial M and Int M\text{Int }M respectively. We say that two manifolds M\mathcal{M} and P\mathcal{P} (with or without boundary) perfectly align if any of the boundary free manifold pairs (Int M,Int P),(Int M,P),(M,Int P)(\text{Int }\mathcal{M},\text{Int }\mathcal{P}),(\text{Int }\mathcal{M},\partial\mathcal{P}),(\partial\mathcal{M},\text{Int }\mathcal{P}) or (M,P)(\partial\mathcal{M},\partial\mathcal{P}) perfectly align.

The interesting thing is that we can safely assume in practice that any two manifolds never perfectly align. This can be done since an arbitrarilly small random perturbation on two manifolds will lead them to intersect transversally or don’t intersect at all. This is precisely stated and proven in Lemma 2.

As stated by Lemma 3, if two manifolds don’t perfectly align, their intersection L=MP\mathcal{L}=\mathcal{M}\cap\mathcal{P} will be a finite union of manifolds with dimensions strictly lower than both the dimension of M\mathcal{M} and the one of P\mathcal{P}.

We now state our perfect discrimination result for the case of two manifolds.

Let xMLx\in\mathcal{M}\setminus\mathcal{L}. Therefore, xPcx\in\mathcal{P}^{c} (the complement of P\mathcal{P}) which is an open set, so there exists a ball of radius ϵx\epsilon_{x} such that B(x,ϵx)P=B(x,\epsilon_{x})\cap\mathcal{P}=\emptyset. This way, we define

Note that these divergences will be maxed out even if the two manifolds lie arbitrarilly close to each other. The samples of our generator might look impressively good, yet both KL divergences will be infinity. Therefore, Theorem 2.3 points us to the fact that attempting to use divergences out of the box to test similarities between the distributions we typically consider might be a terrible idea. Needless to say, if these divergencies are always maxed out attempting to minimize them by gradient descent isn’t really possible. We would like to have a perhaps softer measure, that incorporates a notion of distance between the points in the manifolds. We will come back to this topic later in section 3, where we explain an alternative metric and provide bounds on it that we are able to analyze and optimize.

2 The consequences, and the problems of each cost function

Theorems 2.1 and 2.2 showed one very important fact. If the two distributions we care about have supports that are disjoint or lie on low dimensional manifolds, the optimal discriminator will be perfect and its gradient will be zero almost everywhere.

We will now explore what happens when we pass gradients to the generator through a discriminator. One crucial difference with the typical analysis done so far is that we will develop the theory for an approximation to the optimal discriminator, instead of working with the (unknown) true discriminator. We will prove that as the approximaton gets better, either we see vanishing gradients or the massively unstable behaviour we see in practice, depending on which cost function we use.

In what follows, we denote by D\|D\| the norm

The use of this norm is to make the proofs simpler, but could have been done in another Sobolev norm 1,p\|\cdot\|_{1,p} for p<p<\infty covered by the universal approximation theorem in the sense that we can guarantee a neural network approximation in this norm (Hornik, 1991).

Under the same assumptions of Theorem 2.4

This shows that as our discriminator gets better, the gradient of the generator vanishes. For completeness, this was experimentally verified in Figure 2. The fact that this happens is terrible, since the fact that the generator’s cost function being close to the Jensen Shannon divergence depends on the quality of this approximation. This points us to a fundamental: either our updates to the discriminator will be inacurate, or they will vanish. This makes it difficult to train using this cost function, or leave up to the user to decide the precise amount of training dedicated to the discriminator, which can make GAN training extremely hard.

2.2 The −log⁡D𝐷-\log D alternative

To avoid gradients vanishing when the discriminator is very confident, people have chosen to use a different gradient step for the generator.

We now state and prove for the first time which cost function is being optimized by this gradient step. Later, we prove that while this gradient doesn’t necessarily suffer from vanishing gradients, it does cause massively unstable updates (that have been widely experienced in practice) under the prescence of a noisy approximation to the optimal discriminator.

We already know by Goodfellow et al. (2014a) that

Furthermore, as remarked by Huszar (2016),

Taking derivatives in θ\theta at θ0\theta_{0} we get

Substracting this last equation with the result for the JSD, we obtain our desired result. ∎

We now turn to our result regarding the instability of a noisy version of the true distriminator.

is a centered Cauchy distribution with infinite expectation and variance.Note that the theorem holds regardless of the variance of rr and ϵ\epsilon. As the approximation gets better, this error looks more and more as centered random noise due to the finite precision.

Since r(z)r(z) is a centered Gaussian distribution, multiplying by a matrix doesn’t change this fact. Furthermore, when we divide by ϵ(z)\epsilon(z), a centered Gaussian independent from the numerator, we get a centered Cauchy random variable on every coordinate. Averaging over zz the different independent Cauchy random variables again yields a centered Cauchy distribution. A note on technicality: when ϵ\epsilon is defined as such, the remaining process is not measurable in xx, so we can’t take the expectation in zz trivially. This is commonly bypassed, and can be formally worked out by stating the expectation as the result of a stochastic differential equation. ∎

Note that even if we ignore the fact that the updates have infinite variance, we still arrive to the fact that the distribution of the updates is centered, meaning that if we bound the updates the expected update will be 0, providing no feedback to the gradient.

Since the assumption that the noises of DD and D\nabla D are decorrelated is albeit too strong, we show in Figure 3 how the norm of the gradient grows drastically as we train the discriminator closer to optimality, at any stage in training of a well stabilized DCGAN except when it has already converged. In all cases, using this updates lead to a notorious decrease in sample quality. The noise in the curves also shows that the variance of the gradients is increasing, which is known to delve into slower convergence and more unstable behaviour in the optimization (Bottou et al., 2016).

Towards softer metrics and distributions

An important question now is how to fix the instability and vanishing gradients issues. Something we can do to break the assumptions of these theorems is add continuous noise to the inputs of the discriminator, therefore smoothening the distribution of the probability mass.

If ϵN(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I) then

If ϵN(0,Σ)\epsilon\sim\mathcal{N}(0,\Sigma) then

If Pϵ(x)1xd+1P_{\epsilon}(x)\propto\frac{1}{\|x\|^{d+1}} then

and we want to calculate what the gradient passed to the generator is.

where a(z)a(z) and b(z)b(z) are positive functions. Furthermore, b>ab>a if and only if Pr+ϵ>Pg+ϵP_{r+\epsilon}>P_{g+\epsilon}, and b<ab<a if and only if Pr+ϵ<Pg+ϵP_{r+\epsilon}<P_{g+\epsilon}.

In the same as with Theorem 3.2, aa and bb will have the same properties. The main difference is that we will be moving all our noisy samples towards the data manifold, which can be thought of as moving a small neighbourhood of samples towards it. This will protect the discriminator against measure 0 adversarial examples.

Since the discriminator is assumed fixed when backproping to the generator, the only thing that depends on θ\theta is gθ(z)g_{\theta}(z) for every zz. By taking derivatives on our cost function

Let the density of ϵ\epsilon be 1Zex22σ2\frac{1}{Z}e^{-\frac{\|x\|^{2}}{2\sigma^{2}}}. We now define

Trivially, aa and bb are positive functions. Since b=aPr+ϵPg+ϵb=a\frac{P_{r+\epsilon}}{P_{g+\epsilon}}, we know that b>ab>a if and only if Pr+ϵ>Pg+ϵP_{r+\epsilon}>P_{g+\epsilon}, and b<ab<a if and only if Pr+ϵ<Pg+ϵP_{r+\epsilon}<P_{g+\epsilon} as we wanted. Continuing the proof, we know

We recall the definition of the Wasserstein metric W(P,Q)W(P,Q) for PP and QQ two distributions over X\mathcal{X}. Namely,

where Γ\Gamma is the set of all possible joints on X×X\mathcal{X}\times\mathcal{X} that have marginals PP and QQ.

The Wasserstein distance also goes by other names, most commonly the transportation metric and the earth mover’s distance. This last name is most explicative: it’s the minimum cost of transporting the whole probability mass of PP from its support to match the probability mass of QQ on QQ’s support. This identification of transporting points from PP to QQ is done via the coupling γ\gamma. We refer the reader to Villani (2009) for an in-depth explanation of these ideas. It is easy to see now that the Wasserstein metric incorporates the notion of distance (as also seen inside the integral) between the elements in the support of PP and the ones in the support of QQ, and that as the supports of PP and QQ get closer and closer, the metric will go to 0, inducing as well a notion of distance between manifolds.

If ϵ\epsilon is a random vector with mean 0, then we have

where the last inequality was due to Jensen. ∎

The first author would like to especially thank Luis Scoccola for help with the proof of Lemma 1.

The authors would also like to thank Ishmael Belghazi, Yoshua Bengio, Gerry Che, Soumith Chintala, Caglar Gulcehre, Daniel Jiwoong Im, Alex Lamb, Luis Scoccola, Pablo Sprechmann, Arthur Szlam, Jake Zhao for insightful comments and advice.

References

Appendix A Proofs of things

The proof for the second case is technical and slightly more involved. When σ\sigma is a pointwise smooth strictly increasing nonlinearity, then applying it vectorwise it’s a diffeomorphism to its image. Therefore, it sends a countable union of manifolds of dimension dd to a countable union of manifolds of dimension dd. If we can prove the same thing for affine transformations we will be finished, since g(Z)g(\mathcal{Z}) is just a composition of these applied to a dimZ\dim\mathcal{Z} dimensional manifold. Of course, it suffices to prove that an affine transformation sends a manifold to a countable union of manifolds without increasing dimension, since a countable union of countable unions is still a countable union. Furthermore, we only need to show this for linear transformations, since applying a bias term is a diffeomorphism.

Now we consider the case where M\mathcal{M} and P\mathcal{P} are manifolds with boundary. By a simple union bound,

Let m=dimMm=\dim\mathcal{M} and p=dimPp=\dim\mathcal{P}. We again consider first the case where M\mathcal{M} and P\mathcal{P} are manifolds without boundary. If m+p<dm+p<d, then L=\mathcal{L}=\emptyset so the statement is obviously true. If m+pdm+p\geq d, then M\mathcal{M} and P\mathcal{P} intersect transversally. This implies that L\mathcal{L} is a manifold of dimension m+pd<m,pm+p-d<m,p. Since L\mathcal{L} is a submanifold of both M\mathcal{M} and P\mathcal{P} that has lower dimension, it has measure 0 on both of them.

We now tackle the case where M\mathcal{M} and P\mathcal{P} have boundaries. Let us remember that M=Int MM\mathcal{M}=\text{Int }\mathcal{M}\cup\partial\mathcal{M} and the union is disjoint (and analogously for P\mathcal{P}). By using elementary properties of sets, we can trivially see that

where the unions are disjoint. This is the disjoint union of 4 strictly lower dimensional manifolds, by using the first part of the proof. Since each one of these intersections has measure 0 on either the interior or boundary of M\mathcal{M} (again, by the first part of the proof), and interior and boundary are contained in M\mathcal{M}, each one of the four intersections has measure 0 in M\mathcal{M}. Analogously, they have measure 0 in P\mathcal{P}, and by a simple union bound we see that L\mathcal{L} has measure 0 in M\mathcal{M} and P\mathcal{P} finishing the remaining case of the proof. ∎

Appendix B Further Clarifications

In this appendix we further explain some of the terms and ideas mentioned in the paper, which due to space constrains, and to keep the flow of the paper, couldn’t be extremely developed in the main text. Some of these have to do with notation, others with technical elements of the proofs. On the latter case, we try to convey more intuition than we previously could. We present these clarifications in a very informal fashion in the following item list.

The annoying part is that in everyday paper writing when we talk about continuous random variables, we omit the ‘absolutely’ word to keep the text concise and actually talk about absolutely continuous random variables (ones that have a density), this is done through almost all sciences and throughout mathematics as well, annoying as it is. However we made the clarification in here since it’s relevant to our paper not to mistake the two terms.

In the proof of Theorem 2.1, the distance between sets d(A,B)d(A,B) is defined as the usual distance between sets in a metric space

where d(x,y)d(x,y) is the distance between points (in our case the Euclidean distance).

Whatever happens elsewhere is irrelevant (as it is also reflected by the cost of the discriminator)