On gradient regularizers for MMD GANs
Michael Arbel, Danica J. Sutherland, Mikołaj Bińkowski, Arthur Gretton
Introduction
Another class of IPMs used as IGM losses are the Maximum Mean Discrepancies (MMDs) , as in . Here the critic function is a member of a reproducing kernel Hilbert space (except in , who learn a deep approximation to an RKHS critic). Better performance can be obtained, however, when the MMD kernel is not based directly on image pixels, but on learned features of images. Wasserstein-inspired gradient regularization approaches can be used on the MMD critic when learning these features: uses weight clipping , and use a gradient penalty .
We first discuss in Section 2 how MMD-based losses can be used to learn implicit generative models, and how a naive approach could fail. This motivates our new discrepancies, introduced in Section 3. Section 4 demonstrates that these losses outperform state-of-the-art models for image generation.
Learning implicit generative models with MMD-based losses
In the present work, we will build losses based on the Maximum Mean Discrepancy,
Despite these appealing properties, using simple pixel-level kernels leads to poor generator samples . More recent MMD GANs achieve better results by using a parameterized family of kernels, , in the Optimized MMD loss previously studied by :
We can avoid these issues if we ensure a bounded Lipschitz critic:[27, Theorem 4] makes a similar claim to 2, but its proof was incorrect: it tries to uniformly bound , but the bound used is for a Wasserstein in terms of .
The main result is [12, Corollary 11.3.4]. To show the claim for , note that , which since is
Indeed, if we put a box constraint on or regularize the gradient of the critic function , the resulting MMD GAN generally matches or outperforms WGAN-based models. Unfortunately, though, an additive gradient penalty doesn’t substantially change the vector field of Figure 1 (a), as shown in Figure 5 (Appendix B). We will propose distances with much better convergence behavior.
New discrepancies for learning implicit generative models
Our aim here is to introduce a discrepancy that can provide useful gradient information when used as an IGM loss. Proofs of results in this section are deferred to Appendix A.
2 shows that an MMD-like discrepancy can be continuous under the weak topology even when optimizing over kernels, if we directly restrict the critic functions to be Lipschitz. We can easily define such a distance, which we call the Lipschitz MMD: for some ,
2 Gradient-Constrained Maximum Mean Discrepancy
We define the Gradient-Constrained MMD for and using some measure as
where is the kernel matrix , is the matrix of left derivatives We use to denote the partial derivative with respect to , and that for . , and that of derivatives of both arguments .
3 Scaled Maximum Mean Discrepancy
We will now derive a lower bound on the Gradient-Constrained MMD which retains many of its attractive qualities but can be estimated in time linear in the dimension .
Make Assumptions (A), (B), (C) and (D). For any , , where
We then define the Scaled Maximum Mean Discrepancy based on this bound of 4:
Experiments
We evaluated unsupervised image generation on three datasets: CIFAR-10 ( images, ), CelebA ( face images, resized and cropped to as in ), and the more challenging ILSVRC2012 (ImageNet) dataset ( images, resized to ). Code for all of these experiments is available at github.com/MichaelArbel/Scaled-MMD-GAN.
Evaluation To compare the sample quality of different models, we considered three different scores based on the Inception network trained for ImageNet classification, all using default parameters in the implementation of . The Inception Score (IS) is based on the entropy of predicted labels; higher values are better. Though standard, this metric has many issues, particularly on datasets other than ImageNet . The FID instead measures the similarity of samples from the generator and the target as the Wasserstein-2 distance between Gaussians fit to their intermediate representations. It is more sensible than the IS and becoming standard, but its estimator is strongly biased . The KID is similar to FID, but by using a polynomial-kernel MMD its estimates enjoy better statistical properties and are easier to compare. (A similar score was recommended by .)
Results Table 1(a) presents the scores for models trained on both CIFAR-10 and CelebA datasets. On CIFAR-10, SN-SWGAN and SN-SMMDGAN performed comparably to SN-GAN. But on CelebA, SN-SWGAN and SN-SMMDGAN dramatically outperformed the other methods with the same architecture in all three metrics. It also trained faster, and consistently outperformed other methods over multiple initializations (Figure 2 (a)). It is worth noting that SN-SWGAN far outperformed WGAN-GP on both datasets. Table 1(b) presents the scores for SMMDGAN and SN-SMMDGAN trained on ImageNet, and the scores of pre-trained models using BGAN and SN-GAN .These models are courtesy of the respective authors and also trained at resolution. SN-GAN used the same architecture as our model, but trained for generator iterations; BS-GAN used a similar 5-layer ResNet architecture and trained for 74 epochs, comparable to SN-GAN. The proposed methods substantially outperformed both methods in FID and KID scores. Figure 3 shows samples on ImageNet and CelebA; Section F.4 has more.
Spectrally normalized WGANs / MMDGANs To control for the contribution of the spectral parametrization to the performance, we evaluated variants of MMDGANs, WGANs and Sobolev-GAN using spectral normalization (in Table 2, Section F.3). WGAN and Sobolev-GAN led to unstable training and didn’t converge at all (Figure 11) despite many attempts. MMDGAN converged on CIFAR-10 (Figure 11) but was unstable on CelebA (Figure 10). The gradient control due to SN is thus probably too loose for these methods. This is reinforced by Figure 2 (c), which shows that the expected gradient of the critic network is much better-controlled by SMMD, even when SN is used. We also considered variants of these models with a learned while also adding a gradient penalty and an penalty on critic activations [7, footnote 19]. These generally behaved similarly to MMDGAN, and didn’t lead to substantial improvements. We ran the same experiments on CelebA, but aborted the runs early when it became clear that training was not successful.
Rank collapse We occasionally observed the failure mode for SMMD where the critic becomes low-rank, discussed in Section 3.3, especially on CelebA; this failure was obvious even in the training objective. Figure 2 (b) is one of these examples. Spectral parametrization seemed to prevent this behavior. We also found one could avoid collapse by reverting to an earlier checkpoint and increasing the RKHS regularization parameter , but did not do this for any of the experiments here.
Conclusion
Another area to explore is the geometry of these losses, as studied by , who showed potential advantages of the Wasserstein geometry over the MMD. Their results, though, do not address any distances based on optimized kernels; the new distances introduced here might have interesting geometry of their own.
pages40 rangepages20 rangepages19 rangepages15 rangepages5 rangepages22 rangepages55
References
Appendix A Proofs
We use a slightly nonstandard notation for derivatives: denotes the th partial derivative of evaluated at , and denotes .
Then the following reproducing properties hold for any given function in [47, Lemma 4.34]:
Given two vectors and in and a Hilbert-Schmidt operator we have the following properties:
Define the following covariance-type operators:
these are useful in that, using (8) and (9), .
grows at most linearly in : for all in , for some constant .
The kernel is twice continuously differentiable.
The functions and for are -integrable.
When , Assumption (B) is automatically satisfied by a such as the Gaussian; when is linear, it is true for a quite general class of networks [7, Lemma 1].
We will first give a form for the Gradient-Constrained MMD (5) in terms of the operator (10):
Under Assumptions (A), (B), (C) and (D), the Gradient-Constrained MMD is given by
Let be a function in . We will first express the squared -regularized Sobolev norm of (6) as a quadratic form in . Recalling the reproducing properties of (8) and (9), we have:
Using Property (ii) and the operator (10), one further gets
where is defined as this difference in mean embeddings.
Since is symmetric positive definite, its square-root is well-defined and is also invertible. For any , let , so that . Note that for any , there is a corresponding . Thus we can re-express the maximization problem in (5) in terms of :
5, though, involves inverting the infinite-dimensional operator and thus doesn’t directly give us a computable estimator. 3 solves this problem in the case where is a discrete measure:
where is the kernel matrix , is the matrix of left derivatives , and that of derivatives of both arguments .
Then , and . Thus we can write
Let be the solution to the regression problem :
Taking the inner product of both sides of (12) with for each yields the following equations:
Doing the same with gives equations:
From (12), it is clear that is a linear combination of the form:
where the coefficients and satisfy the system of equations (13) and (14). We can rewrite this system as
where , are the identity matrices of dimension , . Since and must be positive semidefinite, an inverse exists. We conclude by noticing that
The following result was key to our definition of the SMMD in Section 3.3.
Under Assumptions (A), (B), (C) and (D), we have for all that
The key idea here is to use the Cauchy-Schwarz inequality for the Hilbert-Schmidt inner product. Letting , is
follows from the reproducing properties (8) and (9) and Property (ii). is obtained using Property (iii), while follows from the Cauchy-Schwarz inequality and Property (i). ∎
The operator is positive self-adjoint. It is also trace-class, as by the triangle inequality
A.2 Continuity of the Optimized Scaled MMD in the Wasserstein topology
To prove Theorem 1, we we will first need some new notation.
The parameter is the concatenation of all the layer parameters:
is the set of those parameters such that have a small condition number, . is the set of per-layer normalized parameters with a condition number bounded by .
Recall the definition of Scaled MMD, Equation 7, where and is a probability measure:
The Optimized SMMD over the restricted set is given by:
The constraint to is critical to the proof. In practice, using a spectral parametrization helps enforce this assumption, as shown in Figures 2 and 9. Other regularization methods, like orthogonal normalization , are also possible.
is a probability distribution absolutely continuous with respect to the Lebesgue measure.
The dimensions of the weights are decreasing per layer: for all .
The non-linearity used is Leaky-ReLU, (16), with leak coefficient .
There is some for which satisfies
Assumption (II) helps ensure that the span of is never contained in the null space of . Using Leaky-ReLU as a non-linearity, Assumption (III), further ensures that the network is locally full-rank almost everywhere; this might not be true with ReLU activations, where it could be always . Assumptions (II) and (III) can be easily satisfied by design of the network.
Assumptions (IV) and (V) only depend on the top-level kernel and are easy to satisfy in practice. In particular, they always hold for a smooth translation-invariant kernel, such as the Gaussian, as well as the linear kernel.
Under Assumptions (I), (II), (III), (IV) and (V),
Define the pseudo-distance corresponding to the kernel
where is the standard Wasserstein distance (2), and so
We have that \partial_{i}\partial_{i+d}k(x,y)=\left[\partial_{i}\phi_{\psi}(x)\right]^{\mathsf{T}}\left[\nabla_{a}\nabla_{b}K(a,b)\bigr{|}_{(a,b)=(\phi_{\psi}(x),\phi_{\psi}(y))}\right]\left[\partial_{i}\phi_{\psi}(y)\right], where the middle term is a matrix and the outer terms are vectors of length . Thus Assumption (V) implies that , and hence
Using 8, we can write with . Then we have
Take a sample and a function with . By the Cauchy-Schwarz inequality,
Taking the expectation with respect to , we obtain
the result follows by taking the supremum over . ∎
Let . There exists a corresponding scalar and , defined by (18), such that for all ,
Set , , and . Note that the condition number is unchanged, , and , so . It is also easy to see from (16) that
Make Assumptions (II) and (III), and let . Then the set of inputs for which any intermediate activation is exactly zero,
has zero Lebesgue measure. Moreover, for any , exists and
it is undefined when any , i.e. when . Let . Then
where , , and , so long as .
Because , we have and ; also, , . Thus , and using Assumption (II) with 10 gives . In particular, each is full-rank.
Next, note that and each only depend on through the activation patterns . Letting denote the full activation patterns up to level , we can thus write
There are only finitely many possible values for ; we denote the set of such values as . Then we have that
Because each is of rank , each set in the union is either empty or an affine subspace of dimension . As each , each set in the finite union has zero Lebesgue measure, and also has zero Lebesgue measure.
Thus, take some , and find the smallest absolute value of its activations, ; clearly . For any with , we know that for all and ,
implying that as well as . Thus for any point , . Finally, we obtain
A more general version of this result can be found in [19, Theorem 2]; we provide a proof here for completeness.
Here we analyze through simple examples what happens when the condition number can be unbounded, and when Assumption (II), about decreasing widths of the network, is violated.
where . As approaches the matrix becomes singular which means that its condition number blows up. We are interested in analyzing the behavior of the Lipschitz constant of and the expected squared norm of its gradient under as approaches .
One can easily compute the squared norm of the gradient of which is given by
Here , , and are defined by Equation 23 and are represented in Figure 4:
We would like to consider a second example where Assumption (II) doesn’t hold. Consider the following two layer network defined by:
for . Note that is a full rank matrix, but Assumption (II) doesn’t hold. Depending on the sign of the components of one has the following expression for :
where are defined by Equation 26
The squared Lipschitz constant is given by while the expected squared norm of the gradient of is given by:
Appendix B DiracGAN vector fields for more losses
Figure 5 shows parameter vector fields, like those in Figure 6, for Example 1 for a variety of different losses:
Appendix C Vector fields of Gradient-Constrained MMD and Sobolev GAN critics
This unintuitive behavior is most likely related to the vanishing boundary condition, assummed by Sobolev GAN. Solving the actual Sobolev PDE, we found that the Sobolev critic has very high gradients close to the boundary in order to match the condition; moreover, these gradients point in opposite directions to the target distribution.
Appendix D An estimator for Lipschitz MMD
We now describe briefly how to estimate the Lipschitz MMD in low dimensions. Recall that
For , it is the case that
By the generalized representer theorem, the optimal for (29) will be of the form
Writing , the objective function is linear in ,
The constraints are quadratic, built from the following matrices, where the and samples are concatenated together, as are the derivatives with each dimension of the samples:
Thus the optimization problem (29) is a linear problem with convex quadratic constraints, which can be solved by standard convex optimization software. The approximation is reasonable only if we can effectively cover the region of interest with densely spaced ; it requires a nontrivial amount of computation even for the very simple 1-dimensional toy problem of Example 1.
Appendix E Near-equivalence of WGAN and linear-kernel MMD GANs
For an MMD GAN-GP with kernel , we have that
(MMD GANs, however, would typically train on the unbiased estimator of , giving a very slightly different loss function. also applied the gradient penalty to rather than the true critic .)
The SMMD with a linear kernel is thus analogous to applying the scaling operator to a WGAN; hence the name SWGAN.
Appendix F Additional experiments
Figure 7 shows the behavior of the MMD, the Gradient-Constrained SMMD, and the Scaled MMD when comparing Gaussian distributions. We can see that and the Gradient-Constrained MMD behave similarly in this case, and that optimizing the and the Gradient-Constrained MMD is also similar. Optimizing the MMD would yield an essentially constant distance.
F.2 IGMs with Optimized Gradient-Constrained MMD loss
The learned models, however, were reasonable. Using a DCGAN architecture, batches of size 64, and a procedure that otherwise agreed with the setup of Section 4, samples with and without spectral normalization are shown in Figures 8(a) and 8(b). After the points in training shown, however, the same rank collapse as discussed in Section 4 occurred. Here it seems that spectral normalization may have delayed the collapse, but not prevented it. Figure 8(c) shows generator loss estimates through training, including the obvious peak at collapse; Figure 8(d) shows KID scores based on the MNIST-trained convnet representation , including comparable SMMD models for context. The fact that SMMD models converged somewhat faster than Gradient-Constrained MMD models here may be more related to properties of the estimator of 3 rather than the distances; more work would be needed to fully compare the behavior of the two distances.
F.3 Spectral normalization and Scaled MMD
Figure 9 shows the distribution of critic weight singular values, like Figure 2, at more layers. Figures 11 and 2 show results for the spectral normalization variants considered in the experiments. MMDGAN, with neither spectral normalization nor a gradient penalty, did surprisingly well in this case, though it fails badly in other situations.
Figure 9 compares the decay of singular values for layer of the critic’s network at both early and later stages of training in two cases: with or without the spectral parametrization. The model was trained on CelebA using SMMD. Figure 11 shows the evolution per iteration of Inception score, FID and KID for Sobolev-GAN, MMDGAN and variants of MMDGAN and WGAN using spectral normalization. It is often the case that this parametrization alone is not enough to achieve good results.
F.4 Additional samples
Figures 12 and 13 give extra samples from the models.