On gradient regularizers for MMD GANs

Michael Arbel, Danica J. Sutherland, Mikołaj Bińkowski, Arthur Gretton

Introduction

Another class of IPMs used as IGM losses are the Maximum Mean Discrepancies (MMDs) , as in . Here the critic function is a member of a reproducing kernel Hilbert space (except in , who learn a deep approximation to an RKHS critic). Better performance can be obtained, however, when the MMD kernel is not based directly on image pixels, but on learned features of images. Wasserstein-inspired gradient regularization approaches can be used on the MMD critic when learning these features: uses weight clipping , and use a gradient penalty .

We first discuss in Section 2 how MMD-based losses can be used to learn implicit generative models, and how a naive approach could fail. This motivates our new discrepancies, introduced in Section 3. Section 4 demonstrates that these losses outperform state-of-the-art models for image generation.

Learning implicit generative models with MMD-based losses

In the present work, we will build losses $\operatorname{\mathcal{D}}$ based on the Maximum Mean Discrepancy,

Despite these appealing properties, using simple pixel-level kernels leads to poor generator samples . More recent MMD GANs achieve better results by using a parameterized family of kernels, $\{k_{\psi}\}_{\psi\in\Psi}$ , in the Optimized MMD loss previously studied by :

We can avoid these issues if we ensure a bounded Lipschitz critic:[27, Theorem 4] makes a similar claim to 2, but its proof was incorrect: it tries to uniformly bound $\operatorname{MMD}_{k_{\psi}}\leq\operatorname{\mathcal{W}}^{2}$ , but the bound used is for a Wasserstein in terms of $\lVert k_{\psi}(x,\cdot)-k_{\psi}(y,\cdot)\rVert_{\mathcal{H}_{k_{\psi}}}$ .

The main result is [12, Corollary 11.3.4]. To show the claim for $k_{\psi}=K\circ\phi_{\psi}$ , note that $\lvert f_{\psi}(x)-f_{\psi}(y)\rvert\leq\lVert f_{\psi}\rVert_{\mathcal{H}_{k_{\psi}}}\lVert k_{\psi}(x,\cdot)-k_{\psi}(y,\cdot)\rVert_{\mathcal{H}_{k_{\psi}}}$ , which since $\lVert f_{\psi}\rVert_{\mathcal{H}_{k_{\psi}}}=1$ is

Indeed, if we put a box constraint on $\psi$ or regularize the gradient of the critic function , the resulting MMD GAN generally matches or outperforms WGAN-based models. Unfortunately, though, an additive gradient penalty doesn’t substantially change the vector field of Figure 1 (a), as shown in Figure 5 (Appendix B). We will propose distances with much better convergence behavior.

New discrepancies for learning implicit generative models

Our aim here is to introduce a discrepancy that can provide useful gradient information when used as an IGM loss. Proofs of results in this section are deferred to Appendix A.

2 shows that an MMD-like discrepancy can be continuous under the weak topology even when optimizing over kernels, if we directly restrict the critic functions to be Lipschitz. We can easily define such a distance, which we call the Lipschitz MMD: for some $\lambda>0$ ,

2 Gradient-Constrained Maximum Mean Discrepancy

We define the Gradient-Constrained MMD for $\lambda>0$ and using some measure $\mu$ as

where $K$ is the kernel matrix $K_{m,m^{\prime}}=k(X_{m},X_{m^{\prime}})$ , $G$ is the matrix of left derivatives We use $\partial_{i}k(x,y)$ to denote the partial derivative with respect to $x_{i}$ , and $\partial_{i+d}k(x,y)$ that for $y_{i}$ . $G_{(m,i),m^{\prime}}=\partial_{i}k(X_{m},X_{m^{\prime}})$ , and $H$ that of derivatives of both arguments $H_{(m,i),(m^{\prime},j)}=\partial_{i}\partial_{j+d}k(X_{m},X_{m^{\prime}})$ .

3 Scaled Maximum Mean Discrepancy

We will now derive a lower bound on the Gradient-Constrained MMD which retains many of its attractive qualities but can be estimated in time linear in the dimension $d$ .

Make Assumptions (A), (B), (C) and (D). For any $f\in\mathcal{H}_{k}$ , $\lVert f\rVert_{S(\mu),k,\lambda}\leq\sigma_{\mu,k,\lambda}^{-1}\lVert f\rVert_{\mathcal{H}_{k}}$ , where

We then define the Scaled Maximum Mean Discrepancy based on this bound of 4:

Experiments

We evaluated unsupervised image generation on three datasets: CIFAR-10 ( $60\,000$ images, $32\times 32$ ), CelebA ( $202\,599$ face images, resized and cropped to $160\times 160$ as in ), and the more challenging ILSVRC2012 (ImageNet) dataset ( $1\,281\,167$ images, resized to $64\times 64$ ). Code for all of these experiments is available at github.com/MichaelArbel/Scaled-MMD-GAN.

Evaluation To compare the sample quality of different models, we considered three different scores based on the Inception network trained for ImageNet classification, all using default parameters in the implementation of . The Inception Score (IS) is based on the entropy of predicted labels; higher values are better. Though standard, this metric has many issues, particularly on datasets other than ImageNet . The FID instead measures the similarity of samples from the generator and the target as the Wasserstein-2 distance between Gaussians fit to their intermediate representations. It is more sensible than the IS and becoming standard, but its estimator is strongly biased . The KID is similar to FID, but by using a polynomial-kernel MMD its estimates enjoy better statistical properties and are easier to compare. (A similar score was recommended by .)

Results Table 1(a) presents the scores for models trained on both CIFAR-10 and CelebA datasets. On CIFAR-10, SN-SWGAN and SN-SMMDGAN performed comparably to SN-GAN. But on CelebA, SN-SWGAN and SN-SMMDGAN dramatically outperformed the other methods with the same architecture in all three metrics. It also trained faster, and consistently outperformed other methods over multiple initializations (Figure 2 (a)). It is worth noting that SN-SWGAN far outperformed WGAN-GP on both datasets. Table 1(b) presents the scores for SMMDGAN and SN-SMMDGAN trained on ImageNet, and the scores of pre-trained models using BGAN and SN-GAN .These models are courtesy of the respective authors and also trained at $64\times 64$ resolution. SN-GAN used the same architecture as our model, but trained for $250\,000$ generator iterations; BS-GAN used a similar 5-layer ResNet architecture and trained for 74 epochs, comparable to SN-GAN. The proposed methods substantially outperformed both methods in FID and KID scores. Figure 3 shows samples on ImageNet and CelebA; Section F.4 has more.

Spectrally normalized WGANs / MMDGANs To control for the contribution of the spectral parametrization to the performance, we evaluated variants of MMDGANs, WGANs and Sobolev-GAN using spectral normalization (in Table 2, Section F.3). WGAN and Sobolev-GAN led to unstable training and didn’t converge at all (Figure 11) despite many attempts. MMDGAN converged on CIFAR-10 (Figure 11) but was unstable on CelebA (Figure 10). The gradient control due to SN is thus probably too loose for these methods. This is reinforced by Figure 2 (c), which shows that the expected gradient of the critic network is much better-controlled by SMMD, even when SN is used. We also considered variants of these models with a learned $\gamma$ while also adding a gradient penalty and an $L_{2}$ penalty on critic activations [7, footnote 19]. These generally behaved similarly to MMDGAN, and didn’t lead to substantial improvements. We ran the same experiments on CelebA, but aborted the runs early when it became clear that training was not successful.

Rank collapse We occasionally observed the failure mode for SMMD where the critic becomes low-rank, discussed in Section 3.3, especially on CelebA; this failure was obvious even in the training objective. Figure 2 (b) is one of these examples. Spectral parametrization seemed to prevent this behavior. We also found one could avoid collapse by reverting to an earlier checkpoint and increasing the RKHS regularization parameter $\lambda$ , but did not do this for any of the experiments here.

Conclusion

Another area to explore is the geometry of these losses, as studied by , who showed potential advantages of the Wasserstein geometry over the MMD. Their results, though, do not address any distances based on optimized kernels; the new distances introduced here might have interesting geometry of their own.

pages40 rangepages20 rangepages19 rangepages15 rangepages5 rangepages22 rangepages55

References

Appendix A Proofs

We use a slightly nonstandard notation for derivatives: $\partial_{i}f(x)$ denotes the $i$ th partial derivative of $f$ evaluated at $x$ , and $\partial_{i}\partial_{j+d}k(x,y)$ denotes $\frac{\partial^{2}k(a,b)}{\partial a_{i}\partial b_{j}}|_{(a,b)=(x,y)}$ .

Then the following reproducing properties hold for any given function $f$ in $\mathcal{H}$ [47, Lemma 4.34]:

Given two vectors $f$ and $g$ in $\mathcal{H}$ and a Hilbert-Schmidt operator $A$ we have the following properties:

Define the following covariance-type operators:

these are useful in that, using (8) and (9), $\langle f,D_{x}g\rangle=f(x)g(x)+\sum_{i=1}^{d}\partial_{i}f(x)\,\partial_{i}g(x)$ .

$\sqrt{k(x,x)}$ grows at most linearly in $x$ : for all $x$ in $\mathcal{X}$ , $\sqrt{k(x,x)}\leq C(\lVert x\rVert+1)$ for some constant $C$ .

The kernel $k$ is twice continuously differentiable.

The functions $x\mapsto k(x,x)$ and $x\mapsto\partial_{i}\partial_{i+d}k(x,x)$ for $1\leq i\leq d$ are $\mu$ -integrable.

When $k=K\circ\phi_{\psi}$ , Assumption (B) is automatically satisfied by a $K$ such as the Gaussian; when $K$ is linear, it is true for a quite general class of networks $\phi_{\psi}$ [7, Lemma 1].

We will first give a form for the Gradient-Constrained MMD (5) in terms of the operator (10):

Under Assumptions (A), (B), (C) and (D), the Gradient-Constrained MMD is given by

Let $f$ be a function in $\mathcal{H}$ . We will first express the squared $\lambda$ -regularized Sobolev norm of $f$ (6) as a quadratic form in $\mathcal{H}$ . Recalling the reproducing properties of (8) and (9), we have:

Using Property (ii) and the operator (10), one further gets

where $\eta$ is defined as this difference in mean embeddings.

Since $D_{\mu,\lambda}$ is symmetric positive definite, its square-root $D_{\mu,\lambda}^{\frac{1}{2}}$ is well-defined and is also invertible. For any $f\in\mathcal{H}$ , let $g=D_{\mu,\lambda}^{\frac{1}{2}}f$ , so that $\langle f,D_{\mu,\lambda}f\rangle_{\mathcal{H}}=\|g\|_{\mathcal{H}}^{2}$ . Note that for any $g\in\mathcal{H}$ , there is a corresponding $f=D_{\mu,\lambda}^{-\frac{1}{2}}g$ . Thus we can re-express the maximization problem in (5) in terms of $g$ :

5, though, involves inverting the infinite-dimensional operator $D_{\mu,\lambda}$ and thus doesn’t directly give us a computable estimator. 3 solves this problem in the case where $\mu$ is a discrete measure:

where $K$ is the kernel matrix $K_{m,m^{\prime}}=k(X_{m},X_{m^{\prime}})$ , $G$ is the matrix of left derivatives $G_{(m,i),m^{\prime}}=\partial_{i}k(X_{m},X_{m^{\prime}})$ , and $H$ that of derivatives of both arguments $H_{(m,i),(m^{\prime},j)}=\partial_{i}\partial_{j+d}k(X_{m},X_{m^{\prime}})$ .

Then $\begin{bmatrix}\eta(X)\\ \nabla\eta(X)\end{bmatrix}=T\eta$ , and $\begin{bmatrix}K&G^{\mathsf{T}}\\ G&H\end{bmatrix}=TT^{*}$ . Thus we can write

Let $g\in\mathcal{H}$ be the solution to the regression problem $D_{\mu,\lambda}g=\eta$ :

Taking the inner product of both sides of (12) with $k(X_{m^{\prime}},\cdot)$ for each $1\leq m^{\prime}\leq M$ yields the following $M$ equations:

Doing the same with $\partial_{j}k(X_{m^{\prime}},\cdot)$ gives $Md$ equations:

From (12), it is clear that $g$ is a linear combination of the form:

where the coefficients $\alpha:=\left(\alpha_{m}=g(X_{m})\right)_{1\leq m\leq M}$ and $\beta:=\left(\beta_{m,i}=\partial_{i}g(X_{m})\right)_{\begin{subarray}{c}1\leq m\leq M\\ 1\leq i\leq d\end{subarray}}$ satisfy the system of equations (13) and (14). We can rewrite this system as

where $I_{M}$ , $I_{Md}$ are the identity matrices of dimension $M$ , $Md$ . Since $K$ and $H$ must be positive semidefinite, an inverse exists. We conclude by noticing that

The following result was key to our definition of the SMMD in Section 3.3.

Under Assumptions (A), (B), (C) and (D), we have for all $f\in\mathcal{H}$ that

The key idea here is to use the Cauchy-Schwarz inequality for the Hilbert-Schmidt inner product. Letting $f\in\mathcal{H}$ , $\|f\|_{S(\mu),k,\lambda}^{2}$ is

$(a)$ follows from the reproducing properties (8) and (9) and Property (ii). $(b)$ is obtained using Property (iii), while $(c)$ follows from the Cauchy-Schwarz inequality and Property (i). ∎

The operator $D_{x}$ is positive self-adjoint. It is also trace-class, as by the triangle inequality

A.2 Continuity of the Optimized Scaled MMD in the Wasserstein topology

To prove Theorem 1, we we will first need some new notation.

The parameter $\psi$ is the concatenation of all the layer parameters:

$\Psi^{\kappa}$ is the set of those parameters such that $W^{l}$ have a small condition number, $\operatorname{cond}(W)=\sigma_{\max}(W)/\sigma_{\min}(W)$ . $\Psi_{1}^{\kappa}$ is the set of per-layer normalized parameters with a condition number bounded by $\kappa$ .

Recall the definition of Scaled MMD, Equation 7, where $\lambda>0$ and $\mu$ is a probability measure:

The Optimized SMMD over the restricted set $\Psi^{\kappa}$ is given by:

The constraint to $\psi\in\Psi^{\kappa}$ is critical to the proof. In practice, using a spectral parametrization helps enforce this assumption, as shown in Figures 2 and 9. Other regularization methods, like orthogonal normalization , are also possible.

$\mu$ is a probability distribution absolutely continuous with respect to the Lebesgue measure.

The dimensions of the weights are decreasing per layer: $d_{l+1}\leq d_{l}$ for all $0\leq l\leq L-1$ .

The non-linearity used is Leaky-ReLU, (16), with leak coefficient $\alpha\in(0,1)$ .

There is some $\gamma_{K}>0$ for which $K$ satisfies

Assumption (II) helps ensure that the span of $W^{l}$ is never contained in the null space of $W^{l+1}$ . Using Leaky-ReLU as a non-linearity, Assumption (III), further ensures that the network $\phi_{\psi}$ is locally full-rank almost everywhere; this might not be true with ReLU activations, where it could be always . Assumptions (II) and (III) can be easily satisfied by design of the network.

Assumptions (IV) and (V) only depend on the top-level kernel $K$ and are easy to satisfy in practice. In particular, they always hold for a smooth translation-invariant kernel, such as the Gaussian, as well as the linear kernel.

Under Assumptions (I), (II), (III), (IV) and (V),

Define the pseudo-distance corresponding to the kernel $k_{\psi}$

where $\operatorname{\mathcal{W}}$ is the standard Wasserstein distance (2), and so

We have that $\partial_{i}\partial_{i+d}k(x,y)=\left[\partial_{i}\phi_{\psi}(x)\right]^{\mathsf{T}}\left[\nabla_{a}\nabla_{b}K(a,b)\bigr{|}_{(a,b)=(\phi_{\psi}(x),\phi_{\psi}(y))}\right]\left[\partial_{i}\phi_{\psi}(y)\right],$ where the middle term is a $d_{L}\times d_{L}$ matrix and the outer terms are vectors of length $d_{L}$ . Thus Assumption (V) implies that $\partial_{i}\partial_{i+d}k(x,x)\geq\gamma_{K}^{2}\lVert\partial_{i}\phi_{\psi}(x)\rVert^{2}$ , and hence

Using 8, we can write $\phi_{\psi}(X)=\alpha(\psi)\phi_{\bar{\psi}}(X)$ with $\bar{\psi}\in\Psi^{\kappa}_{1}$ . Then we have

Take a sample $(X,Y)\sim\pi^{\star}$ and a function $f\in\mathcal{H}$ with $\lVert f\rVert_{\mathcal{H}}\leq 1$ . By the Cauchy-Schwarz inequality,

Taking the expectation with respect to $\pi^{\star}$ , we obtain

the result follows by taking the supremum over $f$ . ∎

Let $\psi=((W^{L},b^{L}),(W^{L-1},b^{L-1}),\dots,(W^{1},b^{1}))\in\Psi^{\kappa}$ . There exists a corresponding scalar $\alpha(\psi)$ and $\bar{\psi}=((\bar{W}^{L},\bar{b}^{L}),(\bar{W}^{L-1},\bar{b}^{L-1}),\dots,(\bar{W}^{1},\bar{b}^{1}))\in\Psi^{\kappa}_{1}$ , defined by (18), such that for all $X$ ,

Set $\bar{W}^{l}=\frac{1}{\lVert W^{l}\rVert}W^{l}$ , $\bar{b}^{l}=\frac{1}{\prod_{m=1}^{l}\lVert W^{m}\rVert}b^{l}$ , and $\alpha(\psi)=\prod_{l=1}^{L}\lVert W^{l}\rVert$ . Note that the condition number is unchanged, $\operatorname{cond}(\bar{W}^{l})=\operatorname{cond}(W^{l})\leq\kappa$ , and $\lVert\bar{W}^{l}\rVert=1$ , so $\bar{\psi}\in\Phi^{\kappa}_{1}$ . It is also easy to see from (16) that

Make Assumptions (II) and (III), and let $\psi\in\Psi^{\kappa}_{1}$ . Then the set of inputs for which any intermediate activation is exactly zero,

has zero Lebesgue measure. Moreover, for any $X\notin\operatorname{\mathcal{N}}_{\psi}$ , $\nabla_{X}\phi_{\psi}(X)$ exists and

it is undefined when any $h^{l}_{k}(X)=0$ , i.e. when $X\in\operatorname{\mathcal{N}}_{\psi}$ . Let $V^{l}_{X}:=W^{l}\operatorname{diag}\left(M_{X}^{l-1}\right)$ . Then

where $\underline{b}^{0}_{X}=0$ , $\underline{b}^{l}_{X}=V^{l}_{X}\underline{b}^{l-1}+b^{l}$ , and $\underline{W}^{l}_{X}=V^{l}_{X}V^{l-1}_{X}\cdots V^{1}_{X}$ , so long as $X\notin\operatorname{\mathcal{N}}_{\psi}$ .

Because $\psi\in\Psi_{1}^{\kappa}$ , we have $\lVert W^{l}\rVert=1$ and $\operatorname{\sigma_{min}}(W^{l})\geq 1/\kappa$ ; also, $\lVert M_{X}^{l}\rVert\leq 1$ , $\operatorname{\sigma_{min}}(M_{X}^{l})\geq\alpha$ . Thus $\lVert\underline{W}^{l}_{X}\rVert\leq 1$ , and using Assumption (II) with 10 gives $\operatorname{\sigma_{min}}(\underline{W}^{l}_{X})\geq(\alpha/\kappa)^{l}$ . In particular, each $\underline{W}^{l}_{X}$ is full-rank.

Next, note that $\underline{b}^{l}_{X}$ and $\underline{W}^{l}_{X}$ each only depend on $X$ through the activation patterns $M_{X}^{l}$ . Letting $H^{l}_{X}=(M_{X}^{l},M_{X}^{l-1},\dots,M_{X}^{1})$ denote the full activation patterns up to level $l$ , we can thus write

There are only finitely many possible values for $H^{l}_{X}$ ; we denote the set of such values as $\mathcal{H}^{l}$ . Then we have that

Because each $\underline{W}_{k}^{H^{l}}$ is of rank $d_{l}$ , each set in the union is either empty or an affine subspace of dimension $d-d_{l}$ . As each $d_{l}>0$ , each set in the finite union has zero Lebesgue measure, and $\mathcal{N}_{\psi}$ also has zero Lebesgue measure.

Thus, take some $X\notin\operatorname{\mathcal{N}}_{\psi}$ , and find the smallest absolute value of its activations, $\epsilon=\min_{l=1,\dots,L}\min_{k=1,\dots,d_{l}}\left\lvert\left(h^{l}_{\psi}(X)\right)_{k}\right\rvert$ ; clearly $\epsilon>0$ . For any $X^{\prime}$ with $\lVert X-X^{\prime}\rVert<\epsilon$ , we know that for all $l$ and $k$ ,

implying that $H_{X}^{l}=H_{X^{\prime}}^{l}$ as well as $X^{\prime}\notin\operatorname{\mathcal{N}}_{\psi}$ . Thus for any point $X\notin\operatorname{\mathcal{N}}_{\psi}$ , $\nabla\phi_{\psi}(X)=\underline{W}^{H_{X}^{L}}$ . Finally, we obtain

A more general version of this result can be found in [19, Theorem 2]; we provide a proof here for completeness.

Here we analyze through simple examples what happens when the condition number can be unbounded, and when Assumption (II), about decreasing widths of the network, is violated.

where $\alpha>0$ . As $\alpha$ approaches the matrix $W_{\alpha}$ becomes singular which means that its condition number blows up. We are interested in analyzing the behavior of the Lipschitz constant of $\phi$ and the expected squared norm of its gradient under $\mu$ as $\alpha$ approaches .

One can easily compute the squared norm of the gradient of $\phi$ which is given by

Here $A_{1}$ , $A_{2}$ , $A_{3}$ and $A_{4}$ are defined by Equation 23 and are represented in Figure 4:

We would like to consider a second example where Assumption (II) doesn’t hold. Consider the following two layer network defined by:

for $\beta>0$ . Note that $W_{\beta}$ is a full rank matrix, but Assumption (II) doesn’t hold. Depending on the sign of the components of $W_{\beta}X$ one has the following expression for $\|\nabla\phi_{\alpha}(X)\|^{2}$ :

where $(B_{i})_{1\leq i\leq 6}$ are defined by Equation 26

The squared Lipschitz constant is given by $\|\phi\|_{L}^{2}(1-\gamma)^{2}+\beta^{2}$ while the expected squared norm of the gradient of $\phi$ is given by:

Appendix B DiracGAN vector fields for more losses

Figure 5 shows parameter vector fields, like those in Figure 6, for Example 1 for a variety of different losses:

Appendix C Vector fields of Gradient-Constrained MMD and Sobolev GAN critics

This unintuitive behavior is most likely related to the vanishing boundary condition, assummed by Sobolev GAN. Solving the actual Sobolev PDE, we found that the Sobolev critic has very high gradients close to the boundary in order to match the condition; moreover, these gradients point in opposite directions to the target distribution.

Appendix D An estimator for Lipschitz MMD

We now describe briefly how to estimate the Lipschitz MMD in low dimensions. Recall that

For $f\in\mathcal{H}_{k}$ , it is the case that

By the generalized representer theorem, the optimal $f$ for (29) will be of the form

Writing $\delta=\left(\alpha,\beta,\gamma\right)$ , the objective function is linear in $\delta$ ,

The constraints are quadratic, built from the following matrices, where the $X$ and $Y$ samples are concatenated together, as are the derivatives with each dimension of the $Z$ samples:

Thus the optimization problem (29) is a linear problem with convex quadratic constraints, which can be solved by standard convex optimization software. The approximation is reasonable only if we can effectively cover the region of interest with densely spaced $\{Z_{i}\}$ ; it requires a nontrivial amount of computation even for the very simple 1-dimensional toy problem of Example 1.

Appendix E Near-equivalence of WGAN and linear-kernel MMD GANs

For an MMD GAN-GP with kernel $k(x,y)=\phi(x)\phi(y)$ , we have that

(MMD GANs, however, would typically train on the unbiased estimator of $\operatorname{MMD}^{2}$ , giving a very slightly different loss function. also applied the gradient penalty to $\eta$ rather than the true critic $\eta/\lVert\eta\rVert$ .)

The SMMD with a linear kernel is thus analogous to applying the scaling operator to a WGAN; hence the name SWGAN.

Appendix F Additional experiments

Figure 7 shows the behavior of the MMD, the Gradient-Constrained SMMD, and the Scaled MMD when comparing Gaussian distributions. We can see that $\operatorname{MMD}\propto\operatorname{SMMD}$ and the Gradient-Constrained MMD behave similarly in this case, and that optimizing the $\operatorname{SMMD}$ and the Gradient-Constrained MMD is also similar. Optimizing the MMD would yield an essentially constant distance.

F.2 IGMs with Optimized Gradient-Constrained MMD loss

The learned models, however, were reasonable. Using a DCGAN architecture, batches of size 64, and a procedure that otherwise agreed with the setup of Section 4, samples with and without spectral normalization are shown in Figures 8(a) and 8(b). After the points in training shown, however, the same rank collapse as discussed in Section 4 occurred. Here it seems that spectral normalization may have delayed the collapse, but not prevented it. Figure 8(c) shows generator loss estimates through training, including the obvious peak at collapse; Figure 8(d) shows KID scores based on the MNIST-trained convnet representation , including comparable SMMD models for context. The fact that SMMD models converged somewhat faster than Gradient-Constrained MMD models here may be more related to properties of the estimator of 3 rather than the distances; more work would be needed to fully compare the behavior of the two distances.

F.3 Spectral normalization and Scaled MMD

Figure 9 shows the distribution of critic weight singular values, like Figure 2, at more layers. Figures 11 and 2 show results for the spectral normalization variants considered in the experiments. MMDGAN, with neither spectral normalization nor a gradient penalty, did surprisingly well in this case, though it fails badly in other situations.

Figure 9 compares the decay of singular values for layer of the critic’s network at both early and later stages of training in two cases: with or without the spectral parametrization. The model was trained on CelebA using SMMD. Figure 11 shows the evolution per iteration of Inception score, FID and KID for Sobolev-GAN, MMDGAN and variants of MMDGAN and WGAN using spectral normalization. It is often the case that this parametrization alone is not enough to achieve good results.

F.4 Additional samples

Figures 12 and 13 give extra samples from the models.