On gradient regularizers for MMD GANs

Michael Arbel, Danica J. Sutherland, Mikołaj Bińkowski, Arthur Gretton

Introduction

Another class of IPMs used as IGM losses are the Maximum Mean Discrepancies (MMDs) , as in . Here the critic function is a member of a reproducing kernel Hilbert space (except in , who learn a deep approximation to an RKHS critic). Better performance can be obtained, however, when the MMD kernel is not based directly on image pixels, but on learned features of images. Wasserstein-inspired gradient regularization approaches can be used on the MMD critic when learning these features: uses weight clipping , and use a gradient penalty .

We first discuss in Section 2 how MMD-based losses can be used to learn implicit generative models, and how a naive approach could fail. This motivates our new discrepancies, introduced in Section 3. Section 4 demonstrates that these losses outperform state-of-the-art models for image generation.

Learning implicit generative models with MMD-based losses

In the present work, we will build losses D\operatorname{\mathcal{D}} based on the Maximum Mean Discrepancy,

Despite these appealing properties, using simple pixel-level kernels leads to poor generator samples . More recent MMD GANs achieve better results by using a parameterized family of kernels, {kψ}ψΨ\{k_{\psi}\}_{\psi\in\Psi}, in the Optimized MMD loss previously studied by :

We can avoid these issues if we ensure a bounded Lipschitz critic:[27, Theorem 4] makes a similar claim to 2, but its proof was incorrect: it tries to uniformly bound MMDkψW2\operatorname{MMD}_{k_{\psi}}\leq\operatorname{\mathcal{W}}^{2}, but the bound used is for a Wasserstein in terms of kψ(x,)kψ(y,)Hkψ\lVert k_{\psi}(x,\cdot)-k_{\psi}(y,\cdot)\rVert_{\mathcal{H}_{k_{\psi}}}.

The main result is [12, Corollary 11.3.4]. To show the claim for kψ=Kϕψk_{\psi}=K\circ\phi_{\psi}, note that fψ(x)fψ(y)fψHkψkψ(x,)kψ(y,)Hkψ\lvert f_{\psi}(x)-f_{\psi}(y)\rvert\leq\lVert f_{\psi}\rVert_{\mathcal{H}_{k_{\psi}}}\lVert k_{\psi}(x,\cdot)-k_{\psi}(y,\cdot)\rVert_{\mathcal{H}_{k_{\psi}}}, which since fψHkψ=1\lVert f_{\psi}\rVert_{\mathcal{H}_{k_{\psi}}}=1 is

Indeed, if we put a box constraint on ψ\psi or regularize the gradient of the critic function , the resulting MMD GAN generally matches or outperforms WGAN-based models. Unfortunately, though, an additive gradient penalty doesn’t substantially change the vector field of Figure 1 (a), as shown in Figure 5 (Appendix B). We will propose distances with much better convergence behavior.

New discrepancies for learning implicit generative models

Our aim here is to introduce a discrepancy that can provide useful gradient information when used as an IGM loss. Proofs of results in this section are deferred to Appendix A.

2 shows that an MMD-like discrepancy can be continuous under the weak topology even when optimizing over kernels, if we directly restrict the critic functions to be Lipschitz. We can easily define such a distance, which we call the Lipschitz MMD: for some λ>0\lambda>0,

2 Gradient-Constrained Maximum Mean Discrepancy

We define the Gradient-Constrained MMD for λ>0\lambda>0 and using some measure μ\mu as

where KK is the kernel matrix Km,m=k(Xm,Xm)K_{m,m^{\prime}}=k(X_{m},X_{m^{\prime}}), GG is the matrix of left derivatives We use ik(x,y)\partial_{i}k(x,y) to denote the partial derivative with respect to xix_{i}, and i+dk(x,y)\partial_{i+d}k(x,y) that for yiy_{i}. G(m,i),m=ik(Xm,Xm)G_{(m,i),m^{\prime}}=\partial_{i}k(X_{m},X_{m^{\prime}}), and HH that of derivatives of both arguments H(m,i),(m,j)=ij+dk(Xm,Xm)H_{(m,i),(m^{\prime},j)}=\partial_{i}\partial_{j+d}k(X_{m},X_{m^{\prime}}).

3 Scaled Maximum Mean Discrepancy

We will now derive a lower bound on the Gradient-Constrained MMD which retains many of its attractive qualities but can be estimated in time linear in the dimension dd.

Make Assumptions (A), (B), (C) and (D). For any fHkf\in\mathcal{H}_{k}, fS(μ),k,λσμ,k,λ1fHk\lVert f\rVert_{S(\mu),k,\lambda}\leq\sigma_{\mu,k,\lambda}^{-1}\lVert f\rVert_{\mathcal{H}_{k}}, where

We then define the Scaled Maximum Mean Discrepancy based on this bound of 4:

Experiments

We evaluated unsupervised image generation on three datasets: CIFAR-10 (6000060\,000 images, 32×3232\times 32), CelebA (202599202\,599 face images, resized and cropped to 160×160160\times 160 as in ), and the more challenging ILSVRC2012 (ImageNet) dataset (12811671\,281\,167 images, resized to 64×6464\times 64). Code for all of these experiments is available at github.com/MichaelArbel/Scaled-MMD-GAN.

Evaluation To compare the sample quality of different models, we considered three different scores based on the Inception network trained for ImageNet classification, all using default parameters in the implementation of . The Inception Score (IS) is based on the entropy of predicted labels; higher values are better. Though standard, this metric has many issues, particularly on datasets other than ImageNet . The FID instead measures the similarity of samples from the generator and the target as the Wasserstein-2 distance between Gaussians fit to their intermediate representations. It is more sensible than the IS and becoming standard, but its estimator is strongly biased . The KID is similar to FID, but by using a polynomial-kernel MMD its estimates enjoy better statistical properties and are easier to compare. (A similar score was recommended by .)

Results Table 1(a) presents the scores for models trained on both CIFAR-10 and CelebA datasets. On CIFAR-10, SN-SWGAN and SN-SMMDGAN performed comparably to SN-GAN. But on CelebA, SN-SWGAN and SN-SMMDGAN dramatically outperformed the other methods with the same architecture in all three metrics. It also trained faster, and consistently outperformed other methods over multiple initializations (Figure 2 (a)). It is worth noting that SN-SWGAN far outperformed WGAN-GP on both datasets. Table 1(b) presents the scores for SMMDGAN and SN-SMMDGAN trained on ImageNet, and the scores of pre-trained models using BGAN and SN-GAN .These models are courtesy of the respective authors and also trained at 64×6464\times 64 resolution. SN-GAN used the same architecture as our model, but trained for 250000250\,000 generator iterations; BS-GAN used a similar 5-layer ResNet architecture and trained for 74 epochs, comparable to SN-GAN. The proposed methods substantially outperformed both methods in FID and KID scores. Figure 3 shows samples on ImageNet and CelebA; Section F.4 has more.

Spectrally normalized WGANs / MMDGANs To control for the contribution of the spectral parametrization to the performance, we evaluated variants of MMDGANs, WGANs and Sobolev-GAN using spectral normalization (in Table 2, Section F.3). WGAN and Sobolev-GAN led to unstable training and didn’t converge at all (Figure 11) despite many attempts. MMDGAN converged on CIFAR-10 (Figure 11) but was unstable on CelebA (Figure 10). The gradient control due to SN is thus probably too loose for these methods. This is reinforced by Figure 2 (c), which shows that the expected gradient of the critic network is much better-controlled by SMMD, even when SN is used. We also considered variants of these models with a learned γ\gamma while also adding a gradient penalty and an L2L_{2} penalty on critic activations [7, footnote 19]. These generally behaved similarly to MMDGAN, and didn’t lead to substantial improvements. We ran the same experiments on CelebA, but aborted the runs early when it became clear that training was not successful.

Rank collapse We occasionally observed the failure mode for SMMD where the critic becomes low-rank, discussed in Section 3.3, especially on CelebA; this failure was obvious even in the training objective. Figure 2 (b) is one of these examples. Spectral parametrization seemed to prevent this behavior. We also found one could avoid collapse by reverting to an earlier checkpoint and increasing the RKHS regularization parameter λ\lambda, but did not do this for any of the experiments here.

Conclusion

Another area to explore is the geometry of these losses, as studied by , who showed potential advantages of the Wasserstein geometry over the MMD. Their results, though, do not address any distances based on optimized kernels; the new distances introduced here might have interesting geometry of their own.

pages40 rangepages20 rangepages19 rangepages15 rangepages5 rangepages22 rangepages55

References

Appendix A Proofs

We use a slightly nonstandard notation for derivatives: if(x)\partial_{i}f(x) denotes the iith partial derivative of ff evaluated at xx, and ij+dk(x,y)\partial_{i}\partial_{j+d}k(x,y) denotes 2k(a,b)aibj(a,b)=(x,y)\frac{\partial^{2}k(a,b)}{\partial a_{i}\partial b_{j}}|_{(a,b)=(x,y)}.

Then the following reproducing properties hold for any given function ff in H\mathcal{H} [47, Lemma 4.34]:

Given two vectors ff and gg in H\mathcal{H} and a Hilbert-Schmidt operator AA we have the following properties:

Define the following covariance-type operators:

these are useful in that, using (8) and (9), f,Dxg=f(x)g(x)+i=1dif(x)ig(x)\langle f,D_{x}g\rangle=f(x)g(x)+\sum_{i=1}^{d}\partial_{i}f(x)\,\partial_{i}g(x).

k(x,x)\sqrt{k(x,x)} grows at most linearly in xx: for all xx in X\mathcal{X}, k(x,x)C(x+1)\sqrt{k(x,x)}\leq C(\lVert x\rVert+1) for some constant CC.

The kernel kk is twice continuously differentiable.

The functions xk(x,x)x\mapsto k(x,x) and xii+dk(x,x)x\mapsto\partial_{i}\partial_{i+d}k(x,x) for 1id1\leq i\leq d are μ\mu-integrable.

When k=Kϕψk=K\circ\phi_{\psi}, Assumption (B) is automatically satisfied by a KK such as the Gaussian; when KK is linear, it is true for a quite general class of networks ϕψ\phi_{\psi} [7, Lemma 1].

We will first give a form for the Gradient-Constrained MMD (5) in terms of the operator (10):

Under Assumptions (A), (B), (C) and (D), the Gradient-Constrained MMD is given by

Let ff be a function in H\mathcal{H}. We will first express the squared λ\lambda-regularized Sobolev norm of ff (6) as a quadratic form in H\mathcal{H}. Recalling the reproducing properties of (8) and (9), we have:

Using Property (ii) and the operator (10), one further gets

where η\eta is defined as this difference in mean embeddings.

Since Dμ,λD_{\mu,\lambda} is symmetric positive definite, its square-root Dμ,λ12D_{\mu,\lambda}^{\frac{1}{2}} is well-defined and is also invertible. For any fHf\in\mathcal{H}, let g=Dμ,λ12fg=D_{\mu,\lambda}^{\frac{1}{2}}f, so that f,Dμ,λfH=gH2\langle f,D_{\mu,\lambda}f\rangle_{\mathcal{H}}=\|g\|_{\mathcal{H}}^{2}. Note that for any gHg\in\mathcal{H}, there is a corresponding f=Dμ,λ12gf=D_{\mu,\lambda}^{-\frac{1}{2}}g. Thus we can re-express the maximization problem in (5) in terms of gg:

5, though, involves inverting the infinite-dimensional operator Dμ,λD_{\mu,\lambda} and thus doesn’t directly give us a computable estimator. 3 solves this problem in the case where μ\mu is a discrete measure:

where KK is the kernel matrix Km,m=k(Xm,Xm)K_{m,m^{\prime}}=k(X_{m},X_{m^{\prime}}), GG is the matrix of left derivatives G(m,i),m=ik(Xm,Xm)G_{(m,i),m^{\prime}}=\partial_{i}k(X_{m},X_{m^{\prime}}), and HH that of derivatives of both arguments H(m,i),(m,j)=ij+dk(Xm,Xm)H_{(m,i),(m^{\prime},j)}=\partial_{i}\partial_{j+d}k(X_{m},X_{m^{\prime}}).

Then [η(X)η(X)]=Tη\begin{bmatrix}\eta(X)\\ \nabla\eta(X)\end{bmatrix}=T\eta, and [KGTGH]=TT\begin{bmatrix}K&G^{\mathsf{T}}\\ G&H\end{bmatrix}=TT^{*}. Thus we can write

Let gHg\in\mathcal{H} be the solution to the regression problem Dμ,λg=ηD_{\mu,\lambda}g=\eta:

Taking the inner product of both sides of (12) with k(Xm,)k(X_{m^{\prime}},\cdot) for each 1mM1\leq m^{\prime}\leq M yields the following MM equations:

Doing the same with jk(Xm,)\partial_{j}k(X_{m^{\prime}},\cdot) gives MdMd equations:

From (12), it is clear that gg is a linear combination of the form:

where the coefficients α:=(αm=g(Xm))1mM\alpha:=\left(\alpha_{m}=g(X_{m})\right)_{1\leq m\leq M} and β:=(βm,i=ig(Xm))1mM1id\beta:=\left(\beta_{m,i}=\partial_{i}g(X_{m})\right)_{\begin{subarray}{c}1\leq m\leq M\\ 1\leq i\leq d\end{subarray}} satisfy the system of equations (13) and (14). We can rewrite this system as

where IMI_{M}, IMdI_{Md} are the identity matrices of dimension MM, MdMd. Since KK and HH must be positive semidefinite, an inverse exists. We conclude by noticing that

The following result was key to our definition of the SMMD in Section 3.3.

Under Assumptions (A), (B), (C) and (D), we have for all fHf\in\mathcal{H} that

The key idea here is to use the Cauchy-Schwarz inequality for the Hilbert-Schmidt inner product. Letting fHf\in\mathcal{H}, fS(μ),k,λ2\|f\|_{S(\mu),k,\lambda}^{2} is

(a)(a) follows from the reproducing properties (8) and (9) and Property (ii). (b)(b) is obtained using Property (iii), while (c)(c) follows from the Cauchy-Schwarz inequality and Property (i). ∎

The operator DxD_{x} is positive self-adjoint. It is also trace-class, as by the triangle inequality

A.2 Continuity of the Optimized Scaled MMD in the Wasserstein topology

To prove Theorem 1, we we will first need some new notation.

The parameter ψ\psi is the concatenation of all the layer parameters:

Ψκ\Psi^{\kappa} is the set of those parameters such that WlW^{l} have a small condition number, cond(W)=σmax(W)/σmin(W)\operatorname{cond}(W)=\sigma_{\max}(W)/\sigma_{\min}(W). Ψ1κ\Psi_{1}^{\kappa} is the set of per-layer normalized parameters with a condition number bounded by κ\kappa.

Recall the definition of Scaled MMD, Equation 7, where λ>0\lambda>0 and μ\mu is a probability measure:

The Optimized SMMD over the restricted set Ψκ\Psi^{\kappa} is given by:

The constraint to ψΨκ\psi\in\Psi^{\kappa} is critical to the proof. In practice, using a spectral parametrization helps enforce this assumption, as shown in Figures 2 and 9. Other regularization methods, like orthogonal normalization , are also possible.

μ\mu is a probability distribution absolutely continuous with respect to the Lebesgue measure.

The dimensions of the weights are decreasing per layer: dl+1dld_{l+1}\leq d_{l} for all 0lL10\leq l\leq L-1.

The non-linearity used is Leaky-ReLU, (16), with leak coefficient α(0,1)\alpha\in(0,1).

There is some γK>0\gamma_{K}>0 for which KK satisfies

Assumption (II) helps ensure that the span of WlW^{l} is never contained in the null space of Wl+1W^{l+1}. Using Leaky-ReLU as a non-linearity, Assumption (III), further ensures that the network ϕψ\phi_{\psi} is locally full-rank almost everywhere; this might not be true with ReLU activations, where it could be always . Assumptions (II) and (III) can be easily satisfied by design of the network.

Assumptions (IV) and (V) only depend on the top-level kernel KK and are easy to satisfy in practice. In particular, they always hold for a smooth translation-invariant kernel, such as the Gaussian, as well as the linear kernel.

Under Assumptions (I), (II), (III), (IV) and (V),

Define the pseudo-distance corresponding to the kernel kψk_{\psi}

where W\operatorname{\mathcal{W}} is the standard Wasserstein distance (2), and so

We have that \partial_{i}\partial_{i+d}k(x,y)=\left[\partial_{i}\phi_{\psi}(x)\right]^{\mathsf{T}}\left[\nabla_{a}\nabla_{b}K(a,b)\bigr{|}_{(a,b)=(\phi_{\psi}(x),\phi_{\psi}(y))}\right]\left[\partial_{i}\phi_{\psi}(y)\right], where the middle term is a dL×dLd_{L}\times d_{L} matrix and the outer terms are vectors of length dLd_{L}. Thus Assumption (V) implies that ii+dk(x,x)γK2iϕψ(x)2\partial_{i}\partial_{i+d}k(x,x)\geq\gamma_{K}^{2}\lVert\partial_{i}\phi_{\psi}(x)\rVert^{2}, and hence

Using 8, we can write ϕψ(X)=α(ψ)ϕψˉ(X)\phi_{\psi}(X)=\alpha(\psi)\phi_{\bar{\psi}}(X) with ψˉΨ1κ\bar{\psi}\in\Psi^{\kappa}_{1}. Then we have

Take a sample (X,Y)π(X,Y)\sim\pi^{\star} and a function fHf\in\mathcal{H} with fH1\lVert f\rVert_{\mathcal{H}}\leq 1. By the Cauchy-Schwarz inequality,

Taking the expectation with respect to π\pi^{\star}, we obtain

the result follows by taking the supremum over ff. ∎

Let ψ=((WL,bL),(WL1,bL1),,(W1,b1))Ψκ\psi=((W^{L},b^{L}),(W^{L-1},b^{L-1}),\dots,(W^{1},b^{1}))\in\Psi^{\kappa}. There exists a corresponding scalar α(ψ)\alpha(\psi) and ψˉ=((WˉL,bˉL),(WˉL1,bˉL1),,(Wˉ1,bˉ1))Ψ1κ\bar{\psi}=((\bar{W}^{L},\bar{b}^{L}),(\bar{W}^{L-1},\bar{b}^{L-1}),\dots,(\bar{W}^{1},\bar{b}^{1}))\in\Psi^{\kappa}_{1}, defined by (18), such that for all XX,

Set Wˉl=1WlWl\bar{W}^{l}=\frac{1}{\lVert W^{l}\rVert}W^{l}, bˉl=1m=1lWmbl\bar{b}^{l}=\frac{1}{\prod_{m=1}^{l}\lVert W^{m}\rVert}b^{l}, and α(ψ)=l=1LWl\alpha(\psi)=\prod_{l=1}^{L}\lVert W^{l}\rVert. Note that the condition number is unchanged, cond(Wˉl)=cond(Wl)κ\operatorname{cond}(\bar{W}^{l})=\operatorname{cond}(W^{l})\leq\kappa, and Wˉl=1\lVert\bar{W}^{l}\rVert=1, so ψˉΦ1κ\bar{\psi}\in\Phi^{\kappa}_{1}. It is also easy to see from (16) that

Make Assumptions (II) and (III), and let ψΨ1κ\psi\in\Psi^{\kappa}_{1}. Then the set of inputs for which any intermediate activation is exactly zero,

has zero Lebesgue measure. Moreover, for any XNψX\notin\operatorname{\mathcal{N}}_{\psi}, Xϕψ(X)\nabla_{X}\phi_{\psi}(X) exists and

it is undefined when any hkl(X)=0h^{l}_{k}(X)=0, i.e. when XNψX\in\operatorname{\mathcal{N}}_{\psi}. Let VXl:=Wldiag(MXl1)V^{l}_{X}:=W^{l}\operatorname{diag}\left(M_{X}^{l-1}\right). Then

where bX0=0\underline{b}^{0}_{X}=0, bXl=VXlbl1+bl\underline{b}^{l}_{X}=V^{l}_{X}\underline{b}^{l-1}+b^{l}, and WXl=VXlVXl1VX1\underline{W}^{l}_{X}=V^{l}_{X}V^{l-1}_{X}\cdots V^{1}_{X}, so long as XNψX\notin\operatorname{\mathcal{N}}_{\psi}.

Because ψΨ1κ\psi\in\Psi_{1}^{\kappa}, we have Wl=1\lVert W^{l}\rVert=1 and σmin(Wl)1/κ\operatorname{\sigma_{min}}(W^{l})\geq 1/\kappa; also, MXl1\lVert M_{X}^{l}\rVert\leq 1, σmin(MXl)α\operatorname{\sigma_{min}}(M_{X}^{l})\geq\alpha. Thus WXl1\lVert\underline{W}^{l}_{X}\rVert\leq 1, and using Assumption (II) with 10 gives σmin(WXl)(α/κ)l\operatorname{\sigma_{min}}(\underline{W}^{l}_{X})\geq(\alpha/\kappa)^{l}. In particular, each WXl\underline{W}^{l}_{X} is full-rank.

Next, note that bXl\underline{b}^{l}_{X} and WXl\underline{W}^{l}_{X} each only depend on XX through the activation patterns MXlM_{X}^{l}. Letting HXl=(MXl,MXl1,,MX1)H^{l}_{X}=(M_{X}^{l},M_{X}^{l-1},\dots,M_{X}^{1}) denote the full activation patterns up to level ll, we can thus write

There are only finitely many possible values for HXlH^{l}_{X}; we denote the set of such values as Hl\mathcal{H}^{l}. Then we have that

Because each WkHl\underline{W}_{k}^{H^{l}} is of rank dld_{l}, each set in the union is either empty or an affine subspace of dimension ddld-d_{l}. As each dl>0d_{l}>0, each set in the finite union has zero Lebesgue measure, and Nψ\mathcal{N}_{\psi} also has zero Lebesgue measure.

Thus, take some XNψX\notin\operatorname{\mathcal{N}}_{\psi}, and find the smallest absolute value of its activations, ϵ=minl=1,,Lmink=1,,dl(hψl(X))k\epsilon=\min_{l=1,\dots,L}\min_{k=1,\dots,d_{l}}\left\lvert\left(h^{l}_{\psi}(X)\right)_{k}\right\rvert; clearly ϵ>0\epsilon>0. For any XX^{\prime} with XX<ϵ\lVert X-X^{\prime}\rVert<\epsilon, we know that for all ll and kk,

implying that HXl=HXlH_{X}^{l}=H_{X^{\prime}}^{l} as well as XNψX^{\prime}\notin\operatorname{\mathcal{N}}_{\psi}. Thus for any point XNψX\notin\operatorname{\mathcal{N}}_{\psi}, ϕψ(X)=WHXL\nabla\phi_{\psi}(X)=\underline{W}^{H_{X}^{L}}. Finally, we obtain

A more general version of this result can be found in [19, Theorem 2]; we provide a proof here for completeness.

Here we analyze through simple examples what happens when the condition number can be unbounded, and when Assumption (II), about decreasing widths of the network, is violated.

where α>0\alpha>0. As α\alpha approaches the matrix WαW_{\alpha} becomes singular which means that its condition number blows up. We are interested in analyzing the behavior of the Lipschitz constant of ϕ\phi and the expected squared norm of its gradient under μ\mu as α\alpha approaches .

One can easily compute the squared norm of the gradient of ϕ\phi which is given by

Here A1A_{1}, A2A_{2}, A3A_{3} and A4A_{4} are defined by Equation 23 and are represented in Figure 4:

We would like to consider a second example where Assumption (II) doesn’t hold. Consider the following two layer network defined by:

for β>0\beta>0. Note that WβW_{\beta} is a full rank matrix, but Assumption (II) doesn’t hold. Depending on the sign of the components of WβXW_{\beta}X one has the following expression for ϕα(X)2\|\nabla\phi_{\alpha}(X)\|^{2}:

where (Bi)1i6(B_{i})_{1\leq i\leq 6} are defined by Equation 26

The squared Lipschitz constant is given by ϕL2(1γ)2+β2\|\phi\|_{L}^{2}(1-\gamma)^{2}+\beta^{2} while the expected squared norm of the gradient of ϕ\phi is given by:

Appendix B DiracGAN vector fields for more losses

Figure 5 shows parameter vector fields, like those in Figure 6, for Example 1 for a variety of different losses:

Appendix C Vector fields of Gradient-Constrained MMD and Sobolev GAN critics

This unintuitive behavior is most likely related to the vanishing boundary condition, assummed by Sobolev GAN. Solving the actual Sobolev PDE, we found that the Sobolev critic has very high gradients close to the boundary in order to match the condition; moreover, these gradients point in opposite directions to the target distribution.

Appendix D An estimator for Lipschitz MMD

We now describe briefly how to estimate the Lipschitz MMD in low dimensions. Recall that

For fHkf\in\mathcal{H}_{k}, it is the case that

By the generalized representer theorem, the optimal ff for (29) will be of the form

Writing δ=(α,β,γ)\delta=\left(\alpha,\beta,\gamma\right), the objective function is linear in δ\delta,

The constraints are quadratic, built from the following matrices, where the XX and YY samples are concatenated together, as are the derivatives with each dimension of the ZZ samples:

Thus the optimization problem (29) is a linear problem with convex quadratic constraints, which can be solved by standard convex optimization software. The approximation is reasonable only if we can effectively cover the region of interest with densely spaced {Zi}\{Z_{i}\}; it requires a nontrivial amount of computation even for the very simple 1-dimensional toy problem of Example 1.

Appendix E Near-equivalence of WGAN and linear-kernel MMD GANs

For an MMD GAN-GP with kernel k(x,y)=ϕ(x)ϕ(y)k(x,y)=\phi(x)\phi(y), we have that

(MMD GANs, however, would typically train on the unbiased estimator of MMD2\operatorname{MMD}^{2}, giving a very slightly different loss function. also applied the gradient penalty to η\eta rather than the true critic η/η\eta/\lVert\eta\rVert.)

The SMMD with a linear kernel is thus analogous to applying the scaling operator to a WGAN; hence the name SWGAN.

Appendix F Additional experiments

Figure 7 shows the behavior of the MMD, the Gradient-Constrained SMMD, and the Scaled MMD when comparing Gaussian distributions. We can see that MMDSMMD\operatorname{MMD}\propto\operatorname{SMMD} and the Gradient-Constrained MMD behave similarly in this case, and that optimizing the SMMD\operatorname{SMMD} and the Gradient-Constrained MMD is also similar. Optimizing the MMD would yield an essentially constant distance.

F.2 IGMs with Optimized Gradient-Constrained MMD loss

The learned models, however, were reasonable. Using a DCGAN architecture, batches of size 64, and a procedure that otherwise agreed with the setup of Section 4, samples with and without spectral normalization are shown in Figures 8(a) and 8(b). After the points in training shown, however, the same rank collapse as discussed in Section 4 occurred. Here it seems that spectral normalization may have delayed the collapse, but not prevented it. Figure 8(c) shows generator loss estimates through training, including the obvious peak at collapse; Figure 8(d) shows KID scores based on the MNIST-trained convnet representation , including comparable SMMD models for context. The fact that SMMD models converged somewhat faster than Gradient-Constrained MMD models here may be more related to properties of the estimator of 3 rather than the distances; more work would be needed to fully compare the behavior of the two distances.

F.3 Spectral normalization and Scaled MMD

Figure 9 shows the distribution of critic weight singular values, like Figure 2, at more layers. Figures 11 and 2 show results for the spectral normalization variants considered in the experiments. MMDGAN, with neither spectral normalization nor a gradient penalty, did surprisingly well in this case, though it fails badly in other situations.

Figure 9 compares the decay of singular values for layer of the critic’s network at both early and later stages of training in two cases: with or without the spectral parametrization. The model was trained on CelebA using SMMD. Figure 11 shows the evolution per iteration of Inception score, FID and KID for Sobolev-GAN, MMDGAN and variants of MMDGAN and WGAN using spectral normalization. It is often the case that this parametrization alone is not enough to achieve good results.

F.4 Additional samples

Figures 12 and 13 give extra samples from the models.