Training generative neural networks via Maximum Mean Discrepancy optimization

Gintare Karolina Dziugaite, Daniel M. Roy, Zoubin Ghahramani

Introduction

In this paper, we consider the problem of learning generative models from i.i.d. data with unknown distribution P\mathcal{P}. We formulate the learning problem as one of finding a function GG, called the generator, such that, given an input ZZ drawn from some fixed noise distribution N\mathcal{N}, the distribution of the output G(Z)G(Z) is close to the data’s distribution P\mathcal{P}. Note that, given GG and N\mathcal{N}, we can easily generate new samples despite not having an explicit representation for the underlying density.

We are particularly interested in the case where the generator is a deep neural network whose parameters we must learn. Rather than being used to classify or predict, these networks transport input randomness to output randomness, thus inducing a distribution. The first direct instantiation of this idea is due to , although MacKay draws connections even further back to the work of and others on autoencoders, suggesting that generators can be understood as decoders. MacKay’s proposal, called density networks, uses multi-layer perceptrons (MLP) as generators and learns the parameters by approximating Bayesian inference.

Since MacKay’s proposal, there has been a great deal of progress on learning generative models, especially over high-dimensional spaces like images. Some of the most successful approaches have been based on restricted Boltzmann machines and deep Boltzmann networks . A recent example is the Neural Autoregressive Density Estimator due to . An indepth survey, however, is beyond the scope of this article.

This work builds on a proposal due to . Their adversarial nets framework takes an indirect approach to learning deep generative neural networks: a discriminator network is trained to recognize the difference between training data and generated samples, while the generator is trained to confuse the discriminator. The resulting two-player game is cast as a minimax optimization of a differentiable objective and solved greedily by iteratively performing gradient descent steps to improve the generator and then the discriminator.

Given the greedy nature of the algorithm, give a careful prescription for balancing the training of the generator and the discriminator. In particular, two gradient steps on the discriminator’s parameters are taken for every iteration of the generator’s parameters. It is not clear at this point how sensitive this balance is as the data set and network vary. In this paper, we describe an approximation to adversarial learning that replaces the adversary with a closed-form nonparametric two-sample test statistic based on the Maximum Mean Discrepancy (MMD), which we adopted from the kernel two sample test . We call our proposal MMD nets.In independent work reported in a recent preprint, Li, Swersky, and Zemel also propose to use MMD as a training objective for generative neural networks. We leave a comparison to future work. We give bounds on the estimation error incurred by optimizing an empirical estimator rather than the true population MMD and give some illustrations on synthetic and real data.

Learning to sample as optimization

where δ\delta is some measure of discrepancy and Gθ(N)G_{\theta}(\mathcal{N}) is the distribution of Gθ(W)G_{\theta}(W) when WNW\sim\mathcal{N}. In practice, we only have i.i.d. samples X1,X2,X_{1},X_{2},\dotsc from P\mathcal{P}, and so we optimize an empirical estimate of δ(P,Gθ(N))\delta(\mathcal{P},G_{\theta}(\mathcal{N})).

where XPX\sim\mathcal{P} and YGθ(N)Y\sim G_{\theta}(\mathcal{N}). In this case, Eq. 1 becomes

for XPX\sim\mathcal{P} and WNW\sim\mathcal{N}. The output of the discriminator DϕD_{\phi} can be interpreted as the probability it assigns to its input being drawn from P\mathcal{P}, and so V(Gθ,Dϕ)V(G_{\theta},D_{\phi}) is the expected log loss incurred when classifying the origin of a point equally likely to have been drawn from P\mathcal{P} or Gθ(N)G_{\theta}(\mathcal{N}). Therefore, optimizing ϕ\phi maximizes the probability of distinguishing samples from P\mathcal{P} and Gθ(N)G_{\theta}(\mathcal{N}). Assuming that the optimal discriminator exists for every θ\theta, the optimal generator GG is that whose output distribution is closest to P\mathcal{P}, as measured by the Jensen–Shannon divergence, which is minimized when Gθ(N)=PG_{\theta}(\mathcal{N})=\mathcal{P}.

In , the generators GθG_{\theta} and discriminators DϕD_{\phi} are chosen to be multilayer perceptrons (MLP). In order to find a minimax solution, they propose taking alternating gradient steps along DϕD_{\phi} and GθG_{\theta}. Note that the composition Dϕ(Gθ())D_{\phi}(G_{\theta}(\cdot)) that appears in the value function is yet another (larger) MLP. This fact permits the use of the back-propagation algorithm to take gradient steps.

2 MMD as an adversary

In their paper introducing adversarial nets, remark that a balance must be struck between optimizing the generator and optimizing the discriminator. In particular, the authors suggest kk maximization steps for every one minimization step to ensure that DϕD_{\phi} is well synchronized with GθG_{\theta} during training. A large value for kk, however, can lead to overfitting. In their experiments, for every step taken along the gradient with respect to GθG_{\theta}, they take two gradient steps with respect to DϕD_{\phi} to bring DϕD_{\phi} closer to the desired optimum (Goodfellow, pers. comm.).

It is unclear how sensitive this balance is. Regardless, while adversarial networks deliver impressive sampling performance, the optimization takes approximately 7.5 hours to train on the MNIST dataset running on a nVidia GeForce GTX TITAN GPU with 6GB RAM. Can we potentially speed up the process with a more tractable choice of adversary?

where XPX\sim\mathcal{P} and YGθ(N)Y\sim G_{\theta}(\mathcal{N}). See Fig. 1 for a comparison of the architectures of adversarial and MMD nets.

While Eq. 2 involves a maximization over a family of functions, show that it can be solved in closed form when H\mathcal{H} is a reproducing kernel Hilbert space (RKHS).

More carefully, let H\mathcal{H} be a reproducing kernel Hilbert space (RKHS) of real-valued functions on Ω\Omega and let ,H\langle\cdot,\cdot\rangle_{\mathcal{H}} denote its inner product. By the reproducing property it follows that there exists a reproducing kernel kHk\in\mathcal{H} such that every fHf\in\mathcal{H} can be expressed as

The functions induced by a kernel kk are those functions in the closure of the span of the set {k(,x):xΩ}\{k(\cdot,x):x\in\Omega\}, which is necessarily an RKHS. Note, that for every positive definite kernel there is a unique RKHS H\mathcal{H} such that every function in H\mathcal{H} satisfies Eq. 3.

If F\mathcal{F} is chosen to be an RKHS H\mathcal{H}, then

where μpH\mu_{p}\in\mathcal{H} is the mean embedding of pp, given by

and satisfying, for all fHf\in\mathcal{H},

In practice, we often do not have access to pp or qq. Instead, we are given independent i.i.d. data X,X,X1,,XNX,X^{\prime},X_{1},\dotsc,X_{N} and Y,Y,Y1,,YMY,Y^{\prime},Y_{1},\dotsc,Y_{M} fom pp and qq, respectively, and would like to estimate the MMD. showed that

MMD Nets

Note that C(Yθ,X)C(Y_{\theta},X) is comprised of only those parts of the unbiased estimator that depend on θ\theta.

In practice, the minimization is solved by gradient descent, possibly on subsets of the data. More carefully, the chain rule gives us

Each derivative Cn(Yθ,Xn)ym\frac{\partial C_{n}(Y_{\theta},X_{n})}{\partial y_{m}} is easily computed for standard kernels like the RBF kernel. Our gradient C(Yθ,Xn)\nabla C(Y_{\theta},X_{n}) depends on the partial derivatives of the generator with respect to its parameters, which we can compute using back propagation.

Generalization bounds for MMD

MMD nets operate by minimizing an empirical estimate of the MMD. This estimate is subject to Monte Carlo error and so the network weights (parameters) θ^\hat{\theta} that are found to minimize the empirical MMD may do a poor job at minimizing the exact population MMD. We show that, for sufficiently large data sets, this estimation error is bounded, despite the space of parameters θ\theta being continuous and high dimensional.

Let Θ\Theta denote the space of possible parameters for the generator GθG_{\theta}, let N\mathcal{N} be the distribution on W\mathcal{W} for the noisy inputs, and let pθ=Gθ(N)p_{\theta}=G_{\theta}(\mathcal{N}) be the distribution of Gθ(W)G_{\theta}(W) when WNW\sim\mathcal{N} for θΘ\theta\in\Theta. Let θ^\hat{\theta} be the value optimizing the unbiased empirical MMD estimate, i.e.,

and let θ\theta^{*} be the value optimizing the population MMD, i.e.,

We are interested in bounding the difference

To that end, for a measured space X\mathcal{X}, write L(X)L_{\infty}(\mathcal{X}) for the space of essentially bounded functions on X\mathcal{X} and write B(L(X))B(L_{\infty}(\mathcal{X})) for the unit ball under the sup norm, i.e.,

The bounds we obtain will depend on a notion of complexity captured by the fat-shattering dimension:

For every ε\varepsilon, the fat-shattering dimension of F\mathcal{F}, written fatε(F)\text{fat}_{\varepsilon}(\mathcal{F}), is defined as

We then have the following bound on the estimation error:

Assume the kernel is bounded by one. Define

for constants Cp1C_{p_{1}} and Cp2C_{p_{2}} depending on p1p_{1} and p2p_{2} alone.

Empirical evaluation

In this section, we demonstrate the approach on an illustrative synthetic example as well as the standard MNIST digits and Toronto Face Dataset (TFD) benchmarks. We show that MMD-based optimization of the generator rapidly delivers a generator that performs well in maximizing the density of a held-out test set under a kernel-density estimator.

Under an RBF kernel and Gaussian generator with parameters θ={μ,σ}\theta=\{\mu,\sigma\}, it is straightforward to find the gradient of C(Yθ,X)C(Y_{\theta},X) by applying the chain rule. Using fixed random standard normal numbers {w1,...,wM}\{w_{1},...,w_{M}\}, we have ym=μ+σwmy_{m}=\mu+\sigma w_{m} for m{1,..,M}m\in\{1,..,M\}. The result of these illustrative synthetic experiments can be found in Fig. 1. The dataset consisted of N=200N=200 samples from a standard normal and M=50M=50 noise input samples were generated from a standard normal with a fixed random seed. The algorithm was initialized at values {μ,σ}={2.5,0.1}\{\mu,\sigma\}=\{2.5,0.1\}. We fixed the learning rate to 0.50.5 and ran gradient descent steps for K=250K=250 iterations.

2 MNIST digits

We trained our generative network on the MNIST digits dataset . The generator was chosen to be a fully connected, 3 hidden layers neural network with sigmoidal activation functions. Following , we used a radial basis function (RBF) kernel, but also evaluated the rational quadratic (RQ) kernel and Laplacian kernel, but found that the RBF performed best in the parameter ranges we evaluated. We used Bayesian optimization (WHETLab) to set the bandwidth of the RBF and the number of neurons in each layer on initial test runs of 50,000 iterations. We used the median heuristic suggested by for the kernel two-sample test to choose the kernel bandwidth. The learning rate was adjusting during optimization by RMSPROP .

Fig. 2 presents the digits learned after 1,000,000 iterations. We performed minibatch stochastic gradient descent, resampling the generated digits every 300 iterations, using minibatches of size 500, with equal numbers of training and generated points. It is clear that the digits produced have many artifacts not appearing in the MNIST data set. Despite this, the mean log density of the held-out test data is 315±2{\bf 315\pm 2}, as compared with the reported 225±2{\bf 225\pm 2} mean log density achieved by adversarial nets.

There are several possible explanations for this. First, kernel density estimation is known to perform poorly in high dimensions. Second, the MMD objective can itself be understood as the squared difference of two kernel density estimates, and so, in a sense, the objective being optimized is directly related to the subsequent mean test log density evaluation. There is no clear connection for adversarial networks, which might explain why it suffers under this test. Our experience suggests that the RBF kernel delivers base line performance but that an image-specific kernel, capturing, e.g., shift invariance, might lead to better images.

3 Toronto face dataset

We have also trained the generative MMD network on Toronto face dataset (TFD) . The parameters were adapted from the MNIST experiment: we also used a 3-hidden-layer sigmoidal MLP with similar architecture (1000, 600, and 1000 units) and RBF kernel for the cost function with the same hyper parameter. The training dataset batch sizes were equal to the number of generated points (500). The generated points were resampled every 500 iterations. The network was optimized for 500,000 iterations.

The samples from the resulting network are plotted in Fig. 3. The mean log density of the held-out test set is 2283 ±\pm 39. Although this figure is higher than the mean log density of 2057 ±\pm 26 reported in , the samples from the MMD network are again clearly distinguishable from the training dataset. Thus the high test score suggests that kernel density estimation does not perform well at evaluating the performance for these high dimensional datasets.

Conclusion

MMD offers a closed form surrogate for the discriminator in adversarial nets framework. After using Bayesian optimization for the parameters, we found that the network outperformed the adversarial network in terms of the density of the held-out test set under kernel density estimation. On the other hand, there is a clear discrepancy between the digits produced by MMD Nets and the MNIST digits, which might suggest that KDE is not up to the task of evaluating these models. Given how quickly MMD Nets achieves this level of performance, it is worth considering its use as an initialization for more costly procedures.

Acknowledgments

The authors would like to thank Bharath Sriperumbudur for technical discussions.

Appendix A Proofs

We begin with some preliminaries and known results:

A random variable σ\sigma is said to be a Rademacher random variable if it takes values in {1,1}\{-1,1\}, each with probability 1/21/2.

Let μ\mu be a probability measure on X\mathcal{X}, and let F\mathcal{F} be a class of uniformly bounded functions on X\mathcal{X}. Then the Rademacher complexity of F\mathcal{F} (with respect to μ\mu) is

where σ=(σ1,σ2,)\sigma=(\sigma_{1},\sigma_{2},\dotsc) is a sequence of independent Rademacher random variables, and X1,X2,X_{1},X_{2},\dotsc are independent, μ\mu-distributed random variables, independent also from σ\sigma.

Then, for all ε>0\varepsilon>0 and independent random variables ξ1,,ξn\xi_{1},\dotsc,\xi_{n} in X\mathcal{X},

Assume 0k(xi,xj)K0\leq k(x_{i},x_{j})\leq K, M=NM=N. Then

The case where Θ\Theta is a finite set is elementary:

Let pθp_{\theta} be the distribution of Gθ(W)G_{\theta}(W), with θ\theta taking values in some finite set Θ={θ1,...,θT},T<\Theta=\{\theta_{1},...,\theta_{T}\},T<\infty. Then, with probability at least 1(T+1)δε1-(T+1)\delta_{\varepsilon}, where δε\delta_{\varepsilon} is defined as in Theorem 4, we have

Note, that the upper bound stated in Theorem 4 holds for the parameter value θ\theta^{*}, i.e.,

Because θ^\hat{\theta} depends on the training data XX and generator data YY, we use a uniform bound that holds over all θ\theta. Specifically,

This yields that with probability at least 1Tδε1-T\delta_{\varepsilon},

Since θ\theta^{*} was chosen to minimize T(θ)\mathcal{T}(\theta), we know that

In order to prove the general result, we begin with some technical lemmas. The development here owes much to .

Then there exists a constant CC that depends on pp, such that

Let us introduce {ζn}n=1N\{\zeta_{n}\}_{n=1}^{N}, where ζn\zeta_{n} and YnY_{n^{\prime}} have the same distribution and are independent for all n,n{1,,N}n,n^{\prime}\in\{1,\dots,N\}. Then the following is true:

Using Jensen’s inequality and the independence of Y,YY,Y^{\prime} and Yn,YnY_{n},Y_{n^{\prime}}, we have

Introducing conditional expectations allows us to rewrite the equation with the sum over nn outside the expectations. I.e., Eq. 36 equals to

The second equality follows by symmetry of random variables {ζn}n=1N1\{\zeta_{n}\}_{n=1}^{N-1}. Note that we also added Rademacher random variables {σn}n=1N1\{\sigma_{n}\}_{n=1}^{N-1} before each term in the sum since (f(ζn,ζn)f(Yn,Yn))(f(\zeta_{n},\zeta_{n^{\prime}})-f(Y_{n},Y_{n^{\prime}})) has the same distribution as (f(ζn,ζn)f(Yn,Yn))-(f(\zeta_{n},\zeta_{n^{\prime}})-f(Y_{n},Y_{n^{\prime}})) for all n,nn,n^{\prime} and therefore the σ\sigma’s do not affect the expectation of the sum.

Note that ζm\zeta_{m} and YmY_{m} are identically distributed. Thus the triangle inequality implies that Eq. 39 is less than or equal to

where RN1(F+)R_{N-1}(\mathcal{F}_{+}) is the Rademacher’s complexity of F+\mathcal{F}_{+}. Then by Theorem 3, we have

Then there exists CC that depends on pp, such that

The proof is very similar to that of Lemma 1. ∎

The proof follows the same steps as the proof of Theorem 5 apart from a stronger uniform bound stated in Appendix A. I.e., we need to show:

For all n{1,,N}n\in\{1,\dots,N\}, k(Xn,Xn)k(X_{n},X_{n^{\prime}}) does not depend on θ\theta and therefore the first two terms of the equation above can be taken out of the supremum. Also, note that since k(,)K|k(\cdot,\cdot)|\leq K, we have

and ζ\zeta is an unbiased estimate of E(k(X,X))E(k(X,X^{\prime})). Then from McDiarmid’s inequality on ζ\zeta, we have

Therefore LABEL:eq:supremuminequality is bounded by the sum of the bound on Eq. 45 and the following:

Thus the next step is to find the bound for the supremum above.

We will first find the upper bound on f(WM)f(\underline{W}_{M}), i.e., for every ε>0\varepsilon>0, we will show that there exists δf\delta_{f}, such that

since the kernel is bounded by KK, and therefore k(Gθ(Wm),Gθ(Wm))k(G_{\theta}(W_{m}),G_{\theta}(W_{m^{\prime}})) is bounded by KK for all mm. The conditions of Theorem 2 are satisfied and thus we can use McDiarmid’s Inequality on ff:

To show Eq. 48, we need to bound the expectation of ff. We can apply Lemma 1 on the function classes Gk\mathcal{G}_{k} and Gk+\mathcal{G}_{k+}. The resulting bound is

where p1p_{1} and γ1\gamma_{1} are parameters associated to fat shattering dimension of Gk+\mathcal{G}_{k+} as stated in the assumptions of the theorem, and CfC_{f} is a constant depending on p1p_{1}.

Similarly, h(XN,WM)h(\underline{X}_{N},\underline{W}_{M}) has bounded differences:

for some constant ChC_{h} that depends on p@p_{@}. The final bound on hh is then

Summing up the bounds from Eq. 54 and Eq. 57, it follows that

Using the bound in Eq. 45, we have obtain the uniform bound we were looking for:

Since it was assumed that K=1K=1 and N=MN=M, we get

To finish, we proceed as in the proof of Theorem 5. We can rearrange some of the terms to get a different form of Eq. 28:

All of the above implies that for any ε>0\varepsilon>0, there exists δ\delta, such that

The rate r(p,γ,N)r(p,\gamma,N) is given by Eq. 53 and Eq. 59:

where the constants Cp1C_{p1} and Cp2C_{p2} depend on p1p_{1} and p2p_{2} alone. ∎

We close by noting that the approximation error is zero in the nonparametric limit.

References