Maximum Entropy Generators for Energy-Based Models

Rithesh Kumar, Sherjil Ozair, Anirudh Goyal, Aaron Courville, Yoshua Bengio

Introduction

Unsupervised learning promises to take advantage from unlabelled data, and is regarded as crucial for artificial intelligence (Lake et al., 2017). Energy-based modeling (EBMs, LeCun et al. (2006)) is a family of unsupervised learning methods focused on learning an energy function, i.e., an unnormalized log density of the data. This removes the need to make parametric assumptions about the data distribution to make the normalizing constant (ZZ) tractable. However, in practice, due to the very same lack of restrictions, learning high-quality energy-based models is fraught with challenges. To avoid explicitly computing ZZ or its gradient, Contrastive Divergence (Hinton, 2000) and Stochastic Maximum Likelihood (Younes, 1998; Tieleman, 2008a) rely on Markov Chain Monte Carlo (MCMC) to approximately sample from the energy-based model. However, MCMC-based sampling approaches frequently suffer from long mixing times for high-dimensional data. Thus, training of energy-based models has not remained competitive with other unsupervised learning techniques such as variational auto-encoders (Kingma & Welling, 2014) and generative adversarial networks or GANs (Goodfellow et al., 2014).

In this work, we propose Maximum Entropy Generators (MEG), a framework in which we train both an energy function and an approximate sampler, which can either be fast (using a generator network GG) or uses GG to initialize a Markov chain in the latent space of the generator. Training such a generator properly requires entropy maximization of the generator’s output distribution, for which we take advantage of recent advances in nonparametric mutual information maximization (Belghazi et al., 2018; Hjelm et al., 2018; Oord et al., 2018; Poole et al., 2018).

To evaluate the efficacy of the proposed technique, we compare against other state-of-the-art techniques on image generation, accurate mode representation, and anomaly detection. We demonstrate that the proposed technique is able to generate CIFAR-10 samples which are competitive with WGAN-GP (Gulrajani et al., 2017) according to the Fréchet Inception Distance (Heusel et al., 2017) and Inception Score (Salimans et al., 2016), and is able to generate samples of all the 10410^{4} modes of 4-StackedMNIST at the correct data frequencies.

We demonstrate that our technique trains energy functions useful for anomaly detection on the KDD99 dataset, and that it performs as well as state-of-the-art anomaly detection techniques which were specially designed for the task, and vastly outperform other energy-based and generative models for anomaly detection.

To summarize our contributions, we propose maximum entropy generators (MEG), a novel framework for training energy-based models using amortized neural generators and mutual information maximization. We show that the resulting energy function can be successfully used for anomaly detection, and outperforms recently published results with energy-based models. We show that MEG generates sharp images – with competitive Inception and FID scores – and accurately captures more modes than standard GANs, while not suffering from the common mode-mixing issue of many maximum likelihood generative models which results in blurry samples.

Background

where Zθ:=eEθ(x)dxZ_{\theta}:=\int e^{-E_{\bm{\theta}}({\bm{x}})}d{\bm{x}} is the normalizing constant or partition function. Let pDp_{D} be the training distribution, from which the training set is drawn. Towards optimizing the parameters θ\theta of the energy function, the maximum likelihood parameter gradient is

where the second term is the gradient of logZθ\log Z_{\theta}, and the sum of the two expectations is zero when training has converged, with expected energy gradients in the positive phase (under the data pDp_{D}) matching those under the negative phase (under pθ(x)p_{\theta}({\bm{x}})). Training thus consists in trying to separate two distributions: the positive phase distribution (associated with the data) and the negative phase distribution (where the model is free-running and generating configurations by itself). This observation has motivated the pre-GAN idea presented by Bengio (2009) that “model samples are negative examples" and a classifier could be used to learn an energy function if it separated the data distribution from the model’s own samples. Shortly after introducing GANs, Goodfellow (2014) also made a similar connection, related to noise-contrastive estimation (Gutmann & Hyvarinen, 2010). One should also recognize the similarity between Eq. 2 and the objective function for Wasserstein GANs or WGAN (Arjovsky et al., 2017).

The main challenge in Eq. 2 is to obtain samples from the distribution pθp_{\theta} associated with the energy function EθE_{\theta}. Although having an energy function is convenient to obtain a score allowing comparison of the relative probability for different x{\bm{x}}’s, it is difficult to convert an energy function into a generative process. The commonly studied approaches for this are based on Markov Chain Monte Carlo, in which one iteratively updates a candidate configuration, until these configurations converge in distribution to the desired distribution pθp_{\theta}. For the RBM, the most commonly used algorithms have been Contrastive Divergence (Hinton, 2000) and Stochastic Maximum Likelihood (Younes, 1998; Tieleman, 2008a), relying on the particular structure of the RBM to perform Gibbs sampling. Although these MCMC-based methods are appealing, RBMs (and their deeper form, the deep Boltzmann machine) have not been competitive in recent years compared to autoregressive models (van den Oord et al., 2016), variational auto-encoders (Kingma & Welling, 2014) and generative adversarial networks or GANs (Goodfellow et al., 2014).

What has been hypothesized as a reason for poorer results obtained with energy-based models trained with an MCMC estimator for the negative phase gradient is that running a Markov chain in data space is fundamentally difficult when the distribution is concentrated (e.g, near manifolds) and has many modes separated by vast areas of low probability. This mixing challenge is discussed by Bengio et al. (2013) who argue that a Markov chain is very likely to produce only sequences of highly probable configurations: if two modes are far from each other and only local moves are possible (which is typically the case when performing MCMC), it becomes exponentially unlikely to traverse the “desert” of low probability that can separate two modes. This makes mixing between modes difficult in high-dimensional spaces with strong concentration of probability mass in some regions (e.g. corresponding to different categories) and very low probability elsewhere.

Maximum Entropy Generators for Energy-Based Models

We thus propose using an amortized neural sampler to perform fast approximate sampling to train the energy model. We begin by replacing the model distribution pθp_{\theta} in in Eq. 2 by a neural generator GG parametrized by w{\bm{w}}. We define PGP_{G} as the distribution of the outputs G(z)G(z) for zpz{\bm{z}}\sim p_{z} where pzp_{z} is a simple prior distribution such as a standard Normal distribution.

To minimize the approximation error, pGp_{G} must be close to pθp_{\theta}. To do so, we tune GG to minimize the KL divergence KL(pGpθ)KL(p_{G}||p_{\theta}), which can be rewritten in terms of minimizing the energy of the samples from the generator while maximizing the entropy at the output of the generator:

When taking the gradient of KL(pGpθ)KL(p_{G}||p_{\theta}) with respect to the parameters w{\bm{w}} of the generator, the log-partition function logZθ\log Z_{\theta} disappears and we can optimize w{\bm{w}} by minimizing

where pzp_{z} is the prior distribution of the latent variable of the generator.

In order to approximately maximize the entropy H[pG]H[p_{G}] at the output of the generator, we use one recently proposed nonparametric mutual information maximization techniques (Belghazi et al., 2018; Oord et al., 2018; Hjelm et al., 2018). Poole et al. (2018) show that these techniques can be unified into a single framework derived from the variational bound of Barber & Agakov (2003). Since the generator is deterministic, mutual information between inputs and outputs reduces to simply entropy of the outputs, since the conditional entropy of a deterministic function is zero:

In particular, we use the estimator from Hjelm et al. (2018), which estimates the Jensen-Shannon divergence between the joint distribution (p(x,z)p({\bm{x}},{\bm{z}})) and the product of marginals (p(x)p(z)p({\bm{x}})p({\bm{z}})). We refer to this information measure as IJSD(X,Z)I_{JSD}(X,Z). We found that the JSD-based estimator works better in practice than the KL-based estimator (which corresponds to the mutual information).

The estimator of Hjelm et al. (2018) is given by

With X=G(Z)X=G(Z) the output of the generator, IJSD(G(Z),Z){\cal I}_{JSD}(G(Z),Z) is one of the terms to be maximized in the objective function for training GG, which would maximize the generator’s output entropy H(G(Z))H(G(Z)). Thus the final training objective to be minimized for the generator GG and the energy function EE is

where ZpzZ\sim p_{z}, the latent prior (typically a N(0,I)N(0,I) Gaussian).

As can be seen from the above equations, the generator and the energy function are in an adversarial game, similar to generative adversarial networks (Goodfellow et al., 2014). This makes optimization via simultaneous gradient descent challenging since the gradient vector field of such an optimization problem is non-conservative as noted by Mescheder et al. (2017). This is particularly accentuated by the use of deep neural networks for the generator and the energy function. In particular, we noticed that during training the magnitude of the energy function values would diverge.

To help alleviate this issue we look towards another technique for learning energy-based models called score matching proposed by Hyvärinen (2005). Score matching estimates the energy function by matching the score functions of the data density and the model density, where the score function ψ\psi is the gradient of the log density with respect to the sample ψ(x)=logp(x)x\psi({\bm{x}})=\frac{\partial\log p({\bm{x}})}{\partial{\bm{x}}}. If ψD(x)\psi_{D}({\bm{x}}) and ψE(x)\psi_{E}({\bm{x}}) are the score functions under the data distribution and model distribution respectively, the score matching objective is given by

While the score function for the data distribution is typically unknown and would require estimation, Theorem 1 in Hyvärinen (2005) shows that with partial integrations, the score matching objective can be reduced to the following objective which does not depend on the score function under the data distribution:

The above objective is hard to optimize when using deep neural networks because of the difficulty in estimating the gradient of the Hessian diagonal, so we use the first term in our objective, i.e. the zero-centered gradient penalty, pushing the data points to sit near critical points (generally a local minimum) of the energy function.

This term is also similar to the gradient penalty regularization proposed by Gulrajani et al. (2017) which however is one-centered and applied on interpolations of the data and model samples, and is derived from the Lipschitz continuity requirements of Wasserstein GANs (Arjovsky et al., 2017).

2 Improving sample quality via latent space MCMC

Since MEG simultaneously trains a generator and a valid energy function, we can improve the quality of samples by biasing sampling towards high density regions. Furthermore, doing the MCMC walk in the latent space should be easier than in data space because the transformed data manifold (in latent space) is flatter than in the original observed data space, as initially discussed by Bengio et al. (2013). The motivation is also similar to that of the “truncation trick” used successfully by Brock et al. (2018). However, we use an MCMC-based approach for this which is applicable to arbitrary latent distributions.

We use the Metropolis-adjusted Langevin algorithm (MALA, Girolami & Calderhead (2011)), with Langevin dynamics producing a proposal distribution in the latent space as follows:

Experiments

To understand the benefits of MEG, we first visualize the energy densities learnt by our generative model on toy data. Next, we evaluate the efficacy of our entropy maximizer by running discrete mode collapse experiments to verify that we learn all modes and the corresponding mode count (frequency) distribution. Furthermore, we evaluate the performance of MEG on sharp image generation, since this is a common failure mode of models trained with maximum likelihood which tend to generate blurry samples (Theis et al., 2015). We also compare MCMC samples in visible space and our proposed sampling from the latent space of the composed energy function. Finally, we run anomaly detection experiments to test the application of the learnt energy function.

We’ve released open-source codehttps://github.com/ritheshkumar95/energy_based_generative_models for all the experiments.

Generative models trained with maximum likelihood often suffer from the problem of spurious modes and excessive entropy of the trained distribution, where the model incorrectly assigns high probability mass to regions not present in the data manifold. Typical energy-based models such as RBMs suffer from this problem partly because of the poor approximation of the negative phase gradient, as discussed above, and the large price paid in terms of log-likelihood for not putting enough probability mass near data points (i.e. for missing modes).

To check if MEG suffers from spurious modes, we train the energy-based model on synthetic 2D datasets (swissroll, 25gaussians and 8gaussians) and visualize the energy function. From the probability density plots on Figure 2, we can see that the energy model doesn’t suffer from spurious modes and learns a sharp distribution.

2 Investigating Mode Collapse

GANs are notorious for having mode collapse issues wherein certain modes of the data distribution are not represented by the generated distribution. Since the generator is trained to minimize its KL divergence with the energy model distribution (which is trained via maximum likelihood), we expect the generator to faithfully capture all the modes of the data distribution. Our theory requires we maximize entropy of the generated distribution, which we believe is instrumental in ensuring full mode capture.

To empirically verify MEG captures all the modes of the data distribution, we follow the same experimental setup as (Metz et al., 2016) and (Srivastava et al., 2017). We train our generative model on the StackedMNIST dataset, which is a synthetic dataset created by stacking MNIST on different channels. The number of modes are counted using a pretrained MNIST classifier, and the KL divergence is calculated empirically between the generated mode distribution and the data distribution.

From Table 1, we can see that MEG naturally covers all the modes in that data, without dropping a single mode. Apart from just representing all the modes of the data distribution, MEG also better matches the data distribution as evidenced by the significantly smaller KL divergence score compared to the baseline WGAN-GP.

Apart from the standard 3-StackMNIST, we also evaluate MEG on a new dataset with 10410^{4} modes (4 stacks)The 4-StackedMNIST was created in a way analogous to the original 3-StackedMNIST dataset. We randomly sample and fix 128×104128\times 10^{4} images to train the generative model and take 26×10426\times 10^{4} samples for evaluations. which is evidence that MEG does not suffer from mode collapse issues unlike state-of-the-art GANs like WGAN-GP.

3 Modeling Natural Images

While the energy landscapes in Figure 2 provide evidence that MEG trains energy models with sharp distributions, we next investigate if this also holds when learning a distribution over high-dimensional natural images. Energy-based models trained with existing techniques produce blurry samples due to the energy function not learning a sharp distribution.

We train MEG on the standard benchmark 32x32 CIFAR10 dataset for image modeling. We additionally train MEG on the 64x64 cropped CelebA - celebrity faces dataset to report qualitative samples from MEG. Similar to recent GAN works (Miyato et al., 2018), we report both Inception Score (IS) and Fréchet Inception Distance (FID) scores on the CIFAR10 dataset and compare it with a competitive WGAN-GP baseline.

From Table 2, we can see that in addition to learning an energy function, MEG-trained generative model produces samples comparable to recent GAN methods such as WGAN-GP (Gulrajani et al., 2017). Note that the perceptual quality of the samples improves by using the proposed MCMC sampler in the latent space. See also Figure 3 in Appendix B for an ablation study which shows that MCMC on the visible space does not perform as well as MCMC on the latent space.

4 Anomaly Detection

Apart from the usefulness of energy estimates for relative density estimation (up to the normalization constant), energy functions can also be useful to perform unsupervised anomaly detection. Unsupervised anomaly detection is a fundamental problem in machine learning, with critical applications in many areas, such as cyber-security, complex system management, medical care, etc. Density estimation is at the core of anomaly detection since anomalies are data points residing in low probability density areas. We test the efficacy of our energy-based density model for anomaly detection using two popular benchmark datasets: KDDCUP and MNIST.

We first test our generative model on the KDDCUP99 10 percent dataset from the UCI repository (Lichman et al., 2013). Our baseline for this task is Deep Structured Energy-based Model for Anomaly Detection (DSEBM) (Zhai et al., 2016), which trains deep energy models such as Convolutional and Recurrent EBMs using denoising score matching (Vincent, 2011) instead of maximum likelihood, for performing anomaly detection. We also report scores on the state of the art DAGMM (Zong et al., 2018), which learns a Gaussian Mixture density model (GMM) over a low dimensional latent space produced by a deep autoencoder. We train MEG on the KDD99 data and use the score norm xEθ(x)22||\nabla_{x}E_{\theta}(x)||_{2}^{2} as the decision function, similar to Zhai et al. (2016).

From Table 3, we can see that the MEG energy function outperforms the previous SOTA energy-based model (DSEBM) by a large margin (+0.1990 F1 score) and is comparable to the current SOTA model (DAGMM) specifically designed for anomaly detection. Note that DAGMM is specially designed for anomaly detection, while MEG is a general-purpose energy-based model.

MNIST

Next we evaluate our generative model on anomaly detection of high dimensional image data. We follow the same experiment setup as (Zenati et al., 2018) and make each digit class an anomaly and treat the remaining 9 digits as normal examples. We also use the area under the precision-recall curve (AUPRC) as the metric to compare models.

From Table 4, it can be seen that our energy model outperforms VAEs for outlier detection and is comparable to the SOTA BiGAN-based anomaly detection methods for this dataset (Zenati et al., 2018) which train bidirectional GANs to learn both an encoder and decoder (generator) simultaneously and use a combination of the reconstruction error in output space as well as the discriminator’s cross entropy loss as the decision function.

Our aim here is not to claim state-of-the-art on the task of anomaly detection but to demonstrate the quality of the energy functions learned by our technique, as judged by its competitive performance on anomaly detection.

Related Work

Early work on deep learning relied on unsupervised learning (Hinton et al., 2006; Bengio et al., 2007; Larochelle et al., 2009) to train energy-based models (LeCun et al., 2006), in particular Restricted Boltzmann Machines, or RBMs. Hinton (2000) proposed kk-step Contrastive Divergence (CD-kk), to efficiently approximate the negative phase log-likelihood gradient. Subsequent work have improved on CD-kk such as Persistent CD (Salakhutdinov & Hinton, 2009; Tieleman, 2008b). Hyvärinen (2005) proposed an alternative method to train non-normalized graphical models using Score Matching, which does not require computation of the partition function.

Kim & Bengio (2016) and Dai et al. (2017) also learn a generator that approximates samples from an energy-based model. However, their approach for entropy maximization is different from our own. Kim & Bengio (2016) argue that batch normalization (Ioffe & Szegedy, 2015) makes the hidden activations of the generator network approximately Gaussian distributed and thus maximize the log-variance for each hidden activation of the network. Dai et al. (2017) propose two approaches to entropy maximization. One which minimizes entropy of the inverse model (pgen(zx)p_{gen}(z|x)) which is approximated using an amortized inverse model similar to ALI (Dumoulin et al., 2016), and another which makes isotropic Gaussian assumptions for the data. In our work, we perform entropy maximization using a tight mutual information estimator which does not make any assumptions about the data distribution.

Zhao et al. (2016) use an autoencoder as the discriminator and use the reconstruction loss as a signal to classify between real and fake samples. The autoencoder is highly regularized to allow its interpretation as an energy function. However Dai et al. (2017) prove that the EBGAN objective does not guarantee the discriminator to recover the true energy function. The generator diverges from the true data distribution after matching it, since it would continue to receive training signal from the discriminator. The discriminator signal does not vanish even at optimality (when PG=PDP_{G}=P_{D}) if it retains density information, since some samples would be considered "more real" than others.

Conclusion

We proposed MEG, an energy-based generative model that produces energy estimates using an energy model and a generator that produces fast approximate samples. This takes advantage of novel methods to maximize the entropy at the output of the generator using a nonparametric mutual information lower bound estimator. We have shown that our energy model learns good energy estimates using visualizations in toy 2D datasets and through performance in unsupervised anomaly detection. We have also shown that our generator produces samples of high perceptual quality by measuring Inception Scores and Fréchet Inception Distance and shown that MEG is robust to the respective weaknesses of GAN models (mode dropping) and maximum-likelihood energy-based models (spurious modes).

Acknowledgements

The authors acknowledge the funding provided by NSERC, Canada AI Chairs, Samsung, Microsoft, Google and Facebook. We also thank NVIDIA for donating a DGX-1 computer used for certain experiments in this work.

References

Appendix A Training Algorithm

Appendix B MCMC sampling in visible vs latent space