On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models

Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, Ying Nian Wu

Introduction

Statistical modeling of high-dimensional signals is a challenging task encountered in many academic disciplines and practical applications. We study image signals in this work. When images come without annotations or labels, the effective tools of deep supervised learning cannot be applied and unsupervised techniques must be used instead. This work focuses on the unsupervised paradigm of the energy-based model (1) with a ConvNet potential function (2).

Previous works studying Maximum Likelihood (ML) training of ConvNet potentials, such as (?; ?; ?), use Langevin MCMC samples to approximate the gradient of the unknown and intractable log partition function during learning. The authors universally find that after enough model updates, MCMC samples generated by short-run Langevin from informative initialization (see Section 2.3) are realistic images that resemble the data.

However, we find that energy functions learned by prior works have a major defect regardless of MCMC initialization, network structure, and auxiliary training parameters. The long-run and steady-state MCMC samples of energy functions from all previous implementations are oversaturated images with significantly lower energy than the observed data (see Figure 2 top, and Figure 3). In this case it is not appropriate to describe the learned model as an approximate density for the training set because the model assigns disproportionately high probability mass to images which differ dramatically from observed data. The systematic difference between high-quality short-run samples and low-quality long-run samples is a crucial phenomenon that appears to have gone unnoticed in previous studies.

2 Our Contributions

In this work, we present a fundamental understanding of learning ConvNet potentials by MCMC-based ML. We diagnose previously unrecognized complications that arise during learning and distill our insights to train models with new capabilities. Our main contributions are:

Identification of two distinct axes which characterize each parameter update in MCMC-based ML learning: 1) energy difference of positive and negative samples, and 2) MCMC convergence or non-convergence. Contrary to common expectations, convergence is not needed for high-quality synthesis. See Figure 1 and Section 3.

The first ConvNet potentials trained using ML with purely noise-initialized MCMC. Unlike prior models, our model can efficiently generate realistic and diverse samples after training from noise alone. See Figure 7. This method is further explored in our companion work (?).

The first ConvNet potentials with realistic steady-state samples. To our knowledge, ConvNet potentials with realistic MCMC sampling in the image space are unobtainable by all previous training implementations. We refer to (?) for a discussion. See Figure 2 (bottom) and Figure 8 (middle and right column).

Mapping the macroscopic structure of image space energy functions using diffusion in a magnetized energy landscape for unsupervised cluster discovery. See Figure 9.

3 Related Work

Energy-based models define an unnormalized probability density over a state space to represent the distribution of states in a given system. The Hopfield network (?) adapted the Ising energy model into a model capable of representing arbitrary observed data. The RBM (Restricted Boltzmann Machine) (?) and FRAME (Filters, Random field, And Maximum Entropy) (?; ?) models introduce energy functions with greater representational capacity. The RBM uses hidden units which have a joint density with the observable image pixels. The FRAME model uses convolutional filters and histogram matching to learn data features.

The pioneering work (?) studies the hierarchical energy-based model. (?) is an important early work proposing feedforward neural networks to model energy functions. The energy-based model in the form of (2) is introduced in (?). Deep variants of the FRAME model (?; ?) are the first to achieve realistic synthesis with a ConvNet potential and Langevin sampling. Similar methods are applied in (?). The Multi-grid model (?) learns an ensemble of ConvNet potentials for images of different scales. Learning a ConvNet potential with a generator network as approximative direct sampler is explored in (?; ?; ?; ?; ?; ?). The works (?; ?; ?) learn a ConvNet potential in a discriminative framework.

Although many of these works claim to train the energy (2) to be an approximate unnormalized density for the observed images, the resulting energy functions do not have a steady-state that reflects the data (see Figure 3). Short-run Langevin samples from informative initialization are presented as approximate steady-state samples, but further investigation shows long-run Langevin consistently disrupts the realism of short-run images. Our work is the first to address and remedy the systematic non-convergence of all prior implementations.

Energy Landscape Mapping

The full potential of the energy-based model lies in the structure of the energy landscape. Hopfield observed that the energy landscape is a model of associative memory (?). Diffusion along the potential energy manifold is analogous to memory recall because the diffusion process will gradually refine a high-energy image (an incomplete or corrupted memory) until it reaches a low-energy metastable state, which corresponds to the revised memory. Techniques for mapping and visualizing the energy landscape of non-convex functions in the physical chemistry literature (?; ?) have been applied to map the latent space of Cooperative Networks (?). Defects in the energy function (2) from previous ML implementations prevent these techniques from being applied in the image space. Our convergent ML models enable image space mapping.

Learning Energy-Based Models

In this section, we review the established principles of the MCMC-based ML learning from prior works such as (?; ?; ?).

An energy-based model is a Gibbs-Boltzmann density

In ML learning, we seek to find $\theta\in\Theta$ such that the parametric model $p_{\theta}(x)$ is a close approximation of the data distribution $q(x)$ . One measure of closeness is the Kullback-Leibler (KL) divergence. Learning proceeds by solving

We can minimize $\mathcal{L}(\theta)$ by finding the roots of the derivative

The term $\frac{d}{d\theta}\log Z(\theta)$ is intractable, but it can be expressed

The gradient used to learn $\theta$ then becomes

where $\{X^{+}_{i}\}_{i=1}^{n}$ are i.i.d. samples from the data distribution $q$ (called positive samples since probability is increased), and $\{X_{i}^{-}\}_{i=1}^{m}$ are i.i.d. samples from current learned distribution $p_{\theta}$ (called negative samples since probability is decreased). In practice, the positive samples $\{X^{+}_{i}\}_{i=1}^{n}$ are a batch of training images and the negative samples $\{X_{i}^{-}\}_{i=1}^{m}$ are obtained after $L$ iterations of MCMC sampling.

2 MCMC Sampling with Langevin Dynamics

Obtaining the negative samples $\{X_{i}^{-}\}_{i=1}^{m}$ from the current distribution $p_{\theta}$ is a computationally intensive task which must be performed for each update of $\theta$ . ML learning does not impose a specific MCMC algorithm. Early energy-based models such as the RBM and FRAME model use Gibbs sampling as the MCMC method. Gibbs sampling updates each dimension (one pixel of the image) sequentially. This is computationally infeasible when training an energy with the form (2) for standard image sizes.

Several works studying the energy (2) recruit Langevin Dynamics to obtain the negative samples (?; ?; ?; ?; ?). The Langevin Equation

Like most MCMC methods, Langevin dynamics exhibits high auto-correlation and has difficulty mixing between separate modes. Even so, long-run Langevin samples with a suitable initialization can still be considered approximate steady-state samples, as discussed next.

3 MCMC Initialization

We distinguish two main branches of MCMC initialization: informative initialization, where the density of initial states is meant to approximate the model density, and non-informative initialization, where initial states are obtained from a distribution that is unrelated to the model density. Noise initialization is a specific type of non-informative initialization where initial states come from a noise distribution such as uniform or Gaussian.

In the most extreme case, a Markov chain initialized from its steady-state will follow the steady-state distribution after a single MCMC update. In more general cases, a Markov chain initialized from an image that is likely under the steady-state can converge much more quickly than a Markov chain initialized from noise. For this reason, all prior works studying ConvNet potentials use informative initialization.

Data-based initialization uses samples from the training data as the initial MCMC states. Contrastive Divergence (CD) (?) introduces this practice. The Multigrid Model (?) generalizes CD by using multi-scale energy functions to sequentially refine downsampled data.

Persistent initialization uses negative samples from a previous learning iteration as initial MCMC states in the current iteration. The persistent chains can be initialized from noise as in (?; ?; ?) or from data samples as in Persistent Contrastive Divergence (PCD) (?). The Cooperative Learning model (?) generalizes persistent chains by learning a generator for proposals in tandem with the energy.

In this paper we consider long-run Langevin chains from both data-based initialization such as CD and persistent initialization such as PCD to be approximate steady-state samples, even when Langevin chains cannot mix between modes. Prior art indicates that both initialization types span the modes of the learned density, and long-run Langevin samples will travel in a way that respects the $p_{\theta}$ in the local landscape.

Informative MCMC initialization during ML training can limit the ability of the final model $p_{\theta}$ to generate new and diverse synthesized images after training. MCMC samples initialized from noise distributions after training tend to result in images with a similar type of appearance when informative initialization is used in training.

In contrast to common wisdom, we find that informative initialization is not necessary for efficient and realistic synthesis when training ConvNet potentials with ML. In accordance with common wisdom, we find that informative initialization is essential for learning a realistic steady-state.

Two Axes of ML Learning

Inspection of the gradient (8) reveals the central role of the difference of the average energy of negative and positive samples. Let

where $s_{t}(x)$ is the distribution of negative samples given the finite-step MCMC sampler and initialization used at training step $t$ . The difference $d_{s_{t}}(\theta)$ measures whether the positive samples from the data distribution $q$ or the negative samples from $s_{t}$ are more likely under the model $p_{\theta}$ . The ideal case $p_{\theta}=q$ (perfect learning) and $s_{t}=p_{\theta}$ (exact MCMC convergence) satisfies $d_{s_{t}}(\theta)=0$ . A large value of $|d_{s_{t}}|$ indicates that either learning or sampling (or both) have not converged.

Although $d_{s_{t}}(\theta)$ is not equivalent to the ML objective (4), it bridges the gap between theoretical ML and the behavior encountered when MCMC approximation is used. Two outcomes occur for each update on the parameter path $\{\theta_{t}\}_{t=1}^{T+1}$ :

$d_{s_{t}}(\theta_{t})<0$ (expansion) or $d_{s_{t}}(\theta_{t})>0$ (contraction)

$s_{t}\approx p_{\theta_{t}}$ (MCMC convergence) or $s_{t}\not\approx p_{\theta_{t}}$ (MCMC non-convergence) .

We find that only the first axis governs the stability and synthesis results of the learning process. Oscillation of expansion and contraction updates is an indicator of stable ML learning, but this can occur in cases where either $s_{t}$ is always approximately convergent or where $s_{t}$ never converges.

Behavior along the second axis determines the realism of steady-state samples from the final learned energy. Samples from $p_{\theta_{t}}$ will be realistic if and only if $s_{t}$ has realistic samples and $s_{t}\approx p_{\theta_{t}}$ . We use convergent ML to refer to implementations where $s_{t}\approx p_{\theta_{t}}$ for all $t>t_{0}$ , where $t_{0}$ represents burn-in learning steps (e.g. early stages of persistent learning). We use non-convergent ML to refer to all other implementations. All prior ConvNet potentials are learned with non-convergent ML, although this is not recognized by previous authors.

Without proper tuning of the sampling phase, the learning heavily gravitates towards non-convergent ML. In this section we outline principles to explain this behavior and provide a remedy for the tendency of model non-convergence.

which gives the average image gradient magnitude of $U$ along an MCMC path at training step $t$ , plays a central role in sampling. Sampling at noise magnitude $\varepsilon$ will lead to very different behavior depending on the gradient magnitude. If $v_{t}$ is very large, gradients will overwhelm the noise and the resulting dynamics are similar to gradient descent. If $v_{t}$ is very small, sampling becomes an isotropic random walk. A valid image density should appropriately balance energy gradient magnitude and noise strength to enable realistic long-run sampling.

We empirically observe that expansion and contraction updates tend to have opposite effects on $v_{t}$ (see Figure 4). Gradient magnitude $v_{t}$ and computational loss $d_{s_{t}}$ are highly correlated at the current iteration and exhibit significant negative correlation at a short-range lag. Both have significant negative autocorrelation for short-range lag. This indicates that expansion updates tend to increase $v_{t}$ and contraction updates tend to decrease $v_{t}$ , and that expansion updates tend to lead to contraction updates and vice-versa. We believe that the natural oscillation between expansion and contraction updates underlies the stability of ML with (2).

Learning can become unstable when $U$ is updated in the expansion phase for many consecutive iterations if $v_{t}\rightarrow\infty$ and as $U(X^{+})\rightarrow-\infty$ for positive samples and $U(X^{-})\rightarrow\infty$ for negative samples. This behavior is typical of W-GAN training (interpreting the generator as $w_{t}$ with $L=0$ ) and the W-GAN Lipschitz bound is needed to prevent such instability. In ML learning with ConvNet potentials, consecutive updates in the expansion phase will increase $v_{t}$ so that the gradient can better overcome noise and samples can more quickly reach low-energy regions. In contrast, many consecutive contraction updates can cause $v_{t}$ to shrink to 0, leading to the solution $U(x)=c$ for some constant $c$ (see Figure 5 right, blue lines). In proper ML learning, the expansion updates that follow contraction updates prevent the model from collapsing to a flat solution and force $U$ to learn meaningful features of the data.

Throughout our experiments, we find that the network can easily learn to balance the energy of the positive and negative samples so that $d_{s_{t}}(\theta_{t})\approx 0$ after only a few model updates. In fact, ML learning can easily adjust $v_{t}$ so that the gradient is strong enough to balance $d_{s_{t}}$ and obtain high-quality samples from virtually any initial distribution in a small number of MCMC steps. This insight leads to our ML method with noise-initialized MCMC. The natural oscillation of ML learning is the foundation of the robust synthesis capabilities of ConvNet potentials, but realistic short-run MCMC samples can mask the true steady-state behavior.

2 Second Axis: MCMC Convergence or Non-Convergence

In the literature, it is expected that the finite-step MCMC distribution $s_{t}$ must approximately converge to its steady-state $p_{\theta_{t}}$ for learning to be effective. On the contrary, we find that high-quality synthesis is possible, and actually easier to learn, when there is a drastic difference between the finite-step MCMC distribution $s_{t}$ and true steady-state samples of $p_{\theta_{t}}$ . An examination of ConvNet potentials learned by existing methods shows that in all cases, running the MCMC sampler for significantly longer than the number of training steps results in samples with significantly lower energy and unrealistic appearance. Although synthesis is possible without convergence, it is not appropriate to describe a non-convergent ML model $p_{\theta}$ as an approximate data density.

Oscillation of expansion and contraction updates occurs for both convergent and non-convergent ML learning, but for very different reasons. In convergent ML, we expect the average gradient magnitude $v_{t}$ to converge to a constant that is balanced with the noise magnitude $\varepsilon$ at a value that reflects the temperature of the data density $q$ . However, ConvNet potentials can circumvent this desired behavior by tuning $v_{t}$ with respect to the burn-in energy landscape rather than noise $\varepsilon$ . Figure 5 shows how average image space displacement $r_{t}=\frac{\varepsilon^{2}}{2}v_{t}$ is affected by noise magnitude $\varepsilon$ and number of Langevin steps $L$ for noise, data-based, and persistent MCMC initializations.

For noise initialization with low $\varepsilon$ , the model adjusts $v_{t}$ so that $r_{t}L\approx R$ where $R$ is the average distance between an image from the noise initialization distribution and an image from the data distribution. In other words, the MCMC paths obtained from non-convergent ML with noise initialization are nearly linear from the starting point to the ending point. Mixing does not improve when $L$ increases because $r_{t}$ shrinks in proportion to the increase. Oscillation of expansion and contraction updates occurs because the model tunes $v_{t}$ to control how far along the burn-in path the negative samples travel. Samples never reach the steady-state energy spectrum and MCMC mixing is not possible.

For data initialization and persistent initialization with low $\varepsilon$ , we see that $v_{t},r_{t}\rightarrow 0$ and that learning tends to the trivial solution $U(x)=c$ . This occurs because contraction updates dominate the learning dynamics. At low $\varepsilon$ , samples initialized from the data will easily have lower energy than the data since sampling reduces to gradient descent. To our knowledge no authors have trained (2) using CD, possibly because the energy can easily collapse to a trivial flat solution. For persistent learning, the model learns to synthesize meaningful features early in learning and then contracts in gradient strength once it becomes easy to find negative samples with lower energy than the data. Previous authors who trained models with persistent chains use auxiliary techniques such as a Gaussian prior (?) or occasional rejuvenation of chains from noise (?) which prevent unbalanced network contraction, although the role of these techniques is not recognized by the authors.

For all three initialization types, we can see that convergent ML becomes possible when $\varepsilon$ is large enough. ML with noise initialization behaves similarly for high and low $\varepsilon$ when $L$ is small. For large $L$ with high $\varepsilon$ , the model tunes $v_{t}$ to balance with $\varepsilon$ rather than $R/L$ . The MCMC samples complete burn-in and begin to mix for large $L$ , and increasing $L$ will indeed lead to improved MCMC convergence as usual. For data-based and persistent initialization, we see that $v_{t}$ adjusts to balance with $\varepsilon$ instead of contracting to 0 because the noise added during Langevin sampling forces $U$ to learn meaningful features.

3 Learning Algorithm

We now present an algorithm for ML learning. The algorithm is essentially the same as earlier works such as (?) that investigate the potential (2). Our intention is not to introduce a novel algorithm but to demonstrate the range of phenomena that can occur with the ML objective based on changes to MCMC sampling. We present guidelines for the effect of tuning on the learning outcome.

Noise and Step Size for Non-Convergent ML: For non-convergent training we find the tuning of noise and step-size have little effect on training stability. We use $\varepsilon=1$ and $\tau=0$ . Noise is not needed for oscillation because $d_{s_{t}}$ is controlled by the depth of samples along the burn-in path. Including low noise appears to improve synthesis quality.

Noise and Step Size for Convergent ML: For convergent training, we find that it is essential to include noise with $\tau=1$ and precisely tune $\varepsilon$ so that the network learns true mixing dynamics through the gradient strength. The step size $\varepsilon$ should approximately match the local standard deviation of the data along the most constrained direction (?). An effective $\varepsilon$ for $32\times 32$ images with pixel values in appears to lie around $0.015$ .

Number of Steps: When $\tau=0$ or $\tau=1$ and $\varepsilon$ is very small, learning leads to similar non-convergent ML outcomes for any $L\geq 100$ . When $\tau=1$ and $\varepsilon$ is correctly tuned, sufficiently high values of $L$ lead to convergent ML and lower values of $L$ lead to non-convergent ML.

Informative Initialization: Informative MCMC initialization is not needed for non-convergent ML even with as few as $L=100$ Langevin updates. The model can naturally learn fast pathways to realistic negative samples from an arbitrary initial distribution. On the other hand, informative initialization can greatly reduce the magnitude of $L$ needed for convergent ML. We use persistent initialization starting from noise.

Network structure: For the first convolutional layer, we observe that a $3\times 3$ convolution with stride $1$ helps to avoid checkerboard patterns or other artifacts. For convergent ML, use of non-local layers (?) appears to improve synthesis realism.

Regularization and Normalization: Previous studies employ a variety of auxiliary training techniques such as prior distributions (e.g. Gaussian), weight regularization, batch normalization, layer normalization, and spectral normalization to stabilize sampling and weight updates. We find that these techniques are not needed.

Optimizer and Learning Rate: For non-convergent ML, Adam improves training speed and image quality. Our non-convergent models use Adam with $\gamma=0.0001$ . For convergent ML, Adam appears to interfere with learning a realistic steady-state and we use SGD instead. When using SGD with $\tau=1$ and properly tuned $\varepsilon$ and $L$ , higher values of $\gamma$ lead to non-convergent ML and sufficiently low values of $\gamma$ lead to convergent ML.

Experiments

We first demonstrate the outcomes of convergent and non-convergent ML for low-dimensional toy distributions (Figure 6). Both toy models have a standard deviation of $0.15$ along the most constrained direction, and the ideal step size for Langevin dynamics is close to this value (?). Non-convergent models are trained using noise MCMC initialization with $L=100$ and $\varepsilon=0.01$ (too low for the data temperature) and convergent models are trained using persistent MCMC initialization with $L=500$ and $\varepsilon=0.125$ (approximately the right magnitude relative to the data temperature). The distributions of the short-run samples from the non-convergent models reflect the ground-truth densities, but the learned densities are sharply concentrated and different from the ground-truths. In higher dimensions this sharp concentration of non-convergent densities manifests as oversaturated long-run images. With sufficient Langevin noise, one can learn an energy function that closely approximates the ground-truth.

2 Synthesis from Noise with Non-Convergent ML Learning

In this experiment, we learn an energy function (2) using ML with uniform noise initialization and short-run MCMC. We apply our ML algorithm with $L=100$ Langevin steps starting from uniform noise images for each update of $\theta$ with $\tau=0$ and $\varepsilon=1$ . We use Adam with $\gamma=0.0001$ .

Previous authors argued that informative MCMC initialization is a key element for successful synthesis with ML learning, but our learning method can sample from scratch with the same Langevin budget. Unlike the models learned by previous authors, our models can generate high-fidelity and diverse images from a noise signal. Our results are shown in Figure 7, Figure 8 (left), and Figure 2 (top). Our recent companion work (?) thoroughly explores the capabilities of noise-initialized non-convergent ML.

3 Convergent ML Learning

With the correct Langevin noise, one can ensure that MCMC samples mix in the steady-state energy spectrum throughout training. The model will eventually learn a realistic steady-state as long as MCMC samples approximately converge for each parameter update $t$ beyond a burn-in period $t_{0}$ . One can implement convergent ML with noise initialization, but we find that this requires $L\approx$ 20,000 steps.

Informative initialization can dramatically reduce the number of MCMC steps needed for convergent learning. By using SGD with learning rate $\gamma=0.0005$ , noise indicator $\tau=1$ and step size $\varepsilon=0.015$ , we were able to train convergent models using persistent initialization and $L=500$ sampling steps. We initialize 10,000 persistent images from noise and update 100 images for each batch. We implement the same training procedure for a vanilla ConvNet and a network with non-local layers (?). Our results are shown in Figure 8 (middle, right) and Figure 2 (bottom).

4 Mapping the Image Space

A well-formed energy function partitions the image space into meaningful Hopfield basins of attraction. Following Algorithm 3 of (?), we map the structure of a convergent energy. We first identify many metastable MCMC samples. We then sort the metastable samples from lowest energy to highest energy and sequentially group images if travel between samples is possible in a magnetized energy landscape. This process is continued until all minima have been clustered. Our mappings show that the convergent energy has meaningful metastable structures encoding recognizable concepts (Figure 9).

Conclusion and Future Work

Our experiments on energy-based models with the form (2) reveal two distinct axes of ML learning. We use our insights to train models with sampling capabilities that are unobtainable by previous implementations. The informative MCMC initializations used by previous authors are not necessary for high-quality synthesis. By removing this technique we train the first energy functions capable of high-diversity and realistic synthesis from noise initialization after training. We identify a severe defect in the steady-state distributions of prior implementations and introduce the first ConvNet potentials of the form (2) for which steady-state samples have realistic appearance. Our observations could be very useful for convergent ML learning with more complex MCMC initialization methods used in (?; ?). We hope that our work paves the way for future unsupervised and weakly supervised applications with energy-based models.

Acknowledgment

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Prafulla Dhariwal and Anirudh Goyal for helpful discussions.