On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models

Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, Ying Nian Wu

Introduction

Statistical modeling of high-dimensional signals is a challenging task encountered in many academic disciplines and practical applications. We study image signals in this work. When images come without annotations or labels, the effective tools of deep supervised learning cannot be applied and unsupervised techniques must be used instead. This work focuses on the unsupervised paradigm of the energy-based model (1) with a ConvNet potential function (2).

Previous works studying Maximum Likelihood (ML) training of ConvNet potentials, such as (?; ?; ?), use Langevin MCMC samples to approximate the gradient of the unknown and intractable log partition function during learning. The authors universally find that after enough model updates, MCMC samples generated by short-run Langevin from informative initialization (see Section 2.3) are realistic images that resemble the data.

However, we find that energy functions learned by prior works have a major defect regardless of MCMC initialization, network structure, and auxiliary training parameters. The long-run and steady-state MCMC samples of energy functions from all previous implementations are oversaturated images with significantly lower energy than the observed data (see Figure 2 top, and Figure 3). In this case it is not appropriate to describe the learned model as an approximate density for the training set because the model assigns disproportionately high probability mass to images which differ dramatically from observed data. The systematic difference between high-quality short-run samples and low-quality long-run samples is a crucial phenomenon that appears to have gone unnoticed in previous studies.

2 Our Contributions

In this work, we present a fundamental understanding of learning ConvNet potentials by MCMC-based ML. We diagnose previously unrecognized complications that arise during learning and distill our insights to train models with new capabilities. Our main contributions are:

Identification of two distinct axes which characterize each parameter update in MCMC-based ML learning: 1) energy difference of positive and negative samples, and 2) MCMC convergence or non-convergence. Contrary to common expectations, convergence is not needed for high-quality synthesis. See Figure 1 and Section 3.

The first ConvNet potentials trained using ML with purely noise-initialized MCMC. Unlike prior models, our model can efficiently generate realistic and diverse samples after training from noise alone. See Figure 7. This method is further explored in our companion work (?).

The first ConvNet potentials with realistic steady-state samples. To our knowledge, ConvNet potentials with realistic MCMC sampling in the image space are unobtainable by all previous training implementations. We refer to (?) for a discussion. See Figure 2 (bottom) and Figure 8 (middle and right column).

Mapping the macroscopic structure of image space energy functions using diffusion in a magnetized energy landscape for unsupervised cluster discovery. See Figure 9.

3 Related Work

Energy-based models define an unnormalized probability density over a state space to represent the distribution of states in a given system. The Hopfield network (?) adapted the Ising energy model into a model capable of representing arbitrary observed data. The RBM (Restricted Boltzmann Machine) (?) and FRAME (Filters, Random field, And Maximum Entropy) (?; ?) models introduce energy functions with greater representational capacity. The RBM uses hidden units which have a joint density with the observable image pixels. The FRAME model uses convolutional filters and histogram matching to learn data features.

The pioneering work (?) studies the hierarchical energy-based model. (?) is an important early work proposing feedforward neural networks to model energy functions. The energy-based model in the form of (2) is introduced in (?). Deep variants of the FRAME model (?; ?) are the first to achieve realistic synthesis with a ConvNet potential and Langevin sampling. Similar methods are applied in (?). The Multi-grid model (?) learns an ensemble of ConvNet potentials for images of different scales. Learning a ConvNet potential with a generator network as approximative direct sampler is explored in (?; ?; ?; ?; ?; ?). The works (?; ?; ?) learn a ConvNet potential in a discriminative framework.

Although many of these works claim to train the energy (2) to be an approximate unnormalized density for the observed images, the resulting energy functions do not have a steady-state that reflects the data (see Figure 3). Short-run Langevin samples from informative initialization are presented as approximate steady-state samples, but further investigation shows long-run Langevin consistently disrupts the realism of short-run images. Our work is the first to address and remedy the systematic non-convergence of all prior implementations.

Energy Landscape Mapping

The full potential of the energy-based model lies in the structure of the energy landscape. Hopfield observed that the energy landscape is a model of associative memory (?). Diffusion along the potential energy manifold is analogous to memory recall because the diffusion process will gradually refine a high-energy image (an incomplete or corrupted memory) until it reaches a low-energy metastable state, which corresponds to the revised memory. Techniques for mapping and visualizing the energy landscape of non-convex functions in the physical chemistry literature (?; ?) have been applied to map the latent space of Cooperative Networks (?). Defects in the energy function (2) from previous ML implementations prevent these techniques from being applied in the image space. Our convergent ML models enable image space mapping.

Learning Energy-Based Models

In this section, we review the established principles of the MCMC-based ML learning from prior works such as (?; ?; ?).

An energy-based model is a Gibbs-Boltzmann density

In ML learning, we seek to find θΘ\theta\in\Theta such that the parametric model pθ(x)p_{\theta}(x) is a close approximation of the data distribution q(x)q(x). One measure of closeness is the Kullback-Leibler (KL) divergence. Learning proceeds by solving

We can minimize L(θ)\mathcal{L}(\theta) by finding the roots of the derivative

The term ddθlogZ(θ)\frac{d}{d\theta}\log Z(\theta) is intractable, but it can be expressed

The gradient used to learn θ\theta then becomes

where {Xi+}i=1n\{X^{+}_{i}\}_{i=1}^{n} are i.i.d. samples from the data distribution qq (called positive samples since probability is increased), and {Xi}i=1m\{X_{i}^{-}\}_{i=1}^{m} are i.i.d. samples from current learned distribution pθp_{\theta} (called negative samples since probability is decreased). In practice, the positive samples {Xi+}i=1n\{X^{+}_{i}\}_{i=1}^{n} are a batch of training images and the negative samples {Xi}i=1m\{X_{i}^{-}\}_{i=1}^{m} are obtained after LL iterations of MCMC sampling.

2 MCMC Sampling with Langevin Dynamics

Obtaining the negative samples {Xi}i=1m\{X_{i}^{-}\}_{i=1}^{m} from the current distribution pθp_{\theta} is a computationally intensive task which must be performed for each update of θ\theta. ML learning does not impose a specific MCMC algorithm. Early energy-based models such as the RBM and FRAME model use Gibbs sampling as the MCMC method. Gibbs sampling updates each dimension (one pixel of the image) sequentially. This is computationally infeasible when training an energy with the form (2) for standard image sizes.

Several works studying the energy (2) recruit Langevin Dynamics to obtain the negative samples (?; ?; ?; ?; ?). The Langevin Equation

Like most MCMC methods, Langevin dynamics exhibits high auto-correlation and has difficulty mixing between separate modes. Even so, long-run Langevin samples with a suitable initialization can still be considered approximate steady-state samples, as discussed next.

3 MCMC Initialization

We distinguish two main branches of MCMC initialization: informative initialization, where the density of initial states is meant to approximate the model density, and non-informative initialization, where initial states are obtained from a distribution that is unrelated to the model density. Noise initialization is a specific type of non-informative initialization where initial states come from a noise distribution such as uniform or Gaussian.

In the most extreme case, a Markov chain initialized from its steady-state will follow the steady-state distribution after a single MCMC update. In more general cases, a Markov chain initialized from an image that is likely under the steady-state can converge much more quickly than a Markov chain initialized from noise. For this reason, all prior works studying ConvNet potentials use informative initialization.

Data-based initialization uses samples from the training data as the initial MCMC states. Contrastive Divergence (CD) (?) introduces this practice. The Multigrid Model (?) generalizes CD by using multi-scale energy functions to sequentially refine downsampled data.

Persistent initialization uses negative samples from a previous learning iteration as initial MCMC states in the current iteration. The persistent chains can be initialized from noise as in (?; ?; ?) or from data samples as in Persistent Contrastive Divergence (PCD) (?). The Cooperative Learning model (?) generalizes persistent chains by learning a generator for proposals in tandem with the energy.

In this paper we consider long-run Langevin chains from both data-based initialization such as CD and persistent initialization such as PCD to be approximate steady-state samples, even when Langevin chains cannot mix between modes. Prior art indicates that both initialization types span the modes of the learned density, and long-run Langevin samples will travel in a way that respects the pθp_{\theta} in the local landscape.

Informative MCMC initialization during ML training can limit the ability of the final model pθp_{\theta} to generate new and diverse synthesized images after training. MCMC samples initialized from noise distributions after training tend to result in images with a similar type of appearance when informative initialization is used in training.

In contrast to common wisdom, we find that informative initialization is not necessary for efficient and realistic synthesis when training ConvNet potentials with ML. In accordance with common wisdom, we find that informative initialization is essential for learning a realistic steady-state.

Two Axes of ML Learning

Inspection of the gradient (8) reveals the central role of the difference of the average energy of negative and positive samples. Let

where st(x)s_{t}(x) is the distribution of negative samples given the finite-step MCMC sampler and initialization used at training step tt. The difference dst(θ)d_{s_{t}}(\theta) measures whether the positive samples from the data distribution qq or the negative samples from sts_{t} are more likely under the model pθp_{\theta}. The ideal case pθ=qp_{\theta}=q (perfect learning) and st=pθs_{t}=p_{\theta} (exact MCMC convergence) satisfies dst(θ)=0d_{s_{t}}(\theta)=0. A large value of dst|d_{s_{t}}| indicates that either learning or sampling (or both) have not converged.

Although dst(θ)d_{s_{t}}(\theta) is not equivalent to the ML objective (4), it bridges the gap between theoretical ML and the behavior encountered when MCMC approximation is used. Two outcomes occur for each update on the parameter path {θt}t=1T+1\{\theta_{t}\}_{t=1}^{T+1}:

dst(θt)<0d_{s_{t}}(\theta_{t})<0 (expansion) or dst(θt)>0d_{s_{t}}(\theta_{t})>0 (contraction)

stpθts_{t}\approx p_{\theta_{t}} (MCMC convergence) or st≉pθts_{t}\not\approx p_{\theta_{t}} (MCMC non-convergence) .

We find that only the first axis governs the stability and synthesis results of the learning process. Oscillation of expansion and contraction updates is an indicator of stable ML learning, but this can occur in cases where either sts_{t} is always approximately convergent or where sts_{t} never converges.

Behavior along the second axis determines the realism of steady-state samples from the final learned energy. Samples from pθtp_{\theta_{t}} will be realistic if and only if sts_{t} has realistic samples and stpθts_{t}\approx p_{\theta_{t}}. We use convergent ML to refer to implementations where stpθts_{t}\approx p_{\theta_{t}} for all t>t0t>t_{0}, where t0t_{0} represents burn-in learning steps (e.g. early stages of persistent learning). We use non-convergent ML to refer to all other implementations. All prior ConvNet potentials are learned with non-convergent ML, although this is not recognized by previous authors.

Without proper tuning of the sampling phase, the learning heavily gravitates towards non-convergent ML. In this section we outline principles to explain this behavior and provide a remedy for the tendency of model non-convergence.

which gives the average image gradient magnitude of UU along an MCMC path at training step tt, plays a central role in sampling. Sampling at noise magnitude ε\varepsilon will lead to very different behavior depending on the gradient magnitude. If vtv_{t} is very large, gradients will overwhelm the noise and the resulting dynamics are similar to gradient descent. If vtv_{t} is very small, sampling becomes an isotropic random walk. A valid image density should appropriately balance energy gradient magnitude and noise strength to enable realistic long-run sampling.

We empirically observe that expansion and contraction updates tend to have opposite effects on vtv_{t} (see Figure 4). Gradient magnitude vtv_{t} and computational loss dstd_{s_{t}} are highly correlated at the current iteration and exhibit significant negative correlation at a short-range lag. Both have significant negative autocorrelation for short-range lag. This indicates that expansion updates tend to increase vtv_{t} and contraction updates tend to decrease vtv_{t}, and that expansion updates tend to lead to contraction updates and vice-versa. We believe that the natural oscillation between expansion and contraction updates underlies the stability of ML with (2).

Learning can become unstable when UU is updated in the expansion phase for many consecutive iterations if vtv_{t}\rightarrow\infty and as U(X+)U(X^{+})\rightarrow-\infty for positive samples and U(X)U(X^{-})\rightarrow\infty for negative samples. This behavior is typical of W-GAN training (interpreting the generator as wtw_{t} with L=0L=0) and the W-GAN Lipschitz bound is needed to prevent such instability. In ML learning with ConvNet potentials, consecutive updates in the expansion phase will increase vtv_{t} so that the gradient can better overcome noise and samples can more quickly reach low-energy regions. In contrast, many consecutive contraction updates can cause vtv_{t} to shrink to 0, leading to the solution U(x)=cU(x)=c for some constant cc (see Figure 5 right, blue lines). In proper ML learning, the expansion updates that follow contraction updates prevent the model from collapsing to a flat solution and force UU to learn meaningful features of the data.

Throughout our experiments, we find that the network can easily learn to balance the energy of the positive and negative samples so that dst(θt)0d_{s_{t}}(\theta_{t})\approx 0 after only a few model updates. In fact, ML learning can easily adjust vtv_{t} so that the gradient is strong enough to balance dstd_{s_{t}} and obtain high-quality samples from virtually any initial distribution in a small number of MCMC steps. This insight leads to our ML method with noise-initialized MCMC. The natural oscillation of ML learning is the foundation of the robust synthesis capabilities of ConvNet potentials, but realistic short-run MCMC samples can mask the true steady-state behavior.

2 Second Axis: MCMC Convergence or Non-Convergence

In the literature, it is expected that the finite-step MCMC distribution sts_{t} must approximately converge to its steady-state pθtp_{\theta_{t}} for learning to be effective. On the contrary, we find that high-quality synthesis is possible, and actually easier to learn, when there is a drastic difference between the finite-step MCMC distribution sts_{t} and true steady-state samples of pθtp_{\theta_{t}}. An examination of ConvNet potentials learned by existing methods shows that in all cases, running the MCMC sampler for significantly longer than the number of training steps results in samples with significantly lower energy and unrealistic appearance. Although synthesis is possible without convergence, it is not appropriate to describe a non-convergent ML model pθp_{\theta} as an approximate data density.

Oscillation of expansion and contraction updates occurs for both convergent and non-convergent ML learning, but for very different reasons. In convergent ML, we expect the average gradient magnitude vtv_{t} to converge to a constant that is balanced with the noise magnitude ε\varepsilon at a value that reflects the temperature of the data density qq. However, ConvNet potentials can circumvent this desired behavior by tuning vtv_{t} with respect to the burn-in energy landscape rather than noise ε\varepsilon. Figure 5 shows how average image space displacement rt=ε22vtr_{t}=\frac{\varepsilon^{2}}{2}v_{t} is affected by noise magnitude ε\varepsilon and number of Langevin steps LL for noise, data-based, and persistent MCMC initializations.

For noise initialization with low ε\varepsilon, the model adjusts vtv_{t} so that rtLRr_{t}L\approx R where RR is the average distance between an image from the noise initialization distribution and an image from the data distribution. In other words, the MCMC paths obtained from non-convergent ML with noise initialization are nearly linear from the starting point to the ending point. Mixing does not improve when LL increases because rtr_{t} shrinks in proportion to the increase. Oscillation of expansion and contraction updates occurs because the model tunes vtv_{t} to control how far along the burn-in path the negative samples travel. Samples never reach the steady-state energy spectrum and MCMC mixing is not possible.

For data initialization and persistent initialization with low ε\varepsilon, we see that vt,rt0v_{t},r_{t}\rightarrow 0 and that learning tends to the trivial solution U(x)=cU(x)=c. This occurs because contraction updates dominate the learning dynamics. At low ε\varepsilon, samples initialized from the data will easily have lower energy than the data since sampling reduces to gradient descent. To our knowledge no authors have trained (2) using CD, possibly because the energy can easily collapse to a trivial flat solution. For persistent learning, the model learns to synthesize meaningful features early in learning and then contracts in gradient strength once it becomes easy to find negative samples with lower energy than the data. Previous authors who trained models with persistent chains use auxiliary techniques such as a Gaussian prior (?) or occasional rejuvenation of chains from noise (?) which prevent unbalanced network contraction, although the role of these techniques is not recognized by the authors.

For all three initialization types, we can see that convergent ML becomes possible when ε\varepsilon is large enough. ML with noise initialization behaves similarly for high and low ε\varepsilon when LL is small. For large LL with high ε\varepsilon, the model tunes vtv_{t} to balance with ε\varepsilon rather than R/LR/L. The MCMC samples complete burn-in and begin to mix for large LL, and increasing LL will indeed lead to improved MCMC convergence as usual. For data-based and persistent initialization, we see that vtv_{t} adjusts to balance with ε\varepsilon instead of contracting to 0 because the noise added during Langevin sampling forces UU to learn meaningful features.

3 Learning Algorithm

We now present an algorithm for ML learning. The algorithm is essentially the same as earlier works such as (?) that investigate the potential (2). Our intention is not to introduce a novel algorithm but to demonstrate the range of phenomena that can occur with the ML objective based on changes to MCMC sampling. We present guidelines for the effect of tuning on the learning outcome.

Noise and Step Size for Non-Convergent ML: For non-convergent training we find the tuning of noise and step-size have little effect on training stability. We use ε=1\varepsilon=1 and τ=0\tau=0. Noise is not needed for oscillation because dstd_{s_{t}} is controlled by the depth of samples along the burn-in path. Including low noise appears to improve synthesis quality.

Noise and Step Size for Convergent ML: For convergent training, we find that it is essential to include noise with τ=1\tau=1 and precisely tune ε\varepsilon so that the network learns true mixing dynamics through the gradient strength. The step size ε\varepsilon should approximately match the local standard deviation of the data along the most constrained direction (?). An effective ε\varepsilon for 32×3232\times 32 images with pixel values in appears to lie around 0.0150.015.

Number of Steps: When τ=0\tau=0 or τ=1\tau=1 and ε\varepsilon is very small, learning leads to similar non-convergent ML outcomes for any L100L\geq 100. When τ=1\tau=1 and ε\varepsilon is correctly tuned, sufficiently high values of LL lead to convergent ML and lower values of LL lead to non-convergent ML.

Informative Initialization: Informative MCMC initialization is not needed for non-convergent ML even with as few as L=100L=100 Langevin updates. The model can naturally learn fast pathways to realistic negative samples from an arbitrary initial distribution. On the other hand, informative initialization can greatly reduce the magnitude of LL needed for convergent ML. We use persistent initialization starting from noise.

Network structure: For the first convolutional layer, we observe that a 3×33\times 3 convolution with stride 11 helps to avoid checkerboard patterns or other artifacts. For convergent ML, use of non-local layers (?) appears to improve synthesis realism.

Regularization and Normalization: Previous studies employ a variety of auxiliary training techniques such as prior distributions (e.g. Gaussian), weight regularization, batch normalization, layer normalization, and spectral normalization to stabilize sampling and weight updates. We find that these techniques are not needed.

Optimizer and Learning Rate: For non-convergent ML, Adam improves training speed and image quality. Our non-convergent models use Adam with γ=0.0001\gamma=0.0001. For convergent ML, Adam appears to interfere with learning a realistic steady-state and we use SGD instead. When using SGD with τ=1\tau=1 and properly tuned ε\varepsilon and LL, higher values of γ\gamma lead to non-convergent ML and sufficiently low values of γ\gamma lead to convergent ML.

Experiments

We first demonstrate the outcomes of convergent and non-convergent ML for low-dimensional toy distributions (Figure 6). Both toy models have a standard deviation of 0.150.15 along the most constrained direction, and the ideal step size for Langevin dynamics is close to this value (?). Non-convergent models are trained using noise MCMC initialization with L=100L=100 and ε=0.01\varepsilon=0.01 (too low for the data temperature) and convergent models are trained using persistent MCMC initialization with L=500L=500 and ε=0.125\varepsilon=0.125 (approximately the right magnitude relative to the data temperature). The distributions of the short-run samples from the non-convergent models reflect the ground-truth densities, but the learned densities are sharply concentrated and different from the ground-truths. In higher dimensions this sharp concentration of non-convergent densities manifests as oversaturated long-run images. With sufficient Langevin noise, one can learn an energy function that closely approximates the ground-truth.

2 Synthesis from Noise with Non-Convergent ML Learning

In this experiment, we learn an energy function (2) using ML with uniform noise initialization and short-run MCMC. We apply our ML algorithm with L=100L=100 Langevin steps starting from uniform noise images for each update of θ\theta with τ=0\tau=0 and ε=1\varepsilon=1. We use Adam with γ=0.0001\gamma=0.0001.

Previous authors argued that informative MCMC initialization is a key element for successful synthesis with ML learning, but our learning method can sample from scratch with the same Langevin budget. Unlike the models learned by previous authors, our models can generate high-fidelity and diverse images from a noise signal. Our results are shown in Figure 7, Figure 8 (left), and Figure 2 (top). Our recent companion work (?) thoroughly explores the capabilities of noise-initialized non-convergent ML.

3 Convergent ML Learning

With the correct Langevin noise, one can ensure that MCMC samples mix in the steady-state energy spectrum throughout training. The model will eventually learn a realistic steady-state as long as MCMC samples approximately converge for each parameter update tt beyond a burn-in period t0t_{0}. One can implement convergent ML with noise initialization, but we find that this requires LL\approx 20,000 steps.

Informative initialization can dramatically reduce the number of MCMC steps needed for convergent learning. By using SGD with learning rate γ=0.0005\gamma=0.0005, noise indicator τ=1\tau=1 and step size ε=0.015\varepsilon=0.015, we were able to train convergent models using persistent initialization and L=500L=500 sampling steps. We initialize 10,000 persistent images from noise and update 100 images for each batch. We implement the same training procedure for a vanilla ConvNet and a network with non-local layers (?). Our results are shown in Figure 8 (middle, right) and Figure 2 (bottom).

4 Mapping the Image Space

A well-formed energy function partitions the image space into meaningful Hopfield basins of attraction. Following Algorithm 3 of (?), we map the structure of a convergent energy. We first identify many metastable MCMC samples. We then sort the metastable samples from lowest energy to highest energy and sequentially group images if travel between samples is possible in a magnetized energy landscape. This process is continued until all minima have been clustered. Our mappings show that the convergent energy has meaningful metastable structures encoding recognizable concepts (Figure 9).

Conclusion and Future Work

Our experiments on energy-based models with the form (2) reveal two distinct axes of ML learning. We use our insights to train models with sampling capabilities that are unobtainable by previous implementations. The informative MCMC initializations used by previous authors are not necessary for high-quality synthesis. By removing this technique we train the first energy functions capable of high-diversity and realistic synthesis from noise initialization after training. We identify a severe defect in the steady-state distributions of prior implementations and introduce the first ConvNet potentials of the form (2) for which steady-state samples have realistic appearance. Our observations could be very useful for convergent ML learning with more complex MCMC initialization methods used in (?; ?). We hope that our work paves the way for future unsupervised and weakly supervised applications with energy-based models.

Acknowledgment

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Prafulla Dhariwal and Anirudh Goyal for helpful discussions.

References