Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model

Erik Nijkamp, Mitch Hill, Song-Chun Zhu, Ying Nian Wu

Introduction

The maximum likelihood learning of the energy-based model (EBM) follows what Grenander called “analysis by synthesis” scheme. Within each learning iteration, we generate synthesized examples by sampling from the current model, and then update the model parameters based on the difference between the synthesized examples and the observed examples, so that eventually the synthesized examples match the observed examples in terms of some statistical properties defined by the model. To sample from the current EBM, we need to use Markov chain Monte Carlo (MCMC), such as the Gibbs sampler , Langevin dynamics, or Hamiltonian Monte Carlo . Recent work that parametrizes the energy function by modern convolutional neural networks (ConvNets) suggests that the “analysis by synthesis” process can indeed generate highly realistic images .

Although the “analysis by synthesis” learning scheme is intuitively appealing, the convergence of MCMC can be impractical, especially if the energy function is multi-modal, which is typically the case if the EBM is to approximate the complex data distribution, such as that of natural images. For such EBM, the MCMC usually does not mix, i.e., MCMC chains from different starting points tend to get trapped in different local modes instead of traversing modes and mixing with each other.

2 Short-Run MCMC as Generator or Flow Model

In this paper, we investigate a learning scheme that is apparently wrong with no hope of learning a valid model. Within each learning iteration, we run a non-convergent, non-mixing and non-persistent short-run MCMC, such as 55 to 100100 steps of Langevin dynamics, toward the current EBM. Here, we always initialize the non-persistent short-run MCMC from the same distribution, such as the uniform noise distribution, and we always run the same number of MCMC steps. We then update the model parameters as usual, as if the synthesized examples generated by the non-convergent and non-persistent noise-initialized short-run MCMC are the fair samples generated from the current EBM. We show that, after the convergence of such a learning algorithm, the resulting noise-initialized short-run MCMC can generate realistic images, see Figures 1 and 2.

The short-run MCMC is not a valid sampler of the EBM because it is short-run. As a result, the learned EBM cannot be a valid model because it is learned based on a wrong sampler. Thus we learn a wrong sampler of a wrong model. However, the short-run MCMC can indeed generate realistic images. What is going on?

The goal of this paper is to understand the learned short-run MCMC. We provide arguments that it is a valid model for the data in terms of matching the statistical properties of the data distribution. We also show that the learned short-run MCMC can be used as a generative model, such as a generator model or the flow model , with the Langevin dynamics serving as a noise-injected residual network, with the initial image serving as the latent variables, and with the initial uniform noise distribution serving as the prior distribution of the latent variables. We show that unlike traditional EBM and MCMC, the learned short-run MCMC is capable of reconstructing the observed images and interpolating different images, just like a generator or a flow model can do. See Figures 3 and 4. This is very unconventional for EBM or MCMC, and this is due to the fact that the MCMC is non-convergent, non-mixing and non-persistent. In fact, our argument applies to the situation where the short-run MCMC does not need to have the EBM as the stationary distribution.

While the learned short-run MCMC can be used for synthesis, the above learning scheme can be generalized to tasks such as image inpainting, super-resolution, style transfer, or inverse optimal control etc., using informative initial distributions and conditional energy functions.

Contributions and Related Work

This paper constitutes a conceptual shift, where we shift attention from learning EBM with unrealistic convergent MCMC to the non-convergent short-run MCMC. This is a break away from the long tradition of both EBM and MCMC. We provide theoretical and empirical evidence that the learned short-run MCMC is a valid generator or flow model. This conceptual shift frees us from the convergence issue of MCMC, and makes the short-run MCMC a reliable and efficient technology.

More generally, we shift the focus from energy-based model to energy-based dynamics. This appears to be consistent with the common practice of computational neuroscience , where researchers often directly start from the dynamics, such as attractor dynamics whose express goal is to be trapped in a local mode. It is our hope that our work may help to understand the learning of such dynamics. We leave it to future work.

For short-run MCMC, contrastive divergence (CD) is the most prominent framework for theoretical underpinning. The difference between CD and our study is that in our study, the short-run MCMC is initialized from noise, while CD initializes from observed images. CD has been generalized to persistent CD . Compared to persistent MCMC, the non-persistent MCMC in our method is much more efficient and convenient. performs a thorough investigation of various persistent and non-persistent, as well as convergent and non-convergent learning schemes. In particular, the emphasis is on learning proper energy function with persistent and convergent Markov chains. In all of the CD-based frameworks, the goal is to learn the EBM, whereas in our framework, we discard the learned EBM, and only keep the learned short-run MCMC.

Our theoretical understanding of short-run MCMC is based on generalized moment matching estimator. It is related to moment matching GAN , however, we do not learn a generator adversarially.

Non-Convergent Short-Run MCMC as Generator Model

Let xx be the signal, such as an image. The energy-based model (EBM) is a Gibbs distribution

where we assume xx is within a bounded range. fθ(x)f_{\theta}(x) is the negative energy and is parametrized by a bottom-up convolutional neural network (ConvNet) with weights θ\theta. Z(θ)=exp(fθ(x))dxZ(\theta)=\int\exp(f_{\theta}(x))dx is the normalizing constant.

Suppose we observe training examples xi,i=1,...,npdatax_{i},i=1,...,n\sim{p_{\rm data}}, where pdata{p_{\rm data}} is the data distribution. For large nn, the sample average over {xi}\{x_{i}\} approximates the expectation with respect with pdata{p_{\rm data}}. For notational convenience, we treat the sample average and the expectation as the same.

where xipθ(x)x_{i}^{-}\sim p_{\theta}(x) for i=1,...,ni=1,...,n are the generated examples from the current model pθ(x)p_{\theta}(x).

The above equation leads to the “analysis by synthesis” learning algorithm. At iteration tt, let θt{\theta}_{t} be the current model parameters. We generate xipθt(x)x_{i}^{-}\sim p_{{\theta}_{t}}(x) for i=1,...,ni=1,...,n. Then we update θt+1=θt+ηtL(θt){\theta}_{t+1}={\theta}_{t}+\eta_{t}L^{\prime}({\theta}_{t}), where ηt\eta_{t} is the learning rate.

2 Short-Run MCMC

Generating synthesized examples xipθ(x)x_{i}^{-}\sim p_{\theta}(x) requires MCMC, such as Langevin dynamics (or Hamiltonian Monte Carlo) , which iterates

where τ\tau indexes the time, Δτ\Delta\tau is the discretization of time, and UτN(0,I)U_{\tau}\sim{\rm N}(0,I) is the Gaussian noise term. fθ(x)=fθ(x)/xf^{\prime}_{\theta}(x)=\partial f_{\theta}(x)/\partial x can be obtained by back-propagation. If pθp_{\theta} is of low entropy or low temperature, the gradient term dominates the diffusion noise term, and the Langevin dynamics behaves like gradient descent.

If fθ(x)f_{\theta}(x) is multi-modal, then different chains tend to get trapped in different local modes, and they do not mix. We propose to give up the sampling of pθp_{\theta}. Instead, we run a fixed number, e.g., KK, steps of MCMC, toward pθp_{\theta}, starting from a fixed initial distribution, p0p_{0}, such as the uniform noise distribution. Let MθM_{\theta} be the KK-step MCMC transition kernel. Define

which is the marginal distribution of the sample xx after running KK-step MCMC from p0p_{0}.

In this paper, instead of learning pθp_{\theta}, we treat qθq_{\theta} to be the target of learning. After learning, we keep qθq_{\theta}, but we discard pθp_{\theta}. That is, the sole purpose of pθp_{\theta} is to guide a KK-step MCMC from p0p_{0}.

3 Learning Short-Run MCMC

The learning algorithm is as follows. Initialize θ0{\theta}_{0}. At learning iteration tt, let θt{\theta}_{t} be the model parameters. We generate xiqθt(x)x_{i}^{-}\sim q_{{\theta}_{t}}(x) for i=1,...,mi=1,...,m. Then we update θt+1=θt+ηtΔ(θt){\theta}_{t+1}={\theta}_{t}+\eta_{t}\Delta({\theta}_{t}), where

We assume that the algorithm converges so that Δ(θt)0\Delta({\theta}_{t})\rightarrow 0. At convergence, the resulting θ\theta solves the estimating equation Δ(θ)=0\Delta(\theta)=0.

To further improve training, we smooth pdata{p_{\rm data}} by convolution with a Gaussian white noise distribution, i.e., injecting additive noises ϵiN(0,σ2I)\epsilon_{i}\sim{\rm N}(0,\sigma^{2}I) to observed examples xixi+ϵix_{i}\leftarrow x_{i}+\epsilon_{i} . This makes it easy for Δ(θt)\Delta(\theta_{t}) to converge to 0, especially if the number of MCMC steps, KK, is small, so that the estimating equation Δ(θ)=0\Delta(\theta)=0 may not have solution without smoothing pdata{p_{\rm data}}.

The learning procedure in Algorithm 1 is simple. The key to the above algorithm is that the generated {xi}\{x_{i}^{-}\} are independent and fair samples from the model qθq_{\theta}.

4 Generator or Flow Model for Interpolation and Reconstruction

We may consider qθ(x)q_{\theta}(x) to be a generative model,

where uu denotes all the randomness in the short-run MCMC. For the KK-step Langevin dynamics, MθM_{\theta} can be considered a KK-layer noise-injected residual network. zz can be considered latent variables, and p0p_{0} the prior distribution of zz. Due to the non-convergence and non-mixing, xx can be highly dependent on zz, and zz can be inferred from xx. This is different from the convergent MCMC, where xx is independent of zz. When the learning algorithm converges, the learned EBM tends to have low entropy and the Langevin dynamics behaves like gradient descent, where the noise terms are disabled, i.e., u=0u=0. In that case, we simply write x=Mθ(z)x=M_{\theta}(z).

We can perform interpolation as follows. Generate z1z_{1} and z2z_{2} from p0(z)p_{0}(z). Let zρ=ρz1+1ρ2z2z_{\rho}=\rho z_{1}+\sqrt{1-\rho^{2}}z_{2}. This interpolation keeps the marginal variance of zρz_{\rho} fixed. Let xρ=Mθ(zρ)x_{\rho}=M_{\theta}(z_{\rho}). Then xρx_{\rho} is the interpolation of x1=Mθ(z1)x_{1}=M_{\theta}(z_{1}) and x2=Mθ(z2)x_{2}=M_{\theta}(z_{2}). Figure 3 displays xρx_{\rho} for a sequence of ρ\rho\in.

For an observed image xx, we can reconstruct xx by running gradient descent on the least squares loss function L(z)=xMθ(z)2L(z)=\|x-M_{\theta}(z)\|^{2}, initializing from z0p0(z)z_{0}\sim p_{0}(z), and iterates zt+1=ztηtL(zt)z_{t+1}=z_{t}-\eta_{t}L^{\prime}(z_{t}). Figure 4 displays the sequence of xt=Mθ(zt)x_{t}=M_{\theta}(z_{t}).

In general, zp0(z);x=Mθ(z,u)z\sim p_{0}(z);x=M_{\theta}(z,u) defines an energy-based dynamics. KK does not need to be fixed. It can be a stopping time that depends on the past history of the dynamics. The dynamics can be made deterministic by setting u=0u=0. This includes the attractor dynamics popular in computational neuroscience .

Understanding the Learned Short-Run MCMC

Figure 5 illustrates the above idea. The red dotted line illustrates MCMC. Starting from p0p_{0}, KK-step MCMC leads to qθ^MME(x)q_{\hat{\theta}_{\rm MME}}(x). If we continue to run MCMC for infinite steps, we will get to pθ^MMEp_{\hat{\theta}_{\rm MME}}. Thus the role of pθ^MMEp_{\hat{\theta}_{\rm MME}} is to serve as an unreachable target to guide the KK-step MCMC which stops at the mid-way qθ^MMEq_{\hat{\theta}_{\rm MME}}. One can say that the short-run MCMC is a wrong sampler of a wrong model, but it itself is a valid model because it belongs to Ω\Omega.

The MLE pθ^MLEp_{\hat{\theta}_{\rm MLE}} is the projection of pdata{p_{\rm data}} onto Θ\Theta. Thus it belongs to Θ\Theta. It also belongs to Ω\Omega as can be seen from the maximum likelihood estimating equation. Thus it is the intersection of Ω\Omega and Θ\Theta. Among all the distributions in Ω\Omega, pθ^MLEp_{\hat{\theta}_{\rm MLE}} is the closest to p0p_{0}. Thus it has the maximum entropy among all the distributions in Ω\Omega.

The above duality between maximum likelihood and maximum entropy follows from the following fact. Let p^ΘΩ\hat{p}\in\Theta\cap\Omega be the intersection between Θ\Theta and Ω\Omega. Ω\Omega and Θ\Theta are orthogonal in terms of the Kullback-Leibler divergence. For any pθΘp_{\theta}\in\Theta and for any pΩp\in\Omega, we have the Pythagorean property : KL(ppθ)=KL(pp^)+KL(p^pθ){\rm KL}(p|p_{\theta})={\rm KL}(p|\hat{p})+{\rm KL}(\hat{p}|p_{\theta}). See Appendix 7.1 for a proof. Thus (1) KL(pdatapθ)KL(pdatap^){\rm KL}({p_{\rm data}}|p_{\theta})\geq{\rm KL}({p_{\rm data}}|\hat{p}), i.e., p^\hat{p} is MLE within Θ\Theta. (2) KL(pp0)KL(p^p0){\rm KL}(p|p_{0})\geq{\rm KL}(\hat{p}|p_{0}), i.e., p^\hat{p} has maximum entropy within Ω\Omega.

We can understand the learned qθ^MMEq_{\hat{\theta}_{\rm MME}} from two Pythagorean results.

(1) Pythagorean for the right triangle formed by q0q_{0}, qθ^MMEq_{\hat{\theta}_{\rm MME}}, and pθ^MLEp_{\hat{\theta}_{\rm MLE}},

(2) Pythagorean for the right triangle formed by pθ^MMEp_{\hat{\theta}_{\rm MME}}, qθ^MMEq_{\hat{\theta}_{\rm MME}}, and pθ^MLEp_{\hat{\theta}_{\rm MLE}},

For fixed θ\theta, as KK increases, KL(qθpθ){\rm KL}(q_{\theta}|p_{\theta}) decreases monotonically . The smaller KL(qθ^MMEpθ^MME){\rm KL}(q_{\hat{\theta}_{\rm MME}}|p_{\hat{\theta}_{\rm MME}}) is, the smaller KL(qθ^MMEpθ^MLE){\rm KL}(q_{\hat{\theta}_{\rm MME}}|p_{\hat{\theta}_{\rm MLE}}) and KL(pθ^MLEpθ^MME){\rm KL}(p_{\hat{\theta}_{\rm MLE}}|p_{\hat{\theta}_{\rm MME}}) are. Thus, it is desirable to use large KK as long as we can afford the computational cost, to make both qθ^MMEq_{\hat{\theta}_{\rm MME}} and pθ^MMEp_{\hat{\theta}_{\rm MME}} close to pθ^MLEp_{\hat{\theta}_{\rm MLE}}.

2 General ConvNet-EBM and Generalized Moment Matching Estimator

In classical statistics, we often assume that the model is correct, i.e., pdata{p_{\rm data}} corresponds to a qθtrueq_{{\theta}_{\rm true}} for some true value θtrue{\theta}_{\rm true}. In that case, the generalized moment matching estimator θ^MME\hat{\theta}_{\rm MME} follows an asymptotic normal distribution centered at the true value θtrue{\theta}_{\rm true}. The variance of θ^MME\hat{\theta}_{\rm MME} depends on the choice of h(x,θ)h(x,\theta). The variance is minimized by the choice h(x,θ)=θlogqθ(x),h(x,\theta)={\frac{\partial}{\partial\theta}}\log q_{\theta}(x), which corresponds to the maximum likelihood estimate of qθq_{\theta}, and which leads to the Cramer-Rao lower bound and Fisher information. See Appendix 7.2 for a brief explanation.

θlogpθ(x)=θfθ(x)θlogZ(θ){\frac{\partial}{\partial\theta}}\log p_{\theta}(x)={\frac{\partial}{\partial\theta}}f_{\theta}(x)-{\frac{\partial}{\partial\theta}}\log Z(\theta) is not equal to θlogqθ(x){\frac{\partial}{\partial\theta}}\log q_{\theta}(x). Thus the learning algorithm will not give us the maximum likelihood estimate of qθq_{\theta}. However, the validity of the learned qθq_{\theta} does not require h(x,θ)h(x,\theta) to be θlogqθ(x){\frac{\partial}{\partial\theta}}\log q_{\theta}(x). In practice, one can never assume that the model is true. As a result, the optimality of the maximum likelihood may not hold, and there is no compelling reason that we must use MLE.

The relationship between pdata{p_{\rm data}}, qθ^MMEq_{\hat{\theta}_{\rm MME}}, pθ^MMEp_{\hat{\theta}_{\rm MME}}, and pθ^MLEp_{\hat{\theta}_{\rm MLE}} may still be illustrated by Figure 5, although we need to modify the definition of Ω\Omega.

Experimental Results

In this section, we will demonstrate (1) realistic synthesis, (2) smooth interpolation, (3) faithful reconstruction of observed examples, and, (4) the influence of hyperparameters. KK denotes the number of MCMC steps in equation (4). nfn_{f} denotes the number of output features maps in the first layer of fθf_{\theta}. See Appendix for additional results.

We emphasize the simplicity of the algorithm and models, see Appendix 7.3 and 7.4, respectively.

We evaluate the fidelity of generated examples on various datasets, each reduced to 40,00040,000 observed examples. Figure 6 depicts generated samples for various datasets with K=100K=100 Langevin steps for both training and evaluation. For CIFAR-10 we set the number of features nf=128n_{f}=128, whereas for CelebA and LSUN we use nf=64n_{f}=64. We use 200,000200,000 iterations of model updates, then gradually decrease the learning rate η\eta and injected noise ϵiN(0,σ2I)\epsilon_{i}\sim{\rm N}(0,\sigma^{2}I) for observed examples. Table 1 (a) compares the Inception Score (IS) and Fréchet Inception Distance (FID) with Inception v3 classifier on 40,00040,000 generated examples. Despite its simplicity, short-run MCMC is competitive.

2 Interpolation

We demonstrate interpolation between generated examples. We follow the procedure outlined in Section 3.4. Let xρ=Mθ(zρ)x_{\rho}=M_{\theta}(z_{\rho}) where MθM_{\theta} to denotes the KK-step gradient descent with K=100K=100. Figure 3 illustrates xρx_{\rho} for a sequence of ρ\rho\in on CelebA. The interpolation appears smooth and the intermediate samples resemble realistic faces. The interpolation experiment highlights that the short-run MCMC does not mix, which is in fact an advantage instead of a disadvantage. The interpolation ability goes far beyond the capacity of EBM and convergent MCMC.

3 Reconstruction

We demonstrate reconstruction of observed examples. For short-run MCMC, we follow the procedure outlined in Section 3.4. For an observed image xx, we reconstruct xx by running gradient descent on the least squares loss function L(z)=xMθ(z)2L(z)=\|x-M_{\theta}(z)\|^{2}, initializing from z0p0(z)z_{0}\sim p_{0}(z), and iterates zt+1=ztηtL(zt)z_{t+1}=z_{t}-\eta_{t}L^{\prime}(z_{t}). For VAE, reconstruction is readily available. For GAN, we perform Langevin inference of latent variables . Figure 4 depicts faithful reconstruction. Table 1 (b) illustrates competitive reconstructions in terms of MSE (per pixel) for 1,0001,000 observed leave-out examples. Again, the reconstruction ability of the short-run MCMC is due to the fact that it is not mixing.

4 Influence of Hyperparameters

MCMC Steps. Table 2 depicts the influence of varying the number of MCMC steps KK while training on synthesis and average magnitude xfθ(x)2\|\frac{\partial}{\partial x}f_{\theta}(x)\|_{2} over KK-step Langevin (4). We observe: (1) the quality of synthesis decreases with decreasing KK, and, (2) the shorter the MCMC, the colder the learned EBM, and the more dominant the gradient descent part of the Langevin. With small KK, short-run MCMC fails “gracefully” in terms of synthesis. A choice of K=100K=100 appears reasonable.

Injected Noise. To stabilize training, we smooth pdata{p_{\rm data}} by injecting additive noises ϵiN(0,σ2I)\epsilon_{i}\sim{\rm N}(0,\sigma^{2}I) to observed examples xixi+ϵix_{i}\leftarrow x_{i}+\epsilon_{i}. Table 3 (a) depicts the influence of σ2\sigma^{2} on the fidelity of negative examples in terms of IS and FID. That is, when lowering σ2\sigma^{2}, the fidelity of the examples improves. Hence, it is desirable to pick smallest σ2\sigma^{2} while maintaining the stability of training. Further, to improve synthesis, we may gradually decrease the learning rate η\eta and anneal σ2\sigma^{2} while training.

Model Complexity. We investigate the influence of the number of output features maps nfn_{f} on generated samples with K=100K=100. Table 3 (b) summarizes the quality of synthesis in terms of IS and FID. As the number of features nfn_{f} increases, so does the quality of the synthesis. Hence, the quality of synthesis may scale with nfn_{f} until the computational means are exhausted.

Conclusion

Despite our focus on short-run MCMC, we do not advocate abandoning EBM all together. On the contrary, we ultimately aim to learn valid EBM . Hopefully, the non-convergent short-run MCMC studied in this paper may be useful in this endeavor. It is also our hope that our work may help to understand the learning of attractor dynamics popular in neuroscience.

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and XSEDE grant ASC170063. We thank Prof. Stu Geman, Prof. Xianfeng (David) Gu, Diederik P. Kingma, Guodong Zhang, and Will Grathwohl for helpful discussions.

References

Appendix

Thus KL(ppθ)=KL(pp^)+KL(p^pθ){\rm KL}(p|p_{\theta})={\rm KL}(p|\hat{p})+{\rm KL}(\hat{p}|p_{\theta}).

2 Estimating Equation and Cramer-Rao Theory

3 Code

Note, in the code we parametrize the energy as fθ(x)/T-f_{\theta}(x)/T, for an a priori chosen small temperature TT, for convenience, so that the Langevin dynamics becomes xt+1=xt+fθ(xt)/2+N(0,T)x_{t+1}=x_{t}+f_{\theta}^{\prime}(x_{t})/2+N(0,T).

4 Model Architecture

We use the following notation. Convolutional operation conv(n)conv(n) with nn output feature maps and bias term. Leaky-ReLU nonlinearity LReLULReLU with default leaky factor 0.20.2. We set nf{32,64,128}n_{f}\in\{32,64,128\}.

5 Computational Cost

Each iteration of short-run MCMC requires computing KK derivatives of fθ(x)f_{\theta}(x) which is in the form of a convolutional neural network. As an example, 100,000100,000 model parameter updates for 64×6464\times 64 CelebA with K=100K=100 and nf=64n_{f}=64 on 44 Titan Xp GPUs take 1616 hours.

6 Varying K𝐾K for Training and Sampling

We may interpret short-run MCMC as a noise-injected residual network of variable depth. Hence, the number of MCMC steps while training K1K_{1} and sampling K2K_{2} may differ. We train the model on CelebA with K1K_{1} and test the trained model by running MCMC with K2K_{2} steps. Figure 7 below depicts training with K1=100K_{1}=100 and varied K2K_{2} for sampling. Note, over-saturation occurs for K2>K1K_{2}>K_{1}.

7 Toy Examples in 1D and 2D

In Figures 8 and 9, we plot the density and log-density of the true model, the learned EBM, and the kernel density estimate (KDE) of the MCMC samples. The density of the MCMC samples matches the true density closely. The learned energy captures the modes of the true density, but is of a much bigger scale, so that the learned EBM density is of much lower entropy or temperature (so that the density focuses on the global energy minimum). This is consistent with our theoretical understanding.