Divergence Triangle for Joint Training of Generator Model, Energy-based Model, and Inference Model

Tian Han, Erik Nijkamp, Xiaolin Fang, Mitch Hill, Song-Chun Zhu, Ying Nian Wu

Introduction

Deep probabilistic generative models are a powerful framework for representing complex data distributions. They have been widely used in unsupervised learning problems to learn from unlabeled data. The goal of generative learning is to build rich and flexible models to fit complex, multi-modal data distributions as well as to be able to generate samples with high realism. The family of generative models may be roughly divided into two classes: The first class is the energy-based model (a.k.a undirected graphical model) and the second class is the latent variable model (a.k.a directed graphical model) which usually includes generator model for the generation and inference model for inference or reconstruction.

These models have their advantages and limitations. An energy-based model defines an explicit likelihood of the observed data up to a normalizing constant. However, sampling from such a model usually requires expensive Markov chain Monte Carlo (MCMC). A generator model defines direct sampling of the data. However, it does not have an explicit likelihood. The inference of the latent variables also requires MCMC sampling from the posterior distribution. The inference model defines an explicit approximation to the posterior distribution of the latent variables.

Combining the energy-based model, the generator model, and the inference model to get the best of each model is an attractive goal. On the other hand, challenges may accumulate when the models are trained together since different models need to effectively compete or cooperate together to achieve their highest performances. In this work, we propose the divergence triangle for joint training of energy-based model, generator model and inference model. The learning of three models can then be seamlessly integrated in a principled probabilistic framework. The energy-based model is learned based on the samples supplied by the generator model. With the help of the inference model, the generator model is trained by both the observed data and the energy-based model. The inference model is learned from both the real data fitted by the generator model as well as the synthesized data generated by the generator model.

Our experiments demonstrate that the divergence triangle is capable of learning an energy-based model with a well-behaved energy landscape, a generator model with highly realistic samples, and an inference model with faithful reconstruction ability.

2 Prior Art

The divergence triangle jointly learns an energy-based model, a generator model, and an inference model. The following are previous methods for learning such models.

The maximum likelihood learning of the energy-based model requires expectation with respect to the current model, while the maximum likelihood learning of the generator model requires expectation with respect to the posterior distribution of the latent variables. Both expectations can be approximated by MCMC, such as Gibbs sampling , Langevin dynamics, or Hamiltonian Monte Carlo (HMC) . used Langevin dynamics for learning the energy-based models, and used Langevin dynamics for learning the generator model. In both cases, MCMC sampling introduces an inner loop in the training procedure, posing a computational expense.

An early version of the energy-based model is the FRAME (Filters, Random field, And Maximum Entropy) model . used gradient-based method such as Langevin dynamics to sample from the model. called the energy-based models as descriptive models. generalized the model to deep variants.

For learning the energy-based model , to reduce the computational cost of MCMC sampling, contrastive divergence (CD) initializes a finite step MCMC from the observed data. The resulting learning algorithm follows the gradient of the difference between two Kullback-Leibler divergences, thus the name contrastive divergence. In this paper, we shall use the term “contrastive divergence” in a more general sense than . Persistent contrastive divergence initializes MCMC sampling from the samples of the previous learning iteration.

Generalizing , developed an introspective learning method where the energy function is discriminatively learned, and the energy-based model is both a generative model and a discriminative model.

For learning the generator model, the variational auto-encoder (VAE) approximates the posterior distribution of the latent variables by an explicit inference model. In VAE, the inference model is learned jointly with the generator model from the observed data. A precursor of VAE is the wake-sleep algorithm , where the inference model is learned from the dream data generated by the generator model in the sleep phase.

The generator model can also be learned jointly with a discriminator model, as in the generative adversarial networks (GAN) , as well as deep convolutional GAN (DCGAN) , energy-based GAN (EB-GAN) , Wasserstein GAN (WGAN) . GAN does not involve an inference model.

The generator model can also be learned jointly with an energy-based model . We can interpret the learning scheme as an adversarial version of contrastive divergence. While in GAN, the discriminator model eventually becomes a confused one, in the joint learning of the generator model and the energy-based model, the learned energy-based model becomes a well-defined probability distribution on the observed data. The joint learning bares some similarity to WGAN, but unlike WGAN, the joint learning involves two complementary probability distributions.

To bridge the gap between the generator model and the energy-based model, the cooperative learning method of introduces finite-step MCMC sampling of the energy-based model with the MCMC initialized from the samples generated by the generator model. Such finite-step MCMC produces synthesized examples closer to the energy-based model, and the generator model can learn from how the finite-step MCMC revises its initial samples.

Adversarially learned inference (ALI) combines the learning of the generator model and inference model in an adversarial framework. ALI can be improved by adding conditional entropy regularization, resulting in the ALICE model. The recently proposed method shares the same spirit. They lack an energy-based model on observed data.

3 Our Contributions

Our proposed formulation, which we call the divergence triangle, re-interprets and integrates the following elements in unsupervised generative learning: (1) maximum likelihood learning, (2) variational learning, (3) adversarial learning, (4) contrastive divergence, (5) wake-sleep algorithm. The learning is seamlessly integrated into a probabilistic framework based on KL divergence.

We conduct extensive experiments to analyze the learned models. Energy landscape mapping is used to verify that our learned energy-based model is well-behaved. Further, we evaluate the learning of a generator model via synthesis by generating samples with competitive fidelity, and evaluate the accuracy of the inference model both qualitatively and quantitatively via reconstruction. Our proposed model can also benefit in learning directly from incomplete images with various blocking patterns.

Learning Deep Probabilistic Models

In this section, we shall review the two probabilistic models, namely the generator model and the energy-based model, both of which are parametrized by convolutional neural networks . Then, we shall present the maximum likelihood learning algorithms for training these two models, respectively. Our presentation of the two maximum likelihood learning algorithms is unconventional. We seek to derive both algorithms based on the Kullback-Leibler divergence using the same scheme. This will set the stage for the divergence triangle.

The generator model is a generalization of the factor analysis model ,

where gθg_{\theta} is a top-down mapping parametrized by a deep network with parameters θ\theta. It maps the dd-dimensional latent vector zz to the DD-dimensional signal xx. ϵN(0,σ2ID)\epsilon\sim{\rm N}(0,\sigma^{2}I_{D}) and is independent of zz. In general, the model is defined by the prior distribution p(z)p(z) and the conditional distribution pθ(xz)p_{\theta}(x|z). The complete-data model pθ(z,x)=p(z)pθ(xz)p_{\theta}(z,x)=p(z)p_{\theta}(x|z). The observed-data model is pθ(x)=pθ(z,x)dzp_{\theta}(x)=\int p_{\theta}(z,x)dz. The posterior distribution is pθ(zx)=pθ(z,x)/pθ(x)p_{\theta}(z|x)=p_{\theta}(z,x)/p_{\theta}(x). See the diagram (a) below.

A complementary model is the energy-based model , where fα(x)-f_{\alpha}(x) defines the energy of xx, and a low energy xx is assigned a high probability. Specifically, we have the following probability model

where fα(x)f_{\alpha}(x) is parametrized by a bottom-up deep network with parameters α\alpha, and Z(α)Z(\alpha) is the normalizing constant. If fα(x)f_{\alpha}(x) is linear in α\alpha, the model becomes the familiar exponential family model in statistics or the Gibbs distribution in statistical physics. We may consider πα\pi_{\alpha} an evaluator, where fαf_{\alpha} assigns the value to xx, and πα\pi_{\alpha} evaluates xx by a normalized probability distribution. See the diagram (b) above.

The energy-based model πα\pi_{\alpha} defines explicit log-likelihood via fα(x)f_{\alpha}(x), even though Z(α)Z(\alpha) is intractable. However, it is difficult to sample from πα\pi_{\alpha}. The generator model pθp_{\theta} can generate xx directly by first generating zp(z)z\sim p(z), and then transforming zz to xx by gθ(z)g_{\theta}(z). But it does not define an explicit log-likelihood of xx.

In the context of inverse reinforcement learning or inverse optimal control, xx is action and fα(x)-f_{\alpha}(x) defines the cost function or fα(x)f_{\alpha}(x) defines the value function or the objective function.

2 Maximum Likelihood Learning

Let qdata(x)q_{\rm data}(x) be the true distribution that generates the training data. Both the generator pθp_{\theta} and the energy-based model πα\pi_{\alpha} can be learned by maximum likelihood. For large sample, the maximum likelihood amounts to minimizing the Kullback-Leibler divergence KL(qdatapθ){\rm KL}(q_{\rm data}\|p_{\theta}) over θ\theta, and minimizing KL(qdataπα){\rm KL}(q_{\rm data}\|\pi_{\alpha}) over α\alpha, respectively. The expectation Eqdata{\rm E}_{q_{\rm data}} can be approximated by sample average.

To learn the generator model pθp_{\theta}, we seek to minimize KL(qdata(x)pθ(x)){\rm KL}(q_{\rm data}(x)\|p_{\theta}(x)) over θ\theta. Suppose in an iterative algorithm, the current θ\theta is θt\theta_{t}. We can fix θt\theta_{t} at any place we want, and vary θ\theta around θt\theta_{t}.

In the EM algorithm , the left hand side is the surrogate objective function. This surrogate function is more tractable than the true objective function KL(qdata(x)pθ(x)){\rm KL}(q_{\rm data}(x)\|p_{\theta}(x)) because qdata(x)pθt(zx)q_{\rm data}(x)p_{\theta_{t}}(z|x) is a distribution of the complete data, and pθ(z,x)p_{\theta}(z,x) is the complete-data model.

qdata(x)pθt(zx)q_{\rm data}(x)p_{\theta_{t}}(z|x) gives us the complete data. Each step of EM fits the complete-data model pθ(z,x)p_{\theta}(z,x) by minimizing the surrogate S(θ)S(\theta),

which amounts to maximizing the complete-data log-likelihood. By minimizing SS, we will reduce S(θ)S(\theta) relative to θt\theta_{t}, and we will reduce K(θ)K(\theta) even more, relative to θt\theta_{t}, because of the majorization picture.

We can also use gradient descent to update θ\theta. Because S(θt)=K(θt)S^{\prime}(\theta_{t})=K^{\prime}(\theta_{t}), and we can place θt\theta_{t} anywhere, we have

To implement the above updates, we need to compute the expectation with respect to the posterior distribution pθ(zx)p_{\theta}(z|x). It can be approximated by MCMC such as Langevin dynamics or HMC . Both require gradient computations that can be efficiently accomplished by back-propagation. We have learned the generator using such learning method .

2.2 Self-critic Learning of Energy-based Model

To learn the energy-based model πα\pi_{\alpha}, we seek to minimize KL(qdata(x)πα(x)){\rm KL}(q_{\rm data}(x)\|\pi_{\alpha}(x)) over α\alpha. Suppose in an iterative algorithm, the current α\alpha is αt\alpha_{t}. We can fix αt\alpha_{t} at any place we want, and vary α\alpha around αt\alpha_{t}.

Consider the following contrastive divergence

We can use the above as surrogate function, which is more tractable than the true objective function, since the logZ(θ)\log Z(\theta) term is canceled out. Specifically, we can write (8) as

Because SS minorizes KK, we do not have a EM-like update. However, we can still use gradient descent to update α\alpha, where the derivative is

Since we can place αt\alpha_{t} anywhere, we have

To implement the above update, we need to compute the expectation with respect to the current model παt\pi_{\alpha_{t}}. It can be approximated by MCMC such as Langevin dynamics or HMC that samples from παt\pi_{\alpha_{t}}. It can be efficiently implemented by gradient computation via back-propagation. We have trained the energy-based model using such learning method .

The above learning algorithm has an adversarial interpretation. Updating αt\alpha_{t} to αt+1\alpha_{t+1} by following the gradient of S(α)=KL(qdata(x)πα(x))KL(παt(x)πα(x))=(Eqdata[fα(x)]Eπαt[fα(x)])+constS(\alpha)={\rm KL}(q_{\rm data}(x)\|\pi_{\alpha}(x))-{\rm KL}(\pi_{\alpha_{t}}(x)\|\pi_{\alpha}(x))=-({\rm E}_{q_{\rm data}}[f_{\alpha}(x)]-{\rm E}_{\pi_{\alpha_{t}}}[f_{\alpha}(x)])+{\rm const}, we seek to decrease the first KL-divergence, while we will increase the second KL-divergence, or we seek to shift the value function fα(x)f_{\alpha}(x) toward the observed data and away from the synthesized data generated from the current model. That is, the model πα\pi_{\alpha} criticizes its current version παt\pi_{\alpha_{t}}, i.e., the model is its own adversary or its own critic.

2.3 Similarity and Difference

In the generator model, if we replace the intractable pθt(zx)p_{\theta_{t}}(z|x) by the inference model qϕ(zx)q_{\phi}(z|x), we get VAE.

Divergence Triangle: Integrating Adversarial and Variational Learning

In this section, we shall first present the divergence triangle, emphasizing its compact symmetric and anti-symmetric form. Then, we shall show that it is an re-interpretation and integration of existing methods, in particular, VAE and ACD .

Suppose we observe training examples {x(i)qdata(x)}i=1n\{x_{(i)}\sim q_{\rm data}(x)\}_{i=1}^{n} where qdata(x)q_{\rm data}(x) is the unknown data distribution. πα(x)exp[fα(x)]{\pi_{\alpha}(x)\propto\exp[f_{\alpha}(x)]} with energy function fα-f_{\alpha} denotes the energy-based model with parameters α\alpha. The generator model p(z)pθ(xz)p(z)p_{\theta}(x|z) has parameters θ\theta and latent vector zz. It is trivial to sample the latent distribution p(z)p(z) and the generative process is defined as zp(z)z\sim p(z), xpθ(xz)x\sim p_{\theta}(x|z).

The maximum likelihood learning algorithms for both the generator and energy-based model require MCMC sampling. We modify the maximum likelihood KL-divergences by proposing a divergence triangle criterion, so that the two models can be learned jointly without MCMC. In addition to the generator pθp_{\theta} and energy-based model πα\pi_{\alpha}, we also include an inference model qϕ(zx)q_{\phi}(z|x) in the learning scheme. Such an inference model is a key component in the variational auto-encoder . The inference model qϕ(zx)q_{\phi}(z|x) with parameters ϕ\phi maps from the data space to latent space. In the context of EM, qϕ(zx)q_{\phi}(z|x) can be considered an imputor that imputes the missing data zz to get the complete data (z,x)(z,x).

The three models above define joint distributions over zz and xx from different perspectives. The two marginals, i.e., empirical data distribution qdata(x)q_{\rm data}(x) and latent prior distribution p(z)p(z), are known to us. The goal is to harmonize the three joint distributions so that the competition and cooperation between different loss terms improves learning.

The divergence triangle involves the following three joint distributions on (z,x)(z,x):

QQ-distribution: Q(z,x)=qdata(x)qϕ(zx)Q(z,x)=q_{\rm data}(x)q_{\phi}(z|x).

PP-distribution: P(z,x)=p(z)pθ(xz)P(z,x)=p(z)p_{\theta}(x|z).

Π\Pi-distribution: Π(z,x)=πα(x)qϕ(zx)\Pi(z,x)=\pi_{\alpha}(x)q_{\phi}(z|x).

We propose to learn the three models pθp_{\theta}, πα\pi_{\alpha}, qϕq_{\phi} by the following divergence triangle loss functional D{\cal D}

See Figure 3 for illustration. The divergence triangle is based on the three KL-divergences between the three joint distributions on (z,x)(z,x). It has a symmetric and anti-symmetric form, where the anti-symmetry is due to the negative sign in front of the last KL-divergence and the maximization over α\alpha. The divergence triangle leads to the following dynamics between the three models: (1) QQ and PP seek to get close to each other. (2) PP seeks to get close to Π\Pi. (3) π\pi seeks to get close to qdataq_{\rm data}, but it seeks to get away from PP, as indicated by the red arrow. Note that KL(QΠ)=KL(qdataπα){\rm KL}(Q\|\Pi)={\rm KL}(q_{\rm data}\|\pi_{\alpha}), because qϕ(zx)q_{\phi}(z|x) is canceled out. The effect of (2) and (3) is that π\pi gets close to qdataq_{\rm data}, while inducing PP to get close to qdataq_{\rm data} as well, or in other words, PP chases πα\pi_{\alpha} toward qdataq_{\rm data}.

2 Unpacking the Loss Function

The divergence triangle integrates variational and adversarial learning methods, which are modifications of maximum likelihood.

First, minθminϕKL(QP)\min_{\theta}\min_{\phi}{\rm KL}(Q\|P) captures the variational auto-encoder (VAE).

We may interpret VAE as alternating projection between QQ and PP. See Figure 4 for illustration. If qϕ(zx)=pθ(zx)q_{\phi}(z|x)=p_{\theta}(z|x), the algorithm reduces to the EM algorithm. The wake-sleep algorithm is similar to VAE, except that it updates ϕ\phi by minϕKL(PQ)\min_{\phi}{\rm KL}(P\|Q) instead of minϕKL(QP)\min_{\phi}{\rm KL}(Q\|P), so that the wake-sleep algorithm does not have a single objective function.

The VAE minθminϕKL(QP)\min_{\theta}\min_{\phi}{\rm KL}(Q\|P) defines a cooperative game, with the dynamics that qϕq_{\phi} and pθp_{\theta} run toward each other.

2.2 Adversarial Learning

so that we avoid MCMC for sampling παt(x)\pi_{\alpha_{t}}(x), and the gradient for updating α\alpha becomes

Because of the negative sign in front of the second KL-divergence in (16), we need maxθ\max_{\theta} in (16) or minθ\min_{\theta} in (17), so that the learning becomes adversarial. See Figure 5 for illustration. Inspired by , we call (16) the adversarial contrastive divergence (ACD). It underlies .

The adversarial form (16) or (17) defines a chasing game with the following dynamics: the generator pθp_{\theta} chases the energy-based model πα\pi_{\alpha} in minθKL(pθπα)\min_{\theta}{\rm KL}(p_{\theta}\|\pi_{\alpha}), the energy-based model πα\pi_{\alpha} seeks to get closer to qdataq_{\rm data} and get away from pθp_{\theta}. The red arrow in Figure 5 illustrates this chasing game. The result is that πα\pi_{\alpha} lures pθp_{\theta} toward qdataq_{\rm data}. In the idealized case, pθp_{\theta} always catches up with πα\pi_{\alpha}, then πα\pi_{\alpha} will converge to the maximum likelihood estimate minαKL(qdataπα)\min_{\alpha}{\rm KL}(q_{\rm data}\|\pi_{\alpha}), and pθp_{\theta} converges to πα\pi_{\alpha}.

The above chasing game is different from VAE minθminϕKL(QP)\min_{\theta}\min_{\phi}{\rm KL}(Q\|P), which defines a cooperative game where qϕq_{\phi} and pθp_{\theta} run toward each other.

Even though the above chasing game is adversarial, both models are running toward the data distribution. While the generator model runs after the energy-based model, the energy-based model runs toward the data distribution. As a consequence, the energy-based model guides or leads the generator model toward the data distribution. It is different from GAN . In GAN, the discriminator eventually becomes a confused one because the generated data become similar to the real data. In the above chasing game, the energy-based model becomes close to the data distribution.

The updating of α\alpha by (18) bears similarity to Wasserstein GAN (WGAN) , but unlike WGAN, fαf_{\alpha} defines a probability distribution πα\pi_{\alpha}, and the learning of θ\theta is based on minθKL(pθ(x)πα(x))\min_{\theta}{\rm KL}(p_{\theta}(x)\|\pi_{\alpha}(x)), which is a variational approximation to πα\pi_{\alpha}. This variational approximation only requires knowing fα(x)f_{\alpha}(x), without knowing Z(α)Z(\alpha). However, unlike qϕ(zx)q_{\phi}(z|x), pθ(x)p_{\theta}(x) is still intractable, in particular, its entropy does not have a closed form. Thus, we can again use variational approximation, by changing the problem to minθminϕ\min_{\theta}\min_{\phi}KL(p(z)pθ(xz)πα(x)qϕ(zx)){\rm KL}(p(z)p_{\theta}(x|z)\|\pi_{\alpha}(x)q_{\phi}(z|x)), i.e., minθminϕKL(PΠ)\min_{\theta}\min_{\phi}{\rm KL}(P\|\Pi), which is analytically tractable and which underlies . In fact,

Thus, we can modify (17) into maxαminθminϕ[KL(PΠ)KL(QΠ)]\max_{\alpha}\min_{\theta}\min_{\phi}[{\rm KL}(P\|\Pi)-{\rm KL}(Q\|\Pi)], because again KL(QΠ)=KL(qdataπα){\rm KL}(Q\|\Pi)={\rm KL}(q_{\rm data}\|\pi_{\alpha}).

Fitting the above together, we have the divergence triangle (14), which has a compact symmetric and anti-symmetric form.

3 Gap Between Two Models

We can write the objective function D{\cal D} as

Thus D{\cal D} is an upper bound of the difference between the log-likelihood of the energy-based model and the log-likelihood of the generator model.

4 Two Sides of KL-divergences

In the divergence triangle, the generator model appears on the right side of KL(QP){\rm KL}(Q\|P), and it also appears on the left side of KL(PΠ){\rm KL}(P\|\Pi). The former tends to interpolate or smooth the modes of QQ, while the later tends to seek after major modes of Π\Pi while ignoring minor modes. As a result, the learned generator model tends to generate sharper images. As to the inference model qϕ(zx)q_{\phi}(z|x), it appears on the left side of KL(QP){\rm KL}(Q\|P), and it also appears on the right side of KL(PΠ){\rm KL}(P\|\Pi). The former is variational learning of the real data, while the latter corresponds to the sleep phase of wake-sleep learning, which learns from the dream data generated by PP. The inference model thus can infer zz from both observed xx and generated xx.

(20) is the divergence triangle between the three marginal distributions on xx, where pθp_{\theta} appears on both sides of KL-divergences. (21) is the variational scheme to make the marginal distributions into the joint distributions, which are more tractable. In (21), the two KL-divergences have reverse orders.

5 Training Algorithm

The three models are each parameterized by convolutional neural networks. The joint learning under the divergence triangle can be implemented by stochastic gradient descent, where the expectations are replaced by the sample averages. Algorithm 1 describes the procedure which is illustrated in Figure 6.

Experiments

In this section, we demonstrate not only that the divergence triangle is capable of successfully learning an energy-based model with a well-behaved energy landscape, a generator model with highly realistic samples, and an inference model with faithful reconstruction ability, but we also show competitive performance on four tasks: image generation, test image reconstruction, energy landscape mapping, and learning from incomplete images. For image generation, we consider spatial stationary texture images, temporal stationary dynamic textures, and general object categories. We also test our model on large-scale datasets and high-resolution images.

The images are resized and scaled to $,nofurtherpreprocessingisneeded.ThenetworkparametersareinitializedwithzeromeanGaussianwithstandarddeviation, no further pre-processing is needed. The network parameters are initialized with zero-mean Gaussian with standard deviation0.02andoptimizedusingAdam.Networkweightsaredecayedwithrateand optimized using Adam . Network weights are decayed with rate0.0005$, and batch normalization is used. We refer to the Appendix for the model specifications.

In this experiment, we evaluate the visual quality of generator samples from our divergence triangle model. If the generator model is well-trained, then the obtained samples should be realistic and match the visual features and contents of training images.

For object categories, we test our model on two commonly-used datasets of natural images: CIFAR-10 and CelebA . For CelebA face dataset, we randomly select 9,000 images for training and another 1,000 images for testing in reconstruction task. The face images are resized to 64×6464\times 64 and CIFAR-10 images remain 32×3232\times 32. The qualitative results of generated samples for objects are shown in Figure 7. We further evaluate our model using quantitative evaluations which are based on the Inception Score (IS) for CIFAR-10 and Frechet Inception Distance (FID) for CelebA faces. We generate 50,000 random samples for the computation of the inception score and 10,000 random samples for the computation of the FID score. Table I shows the IS and FID scores of our model compared with VAE , DCGAN , WGAN , CoopNet , CEGAN , ALI , ALICE .

Note that for the Inception Score on CIFAR-10, we borrowed the scores from relevant papers, and for FID score on 9,000 CelebA faces, we re-implemented or used the available code with the similar network structure as our model. It can be seen that our model achieves the competitive performance compared to recent baseline models.

1.2 Large-scale Dataset

We also train our model on large scale datasets including down-sampled 32×3232\times 32 version of ImageNet (roughly 1 million images) and Large-scale Scene Understand (LSUN) dataset . For the LSUN dataset, we consider the bedroom, tower and Church ourdoor categories which contains roughly 3 million, 0.7 million and 0.1 million images and were re-sized to 64×6464\times 64. The network structures are similar with the ones used in object generation with twice the number of channels and batch normalization is used in all three models. Generated samples are shown on Figure 8.

1.3 High-resolution Synthesis

In this section, we recruit a layer-wise training scheme to learn models on CelebA-HQ with resolutions of up to 1,024×1,0241,024\times 1,024 pixels. Layer-wise training dates back to initializing deep neural networks by Restricted Boltzmann Machines to overcome optimization hurdles and has been resurrected in progressive GANs , albeit the order of layer transitions is reversed such that top layers are trained first. This resembles a Laplacian Pyramid in which images are generated in a coarse-to-fine fashion.

As in , the training starts with down-sampled images with a spatial resolution of 4×44\times 4 while progressively increasing the size of the images and number of layers. All three models are grown in synchrony where 1×11\times 1 convolutions project between RGB and feature. In contrast to , we do not require mini-batch discrimination to increase variation of gθ()g_{\theta}(\cdot) nor gradient penalty to preserve 11-Lipschitz continuity of fα()f_{\alpha}(\cdot).

Figure 9 depicts high-fidelity synthesis in a resolution of 1,024×1,0241,024\times 1,024 pixels sampled from the generator model gθ(z)g_{\theta}(z) on CelebA-HQ. Figure 10 illustrates linear interpolation in latent space (i.e., (1α)z0+αz1(1-\alpha)\cdot z_{0}+\alpha\cdot z_{1}), which indicates diversity in the samples.

Therefore, the joint learning in the triangle formulation is not only able to train the three models with stable optimization, but it also achieves synthesis with high fidelity.

1.4 Texture Synthesis

We consider texture images, which are spatial stationary and contain repetitive patterns. The texture images are resized to 224×224224\times 224. Separate models are trained on each image. We start from the latent factor of size 7×7×57\times 7\times 5 and use five convolutional-transpose layers with kernel size 44 and up-sampling factor 22 for the generator network. The layers have 512512, 512512, 256256, 128128 and 33 filters, respectively, and ReLU non-linearity between each layer is used. The inference model has the inverse or “mirror” structure of generator model except that we use convolutional layers and ReLU with leak factor 0.20.2. The energy-based model has three convolutional layers. The first two layers have kernel size 77 with stride 22 for 100100 and 7070 filters respectively, and the last layer has 3030 filters with kernel size 55 and stride 11.

The representative examples are shown in Figure 11. Three texture synthesis results are obtained by sampling different latent factors from prior distribution p(z)p(z). Notice that although we only have one texture image for training, the proposed triangle divergence model can effectively utilize the repetitive patterns, thus generating realistic texture images with different configurations.

1.5 Dynamic Texture Synthesis

Our model can also be used for dynamic patterns which exhibit stationary regularity in the temporal domain. The training video clips are selected from Dyntex database and resized to 6464 pixels ×\times 6464 pixels ×\times 3232 frames. Inspired by recent work , we adopt spatial-temporal models for dynamic patterns that are stationary in the temporal domain but non-stationary in the spatial domain. Specifically, we start from 1010 latent factors of size 1×1×21\times 1\times 2 for each video clip and we adopt the same spatial-temporal convolutional transpose generator network as in except we use kernel size 55 for the second layer. For the inference model, we use 55 spatial-temporal convolutional layers. The first 44 layers have kernel size 44 with upsampling factor 22 and the last layer is fully-connected in spatial domain but convolutional in the temporal domain, yielding re-parametrized μϕ\mu_{\phi} and σϕ\sigma_{\phi} which have the same size the as latent factors. For the energy-based model, we use three spatial-temporal convolutional layers. The first two layers have kernel size 44 with up-sample factor 22 in all directions, but the last layer is fully-connected in the spatial domain but convolutional with kernel size 44 and upsample by 22 in the temporal domain. Each layer has 6464, 128128 and 128128 filters, respectively. Some of the synthesis results are shown in Figure 12. Note, we sub-sampled 66 frames of the training and generated video clips and we only show them in the first batch for illustration.

2 Test Image Reconstruction

For CIFAR-10, we use its own 10,000 test images while for CelebA, we use the hold-out 1,000 test images as stated above. The reconstruction quality is further measured by per-pixel mean square error (MSE). Table II shows the per-pixel MSE of our model compared to WS , VAE , ALI , ALICE .

Note, we do not consider methods without inference models on training data, including variants of GANs and cooperative training, since it is infeasible to test such models using image reconstruction.

3 Energy Landscape Mapping

In the following, we evaluate the learned energy-based model by mapping the macroscopic structure of the energy landscape. When following a MLE regime by minimizing KL(qdataπα){\rm KL}(q_{\rm data}\|\pi_{\alpha}), we expect the energy-function fα(x)-f_{\alpha}(x) to encode xqdata(x){x\sim q_{\rm data}(x)} as local energy minima. Moreover, fα(x)-f_{\alpha}(x) should form minima for unseen images and macroscopic landscape structure in which basins of minima are distinctly separated by energy barriers. Hopfield observed that such landscape is a model of associative memory .

To verify that (i) local minima of fα(x)-f_{\alpha}(x) resemble {xi}\{x_{i}\} and (ii) minima are separated by significant energy barriers, we shall follow the approach used in . When clustering with respect to energetic barriers, the landscape is partitioned into Hopfield basins of attraction whereby each point {xi}\{x_{i}\} on the landscape fα(x)-f_{\alpha}(x) is mapped onto a local minimum {x^i}\{\hat{x}_{i}\} by a steepest-descent path xit+1=xit+ηfα(xit){x_{i}^{t+1}=x_{i}^{t}+\eta\nabla f_{\alpha}(x_{i}^{t})}. The similarity measure used for hierarchical clustering is the barrier energy that separates any two regions. Given a pair of local minima {x^i,x^j}{\{\hat{x}_{i},\hat{x}_{j}\}}, we estimate the barrier bi,j=max{fα(xk):xkx^iγx^j}{b_{i,j}=\max\{-f_{\alpha}(x_{k}):x_{k}\in\hat{x}_{i}\overset{\gamma}{\rightharpoondown}\hat{x}_{j}\}} as the highest energy along a linear interpolation xγy={x+γ(yx):γ}{x\overset{\gamma}{\rightharpoondown}y=\{x+\gamma(y-x):\gamma\subseteq\}}. If bi,j<ϵb_{i,j}<\epsilon for some energy threshold ϵ\epsilon, then {xi,xj}{\{x_{i},x_{j}\}} belong to the same basin. The clustering is repeated recursively until all minima are clustered together. Such graphs have come to be referred as disconnectivity graphs (DG) .

We conduct energy landscape mapping experiments on the MNIST and Fashion-MNIST datasets, each containing 70,00070,000 grayscale images of size 28×2828\times 28 pixels depicting handwritten digits and fashion products from 1010 categories, respectively. The energy landscape mapping is not without limitations, because it is practically impossible to locate all local modes. Based on the local modes located by our algorithm, see Figure 14 for the MNIST dataset, it suggests that the learned energy function is well-formed which not only encodes meaningful images as minima, but also forms meaningful macroscopic structure. Moreover, within basins the local minima have a high degree of purity (i.e. digits within a basin belong to the same class), and, the energy barrier between basins seem informative (i.e. basins of ones and sixes form pure super-basins). Figure 15 depicts the energy landscape mapping on Fashion-MNIST.

4 Learning from incomplete images

The divergence triangle can be used to learn from occluded images. This task is challenging , because only parts of the images are observed, thus the model needs to learn sufficient information to recover the occluded parts. The generative models with inferential mechanism can be used for this task. Notably, proposed to recover incomplete images using alternating back-propagation (ABP) which has a MCMC based inference step to refine the latent factors and perform reconstruction iteratively. VAEs build the inference model on occluded images, and can also be adapted for this task. It proceeds by filling the missing parts with average pixel intensity in the beginning, then iteratively re-update the missing parts using reconstructed values. Unlike VAEs, which only consider the un-occluded parts of training data, our model utilizes the generated samples which become gradually recovered during training, resulting in improved recovery accuracy and sharp generation. Note that learning from incomplete data can be difficult for variants of GANs and cooperative training , since inference cannot be performed directly on the occluded images.

We evaluate our model on 10,000 images randomly chosen from CelebA dataset. Then, selected images are further center cropped as in . Similar to VAEs, we zero-fill the occluded parts in the beginning, then iterative update missing values using reconstructed images obtained from the generator model. Three types of occlusions are used: (1) salt and pepper noise which randomly covers 50%50\% (P.5) and 70%70\% (P.7) of the image. (2) Multiple block occlusion which has 10 random blocks of size 10×1010\times 10 (MB10). (3) Singe block occlusion where we randomly place a large 20×2020\times 20 and 30×3030\times 30 block on each image, denoted by B20 and B30 respectively. Table III shows the recovery errors using VAE , ABP and our triangle model where the error is defined as per-pixel absolute difference (relative to the range of pixel values) between the recovered image on the occluded pixels and the ground truth image.

It can be seen that our model consistently out-performs the VAE model for different occlusion patterns. For structured occlusions (i.e., multiple and single blocks), the un-occluded parts contain more meaningful configurations that will improve learning of the generator through the energy-based model, which will, in turn, generate more meaningful samples to refine our inference model. This could be verified by the superior results compared to ABP . While for unstructured occlusions (i.e., salt and pepper noise), ABP achieves improved recovery, a possible reason being that un-occluded parts contain less meaningful patterns which offer limited help for learning the generator and inference model. Our model synthesizes sharper and more realistic images from the generator on occluded images. See Figure 17 in which images are occluded with 30×3030\times 30 random blocks.

Conclusion

The proposed probabilistic framework, namely divergence triangle, for joint learning of the energy-based model, the generator model, and the inference model. The divergence triangle forms the compact learning functional for three models and naturally unifies aspects of maximum likelihood estimation , variational auto-encoder , adversarial learning , contrastive divergence , and the wake-sleep algorithm .

An extensive set of experiments demonstrated learning of a well-behaved energy-based model, realistic generator model as well as an accurate inference model. Moreover, experiments showed that the proposed divergence framework can be effective in learning directly from incomplete data.

In future work, we aim to extend the formulation to learn interpretable generator and energy-based models with multiple layers of sparse or semantically meaningful latent variables or features . Further, it would be desirable to unify the generator, energy-based and inference models into a single model by allowing them to share parameters and nodes instead of having separate sets of parameters and nodes.

Acknowledgments

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Dr. Tianfu Wu, Shuai Zhu and Bo Pang for helpful discussions.

Model Architecture

We describe the basic network structures, in particular for object generation. We use the following notation:

conv(n): convolutional operation with nn output feature maps.

convT(n): convolutional transpose operation with nn output feature maps.

LReLU: Leaky-ReLU nonlinearity with default leaky factor 0.2.

The structures for CelebA (where 9,000 random images are chosen) are shown in Table IV. The structures for CIFAR-10 and MNIST/Fashion-MNIST are shown in Table V and Table VI, respectively.

References