Symmetric Variational Autoencoder and Connections to Adversarial Learning

Liqun Chen, Shuyang Dai, Yunchen Pu, Chunyuan Li, Qinliang Su, Lawrence Carin

Introduction

Generative models that are descriptive of data have been widely employed in statistics and machine learning. Factor models (FMs) represent one commonly used generative model (Tipping and Bishop, 1999), and mixtures of FMs have been employed to account for more-general data distributions (Ghahramani and Hinton, 1997). These models typically have latent variables (e.g., factor scores) that are inferred given observed data; the latent variables are often used for a down-stream goal, such as classification (Carvalho et al., 2008). After training, such models are useful for inference tasks given subsequent observed data. However, when one draws from such models, by drawing latent variables from the prior and pushing them through the model to synthesize data, the synthetic data typically do not appear to be realistic. This suggests that while these models may be useful for analyzing observed data in terms of inferred latent variables, they are also capable of describing a large set of data that do not appear to be real.

The generative adversarial network (GAN) (Goodfellow et al., 2014) represents a significant recent advance toward development of generative models that are capable of synthesizing realistic data. Such models also employ latent variables, drawn from a simple distribution analogous to the aforementioned prior, and these random variables are fed through a (deep) neural network. The neural network acts as a functional transformation of the original random variables, yielding a model capable of representing sophisticated distributions. Adversarial learning discourages the network from yielding synthetic data that are unrealistic, from the perspective of a learned neural-network-based classifier. However, GANs are notoriously difficult to train, and multiple generalizations and techniques have been developed to improve learning performance (Salimans et al., 2016), for example Wasserstein GAN (WGAN) (Arjovsky and Bottou, 2017; Arjovsky et al., 2017) and energy-based GAN (EB-GAN) (Zhao et al., 2017).

While the original GAN and variants were capable of synthesizing highly realistic data (e.g., images), the models lacked the ability to infer the latent variables given observed data. This limitation has been mitigated recently by methods like adversarial learned inference (ALI) (Dumoulin et al., 2017), and related approaches. However, ALI appears to be inadequate from the standpoint of inference, in that, given observed data and associated inferred latent variables, the subsequently synthesized data often do not look particularly close to the original data.

The variational autoencoder (VAE) (Kingma and Welling, 2014) is a class of generative models that precedes GAN. VAE learning is based on optimizing a variational lower bound, connected to inferring an approximate posterior distribution on latent variables; such learning is typically not performed in an adversarial manner. VAEs have been demonstrated to be effective models for inferring latent variables, in that the reconstructed data do typically look like the original data, albeit in a blurry manner (Dumoulin et al., 2017). The form of the VAE has been generalized recently, in terms of the adversarial variational Bayesian (AVB) framework (Mescheder et al., 2016). This model yields general forms of encoders and decoders, but it is based on the original variational Bayesian (VB) formulation. The original VB framework yields a lower bound on the log likelihood of the observed data, and therefore model learning is connected to maximum-likelihood (ML) approaches. From the perspective of designing generative models, it has been recognized recently that ML-based learning has limitations (Arjovsky and Bottou, 2017): such learning tends to yield models that match observed data, but also have a high probability of generating unrealistic synthetic data.

The original VAE employs the Kullback-Leibler divergence to constitute the variational lower bound. As is well known, the KL distance metric is asymmetric. We demonstrate that this asymmetry encourages design of decoders (generators) that often yield unrealistic synthetic data when the latent variables are drawn from the prior. From a different but related perspective, the encoder infers latent variables (across all training data) that only encompass a subset of the prior. As demonstrated below, these limitations of the encoder and decoder within conventional VAE learning are intertwined.

We consequently propose a new symmetric VAE (sVAE), based on a symmetric form of the KL divergence and associated variational bound. The proposed sVAE is learned using an approach related to that employed in the AVB (Mescheder et al., 2016), but in a new manner connected to the symmetric variational bound. Analysis of the sVAE demonstrates that it has close connections to ALI (Dumoulin et al., 2017), WGAN (Arjovsky et al., 2017) and to the original GAN (Goodfellow et al., 2014) framework; in fact, ALI is recovered exactly, as a special case of the proposed sVAE. This provides a new and explicit linkage between the VAE (after it is made symmetric) and a wide class of adversarially trained generative models. Additionally, with this insight, we are able to ameliorate much of the aforementioned limitations of ALI, from the perspective of data reconstruction. In addition to analyzing properties of the sVAE, we demonstrate excellent performance on an extensive set of experiments.

Review of Variational Autoencoder

It is typically intractable to evaluate pθ(x)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}) directly, as dzpθ(xz)p(z)\int d\boldsymbol{z}p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z}) generally doesn’t have a closed form. Consequently, a typical approach is to consider a model qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) for the posterior of the latent code z\boldsymbol{z} given observed x\boldsymbol{x}, characterized by parameters ϕ{\boldsymbol{\phi}}. Distribution qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) is often termed an encoder, and pθ(xz)p_{\theta}(\boldsymbol{x}|\boldsymbol{z}) is a decoder (Kingma and Welling, 2014); both are here stochastic, vis-à-vis their deterministic counterparts associated with a traditional autoencoder (Vincent et al., 2010). Consider the variational expression

In practice the expectation wrt xq(x)\boldsymbol{x}\sim q(\boldsymbol{x}) is evaluated via sampling, assuming NN observed samples {xn}n=1,N\{\boldsymbol{x}_{n}\}_{n=1,N}. One typically must also utilize sampling from qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) to evaluate the corresponding expectation in (1). Learning is effected as (θ^,ϕ^)=\mboxargmaxθ,ϕ Lx(θ,ϕ)(\hat{{\boldsymbol{\theta}}},\hat{{\boldsymbol{\phi}}})=\mbox{argmax}_{{\boldsymbol{\theta}},{\boldsymbol{\phi}}}~{}\mathcal{L}_{x}({\boldsymbol{\theta}},{\boldsymbol{\phi}}), and a model so learned is termed a variational autoencoder (VAE) (Kingma and Welling, 2014).

where qϕ(z)=q(x)qϕ(zx)dxq_{{\boldsymbol{\phi}}}(\boldsymbol{z})=\int q(\boldsymbol{x})q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x})d\boldsymbol{x}. To maximize Lx(θ,ϕ)\mathcal{L}_{x}({\boldsymbol{\theta}},{\boldsymbol{\phi}}), we seek minimization of \mboxKL(qϕ(x,z)pθ(z,z))\mbox{KL}(q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{z},\boldsymbol{z})). Hence, from (3) the goal is to align pθ(x)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}) with q(x)q(\boldsymbol{x}), while from (4) the goal is to align qϕ(z)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}) with p(z)p(\boldsymbol{z}). The other terms seek to match the respective conditional distributions. All of these conditions are implied by minimizing \mboxKL(qϕ(x,z)pθ(z,z))\mbox{KL}(q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{z},\boldsymbol{z})). However, the KL divergence is asymmetric, which yields limitations wrt the learned model.

2 Limitations of the VAE

Summarizing these conditions, the goal of maximizing \mboxKL(q(x)pθ(x))-\mbox{KL}(q(\boldsymbol{x})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x})) encourages Sq(x)Spθ(x)\mathcal{S}_{q(\boldsymbol{x})}\subset\mathcal{S}_{p_{{\boldsymbol{\theta}}}(\boldsymbol{x})}. This implies that pθ(x)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}) can synthesize all x\boldsymbol{x} that may be drawn from q(x)q(\boldsymbol{x}), but additionally there is (often) high probability that pθ(x)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}) will synthesize x\boldsymbol{x} that will not be drawn from q(x)q(\boldsymbol{x}).

Hence, the goal of large \mboxKL(q(x)pθ(x))-\mbox{KL}(q(\boldsymbol{x})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x})) and \mboxKL(qϕ(z)p(z))-\mbox{KL}(q_{{\boldsymbol{\phi}}}(\boldsymbol{z})\|p(\boldsymbol{z})) are saying the same thing, from different perspectives: (ii) seeking large \mboxKL(q(x)pθ(x))-\mbox{KL}(q(\boldsymbol{x})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x})) implies that there is a high probability that x\boldsymbol{x} drawn from pθ(x)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}) will be different from those drawn from q(x)q(\boldsymbol{x}), and (iiii) large \mboxKL(qϕ(z)p(z))-\mbox{KL}(q_{{\boldsymbol{\phi}}}(\boldsymbol{z})\|p(\boldsymbol{z})) implies that z\boldsymbol{z} drawn from p(z)p(\boldsymbol{z}) are likely to be different from those drawn from qϕ(z)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}), with z{Sp(z)Sqϕ(z)}\boldsymbol{z}\in\{\mathcal{S}_{p(\boldsymbol{z})}\cap\mathcal{S}_{q_{{\boldsymbol{\phi}}}(\boldsymbol{z})_{-}}\} responsible for the x\boldsymbol{x} that are inconsistent with q(x)q(\boldsymbol{x}). These properties are summarized in Fig. 1.

Refined VAE: Imposition of Symmetry

where Cz=h(p(z))C_{z}=-h(p(\boldsymbol{z})). Using logic analogous to that applied to Lx\mathcal{L}_{x}, maximization of Lz\mathcal{L}_{z} encourages distribution supports reflected in Fig. 2.

Defining Lxz(θ,ϕ)=Lx(θ,ϕ)+Lz(θ,ϕ)\mathcal{L}_{xz}({\boldsymbol{\theta}},{\boldsymbol{\phi}})=\mathcal{L}_{x}({\boldsymbol{\theta}},{\boldsymbol{\phi}})+\mathcal{L}_{z}({\boldsymbol{\theta}},{\boldsymbol{\phi}}), we have

where K=Cx+CzK=C_{x}+C_{z}, and the symmetric KL divergence is \mboxKLs(qϕ(x,z)pθ(x,z))\mboxKL(qϕ(x,z)pθ(x,z))+\mboxKL(pθ(x,z)qϕ(x,z))\mbox{KL}_{s}(q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z}))\triangleq\mbox{KL}(q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z}))+\mbox{KL}(p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})\|q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})). Maximization of Lxz(θ,ϕ)\mathcal{L}_{xz}({\boldsymbol{\theta}},{\boldsymbol{\phi}}) seeks minimizing \mboxKLs(qϕ(x,z)pθ(x,z))\mbox{KL}_{s}(q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})), which simultaneously imposes the conditions summarized in Figs. 1 and 2.

2 Adversarial solution

Assuming fixed (θ,ϕ)({\boldsymbol{\theta}},{\boldsymbol{\phi}}), and using logic analogous to Proposition 1 in (Mescheder et al., 2016), we consider

where σ(ζ)=1/(1+exp(ζ))\sigma(\zeta)=1/(1+\exp(-\zeta)). The scalar function fψ(x,z)f_{{\boldsymbol{\psi}}}(\boldsymbol{x},\boldsymbol{z}) is represented by a deep neural network with parameters ψ{\boldsymbol{\psi}}, and network inputs (x,z)(\boldsymbol{x},\boldsymbol{z}). For fixed (θ,ϕ)({\boldsymbol{\theta}},{\boldsymbol{\phi}}), the parameters ψ{\boldsymbol{\psi}}^{*} that maximize g(ψ)g({\boldsymbol{\psi}}) yield

Hence, to optimize Lxz(θ,ϕ)\mathcal{L}_{xz}({\boldsymbol{\theta}},{\boldsymbol{\phi}}) we consider the cost function

The expectations in (10) and (14) are approximated by averaging over samples, and therefore to implement this solution we need only be able to sample from pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}) and qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}), and we do not require explicit forms for these distributions. For example, a draw from qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) may be constituted as z=hϕ(x,ϵ)\boldsymbol{z}=h_{{\boldsymbol{\phi}}}(\boldsymbol{x},{\boldsymbol{\epsilon}}), where hϕ(x,ϵ)h_{{\boldsymbol{\phi}}}(\boldsymbol{x},{\boldsymbol{\epsilon}}) is implemented as a neural network with parameters ϕ{\boldsymbol{\phi}} and ϵN(0,I){\boldsymbol{\epsilon}}\sim\mathcal{N}(\boldsymbol{0},{\bf I}).

3 Interpretation in terms of LRT statistic

In (10) a classifier is designed to distinguish between samples (x,z)(\boldsymbol{x},\boldsymbol{z}) drawn from pθ(x,z)=p(z)pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})=p(\boldsymbol{z})p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}) and from qϕ(x,z)=q(x)qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})=q(\boldsymbol{x})q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}). Implicit in that expression is that there is equal probability that either of these distributions are selected for drawing (x,z)(\boldsymbol{x},\boldsymbol{z}), i.e., that (x,z)[pθ(x,z)+qϕ(x,z)]/2(\boldsymbol{x},\boldsymbol{z})\sim[p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})+q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})]/2. Under this assumption, given observed (x,z)(\boldsymbol{x},\boldsymbol{z}), the probability of it being drawn from pθ(x,z)p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z}) is pθ(x,z)/(pθ(x,z)+qϕ(x,z))p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})/(p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})+q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})), and the probability of it being drawn from qϕ(x,z)q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z}) is qϕ(x,z)/(pθ(x,z)+qϕ(x,z))q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})/(p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})+q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})) (Goodfellow et al., 2014). Since the denominator pθ(x,z)+qϕ(x,z)p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})+q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z}) is shared by these distributions, and assuming function pθ(x,z)/qϕ(x,z)p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})/q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z}) is known, an observed (x,z)(\boldsymbol{x},\boldsymbol{z}) is inferred as being drawn from the underlying distributions as

This is the well-known likelihood ratio test (LRT) (Trees, 2001), and is reflected by (11). We have therefore derived a learning procedure based on the log-LRT, as reflected in (14). The solution is “adversarial,” in the sense that when optimizing (θ,ϕ)({\boldsymbol{\theta}},{\boldsymbol{\phi}}) the objective in (14) seeks to “fool” the LRT test statistic, while for fixed (θ,ϕ)({\boldsymbol{\theta}},{\boldsymbol{\phi}}) maximization of (10) wrt ψ{\boldsymbol{\psi}} corresponds to updating the LRT. This adversarial solution comes as a natural consequence of symmetrizing the traditional VAE learning procedure.

Connections to Prior Work

The adversarially learned inference (ALI) (Dumoulin et al., 2017) framework seeks to learn both an encoder and decoder, like the approach proposed above, and is based on optimizing

Note that logσ()\log\sigma(\cdot) is a monotonically increasing function, and therefore we may replace (14) as

and note σ(fψ(x,z;θ,ϕ))=1σ(fψ(x,z;θ,ϕ))\sigma(-f_{{\boldsymbol{\psi}}^{*}}(\boldsymbol{x},\boldsymbol{z};{\boldsymbol{\theta}},{\boldsymbol{\phi}}))=1-\sigma(f_{{\boldsymbol{\psi}}^{*}}(\boldsymbol{x},\boldsymbol{z};{\boldsymbol{\theta}},{\boldsymbol{\phi}})). Maximizing (19) wrt (θ,ϕ)({\boldsymbol{\theta}},{\boldsymbol{\phi}}) with fixed ψ{\boldsymbol{\psi}}^{*} corresponds to the minimization wrt (θ,ϕ)({\boldsymbol{\theta}},{\boldsymbol{\phi}}) reflected in (18). Hence, the proposed approach is exactly ALI, if in (14) we replace ±fψ\pm f_{{\boldsymbol{\psi}}^{*}} with logσ(±fψ)\log\sigma(\pm f_{{\boldsymbol{\psi}}^{*}}).

2 Original GAN

The proposed approach assumed both a decoder pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}) and an encoder pϕ(zx)p_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}), and we considered the symmetric \mboxKLs(qϕ(x,z)pθ(x,z))\mbox{KL}_{s}(q_{{\boldsymbol{\phi}}}(\boldsymbol{x},\boldsymbol{z})\|p_{{\boldsymbol{\theta}}}(\boldsymbol{x},\boldsymbol{z})). We now simplify the model for the case in which we only have a decoder, and the synthesized data are drawn xpθ(xz)\boldsymbol{x}\sim p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}) with zp(z)\boldsymbol{z}\sim p(\boldsymbol{z}), and we wish to learn θ{\boldsymbol{\theta}} such that data synthesized in this manner match observed data xq(x)\boldsymbol{x}\sim q(\boldsymbol{x}). Consider the symmetric

We consider a simplified form of (10), specifically

Recall that logσ()\log\sigma(\cdot) is a monotonically increasing function, and therefore we may replace (23) as

3 Wasserstein GAN

The Wasserstein GAN (WGAN) (Arjovsky et al., 2017) setup is represented as

It is believed that the current paper is the first to consider symmetric variational learning, introducing Lz\mathcal{L}_{z}, from which we have made explicit connections to previously developed adversarial-learning methods. Previous efforts have been made to match qϕ(z)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}) to p(z)p(\boldsymbol{z}), which is a consequence of the proposed symmetric VAE (sVAE). For example, (Makhzani et al., 2016) introduced a modification to the original VAE formulation, but it loses connection to the variational lower bound (Mescheder et al., 2016).

4 Amelioration of vanishing gradients

As discussed in (Arjovsky et al., 2017), a key distinction between the WGAN framework in (25) and the original GAN (Goodfellow et al., 2014) is that the latter uses a binary discriminator to distinguish real and synthesized data; the fψ(x)f_{{\boldsymbol{\psi}}}(\boldsymbol{x}) in WGAN is a 1-Lipschitz function, rather than an explicit discriminator. A challenge with GAN is that as the discriminator gets better at distinguishing real and synthetic data, the gradients wrt the discriminator parameters vanish, and learning is undermined. The WGAN was designed to ameliorate this problem (Arjovsky et al., 2017).

From the discussion in Section 4.1, we note that the key distinction between the proposed sVAE and ALI is that the latter uses a binary discriminator to distinguish (x,z)(\boldsymbol{x},\boldsymbol{z}) manifested via the generator from (x,z)(\boldsymbol{x},\boldsymbol{z}) manifested via the encoder. By contrast, the sVAE uses a log-LRT, rather than a binary classifier, with it inferred in an adversarial manner. ALI is therefore undermined by vanishing gradients as the binary discriminator gets better, with this avoided by sVAE. The sVAE brings the same intuition associated with WGAN (addressing vanishing gradients) to a generalized VAE framework, with a generator and a decoder; WGAN only considers a generator. Further, as discussed in Section 4.3, unlike WGAN, which requires gradient clipping or other forms of regularization to approximate 1-Lipschitz functions, in the proposed sVAE the fψ(x,z)f_{{\boldsymbol{\psi}}}(\boldsymbol{x},\boldsymbol{z}) arises naturally from the symmetrized VAE and we do not require imposition of Lipschitz conditions. As discussed in Section 6, this simplification has yielded robustness in implementation.

Model Augmentation

To encourage qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) that are more peaked in the space of z\boldsymbol{z} for individual x\boldsymbol{x}, and also to consider more peaked pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}), we may augment the variational expressions as

where λ0\lambda\geq 0. For λ=0\lambda=0 the original variational expressions are retained, and for λ>0\lambda>0, qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) and pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}) are allowed to diverge more from p(z)p(\boldsymbol{z}) and q(x)q(\boldsymbol{x}), respectively, while placing more emphasis on the data-fit terms. Defining Lxz=Lx+Lz\mathcal{L}_{xz}^{\prime}=\mathcal{L}_{x}^{\prime}+\mathcal{L}_{z}^{\prime}, we have

Model learning is the same as discussed in Sec. 3.2, with the modification

A disadvantage of this approach is that it requires explicit forms for pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}) and pϕ(zx)p_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}), while the setup in Sec. 3.2 only requires the ability to sample from these distributions.

We can now make a connection to additional related work, particularly (Pu et al., 2017), which considered a similar setup to (26) and (27), for the special case of λ=1\lambda=1. While (Pu et al., 2017) had a similar idea of using a symmetrized VAE, they didn’t make the theoretical justification presented in Section 3. Further, and more importantly, the way in which learning was performed in (Pu et al., 2017) is distinct from that applied here, in that (Pu et al., 2017) required an additional adversarial learning step, increasing implementation complexity. Consequently, (Pu et al., 2017) did not use adversarial learning to approximate the log-LRT, and therefore it cannot make the explicit connection to ALI and WGAN that were made in Sections 4.1 and 4.3, respectively.

Experiments

All parameters are initialized with Xavier (Glorot and Bengio, 2010) and optimized using Adam (Kingma and Ba, 2015) with learning rate of 0.0001. No dataset-specific tuning or regularization, other than dropout (Srivastava et al., 2014), is performed. The architectures for the encoder, decoder and discriminator are detailed in the Appendix. All experimental results were performed on a single NVIDIA TITAN X GPU.

In order to show the robustness and stability of our model, we test sVAE and sVAE-r on a toy dataset designed in the same manner as the one in ALICE (Li et al., 2017). In this dataset, the true distribution of data x\boldsymbol{x} is a two-dimensional Gaussian mixture model with five components. The latent code z\boldsymbol{z} is a standard Gaussian distribution N(0,1)\mathcal{N}(0,1). To perform the test, we consider using different values of λ\lambda for both sVAE-r and ALICE. For each λ\lambda, 576576 experiments with different choices of architecture and hyper-parameters are conducted. In all experiments, we use mean square error (MSE) and inception score (IS) to evaluate the performance of the two models. Figure 3 shows the histogram results for each model. As we can see, both ALICE and sVAE-r are able to reconstruct images when λ=0.1\lambda=0.1, while sVAE-r provides better overall inception score.

2 MNIST

The results of image generation and reconstruction for sVAE, as applied to the MNIST dataset, are shown in Figure 4. By adding the regularization term, sVAE overcomes the limitation of image reconstruction in ALI. The log-likelihood of sVAE shown in Table 1 is calculated using the annealed importance sampling method on the binarized MNIST dataset, as proposed in (Wu et al., 2016). Note that in order to compare the model performance on binarized data, the output of the decoder is considered as a Bernoulli distribution instead of the Gaussian approach from the original paper. Our model achieves -79.26 nats, outperforming normalizing flow (-85.1 nats) while also being competitive to the state-of-the-art result (-79.2 nats). In addition, sVAE is able to provide compelling generated images, outperforming GAN (Goodfellow et al., 2014) and WGAN-GP (Ishaan Gulrajani, 2017) based on the inception scores.

3 CelebA

We evaluate sVAE on the CelebA dataset and compare the results with ALI. In experiments we note that for high-dimensional data like the CelebA, ALICE (Li et al., 2017) shows a trade-off between reconstruction and generation, while sVAE-r does not have this issue. If the regularization term is not included in ALI, the reconstructed images do not match the original images. On the other hand, when the regularization term is added, ALI is capable of reconstructing images but the generated images are flawed. In comparison, sVAE-r does well in both generation and reconstruction with different values of λ\lambda. The results for both sVAE and ALI are shown in Figure 6 and 6.

Generally speaking, adding the augmentation term as shown in (28) should encourage more peaked qϕ(zx)q_{{\boldsymbol{\phi}}}(\boldsymbol{z}|\boldsymbol{x}) and pθ(xz)p_{{\boldsymbol{\theta}}}(\boldsymbol{x}|\boldsymbol{z}). Nevertheless, ALICE fails in the inference process and performs more like an autoencoder. This is due to the fact that the discriminator becomes too sensitive to the regularization term. On the other hand, by using the symmetric KL (14) as the cost function, we are able to alleviate this issue, which makes sVAE-r a more stable model than ALICE. This is because sVAE updates the generator using the discriminator output, before the sigmoid, a non-linear transformation on the discriminator output scale.

4 CIFAR-10

The trade-off of ALICE (Li et al., 2017) mentioned in Sec. 6.3 is also manifested in the results for the CIFAR-10 dataset. In Figure 7, we show quantitative results in terms of inception score and mean squared error of sVAE-r and ALICE with different values of λ\lambda. As can be seen, both models are able to reconstruct images when λ\lambda increases. However, when λ\lambda is larger than 10310^{-3}, we observe a decrease in the inception score of ALICE, in which the model fails to generate images.

The CIFAR-10 dataset is also used to evaluate the generation ability of our model. The quantitative results, i.e., the inception scores, are listed in Table 2. Our model shows improved performance on image generation compared to ALI and DCGAN. Note that sVAE also gets comparable result as WGAN-GP (Ishaan Gulrajani, 2017) achieves. This can be interpreted using the similarity between (23) and (25) as summarized in the Sec. 4. The generated images are shown in Figure 8. More results are in the Appendix.

Conclusions

We present the symmetric variational autoencoder (sVAE), a novel framework which can match the joint distribution of data and latent code using the symmetric Kullback-Leibler divergence. The experiment results show the advantages of sVAE, in which it not only overcomes the missing mode problem (Hu et al., 2017), but also is very stable to train. With excellent performance in image generation and reconstruction, we will apply sVAE on semi-supervised learning tasks and conditional generation tasks in future work. Morever, because the latent code zz can be treated as data from a different domain, i.e., images (Zhu et al., 2017; Kim et al., 2017) or text (Gan et al., 2017), we can also apply sVAE to domain transfer tasks.

References

Appendix A Model Architectures

Appendix B More Result

B.2 CelebA result