ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching
Chunyuan Li, Hao Liu, Changyou Chen, Yunchen Pu, Liqun Chen, Ricardo Henao, Lawrence Carin
Introduction
Deep directed generative models are a powerful framework for modeling complex data distributions. Generative Adversarial Networks (GANs) can implicitly learn the data generating distribution; more specifically, GAN can learn to sample from it. In order to do this, GAN trains a generator to mimic real samples, by learning a mapping from a latent space (where the samples are easily drawn) to the data space. Concurrently, a discriminator is trained to distinguish between generated and real samples. The key idea behind GAN is that if the discriminator finds it difficult to distinguish real from artificial samples, then the generator is likely to be a good approximation to the true data distribution.
In its standard form, GAN only yields a one-way mapping, i.e., it lacks an inverse mapping mechanism (from data to latent space), preventing GAN from being able to do inference. The ability to compute a posterior distribution of the latent variable conditioned on a given observation may be important for data interpretation and for downstream applications (e.g., classification from the latent variable) . Efforts have been made to simultaneously learn an efficient bidirectional model that can produce high-quality samples for both the latent and data spaces . Among them, the recently proposed Adversarially Learned Inference (ALI) casts the learning of such a bidirectional model in a GAN-like adversarial framework. Specifically, a discriminator is trained to distinguish between two joint distributions: that of the real data sample and its inferred latent code, and that of the real latent code and its generated data sample.
While ALI is an inspiring and elegant approach, it tends to produce reconstructions that are not necessarily faithful reproductions of the inputs . This is because ALI only seeks to match two joint distributions, but the dependency structure (correlation) between the two random variables (conditionals) within each joint is not specified or constrained. In practice, this results in solutions that satisfy ALI’s objective and that are able to produce real-looking samples, but have difficulties reconstructing observed data . ALI also has difficulty discovering the correct pairing relationship in domain transformation tasks .
In this paper, we first describe the non-identifiability issue of ALI. To solve this problem, we propose to regularize ALI using the framework of Conditional Entropy (CE), hence we call the proposed approach ALICE. Adversarial learning schemes are proposed to estimate the conditional entropy, for both unsupervised and supervised learning paradigms. We provide a unified view for a family of recently proposed GAN models from the perspective of joint distribution matching, including ALI , CycleGAN and Conditional GAN . Extensive experiments on synthetic and real data demonstrate that ALICE is significantly more stable to train than ALI, in that it consistently yields more viable solutions (good generation and good reconstruction), without being too sensitive to perturbations of the model architecture, i.e., hyperparameters. We also show that ALICE results in more faithful image reconstructions. Further, our framework can leverage paired data (when available) for semi-supervised tasks. This is empirically demonstrated on the discovery of relationships for cross domain tasks based on image data.
Background
Consider two general marginal distributions and over and . One domain can be inferred based on the other using conditional distributions, and . Further, the combined structure of both domains is characterized by joint distributions and .
The goal of GAN is to match the marginal to . Note that denotes the true distribution of the data (from which we have samples) and is specified as a simple parametric distribution, e.g., isotropic Gaussian. In order to do the matching, GAN trains a -parameterized adversarial discriminator network, , to distinguish between samples from and . Formally, the minimax objective of GAN is given by the following expression:
where is the sigmoid function. The following lemma characterizes the solutions of (2) in terms of marginals and .
The optimal decoder and discriminator, parameterized by , correspond to a saddle point of the objective in (2), if and only if .
Alternatively, ALI matches the joint distributions and , using an adversarial discriminator network similar to (2), , parameterized by . The minimax objective of ALI can be then written as
The optimum of the two generators and the discriminator with parameters form a saddle point of the objective in (3), if and only if .
From Lemma 2, if a solution of (3) is achieved, it is guaranteed that all marginals and conditional distributions of the pair match. Note that this implies that and match; however, (3) imposes no restrictions on these two conditionals. This is key for the identifiability issues of ALI described below.
Adversarial Learning with Information Measures
The relationship (mapping) between random variables and is not specified or constrained by ALI. As a result, it is possible that the matched distribution is undesirable for a given application.
Many applications require meaningful mappings. Consider two scenarios:
A1: In unsupervised learning, one desirable property is cycle-consistency , meaning that the inferred of a corresponding , can reconstruct itself with high probability. In Figure 1 this corresponds to either or , as in Figures 1(b) and 1(c).
A2: In supervised learning, the pre-specified correspondence between samples imposes restrictions on the mapping between and , e.g., in image tagging, are images and are tags. In this case, paired samples from the desired joint distribution are usually available, thus we can leverage this supervised information to resolve the ambiguity between Figure 1(b) and (c).
From our simple example in Figure 1, we see that in order to alleviate the identifiability issues associated with the solutions to the ALI objective, we have to impose constraints on the conditionals and . Furthermore, to fully mitigate the identifiability issues we require supervision, i.e., paired samples from domains and .
To deal with the problem of undesirable but matched joint distributions, below we propose to use an information-theoretic measure to regularize ALI. This is done by controlling the “uncertainty” between pairs of random variables, i.e., and , using conditional entropies.
Conditional Entropy (CE) is an information-theoretic measure that quantifies the uncertainty of random variable when conditioned on (or the other way around), under joint distribution :
The uncertainty of given is linked with ; in fact, if only if is a deterministic mapping of . Intuitively, by controlling the uncertainty of and , we can restrict the solutions of the ALI objective to joint distributions whose mappings result in better reconstruction ability. Therefore, we propose to use the CE in (4), denoted as or (depending on the task; see below), as a regularization term in our framework, termed ALI with Conditional Entropy (ALICE), and defined as the following minimax objective:
is dependent on the underlying distributions for the random variables, parametrized by , as made clearer below. Ideally, we could select the desirable solutions of (5) by evaluating their CE, once all the saddle points of the ALI objective have been identified. However, in practice, is intractable because we do not have access to the saddle points beforehand. Below, we propose to approximate the CE in (5) during training for both unsupervised and supervised tasks. Since and are symmetric in terms of CE according to (4), we use to derive our theoretical results. Similar arguments hold for , as discussed in the Supplementary Material (SM).
2 Unsupervised Learning
In the absence of explicit probability distributions needed for computing the CE, we can bound the CE using the criterion of cycle-consistency . We denote the reconstruction of as , via generating procedure (cycle) , . We desire that have high likelihood for , for the that begins the cycle , and hence that be similar to the original . Lemma 3 below shows that cycle-consistency is an upper bound of the conditional entropy in (4).
For joint distributions or , we have
Note that as ALI reaches its optimum, and reach saddle point , then in (4) accordingly, thus (7) effectively approaches (5) (ALICE). Unlike in (4), its upper bound, , can be easily approximated via Monte Carlo simulation. Importantly, (7) can be readily added to ALI’s objective without additional changes to the original training procedure.
Finally, the fully adversarial training algorithm for unsupervised learning using the ALICE framework is the result of replacing with in (7); thus, for fixed , we maximize wrt .
The use of paired samples in (8) is critical. It encourages the generators to mimic the reconstruction relationship implied in the first joint; on the contrary, the model may reduce to the basic GAN discussed in Section 3, and generate any realistic sample in . The objective in (8) enjoys many theoretical properties of GAN. Particularly, Proposition 1 guarantees the existence of the optimal generator and discriminator.
The proof is provided in the SM. Together with Lemma 2 and 3, we can also show that:
3 Semi-supervised Learning
When the objective in (7) is optimized in an unsupervised way, the identifiability issues associated with ALI are largely reduced due to the cycle-consistency-enforcing bound in Lemma 3. This means that samples in the training data have been probabilistically “paired” with high certainty, by conditionals and , though perhaps not in the desired configuration. In real-world applications, obtaining correctly paired data samples for the entire dataset is expensive or even impossible. However, in some situations obtaining paired data for a very small subset of the observations may be feasible. In such a case, we can leverage the small set of empirically paired samples, to further provide guidance on selecting the correct configuration. This suggests that ALICE is suitable for semi-supervised classification.
The fully adversarial training algorithm for supervised learning using the ALICE in (11) is the result of replacing with in (10), thus for fixed we maximize wrt .
The proof is provided in the SM. Proposition 2 enforces that the generator will map to the correctly paired sample in the other space. Together with the theoretical result for ALI in Lemma 2, we have
The first two terms operate on the entire set, while the last term only applies to the paired subset. Note that we can train (12) fully adversarially by replacing and with and in (8) and (11), respectively. In (12) each of the three terms are treated with equal weighting in the experiments if not specificially mentioned, but of course one may introduce additional hyperparameters to adjust the relative emphasis of each term.
Related Work: A Unified Perspective for Joint Distribution Matching
Connecting ALI and CycleGAN. We provide an information theoretical interpretation for cycle-consistency, and show that it is equivalent to controlling conditional entropies and matching conditional distributions. When cycle-consistency is satisfied, Corollary 1 shows that the conditionals are matched in CycleGAN. They also train additional discriminators to guarantee the matching of marginals for and using the original GAN objective in (2). This reveals the equivalence between ALI and CycleGAN, as the latter can also guarantee the matching of joint distributions and . In practice, CycleGAN is easier to train, as it decomposes the joint distribution matching objective (as in ALI) into four subproblems. Our approach leverages a similar idea, and further improves it with adversarially learned cycle-consistency, when high quality samples are of interest.
Stochastic Mapping vs. Deterministic Mapping. We propose to enforce the cycle-consistency in ALI for the case when two stochastic mappings are specified as in (1). When cycle-consistency is achieved, Corollary 1 shows that the bounded conditional entropy vanishes, and thus the corresponding mapping reduces to be deterministic. In the literture, one deterministic mapping has been empirically tested in ALI’s framework , without explicitly specifying cycle-consistency. BiGAN uses two deterministic mappings. In theory, deterministic mappings guarantee cycle-consistency in ALI’s framework. However, to achieve this, the model has to fit a delta distribution (deterministic mapping) to another distribution in the sense of KL divergence (see Lemma 3). Due to the asymmetry of KL, the cost function will pay extremely low cost for generating fake-looking samples . This explains the underfitting reasoning in behind the subpar reconstruction ability of ALI. Therefore, in ALICE, we explicitly add a cycle-consistency regularization to accelerate and stabilize training.
Conditional GANs as Joint Distribution Matching. Conditional GAN and its variants have been widely used in supervised tasks. Our scheme to learn conditional entropy borrows the formulation of conditional GAN . To the authors’ knowledge, this is the first attempt to study the conditional GAN formulation as joint distribution matching problem. Moreover, we add the potential to leverage the well-defined distribution implied by paired data, to resolve the ambiguity issues of unsupervised ALI variants .
Experimental Results
The code to reproduce these experiments is at https://github.com/ChunyuanLI/ALICE
To highlight the role of the CE regularization for unsupervised learning, we perform an experiment on a toy dataset. is a 2D Gaussian Mixture Model (GMM) with 5 mixture components, and is chosen as a standard Gaussian, . Following , the covariance matrices and centroids are chosen such that the distribution exhibits severely separated modes, which makes it a relatively hard task despite its 2D nature. Following , to study stability, we run an exhaustive grid search over a set of architectural choices and hyper-parameters, 576 experiments for each method. We report Mean Squared Error (MSE) and inception score (denoted as ICP) to quantitatively evaluate the performance of generative models. MSE is a proxy for reconstruction quality, while ICP reflects the plausibility and variety of sample generation. Lower MSE and higher ICP indicate better results. See SM for the details of the grid search and the calculation of ICP.
We train on 2048 samples, and test on 1024 samples. The ground-truth test samples for and are shown in Figure 2(a) and (b), respectively. We compare ALICE, ALI and Denoising Auto-Encoders (DAEs) , and report the distribution of ICP and MSE values, for all (576) experiments in Figure 2 (c) and (d), respectively. For reference, samples drawn from the “oracle” (ground-truth) GMM yield =4.9770.016. ALICE yields an ICP larger than 4.5 in 77 of experiments, while ALI’s ICP wildly varies across different runs. These results demonstrate that ALICE is more consistent and quantitatively reliable than ALI. The DAE yields the lowest MSE, as expected, but it also results in the weakest generation ability. The comparatively low MSE of ALICE demonstrates its acceptable reconstruction ability compared to DAE, though a very significantly improvement over ALI.
Figure 3 shows the qualitative results on the test set. Since ALI’s results vary largely from trial to trial, we present the one with highest ICP. In the figure, we color samples from different mixture components to highlight their correspondance between the ground truth, in Figure 2(a), and their reconstructions, in Figure 3 (first row, columns 2, 4 and 6, for ALICE, ALI and DAE, respectively). Importantly, though the reconstruction of ALI can recover the shape of manifold in (Gaussian mixture), each individual reconstructed sample can be substantially far away from its “original” mixture component (note the highly mixed coloring), hence the poor MSE. This occurs because the adversarial training in ALI only requires that the generated samples look realistic, i.e., to be located near true samples in , but the mapping between observed and latent spaces ( and ) is not specified. In the SM we also consider ALI with various combinations of stochastic/deterministic mappings, and conclude that models with deterministic mappings tend to have lower reconstruction ability but higher generation ability. In terms of the estimated latent space, , in Figure 3 (first row, columns 1, 3 and 5, for ALICE, ALI and DAE, respectively), we see that ALICE results in a better latent representation, in the sense of mapping consistency (samples from different mixture components remain clustered) and distribution consistency (samples approximate a Gaussian distribution). The results for reconstruction of and sampling of are shown in the SM.
In Figure 3 (second row), we also investigate latent space interpolation between a pair of test set examples. We use and , map them into and , linearly interpolate between and to get intermediate points , and then map them back to the original space as . We only show the index of the samples for better visualization. Figure 3 shows that ALICE’s interpolation is smooth and consistent with the ground-truth distributions. Interpolation using ALI results in realistic samples (within mixture components), but the transition is not order-wise consistent. DAEs provides smooth transitions, but the samples in the original space look unrealistic as some of them are located in low probability density regions of the true model.
We investigate the impact of different amount of regularization on three datasets, including the toy dataset, MNIST and CIFAR-10 in SM Section D. The results show that our regularizer can improve image generation and reconstruction of ALI for a large range of weighting hyperparameter values.
2 Reconstruction and Cross-Domain Transformation on Real Datasets
Two image-to-image translation tasks are considered. Car-to-Car : each domain ( and ) includes car images in 11 different angles, on which we seek to demonstrate the power of adversarially learned reconstruction and weak supervision. Edge-to-Shoe : domain consists of shoe photos and domain consists of edge images, on which we report extensive quantitative comparisons. Cycle-consistency is applied on both domains. The goal is to discover the cross-domain relationship (i.e., cross-domain prediction), while maintaining reconstruction ability on each domain.
Weak supervision The DiscoGAN and BiGAN are unsupervised methods, and exhibit very different cross-domain pairing configurations during different training epochs, which is indicative of non-identifiability issues. We leverage very weak supervision to help with convergence and guide the pairing. The results on shown in Figure 4 (b). We run each methods times, the width of the colored lines reflect the standard deviation. We start with 1 true pairs for supervision, which yields significantly higher accuracy than DiscoGAN/BiGAN. We then provided 10 supervison in only 2 or 6 angles (of 11 total angles), which yields comparable angle prediction accuracy with full angle supervison in testing. This shows ALICE’s ability in terms of zero-shot learning, i.e., predicting unseen pairs. In the SM, we show that enforcing different weak supervision strategies affects the final pairing configurations, i.e., we can leverage supervision to obtain the desirable joint distribution.
One-side cycle-consistency When uncertainty in one domain is desirable, we consider one-side cycle-consistency. This is demonstrated on the CelebA face dataset . Each face is associated with a 40-dimensional attribute vector. The results are in the Figure 13 of SM. In the first task, we consider the images are generated from a 128-dimensional Gaussian latent space , and apply on . We compare ALICE and ALI on reconstruction in (a)(b). ALICE shows more faithful reproduction of the input subjects. In the second task, we consider as the attribute space, from which the images are generated. The mapping from to is then attribute classification. We only apply on the attribute domain, and on both domains. When paired samples are considered, the predicted attributes still reach 86% accuracy, which is comparable with the fully supervised case. To test the diversity on , we first predict the attributes of a true face image, and then generated multiple images conditioned on the predicted attributes. Four examples are shown in (c).
Conclusion
We have studied the problem of non-identifiability in bidirectional adversarial networks. A unified perspective of understanding various GAN models as joint matching is provided to tackle this problem. This insight enables us to propose ALICE (with both adversarial and non-adversarial solutions) to reduce the ambiguity and control the conditionals in unsupervised and semi-supervised learning. For future work, the proposed view can provide opportunities to leverage the advantages of each model, to advance joint-distribution modeling.
We acknowledge Shuyang Dai, Chenyang Tao and Zihang Dai for helpful feedback/editing. This research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.
References
Appendix A Information Measures
Since our paper constrain correlation of two random variables using information theoretical measures, we first review the related concepts. For any probability measure on the random variables and , we have the following additive and subtractive relationships for various information measures, including Mutual Information (MI), Variation of Information (VI) and the Conditional Entropy (CE).
For random variables and with two different probability measures, and , we have
where is the conditional entropy. From lemma 4, we have
For random variables and with probability measure , the mutual information between and can be written as
Given a simple prior such as isotropic Gaussian, is a constant.
For random variables and with probability measure , the variation of information between and can be written as
Appendix B Proof for Adversarial Learning Schemes
The proof for cycle-consistency and conditional GAN using adversarial traning is shown below. It follows the proof of the original GAN paper: we first show the implication of optimal discriminator, and then show the corresponding optimal generator.
In the unsupervised case, given data sample , one desirable property is reconstruction. The following game learns to reconstruct:
The expression in (26) is maximal as a function of if and only if the integrand is maximal for every . However, the problem attains its maximum at , showing that
For the game in (24), for which are optimized as to most confuse the discriminator, the optimal solution for the distribution parameters yield , and therefore from (31)
B.2 Proof of Proposition 2: Adversarially Learned Conditional Generation for Paired Data
In the supervised case, given the paired data sample , the following game is used to conditionally generate :
To show the results, we need the following Lemma:
The optimial generator and discriminator, with parameters , forms the saddle points of game in (33), if and only if . Further,
For the observed paired data , we have , where is marginal empirical distribution of for the paired data.
Therefore, the objective in (33) can be expressed as
This integral is maximal as a function of if and only if the integrand is maximal for every . However, the problem attains its maximum at , showing that
or equivalently, the optimum generator is . Since , we further have . Similarly, for conditional GAN of , we can show that is and for the Combining them, we show that .
Appendix C More Results on the Toy Data
The 5-component Gaussian mixture model (GMM) in is set with the means , and standard derivation . The Isotropic Gaussian in is set with mean and standard derivation .
We consider various network architectures to compare the stability of the methods. The hyperparameters includes: the number of layers and the number of neurons of the discriminator and two generators, and the update frenquency for discriminator and generator. The grid search specification is summarized in Table C.4. Hence, the total number of experiments is .
C.2 Reconstruction of 𝒛𝒛\boldsymbol{z} and sampling for 𝒙𝒙\boldsymbol{x}
We show the additional results for the econstruction of and sampling for in Figure 6. ALICE shows good sampling ability, as it reflects the Guassian characteristics for each of 5 components, while ALI’s samples tends to be concentrated, reflected by the shrinked Guassian components. DAE learns an indentity mapping, and thus show weak generation ability.
C.3 Summary of the four variants of ALICE
ALICE is a general CE-based framework to regularize the objectives of bidiretional adversarial training, in order to obtain desirable solutions. To clearly show the versatility of ALICE, we summarize its four variants, and test their effectivenss on toy datasets.
In unsupervised learning, two forms of cycle-consistency/reconstruction are considered to bound CE:
Implicit cycle-consistency: Implicitly learned reconstruction via adversarial training
In semi-supervised learning, the pairwise information is leveraged in two forms to approximate CE:
Implicit mapping: Implicitly learned mapping via adversarial training
Results
The effectivenss of these algorithms are demonstrated on toy data of low dimension in Figure 7. The unsupervised variants are tested in the same toy dataset described above, the results are in Figure 7 (a)(b). For the supervised variants, we create a toy dataset, where -domain is 2-component GMM, and -domain is 5-component GMM. Since each domain is symmtric, ambiguity exists when Cycle-GAN variants attempt to discover the relationship of the two domains in pure unsupervised setting. Indeed, we observed random switching of the discoverd corresponded components in different runs of Cycle-GAN. By adding a tiny fraction of pairwise information (a cheap way to specify the desirable relationship ), we can easily learn the correct correspondences for the entire datasets. In Figure 7 (c)(d), pairs (out of 2048) are pre-specified: the points in -domain are paired with the points in -domain with opposite signs. Both explicit and implicit ALICE find the correct pairing configurations for other unlabeled samples. This inspires us to manually labeling the relations for a few samples between domains, and use ALICE to automatically control the full datasets pairing for the real datasets. One example is shown on Car2Car dataset.
C.4 Comparisons of ALI with stochastic/deterministic mappings
We investigate the ALI model with different mappings:
ALI-: one stochastic mapping and one deterministic mapping;
We plot the histogram of ICP and MSE in Fig. 8, and report the mean and standard derivation in Table C.4. In Fig. 9, we compare their reconstruction and generation ability. Models with deterministic mapping have higher recontruction ability, while show lower sampling ability.
Comparison on Reconstruction Please see row 1 and 2 in Fig. 9. For reconstruction, we start from one sample (red dot), and pass it through the cycle formed by the two mappings 100 times. The resulted reconstructions are shown as blue dots. The reconstructed samples tends to be concentrated with more deterministic mappings.
Comparison on Sampling Please see row 3 and 4 in Fig. 9. For sampling, we first draw samples in each domain, and pass them through the mappings. The generated samples are colored as the index of Gaussian component it comes from in the original domain.
Setup The dataset consists of rendered images of 3D car models with varying azimuth angles at intervals. 11 views of each car are used. The dataset is split into train set ( images) and test set ( images), and further split the train set into two groups, each of which is used as A domain and B domain samples. To evaluate, we trained a regressor and a classifier that predict the azimuth angle using the train set. We map the car image from one domain to the other, and then reconstruct to the original domain. The cycle-consistency is evaluted as the prediction accuracy of the reconstructed images.
Table E.1 shows the MSE and prediction accuracy by leverage the supervision in different number of angles. To further demonstrate that we can easily control the correspondence configuration by designing the proper supervision, we use ALICE to enforce coherent supervsion and opposite supervision, respectively. Only supervison information is used in each angle. We translated images in the test set using each of the three trained models, and azimuth angles were predicted using the regressor for both input and translated images. In Table E.1, we show the cross domain relationship discovered by each method. X and Y axis indicates predicted angles of original and transformed cars, respectively. All three plots are results at the 10th epoch. Scatter points with supervision are more concentrated on the diagnals in the plots, which indicates higher prediction/correlation. The learning curves are shown in Table E.1(d). The Y axis indicate the RMSE in angle prediction. We see that very weak supervision can largely imporve the convergence results and speed. Example and comparison arre shown in Figure11.
The MSE results on cross-domain prediction and one-domain reconstruction are shown in Figure 12.
E.3 Celeba Face Dataset
Reconstruction results on the validation dataset of Celeba dataset are shown in Figure 13. ALI results are from the paper . ALICE provides more faithful reconstruction to the input subjects. As a trade-off between theoretical optimum and practical convergence, we employ feature matching, and thus our results exhibits slight bluriness characteristic.
We demonstrate the potential real applications of ALICE algorithms on the task of sketch to cartoon. We built a dataset by collecting frames from Disney’s film . A large image size is considered. The training dataset consists of two domains: cartoon images and edges images, where the edges are created via holistically-nested edge detection on their true cartoon images. The image content is about either of two characters in the film: Alice or White Rabbit. Therefore, each domain exhibits two modes. 52 images are collected in each domain. The one-to-one image correspondence between two domain is unknown, the goal is to efficiently generate realistic cartoon images for animation, based on the edges.
ALICE converges faster than CycleGAN: . The generated images after 6K iterations are show in Figure 14. CycleGAN generates images with mixed colors (e.g., , the clothes of Rabbit), while ALICE clearly paint colors in different regions.
ALICE enables desirable image generation (with better generalization): The generated images after 10K iterations are show in Figure 15. We generated image based on slightly different edges: more background and details on the character. CycleGAN gets confused when identifying the character, thus inconsistently paint the dfferent colors to Rabbit, while explicit ALICE can generate coherent color.
ALICE improves image quality via weak supervision: Two sets of the generated image after 10K iterations are compared in Figure 16. ALICE can capture more visual details (e.g., the background flowers in (a)) and colors (e.g., the ambarella in (b)).