Coupled Generative Adversarial Networks
Ming-Yu Liu, Oncel Tuzel
Introduction
The paper concerns the problem of learning a joint distribution of multi-domain images from data. A joint distribution of multi-domain images is a probability density function that gives a density value to each joint occurrence of images in different domains such as images of the same scene in different modalities (color and depth images) or images of the same face with different attributes (smiling and non-smiling). Once a joint distribution of multi-domain images is learned, it can be used to generate novel tuples of images. In addition to movie and game production, joint image distribution learning finds applications in image transformation and domain adaptation. When training data are given as tuples of corresponding images in different domains, several existing approaches can be applied. However, building a dataset with tuples of corresponding images is often a challenging task. This correspondence dependency greatly limits the applicability of the existing approaches.
To overcome the limitation, we propose the coupled generative adversarial networks (CoGAN) framework. It can learn a joint distribution of multi-domain images without existence of corresponding images in different domains in the training set. Only a set of images drawn separately from the marginal distributions of the individual domains is required. CoGAN is based on the generative adversarial networks (GAN) framework , which has been established as a viable solution for image distribution learning tasks. CoGAN extends GAN for joint image distribution learning tasks.
CoGAN consists of a tuple of GANs, each for one image domain. When trained naively, the CoGAN learns a product of marginal distributions rather than a joint distribution. We show that by enforcing a weight-sharing constraint the CoGAN can learn a joint distribution without existence of corresponding images in different domains. The CoGAN framework is inspired by the idea that deep neural networks learn a hierarchical feature representation. By enforcing the layers that decode high-level semantics in the GANs to share the weights, it forces the GANs to decode the high-level semantics in the same way. The layers that decode low-level details then map the shared representation to images in individual domains for confusing the respective discriminative models. CoGAN is for multi-image domains but, for ease of presentation, we focused on the case of two image domains in the paper. However, the discussions and analyses can be easily generalized to multiple image domains.
We apply CoGAN to several joint image distribution learning tasks. Through convincing visualization results and quantitative evaluations, we verify its effectiveness. We also show its applications to unsupervised domain adaptation and image transformation.
Generative Adversarial Networks
A GAN consists of a generative model and a discriminative model. The objective of the generative model is to synthesize images resembling real images, while the objective of the discriminative model is to distinguish real images from synthesized ones. Both the generative and discriminative models are realized as multilayer perceptrons.
In practice (1) is solved by alternating the following two gradient update steps:
where \text{\boldmath\theta}_{f} and \text{\boldmath\theta}_{g} are the parameters of and , is the learning rate, and is the iteration number.
Goodfellow et al. show that, given enough capacity to and and sufficient training iterations, the distribution, , converges to . In other words, from a random vector, , the network can synthesize an image, , that resembles one that is drawn from the true distribution, .
Coupled Generative Adversarial Networks
CoGAN as illustrated in Figure 1 is designed for learning a joint distribution of images in two different domains. It consists of a pair of GANs— and ; each is responsible for synthesizing images in one domain. During training, we force them to share a subset of parameters. This results in that the GANs learn to synthesize pairs of corresponding images without correspondence supervision.
Generative Models: Let and be images drawn from the marginal distribution of the 1st domain, and the marginal distribution of the 2nd domain, , respectively. Let and be the generative models of and , which map a random vector input to images that have the same support as and , respectively. Denote the distributions of and by and . Both and are realized as multilayer perceptrons:
where and are the th layers of and and and are the numbers of layers in and . Note that need not equal . Also note that the support of need not equal to that of .
Through layers of perceptron operations, the generative models gradually decode information from more abstract concepts to more material details. The first layers decode high-level semantics and the last layers decode low-level details. Note that this information flow direction is opposite to that in a discriminative deep neural network where the first layers extract low-level features while the last layers extract high-level features.
Based on the idea that a pair of corresponding images in two domains share the same high-level concepts, we force the first layers of and to have identical structure and share the weights. That is \text{\boldmath\theta}_{g_{1}^{(i)}}=\text{\boldmath\theta}_{g_{2}^{(i)}},\text{for }i=1,2,...,k where is the number of shared layers, and \text{\boldmath\theta}_{g_{1}^{(i)}} and \text{\boldmath\theta}_{g_{2}^{(i)}} are the parameters of and , respectively. This constraint forces the high-level semantics to be decoded in the same way in and . No constraints are enforced to the last layers. They can materialize the shared high-level representation differently for fooling the respective discriminators.
Discriminative Models: Let and be the discriminative models of and given by
where and are the th layers of and and and are the numbers of layers. The discriminative models map an input image to a probability score, estimating the likelihood that the input is drawn from a true data distribution. The first layers of the discriminative models extract low-level features, while the last layers extract high-level features. Because the input images are realizations of the same high-level semantics in two different domains, we force and to have the same last layers, which is achieved by sharing the weights of the last layers via \text{\boldmath\theta}_{f_{1}^{(n_{1}-i)}}=\text{\boldmath\theta}_{f_{2}^{(n_{2}-i)}},\text{for }i=0,1,...,l-1 where is the number of weight-sharing layers in the discriminative models, and \text{\boldmath\theta}_{f_{1}^{(i)}} and \text{\boldmath\theta}_{f_{2}^{(i)}} are the network parameters of and , respectively. The weight-sharing constraint in the discriminators helps reduce the total number of parameters in the network, but it is not essential for learning a joint distribution.
Learning: The CoGAN framework corresponds to a constrained minimax game given by
In the game, there are two teams and each team has two players. The generative models form a team and work together for synthesizing a pair of images in two different domains for confusing the discriminative models. The discriminative models try to differentiate images drawn from the training data distribution in the respective domains from those drawn from the respective generative models. The collaboration between the players in the same team is established from the weight-sharing constraint. Similar to GAN, CoGAN can be trained by back propagation with the alternating gradient update steps. The details of the learning algorithm are given in the supplementary materials.
Remarks: CoGAN learning requires training samples drawn from the marginal distributions, and . It does not rely on samples drawn from the joint distribution, , where corresponding supervision would be available. Our main contribution is in showing that with just samples drawn separately from the marginal distributions, CoGAN can learn a joint distribution of images in the two domains. Both weight-sharing constraint and adversarial training are essential for enabling this capability. Unlike autoencoder learning , which encourages a generated pair of images to be identical to the target pair of corresponding images in the two domains for minimizing the reconstruction lossThis is why requires samples from the joint distribution for learning the joint distribution., the adversarial training only encourages the generated pair of images to be individually resembling to the images in the respective domains. With this more relaxed adversarial training setting, the weight-sharing constraint can then kick in for capturing correspondences between domains. With the weight-sharing constraint, the generative models must utilize the capacity more efficiently for fooling the discriminative models, and the most efficient way of utilizing the capacity for generating a pair of realistic images in two domains is to generate a pair of corresponding images since the neurons responsible for decoding high-level semantics can be shared.
CoGAN learning is based on existence of shared high-level representations in the domains. If such a representation does not exist for the set of domains of interest, it would fail.
Experiments
In the experiments, we emphasized there were no corresponding images in the different domains in the training sets. CoGAN learned the joint distributions without correspondence supervision. We were unaware of existing approaches with the same capability and hence did not compare CoGAN with prior works. Instead, we compared it to a conditional GAN to demonstrate its advantage. Recognizing that popular performance metrics for evaluating generative models all subject to issues , we adopted a pair image generation performance metric for comparison. Many details including the network architectures and additional experiment results are given in the supplementary materials. An implementation of CoGAN is available in https://github.com/mingyuliutw/cogan.
We used deep convolutional networks to realized the CoGAN. The two generative models had an identical structure; both had 5 layers and were fully convolutional. The stride lengths of the convolutional layers were fractional. The models also employed the batch normalization processing and the parameterized rectified linear unit processing . We shared the parameters for all the layers except for the last convolutional layers. For the discriminative models, we used a variant of LeNet . The inputs to the discriminative models were batches containing output images from the generative models and images from the two training subsets (each pixel value is linearly scaled to ).
We divided the training set into two equal-size non-overlapping subsets. One was used to train and the other was used to train . We used the ADAM algorithm for training and set the learning rate to 0.0002, the 1st momentum parameter to 0.5, and the 2nd momentum parameter to 0.999 as suggested in . The mini-batch size was 128. We trained the CoGAN for 25000 iterations. These hyperparameters were fixed for all the visualization experiments.
Weight Sharing: We varied the numbers of weight-sharing layers in the generative and discriminative models to create different CoGANs for analyzing the weight-sharing effect for both tasks. Due to lack of proper validation methods, we did a grid search on the training iteration hyperparameter and reported the best performance achieved by each network. For quantifying the performance, we transformed the image generated by to the 2nd domain using the same method employed for generating the training images in the 2nd domain. We then compared the transformed image with the image generated by . A perfect joint distribution learning should render two identical images. Hence, we used the ratios of agreed pixels between 10K pairs of images generated by each network (10K randomly sampled ) as the performance metric. We trained each network 5 times with different initialization weights and reported the average pixel agreement ratios over the 5 trials for each network. The results are shown in Figure 3. We observed that the performance was positively correlated with the number of weight-sharing layers in the generative models. With more sharing layers in the generative models, the rendered pairs of images resembled true pairs drawn from the joint distribution more. We also noted that the performance was uncorrelated to the number of weight-sharing layers in the discriminative models. However, we still preferred discriminator weight-sharing because this reduces the total number of network parameters.
Faces: We applied CoGAN to learn a joint distribution of face images with different. We trained several CoGANs, each for generating a face with an attribute and a corresponding face without the attribute. We used the CelebFaces Attributes dataset for the experiments. The dataset covered large pose variations and background clutters. Each face image had several attributes, including blond hair, smiling, and eyeglasses. The face images with an attribute constituted the 1st domain; and those without the attribute constituted the 2nd domain. No corresponding face images between the two domains was given. We resized the images to a resolution of and randomly sampled regions for training. The generative and discriminative models were both 7 layer deep convolutional neural networks.
The experiment results are shown in Figure 4. We randomly sampled two points in the 100-dimensional input noise space and visualized the rendered face images as traveling from one pint to the other. We found CoGAN generated pairs of corresponding faces, resembling those from the same person with and without an attribute. As traveling in the space, the faces gradually change from one person to another. Such deformations were consistent for both domains. Note that it is difficult to create a dataset with corresponding images for some attribute such as blond hair since the subjects have to color their hair. It is more ideal to have an approach that does not require corresponding images like CoGAN. We also noted that the number of faces with an attribute was often several times smaller than that without the attribute in the dataset. However, CoGAN learning was not hindered by the mismatches.
Color and Depth Images: We used the RGBD dataset and the NYU dataset for learning joint distribution of color and depth images. The RGBD dataset contains registered color and depth images of 300 objects captured by the Kinect sensor from different view points. We partitioned the dataset into two equal-size non-overlapping subsets. The color images in the 1st subset were used for training , while the depth images in the 2nd subset were used for training . There were no corresponding depth and color images in the two subsets. The images in the RGBD dataset have different resolutions. We resized them to a fixed resolution of . The NYU dataset contains color and depth images captured from indoor scenes using the Kinect sensor. We used the 1449 processed depth images for the depth domain. The training images for the color domain were from all the color images in the raw dataset except for those registered with the processed depth images. We resized both the depth and color images to a resolution of and randomly cropped patches for training.
Figure 5 showed the generation results. We found the rendered color and depth images resembled corresponding RGB and depth image pairs despite of no registered images existed in the two domains in the training set. The CoGAN recovered the appearance–depth correspondence unsupervisedly.
Applications
In addition to rendering novel pairs of corresponding images for movie and game production, the CoGAN finds applications in the unsupervised domain adaptation and image transformation tasks.
Unsupervised Domain Adaptation (UDA): UDA concerns adapting a classifier trained in one domain to classify samples in a new domain where there is no labeled example in the new domain for re-training the classifier. Early works have explored ideas from subspace learning to deep discriminative network learning . We show that CoGAN can be applied to the UDA problem. We studied the problem of adapting a digit classifier from the MNIST dataset to the USPS dataset. Due to domain shift, a classifier trained using one dataset achieves poor performance in the other. We followed the experiment protocol in , which randomly samples 2000 images from the MNIST dataset, denoted as , and 1800 images from the USPS dataset, denoted as , to define an UDA problem. The USPS digits have a different resolution. We resized them to have the same resolution as the MNIST digits. We employed the CoGAN used for the digit generation task. For classifying digits, we attached a softmax layer to the last hidden layer of the discriminative models. We trained the CoGAN by jointly solving the digit classification problem in the MNIST domain which used the images and labels in and the CoGAN learning problem which used the images in both and . This produced two classifiers: for MNIST and for USPS. No label information in was used. Note that and due to weight sharing and denotes the softmax layer. We then applied to classify digits in the USPS dataset. The classifier adaptation from USPS to MNIST can be achieved in the same way. The learning hyperparameters were determined via a validation set. We reported the average accuracy over 5 trails with different randomly selected and .
Table 1 reports the performance of the proposed CoGAN approach with comparison to the state-of-the-art methods for the UDA task. The results for the other methods were duplicated from . We observed that CoGAN significantly outperformed the state-of-the-art methods. It improved the accuracy from 0.64 to 0.90, which translates to a 72% error reduction rate.
Cross-Domain Image Transformation: Let be an image in the 1st domain. Cross-domain image transformation is about finding the corresponding image in the 2nd domain, , such that the joint probability density, , is maximized. Let be a loss function measuring difference between two images. Given and , the transformation can be achieved by first finding the random vector that generates the query image in the 1st domain After finding , one can apply to obtain the transformed image, . In Figure 6, we show several CoGAN cross-domain transformation results, computed by using the Euclidean loss function and the L-BFGS optimization algorithm. We found the transformation was successful when the input image was covered by (The input image can be generated by .) but generated blurry images when it is not the case. To improve the coverage, we hypothesize that more training images and a better objective function are required, which are left as future work.
Related Work
Neural generative models has recently received an increasing amount of attention. Several approaches, including generative adversarial networks, variational autoencoders (VAE), attention models, moment matching, stochastic back-propagation, and diffusion processes, have shown that a deep network can learn an image distribution from samples. The learned networks can be used to generate novel images. Our work was built on . However, we studied a different problem, the problem of learning a joint distribution of multi-domain images. We were interested in whether a joint distribution of images in different domains can be learned from samples drawn separately from its marginal distributions of the individual domains. We showed its achievable via the proposed CoGAN framework. Note that our work is different to the Attribute2Image work, which is based on a conditional VAE model . The conditional model can be used to generate images of different styles, but they are unsuitable for generating images in two different domains such as color and depth image domains.
Following , several works improved the image generation quality of GAN, including a Laplacian pyramid implementation, a deeper architecture, and conditional models. Our work extended GAN to dealing with joint distributions of images.
Our work is related to the prior works in multi-modal learning, including joint embedding space learning and multi-modal Boltzmann machines . These approaches can be used for generating corresponding samples in different domains only when correspondence annotations are given during training. The same limitation is also applied to dictionary learning-based approaches . Our work is also related to the prior works in cross-domain image generation , which studied transforming an image in one style to the corresponding images in another style. However, we focus on learning the joint distribution in an unsupervised fashion, while focus on learning a transformation function directly in a supervised fashion.
Conclusion
We presented the CoGAN framework for learning a joint distribution of multi-domain images. We showed that via enforcing a simple weight-sharing constraint to the layers that are responsible for decoding abstract semantics, the CoGAN learned the joint distribution of images by just using samples drawn separately from the marginal distributions. In addition to convincing image generation results on faces and RGBD images, we also showed promising results of the CoGAN framework for the image transformation and unsupervised domain adaptation tasks.
References
Appendix A Additional Experiment Results
We applied CoGAN to a task of learning a joint distribution of images with different in-plane rotation angles. We note that this task is very different to the other tasks discussed in the paper. In the other tasks, the image contents in the same spatial region in the corresponding images are in direct correspondence. In this task, the content in one spatial region in one image domain is related to the content in a different spatial region in the other image domain. Through this experiment, we planed to verify whether CoGAN can learn a joint distribution of images related by a global transformation.
For this task, we partitioned the MNIST training set into two disjoint subsets. The first set consisted of the original digit images, which constitute the first domain. We applied a 90 degree rotation to all the digits in the second set to construct the second domain. There were no corresponding images in the two domains. The CoGAN architecture used for this task is shown in Table 2. Different to the other tasks, the generative models in the CoGAN were based on fully connected layers, and the discriminative models only share the last layer. This design was due to lack of spatial correspondence between the two domains. We used the same hyperparameters to train the CoGAN. The results are shown in Figure 7. We found that the CoGAN was able to capture the in-plane rotation. For the same noise input, the digit generated by is a 90 degree rotated version of the digit generated by .
A.2 Weight Sharing
From the tables, we observed that the pair image generation performance was positively correlated with the number of weight-sharing layers in the generative models. With more shared layers in the generative models, the rendered pairs of images were resembling more to true pairs drawn from the joint distribution. We noted that the pair image generation performance was uncorrelated to the number of weight-sharing layers in the discriminative models. However, we still preferred applying discriminator weight sharing because this reduces the total number of parameters.
A.3 Comparison with the Conditional Generative Adversarial Nets
We compared the CoGAN framework with the conditional generative adversarial networks (GAN) framework for joint image distribution learning. We designed a conditional GAN where the generative and discriminative models were identical to those used in the CoGAN in the digit experiments. The only difference was that the conditional GAN took an additional binary variable as input, which controlled the domain of the output image. The binary variable acted as a switch. When the value of the binary variable was zero, it generated images resembling images in the first domain. Otherwise, it generated images resembling those in the second domain. The output layer of the discriminative model was a softmax layer with three neurons. If the first neuron was on, it meant the input to the discriminative model was a synthesized image from the generative model. If the second neuron was on, it meant the input was a real image from the first domain. If the third neuron was on, it meant the input was a real image from the second domain. The goal of the generative model was to render images resembling those from the first domain when the binary variable was zero and to render images resembling those from the second domain when the binary variable was one. The details of the conditional GAN network architecture is shown in Table 5.
Similarly to CoGAN learning, no correspondence was given during the conditional GAN learning. We applied the conditional GAN to the two digit generation tasks and hoped to answer whether a conditional model can be used to render corresponding images in two different domains without pairs of corresponding images in the training set. We used the same training data and hyperparameters as those used in the CoGAN learning. We trained the CoGAN for 25000 iterations We note the generation performance of the conditional GAN did not change much after 5000 iterations. and used the trained network to render 10000 pairs of images in the two domains. Specifically, each pair of images was rendered with the same but with different conditional variable values. These images were used to compute the pair image generation performance of the conditional GAN measured by the average of the pixel agreement ratios. For each task, we trained the conditional GAN for 5 times, each with a different random initialization of the network weights. We reported the average scores and the standard deviations.
Appendix B CoGAN Learning Algorithm
We present the learning algorithm for the coupled generative adversarial networks in Algorithm 1. The algorithm is an extension of the learning algorithm for the generative adversarial networks (GAN) to the case of training two GANs with weight sharing constraints. The convergence property follows the results shown in .
Appendix C Training Datasets
In Figure 12, Figure 12, Figure 12, and Figure 12, we show several example images of the training images used for the pair image generation tasks in the experiment section. Table 10, Table 10, Table 10, and Table 10 contain the statistics of the training datasets for the experiments.
Appendix D Networks
In CoGAN, the generative models are based on the fractional length convolutional (FCONV) layers, while the discriminative models are based on the standard convolutional (CONV) layers with the exceptions that the last two layers are based on the fully-connected (FC) layers. The batch normalization (BN) layers are applied after each convolutional layer, which are followed by the parameterized rectified linear unit (PReLU) processing . The sigmoid units and the hyperbolic tangent units are applied to the output layers of the generative models for generating images with desired pixel range values.