DualGAN: Unsupervised Dual Learning for Image-to-Image Translation
Zili Yi, Hao Zhang, Ping Tan, Minglun Gong
Introduction
Many image processing and computer vision tasks, e.g., image segmentation, stylization, and abstraction, can be posed as image-to-image translation problems , which convert one visual representation of an object or scene into another. Conventionally, these tasks have been tackled separately due to their intrinsic disparities . It is not until the past two years that general-purpose and end-to-end deep learning frameworks, most notably those utilizing fully convolutional networks (FCNs) and conditional generative adversarial nets (cGANs) , have been developed to enable a unified treatment of these tasks.
Up to date, these general-purpose methods have all been supervised and trained with a large number of labeled and matching image pairs. In practice however, acquiring such training data can be time-consuming (e.g., with pixelwise or patchwise labeling) and even unrealistic. For example, while there are plenty of photos or sketches available, photo-sketch image pairs depicting the same people under the same pose are scarce. In other image translation settings, e.g., converting daylight scenes to night scenes, even though labeled and matching image pairs can be obtained with stationary cameras, moving objects in the scene often cause varying degrees of content discrepancies.
In this paper, we aim to develop an unsupervised learning framework for general-purpose image-to-image translation, which only relies on unlabeled image data, such as two sets of photos and sketches for the photo-to-sketch conversion task. The obvious technical challenge is how to train a translator without any data characterizing correct translations. Our approach is inspired by dual learning from natural language processing . Dual learning trains two “opposite” language translators (e.g., English-to-French and French-to-English) simultaneously by minimizing the reconstruction loss resulting from a nested application of the two translators. The two translators represent a primal-dual pair and the nested application forms a closed loop, allowing the application of reinforcement learning. Specifically, the reconstruction loss measured over monolingual data (either English or French) would generate informative feedback to train a bilingual translation model.
Our work develops a dual learning framework for image-to-image translation for the first time and differs from the original NLP dual learning method of Xia et al. in two main aspects. First, the NLP method relied on pre-trained (English and French) language models to indicate how confident the the translator outputs are natural sentences in their respective target languages. With general-purpose processing in mind and the realization that such pre-trained models are difficult to obtain for many image translation tasks, our work develops GAN discriminators that are trained adversarially with the translators to capture domain distributions. Hence, we call our learning architecture DualGAN. Furthermore, we employ FCNs as translators which naturally accommodate the 2D structure of images, rather than sequence-to-sequence translation models such as LSTM or Gated Recurrent Unit (GUT).
Taking two sets of unlabeled images as input, each characterizing an image domain, DualGAN simultaneously learns two reliable image translators from one domain to the other and hence can operate on a wide variety of image-to-image translation tasks. The effectiveness of DuanGAN is validated through comparison with both GAN (with an image-conditional generator and the original discriminator) and conditional GAN . The comparison results demonstrate that, for some applications, DualGAN can outperform supervised methods trained on labeled data.
Related work
Since the seminal work by Goodfellow et al. in 2014, a series of GAN-family methods have been proposed for a wide variety of problems. The original GAN can learn a generator to capture the distribution of real data by introducing an adversarial discriminator that evolves to discriminate between the real data and the fake . Soon after, various conditional GANs (cGAN) have been proposed to condition the image generation on class labels , attributes , texts , and images .
Most image-conditional models were developed for specific applications such as super-resolution , texture synthesis , style transfer from normal maps to images , and video prediction , whereas few others were aiming for general-purpose processing . The general-purpose solution for image-to-image translation proposed by Isola et al. requires significant number of labeled image pairs. The unsupervised mechanism for cross-domain image conversion presented by Taigman et al. can train an image-conditional generator without paired images, but relies on a sophisticated pre-trained function that maps images from either domain to an intermediate representation, which requires labeled data in other formats.
Dual learning was first proposed by Xia et al. to reduce the requirement on labeled data in training English-to-French and French-to-English translators. The French-to-English translation is the dual task to English-to-French translation, and they can be trained side-by-side. The key idea of dual learning is to set up a dual-learning game which involves two agents, each of whom only understands one language, and can evaluate how likely the translated are natural sentences in targeted language and to what extent the reconstructed are consistent with the original. Such a mechanism is played alternatively on both sides, allowing translators to be trained from monolingual data only.
Despite of a lack of parallel bilingual data, two types of feedback signals can be generated: the membership score which evaluates the likelihood of the translated texts belonging to the targeted language, and the reconstruction error that measures the disparity between the reconstructed sentences and the original. Both signals are assessed with the assistance of application-specific domain knowledge, i.e., the pre-trained English and French language models.
In our work, we aim for a general-purpose solution for image-to-image conversion and hence do not utilize any domain-specific knowledge or pre-trained domain representations. Instead, we use a domain-adaptive GAN discriminator to evaluate the membership score of translated samples, whereas the reconstruction error is measured as the mean of absolute difference between the reconstructed and original images within each image domain.
In CycleGAN, a concurrent work by Zhu et al. , the same idea for unpaired image-to-image translation is proposed, where the primal-dual relation in DualGAN is referred to as a cyclic mapping and their cycle consistency loss is essentially the same as our reconstruction loss. Superiority of CycleGAN has been demonstrated on several tasks where paired training data hardly exist, e.g., in object transfiguration and painting style and season transfer.
Recent work by Liu and Tuzel , which we refer to as coupled GAN or CoGAN, also trains two GANs together to solve image translation problems without paired training data. Unlike DualGAN or CycleGAN, the two GANs in CoGAN are not linked to enforce cycle consistency. Instead, CoGAN learns a joint distribution over images from two domains. By sharing weight parameters corresponding to high level semantics in both generative and discriminative networks, CoGAN can enforce the two GANs to interpret these image semantics in the same way. However, the weight-sharing assumption in CoGAN and similar approaches, e.g., , does not lead to effective general-purpose solutions as its applicability is task-dependent, leading to unnatural image translation results, as shown in comparative studies by CycleGAN .
DualGAN and CycleGAN both aim for general-purpose image-to-image translations without requiring a joint representation to bridge the two image domains. In addition, DualGAN trains both primal and dual GANs at the same time, allowing a reconstruction error term to be used to generate informative feedback signals.
Method
Given two sets of unlabeled and unpaired images sampled from domains and , respectively, the primal task of DualGAN is to learn a generator that maps an image to an image , while the dual task is to train an inverse generator . To realize this, we employ two GANs, the primal GAN and the dual GAN. The primal GAN learns the generator and a discriminator that discriminates between ’s fake outputs and real members of domain . Analogously, the dual GAN learns the generator and a discriminator . The overall architecture and data flow are illustrated in Fig. 1.
As shown in Fig. 1, image is translated to domain using . How well the translation fits in is evaluated by , where is random noise, and so is that appears below. is then translated back to domain using , which outputs as the reconstructed version of . Similarly, is translated to as and then reconstructed as . The discriminator is trained with as positive samples and as negative examples, whereas takes as positive and as negative. Generators and are optimized to emulate “fake” outputs to blind the corresponding discriminators and , as well as to minimize the two reconstruction losses and .
As in the traditional GAN, the objective of discriminators is to discriminate the generated fake samples from the real ones. Nevertheless, here we use the loss format advocated by Wasserstein GAN (WGAN) rather than the sigmoid cross-entropy loss used in the original GAN . It is proven that the former performs better in terms of generator convergence and sample quality, as well as in improving the stability of the optimization . The corresponding loss functions used in and are defined as:
The same loss function is used for both generators and as they share the same objective. Previous works on conditional image synthesis found it beneficial to replace distance with , since the former often leads to blurriness . Hence, we adopt distance to measure the recovery error, which is added to the GAN objective to force the translated samples to obey the domain distribution:
where , , and , are two constant parameters. Depending on the application, and are typically set to a value within . If contains natural images and does not (e.g., aerial photo-maps), we find it more effective to use smaller than .
2 Network configuration
DualGAN is constructed with identical network architecture for and . The generator is configured with equal number of downsampling (pooling) and upsampling layers. In addition, we configure the generator with skip connections between mirrored downsampling and upsampling layers as in , making it a U-shaped net. Such a design enables low-level information to be shared between input and output, which is beneficial since many image translation problems implicitly assume alignment between image structures in the input and output (e.g., object shapes, textures, clutter, etc.). Without the skip layers, information from all levels has to pass through the bottleneck, typically causing significant loss of high-frequency information. Furthermore, similar to , we did not explicitly provide the noise vectors , . Instead, they are provided only in the form of dropout and applied to several layers of our generators at both training and test phases.
For discriminators, we employ the Markovian PatchGAN architecture as explored in , which assumes independence between pixels distanced beyond a specific patch size and models images only at the patch level rather than over the full image. Such a configuration is effective in capturing local high-frequency features such as texture and style, but less so in modeling global distributions. It fulfills our needs well, since the recovery loss encourages preservation of global and low-frequency information and the discriminators are designated to capture local high-frequency information. The effectiveness of this configuration has been verified on various translation tasks . Similar to , we run this discriminator convolutionally across the image, averaging all responses to provide the ultimate output. An extra advantage of such a scheme is that it requires fewer parameters, runs faster, and has no constraints over the size of the input image. The patch size at which the discriminator operates is fixed at , and the image resolutions were mostly , same as pix2pix .
3 Training procedure
To optimize the DualGAN networks, we follow the training procedure proposed in WGAN ; see Alg. 1. We train the discriminators steps, then one step on generators. We employ mini-batch Stochastic Gradient Descent and apply the RMSProp solver, as momentum based methods such as Adam would occasionally cause instability , and RMSProp is known to perform well even on highly non-stationary problems . We typically set the number of critic iterations per generator iteration to - and assign batch size to -, without noticeable differences on effectiveness in the experiments. The clipping parameter is normally set in , varying by application.
Training for traditional GANs needs to carefully balance between the generator and the discriminator, since, as the discriminator improves, the sigmoid cross-entropy loss is locally saturated and may lead to vanishing gradients. Unlike in traditional GANs, the Wasserstein loss is differentiable almost everywhere, resulting in a better discriminator. At each iteration, the generators are not trained until the discriminators have been trained for steps. Such a procedure enables the discriminators to provide more reliable gradient information .
Experimental results and evaluation
To assess the capability of DualGAN in general-purpose image-to-image translation, we conduct experiments on a variety of tasks, including photo-sketch conversion, label-image translation, and artistic stylization.
To compare DualGAN with GAN and cGAN , four labeled datasets are used: PHOTO-SKETCH , DAY-NIGHT , LABEL-FACADES , and AERIAL-MAPS, which was directly captured from Google Map . These datasets consist of corresponding images between two domains; they serve as ground truth (GT) and can also be used for supervised learning. However, none of these datasets could guarantee accurate feature alignment at the pixel level. For example, the sketches in SKETCH-PHOTO dataset were drawn by artists and do not accurately align with the corresponding photos, moving objects and cloud pattern changes often show up in the DAY-NIGHT dataset, and the labels in LABEL-FACADES dataset are not always precise. This highlights, in part, the difficulty in obtaining high quality matching image pairs.
DualGAN enables us to utilize abundant unlabeled image sources from the Web. Two unlabeled and unpaired datasets are also tested in our experiments. The MATERIAL dataset includes images of objects made of different materials, e.g., stone, metal, plastic, fabric, and wood. These images were manually selected from Flickr and cover a variety of illumination conditions, compositions, color, texture, and material sub-types . This dataset was initially used for material recognition, but is applied here for material transfer. The OIL-CHINESE painting dataset includes artistic paintings of two disparate styles: oil and Chinese. All images were crawled from search engines and they contain images with varying quality, format, and size. We reformat, crop, and resize the images for training and evaluation. In both of these datasets, no correspondence is available between images from different domains.
Qualitative evaluation
Using the four labeled datasets, we first compare DualGAN with GAN and cGAN on the following translation tasks: daynight (Figure 2), labelsfacade (Figures 3 and 10), face photosketch (Figures 4 and 5), and mapaerial photo (Figures 8 and 9). In all these tasks, cGAN was trained with labeled (i.e., paired) data, where we ran the model and code provided in and chose the optimal loss function for each task: loss for facadelabel and loss for the other tasks (see for more details). In contrast, DualGAN and GAN were trained in an unsupervised way, i.e., we decouple the image pairs and then reshuffle the data. The results of GAN were generated using our approach by setting in eq. (3), noting that this GAN is different from the original GAN model as it employs a conditional generator.
All three models were trained on the same training datasets and tested on novel data that does not overlap those for training. All the training were carried out on a single GeForce GTX Titan X GPU. At test time, all models ran in well under a second on this GPU.
Compared to GAN, in almost all cases, DualGAN produces results that are less blurry, contain fewer artifacts, and better preserve content structures in the inputs and capture features (e.g., texture, color, and/or style) of the target domain. We attribute the improvements to the reconstruction loss, which forces the inputs to be reconstructable from outputs through the dual generator and strengthens feedback signals that encodes the targeted distribution.
In many cases, DualGAN also compares favorably over the supervised cGAN in terms of sharpness of the outputs and faithfulness to the input images; see Figures 2, 3, 4, 5, and 8. This is encouraging since the supervision in cGAN does utilize additional image and pixel correspondences. On the other hand, when translating between photos and semantic-based labels, such as mapaerial and labelfacades, it is often impossible to infer the correspondences between pixel colors and labels based on targeted distribution alone. As a result, DualGAN may map pixels to wrong labels (see Figures 9 and 10) or labels to wrong colors/textures (see Figures 3 and 8).
Figures 6 and 7 show image translation results obtained using the two unlabeled datasets, including oilChinese, plasticmetal, metalstone, leatherfabric, as well as woodplastic. The results demonstrate that visually convincing images can be generated by DualGAN when no corresponding images can be found in the target domains. As well, the DualGAN results generally contain less artifacts than those from GAN.
To quantitatively evaluate DualGAN, we set up two user studies through Amazon Mechanical Turk (AMT). The “material perceptual” test evaluates the material transfer results, in which we mix the outputs from all material transfer tasks and let the Turkers choose the best match based on which material they believe the objects in the image are made of. For a total of 176 output images, each was evaluated by ten Turkers. An output image is rated as a success if at least three Turkers selected the target material type. Success rates of various material transfer results using different approaches are summarized in Table 1, showing that DualGAN outperforms GAN by a large margin.
In addition, we run the AMT “realness score” evaluation for sketchphoto, label mapfacades, mapsaerial photo, and daynight translations. To eliminate potential bias, for each of the four evaluations, we randomly shuffle real photos and outputs from all three approaches before showing them to Turkers. Each image is shown to 20 Turkers, who were asked to score the image based on to what extent the synthesized photo looks real. The “realness” score ranges from 0 (totally missing), 1 (bad), 2 (acceptable), 3 (good), to 4 (compelling). The average score of different approaches on various tasks are then computed and shown in Table. 2. The AMT study results show that DualGAN outperforms GAN on all tasks and outperforms cGAN on two tasks as well. This indicates that cGAN has little tolerance to misalignment and inconsistency between image pairs, but the additional pixel-level correspondence does help cGAN correctly map labels to colors and textures.
Finally, we compute the segmentation accuracies for facadeslabel and aerialmap tasks, as reported in Tables 3 and 4. The comparison shows that DualGAN is outperformed by cGAN, which is expected as it is difficult to infer proper labeling without image correspondence information from the training data.
Conclusion
We propose DualGAN, a novel unsupervised dual learning framework for general-purpose image-to-image translation. The unsupervised characteristic of DualGAN enables many real world applications, as demonstrated in this work, as well as in the concurrent work CycleGAN . Experimental results suggest that the DualGAN mechanism can significantly improve the outputs of GAN for various image-to-image translation tasks. With unlabeled data only, DualGAN can generate comparable or even better outputs than conditional GAN which is trained with labeled data providing image and pixel-level correspondences.
On the other hand, our method is outperformed by conditional GAN or cGAN for certain tasks which involve semantics-based labels. This is due to the lack of pixel and label correspondence information, which cannot be inferred from the target distribution alone. In the future, we intend to investigate whether this limitation can be lifted with the use of a small number of labeled data as a warm start.
We thank all the anonymous reviewers for their valuable comments and suggestions. The first author is a PhD student from the Memorial University of Newfoundland and has been visiting SFU since 2016. This work was supported in part by grants from the Natural Sciences and Engineering Research Council (NSERC) of Canada (No. 611370, 2017-06086).
References
Appendix
More results could be found in Figures 11, 13, 15, 14, 12, 16, 17. Source codes of DualGAN have been release on duxingren14/DualGAN on github.