Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data

Amjad Almahairi, Sai Rajeswar, Alessandro Sordoni, Philip Bachman, Aaron Courville

Introduction

The problem of learning mappings between domains from unpaired data has recently received increasing attention, especially in the context of image-to-image translation (Zhu et al., 2017a; Kim et al., 2017; Liu et al., 2017). This problem is important because, in some cases, paired information may be scarce or otherwise difficult to obtain. For example, consider tasks like face transfiguration (male to female), where obtaining explicit pairs would be difficult as it would require artistic authoring. An effective unsupervised model may help when learning from relatively few paired examples, as compared to training strictly from the paired examples. Intuitively, forcing inter-domain mappings to be (approximately) invertible by a model of limited capacity acts as a strong regularizer.

Motivated by the success of Generative Adversarial Networks (GANs) in image generation (Goodfellow et al., 2014; Radford et al., 2015), existing unsupervised mapping methods such as CycleGAN (Zhu et al., 2017a) learn a generator which produces images in one domain given images from the other. Without the use of pairing information, there are many possible mappings that could be inferred. To reduce the space of the possible mappings, these models are typically trained with a cycle-consistency constraint which enforces a strong connection across domains, by requiring that mapping an image from the source domain to the target domain and then back to source will result in the same starting image. This framework has been shown to learn convincing mappings across image domains and proved successful in a variety of related applications (Tung et al., 2017; Wolf et al., 2017; Hoffman et al., 2017).

One major limitation of CycleGAN is that it only learns one-to-one mappings, i.e. the model associates each input image with a single output image. We believe that most relationships across domains are more complex, and better characterized as many-to-many. For example, consider mapping silhouettes of shoes to images of shoes. While the mapping that CycleGAN learns can be superficially convincing (e.g. it produces a single reasonable shoe with a particular style), we would like to learn a mapping that can capture diversity of the output (e.g. produces multiple shoes with different styles). The limits of one-to-one mappings are more dramatic when the source domain and target domain substantially differ. For instance, it would be difficult to learn a CycleGAN model when the two domains are descriptive facial attributes and images of faces.

We propose a model for learning many-to-many mappings between domains from unpaired data. Specifically, we “augment” each domain with auxiliary latent variables and extend CycleGAN’s training procedure to the augmented spaces. The mappings in our model take as input a sample from the source domain and a latent variable, and output both a sample in the target domain and a latent variable (Fig. 1b). The learned mappings are one-to-one in the augmented space, but many-to-many in the original domains after marginalizing over the latent variables.

Our contributions are as follows. 1 We introduce the Augmented CycleGAN model for learning many-to-many mappings across domains in an unsupervised way. 2 We show that our model can learn mappings which produce a diverse set of outputs for each input. 3 We show that our model can learn mappings across substantially different domains, and we apply it in a semi-supervised setting for mapping between faces and attributes with competitive results.

Unsupervised Learning of Mappings Between Domains

Given two domains A{A} and B{B}, we assume there exists a mapping, potentially many-to-many, between their elements. The objective is to recover this mapping using unpaired samples from distributions pd(a)p_{d}(a) and pd(b)p_{d}(b) in each domain. This can be formulated as a conditional generative modeling task where we try to estimate the true conditionals p(ab)p(a|b) and p(ba)p(b|a) using samples from the true marginals. An important assumption here is that elements in domains A{A} and B{B} are highly dependent; otherwise, it is unlikely that the model would uncover a meaningful relationship without any pairing information.

2 CycleGAN Model

The CycleGAN model (Zhu et al., 2017a) estimates these conditionals using two mappings GAB:ABG_{{A}{B}}:{A}\mapsto{B} and GBA:BAG_{{B}{A}}:{B}\mapsto{A}, parameterized by neural networks, which satisfy the following constraints:

Marginal matching: The output of each mapping should match the empirical distribution of the target domain, when marginalized over the source domain.

Cycle-consistency: Mapping an element from one domain to the other, and then back, should produce a sample close to the original element.

Marginal matching in CycleGAN is achieved using the generative adversarial networks framework (GAN) (Goodfellow et al., 2014). Mappings GABG_{{A}{B}} and GBAG_{{B}{A}} are given by neural networks trained to fool adversarial discriminators DBD_{B} and DAD_{A}, respectively. Enforcing marginal matching on target domain B{B}, marginalized over source domain A{A}, involves minimizing an adversarial objective with respect to GABG_{{A}{B}}:

while the discriminator DBD_{B} is trained to maximize it. A similar adversarial loss LGANA(GBA,DA)\mathcal{L}_{\text{GAN}}^{A}(G_{{B}{A}},D_{A}) is defined for marginal matching in the reverse direction.

Cycle-consistency enforces that, when starting from a sample aa from AA, the reconstruction a=GBA(GAB(a))a^{\prime}=G_{{B}{A}}(G_{{A}{B}}(a)) remains close to the original aa. For image domains, closeness between aa and aa^{\prime} is typically measured with L1L_{1} or L2L_{2} norms. When using the L1L_{1} norm, cycle-consistency starting from A{A} can be formulated as:

And similarly for cycle-consistency starting from B{B}. The full CycleGAN objective is given by:

where γ\gamma is a hyper-parameter that balances between marginal matching and cycle-consistency.

The success of CycleGAN can be attributed to the complementary roles of marginal matching and cycle-consistency in its objective. Marginal matching encourages generating realistic samples in each domain. Cycle-consistency encourages a tight relationship between domains. It may also help prevent multiple items from one domain mapping to a single item from the other, analogous to the troublesome mode collapse in adversarial generators (Li et al., 2017).

3 Limitations of CycleGAN

A fundamental weakness of the CycleGAN model is that it learns deterministic mappings. In CycleGAN, and in other similar models (Kim et al., 2017; Yi et al., 2017), the conditionals between domains correspond to delta functions: p^(ab)=δ(GBA(b))\hat{p}(a|b)=\delta(G_{{B}{A}}(b)) and p^(ba)=δ(GAB(a))\hat{p}(b|a)=\delta(G_{{A}{B}}(a)), and cycle-consistency forces the learned mappings to be inverses of each other. When faced with complex cross-domain relationships, this results in CycleGAN learning an arbitrary one-to-one mapping instead of capturing the true, structured conditional distribution more faithfully. Deterministic mappings are also an obstacle to optimizing cycle-consistency when the domains differ substantially in complexity, in which case mapping from one domain (e.g. class labels) to the other (e.g. real images) is generally one-to-many. Next, we discuss how to extend CycleGAN to capture more expressive relationships across domains.

4 CycleGAN with Stochastic Mappings

A straightforward approach for extending CycleGAN to model many-to-many relationships is to equip it with stochastic mappings between AA and BB. Let Z{Z} be a latent space with a standard Gaussian prior p(z)p(z) over its elements. We define mappings GAB:A×ZBG_{{A}{B}}:{A}\times{Z}\mapsto{B} and GBA:B×ZAG_{{B}{A}}:{B}\times{Z}\mapsto{A}To avoid clutter in notation, we reuse the same symbols of deterministic mappings.. Each mapping takes as input a vector of auxiliary noise and a sample from the source domain, and generates a sample in the target domain. Therefore, by sampling different zp(z)z\sim p(z), we could in principle generate multiple bb’s conditioned on the same aa and vice-versa. We can write the marginal matching loss on domain B{B} as:

Cycle-consistency starting from A{A} is now given by:

The full training loss is similar to the objective in Eqn. LABEL:eq:cgan_full_loss. We refer to this model as Stochastic CycleGAN.

In principle, stochastic mappings can model multi-modal conditionals, and hence generate a richer set of outputs than deterministic mappings. However, Stochastic CycleGAN suffers from a fundamental flaw: the cycle-consistency in Eq. LABEL:eq:stoch_cgan_cycle_loss encourages the mappings to ignore the latent zz. Specifically, the unimodality assumption implicit in the reconstruction error from Eq. LABEL:eq:stoch_cgan_cycle_loss forces the mapping GBAG_{{B}{A}} to be many-to-one when cycling ABAA\rightarrow B\rightarrow A^{\prime}, since any bb generated for a given aa must map to a=GBA(b,z)aa^{\prime}=G_{{B}{A}}(b,z)\approx a, for all zz. For the cycle BABB\rightarrow A\rightarrow B^{\prime}, GABG_{{A}{B}} is similarly forced to be many-to-one. The only way for to GBAG_{{B}{A}} and GABG_{{A}{B}} to be both many-to-one and mutual inverses is if they collapse to being (roughly) one-to-one. We could possibly mitigate this degeneracy by introducing a VAE-like encoder and exchanging the L1L_{1} error in Eq. LABEL:eq:stoch_cgan_cycle_loss for a more complex variational bound on conditional log-likelihood. In the next section, we discuss an alternative approach to learning complex, stochastic mappings between domains.

Approach

In order to learn many-to-many mappings across domains, we propose to learn to map between pairs of items (a,zb)A×Zb(a,z_{b})\in{A}\times{Z}_{b} and (b,za)B×Za(b,z_{a})\in{B}\times{Z}_{a}, where Za{Z}_{a} and Zb{Z}_{b} are latent spaces that capture any missing information when transforming an element from A{A} to B{B}, and vice-versa. For example, when generating a female face (bBb\in{B}) which resembles a male face (aAa\in{A}), the latent code zbZbz_{b}\in{Z}_{b} can capture female face variations (e.g. hair length or style) independent from aa. Similarly, zaZaz_{a}\in{Z}_{a} captures variations in a generated male face independent from the given female face. This approach can be described as learning mappings between augmented spaces A×Zb{A}\times{Z}_{b} and B×Za{B}\times{Z}_{a} (Figure 1b); hence, we call it Augmented CycleGAN. By learning to map a pair (a,zb)A×Zb(a,z_{b})\in{A}\times{Z}_{b} to (b,za)B×Za(b,z_{a})\in{B}\times{Z}_{a}, we can (i) learn a stochastic mapping from aa to multiple items in B{B} by sampling different zbZbz_{b}\in{Z}_{b}, and (ii) infer latent codes zaz_{a} containing information about aa not captured in the generated bb, which allows for doing proper reconstruction of aa. As a result, we are able to optimize both marginal matching and cycle consistency while using stochastic mappings. We present details of our approach in the next sections. Our model captures many-to-many relationships because it captures both one-to-many and many-to-one: one item in A maps to many items in B, and many items in B map to one item in A (cycle). The same is true in the other direction.

Learning in Augmented CycleGAN follows a similar approach to CycleGAN – optimizing both marginal matching and cycle-consistency losses, albeit over augmented spaces.

We adopt an adversarial approach for marginal matching over B×Za{B}\times{Z}_{a} where we use two independent discriminators DBD_{B} and DZaD_{{Z}_{a}} to match generated pairs to real samples from the independent priors pd(b)p_{d}(b) and p(za)p(z_{a}), respectively. Marginal matching loss over B{B} is defined as in Eqn 4. Marginal matching over Za{Z}_{a} is given by:

Cycle Consistency Loss

The second is for reconstructing zbp(zb)z_{b}\sim p(z_{b}):

These reconstruction costs represent an autoregressive decomposition of the basic CycleGAN cycle-consistency cost from Eq. LABEL:eq:cgan_cycle_loss, after extending it to the augmented domains. Specifically, we decompose the required reconstruction distribution p(b,zaa,zb)p(b,z_{a}|a,z_{b}) into the conditionals p(ba,zb)p(b|a,z_{b}) and p(zaa,zb,b)p(z_{a}|a,z_{b},b).

Training Augmented CycleGAN in the direction A×Zb{A}\times{Z}_{b} to B×Za{B}\times{Z}_{a} is done by optimizing:

where γ1\gamma_{1} and γ2\gamma_{2} are a hyper-parameters used to balance objectives. We define a similar objective for the direction going from B×Za{B}\times{Z}_{a} to A×Zb{A}\times{Z}_{b}, and train the model on both objectives simultaneously.

2 Semi-supervised Learning with Augmented CycleGAN

In cases where we have access to paired data, we can leverage it to train our model in a semi-supervised setting (Fig. 3). Given pairs sampled from the true joint, i.e. (a,b)pd(a,b)(a,b)\sim p_{d}(a,b), we can define a supervision cost for the mapping GABG_{{A}{B}} as follows:

3 Modeling Stochastic Mappings

We note here some design choices that we found important for training our stochastic mappings. We discuss architectural and training details further in Sec. 5. In order to allow the latent codes to capture diversity in generated samples, we found it important to inject latent codes to layers of the network which are closer to the inputs. This allows the injected codes to be processed with a larger number of remaining layers and therefore capture high-level variations of the output, as opposed to small pixel-level variations. We also found that Conditional Normalization (CN) (Dumoulin et al., ; Perez et al., 2017) for conditioning layers can be more effective than concatenation, which is more commonly used (Radford et al., 2015; Zhu et al., 2017b). The basic idea of CN is to replace parameters of affine transformations in normalization layers (Ioffe & Szegedy, 2015) of a neural network with a learned function of the conditioning information. We apply CN by learning two linear functions ff and gg which take a latent code zz as input and output scale and shift parameters of normalization layers in intermediate layers, i.e. γ=f(z)\gamma=f(z) and β=g(z)\beta=g(z). When activations are normalized over spatial dimensions only, we get Conditional Instance Normalization (CIN), and when they are also normalized over batch dimension, we get Conditional Batch Normalization (CBN).

Related Work

There has been a surge of interest recently in unsupervised learning of cross-domain mappings, especially for image translation tasks. Previous attempts for image-to-image translation have unanimously relied on GANs to learn mappings that produce compelling images. In order to constrain learned mappings, some methods have relied on cycle-consistency based constraints similar to CycleGAN (Kim et al., 2017; Yi et al., 2017; Royer et al., 2017), while others relied on weight sharing constraints (Liu & Tuzel, 2016; Liu et al., 2017). However, the focus in all of these methods was on learning conditional image generators that produce single output images given the input image. Notably, Liu et al. (2015) propose to map inputs from both domains into a shared latent space. This approach may constrain too much the space of learnable mappings, for example in cases where the domains differ substantially (class labels and images).

Unsupervised learning of mappings have also been addressed recently in language translation, especially for machine translation (Lample et al., 2017) and text style transfer (Shen et al., 2017). These methods also rely on some notion of cycle-consistency over domains in order to constrain the learned mappings. They rely heavily on the power of the RNN-based decoders to capture complex relationships across domains while we propose to use auxiliary latent variables. The two approaches may be synergistic, as it was recently suggested in (Gulrajani et al., 2016).

Recently, Zhu et al. (2017b) proposed the BiCycleGAN model for learning multi-modal mappings but in fully supervised setting. This model extends the pix2pix framework in (Isola et al., 2017) by learning a stochastic mapping from the source to the target, and shows interesting diversity in the generated samples. Several modeling choices in BiCycleGAN resemble our proposed model, including the use of stochastic mappings and an encoder to handle multi-modal targets. However, our approach focuses on unsupervised many-to-many mappings, which allows it to handle domains with no or very little paired data.

Experiments

We first study a one-to-many image translation task between edges (domain A{A}) and photos of shoes (domain B{B}). Public code available at: https://github.com/aalmah/augmented_cyclegan Training data is composed of almost 50K shoe images with corresponding edges (Yu & Grauman, 2014; Zhu et al., 2016; Isola et al., 2017), but as in previous approaches (e.g. (Kim et al., 2017)), we assume no pairing information while training unsupervised models. Stochastic mappings in our Augmented CycleGAN (AugCGAN) model are based on ResNet conditional image generators of (Zhu et al., 2017a), where we inject noise with CIN to all intermediate layers. As baselines, we train: CycleGAN, Stochastic CycleGAN (StochCGAN) and Triangle-GAN (Δ\Delta-GAN) of (Gan et al., 2017) which share the same architectures and training procedure for fair comparison. Δ\Delta-GAN architecture differs only in the two discriminators, which match conditionals/joints instead of marginals.

First, we evaluate conditionals learned by each model by measuring the ability of the model of generating a specific edge-shoe pair from a test set. We follow the same evaluation methodology adopted in (Metz et al., 2016; Xiang & Li, 2017), which opt for an inference-via-optimization approach to estimate the reconstruction error of a specific shoe given an edge. Specifically, given a trained model with mapping GABG_{{A}{B}} and an edge-shoe pair (a,b)(a,b) in the test set, we solve the optimization task zb=arg minzbGAB(a,zb)b1z_{b}^{*}=\operatorname*{arg\,min}_{z_{b}}\|G_{{A}{B}}(a,z_{b})-b\|_{1} and compute reconstruction error GAB(a,zb)b1\|G_{{A}{B}}(a,z_{b}^{*})-b\|_{1}. Optimization is done with RMSProp as in (Xiang & Li, 2017). We show the average errors over a predefined test set of 200 samples in Table 2 for: AugCGAN (unsupervised and semi-supervised with 10% paired data), unsupervised CycleGAN and StochCGAN, and a semi-supervised Δ\Delta-GAN, all sharing the same architecture. Our unsupervised AugCGAN model outperforms all baselines including semi-supervised Δ\Delta-GAN, which indicates that reconstruction-based cycle-consistency is more effective in learning conditionals than the adversarial approach of Δ\Delta-GAN. As expected, adding 10% supervision to AugCGAN improves shoe predictions further. In addition, we evaluate edge predictions given real shoes from test set as well. We report mean squared error (MSE) similar to (Gan et al., 2017), where we normalize over all edge pixels. The Δ\Delta-GAN model with our architecture outperforms the one reported in (Gan et al., 2017), but is outperformed by our unsupervised AugCGAN model. Again, adding 10% supervision to AugCGAN reduces MSE even further.

Qualitative Results

We qualitatively compare the mappings learned by our model AugCGAN and StochCGAN. Fig. 5.1 shows generated images of shoes given an edge apd(a)a\sim p_{d}(a) (row) and zbp(zb)z_{b}\sim p(z_{b}) (column) from both model, and Fig. 5.1 shows cycles starting from edges and shoes. Note that here the edges are sampled from the data distribution and not produced by the learnt stochastic mapping GBAG_{{B}{A}}. In this case, both models can 1 generate diverse set of shoes with color variations mostly defined by zbz_{b}, and 2 perform reconstructions of both edges and shoes.

We investigate “steganography” behavior in both AugCGAN and StochCGAN using a similar approach to (Chu et al., 2017), where we corrupt generated edges with noise sampled from N(0,ϵ2)\mathcal{N}(0,\epsilon^{2}), and compute reconstruction error of shoes. Fig. 2 shows L1L_{1} reconstruction error as we increase ϵ\epsilon. AugCGAN seems more robust to corruption of edges than in StochCGAN, which confirms that information is being stored in the latent codes instead of being completely hidden in generated edges.

2 Male-to-Female

We study another image translation task of translating between male and female faces. Data is based on CelebA dataset (Liu et al., 2015) where we split it into two separate domains using provided attributes. Several key features distinguish this task from other image-translation tasks: 1 there is no predefined correspondence in real data of each domain, 2 the relationship is many-to-many between domains, as we can map a male to female face, and vice-versa, in many possible ways, and 3 capturing realistic variations in generated faces requires transformations that go beyond simple color and texture changes. The architecture of stochastic mappings are based on U-NET conditional image generators of (Isola et al., 2017), and again with noise injected to all intermediate layers. Fig. 9 shows results of applying our model to this task on 128×128128\times 128 resolution CelebA images. We can see that our model depicts meaningful variations in generated faces without compromising their realistic appearance. In Fig. 10 we show 64×6464\times 64 generated samples in both domains from our model ((a) and (b)), and compare them to both: (c) our model but with noise injected noise only in last 3 layers of the GABG_{{A}{B}}’s network, and (d) StochCGAN with the same architecture. We can see that in Fig. 10-(c) variations are very limited, which highlights the importance of processing latent code with multiple layers. StochCGAN in this task produces almost no variations at all, which highlights the importance of proper optimization of cycle-consistency for capturing meaningful variations. We verify these results quantitatively using LPIPS distance (Zhang et al., 2018), where we average distance between 1000 pairs of generated female faces (10 random pairs from 100 male faces). AugCGAN (Fig. 10-(b)) achieves highest LPIPS diversity score with 0.108 ±\pm 0.003, while AugCGAN with zz in low-level layers (Fig. 10-(c)) gets 0.059 +/- 0.001, and finally StochCGAN (Fig. 10-(d)) gets 0.008 +/- 0.000, i.e. severe mode collapse.

3 Attributes-to-Faces

In this task, we make use of the CelebA dataset in order map from descriptive facial attributes AA to images of faces BB and vice-versa. We report both quantitative and qualitative results. For the quantitative results, we follow (Gan et al., 2017) and test our models in a semi-supervised attribute prediction setting. We let the model train on all the available data without the pairing information and only train with a small amount of paired data as described in Sec. 3.2. We report Precision (P) and normalized Discounted Cumulative Gain (nDCG) as the two metrics for multi-label classification problems. As an additional baseline, we also train a supervised classifier (which has the same architecture as GBAG_{{B}{A}}) on the paired subset. The results are reported in Table 3. In Fig. 11, we show some generation obtained from the model in the direction attributes to faces. We can see that the model generates reasonable diverse faces for the same set of attributes.

Conclusion

In this paper we have introduced the Augmented CycleGAN model for learning many-to-many cross-domain mappings in unsupervised fashion. This model can learn stochastic mappings which leverage auxiliary noise to capture multi-modal conditionals. Our experimental results verify quantitatively and qualitatively the effectiveness of our approach in image translation tasks. Furthermore, we apply our model in a challenging task of learning to map across attributes and faces, and show that it can be used effectively in a semi-supervised learning setting.

Acknowledgements

Authors would like to thank Zihang Dai for valuable discussions and feedback. We are also grateful for ICML anonymous reviewers for their comments.

References