Multimodal Unsupervised Image-to-Image Translation

Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz

Introduction

Many problems in computer vision aim at translating images from one domain to another, including super-resolution , colorization , inpainting , attribute transfer , and style transfer . This cross-domain image-to-image translation setting has therefore received significant attention . When the dataset contains paired examples, this problem can be approached by a conditional generative model or a simple regression model . In this work, we focus on the much more challenging setting when such supervision is unavailable.

In many scenarios, the cross-domain mapping of interest is multimodal. For example, a winter scene could have many possible appearances during summer due to weather, timing, lighting, etc. Unfortunately, existing techniques usually assume a deterministic or unimodal mapping. As a result, they fail to capture the full distribution of possible outputs. Even if the model is made stochastic by injecting noise, the network usually learns to ignore it .

In this paper, we propose a principled framework for the Multimodal UNsupervised Image-to-image Translation (MUNIT) problem. As shown in Fig. 1 (a), our framework makes several assumptions. We first assume that the latent space of images can be decomposed into a content space and a style space. We further assume that images in different domains share a common content space but not the style space. To translate an image to the target domain, we recombine its content code with a random style code in the target style space (Fig. 1 (b)). The content code encodes the information that should be preserved during translation, while the style code represents remaining variations that are not contained in the input image. By sampling different style codes, our model is able to produce diverse and multimodal outputs. Extensive experiments demonstrate the effectiveness of our method in modeling multimodal output distributions and its superior image quality compared with state-of-the-art approaches. Moreover, the decomposition of content and style spaces allows our framework to perform example-guided image translation, in which the style of the translation outputs are controlled by a user-provided example image in the target domain.

Related Works

Generative adversarial networks (GANs). The GAN framework has achieved impressive results in image generation. In GAN training, a generator is trained to fool a discriminator which in turn tries to distinguish between generated samples and real samples. Various improvements to GANs have been proposed, such as multi-stage generation , better training objectives , and combination with auto-encoders . In this work, we employ GANs to align the distribution of translated images with real images in the target domain.

Image-to-image translation. Isola et al. propose the first unified framework for image-to-image translation based on conditional GANs, which has been extended to generating high-resolution images by Wang et al. . Recent studies have also attempted to learn image translation without supervision. This problem is inherently ill-posed and requires additional constraints. Some works enforce the translation to preserve certain properties of the source domain data, such as pixel values , pixel gradients , semantic features , class labels , or pairwise sample distances . Another popular constraint is the cycle consistency loss . It enforces that if we translate an image to the target domain and back, we should obtain the original image. In addition, Liu et al. propose the UNIT framework, which assumes a shared latent space such that corresponding images in two domains are mapped to the same latent code.

A significant limitation of most existing image-to-image translation methods is the lack of diversity in the translated outputs. To tackle this problem, some works propose to simultaneously generate multiple outputs given the same input and encourage them to be different . Still, these methods can only generate a discrete number of outputs. Zhu et al. propose a BicycleGAN that can model continuous and multimodal distributions. However, all the aforementioned methods require pair supervision, while our method does not. A couple of concurrent works also recognize this limitation and propose extensions of CycleGAN/UNIT for multimodal mapping /.

Our problem has some connections with multi-domain image-to-image translation . Specifically, when we know how many modes each domain has and the mode each sample belongs to, it is possible to treat each mode as a separate domain and use multi-domain image-to-image translation techniques to learn a mapping between each pair of modes, thus achieving multimodal translation. However, in general we do not assume such information is available. Also, our stochastic model can represent continuous output distributions, while still use a deterministic model for each pair of domains.

Style transfer. Style transfer aims at modifying the style of an image while preserving its content, which is closely related to image-to-image translation. Here, we make a distinction between example-guided style transfer, in which the target style comes from a single example, and collection style transfer, in which the target style is defined by a collection of images. Classical style transfer approaches typically tackle the former problem, whereas image-to-image translation methods have been demonstrated to perform well in the latter . We will show that our model is able to address both problems, thanks to its disentangled representation of content and style.

Learning disentangled representations. Our work draws inspiration from recent works on disentangled representation learning. For example, InfoGAN and $\beta$ -VAE have been proposed to learn disentangled representations without supervision. Some other works focus on disentangling content from style. Although it is difficult to define content/style and different works use different definitions, we refer to “content” as the underling spatial structure and “style” as the rendering of the structure. In our setting, we have two domains that share the same content distribution but have different style distributions.

Multimodal Unsupervised Image-to-image Translation

Let $x_{1}\in\mathcal{X}_{1}$ and $x_{2}\in\mathcal{X}_{2}$ be images from two different image domains. In the unsupervised image-to-image translation setting, we are given samples drawn from two marginal distributions $p(x_{1})$ and $p(x_{2})$ , without access to the joint distribution $p(x_{1},x_{2})$ . Our goal is to estimate the two conditionals $p(x_{2}|x_{1})$ and $p(x_{1}|x_{2})$ with learned image-to-image translation models $p(x_{1\rightarrow 2}|x_{1})$ and $p(x_{2\rightarrow 1}|x_{2})$ , where $x_{1\rightarrow 2}$ is a sample produced by translating $x_{1}$ to $\mathcal{X}_{2}$ (similar for $x_{2\rightarrow 1}$ ). In general, $p(x_{2}|x_{1})$ and $p(x_{1}|x_{2})$ are complex and multimodal distributions, in which case a deterministic translation model does not work well.

To tackle this problem, we make a partially shared latent space assumption. Specifically, we assume that each image $x_{i}\in\mathcal{X}_{i}$ is generated from a content latent code $c\in\mathcal{C}$ that is shared by both domains, and a style latent code $s_{i}\in\mathcal{S}_{i}$ that is specific to the individual domain. In other words, a pair of corresponding images $(x_{1},x_{2})$ from the joint distribution is generated by $x_{1}=G^{*}_{1}(c,s_{1})$ and $x_{2}=G^{*}_{2}(c,s_{2})$ , where $c,s_{1},s_{2}$ are from some prior distributions and $G^{*}_{1}$ , $G^{*}_{2}$ are the underlying generators. We further assume that $G^{*}_{1}$ and $G^{*}_{2}$ are deterministic functions and have their inverse encoders $E^{*}_{1}=(G^{*}_{1})^{-1}$ and $E^{*}_{2}=(G^{*}_{2})^{-1}$ . Our goal is to learn the underlying generator and encoder functions with neural networks. Note that although the encoders and decoders are deterministic, $p(x_{2}|x_{1})$ is a continuous distribution due to the dependency of $s_{2}$ .

Our assumption is closely related to the shared latent space assumption proposed in UNIT . While UNIT assumes a fully shared latent space, we postulate that only part of the latent space (the content) can be shared across domains whereas the other part (the style) is domain specific, which is a more reasonable assumption when the cross-domain mapping is many-to-many.

2 Model

Fig. 2 shows an overview of our model and its learning process. Similar to Liu et al. , our translation model consists of an encoder $E_{i}$ and a decoder $G_{i}$ for each domain $\mathcal{X}_{i}$ ( $i=1,2$ ). As shown in Fig. 2 (a), the latent code of each auto-encoder is factorized into a content code $c_{i}$ and a style code $s_{i}$ , where $(c_{i},s_{i})=(E_{i}^{c}(x_{i}),E_{i}^{s}(x_{i}))=E_{i}(x_{i})$ . Image-to-image translation is performed by swapping encoder-decoder pairs, as illustrated in Fig. 2 (b). For example, to translate an image $x_{1}\in\mathcal{X}_{1}$ to $\mathcal{X}_{2}$ , we first extract its content latent code $c_{1}=E^{c}_{1}(x_{1})$ and randomly draw a style latent code $s_{2}$ from the prior distribution $q(s_{2})\sim\mathcal{N}(0,\mathbf{I})$ . We then use $G_{2}$ to produce the final output image $x_{1\rightarrow 2}=G_{2}(c_{1},s_{2})$ . We note that although the prior distribution is unimodal, the output image distribution can be multimodal thanks to the nonlinearity of the decoder.

Our loss function comprises a bidirectional reconstruction loss that ensures the encoders and decoders are inverses, and an adversarial loss that matches the distribution of translated images to the image distribution in the target domain.

Bidirectional reconstruction loss. To learn pairs of encoder and decoder that are inverses of each other, we use objective functions that encourage reconstruction in both image $\rightarrow$ latent $\rightarrow$ image and latent $\rightarrow$ image $\rightarrow$ latent directions:

Image reconstruction. Given an image sampled from the data distribution, we should be able to reconstruct it after encoding and decoding.

Latent reconstruction. Given a latent code (style and content) sampled from the latent distribution at translation time, we should be able to reconstruct it after decoding and encoding.

where $q(s_{2})$ is the prior $\mathcal{N}(0,\mathbf{I})$ , $p(c_{1})$ is given by $c_{1}=E^{c}_{1}(x_{1})$ and $x_{1}\sim p(x_{1})$ .

We note the other loss terms $\mathcal{L}^{x_{2}}_{\text{recon}}$ , $\mathcal{L}^{c_{2}}_{\text{recon}}$ , and $\mathcal{L}^{s_{1}}_{\text{recon}}$ are defined in a similar manner. We use $\mathcal{L}_{1}$ reconstruction loss as it encourages sharp output images.

The style reconstruction loss $\mathcal{L}^{s_{i}}_{\text{recon}}$ is reminiscent of the latent reconstruction loss used in the prior works . It has the effect on encouraging diverse outputs given different style codes. The content reconstruction loss $\mathcal{L}^{c_{i}}_{\text{recon}}$ encourages the translated image to preserve semantic content of the input image.

Adversarial loss. We employ GANs to match the distribution of translated images to the target data distribution. In other words, images generated by our model should be indistinguishable from real images in the target domain.

where $D_{2}$ is a discriminator that tries to distinguish between translated images and real images in $\mathcal{X}_{2}$ . The discriminator $D_{1}$ and loss $\mathcal{L}^{x_{1}}_{\text{GAN}}$ are defined similarly.

Total loss. We jointly train the encoders, decoders, and discriminators to optimize the final objective, which is a weighted sum of the adversarial loss and the bidirectional reconstruction loss terms.

where $\lambda_{x}$ , $\lambda_{c}$ , $\lambda_{s}$ are weights that control the importance of reconstruction terms.

Theoretical Analysis

We now establish some theoretical properties of our framework. Specifically, we show that minimizing the proposed loss function leads to 1) matching of latent distributions during encoding and generation, 2) matching of two joint image distributions induced by our framework, and 3) enforcing a weak form of cycle consistency constraint. All the proofs are given in Appendix 0.A.

First, we note that the total loss in Eq. (5) is minimized when the translated distribution matches the data distribution and the encoder-decoder are inverses.

Suppose there exists $E^{*}_{1}$ , $E^{*}_{2}$ , $G^{*}_{1}$ , $G^{*}_{2}$ such that: 1) $E^{*}_{1}=(G^{*}_{1})^{-1}$ and $E^{*}_{2}=(G^{*}_{2})^{-1}$ , and 2) $p(x_{1\rightarrow 2})=p(x_{2})$ and $p(x_{2\rightarrow 1})=p(x_{1})$ . Then $E^{*}_{1}$ , $E^{*}_{2}$ , $G^{*}_{1}$ , $G^{*}_{2}$ minimizes $\mathcal{L}(E_{1},E_{2},G_{1},G_{2})=\underset{D_{1},D_{2}}{\max}\ \mathcal{L}(E_{1},E_{2},G_{1},G_{2},D_{1},D_{2})$ (Eq. (5)).

For image generation, existing works on combining auto-encoders and GANs need to match the encoded latent distribution with the latent distribution the decoder receives at generation time, using either KLD loss or adversarial loss in the latent space. The auto-encoder training would not help GAN training if the decoder received a very different latent distribution during generation. Although our loss function does not contain terms that explicitly encourage the match of latent distributions, it has the effect of matching them implicitly.

The above proposition shows that at optimality, the encoded style distributions match their Gaussian priors. Also, the encoded content distribution matches the distribution at generation time, which is just the encoded distribution from the other domain. This suggests that the content space becomes domain-invariant.

0.2 Joint Distribution Matching

Our model learns two conditional distributions $p(x_{1\rightarrow 2}|x_{1})$ and $p(x_{2\rightarrow 1}|x_{2})$ , which, together with the data distributions, define two joint distributions $p(x_{1},x_{1\rightarrow 2})$ and $p(x_{2\rightarrow 1},x_{2})$ . Since both of them are designed to approximate the same underlying joint distribution $p(x_{1},x_{2})$ , it is desirable that they are consistent with each other, i.e., $p(x_{1},x_{1\rightarrow 2})=p(x_{2\rightarrow 1},x_{2})$ .

Joint distribution matching provides an important constraint for unsupervised image-to-image translation and is behind the success of many recent methods. Here, we show our model matches the joint distributions at optimality.

When optimality is reached, we have $p(x_{1},x_{1\rightarrow 2})=p(x_{2\rightarrow 1},x_{2})$ .

0.3 Style-augmented Cycle Consistency

Joint distribution matching can be realized via cycle consistency constraint , assuming deterministic translation models and matched marginals . However, we note that this constraint is too strong for multimodal image translation. In fact, we prove in Appendix 0.A that the translation model will degenerate to a deterministic function if cycle consistency is enforced. In the following proposition, we show that our framework admits a weaker form of cycle consistency, termed as style-augmented cycle consistency, between the image–style joint spaces, which is more suited for multimodal image translation.

Denote $h_{1}=(x_{1},s_{2})\in\mathcal{H}_{1}$ and $h_{2}=(x_{2},s_{1})\in\mathcal{H}_{2}$ . $h_{1},h_{2}$ are points in the joint spaces of image and style. Our model defines a deterministic mapping $F_{1\rightarrow 2}$ from $\mathcal{H}_{1}$ to $\mathcal{H}_{2}$ (and vice versa) by $F_{1\rightarrow 2}(h_{1})=F_{1\rightarrow 2}(x_{1},s_{2})\triangleq(G_{2}(E^{c}_{1}(x_{1}),s_{2}),E^{s}_{1}(x_{1}))$ . When optimality is achieved, we have $F_{1\rightarrow 2}=F_{2\rightarrow 1}^{-1}$ .

Intuitively, style-augmented cycle consistency implies that if we translate an image to the target domain and translate it back using the original style, we should obtain the original image. Style-augmented cycle consistency is implied by the proposed bidirectional reconstruction loss, but explicitly enforcing it could be useful for some datasets:

Experiments

Fig. 3 shows the architecture of our auto-encoder. It consists of a content encoder, a style encoder, and a joint decoder. More detailed information and hyperparameters are given in Appendix 0.B. We also provide an open-source implementation in PyTorch at https://github.com/nvlabs/MUNIT.

Content encoder. Our content encoder consists of several strided convolutional layers to downsample the input and several residual blocks to further process it. All the convolutional layers are followed by Instance Normalization (IN) .

Style encoder. The style encoder includes several strided convolutional layers, followed by a global average pooling layer and a fully connected (FC) layer. We do not use IN layers in the style encoder, since IN removes the original feature mean and variance that represent important style information .

Decoder. Our decoder reconstructs the input image from its content and style code. It processes the content code by a set of residual blocks and finally produces the reconstructed image by several upsampling and convolutional layers. Inspired by recent works that use affine transformation parameters in normalization layers to represent styles , we equip the residual blocks with Adaptive Instance Normalization (AdaIN) layers whose parameters are dynamically generated by a multilayer perceptron (MLP) from the style code.

where $z$ is the activation of the previous convolutional layer, $\mu$ and $\sigma$ are channel-wise mean and standard deviation, $\gamma$ and $\beta$ are parameters generated by the MLP. Note that the affine parameters are produced by a learned network, instead of computed from statistics of a pretrained network as in Huang et al. .

Discriminator. We use the LSGAN objective proposed by Mao et al. . We employ multi-scale discriminators proposed by Wang et al. to guide the generators to produce both realistic details and correct global structure.

Domain-invariant perceptual loss. The perceptual loss, often computed as a distance in the VGG feature space between the output and the reference image, has been shown to benefit image-to-image translation when paired supervision is available . In the unsupervised setting, however, we do not have a reference image in the target domain. We propose a modified version of perceptual loss that is more domain-invariant, so that we can use the input image as the reference. Specifically, before computing the distance, we perform Instance Normalization (without affine transformations) on the VGG features in order to remove the original feature mean and variance, which contains much domain-specific information . In Appendix 0.C, we quantitatively show that Instance Normalization can indeed make the VGG features more domain-invariant. We find the domain-invariant perceptual loss accelerates training on high-resolution ( $\geq 512\times 512$ ) datasets and thus employ it on those datasets.

2 Evaluation Metrics

Human Preference. To compare the realism and faithfulness of translation outputs generated by different methods, we perform human perceptual study on Amazon Mechanical Turk (AMT). Similar to Wang et al. , the workers are given an input image and two translation outputs from different methods. They are then given unlimited time to select which translation output looks more accurate. For each comparison, we randomly generate $500$ questions and each question is answered by $5$ different workers.

LPIPS Distance. To measure translation diversity, we compute the average LPIPS distance between pairs of randomly-sampled translation outputs from the same input as in Zhu et al. . LPIPS is given by a weighted $\mathcal{L}_{2}$ distance between deep features of images. It has been demonstrated to correlate well with human perceptual similarity . Following Zhu et al. , we use $100$ input images and sample $19$ output pairs per input, which amounts to $1900$ pairs in total. We use the ImageNet-pretrained AlexNet as the deep feature extractor.

(Conditional) Inception Score. The Inception Score (IS) is a popular metric for image generation tasks. We propose a modified version called Conditional Inception Score (CIS), which is more suited for evaluating multimodal image translation. When we know the number of modes in $\mathcal{X}_{2}$ as well as the ground truth mode each sample belongs to, we can train a classifier $p(y_{2}|x_{2})$ to classify an image $x_{2}$ into its mode $y_{2}$ . Conditioned on a single input image $x_{1}$ , the translation samples $x_{1\rightarrow 2}$ should be mode-covering (thus $p(y_{2}|x_{1})=\int p(y|x_{1\rightarrow 2})p(x_{1\rightarrow 2}|x_{1})\,dx_{1\rightarrow 2}$ should have high entropy) and each individual sample should belong to a specific mode (thus $p(y_{2}|x_{1\rightarrow 2})$ should have low entropy). Combing these two requirements we get:

To compute the (unconditional) IS, $p(y_{2}|x_{1})$ is replaced with the unconditional class probability $p(y_{2})=\iint p(y|x_{1\rightarrow 2})p(x_{1\rightarrow 2}|x_{1})p(x_{1})\,dx_{1}\,dx_{1\rightarrow 2}$ .

To obtain a high CIS/IS score, a model needs to generate samples that are both high-quality and diverse. While IS measures diversity of all output images, CIS measures diversity of outputs conditioned on a single input image. A model that deterministically generates a single output given an input image will receive a zero CIS score, though it might still get a high score under IS. We use the Inception-v3 fine-tuned on our specific datasets as the classifier and estimate Eq. (8) and Eq. (9) using $100$ input images and $100$ samples per input.

3 Baselines

UNIT . The UNIT model consists of two VAE-GANs with a fully shared latent space. The stochasticity of the translation comes from the Gaussian encoders as well as the dropout layers in the VAEs.

CycleGAN . CycleGAN consists of two residual translation networks trained with adversarial loss and cycle reconstruction loss. We use Dropout during both training and testing to encourage diversity, as suggested in Isola et al. .

CycleGAN* with noise. To test whether we can generate multimodal outputs within the CycleGAN framework, we additionally inject noise vectors to both translation networks. We use the U-net architecture with noise added to input, since we find the noise vectors are ignored by the residual architecture in CycleGAN . Dropout is also utilized during both training and testing.

BicycleGAN . BicycleGAN is the only existing image-to-image translation model we are aware of that can generate continuous and multimodal output distributions. However, it requires paired training data. We compare our model with BicycleGAN when the dataset contains pair information.

4 Datasets

Edges $\leftrightarrow$ shoes/handbags. We use the datasets provided by Isola et al. , Yu et al. , and Zhu et al. , which contain images of shoes and handbags with edge maps generated by HED . We train one model for edges $\leftrightarrow$ shoes and another for edges $\leftrightarrow$ handbags without using paired information.

Animal image translation. We collect images from $3$ categories/domains, including house cats, big cats, and dogs. Each domain contains $4$ modes which are fine-grained categories belonging to the same parent category. Note that the modes of the images are not known during learning the translation model. We learn a separate model for each pair of domains.

Street scene images. We experiment with two street scene translation tasks:

Synthetic $\leftrightarrow$ real. We perform translation between synthetic images in the SYNTHIA dataset and real-world images in the Cityscape dataset . For the SYNTHIA dataset, we use the SYNTHIA-Seqs subset which contains images in different seasons, weather, and illumination conditions.

Summer $\leftrightarrow$ winter. We use the dataset from Liu et al. , which contains summer and winter street images extracted from real-world driving videos.

Yosemite summer $\leftrightarrow$ winter (HD). We collect a new high-resolution dataset containing $3253$ summer photos and $2385$ winter photos of Yosemite. The images are downsampled such that the shortest side of each image is $1024$ pixels.

5 Results

First, we qualitatively compare MUNIT with the four baselines above, and three variants of MUNIT that ablate $\mathcal{L}^{x}_{\text{recon}}$ , $\mathcal{L}^{c}_{\text{recon}}$ , $\mathcal{L}^{s}_{\text{recon}}$ respectively. Fig. 4 shows example results on edges $\rightarrow$ shoes. Both UNIT and CycleGAN (with or without noise) fail to generate diverse outputs, despite the injected randomness. Without $\mathcal{L}^{x}_{\text{recon}}$ or $\mathcal{L}^{c}_{\text{recon}}$ , the image quality of MUNIT is unsatisfactory. Without $\mathcal{L}^{s}_{\text{recon}}$ , the model suffers from partial mode collapse, with many outputs being almost identical (e.g., the first two rows). Our full model produces images that are both diverse and realistic, similar to BicycleGAN but does not need supervision.

The qualitative observations above are confirmed by quantitative evaluations. We use human preference to measure quality and LPIPS distance to evaluate diversity, as described in Sec. 5.2. We conduct this experiment on the task of edges $\rightarrow$ shoes/handbags. As shown in Table 1, UNIT and CycleGAN produce very little diversity according to LPIPS distance. Removing $\mathcal{L}^{x}_{\text{recon}}$ or $\mathcal{L}^{c}_{\text{recon}}$ from MUNIT leads to significantly worse quality. Without $\mathcal{L}^{s}_{\text{recon}}$ , both quality and diversity deteriorate. The full model obtains quality and diversity comparable to the fully supervised BicycleGAN, and significantly better than all unsupervised baselines. In Fig. 5, we show more example results on edges $\leftrightarrow$ shoes/handbags.

We proceed to perform experiments on the animal image translation dataset. As shown in Fig. 6, our model successfully translate one kind of animal to another. Given an input image, the translation outputs cover multiple modes, i.e., multiple fine-grained animal categories in the target domain. The shape of an animal has undergone significant transformations, but the pose is overall preserved. As shown in Table 2, our model obtains the highest scores according to both CIS and IS. In particular, the baselines all obtain a very low CIS, indicating their failure to generate multimodal outputs from a given input. As the IS has been shown to correlate well to image quality , the higher IS of our method suggests that it also generates images of high quality than baseline approaches.

Fig. 7 shows results on street scene datasets. Our model is able to generate SYNTHIA images with diverse renderings (e.g., rainy, snowy, sunset) from a given Cityscape image, and generate Cityscape images with different lighting, shadow, and road textures from a given SYNTHIA image. Similarly, it generates winter images with different amount of snow from a given summer image, and summer images with different amount of leafs from a given winter image. Fig. 8 shows example results of summer $\leftrightarrow$ winter transfer on the high-resolution Yosemite dataset. Our algorithm generates output images with different lighting.

Example-guided Image Translation. Instead of sampling the style code from the prior, it is also possible to extract the style code from a reference image. Specifically, given a content image $x_{1}\in\mathcal{X}_{1}$ and a style image $x_{2}\in\mathcal{X}_{2}$ , our model produces an image $x_{1\rightarrow 2}$ that recombines the content of the former and the style latter by $x_{1\rightarrow 2}=G_{2}(E^{c}_{1}(x_{1}),E^{s}_{2}(x_{2}))$ . Examples are shown in Fig. 9. Note that this is similar to classical style transfer algorithms that transfer the style of one image to another. In Fig. 10, we compare out method with classical style transfer algorithms including Gatys et al. , Chen et al. , AdaIN , and WCT . Our method produces results that are significantly more faithful and realistic, since our method learns the distribution of target domain images using GANs.

Conclusions

We presented a framework for multimodal unsupervised image-to-image translation. Our model achieves quality and diversity superior to existing unsupervised methods and comparable to state-of-the-art supervised approach. Future work includes extending this framework to other domains, such as videos and text.

References

Appendix 0.A Proofs

Proposition 1. Suppose there exists $E^{*}_{1}$ , $E^{*}_{2}$ , $G^{*}_{1}$ , $G^{*}_{2}$ such that: 1) $E^{*}_{1}=(G^{*}_{1})^{-1}$ and $E^{*}_{2}=(G^{*}_{2})^{-1}$ , and 2) $p(x_{1\rightarrow 2})=p(x_{2})$ and $p(x_{2\rightarrow 1})=p(x_{1})$ . Then $E^{*}_{1}$ , $E^{*}_{2}$ , $G^{*}_{1}$ , $G^{*}_{2}$ minimizes $\mathcal{L}(E_{1},E_{2},G_{1},G_{2})=\underset{D_{1},D_{2}}{\max}\ \mathcal{L}(E_{1},E_{2},G_{1},G_{2},D_{1},D_{2})$ (Eq. (5)).

As shown in Goodfellow et al. , $\underset{D_{2}}{\max}\ \mathcal{L}^{x_{2}}_{\text{GAN}}=2\cdot\text{JSD}(p(x_{2})|p(x_{1\rightarrow 2}))-\log 4$ which has a global minimum when $p(x_{2})=p(x_{1\rightarrow 2})$ . Also, the bidirectional reconstruction loss terms are minimized when $E_{i}$ inverts $G_{i}$ . Thus the total loss is minimized under the two stated conditions. Below, we assume the networks have sufficient capacity and the optimality is reachable as in prior works . That is $E_{1}\rightarrow E_{1}^{*}$ , $E_{2}\rightarrow E_{2}^{*}$ , $G_{1}\rightarrow G_{1}^{*}$ , and $G_{2}\rightarrow G_{2}^{*}$ .

Proposition 2. When optimality is reached, we have:

Let $z_{1}$ denote the latent code, which is the concatenation of $c_{1}$ and $s_{1}$ . We denote the encoded latent distribution by $p_{E}(z_{1})$ , which is defined by $z_{1}=E_{1}(x_{1})$ and $x_{1}$ sampled from the data distribution $p(x_{1})$ . We denote the latent distribution at generation time by $p(z_{1})$ , which is obtained by $s_{1}\sim q(s_{1})$ and $c_{1}\sim p(c_{2})$ . The generated image distribution $p_{G}(x_{1})=p(x_{2\rightarrow 1})$ is defined by $x_{1}=G_{1}(z_{1})$ and $z_{1}$ sampled from $p(z_{1})$ . According to the change of variable formula for probability density functions:

According to Proposition 1, we have $p_{G}(x_{1})=p(x_{1})$ and $E_{1}=G^{-1}_{1}$ when optimality is reached. Thus:

Similarly we have $p_{E}(z_{2})=p(z_{2})$ , which together prove the original proposition. From another perspective, we note that $\mathcal{L}^{c_{2}}_{\text{recon}},\mathcal{L}^{s_{1}}_{\text{recon}},\mathcal{L}^{x_{1}}_{\text{GAN}}$ coincide with the objective of a WAE or AAE in the latent space, which pushes the encoded latent distribution towards the latent distribution at generation time.

Proposition 3. When optimality is reached, we have $p(x_{1},x_{1\rightarrow 2})=p(x_{2\rightarrow 1},x_{2})$ .

For the ease of notation we denote the joint distribution $p(x_{1},x_{1\rightarrow 2})$ by $p_{1\rightarrow 2}(x_{1},x_{2})$ and $p(x_{2\rightarrow 1},x_{2})$ by $p_{2\rightarrow 1}(x_{1},x_{2})$ . Both densities are zero when $E_{1}^{c}(x_{1})\neq E_{2}^{c}(x_{2})$ . When $E_{1}^{c}(x_{1})=E_{2}^{c}(x_{2})$ , we also have:

Proposition 4. Denote $h_{1}=(x_{1},s_{2})\in\mathcal{H}_{1}$ and $h_{2}=(x_{2},s_{1})\in\mathcal{H}_{2}$ . $h_{1},h_{2}$ are points in the joint spaces of image and style. Our model defines a deterministic mapping $F_{1\rightarrow 2}$ from $\mathcal{H}_{1}$ to $\mathcal{H}_{2}$ (and vice versa) by $F_{1\rightarrow 2}(h_{1})=F_{1\rightarrow 2}(x_{1},s_{2})\triangleq(G_{2}(E^{c}_{1}(x_{1}),s_{2}),E^{s}_{1}(x_{1}))$ . When optimality is achieved, we have $F_{1\rightarrow 2}=F_{2\rightarrow 1}^{-1}$ .

And we can prove $F_{1\rightarrow 2}(F_{2\rightarrow 1}(x_{2},s_{1}))=(x_{2},s_{1})$ in a similar manner. To be more specific, $(3)$ is implied by the style reconstruction loss $\mathcal{L}^{s}_{\text{recon}}$ , $(4)$ is implied by the content reconstruction loss $\mathcal{L}^{c}_{\text{recon}}$ , and $(5)$ is implied by the image reconstruction loss $\mathcal{L}^{x}_{\text{recon}}$ . As a result, style-augmented cycle consistency is implicitly implied by the proposed bidirectional reconstruction loss.

Let $x^{*}_{1}$ be a sample from $p(x_{1})$ . $x^{\prime}_{2}$ , $x^{\prime\prime}_{2}$ are two samples from $p_{G}(x_{2}|x^{*}_{1})$ . Due to cycle consistency in $\mathcal{X}_{1}\rightarrow\mathcal{X}_{2}\rightarrow\mathcal{X}_{1}$ , we have $p_{G}(x_{1}|x^{\prime}_{2})=p_{G}(x_{1}|x^{\prime\prime}_{2})=\delta(x_{1}-x^{*}_{1})$ . Also, $x^{\prime}_{2}\in\mathcal{X}_{2}$ and $x^{\prime\prime}_{2}\in\mathcal{X}_{2}$ because of matched marginals. Due to cycle consistency in $\mathcal{X}_{2}\rightarrow\mathcal{X}_{1}\rightarrow\mathcal{X}_{2}$ , we have $p_{G}(x_{2}|x^{*}_{1})=\delta(x_{2}-x^{\prime}_{2})=\delta(x_{2}-x^{\prime\prime}_{2})$ . Thus $p_{G}(x_{2}|x_{1})$ collapses to a delta function, similar for $p_{G}(x_{1}|x_{2})$ . This proposition shows that cycle consistency is a too strong constraint for multimodal image translation.

Appendix 0.B Training Details

We use the Adam optimizer with $\beta_{1}=0.5$ , $\beta_{2}=0.999$ , and an initial learning rate of $0.0001$ . The learning rate is decreased by half every $100,000$ iterations. In all experiments, we use a batch size of 1 and set the loss weights to $\lambda_{x}=10$ , $\lambda_{c}=1$ , $\lambda_{s}=1$ . We use the domain-invariant perceptual loss with weight $1$ in the street scene and Yosemite datasets. We choose the dimension of the style code to be $8$ across all datasets. Random mirroring is applied during training.

B.2 Network Architectures

Let c7s1-k denote a $7\times 7$ convolutional block with k filters and stride 1. dk denotes a $4\times 4$ convolutional block with k filters and stride 2. Rk denotes a residual block that contains two $3\times 3$ convolutional blocks. uk denotes a $2\times$ nearest-neighbor upsampling layer followed by a $5\times 5$ convolutional block with k filters and stride 1. GAP denotes a global average pooling layer. fck denotes a fully connected layer with k filters. We apply Instance Normalization (IN) to the content encoder and Adaptive Instance Normalization (AdaIN) to the decoder. We use ReLU activations in the generator and Leaky ReLU with slope $0.2$ in the discriminator. We use multi-scale discriminators with $3$ scales.

Content encoder: c7s1-64, d128, d256, R256, R256, R256, R256

Style encoder: c7s1-64, d128, d256, d256, d256, GAP, fc8

Decoder: R256, R256, R256, R256, u128, u64, c7s1-3

Discriminator architecture: d64, d128, d256, d512

Appendix 0.C Domain-invariant Perceptual Loss

We conduct an experiment to verify if applying IN before computing the feature distance can indeed make the distance more domain-invariant. We experiment on the day $\leftrightarrow$ dataset used by Isola et al. and originally proposed by Laffont et al. . We randomly sample two sets of image pairs: 1) images from the same domain (both day or both night) but different scenes, 2) images from the same scene but different domains. Fig. 11 shows examples from the two sets of image pairs. We then compute the VGG feature (relu4_3) distance between each image pair, with IN either applied or not before computing the distance. In Fig. 12, we show histograms of the distance computed either with or without IN, and from image pairs either of the same domain or the same scene. Without applying IN before computing the distance, the distribution of feature distance is similar for both sets of image pairs. With IN enabled, however, image pairs from the same scene have clearly smaller distance, even they come from different domains. The results suggest that applying IN before computing the distance makes the feature distance much more domain-invariant.