Triangle Generative Adversarial Networks

Zhe Gan, Liqun Chen, Weiyao Wang, Yunchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li, Lawrence Carin

Introduction

Generative adversarial networks (GANs) have emerged as a powerful framework for learning generative models of arbitrarily complex data distributions. When trained on datasets of natural images, significant progress has been made on generating realistic and sharp-looking images . The original GAN formulation was designed to learn the data distribution in one domain. In practice, one may also be interested in matching two joint distributions. This is an important task, since mapping data samples from one domain to another has a wide range of applications. For instance, matching the joint distribution of image-text pairs allows simultaneous image captioning and text-conditional image generation , while image-to-image translation is another challenging problem that requires matching the joint distribution of image-image pairs.

In this work, we are interested in designing a GAN framework to match joint distributions. If paired data are available, a simple approach to achieve this is to train a conditional GAN model , from which a joint distribution is readily manifested and can be matched to the empirical joint distribution provided by the paired data. However, fully supervised data are often difficult to acquire. Several methods have been proposed to achieve unsupervised joint distribution matching without any paired data, including DiscoGAN , CycleGAN and DualGAN . Adversarially Learned Inference (ALI) and Bidirectional GAN (BiGAN) can be readily adapted to this case as well. Though empirically achieving great success, in principle, there exist infinitely many possible mapping functions that satisfy the requirement to map a sample from one domain to another. In order to alleviate this nonidentifiability issue, paired data are needed to provide proper supervision to inform the model the kind of joint distributions that are desired.

This motivates the proposed Triangle Generative Adversarial Network (Δ\Delta-GAN), a GAN framework that allows semi-supervised joint distribution matching, where the supervision of domain correspondence is provided by a few paired samples. Δ\Delta-GAN consists of two generators and two discriminators. The generators are designed to learn the bidirectional mappings between domains, while the discriminators are trained to distinguish real data pairs and two kinds of fake data pairs. Both the generators and discriminators are trained together via adversarial learning.

Δ\Delta-GAN bears close resemblance to Triple GAN , a recently proposed method that can also be utilized for semi-supervised joint distribution mapping. However, there exist several key differences that make our work unique. First, Δ\Delta-GAN uses two discriminators in total, which implicitly defines a ternary discriminative function, instead of a binary discriminator as used in Triple GAN. Second, Δ\Delta-GAN can be considered as a combination of conditional GAN and ALI, while Triple GAN consists of two conditional GANs. Third, the distributions characterized by the two generators in both Δ\Delta-GAN and Triple GAN concentrate to the data distribution in theory. However, when the discriminator is optimal, the objective of Δ\Delta-GAN becomes the Jensen-Shannon divergence (JSD) among three distributions, which is symmetric; the objective of Triple GAN consists of a JSD term plus a Kullback-Leibler (KL) divergence term. The asymmetry of the KL term makes Triple GAN more prone to generating fake-looking samples . Lastly, the calculation of the additional KL term in Triple GAN is equivalent to calculating a supervised loss, which requires the explicit density form of the conditional distributions, which may not be desirable. On the other hand, Δ\Delta-GAN is a fully adversarial approach that does not require that the conditional densities can be computed; Δ\Delta-GAN only require that the conditional densities can be sampled from in a way that allows gradient backpropagation.

Δ\Delta-GAN is a general framework, and can be used to match any joint distributions. In experiments, in order to demonstrate the versatility of the proposed model, we consider three domain pairs: image-label, image-image and image-attribute pairs, and use them for semi-supervised classification, image-to-image translation and attribute-based image editing, respectively. In order to demonstrate the scalability of the model to large and complex datasets, we also present attribute-conditional image generation on the COCO dataset .

Model

Generative Adversarial Networks (GANs) consist of a generator GG and a discriminator DD that compete in a two-player minimax game, where the generator is learned to map samples from an arbitray latent distribution to data, while the discriminator tries to distinguish between real and generated samples. The goal of the generator is to “fool” the discriminator by producing samples that are as close to real data as possible. Specifically, DD and GG are learned as

where p(x)p({\boldsymbol{x}}) is the true data distribution, and pz(z)p_{z}({\boldsymbol{z}}) is usually defined to be a simple distribution, such as the standard normal distribution. The generator GG implicitly defines a probability distribution pg(x)p_{g}({\boldsymbol{x}}) as the distribution of the samples G(z)G({\boldsymbol{z}}) obtained when zpz(z){\boldsymbol{z}}\sim p_{z}({\boldsymbol{z}}). For any fixed generator GG, the optimal discriminator is D(x)=p(x)pg(x)+p(x)D({\boldsymbol{x}})=\frac{p({\boldsymbol{x}})}{p_{g}({\boldsymbol{x}})+p({\boldsymbol{x}})}. When the discriminator is optimal, solving this adversarial game is equivalent to minimizing the Jenson-Shannon Divergence (JSD) between p(x)p({\boldsymbol{x}}) and pg(x)p_{g}({\boldsymbol{x}}) . The global equilibrium is achieved if and only if p(x)=pg(x)p({\boldsymbol{x}})=p_{g}({\boldsymbol{x}}).

2 Triangle Generative Adversarial Networks (ΔΔ\Delta-GANs)

We now extend GAN to Δ\Delta-GAN for joint distribution matching. We first consider Δ\Delta-GAN in the supervised setting, and then discuss semi-supervised learning in Section 2.4. Consider two related domains, with x{\boldsymbol{x}} and y{\boldsymbol{y}} being the data samples for each domain. We have fully-paired data samples that are characterized by the joint distribution p(x,y)p({\boldsymbol{x}},{\boldsymbol{y}}), which also implies that samples from both the marginal p(x)p({\boldsymbol{x}}) and p(y)p({\boldsymbol{y}}) can be easily obtained.

The value function describing the game is given by

The discriminator D1D_{1} is used to distinguish whether a sample pair is from p(x,y)p({\boldsymbol{x}},{\boldsymbol{y}}) or not, if this sample pair is not from p(x,y)p({\boldsymbol{x}},{\boldsymbol{y}}), another discriminator D2D_{2} is used to distinguish whether this sample pair is from px(x,y)p_{x}({\boldsymbol{x}},{\boldsymbol{y}}) or py(x,y)p_{y}({\boldsymbol{x}},{\boldsymbol{y}}). D1D_{1} and D2D_{2} work cooperatively, and the use of both implicitly defines a ternary discriminative function DD that distinguish sample pairs in three ways. See Figure 1 for an illustration of the adversarial game and Appendix B for an algorithmic description of the training procedure.

3 Theoretical analysis

Δ\Delta-GAN shares many of the theoretical properties of GANs . We first consider the optimal discriminators D1D_{1} and D2D_{2} for any given generator GxG_{x} and GyG_{y}. These optimal discriminators then allow reformulation of objective (2), which reduces to the Jensen-Shannon divergence among the joint distribution p(x,y),px(x,y)p({\boldsymbol{x}},{\boldsymbol{y}}),p_{x}({\boldsymbol{x}},{\boldsymbol{y}}) and py(x,y)p_{y}({\boldsymbol{x}},{\boldsymbol{y}}).

For any fixed generator GxG_{x} and GyG_{y}, the optimal discriminator D1D_{1} and D2D_{2} of the game defined by V(Gx,Gy,D1,D2)V(G_{x},G_{y},D_{1},D_{2}) is

The proof is a straightforward extension of the proof in . See Appendix A for details. ∎

The equilibrium of V(Gx,Gy,D1,D2)V(G_{x},G_{y},D_{1},D_{2}) is achieved if and only if p(x,y)=px(x,y)=py(x,y)p({\boldsymbol{x}},{\boldsymbol{y}})=p_{x}({\boldsymbol{x}},{\boldsymbol{y}})=p_{y}({\boldsymbol{x}},{\boldsymbol{y}}) with D1(x,y)=13D_{1}^{*}({\boldsymbol{x}},{\boldsymbol{y}})=\frac{1}{3} and D2(x,y)=12D_{2}^{*}({\boldsymbol{x}},{\boldsymbol{y}})=\frac{1}{2}, and the optimum value is 3log3-3\log 3.

Given the optimal D1(x,y)D_{1}^{*}({\boldsymbol{x}},{\boldsymbol{y}}) and D2(x,y)D_{2}^{*}({\boldsymbol{x}},{\boldsymbol{y}}), the minimax game can be reformulated as:

where JSDJSD denotes the Jensen-Shannon divergence (JSD) among three distributions. See Appendix A for details. ∎

Since p(x,y)=px(x,y)=py(x,y)p({\boldsymbol{x}},{\boldsymbol{y}})=p_{x}({\boldsymbol{x}},{\boldsymbol{y}})=p_{y}({\boldsymbol{x}},{\boldsymbol{y}}) can be achieved in theory, it can be readily seen that the learned conditional generators can reveal the true conditional distributions underlying the data, i.e., px(xy)=p(xy)p_{x}({\boldsymbol{x}}|{\boldsymbol{y}})=p({\boldsymbol{x}}|{\boldsymbol{y}}) and py(yx)=p(yx)p_{y}({\boldsymbol{y}}|{\boldsymbol{x}})=p({\boldsymbol{y}}|{\boldsymbol{x}}).

4 Semi-supervised learning

In order to further understand Δ\Delta-GAN, we write (2) as

The objective of Δ\Delta-GAN is a combination of the objectives of conditional GAN and BiGAN. The BiGAN part matches two joint distributions: px(x,y)p_{x}({\boldsymbol{x}},{\boldsymbol{y}}) and py(x,y)p_{y}({\boldsymbol{x}},{\boldsymbol{y}}), while the conditional GAN part provides the supervision signal to notify the BiGAN part what joint distribution to match. Therefore, Δ\Delta-GAN provides a natural way to perform semi-supervised learning, since the conditional GAN part and the BiGAN part can be used to account for paired and unpaired data, respectively.

However, when doing semi-supervised learning, there is also one potential problem that we need to be cautious about. The theoretical analysis in Section 2.3 is based on the assumption that the dataset is fully supervised, i.e., we have the ground-truth joint distribution p(x,y)p({\boldsymbol{x}},{\boldsymbol{y}}) and marginal distributions p(x)p({\boldsymbol{x}}) and p(y)p({\boldsymbol{y}}). In the semi-supervised setting, p(x)p({\boldsymbol{x}}) and p(y)p({\boldsymbol{y}}) are still available but p(x,y)p({\boldsymbol{x}},{\boldsymbol{y}}) is not. We can only obtain the joint distribution pl(x,y)p_{l}({\boldsymbol{x}},{\boldsymbol{y}}) characterized by the few paired data samples. Hence, in the semi-supervised setting, px(x,y)p_{x}({\boldsymbol{x}},{\boldsymbol{y}}) and py(x,y)p_{y}({\boldsymbol{x}},{\boldsymbol{y}}) will try to concentrate to the empirical distribution pl(x,y)p_{l}({\boldsymbol{x}},{\boldsymbol{y}}). We make the assumption that pl(x,y)p(x,y)p_{l}({\boldsymbol{x}},{\boldsymbol{y}})\approx p({\boldsymbol{x}},{\boldsymbol{y}}), i.e., the paired data can roughly characterize the whole dataset. For example, in the semi-supervised classification problem, one usually strives to make sure that labels are equally distributed among the labeled dataset.

5 Relation to Triple GAN

Δ\Delta-GAN is closely related to Triple GAN . Below we review Triple GAN and then discuss the main differences. The value function of Triple GAN is defined as follows:

6 Applications

Δ\Delta-GAN is a general framework that can be used for any joint distribution matching. Besides the semi-supervised image classification task considered in , we also conduct experiments on image-to-image translation and attribute-conditional image generation. When modeling image pairs, both px(xy)p_{x}({\boldsymbol{x}}|{\boldsymbol{y}}) and py(yx)p_{y}({\boldsymbol{y}}|{\boldsymbol{x}}) are implemented without introducing additional latent variables, i.e., px(xy)=δ(xGx(y))p_{x}({\boldsymbol{x}}|{\boldsymbol{y}})=\delta({\boldsymbol{x}}-G_{x}({\boldsymbol{y}})), py(yx)=δ(yGy(x))p_{y}({\boldsymbol{y}}|{\boldsymbol{x}})=\delta({\boldsymbol{y}}-G_{y}({\boldsymbol{x}})).

A different strategy is adopted when modeling the image-label/attribute pairs. Specifically, let x{\boldsymbol{x}} denote samples in the image domain, y{\boldsymbol{y}} denote samples in the label/attribute domain. y{\boldsymbol{y}} is a one-hot vector or a binary vector when representing labels and attributes, respectively. When modeling px(xy)p_{x}({\boldsymbol{x}}|{\boldsymbol{y}}), we assume that x{\boldsymbol{x}} is transformed by the latent style variables z{\boldsymbol{z}} given the label or attribute vector y{\boldsymbol{y}}, i.e., px(xy)=δ(xGx(y,z))p(z)dzp_{x}({\boldsymbol{x}}|{\boldsymbol{y}})=\int\delta({\boldsymbol{x}}-G_{x}({\boldsymbol{y}},{\boldsymbol{z}}))p({\boldsymbol{z}})d{\boldsymbol{z}}, where p(z)p({\boldsymbol{z}}) is chosen to be a simple distribution (e.g., uniform or standard normal). When learning py(yx)p_{y}({\boldsymbol{y}}|{\boldsymbol{x}}), py(yx)p_{y}({\boldsymbol{y}}|{\boldsymbol{x}}) is assumed to be a standard multi-class or multi-label classfier without latent variables z{\boldsymbol{z}}. In order to allow the training signal backpropagated from D1D_{1} and D2D_{2} to GyG_{y}, we adopt the REINFORCE algorithm as in , and use the label with the maximum probability to approximate the expectation over y{\boldsymbol{y}}, or use the output of the sigmoid function as the predicted attribute vector.

Related work

The proposed framework focuses on designing GAN for joint-distribution matching. Conditional GAN can be used for this task if supervised data is available. Various conditional GANs have been proposed to condition the image generation on class labels , attributes , texts and images . Unsupervised learning methods have also been developed for this task. BiGAN and ALI proposed a method to jointly learn a generation network and an inference network via adversarial learning. Though originally designed for learning the two-way transition between the stochastic latent variables and real data samples, BiGAN and ALI can be directly adapted to learn the joint distribution of two real domains. Another method is called DiscoGAN , in which two generators are used to model the bidirectional mapping between domains, and another two discriminators are used to decide whether a generated sample is fake or not in each individual domain. Further, additional reconstructon losses are introduced to make the two generators strongly coupled and also alleviate the problem of mode collapsing. Similiar work includes CycleGAN , DualGAN and DTN . Additional weight-sharing constraints are introduced in CoGAN and UNIT .

Our work differs from the above work in that we aim at semi-supervised joint distribution matching. The only work that we are aware of that also achieves this goal is Triple GAN. However, our model is distinct from Triple GAN in important ways (see Section 2.5). Further, Triple GAN only focuses on image classification, while Δ\Delta-GAN has been shown to be applicable to a wide range of applications.

Various methods and model architectures have been proposed to improve and stabilize the training of GAN, such as feature matching , Wasserstein GAN , energy-based GAN , and unrolled GAN among many other related works. Our work is orthogonal to these methods, which could also be used to improve the training of Δ\Delta-GAN. Instead of using adversarial loss, there also exists work that uses supervised learning for joint-distribution matching, and variational autoencoders for semi-supervised learning . Lastly, our work is also closely related to the recent work of , which treats one of the domains as latent variables.

Experiments

We present results on three tasks: (i) semi-supervised classification on CIFAR10 ; (ii) image-to-image translation on MNIST and the edges2shoes dataset ; and (iii) attribute-to-image generation on CelebA and COCO . We also conduct a toy data experiment to further demonstrate the differences between Δ\Delta-GAN and Triple GAN. We implement Δ\Delta-GAN without introducing additional regularization unless explicitly stated. All the network architectures are provided in the Appendix.

We first compare our method with Triple GAN on a toy dataset. We synthesize data by drawing (x,y)14N(μ1,Σ1)+14N(μ2,Σ2)+14N(μ3,Σ3)+14N(μ4,Σ4)(x,y)\sim\tfrac{1}{4}\mathcal{N}({\boldsymbol{\mu}}_{1},{\boldsymbol{\Sigma}}_{1})+\tfrac{1}{4}\mathcal{N}({\boldsymbol{\mu}}_{2},{\boldsymbol{\Sigma}}_{2})+\tfrac{1}{4}\mathcal{N}({\boldsymbol{\mu}}_{3},{\boldsymbol{\Sigma}}_{3})+\tfrac{1}{4}\mathcal{N}({\boldsymbol{\mu}}_{4},{\boldsymbol{\Sigma}}_{4}), where μ1=[0,1.5]{\boldsymbol{\mu}}_{1}=[0,1.5]^{\top}, μ2=[1.5,0]{\boldsymbol{\mu}}_{2}=[-1.5,0]^{\top}, μ3=[1.5,0]{\boldsymbol{\mu}}_{3}=[1.5,0]^{\top}, μ4=[0,1.5]{\boldsymbol{\mu}}_{4}=[0,-1.5]^{\top}, Σ1=Σ4=(3000.025){\boldsymbol{\Sigma}}_{1}={\boldsymbol{\Sigma}}_{4}=\left(\begin{smallmatrix}3&0\\ 0&0.025\end{smallmatrix}\right) and Σ2=Σ3=(0.025003){\boldsymbol{\Sigma}}_{2}={\boldsymbol{\Sigma}}_{3}=\left(\begin{smallmatrix}0.025&0\\ 0&3\end{smallmatrix}\right). We generate 5000 (x,y)(x,y) pairs for each mixture component. In order to implement Δ\Delta-GAN and Triple GAN-s, we model px(xy)p_{x}(x|y) and py(yx)p_{y}(y|x) as px(xy)=δ(xGx(y,z))p(z)dz,py(yx)=δ(yGy(x,z))p(z)dzp_{x}(x|y)=\int\delta(x-G_{x}(y,{\boldsymbol{z}}))p({\boldsymbol{z}})d{\boldsymbol{z}},p_{y}(y|x)=\int\delta(y-G_{y}(x,{\boldsymbol{z}}))p({\boldsymbol{z}})d{\boldsymbol{z}} where both GxG_{x} and GyG_{y} are modeled as a 4-hidden-layer multilayer perceptron (MLP) with 500 hidden units in each layer. p(z)p({\boldsymbol{z}}) is a bivariate standard Gaussian distribution. Triple GAN can be implemented by specifying both px(xy)p_{x}(x|y) and py(yx)p_{y}(y|x) to be distributions with explicit density form, e.g., Gaussian distributions. However, the performance can be bad since it fails to capture the multi-modality of px(xy)p_{x}(x|y) and py(yx)p_{y}(y|x). Hence, only Triple GAN-s is implemented.

Results are shown in Figure 2. The joint distributions px(x,y)p_{x}(x,y) and py(x,y)p_{y}(x,y) learned by Δ\Delta-GAN successfully match the true joint distribution p(x,y)p(x,y). Triple GAN-s cannot achieve this, and can only guarantee 12(px(x,y)+py(x,y))\frac{1}{2}(p_{x}(x,y)+p_{y}(x,y)) matches p(x,y)p(x,y). Although this experiment is limited due to its simplicity, the results clearly support the advantage of our proposed model over Triple GAN.

2 Semi-supervised classification

We evaluate semi-supervised classification on the CIFAR10 dataset with 4000 labels. The labeled data is distributed equally across classes and the results are averaged over 10 runs with different random splits of the training data. For fair comparison, we follow the publically available code of Triple GAN and use the same regularization terms and hyperparameter settings as theirs. Results are summarized in Table 2. Our Δ\Delta-GAN achieves the best performance among all the competing methods. We also show the ability of Δ\Delta-GAN to disentangle classes and styles in Figure 4. Δ\Delta-GAN can generate realistic data in a specific class and the injected noise vector encodes meaningful style patterns like background and color.

3 Image-to-image translation

We first evaluate image-to-image translation on the edges2shoes dataset. Results are shown in Figure 4(bottom). Though DiscoGAN is an unsupervised learning method, it achieves impressive results. However, with supervision provided by 10% paired data, Δ\Delta-GAN generally generates more accurate edge details of the shoes. In order to provide quantitative evaluation of translating shoes to edges, we use mean squared error (MSE) as our metric. The MSE of using DiscoGAN is 140.1; with 10%, 20%, 100% paired data, the MSE of using Δ\Delta-GAN is 125.3, 113.0 and 66.4, respectively.

To further demonstrate the importance of providing supervision of domain correspondence, we created a new dataset based on MNIST , where the two image domains are the MNIST images and their corresponding tranposed ones. As can be seen in Figure 4(top), Δ\Delta-GAN matches images betwen domains well, while DiscoGAN fails in this task. For supporting quantitative evaluation, we have trained a classifier on the MNIST dataset, and the classification accuracy of this classifier on the test set approaches 99.4%, and is, therefore, trustworthy as an evaluation metric. Given an input MNIST image x{\boldsymbol{x}}, we first generate a transposed image y{\boldsymbol{y}} using the learned generator, and then manually transpose it back to normal digits yT{\boldsymbol{y}}^{T} , and finally send this new image yT{\boldsymbol{y}}^{T} to the classifier. Results are summarized in Table 2, which are averages over 5 runs with different random splits of the training data. Δ\Delta-GAN achieves significantly better performance than Triple GAN and DiscoGAN.

4 Attribute-conditional image generation

We apply our method to face images from the CelebA dataset. This dataset consists of 202,599 images annotated with 40 binary attributes. We scale and crop the images to 64×6464\times 64 pixels. In order to qualitatively evaluate the learned attribute-conditional image generator and the multi-label classifier, given an input face image, we first use the classifier to predict attributes, and then use the image generator to produce images based on the predicted attributes. Figure 5 shows example results. Both the learned attribute predictor and the image generator provides good results. We further show another set of image editing experiment in Figure 6. For each subfigure, we use a same set of attributes with different noise vectors to generate images. For example, for the top-right subfigure, all the images in the 1st row were generated based on the following attributes: black hair, female, attractive, and we then added the attribute of “sunglasses” when generating the images in the 2nd row. It is interesting to see that Δ\Delta-GAN has great flexibility to adjust the generated images by changing certain input attribtutes. For instance, by switching on the wearing hat attribute, one can edit the face image to have a hat on the head.

In order to demonstrate the scalablility of our model to large and complex datasets, we also present results on the COCO dataset. Following , we first select a set of 1000 attributes from the caption text in the training set, which includes the most frequent nouns, verbs, or adjectives. The images in COCO are scaled and cropped to have 64×6464\times 64 pixels. Unlike the case of CelebA face images, the networks need to learn how to handle multiple objects and diverse backgrounds. Results are provided in Figure 7. We can generate reasonably good images based on the predicted attributes. The input and generated images also clearly share a same set of attributes. We also observe diversity in the samples by simply drawing multple noise vectors and using the same predicted attributes.

Precision (P) and normalized Discounted Cumulative Gain (nDCG) are two popular evaluation metrics for multi-label classification problems. Table 3 provides the quantatitive results of P@10 and nDCG@10 on CelebA and COCO, where @kk means at rank kk (see the Appendix for definitions). For fair comparison, we use the same network architecures for both Triple GAN and Δ\Delta-GAN. Δ\Delta-GAN consistently provides better results than Triple GAN. On the COCO dataset, our semi-supervised learning approach with 50% labeled data achieves better performance than the results of Triple GAN using the full dataset, demonstrating the effectiveness of our approach for semi-supervised joint distribution matching. More results for the above experiments are provided in the Appendix.

Conclusion

We have presented the Triangle Generative Adversarial Network (Δ\Delta-GAN), a new GAN framework that can be used for semi-supervised joint distribution matching. Our approach learns the bidirectional mappings between two domains with a few paired samples. We have demonstrated that Δ\Delta-GAN may be employed for a wide range of applications. One possible future direction is to combine Δ\Delta-GAN with sequence GAN or textGAN to model the joint distribution of image-caption pairs.

This research was supported in part by ARO, DARPA, DOE, NGA and ONR.

References

Appendix A Detailed theoretical analysis

For any fixed generator GxG_{x} and GyG_{y}, the optimal discriminator D1D_{1} and D2D_{2} of the game defined by the value function V(Gx,Gy,D1,D2)V(G_{x},G_{y},D_{1},D_{2}) is

The training criterion for the discriminator D1D_{1} and D2D_{2}, given any generator GxG_{x} and GyG_{y}, is to maximize the quantity V(Gx,Gy,D1,D2)V(G_{x},G_{y},D_{1},D_{2}):

The equilibrium of V(Gx,Gy,D1,D2)V(G_{x},G_{y},D_{1},D_{2}) is achieved if and only if p(x,y)=px(x,y)=py(x,y)p({\boldsymbol{x}},{\boldsymbol{y}})=p_{x}({\boldsymbol{x}},{\boldsymbol{y}})=p_{y}({\boldsymbol{x}},{\boldsymbol{y}}) with D1(x,y)=13D_{1}^{*}({\boldsymbol{x}},{\boldsymbol{y}})=\frac{1}{3} and D2(x,y)=12D_{2}^{*}({\boldsymbol{x}},{\boldsymbol{y}})=\frac{1}{2}, and the optimum value is 3log3-3\log 3.

Given the optimal D1(x,y)D_{1}^{*}({\boldsymbol{x}},{\boldsymbol{y}}) and D2(x,y)D_{2}^{*}({\boldsymbol{x}},{\boldsymbol{y}}), the minimax game can be reformulated as:

where JSD_{\pi_{1},\ldots,\pi_{n}}(p_{1},p_{2},\ldots,p_{n})=H\Big{(}\sum_{i=1}^{n}\pi_{i}p_{i}\Big{)}-\sum_{i=1}^{n}\pi_{i}H(p_{i}) is the Jensen-Shannon divergence. π1,,πn\pi_{1},\ldots,\pi_{n} are weights that are selected for the probability distribution p1,p2,,pnp_{1},p_{2},\ldots,p_{n}, and H(p)H(p) is the entropy for distribution pp. In the three-distribution case described above, we set n=3n=3 and π1=π2=π3=13\pi_{1}=\pi_{2}=\pi_{3}=\frac{1}{3}.

For p(x,y)=px(x,y)=py(x,y)p({\boldsymbol{x}},{\boldsymbol{y}})=p_{x}({\boldsymbol{x}},{\boldsymbol{y}})=p_{y}({\boldsymbol{x}},{\boldsymbol{y}}), we have D1(x,y)=13D_{1}^{*}({\boldsymbol{x}},{\boldsymbol{y}})=\frac{1}{3}, D2(x,y)=12D_{2}^{*}({\boldsymbol{x}},{\boldsymbol{y}})=\frac{1}{2} and C(Gx,Gy)=3log3C(G_{x},G_{y})=-3\log 3. Since the Jensen-Shannon divergence is always non-negative, and zero iff they are equal, we have shown that C=3log3C^{*}=-3\log 3 is the global minimum of C(Gx,Gy)C(G_{x},G_{y}) and that the only solution is p(x,y)=px(x,y)=py(x,y)p({\boldsymbol{x}},{\boldsymbol{y}})=p_{x}({\boldsymbol{x}},{\boldsymbol{y}})=p_{y}({\boldsymbol{x}},{\boldsymbol{y}}), i.e., the generative models perfectly replicating the data distribution. ∎

Appendix B ΔΔ\Delta-GAN training procedure

Appendix C Additional experimental results

Appendix D Evaluation metrics for multi-label classification

Precision at kk is a popular evaluation metric for multi-label classification problems. Given the ground truth label vector y{0,1}L{\boldsymbol{y}}\in\{0,1\}^{L} and the prediction y^L\hat{{\boldsymbol{y}}}\in^{L}, P@kP@k is defined as

Precision at kk performs evaluation that counts the fraction of correct predictions in the top kk scoring labels.

nDCG@k𝑘k

normalized Discounted Cumulative Gain (nDCG) at rank kk is a family of ranking measures widely used in multi-label learning. DCG is the total gain accumulated at a particular rank pp, which is defined as

Then normalizing DCG by the value at rank kk of the ideal ranking gives

Appendix E Detailed network architectures

For the CIFAR10 dataset, we use the same network architecture as used in Triple GAN . For the edges2shoes dataset, we use the same network architecture as used in the pix2pix paper . For other datasets, we provide the detailed network architectures below.