Adversarial Feature Learning

Jeff Donahue, Philipp Krähenbühl, Trevor Darrell

Introduction

Deep convolutional networks (convnets) have become a staple of the modern computer vision pipeline. After training these models on a massive database of image-label pairs like ImageNet (Russakovsky et al., 2015), the network easily adapts to a variety of similar visual tasks, achieving impressive results on image classification (Donahue et al., 2014; Zeiler & Fergus, 2014; Razavian et al., 2014) or localization (Girshick et al., 2014; Long et al., 2015) tasks. In other perceptual domains such as natural language processing or speech recognition, deep networks have proven highly effective as well (Bahdanau et al., 2015; Sutskever et al., 2014; Vinyals et al., 2015; Graves et al., 2013). However, all of these recent results rely on a supervisory signal from large-scale databases of hand-labeled data, ignoring much of the useful information present in the structure of the data itself.

Meanwhile, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have emerged as a powerful framework for learning generative models of arbitrarily complex data distributions. The GAN framework learns a generator mapping samples from an arbitrary latent distribution to data, as well as an adversarial discriminator which tries to distinguish between real and generated samples as accurately as possible. The generator’s goal is to “fool” the discriminator by producing samples which are as close to real data as possible. When trained on databases of natural images, GANs produce impressive results (Radford et al., 2016; Denton et al., 2015).

Interpolations in the latent space of the generator produce smooth and plausible semantic variations, and certain directions in this space correspond to particular semantic attributes along which the data distribution varies. For example, Radford et al. (2016) showed that a GAN trained on a database of human faces learns to associate particular latent directions with gender and the presence of eyeglasses.

A natural question arises from this ostensible “semantic juice” flowing through the weights of generators learned using the GAN framework: can GANs be used for unsupervised learning of rich feature representations for arbitrary data distributions? An obvious issue with doing so is that the generator maps latent samples to generated data, but the framework does not include an inverse mapping from data to latent representation.

Hence, we propose a novel unsupervised feature learning framework, Bidirectional Generative Adversarial Networks (BiGAN). The overall model is depicted in Figure 1. In short, in addition to the generator GG from the standard GAN framework (Goodfellow et al., 2014), BiGAN includes an encoder EE which maps data x\mathbf{x} to latent representations z\mathbf{z}. The BiGAN discriminator DD discriminates not only in data space (x\mathbf{x} versus G(z)G(\mathbf{z})), but jointly in data and latent space (tuples (x,E(x))(\mathbf{x},E(\mathbf{x})) versus (G(z),z)(G(\mathbf{z}),\mathbf{z})), where the latent component is either an encoder output E(x)E(\mathbf{x}) or a generator input z\mathbf{z}.

It may not be obvious from this description that the BiGAN encoder EE should learn to invert the generator GG. The two modules cannot directly “communicate” with one another: the encoder never “sees” generator outputs (E(G(z))E(G(\mathbf{z})) is not computed), and vice versa. Yet, in Section 3, we will both argue intuitively and formally prove that the encoder and generator must learn to invert one another in order to fool the BiGAN discriminator.

Because the BiGAN encoder learns to predict features z\mathbf{z} given data x\mathbf{x}, and prior work on GANs has demonstrated that these features capture semantic attributes of the data, we hypothesize that a trained BiGAN encoder may serve as a useful feature representation for related semantic tasks, in the same way that fully supervised visual models trained to predict semantic “labels” given images serve as powerful feature representations for related visual tasks. In this context, a latent representation z\mathbf{z} may be thought of as a “label” for x\mathbf{x}, but one which came for “free,” without the need for supervision.

An alternative approach to learning the inverse mapping from data to latent representation is to directly model p(zG(z))p(\mathbf{z}|G(\mathbf{z})), predicting generator input z\mathbf{z} given generated data G(z)G(\mathbf{z}). We’ll refer to this alternative as a latent regressor, later arguing (Section 4.1) that the BiGAN encoder may be preferable in a feature learning context, as well as comparing the approaches empirically.

BiGANs are a robust and highly generic approach to unsupervised feature learning, making no assumptions about the structure or type of data to which they are applied, as our theoretical results will demonstrate. Our empirical studies will show that despite their generality, BiGANs are competitive with contemporary approaches to self-supervised and weakly supervised feature learning designed specifically for a notoriously complex data distribution – natural images.

Dumoulin et al. (2016) independently proposed an identical model in their concurrent work, exploring the case of a stochastic encoder EE and the ability of such models to learn in a semi-supervised setting.

Preliminaries

The GAN framework trains a generator, such that no discriminative model D:ΩXD:\Omega_{\mathbf{X}}\mapsto can distinguish samples of the data distribution from samples of the generative distribution. Both generator and discriminator are learned using the adversarial (minimax) objective minGmaxDV(D,G)\min\limits_{G}\max\limits_{D}V(D,G), where

Goodfellow et al. (2014) showed that for an ideal discriminator the objective C(G):=maxDV(D,G)C(G)\vcentcolon=\max_{D}V(D,G) is equivalent to the Jensen-Shannon divergence between the two distributions pGp_{G} and pXp_{\mathbf{X}}.

The adversarial objective 1 does not directly lend itself to an efficient optimization, as each step in the generator GG requires a full discriminator DD to be learned. Furthermore, a perfect discriminator no longer provides any gradient information to the generator, as the gradient of any global or local maximum of V(D,G)V(D,G) is . To provide a strong gradient signal nonetheless, Goodfellow et al. (2014) slightly alter the objective between generator and discriminator updates, while keeping the same fixed point characteristics. They also propose to optimize (1) using an alternating optimization switching between updates to the generator and discriminator. While this optimization is not guaranteed to converge, empirically it works well if the discriminator and generator are well balanced.

Despite the empirical strength of GANs as generative models of arbitrary data distributions, it is not clear how they can be applied as an unsupervised feature representation. One possibility for learning such representations is to learn an inverse mapping regressing from generated data G(z)G(\mathbf{z}) back to the latent input z\mathbf{z}. However, unless the generator perfectly models the data distribution pXp_{\mathbf{X}}, a nearly impossible objective for a complex data distribution such as that of high-resolution natural images, this idea may prove insufficient.

Bidirectional Generative Adversarial Networks

In Bidirectional Generative Adversarial Networks (BiGANs) we not only train a generator, but additionally train an encoder E:ΩXΩZE:\Omega_{\mathbf{X}}\to\Omega_{\mathbf{Z}}. The encoder induces a distribution pE(zx)=δ(zE(x))p_{E}(\mathbf{z}|\mathbf{x})=\delta(\mathbf{z}-E(\mathbf{x})) mapping data points x\mathbf{x} into the latent feature space of the generative model. The discriminator is also modified to take input from the latent space, predicting PD(Yx,z)P_{D}(Y|\mathbf{x},\mathbf{z}), where Y=1Y=1 if x\mathbf{x} is real (sampled from the real data distribution pXp_{\mathbf{X}}), and Y=0Y=0 if x\mathbf{x} is generated (the output of G(z),zpZG(\mathbf{z}),\mathbf{z}\sim p_{\mathbf{Z}}).

The BiGAN training objective is defined as a minimax objective

We optimize this minimax objective using the same alternating gradient based optimization as Goodfellow et al. (2014). See Section 3.4 for details.

Let pGZ(x,z):=pG(xz)pZ(z)p_{G\mathbf{Z}}(\mathbf{x},\mathbf{z})\vcentcolon=p_{G}(\mathbf{x}|\mathbf{z})p_{\mathbf{Z}}(\mathbf{z}) and pEX(x,z):=pE(zx)pX(x)p_{E\mathbf{X}}(\mathbf{x},\mathbf{z})\vcentcolon=p_{E}(\mathbf{z}|\mathbf{x})p_{\mathbf{X}}(\mathbf{x}) be the joint distributions modeled by the generator and encoder respectively. Ω:=ΩX×ΩZ\Omega\vcentcolon=\Omega_{\mathbf{X}}\times\Omega_{\mathbf{Z}} is the joint latent and data space. For a region RΩR\subseteq\Omega,

are probability measures over that region. We also define

1 Optimal discriminator, generator, & encoder

We start by characterizing the optimal discriminator for any generator and encoder, following Goodfellow et al. (2014). This optimal discriminator then allows us to reformulate objective (3), and show that it reduces to the Jensen-Shannon divergence between the joint distributions PEXP_{E\mathbf{X}} and PGZP_{G\mathbf{Z}}.

This optimal discriminator now allows us to characterize the optimal generator and encoder.

The global minimum of C(E,G)C(E,G) is achieved if and only if PEX=PGZP_{E\mathbf{X}}=P_{G\mathbf{Z}}. At that point, C(E,G)=log4C(E,G)=-\log 4 and DEG=12D^{*}_{EG}=\frac{1}{2}.

The optimal discriminator, encoder, and generator of BiGAN are similar to the optimal discriminator and generator of the GAN framework (Goodfellow et al., 2014). However, an important difference is that BiGAN optimizes a Jensen-Shannon divergence between a joint distribution over both data X\mathbf{X} and latent features Z\mathbf{Z}. This joint divergence allows us to further characterize properties of GG and EE, as shown below.

2 Optimal generator & encoder are inverses

We first present an intuitive argument that, in order to “fool” a perfect discriminator, a deterministic BiGAN encoder and generator must invert each other. (Later we will formally state and prove this property.) Consider a BiGAN discriminator input pair (x,z)(\mathbf{x},\mathbf{z}). Due to the sampling procedure, (x,z)(\mathbf{x},\mathbf{z}) must satisfy at least one of the following two properties:

If only one of these properties is satisfied, a perfect discriminator can infer the source of (x,z)(\mathbf{x},\mathbf{z}) with certainty: if only (a) is satisfied, (x,z)(\mathbf{x},\mathbf{z}) must be an encoder pair (x,E(x))(\mathbf{x},E(\mathbf{x})) and DEG(x,z)=1D^{*}_{EG}(\mathbf{x},\mathbf{z})=1; if only (b) is satisfied, (x,z)(\mathbf{x},\mathbf{z}) must be a generator pair (G(z),z)(G(\mathbf{z}),\mathbf{z}) and DEG(x,z)=0D^{*}_{EG}(\mathbf{x},\mathbf{z})=0.

Therefore, in order to fool a perfect discriminator at (x,z)(\mathbf{x},\mathbf{z}) (so that 0<DEG(x,z)<10<D^{*}_{EG}(\mathbf{x},\mathbf{z})<1), EE and GG must satisfy both (a) and (b). In this case, we can substitute the equality E(x)=zE(\mathbf{x})=\mathbf{z} required by (a) into the equality G(z)=xG(\mathbf{z})=\mathbf{x} required by (b), and vice versa, giving the inversion properties x=G(E(x))\mathbf{x}=G(E(\mathbf{x})) and z=E(G(z))\mathbf{z}=E(G(\mathbf{z})).

Formally, we show in Theorem 2 that the optimal generator and encoder invert one another almost everywhere on the support Ω^X\hat{\Omega}_{\mathbf{X}} and Ω^Z\hat{\Omega}_{\mathbf{Z}} of PXP_{\mathbf{X}} and PZP_{\mathbf{Z}}.

If EE and GG are an optimal encoder and generator, then E=G1E=G^{-1} almost everywhere; that is, G(E(x))=xG(E(\mathbf{x}))=\mathbf{x} for PXP_{\mathbf{X}}-almost every xΩX\mathbf{x}\in\Omega_{\mathbf{X}}, and E(G(z))=zE(G(\mathbf{z}))=\mathbf{z} for PZP_{\mathbf{Z}}-almost every zΩZ\mathbf{z}\in\Omega_{\mathbf{Z}}.

3 Relationship to autoencoders

As argued in Section 1, a model trained to predict features z\mathbf{z} given data x\mathbf{x} should learn useful semantic representations. Here we show that the BiGAN objective forces the encoder EE to do exactly this: in order to fool the discriminator at a particular z\mathbf{z}, the encoder must invert the generator at that z\mathbf{z}, such that E(G(z))=zE(G(\mathbf{z}))=\mathbf{z}.

with logfEG(,0)\log f_{EG}\in(-\infty,0) and log(1fEG)(,0)\log\left(1-f_{EG}\right)\in(-\infty,0) PEXP_{E\mathbf{X}}-almost and PGZP_{G\mathbf{Z}}-almost everywhere.

4 Learning

In practice, as in the GAN framework (Goodfellow et al., 2014), each BiGAN module DD, GG, and EE is a parametric function (with parameters θD\theta_{D}, θG\theta_{G}, and θE\theta_{E}, respectively). As a whole, BiGAN can be optimized using alternating stochastic gradient steps. In one iteration, the discriminator parameters θD\theta_{D} are updated by taking one or more steps in the positive gradient direction θDV(D,E,G)\nabla_{\theta_{D}}V(D,E,G), then the encoder parameters θE\theta_{E} and generator parameters θG\theta_{G} are together updated by taking a step in the negative gradient direction θE,θGV(D,E,G)-\nabla_{\theta_{E},\theta_{G}}V(D,E,G). In both cases, the expectation terms of V(D,E,G)V(D,E,G) are estimated using mini-batches of nn samples {x(i)pX}i=1n\{\mathbf{x}^{(i)}\sim p_{\mathbf{X}}\}_{i=1}^{n} and {z(i)pZ}i=1n\{\mathbf{z}^{(i)}\sim p_{\mathbf{Z}}\}_{i=1}^{n} drawn independently for each update step.

Goodfellow et al. (2014) found that an objective in which the real and generated labels YY are swapped provides stronger gradient signal to GG. We similarly observed in BiGAN training that an “inverse” objective provides stronger gradient signal to GG and EE. For efficiency, we also update all modules DD, GG, and EE simultaneously at each iteration, rather than alternating between DD updates and GG, EE updates. See Appendix B for details.

5 Generalized BiGAN

It is often useful to parametrize the output of the generator GG and encoder EE in a different, usually smaller, space ΩX\Omega_{\mathbf{X}}^{\prime} and ΩZ\Omega_{\mathbf{Z}}^{\prime} rather than the original ΩX\Omega_{\mathbf{X}} and ΩZ\Omega_{\mathbf{Z}}. For example, for visual feature learning, the images input to the encoder should be of similar resolution to images used in the evaluation. On the other hand, generating high resolution images remains difficult for current generative models. In this situation, the encoder may take higher resolution input while the generator output and discriminator input remain low resolution.

We generalize the BiGAN objective V(D,G,E)V(D,G,E) (3) with functions gX:ΩXΩXg_{\mathbf{X}}:\Omega_{\mathbf{X}}\mapsto\Omega_{\mathbf{X}}^{\prime} and gZ:ΩZΩZg_{\mathbf{Z}}:\Omega_{\mathbf{Z}}\mapsto\Omega_{\mathbf{Z}}^{\prime}, and encoder E:ΩXΩZE:\Omega_{\mathbf{X}}\mapsto\Omega_{\mathbf{Z}}^{\prime}, generator G:ΩZΩXG:\Omega_{\mathbf{Z}}\mapsto\Omega_{\mathbf{X}}^{\prime}, and discriminator D:ΩX×ΩZD:\Omega_{\mathbf{X}}^{\prime}\times\Omega_{\mathbf{Z}}^{\prime}\mapsto:

An identity gX(x)=xg_{\mathbf{X}}(\mathbf{x})=\mathbf{x} and gZ(z)=zg_{\mathbf{Z}}(\mathbf{z})=\mathbf{z} (and ΩX=ΩX\Omega_{\mathbf{X}}^{\prime}=\Omega_{\mathbf{X}}, ΩZ=ΩZ\Omega_{\mathbf{Z}}^{\prime}=\Omega_{\mathbf{Z}}) yields the original objective. For visual feature learning with higher resolution encoder inputs, gXg_{\mathbf{X}} is an image resizing function that downsamples a high resolution image xΩX\mathbf{x}\in\Omega_{\mathbf{X}} to a lower resolution image xΩX\mathbf{x}^{\prime}\in\Omega_{\mathbf{X}}^{\prime}, as output by the generator. (gZg_{\mathbf{Z}} is identity.)

Evaluation

We evaluate the feature learning capabilities of BiGANs by first training them unsupervised as described in Section 3.4, then transferring the encoder’s learned feature representations for use in auxiliary supervised learning tasks. To demonstrate that BiGANs are able to learn meaningful feature representations both on arbitrary data vectors, where the model is agnostic to any underlying structure, as well as very high-dimensional and complex distributions, we evaluate on both permutation-invariant MNIST (LeCun et al., 1998) and on the high-resolution natural images of ImageNet (Russakovsky et al., 2015).

In all experiments, each module DD, GG, and EE is a parametric deep (multi-layer) network. The BiGAN discriminator D(x,z)D(\mathbf{x},\mathbf{z}) takes data x\mathbf{x} as its initial input, and at each linear layer thereafter, the latent representation z\mathbf{z} is transformed using a learned linear transformation to the hidden layer dimension and added to the non-linearity input.

Besides the BiGAN framework presented above, we considered alternative approaches to learning feature representations using different GAN variants.

The discriminator DD in a standard GAN takes data samples xpX\mathbf{x}\sim p_{\mathbf{X}} as input, making its learned intermediate representations natural candidates as feature representations for related tasks. This alternative is appealing as it requires no additional machinery, and is the approach used for unsupervised feature learning in Radford et al. (2016). On the other hand, it is not clear that the task of distinguishing between real and generated data requires or benefits from intermediate representations that are useful as semantic feature representations. In fact, if GG successfully generates the true data distribution pX(x)p_{\mathbf{X}}(\mathbf{x}), DD may ignore the input data entirely and predict P(Y=1)=P(Y=1x)=12P(Y=1)=P(Y=1|\mathbf{x})=\frac{1}{2} unconditionally, not learning any meaningful intermediate representations.

We consider an alternative encoder training by minimizing a reconstruction loss L(z,E(G(z)))\mathcal{L}(\mathbf{z},E(G(\mathbf{z}))), after or jointly during a regular GAN training, called latent regressor or joint latent regressor respectively. We use a sigmoid cross entropy loss L\mathcal{L} as it naturally maps to a uniformly distributed output space. Intuitively, a drawback of this approach is that, unlike the encoder in a BiGAN, the latent regressor encoder EE is trained only on generated samples G(z)G(\mathbf{z}), and never “sees” real data xpX\mathbf{x}\sim p_{\mathbf{X}}. While this may not be an issue in the theoretical optimum where pG(x)=pX(x)p_{G}(\mathbf{x})=p_{\mathbf{X}}(\mathbf{x}) exactly – i.e., GG perfectly generates the data distribution pXp_{\mathbf{X}} – in practice, for highly complex data distributions pXp_{\mathbf{X}}, such as the distribution of natural images, the generator will almost never achieve this perfect result. The fact that the real data x\mathbf{x} are never input to this type of encoder limits its utility as a feature representation for related tasks, as shown later in this section.

2 Permutation-invariant MNIST

3 ImageNet

The convolutional filters learned by each of the three modules are shown in Figure 3. We see that the filters learned by the encoder EE have clear Gabor-like structure, similar to those originally reported for the fully supervised AlexNet model (Krizhevsky et al., 2012). The filters also have similar “grouping” structure where one half (the bottom half, in this case) is more color sensitive, and the other half is more edge sensitive. (This separation of the filters occurs due to the AlexNet architecture maintaining two separate filter paths for computational efficiency.)

In Figure 4 we present sample generations G(z)G(\mathbf{z}), as well as real data samples x\mathbf{x} and their BiGAN reconstructions G(E(x))G(E(\mathbf{x})). The reconstructions, while certainly imperfect, demonstrate empirically that the BiGAN encoder EE and generator GG learn approximate inverse mappings, as shown theoretically in Theorem 2. In Appendix C.2, we present nearest neighbors in the BiGAN learned feature space.

Following Noroozi & Favaro (2016), we evaluate by freezing the first NN layers of our pretrained network and randomly reinitializing and training the remainder fully supervised for ImageNet classification. Results are reported in Table 2.

We evaluate the transferability of BiGAN representations to the PASCAL VOC (Everingham et al., 2014) computer vision benchmark tasks, including classification, object detection, and semantic segmentation. The classification task involves simple binary prediction of presence or absence in a given image for each of 20 object categories. The object detection and semantic segmentation tasks go a step further by requiring the objects to be localized, with semantic segmentation requiring this at the finest scale: pixelwise prediction of object identity. For detection, the pretrained model is used as the initialization for Fast R-CNN (Girshick, 2015) (FRCN) training; and for semantic segmentation, the model is used as the initialization for Fully Convolutional Network (Long et al., 2015) (FCN) training, in each case replacing the AlexNet (Krizhevsky et al., 2012) model trained fully supervised for ImageNet classification. We report results on each of these tasks in Table 3, comparing BiGANs with contemporary approaches to unsupervised (Krähenbühl et al., 2016) and self-supervised (Doersch et al., 2015; Agrawal et al., 2015; Wang & Gupta, 2015; Pathak et al., 2016) feature learning in the visual domain, as well as the baselines discussed in Section 4.1.

4 Discussion

Despite making no assumptions about the underlying structure of the data, the BiGAN unsupervised feature learning framework offers a representation competitive with existing self-supervised and even weakly supervised feature learning approaches for visual feature learning, while still being a purely generative model with the ability to sample data x\mathbf{x} and predict latent representation z\mathbf{z}. Furthermore, BiGANs outperform the discriminator (DD) and latent regressor (LR) baselines discussed in Section 4.1, confirming our intuition that these approaches may not perform well in the regime of highly complex data distributions such as that of natural images. The version in which the encoder takes a higher resolution image than output by the generator (BiGAN 112×112112\times 112 EE) performs better still, and this strategy is not possible under the LR and DD baselines as each of those modules take generator outputs as their input.

Although existing self-supervised approaches have shown impressive performance and thus far tended to outshine purely unsupervised approaches in the complex domain of high-resolution images, purely unsupervised approaches to feature learning or pre-training have several potential benefits.

BiGAN and other unsupervised learning approaches are agnostic to the domain of the data. The self-supervised approaches are specific to the visual domain, in some cases requiring weak supervision from video unavailable in images alone. For example, the methods are not applicable in the permutation-invariant MNIST setting explored in Section 4.2, as the data are treated as flat vectors rather than 2D images.

Furthermore, BiGAN and other unsupervised approaches needn’t suffer from domain shift between the pre-training task and the transfer task, unlike self-supervised methods in which some aspect of the data is normally removed or corrupted in order to create a non-trivial prediction task. In the context prediction task (Doersch et al., 2015), the network sees only small image patches – the global image structure is unobserved. In the context encoder or inpainting task (Pathak et al., 2016), each image is corrupted by removing large areas to be filled in by the prediction network, creating inputs with dramatically different appearance from the uncorrupted natural images seen in the transfer tasks.

Other approaches (Agrawal et al., 2015; Wang & Gupta, 2015) rely on auxiliary information unavailable in the static image domain, such as video, egomotion, or tracking. Unlike BiGAN, such approaches cannot learn feature representations from unlabeled static images.

We finally note that the results presented here constitute only a preliminary exploration of the space of model architectures possible under the BiGAN framework, and we expect results to improve significantly with advancements in generative image models and discriminative convolutional networks alike.

The authors thank Evan Shelhamer, Jonathan Long, and other Berkeley Vision labmates for helpful discussions throughout this work. This work was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Artificial Intelligence Research laboratory. The GPUs used for this work were donated by NVIDIA.

References

Appendix A Additional proofs

We use (4) and (5) to rewrite the objective VV (3) as a single expectation under measure PEGP_{EG}:

Note that arg maxy{alogy+(1a)log(1y)}=a\operatorname*{arg\,max}_{y}\left\{a\log y+(1-a)\log(1-y)\right\}=a for any aa\in. Thus, DEG=fEGD^{*}_{EG}=f_{EG}. \square

A.2 Proof of Proposition 2 (encoder and generator objective)

See 2 Proof. Using Proposition 1 along with (5) (1DEG=1fEG=fGE1-D^{*}_{EG}=1-f_{EG}=f_{GE}) we rewrite the objective

A.3 Measure definitions for deterministic E𝐸E and G𝐺G

While Theorem 1 and Propositions 1 and 2 hold for any encoder pE(zx)p_{E}(\mathbf{z}|\mathbf{x}) and generator pG(xz)p_{G}(\mathbf{x}|\mathbf{z}), stochastic or deterministic, Theorems 2 and 3 assume the encoder EE and generator GG are deterministic functions; i.e., with conditionals pE(zx)=δ(zE(x))p_{E}(\mathbf{z}|\mathbf{x})=\delta(\mathbf{z}-E(\mathbf{x})) and pG(xz)=δ(xG(z))p_{G}(\mathbf{x}|\mathbf{z})=\delta(\mathbf{x}-G(\mathbf{z})) defined as δ\delta functions.

For use in the proofs of those theorems, we simplify the definitions of measures PEXP_{E\mathbf{X}} and PGZP_{G\mathbf{Z}} given in Section 3 for the case of deterministic functions EE and GG below:

A.4 Proof of Theorem 2 (optimal generator and encoder are inverses)

See 2 Proof. Let RX0:={xΩX:xG(E(x))}R^{0}_{\mathbf{X}}\vcentcolon=\{\mathbf{x}\in\Omega_{\mathbf{X}}:\mathbf{x}\neq G(E(\mathbf{x}))\} be the region of ΩX\Omega_{\mathbf{X}} in which the inversion property x=G(E(x))\mathbf{x}=G(E(\mathbf{x})) does not hold. We will show that, for optimal EE and GG, RX0R^{0}_{\mathbf{X}} has measure zero under PXP_{\mathbf{X}} (i.e., PX(RX0)=0P_{\mathbf{X}}(R^{0}_{\mathbf{X}})=0) and therefore x=G(E(x))\mathbf{x}=G(E(\mathbf{x})) holds PXP_{\mathbf{X}}-almost everywhere.

Let R0:={(x,z)Ω:z=E(x)xRX0}R^{0}\vcentcolon=\{(\mathbf{x},\mathbf{z})\in\Omega:\mathbf{z}=E(\mathbf{x})\,\land\,\mathbf{x}\in R^{0}_{\mathbf{X}}\} be the region of Ω\Omega such that (x,E(x))R0(\mathbf{x},E(\mathbf{x}))\in R^{0} if and only if xRX0\mathbf{x}\in R^{0}_{\mathbf{X}}. We’ll use the definitions of PEXP_{E\mathbf{X}} and PGZP_{G\mathbf{Z}} for deterministic EE and GG (Appendix A.3), and the fact that PEX=PGZP_{E\mathbf{X}}=P_{G\mathbf{Z}} for optimal EE and GG (Theorem 1).

Hence region RX0R^{0}_{\mathbf{X}} has measure zero (PX(RX0)=0P_{\mathbf{X}}(R^{0}_{\mathbf{X}})=0), and the inversion property x=G(E(x))\mathbf{x}=G(E(\mathbf{x})) holds PXP_{\mathbf{X}}-almost everywhere.

An analogous argument shows that RZ0:={zΩZ:zE(G(z))}R^{0}_{\mathbf{Z}}\vcentcolon=\{\mathbf{z}\in\Omega_{\mathbf{Z}}:\mathbf{z}\neq E(G(\mathbf{z}))\} has measure zero on PZP_{\mathbf{Z}} (i.e., PZ(RZ0)=0P_{\mathbf{Z}}(R^{0}_{\mathbf{Z}})=0) and therefore z=E(G(z))\mathbf{z}=E(G(\mathbf{z})) holds PZP_{\mathbf{Z}}-almost everywhere. \square

A.5 Proof of Theorem 3 (relationship to autoencoders)

As shown in Proposition 2 (Section 3), the BiGAN objective is equivalent to the Jensen-Shannon divergence between PEXP_{E\mathbf{X}} and PGZP_{G\mathbf{Z}}. We now go a step further and show that this Jensen-Shannon divergence is closely related to a standard autoencoder loss. Omitting the 12\tfrac{1}{2} scale factor, a KL divergence term of the Jensen-Shannon divergence is given as

We’ll make use of the definitions of PEXP_{E\mathbf{X}} and PGZP_{G\mathbf{Z}} for deterministic EE and GG found in Appendix A.3. The integral term of the KL divergence expression given in (6) over a particular region RΩR\subseteq\Omega will be denoted by

Next we will show that f>0f>0 holds PEXP_{E\mathbf{X}}-almost everywhere, and hence FF is always well defined and finite. We then show that FF is equivalent to an autoencoder-like reconstruction loss function.

f>0f>0 PEXP_{E\mathbf{X}}-almost everywhere.

where ε\varepsilon is a constant smaller than 11. But PEX(Rf<1)<PEX(Rf<1)P_{E\mathbf{X}}(R^{f<1})<P_{E\mathbf{X}}(R^{f<1}) is a contradiction; hence PEX(Rf<1)=0P_{E\mathbf{X}}(R^{f<1})=0 and f=1f=1 PEXP_{E\mathbf{X}}-almost everywhere in RSR_{S}, implying logf=0\log f=0 PEXP_{E\mathbf{X}}-almost everywhere in RSR_{S}. Hence F(RS)=0F(R_{S})=0. \square

f<1f<1 PEXP_{E\mathbf{X}}-almost everywhere in R1R^{1}.

Let Rf=1:={(x,z)R1:f(x,z)=1}R^{f=1}\vcentcolon=\left\{(\mathbf{x},\mathbf{z})\in R^{1}:f(\mathbf{x},\mathbf{z})=1\right\} be the region in which f=1f=1. Let’s assume the set Rf=1R^{f=1}\neq\emptyset is not empty. By definition of the supportWe use the definition UC    μ(UC)>0U\cap C\neq\emptyset\implies\mu(U\cap C)>0 here., PEX(Rf=1)>0P_{E\mathbf{X}}(R^{f=1})>0 and PGZ(Rf=1)>0P_{G\mathbf{Z}}(R^{f=1})>0. The Radon-Nikodym derivative on Rf=1R^{f=1} is then given by

which implies PGZ(Rf=1)=0P_{G\mathbf{Z}}(R^{f=1})=0 and contradicts the definition of support. Hence Rf=1=R^{f=1}=\emptyset and f<1f<1 PEXP_{E\mathbf{X}}-almost everywhere on R1R^{1}, implying logf<0\log f<0 PEXP_{E\mathbf{X}}-almost everywhere. \square

So a point (x,E(x))(\mathbf{x},E(\mathbf{x})) is in R1R^{1} if xΩ^X\mathbf{x}\in\hat{\Omega}_{\mathbf{X}}, E(x)Ω^ZE(\mathbf{x})\in\hat{\Omega}_{\mathbf{Z}}, and G(E(x))=xG(E(\mathbf{x}))=\mathbf{x}. (We can omit the xΩ^X\mathbf{x}\in\hat{\Omega}_{\mathbf{X}} condition from inside an expectation over PXP_{\mathbf{X}}, as PXP_{\mathbf{X}}-almost all xΩ^X\mathbf{x}\notin\hat{\Omega}_{\mathbf{X}} have 0 probability.) Therefore,

Finally, with Propositions 3 and 5, we have f(0,1)f\in(0,1) PEXP_{E\mathbf{X}}-almost everywhere in R1R^{1}, and therefore logf(,0)\log f\in(-\infty,0), taking a finite and strictly negative value PEXP_{E\mathbf{X}}-almost everywhere.

An analogous argument (along with the fact that fEG+fGE=1f_{EG}+f_{GE}=1) lets us rewrite the other KL divergence term

The Jensen-Shannon divergence is the mean of these two KL divergences, giving C(E,G)C(E,G):

Appendix B Learning details

In this section we provide additional details on the BiGAN learning protocol summarized in Section 3.4. Goodfellow et al. (2014) found for GAN training that an objective in which the real and generated labels YY are swapped provides stronger gradient signal to GG. We similarly observed in BiGAN training that an “inverse” objective Λ\Lambda (with the same fixed point characteristics as VV) provides stronger gradient signal to GG and EE, where

In practice, θG\theta_{G} and θE\theta_{E} are updated by moving in the positive gradient direction of this inverse objective θE,θGΛ\nabla_{\theta_{E},\theta_{G}}\Lambda, rather than the negative gradient direction of the original objective.

We also observed that learning behaved similarly when all parameters θD\theta_{D}, θG\theta_{G}, θE\theta_{E} were updated simultaneously at each iteration rather than alternating between θD\theta_{D} updates and θG,θE\theta_{G},\theta_{E} updates, so we took the simultaneous updating (non-alternating) approach for computational efficiency. (For standard GAN training, simultaneous updates of θD\theta_{D}, θG\theta_{G} performed similarly well, so our standard GAN experiments also follow this protocol.)

Appendix C Model and training details

In the following sections we present additional details on the models and training protocols used in the permutation-invariant MNIST and ImageNet evaluations presented in Section 4.

We implement BiGANs and baseline feature learning methods using the Theano (Theano Development Team, 2016) framework, based on the convolutional GAN implementation provided by Radford et al. (2016). ImageNet transfer learning experiments (Section 4.3) use the Caffe (Jia et al., 2014) framework, per the Fast R-CNN (Girshick, 2015) and FCN (Long et al., 2015) reference implementations. Most computation is performed on an NVIDIA Titan X or Tesla K40 GPU.

C.1 Permutation-invariant MNIST

In all permutation-invariant MNIST experiments (Section 4.2), DD, GG, and EE each consist of two hidden layers with 1024 units. The first hidden layer is followed by a non-linearity; the second is followed by (parameter-free) batch normalization (Ioffe & Szegedy, 2015) and a non-linearity. The second hidden layer in each case is the input to a linear prediction layer of the appropriate size. In DD and EE, a leaky ReLU (Maas et al., 2013) non-linearity with a “leak” of 0.2 is used; in GG, a standard ReLU non-linearity is used. All models are trained for 400 epochs.

C.2 ImageNet

In all ImageNet experiments (Section 4.3), the encoder EE architecture follows AlexNet (Krizhevsky et al., 2012) through the fifth and last convolution layer (conv5), with local response normalization (LRN) layers removed and batch normalization (Ioffe & Szegedy, 2015) (including the learned scaling and bias) with leaky ReLU non-linearity applied to the output of each convolution at unsupervised training time. (For supervised evaluation, batch normalization is not used, and the pre-trained scale and bias is merged into the preceding convolution’s weights and bias.)

In most experiments, both the discriminator DD and generator GG architecture are those used by Radford et al. (2016), consisting of a series of four 5×55\times 5 convolutions (or “deconvolutions” – fractionally-strided convolutions – for the generator GG) applied with 2 pixel stride, each followed by batch normalization and rectified non-linearity.

The sole exception is our discriminator baseline feature learning experiment, in which we let the discriminator DD be the AlexNet variant described above. Generally, using AlexNet (or similar convnet architecture) as the discriminator DD is detrimental to the visual fidelity of the resulting generated images, likely due to the relatively large convolutional filter kernel size applied to the input image, as well as the max-pooling layers, which explicitly discard information in the input. However, for fair comparison of the discriminator’s feature learning abilities with those of BiGANs, we use the same architecture as used in the BiGAN encoder.

To produce a data sample x\mathbf{x}, we first sample an image from the database, and resize it proportionally such that its shorter edge has a length of 72 pixels. Then, a 64×6464\times 64 crop is randomly selected from the resized image. The crop is flipped horizontally with probability 12\frac{1}{2}. Finally, the crop is scaled to $,givingthesample, giving the sample\mathbf{x}$.

A single epoch (one training pass over the 1.2 million images) of BiGAN training takes roughly 40 minutes on a Titan X GPU. Models are trained for 100 epochs, for a total training time of under 3 days.

In Figure 5 we present nearest neighbors in the feature space of the BiGAN encoder EE learned in unsupervised ImageNet training.