Large Scale Adversarial Representation Learning

Jeff Donahue, Karen Simonyan

Introduction

In recent years we have seen rapid progress in generative models of visual data. While these models were previously confined to domains with single or few modes, simple structure, and low resolution, with advances in both modeling and hardware they have since gained the ability to convincingly generate complex, multimodal, high resolution image distributions .

Intuitively, the ability to generate data in a particular domain necessitates a high-level understanding of the semantics of said domain. This idea has long-standing appeal as raw data is both cheap – readily available in virtually infinite supply from sources like the Internet – and rich, with images comprising far more information than the class labels that typical discriminative machine learning models are trained to predict from them. Yet, while the progress in generative models has been undeniable, nagging questions persist: what semantics have these models learned, and how can they be leveraged for representation learning?

The dream of generation as a means of true understanding from raw data alone has hardly been realized. Instead, the most successful approaches for unsupervised learning leverage techniques adopted from the field of supervised learning, a class of methods known as self-supervised learning . These approaches typically involve changing or holding back certain aspects of the data in some way, and training a model to predict or generate aspects of the missing information. For example, proposed colorization as a means of unsupervised learning, where a model is given a subset of the color channels in an input image, and trained to predict the missing channels.

Generative models as a means of unsupervised learning offer an appealing alternative to self-supervised tasks in that they are trained to model the full data distribution without requiring any modification of the original data. One class of generative models that has been applied to representation learning is generative adversarial networks (GANs) . The generator in the GAN framework is a feed-forward mapping from randomly sampled latent variables (also called “noise”) to generated data, with learning signal provided by a discriminator trained to distinguish between real and generated data samples, guiding the generator’s outputs to follow the data distribution. The adversarially learned inference (ALI) or bidirectional GAN (BiGAN) approaches were proposed as extensions to the GAN framework that augment the standard GAN with an encoder module mapping real data to latents, the inverse of the mapping learned by the generator.

In it was demonstrated that the encoder learned via the BiGAN or ALI framework is an effective means of visual representation learning on ImageNet for downstream tasks. However, it used a DCGAN style generator, incapable of producing high-quality images on this dataset, so the semantics the encoder could model were in turn quite limited. In this work we revisit this approach using BigGAN as the generator, a modern model that appears capable of capturing many of the modes and much of the structure present in ImageNet images. Our contributions are as follows:

We show that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in unsupervised representation learning on ImageNet.

We propose a more stable version of the joint discriminator for BigBiGAN.

We perform a thorough empirical analysis and ablation study of model design choices.

We show that the representation learning objective also improves unconditional image generation, and demonstrate state-of-the-art results in unconditional ImageNet generation.

We open source pretrained BigBiGAN models on TensorFlow HubSee footnote 1..

BigBiGAN

The BiGAN or ALI approaches were proposed as extensions of the GAN framework which enable the learning of an encoder that can be employed as an inference model or feature representation . Given a distribution PxP_{\mathbf{x}} of data x\mathbf{x} (e.g., images), and a distribution PzP_{\mathbf{z}} of latents z\mathbf{z} (usually a simple continuous distribution like an isotropic Gaussian N(0,I)\mathcal{N}(0,I)), the generator G\mathcal{G} models a conditional distribution P(xz)P(\mathbf{x}|\mathbf{z}) of data x\mathbf{x} given latent inputs z\mathbf{z} sampled from the latent prior PzP_{\mathbf{z}}, as in the standard GAN generator . The encoder E\mathcal{E} models the inverse conditional distribution P(zx)P(\mathbf{z}|\mathbf{x}), predicting latents z\mathbf{z} given data x\mathbf{x} sampled from the data distribution PxP_{\mathbf{x}}.

Besides the addition of E\mathcal{E}, the other modification to the GAN in the BiGAN framework is a joint discriminator D\mathcal{D}, which takes as input data-latent pairs (x,z)(\mathbf{x},\mathbf{z}) (rather than just data x\mathbf{x} as in a standard GAN), and learns to discriminate between pairs from the data distribution and encoder, versus the generator and latent distribution. Concretely, its inputs are pairs (xPx,z^E(x))(\mathbf{x}\sim P_{\mathbf{x}},\hat{\mathbf{z}}\sim\mathcal{E}(\mathbf{x})) and (x^G(z),zPz)(\hat{\mathbf{x}}\sim\mathcal{G}(\mathbf{z}),\mathbf{z}\sim P_{\mathbf{z}}), and the goal of the G\mathcal{G} and E\mathcal{E} is to “fool” the discriminator by making the two joint distributions PxEP_{\mathbf{x}\mathcal{E}} and PGzP_{\mathcal{G}\mathbf{z}} from which these pairs are sampled indistinguishable. The adversarial minimax objective in , analogous to that of the GAN framework , was defined as follows:

While the crux of our approach, BigBiGAN, remains the same as that of BiGAN , we have adopted the generator and discriminator architectures from the state-of-the-art BigGAN generative image model. Beyond that, we have found that an improved discriminator structure leads to better representation learning results without compromising generation (Figure 1). Namely, in addition to the joint discriminator loss proposed in which ties the data and latent distributions together, we propose additional unary terms in the learning objective, which are functions only of either the data x\mathbf{x} or the latents z\mathbf{z}. Although prove that the original BiGAN objective already enforces that the learnt joint distributions match at the global optimum, implying that the marginal distributions of x\mathbf{x} and z\mathbf{z} match as well, these unary terms intuitively guide optimization in the “right direction” by explicitly enforcing this property. For example, in the context of image generation, the unary loss term on x\mathbf{x} matches the original GAN objective and provides a learning signal which steers only the generator to match the image distribution independently of its latent inputs. (In our evaluation we will demonstrate empirically that the addition of these terms results in both improved generation and representation learning.)

Like in BiGAN and ALI , the discriminator loss LD\mathcal{L}_{\mathcal{D}} intuitively trains the discriminator to distinguish between the two joint data-latent distributions from the encoder and the generator, pushing it to predict positive values for encoder input pairs (x,E(x))(\mathbf{x},\mathcal{E}(\mathbf{x})) and negative values for generator input pairs (G(z),z)(\mathcal{G}(\mathbf{z}),\mathbf{z}). The generator and encoder loss LEG\mathcal{L}_{\mathcal{E}\mathcal{G}} trains these two modules to fool the discriminator into incorrectly predicting the opposite, in effect pushing them to create matching joint data-latent distributions. (In the case of deterministic E\mathcal{E} and G\mathcal{G}, this requires the two modules to invert one another .)

Evaluation

Most of our experiments follow the standard protocol used to evaluate unsupervised learning techniques, first proposed in . We train a BigBiGAN on unlabeled ImageNet, freeze its learned representation, and then train a linear classifier on its outputs, fully supervised using all of the training set labels. We also measure image generation performance, reporting Inception Score (IS) and Fréchet Inception Distance (FID) as the standard metrics there.

We begin with an extensive ablation study in which we directly evaluate a number of modeling choices, with results presented in Table 1. Where possible we performed three runs of each variant with different seeds and report the mean and standard deviation for each metric.

We start with a relatively fully-fledged version of the model at 128×128128\times 128 resolution (row Base), with the G\mathcal{G} architecture and the FF component of D\mathcal{D} taken from the corresponding 128×128128\times 128 architectures in BigGAN, including the skip connections and shared noise embedding proposed in . z\mathbf{z} is 120 dimensions, split into six groups of 20 dimensions fed into each of the six layers of G\mathcal{G} as in . The remaining components of D\mathcal{D}HH and JJ – are 8-layer MLPs with ResNet-style skip connections (four residual blocks with two layers each) and size 2048 hidden layers. The E\mathcal{E} architecture is the ResNet-v2-50 ConvNet originally proposed for image classification in , followed by a 4-layer MLP (size 4096) with skip connections (two residual blocks) after ResNet’s globally average pooled output. The unconditional BigGAN training setup corresponds to the “Single Label” setup proposed in , where a single “dummy” label is used for all images (theoretically equivalent to learning a bias in place of the class-conditional batch norm inputs). We then ablate several aspects of the model, with results detailed in the following paragraphs. Additional architectural and optimization details are provided in Appendix A. Full learning curves for many results are included in Appendix D.

As in ALI , the encoder E\mathcal{E} of our Base model is non-deterministic, parametrizing a distribution N(μ,σ)\mathcal{N(\mu,\sigma)}. μ\mu and σ^\hat{\sigma} are given by a linear layer at the output of the model, and the final standard deviation σ\sigma is computed from σ^\hat{\sigma} using a non-negative “softplus” non-linearity σ=log(1+exp(σ^))\sigma=\log(1+\exp(\hat{\sigma})) . The final z\mathbf{z} uses the reparametrized sampling from , with z=μ+ϵσ\mathbf{z}=\mu+\epsilon\sigma, where ϵN(0,I)\epsilon\sim\mathcal{N}(0,I). Compared to a deterministic encoder (row Deterministic E\mathcal{E}) which predicts z\mathbf{z} directly without sampling (effectively modeling P(zx)P(\mathbf{z}|\mathbf{x}) as a Dirac δ\delta distribution), the non-deterministic Base model achieves significantly better classification performance (at no cost to generation). We also compared to using a uniform Pz=U(1,1)P_{\mathbf{z}}=\mathcal{U}(-1,1) (row Uniform PzP_{\mathbf{z}}) with E\mathcal{E} deterministically predicting z=tanh(z^)\mathbf{z}=\tanh({\hat{\mathbf{z}}}) given a linear output z^\hat{\mathbf{z}}, as done in BiGAN . This also achieves worse classification results than the non-deterministic Base model.

We evaluate the effect of removing one or both unary terms of the loss function proposed in Section 2, sxs_{\mathbf{x}} and szs_{\mathbf{z}}. Removing both unary terms (row No Unaries) corresponds to the original objective proposed in . It is clear that the x\mathbf{x} unary term has a large positive effect on generation performance, with the Base and x\mathbf{x} Unary Only rows having significantly better IS and FID than the z\mathbf{z} Unary Only and No Unaries rows. This result makes intuitive sense as it matches the standard generator loss. It also marginally improves classification performance. The z\mathbf{z} unary term makes a more marginal difference, likely due to the relative ease of modeling relatively simple distributions like isotropic Gaussians, though also does result in slightly improved classification and generation in terms of FID – especially without the x\mathbf{x} term (z\mathbf{z} Unary Only vs. No Unaries). On the other hand, IS is worse with the z\mathbf{z} term. This may be due to IS roughly measuring the generator’s coverage of the major modes of the distribution (the classes) rather than the distribution in its entirety, the latter of which may be better captured by FID and more likely to be promoted by a good encoder E\mathcal{E}. The requirement of invertibility in a (Big)BiGAN could be encouraging the generator to produce distinguishable outputs across the entire latent space, rather than “collapsing” large volumes of latent space to a single mode of the data distribution.

To address the question of the importance of the generator G\mathcal{G} in representation learning, we vary the capacity of G\mathcal{G} (with E\mathcal{E} and D\mathcal{D} fixed) in the Small G\mathcal{G} rows. With a third of the capacity of the Base G\mathcal{G} model (Small G\mathcal{G} (32)), the overall model is quite unstable and achieves significantly worse classification results than the higher capacity base modelThough the generation performance by IS and FID in row Small G\mathcal{G} (32) is very poor at the point we measured – when its best validation classification performance (43.59%) is achieved – this model was performing more reasonably for generation earlier in training, reaching IS 14.69 and FID 60.67. With two-thirds capacity (Small G\mathcal{G} (64)), generation performance is substantially worse (matching the results in ) and classification performance is modestly worse. These results confirm that a powerful image generator is indeed important for learning good representations via the encoder. Assuming this relationship holds in the future, we expect that better generative models are likely to lead to further improvements in representation learning.

We also compare BigBiGAN’s image generation performance against a standard unconditional BigGAN with no encoder E\mathcal{E} and only the standard FF ConvNet in the discriminator, with only the sxs_{\mathbf{x}} term in the loss (row No E\mathcal{E} (GAN)). While the standard GAN achieves a marginally better IS, the BigBiGAN FID is about the same, indicating that the addition of the BigBiGAN E\mathcal{E} and joint D\mathcal{D} does not compromise generation with the newly proposed unary loss terms described in Section 2. (In comparison, the versions of the model without unary loss term on x\mathbf{x} – rows z\mathbf{z} Unary Only and No Unaries – have substantially worse generation performance in terms of FID than the standard GAN.) We conjecture that the IS is worse for similar reasons that the szs_{\mathbf{z}} unary loss term leads to worse IS. Next we will show that with an enhanced E\mathcal{E} taking higher input resolutions, generation with BigBiGAN in terms of FID is substantially improved over the standard GAN.

BiGAN proposed an asymmetric setup in which E\mathcal{E} takes higher resolution images than G\mathcal{G} outputs and D\mathcal{D} takes as input, showing that an E\mathcal{E} taking 128×128128\times 128 inputs with a 64×6464\times 64 G\mathcal{G} outperforms a 64×6464\times 64 E\mathcal{E} for downstream tasks. We experiment with this setup in BigBiGAN, raising the E\mathcal{E} input resolution to 256×256256\times 256 – matching the resolution used in typical supervised ImageNet classification setups – and varying the G\mathcal{G} output and D\mathcal{D} input resolution in {64,128,256}\{64,128,256\}. Our results in Table 1 (rows High Res E\mathcal{E} (256) and Low/High Res G\mathcal{G} (*)) show that BigBiGAN achieves better representation learning results as the G\mathcal{G} resolution increases, up to the full E\mathcal{E} resolution of 256×256256\times 256. However, because the overall model is much slower to train with G\mathcal{G} at 256×256256\times 256 resolution, the remainder of our results use the 128×128128\times 128 resolution for G\mathcal{G}.

Interestingly, with the higher resolution E\mathcal{E}, generation improves significantly (especially by FID), despite G\mathcal{G} operating at the same resolution (row High Res E\mathcal{E} (256) vs. Base). This is an encouraging result for the potential of BigBiGAN as a means of improving adversarial image synthesis itself, besides its use in representation learning and inference.

Keeping the E\mathcal{E} input resolution fixed at 256, we experiment with varied and often larger E\mathcal{E} architectures, including several of the ResNet-50 variants explored in . In particular, we expand the capacity of the hidden layers by a factor of 22 or 44, as well as swap the residual block structure to a reversible variant called RevNet with the same number of layers and capacity as the corresponding ResNets. (We use the version of RevNet described in .) We find that the base ResNet-50 model (row High Res E\mathcal{E} (256)) outperforms RevNet-50 (row RevNet), but as the network widths are expanded, we begin to see improvements from RevNet-50, with double-width RevNet outperforming a ResNet of the same capacity (rows RevNet ×2\times 2 and ResNet ×2\times 2). We see further gains with an even larger quadruple-width RevNet model (row RevNet ×4\times 4), which we use for our final results in Section 3.2.

As a final improvement, we decoupled the E\mathcal{E} optimizer from that of G\mathcal{G}, and found that simply using a 10×10\times higher learning rate for E\mathcal{E} dramatically accelerates training and improves final representation learning results. For ResNet-50 this improves linear classifier accuracy by nearly 3% (ResNet (E\uparrow\mathcal{E} LR) vs. High Res E\mathcal{E} (256)). We also applied this to our largest E\mathcal{E} architecture, RevNet-50 ×4\times 4, and saw similar gains (RevNet ×4\times 4 (E\uparrow\mathcal{E} LR) vs. RevNet ×4\times 4).

2 Comparison with prior methods

We now take our best model by trainval classification accuracy from the above ablations and present results on the official ImageNet validation set, comparing against the state of the art in recent unsupervised learning literature. For comparison, we also present classification results for our best performing variant with the smaller ResNet-50-based E\mathcal{E}. These models correspond to the last two rows of Table 1, ResNet (E\uparrow\mathcal{E} LR) and RevNet ×4\times 4 (E\uparrow\mathcal{E} LR).

Results are presented in Table 2. (For reference, the fully supervised accuracy of these architectures is given in Appendix A, Table 4.) Compared with a number of modern self-supervised approaches and combinations thereof , our BigBiGAN approach based purely on generative models performs well for representation learning, state-of-the-art among recent unsupervised learning results, improving upon a recently published result from of 55.4% to 60.8% top-1 accuracy using rotation prediction pre-training with the same representation learning architecture Our RevNet ×4\times 4 architecture matches the widest architectures used in , labeled as ×16\times 16 there. and feature, labeled as AvePool in Table 2, and matches the results of the concurrent work in based on contrastic predictive coding (CPC).

Finally, in Appendix C we consider evaluating representations by zero-shot kk nearest neighbors classification, achieving 43.3% top-1 accuracy in this setting. Qualitative examples of nearest neighbors are presented in Figure 13.

In Table 3 we show results for unsupervised generation with BigBiGAN, comparing to the BigGAN-based unsupervised generation results from . Note that these results differ from those in Table 1 due to the use of the data augmentation method of See the “distorted” preprocessing method from the Compare GAN framework: https://github.com/google/compare_gan/blob/master/compare_gan/datasets.py. (rather than ResNet-style preprocessing used for all results in our Table 1 ablation study). The lighter augmentation from results in better image generation performance under the IS and FID metrics. The improvements are likely due in part to the fact that this augmentation, on average, crops larger portions of the image, thus yielding generators that typically produce images encompassing most or all of a given object, which tends to result in more representative samples of any given class (giving better IS) and more closely matching the statistics of full center crops (as used in the real data statistics to compute FID). Besides this preprocessing difference, the approaches in Table 3 have the same configurations as used in the Base or High Res E\mathcal{E} (256) row of Table 1.

These results show that BigBiGAN significantly improves both IS and FID over the baseline unconditional BigGAN generation results with the same (unsupervised) “labels” (a single fixed label in the SL (Single Label) approach – row BigBiGAN + SL vs. BigGAN + SL). We see further improvements using a high resolution E\mathcal{E} (row BigBiGAN High Res E\mathcal{E} + SL), surpassing the previous unsupervised state of the art (row BigGAN + Clustering) under both IS and FID. (Note that the image generation results remain comparable: the generated image resolution is still 128×128128\times 128 here, despite the higher resolution E\mathcal{E} input.) The alternative “pseudo-labeling” approach from , Clustering, which uses labels derived from unsupervised clustering, is complementary to BigBiGAN and combining both could yield further improvements. Finally, observing that results continue to improve significantly with training beyond 500K steps, we also report results at 1M steps in the final row of Table 3.

3 Reconstruction

As shown in , the (Big)BiGAN E\mathcal{E} and G\mathcal{G} can reconstruct data instances x\mathbf{x} by computing the encoder’s predicted latent representation E(x)\mathcal{E}(\mathbf{x}) and then passing this predicted latent back through the generator to obtain the reconstruction G(E(x))\mathcal{G}(\mathcal{E}(\mathbf{x})). We present BigBiGAN reconstructions in Figure 2. These reconstructions are far from pixel-perfect, likely due in part to the fact that no reconstruction cost is explicitly enforced by the objective – reconstructions are not even computed at training time. However, they may provide some intuition for what features the encoder E\mathcal{E} learns to model. For example, when the input image contains a dog, person, or a food item, the reconstruction is often a different instance of the same “category” with similar pose, position, and texture – for example, a similar species of dog facing the same direction. The extent to which these reconstructions tend to retain the high-level semantics of the inputs rather than the low-level details suggests that BigBiGAN training encourages the encoder to model the former more so than the latter. Additional reconstructions are presented in Appendix B.

Related work

A number of approaches to unsupervised representation learning from images based on self-supervision have proven very successful. Self-supervision generally involves learning from tasks designed to resemble supervised learning in some way, but in which the “labels” can be created automatically from the data itself with no manual effort. An early example is relative location prediction , where a model is trained on input pairs of image patches and predicts their relative locations. Contrastive predictive coding (CPC) is a recent related approach where, given an image patch, a model predicts which patches occur in other image locations. Other approaches include colorization , motion segmentation , rotation prediction , GAN-based discrimination , and exemplar matching . Rigorous empirical comparisons of many of these approaches have also been conducted . A key advantage offered by BigBiGAN and other approaches based on generative models, relative to most self-supervised approaches, is that their input may be the full-resolution image or other signal, with no cropping or modification of the data needed (though such modifications may be beneficial as data augmentation). This means the resulting representation can typically be applied directly to full data in the downstream task with no domain shift.

A number of relevant autoencoder and GAN variants have also been proposed. Associative compression networks (ACNs) learn to compress at the dataset level by conditioning data on other previously transmitted data which are similar in code space, resulting in models that can “daydream” semantically similar samples, similar to BigBiGAN reconstructions. VQ-VAEs pair a discrete (vector quantized) encoder with an autoregressive decoder to produce faithful reconstructions with a high compression factor and demonstrate representation learning results in reinforcement learning settings. In the adversarial space, adversarial autoencoders proposed an autoencoder-style encoder-decoder pair trained with pixel-level reconstruction cost, replacing the KL-divergence regularization of the prior used in VAEs with a discriminator. In another proposed VAE-GAN hybrid the pixel-space reconstruction error used in most VAEs is replaced with feature space distance from an intermediate layer of a GAN discriminator. Other hybrid approaches like AGE and α\alpha-GAN add an encoder to stabilize GAN training. An interesting difference between many of these approaches and the BiGAN framework is that BiGAN does not train the encoder or generator with an explicit reconstruction cost. Though it can be shown that (Big)BiGAN implicitly minimizes a reconstruction cost, qualitative reconstruction results (Section 3.3) suggest that this reconstruction cost is of a different flavor, emphasizing high-level semantics over pixel-level details.

Discussion

We have shown that BigBiGAN, an unsupervised learning approach based purely on generative models, achieves state-of-the-art results in image representation learning on ImageNet. Our ablation study lends further credence to the hope that powerful generative models can be beneficial for representation learning, and in turn that learning an inference model can improve large-scale generative models. In the future we hope that representation learning can continue to benefit from further advances in generative models and inference models alike, as well as scaling to larger image databases.

The authors would like to thank Aidan Clark, Olivier Hénaff, Aäron van den Oord, Sander Dieleman, and many other colleagues at DeepMind for useful discussions and feedback on this work.

References

Appendix A Model and optimization details

Our optimizer matches that of BigGAN – we use Adam with batch size 2048 and the same learning rates and other hyperparameters, using the G\mathcal{G} optimizer to update E\mathcal{E} simultaneously, with the same alternating optimization: two D\mathcal{D} updates followed by a single joint update of G\mathcal{G} and E\mathcal{E}. (We do not use orthogonal regularization used in , finding it gave worse results in the unconditional setting, matching the findings of .) Spectral normalization is used in G\mathcal{G} and D\mathcal{D}, but not in E\mathcal{E}. Full cross-replica batch normalization is used in both G\mathcal{G} and E\mathcal{E} (including for the linear classifier training on E\mathcal{E} features used for evaluations). We also apply exponential moving averaging (EMA) with a decay of 0.9999 to the G\mathcal{G} and E\mathcal{E} weights in all evaluations. (We find this results in only a small improvement for E\mathcal{E} evaluations, but a substantial one for G\mathcal{G} evaluations.)

At BigBiGAN training time, as well as linear classification evaluation training time, we preprocess inputs with ResNet -style data augmentation, though with crops of size 128 or 256 rather than 224Preprocessing code from the TensorFlow ResNet TPU model: https://github.com/tensorflow/tpu/tree/master/models/official/resnet..

For linear classification evaluations in the ablations reported in Table 1, we hold out 10K randomly selected images from the official ImageNet training set as a validation set and report accuracy on that validation set, which we call trainval. All results in Table 1 are run for 500K steps, with early stopping based on linear classifier accuracy on our trainval split. In all of these models the linear classifier is initialized to 0 and trained for 5K Adam steps with a (high) learning rate of 0.01 and EMA smoothing with decay 0.9999. We have found it helpful to monitor representation learning progress during BigBiGAN training by periodically rerunning this linear classification evaluation from scratch given the current E\mathcal{E} weights, resetting the classifier weights to 0 before each evaluation.

In Table 2 we extend the BigBiGAN training time to 1M steps, and report results on the official validation set of 50K images for comparison with prior work. The classifier in these results is trained for 100K Adam steps, sweeping over learning rates {104,3104,103,3103,102}\{10^{-4},3\cdot 10^{-4},10^{-3},3\cdot 10^{-3},10^{-2}\}, again applying EMA with decay 0.9999 to the classifier weights. Hyperparameter selection and early stopping is again based on classification accuracy on trainval. As in , FID is reported against statistics over the full ImageNet training set, preprocessed by resizing the minor axis to the G\mathcal{G} output resolution and taking the center crop along the major axis, except as noted in Table 3, where we also report FID against the validation set for comparison with .

All models were trained via TensorFlow and Sonnet with data parallelism on TPU pod slices using 32 to 512 cores, coordinated by TF-Replicator .

In Table 4 we present the results of fully supervised training with the model architectures used in our experiments in Section 3 for comparison purposes.

In Figure 3 we visualize the learned convolutional filters for the first convolutional layer of our BigBiGAN encoders E\mathcal{E} using the largest RevNet ×4\times 4 E\mathcal{E} architecture. Note the difference between the filters in (a) and (b) (corresponding to rows RevNet ×4\times 4 and RevNet ×4\times 4 (E\uparrow\mathcal{E} LR) in Table 1). In (b) we use the higher E\mathcal{E} learning rate and see a corresponding qualitative improvement in the appearance of the learned filters, with less noise and more Gabor-like and color filters, as observed in BiGAN . This suggests that examining the convolutional filters of the input layer can serve as a diagnostic for undertrained models.

Appendix B Samples and reconstructions

To further explore the behavior of a BigBiGAN (or any other model capable of approximately reconstructing its input), we can “iterate” the reconstruction operation. In particular, let Ri(x)R_{i}(\mathbf{x}) be defined for non-negative integers ii and input images x\mathbf{x} as:

In Figure 12 we show the results of up to 500 steps of this process for a few sample images. Qualitatively, the first several steps of this process often appear to retain some semantics of the input image x\mathbf{x}. After dozens or hundreds of iterations, however, little content from the original input apparently remains intact.

Appendix C Nearest neighbors

In this Appendix we consider an alternative way of evaluating representations –- by means of kk nearest neighbors classification, which does not involve learning any parameters during evaluation and is even simpler than learning a linear classifier as done in Section 3. For all results in this section, we use the outputs of the global average pooling layer (a flat 8192D feature) of our best performing model, RevNet ×4\times 4, E\uparrow\mathcal{E} LR. We do not do any data augmentation for either the training or validation sets: we simply crop each image at the center of its larger axis and resize to 256×256256\times 256.

Figure 13 shows sample nearest neighbors in the ImageNet training set for query images in the validation set. Despite being fully unsupervised, the neighbors in many cases match the query image in terms of high-level semantic content such as the category of the object of interest, demonstrating BigBiGAN’s ability to capture high-level attributes of the data in its unsupervised representations. Where applicable, the object’s pose and position in the image appears to be important as well – for example, the nearest neighbors of the RV (row 2, column 2) are all RVs facing roughly the same direction. In other cases, the nearest neighbors appear to be selected primarily based on the background or color scheme.

While our quantitative kk nearest neighbors classification results are far from the state of the art for ImageNet classification and significantly below the linear classifier-based results reported in Table 2, note that in this setup, no supervised learning of model parameters from labels occurs at any point: labels are predicted purely based on distance in a feature space learned from BigBiGAN training on image pixels alone. We believe this makes nearest neighbors classification an interesting additional benchmark for future approaches to unsupervised representation learning.

Appendix D Learning curves

In this Appendix we present learning curves showing how the image generation and representation learning metrics that we measured evolve throughout training, as a more detailed view of the results in Section 3, Table 1. We include plots for the following results:

Latent distribution PzP_{\mathbf{z}} and stochastic E\mathcal{E} (Figure 15)

High resolution E\mathcal{E} with varying resolution G\mathcal{G} (Figure 18)

Decoupled E\mathcal{E}/G\mathcal{G} learning rates (Figure 20)