StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan

Introduction

Data has assumed a paramount role as the key component for the success of modern machine learning systems. Such systems, especially foundation models in various domains, heavily rely on vast and diverse datasets to acquire knowledge, make accurate predictions, and generate content. The quality, quantity, and diversity of the data significantly impacts the performance and effectiveness of these models, as they learn from the collective information encapsulated within the data. In this data-centric era, a central question is: how can we collect such large amounts of varied data to train AI models?

As an example, suppose we are trying to solve a new computer vision problem, and need to collect data (images) for it. An ideal situation is to place a camera anywhere in the wold and capture whatever we need. But in reality, collecting data is historically not easy. In the 1990s, researchers needed to take photos by themselves to create datasets for objects and faces . To collect data in the 2000s, people crawled the Internet . Noisy, uncurated data collected in such a manner can exhibit domain gaps with the real world problem and reflect imbalances due to societal bias. Removing or reducing such imperfection in data of high volume by human labeling is costly and can be prohibitive.

However, what if data collection could be simplified to the utterance of a natural language command, specifying what you want? What if, for hardly any cost, you could take a photo every few milliseconds? This sounds fanciful, but modern text-to-image generative models are approaching this vision. It has long been a dream that someday we could use these as our data sources, rather than taking photos . In this paper, we study if this is now a practical option in the context of large scale visual representation learning.

To achieve this, we choose to work with Stable Diffusion , one of the leading open source text-to-image models. We synthesize images by prompting Stable Diffusion with text from large scale image-text datasets, such as CC12M and RedCaps . Surprisingly, our investigation reveals that when the classifier-free guidance scale is properly configured for Stable Diffusion, it is able to synthesize images on which training self-supervised methods can perform at par with or better than training on real images of the same sample size. Inspired by the idea of contrastive self-supervised learning, which promotes intra-image invariance, we develop a representation learning approach that promotes intra-caption invariance. We achieve this by treating the multiple images generated from the same text prompt as positives for each other and use them in a multi-positive contrastive loss (see Figure 1). Despite training with solely synthetic images, this approach, called StableRep, even outperforms state-of-the-art methods such as CLIP using the same text set, but with corresponding real images, on various representation evaluation benchmarks.

Intuitively, one reason that synthetic data can be better than real data is because we are able to achieve a greater degree of control in the sampling, such as via the guidance scale in Stable Diffusion, or via text prompts and latent noise variables. Furthermore, generative models have the potential to generalize beyond their training data and therefore provide a richer (synthetic) training set than the corresponding real data alone. Our key contributions are:

We discover that training modern self-supervised methods on synthetic images from Stable Diffusion can be surprisingly effective. The learned representations are often better than representations learned from real images of the same sample size.

We develop StableRep, a novel representation learning approach by capturing invariance between images generated from the same text prompt, and propose a multi-positive contrastive loss.

With StableRep, we are able to achieve 76.7% linear accuracy on ImageNet with ViT-B/16, using solely synthetic images.

When coupled with language supervision, our StableRep trained with 20M synthetic images (10M captions) achieves better accuracy than CLIP trained with 50M real images (50M captions).

Standard Self-supervised Learning on Synthetic Images

A typical visual representation learning algorithm takes an image dataset $\{{\mathbf{x}}_{i}\}_{i=1}^{N}$ as input, and yields an image encoder $F:{\mathbf{x}}\rightarrow{\mathbf{e}}$ , which embeds an image ${\mathbf{x}}$ into a vector ${\mathbf{e}}$ . In this paper, we instead try to produce a good $F$ by using a generative model $G$ rather than a real image dataset. Specifically, we focus on text-to-image generative models $G:({\mathbf{t}},{\mathbf{z}})\rightarrow{\mathbf{x}}$ , which maps a pair of text ${\mathbf{t}}$ and latent noise ${\mathbf{z}}$ to an image ${\mathbf{x}}$ . While there are several top performing text-to-image models , we conduct our exploration with the Stable Diffusion since it is publicly available and widely used. The version we used is v1-5.

Stable diffusion is a denoising diffusion probabilistic model that runs the diffusion process in the latent space of an autoencoder. It improves the sample quality and text-image alignment via classifier-free guidance , which linearly combines conditional score estimate ${\bm{\epsilon}}({\mathbf{t}},{\mathbf{z}}_{\lambda})$ and unconditional estimate ${\bm{\epsilon}}({\mathbf{z}}_{\lambda})$ with the guidance scale $w$ at each step $\lambda$ :

The Stable Diffusion model $G_{sd}$ relies on text sources to generate images. Instead of collecting a corpus of captions from scratch, we use the text part of existing uncurated image-text pair datasets, such as CC3M and CC12M . Formally, given an image caption dataset $\{{\mathbf{t}}_{i}\}_{i=1}^{N}$ , we generate one image per caption, forming a synthetic image dataset of the same size.

2 Self-supervised learning on synthetic images

Recent representative self-supervised learning algorithms are mostly from two families: (1) contrastive learning which encourages invariance between embeddings of different augmentations of the same image ; (2) masked image modeling where model uses unmasked patches to predict masked patches (although there are other methods that fall into neither category, such as BYOL and DINO ). For our study, we choose SimCLR from the former family and MAE from the latter due to their simplicity and strong performance. We focus on the Vision Transformer architecture , and use captions from CC3M except when noted.

SimCLR . We directly train SimCLR with ViT-B/16 on the synthetic image dataset, and measure the representation quality by linear probing evaluation on ImageNet We verify our SimCLR implementation by pre-training on ImageNet. It achieves 74.3% linear probing accuracy. As a comparison, SimCLR in with the same architecture and epochs achieved 73.9%.. One factor to consider is the classifier-free guidance scale $w$ , as it trades off between diversity and quality of the synthesized images and thus can affect the learned representations. To study this, for each $w$ in the set $\{2,3,4,6,8,10,12\}$ , we generate a copy of size $N$ (one image per caption) to train SimCLR. Figure 2(left) visualizes the influence of $w$ . The optimal $w$ is around 8 (both 8 and 10 give an accuracy of 62.0 $\%$ ). This is different from the FID metric where $w=2$ is the optimal.

The captions $\{{\mathbf{t}}_{i}\}_{i=1}^{N}$ used to generate synthetic images are also paired with $N$ real images. We train a SimCLR model with these real images. This model achieves 60.4 $\%$ accuracy, experiencing a 13 $\%$ drop in linear accuracy compared to pre-training on ImageNet. Such gap has been generally observed for uncurated pre-training data . However, both interestingly and surprisingly, synthetic images with $w=8$ have 1.6 $\%$ higher accuracy than real images (62.0 $\%$ v.s. 60.4 $\%$ ).

MAE . Following the default hyperparameters in MAE , we train a ViT-B/16 model for each guidance scale $w$ . Figure 2(right) reports the linear probing results. The accuracy of synthetic images increases quickly with $w$ after 2, and gradually drops when $w$ is large, e.g., $w\geq 10$ . The optimal guidance scale for MAE is 6, and this is different from SimCLR where the accuracy peaks at 8 or 10. This suggests that different methods may require different $w$ . With $w=6$ , synthetic images have a 4.2 $\%$ better accuracy than real images.

While the linear probing accuracy of MAE is lower than that of contrastive methods, its effectiveness often comes with fine-tuning. When fine-tuning pre-trained MAE models on ImageNet, we found synthetic images are still able to outperform real images. For instance, synthetic images with $w=6$ is 0.3 $\%$ higher than real images (82.9% v.s. 82.6%).

Other SSL methods. To test if synthetic images can be generically applied to different self-supervised learning methods, we try three more representative approaches: BYOL , MoCo-v3 , and DINO . We do not tune $w$ for each method, and instead apply the optimal $w$ ( $=8$ ) discovered for SimCLR. The results on CC3M and CC12M are visualized in Figure 3. Synthetic images significantly improve over real for MAE, DINO, and SimCLR, and performs on par with real for BYOL, and slightly worse for MoCo-v3 (which could be attributed to not tuning the guidance scale $w$ ).

Multi-Positive Contrastive Learning with Synthetic Images

Text-to-image generative models offer a new way to compose positive samples for contrastive learning. Given an image caption, we can create multiple diverse samples by starting the reverse diffusion process with different latent noise ${\mathbf{z}}$ . Since these images are produced using the same prompt, they possess similar visual semantics, making them suitable for use as multiple positive samples for each other in contrastive learning. This property is unique to generative models, since collecting multiple images for each caption in large scale is infeasible. Figure 4 compares our StableRep pipeline with that of SimCLR and CLIP.

Multi-positive contrastive loss. We describe multi-positive contrastive learning as a matching problem. Consider an encoded anchor sample $\bm{a}$ , and a set of encoded candidates $\{\bm{b}_{1},\bm{b}_{2},...,\bm{b}_{K}\}$ . We compute a contrastive categorical distribution ${\mathbf{q}}$ that describes how likely $\bm{a}$ is to match each $\bm{b}$ :

This is a generalized form of the widely-used single-positive contrastive loss , where ${\mathbf{p}}$ reduces to a one-hot vector. This loss is closely related to that in , but a key distinction here is that we have no image class labels, and only assume images generated from the same caption are matched.

The PyTorch-like pseudocode of the batched multi-positive contrastive learning algorithm is described in Algo. 1. Each batch consists of $n*m$ images, meaning that we sample $m$ images for each of the $n$ captions. Here we still apply data augmentation, even though images from the same caption are different. This is to reduce overfitting since we perform many epochs of training over pre-generated synthetic images. However, if in the future the image generator is capable of producing images fast enough, then we can draw batches online and data augmentation may not be necessary. The multi-positive contrastive learning algorithm is also generic such that SimCLR can also be described by it – we begin by randomly selecting a set of $n$ images and subsequently apply $m$ (set as 2) crops to each of the chosen images. However, in our StableRep we only utilize a single crop from each image.

Experiments

We perform StableRep pre-training on synthetic images synthesized from texts in the CC3M (2.7 million samples) , CC12M (10 million) , or RedCaps datasets (11.6 million) . We then evaluate the frozen representations by (1) linear probing on ImageNet-1k and other smaller scale image classification benchmark, and (2) few-shot image recognition that measures the generalization ability of the representations.

Backbone. We use ViT models as the backbone for our approach StableRep. On top of the CLS token, we apply a 3-layer MLP projection head with hidden layers of 4096 dimensions and an output of 256 dimensions. Batch Normalization is used in this projection head.

Training. In most of our experiments, we adopt a batch size of 8192 images (i.e. $m*n=8192$ ). This way the computation of each batch is equivalent to SimCLR with a batch size of 4096, because each image in SimCLR has two crops. We use AdamW optimizer with a learning rate of 0.0032 and weight decay of 0.1, and set $\beta_{1},\beta_{2}$ as $0.9,0.98$ respectively. We pre-generate $10$ images for each text prompt. In each iteration, we randomly sample $6$ out of the $10$ for each sampled caption to form the training batch, i.e., $m=6$ in Algo. 1. Recall that for SimCLR $m=2$ . As a result, one epoch training of StableRep is computationally equivalent to 3 epochs of SimCLR. To provide easy comparison, we report SimCLR-equivalent epochs for StableRep in all of our analysis.

In this section, we perform StableRep on images synthesized by either CC12M or RedCaps. For StableRep, we first removed duplicate captions from each dataset, resulting in a reduced number of captions: from 10.0M to 8.3M for CC12M and from 11.7M to 10.5M for RedCaps. We compared StableRep to SimCLR, which was trained on either synthetic or original real images. We also included CLIP with a synthetic and a real version We verified our CLIP implementation by comparing to prior work on CC12M. With ViT-B/16, our CLIP achieved 40.2% zero-shot and 70.3% linear accuracy on ImageNet (v.s. 36.0% and 69.0% in ).. For SimCLR and CLIP, we did not perform de-duplication for either real or synthetic setting. We train for 35 epochs for all methods using ViT-B/16 (for StableRep, this refers to 35 SimCLR-equivalent epochs). We observed that CLIP started to overfit around 30 epochs. But StableRep did not overfit with this schedule (see Table LABEL:tab:ablation_epochs for results with longer training). For StableRep, we additionally apply random downsample augmentation (see Appendix A.1 for details and how such downsample affects different methods).

ImageNet. Table 1 presents the results of linear probing on ImageNet. For StableRep, we prepend a BatchNorm layer without affine transformation to the linear classifier (see Appendix A.5 for more details). We observed that training SimCLR on synthetic images yields an improvement of 2.2% top-1 accuracy on CC12M and 1.0% on RedCaps when compared to real images. However, the accuracy of CLIP drops by 2.6% on CC12M and 2.7% on RedCaps when trained on synthetic images (see Section 5 for more discussion). On the other hand, our method StableRep outperforms CLIP trained on real images, with improvements of 3.2% and 2.6% for CC12M and RedCaps, respectively.

Linear classification on more datasets. We followed the approach of SimCLR and BYOL to assess the generality of our learned representations across different image domains. Specifically, we performed linear classification on 11 image classification datasets introduced by . The results are reported in Table 2, and the relative performance is consistent with that on ImageNet. Notably, our proposed method, StableRep, achieves the highest accuracy on all of the 11 datasets.

Few-shot image classification. Prior work has shown that representation learning is the key for few-shot image classification. A simple classifier on top of frozen representation is sufficient to achieve strong results. We perform 5-way, 5-shot classification following the setup in . As shown in Table 3, StableRep stands out on 9 out of the 10 datasets.

Semantic segmentation. We fine-tune pre-trained StableRep models on ADE20k using UperNet . For this evaluation, StableRep is pre-trained for 35 or 105 epochs. Table 4 shows that StableRep trained on synthetic data is able to outperform MAE trained on the real ImageNet images, despite StableRep has no masked image modeling which benefits dense prediction tasks.

2 Ablation analysis

For simplicity, ablation studies in this section do not use the random downsample augmentation in pre-training or prepend an extra BatchNorm layer to the linear classifier.

The design choice of $m$ (number of synthetic images per caption) is one of the key design choices for our approach. Therefore we study the following two factors relevant to $m$ on CC3M captions (2.7 million after de-duplication).

Fixed generation budget. We first study the question: given a fixed value for the number of total synthetic images generated ( $T$ ), should we generate more images per caption ( $l$ ), and therefore use fewer captions ( $T/l$ ) or the reverse. We assume an image budget of $T=2.7$ million. During training, we use the same total batch size (8192) for all $l$ , and set the sampling parameter $m$ as $l$ . Table LABEL:tab:budget presents the results. There is a clear benefit of generating more than $1$ image per caption, e.g., $l=8$ improves over $l=1$ by 4.8%. But this benefit saturates around $l=10$ . We thus generate 10 images per caption for our final experiments.

How to form the batch. Suppose we have generated 10 images for each of the 2.7 million captions. Now given a fixed batch size, i.e., $n*m=C$ (recall that $n$ is the number of captions, $m$ is the number of images per caption, inside each batch), a larger $m$ encourages stronger invariance of images from the same caption, while larger $n$ incorporates more negatives and thus encourages better separability of representations. To study this trade-off, we vary the sampling parameter $m$ from 2 to 10 while keeping $n=C/m$ . As shown in Table LABEL:tab:batch, The linear probing accuracy are similar between $m=4$ and $m=10$ (peak at $m=8$ with 69.8% accuracy), showing the robustness of StableRep w.r.t. $m$ . We choose $m=6$ as our default setup. We “abuse” $m=1$ to represent SimCLR.

After the above study, we continue to ablate the following factors on CC12M and RedCaps.

Guidance score for training. We consider three configurations for the classifier free guidance scale $w$ : (1) large scale – $w\in\{8,10\}$ ; (2) small scale – $w\in\{2,3\}$ ; (3) mixed scale – $w\in\{2,3,4,5,6,8,10,12\}$ . As shown in Table LABEL:tab:ablation_scale, small scale gives the best linear transfer accuracy on ImageNet and fine-grained classification datasets. This is possibly because smaller $w$ leads to larger intra-caption variation between generated images, which enforces StableRep to learn stronger invariance. This is different from SimCLR which requires larger $w$ (recall Section 2.1), as SimCLR only models intra-image invariance and thus higher image quality (larger $w$ ) helps more.

Model scale. We switch the backbone architecture to ViT-L/16. Table LABEL:tab:ablation_model presents the results. The accuracy improves by 1.9% on ImageNet linear probing and 0.7% on the average over fine-grained classification datasets. We found that pre-training with ViT-L was unstable. The loss kept exploding to NaN, and we resumed from the checkpoint before NaN. But this led to a higher convergent loss than ViT-B (ViT-L loss is lower before exploding). This may partly be due to the usage of BatchNorm.

Longer training. To investigate the scaling behavior of StableRep w.r.t. training compute, we further increase the pre-training computation budget to $2$ x and $3$ x epochs, and report the linear probing accuracy on ImageNet in Table LABEL:tab:ablation_epochs. The results indicate that StableRep scales well with longer training, e.g., improving by 2.2 for 2x and 2.9 for 3x on CC12M pre-training, and by 2.6 for 2x and 3.0 for 3x on RedCaps pre-training.

Adding Language Supervision

How would training CLIP using synthetic images work? We study this question by generating a copy (one image per caption) for each guidance scale $w$ in $\{1,2,3,4,6,8,10\}$ and training CLIP using each copy. Figure 8 plots the zero-shot ImageNet accuracy. Contrary to SSL methods, CLIP favors lower $w$ . With the optimal $w=2$ , CLIP achieves 34.9% zero-shot accuracy. This is 5.4% lower than training on real images (40.2%). Such gap may be explained by misalignment between the generated images and the input text, shown in Figure 9. This is especially true for fine-grained classes.

We can add language supervision to StableRep by adding $0.5*(\mathcal{L}_{i2t}+\mathcal{L}_{t2i})$ to StableRep loss, where $\mathcal{L}_{i2t}$ , $\mathcal{L}_{t2i}$ are image-to-text and text-to-image contrastive losses described by Eq. 4. Adding supervision improves StableRep from 72.8% to 74.4% on CC12M and from 73.7% to 75.4% on RedCaps for ImageNet linear probing. We term it as StableRep+. We then further scale StableRep+ to a randomly selected 50M subset of LAION-400M . For this experiment, we only generate 2 images per caption with $w=2$ , and train CLIP with real images and StableRep+ with synthetic images using different scales of random subsets of the 50M data. We plot the results in Figure 8. StableRep+ consistently achieves better accuracy than CLIP. Noteably, StableRep+ with 10M captions outperforms CLIP with 50M captions, yielding a 5x time caption efficiency (2.5x image efficiency).

We further study the fairness and compositional understanding of the learned models on FairFace and ARO benchmarks, respectively. The results are presented in Table 7.

Fairness. We perform zero-shot classificaton on FairFace. We jointly classify both races and genders, e.g., treating Black male, Black female, Indian female, and so on as different classes at the same time. For cc12m models, CLIP with real data only achieved 0.3% accuracy with Southeast Asian male class, and CLIP wth synthetic data improves this class to 3.1%, while our StableRep+ furthers it to 27.2%. For redcaps models, real CLIP only has 0.4% accuracy for East Asian Male, while StableRep+ improves this class to 22.8%. In summary, training with synthetic data is able to improve the worst class accuracy. However, a obvious geographic bias still exists in all models.

Compositionality. The results of compositionality evaluation are less clear. While training with synthetic data on cc12m slightly improves the relational understanding, an accuracy drop is observed in models trained with synthetic data on redcaps. An in-depth investigation may be further needed.

Related Work

Text-to-Image generative models. Text-to-image models trained on large image and text pairs have recently enabled the creation of rich and diverse images encompassing many genres and themes . The resulting creations have become a sensation, with Stable Diffusion having millions of downloads and many tools for image manipulation built on top . Most of these models are built on denoising diffusion models with some notable exceptions . In this paper, we leverage this latest generation of diffusion-based pre-trained generative models for the task of representation learning.

Visual representation learning. Early approaches for visual representation learning often relied on pretext tasks such as inpainting to train image encoders. More recent advancements have shown that mask image modeling, a form of self-supervised training, can be highly effective. In particular, Masked Autoencoder (MAE) has demonstrated significant improvements in downstream fine-tuning performance. Another line of research focuses on contrastive learning, which aims to learn visual representations by maximizing agreement between two augmented views of the same image while distinguishing it from negative examples . Meanwhile CLIP and its subsequent works leverage contrastive learning to train image representations using language supervision, leading to impressive transferability across various tasks.

Learning from synthetic data. It has been common to train machine learning models with synthetic data in different domains . In computer vision, synthetic images have been used as a source for training models, such as optical flow , autonomous driving , semantic segmentation , object detection , human pose estimation or classification . The closest set of work are the ones that conduct representation learning on synthetic images . In , a model is trained to perform multi-task learning on synthetic images. The main method in is to manipulate the latent variable of deep generative models or image generation procedures , to form meaningful synthetic images for their representation learning methods. Our method falls into this category, but we use text-to-image diffusion models, which have also been explored by . The key difference is that they conducted supervised learning while we use synthetic data for pre-training representations.

Conclusion, Limitations and Broader Impact

We have shown that solely synthetic data generated from state of the art text-to-image models can be used to train powerful visual representations. By harnessing the stochastic nature of Stable Diffusion in combination with a multi-positive contrastive loss, our approach yields a representation that surpasses the performance achieved through training on real data alone. Through a series of experiments, we establish that pre-training with synthetic datasets of varying scales yields impressive results across different downstream tasks, including linear probing and few-shot classification. Interestingly, we discover that even vanilla self-supervised methods trained on synthetic data can either outperform or achieve comparable results to those trained on real data.

Despite demonstrating the potential of training with synthetic data, this paper acknowledges its limitations. Firstly, we have yet to comprehend the reasons behind the effectiveness of training self-supervised methods on synthetic images compared to an equal amount of real images. It is possible that this observation is confined to our particular evaluation methodology. Furthermore, the current image generation process remains slow, with approximately 0.8s per image on a A100 GPU or 2.2s per image on a V100 GPU while xFormers is enabled. Consequently, we are not able to train StableRep models with non-repetitive images synthesized online. Additionally, we have not addressed the issue of semantic mismatch between the input prompts and the generated images, which may impact the quality and usefulness of the synthetic data. Moreover, synthetic data has the potential to exacerbate biases due to mode collapse and a predisposition to output “prototypical” images. Lastly, image attribution becomes a challenge when working with synthetic data.

Broader impacts. This paper focuses on the fundamentals of visual representation learning, and we believe it will be beneficial to the practice of this field. Our method presents an immediate application by reducing the reliance on collecting a vast amount of real images for learning representations. This approach brings potential benefits in terms of cost-effectiveness and minimizing biases introduced through human collection and curation processes. However, it is important to acknowledge that our method relies on text-to-image generative models trained on large-scale, uncurated web data. Such data may conceal social biases and errors that would have been exposed through human curation. Additionally, we must recognize that the text prompts we employed are not completely bias-free; the selection of prompts influences the synthesized images. Thus, the choice of prompts assumes a role similar to the selection of real images for self-supervised visual representation learning.

Acknowledgements

We would like to thank anonymous reviewers and Shumeet Baluja for reviewing our manuscript and providing many helpful comments and suggestions. We also appreciate the helpful discussions and the general supports from the VisCam teammates in Google Research.

References

Appendix A Implementation Details

Synthetic images have constant high resolutions (e.g., 512 $\times$ 512 for Stable Diffusion). We find this leads to a domain gap when transferring to situations involving low resolution images, such as CIFAR-10 or CIFAR-100. To address this issue, we introduce Random Downsample augmentation, which randomly resizes images to a resolution of 64 or 128 (equally probable) and then resizes them back to 224. During pre-training, we apply this augmentation with a probability of 0.05 and prepend it to other augmentations.

In Table 8, we ablate the effects of applying this random downsample augmentation to different pre-training methods. This augmentation brings significant improvements on CIFAR-10 and CIFAR-100 datasets, while maintaining the performance on other datasets. On average this augmentation is more beneficial for pre-training with synthetic images than real ones.

A.2 Standard self-supervised learning

We follow the default settings for standard self-supervised learning algorithms, and present the training details in Table 9 and Table 10. We use the linear $lr$ scaling rule: $lr=base\_lr\times bsz/256$ . For BYOL , we did not follow the hyperparameters ( $blr=1.0e{-4},wd=0.03$ ) in , as we found our setting here yielded better accuracy. For DINO , we did not use the multi-crop strategy and only pre-trained the model with two 224 $\times$ 224 crops.

A.3 StableRep pre-training

The hyperparameterss for StableRep is presented in Table 11. Indeed, they are the same as that in SimCLR. The difference is that the $base\_lr$ in StableRep is for 512 images while in SimCLR it is for 256 images, because each image in StableRep only has one single crop. We ended up using a batch size of 8256 images, since we trained our model with 32 GPUs and 8192 is not divisible over 32 $\times$ 6. The computation for StableRep has been converted to SimCLR-equivalent epochs.

A.4 CLIP training

We follow the hyperparameter setting used in since it is better than that from the original CLIP paper. Table 12 summarizes the training details, and Table 13 presents the architecture of CLIP encoders. With this training setup, we are able to produce $40.2\%$ ImageNet zero-shot accuracy when training CLIP on CC12M dataset. As a comparison, reports $36.0\%$ using the same architecutre.

A.5 ImageNet linear probing

We follow prior work to train the linear classifier. It has been generally observed that regularization such as weight decay hurts the performance . Following , we set weight decay as 0, and only use RandomResizedCrop and RandomHorizontalFlip as data augmentation. We sweep the $base\_lr$ over $\{0.1,0.2,0.5,1,2,5,10,20,50\}\times 10^{-2}$ .

For StableRep trained with 35 epochs, we find that adding an extra BatchNorm layer without affine transformation improves and stablizes the linear probing results. However, this additional BatchNorm does not help when StableRep is trained with a longer schedule, e.g., 105 epochs. We conjecture that BatchNorm is helpful when StableRep is not convergent, and present the comparison in Table 15.

A.6 Fine-grained linear classification

The details about the fine-grained classification datasets are presented in Table 16.

A.7 Few-shot image classification

Following the settings in , we evaluate the 5-way 5-shot performance on 10 different datasets. We do not use data augmentation; images are resized to 224 pixels along the shorter side using bicubic resampling, followed by a center crop of 224 $\times$ 224. We report the mean accuracy of 600 randomly sampled tasks (also known as episodes). For each task, images are randomly sampled from the combination of training, validation and testing sets. We sample 15 query images for each class in every task for evaluation purpose.

Appendix B Additional Results

In Table 17, we further present the fine-grained linear classification results by models from RedCaps or models that are trained longer (2x or 3x longer). When pre-training on RedCaps, StableRep achieves the best average accuracy. Longer training of StableRep further improves transferability. Notably, our StableRep trained with synthetic images only is approaching the performance of OpenAI’s CLIP trained with 400 millions of real images.

B.2 Few-shot image classification

We further summarizes the few-shot image classification results in Table 18. The 95 $\%$ confidence interval is provided. StableRep stands out on the majority of the evaluated datasets.

Appendix C Image Generation

We use Stable Diffusion v1.5. During sampling, we generate images by 50 DDIM steps. To accelerate the generation process, we leverage xFormers library for efficient attention computation, which brings down the sampling time to $\sim$ 0.8s per image on a single A100 GPU and $\sim$ 2.3s per image on a V100 GPU.

Image resolution. The image resolution may affect the quality of representations learned by self-supervised learning algorithms. We try to make a relative fair comparison by storing all synthetic and real images in similar resolutions. The synthetic images generated by Stable Diffusion are 512 $\times$ 512; we resized them to 256 $\times$ 256 before storing them on the disk. The real images have various sizes, ranging from less than a hundred of pixels in shorter side to thousands of pixels; we resize the shorter side of all real images to 256.

C.2 Generation examples

Some examples of synthetic images are visualized in Figure 10.

Appendix D Computation

Synthesis. The slowest part of the StableRep pipeline is the image generation. We use 512 V100 GPUs to synthesize images, which takes $\sim$ 13 hours for every ten million images.

Pre-training. Each of our StableRep models with ViT-B/16 is trained on 4 nodes, each of which has 8 A100 GPUs and 96 CPU cores. It takes $\sim$ 20 hours to complete 35 SimCLR-equivalent epochs of training on CC12M and $\sim$ 23 hours on RedCaps. For ViT-L/16, we use 64 A100 80GB GPUs spread over 8 nodes.