Unified Contrastive Learning in Image-Text-Label Space

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao

Introduction

Learning to recognize visual concepts in an image has been a fundamental and long-standing research problem. Typically, this can be tackled via either supervised learning on human-annotated image-label pairs or contrastive learning on webly-crawed image-text pairs . When fueled with clean and large-scale human-annotated image-label data, e.g., ImageNet , supervised learning can attain decent visual recognition capacities over the given categories and also powerful transfer learning abilities . Nevertheless, collecting precise image-label data can be a laborious and expensive process, not to say its difficulty to scale up to numerous visual conceptsThe largest scale but private JFT-300M covers 18,291 concepts.. On the other hand, language-image contrastive learning has recently emerged as a promising approach by leveraging huge amounts of webly-crawled image-text pairs. These pairs are usually noisy, free-form but cover lots of visual concepts. As demonstrated in CLIP and ALIGN , models learned from hundreds of millions of image-text pairs can attain impressive low-shot recognition performance for a wide range of visual understanding scenarios. Though these image-text models show a broad coverage of visual concepts, we find in our experiments that they usually lack the strong discriminative ability required by transfer learning. A natural question is: can we have one model for both discriminative representations and broad visual concept coverage?

In this work, we take the first step to answer this question. We start with a new perspective, illustrated in Fig. 1. Instead of isolating image-label and image-text data, we define an image-text-label space and show how we can eliminate the boundary between two data types. As shown in Fig. 1 left part, supervised learning on image-label data typically aims at mapping images to discrete labels, and completely ignores the textual concept associated with each label during the training. In contrast, language-image contrastive learning aims at learning a pair of visual and textual encoders to align images and texts as shown in Fig. 1 right part. This learning method implicitly assumes that each image-text pair has a unique label. Comparing these two learning paradigms side by side, we can see that both of them actually reside in the common image-text-label space, which is constructed by mapping each label to a textual concept for supervised learning, and assigning each textual description a unique label for language-image pretraining, as shown in Fig. 1 bottom. Based on this new perspective, we can simply use a visual encoder and a language encoder to encode the images and texts, and align the visual and textual features with the guide of labels (unique labels for image-text pairs and manual labels for image-label data). However, learning from these combined labels cannot be supported in existing supervised learning and language-image contrastive learning paradigms. For this purpose, we propose a unified contrastive learning method, called UniCL to seamlessly accommodate both data types for visual-semantic representation learning. It takes images, texts as input and compute the loss with softened targets derived from the labels. With UniCL, we combine image-label and image-text data together to learn discriminative and semantic-rich representations, which are beneficial to a variety of downstream tasks. To summarize, our main contributions are:

We introduce a new perspective of image-text-label space, which can seamlessly unify the commonly used image-label and image-text data.

We propose a unified contrastive learning method called UniCL in the image-text-label space, that can learn from either of the image-label and image-text data, or both.

Extensive experiments show that our UniCL can leverage both types of data effectively and achieve superior performance universally on standard zero-shot, linear probe, fully-finetuning and transfer learning settings.

Finally, we scaled up UniCL to billions of image-text-label data in Florence and demonstrated its superiority over CLIP and ALIGN across dozens of benchmarks. Hereby, we highly recommend UniCL as a generic multi-modal learning paradigm for vision.

Related works

Supervised Learning. Supervised learning for image classification has a long history. As mentioned earlier, a canonical way of supervised learning is mapping images to manual labels. With this goal, numerous works have pushed the image recognition performance from different directions, such as data scale from MNIST to ImageNet-1K , model architectures from convolutional neural networks (CNNs) to Transformers , and learning objectives from original Cross-Entropy to marginal losses and recent supervised contrastive loss . In this paper, we develop a unified contrastive learning method that regards image-label as image-text-label data to learn a generic visual-semantic space. It calls back the textual concepts behind the labels and use them as a special format of language. In this sense, our work is also related to conventional zero-shot classification . Most of these works pay attention to recognize fine-grained categories at a small scale. Our work goes beyond such restricted setting and is targeted to learn a good and rich visual-semantic representation from the combined image-label and image-text pairs.

Language-Image Contrastive Learning. Vision-and-language is a rapidly growing field. Existing works can be broadly categorized into two classes. $(i)$ Inspired by the success of BERT , the first line of research focuses on learning generic multi-modal fusion layers based on masked token prediction and/or image-text matching, given the pre-extracted features from visual and textual encoder . They aim to improve downstream tasks such as visual question answering , image captioning , visual commonsense reasoning . $(ii)$ Another line of works focuses on learning transferable visual representation from natural language supervisions, including generative and contrastive methods . Recently, contrastive learning has been scaled up in representative works such as CLIP and ALIGN , by pretraining on hundreds of millions of webly-crawled image-text pairs. Our work is close to these works in that we also use the image-text data as one of the major data sources. However, image-label data is ignored in these works. Our work presents the first unified contrastive learning method that can seamlessly leverage both.

Self-supervised Learning. Self-supervised learning (SSL) for vision aims to learn general-purpose visual representations from raw pixels without supervisions from label or text . Contrastive learning has laid the foundation for the best performing SSL models . It maximizes agreement of learned representations between differently augmented views of the same image, and minimizes agreement of views from different images. This augmented-view-based paradigm has also been extended to non-contrastive methods , where only positive image view pairs are considered in learning. Though image SSL has great promises in leveraging nearly infinite amounts of unlabelled image data in training , the lack of language association renders it hardly applicable to zero-shot recognition. Nevertheless, the success of contrastive learning in SSL has inspired the generalization of this methodology to a much broader range, such as CLIP in image-text setting and our UniCL in image-text-label setting, where images and language descriptions can be considered as multi-modal views of the same underlying concepts.

Method

We define a triplet-wise data format $\mathcal{S}=\{(\boldsymbol{x}_{n},{\boldsymbol{t}}_{n},y_{n})\}_{n=1}^{N}$ , where $\boldsymbol{x}\in\mathcal{X}$ is the image, and ${\boldsymbol{t}}\in\mathcal{T}$ is its corresponding language description (ranging from simple tokens such as category names to free-form text sequences), and $y\in\mathcal{Y}$ is a label indicating the index of the grouped or unique language description in the dataset. As we discussed earlier, this triplet data representation is a general format of widely existing image data, including the commonly used image-text and image-label data. On one hand, image-text pairs $\{(\boldsymbol{x}_{n},{\boldsymbol{t}}_{n})\}_{n=1}^{N}$ from the web usually have an one-to-one mapping, thus each image-text pair has unique label and $\mathcal{S}$ reduces to $\{(\boldsymbol{x}_{n},{\boldsymbol{t}}_{n},y_{n}\equiv n)\}_{n=1}^{N}$ . On the other hand, though an image classification problem often uses simple category labels or indices, each label is induced from the similarity of concepts in its task definition . Therefore, for image-label data, $\mathcal{S}$ reduces to $\{(\boldsymbol{x}_{n},{\boldsymbol{t}}_{n}\equiv C[y_{n}],y_{n})\}_{n=1}^{N}$ , with $C$ as the set of concept names indexed by $y_{n}$ . Based on this definition, we can represent an image-label pair as a labeled image-text pair, while an image-text pair as ones with unique label. An example of how they are unified is illustrated in Fig. 2. The goal of this work is to learn from the joint data $\mathcal{S}$ , believing that the rich semantics in language description ${\boldsymbol{t}}$ and structured organizations of labels $y$ together are beneficial for learning semantic-rich and discriminative visual representations of images $\boldsymbol{x}$ .

2 Unified Image-Text-Label Contrast

including two contrastive terms (A temperature hyper-parameter $\tau$ controls the strength of penalties on hard negative samples):

The image-to-text contrastive loss to align matched images in a batch with a given text

where $k\in\mathcal{P}(i)=\{k|k\in\mathcal{B},y_{k}=y_{i}\}$ .

The text-to-image contrastive loss to align matched texts to a given image

where $k\in\mathcal{P}(j)=\{k|k\in\mathcal{B},y_{k}=y_{j}\}$ .

Using Fig. 2 right side as an example, the $\mathcal{L}_{i2t}$ is computed for each row, and $\mathcal{L}_{t2i}$ computed for each column. The red tiles indicate the positive pairs while blank tiles the negative ones, all allocated based on the labels.

3 Discussions & Properties

We discuss the unique properties of our proposed UniCL and build the connections with previous commonly used learning paradigms. An illustrative comparison is shown in Fig. 3, with more detailed analysis below.

where $\hat{y}$ is the ground-truth label for the $j$ -th image in the batch. Based on this, we argue that $\mathcal{L}_{BiC}$ is more general than $\mathcal{L}_{CE}$ , from two aspects: $(i)$ Augmentation with $\mathcal{L}_{i2t}$ . The additional text-to-image term $\mathcal{L}_{i2t}$ in $\mathcal{L}_{BiC}$ plays the role of regularizer. Given a language description ${\boldsymbol{t}}_{j}$ , all image features with the same ${\boldsymbol{t}}_{j}$ in the batch are clustered towards the text feature; otherwise they are pushed away. This can help prevent over-fitting, as demonstrated in our experiment later; $(ii)$ Text encoder $f_{\boldsymbol{\phi}}$ . The text encoder can be specified as in more powerful forms such as 12-layer Transformers or pretrained BERT encoder, and take free-form text inputs beyond the set of category names.

Connections to SupCon One shared property between our UniCL and SupCon is that both methods exploit label-guided contrastive learning: For any query, both methods leverage samples with the same label to contribute to the numerator as positives. Note that SupCon is proposed in the image-label setting, where each image is augmented with two different views. UniCL and SupCon differ in two aspects: $(i)$ Query-vs-Key modality. In SupCon, both query and key in contrastive learning are from the same modality: image-and-image pairs; In UniCL, the query and key are different modalities: image-and-language pairs. $(ii)$ Encoders. Only one shared image encoder is used in SupCon for query and key. Two different encoders are used in UniCL for different modalities, as shown in Fig. 3.

Connections to CLIP For image-texts pairs, there are only one-to-one mappings between an image and its paired text in a batch. In another word, $\mathcal{P}(i)=\{i\}$ and $\mathcal{P}(j)=\{j\}$ for Eq. (2) and Eq. (3), respectively. Then $\mathcal{L}_{BiC}$ becomes:

This means that $\mathcal{L}_{BiC}$ reduces to CLIP training objective, when only image-text data is employed. The major structural change of (2) over (5) is that for each language description, any of the image samples with the same label are considered as positives in a batch, contributing to the numerator. Similar conclusion is drawn by comparing (3) and (6).

4 Model Training and Adaptation

The training process of UniCL is summarized in Algorithm 1. Note that this pseudo code is related to our data loader construction: all the image-text pairs have an initial label index $y=0$ , while all image-label pairs have an initial label index $y\in[1,\cdots,K]$ . The $\mathbf{TargetM}$ function ensures that each unique language description in the batch has a unique label index. In training, $\tau$ is a trainable variable initialized as 1. After training, the learned visual and textual encoder $\{f_{\boldsymbol{\theta}},f_{\boldsymbol{\phi}}\}$ can be used jointly for open-vocabulary image recognition, i.e., recognizing the categories seen during training or novel ones beyond the annotated categories. Alternatively, the visual backbone $f_{\boldsymbol{\theta}}$ can be used independently, either for feature extraction in linear probe or for full model finetuning in object detection.

Experiments

In this section, we examine UniCL to answer two research questions. Q1 learning objective – how does our UniCL perform compared with CE and SupCon on image classification? Q2 pre-training data – what is the unique benefit of applying UniCL on the joint image-text-label data?

Datasets. We study our models based on publicly available datasets, and the statistics are shown in Table 1. For classification data (top four rows), the number of visual concepts are identical to the number of categories. For image-text data (bottom three rows), we use Spacy to extract the noun phrases and then count the number of unique noun entities that appear more than 5 times. Given the pool of concepts, we then calculate the number of unique words and report it as the vocabulary size. The ratio of #images/#concepts clearly illustrates the different trade-off between image diversity and semantic-richness over different datasets. In our unified image-text-label space, all these datasets are homogeneous, and can be jointly used for learning. GCC-15M denotes the merged version of GCC-3M and 12M.

Training. We use the same prompt strategy and tokenizer for classification data as proposed in CLIP . We fill the class names into the prompt templates, followed by a tokenization before feeding into the text encoder. During training, we randomly sample one prompt template while averaging over all 80 templates for validation. For fair comparison, we use the same text encoder architecture as in CLIP , and the whole model including vision and text encoder are trained from scratch. More training details are discussed in the following individual sections.

Evaluation. We evaluate the quality of learned representations on a set of computer vision tasks, including:

Standard classification. We report the Top-1 classification accuracy on CIFAR-10 , CIFAR-100 and ImageNet-1K .

Zero-shot classification. We evaluate on ImageNet-1K as well as 14 datasets used in , and employ the same text prompts. Averaged scores is reported.

Linear probe. We study 18 datasets used in . Automatic hyper-parameter tuning is considered to ensure fairness of comparison. The averaged scores is reported.

Object detection. We use Mask R-CNN as the detector and follow the standard 1 $\times$ schedule. mAP for box and mask are reported on 80 object categories.

To gain empirical understanding of our UniCL objective, we compare UniCL against two supervised learning methods, Cross-Entropy (CE) and Supervised Contrastive Learning (SupCon) on image classification datasets. We employ two representative architectures, ResNet and Swin Transformer to build the visual encoder, whose last layer output are pooled as the visual representation. We use standard random crop as the data augmentation. All models are trained for 500 epochs with a batch size of 4096. We report the comparison results in Table 2, Overall, the proposed UniCL achieves comparable if not better performance across all datasets and model architectures.

Comparison with SupCon . We can find that our UniCL is superior on CIFAR-10 and CIFAR-100 and on par with SupCon on ImageNet-1K. Both UniCL and SupCon pursue bidirectional alignments, one for image-text pairs and the other one for images from multi-views. Though the overall performance is comparable on these standard classification tasks, our UniCL has two unique advantages over SupCon: 1) it is end-to-end training while SupCon requires two training stages, i.e., visual encoder training and a linear classifier tuning; 2) the learned representations in our model is language-aware, which means we can directly use it for zero-shot recognition, as demonstrated later.

Comparsion with CE . UniCL in (1) promotes a bidirectional alignment between images and category names, which imposes an additional regularization term than CE in (4). As such, it can be particularly helpful when over-fitting tends to occur. For example, when training ResNet-50 on small datasets such as CIFAR-10 and CIFAR-100, UniCL improves around 1-3 points over CE. When training Swin Transformer on ImageNet-1K, the network tends to over-fit due to the lack of spatial inductive bias; Our UniCL outperforms CE by 3 points. When over-fitting is less severe, such as training on larger datasets (from CIFAR to ImageNet) or with strong augmentation (MixUp and CutMix ), our method is still on par with CE.

Ablation of language encoders $f_{\boldsymbol{\phi}}$ . Our UniCL has the flexibility in constructing its language encoders. In Table 3, we ablate by comparing two options: Transformers vs a simple linear embedding layer ${{\bf W}}$ . The former is superior by absolute 1.2%. We suspect this is due to its ability to capture the semantics behind the 1K category names. For example, two categories “tree frog”,“tailed frog” share the common word “frog”, which conveys a language prior knowledge about their similarity. This semantic information, however, can be hardly captured by an embedding layer indexed with labels. One may notice that our model using extra language encoder introduces more parameters, leading to an unfair comparison. However, during inference, the language encoder is used to extract the textual embeddings for all concepts and then discarded. Therefore, the effective complexity and time cost during inference is nearly identical to the other methods.

Ablation of training objectives. The third row in Table 3 shows a significant 3% drop by remaining the term $\mathcal{L}_{t2i}$ only in our bidirectional loss. It indicates the importance of both loss terms in our UniCL. Though $\mathcal{L}_{t2i}$ resembles CE under certain conditions described in Section 3.3, we notice a small gap between them (75.7 v.s. 76.8 in Table 2). This gap is probably attributed to the stochastic training. At each iteration, CE always compares the visual feature to the entire 1K class embeddings, while the UniCL updates with the subset of concept embeddings in the current mini-batch.

Effect of training batch size. We vary the default batch size from 4096 to 2048 and 1024. Results are shown in Table 4. UniCL is robust to the variation of batch size, regardless of which language encoder is employed. This observation is different from contrastive methods such as SimCLR in self-supervised learning. This is probably because: $(i)$ one of the two views is the embeddings of category names in our UniCL, which are consistently used with high overlap across different mini-batches, which make the learning less vulnerable to the batch size; $(ii)$ The label information provides a consistent and strong guidance.

2 Results on data unification of image-text-label

In this part, we study the benefits of UniCL when learned with the unification of image-label and image-text data. We use Swin-Tiny as the visual encoder for consistency.

We use ImageNet-1K as the base dataset, and gradually add different sets of image-text pairs, including GCC-3M, GCC-15M and YFCC-14M. When combining with image-text pairs, we use a balanced data sampler to ensure that the model is trained with the same number of image-label and image-text pairs per epoch. All models are trained with 500 epochs. We report the results in Table 5.

Comparison of objectives. From the first three rows, we see that the models trained with different objectives on ImageNet-1K obtain similar performance across different metrics. However, our UniCL is the only one that is directly applicable for zero-shot image recognition, though CE can be partially used for zero-shot with extra label mapping efforts. Surprisingly, the average zero-shot performance over 14 datasets for UniCL trained only on ImageNet-1K reaches a similar level to CLIP trained on YFCC-14M (30.2 v.s. 36.3 as will be shown in Table 6).

Benefit of image-text pairs. Adding image-text pairs can generally improve the performance across all metrics. In the table, we can see all image-text datasets help to significantly improve the zero-shot performance. Besides, adding GCC-3M further improves linear probe and COCO detection by 0.9 and 0.5, respectively. YFCC-14M helps to improve ImageNet-1K and linear probe by 1.2 and 2.1, respectively. As summarized in Table 1, image-text pairs are coarsely aligned, but cover rich visual concepts. Therefore, they are particularly beneficial for tasks requiring broad visual concept understanding, such as zero-shot and linear probe on dozens of datasets. When GCC-15M is used, we observed much more improvements as well for ImageNet-1K (+1.9), Linear Probe (+3.5) and COCO detection (+1.2). Note that we used balanced data sampler to ensure the model sees equal number of image-text batches during training. This suggests that concept richness (GCC-15M is much higher than GCC-3M) and quality (GCC-15M is much cleaner than YFCC-14M) are both important to compensate classification data for learning discriminative representations.

For qualitative analysis, we visualize the 2D $t$ -SNE of the textual feature space in Fig. 4. Given a query concept from ImageNet-1K, we search the closest target concept from the remained 21K concepts in ImageNet-22K in the feature space. For better understanding, we also show the exemplar image corresponding to each concept. Clearly, model trained on ImageNet-1K can hardly generalize to understand the concepts from the other 21K concepts. In contrast, adding GCC-15M image-text pairs significantly improve the its understanding ability, as the retrieved target become more semantically similar to the queries in ImageNet-1K.

2.2 Benefit of image-label to image-text

We switch the role to study how image-label data can assist the learning with image-text pairs. Follow the protocols in CLIP , we use random crop as the data augmentation, a standard data sampler, and train all models for 32 epochs. We compare against two baselines: $(i)$ CLIP, a language-image contrastive learning method without label supervision, our UniCL can recover CLIP when merely using image-text pair for the training. $(ii)$ Multi-task learning that performs CE on image-label data, and CLIP on image-text data.

We report the results in Table 6. We first reproduced CLIP on YFCC-14M with Swin-Tiny. The ImageNet-1K zero-shot accuracy is 30.1%, which closely matches the reported number 31.2% with ResNet-50 in . To ensure fair comparisons, we build a ImageNet-21K dataset by excluding the categories in ImageNet-1K from ImageNet-22K dataset, and train UniCL. Interestingly, it achieves comparable ImageNet-1K zero-shot performance to YFCC-14M. This indicates image-label data is arguably another good source of learning visual-semantic representations, which is nevertheless less studied in previous works. We combine half of ImageNet-21K and YFCC-14M datasets so that the total number of training instances remains the same, and train a UniCL model. This data unification boosts performance almost on all metrics, especially on zero-shot classification for ImageNet-1K (absolute $6\%>$ gain) and 14 datastes (absolute $7\%>$ gain). The detailed comparison on 14 datasets in Fig. 5, shows that UniCL wins on 11 out of 14 datasets. Besides zero-shot, our UniCL also achieves significant improvement (+7.3%) on linear probe compared with the CLIP baseline. With the full set of both datasets (row 4), the performance can be uniformly improved further.

We compare our method with multi-task learner with different datasets. First, when using half of YFCC-14M and ImageNet-21K, our UniCL outperforms multi-task learner by a large margin across all tasks. When trained with the ImageNet-22K, the gaps shrink for ImageNet-1K finetuning and linear probe but remain for zero-shot recognition. This is mainly because ImageNet-22K cover all ImageNet-1K concepts and a large portion of categories in the linear probe datasets. Admittedly, Multi-task learner is a good representation learning method. However, because it isolates image-label and image-text pairs, it cannot learn a discriminative and semantic-rich feature space as our method.

Finally, to qualitatively show that how UniCL trained with image-label data yields a more discriminative feature space, we visualize the 2D $t$ -SNE for the visual features of ImageNet-1K dataset in Fig. 6. Dogs with fine-grained breeds are heavily mixed together for the model trained on image-text pairs only. However, they are clearly grouped with the aid of image-label data from ImageNet-21K, even though it contains none of those dog breed concepts.

Conclusion

We have presented UniCL, a new contrastive learning paradigm for generic multi-modal representation learning. It is built in the image-text-label space, and empowered by our unified contrastive learning method. Such a unified paradigm prompts a seamless synergy between image-label and image-text pairs for discriminative and semantic-rich representation learning, which brings universal improvements on zero-shot, linear probe, finetuning benchmarks. Moreover, we discuss its connections to existing learning methods, and empirically demonstrated that our learning method stand-alone is a good alternative learner on pure image-label data.

Discussions. During our submission, we mainly focused on vision tasks such as image recognition and object detection, and based our model on public datasets. However, we refer the readers to Florence for large-scale pretraining and evaluation on a boarder set of tasks including VQA and video understanding. We note that Florence used a huge amount of private data and thus recommend the suite of experiments in this paper as a baseline for future academic research.

References

Appendix A Validation dataset details

In addition to the training datasets listed in our main submission, we list in Table 7 the statistics for all the validation datasets used in our experiments. Similar to the Table 1 in our main submission, we calculate the vocabulary size for each dataset, which is typically more than the number of concepts (classes).

Appendix B Experiment details

This part mainly explains the detailed experiment setups for Sec. 4.1 in our main submission.

Model architecture. We employ two representative architectures, ResNet and Swin Transformer to build the visual encoder. The globally pooled feature from last visual encoder layer is used as the visual feature. For language encoder, we use a 12-layer Transformer with hidden dimension of 512 following . Features from visual and textual encoder are projected to the same dimension of 512, using two linear projection layers.

Training protocol. For optimization, we use SGD for all CNN models, while AdamW for all models with Transformers on either vision or language side. We set the learning rate to 0.4 and 0.002, weight decay to 1e-4 and 0.05 for SGD and AdamW optimizer, respectively. All models are trained for 500 epochs with a batch size of 4096. We use same set of data augmentation and regularization as in , but do not use MixUp and CutMix except for the last column in Table 2 of our main submission. For all training, we used a cosine learning rate schedule, with 5 epochs and 20 epochs warmup for ResNet and Swin Transformer, respectively.

B.2 Training on image-text-label space

Training protocol for Sec. 4.2.1. We use Swin-Tiny as the visual encoder and follow the training settings in Section 4.1 mostly to train the models on the joint of image-label and image-text pairs. However, we notice there is a severe imbalance between image-label and image-text data as shown in Table 1 in our main submission (e.g., there are around 1.3M images in ImageNet-1K while above 12M images in GCC-12M dataset). To ensure that the model training is not biased to the dominant image-text pairs, we develop a balanced data sampler for two data types. More specifically, at each epoch, we randomly sample a subset of image-text pairs that has the equal or similar size to that of image-label data. In this case, the model sees half image-label data and half image-text data at each iteration for a balanced learning. We keep the number of training epochs the same as 500, so the effective number of training epochs on the image-text dataset is roughly 500 $\times$ (size of image-label dataset)/(size of image-text pair dataset). For example, the model learns from GCC-12M for around 40 epochs. We find this balanced sampling strategy is very important to achieve the reported performance in our main submission.

Training protocol for Sec. 4.2.2. We followed the training protocol in CLIP for fair comparison. Specifically, we merely used random crop for dataset augmentation in all model trainings. All models including the baseline models are trained for 32 epochs, with batch size 4096, initial learning rate 1e-3 and weight decay 0.1. We also used a cosine learning rate scheduler with 5000 warmup iterations.

Appendix C More results

In Figure 7, we show the zero-shot classification on 14 datasets by adding different image-caption pairs into the ImageNet-1K, i.e. the methods compared in Table 5 in the main text. UniCL takes the advantages of learning rich concept coverage from image-text pairs: On most of the datasets, it outperforms the baseline, especially on fine-grained classification tasks such as Food101 and OxfordPets.

C.2 Results with larger vision backbone

In our main submission, we used Swin-Tiny as the visual backbone to study how our UniCL perform when trained on the combination of image-label data ImageNet-21K and image-text pairs YFCC-14M in Table 6. Here, we investigate whether increase the capacity of the vision backbone can further improve the representation learning.

As shown in Table 8, we observe consistent trend as in Table 6 of our main submission. Though using similar amount of image-text-label corpus, combining two type of data can significantly improve the zero-shot recognition performance on both ImageNet-1K (8.6 points) and other 14 datasets in average (11.0 points). When using the full set of ImageNet-21K and YFCC-14M, both performance can be further improved significantly. These results suggest that our method is agnostic to different model sizes and thus a generic learning paradigm for visual-semantic representations. For comparison, we also list the numbers for Swin-Tiny models after each “/”. Clearly, increasing the visual encoder size brings substantial gains in all cases, and particularly significant for the combination of both data types.

C.3 Transfer to object detection

In the Table 5 of our main submission, we mainly studied whether image-text pairs can bring benefits to object detection transfer learning compared with the models solely trained on image-label data. As we demonstrated in Table 6 of our main submission, image-label data can help to learn more discriminative representations, and thus benefits ImageNet-1K finetuning and linear probing. Here, we further study whether the learned representations can generalize to object detection task as well. Specifically, we use the Swin-Tiny models pretrained in Table 6 as the vision backbones and train a Mask R-CNN model with 1 $\times$ schedule following the default settings in Swin Transformer based on Detectron2 . In Table 9, we can see combining two data types with similar amount clearly improve the object detection performance by around 2 points for both box and mask mAP, compared with the CLIP-based model trained on YFCC-14M. This further validates our note that representations learned from pure image-text pair data usually lack the discirminative ability required by transfer learning to image recognition and object detection. As expected, using the full set (last row) brings further around 1 point improvement for both metrics. Along with the reported numbers in Table 5 of our main submission, these results together imply that adding image-text pairs to image-label data and the other way around can universally help to learn a better visual representations compared with the individual counterparts. Adding image-text pairs data can enrich and smoothen the semantic space which may implicitly prompt distinctive representations for the concepts in COCO object detection, while adding image-label data directly imposes the pressure to learn more discriminative representations.

Appendix D More analysis

The concepts residing in the training data is arguably crucial to the model learning. Both CLIP and ALIGN exhaustively collect hundreds of millions of image-text pairs to cover as many visual concepts as possible. Though the datasets used in our experiments are at much smaller scale, we are still interested in the concept distributions of different datasets. In Fig. 8, we show the occurrences of top 1000 concepts in GCC-3M, GCC-12M and YFCC-14M. Along with the remaining concepts that do not show here, all three datasets have extreme long-tail distributions. For example, the most frequent concept “view” in GCC-12M appears over 185,363 times, while the 10k-th concept “candle holder” only appears 501 times, knowing that there are more than 584k concepts in the whole set.

Interestingly, we find the overlap of most common concepts across three datasets is lower than what we expect. Table 10 shows the overlap ratios of top 10k concepts among three datasets. These relatively lower overlapping indicates the sufficient diversities and complementary among them.

D.2 Concept coverage

Given the concept distributions above, we further investigate the concept coverage between training datasets and validation datasets. In Table 11, we calculate the coverage ratio to be the percentage of concepts mentioned by the pretraining data, including ImageNet-1K, ImageNet-21K, GCC-3M, GCC-12M and YFCC-14M. Coverage ratios equal or larger than 50% are highlighted.

Accordingly, for image-label dataset ImageNet-1K, it has some overlaps with CIFAR-100 (24.0%) and Caltech-101 (24.5%). This may explain why the zero-shot performance on these two datasets shown in Fig. 7 is relative higher. In contrast, we also notice that even with less or no coverage, the model pretrained on ImageNet-1K with our method still attain reasonably good zero-shot performance on datasets like CIFAR-10, Flowers102, Oxford Pet, etc.

Similarly, for ImageNet-21K, it covers a certain proportion of concepts in the validation sets, such as CIFAR-10, CIFAR-100, Caltech-101, etc, and we did observe high zero-shot recognition performance on them in the Table 6 of our main submission. Nevertheless, for other datasets like Hateful Memes, PatchCamelyon, there are zero concept overlaps, while our model still realizes reasonable performance. This indicates that our model is not just memorizing the concepts appearing in the training datasets, but also learns to understand the underlying structures of different concepts, which has been also demonstrated in Fig. 5 of our main submission.

Finally, we find image-text pairs data have higher coverage of concepts than image-label datasets almost on all validation sets. Among the three image-text pair datasets, GCC-12M has relatively higher coverage than the other two datasets. This may also explain why we observe better performance in the comparisons shown in Table 5 of our main submission. However, we also notice that higher concept coverage does not necessarily means better zero-shot performance. For example, even though all of these three datasets have a fully coverage of concepts in CIFAR-10 and CIFAR-100, adding them into the pretraining hurts the performance as shown in Fig. 7. We suspect there might be some significant gaps in the image domain between the pretraining and validation datasets even though they share common semantic concepts. Moreover, images in image-text pairs usually contain multiple objects, the coverage of concepts does not necessarily means the model can learn to grounding the concepts to the specific image contents. How to better leverage the image-text pair data and build a more gounded visual understanding worth further studies.

D.3 Concept visualizations

In Fig. 9, we further show the concept embeddings for two models as in Fig. 4 in our main submission. Fig. 9 left shows the model trained only on ImageNet-1K while right shows the model trained jointly with ImageNet-1K and GCC-15M. The model trained with two type of data understand the novel concepts from ImageNet-21K much better than the left one. For example, the left model put “porthole” and porcupine close to each other but the former is a circular window and latter is an animal. In contrast, the model at right side can easily find the “porcuponefish” as the close neighbor. Similarly, the left model mix “goblet” and “coverlet”, probably because they share the same suffix. Our model on right side finds one of the most matched concepts “liqueur glass” which is semantically and visually similar to the query concept. Similar trend is also observed in Fig. 10. All these visualizations demonstrate that our model trained with both type of data has learned the visually-grounded semantic meanings for various concepts.