Rethinking Pre-training and Self-training

Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le

Introduction

Pre-training is a dominant paradigm in computer vision. As many vision tasks are related, it is expected a model, pre-trained on one dataset, to help another. It is now common practice to pre-train the backbones of object detection and segmentation models on ImageNet classification . This practice has been recently challenged He et al. , among others , who show a surprising result that such ImageNet pre-training does not improve accuracy on the COCO dataset.

A stark contrast to pre-training is self-training . Let’s suppose we want to use ImageNet to help COCO object detection; under self-training, we will first discard the labels on ImageNet. We then train an object detection model on COCO, and use it to generate pseudo labels on ImageNet. The pseudo-labeled ImageNet and labeled COCO data are then combined to train a new model. The recent successes of self-training raise the question to what degree does self-training work better than pre-training. Can self-training work well on the exact setup, using ImageNet to improve COCO, where pre-training fails?

Our work studies self-training with a focus on answering the above question. We define a set of control experiments where we use ImageNet as additional data with the goal of improving COCO. We vary the amount of labeled data in COCO and the strength of data augmentation as control factors. Our experiments show that as we increase the strength of data augmentation or the amount of labeled data, the value of pre-training diminishes. In fact, with our strongest data augmentation, pre-training significantly hurts accuracy by -1.0AP, a surprising result that was not seen by He et al. . Our experiments then show that self-training interacts well with data augmentations: stronger data augmentation not only doesn’t hurt self-training, but also helps it. Under the same data augmentation, self-training yields positive +1.3AP improvements using the same ImageNet dataset. This is another striking result because it shows self-training works well exactly on the setup that pre-training fails. These two results provide a positive answer to the above question.

An increasingly popular pre-training method is self-supervised learning. Self-supervised learning methods pre-train on a dataset without using labels with the hope to build more universal representations that work across a wider variety of tasks and datasets. We study ImageNet models pre-trained using a state-of-the-art self-supervised learning technique and compare to standard supervised ImageNet pre-training on COCO. We find that self-supervised pre-trained models using SimCLR have similar performance as supervised ImageNet pre-training. Both methods hurt COCO performance in the high data/strong augmentation setting, when self-training helps. Our results suggest that both supervised and self-supervised pre-training methods fail to scale as the labeled dataset size grows, while self-training is still useful.

Our work however does not dismiss the use of pre-training in computer vision. Fine-tuning a pre-trained model is faster than training from scratch and self-training in our experiments. The speedup ranges from 1.3x to 8x depending on the pre-trained model quality, strength of data augmentation, and dataset size. Pre-training can also benefit applications where collecting sufficient labeled data is difficult. In such scenarios, pre-training works well; but self-training also benefits models with and without pre-training. For example, our experiment with PASCAL segmentation dataset shows that ImageNet pre-training improves accuracy, but self-training provides an additional +1.3% mIOU boost on top of pre-training. The fact that the benefit of pre-training does not cancel out the gain by self-training, even when utilizing the same dataset, suggests the generality of self-training.

Taking a step further, we explore the limits of self-training on COCO and PASCAL datasets, thereby demonstrating the method’s flexibility. We perform self-training on COCO dataset with Open Images dataset as the source of unlabeled data, and RetinaNet with SpineNet as the object detector. This combination achieves 54.3AP on the COCO test set, which is +1.5AP better than the strongest SpineNet model. On segmentation, we use PASCAL aug set as the source of unlabeled data, and NAS-FPN with EfficientNet-L2 as the segmentation model. This combination achieves 90.5AP on the PASCAL VOC 2012 test set, which surpasses the state-of-the-art accuracy of 89.0AP , who also use additional 300M labeled images. These results confirm another benefit of self-training: it’s very flexible about unlabeled data sources, model architectures and computer vision tasks.

Related Work

Pre-training has received much attention throughout the history of deep learning (see and references therein). The resurgence of deep learning in the 2000s also began with unsupervised pre-training . The success of unsupervised pre-training in NLP has revived much interest in unsupervised pre-training in computer vision, especially contrastive training . In practice, supervised pre-training is highly successful in computer vision. For example, many studies, e.g., , have shown that ConvNets pre-trained on ImageNet, Instagram, and JFT can provide strong improvements for many downstream tasks.

Supervised ImageNet pre-training is the most widely-used initialization method for machine vision (e.g., ). Shen et al and He et al. , however, demonstrate that ImageNet pre-training does not work well if we consider a much different task such as COCO object detection. Ghiasi et al. find model trained with random initialization outperforms the ImageNet pre-trained model on COCO object detection when strong regularization is applied. Poudel et al. on the other hand show that ImageNet pre-training is not necessary for semantic segmentation with CityScapes if aggressive data augmentation is applied. Furthermore, Raghu et al. show that ImageNet pre-training does not improve medical image classification tasks. Compared to these previous works, our work takes a step further and studies the role of pre-training in computer vision in greater detail with stronger data augmentation, different pre-training methods (supervised and self-supervised), and different pre-trained checkpoint qualities.

Our paper does not study targeted pre-training in depth, e.g., using an object detection dataset to improve another object detection dataset, for two reasons. Firstly, targeted pre-training is expensive and not scalable. Secondly, there exists evidence that pre-training on a dataset that is the same as the target task still can fail to yield improvements. For example, Shao et al. found that pre-training on the Open Images object detection dataset actually hurts COCO performance. More analysis of targeted pre-training can be found in .

Our work argues for the scalability and generality of self-training (e.g., ). Recently, self-training has shown significant progress in deep learning (e.g., image classification , machine translation , and speech recognition ). Most closely related to our work is Xie et al. who also use strong data augmentation in self-training but for image classification. Closer in applications are semi-supervised learning for detection and segmentation (e.g., ), who only study self-training in isolation or without a comparison against ImageNet pre-training. They also do not consider the interactions between these training methods and data augmentations.

Methodology

We use four different augmentation policies of increasing strength that work for both detection and segmentation. This allows for varying the strength of data augmentation in our analysis. We design our augmentation policies based on the standard flip and crop augmentation in the literature , AutoAugment , and RandAugment . The standard flip and crop policy consists of horizontal flips and scale jittering . The random jittering operation resizes an image to (0.8, 1.2) of the target image size and then crops it. AutoAugment and RandAugment are originally designed with the standard scale jittering. We increase scale jittering (0.5, 2.0) in AutoAugment and RandAugment and find the performances are significantly improved. For RandAugment we use a magntiude of 10 for all models . We arrive at our four data augmentation policies which we use for experimentation: FlipCrop, AutoAugment, AutoAugment with higher scale jittering, RandAugment with higher scale jittering. Throughout the text we will refer to them as: Augment-S1, Augment-S2, Augment-S3 and Augment-S4 respectively. The last three augmentation policies are stronger than He et al. who use only a FlipCrop-based strategy.

Pre-training:

To evaluate the effectiveness of pre-training, we study ImageNet pre-trained checkpoints of varying quality. To control for model capacity, all checkpoints use the same model architecture but have different accuracies on ImageNet (as they were trained differently). We use the EfficientNet-B7 architecture as a strong baseline for pre-training. For the EfficientNet-B7 architecture, there are two available checkpoints: 1) the EfficientNet-B7 checkpoint trained with AutoAugment that achieves 84.5% top-1 accuracy on ImageNet; 2) the EfficientNet-B7 checkpoint trained with the Noisy Student method , which utilizes an additional 300M unlabeled images, that achieves 86.9% top-1 accuracy.https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet We denote these two checkpoints as ImageNet and ImageNet++ , respectively. Training from a random initialization is denoted by Rand Init. All of our baselines are therefore stronger than He et al. who only use ResNets for their experimentation (EfficientNet-B7 checkpoint has an approximately 8% higher accuracy than a ResNet-50 checkpoint). Table 1 summarizes our notations for data augmentations and pre-trained checkpoints.

Self-training:

We use a simple self-training method inspired by which consists of three steps. First, a teacher model is trained on the labeled data (e.g., COCO dataset). Then the teacher model generates pseudo labels on unlabeled data (e.g., ImageNet dataset). Finally, a student is trained to optimize the loss on human labels and pseudo labels jointly. Our experiments with various hyperparameters and data augmentations indicate that self-training with this standard loss function can be unstable. To address this problem, we implement a loss normalization technique, which is described in Appendix B.

2 Additional Experimental Settings

We use COCO dataset (118k images) for supervised learning. In self-training, we experiment with ImageNet (1.2M images) and OpenImages (1.7M images) as unlabeled datasets. We adopt RetinaNet detector with EfficientNet-B7 backbone and feature pyramid networks in the experiments. We use image size $640\times 640$ , pyramid levels from $P_{3}$ to $P_{7}$ and 9 anchors per pixel as done in . The training batch size is 256 with weight decay 1e-4. The model is trained with learning rate 0.32 and a cosine learning rate decay schedule . At the beginning of training the learning rate is linearly increased over the first 1000 steps from 0.0032 to 0.32. All models are trained using synchronous Batch Normalization. For all experiments using different augmentation strengths and datasets sizes, we allow each model to train until it converges (when training longer stops helping or even hurts performance on a held-out validation set). For example, training takes 45k iterations with Augment-S1 and 120k iterations with Augment-S4, when both models are randomly initialized. For results using SpineNet, we use the model architecture and hyper-parameters reported in the paper . When we use SpineNet, due to memory constraints we reduce the batch size from 256 to 128 and scale the learning rate by half. The hyper-parameters, except batch size and learning rate, follow the default implementation in the SpineNet open-source repository.https://github.com/tensorflow/tpu/tree/master/models/official/detection All SpineNet models also use Soft-NMS with a sigma of 0.3 . In self-training, we use a hard score threshold of 0.5 to generate pseudo box labels. We use a total 512 batch size with 256 from COCO and 256 from pseudo dataset. The other training hyper-parameters remain the same as those in supervised training. For all experiments the parameters of the student model are initialized by the teacher model to save training time. Experimental analysis studying the impact of student model initialization during self-training can be found in Appendix C.

Semantic Segmentation:

We use the train set (1.5k images) of PASCAL VOC 2012 segmentation dataset for supervised learning. In self-training, we experiment with augmented PASCAL dataset (9k images), COCO (240k images, combining labeled and unlabeled datasets), and ImageNet (1.2M images). We adopt a NAS-FPN model architecture with EfficientNet-B7 and EfficientNet-L2 backbone models. Our NAS-FPN model uses 7 repeats with depth-wise separable convolution. We use pyramid levels from $P_{3}$ to $P_{7}$ and upsample all feature levels to $P_{2}$ and then merge them by a sum operation. We apply 3 layers of $3\times 3$ convolutions after the merged features and then attach a $1\times 1$ convolution for 21 class prediction. The learning rate is set to 0.08 for EfficientNet-B7 and 0.2 for EfficientNet-L2 with batch size 256 and weight decay 1e-5. All models are trained with a cosine learning rate decay schedule and use synchronous Batch Normalization. EfficientNet-B7 is trained for 40k iterations and EfficientNet-L2 for 20k iterations. For self-training, we use a batch size of 512 for EfficientNet-B7 and 256 for EfficientNet-L2. Half of the batch consists of supervised data and the other half pseudo data. Other hyper-parameters follow those used in supervised training. Additionally, we use a hard score threshold of 0.5 to generate segmentation masks and pixels with a smaller score are set to the ignore label. Lastly, we apply multi-scale inference augmentation with scales of (0.5, 0.75, 1, 1.25, 1.5, 1.75) to compute the segmentation masks for pseudo labeling.

In Appendix H, we show the optimal training iterations and loss hyperparameters used for all of our experiments.

Experiments

This section expands on the findings of He et al. who study the weaknesses of pre-training on the COCO dataset as they vary the size of the labeled dataset. Similar to their study, we use ImageNet for supervised pre-training and vary the COCO labeled dataset size. Different from their study, we also change other factors: data augmentation strengths and pre-trained model qualities (see Section 3.1 for more details). As mentioned above, we use RetinaNet object detectors with the EfficientNet-B7 architecture as the backbone. Below are our key findings:

We analyze the impact of pre-training when we vary the augmentation strength. In Figure 1-Left, when we use the standard data augmentation (Augment-S1), pre-training helps. But as we increase the data augmentation strength, the value of pre-training diminishes.

Furthermore, in the stronger augmentation regimes, we observe that pre-training actually hurts performance by a large amount (-1.0 AP). This result was not observed by He et al. , as pre-training only slightly hurts performance (-0.4AP) or is neutral in their experiments.

More labeled data diminishes the value of pre-training.

Next, we analyze the impact of pre-training when varying the labeled dataset size. Figure 1-Right shows that pre-training is helpful in the low-data regimes (20%) and neutral or harmful in the high-data regime. This result is mostly consistent with the observation in He et al. . One new finding here is that the checkpoint quality does correlate with the final performance in the low data regime (ImageNet++ performs best on 20% COCO).

2 The effects of augmentation and labeled dataset size on self-training

In this section, we analyze self-training and contrast it with the above results. For consistency, we will continue to use COCO object detection as the task of interest, and ImageNet as the self-training data source. Unlike pre-training, self-training only treats ImageNet as unlabeled data. Again, we use RetinaNet object detectors with the EfficientNet-B7 architecture as the backbone to be compatible with previous experiments. Below are our key findings:

Similar to the previous section, we first analyze the performance of object detectors as we vary the data augmentation strength. Table 2 shows the performance of self-training across the four data augmentation methods, and compares them against supervised learning (Rand Init) and pre-training (ImageNet Init). Here we also show the gain/loss of self-training and pre-training to the baseline. The results confirm that in the scenario where pre-training hurts (strong data augmentations: Augment-S2, Augment-S3, Augment-S4), self-training helps significantly. It provides a boost of more than +1.3AP on top of the baseline, when pre-training hurts by -1.0AP. Similar results are obtained on ResNet-101 (see Appendix E).

Self-training works across dataset sizes and is additive to pre-training.

Next we analyze the performance of self-training as we vary the COCO labeled dataset size. As can be seen from Table 3, self-training benefits object detectors across dataset sizes, from small to large, regardless of pre-training methods. Most importantly, at the high data regime of 100% labeled set size, self-training significantly improves all models while pre-training hurts.

In the low data regime of 20%, self-training enjoys the biggest gain of +3.4AP on top of Rand Init. This gain is bigger than the gain achieved by ImageNet Init (+2.6AP). Although the self-training gain is smaller than the gain by ImageNet++ Init, ImageNet++ Init uses 300M additional unlabeled images.

Self-training is quite additive with pre-training even when using the same data source. For example, in the 20% data regime, utilizing an ImageNet pre-trained checkpoint yields a +2.6AP boost. Using both pre-training and self-training with ImageNet yields an additional +2.7AP gain. The additive benefit of combining pre-training and self-training is observed across all of the dataset sizes.

3 Self-supervised pre-training also hurts when self-training helps in high data/strong augmentation regimes

The previous experiments show that ImageNet pre-training hurts accuracy, especially in the highest data and strongest augmentation regime. Under this regime, we investigate another popular pre-training method: self-supervised learning.

The primary goal of self-supervised learning, pre-training without labels, is to build universal representations that are transferable to a wider variety of tasks and datasets. Since supervised ImageNet pre-training hurts COCO performance, potentially self-supervised learning techniques not using label information could help. In this section, we focus on COCO in the highest data (100% COCO dataset) and strongest augmentation (Augment-S4) setting. Our goal is to compare random initialization against a model pre-trained with a state-of-the-art self-supervised algorithm. For this purpose, we choose a checkpoint that is pre-trained with the SimCLR framework on ImageNet. We use the checkpoint before it is fine-tuned on ImageNet labels. All backbones models use ResNet-50 as SimCLR only uses ResNets in their work.

The results in Table 4 reveal that the self-supervised pre-trained checkpoint hurts performance just as much as supervised pre-training on the COCO dataset. Both pre-trained models decrease performance by -0.7AP over using a randomly initialized model. Once again we see self-training improving performance by +0.8AP when both pre-trained models hurt performance. Even though both self-supervised learning and self-training ignore the labels, self-training seems to be more effective at using the unlabeled ImageNet data to help COCO.

4 Exploring the limits of self-training and pre-training

In this section we combine our knowledge about the interactions of data augmentation, self-training and pre-training to improve the state-of-the-art. Below are our key results:

In this experiment, we use self-training and Augment-S3 as the augmentation method. The previous experiments on full COCO suggest that ImageNet pre-training hurts performance, so we do not use it. Although the control experiments use EfficientNet and ResNet backbones, we use SpineNet in this experiment as it is closer to the state-of-the-art. For self-training, we use Open Images Dataset (OID) as the self-training unlabeled data, which we found to be better than ImageNet (for more information about the effects of data sources on self-training, see Appendix F). Note that OID is found to not be helpful on COCO by pre-training in .

Table 5 shows our results on the two largest SpineNet models, and compares them against previous best single-model, single-crop performance on this dataset. For the largest SpineNet model we improve upon the best 52.8AP SpineNet model by +1.5AP to achieve 54.3AP. Across all model variants, we obtain at least a +1.5AP gain.

PASCAL VOC Semantic Segmentation.

For this experiment, we use NAS-FPN architecture with EfficientNet-B7 and EfficientNet-L2 as the backbone architectures. Due to PASCAL’s small dataset size, pre-training still matters much here. Hence, we use a combination of pre-training, self-training and strong data augmentation for this experiment. For pre-training, we use the ImageNet++ initialization for the EfficientNet backbones. For augmentation, we use Augment-S4. We use the aug set of PASCAL as the additional data source for self-training, which we found to be more effective than ImageNet.

Table 6 shows that our method improves state-of-the-art by a large margin. We achieve 90.5% mIOU on the PASCAL VOC 2012 test set using single-scale inference, outperforming the old state-of-the-art 89% mIOU which utilizes multi-scale inference. For PASCAL, we find pre-training with a good checkpoint to be crucial, without it we achieve 41.5 % mIOU. Interestingly, our model improves the previous state-of-the-art by 1.5% mIOU even using much less human labels in training. Our method uses labeled data from ImageNet (1.2M images) and PASCAL train segmentation (1.5k images). In contrast, previous state-of-the-art models used 250x additional labeled classification data for pre-training: JFT (300M images), and 86x additional labeled segmentation data: COCO (120k images), and PASCAL aug (9k images). For a visualization of pseudo labeled images, see Appendix G.

Discussion

One of the grandest goals of computer vision is to develop universal feature representations that can solve many tasks. Our experiments show the limitation of learning universal representations from both classification and self-supervised tasks, demonstrated by the performance differences in self-training and pre-training. Our intuition for the weak performance of pre-training is that pre-training is not aware of the task of interest and can fail to adapt. Such adaptation is often needed when switching tasks because, for example, good features for ImageNet may discard positional information which is needed for COCO. We argue that jointly training the self-training objective with supervised learning is more adaptive to the task of interest. We suspect that this leads self-training to be more generally beneficial.

The benefit of joint-training.

A strength of the self-training paradigm is that it jointly trains the supervised and self-training objectives, thereby addressing the mismatch between them. But perhaps we can jointly train ImageNet and COCO to address this mismatch too? Table 7 shows results for joint-training, where ImageNet classification is trained jointly with COCO object detection (we use the exact setup as self-training in this experiment). Our results indicate that ImageNet pre-training yields a +2.6AP improvement, but using a random initialization and joint-training gives a comparable gain of +2.9AP. This improvement is achieved by training 19 epochs over the ImageNet dataset. Most ImageNet models that are used for fine-tuning require much longer training. For example, the ImageNet Init (supervised pre-trained model) needed to be trained for 350 epochs on the ImageNet dataset.

Moreover, pre-training, joint-training and self-training are all additive using the same ImageNet data source (last column of the table). ImageNet pre-training gets a +2.6AP improvement, pre-training + joint-training gets +0.7AP improvement and doing pre-training + joint-training + self-training achieves a +3.3AP improvement.

The importance of task alignment.

One interesting result in our experiments is ImageNet pre-training, even with additional human labels, performs worse than self-training. Similarly, we verify the same phenomenon on PASCAL dataset. On PASCAL dataset, the aug set is often used as an additional dataset, which has much noisier labels than the train set. Our experiment shows that with strong data augmentation (Augment-S4), training with train+aug actually hurts accuracy. Meanwhile, pseudo labels generated by self-training on the same aug dataset significantly improves accuracy. Both results suggest that noisy (PASCAL) or un-targeted (ImageNet) labeling is worse than targeted pseudo labeling.

It is worth mentioning that Shao et al. report pre-training on Open Images hurts performance on COCO, despite both of them being annotated with bounding boxes. This means that not only we want the task to be the same but also the annotations to be the same for pre-training to be really beneficial. Self-training on the other hand is very general and can use Open Images successfully to improve COCO performance in Appendix F, a result that suggests self-training can align to the task of interest well.

Limitations.

There are still limitations to current self-training techniques. In particular, self-training requires more compute than fine-tuning on a pre-trained model. The speedup thanks to pre-training ranges from 1.3x to 8x depending on the pre-trained model quality, strength of data augmentation, and dataset size. Good pre-trained models are also needed for low-data applications like PASCAL segmentation.

The scalability, generality and flexibility of self-training.

Our experimental results highlight important advantages of self-training. First, in terms of flexibility, self-training works well in every setup that we tried: low data regime, high data regime, weak data augmentation and strong data augmentation. Self-training also is effective with different architectures (ResNet, EfficientNet, SpineNet, FPN, NAS-FPN), data sources (ImageNet, OID, PASCAL, COCO) and tasks (Object Detection, Segmentation). Secondly, in terms of generality, self-training works well even when pre-training fails but also when pre-training succeeds. In terms of scalability, self-training proves to perform well as we have more labeled data and better models. One bitter lesson in machine learning is that most methods fail when we have more labeled data or more compute or better supervised training recipes, but that does not seem to apply to self-training.

Broader and Social Impact

Our paper studies self-training, a machine learning technique, with applications in object detection and segmentation. As a core machine learning method, self-training can enable machine learning methods to work better and with less data. So it should have broader applications in computer vision, and other fields such as speech recognition, NLP, bioinformatics etc. The datasets in our study are generic and publicly available, which do not tie to any specific application. We foresee positive impacts if the method is applied to datasets in self-driving or healthcare. But the method can also be applied to other datasets and sensitive applications that have ethical implications such as mass surveillance.

Acknowledgements

We thank Anelia Angelova, Aravind Srinivas, and Mingxing Tan for comments and suggestions.

References

Appendix A Other Related Work

Self-training is related to the method of pseudo labels and consistency training . There are many differences between these works and ours. First, self-training is different from consistency training in that self-training uses two models (a teacher and a student) whereas consistency training uses only one model. Secondly, previous works focus on image classification, whereas our work studies object detection and segmentation. Finally, previous works do not study the interactions between self-training and pre-training under modern data augmentation.

Appendix B Loss Normalization Analysis

In this work we noticed that the standard loss for self-training $\hat{L}=L_{h}+\alpha L_{p}$ can be quite unstable. This is caused by the total loss magnitude drastically changing as $\alpha$ is varied. We thus implement a Loss Normalization method, which stabilizes self-training as we vary $\alpha$ : $\hat{L}=\frac{1}{1+\alpha}(L_{h}+\alpha\frac{\bar{L_{h}}}{\bar{L_{p}}}L_{p})$ , where $L_{h}$ , $L_{p}$ , $\bar{L}_{h}$ and $\bar{L}_{p}$ are human loss, pseudo loss and their respective moving averages over training. All moving averages are an exponential moving average with a decay rate of 0.9997.

Figure 2 shows the Loss Normalization performance as we vary the data augmentation strength, training iterations, learning rate and $\alpha$ . These experiments are performed on RetinaNet with a ResNet-101 backbone architecture on COCO object detection. ImageNet is used as the dataset for self-training. As can be seen from the figure, Loss Normalization gets better results in almost all settings, and more importantly, helps avoid training instability when $\alpha$ is large. Across all settings of varying augmentations, training iterations and learning rates we find Loss Normalization achieves an average of +0.4 AP performance over the standard loss combination. Importantly, it also helps in our highest performing Augment-S4 setting by +1.3 AP.

Recent self-training works typically fix the $\alpha$ parameter to be one across all of their experiments . We find in many of our experiments that setting $\alpha$ to one is sub-optimal and that the optimal $\alpha$ changes as the training iterations and augmentation strength varies. Table 9 shows the optimal $\alpha$ ’s as augmentation and training iterations vary. As the augmentation strength increases the optimal $\alpha$ decreases. Additionally, as the training iterations increases the optimal $\alpha$ increases.

Appendix C Student Model Initialization Study for Self-training

In this section we study how the student model should be initialized in self-training. Table 10 shows the results of initializing the student from the teacher weights and using random initialization. Across all four augmentation regimes we observe similar performance between the two settings, with initializing from the teacher weights doing slightly better (0.3-0.4 AP). One added benefit of initializing the student with the teacher weights is not only due to the increased performance, but the speedup in convergence. Across all augmentation regimes the student model trained from scratch had to train on average 2.25 times as long as the student model initialized with the teacher weights. Therefore for all experiments in the paper we initialize the student with the teacher weights.

Appendix D Further Study of Augmentation, Supervised Dataset Size, and Pre-trained Model Quality

We expand upon our previous analysis in Section 4.1 and show how all four augmentation strengths across different COCO sizes interact with pre-trained checkpoint quality on COCO. Figure 3 shows the interaction of all these factors. We again observe the same three points: 1) stronger data augmentation diminishes the value of pre-training, 2) pre-training hurts performance if stronger data augmentation is used, and 3) more supervised data diminishes the value of pre-training. Across all augmentations and data sizes we also observe the better ImageNet pre-trained checkpoint, ImageNet++ , outperforming the standard ImageNet pre-trained model. Interestingly, in the three out of four augmentation regimes where pre-training hurts, the better the pre-trained checkpoint quality, the less it hurts.

As a case study in the low data regime, we study the impact of pre-trained checkpoint quality and augmentation strength on PASCAL VOC 2012. The results in Table 11 indicate that for training on the PASCAL train dataset, which only has 1.5k images, checkpoint quality is very important and improves results significantly. We observe that the gain from checkpoint quality begins to diminish as the augmentation strength increases. Additionally, the performance of the ImageNet checkpoint is again correlated with the performance on PASCAL VOC.

Appendix E ResNet-101 Self-training Performance on COCO

In the paper we presented our experimental results on COCO with RetinaNet using EfficientNet-B7 and SpineNet backbones. Self-training also works well on other backbones, such as ResNet-101 . Our results are presented in Table 12. Again, self-training achieves strong improvements across all augmentation strengths.

Appendix F The Effects of Unlabeled Data Sources on Self-Training

An important question raised from recent experiments is how changing the additional dataset source affects self-training performance. In our analysis we use ImageNet, a dataset designed for image classification that mostly contains iconic images. The image contents are known to be quite different from COCO, PASCAL, and Open Images, which contain more non-iconic images. Iconic images typically only have one object with its conical view, while non-iconic images capture multiple objects in a scene with their natural views . Table 13 studies how changing the additional data from ImageNet to Open Images Dataset impacts the performance of self-training. Switching the additional dataset source improves performance of self-training up to +0.6 AP over using ImageNet across varying data augmentation strengths on COCO. Interestingly the Open Images Dataset was found to not help COCO by pre-training in , but we do see improvements using it over ImageNet for self-training.

We also study the effects of changing the additional dataset source on PASCAL VOC 2012. In Table 14, we observe that changing the additional data source from ImageNet to COCO improves performance across all of our augmentation strengths. The best self-training dataset is PASCAL aug set, which is in-domain data for the PASCAL task. The PASCAL aug set which has only about 9k images improves performance more than COCO with 240k images.

Appendix G Visualization of Pseudo Labels in Self-training

The original PASCAL VOC 2012 dataset contains 1464 labeled in train set. Extra annotations are provided by resulting in 10582 images in train+aug. Most previous works have used the train+aug set for training. However, we find that with strong data augmentation training with the aug set actually hurts performance (see Table 8). Figure 4 includes selected examples from the aug set. We observe the annotations in aug set are less accurate compared to the train set. For example, some of the images do not include annotation for all the objects in the image and segmentation masks are not precise. The third column of the figure shows pseudo labels generated from our teacher model. From the visualization, we observe that the pseudo labels can have more precise segmentation masks. Empirically, we find that training with this pseudo label set improves performance more than training with the human annotated aug set (see Table 8).

ImageNet dataset:

Figure 5 shows segmentation pseudo labels generated by the teacher model on 14 randomly-selected images from ImageNet. Interestingly, some of the ImageNet classes that don’t exist in the PASCAL VOC 2012 dataset are predicted as one of its 20 classes. For instance, saw and lizard are predicted as bird. Although pseudo labels are noisy they still improve accuracy of the student model (Table 14).

Appendix H Optimal Model Training Iterations and Alpha Weighting

In all experiments, we allow our models to train until convergence (validation set performance no longer improves). For the self-training experiments we search over a few different alpha values: [0.25, 0.5, 1.0, 2.0, 3.0] (see Appendix B for more details). Below we list all of the optimal training iterations and alphas to promote reproducibility for all of our experiments. For each table the optimal training iteration found is represented by (45k), which means the model was trained for 45000 steps. The optimal alpha is represented as (1.0). An alpha value of (—) represents that no alpha is used in the experiment as our supervised learning experiments do not make use of pseudo labeled data. One table (Table 7) jointly trains ImageNet classification and COCO at the same time. For this setup we simply use a scalar to combine the ImageNet classification loss and the COCO loss, which is represented as (0.2). The total training loss is computed by $\textrm{Loss}_{\textrm{COCO}}+0.2\cdot\textrm{Loss}_{\textrm{ImageNet}}$ .