A Simple Semi-Supervised Learning Framework for Object Detection

Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, Tomas Pfister

Introduction

Semi-supervised learning (SSL) has received growing attention in recent years as it provides means of using unlabeled data to improve model performance when large-scale annotated data is not available. A popular class of SSL methods is based on “Consistency-based Self-Training” . The key idea is to first generate the artificial labels for the unlabeled data and train the model to predict these artificial labels when feeding the unlabeled data with semanticity-preserving stochastic augmentations. The artificial label can either be a one-hot prediction (hard) or the model’s predictive distribution (soft). The other pillar for the success of SSL is from advancements in data augmentations. Data augmentations improve the robustness of deep neural networks and has been shown to be particularly effective for consistency-based self-training . The augmentation strategy spans from a manual combination of basic image transformations, such as rotation, translation, flipping, or color jittering, to neural image synthesis and policies learned by reinforcement learning . Lately, complex data augmentation strategies, such as RandAugment or CTAugment , have turned out to be powerful for SSL on image classification .

While having made remarkable progress, SSL methods have been mostly applied to image classification, whose labeling cost is relatively cheaper compared to other important problems in computer vision, such as object detection. Due to its expensive labeling cost, object detection demands a higher level of label efficiency, necessitating the development of strong SSL methods. On the other hand, the majority of existing works on object detection has focused on training a stronger and faster detector given sufficient amount of annotated data. Few existing works on SSL for object detection rely on additional context, such as categorical similarities of objects.

In this work, we leverage lessons learned from deep SSL on image classification to tackle SSL for object detection. To this end, we propose a SSL framework for object detection that combines self-training (via pseudo label) and consistency regularization based on the strong data augmentations . Inspired by the framework in Noisy-Student , our system contains two stages of training. In the first stage, we train an object detector (e.g., Faster RCNN ) using all labeled data until convergence. The trained detector is then used to predict bounding boxes and class labels of localized objects for unlabeled images as shown in Figure 2. Then, we apply confidence-based filtering to each predicted box (after NMS) with high threshold value to obtain pseudo labels with high precision, inspired by the design of FixMatch . In the second stage, the strong data augmentations are applied to each unlabeled image and the model is trained with labeled data and unlabeled data with its corresponding pseudo labels generated in the first stage. Encouraged by RandAugment and its successful adaptation to SSL and object detection , we design our augmentation strategy for object detection, which consists of global color transformation, global or box-level geometric transformations, and Cutout .

We test the efficacy of STAC on public datasets: MS-COCO and PASCAL VOC . We design new experimental protocols using MS-COCO dataset to evaluate the semi-supervised object detection performance. We use 1, 2, 5 and 10% of labeled data as labeled sets and the remainder as unlabeled sets to evaluate the effectiveness of SSL methods in the low-label regime. In addition, following , we evaluate using all labeled data as the labeled set and additional unlabeled data provided by MS-COCO as the unlabeled set. Following , we use trainval of VOC07 as the labeled set and that of VOC12 with or without unlabeled data of MS-COCO as unlabeled sets. While being simple, STAC brings significant gain in mAPs: 18.47 to 24.38 on 5% protocol, 23.86 to 28.64 on 10% protocol as in Figure 1, and 42.60 to 46.01 on PASCAL VOC.

Overall, the contribution of this paper is as follows:

We develop STAC, a SSL framework for object detection that seamlessly extends the class of state-of-the-art SSL methods for classification based on self-training and augmentation-driven consistency regularization.

STAC is simple and introduces only two new hyperparameters: the confidence threshold $\tau$ and the unsupervised loss weight $\lambda_{u}$ , which do not require an extensive additional effort for tuning.

We propose new experimental protocols for SSL object detection using MS-COCO and demonstrate the efficacy of STAC on MS-COCO and PASCAL VOC in Faster RCNN framework.

Related Work

Object detection is a fundamental computer vision task and has been extensively studied in the literature . Popular object detection frameworks include Region-based CNN (RCNN) , YOLO , SSD , etc . The progress made by existing works is mainly on training a stronger or faster object detector given sufficient amount of annotated data. There is growing interest in improving detectors using unlabeled training data through a semi-supervised object detection framework . Before deep learning, the idea has been explored by . Recently, proposes a consistency-based semi-supervised object detection method, which enforces the consistent prediction of an unlabeled image and its flipped counterpart. Their method requires a more sophisticated Jensen-Shannon Divergence for consistency regularization computation. Similar ideas to consistency regularization have also been studied in the active learning settings for object detection . introduces a self-supervised proposal learning module to learn context-aware and noise-robust proposal features from unlabeled data. proposes data distillation that generates labels by ensembling predictions of multiple transformations of unlabeled data. We argue that stronger semi-supervised detectors require further investigation of unsupervised objectives and data augmentations.

Semi-supervised learning (SSL) for image classification has been dramatically improved recently. Consistency regularization becomes one of the popular approaches among recent methods and inspires on object detection. The idea is to enforce the model to generate consistent predictions across label-preserving data augmentations. Some exemplars include Mean-Teacher , UDA , and MixMatch . Another popular class of SSL is pseudo labeling , which can be viewed as a hard version of consistency regularization: the model is performing self-training to generate pseudo labels of unlabeled data and thereby train randomly-augmented unlabeled data to match the respective pseudo labels. How to use pseudo labels is critical to the success of SSL. For instance, Noisy-Student demonstrates an iterative teacher-student framework that repeats the process of labeling assignments using a teacher model and then training a larger student model. This method achieves state-of-the-art performance on ImageNet classification by leveraging extra unlabeled images in the wild. FixMatch demonstrates a simple algorithm which outperforms previous approaches and establishes state-of-the-art performance, especially on diverse small labeled data regimes. The key idea behind FixMatch is matching the prediction of the strongly-augmented unlabeled data to the pseudo label of the weakly-augmented counterpart when the model confidence on the weakly-augmented one is high. In light of the success of these methods, this paper exploits the effective usage of pseudo labeling and pseudo boxes as well as data augmentations to improve object detectors.

Data augmentations are critical to improve model generalization and robustness , especially gradually become a major impetus on semi-supervised learning . Finding appropriate color transformations and geometric transformations of input spaces has been shown to be critical to improve generalization . However, most augmentations are mainly studied in image classification. The complexity of data augmentations for object detection is much higher than image classification , since global geometric transformations of data affect bounding box annotations. Some works have presented augmentation techniques for supervised object detection, such as MixUp , CutMix , or augmentation strategy learning . The recent consistency-based SSL object detection method utilizes global horizontal flipping (weak augmentation) to construct the consistency loss. To the best of our knowledge, the impact of intensive data augmentations on semi-supervised object detection has not been thoroughly studied.

Methodology

Formulating an unsupervised loss that leverages unlabeled data is the key in SSL. Many advancements in SSL for classification rely on some forms of consistency regularization . Inspired by a comparison in , we provide a unified view of consistency regularization for image classification. For $K$ -way classification, the consistency regularization is written as follows:

We refer to Appendix for configurations of SSL methods. State-of-the-art methods, such as Unsupervised Data Augmentation and FixMatch , apply strong data augmentation $\mathcal{A}$ , such as RandAugment or CTAugment , to the model prediction $p(\mathcal{A}(x);\theta)$ for improved robustness. Noisy-Student applies diverse forms of stochastic noise to the model prediction, including input augmentations via RandAugment, and network augmentations via dropout and stochastic depth . While sharing similarities on the model prediction, they differ in $q$ that generates the prediction target as detailed in Appendix. Different from Equation (2) and many aforementioned algorithms, Noisy-Student employs a “teacher” network other than $p(\cdot,\theta)$ to generates pseudo labels $q(x)$ . Note that the fixed teacher network allows offline pseudo label generation and this provides scalability to large unlabeled data and flexibility on the choice of architecture or optimization.

2 STAC: SSL for Object Detection

We propose a simple SSL framework for object detection, called STAC, based on the Self-Training (via pseudo label) and the Augmentation driven Consistency regularization. First, we adopt a stage-wise training of Noisy-Student for its scalability and flexibility. This involves at least two stages of training, where in the first stage, we train a teacher model using all available labeled data, and in the second stage, we train STAC using both labeled and unlabeled data. Second, we use a high threshold value for the confidence-based thresholding inspired by FixMatch to control the quality of pseudo labels comprised of bounding boxes and their class labels in object detection. The steps for training STAC are summarized as follows:

Train a teacher model on available labeled images.

Generate pseudo labels of unlabeled images (i.e., bounding boxes and their class labels) using the trained teacher model.

Apply strong data augmentations to unlabeled images, and augment pseudo labels (i.e. bounding boxes) correspondingly when global geometric transformations are applied.

Compute unsupervised loss and supervised loss to train a detector.

Training a Teacher Model. We develop our formulation based on the Faster RCNN as it has been one of the most representative detection framework. Faster RCNN has a classifier (CLS) and a region proposal network (RPN) heads on top of the shared backbone network. Each head has two modules, namely region classifiers (e.g., a ( $K{+}1$ )-way classifier for the CLS head or a binary classifier for the RPN head) and bounding box regressors (REG). We present the supervised and unsupervised losses of the Faster RCNN for the RPN head for simplicity. The supervised loss is written as follows:

where $i$ is an index of an anchor in mini-batch. $p_{i}$ is the predictive probability of an anchor being positive, $t_{i}$ is the 4-dimensional coordinates of an anchor. $p_{i}^{*}$ is the binary label of an anchor with respect to ground-truth boxes, $t_{i}^{*}$ is the ground-truth box coordinates of the box $i$ for all $p_{i}^{*}=1$ .

Generating Pseudo Labels. We perform a test-time inference of the object detector from the teacher model to generate pseudo labels. That being said, the pseudo label generation involves not only the forward pass of the backbone, RPN and CLS networks, but also the post-processing such as non-maximum suppression (NMS). This is different from conventional approaches for classification where the confidence score is computed from the raw predictive probability. We use the score of each returned bounding box after NMS, which aggregates the prediction probabilities of anchor boxes. Using box predictions after NMS has an advantage over using raw predictions (before NMS) since it removes repetitive detection. However, this does not filter out boxes at wrong locations as visualized in Figure 2 and Figure 5(a). We apply confidence-based thresholding to further reduce potentially wrong pseudo boxes.

Finally, the RPN is trained by jointly minimizing two losses as follows:

STAC introduces two hyperparameters $\tau$ and $\lambda_{u}$ . In experiments, we find $\tau\,{=}\,0.9$ and $\lambda_{u}\,{\in}\,$ work well. Note that the consistency-based SSL object detection method in requires sophisticated weighting schedule for $\lambda_{u}$ including temporal ramp-up and ramp-down. Instead, our framework demonstrates effectiveness with a simple constant schedule thanks to the consistency regularization using a strong data augmentation and confidence-based thresholding.

Data Augmentation Strategy. The key factor for the success of consistency-based SSL methods, such as UDA and FixMatch , is a strong data augmentation. While the augmentation strategy for supervised and semi-supervised image classification has been extensively studied , not much effort has been made yet for object detection. We extend the RandAugment for object detection used in using the augmentation search space recently proposed by (e.g., box-level transformation) along with the Cutout . We explore different variants of transformation operations and determinate a group of effective combinations. Each operation has a magnitude that decides the augmentation degree of strength.The range of degrees is empirically chosen without tuning.

Global color transformation (C): Color transformation operations in and the suggested ranges of magnitude for each op are used.

Global geometric transformation (G): Geometric transformation operations in , namely, x-y translation, rotation, and x-y shear, are used.The translation range in percentage is [ $-10\%$ , $10\%$ ] of image widths or heights. The rotation and shear ranges are [ $-30\%$ , $30\%$ ] in degrees.

Box-level transformation (B): Three transformation operations from global geometric transformations are used, but with smaller magnitude ranges.The translation range in percentage is [ $-5\%$ , $5\%$ ] of image widths or heights. The rotation and shear range is [ $-10\%$ , $10\%$ ] in degree.

For each image, we apply transformation operations in sequence as follows. First, we apply one of the operations sampled from C. Second, we apply one of the operations sampled from either G or B. Finally, we apply Cutout at multiple random locationsThe number of Cutout regions is sampled from , and the region size is sampled from [0%, 20%] of the short edge of the applied image. of a whole image to prevent a trivial solution when applied exclusively inside the bounding box. We visualize transformed images with aforementioned augmentation strategies in Figure 3.

Experiments

We test the efficacy of our proposed method on MS-COCO , which is one of the most popular public benchmarks for object detection. MS-COCO contains more than 118k labeled images and 850k labeled object instances from 80 object categories for training. In addition, there are 123k unlabeled images that can be used for semi-supervised learning. We experiment two SSL settings. First, we randomly sample 1, 2, 5 and 10% of labeled training data as a labeled set and use the rest of labeled training data as an unlabeled set. For these experiments, we create 5 data folds. 1% protocol contains approximately 1.2k labeled images randomly selected from the labeled set of MS-COCO. 2% protocol contains additional $\sim$ 1.2k images and 5, 10% protocol datasets are constructed in a similar way. Second, following , we use an entire labeled training data as a labeled set and additional unlabeled data as an unlabeled set. Note that the first protocol tests the efficacy of STAC when only few labeled examples are available, while the second protocol evaluates the potential to improve the state-of-the-art object detector with unlabeled data in addition to already a large-scale labeled data. We report the mAP over 80 classes.

We also test on PASCAL VOC following . The trainval set of VOC07, containing 5,011 images from 20 object categories, is used as a labeled training data, and 11,540 images from the trainval set of VOC12 are used for an unlabeled training data. The detection performance is evaluated on the test set of VOC07 and mAP at IoU of $0.5$ (AP0.5) is reported in addition to the MS-COCO metric.

Our implementation is based on the Faster RCNN and FPN library of Tensorpack . We use ResNet-50 backbone for our object detector models. Unless otherwise stated, the network weights are initialized by the ImageNet-pretrained model at all stages of training.

Since the training of the object detector is quite involved, we stay with the default learning settings for all our experiments other than the learning schedule. Most of our experiments are conducted using the quick learning scheduleSection 5.1 defines different learning schedules. with an exception for 100% MS-COCO protocol.https://github.com/tensorpack/tensorpack/tree/master/examples/FasterRCNN#results We find that the model’s performance is benefited significantly by longer training when more labeled training data and more complex data augmentation strategies are used. STAC introduces two new hyperparameters $\tau$ for the confidence threshold and $\lambda_{u}$ for the unsupervised loss. We use $\tau\,{=}\,0.9$ and $\lambda_{u}\,{=}\,2$ for all experiments except for the 100% protocol of MS-COCO, were we lower threshold $\tau\,{=}\,0.5$ to increase the recall of pseudo labels. We refer readers to Appendix for complete learning settings.

2 Results

Since deep semi-supervised learning of visual object detectors has not been widely studied yet, we mainly compare STAC with the supervised models (i.e., models trained with labeled data only) for various experimental protocols using different data augmentation strategies. Table 1 summarizes the results. For 1, 2, 5 and 10% protocols, we train models with a quick learning schedule and report mAPs averaged over 5 data folds and their standard deviation. For 100% protocol, we employ standard with $3{\times}$ longer learning schedule and report a single mAP value for each model.

Firstly, we confirm the findings of with varying amount of labeled training data that the RandAugment improves the supervised learning performance of a detector by a significant margin, 2.71 mAP at 5% protocol, 2.32 mAP at 10% protocol, and 1.85 mAP for 100% protocol, upon the supervised baselines with default data augmentation of resizing and horizontal flipping.

STAC further improves the performance upon stronger supervised models. We find it to be particularly effective for protocols with small labeled training data, showing 5.91 mAP improvement at 5% protocol and 4.78 mAP at 10% protocol. Interestingly, STAC is proven to be at least 2 ${\times}$ more data efficient than the baseline models for both 5% (24.36 for STAC v.s. 23.86 for supervised model with 10% labeled training data) and 10% protocols (28.56 for STAC v.s. 28.63 for the supervised model with 20% labeled training data). For the 100% protocol, STAC achieves 39.21 mAP. This improves upon the baseline (37.63 mAP), but falls short of the supervised model with a strong data augmentation (39.48 mAP). We hypothesize that the pseudo label training benefits by a larger amount of unlabeled data relative to the size of labeled data and study its effectiveness with respect to the scale of unlabeled data in Section 5.

We have a similar finding for experiments on PASCAL VOC. In Table 2, the mAP of the supervised models increases from 42.6 to 43.4, and AP0.5 increases from 76.30 to 78.21. A large-scale unlabeled data from VOC12 and MS-COCO further improves the performance, achieving 46.01 mAP and 79.08 AP0.5.

Ablation Study

We perform ablation study on the key components of STAC. The study analyzes the impact on the detector performance of 1) different data augmentations and learning schedule strategies, 2) different sizes of unlabeled sets, 3) the hyperparameters $\lambda_{u}$ , coefficient for unsupervised loss, and $\tau$ , confidence threshold, and 4) quality of pseudo labels and their impact on the proposed STAC.

In this section, we evaluate the performance of supervised detector models with different data augmentation strategies and learning rate schedules while varying the amount of training data. We consider different combinations of augmentation modules, including the default augmentations of horizontal image flip, color only (C), color followed by geometric or box-level transforms (C+{G,B}), and the one followed by Cutout (C+{G,B}+Cutout). For {G,B}, we sample randomly and uniformly between geometric and box-level transform modules for each image. We consider different learning schedules, including quick, standard, and standard $[n]{\times}$ (standard setting with $[n]$ times longer training). While the number of weight updates are the same, the quick schedule uses lower resolution image as an input and smaller batch size for training.

The summary results are provided in Table 3. With small amount of labeled training data, we observe an increasing positive impact on detector performance with more complex (thus stronger) augmentation strategies. The trend holds true with the standard schedule, but we find that the quick schedule is beneficial on the low-labeled data regime due to its fast training and less issue of overfitting. On the other hand, we observe that the network significantly underfits with our augmentation strategies when all labeled data is used for training. For example, with 100% labeled data, we achieve even lower mAP of $36.12$ with C+{G,B}+Cutout strategy than that of $37.42$ with default augmentations. We find that the issue can be alleviated by longer training. Moreover, while the performance with default augmentations saturates and starts to decrease as it is trained longer, the models with strong data augmentation start to outperform, demonstrating their effectiveness on training with large-scale labeled data.

STAC contains two key components: self-training and strong data augmentation. We also verify the importance of data augmentation in Appendix, which is in line with recent findings in SSL for image classification . We evaluate the performance of STAC with the default augmentations (horizontal flip). On a single fold of 10% protocol, we observe a good improvement in mAP upon baseline model (from 24.05 to 26.27), but the gain is not as significant as STAC (29.00). On 100% protocol, we observe slight decrease in performance when trained with self-training only (from 37.63 to 37.57), while STAC achieves 39.21 in mAP.

2 Size of Unlabeled Data

While the importance of large-scale labeled data for supervised learning has been broadly studied and emphasized , the importance on the scale of unlabeled data for semi-supervised learning has been often overlooked . In this study, we highlight the importance of large-scale unlabeled data in the context of semi-supervised object detector learning. We experiment with 5% and 10% labeled data of MS-COCO while varying the amount of unlabeled data by 1, 2, 4, and 8 times more.

The summary results are given in Table 4. While there still exists the improvement in mAPs when STAC is trained with small amount of unlabeled data, the gain is less significant compared to that of supervised model with strong data augmentation. We observe clearly from Table 4 that STAC benefits from the larger amount of unlabeled training data. We make a similar observation from experiments on PASCAL VOC in Table 2, where the AP0.5 of STAC trained using trainval of VOC12 as unlabeled data achieves 77.45, which is lower than that of supervised model with strong augmentations (78.21). On the other hand, STAC trained with large amount of unlabeled data by combining VOC12 and MS-COCO achieves 79.08 AP0.5. This analysis may explain the slightly lower mAP of STAC for 100% protocol of MS-COCO than that of the supervised model with strong data augmentation since the size of available unlabeled data is roughly the same as that of the labeled data.

We study the impact of $\lambda_{u}$ , a regularization coefficient for unsupervised loss, and $\tau$ , the confidence threshold. Specifically, we test the STAC with different values of $\lambda_{u}\,{\in}\,\{0.1,0.5,1,2,4\}$ and $\tau\,{\in}\,\{0,0.3,0.5,0.7,0.9\}$ on a single fold of 10% protocol. The summary results are provided in Figure 4. Firstly, the best performance of STAC is obtained when $\lambda_{u}\,{=}\,2$ and $\tau\,{=}\,0.9$ . We observe that the performance of STAC deteriorates when $\lambda_{u}$ is too large ( ${>}\,2$ ) or too small ( ${<}\,0.5$ ), but it improves upon strong baseline consistently for $\lambda_{u}\,{\in}\,$ . When there is no confidence-based box filtering, the gain of STAC, if any, is marginal over the strong baseline. This is because lots of predicted boxes are indeed inaccurate, as shown in Figure 5(a). Using larger value of $\tau$ allows to have pseudo box labels with higher precision (i.e., remaining boxes whose confidence is higher than $\tau$ are accurate), as in Figure 5(e). However, if $\tau$ becomes too large, one would get a lower recall (e.g., bounding box at sofa in Figure 5(c) is filtered out in Figure 5(d)). Figure 4 shows that the high precision (i.e., larger value of $\tau$ ) is preferred to high recall (i.e., smaller value of $\tau$ ) on 10% protocol.

4 Quality of Pseudo Labels

One intriguing question is whether the semi-supervised performance of the model improves with pseudo labels of higher quality. To validate the hypothesis, we train two additional STAC models for 10% protocol, where models are provided pseudo labels predicted by two different supervised models trained with 5% and 100% labeled data, whose mAPs are 18.67 and 37.63, respectively. Note that the STAC on 10% protocol achieves 29.00 mAP. STAC trained with less accurate pseudo labels achieves only 24.25 mAP, while the one with more accurate pseudo labels achieves 30.30 mAP, confirming the importance of pseudo label quality.

Inspired by this observation, we increase the augmentation strength to train the teacher model in order to get better pseudo labels, expecting a further improvement for STAC. To this end, we train STAC using different sets of pseudo labels that are provided by the supervised models trained with different data augmentation schemes. As in Table 5, the performance of supervised models vary from mAP of 18.67 to 21.16 with 5% labeled data and from 24.05 to 26.34 with 10% labeled data. We observe an improvement in mAP by using more accurate pseudo labels on 5% protocol, but the gain is not as substantial. We also do not observe a clear correlation between the accuracy of pseudo label and the performance of STAC on 10% protocol. While STAC brings a significant gain in mAP using pseudo labels, our results suggest that the incremental improvement on the quality of pseudo labels may not bring in a significant extra benefit.

Discussion and Conclusion

While SSL for classification has made significant strides, label-efficient training for for tasks requiring high labeling cost is demanding. We propose a simple (introducing only two hyperparameters that are easy to tune) and effective ( $2{\times}$ label efficiency in low-label regime) SSL framework for object detection by leveraging lessons from SSL methods for classification. The simplicity of our method will provide a flexibility for further development towards solving SSL for object detection.

The proposed framework is amenable to many variations, including using soft labels for classification loss, other detector frameworks than Faster RCNN, and other data augmentation strategies. While STAC demonstrates an impressive performance gain already without taking confirmation bias issue into account, it could be problematic when using a detection framework with a stronger form of hard negative mining because noisy pseudo labels can be overly-used. Further investigation in learning with noisy labels, confidence calibration, and uncertainty estimation in the context of object detection are few important topics to further enhance the performance of SSL object detection.

References

Appendix A Learning Schedules

In this section, we provide complete descriptions on different learning schedules used in our experiments. Note that the schedule VOC is only used for experiments related to PASCAL VOC. Besides specified below, we adopt the learning settings as follows: https://github.com/tensorpack/tensorpack/blob/master/examples/FasterRCNN/config.py.

LR Decay: $\big{[}0.01\,({\leq}120k),0.001\,({\leq}160k),0.0001\,({\leq}180k)\big{]}$

Data processing: Short edge size is sampled between 500 and 800 if the long edge is less than 1024 after resizing.

Batch per image for training Faster RCNN head: 64

A.2 Standard, [n]×[n]\times

LR Decay: $\big{[}0.01\,({\leq}120k),0.001\,({\leq}160k),0.0001\,({\leq}180k)\big{]}$

LR Decay ( $2\times$ ): $\big{[}0.01\,({\leq}240k),0.001\,({\leq}320k),0.0001\,({\leq}360k)\big{]}$

LR Decay ( $3\times$ ): $\big{[}0.01\,({\leq}420k),0.001\,({\leq}500k),0.0001\,({\leq}540k)\big{]}$

Data processing: Short edge size is fixed to 800 if the long edge is less than 1333 after resizing.

Batch per image for training Faster RCNN head: 512

A.3 VOC

LR Decay: $\big{[}0.001\,({\leq}120k),0.0005\,({\leq}160k)\big{]}$

Data processing: Short edge size is fixed to 600 if the long edge is less than 1000 after resizing. Image is resized to have its longer edge to be 1000 if long edge is longer than 1000.

Batch per image for training Faster RCNN head: 256

RPN Anchor Sizes: $\big{[}8,16,32\big{]}$

Appendix B Data Augmentation in STAC

This section provides comprehensive results of Section 5.1 to validate the importance of data augmentation in STAC. In Table A1, we provide two rows of results with STAC (bottom) and the STAC without strong data augmentation, i.e., “Self-Training”. We observe significant gain in mAP on all cases, which validates the importance of the data augmentation in STAC.

Appendix C Extended Background: Unsupervised Loss in SSL

In this section, we extend Section 3.1 and provide unsupervised loss formulations for comprehensive list of SSL algorithms whose loss can be represented in Equation (1). For presentation clarity, let us reiterate definitions as follows:

Here, we use $p(x)$ instead of $p(x;\theta)$ as in Equation (1) for generality. Instead, let us denote $p(x;\theta)$ as a prediction of the model with parameters $\theta$ at training.

Note that the unsupervised loss formulation of STAC is following the form of Noisy Student (Section C.9), which can be viewed as a combination of Self-Training (Section C.1) and strong data augmentation. While we have shown such a simple formulation of STAC brings in a significant performance gain at object detection, more complicated formulations (e.g., Mean Teacher (Section C.5) or MixMatch/ReMixMatch (Section C.10)) are amenable to be used in place of several design choices made for STAC. Further investigation of STAC variants is in the scope of the future work.

C.2 Entropy Minimization [16]

Note that gradient flows both to $q$ and $p$ . To our best knowledge, Entropy Minimization is the only method that backpropagates the gradient through $q$ .

C.3 Pseudo Labeling [27]

C.4 Temporal Ensembling [25]

We omit the ramp up and ramp down for $w(\cdot)$ in our formulation since it is dependent on the optimization framework. See for more details.

C.5 Mean Teacher [55]

We omit the ramp up and ramp down for $w(\cdot)$ in our formulation since it is dependent on the optimization framework. See for more details.

C.6 Virtual Adversarial Training [36]

C.7 Unsupervised Data Augmentation (UDA) [59]

UDA uses a weak ( $\alpha(\cdot)$ ), such as translation and horizontal flip, to generate a pseudo label, and strong augmentation ( $\mathcal{A}(\cdot)$ ), such as RandAugment followed by Cutout , for model training.

C.8 FixMatch [50]

FixMatch also uses a weak ( $\alpha(\cdot)$ ), such as translation and horizontal flip, to generate a pseudo label, and strong augmentation ( $\mathcal{A}(\cdot)$ ), such as RandAugment or CTAugment followed by Cutout , for model training.

C.9 Noisy Student [60]

C.10 MixMatch [4]

Note that MixMatch uses MixUp for unsupervised loss. It uses weak augmentation $\alpha(\cdot)$ , such as translation and horizontal flip.

C.11 ReMixMatch [3]

Note that ReMixMatch uses MixUp for unsupervised loss. It also uses weak augmentation $\alpha(\cdot)$ , such as translation and horizontal flip, and strong augmentation $\mathcal{A}(\cdot)$ , such as CTAugment .