CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo

Introduction

Deep convolutional neural networks (CNNs) have shown promising performances on various computer vision problems such as image classification , object detection , semantic segmentation , and video analysis . To further improve the training efficiency and performance, a number of training strategies have been proposed, including data augmentation and regularization techniques .

In particular, to prevent a CNN from focusing too much on a small set of intermediate activations or on a small region on input images, random feature removal regularizations have been proposed. Examples include dropout for randomly dropping hidden activations and regional dropout for erasing random regions on the input. Researchers have shown that the feature removal strategies improve generalization and localization by letting a model attend not only to the most discriminative parts of objects, but rather to the entire object region .

While regional dropout strategies have shown improvements of classification and localization performances to a certain degree, deleted regions are usually zeroed-out or filled with random noise , greatly reducing the proportion of informative pixels on training images. We recognize this as a severe conceptual limitation as CNNs are generally data hungry . How can we maximally utilize the deleted regions, while taking advantage of better generalization and localization using regional dropout?

We address the above question by introducing an augmentation strategy CutMix. Instead of simply removing pixels, we replace the removed regions with a patch from another image (See Table 1). The ground truth labels are also mixed proportionally to the number of pixels of combined images. CutMix now enjoys the property that there is no uninformative pixel during training, making training efficient, while retaining the advantages of regional dropout to attend to non-discriminative parts of objects. The added patches further enhance localization ability by requiring the model to identify the object from a partial view. The training and inference budgets remain the same.

CutMix shares similarity with Mixup which mixes two samples by interpolating both the image and labels. While certainly improving classification performance, Mixup samples tend to be unnatural (See the mixed image in Table 1). CutMix overcomes the problem by replacing the image region with a patch from another training image.

Table 1 gives an overview of Mixup , Cutout , and CutMix on image classification, weakly supervised localization, and transfer learning to object detection methods. Although Mixup and Cutout enhance ImageNet classification, they decrease the ImageNet localization or object detection performances. On the other hand, CutMix consistently achieves significant enhancements across three tasks.

We present extensive evaluations of CutMix on various CNN architectures, datasets, and tasks. Summarizing the key results, CutMix has significantly improved the accuracy of a baseline classifier on CIFAR-100 and has obtained the state-of-the-art top-1 error $14.47\%$ . On ImageNet , applying CutMix to ResNet-50 and ResNet-101 has improved the classification accuracy by $+2.28\%$ and $+1.70\%$ , respectively. On the localization front, CutMix improves the performance of the weakly-supervised object localization (WSOL) task on CUB200-2011 and ImageNet by $+5.4\%$ and $+0.9\%$ , respectively. The superior localization capability is further evidenced by fine-tuning a detector and an image caption generator on CutMix-ImageNet-pretrained models; the CutMix pretraining has improved the overall detection performances on Pascal VOC by $+1$ mAP and image captioning performance on MS-COCO by $+2$ BLEU scores. CutMix also enhances the model robustness and alleviates the over-confidence issue of deep networks.

Related Works

Regional dropout: Methods removing random regions in images have been proposed to enhance the generalization performance of CNNs. Object localization methods also utilize the regional dropout techniques for improving the localization ability of CNNs. CutMix is similar to those methods, while the critical difference is that the removed regions are filled with patches from another training images. DropBlock has generalized the regional dropout to the feature space and have shown enhanced generalizability as well. CutMix can also be performed on the feature space, as we will see in the experiments.

Synthesizing training data: Some works have explored synthesizing training data for further generalizability. Generating new training samples by Stylizing ImageNet has guided the model to focus more on shape than texture, leading to better classification and object detection performances. CutMix also generates new samples by cutting and pasting patches within mini-batches, leading to performance boosts in many computer vision tasks; unlike stylization as in , CutMix incurs only negligible additional cost for training. For object detection, object insertion methods have been proposed as a way to synthesize objects in the background. These methods aim to train a good represent of a single object samples, while CutMix generates combined samples which may contain multiple objects.

Mixup: CutMix shares similarity with Mixup in that both combines two samples, where the ground truth label of the new sample is given by the linear interpolation of one-hot labels. As we will see in the experiments, Mixup samples suffer from the fact that they are locally ambiguous and unnatural, and therefore confuses the model, especially for localization. Recently, Mixup variants have been proposed; they perform feature-level interpolation and other types of transformations. Above works, however, generally lack a deep analysis in particular on the localization ability and transfer-learning performances. We have verified the benefits of CutMix not only for an image classification task, but over a wide set of localization tasks and transfer learning experiments.

Tricks for training deep networks: Efficient training of deep networks is one of the most important problems in computer vision community, as they require great amount of compute and data. Methods such as weight decay, dropout , and Batch Normalization are widely used to efficiently train deep networks. Recently, methods adding noises to the internal features of CNNs or adding extra path to the architecture have been proposed to enhance image classification performance. CutMix is complementary to the above methods because it operates on the data level, without changing internal representations or architecture.

CutMix

We describe the CutMix algorithm in detail.

where $\mathbf{M}\in\{0,1\}^{W\times H}$ denotes a binary mask indicating where to drop out and fill in from two images, $\mathbf{1}$ is a binary mask filled with ones, and $\odot$ is element-wise multiplication. Like Mixup , the combination ratio $\lambda$ between two data points is sampled from the beta distribution $\text{Beta}(\alpha,\alpha)$ . In our all experiments, we set $\alpha$ to $1$ , that is $\lambda$ is sampled from the uniform distribution $(0,1)$ . Note that the major difference is that CutMix replaces an image region with a patch from another training image and generates more locally natural image than Mixup does.

To sample the binary mask $\mathbf{M}$ , we first sample the bounding box coordinates $\mathbf{B}=(r_{x},r_{y},r_{w},r_{h})$ indicating the cropping regions on $x_{A}$ and $x_{B}$ . The region $\mathbf{B}$ in $x_{A}$ is removed and filled in with the patch cropped from $\mathbf{B}$ of $x_{B}$ .

In our experiments, we sample rectangular masks $\mathbf{M}$ whose aspect ratio is proportional to the original image. The box coordinates are uniformly sampled according to:

making the cropped area ratio $\frac{r_{w}r_{h}}{WH}=1-\lambda$ . With the cropping region, the binary mask $\mathbf{M}$ $\in\{0,1\}^{W\times H}$ is decided by filling with within the bounding box $\mathbf{B}$ , otherwise $1$ .

2 Discussion

What does model learn with CutMix? We have motivated CutMix such that full object extents are considered as cues for classification, the motivation shared by Cutout, while ensuring two objects are recognized from partial views in a single image to increase training efficiency. To verify that CutMix is indeed learning to recognize two objects from their respective partial views, we visually compare the activation maps for CutMix against Cutout and Mixup . Figure 1 shows example augmentation inputs as well as corresponding class activation maps (CAM) for two classes present, Saint Bernard and Miniature Poodle. We use vanilla ResNet-50 modelWe use ImageNet-pretrained ResNet-50 provided by PyTorch . for obtaining the CAMs to clearly see the effect of augmentation method only.

We observe that Cutout successfully lets a model focus on less discriminative parts of the object, such as the belly of Saint Bernard, while being inefficient due to unused pixels. Mixup, on the other hand, makes full use of pixels, but introduces unnatural artifacts. The CAM for Mixup, as a result, shows that the model is confused when choosing cues for recognition. We hypothesize that such confusion leads to its suboptimal performance in classification and localization, as we will see in Section 4.

CutMix efficiently improves upon Cutout by being able to localize the two object classes accurately. We summarize the key differences among Mixup, Cutout, and CutMix in Table 2.

Analysis on validation error: We analyze the effect of CutMix on stabilizing the training of deep networks. We compare the top-1 validation error during the training with CutMix against the baseline. We train ResNet-50 for ImageNet Classification, and PyramidNet-200 for CIFAR-100 Classification. Figure 2 shows the results.

We observe, first of all, that CutMix achieves lower validation errors than the baseline at the end of training. At epoch 150 when the learning rates are reduced, the baselines suffer from overfitting with increasing validation error. CutMix, on the other hand, shows a steady decrease in validation error; diverse training samples reduce overfitting.

Experiments

In this section, we evaluate CutMix for its capability to improve localizability as well as generalizability of a trained model on multiple tasks. We first study the effect of CutMix on image classification (Section 4.1) and weakly supervised object localization (Section 4.2). Next, we show the transferability of a CutMix pre-trained model when it is fine-tuned for object detection and image captioning tasks (Section 4.3). We also show that CutMix can improve the model robustness and alleviate the model over-confidence in Section 4.4.

All experiments were implemented and evaluated on NAVER Smart Machine Learning (NSML) platform with PyTorch . Source code and pretrained models are available at https://github.com/clovaai/CutMix-PyTorch.

We evaluate on ImageNet-1K benchmark , the dataset containing 1.2M training images and 50K validation images of 1K categories. For fair comparison, we use the standard augmentation setting for ImageNet dataset such as re-sizing, cropping, and flipping, as done in . We found that regularization methods including Stochastic Depth , Cutout , Mixup , and CutMix require a greater number of training epochs till convergence. Therefore, we have trained all the models for $300$ epochs with initial learning rate $0.1$ decayed by factor $0.1$ at epochs $75$ , $150$ , and $225$ . The batch size is set to $256$ . The hyper-parameter $\alpha$ is set to $1$ . We report the best performances of CutMix and other baselines during training.

We briefly describe the settings for baseline augmentation schemes. We set the dropping rate of residual blocks to $0.25$ for the best performance of Stochastic Depth . The mask size for Cutout is set to $112\times 112$ and the location for dropping out is uniformly sampled. The performance of DropBlock is from the original paper and the difference from our setting is the training epochs which is set to $270$ . Manifold Mixup applies Mixup operation on the randomly chosen internal feature map. We have tried $\alpha=0.5$ and $1.0$ for Mixup and Manifold Mixup and have chosen $1.0$ which has shown better performances. It is also possible to extend CutMix to feature-level augmentation (Feature CutMix). Feature CutMix applies CutMix at a randomly chosen layer per minibatch as Manifold Mixup does.

Comparison against baseline augmentations: Results are given in Table 3. We observe that CutMix achieves the best result, 21.40% top-1 error, among the considered augmentation strategies. CutMix outperforms Cutout and Mixup, the two closest approaches to ours, by $+1.53\%$ and $+1.18\%$ , respectively. On the feature level as well, we find CutMix preferable to Mixup, with top-1 errors $21.78\%$ and $22.50\%$ , respectively.

Comparison against architectural improvements: We have also compared improvements due to CutMix versus architectural improvements (e.g. greater depth or additional modules). We observe that CutMix improves the performance by +2.28% while increased depth (ResNet-50 $\rightarrow$ ResNet-152) boosts $+1.99\%$ and SE and GE boosts $+1.56\%$ and $+1.80\%$ , respectively. Note that unlike above architectural boosts improvements due to CutMix come at little or memory or computational time.

CutMix for Deeper Models: We have explored the performance of CutMix for the deeper networks, ResNet-101 and ResNeXt-101 (32 $\times$ 4d) , on ImageNet. As seen in Table 4, we observe +1.60% and +1.71% respective improvements in top-1 errors due to CutMix.

1.2 CIFAR Classification

Table 5 shows the performance comparison against other state-of-the-art data augmentation and regularization methods. All experiments were conducted three times and the averaged best performances during training are reported.

Hyper-parameter settings: We set the hole size of Cutout to $16\times 16$ . For DropBlock , keep_prob and block_size are set to $0.9$ and $4$ , respectively. The drop rate for Stochastic Depth is set to 0.25. For Mixup , we tested the hyper-parameter $\alpha$ with 0.5 and 1.0. For Manifold Mixup , we applied Mixup operation at a randomly chosen layer per minibatch.

Combination of regularization methods: We have evaluated the combination of regularization methods. Both Cutout and label smoothing does not improve the accuracy when adopted independently, but they are effective when used together. Dropblock , the feature-level generalization of Cutout, is also more effective when label smoothing is also used. Mixup and Manifold Mixup achieve higher accuracies when Cutout is applied on input images. The combination of Cutout and Mixup tends to generate locally separated and mixed samples since the cropped regions have less ambiguity than those of the vanilla Mixup. The superior performance of Cutout and Mixup combination shows that mixing via cut-and-paste manner is better than interpolation, as much evidenced by CutMix performances.

CutMix achieves $14.47\%$ top-1 classification error on CIFAR-100, $+1.98\%$ higher than the baseline performance $16.45\%$ . We have achieved a new state-of-the-art performance $13.81\%$ by combining CutMix and ShakeDrop , a regularization that adds noise on intermediate features.

CutMix for various models: Table 6 shows CutMix also significantly improves the performance of the weaker baseline architectures, such as PyramidNet-110 and ResNet-110.

CutMix for CIFAR-10: We have evaluated CutMix on CIFAR-10 dataset using the same baseline and training setting for CIFAR-100. The results are given in Table 7. On CIFAR-10, CutMix also enhances the classification performances by $+0.97\%$ , outperforming Mixup and Cutout performances.

1.3 Ablation Studies

We conducted ablation study in CIFAR-100 dataset using the same experimental settings in Section 4.1.2. We evaluated CutMix with $\alpha\in\{0.1,0.25,0.5,1.0,2.0,4.0\}$ ; the results are given in Figure 3, left plot. For all $\alpha$ values considered, CutMix improves upon the baseline ( $16.45\%$ ). The best performance is achieved when $\alpha=1.0$ .

The performance of feature-level CutMix is given in Figure 3, right plot. We changed the layer on which CutMix is applied, from image layer itself to higher feature levels. We denote the index as (0=image level, 1=after first conv-bn, 2=after layer1, 3=after layer2, 4=after layer3). CutMix achieves the best performance when it is applied on the input images. Again, feature-level CutMix except the layer3 case improves the accuracy over the baseline ( $16.45\%$ ).

2 Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) task aims to train the classifier to localize target objects by using only the class labels. To localize the target well, it is important to make CNNs extract cues from full object regions and not focus on small discriminant parts of the target. Learning spatially distributed representation is thus the key for improving performance on WSOL task. CutMix guides a classifier to attend to broader sets of cues to make decisions; we expect CutMix to improve WSOL performances of classifiers. To measure this, we apply CutMix over baseline WSOL models. We followed the training and evaluation strategy of existing WSOL methods with VGG-GAP and ResNet-50 as the base architectures. The quantitative and qualitative results are given in Table 9 and Figure 4, respectively. Full implementation details are in Appendix B.

Comparison against Mixup and Cutout: CutMix outperforms Mixup on localization accuracies by $+5.51\%$ and $+1.41\%$ on CUB200-2011 and ImageNet, respectively. Mixup degrades the localization accuracy of the baseline model; it tends to make a classifier focus on small regions as shown in Figure 4. As we have hypothesized in Section 3.2, more ambiguity in Mixup samples make a classifier focus on even more discriminative parts of objects, leading to decreased localization accuracies. Although Cutout improves the accuracy over the baseline, it is outperformed by CutMix: $+2.03\%$ and $+0.56\%$ on CUB200-2011 and ImageNet, respectively.

CutMix also achieves comparable localization accuracies on CUB200-2011 and ImageNet, even when compared against the dedicated state-of-the-art WSOL methods that focus on learning spatially dispersed representations.

3 Transfer Learning of Pretrained Model

ImageNet pre-training is de-facto standard practice for many visual recognition tasks. We examine whether CutMix pre-trained models leads to better performances in certain downstream tasks based on ImageNet pre-trained models. As CutMix has shown superiority in localizing less discriminative object parts, we would expect it to lead to boosts in certain recognition tasks with localization elements, such as object detection and image captioning. We evaluate the boost from CutMix on those tasks by replacing the backbone network initialization with other ImageNet pre-trained models using Mixup , Cutout , and CutMix. ResNet-50 is used as the baseline architecture in this section.

Transferring to Pascal VOC object detection: Two popular detection models, SSD and Faster RCNN , are considered. Originally the two methods have utilized VGG-16 as backbones, but we have changed it to ResNet-50. The ResNet-50 backbone is initialized with various ImageNet-pretrained models and then fine-tuned on Pascal VOC 2007 and 2012 trainval data. Models are evaluated on VOC 2007 test data using the mAP metric. We follow the fine-tuning strategy of the original methods ; implementation details are in Appendix C. Results are shown in Table 10. Pre-training with Cutout and Mixup has failed to improve the object detection performance over the vanilla pre-trained model. However, the pre-training with CutMix improves the performance of both SSD and Faster-RCNN. Stronger localizability of the CutMix pre-trained models leads to better detection performances.

Transferring to MS-COCO image captioning: We used Neural Image Caption (NIC) as the base model for image captioning experiments. We have changed the backbone network of encoder from GoogLeNet to ResNet-50. The backbone network is initialized with various ImageNet pre-trained models, and then trained and evaluated on MS-COCO dataset . Implementation details and evaluation metrics (METEOR, CIDER, etc.) are in Appendix D. Table 10 shows the results. CutMix outperforms Mixup and Cutout in both BLEU1 and BLEU4 metrics. Simply replacing backbone network with our CutMix pre-trained model gives performance gains for object detection and image captioning tasks at no extra cost.

4 Robustness and Uncertainty

Many researches have shown that deep models are easily fooled by small and unrecognizable perturbations on the input images, a phenomenon referred to as adversarial attacks . One straightforward way to enhance robustness and uncertainty is an input augmentation by generating unseen samples . We evaluate robustness and uncertainty improvements due to input augmentation methods including Mixup, Cutout, and CutMix.

Robustness: We evaluate the robustness of the trained models to adversarial samples, occluded samples, and in-between class samples. We use ImageNet pre-trained ResNet-50 models with same setting as in Section 4.1.1.

Fast Gradient Sign Method (FGSM) is used to generate adversarial perturbations and we assume that the adversary has full information of the models (white-box attack). We report top-1 accuracies after attack on ImageNet validation set in Table 11. CutMix significantly improves the robustness to adversarial attacks compared to other augmentation methods.

For occlusion experiments, we generate occluded samples in two ways: center occlusion by filling zeros in a center hole and boundary occlusion by filling zeros outside of the hole. In Figure 6(a), we measure the top-1 error by varying the hole size from to $224$ . For both occlusion scenarios, Cutout and CutMix achieve significant improvements in robustness while Mixup only marginally improves it. Interestingly, CutMix almost achieves a comparable performance as Cutout even though CutMix has not observed any occluded sample during training unlike Cutout.

Finally, we evaluate the top-1 error of Mixup and CutMix in-between samples. The probability to predict neither two classes by varying the combination ratio $\lambda$ is illustrated in Figure 6(b). We randomly select $50,000$ in-between samples in ImageNet validation set. In both experiments, Mixup and CutMix improve the performance while improvements due to Cutout are almost negligible. Similarly to the previous occlusion experiments, CutMix even improves the robustness to the unseen Mixup in-between class samples.

Uncertainty: We measure the performance of the out-of-distribution (OOD) detectors proposed by which determines whether the sample is in- or out-of-distribution by score thresholding. We use PyramidNet-200 trained on CIFAR-100 datasets with same setting as in Section 4.1.2. In Table 12, we report the averaged OOD detection performances against seven out-of-distribution samples from , including TinyImageNet, LSUN , uniform noise, Gaussian noise, etc. More results are illustrated in Appendix E. Mixup and Cutout augmentations aggravate the over-confidence of the base networks. Meanwhile, CutMix significantly alleviates the over-confidence of the model.

Conclusion

We have introduced CutMix for training CNNs with strong classification and localization ability. CutMix is easy to implement and has no computational overhead, while being surprisingly effective on various tasks. On ImageNet classification, applying CutMix to ResNet-50 and ResNet-101 brings $+2.28\%$ and $+1.70\%$ top-1 accuracy improvements. On CIFAR classification, CutMix significantly improves the performance of baseline by $+1.98\%$ leads to the state-of-the-art top-1 error $14.47\%$ . On weakly supervised object localization (WSOL), CutMix substantially enhances the localization accuracy and has achieved comparable localization performances as the state-of-the-art WSOL methods. Furthermore, simply using CutMix-ImageNet-pretrained model as the initialized backbone of the object detection and image captioning brings overall performance improvements. Finally, we have shown that CutMix results in improvements in robustness and uncertainty of image classifiers over the vanilla model as well as other regularized models.

Acknowledgement

We would like to thank Clova AI Research team, especially Jung-Woo Ha and Ziad Al-Halah for their helpful feedback and discussion.

References

Appendix A CutMix Algorithm

We present the code-level description of CutMix algorithm in Algorithm A1. N, C, and K denote the size of minibatch, channel size of input image, and the number of classes. First, CutMix shuffles the order of the minibatch input and target along the first axis of the tensors. And the lambda and the cropping region (x1,x2,y1,y2) are sampled. Then, we mix the input and input_s by replacing the cropping region of input to the region of input_s. The target label is also mixed by interpolating method.

Note that CutMix is easy to implement with few lines (from line $4$ to line $15$ ), so is very practical algorithm giving significant impact on a wide range of tasks.

Appendix B Weakly-supervised Object Localization

We describe the training and evaluation procedure of weakly-supervised object localization in detail.

Network modification: Basically weakly-supervised object localization (WSOL) has the same training strategy as image classification does. Training WSOL is starting from ImageNet-pretrained model. From the base network structures, VGG-16 and ResNet-50 , WSOL takes larger spatial size of feature map $14\times 14$ whereas the original models has $7\times 7$ . For VGG network, we utilize VGG-GAP, which is a modified VGG-16 introduced in . For ResNet-50, we modified the final residual block (layer4) to have no stride ( $=1$ ), which originally has stride $2$ .

Since the network is modified and the target dataset could be different from ImageNet , the last fully-connected layer is randomly initialized with the final output dimension of $200$ and $1000$ for CUB200-2011 and ImageNet, respectively.

Input image transformation: For fair comparison, we used the same data augmentation strategy except Mixup, Cutout, and CutMix as the state-of-the-art WSOL methods do . In training, the input image is resized to $256\times 256$ size and randomly cropped $224\times 224$ size images are used to train network. In testing, the input image is resized to $256\times 256$ , cropped at center with $224\times 224$ size and used to validate the network, which called single crop strategy.

Estimating bounding box: We utilize class activation mapping (CAM) to estimate the bounding box of an object. First we compute CAM of an image, and next, we decide the foreground region of the image by binarizing the CAM with a specific threshold. The region with intensity over the threshold is set to 1, otherwise to 0. We use the threshold as a specific rate $\sigma$ of the maximum intensity of the CAM. We set $\sigma$ to $0.15$ for all our experiments. From the binarized foreground map, the tightest box which can cover the largest connected region in the foreground map is selected to the bounding box for WSOL.

Evaluation metric: To measure the localization accuracy of models, we report top-1 localization accuracy (Loc), which is used for ImageNet localization challenge . For top-1 localization accuracy, intersection-over-union (IoU) between the estimated bounding box and ground truth position is larger than $0.5$ , and, at the same time, the estimated class label should be correct. Otherwise, top-1 localization accuracy treats the estimation was wrong.

CUB-200-2011 dataset contains over 11 K images with 200 categories of birds. We set the number of training epochs to $600$ . For ResNet-50, the learning rate for the last fully-connected layer and the other were set to $0.01$ and $0.001$ , respectively. For VGG network, the learning rate for the last fully-connected layer and the other were set to $0.001$ and $0.0001$ , respectively. The learning rate is decaying by the factor of $0.1$ at every $150$ epochs. We used SGD optimizer, and the minibatch size, momentum, weight decay were set to $32$ , $0.9$ , and $0.0001$ .

B.2 ImageNet dataset

ImageNet-1K is a large-scale dataset for general objects consisting of 13 M training samples and 50 K validation samples. We set the number of training epochs to $20$ . The learning rate for the last fully-connected layer and the other were set to $0.1$ and $0.01$ , respectively. The learning rate is decaying by the factor of $0.1$ at every $6$ epochs. We used SGD optimizer, and the minibatch size, momentum, weight decay were set to $256$ , $0.9$ , and $0.0001$ .

Appendix C Transfer Learning to Object Detection

We evaluate the models on the Pascal VOC 2007 detection benchmark with 5 K test images over 20 object categories. For training, we use both VOC2007 and VOC2012 trainval (VOC07+12).

Finetuning on SSDhttps://github.com/amdegroot/ssd.pytorch : The input image is resized to $300\times 300$ (SSD300) and we used the basic training strategy of the original paper such as data augmentation, prior boxes, and extra layers. Since the backbone network is changed from VGG16 to ResNet-50, the pooling location conv4_3 of VGG16 is modified to the output of layer2 of ResNet-50. For training, we set the batch size, learning rate, and training iterations to $32$ , $0.001$ , and $120$ K, respectively. The learning rate is decayed by the factor of $0.1$ at $80$ K and $100$ K iterations.

Finetuning on Faster-RCNNhttps://github.com/jwyang/faster-rcnn.pytorch : Faster-RCNN takes fully-convolutional structure, so we only modify the backbone from VGG16 to ResNet-50. The batch size, learning rate, training iterations are set to $8$ , $0.01$ , and $120$ K. The learning rate is decayed by the factor of $0.1$ at $100$ K iterations.

Appendix D Transfer Learning to Image Captioning

MS-COCO dataset contains $120$ K trainval images and $40$ K test images. From the base model NIChttps://github.com/stevehuanghe/image_captioning , the backbone model is changed from GoogLeNet to ResNet-50. For training, we set batch size, learning rate, and training epochs to $20$ , $0.001$ , and $100$ , respectively. For evaluation, the beam size is set to $20$ for all the experiments. Image captioning results with various metrics are shown in Table A1.

Appendix E Robustness and Uncertainty

In this section, we describe the details of the experimental setting and evaluation methods.

We evaluate the model robustness to adversarial perturbations, occlusion and in-between samples using ImageNet trained models. For the base models, we use ResNet-50 structure and follow the settings in Section 4.1.1. For comparison, we use ResNet-50 trained without any additional regularization or augmentation techniques, ResNet-50 trained by Mixup strategy, ResNet-50 trained by Cutout strategy and ResNet-50 trained by our proposed CutMix strategy.

Fast Gradient Sign Method (FGSM): We employ Fast Gradient Sign Method (FGSM) to generate adversarial samples. For the given image $x$ , the ground truth label $y$ and the noise size $\epsilon$ , FGSM generates an adversarial sample as the following

where $L(\theta,x,y)$ denotes a loss function, for example, cross entropy function. In our experiments, we set the noise scale $\epsilon=8/255$ .

Occlusion: For the given hole size $s$ , we make a hole with width and height equals to $s$ in the center of the image. For center occluded samples, we zeroed-out inside of the hole and for boundary occluded samples, we zeroed-out outside of the hole. In our experiments, we test the top-1 ImageNet validation accuracy of the models with varying hole size from to $224$ .

In-between class samples: To generate in-between class samples, we first sample $50,000$ pairs of images from the ImageNet validation set. For generating Mixup samples, we generate a sample $x$ from the selected pair $x_{A}$ and $x_{B}$ by $x=\lambda x_{A}+(1-\lambda)x_{B}$ . We report the top-1 accuracy on the Mixup samples by varying $\lambda$ from to $1$ . To generate CutMix in-between samples, we employ the center mask instead of the random mask. We follow the hole generation process used in the occlusion experiments. We evaluate the top-1 accuracy on the CutMix samples by varing hole size $s$ from to $224$ .

E.2 Uncertainty

Deep neural networks are often overconfident in their predictions. For example, deep neural networks produce high confidence number even for random noise . One standard benchmark to evaluate the overconfidence of the network is Out-of-distribution (OOD) detection proposed by . The authors proposed a threshold-baed detector which solves the binary classification task by classifying in-distribution and out-of-distribution using the prediction of the given network. Recently, a number of reserchs are proposed to enhance the performance of the baseline detector but in this paper, we follow only the baseline detector algorithm without any input enhancement and temperature scaling .

Setup: We compare the OOD detector performance using CIFAR-100 trained models described in Section 4.1.2. For comparison, we use PyramidNet-200 model without any regularization method, PyramidNet-200 model with Mixup, PyramidNet-200 model with Cutout and PyramidNet-200 model with our proposed CutMix.

Evaluation Metrics and Out-of-distributions: In this work, we follow the experimental setting used in . To measure the performance of the OOD detector, we report the true negative rate (TNR) at 95% true positive rate (TPR), the area under the receiver operating characteristic curve (AUROC) and detection accuracy of each OOD detector. We use seven datasets for out-of-distribution: TinyImageNet (crop), TinyImageNet (resize), LSUN (crop), LSUN (resize), iSUN, Uniform noise and Gaussian noise.

Results: We report OOD detector performance to seven OODs in Table A2. Overall, CutMix outperforms baseline, Mixup and Cutout. Moreover, we find that even though Mixup and Cutout outperform the classification performance, Mixup and Cutout largely degenerate the baseline detector performance. Especially, for Uniform noise and Gaussian noise, Mixup and Cutout seriously impair the baseline performance while CutMix dramatically improves the performance. From the experiments, we observe that our proposed CutMix enhances the OOD detector performance while Mixup and Cutout produce more overconfident predictions to OOD samples than the baseline.