The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, Justin Gilmer

Introduction

While the research community must create robust models that generalize to new scenarios, the robustness literature lacks consensus on evaluation benchmarks and contains many dissonant hypotheses. Hendrycks et al., 2020 find that many recent language models are already robust to many forms of distribution shift, while others find that vision models are largely fragile and argue that data augmentation offers one solution. In contrast, other researchers provide results suggesting that using pretraining and improving in-distribution test set accuracy improves natural robustness, whereas other methods do not.

Prior works have also offered various interpretations of empirical results, such as the Texture Bias hypothesis that convolutional networks are biased towards texture, harming robustness . Additionally, some authors posit a fundamental distinction between robustness on synthetic benchmarks vs. real-world distribution shifts, casting doubt on the generality of conclusions drawn from experiments conducted on synthetic benchmarks .

It has been difficult to arbitrate these hypotheses because existing robustness datasets vary multiple factors (e.g., time, camera, location, etc.) simultaneously in unspecified ways . Existing datasets also lack diversity such that it is hard to extrapolate which methods will improve robustness more broadly. To address these issues and test the methods outlined above, we introduce four new robustness datasets and a new data augmentation method.

First we introduce ImageNet-Renditions (ImageNet-R), a 30,000 image test set containing various renditions (e.g., paintings, embroidery, etc.) of ImageNet object classes. These renditions are naturally occurring, with textures and local image statistics unlike those of ImageNet images, allowing us to compare against gains on synthetic robustness benchmarks.

Next, we investigate the effect of changes in the image capture process with StreetView StoreFronts (SVSF) and DeepFashion Remixed (DFR). SVSF contains business storefront images collected from Google StreetView, along with metadata allowing us to vary location, year, and even the camera type. DFR leverages the metadata from DeepFashion2 to systematically shift object occlusion, orientation, zoom, and scale at test time. Both SVSF and DFR provide distribution shift controls and do not alter texture, which remove possible confounding variables affecting prior benchmarks.

Additionally, we collect Real Blurry Images, which consists of 1, ⁣0001,\!000 blurry natural images from a 100-class subset of the ImageNet classes. This benchmark serves as a real-world analog for the synthetic blur corruptions of the ImageNet-C benchmark . With it we find that synthetic corruptions correlate with corruptions that appear in the wild, contradicting speculations from previous work .

Finally, we contribute DeepAugment to increase robustness to some new types of distribution shift. This augmentation technique uses image-to-image neural networks for data augmentation. DeepAugment improves robustness on our newly introduced ImageNet-R benchmark and can also be combined with other augmentation methods to outperform a model pretrained on 1000×1000\times more labeled data.

We use these new datasets to test four overarching classes of methods for improving robustness:

Larger Models: increasing model size improves robustness to distribution shift .

Self-Attention: adding self-attention layers to models improves robustness .

Diverse Data Augmentation: robustness can increase through data augmentation .

Pretraining: pretraining on larger and more diverse datasets improves robustness .

After examining our results on these four new datasets as well as prior benchmarks, we can rule out several previous hypotheses while strengthening support for others. As one example, we find that synthetic data augmentation robustness interventions improve accuracy on ImageNet-R and real-world image blur distribution shifts, which lends credence to the use of synthetic robustness benchmarks and also reinforces the Texture Bias hypothesis. In the conclusion, we summarize the various strands of evidence for and against each hypothesis. Across our many experiments, we do not find a general method that consistently improves robustness, and some hypotheses require additional qualifications. While robustness is often spoken of and measured as a single scalar property like accuracy, our investigations show that robustness is not so simple. Our results show that future robustness research requires more thorough evaluation using more robustness datasets.

Related Work

Recent works have begun to characterize model performance on out-of-distribution (OOD) data with various new test sets, with dissonant findings. For instance, prior work demonstrates that modern language processing models are moderately robust to numerous naturally occurring distribution shifts, and that IID accuracy is not straightforwardly predictive of OOD accuracy for natural language tasks. For image recognition, other work analyzes image models and shows that they are sensitive to various simulated image corruptions (e.g., noise, blur, weather, JPEG compression, etc.) from their ImageNet-C benchmark.

Recht et al., 2019 reproduce the ImageNet validation set for use as a benchmark of naturally occurring distribution shift in computer vision. Their evaluations show a 11-14% drop in accuracy from ImageNet to the new validation set, named ImageNetV2, across a wide range of architectures. use ImageNetV2 to measure natural robustness and conclude that methods such as data augmentation do not significantly improve robustness. Recently, identify statistical biases in ImageNetV2’s construction, and they estimate that re-weighting ImageNetV2 to correct for these biases results in a less substantial 3.6% drop.

Data Augmentation.

Recent works demonstrate that data augmentation can improve robustness on ImageNet-C. The space of augmentations that help robustness includes various types of noise , highly unnatural image transformations , or compositions of simple image transformations such as Python Imaging Library operations . Some of these augmentations can improve accuracy on in-distribution examples as well as on out-of-distribution (OOD) examples.

New Datasets

In order to evaluate the four robustness methods, we introduce four new benchmarks that capture new types of naturally occurring distribution shifts. ImageNet-Renditions (ImageNet-R) and Real Blurry Images are both newly collected test sets intended for ImageNet classifiers, whereas StreetView StoreFronts (SVSF) and DeepFashion Remixed (DFR) each contain their own training sets and multiple test sets. SVSF and DFR split data into a training and test sets based on various image attributes stored in the metadata. For example, we can select a test set with images produced by a camera different from the training set camera. We now describe the structure and collection of each dataset.

While current classifiers can learn some aspects of an object’s shape , they nonetheless rely heavily on natural textural cues . In contrast, human vision can process abstract visual renditions. For example, humans can recognize visual scenes from line drawings as quickly and accurately as they can from photographs . Even some primates species have demonstrated the ability to recognize shape through line drawings .

To measure generalization to various abstract visual renditions, we create the ImageNet-Rendition (ImageNet-R) dataset. ImageNet-R contains various artistic renditions of object classes from the original ImageNet dataset. Note the original ImageNet dataset discouraged such images since annotators were instructed to collect “photos only, no painting, no drawings, etc.” . We do the opposite.

ImageNet-R contains 30,000 image renditions for 200 ImageNet classes. We choose a subset of the ImageNet-1K classes, following , for several reasons. A handful ImageNet classes already have many renditions, such as “triceratops.” We also choose a subset so that model misclassifications are egregious and to reduce label noise. The 200 class subset was also chosen based on rendition prevalence, as “strawberry” renditions were easier to obtain than “radiator” renditions. Were we to use all 1,000 ImageNet classes, annotators would be pressed to distinguish between Norwich terrier renditions as Norfolk terrier renditions, which is difficult. We collect images primarily from Flickr and use queries such as “art,” “cartoon,” “graffiti,” “embroidery,” “graphics,” “origami,” “painting,” “pattern,” “plastic object,” “plush object,” “sculpture,” “line drawing,” “tattoo,” “toy,” “video game,” and so on. Images are filtered by Amazon MTurk annotators using a modified collection interface from ImageNetV2 . For instance, after scraping Flickr images with the query “lighthouse cartoon,” we have MTurk annotators select true positive lighthouse renditions. Finally, as a second round of quality control, graduate students manually filter the resulting images and ensure that individual images have correct labels and do not contain multiple labels. Examples are depicted in Figure 2. ImageNet-R also includes the line drawings from , excluding horizontally mirrored duplicate images, pitch black images, and images from the incorrectly collected “pirate ship” class.

2 StreetView StoreFronts (SVSF)

Computer vision applications often rely on data from complex pipelines that span different hardware, times, and geographies. Ambient variations in this pipeline may result in unexpected performance degradation, such as degradations experienced by health care providers in Thailand deploying laboratory-tuned diabetic retinopathy classifiers in the field . In order to study the effects of shifts in the image capture process we collect the StreetView StoreFronts (SVSF) dataset, a new image classification dataset sampled from Google StreetView imagery focusing on three distribution shift sources: country, year, and camera.

SVSF consists of cropped images of business store fronts extracted from StreetView images by an object detection model. Each store front image is assigned the class label of the associated Google Maps business listing through a combination of machine learning models and human annotators. We combine several visually similar business types (e.g. drugstores and pharmacies) for a total of 20 classes, listed in the Supplementary Materials.

Splitting the data along the three metadata attributes of country, year, and camera, we create one training set and five test sets. We sample a training set and an in-distribution test set (200K and 10K images, respectively) from images taken in US/Mexico/Canada during 2019 using a “new” camera system. We then sample four OOD test sets (10K images each) which alter one attribute at a time while keeping the other two attributes consistent with the training distribution. Our test sets are year: 2017, 2018; country: France; and camera: “old.”

3 DeepFashion Remixed

Changes in day-to-day camera operation can cause shifts in attributes such as object size, object occlusion, camera viewpoint, and camera zoom. To measure this, we repurpose DeepFashion2 to create the DeepFashion Remixed (DFR) dataset. We designate a training set with 48K images and create eight out-of-distribution test sets to measure performance under shifts in object size, object occlusion, camera viewpoint, and camera zoom-in. DeepFashion Remixed is a multi-label classification task since images may contain more than one clothing item per image.

Similar to SVSF, we fix one value for each of the four metadata attributes in the training distribution. Specifically, the DFR training set contains images with medium scale, medium occlusion, side/back viewpoint, and no zoom-in. After sampling an IID test set, we construct eight OOD test distributions by altering one attribute at a time, obtaining test sets with minimal and heavy occlusion; small and large scale; frontal and not-worn viewpoints; and medium and large zoom-in. See the Supplementary Materials for details on test set sizes.

4 Real Blurry Images

We collect a small dataset of 1,000 real-world blurry images to capture real-world corruptions and validate synthetic image corruption benchmarks such as ImageNet-C. We collect the “Real Blurry Images” dataset from Flickr and query ImageNet object class names concatenated with the word “blurry.” Examples are in Figure 3. Each image belongs to one of 100 ImageNet classes.

DeepAugment

In order to further explore effects of data augmentation, we introduce a new data augmentation technique. Whereas most previous data augmentations techniques use simple augmentation primitives applied to the raw image itself, we introduce DeepAugment, which distorts images by perturbing internal representations of deep networks.

DeepAugment works by passing a clean image through an image-to-image network and introducing several perturbations during the forward pass. These perturbations are randomly sampled from a set of manually designed functions and applied to the network weights and to the feed-forward signal at random layers. For example, our set of perturbations includes zeroing, negating, convolving, transposing, applying activation functions, and more. This setup generates semantically consistent images with unique and diverse distortions as shown in Figure 4. Although our set of perturbations is designed with random operations, we show that DeepAugment still outperforms other methods on benchmarks such as ImageNet-C and ImageNet-R. We provide the pseudocode in the Supplementary Materials.

For our experiments, we specifically use the CAE and EDSR architectures as the basis for DeepAugment. CAE is an autoencoder architecture, and EDSR is a superresolution architecture. These two architectures show the DeepAugment approach works with different architectures. Each clean image in the original dataset and passed through the network and is thereby stochastically distored, resulting in two distorted versions of the clean dataset (one for CAE and one for EDSR). We then train on the augmented and clean data simultaneously and call this approach DeepAugment. The EDSR and CAE architectures are arbitrary. We show that the DeepAugment approach also works for untrained, randomly sampled architectures in the Supplementary Materials.

Experiments

In this section we briefly describe the evaluated models, pretraining techniques, self-attention mechanisms, data augmentation methods, and note various implementation details.

Model Architectures and Sizes. Most experiments are evaluated on a standard ResNet-50 model . Model size evaluations use ResNets or ResNeXts of varying sizes.

Pretraining. For pretraining we use ImageNet-21K which contains approximately 21,000 classes and approximately 14 million labeled training images, or around 10×10\times more labeled training data than ImageNet-1K. We also tune an ImageNet-21K model . We also use a large pre-trained ResNeXt-101 model . This was pre-trained on on approximately 1 billion Instagram images with hashtag labels and fine-tuned on ImageNet-1K. This Weakly Supervised Learning (WSL) pretraining strategy uses approximately 1000×1000\times more labeled data.

Self-Attention. When studying self-attention, we employ CBAM and SE modules, two forms of self-attention that help models learn spatially distant dependencies.

2 Results

We now perform experiments on ImageNet-R, StreetView StoreFronts, DeepFashion Remixed, and Real Blurry Images. We also evaluate on ImageNet-C and compare and contrast it with real distribution shifts.

Table 1 shows performance on ImageNet-R as well as on ImageNet-200 (the original ImageNet data restricted to ImageNet-R’s 200 classes). This has several implications regarding the four method-specific hypotheses. Pretraining with ImageNet-21K (approximately 10×10\times labeled data) hardly helps. The Supplementary Materials shows WSL pretraining can help, but Instagram has renditions, while ImageNet excludes them; hence we conclude comparable pretraining was ineffective. Notice self-attention increases the IID/OOD gap. Compared to simpler data augmentation techniques such as Speckle Noise, the data augmentation techniques of Style Transfer, AugMix, and DeepAugment improve generalization. Note AugMix and DeepAugment improve in-distribution performance whereas Style transfer hurts it. Also, our new DeepAugment technique is the best standalone method with an error rate of 57.8%. Last, larger models reduce the IID/OOD gap.

As for prior hypothesis in the literature regarding model robustness, we find that biasing networks away from natural textures through diverse data augmentation improved performance. The IID/OOD generalization gap varies greatly by method, demonstrating that it is possible to significantly outperform the trendline of models optimized solely for the IID setting. Finally, as ImageNet-R contains real-world examples, and since data augmentation helps on ImageNet-R, we now have clear evidence against the hypothesis that robustness interventions cannot help with natural distribution shifts .

StreetView StoreFronts. In Table 2, we evaluate data augmentation methods on SVSF and find that all of the tested methods have mostly similar performance and that no method helps much on country shift, where error rates roughly double across the board. Here evaluation is limited to augmentations due to a 30 day retention window for each instantiation of the dataset. Images captured in France contain noticeably different architectural styles and storefront designs than those captured in US/Mexico/Canada; meanwhile, we are unable to find conspicuous and consistent indicators of the camera and year. This may explain the relative insensitivity of evaluated methods to the camera and year shifts. Overall data augmentation here shows limited benefit, suggesting either that data augmentation primarily helps combat texture bias as with ImageNet-R, or that existing augmentations are not diverse enough to capture high-level semantic shifts such as building architecture.

DeepFashion Remixed. Table 3 shows our experimental findings on DFR, in which all evaluated methods have an average OOD mAP that is close to the baseline. In fact, most OOD mAP increases track IID mAP increases. In general, DFR’s size and occlusion shifts hurt performance the most. We also evaluate with Random Erasure augmentation, which deletes rectangles within the image, to simulate occlusion . Random Erasure improved occlusion performance, but Style Transfer helped even more. Nothing substantially improved OOD performance beyond what is explained by IID performance, so here it would appear that in this setting, only IID performance matters. Our results suggest that while some methods may improve robustness to certain forms of distribution shift, no method substantially raises performance across all shifts.

Real Blurry Images and ImageNet-C. We now consider a previous robustness benchmark to evaluate the four major methods. We use the ImageNet-C dataset which applies 15 common image corruptions (e.g., Gaussian noise, defocus blur, simulated fog, JPEG compression, etc.) across 5 severities to ImageNet-1K validation images. We find that DeepAugment improves robustness on ImageNet-C. Figure 5 shows that when models are trained with both AugMix and DeepAugment they set a new state-of-the-art, breaking the trendline and exceeding the corruption robustness provided by training on 1000×1000\times more labeled training data. Note the augmentations from AugMix and DeepAugment are disjoint from ImageNet-C’s corruptions. Full results are shown in the Supplementary Materials. IID accuracy alone is clearly unable to capture the full story of model robustness. Instead, larger models, self-attention, data augmentation, and pretraining all improve robustness far beyond the degree predicted by their influence on IID accuracy.

A recent work reminds us that ImageNet-C uses various synthetic corruptions and suggest that they are decoupled from real-world robustness. Real-world robustness requires generalizing to naturally occurring corruptions such as snow, fog, blur, low-lighting noise, and so on, but it is an open question whether ImageNet-C’s simulated corruptions meaningfully approximate real-world corruptions.

We evaluate various models on Real Blurry Images and find that all the robustness interventions that help with ImageNet-C also help with real-world blurry images. Hence ImageNet-C can track performance on real-world corruptions. Moreover, DeepAugment+AugMix has the lowest error rate on Real Blurry Images, which again contradicts the synthetic vs natural dichotomy. The upshot is that ImageNet-C is a controlled and systematic proxy for real-world robustness.

Our results, which are expanded on in the Supplementary Materials, show that larger models, self-attention, data augmentation, and pretraining all help, just like on ImageNet-C. Here DeepAugment+AugMix attains state-of-the-art. These results suggest ImageNet-C’s simulated corruptions track real-world corruptions. In hindsight, this is expected since various computer vision problems have used synthetic corruptions as proxies for real-world corruptions, for decades. In short, ImageNet-C is a diverse and systematic benchmark that is correlated with improvements on real-world corruptions.

Conclusion

In this paper we introduced four real-world datasets for evaluating the robustness of computer vision models: ImageNet-Renditions, DeepFashion Remixed, StreetView StoreFronts, and Real Blurry Images. With our new datasets, we re-evaluate previous robustness interventions and determine whether various robustness hypotheses are correct or incorrect in view of our new findings.

Our main results for different robustness interventions are as follows. Larger models improved robustness on Real Blurry Images, ImageNet-C, and ImageNet-R, but not with DFR. While self-attention noticeably helped Real Blurry Images and ImageNet-C, it did not help with ImageNet-R and DFR. Diverse data augmentation was ineffective for SVSF and DFR, but it greatly improved accuracy on Real Blurry Images, ImageNet-C, and ImageNet-R. Pretraining greatly helped with Real Blurry Images and ImageNet-C but hardly helped with DFR and ImageNet-R. It was not obvious a priori that synthetic data augmentation could improve accuracy on a real-world distribution shift such as ImageNet-R, nor had pretraining ever failed to improve performance in earlier research . Table 4 shows that many methods improve robustness across multiple distribution shifts. While no single method consistently helped across all distribution shifts, some helped more than others.

Our analysis also has implications for the three robustness hypotheses. In support of the Texture Bias hypothesis, ImageNet-R shows that standard networks do not generalize well to renditions (which have different textures), but that diverse data augmentation (which often distorts textures) can recover accuracy. More generally, larger models and diverse data augmentation consistently helped on ImageNet-R, ImageNet-C, and Real Blurry Images, suggesting that these two interventions reduce texture bias. However, these methods helped little for geographic shifts, showing that there is more to robustness than texture bias alone. Regarding more general trends across the last several years of progress in deep learning, while IID accuracy is a strong predictor of OOD accuracy, it is not decisive, contrary to some prior works . Again contrary to a hypothesis from prior work , our findings show that the gains from data augmentation on ImageNet-C generalize to both ImageNet-R and Real Blurry Images serve as a resounding validation of using synthetic benchmarks to measure model robustness.

The existing literature presents several conflicting accounts of robustness. What led to this conflict? We suspect that this is due in large part to inconsistent notions of how to best evaluate robustness, and in particular a desire to simplify the problem by establishing the primacy of a single benchmark over others. In response, we collected several additional datasets which each capture new dimensions of distribution shift and degradations in model performance not well studied before. These new datasets demonstrate the importance of conducting multi-faceted evaluations of robustness as well as the general complexity of the landscape of robustness research, where it seems that so far nothing consistently helps in all settings. Hence the research community may consider prioritizing the study of new robustness methods, and we encourage the research community to evaluate future methods on multiple distribution shifts. For example, ImageNet models should at least be tested against ImageNet-C and ImageNet-R. By heightening experimental standards for robustness research, we facilitate future work towards developing systems that can robustly generalize in safety-critical settings.

References

Appendix A Additional Results

ImageNet-R. Expanded ImageNet-R results are in Table 8. WSL pretraining on Instagram images appears to yield dramatic improvements on ImageNet-R, but the authors note the prevalence of artistic renditions of object classes on the Instagram platform. While ImageNet’s data collection process actively excluded renditions, we do not have reason to believe the Instagram dataset excluded renditions. On a ResNeXt-101 32×\times8d model, WSL pretraining improves ImageNet-R performance by a massive 37.5% from 57.5% top-1 error to 24.2%. Ultimately, without examining the training images we are unable to determine whether ImageNet-R represents an actual distribution shift to the Instagram WSL models. However, we also observe that with greater controls, that is with ImageNet-21K pre-training, pretraining hardly helped ImageNet-R performance, so it is not clear that more pretraining data improves ImageNet-R performance.

Increasing model size appears to automatically improve ImageNet-R performance, as shown in Figure 6. A ResNet-50 (25.5M parameters) has 63.9% error, while a ResNet-152 (60M) has 58.7% error. ResNeXt-50 32×\times4d (25.0M) attains 62.3% error and ResNeXt-101 32×\times8d (88M) attains 57.5% error.

ImageNet-C. Expanded ImageNet-C results are Table 7. We also tested whether model size improves performance on ImageNet-C for even larger models. With a different codebase, we trained ResNet-50, ResNet-152, and ResNet-500 models which achieved 80.6, 74.0, and 68.5 mCE respectively. Expanded comparisons between ImageNet-C and Real Blurry Images is in Table 5.

ImageNet-A. ImageNet-A is an adversarially filtered test set and is constructed based on existing model weaknesses (see for another robustness dataset algorithmically determined by model weaknesses). This dataset contains examples that are difficult for a ResNet-50 to classify, so examples solvable by simple spurious cues are are especially infrequent in this dataset. Results are in Table 9. Notice Res2Net architectures can greatly improve accuracy. Results also show that Larger Models, Self-Attention, and Pretraining help, while Diverse Data Augmentation usually does not help substantially.

Implications for the Four Methods. Larger Models help with ImageNet-C (++), ImageNet-A (++), ImageNet-R (++), yet does not markedly improve DFR (-) performance. Self-Attention helps with ImageNet-C (++), ImageNet-A (++), yet does not help ImageNet-R (-) and DFR (-) performance. Diverse Data Augmentation helps ImageNet-C (++), ImageNet-R (++), yet does not markedly improve ImageNet-A (-), DFR(-), nor SVSF (-) performance. Pretraining helps with ImageNet-C (++), ImageNet-A (++), yet does not markedly improve DFR (-) nor ImageNet-R (-) performance.

Appendix B DeepAugment Details

Below is Pythonic pseudocode for DeepAugment. The basic structure of DeepAugment is agnostic to the backbone network used, but specifics such as which layers are chosen for various transforms may vary as the backbone architecture varies. We do not need to train many different image-to-image models to get diverse distortions . We only use two existing models, the EDSR super-resolution model and the CAE image compression model . See full code for such details.

At a high level, DeepAugment processes each image with an image-to-image network. The image-to-image network’s weights and feedforward activations are distorted with each pass. The distortion is made possible by, for example, negating the network’s weights and applying dropout to the feedforward activations. These modifications were not carefully chosen and demonstrate the utility of mixing together diverse operations without tuning. The resulting image is distorted and saved. This process generates an augmented dataset.

Ablations.

We run ablations on DeepAugment to understand the contributions from the EDSR and CAE models independently. Table 11 contains results of these experiments on ImageNet-R and Table 10 contains results of these experiments on ImageNet-C. In both tables, “DeepAugment (EDSR)” and “DeepAugment (CAE)” refer to experiments where we only use a single extra augmented training set (+ the standard training set), and train on those images.

Noise2Net.

We show that untrained, randomly sampled neural networks can provide useful deep augmentations, highlighting the efficacy of the DeepAugment approach. While in the main paper we use EDSR and CAE to create DeepAugment augmentations, in this section we explore the use of randomly initialized image-to-image networks to generate diverse image augmentations. We propose a DeepAugment method, Noise2Net.

In Noise2Net, the architecture and weights are randomly sampled. Noise2Net is the composition of several residual blocks: Block(x)=x+εfΘ(x)\text{Block}(x)=x+\varepsilon\cdot f_{\Theta}(x), where Θ\Theta is randomly initialized and ε\varepsilon is a parameter that controls the strength of the augmentation. For all our experiments, we use 4 Res2Net blocks and εU(0.375,0.75)\varepsilon\sim U(0.375,0.75). The weights of Noise2Net are resampled at every minibatch, and the dilation and kernel sizes of all the convolutions used in Noise2Net are randomly sampled every epoch. Hence Noise2Net augments an image to an augmented image by processing the image through a randomly sampled network with random weights.

Recall that in the case of EDSR and CAE, we used networks to generate a static dataset, and then we trained normally on that static dataset. This setup could not be done on-the-fly. That is because we fed in one example at a time with EDSR and CAE. If we pass the entire minibatch through EDSR or CAE, we will end up applying the same augmentation to all images in the minibatch, reducing stochasticity and augmentation diversity. In contrast, Noise2Net enables us to process batches of images on-the-fly and obviates the need for creating a static augmented dataset.

In Noise2Net, each example is processed differently in parallel, so we generate more diverse augmentations in real-time. To make this possible, we use grouped convolutions. A grouped convolution with number of groups = NN will take a set of kNkN channels as input, and apply NN independent convolutions on channels {1,,k},{k+1,,2k},,{(N1)k+1,,Nk}\{1,\ldots,k\},\{k+1,\ldots,2k\},\ldots,\{(N-1)k+1,\ldots,Nk\}. Given a minibatch of size BB, we can apply a randomly initialized grouped convolution with N=BN=B groups in order to apply a different random convolutional filter to each element in the batch in a single forward pass. By replacing all the convolutions in each Res2Net block with a grouped convolution and randomly initializing network weights, we arrive at Noise2Net, a variant of DeepAugment. See Figure 7 for a high-level overview of Noise2Net and Figure 8 for sample outputs.

We evaluate the Noise2Net variant of DeepAugment on ImageNet-R. Table 11 shows that it outperforms the EDSR and CAE variants of DeepAugment, even though the network architecture is randomly sampled, its weights are random, and the network is not trained. This demonstrates the flexibility of the DeepAugment approach. Below is Pythonic pseudocode for training a classifier using the Noise2Net variant of DeepAugment.

Appendix C Further Dataset Descriptions

The 200 ImageNet classes and their WordNet IDs in ImageNet-R are as follows.

Goldfish, great white shark, hammerhead, stingray, hen, ostrich, goldfinch, junco, bald eagle, vulture, newt, axolotl, tree frog, iguana, African chameleon, cobra, scorpion, tarantula, centipede, peacock, lorikeet, hummingbird, toucan, duck, goose, black swan, koala, jellyfish, snail, lobster, hermit crab, flamingo, american egret, pelican, king penguin, grey whale, killer whale, sea lion, chihuahua, shih tzu, afghan hound, basset hound, beagle, bloodhound, italian greyhound, whippet, weimaraner, yorkshire terrier, boston terrier, scottish terrier, west highland white terrier, golden retriever, labrador retriever, cocker spaniels, collie, border collie, rottweiler, german shepherd dog, boxer, french bulldog, saint bernard, husky, dalmatian, pug, pomeranian, chow chow, pembroke welsh corgi, toy poodle, standard poodle, timber wolf, hyena, red fox, tabby cat, leopard, snow leopard, lion, tiger, cheetah, polar bear, meerkat, ladybug, fly, bee, ant, grasshopper, cockroach, mantis, dragonfly, monarch butterfly, starfish, wood rabbit, porcupine, fox squirrel, beaver, guinea pig, zebra, pig, hippopotamus, bison, gazelle, llama, skunk, badger, orangutan, gorilla, chimpanzee, gibbon, baboon, panda, eel, clown fish, puffer fish, accordion, ambulance, assault rifle, backpack, barn, wheelbarrow, basketball, bathtub, lighthouse, beer glass, binoculars, birdhouse, bow tie, broom, bucket, cauldron, candle, cannon, canoe, carousel, castle, mobile phone, cowboy hat, electric guitar, fire engine, flute, gasmask, grand piano, guillotine, hammer, harmonica, harp, hatchet, jeep, joystick, lab coat, lawn mower, lipstick, mailbox, missile, mitten, parachute, pickup truck, pirate ship, revolver, rugby ball, sandal, saxophone, school bus, schooner, shield, soccer ball, space shuttle, spider web, steam locomotive, scarf, submarine, tank, tennis ball, tractor, trombone, vase, violin, military aircraft, wine bottle, ice cream, bagel, pretzel, cheeseburger, hotdog, cabbage, broccoli, cucumber, bell pepper, mushroom, Granny Smith, strawberry, lemon, pineapple, banana, pomegranate, pizza, burrito, espresso, volcano, baseball player, scuba diver, acorn.

n01443537, n01484850, n01494475, n01498041, n01514859, n01518878, n01531178, n01534433, n01614925, n01616318, n01630670, n01632777, n01644373, n01677366, n01694178, n01748264, n01770393, n01774750, n01784675, n01806143, n01820546, n01833805, n01843383, n01847000, n01855672, n01860187, n01882714, n01910747, n01944390, n01983481, n01986214, n02007558, n02009912, n02051845, n02056570, n02066245, n02071294, n02077923, n02085620, n02086240, n02088094, n02088238, n02088364, n02088466, n02091032, n02091134, n02092339, n02094433, n02096585, n02097298, n02098286, n02099601, n02099712, n02102318, n02106030, n02106166, n02106550, n02106662, n02108089, n02108915, n02109525, n02110185, n02110341, n02110958, n02112018, n02112137, n02113023, n02113624, n02113799, n02114367, n02117135, n02119022, n02123045, n02128385, n02128757, n02129165, n02129604, n02130308, n02134084, n02138441, n02165456, n02190166, n02206856, n02219486, n02226429, n02233338, n02236044, n02268443, n02279972, n02317335, n02325366, n02346627, n02356798, n02363005, n02364673, n02391049, n02395406, n02398521, n02410509, n02423022, n02437616, n02445715, n02447366, n02480495, n02480855, n02481823, n02483362, n02486410, n02510455, n02526121, n02607072, n02655020, n02672831, n02701002, n02749479, n02769748, n02793495, n02797295, n02802426, n02808440, n02814860, n02823750, n02841315, n02843684, n02883205, n02906734, n02909870, n02939185, n02948072, n02950826, n02951358, n02966193, n02980441, n02992529, n03124170, n03272010, n03345487, n03372029, n03424325, n03452741, n03467068, n03481172, n03494278, n03495258, n03498962, n03594945, n03602883, n03630383, n03649909, n03676483, n03710193, n03773504, n03775071, n03888257, n03930630, n03947888, n04086273, n04118538, n04133789, n04141076, n04146614, n04147183, n04192698, n04254680, n04266014, n04275548, n04310018, n04325704, n04347754, n04389033, n04409515, n04465501, n04487394, n04522168, n04536866, n04552348, n04591713, n07614500, n07693725, n07695742, n07697313, n07697537, n07714571, n07714990, n07718472, n07720875, n07734744, n07742313, n07745940, n07749582, n07753275, n07753592, n07768694, n07873807, n07880968, n07920052, n09472597, n09835506, n10565667, n12267677.

Street View StoreFronts.

DeepFashion Remixed.

Size (small, moderate, or large) defines how much of the image the article of clothing takes up. Occlusion (slight, medium, or heavy) defines the degree to which the object is occluded from the camera. Viewpoint (front, side/back, or not worn) defines the camera position relative to the article of clothing. Zoom (no zoom, medium, or large) defines how much camera zoom was used to take the picture.