AugLy: Data Augmentations for Robustness

Zoe Papakipos, Joanna Bitton

Introduction

Data augmentations are a key component in the computer vision model development life cycle, and are also becoming increasingly prevalent in other domains. They are commonly used to increase the size of datasets and prevent overfitting by performing perturbations on the input data. In addition to the classical use cases, data augmentations can also be used to evaluate the robustness of trained models to perturbations not seen at train time.

For instance, to preserve a sense of data provenance, being robust to data manipulations is critical. Content online is often manipulated and reshared, for example when users screenshot & share a post, or overlay text or images on top of an image to make a meme. It is therefore non-trivial to be able to detect that two pieces of media are near-duplicates . Additionally, adversaries may try to intentionally pass in obfuscated data to a model to evade detection.

The classical set of data augmentations used during model development does not completely mimic the way individuals online organically perturb data. Most classical augmentation libraries focus on simple transformations such as mirroring, rotating, cropping, brightness changes, etc. While these kinds of augmentations do naturally occur online, others such as overlaying text and emojis, social media screenshots, etc. are also prevalent. In addition, multimodal data processing and learning is becoming increasingly important as many real-world use cases involve multiple types of data, such as text & images or audio & video, and it can be useful to augment data of multiple modalities under one unified library & API.

AugLy is built with robustness and the vast landscape of organic data augmentations seen online in mind, and to our knowledge is the first multimodal data augmentation library. AugLy can be used to synthetically create realistic data augmentations seen online, as a tool for evaluating and increasing robustness and to augment multiple modalities at a time, and thus stands out in comparison to existing libraries. In this paper we introduce AugLy, explain how it works, its architecture, and how it compares in terms of functionality & efficiency to existing data augmentation libraries. We also conduct a robustness evaluation on state-of-the-art image classification models throughout the years to demonstrate how AugLy can be used to identify robustness gaps in pre-trained models.

Related Work

Most commonly-used augmentation libraries focus on one modality and provide a fairly limited set of augmentations. A majority of libraries focus on images and text , however audio and video augmentation libraries do exist as well with more limited augmentations (see Section 4 for in-depth comparisons between AugLy and existing libraries for each modality). Meanwhile, AugLy provides augmentations for audio, images, text and video under a unified API, and is one of few libraries that focus on evaluating robustness rather than augmenting a dataset at train time.

Other works have conducted experiments to find sets of augmentations that when trained on improve robustness at test time, such as AugMix. Strategies like AutoAugment, on the other hand, find an “optimal” set of augmentations to train on in a more automated way.

In AI Fairness, studies assessing the robustness of models to various protected categories are common. In NLP, there are studies that augment text to assess a model’s biases towards gender and ethnicity. AugLy provides “fairness augmentations” since being robust to perturbations in protected classes is an important aspect of robustness that we must evaluate to ensure that models are not amplifying biases.

AugLy

AugLy is a novel open-source data augmentation library which provides over 100 data augmentations across four modalities: audio, image, text, and video. The augmentations provided in AugLy are informed by the perturbations that real people on the Internet perform on data daily. This includes augmentations such as overlaying text, emojis, and screenshot transforms for image & video and inserting punctuation or similar characters for text.

AugLy has four sub-libraries (audio, image, text, & video), each corresponding to a different modality. All sub-libraries follow the same interface: we provide transforms in both function-based and class-based formats, and we provide intensity functions that compute a notion of how strong a transformation is based on the given parameters. AugLy can optionally generate metadata that provides additional context as to how the data was transformed, which is useful to perform comparisons of model performance based on the augmentation type & intensity.

AugLy also provides operators for composing multiple augmentations together, applying augmentations with a given probability, and applying multimodal augmentations (for example augmenting both the audio & frames in a video).

We provide many basic augmentations that are already supported in existing libraries, as well as some new transformation types that are directly informed by data perturbations observed online. For example, one of our augmentations takes an image or video and overlays it onto a social media interface to make it seem as if the image or video was screenshotted by a user on a social network. This augmentation is beneficial because individuals on the internet commonly reshare content this way, and it is important for systems to be able to identify that the content is still the same despite the added interface elements.

2 Existing Use Cases

AugLy has already been used by several projects. SimSearchNet, an image copy detection model, was trained using AugLy augmentations. AugLy was used to evaluate the robustness of deepfake detection models in the 2019 Deepfake Detection Challenge, ultimately influencing who were the top five winners. The dataset (DISC21) for the Image Similarity Challenge, a NeurIPS 2021 competition on image copy detection, was built using AugLy as well.

Benchmarking

In order to show how AugLy fits into the existing ecosystem of data augmentation libraries, we compare each modality’s sub-library within AugLy to a few of the most popular augmentations libraries in that respective modality. Specifically, we compare the overall focus and functionality of each library, and perform runtime benchmarking to evaluate how efficient AugLy’s augmentations are. Note: the augmentations were benchmarked using AugLy v0.2.1, available on Pypi and GitHub. To see the full list of augmentations benchmarked, please review the Appendix.

We chose to compare AugLy’s audio augmentations to three existing and popular libraries: pydub, torchaudio, and audiomentations. See Figure 5 to compare the number of distinct augmentations provided.

Each library has a slightly different focus: torchaudio and audiomentations integrate easily with pytorch (torchaudio’s can also be GPU-accelerated) and are clearly intended to be used at train time to improve generalization of audio machine learning models. Pydub provides more general-purpose audio processing functionality without much emphasis on either integrating with ML training or evaluation pipelines; the number of transformation functions in Pydub is also much lower than the other three.

We benchmark each audio augmentation in AugLy, as well as some analogues that exist in the other libraries. See Figure 6 for the runtime in seconds of each augmentation in (1) AugLy, (2) pydub, (3) torchaudio, & (4) audiomentations.

2 Image

We compare AugLy’s image augmentations to three well-established libraries: imgaug, torchvision, and Albumentations. See Figure 7 for a comparison of the four libraries in terms of the number of distinct augmentations provided.

Whereas imgaug, torchvision, and Albumentations are all geared toward providing general image augmentations to be used in computer vision training pipelines for regularization purposes, AugLy is more focused on replicating image transformations that users perform online. For example none of the other three libraries contain overlay augmentations (e.g. “OverlayText”, “OverlayEmoji”, or “OverlayOntoScreenshot”), although these are extremely common image manipulations.

This indicates a gap in existing image augmentation libraries: models are not being trained to be invariant to data manipulations that they will see in the real world. For instance, a model that detects violent or harmful content in images on any online platform needs to be invariant to the augmentations provided in AugLy; otherwise a user can bypass that model by overlaying an emoji onto the harmful image or overlaying the image onto a background.

We benchmark each AugLy image augmentation, as well as any analogues that exist in the other libraries. See Figure 8 for the runtime in seconds of each augmentation in (1) AugLy, (2) imgaug, (3) torchvision, & (4) Albumentations.

3 Text

We compare AugLy’s text augmentations to three existing text libraries: nlpaug, TextAttack, & textflint. See Figure 9 for a comparison of the five libraries in terms of the number of distinct augmentations provided.

One significant difference between AugLy and the other text augmentation libraries is the prevalence of syntactic versus semantic (i.e. character-level vs word-level) augmentations. Most augmentations in nlpaug and TextAttack are semantic (e.g. words being swapped for synonyms or antonyms), or a few simple syntactic ones (e.g. deleting/adding characters, replacing characters with nearby ones on the keyboard). AugLy provides many syntactic augmentations that are often used online in an attempt to evade detection, such as inserting punctuation, zero-width, or bidirectional characters and changing fonts.

We benchmark each AugLy text augmentation, as well as any analogues that exist in the other libraries. See Figure 10 for the runtime in seconds of each augmentation in (1) AugLy, (2) nlpaug, (3) TextAttack, & (4) textflint.

4 Video

We compare AugLy’s video augmentations to three existing libraries: moviepy, pytorchvideo, and vidaug. See Figure 11 for a comparison of the four libraries in terms of the number of distinct augmentations provided.

Most existing video augmentations either focus on manipulating the spatial dimension or the temporal dimension, as opposed to both. For instance, many individuals apply spatial image augmentations frame by frame onto videos; pytorchvideo provides one such API to do this using the torchvision transforms. Although spatial augmentations are effective, applying temporal augmentations in tandem has been shown to improve performance.

Moviepy is more of a general video processing and editing library, but it provides both spatial and temporal manipulations such as changing the speed of the video, trimming, and spatial cropping. vidaug provides similar spatial and temporal augmentations. However, none of these existing libraries provide the option to augment the audio or to perform overlay augmentations which AugLy does provide. AugLy provides a wide array of spatiotemporal augmentations which are common online such as temporally splicing one video into another, simulating a screenshot reshare, and overlaying one video onto another. AugLy is also unique in its multimodal integration, meaning a video’s audio can be transformed then recombined with the video in conjunction with other augmentations).

We benchmark each AugLy video augmentation, as well as the analogues that exist in the other libraries. See Figure 12 for the runtime in seconds of each augmentation in (1) AugLy, (2) moviepy, (3) pytorchvideo, & (4) vidaug.

Robustness Evaluation

To demonstrate how AugLy can be used to evaluate robustness, we evaluated a few ImageNet models throughout the years on AugLy augmentations. We were interested to see how robustness has evolved as models’ accuracy has improved, as well as understanding which augmentations the models were particularly vulnerable to. We chose three models to evaluate: VGG16, Resnet152, and Efficientnet-L2 (Noisy Student).

We evaluated the aforementioned models on the ImageNet validation set, which is commonly used since the test set is not available for download. However, to avoid any potential bias due to overfitting, we evaluated on an additional dataset, ImageNet V2. “ImageNet V2” was put together by researchers with the intention to be a held-out test set for ImageNet that can be evaluated on with no risk of overfitting.

We evaluated the robustness of each model across many different AugLy image augmentations by sampling 250 images from each dataset, computing the top-5 accuracy on those images, and computing the top-5 accuracy when the images are augmented using each augmentation. The change in top-5 accuracy from the baseline (i.e. when the images are not augmented) to the augmented images gives us a measure of how vulnerable the model is to that augmentation. We chose a diverse set of augmentations and set the parameters such that the augmentations were very noticeable but the content of the image was still recognizable to the human eye. See examples of some of the augmentations in Figure 14. The notebook used to perform this robustness evaluation can be found in the AugLy repo at https://github.com/facebookresearch/AugLy/blob/main/examples/imagenet/pwc_imagenet_v1_vs_v2_metrics.ipynb.

In Figure 13, VGG and ResNet are pretty vulnerable to AugLy augmentations across the board. EfficientNet, on the other hand, is much more robust to most augmentations except for blur and random_noise which cause a larger drop in accuracy. This makes sense considering the augmentations each model was trained on: VGG was trained on augmentations equivalent to AugLy’s crop, hflip, & color_jitter; ResNet was trained on crop, hflip, scale, & color changes similar to color_jitter. EfficientNet was trained using AutoAugment, which includes a much wider range of augmentations such as shear_x/y, translate_x/y, rotate, contrast, invert, solarize, posterize, color, brightness, sharpness, and cutout.

Whereas VGG & ResNet were trained on a very limited set of spatial and color-based augmentations, EfficientNet was trained on a larger number of both spatial and color-based augmentations, as well as cutout which is similar to the overlay augmentations in AugLy (but instead of overlaying content over the image, black rectangles are overlaid). However, none of the three models were trained on pixel-level augmentations such as blur, random_noise, or pixelization, which likely explains why all three models are vulnerable to those augmentations. Figure 14 illustrates a few examples from AugLy of the four categories: spatial, color, overlay, and pixel-level augmentations.

We validated that these results are comparable on the ImageNet V2 dataset, shown in Figure 15. Similar to evaluation on the ImageNet validation dataset, VGG and ResNet are quite vulnerable to all augmentations at varying degrees, and EfficientNet is significantly less so with the exception of blur & random_noise.

Figure 16 shows the drop in accuracy on EfficientNet for each augmentation with respect to the original ImageNet validation set and ImageNet V2. The drop in accuracy is close on both datasets for all augmentations, so there is no indication of overfitting on the validation set.

Conclusion

We presented AugLy, a new multimodal augmentation library with a focus on robustness. We compared each sub-library (audio, image, text, and video) to other similar augmentation libraries, assessing the amount of augmentations offered, the kinds of augmentations available, and benchmarking analogous functions to observe their performance. While other libraries may be more performant time-wise, AugLy provides a wide range of unique augmentation that replicate real modifications seen online. Additionally, we evaluated our augmentations on three state-of-the-art image classification models over time, showing that retraining on augmented data is an effective method for building defenses against various attack types.

Acknowledgements

We would like to thank A.K.M Adib, Erik Chou, Aditya Prasad, and Guillermo Sanchez for their contributions to this paper by benchmarking and improving the efficiency of AugLy’s augmentations!