A Survey of Data Augmentation Approaches for NLP

Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

Introduction

Data augmentation (DA) refers to strategies for increasing the diversity of training examples without explicitly collecting new data. It has received active attention in recent machine learning (ML) research in the form of well-received, general-purpose techniques such as UDA Xie et al. (2020) (3.1), which used backtranslation Sennrich et al. (2016), AutoAugment Cubuk et al. (2018), and RandAugment Cubuk et al. (2020), and MixUp Zhang et al. (2017) (3.2). These are often first explored in computer vision (CV), and DA’s adaptation for natural language processing (NLP) seems secondary and comparatively underexplored, perhaps due to challenges presented by the discrete nature of language, which rules out continuous noising and makes it more difficult to maintain invariance.

Despite these challenges, there has been increased interest and demand for DA for NLP. As NLP grows due to off-the-shelf availability of large pretrained models, there are increasingly more tasks and domains to explore. Many of these are low-resource, and have a paucity of training examples, creating many use-cases for which DA can play an important role. Particularly, for many non-classification NLP tasks such as span-based tasks and generation, DA research is relatively sparse despite their ubiquity in real-world settings.

Our paper aims to sensitize the NLP community towards this growing area of work, which has also seen increasing interest in ML overall (as seen in Figure 1). As interest and work on this topic continue to increase, this is an opportune time for a paper of our kind to (i) give a bird’s eye view of DA for NLP, and (ii) identify key challenges to effectively motivate and orient interest in this area. To the best of our knowledge, this is the first survey to take a detailed look at DA methods for NLP.Liu et al. (2020a) present a smaller-scale text data augmentation survey that is concise and focused. Our work serves as a more comprehensive survey with larger coverage and is more up-to-date.

This paper is structured as follows. Section 2 discusses what DA is, its goals and trade-offs, and why it works. Section 3 describes popular methodologically representative DA techniques for NLP—which we categorize into rule-based (3.1), example interpolation-based (3.2), or model-based (3.3). Section 4 discusses useful NLP applications for DA, including low-resource languages (4.1), mitigating bias (4.2), fixing class imbalance (4.3), few-shot learning (4.4), and adversarial examples (4.5). Section 5 describes DA methods for common NLP tasks including summarization (5.1), question answering (5.2), sequence tagging tasks (5.3), parsing tasks (5.4), grammatical error correction (5.5), neural machine translation (5.6), data-to-text NLG (5.7), open-ended and conditional text generation (5.8), dialogue (5.9), and multimodal tasks (5.10). Finally, Section 6 discusses challenges and future directions in DA for NLP. Appendix A lists useful blog posts and code repositories.

Through this work, we hope to emulate past papers which have surveyed DA methods for other types of data, such as images Shorten and Khoshgoftaar (2019), faces Wang et al. (2019b), and time series Iwana and Uchida (2020). We hope to draw further attention, elicit broader interest, and motivate additional work in DA, particularly for NLP.

Background

What is data augmentation? Data augmentation (DA) encompasses methods of increasing training data diversity without directly collecting more data. Most strategies either add slightly modified copies of existing data or create synthetic data, aiming for the augmented data to act as a regularizer and reduce overfitting when training ML models Shorten and Khoshgoftaar (2019); Hernández-García and König (2020). DA has been commonly used in CV, where techniques like cropping, flipping, and color jittering are a standard component of model training. In NLP, where the input space is discrete, how to generate effective augmented examples that capture the desired invariances is less obvious.

Despite challenges associated with text, many DA techniques for NLP have been proposed, ranging from rule-based manipulations Zhang et al. (2015) to more complicated generative approaches Liu et al. (2020b). As DA aims to provide an alternative to collecting more data, an ideal DA technique should be both easy-to-implement and improve model performance. Most offer trade-offs between these two.

Rule-based techniques are easy-to-implement but usually offer incremental performance improvements Li et al. (2017); Wei and Zou (2019); Wei et al. (2021b). Techniques leveraging trained models may be more costly to implement but introduce more data variation, leading to better performance boosts. Model-based techniques customized for downstream tasks can have strong effects on performance but be difficult to develop and utilize.

Further, the distribution of augmented data should neither be too similar nor too different from the original. This may lead to greater overfitting or poor performance through training on examples not representative of the given domain, respectively. Effective DA approaches should aim for a balance.

Kashefi and Hwa (2020) devise a KL-Divergence-based unsupervised procedure to preemptively choose among DA heuristics, rather than a typical "run-all-heuristics" comparison, which can be very time and cost intensive.

Interpretation of DA

Dao et al. (2019) note that "data augmentation is typically performed in an ad-hoc manner with little understanding of the underlying theoretical principles", and claim the typical explanation of DA as regularization to be insufficient. Overall, there indeed appears to be a lack of research on why exactly DA works. Existing work on this topic is mainly surface-level, and rarely investigates the theoretical underpinnings and principles. We discuss this challenge more in §6, and highlight some of the existing work below.

Bishop (1995) show training with noised examples is reducible to Tikhonov regularization (subsumes L2). Rajput et al. (2019) show that DA can increase the positive margin for classifiers, but only when augmenting exponentially many examples for common DA methods.

Dao et al. (2019) think of DA transformations as kernels, and find two ways DA helps: averaging of features and variance regularization. Chen et al. (2020d) show that DA leads to variance reduction by averaging over orbits of the group that keep the data distribution approximately invariant.

Techniques & Methods

We now discuss some methodologically representative DA techniques which are relevant to all tasks via the extensibility of their formulation.Table 1 compares several DA methods by various aspects relating to their applicability, dependencies, and requirements.

Here, we cover DA primitives which use easy-to-compute, predetermined transforms sans model components. Feature space DA approaches generate augmented examples in the model’s feature space rather than input data. Many few-shot learning approaches Hariharan and Girshick (2017); Schwartz et al. (2018) leverage estimated feature space "analogy" transformations between examples of known classes to augment for novel classes (see §4.4). Paschali et al. (2019) use iterative affine transformations and projections to maximally "stretch" an example along the class-manifold.

Wei and Zou (2019) propose Easy Data Augmentation (EDA), a set of token-level random perturbation operations including random insertion, deletion, and swap. They show improved performance on many text classification tasks. UDA Xie et al. (2020) show how supervised DA methods can be exploited for unsupervised data through consistency training on $(x,DA(x))$ pairs.

For paraphrase identification, Chen et al. (2020b) construct a signed graph over the data, with individual sentences as nodes and pair labels as signed edges. They use balance theory and transitivity to infer augmented sentence pairs from this graph. Motivated by image cropping and rotation, Şahin and Steedman (2018) propose dependency tree morphing. For dependency-annotated sentences, children of the same parent are swapped (à la rotation) or some deleted (à la cropping), as seen in Figure 2. This is most beneficial for language families with rich case marking systems (e.g. Baltic and Slavic).

2 Example Interpolation Techniques

Another class of DA techniques, pioneered by MixUp Zhang et al. (2017), interpolates the inputs and labels of two or more real examples. This class of techniques is also sometimes referred to as Mixed Sample Data Augmentation (MSDA). Ensuing work has explored interpolating inner components Verma et al. (2019); Faramarzi et al. (2020), more general mixing schemes Guo (2020), and adding adversaries Beckham et al. (2019).

Another class of extensions of MixUp which has been growing in the vision community attempts to fuse raw input image pairs together into a single input image, rather than improve the continuous interpolation mechanism. Examples of this paradigm include CutMix Yun et al. (2019), CutOut DeVries and Taylor (2017) and Copy-Paste Ghiasi et al. (2020). For instance, CutMix replaces a small sub-region of Image A with a patch sampled from Image B, with the labels mixed in proportion to sub-region sizes. There is potential to borrow ideas and inspiration from these works for NLP, e.g. for multimodal work involving both images and text (see "Multimodal challenges" in §6).

A bottleneck to using MixUp for NLP tasks was the requirement of continuous inputs. This has been overcome by mixing embeddings or higher hidden layers Chen et al. (2020c). Later variants propose speech-tailored mixing schemes Jindal et al. (2020b) and interpolation with adversarial examples Cheng et al. (2020), among others.

Seq2MixUp Guo et al. (2020) generalizes MixUp for sequence transduction tasks in two ways - the "hard" version samples a binary mask (from a Bernoulli with a $\beta(\alpha,\alpha)$ prior) and picks from one of two sequences at each token position, while the "soft" version softly interpolates between sequences based on a coefficient sampled from $\beta(\alpha,\alpha)$ . The "soft" version is found to outperform the "hard" version and earlier interpolation-based techniques like SwitchOut Wang et al. (2018a).

3 Model-Based Techniques

Seq2seq and language models have also been used for DA. The popular backtranslation method Sennrich et al. (2016) translates a sequence into another language and then back into the original language. Kumar et al. (2019a) train seq2seq models with their proposed method DiPS which learns to generate diverse paraphrases of input text using a modified decoder with a submodular objective, and show its effectiveness as DA for several classification tasks. Pretrained language models such as RNNs Kobayashi (2018) and transformers Yang et al. (2020) have also been used for augmentation.

Kobayashi (2018) generate augmented examples by replacing words with others randomly drawn according to the recurrent language model’s distribution based on the current context (illustration in Figure 3). Yang et al. (2020) propose G-DAugc which generates synthetic examples using pretrained transformer language models, and selects the most informative and diverse set for augmentation. Gao et al. (2019) advocate retaining the full distribution through "soft" augmented examples, showing gains on machine translation.

Nie et al. (2020) augment word representations with a context-sensitive attention-based mixture of their semantic neighbors from a pretrained embedding space, and show its effectiveness for NER on social media text. Inspired by denoising autoencoders, Ng et al. (2020) use a corrupt-and-reconstruct approach, with the corruption function $q(x^{\prime}|x)$ masking an arbitrary number of word positions and the reconstruction function $r(x|x^{\prime})$ unmasking them using BERT Devlin et al. (2019). Their approach works well on domain-shifted test sets across 9 datasets on sentiment, NLI, and NMT.

Feng et al. (2019) propose a task called Semantic Text Exchange (STE) which involves adjusting the overall semantics of a text to fit the context of a new word/phrase that is inserted called the replacement entity (RE). They do so by using a system called SMERTI and a masked LM approach. While not proposed directly for DA, it can be used as such, as investigated in Feng et al. (2020).

Rather than starting from an existing example and modifying it, some model-based DA approaches directly estimate a generative process from the training set and sample from it. Anaby-Tavor et al. (2020) learn a label-conditioned generator by finetuning GPT-2 Radford et al. (2019) on the training data, using this to generate candidate examples per class. A classifier trained on the original training set is then used to select top $k$ candidate examples which confidently belong to the respective class for augmentation. Quteineh et al. (2020) use a similar label-conditioned GPT-2 generation method, and demonstrate its effectiveness as a DA method in an active learning setup.

Other approaches include syntactic or controlled paraphrasing Iyyer et al. (2018); Kumar et al. (2020), document-level paraphrasing Gangal et al. (2021), augmenting misclassified examples Dreossi et al. (2018), BERT cross-encoder labeling of new inputs Thakur et al. (2021), guided generation using large-scale generative language models Liu et al. (2020b, c), and automated text augmentation Hu et al. (2019); Cai et al. (2020). Models can also learn to combine together simpler DA primitives Cubuk et al. (2018); Ratner et al. (2017) or add human-in-the-loop Kaushik et al. (2020, 2021).

Applications

In this section, we discuss several DA methods for some common NLP applications.2

Low-resource languages are an important and challenging application for DA, typically for neural machine translation (NMT). Techniques using external knowledge such as WordNet Miller (1995) may be difficult to use effectively here.Low-resource language challenges discussed more in §6. There are ways to leverage high-resource languages for low-resource languages, particularly if they have similar linguistic properties. Xia et al. (2019) use this approach to improve low-resource NMT.

Li et al. (2020b) use backtranslation and self-learning to generate augmented training data. Inspired by work in CV, Fadaee et al. (2017) generate additional training examples that contain low-frequency (rare) words in synthetically created contexts. Qin et al. (2020) present a DA framework to generate multi-lingual code-switching data to finetune multilingual-BERT. It encourages the alignment of representations from source and multiple target languages once by mixing their context information. They see improved performance across 5 tasks with 19 languages.

2 Mitigating Bias

Zhao et al. (2018) attempt to mitigate gender bias in coreference resolution by creating an augmented dataset identical to the original but biased towards the underrepresented gender (using gender swapping of entities such as replacing "he" with "she") and train on the union of the two datasets. Lu et al. (2020) formally propose counterfactual DA (CDA) for gender bias mitigation, which involves causal interventions that break associations between gendered and gender-neutral words. Zmigrod et al. (2019) and Hall Maudslay et al. (2019) propose further improvements to CDA. Moosavi et al. (2020) augment training sentences with their corresponding predicate-argument structures, improving the robustness of transformer models against various types of biases.

3 Fixing Class Imbalance

Fixing class imbalance typically involves a combination of undersampling and oversampling. Synthetic Minority Oversampling Technique (SMOTE) Chawla et al. (2002), which generates augmented minority class examples through interpolation, still remains popular Fernández et al. (2018). Multilabel SMOTE (MLSMOTE) Charte et al. (2015) modifies SMOTE to balance classes for multi-label classification, where classifiers predict more than one class at the same time. Other techniques such as EDA Wei and Zou (2019) can possibly be used for oversampling as well.

4 Few-Shot Learning

DA methods can ease few-shot learning by adding more examples for novel classes introduced in the few-shot phase. Hariharan and Girshick (2017) use learned analogy transformations $\phi(z_{1},z_{2},x)$ between example pairs from a non-novel class $z_{1}\rightarrow z_{2}$ to generate augmented examples $x\rightarrow x^{\prime}$ for novel classes. Schwartz et al. (2018) generalize this to beyond just linear offsets, through their " $\Delta$ -network" autoencoder which learns the distribution $P(z_{2}|z_{1},C)$ from all $y^{*}_{z_{1}}=y^{*}_{z_{2}}=C$ pairs, where $C$ is a class and $y$ is the ground-truth labelling function. Both these methods are applied only on image tasks, but their theoretical formulations are generally applicable, and hence we discuss them.

Kumar et al. (2019b) apply these and other DA methods for few-shot learning of novel intent classes in task-oriented dialog. Wei et al. (2021a) show that data augmentation facilitates curriculum learning for training triplet networks for few-shot text classification. Lee et al. (2021) use T5 to generate additional examples for data-scarce classes.

5 Adversarial Examples (AVEs)

Adversarial examples can be generated using innocuous label-preserving transformations (e.g. paraphrasing) that fool state-of-the-art NLP models, as shown in Jia et al. (2019). Specifically, they add sentences with distractor spans to passages to construct AVEs for span-based QA. Zhang et al. (2019d) construct AVEs for paraphrase detection using word swapping. Kang et al. (2018) and Glockner et al. (2018) create AVEs for textual entailment using WordNet relations.

Tasks

In this section, we discuss several DA works for common NLP tasks.2 We focus on non-classification tasks as classification is worked on by default, and well covered in earlier sections (e.g. §3 and §4). Numerous previously mentioned DA techniques, e.g. Wei and Zou (2019); Chen et al. (2020b); Anaby-Tavor et al. (2020), have been used or can be used for text classification tasks.

Fabbri et al. (2020) investigate backtranslation as a DA method for few-shot abstractive summarization with the use of a consistency loss inspired by UDA. Parida and Motlicek (2019) propose an iterative DA approach for abstractive summarization that uses a mix of synthetic and real data, where the former is generated from Common Crawl. Zhu et al. (2019) introduce a query-focused summarization Dang (2005) dataset collected using Wikipedia called WikiRef which can be used for DA. Pasunuru et al. (2021) use DA methods to construct two training datasets for Query-focused Multi-Document Summarization (QMDS) called QmdsCnn and QmdsIr by modifying CNN/DM Hermann et al. (2015) and mining search-query logs, respectively.

2 Question Answering (QA)

Longpre et al. (2019) investigate various DA and sampling techniques for domain-agnostic QA including paraphrasing by backtranslation. Yang et al. (2019) propose a DA method using distant supervision to improve BERT finetuning for open-domain QA. Riabi et al. (2020) leverage Question Generation models to produce augmented examples for zero-shot cross-lingual QA. Singh et al. (2019) propose XLDA, or cross-lingual DA, which substitutes a portion of the input text with its translation in another language, improving performance across multiple languages on NLI tasks including the SQuAD QA task. Asai and Hajishirzi (2020) use logical and linguistic knowledge to generate additional training data to improve the accuracy and consistency of QA responses by models. Yu et al. (2018) introduce a new QA architecture called QANet that shows improved performance on SQuAD when combined with augmented data generated using backtranslation.

3 Sequence Tagging Tasks

Ding et al. (2020) propose DAGA, a two-step DA process. First, a language model over sequences of tags and words linearized as per a certain scheme is learned. Second, sequences are sampled from this language model and de-linearized to generate new examples. Şahin and Steedman (2018), discussed in §3.1, use dependency tree morphing (Figure 2) to generate additional training examples on the downstream task of part-of-speech (POS) tagging.

Dai and Adel (2020) modify DA techniques proposed for sentence-level tasks for named entity recognition (NER), including label-wise token and synonym replacement, and show improved performance using both recurrent and transformer models. Zhang et al. (2020) propose a DA method based on MixUp called SeqMix for active sequence labeling by augmenting queried samples, showing improvements on NER and Event Detection.

4 Parsing Tasks

Jia and Liang (2016) propose data recombination for injecting task-specific priors to neural semantic parsers. A synchronous context-free grammar (SCFG) is induced from training data, and new "recombinant" examples are sampled. Yu et al. (2020) introduce Grappa, a pretraining approach for table semantic parsing, and generate synthetic question-SQL pairs via an SCFG. Andreas (2020) use compositionality to construct synthetic examples for downstream tasks like semantic parsing. Fragments of original examples are replaced with fragments from other examples in similar contexts.

Vania et al. (2019) investigate DA for low-resource dependency parsing including dependency tree morphing from Şahin and Steedman (2018) (Figure 2) and modified nonce sentence generation from Gulordava et al. (2018), which replaces content words with other words of the same POS, morphological features, and dependency labels.

5 Grammatical Error Correction (GEC)

Lack of parallel data is typically a barrier for GEC. Various works have thus looked at DA methods for GEC. We discuss some here, and more can be found in Table 3 in Appendix C.

There is work that makes use of additional resources. Boyd (2018) use German edits from Wikipedia revision history and use those relating to GEC as augmented training data. Zhang et al. (2019b) explore multi-task transfer, or the use of annotated data from other tasks.

There is also work that adds synthetic errors to noise the text. Wang et al. (2019a) investigate two approaches: token-level perturbations and training error generation models with a filtering strategy to keep generations with sufficient errors. Grundkiewicz et al. (2019) use confusion sets generated by a spellchecker for noising. Choe et al. (2019) learn error patterns from small annotated samples along with POS-specific noising.

There have also been approaches to improve the diversity of generated errors. Wan et al. (2020) investigate noising through editing the latent representations of grammatical sentences, and Xie et al. (2018) use a neural sequence transduction model and beam search noising procedures.

6 Neural Machine Translation (NMT)

There are many works which have investigated DA for NMT. We highlighted some in §3 and §4.1, e.g. Sennrich et al. (2016); Fadaee et al. (2017); Xia et al. (2019). We discuss some further ones here, and more can be found in Table 3 in Appendix C.

Wang et al. (2018a) propose SwitchOut, a DA method that randomly replaces words in both source and target sentences with other random words from their corresponding vocabularies. Gao et al. (2019) introduce Soft Contextual DA that softly augments randomly chosen words in a sentence using a contextual mixture of multiple related words over the vocabulary. Nguyen et al. (2020) propose Data Diversification which merges original training data with the predictions of several forward and backward models.

7 Data-to-Text NLG

Data-to-text NLG refers to tasks which require generating natural language descriptions of structured or semi-structured data inputs, e.g. game score tables Wiseman et al. (2017). Randomly perturbing game score values without invalidating overall game outcome is one DA strategy explored in game summary generation Hayashi et al. (2019).

Two popular recent benchmarks are E2E-NLG Dušek et al. (2018) and WebNLG Gardent et al. (2017). Both involve generation from structured inputs - meaning representation (MR) sequences and triple sequences, respectively. Montella et al. (2020) show performance gains on WebNLG by DA using Wikipedia sentences as targets and parsed OpenIE triples as inputs. Tandon et al. (2018) propose DA for E2E-NLG based on permuting the input MR sequence. Kedzie and McKeown (2019) inject Gaussian noise into a trained decoder’s hidden states and sample diverse augmented examples from it. This sample-augment-retrain loop helps performance on E2E-NLG.

8 Open-Ended & Conditional Generation

There has been limited work on DA for open-ended and conditional text generation. Feng et al. (2020) experiment with a suite of DA methods for finetuning GPT-2 on a low-resource domain in attempts to improve the quality of generated continuations, which they call GenAug. They find that WN-Hypers (WordNet hypernym replacement of keywords) and Synthetic Noise (randomly perturbing non-terminal characters in words) are useful, and the quality of generated text improves to a peak at $\approx$ 3x the original amount of training data.

9 Dialogue

Most DA approaches for dialogue focus on task-oriented dialogue. We outline some below, and more can be found in Table 5 in Appendix C.

Quan and Xiong (2019) present sentence and word-level DA approaches for end-to-end task-oriented dialogue. Louvan and Magnini (2020) propose lightweight augmentation, a set of word-span and sentence-level DA methods for low-resource slot filling and intent classification.

Hou et al. (2018) present a seq2seq DA framework to augment dialogue utterances for dialogue language understanding Young et al. (2013), including a diversity rank to produce diverse utterances. Zhang et al. (2019c) propose MADA to generate diverse responses using the property that several valid responses exist for a dialogue context.

There is also DA work for spoken dialogue. Hou et al. (2018), Kim et al. (2019), Zhao et al. (2019), and Yoo et al. (2019) investigate DA methods for dialogue and spoken language understanding (SLU), including generative latent variable models.

10 Multimodal Tasks

DA techniques have also been proposed for multimodal tasks where aligned data for multiple modalities is required. We look at ones that involve language or text. Some are discussed below, and more can be found in Table 5 in Appendix C.

Beginning with speech, Wang et al. (2020) propose a DA method to improve the robustness of downstream dialogue models to speech recognition errors. Wiesner et al. (2018) and Renduchintala et al. (2018) propose DA methods for end-to-end automatic speech recognition (ASR).

Looking at images or video, Xu et al. (2020) learn a cross-modality matching network to produce synthetic image-text pairs for multimodal classifiers. Atliha and Šešok (2020) explore DA methods such as synonym replacement and contextualized word embeddings augmentation using BERT for image captioning. Kafle et al. (2017), Yokota and Nakayama (2018), and Tang et al. (2020) propose methods for visual QA including question generation and adversarial examples.

Challenges & Future Directions

Looking forward, data augmentation faces substantial challenges, specifically for NLP, and with these challenges, new opportunities for future work arise.

There appears to be a conspicuous lack of research on why DA works. Most studies might show empirically that a DA technique works and provide some intuition, but it is currently challenging to measure the goodness of a technique without resorting to a full-scale experiment. A recent work in vision Gontijo-Lopes et al. (2020) has proposed that affinity (the distributional shift caused by DA) and diversity (the complexity of the augmentation) can predict DA performance, but it is unclear how these results might translate to NLP.

Minimal benefit for pretrained models on in-domain data:

With the popularization of large pretrained language models, it has come to light that a couple previously effective DA techniques for certain English text classification tasks Wei and Zou (2019); Sennrich et al. (2016) provide little benefit for models like BERT and RoBERTa, which already achieve high performance on in-domain text classification Longpre et al. (2020). One hypothesis is that using simple DA techniques provides little benefit when finetuning large pretrained transformers on tasks for which examples are well-represented in the pretraining data, but DA methods could still be effective when finetuning on tasks for which examples are scarce or out-of-domain compared with the training data. Further work could study under which scenarios data augmentation for large pretrained models is likely to be effective.

Multimodal challenges:

While there has been increased work in multimodal DA, as discussed in §5.10, effective DA methods for multiple modalities has been challenging. Many works focus on augmenting a single modality or multiple ones separately. For example, there is potential to further explore simultaneous image and text augmentation for image captioning, such as a combination of CutMix Yun et al. (2019) and caption editing.

Span-based tasks

offer unique DA challenges as there are typically many correlated classification decisions. For example, random token replacement may be a locally acceptable DA method but possibly disrupt coreference chains for latter sentences. DA techniques here must take into account dependencies between different locations in the text.

Working in specialized domains

such as those with domain-specific vocabulary and jargon (e.g. medicine) can present challenges. Many pretrained models and external knowledge (e.g. WordNet) cannot be effectively used. Studies have shown that DA becomes less beneficial when applied to out-of-domain data, likely because the distribution of augmented data can substantially differ from the original data Zhang et al. (2019a); Herzig et al. (2020); Campagna et al. (2020); Zhong et al. (2020).

Working with low-resource languages

may present similar difficulties as specialized domains. Further, DA techniques successful in the high-resource scenario may not be effective for low-resource languages that are of a different language family or very distinctive in linguistic and typological terms. For example, those which are language isolates or lack high-resource cognates.

More vision-inspired techniques:

Although many NLP DA methods have been inspired by analogous approaches in CV, there is potential for drawing further connections. Many CV DA techniques motivated by real-world invariances (e.g. many angles of looking at the same object) may have similar NLP interpretations. For instance, grayscaling could translate to toning down aspects of the text (e.g. plural to singular, "awesome" → "good"). Morphing a dependency tree could be analogous to rotating an image, and paraphrasing techniques may be analogous to changing perspective. For example, negative data augmentation (NDA) Sinha et al. (2021) involves creating out-of-distribution samples. It has so far been exclusively explored for CV, but could be investigated for text.

Self-supervised learning:

More recently, DA has been increasingly used as a key component of self-supervised learning, particularly in vision Chen et al. (2020e). In NLP, BART Lewis et al. (2020) showed that predicting deleted tokens as a pretraining task can achieve similar performance as the masked LM, and Electra Clark et al. (2020) found that pretraining by predicting corrupted tokens outperforms BERT given the same model size, data, and compute. We expect future work will continue exploring how to effectively manipulate text for both pretraining and downstream tasks.

Offline versus online data augmentation:

In CV, standard techniques such as cropping and rotations are typically done stochastically, allowing for DA to be incorporated elegantly into the training pipeline. In NLP, however, it is unclear how to include a lightweight code module to apply DA stochastically. This is because DA techniques for NLP often leverage external resources (e.g. a dictionary for token substitution or translation model for backtranslation) that are not easily transferable across training pipelines. Thus, a common practice for DA in NLP is to generate augmented data offline and store it as additional data to be loaded during training.See Appendix D. Future work on a lightweight module for online DA in NLP could be fruitful, though another challenge will be determining when such a module will be helpful, which—compared with CV, where invariances being imposed are well-accepted—can vary substantially across NLP tasks.

Lack of unification

is a challenge for the current literature on data augmentation for NLP, and popular methods are often presented in an auxiliary fashion. Whereas there are well-accepted frameworks for DA for CV (e.g. default augmentation libraries in PyTorch, RandAugment Cubuk et al. (2020)), there are no such "generalized" DA techniques for NLP. Further, we believe that DA research would benefit from the establishment of standard and unified benchmark tasks and datasets to compare different augmentation methods.

Good data augmentation practices

would help make DA work more accessible and reproducible to the NLP and ML communities. On top of unified benchmark tasks, datasets, and frameworks/libraries mentioned above, other good practices include making code and augmented datasets publicly available, reporting variation among results (e.g. standard deviation across random seeds), and more standardized evaluation procedures. Further, transparent hyperparameter analysis, explicitly stating failure cases of proposed techniques, and discussion of the intuition and theory behind them would further improve the transparency and interpretability of DA techniques.

Conclusion

In this paper, we presented a comprehensive and structured survey of data augmentation for natural language processing (NLP). We provided a background about data augmentation and how it works, discussed major methodologically representative data augmentation techniques for NLP, and touched upon data augmentation techniques for popular NLP applications and tasks. Finally, we outlined current challenges and directions for future research, and showed that there is much room for further exploration. Overall, we hope our paper can serve as a guide for NLP researchers to decide on which data augmentation techniques to use, and inspire additional interest and work in this area. Please see the corresponding GitHub repository at https://github.com/styfeng/DataAug4NLP.

References

Appendices

Appendix A Useful Blog Posts and Code Repositories

The following blog posts and code repositories could be helpful in addition to the information presented and papers/works mentioned in the body:

Introduction to popular text augmentation techniques: https://towardsdatascience.com/data-augmentation-in-nlp-2801a34dfc28

Detailed blog post on various text DA techniques: https://amitness.com/2020/05/data-augmentation-for-nlp/

Lightweight library for DA on text and audio: https://github.com/makcedward/nlpaug

python framework for adversarial examples: https://github.com/QData/TextAttack

Appendix B DA Methods Table - Description of Columns and Attributes

Table 1 in the main body compares a non-exhaustive selection of DA methods along various aspects relating to their applicability, dependencies, and requirements. Below, we provide a more extensive description of each of this table’s columns and their attributes.

Ext.Know: Short for external knowledge, this column is ✓ when the data augmentation process requires knowledge resources which go beyond the immediate input examples and the task definition, such as WordNet Miller (1995) or PPDB Pavlick et al. (2015). Note that we exclude the case where these resources are pretrained models under a separate point (next) for clarity, since these are widespread enough to merit a separate category.

Pretrained: Denotes that the data augmentation process requires a pretrained model, such as BERT Devlin et al. (2019) or GPT-2 Radford et al. (2019).

Preprocess: Denotes the preprocessing steps, e.g. tokenization (tok), dependency parsing (dep), etc. required for the DA process. A hyphen (-) means either no preprocessing is required or that it was not explicitly stated.

Level: Denotes the depth and extent to which elements of the instance/data are modified by the DA. Some primitives modify just the Input (e.g. word swapping), some modify both Input and Label (e.g. negation), while others make changes in the embedding or hidden space (Embed/Hidden) or higher representation layers enroute to the task model.

Task-Agnostic: This is an approximate, partially subjective column denoting the extent to which a DA method can be applied to different tasks. When we say ✓ here, we don’t denote a very rigid sense of the term task-agnostic, but mean that it would possibly easily extend to most NLP tasks as understood by the authors. Similarly, an $\times$ denotes being restricted to a specific task (or small group of related tasks) only. There can be other labels, denoting applicability to broad task families. For example, substructural denotes the family of tasks where sub-parts of the input are also valid input examples in their own right, e.g. constituency parsing. Sentence Pairs denotes tasks which involve pairwise sentence scoring such as paraphrase identification, duplicate question detection, and semantic textual similarity.

Appendix C Additional DA Works by Task

See Table 3 for additional DA works for GEC, Table 3 for additional DA works for neural machine translation, Table 5 for additional DA works for dialogue, and Table 5 for additional DA works for multimodal tasks. Each work is described briefly.