Substructure Substitution: Structured Data Augmentation for NLP

Haoyue Shi, Karen Livescu, Kevin Gimpel

Introduction

Data augmentation has been shown effective for various natural language processing (NLP) tasks, such as machine translation (Fadaee et al., 2017; Gao et al., 2019; Xia et al., 2019, inter alia), text classification (Wei and Zou, 2019; Quteineh et al., 2020), semantic role labeling (Fürstenau and Lapata, 2009) and dialogue understanding (Hou et al., 2018; Niu and Bansal, 2019). Such methods enhance the diversity of the training set by generating examples based on existing ones and simple heuristics, and make the training process more consistent (Xie et al., 2019). Most existing work focuses on word-level manipulation (Kobayashi, 2018; Wei and Zou, 2019; Dai and Adel, 2020, inter alia) or global sequence-to-sequence style generation (Sennrich et al., 2016).

In this work, we study a family of general data augmentation methods, substructure substitution (Sub2), which generates new examples by same-label substructure substitution (Figure 1). Sub2 naturally fits structured prediction tasks such as part-of-speech tagging and parsing, where substructures exist in the annotations of the tasks. For more general NLP tasks such as text classification, we present a variation of Sub2 which (1) performs constituency parsing on existing examples, and (2) generates new examples by subtree substitution based on the parses.

Different from other investigated methods which sometimes hurt the performance of models, we show through intensive experiments that Sub2 helps models achieve competitive or better performance than training on the original dataset across tasks and original dataset sizes. When combined with pretrained language models (Conneau et al., 2019), Sub2 establishes new state of the art results for low-resource part-of-speech tagging and sentiment analysis.

The question of whether explicit parse trees can help neural network–based approaches on downstream tasks has been raised by recent work (Shi et al., 2018b; Havrylov et al., 2019) in which non-linguistic balanced trees have been shown to rival the performance of those from syntactic parsers. Our work shows that constituency parse trees are more effective than balanced trees as backbones for Sub2 on text classification, especially when only few examples are available, introducing more potential applications for constituency parse trees in the neural network era.

Related Work

Data augmentation aims to generate new examples based on available ones, without actually collecting new data. Such methods reduce the cost of dataset collection, and usually boost the model performance on desired tasks. Most existing data augmentation methods for NLP tasks can be classified into the following categories:

Token-level manipulation. Token-level manipulation has been widely studied in recent years. An intuitive way is to create new examples by substituting (word) tokens with ones with the same desired features, such as synonym substitution Zhang et al. (2015); Wang and Yang (2015); Fadaee et al. (2017); Kobayashi (2018) or substitution with words having the same morphological features (Silfverberg et al., 2017). Such methods have been applied to generate adversarial or negative examples which help improve the robustness of neural network–based NLP models (Belinkov and Bisk, 2018; Shi et al., 2018a; Alzantot et al., 2018; Zhang et al., 2019; Min et al., 2020, inter alia), or to generate counterfactual examples which mitigate bias in natural language (Zmigrod et al., 2019; Lu et al., 2020).

Other token-level manipulation methods introduce extra noise such as random token shuffling and deletion (Wang et al., 2018; Wei and Zou, 2019). Models trained on the augmented dataset are expected to be more robust to the considered noise.

Label-conditioned text generation. Recent work has explored generating new examples by training a conditional text generation model (Bergmanis et al., 2017; Liu et al., 2020a; Ding et al., 2020; Liu et al., 2020b, inter alia), or applying post-processing on the examples generated by pretrained models (Yang et al., 2020; Wan et al., 2020; Yoo et al., 2020). In the data augmentation stage, given labels in the original dataset as conditions, such models generate associated text accordingly. The generated examples, together with the original datasets, are used to further train models for the primary tasks. A representative among them is back-translation (Sennrich et al., 2016), which has been demonstrated effective on not only machine translation, but also style-transfer Prabhumoye et al. (2018); Zhang et al. (2020a), conditional text generation (Sobrevilla Cabezudo et al., 2019), and grammatical error correction (Xie et al., 2018).

Another group of work on example generation is to generate new examples based on predefined templates Kafle et al. (2017); Asai and Hajishirzi (2020), where the templates are designed following heuristic, and usually task-specific, rules.

Soft data augmentation. In addition to explicit generation of concrete examples, soft augmentation, which directly represents generated examples in a continuous vector space, has been proposed: Gao et al. (2019) propose to perform soft word substitution for machine translation; recent work has adapted the mix-up method (Zhang et al., 2018), which augments the original dataset by linearly interpolating the vector representations of text and labels, to text classification (Guo et al., 2019; Sun et al., 2020), named entity recognition (Chen et al., 2020) and compositional generalization (Guo et al., 2020).

Structure-aware data augmentation. Existing work has also sought potential gain from structures associated with natural language: Xu et al. (2016) improve word relation classification by dependency path–based augmentation. Şahin and Steedman (2018) show that subtree cropping and rotation based on dependency parse trees can help part-of-speech tagging for low-resource languages, while Vania et al. (2019) has demonstrated that such methods also help dependency parsing when very limited training data is available.

Sub2 also falls into this category. The idea of same-label substructure substitution has improved over baselines on structured prediction tasks such as semantic parsing (Jia and Liang, 2016), constituency parsing (Shi et al., 2020), dependency parsing (Dehouck and Gómez-Rodríguez, 2020), named entity recognition (Dai and Adel, 2020), meaning representation–based text generation (Kedzie and McKeown, 2020), and compositional generalization (Andreas, 2020). To the best of our knowledge, however, Sub2 has not been systematically studied as a general data augmentation method for NLP tasks. In this work, we not only extend Sub2 to part-of-speech tagging and structured sentiment classification, but also present a variation that allows a broader range of NLP tasks (e.g., text classification) to benefit from syntactic parse trees. We evaluate Sub2 and several representative general data augmentation methods, which can be widely applied to various NLP tasks.

When constituency parse trees are used, there is a connection between Sub2 and tree substitution grammars (TSGs; Schabes, 1990), where the approach can be viewed as (1) estimating a TSG using the given corpus and (2) drawing new sentences from the estimated TSG.

Method

We introduce the general framework we investigate in Section 3.1, and describe the variations of Sub2 which can be extended to text classification and other NLP applications.

As shown in Figure 1, given the original training set $\mathcal{D}$ , Sub2 generates new examples using same-label substructure substitution, and repeats the process until the training set reaches the desired size. The general Sub2 procedure is presented in Algorithm 1.

For part-of-speech (POS) tagging, we let text spans be substructures and use the corresponding POS tag sequence as substructure labels (Figure LABEL:fig:pos); for constituency parsing, we use subtrees as the substructures, with phrase labels as the substructure labels (Figure LABEL:fig:c-parse); for dependency parsing, we also use subtrees as substructures, and let the label of dependency arc, which links the head of the subtree to its parent, be the substructure labels.

2 Variations of Sub2 for Text Classification

Text classification examples do not typically contain explicit substructures. However, we can obtain them by viewing all text spans as substructures (Figure LABEL:fig:txt-class). This approach may be too unconstrained in practice and could introduce noise during augmentation, so we consider constraining substitution based on matching several features of the spans:

Number of words (Sub2+n): when considering this constraint, we can only substitute a span with another having the same number of words; otherwise we can substitute a span with any other span.

Phrase or not (Sub2+p): when considering this constraint, we can only substitute a phrase with another phrase (according to a constituency parse of the text); otherwise the considered spans do not necessarily need to be phrases.

Phrase label (Sub2+l): this constraint is only applicable when also using Sub2+p. When considering this constraint, we can only perform substitution between phrases with the same phrase label (from constituency parse trees).

Text classification label (Sub2+t): when considering this constraint, we can only substitute a span with another span that comes from text annotated with the same class label as the original one; otherwise we can choose the alternative from any example text in the training corpus.

We also investigate combinations of the above constraints, where we require all the involved substructures to be the same to perform Sub2. For example, Sub2+t+n (Figure LABEL:fig:txt-class) requires the original and the alternative span to have the same text label and the same number of words.

Experiments

We evaluate Sub2 and other data augmentation baselines (Section 4.2) on four tasks: part-of-speech tagging, dependency parsing, constituency parsing, and text classification.

For part-of-speech tagging and text classification, we add a two-layer perceptron on top of XLM-R (Conneau et al., 2019) embeddings, where we calculate contextualized token embeddings by a learnable weighted average across layers. We use endpoint concatenation (i.e., the concatenation of the first and last token representation) to obtain fixed-dimensional span or sentence features, and keep the pretrained model frozen during training.We did not observe any significant improvement by finetuning the large pretrained language model, and for most cases, the performance is much worse than the current scheme we apply. For dependency parsing, we use the SuPar implementation of Dozat and Manning (2017).https://github.com/yzhangcs/parser For constituency parsing, we use Benepar (Kitaev and Klein, 2018).https://github.com/nikitakit/self-attentive-parser

For all data augmentation methods, including the baselines (Section 4.2), we only augment the training set, and use the original development set. If not specified, we introduce 20 times more examples than the original training set when applying an augmentation method. When introducing $k\times$ new examples, we also replicate the original training set $k$ times to ensure that the model can access sufficient examples from the original distribution.

All models are initialized with the XLM-R base model (Conneau et al., 2019) if not specified. We train models for 20 epochs when applying the high-resource setting (i.e., high-resource part-of-speech tagging, sentiment classification trained on the full training set) or when applying data augmentation methods, and for 400 epochs in the low-resource settings without augmentation; we select the one with the highest accuracy or $F_{1}$ score on the development set. All models are optimized using Adam (Kingma and Ba, 2015), where we try learning rates in $\{5\times 10^{-4},5\times 10^{-5}\}$ . For hidden size (i.e., the hidden size of the perceptron for part-of-speech tagging and text classification, the dimensionality of span representation and scoring multi-layer perceptron for constituency parsing, and the dimensionality of token representation and scoring multi-layer perceptron for dependency parsing), we vary between $128$ and $512$ . We apply a 0.2 dropout ratio to the contextualized embeddings in the training stage. All other hyperparameters are the same as the default settings in the released codebases.

2 Baselines

We compare Sub2 to the following baselines:

No augmentation (NoAug), where the original training and development set are used.

Contextualized substitution (CtxSub), where we apply contextualized augmentation (Kobayashi, 2018), masking out a random word token from the existing dataset, and use multilingual-BERT (mBERT; Devlin et al., 2019) to generate a different word.

Random shuffle (Rand), where we randomly shuffle all the words in the original sentence, while keeping the original structured or non-structured labels. It is worth noting that for dependency parsing, we shuffle the words, while maintaining the dependency arcs between individual words; for constituency parsing, we shuffle the terminal nodes, and insert them back into the tree structure. Our Rand method for constituency parsing is arguably more noisy than that for dependency parsing.

For non-structured text classification tasks, we also introduce the following baselines:

Random word substitution (RandWord), where we substitute a random word in an original example with another random word. This can be viewed as a less restricted version of CtxSub.

Binary balanced tree–based Sub2 (Sub2+p, balanced tree). Shi et al. (2018b) argue that binary balanced trees are better backbones for recursive neural networks (Zhu et al., 2015; Tai et al., 2015) on text classification. In this work, we present binary balanced tree as the backbone for Sub2: we (1) generate balanced trees by recursively splitting a span of $n$ words into two consecutive groups, which consist of $\left\lfloor\frac{n}{2}\right\rfloor$ and $\left\lceil\frac{n}{2}\right\rceil$ words respectively, and (2) treat each nonterminal in the balanced tree as a substructure to perform Sub2.

All of the data augmentation baselines are explicit augmentations where concrete new examples are generated and used. The methods above are generally applicable to a wide range of NLP tasks.

3 Part-of-Speech Tagging

We conduct our experiments using the Universal Dependencies (UD; Nivre et al., 2016, 2020)http://universaldependencies.org/ dataset.

First, we compare both NoAug and Sub2 to the previous state-of-the-art performance (Heinzerling and Strube, 2019) to ensure that our baselines are strong enough (Table 1). Heinzerling and Strube (2019) take the token-wise concatenation of mBERT last-layer representations, byte-pair encoding (BPE; Gage, 1994)–based LSTM hidden states and character-LSTM hidden states as the input to the classifier, and fine-tune the pretrained mBERT during training. We find that with our framework with frozen mBERT and extra learnable layer weight parameters, we are able to obtain competitive or better results than those reported by Heinzerling and Strube (2019); the gains grow larger when using XLM-R, which is trained on larger corpora than mBERT. In addition, by augmenting the training set with Sub2, we obtain competitive performance on all languages, and achieve better average accuracy on low-resource languages.

We further test the part-of-speech tagging accuracy on 5 selected low-resource treebanks in the UD 2.6 dataset (Table 2), following the official splits of the dataset. For four among the five investigated treebanks, Sub2 achieves the best performance among all methods, while also maintaining a competitive performance on te (mtg). In contrast, other augmentation methods (CtxSub and Rand) are harmful compared to NoAug on all treebanks, indicating that the examples generated by Sub2 may be closer to the original data distribution.

4 Dependency Parsing

We evaluate the performance of models using the standard Penn Treebank dataset (PTB; Marcus et al., 1993), converted by Stanford dependency converter v3.0,https://nlp.stanford.edu/software/stanford-dependencies.shtml following the standard splits.

We first compare the performance of Sub2 and baselines in the low-resource setting (Table 3). All methods, though not always, may help achieve better performance than NoAug. CtxSub helps achieve the best LAS when there is only an extremely small training set (e.g., 10 examples) available; however, when the size of the original training set becomes larger, Sub2 begins to dominate, while CtxSub and Rand start to sometimes hurt the performance. In addition, a larger augmented dataset does not necessarily lead to better performance – throughout our experiments, augmenting the original dataset to $10\times$ – $50\times$ larger can result in reasonably good accuracy.

However, when training on the full WSJ training set, Sub2 does not necessarily help improve over baselines, but the performance is quite competitive (Table 4).An additional finding here is that a simple biaffine dependency parsing model (Dozat and Manning, 2017) with XLM-R initialization is able to set a new state of the art for dependency parsing with only in-domain annotation.

5 Constituency Parsing

We evaluate Sub2 and baseline methods on few-shot constituency parsing, using the Foreebank (Fbank; Kaljahi et al., 2015) and NXT-Switchboard (SWBD; Calhoun et al., 2010) datasets. Foreebank consists of 1,000 English and 1,000 French sentences; for either language, we randomly select 50 sentences for training, 50 for development, and 250 for testing.We leave the other 650 sentences for future use. We follow the standard splits of NXT-Switchboard, and randomly select 50 sentences from the training set and 50 from the development set for training and development respectively.

We compare different data augmentation methods using the setup of few-shot parsing from scratch (Table 5). Among all settings we tested, Sub2 achieves the best performance, while all augmentation methods we investigated improve over training only on the original dataset (NoAug). Surprisingly, we find that the seemingly meaningless Rand, which random shuffles the sentence and inserts the shuffled words back into the original parse tree structure as the nonterminals, also consistently helps few-shot parsing by a nontrivial margin.This trend may be explained by benefits in learning/optimization stability in this few-shot setting, but we leave a richer exploration of potential explanations for future work.

For domain adaptation (Table 6), we first train Benepar (Kitaev and Klein, 2018) on the Penn Treebank dataset, and use the pretrained model as the initialization. While compared to few-shot parsing trained from scratch, the gain by data augmentation generally becomes smaller, Sub2 still works the best across datasets.

6 Text Classification

We evaluate the methods introduced in Section 3.2 and baselines on two text classification datasets: (SST; Socher et al., 2013) and AG News (Zhang et al., 2015) sentence (Table 7), in the low-resource setting.We only keep the single-sentence instances among all examples in each split of the original AG News dataset, following Shi et al. (2018b). We obtain the constituency parse trees using Benepar (Kitaev and Klein, 2018) trained on the standard PTB dataset. Since the SST dataset provides sentiment labels of phrases, it is also natural to apply such phrase sentiment labels as substructure labels, where the substructures are phrases (Sub2+p+senti).

Across the two investigated settings, data augmentation is usually helpful to improve over NoAug, and most variations of Sub2 with the phrase-or-not (+p) substructure label are among the best-performing methods on each task (except Sub2+p for SST-10%). Additionally, constituency tree–based Sub2 with phrase labels (+p+l) outperforms balanced tree–based Sub2 in both settings, indicating that phrase structures can be considered as useful information for data augmentation in general.

We further use Sub2+p+t+senti to augment the full SST training set, since it is the best augmentation method for few-shot sentiment classification. In addition to sentences, we also add phrases (i.e., subtrees) as training examples, following most of existing work (Socher et al., 2013; Kim, 2014; Brahma, 2018, inter alia),That is, different from Table 7, we apply the same settings as conventional work to produce numbers in Table 8. to boost performance. In this setting, we find that Sub2 helps set a new state of the art on the SST dataset (Table 8).

Discussion

We investigate substructure substitution (Sub2), a family of data augmentation methods that generates new examples by same-label substructure substitution. Such methods help achieve competitive or better performance on the tasks of part-of-speech tagging, few-shot dependency parsing, few-shot constituency parsing, and text classification. While other data augmentation methods (e.g., CtxSub and Rand) sometimes improve the performance, Sub2 is the only one that consistently helps low-resource NLP across tasks.

While existing work has shown that explicit constituency parse trees may not necessarily help improve recursive neural networks for text classification and other NLP tasks (Shi et al., 2018b), our work shows that such parse trees can be robust backbones for Sub2-style data augmentation, introducing more potential ways to help neural networks take advantages from explicit syntactic annotations.

There is an open question remaining to be addressed: it is still unclear that why Rand helps improve few-shot constituency parsing, as the training process requires the model to output the correct parse tree of a sentence while only accessing shuffled words. We leave the above question, as well as applications of Sub2 to more NLP tasks, for future work.