Sequence-Level Mixed Sample Data Augmentation

Demi Guo, Yoon Kim, Alexander M. Rush

Introduction

Natural language is thought to be characterized by systematic compositionality Fodor and Pylyshyn (1988). A computational model that is able to exploit such systematic compositionality should understand sentences by appropriately recombining subparts that have not been seen together during training. Consider the following example from Andreas (2020):

Given the above sentences, a model which has learned compositional structure should be able to generalize and understand sentences such as:

In practice, neural models often overfit to long segments of text and fail to generalize compositionally.

This work proposes a simple data augmentation strategy for sequence-to-sequence learning, SeqMix, which creates soft synthetic examples by randomly combining parts of two sentences. This prevents models from memorizing long segments and encourages models to rely on compositions of subparts to predict the output. To motivate our approach, consider some example sentences that can be created by combining (1a) and (1b) :

Instead of enumerating over all possible combinations of two sentences, SeqMix crafts a new example by softly mixing the two sentences via a convex combination of the original examples. This approach can be seen as a sequence-level variant of a broader family of techniques called mixed sample data augmentation (MSDA), which was originally proposed by Zhang et al. (2018) and has been shown to be particularly effective for classification tasks DeVries and Taylor (2017); Yun et al. (2019); Verma et al. (2019). We also show that SeqMix shares similarities with word replacement/dropout strategies in machine translation Sennrich et al. (2016); Wang et al. (2018); Gao et al. (2019),

SeqMix targets a crude but simple approach to data augmentation for language applications. We apply SeqMix to a variety of sequence-to-sequence tasks including neural machine translation, semantic parsing, and SCAN (a dataset designed to test for compositionality of data-driven models), and find that SeqMix improves results on top of (and when combined with) existing data augmentation methods.

Motivation and Related Work

While neural networks trained on large datasets have led to significant improvements across a wide range of NLP tasks, training them to generalize by learning the compositional structure of language remains a challenging open problem. Notably, Lake and Baroni (2018) propose an influential dataset (SCAN) to evaluate the systematic compositionality of neural models and find that they often fail to generalize compositionally.

One approach to encouraging compositional behavior in neural models is by incorporating compositional structures such as parse trees or programs directly into a network’s computational graph Socher et al. (2013); Dyer et al. (2016); Bowman et al. (2016); Andreas et al. (2016); Johnson et al. (2017). While effective on certain domains such as visual question answering, these approaches usually rely on intermediate structures predicted from pipelined models, which limits their applicability in general. Further, it is an open question as to whether such putatively compositional models result in significant empirical improvements on many NLP tasks Shi et al. (2018).

Expressive parameterizations over high dimensional input afforded by neural networks contribute to their excellent performance in high resource settings; however, such flexible parameterizations can easily lead to a model’s memorizing—i.e., overfitting to—long segments of text, instead of relying on the appropriate subparts of segments. Another approach to encouraging compositionality in richly-parameterized neural models, then, is to augment the training data with more examples. Existing work in this vein include SwitchOut Wang et al. (2018), which replaces a word in a sentence with a random word from the vocabulary, GECA Andreas (2020), which creates new examples by switching subparts that occur in similar contexts, and TMix Chen et al. (2020), which interpolates between hidden states of neural models for text classification. We compare to these approaches to our proposed approach in this paper.

Method

Our proposed approach, SeqMix, is simple, and is essentially a sequence-level variant of MixUp Zhang et al. (2018), which has primarily been used for image classification tasks DeVries and Taylor (2017); Yun et al. (2019). We first describe the generative data augmentation process behind this model for text generation, and show how SeqMix approximates the resulting latent variable objective with a relaxed version.

The new example pair of sentences $(\hat{X},\hat{Y})$ will not correspond to natural sentences in general, but may contain valid subparts (phrases) that bias the model towards learning the compositional structure (as in the examples discussed in the introduction). Marginalizing over $m$ gives the following log marginal likelihood,

where $p_{\lambda}(m)=\prod_{i=1}^{s+t}p_{\lambda}(m_{i})$ and $D,D^{\prime}$ are the example distributions.

As exact marginalization in the above is intractable, we could target a lower bound, with Monte Carlo samples from $p_{\lambda}(m)$ , resulting from Jensen’s inequality,

An alternative, which we refer to as SeqMix, is to consider a soft variant of the original objective by training on expected samples,

Letting $f_{\theta}(X,Y_{<t})$ be the output of the $\operatorname*{log-softmax}$ layer, the local probability of $Y_{t}$ is given by $\log p_{\theta}(Y_{t}|X,Y_{<t})=Y_{t}^{\top}f_{\theta}(X,Y_{<t})$ . SeqMix then trains on the objective,

To summarize, this results in a simple algorithm where we sample $\lambda\sim\text{Beta}(\alpha,\alpha)$ and train on these expected samples.Our implementation can be found at https://github.com/dguo98/seqmix, and pseudocode can be found in supplementary materials.

Table 1 shows that we can recover existing data augmentation methods such as SwitchOut and word dropout under the above framework. In particular, these methods approximate a version of the “hard” latent variable objective in Eq. 2 by considering different swap distributions $p(m)$ and sampling distributions $D^{\prime}$ .Wang et al. (2018) also offer an alternative formulation which unifies various data augmentation strategies as training on a distribution that better approximates the underlying data distribution. While the hard version of SeqMix can also be unified under SwitchOut’s resulting objective, we chose our alternative formulation given its natural extension to the relaxed version. Compared to other approaches, SeqMix is essentially a relaxed variant of the same objective, similar to the difference between soft vs. hard attention Xu et al. (2015); Deng et al. (2018); Wu et al. (2018); Shankar et al. (2018). SeqMix is also more efficient than more sophisticated augmentation strategies such as GECA which requires a computationally expensive validation check for swaps.

Experimental Setup

We test our approach against existing baselines across a variety of sequence-to-sequence tasks: machine translation, SCAN, and semantic parsing. For all datasets, we tune the $\alpha$ hyperparameter in the range of $[0.1,1.5]$ on the validation set.However we observed the final result to be relatively invariant to $\alpha$ and found that setting $\alpha=1$ usually achieves good results. Exact details regarding the training setup (including descriptions of the various datasets) can be found in the supplementary materials.

Our machine translation experiments consider five translation datasets: (1) IWSLT ’14 German-English (de-en) (2) IWSLT ’14 English-{German, Italian, Spanish} (en-{de, it, es}) (3) WMT ’14 English-German (en-de). We use the Transformer implementation from fairseq Ott et al. (2019) with the default configuration.

SCAN

SCAN is a command execution dataset designed to test for systematic compositionality of data-driven models. SCAN consists of simple English commands and corresponding action sequences. We consider three different splits that have been widely utilized in the existing literature: jump, around-right, turn-left. For the splits (jump, turn-left), the primitive commands (i.e. “jump”, “turn left”) are only seen in isolation during training, and the test set consists commands that compose the isolated primitive command with the other commands seen during training. For the template split (around-right), training examples contain the commands “around” and “right” but never in combination. Following previous work Andreas (2020), we use a one-layer LSTM encoder-decoder model with hidden size of 512 and embedding size of 64.

Semantic Parsing

For semantic parsing, we consider the SQL queries subset of GeoQuery Finegan-Dollak et al. (2018), which consists of 880 English questions paired with SQL commands. The standard question split ensures no questions are repeated between the train and test sets, while the more challenging query split ensures that neither questions nor logical forms (anonymized) are repeated. Following Andreas (2020), we use the same model as for SCAN but additionally introduce a copy mechanism.

Results and Analysis

Table 2 shows the results from SeqMix and the relevant baselines. On all datasets, SeqMix consistently improves over SwitchOut and word dropout (WordDrop). For machine translation, SeqMix achieves around 1 BLEU score gain on IWSLT over strong baselines, and these gains persist on WMT which is an order of magnitude larger. On SCAN and semantic parsing, SeqMix does not perform as well as GECA on its own but does well when combined with GECA.

We perform further analysis on the SCAN dataset, which is explicitly designed to test for compositional generalization. Table 2 shows that without GECA, the baseline seq2seq model and other regularization methods such as WordDrop and SwitchOut completely fails on the jump split, while SeqMix can achieve 49% accuracy. Similarly, SeqMix can boost the performance on the turn-left split from $49\%$ to $99\%$ in contrast to SwitchOut and WordDrop.

The fact that SeqMix can improve over simple regularization methods (such as WordDrop) even without GECA indicates that despite its crudity, SeqMix is somewhat effective at biasing models to learn the appropriate compositional structure. However, these results on SCAN also highlight its limitations: SeqMix fails on the difficult around-right split, where the model has to learn combine “around” with “right” even though they are not encountered together in training, and does not outperform more sophisticated data augmentation strategies such as GECA Andreas (2020).

In Table 3, we show a qualitative example in the jump split of SCAN dataset. Recall that the jump split of SCAN is constructed to test the generalization of primitive “jump” in novel contexts. Given train examples such as jump; walk; walk left; look after walk twice, the model demonstrates compositionality if it is able to correctly process test examples such as jump left; look after jump twice, i.e. generalize the understanding of isolated jump to unseen combinations with jump. As shown in Table 3, only SeqMix successfully exhibits this compositional generalization.

Conclusion

This paper presents SeqMix, a simple data augmentation strategy for sequence-to-sequence applications. Despite being a crude approximation to compositional phenomena in language, we found SeqMix to be effective on three different sequence-to-sequence tasks, including the challenging SCAN dataset which is designed to test for compositional generalization. SeqMix is efficient and easy to implement, and as a secondary contribution, we provide a framework that unifies several data augmentation strategies for compositionality, which naturally suggests avenue for future research (e.g., a relaxed variant of GECA).

Acknowledgements

The authors would like to thank the anonymous reviewers, Yuntian Deng, Justin Chiu, Jiawei Zhou, Ishita Dasgupta and Xinya Du for their valuable feedback on the initial draft. AMR’s work is supported by CAREER 2037519 and NSF III 1901030.