SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation

Xinyi Wang, Hieu Pham, Zihang Dai, Graham Neubig

Introduction and Related Work

Data augmentation algorithms generate extra data points from the empirically observed training set to train subsequent machine learning algorithms. While these extra data points may be of lower quality than those in the training set, their quantity and diversity have proven to benefit various learning algorithms (DeVries and Taylor, 2017; Amodei et al., 2016). In image processing, simple augmentation techniques such as flipping, cropping, or increasing and decreasing the contrast of the image are both widely utilized and highly effective (Huang et al., 2016; Zagoruyko and Komodakis, 2016).

However, it is nontrivial to find simple equivalences for NLP tasks like machine translation, because even slight modifications of sentences can result in significant changes in their semantics, or require corresponding changes in the translations in order to keep the data consistent. In fact, indiscriminate modifications of data in NMT can introduce noise that makes NMT systems brittle (Belinkov and Bisk, 2018).

Due to such difficulties, the literature in data augmentation for NMT is relatively scarce. To our knowledge, data augmentation techniques for NMT fall into two categories. The first category is based on back-translation (Sennrich et al., 2016b; Poncelas et al., 2018), which utilizes monolingual data to augment a parallel training corpus. While effective, back-translation is often vulnerable to errors in initial models, a common problem of self-training algorithms (Chapelle et al., 2009). The second category is based on word replacements. For instance, Fadaee et al. (2017) propose to replace words in the target sentences with rare words in the target vocabulary according to a language model, and then modify the aligned source words accordingly. While this method generates augmented data with relatively high quality, it requires several complicated preprocessing steps, and is only shown to be effective for low-resource datasets. Other generic word replacement methods include word dropout (Sennrich et al., 2016a; Gal and Ghahramani, 2016), which uniformly set some word embeddings to at random, and Reward Augmented Maximum Likelihood (RAML; Norouzi et al. (2016)), whose implementation essentially replaces some words in the target sentences with other words from the target vocabulary.

In this paper, we derive an extremely simple and efficient data augmentation technique for NMT. First, we formulate the design of a data augmentation algorithm as an optimization problem, where we seek the data augmentation policy that maximizes an objective that encourages two desired properties: smoothness and diversity. This optimization problem has a tractable analytic solution, which describes a generic framework of which both word dropout and RAML are instances. Second, we interpret the aforementioned solution and propose a novel method: independently replacing words in both the source sentence and the target sentence by other words uniformly sampled from the source and the target vocabularies, respectively. Experiments show that this method, which we name SwitchOut, consistently improves over strong baselines on datasets of different scales, including the large-scale WMT 15 English-German dataset, and two medium-scale datasets: IWSLT 2016 German-English and IWSLT 2015 English-Vietnamese.

Method

We use uppercase letters, such as $X$ , $Y$ , etc., to denote random variables and lowercase letters such as $x$ , $y$ , etc., to denote the corresponding actual values. Additionally, since we will discuss a data augmentation algorithm, we will use a hat to denote augmented variables and their values, e.g. $\widehat{X}$ , $\widehat{Y}$ , $\widehat{x}$ , $\widehat{y}$ , etc. We will also use boldfaced characters, such as $\mathbf{p}$ , $\mathbf{q}$ , etc., to denote probability distributions.

2 Data Augmentation

We facilitate our discussion with a probabilistic framework that motivates data augmentation algorithms. With $X$ , $Y$ being the sequences of words in the source and target languages (e.g. in machine translation), the canonical MLE framework maximizes the objective

Here $\widehat{\mathbf{p}}(X,Y)$ is the empirical distribution over all training data pairs $(x,y)$ and $\mathbf{p}_{\theta}(y|x)$ is a parameterized distribution that we aim to learn, e.g. a neural network. A potential weakness of MLE is the mismatch between $\widehat{\mathbf{p}}(X,Y)$ and the true data distribution $\mathbf{p}(X,Y)$ . Specifically, $\widehat{\mathbf{p}}(X,Y)$ is usually a bootstrap distribution defined only on the observed training pairs, while $\mathbf{p}(X,Y)$ has a much larger support, i.e. the entire space of valid pairs. This issue can be dramatic when the empirical observations are insufficient to cover the data space.

In practice, data augmentation is often used to remedy this support discrepancy by supplying additional training pairs. Formally, let $\mathbf{q}(\widehat{X},\widehat{Y})$ be the augmented distribution defined on a larger support than the empirical distribution $\widehat{\mathbf{p}}(X,Y)$ . Then, MLE training with data augmentation maximizes

In this work, we focus on a specific family of $\mathbf{q}$ , which depends on the empirical observations by

This particular choice follows the intuition that an augmented pair $(\widehat{x},\widehat{y})$ that diverges too far from any observed data is more likely to be invalid and thus harmful for training. The reason will be more evident later.

3 Diverse and Smooth Augmentation

Certainly, not all $\mathbf{q}$ are equally good, and the more similar $\mathbf{q}$ is to $\mathbf{p}$ , the more desirable $\mathbf{q}$ will be. Unfortunately, we only have access to limited observations captured by $\widehat{\mathbf{p}}$ . Hence, in order to use $\mathbf{q}$ to bridge the gap between $\widehat{\mathbf{p}}$ and $\mathbf{p}$ , it is necessary to utilize some assumptions about $\mathbf{p}$ . Here, we exploit two highly generic assumptions, namely:

Diversity: $\mathbf{p}(X,Y)$ has a wider support set, which includes samples that are more diverse than those in the empirical observation set.

Smoothness: $\mathbf{p}(X,Y)$ is smooth, and similar $(x,y)$ pairs will have similar probabilities.

where $\tau$ controls the strength of the diversity objective. The first term in (1) instantiates the smoothness assumption, which encourages $\mathbf{q}$ to draw samples that are similar to $(x,y)$ . Meanwhile, the second term in (1) encourages more diverse samples from $\mathbf{q}$ . Together, the objective $J(\mathbf{q};x,y)$ extends the information in the “pivotal” empirical sample $(x,y)$ to a diverse set of similar cases. This echoes our particular parameterization of $\mathbf{q}$ in Section 2.2.

The objective $J(\mathbf{q};x,y)$ in (1) is the canonical maximum entropy problem that one often encounters in deriving a max-ent model (Berger et al., 1996), which has the analytic solution:

Note that (2) is a fairly generic solution which is agnostic to the choice of the similarity measure $s$ . Obviously, not all similarity measures are equally good. Next, we will show that some existing algorithms can be seen as specific instantiations under our framework. Moreover, this leads us to propose a novel and effective data augmentation algorithm.

4 Existing and New Algorithms

In the context of machine translation, Sennrich et al. (2016a) propose to randomly choose some words in the source and/or target sentence, and set their embeddings to vectors. Intuitively, it regards every new data pair generated by this procedure as similar enough and then includes them in the augmented training set. Formally, word dropout can be seen as an instantiation of our framework with a particular similarity function $s(\hat{x},\hat{y};x,y)$ (see Appendix A.1).

RAML.

From the perspective of reinforcement learning, Norouzi et al. (2016) propose to train the model distribution to match a target distribution proportional to an exponentiated reward. Despite the difference in motivation, it can be shown (c.f. Appendix A.2) that RAML can be viewed as an instantiation of our generic framework, where the similarity measure is $s(\widehat{x},\widehat{y};x,y)=r(\widehat{y};y)$ if $\widehat{x}=x$ and $-\infty$ otherwise. Here, $r$ is a task-specific reward function which measures the similarity between $\widehat{y}$ and $y$ . Intuitively, this means that RAML only exploits the smoothness property on the target side while keeping the source side intact.

SwitchOut.

After reviewing the two existing augmentation schemes, there are two immediate insights. Firstly, augmentation should not be restricted to only the source side or the target side. Secondly, being able to incorporate prior knowledge, such as the task-specific reward function $r$ in RAML, can lead to a better similarity measure.

Motivated by these observations, we propose to perform augmentation in both source and target domains. For simplicity, we separately measure the similarity between the pair $(\widehat{x},x)$ and the pair $(\widehat{y},y)$ and then sum them together, i.e.

where $r_{x}$ and $r_{y}$ are domain specific similarity functions and $\tau_{x}$ , $\tau_{y}$ are hyper-parameters that absorb the temperature parameter $\tau$ . This allows us to factor $\mathbf{q}^{*}(\widehat{x},\widehat{y}|x,y)$ into:

In addition, notice that this factored formulation allows $\widehat{x}$ and $\widehat{y}$ to be sampled independently.

Sampling Procedure.

To complete our method, we still need to define $r_{x}$ and $r_{y}$ , and then design a practical sampling scheme from each factor in (4). Though non-trivial, both problems have been (partially) encountered in RAML (Norouzi et al., 2016; Ma et al., 2017). For simplicity, we follow previous work to use the negative Hamming distance for both $r_{x}$ and $r_{y}$ . For a more parallelized implementation, we sample an augmented sentence $\widehat{s}$ from a true sentence $s$ as follows:

Sample $\widehat{n}\in\{0,1,...,\left|s\right|\}$ by $\mathbf{p}(\widehat{n})\propto e^{-\widehat{n}/\tau}$ .

For each $i\in\{1,2,...,\left|s\right|\}$ , with probability $\widehat{n}/\left|s\right|$ , we can replace $s_{i}$ by a uniform $\widehat{s_{i}}\neq s_{i}$ .

This procedure guarantees that any two sentences $\widehat{s}_{1}$ and $\widehat{s}_{2}$ with the same Hamming distance to $s$ have the same probability, but slightly changes the relative odds of sentences with different Hamming distances to $s$ from the true distribution by negative Hamming distance, and thus is an approximation of the actual distribution. However, this efficient sampling procedure is much easier to implement while achieving good performance.

Algorithm 1 illustrates this sampling procedure, which can be applied independently and in parallel for each batch of source sentences and target sentences. Additionally, we open source our implementation in TensorFlow and in PyTorch (respectively in Appendix A.5 and A.6).

Experiments

We benchmark SwitchOut on three translation tasks of different scales: 1) IWSLT 2015 English-Vietnamese (en-vi); 2) IWSLT 2016 German-English (de-en); and 3) WMT 2015 English-German (en-de). All translations are word-based. These tasks and pre-processing steps are standard, used in several previous works. Detailed statistics and pre-processing schemes are in Appendix A.3.

Models and Experimental Procedures.

Our translation model, i.e. $\mathbf{p}_{\theta}(y|x)$ , is a Transformer network (Vaswani et al., 2017). For each dataset, we first train a standard Transformer model without SwitchOut and tune the hyper-parameters on the dev set to achieve competitive results. (w.r.t. Luong and Manning (2015); Gu et al. (2018); Vaswani et al. (2017)). Then, fixing all hyper-parameters, and fixing $\tau_{y}=0$ , we tune the $\tau_{x}$ rate, which controls how far we are willing to let $\widehat{x}$ deviate from $x$ . Our hyper-parameters are listed in Appendix A.4.

Baselines.

While the Transformer network without SwitchOut is already a strong baseline, we also compare SwitchOut against two other baselines that further use existing varieties of data augmentation: 1) word dropout on the source side with the dropping probability of $\lambda_{\text{word}}=0.1$ ; and 2) RAML on the target side, as in Section 2.4. Additionally, on the en-de task, we compare SwitchOut against back-translation (Sennrich et al., 2016b).

SwitchOut vs. Word Dropout and RAML.

We report the BLEU scores of SwitchOut, word dropout, and RAML on the test sets of the tasks in Table 1. To account for variance, we run each experiment multiple times and report the median BLEU. Specifically, each experiment without SwitchOut is run for $4$ times, while each experiment with SwitchOut is run for $9$ times due to its inherently higher variance. We also conduct pairwise statistical significance tests using paired bootstrap (Clark et al., 2011), and record the results in Table 1. For 4 of the 6 settings, SwitchOut delivers significant improvements over the best baseline without SwitchOut. For the remaining two settings, the differences are not statistically significant. The gains in BLEU with SwitchOut over the best baseline on WMT 15 en-de are all significant ( $p<0.0002$ ). Notably, SwitchOut on the source demonstrates as large gains as these obtained by RAML on the target side, and SwitchOut delivers further improvements when combined with RAML.

SwitchOut vs. Back Translation.

Traditionally, data-augmentation is viewed as a method to enlarge the training datasets (Krizhevsky et al., 2012; Szegedy et al., 2014). In the context of neural MT, Sennrich et al. (2016b) propose to use artificial data generated from a weak back-translation model, effectively utilizing monolingual data to enlarge the bilingual training datasets. In connection, we compare SwitchOut against back translation. We only compare SwitchOut against back translation on the en-de task, where the amount of bilingual training data is already sufficiently largeWe add the extra monolingual data from http://data.statmt.org/rsennrich/wmt16_backtranslations/en-de/. The BLEU scores with back-translation are reported in Table 2. These results provide two insights. First, the gain delivered by back translation is less significant than the gain delivered by SwitchOut. Second, SwitchOut and back translation are not mutually exclusive, as one can additionally apply SwitchOut on the additional data obtained from back translation to further improve BLEU scores.

We empirically study the effect of these temperature parameters. During the tuning process, we translate the dev set of the tasks and report the BLEU scores in Figure 1. We observe that when fixing $\tau_{y}$ , the best performance is always achieved with a non-zero $\tau_{x}$ .

Where does SwitchOut Help the Most?

Intuitively, because SwitchOut is expanding the support of the training distribution, we would expect that it would help the most on test sentences that are far from those in the training set and would thus benefit most from this expanded support. To test this hypothesis, for each test sentence we find its most similar training sample (i.e. nearest neighbor), then bucket the instances by the distance to their nearest neighbor and measure the gain in BLEU afforded by SwitchOut for each bucket. Specifically, we use (negative) word error rate (WER) as the similarity measure, and plot the bucket-by-bucket performance gain for each group in Figure 2. As we can see, SwitchOut improves increasingly more as the WER increases, indicating that SwitchOut is indeed helping on examples that are far from the sentences that the model sees during training. This is the desirable effect of data augmentation techniques.

Conclusion

In this paper, we propose a method to design data augmentation algorithms by solving an optimization problem. These solutions subsume a few existing augmentation schemes and inspire a novel augmentation method, SwitchOut. SwitchOut delivers improvements over translation tasks at different scales. Additionally, SwitchOut is efficient and easy to implement, and thus has the potential for wide application.

Acknowledgements

We thank Quoc Le, Minh-Thang Luong, Qizhe Xie, and the anonymous EMNLP reviewers, for their suggestions to improve the paper.

This material is based upon work supported in part by the Defense Advanced Research Projects Agency Information Innovation Office (I2O) Low Resource Languages for Emergent Incidents (LORELEI) program under Contract No. HR0011-15-C0114. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References

Appendix A Appendix

Here, we derive word dropout as an instance of our framework. First, let us introduce a new token, $\langle\text{null}\rangle$ , into both the source vocabulary and the target vocabulary. $\langle\text{null}\rangle$ has the embedding of a all- vector and is never trained. For a sequence $x$ of words in a vocabulary with $\langle\text{null}\rangle$ , we define the neighborhood $N(x)$ to be:

In other words, $N(x)$ consists of $x$ and all the sentences obtained by replacing a few words in $x$ by $\langle\text{null}\rangle$ . Clearly, all augmented sentences $\widehat{x}$ that are sampled from $x$ using word dropout fall into $N(x)$ .

In (4), the augmentation policy $\mathbf{q}^{*}(\widehat{x},\widehat{y}|x,y)$ was decomposed into two independent terms, one of which samples the augmented source sentence $\widehat{x}$ and the other samples the augmented target sentence $\widehat{y}$

Word dropout is an instance of this decomposition, where $r_{y}$ takes the same form with $r_{x}$ , given by:

where $\text{HammingDistance}(\widehat{x},x)=\sum_{i=1}^{\left|x\right|}\mathbf{1}[\widehat{x}_{i}\neq x_{i}]$ . To see this is indeed the case, let $h$ be the Hamming distance for $\widehat{x}\in N(x)$ and set $\lambda_{\text{word}}=\exp{\left\{-1/\tau_{x}\right\}}$ , then we have:

which is precisely the probability of dropping out $h$ words in $x$ , where each word is dropped with the distribution $\text{Bernoulli}(\lambda_{\text{word}})$ .

The difference between word dropout and SwitchOut comes in the fact that $N(x)$ is much smaller than the support of $\widehat{x}$ that SwitchOut can sample from, which is $V^{\left|x\right|}$ where $V$ is the vocabulary. Word dropout concentrates all augmentation probability mass into $N(x)$ while SwitchOut spreads the mass into a larger support, leading to a larger entropy. Meanwhile, both word dropout and SwitchOut are exponentially less likely to diverge a way from $x$ , ensuring the smoothness desiderata of a good data augmentation policy, as we discussed in Section 2.3.

A.2 RAML as a Special Case

Here, we present a detailed description of how RAML is a special case of our proposed framework. For each empirical observation $(x,y)\sim\widehat{\mathbf{p}}$ , RAML defines a reward aware target distribution $\mathbf{p}_{\text{RAML}}(Y|x,y)$ for the model distribution $\mathbf{p}_{\theta}(Y\mid x)$ to match. Concretely, the target distribution in RAML has the form

where $r$ is the task reward function. With this definition, RAML amounts to minimizing the expected KL divergence between $\mathbf{p}_{\text{RAML}}$ and $\mathbf{p}_{\theta}$ , i.e.

The last equality reveals an immediate connection between RAML and our proposed framework. In summary, RAML can be seen as a special case of our data augmentation framework, where the similarity function is defined by (7). Practically, this means RAML only consider pairs with source sentences from the empirical set for data augmentation.

A.3 Datasets Descriptions

Table 3 summarizes the statistics of the datasets in our experiments. The WMT 15 en-de dataset is one order of magnitude larger than the IWSLT 16 de-en dataset and the IWSLT 15 en-vi dataset. For the en-vi task, we use the data pre-processed by Luong and Manning (2015). For the en-de task, we use the data pre-processed by Luong et al. (2015), with newstest2014 for validation and newstest2015 for testing. For the de-en task, we use the data pre-processed by Ranzato et al. (2016).

A.4 Hyper-parameters

The hyper-parameters used in our experiments are in Table 4. All models are initialized uniformly at random in the range as reported in Table 4. All models are trained with Adam (Kingma and Ba, 2015). Gradients are clipped at the threshold as specified in Table 4. For the WMT en-de task, we use the legacy learning rate schedule as specified by Vaswani et al. (2017). For the de-en task and the en-vi task, the learning rate is initially $0.001$ , and is decreased by a factor of $0.97$ for every $1000$ steps, starting at step $8000$ . All models are trained for 100,000 steps, during which one checkpoint is saved for each $2500$ steps and the final evaluation is performed on the checkpoint with lowest perplexity on the dev set.

Multiple GPUs are used for each experiment. For the de-en and the en-vi experiments, if we use $n$ GPUs, where $n\in\{1,2,4\}$ , then we only perform $10^{5}/n$ updates to the models’ parameters. We find that this is sufficient to make the models converge.