SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le

Introduction

Deep Learning has been applied successfully to Automatic Speech Recognition (ASR) , where the main focus of research has been designing better network architectures, for example, DNNs , CNNs , RNNs and end-to-end models . However, these models tend to overfit easily and require large amounts of training data .

Data augmentation has been proposed as a method to generate additional training data for ASR. For example, in , artificial data was augmented for low resource speech recognition tasks. Vocal Tract Length Normalization has been adapted for data augmentation in . Noisy audio has been synthesised via superimposing clean audio with a noisy audio signal in . Speed perturbation has been applied on raw audio for LVSCR tasks in . The use of an acoustic room simulator has been explored in . Data augmentation for keyword spotting has been studied in . Feature drop-outs have been employed for training multi-stream ASR systems . More generally, learned augmentation techniques have explored different sequences of augmentation transformations that have achieved state-of-the-art performance in the image domain .

Inspired by the recent success of augmentation in the speech and vision domains, we propose SpecAugment, an augmentation method that operates on the log mel spectrogram of the input audio, rather than the raw audio itself. This method is simple and computationally cheap to apply, as it directly acts on the log mel spectrogram as if it were an image, and does not require any additional data. We are thus able to apply SpecAugment online during training. SpecAugment consists of three kinds of deformations of the log mel spectrogram. The first is time warping, a deformation of the time-series in the time direction. The other two augmentations, inspired by “Cutout”, proposed in computer vision , are time and frequency masking, where we mask a block of consecutive time steps or mel frequency channels.

This approach while rudimentary, is remarkably effective and allows us to train end-to-end ASR networks, called Listen Attend and Spell (LAS) , to surpass more complicated hybrid systems, and achieve state-of-the-art results even without the use of Language Models (LMs). On LibriSpeech , we achieve 2.8% Word Error Rate (WER) on the test-clean set and 6.8% WER on the test-other set, without the use of an LM. Upon shallow fusion with an LM trained on the LibriSpeech LM corpus, we are able to better our performance (2.5% WER on test-clean and 5.8% WER on test-other), improving the current state of the art on test-other by 22% relatively. On Switchboard 300h (LDC97S62) , we obtain 7.2% WER on the Switchboard portion of the Hub5’00 (LDC2002S09, LDC2003T02) test set, and 14.6% on the CallHome portion, without using an LM. Upon shallow fusion with an LM trained on the combined transcript of the Switchboard and Fisher (LDC200{4,5}T19) corpora, we obtain 6.8%/14.1% WER on the Switchboard/Callhome portion.

Augmentation Policy

We aim to construct an augmentation policy that acts on the log mel spectrogram directly, which helps the network learn useful features. Motivated by the goal that these features should be robust to deformations in the time direction, partial loss of frequency information and partial loss of small segments of speech, we have chosen the following deformations to make up a policy:

Time warping is applied via the function sparse_image_warp of tensorflow. Given a log mel spectrogram with $\tau$ time steps, we view it as an image where the time axis is horizontal and the frequency axis is vertical. A random point along the horizontal line passing through the center of the image within the time steps $(W,\tau-W)$ is to be warped either to the left or right by a distance $w$ chosen from a uniform distribution from 0 to the time warp parameter $W$ along that line. We fix six anchor points on the boundary—the four corners and the mid-points of the vertical edges.

Frequency masking is applied so that $f$ consecutive mel frequency channels $[f_{0},f_{0}+f)$ are masked, where $f$ is first chosen from a uniform distribution from 0 to the frequency mask parameter $F$ , and $f_{0}$ is chosen from $[0,\nu-f)$ . $\nu$ is the number of mel frequency channels.

Time masking is applied so that $t$ consecutive time steps $[t_{0},t_{0}+t)$ are masked, where $t$ is first chosen from a uniform distribution from 0 to the time mask parameter $T$ , and $t_{0}$ is chosen from $[0,\tau-t)$ .

We introduce an upper bound on the time mask so that a time mask cannot be wider than $p$ times the number of time steps.

Figure 1 shows examples of the individual augmentations applied to a single input. The log mel spectrograms are normalized to have zero mean value, and thus setting the masked value to zero is equivalent to setting it to the mean value.

We can consider policies where multiple frequency and time masks are applied. The multiple masks may overlap. In this work, we mainly consider a series of hand-crafted policies, LibriSpeech basic (LB), LibriSpeech double (LD), Switchboard mild (SM) and Switchboard strong (SS) whose parameters are summarized in Table 1. In Figure 2, we show an example of a log mel spectrogram augmented with policies LB and LD.

Model

We use Listen, Attend and Spell (LAS) networks for our ASR tasks. These models, being end-to-end, are simple to train and have the added benefit of having well-documented benchmarks that we are able to build upon to get our results. In this section, we review LAS networks and introduce some notation to parameterize them. We also introduce the learning rate schedules we use to train the networks, as they turn out to be an important factor in determining performance. We end with reviewing shallow fusion , which we have used to incorporate language models for further gains in performance.

We use Listen, Attend and Spell (LAS) networks for end-to-end ASR studied in , for which we use the notation LAS- $d$ - $w$ . The input log mel spectrogram is passed in to a 2-layer Convolutional Neural Network (CNN) with max-pooling and stride of $2$ . The output of the CNN is passes through an encoder consisting of $d$ stacked bi-directional LSTMs with cell size $w$ to yield a series of attention vectors. The attention vectors are fed into a 2-layer RNN decoder of cell dimension $w$ , which yields the tokens for the transcript. The text is tokenized using a Word Piece Model (WPM) of vocabulary size 16k for LibriSpeech and 1k for Switchboard. The WPM for LibriSpeech 960h is constructed using the training set transcripts. For the Switchboard 300h task, transcripts from the training set are combined with those of the Fisher corpus to construct the WPM. The final transcripts are obtained by a beam search with beam size 8. For comparison with , we note that their “large model” in our notation is LAS-4-1024.

2 Learning Rate Schedules

The learning rate schedule turns out to be an important factor in determining the performance of ASR networks, especially so when augmentation is present. Here, we introduce training schedules that serve two purposes. First, we use these schedules to verify that a longer schedule improves the final performance of the network, even more so with augmentation (Table 2). Second, based on this, we introduce very long schedules that are used to maximize the performance of the networks.

We use a learning rate schedule in which we ramp-up, hold, then exponentially decay the learning rate until it reaches $\nicefrac{{1}}{{100}}$ of its maximum value. The learning rate is kept constant beyond this point. This schedule is parameterized by three time stamps $(s_{r},s_{i},s_{f})$ – the step $s_{r}$ where the ramp-up (from zero learning rate) is complete, the step $s_{i}$ where exponential decay starts, and the step $s_{f}$ where the exponential decay stops.

There are two more factors that introduce time scales in our experiment. First, we turn on a variational weight noise of standard deviation 0.075 at step $s_{\text{noise}}$ and keep it constant throughout training. Weight noise is introduced in the step interval $(s_{r},s_{i})$ , i.e., during the high plateau of the learning rate.

Second, we introduce uniform label smoothing with uncertainty 0.1, i.e., the correct class label is assigned the confidence 0.9, while the confidence of the other labels are increased accordingly. As is commented on again later on, label smoothing can destabilize training for smaller learning rates, and we sometimes choose to turn it on only at the beginning of training and off when the learning rate starts to decay.

The two basic schedules we use, are given as the following:

B(asic): $(s_{r},s_{\text{noise}},s_{i},s_{f})$ = (0.5k, 10k, 20k, 80k)

D(ouble): $(s_{r},s_{\text{noise}},s_{i},s_{f})$ = (1k, 20k, 40k, 160k)

As discussed further in section 5, we can improve the performance of the trained network by using a longer schedule. We thus introduce the following schedule:

L(ong): $(s_{r},s_{\text{noise}},s_{i},s_{f})$ = (1k, 20k, 140k, 320k)

which we use to train the largest model to improve performance. When using schedule L, label smoothing with uncertainty $0.1$ is introduced for time steps $<s_{i}=\text{140k}$ for LibriSpeech 960h, and is subsequently turned off. For Switchboard 300h, label smoothing is turned on throughout training.

3 Shallow Fusion with Language Models

While we are able to get state-of-the-art results with augmentation, we can get further improvements by using a language model. We thus incorporate an RNN language model by shallow fusion for both tasks. In shallow fusion, the “next token” $\mathbf{y}^{*}$ in the decoding process is determined by

i.e., by jointly scoring the token using the base ASR model and the language model. We also use a coverage penalty $c$ .

For LibriSpeech, we use a two-layer RNN with embedding dimension 1024 used in for the LM, which is trained on the LibriSpeech LM corpus. We use identical fusion parameters ( $\lambda=0.35$ and $c=0.05$ ) used in throughout.

For Switchboard, we use a two-layer RNN with embedding dimension 256, which is trained on the combined transcripts of the Fisher and Switchboard datasets. We find the fusion parameters via grid search by measuring performance on RT-03 (LDC2007S10). We discuss the fusion parameters used in individual experiments in section 4.2.

Experiments

In this section, we describe our experiments on LibriSpeech and Switchboard with SpecAugment. We report state-of-the-art results that out-perform heavily engineered hybrid systems.

For LibriSpeech, we use the same setup as , where we use 80-dimensional filter banks with delta and delta-delta acceleration, and a 16k word piece model .

The three networks LAS-4-1024, LAS-6-1024 and LAS-6-1280 are trained on LibriSpeech 960h with a combination of augmentation policies (None, LB, LD) and training schedules (B/D). Label smoothing was not applied in these experiments. The experiments were run with peak learning rate of $0.001$ and batch size of 512, on 32 Google Cloud TPU chips for 7 days. Other than the augmentation policies and learning rate schedules, all other hyperparameters were fixed, and no additional tuning was applied. We report test set numbers validated by the dev-other set in Table 2. We see that augmentation consistently improves the performance of the trained network, and that the benefit of a larger network and a longer learning rate schedule is more apparent with harsher augmentation.

We take the largest network, LAS-6-1280, and use schedule L (with training time $\sim$ 24 days) and policy LD to train the network to maximize performance. We turn label smoothing on for time steps $<140$ k as noted before. The test set performance is reported by evaluating the checkpoint with best dev-other performance. State of the art performance is achieved by the LAS-6-1280 model, even without a language model. We can incorporate an LM using shallow fusion to further improve performance. The results are presented in Table 3.

2 Switchboard 300h

For Switchboard 300h, we use the Kaldi “s5c” recipe to process our data, but we adapt the recipe to use 80-dimensional filter banks with delta and delta-delta acceleration. We use a 1k WPM to tokenize the output, constructed using the combined vocabulary of the Switchboard and Fisher transcripts.

We train LAS-4-1024 with policies (None, SM, SS) and schedule B. As before, we set the peak learning rate to $0.001$ and total batch size to 512, and train using 32 Google Cloud TPU chips. Here the experiments are run with and without label smoothing. Not having a canonical development set, we choose to evaluate the checkpoint at the end point of the training schedule, which we choose to be 100k steps for schedule B. We note that the training curve relaxes after the decay schedule is completed (step $s_{f}$ ), and the performance of the network does not vary much. The performance of various augmentation policies with and without label smoothing for Switchboard 300h is shown in Table 4. We see that label smoothing and augmentation have an additive effect for this corpus.

As with LibriSpeech 960h, we train LAS-6-1280 on the Switchboard 300h training set with schedule L (training time $\sim$ 24 days) to get state of the art performance. In this case, we find that turning label smoothing on throughout training benefits the final performance. We report the performance at the end of training time at 340k steps. We present our results in the context of other work in Table 5. We also apply shallow fusion with an LM trained on Fisher-Switchboard, whose fusion parameters are obtained by evaluating performance on the RT-03 corpus. Unlike the case for LibriSpeech, the fusion parameters do not transfer well between networks trained differently—the three entries in Table 5 were obtained by using fusion parameters $(\lambda,c)=$ (0.3, 0.05), (0.2, 0.0125) and (0.1, 0.025) respectively.

Discussion

Time warping contributes, but is not a major factor in improving performance. In Table 6, we present three training results for which time warping, time masking and frequency masking have been turned off, respectively. We see that the effect time warping, while small, is still existent. Time warping, being the most expensive as well as the least influential of the augmentations discussed in this work, should be the first augmentation to be dropped given any budgetary limitations.

Label smoothing introduces instability to training. We have noticed that the proportion of unstable training runs increases for LibriSpeech when label smoothing is applied with augmentation. This becomes more conspicuous while learning rate is being decayed, thus our introduction of a label smoothing schedule for training LibriSpeech, where labels are only smoothed in the initial phases of the learning rate schedule.

Augmentation converts an over-fitting problem into an under-fitting problem. As can be observed from the training curves of the networks in Figure 3, the networks during training not only under-fit the loss and WER on the augmented training set, but also on the training set itself when trained on augmented data. This is in stark contrast to the usual situation where networks tend to over-fit to the training data. This is the major benefit of training with augmentation, as explained below.

Common methods of addressing under-fitting yield improvements. We were able to make significant gains in performance by standard approaches to alleviate under-fitting—making larger networks and training longer. The current reported performance was obtained by the recursive process of applying a harsh augmentation policy, and then making wider, deeper networks and training them with longer schedules to address the under-fitting.

Remark on related works. We note that an augmentation similar to frequency masking has been studied in the context of CNN acoustic models in . There, blocks of adjacent frequencies are pre-grouped into bins, which are randomly zeroed-out per minibatch. On the other hand, both the size and position of the frequency masks in SpecAugment are chosen stochastically, and differ for every input in the minibatch. More ideas for structurally omitting frequency data of spectrograms have been discussed in .

Conclusions

SpecAugment greatly improves the performance of ASR networks. We are able to obtain state-of-the-art results on the LibriSpeech 960h and Switchboard 300h tasks on end-to-end LAS networks by augmenting the training set using simple hand-crafted policies, surpassing the performance of hybrid systems even without the aid of a language model. SpecAugment converts ASR from an over-fitting to an under-fitting problem, and we were able to gain performance by using bigger networks and training longer.

Acknowledgements: We would like to thank Yuan Cao, Ciprian Chelba, Kazuki Irie, Ye Jia, Anjuli Kannan, Patrick Nguyen, Vijay Peddinti, Rohit Prabhavalkar, Yonghui Wu and Shuyuan Zhang for useful discussions. We also thank György Kovács for introducing us to the works .