Back-Translation-Style Data Augmentation for End-to-End ASR

Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda

INTRODUCTION

Automatic speech recognition (ASR) is the task of converting a continuous speech signal into a sequence of discrete characters, and is a key technology for the realization of natural interaction between humans and machines. ASR technology has great potential in various applications such as voice search and voice input, making our lives more convenient. Typical ASR systems consist of multiple modules such as an acoustic model, a lexicon model, and a language model. Dividing ASR systems into modules makes it possible to optimize each of them separately, but this also results in more complex systems and imposes performance limitations. Over the past few decades, this approach has been the basis of ASR systems.

With the improvement of deep learning techniques, end-to-end (E2E) approaches have begun to attract attention . While typical ASR systems convert a sequence of acoustic features into text step-by-step using several modules trained separately, E2E-ASR systems directly convert speech using a single neural network. Therefore, the whole E2E-ASR system can be optimized jointly, making system construction much easier than with typical ASR systems. Furthermore, it does not require costly lexical information or morphological analysis.

The present E2E-ASR approaches can be divided into two types. First type is based on connectionist temporal classification (CTC) . The CTC approach makes it possible to map the input sequences of acoustic features to output sequences of symbols of shorter length without using a hidden Markov model (HMM). However, it requires assumptions of conditional independence in the output sequence, i.e., each output symbol such as a character or phoneme is independently predicted in each frame. The second E2E-ASR approach utilizes an attention-based sequence-to-sequence (Seq2Seq) model . In this approach, a sequence of acoustic features is directly mapped into text using an encoder-decoder architecture . In contrast to the CTC-based approach, the attention-based Seq2Seq approach is not bound by any assumptions, therefore it can be trained to directly maximize the probability of a word sequence given a sequence of acoustic features. However, in exchange for its generality, the Seq2Seq approach requires large amounts of data for training. Furthermore, since the language model is not a separate module, the large amounts of text typically available cannot be used to improve its performance. This actually yields significant degradation of proper noun recognition, which are not appeared in the paired speech and text data, and affects negatively to production when evaluated on live production data according to .

One straightforward approach to address these issues is to integrate a language model with the Seq2Seq model, including shallow fusion, deep fusion, and their variants . Shallow fusion is the most simple approach in that we separately train a Seq2Seq model and a language model and then combine the score of two models in the decoding phase. Deep fusion is an approach which has been proposed in the field of neural machine translation. A seq2seq model and a language model are trained separately, and then the hidden states of the decoder of the Seq2Seq model and those of the language model are concatenated using a gating matrix which controls the importance of each model. The parameters used to calculate the gating matrix are then trained using a small amount of training data while fixing all of the other parameters. These fusion approaches enable us to utilize a large amount of unpaired text to improve ASR performance. The resulting model is not actually end-to-end, however, since it requires additional steps and fine-tuning to integrate the separate modules.

A simpler approach is back-translation , a method which has been proposed in the field of machine translation. In this approach, a pre-trained target-to-source translation model is used to generate source text from unpaired target text. Augmenting training data with back-translated data led to notable improvements in performance of neural machine translation models. Similar techniques have also led to performance improvements in related tasks such as automatic post edition .

Inspired by the back-translation approach, in this paper we propose a novel data augmentation method for attention-based E2E-ASR models allows them to utilize large amounts of text not paired with speech signals. Instead of using a text-to-speech system on unpaired text to produce synthetic speech or using grapheme to phoneme conversion to generate paired text and pseudo speech sequences based on phonemes , we build a text-to-encoder model which learns to predict the hidden states of the E2E-ASR encoder. Targeting the states of the speech encoder, rather than speech itself makes it possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling present in E2E-ASR encoder, Furthermore, the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, and then the decoder of the E2E-ASR model is retrained using the generated hidden states as additional training data. To evaluate our proposed method, we conduct experimental evaluation using LibriSpeech dataset . The experimental results demonstrate that our proposed method achieves the improvement of ASR performance and makes it possible to improve the recognition results for unknown words.

BACK-TRANSLATION-STYLE DATA AUGMENTATION

An overview of our proposed back-translation-style data augmentation method is shown in Fig. 1. First, the attention-based E2E-ASR model is trained using paired training data which consists of text and speech. Next, the final layer hidden states of the ASR encoder are extracted, providing paired training data which consists of text and the corresponding hidden states. Using this paired training data, a neural text-to-encoder (TTE) model is trained to predict the hidden states of the ASR encoder from a sequence of characters. Finally, the text-to-encoder model generates hidden states from a large amount of unpaired text and the ASR decoder is retrained using the generated states as additional training data.

2 ASR model training

An overview of an attention-based ASR model is shown in Fig. 2(a). This model directly estimates posterior $p(\mathbf{C}|\mathbf{X})$ , where $\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{T}\}$ represents a sequence of input features, and $\mathbf{C}=\{c_{1},c_{2},\dots,c_{L}\}$ represents a sequence of output characters. Posterior $p(\mathbf{C}|\mathbf{X})$ is factorized with a probabilistic chain rule as follows:

where $c_{1:l-1}$ represents subsequence $\{c_{1},c_{2},\dots\,c_{l-1}\}$ , and $p(c_{l}|c_{1:l-1},\mathbf{X})$ is calculated as follows:

All of the above networks are optimized using back-propagation through time (BPTT) to minimize the following objective function:

where $c_{1:l-1}^{*}=\{c_{1}^{*},c_{2}^{*},\dots,c_{l-1}^{*}\}$ represents the ground truth of the previous characters.

3 TTE model training

All of the networks are jointly optimized to minimize the following objective functions:

4 ASR decoder retraining

After training of the TTE model, we retrain the ASR decoder using both the paired and unpaired training data. A flowchart of this retraining is shown in Fig. 3. We concatenate the paired and unpaired text datasets, and then for each text, if there is paired speech data, the acoustic features of that speech are used as inputs. If not, the hidden states generated by the TTE model are used as inputs. Using both the generated hidden states and the original acoustic features produces a regularization effect which prevents overfitting to the generated states.

EXPERIMENTAL EVALUATION

We conducted an experimental evaluation using the LibriSpeech dataset , which consists of two sets of clean speech data (100 hours + 360 hours), and noisy speech data (500 hours) for training. We used 100 hours of clean speech data to train the initial ASR model and the text-to-encoder (TTE) model, and the text of 360 hours of clean speech data to retrain the ASR decoder. We used five hours of clean development data as a validation set, and five hours of clean test data as an evaluation set. To evaluate the effectiveness of our proposed method, we compared the recognition performance of the following seven methods:

model trained with 100 hours of acoustic features;

model retrained with 360 hours of generated hidden states and 100 hours of extracted hidden states;

model retrained with 360 hours of generated hidden states and 100 hours of extracted hidden states while the attention layers are frozen;

model retrained with 360 hours of generated hidden states and 100 hours of acoustic features;

model trained with 460 hours of extracted hidden states;

model trained with 460 hours of extracted hidden states while the attention layers are frozen;

model trained with 460 hours of acoustic features;

where “generated hidden states” represent the hidden states generated by the TTE model, and “extracted hidden states” represent the hidden states extracted from the ASR encoder using raw acoustic features.

We used an acoustic feature vector consisting of an 80-dimensional log Mel-filter bank and three-dimensional pitch features, which were extracted using the open-source speech recognition toolkit Kaldi . The ASR encoder consisted an eight-layered bidirectional long short-term memory with a projection layer (BLSTMP), and the ASR decoder consisted a one-layered LSTM. In the second and third layers from the bottom of the ASR encoder, sub-sampling was performed to reduce the length of utterances $T$ , yielding the length $T/4$ . The ASR attention network used location-aware attention , which is more robust to long sequences than dot-product or additive attention . For decoding, we used a beam search algorithm with beam size of 20. We manually set the maximum and minimum lengths of the output sequence to 0.3 and 0.8 times the length of the subsampled input sequence, respectively. Details of the experimental conditions for the ASR model are shown in Table 1.

The architecture of the TTE model followed the original Tacotron2 settings . The input characters were converted into 512-dimensional character embeddings. The TTE encoder consisted of a three-layered 1D convolutional neural network (CNN) containing 512 filters with the shape 5, a batch normalization and rectified linear unit (ReLU) activation function, and an one-layered BLSTM with 512 units (256 units for forward processing, the rest for backward processing). Although the attention mechanism of the TTE model was based on location-aware attention , we additionally cumulated the attention weight feedback to next step to accelerate attention learning. The TTE decoder consisted of a two-layered LSTM with 1024 units. Prenet was a two-layered feed forward network with 256 units and ReLU. Postnet was a five-layered CNN containing 512 filters with the shape 5, a batch normalization, and tanh activation function except in the final layer. Dropout with a probability of 0.5 was applied to all of the convolution and Prenet layers. Zoneout with a probability of 0.1 was applied to the decoder LSTM. During generation, we applied dropout to Prenet in the same manner as in , and set the threshold value of the probability of the end of sequence at 0.75 to prevent from cutting off the end of the input sequence. Details of the experimental conditions for the TTE model are shown in Table. 2.

All of the networks were trained using the end-to-end speech processing toolkit ESPnet with a single GPU (Titan X pascal). Character error rate (CER) and word error rate (WER) were used as metrics.

2 Experimental results

First, we focus on the effectiveness of adding the L1 norm to the objective function of the TTE model. Mean square error loss for validation data with teacher forcing is shown in Table 3. We can confirm that the use of the L1 norm results in improved performance. Furthermore, we found that use of the L1 norm also leads to much faster attention learning. The attention weights for the validation data are shown in Fig. 4. While the TTE model without the L1 norm is unable to learn the attention until after epoch 40, use of the L1 norm make the model to learn the attention in less than 1/3 the number of epochs. This is because use of the L1 norm makes the model focus on reducing smaller error, which prevents the decoder of the model from becoming something like an auto-encoder.

Next, we focus on the effectiveness of our proposed data augmentation method. Our experimental results are shown in Table 4. Compared with the baseline, we can confirm that our proposed “Retrain-Joint” approach improved the recognition performance. However, when only hidden states were used during retraining, no improvement was observed. This is because using only the hidden states resulted in overfitting. However, in comparison to the oracle results, we can see that there is a still room for improvement. These results imply that the use of data of various speakers is more important than the use of various text. Since hidden states contains less information about speaker characteristics than acoustic features, using hidden states at the targets of the TTE model likely results in the generated hidden states representing the characteristics of an intermediate speaker. As a result, there is not enough speaker variation among the generated hidden states, degrading the effectiveness of data augmentation. To address this issue, we will extend our scheme using multi-speaker Tacotron2 in future work.

Some recognition examples including unknown words before retraining are shown in Table 5. We can see that our proposed data augmentation method can improve the performance of the ASR decoder as language model, making it possible to extend the vocabulary. A similar effect was observed in the original back-translation work .

Finally, the results of shallow fusion with a character-based language model (LM) are shown in Table 6, where the LM was trained using text of 360 hours of clean speech, and the balancing weight parameter between two models was decided to achieve the best recognition performance on CER. We can see that the use of LM improved the recognition performance in both cases, indicating that our proposed method can still be combined with LM integration methods.

CONCLUSION

In this paper we proposed a novel data augmentation method for attention-based E2E-ASR, utilizing a large amount of text which was not paired with speech signals, an approach inspired by the back-translation technique has been proposed in the field of machine translation. We built a neural text-to-encoder (TTE) model which predicted a sequence of hidden states extracted by a pre-trained E2E-ASR decoder from a sequence of characters. Using the hidden states as targets makes it possible to achieve faster attention learning and reduces computational cost thanks to sub-sampling in the ASR encoder. After training, the TTE model generated the hidden states from a large amount of unpaired text, and then the decoder of the E2E-ASR model was retrained using the generated hidden states as additional training data. An experimental evaluation using LibriSpeech dataset demonstrated that our proposed method achieved the improvement of ASR performance and made it possible to improve the recognition results due to the smaller number of unknown words. Furthermore, we could confirm that our proposed method can be combined with language model integration methods.

In future works, we will extend the text-to-encoder model to multi-speaker model using speaker embedding vector to generate more variable hidden states, and apply our proposed method using much larger amount of unpaired text.