Cycle-consistency training for end-to-end speech recognition

Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, Yu Zhang, Shinji Watanabe, Jonathan Le Roux

Introduction

In recent years, automatic speech recognition (ASR) technology has been widely used as an effective user interface for various devices such as car navigation systems, smart phones, and smart speakers. The recognition accuracy has dramatically improved with the help of deep learning techniques , and reliability of speech interfaces has been greatly enhanced. However, building ASR systems is very costly and time consuming. Current systems typically have a module-based architecture including an acoustic model, a pronunciation dictionary, and a language model, which rely on phonetically-designed phone units and word-level pronunciations using linguistic assumptions. To build a language model, text preprocessing such as tokenization for some languages that do not explicitly have word boundaries is also required. Consequently, it is not easy for non-experts to develop ASR systems, especially for underresourced languages.

End-to-end ASR has the goal of simplifying the module-based architecture into a single-network architecture within a deep learning framework, in order to address these issues . End-to-end ASR methods typically rely only on paired acoustic and language data, without the need for extra linguistic knowledge, and train the model with a single algorithm. Therefore, this approach makes it feasible to build ASR systems without expert knowledge. However, in the end-to-end ASR framework a large amount of training data is crucial to assure high recognition accuracy. Paired acoustic (speech) and language (transcription) realizations spoken by multiple speakers are needed . Nowadays, it is easy to collect audio and text data independently from the world wide web, but difficult to find paired data in different languages. Transcribing existing audio data or recording texts spoken by sufficient speakers are also very expensive.

There are several approaches that tackle the problem of limited paired data in the literature . In particular, cycle consistency has recently been introduced in machine translation (MT) and image transformation , and enables one to optimize deep networks using unpaired data. The basic underlying assumption is that, given a model that converts input data to output data and another model that reconstructs the input data from the output data, input data and its reconstruction should be close to each other. For example, suppose an English-to-French MT system translates an English sentence to a French sentence, and then a French-to-English MT system back-translates the French sentence to an English sentence. In this case, we can train the English-to-French system so that the difference between the English sentence and its back-translation becomes smaller, for which we only need English sentences. The French-to-English MT system can also be trained in the same manner using only French sentences.

Applying the concept of cycle consistency to ASR is quite challenging. As is the case in MT, the output of ASR is a discrete distribution over the set of all possible sentences. It is therefore not possible to build an end-to-end differentiable loss that back-propagates error through the most probable sentence in this step. Since the set of possible sentences is exponentially large in the size of the sentence, it is not possible to exactly average over all possible sentences either. Furthermore, unlike in MT and image transformation, in ASR, the input and output domains are very different and do not contain the same information. The output text does not include speaker and prosody information, which is eliminated through feature extraction and decoding. Hence, the speech reconstructed by the TTS system does not have the original speaker and prosody information and can result in a strong mismatch.

Previous approaches related to cycle consistency in end-to-end ASR circumvent these problems by avoiding back-propagating the error beyond the discrete steps and adding a speaker network to transfer the information not present in the text. Therefore, these methods are not strictly cycle-consistency training, as used in MT and image transformation. Gradients are not cycled both through ASR and TTS simultaneously and only the second step on a ASR-TTS or TTS-ASR chain can be updated.

In this work, we propose an alternative approach that uses an end-to-end differentiable loss in the cycle-consistency manner. This idea rests on the two following principles.

Encoder-state-level cycle consistency: We use ASR encoder state sequences for computing the cycle consistency instead of waveform or spectral features. This uses a normal TTS Tacotron2 end-to-end model modified to reconstruct the encoder state sequence instead of speech. We call this a text-to-encoder (TTE) model , which we introduced in our prior work on data augmentation. This approach reduces the mismatch between the original and the reconstruction by avoiding the problem of missing para-linguistic information.

Expected end-to-end loss: We use an expected loss approximated with a sampling-based method. In other words, we sample multiple sentences from the ASR model, generate an encoder state sequence for each, and compute the consistency loss for each sentence by comparing each encoder state sequence with the original. Then, the mean loss can be used to backpropagate the error to the ASR model via the REINFORCE algorithm . This allows us to update the ASR system when the TTE is used to compute the loss, unlike .

The proposed approach allows therefore training with unpaired data, even if only speech is available. Furthermore, since error is backpropagated into the ASR system from a TTS-based loss, additional unsupervised losses can be used, such as language models. We demonstrate the efficacy of the proposed method in a semi-supervised training condition on the LibriSpeech corpus.

Cycle-consistency training for ASR

The proposed method consists of an ASR encoder-decoder, a TTE encoder-decoder, and consistency loss computation as shown in Fig. 1. In this framework, we need only audio data for backpropagation. In a first step, the ASR system transcribes the input audio feature sequence into a sequence of characters. In addition to this, a encoder state sequence is obtained. In a second step, the TTE system reconstructs the ASR encoder state sequence fro the character sequence. Finally, the cycle-consistency loss is computed by comparing the original state sequence and the reconstructed one. Backpropagation is performed with respect to this loss to update the ASR parameters.

2 Attention-based ASR model

All of the above networks are optimized using back-propagation to minimize the following objective function:

where ${\cal U}^{+}$ is the set of all sentences formed from the original character vocabulary ${\cal U}$ .

3 Tacotron2-based TTE model

All of the networks are jointly optimized to minimize the following objective function:

4 Cycle-consistency training

To compute the gradients with respect to the expectation in Eq. 19, we utilize the REINFORCE algorithm . This yields the following expression for the gradient

where the weight for each sample $\mathbf{C}^{n}$ is defined as

Related work

The algorithm introduced in this paper is related to existing works on data augmentation and chain-based training. Our prior work introduced the TTE model but used the synthesized encoder state sequences to train the ASR decoder from text data only. This is equivalent to back-translation in MT and builds a non-differentiable TTE-ASR chain as opposed to the end-to-end differentiable ASR-TTE chain proposed here.

The work in introduces a model consisting of a text-to-text auto-encoder and a speech-to-text encoder-decoder sharing the speech and text encodings. This model can also be trained jointly using paired and unpaired data but uses a simpler text encoder. Furthermore speech-only data is used to enhance the speech encodings, but not used to reduce recognition errors unlike our cycle-consistency approach. Finally, the text encoder is much simpler than our TTE model. In our work, the TTE model can hopefully generate better speech encodings to compute the consistency loss.

The speech chain model is the most similar architecture to ours. As described in Section 1, the ASR model is trained with synthesized speech and the TTS model is trained with ASR hypotheses for unpaired data. Therefore, the models are not tightly connected with each other, i.e., one model cannot be updated directly with the help of the other model to reduce the recognition or synthesis errors. Our approach utilizes an end-to-end differentiable loss that allows TTS or other loss to be used after ASR for unsupervised training. We introduce as well the TTE model, which benefits from the reduction of speaker variations in the loss function and of computational complexity. With regard to cycle-consistency approaches in other disciplines, our approach is most similar to the dual learning approach in MT . This paper combines alternating losses as in using REINFORCE to compute expected translation losses.

EXPERIMENTS

We conducted several experiments using the LibriSpeech corpus , consisting of two sets of clean speech data (100 hours + 360 hours), and other (noisy) speech data (500 hours) for training. We used 100 hours of the clean speech data to train the initial ASR and TTE models, and the audio of 360 hours set for unsupervised re-training of the ASR model with the cycle-consistency loss. We used five hours of clean development data as a validation set, and five hours of clean test data as an evaluation set.

The open source speech recognition toolkit Kaldi was used to extract 80-dimensional log mel-filter bank acoustic vectors with three-dimensional pitch features. The ASR encoder had an eight-layered bidirectional long short-term memory with 320 cells including projection layers (BLSTMP), and the ASR decoder had a one-layered LSTM with 300 cells. In the second and third layers from the bottom of the ASR encoder, sub-sampling was performed to reduce the utterance length from $T$ down to $T/4$ . The ASR attention network used location-aware attention . For decoding, we used a beam search algorithm with beam size of 20. We set the maximum and minimum lengths of the output sequence to 0.2 and 0.8 times the length of the subsampled input sequence, respectively.

The architecture of the TTE model followed the original Tacotron2 . It use 512-dimensional character embeddings, the TTE encoder consisted of a three-layered 1D convolutional neural network (CNN) containing 512 filters with size 5, a batch normalization, and rectified linear unit (ReLU) activation function, and a one-layered BLSTM with 512 units (256 units for forward processing, the rest for backward processing). Although the attention mechanism of the TTE model was based on location-aware attention , we additionally accumulated the attention weight feedback to the next step to accelerate attention learning. The TTE decoder consisted of a two-layered LSTM with 1024 units. Prenet was a two-layered feed forward network with 256 units and ReLU activation. Postnet was a five-layered CNN containing 512 filters with the shape 5, a batch normalization, and tanh activation function except in the final layer. Dropout with a probability of 0.5 was applied to all of the convolution and Prenet layers. Zoneout with a probability of 0.1 was applied to the decoder LSTM. During generation, we applied dropout to Prenet in the same manner as in , and set the threshold value of the end-of-sequence probability at 0.75 to prevent from cutting off the end of the input sequence.

In cycle-consistency training, five sequences of characters were drawn from the ASR model for each utterance, where each character was drawn repeatedly from the Softmax distribution of ASR until it encountered the end-of-sequence label ‘’. During training, we also used the 100-hour paired data to regularize the model parameters in a teacher-forcing manner, i.e., the parameters were updated alternately by cross-entropy loss with paired data and the cycle-consistency loss with unpaired data.

All models were trained using the end-to-end speech processing toolkit ESPnet on a single GPU (Titan Xp). Character error rate (CER) and word error rate (WER) were used as evaluation metrics.

2 Results

First, we show the changes of the consistency loss for training data and the validation accuracy for development data in Fig. 2, where the accuracy was computed based on the prediction with ground truth history. The consistency loss successfully decreased as the number of epochs increased. Although the validation accuracy did not improve smoothly, it reached a better value than that for the first epoch. We chose the 6th-epoch model for the following ASR experiments.

Table 1 shows the ASR performance using different training methods. Compared with the baseline result given by the initial ASR model, we can confirm that our proposed cycle-consistency training reduced the word error rate from 25.2% to 21.5%, a relative reduction of 14.7%. Thus, the results demonstrate that the proposed method works for ASR training with unpaired data. To verify the effectiveness of our approach, we further examined more straightforward methods, in which we simply used cross-entropy (CE) loss for unpaired data, where the target was chosen as the one best ASR hypothesis or sampled in the same manner as the cycle-consistency training. To alleviate the impact of the ASR errors, we weighted the CE loss by 0.1 for unpaired data while we did not down-weight the paired data. However, the error rates increased significantly in the 1-best condition. Even in the 5-sample condition, we could not obtain better performance than the baseline. We also conducted additional experiments under an oracle condition, where the 360-hour paired data were used together with the 100-hour data using the standard CE loss. The error rates can be considered the upper bound of this framework. We can see that there is still a big gap to the upper bound and further challenges need to be overcome to reach this goal.

Finally, we combined the ASR model with a character-based language model (LM) in a shallow fusion technique . An LSTM-based LM was trained using text-only data from the 500-hour noisy set excluding audio data, and used for decoding. As shown in Table 2, the use of text-only data yielded further improvement reaching 19.5% WER (an 8% error reduction), which is the best number we have achieved so far for this unpaired data setup.

CONCLUSION

In this paper, we proposed a novel method to train end-to-end automatic speech recognition (ASR) models using unpaired data. The method employs an attention-based ASR model and a Tacotron2-based text-to-encoder (TTE) model to compute a cycle-consistency loss using audio data only. Experimental results on the LibriSpeech corpus demonstrated that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio-only data without transcriptions. We also investigated the use of text-only data from 500-hour utterances for language modeling, and obtained a further error reduction of 8%. Accordingly, we achieved 22.7% error reduction in total for this unpaired data setup. Future work includes joint training of ASR and TTE model using both sides of the cycle-consistency loss, and the use of additional loss functions to make the training better.