Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, Przemysław Szczepaniak

Introduction

Statistical parametric speech synthesis (SPSS) based on artificial neural networks (ANN) has became popular in the text-to-speech (TTS) research area in the last few years . ANN-based acoustic models offer an efficient and distributed representation of complex dependencies between linguistic and acoustic features and have shown the potential to produce natural sounding synthesized speech . Recurrent neural networks (RNNs) , especially long short-term memory (LSTM)-RNNs , provide an elegant way to model speech-like sequential data that embodies short- and long-term correlations. They were successfully applied to acoustic modeling for SPSS . Zen et al. proposed a streaming speech synthesis architecture using unidirectional LSTM-RNNs with a recurrent output layer . It enabled low-latency speech synthesis, which is essential in some applications. However, it was significantly slower than hidden Markov model (HMM)-based SPSS in terms of real-time ratio . This paper describes further optimizations of LSTM-RNN-based SPSS for deployment on mobile devices. The optimizations conducted here include reducing computation and disk footprint, as well as making it robust to errors in training data.

The rest of this paper is organized as follows. Section 2 describes the proposed optimizations. Experiments and subjective evaluation-based findings are presented in Section 3. Concluding remarks are shown in the final section.

Optimizing LSTM-RNN-based SPSS

Figure 1 shows the overview of the streaming synthesis architecture using unidirectional LSTM-RNNs . Unlike HMM-based SPSS, which usually requires utterance-level batch processing or frame lookahead , this architecture allows frame-synchronous streaming synthesis with no frame lookahead. Therefore this architecture provides much lower latency speech synthesis. However, there are still a few drawbacks;

Disk footprint; Although the total number of parameters in LSTM-RNN-based SPSS can be significantly lower than that of HMM-based SPSS , the overall disk footprint of the LSTM-RNN system can be similar or slightly larger because HMM parameters can be quantized using 8-bit integers . Therefore decreasing the LSTM-RNN system disk footprint is essential for deployment on mobile devices.

Computation; With HMM-based SPSS, inference of acoustic parameters involves traversing decision trees at each HMM state and running the speech parameter generation algorithm . On the other hand, inference of acoustic parameters with LSTM-RNN-based SPSS involves many matrix-vector multiplications at each frame, which are expensive. This is particularly critical for client-side TTS on mobile devices, which have less powerful CPUs and limited battery capacity.

Robustness; Typical ANN-based SPSS relies on fixed phoneme- or state-level alignments , whereas HMMs can be trained without fixed alignments using the Baum-Welch algorithm. Therefore, the ANN-based approach is less robust to alignment errors.

This section describes optimizations addressing these drawbacks. Each of them that follow will be evaluated in Section 3.

ANN weights are typically stored in 32-bit floating-point numbers. However there are significant advantages in memory, disk footprint and processing performance in representing them in lower integer precision. This is commonly approached by quantizing the ANN weights. This paper utilizes 8-bit quantization of ANN weights to reduce the disk footprint of LSTM-RNN-based acoustic and duration models. Although it is possible to run inference in 8-bit integers with quantization-aware training , that possibility is not utilized here; instead weights are stored in 8-bit integer on disk then recovered to 32-bit floating-point numbers after loading to memory.

2 Multi-frame bundled inference

Inference of acoustic frames takes 60–70% of total computations in our LSTM-RNN-based SPSS implementation. Therefore, it is desirable to reduce the amount of computations at the inference stage. In typical ANN-based SPSS, input linguistic features other than state- and frame-position features are constant within a phoneme . Furthermore, speech is a rather stationary process at 5-ms frame shift and target acoustic frames change slowly across frames. Based on these characteristics of inputs and targets this paper explores the multi-frame inference approach . Figure 3 illustrates the concept of multi-frame inference. Instead of predicting one acoustic frame, multiple acoustic frames are jointly predicted at the same time instance. This architecture allows significant reduction in computation while maintaining the streaming capability.

However, preliminary experiments showed degradation due to mismatch between training and synthesis; alignments between input/target features can be different at the synthesis stage, e.g., training: $\bm{x}_{2}\to\{\bm{y}_{1},\bm{y}_{2}\}$ , synthesis: $\bm{x}_{3}\to\{\bm{y}_{2},\bm{y}_{3}\}$ . This issue can be addressed by data augmentation. Figure 3 shows the data augmentation with different frame offset. From aligned input/target pairs, multiple data sequences can be generated with different starting frame offset. By using these data sequences for training, acoustic LSTM-RNNs will generalize to different possible alignments between inputs and targets.

3 Robust regression

It is known that learning a linear regression model with the squared loss function can suffer from the effect of outliers. Although ANNs trained with the squared loss function are not a simple linear regression model, their output layers perform linear regression given activations at the last hidden layer. Therefore, ANNs trained with the squared loss function can be affected by outliers. These outliers can come from recordings, transcriptions, forced alignments, and $F_{0}$ extraction errors.

Using robust regression techniques such as linear regression with a heavy-tailed distribution or minimum density power divergence estimator can relax the effect of outliers. In this work a simple robust regression technique assuming that the errors follow a mixture of two Gaussian distributions, in particular, $\epsilon$ -contaminated Gaussian distribution , which is a special case of the Richter distribution , is employed; the majority of observations are from a specified Gaussian distribution, though a small proportion are from a Gaussian distribution with much higher variance, while the two Gaussian distributions share the same mean. The loss function can be defined as

1italic-ϵ𝒩𝒛𝑓𝒙𝚲𝚺italic-ϵ𝒩𝒛𝑓𝒙𝚲𝑐𝚺\mathcal{L}(\bm{z};\bm{x},\bm{\Lambda})=-\log\bigl{\{}(1-\epsilon)\mathcal{N}\left(\bm{z};f(\bm{x};\bm{\Lambda}),\bm{\Sigma}\right)+\epsilon\mathcal{N}\left(\bm{z};f(\bm{x};\bm{\Lambda}),c\bm{\Sigma}\right)\bigr{\}}, (1) where $\bm{z}$ and $\bm{x}$ denote target and input vectors, $\bm{\Sigma}$ is a covariance matrix, $\epsilon$ and $c$ are weight and scale of outliers, $\Lambda$ is a set of neural network weights, and $f(\cdot)$ is a non-linear function to predict an output vector given the input vector. Typically, $\epsilon<0.5$ and $c>1$ . Note that if $\epsilon=0$ and $\bm{\Sigma}=\bm{I}$ , the $\epsilon$ -contaminated Gaussian loss function is equivalent to the squared loss function. Figure 4 illustrates $\epsilon$ -contaminated Gaussian distribution ( $\bm{\mu}=$ , $\bm{\Sigma}=$ , $c=10$ and $\epsilon=0.1$ ). It can be seen from the figure that the $\epsilon$ -contaminated Gaussian distribution has heavier tail than the Gaussian distribution. As outliers will be captured by the Gaussian distribution with wider variances, the estimation of means is less affected by these outliers. Here using the $\epsilon$ -contaminated Gaussian loss function as a criterion to train LSTM-RNNs is investigated for both acoustic and duration LSTM-RNNs. Note that the $\epsilon$ -contaminated Gaussian distribution is similar to globally tied distribution (GTD) in .

Experiments

Speech data from a female professional speaker was used to train speaker-dependent unidirectional LSTM-RNNs for each language. The configuration for speech analysis stage and data preparation process were the same as those described in except the use of speech at 22.05 kHz sampling rather than 16 kHz and 7-band aperiodicities rather than 5-band ones.

Both the input and target features were normalized to be zero-mean unit-variance in advance. The architecture of the acoustic LSTM-RNNs was 1 $\times$ 128-unit ReLU layer followed by 3 $\times$ 128-cell LSTMP layers with 64 recurrent projection units with a linear recurrent output layer . The duration LSTM-RNN used a single LSTM layer with 64 cells with feed-forward output layer with linear activation. To reduce the training time and impact of having many silence frames, 80% of silence frames were removed from the training data. Durations of the beginning and ending silences were excluded from the training data for the duration LSTM-RNNs. The weights of the LSTM-RNNs were initialized randomly. Then they were updated to minimize the mean squared error between the target and predicted output features. A distributed CPU implementation of mini-batch asynchronous stochastic gradient descent (ASGD)-based truncated back propagation through time (BPTT) algorithm was used . The learning rates for the acoustic and duration LSTM-RNNs were $10^{-5}$ and $10^{-6}$ , respectively. The learning rates were exponentially decreased over time . Training was continued until the loss value over the development set converged. The model architecture and hyper-parameters were used across all languages.

At the synthesis stage, durations and acoustic features were predicted from linguistic features using the trained networks. Spectral enhancement based on post-filtering in the cepstral domain was applied to improve the naturalness of the synthesized speech. From the acoustic features, speech waveforms were synthesized using the Vocaine vocoder .

To subjectively evaluate the performance of the systems, preference tests were also conducted. 100 utterances not included in the training data were used for evaluation. Each pair was evaluated by at least eight native speakers of each language. The subjects who did not use headphones were excluded from the experimental results. After listening to each pair of samples, the subjects were asked to choose their preferred one, or they could choose “no preference” if they did not have any preference. Note that stimuli that achieved a statistically significant preference ( $p<0.01$ ) are presented in bold characters in tables displaying experimental results in this section.

2 Experimental results for optimizations

Table 3 shows the preference test result comparing LSTM-RNNs with and without weight quantization. It can be seen from the table that the effect of quantization was negligible. The disk footprint of the acoustic LSTM-RNN for English (NA) was reduced from 1.05 MBytes to 272 KBytes.

2.2 Multi-frame inference

While training multi-frame LSTM-RNNs, the learning rate needed to be reduced (from $10^{-5}$ to $2.5\times 10^{-6}$ ) as mentioned in . Table 3 shows the preference test result comparing single and multi-frame inference. Note that weights of the LSTM-RNNs were quantized to 8-bit integers. It can be seen from the table that LSTM-RNN with multi-frame inference with data augmentation achieved the same naturalness as that with single-frame one. Compared with 1-frame, 4-frame achieved about 40% reduction of walltime at runtime synthesis.

2.3 ϵitalic-ϵ\epsilon-contaminated Gaussian loss function

Although $c$ , $\epsilon$ , and $\bm{\Sigma}$ could be trained with the network weights, they were fixed to $c=10$ , $\epsilon=0.1$ , and $\bm{\Sigma}=\bm{I}$ for both acoustic and duration LSTM-RNNs.Therefore, the numbers of parameters of the LSTM-RNNs trained with the squared and $\epsilon$ -contaminated Gaussian loss functions were identical. For training LSTM-RNNs with the $\epsilon$ -contaminated Gaussian loss function, the learning rate could be increased (from $2.5\times 10^{-6}$ to $5\times 10^{-6}$ for acoustic LSTM-RNNs, from $10^{-6}$ to $5\times 10^{-6}$ for duration LSTM-RNNs). From a few preliminary experiments, the $\epsilon$ -contaminated Gaussian loss function with a 2-block structure was selected; 1) mel-cepstrum and aperiodicities, 2) $\log F_{0}$ and voiced/unvoiced binary flag. This is similar to the multi-stream HMM structure used in HMM-based speech synthesis . Table 3 shows the preference test result comparing the squared and $\epsilon$ -contaminated normal loss function to train LSTM-RNNs. Note that all weights of the LSTM-RNNs were quantized to 8-bit integers and 4-frame bundled inference was used. It can be seen from the table that LSTM-RNN trained with the $\epsilon$ -contaminated normal loss function achieved the same or better naturalness than those with the squared loss function.

3 Comparison with HMM-based SPSS

The next experiment compared HMM- and LSTM-RNN-based SPSS with the optimizations described in this paper. Both HMM- and LSTM-RNN-based acoustic and duration models were quantized into 8-bit integers. The same training data and text processing front-end modules were used.

The average disk footprints of HMMs and LSTM-RNNs including both acoustic and duration models over 6 languages were 1560 and 454.5 KBytes, respectively. Table 5 shows the average latency (time to get the first chunk of audio) and average total synthesis time (time to get the entire audio) of the HMM and LSTM-RNN-based SPSS systems (North American English) to synthesize a character, word, sentence, and paragraph on a Nexus 6 phone. Note that the execution binary was compiled for modern ARM CPUs having the NEON advanced single instruction, multiple data (SIMD) instruction set . To reduce the latency of the HMM-based SPSS system, the recursive version of the speech parameter generation algorithm with 10-frame lookahead was used. It can be seen from the table that the LSTM-RNN-based system could synthesize speech with lower latency and total synthesis time than the HMM-based system. However, it is worthy noting that the LSTM-RNN-based system was 15–22% slower than the HMM-based system in terms of the total synthesis time on old devices having ARM CPUs without the NEON instruction set (latency was still lower). Table 5 shows the preference test result comparing the LSTM-RNN- and HMM-based SPSS systems. It shows that the LSTM-RNN-based system could synthesize more naturally sounding synthesized speech than the HMM-based one.

4 Comparison with concatenative TTS

The last experiment evaluated the HMM-driven unit selection TTS and LSTM-RNN-based SPSS with the optimizations described in this paper except quantization. Both TTS systems used the same training data and text processing front-end modules. Note that additional linguistic features which were only available with the server-side text processing front-end modules were used in both systems. The HMM-driven unit selection TTS systems were built from speech at 16 kHz sampling. Although LSTM-RNNs were trained from speech at 22.05 kHz sampling, speech at 16 kHz sampling was synthesized at runtime using a resampling functionality in Vocaine . These LSTM-RNNs had the same network architecture as the one described in the previous section. They were trained with the $\epsilon$ -contaminated Gaussian loss function and utilized 4-frame bundled inference. Table 6 shows the preference test result. It can be seen from the table that the LSTM-RNN-based SPSS systems were preferred to the HMM-driven unit selection TTS systems in 10 of 26 languages, while there was no significant preference between them in 3 languages. Note that the LSTM-RNN-based SPSS systems were 3–10% slower but 1,500–3,500 times smaller in disk footprint than the hybrid ones.

Conclusions

This paper investigated three optimizations of LSTM-RNN-based SPSS for deployment on mobile devices; 1) Quantizing LSTM-RNN weights to 8-bit integers reduced disk footprint by 70%, with no significant difference in naturalness; 2) Using multi-frame inference reduced CPU use by 40%, again with no significant difference in naturalness; 3) For training, using an $\epsilon$ -contaminated Gaussian loss function rather than a squared loss function to avoid excessive effects from outliers proved beneficial, allowing for an increased learning rate and improving naturalness. The LSTM-RNN-based SPSS systems with these optimizations surpassed the HMM-based SPSS systems in speed, latency, disk footprint, and naturalness on modern mobile devices. Experimental results also showed that the LSTM-RNN-based SPSS system with the optimizations could match the HMM-driven unit selection TTS systems in naturalness in 13 of 26 languages.

Acknowledgement

The authors would like to thank Mr. Raziel Alvarez for helpful comments and discussions.