Investigating gated recurrent neural networks for speech synthesis

Zhizheng Wu, Simon King

Introduction

Statistical parametric speech synthesis (SPSS) has quite steadily advanced in naturalness in the past decade, as witnessed by the series of Blizzard Challenges . However, the quality of synthetic speech produced by SPSS is still far below that of the natural human speech, and cannot compete with the best unit selection systems, which concatenate waveforms . As suggested in , acoustic modelling, which captures the complex relationship between linguistic and acoustic representations, is a key limiting factor and is the focus of this work.

Neural networks have re-emerged as a potential powerful acoustic model for SPSS. In , feed-forward neural networks are employed to map a linguistic representation derived from input text directly to acoustic features. In , a deep belief network (DBN) was used to model the relationship between linguistic and acoustic representations jointly. In and , mixture density networks (MDNs) and real-valued neural autoregressive density estimators (RNADEs) were proposed, respectively, to predict acoustic feature distributions given input linguistic features. These various implementations can be viewed as a replacement of the decision tree in HMM-based speech synthesis; they map linguistic features to acoustic features frame by frame through multiple hidden layers. However, the temporal sequence nature of speech is not explicitly modelled in the network architectures.

To include temporal constraints, we proposed to include contextual information by stacking low-dimensional bottleneck features from multiple consecutive frames . Still in the DNN framework, minimum trajectory error training or sequence error training criterion have been proposed to minimise the utterance-level trajectory error rather than the frame-by-frame error. On the other hand, recurrent neural networks (RNNs) directly and elegantly include temporal information in the network architecture, making them attractive for modelling speech parameter trajectories. In , a standard RNN was employed to predict prosodic information for speech synthesis. In , two variants on standard RNNs, the Elman RNN and clockwork RNN, were investigated for speech synthesis.

The most widely used recurrent network in speech processing applications is the long short-term memory (LSTM) architecture. Because the LSTM addresses the vanishing gradient problem of the standard RNN, it is easier to train. In , an LSTM was employed to model the F0 contour. In , a bidirectional LSTM was employed to map a sequence of linguistic features to the corresponding sequence of acoustic features. In , an LSTM with a recurrent output layer was proposed to perform sequence mapping from linguistic to acoustic representations. These studies all formulate SPSS as sequence-to-sequence mapping and all demonstrate the effectiveness of LSTMs. However, LSTM architecture seems rather ad-hoc and it is not obvious what its various components are actually contributing to performance.

This raises at least two questions that have not been answered in previous studies: a) how exactly does the LSTM architecture model a speech parameter sequence; b) which components of the LSTM architecture are important, and which could be discarded. Answers to these questions may suggest better and perhaps simpler recurrent network architectures.

2 The novelty of this work

We attempt to reach a better understanding of the “black-box” LSTM architecture and our findings lead us to propose a simplified architecture for speech modelling.

First, we give an analysis of the forget gate and memory cell in the LSTM architecture. Specifically, we visualise the activation of the forget gate to understand when the forget gate resets the memory cell state, and how the forget gate relates to speech structure. We analyse how the cell state correlates with the trajectory to be predicted. These visualisations enable us to understand how LSTMs model the temporal structure in speech synthesis. To the best of our knowledge, this is the first attempt to visually analyse the LSTM architecture in predicting a speech parameter sequence.

Second, we analyse the importance of each LSTM component for speech synthesis and propose a simplified architecture. The analysis is done empirically with several variants of the LSTM. Each removes a different component of the vanilla LSTM. The analysis was inspired by the studies in , and we focus on the speech synthesis application. Based on this analysis, we present a simplified architecture, which only has the forget gate. The simplified architecture has significantly fewer parameters than the vanilla LSTM, and so reduces the computational cost of generation considerably without degrading the quality of the synthesised speech.

Long Short-Term Memory

Standard RNNs are hard to train due to the well-known vanishing or exploding gradient problems . To address the vanishing gradient problem, the LSTM architecture was proposed, the basic idea of which was presented in . The most commonly used architecture was described in , and is formulated as,

The central idea of the LSTM is the so-called memory cell $\mathbf{c}$ which maintains its state over time, and the gating units which are used to regulate the information flow into and out of the memory cell . More specifically, the input gate can allow the input signal to adjust the cell state or prevent that (e.g., setting the input gate to zero); the output gate can allow the cell state to affect other neurons or block that; and the forget gate enables the cell to remember or forget its previous state. However, as discussed in , the architecture might not be optimal for all the tasks, and the relative importance of each component is not at all clear.

Gated Recurrent Neural Networks

In this section, we present several variants of the LSTM and propose a simplified version that only has the forget gate; it therefore has significantly fewer parameters and lower computational cost. As these variants all share with the LSTM the concept of a memory cell with gates, we will call them gated recurrent neural networks.

To assess the importance of each component, we start with four variants of the LSTM architecture. Each removes one component from the LSTM architecture, so we can understand how much each component contributes to performance. The differences with the vanilla LSTM are:

In the NFG variant, the past cell state will still contribute to the current cell state but without any controlling or scaling by the forget gate. Note that, when removing the input, forget or output gates, the number of parameters is reduced.

2 Gated Recurrent Unit (GRU)

As an alternative to the LSTM, the Gated Recurrent Unit (GRU) architecture was proposed in . In , the GRU was found to achieve better performance than the LSTM on some tasks. The GRU is formulated as:

3 Simplified LSTM (S-LSTM)

As we will see in the experiments reported in the next section, the input gate, output gate and peep-hole connections can be removed without degrading speech synthesis performance significantly. Hence, we can propose an even simpler variant, that removes output gates and peep-hole connections, and replaces the input gate by the forget gate in the form of $1-\mathbf{f}_{t}$ . In this way, only the forget gate is retained. This simplest variant can be written as:

The simplified architecture is similar to the GRU, except that it uses a memory cell state. The cell state is controlled by the forget gate only, which trades off between past cell state and current block input. When the activation of forget gate is small, the cell state will mainly depend on the block input, otherwise it will mainly copy the past cell state.

Experiments

A corpus from a British male speaker was employed in our experiments, divided into three subsets: training, development and testing (2400, 70 and 72 utterances). The sampling rate was 48 kHz, and we used the STRAIGHT vocoder to extract 60-dimensional Mel-Cepstral Coefficients (MCCs), 25 band aperiodicities (BAPs), and fundamental frequency ( $F_{0}$ ) on log-scale, all at 5-ms frame step. All systems used the same acoustic features. $F_{0}$ was linearly interpolated before modelling and a binary voiced/unvoiced feature was used to record voicing information. Dynamic features for MCCs, BAPs and $F_{0}$ were also computed. The acoustic features were mean-variance normalised before modelling, and the mean and variance was restored at the generation time. At generation time, maximum likelihood parameter generation algorithm was applied to smooth parameter trajectories.

All systems used the same input linguistic features comprising 601 features. 592 of these are binary features derived from linguistic context, such as quin-phone identities, part-of-speech, positional information of phoneme, syllable, word and phrase, and the number of syllables, words and phrases, etc. The remaining 9 numerical features capture frame position information, e.g., frame position in HMM state and phoneme. Linguistic features were normalised to [0.01 0.99] before modelling.

In all RNNs, we employed a three-layer feed-forward neural network at the bottom. On top of the feed-forward layers, we used the gated recurrent neural networks. The bottom feed-forward layers were intended to act as feature extraction layers, with 512 hidden units using tangent activation function in each layer. All RNN implementations used 256 units (e.g., LSTM blocks) in the recurrent layer. Hyperparameters for each system were optimised on the development set. We fixed the momentum, and only tuned learning rates.

2 Analysis of LSTM

We first visualised the forget gate and cell state, which are thought to be the two most important components in modelling long-term temporal structure. The averaged activations (over the 256 units) of the forget gate as a function of the frame index is presented in Fig. 1. The red solid line is the forget gates averaged activations; blue dashed lines show phoneme boundaries. It is clear that the peaks of the forget gate activation trajectory have a strong correspondence with the phoneme boundaries; within a phoneme, the contribution of past cell state decays linearly. The forget gate is capturing some important temporal structure of speech; this is not surprising, since the phoneme boundaries are explicitly represented in the input linguistic features.

3 Objective results

Even though objective measures might not always correlate with human perception, they offer a way to tune the systems and roughly predict model performance. The objective results are in Table 2. Compared to LSTM, NIG, NOG and NPH all achieve similar objective distortion, with considerably fewer parameters and lower generation time: the input gate, output gate and peep-hole connections are not necessary. The NFG system increases distortion considerably: the forget gate is important. This finding is consistent with .

The GRU system achieves similar performance to the LSTM system: even though it has even fewer parameters, it performs as well as NIG, NOG or NPH. This is also consistent with studies on other tasks . Although S-LSTM slightly increases MCD distortion from 4.14 dB to 4.19 dB compared to LSTM, it achieves similar performance on the other measures. The S-LSTM has about half the number of parameters in its recurrent layer compared to the LSTM, and reduces generation time from 214 seconds to 154 seconds. The generation time is the total time to generate all the 142 utterances in both development and testing sets.

In summary, the S-LSTM has the smallest number of parameters and achieves the fastest generation, whilst achieving similar objective results to the LSTM and GRU architectures.

4 Subjective results

Subjective preference tests were conducted using 30 paid native English speakers. Each listener was asked to listen 20 pairs of synthesised utterances. The sentence was the same in both items within a pair, and was randomly selected from the 72 test sentencesSamples are available at: http://homepages.inf.ed.ac.uk/zwu2/demo/icassp16/lstm.html. For each pair, the listener was asked to decide which one sounded more natural; a “neutral” option was allowed if the listener had no preference.

Preference results are in Table 1. Comparing against the LSTM system, all the systems except NFG show no significant difference in preference.

The NFG system achieves only a 20.3% preference score when paired against the LSTM which is preferred 74.3% of the time. As with the objective results in Table 2, we conclude that the forget gate is the only critical component in the LSTM architecture; the input gate, output gate and peep-hole connections can be omitted.

We also compares the proposed S-LSTM system against with all other systems (except NFG, since it is worse than LSTM). Consistent with the objective results, the subjective results also demonstrate that S-LSTM is as good as any other systems.

Conclusions

We have analysed the forget gate and cell state of the LSTM architecture, and examined the performance of several variants of LSTM. We conclude that:

The forget gate can learn the temporal structure of speech; its activations have a high correspondence with phone boundaries.

The memory cell maintains a state over time, which matched the shape of the trajectory to be predicted.

For this task, the forget gate is the only critical component of the LSTM; other components can be omitted with no reduction in naturalness.

From these results, we propose a simplified LSTM architecture that only uses the critical forget gate. The simplified LSTM has significantly fewer parameters than the vanilla LSTM, but achieves similar performance in both objective and subjective evaluations.

Acknowledgements: This research was supported by EPSRC Programme Grant EP/I031022/1, Natural Speech Technology (NST). The NST research data collection may be accessed at http://datashare.is.ed.ac.uk/handle/10283/786.