Neural source-filter-based waveform model for statistical parametric speech synthesis
Xin Wang, Shinji Takaki, Junichi Yamagishi
Introduction
Text-to-speech (TTS) synthesis, a technology that converts texts into speech waveforms, has been advanced by using end-to-end architectures and neural-network-based waveform models . Among those waveform models, the WaveNet directly models the distributions of waveform sampling points and has demonstrated outstanding performance. The vocoder version of WaveNet , which converts the acoustic features into the waveform, also outperformed other vocoders for the pipeline TTS systems .
As an autoregressive (AR) model, the WaveNet is quite slow in waveform generation because it has to generate the waveform sampling points one by one. To improve the generation speed, the Parallel WaveNet and the ClariNet introduce a distilling method to transfer ‘knowledge’ from a teacher AR WaveNet to a student non-AR model that simultaneously generates all the waveform sampling points. However, the concatenation of two large models and the mix of distilling and other training criteria reduce the model interpretability and raise the implementation cost.
In this paper, we propose a neural source-filter waveform model that converts acoustic features into speech waveforms. Inspired by classical speech modeling methods , we used a source module to generate a sine-based excitation signal with a specified fundamental frequency (F0). We then used a dilated-convolution-based filter module to transform the sine-based excitation into the speech waveform. The proposed model was trained by minimizing spectral amplitude and phase distances, which can be efficiently implemented using discrete Fourier transforms (DFTs). Because the proposed model is a non-AR model, it generates waveforms much faster than the AR WaveNet. A large-scale listening test showed that the proposed model was close to the AR WaveNet in terms of the Mean opinion score (MOS) on the quality of synthetic speech. An ablation test showed that both the sine-wave excitation and the spectral amplitude distance were crucial to the proposed model.
The model structure and training criteria are explained in Section 2, after which the experiments are described in Section 3. Finally, this paper is summarized and concluded in Section 4.
Proposed model and training criteria
The condition module takes as input the acoustic feature sequence , where each contains the F0 and the spectral features of the -th speech frame. The condition module upsamples the F0 by duplicating to every time step within the -th frame and feeds the upsampled F0 sequence to the source module. Meanwhile, it processes using a bi-directional recurrent layer with long-short-term memory (LSTM) units and a convolutional (CONV) layer, after which the processed features are upsampled and sent to the filter module. The LSTM and CONV were used so that the condition module was similar to that of the WaveNet-vocoder in the experiment. They can be replaced with a feedforward layer in practice.
1.2 Source module
where is a Gaussian noise, is a random initial phase, and is equal to the waveform sampling rate.
Although we can directly set , we tried two additional tricks. First, a ‘best’ phase for can be determined in the training stage by maximizing the correlation between and the natural waveform . During generation, is randomly generated. The second method is to generate harmonics by increasing in Equation (1) and use a feedforward (FF) layer to merge the harmonics and into . In this paper we use 7 harmonics and set and .
1.3 Neural filter module
Given the excitation signal from the source module and the processed acoustic features from the condition module, the filter module modulates using multiple stages of dilated convolution and affine transformations similar to those in ClariNet . For example, the first stage takes and the processed acoustic features as input and produces two signals and using dilated convolution. The is then transformed using , where denotes element-wise multiplication. The transformed signal is further processed in the following stages, and the output of the final stage is used as generated waveform .
Unlike ClariNet or Parallel WaveNet, the proposed model does not use the distilling method. It is unnecessary to compute the mean and standard deviation of the transformed signal. Neither is it necessary to form the convolution and transformation blocks as an inverse autoregressive flow .
2 Training criteria in frequency domain
Because speech perception heavily relies on acoustic cues in the frequency domain, we define training criteria that minimize the spectral amplitude and phase distances, which can be implemented using DFTs. Given these criteria, the proposed model is trained using the stochastic gradient descent (SGD) method.
Suppose the waveform is sliced into frames. Then the log spectral amplitude distance is defined as follows:
where and denote the real and imaginary parts of a complex number, respectively.
The Hermitian symmetry of is satisfied if is carefully defined. For example, can be the square error or Kullback-Leibler divergence (KLD) of the spectral amplitudes . The phase distance defined below also satisfies the requirement.
2.2 Phase distance
Given the spectra, a phase distance is computed as
where and are the phases of and , respectively. The gradient can be computed by the same procedure as . Multiple s and s with different framing and DFT configurations can be added up as the ultimate training criterion . For different s, additional DFT/iDFT and framing/windowing blocks should be added to the model in Figure 1.
Experiments
This study used the same Japanese speech corpus and data division recipe as our previous study . This corpus contains neutral reading speech uttered by a female speaker. Both validation and test sets contain 480 randomly selected utterances. Among the 48-hour training data, 9,000 randomly selected utterances (15 hours) were used as the training set in this study. For the ablation test in Section 3.3, the training set was further reduced to 3,000 utterances (5 hours). Acoustic features, including 60 dimensions of Mel-generalized cepstral coefficients (MGCs) and 1 dimension of F0, were extracted from the 48 kHz waveforms at a frame shift of 5 ms using WORLD . The natural waveforms were then downsampled to 16 kHz for model training and the listening test.
2 Comparison of proposed model, WaveNet, and WORLD
The first experiment compared the four models listed in Table 2The models were implemented using a modified CURRENNT toolkit on a single P100 Nvidia GPU card. Codes, recipes, and generated speech can be found on https://nii-yamagishilab.github.io.. The WAD model, which was trained in our previous study , contained a condition module, a post-processing module, and 40 dilated CONV blocks, where the -th CONV block had a dilation size of . WAC was similar to WAD but used a Gaussian distribution to model the raw waveform at the output layer .
The proposed NSF contained 5 stages of dilated CONV and transformation, each stage including 10 convolutional layers with a dilation size of and a filter size of 3. Its condition module was the same as that of WAD and WAC. NSF was trained using , and the configuration of each is listed in Table 1. The phase distance was not used in this test.
Each model generated waveforms using natural and generated acoustic features, where the generated acoustic features were produced by the acoustic models in our previous study . The generated and natural waveforms were then evaluated by paid native Japanese speakers. In each evaluation round the evaluator listened to one speech waveform in each screen and rated the speech quality on a 1-to-5 MOS scale. The evaluator can take at most 10 evaluation rounds and can replay the sample during evaluation. The waveforms in an evaluation round were for the same text and were played in a random order. Note that the waveforms generated from NSF and WAC were converted to 16-bit PCM format before evaluation.
A total of 245 evaluators conducted 1444 valid evaluation rounds in all, and the results are plotted in Figure 2. Two-sided Mann-Whitney tests showed that the difference between any pair of models is statistically significant () except NSF and WAC when the two models used generated acoustic features. In general, NSF outperformed WOR and WAC but performed slightly worse than WAD. The gap of the mean MOS scores between NSF and WAD was about 0.12, given either natural or generated acoustic features. A possible reason for this result may be the difference between the non-AR and AR model structures, which is similar to the difference between the finite and infinite impulse response filters. WAC performed worse than WAD because some syllables were perceived to be trembling in pitch, which may be caused by the random sampling generation method. WAD alleviated this artifact by using a one-best generation method in voiced regions .
After the MOS test, we compared the waveform generation speed of NSF and WAD. The implementation of NSF has a normal generation mode and a memory-save one. The normal mode allocates all the required GPU memory once but cannot generate waveforms longer than 6 seconds because of the insufficient memory space in a single GPU card. The memory-save mode can generate long waveforms because it releases and allocates the memory layer by layer, but the repeated memory operations are time consuming.
We evaluated NSF using both modes on a smaller test set, in which each of the 80 generated test utterances was around 5 seconds long. As the results in Table 3 show, NSF is much faster than WAD. Note that WAD allocates and re-uses a small size of GPU memory, which needs no repeated memory operation. WAD is slow mainly because of the AR generation process. Of course, both WAD and NSF can be improved if our toolkit is further optimized. Particularly, if the memory operation can be sped up, the memory-save mode of NSF will be much faster.
3 Ablation test on proposed model
This experiment was an ablation test on NSF. Specifically, the 11 variants of NSF listed in Table 4 were trained using the 5-hour training set. For a fair comparison, NSF was re-trained using the 5-hour data, and this variant is referred to as NSF. The speech waveforms were generated given the natural acoustic features and rated in 1444 evaluation rounds by the same group of evaluators in Section 3.2. This test excluded natural waveform for evaluation.
The results are plotted in Figure 4. The difference between NSTs and any other model except S2 was statistically significant (). Comparison among NSTs, L1, L2, and L3 shows that using multiple s listed in Table 1 is beneficial. For L3 that used only , the generated waveform points clustered around one peak in each frame, and the waveform suffered from a pulse-train noise. This can be observed from L3 of Figure 3, whose spectrogram in the high frequency band shows more clearly vertical strips than other models. Accordingly, this artifact can be alleviated by adding with a frame length of 5 ms for model training, which explained the improvement in L1. Using phase distance (L4) didn’t improve the speech quality even though the value of the phase distance was consistently decreased on both training and validation data.
The good result of S2 indicates that a single sine-wave excitation with a random initial phase also works. Without the sine-wave excitation, S3 generated waveforms that were intelligible but lacked stable harmonic structure. N1 slightly outperformed NSFs while N2 produced unstable harmonic structures. Because the transformation in N1 is equivalent to skip-connection , the result indicates that the skip-connection may help the model training.
Conclusion
In this paper, we proposed a neural waveform model with separated source and filter modules. The source module produces a sine-wave excitation signal with a specified F0, and the filter module uses dilated convolution to transform the excitation into a waveform. Our experiment demonstrated that the sine-wave excitation was essential for generating waveforms with harmonic structures. We also found that multiple spectral-based training criteria and the transformation in the filter module contributed to the performance of the proposed model. Compared with the AR WaveNet, the proposed model generated speech with a similar quality at a much faster speed.
The proposed model can be improved in many aspects. For example, it is possible to simplify the dilated convolution blocks. It is also possible to try classical speech modeling methods, including glottal waveform excitations , two-bands or multi-bands approaches on waveforms. When applying the model to convert linguistic features into the waveform, we observed the over-smoothing affect in the high-frequency band and will investigate the issue in the future work.
References
Appendix A Forward computation
The framing and windowing operation is also parallelized over using for_each command in CUDA/Thrust. However, for explanation, let’s use the matrix operation in Figure 6. In other words, we compute
where is the element in the \big{[}(n-1)\times{M}+m\big{]}-th row and the -th column of the transformation matrix .
A.2 DFT
Our implementation uses cuFFT (cufftExecR2C and cufftPlan1d) https://docs.nvidia.com/cuda/cufft/index.html to compute in parallel.
Appendix B Backward computation
where are the framing/windowing coefficients. This equation explains what we mean by saying ‘ can be easily accumulated the relationship between and each has been determined by the framing and windowing operations’.
Our implementation uses CUDA/Thrust for_each command to launch threads and compute in parallel. Because is only used in a few frames and there is only one for each , Equation (5) can be optimized as
where is the frame range that appears, and is the position of in the -th frame.
where . Note that, although the sum should be , the summation over the zero-padded part can be safely ignored Although we can avoid zero-padding by setting , in practice is usually the power of 2 while the frame length is not..
Suppose we compute a log spectral amplitude distance over the frames as
Because , , , and are real-valued numbers, we can compute the gradient using the chain rule:
Once we compute for each and , we can use Equation (6) to compute the gradient .
B.3 Implementation of the 1st step using inverse DFT
Because and are real numbers, we can directly implement Equation (11) using matrix multiplication. However, a more efficient way is to use inverse DFT (iDFT).
For the first term in Line 15, we can write
Note that in Line (17), , and .
It is easy to those that Line (20) is equal to 0 if , for any . Similarly, it can be shown that if and . When these two terms are equal to 0, the imaginary part in Line (15) will be 0, and in Line (12) will be a real number.
To summarize, if satisfies the conditions below
inverse DFT of will be real-valued:
This is a basic concept in signal processing: the iDFT of a conjugate-symmetric (Hermitian)It should be called circular conjugate symmetry in strict sense signal will be a real-valued signal.
We can observer from Equation (23) and (11) that, if \Big{[}\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{1}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{1}^{(n)})},\cdots,\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{K}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{K}^{(n)})}\Big{]}^{\top} is conjugate-symmetric, the gradient vector be computed using iDFT:
Note that are the gradients w.r.t to the zero-padded part, which will not be used and can safely set to 0. The iDFT of a conjugate symmetric signal can be executed using cuFFT cufftExecC2R command. It is more efficient than other implementations of Equation (11) because
there is no need to compute the imaginary part;
there is no need to compute and allocate GPU memory for where because of the conjugate symmetry;
iDFT can be executed for all the frames in parallel.
B.4 Conjugate symmetry complex-valued gradient vector
The conjugate symmetry of \Big{[}\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{1}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{1}^{(n)})},\cdots,\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{K}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{K}^{(n)})}\Big{]}^{\top} is satisfied if is carefully chosen. Luckily, most of the common distance metrics can be used.
Given the log spectral amplitude distance in Equation (9), we can compute
Because is the DFT spectrum of the real-valued signal, is conjugate symmetric, and and satisfy the condition in Equations (21) and (22), respectively. Because the amplitude does not change the symmetry, and also satisfy the conditions in Equations (21) and (22), respectively, and \Big{[}\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Re}(\widehat{y}_{1}^{(n)})}+j\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Im}(\widehat{y}_{1}^{(n)})},\cdots,\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Re}(\widehat{y}_{K}^{(n)})}+j\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Im}(\widehat{y}_{K}^{(n)})}\Big{]}^{\top} is conjugate-symmetric.
B.4.2 Phase distance
Let and to be the phases of and , respectively. Then, the phase distance is defined as
Because both and are conjugate-symmetric, it can be easily observed that and satisfy the condition in Equations (21) and (22), respectively.
Appendix C Multiple distance metrics
Different distance metrics can be merged easily. For example, we can define