Neural source-filter-based waveform model for statistical parametric speech synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi

Introduction

Text-to-speech (TTS) synthesis, a technology that converts texts into speech waveforms, has been advanced by using end-to-end architectures and neural-network-based waveform models . Among those waveform models, the WaveNet directly models the distributions of waveform sampling points and has demonstrated outstanding performance. The vocoder version of WaveNet , which converts the acoustic features into the waveform, also outperformed other vocoders for the pipeline TTS systems .

As an autoregressive (AR) model, the WaveNet is quite slow in waveform generation because it has to generate the waveform sampling points one by one. To improve the generation speed, the Parallel WaveNet and the ClariNet introduce a distilling method to transfer ‘knowledge’ from a teacher AR WaveNet to a student non-AR model that simultaneously generates all the waveform sampling points. However, the concatenation of two large models and the mix of distilling and other training criteria reduce the model interpretability and raise the implementation cost.

In this paper, we propose a neural source-filter waveform model that converts acoustic features into speech waveforms. Inspired by classical speech modeling methods , we used a source module to generate a sine-based excitation signal with a specified fundamental frequency (F0). We then used a dilated-convolution-based filter module to transform the sine-based excitation into the speech waveform. The proposed model was trained by minimizing spectral amplitude and phase distances, which can be efficiently implemented using discrete Fourier transforms (DFTs). Because the proposed model is a non-AR model, it generates waveforms much faster than the AR WaveNet. A large-scale listening test showed that the proposed model was close to the AR WaveNet in terms of the Mean opinion score (MOS) on the quality of synthetic speech. An ablation test showed that both the sine-wave excitation and the spectral amplitude distance were crucial to the proposed model.

The model structure and training criteria are explained in Section 2, after which the experiments are described in Section 3. Finally, this paper is summarized and concluded in Section 4.

Proposed model and training criteria

The condition module takes as input the acoustic feature sequence c1:B={c1,,cB}\bm{c}_{1:B}=\{\bm{c}_{1},\cdots,\bm{c}_{B}\}, where each cb=[fb,sb]\bm{c}_{b}=[f_{b},\bm{s}_{b}^{\top}]^{\top} contains the F0 fbf_{b} and the spectral features sb\bm{s}_{b} of the bb-th speech frame. The condition module upsamples the F0 by duplicating fbf_{b} to every time step within the bb-th frame and feeds the upsampled F0 sequence f1:T\bm{f}_{1:T} to the source module. Meanwhile, it processes c1:B\bm{c}_{1:B} using a bi-directional recurrent layer with long-short-term memory (LSTM) units and a convolutional (CONV) layer, after which the processed features are upsampled and sent to the filter module. The LSTM and CONV were used so that the condition module was similar to that of the WaveNet-vocoder in the experiment. They can be replaced with a feedforward layer in practice.

1.2 Source module

where ntN(0,σ2){n}_{t}\sim\mathcal{N}(0,\sigma^{2}) is a Gaussian noise, ϕ[π,π]\phi\in[-\pi,\pi] is a random initial phase, and NsN_{s} is equal to the waveform sampling rate.

Although we can directly set e1:T=e1:T<0>\bm{e}_{1:T}=\bm{e}_{1:T}^{<0>}, we tried two additional tricks. First, a ‘best’ phase ϕ\phi^{*} for e1:T<0>\bm{e}_{1:T}^{<0>} can be determined in the training stage by maximizing the correlation between e1:T<0>\bm{e}_{1:T}^{<0>} and the natural waveform o1:T\bm{o}_{1:T}. During generation, ϕ\phi is randomly generated. The second method is to generate harmonics by increasing fkf_{k} in Equation (1) and use a feedforward (FF) layer to merge the harmonics and e1:T<0>\bm{e}_{1:T}^{<0>} into e1:T\bm{e}_{1:T}. In this paper we use 7 harmonics and set σ=0.003\sigma=0.003 and α=0.1\alpha=0.1.

1.3 Neural filter module

Given the excitation signal e1:T{\bm{e}_{1:T}} from the source module and the processed acoustic features from the condition module, the filter module modulates e1:T{\bm{e}_{1:T}} using multiple stages of dilated convolution and affine transformations similar to those in ClariNet . For example, the first stage takes e1:T{\bm{e}_{1:T}} and the processed acoustic features as input and produces two signals a1:T\bm{a}_{1:T} and b1:T\bm{b}_{1:T} using dilated convolution. The e1:T{\bm{e}_{1:T}} is then transformed using e1:Tb1:T+a1:T{\bm{e}_{1:T}}\odot{\bm{b}_{1:T}}+\bm{a}_{1:T}, where \odot denotes element-wise multiplication. The transformed signal is further processed in the following stages, and the output of the final stage is used as generated waveform o^1:T\widehat{\bm{o}}_{1:T}.

Unlike ClariNet or Parallel WaveNet, the proposed model does not use the distilling method. It is unnecessary to compute the mean and standard deviation of the transformed signal. Neither is it necessary to form the convolution and transformation blocks as an inverse autoregressive flow .

2 Training criteria in frequency domain

Because speech perception heavily relies on acoustic cues in the frequency domain, we define training criteria that minimize the spectral amplitude and phase distances, which can be implemented using DFTs. Given these criteria, the proposed model is trained using the stochastic gradient descent (SGD) method.

Suppose the waveform is sliced into NN frames. Then the log spectral amplitude distance is defined as follows:

where Re()\texttt{Re}(\cdot) and Im()\texttt{Im}(\cdot) denote the real and imaginary parts of a complex number, respectively.

The Hermitian symmetry of g(n)\bm{g}^{(n)} is satisfied if Ls\mathcal{L}_{s} is carefully defined. For example, Ls\mathcal{L}_{s} can be the square error or Kullback-Leibler divergence (KLD) of the spectral amplitudes . The phase distance defined below also satisfies the requirement.

2.2 Phase distance

Given the spectra, a phase distance is computed as

where θ^k(n)\widehat{\theta}^{(n)}_{k} and θk(n)\theta^{(n)}_{k} are the phases of y^k(n)\widehat{y}^{(n)}_{k} and yk(n){y}^{(n)}_{k}, respectively. The gradient Lpo^1:T\frac{\partial{\mathcal{L}_{p}}}{\partial{\widehat{\bm{o}}_{1:T}}} can be computed by the same procedure as Lso^1:T\frac{\partial{\mathcal{L}_{s}}}{\partial{\widehat{\bm{o}}_{1:T}}}. Multiple Lp\mathcal{L}_{p}s and Ls\mathcal{L}_{s}s with different framing and DFT configurations can be added up as the ultimate training criterion L\mathcal{L}. For different L\mathcal{L}_{*}s, additional DFT/iDFT and framing/windowing blocks should be added to the model in Figure 1.

Experiments

This study used the same Japanese speech corpus and data division recipe as our previous study . This corpus contains neutral reading speech uttered by a female speaker. Both validation and test sets contain 480 randomly selected utterances. Among the 48-hour training data, 9,000 randomly selected utterances (15 hours) were used as the training set in this study. For the ablation test in Section 3.3, the training set was further reduced to 3,000 utterances (5 hours). Acoustic features, including 60 dimensions of Mel-generalized cepstral coefficients (MGCs) and 1 dimension of F0, were extracted from the 48 kHz waveforms at a frame shift of 5 ms using WORLD . The natural waveforms were then downsampled to 16 kHz for model training and the listening test.

2 Comparison of proposed model, WaveNet, and WORLD

The first experiment compared the four models listed in Table 2The models were implemented using a modified CURRENNT toolkit on a single P100 Nvidia GPU card. Codes, recipes, and generated speech can be found on https://nii-yamagishilab.github.io.. The WAD model, which was trained in our previous study , contained a condition module, a post-processing module, and 40 dilated CONV blocks, where the kk-th CONV block had a dilation size of 2modulo(k,10)2^{\text{modulo}(k,10)}. WAC was similar to WAD but used a Gaussian distribution to model the raw waveform at the output layer .

The proposed NSF contained 5 stages of dilated CONV and transformation, each stage including 10 convolutional layers with a dilation size of 2modulo(k,10)2^{\text{modulo}(k,10)} and a filter size of 3. Its condition module was the same as that of WAD and WAC. NSF was trained using L=Ls1+Ls2+Ls3\mathcal{L}=\mathcal{L}_{s1}+\mathcal{L}_{s2}+\mathcal{L}_{s3}, and the configuration of each Ls\mathcal{L}_{s*} is listed in Table 1. The phase distance Lp\mathcal{L}_{p*} was not used in this test.

Each model generated waveforms using natural and generated acoustic features, where the generated acoustic features were produced by the acoustic models in our previous study . The generated and natural waveforms were then evaluated by paid native Japanese speakers. In each evaluation round the evaluator listened to one speech waveform in each screen and rated the speech quality on a 1-to-5 MOS scale. The evaluator can take at most 10 evaluation rounds and can replay the sample during evaluation. The waveforms in an evaluation round were for the same text and were played in a random order. Note that the waveforms generated from NSF and WAC were converted to 16-bit PCM format before evaluation.

A total of 245 evaluators conducted 1444 valid evaluation rounds in all, and the results are plotted in Figure 2. Two-sided Mann-Whitney tests showed that the difference between any pair of models is statistically significant (p<0.01p<0.01) except NSF and WAC when the two models used generated acoustic features. In general, NSF outperformed WOR and WAC but performed slightly worse than WAD. The gap of the mean MOS scores between NSF and WAD was about 0.12, given either natural or generated acoustic features. A possible reason for this result may be the difference between the non-AR and AR model structures, which is similar to the difference between the finite and infinite impulse response filters. WAC performed worse than WAD because some syllables were perceived to be trembling in pitch, which may be caused by the random sampling generation method. WAD alleviated this artifact by using a one-best generation method in voiced regions .

After the MOS test, we compared the waveform generation speed of NSF and WAD. The implementation of NSF has a normal generation mode and a memory-save one. The normal mode allocates all the required GPU memory once but cannot generate waveforms longer than 6 seconds because of the insufficient memory space in a single GPU card. The memory-save mode can generate long waveforms because it releases and allocates the memory layer by layer, but the repeated memory operations are time consuming.

We evaluated NSF using both modes on a smaller test set, in which each of the 80 generated test utterances was around 5 seconds long. As the results in Table 3 show, NSF is much faster than WAD. Note that WAD allocates and re-uses a small size of GPU memory, which needs no repeated memory operation. WAD is slow mainly because of the AR generation process. Of course, both WAD and NSF can be improved if our toolkit is further optimized. Particularly, if the memory operation can be sped up, the memory-save mode of NSF will be much faster.

3 Ablation test on proposed model

This experiment was an ablation test on NSF. Specifically, the 11 variants of NSF listed in Table 4 were trained using the 5-hour training set. For a fair comparison, NSF was re-trained using the 5-hour data, and this variant is referred to as NSFs{}_{\text{s}}. The speech waveforms were generated given the natural acoustic features and rated in 1444 evaluation rounds by the same group of evaluators in Section 3.2. This test excluded natural waveform for evaluation.

The results are plotted in Figure 4. The difference between NSTs and any other model except S2 was statistically significant (p<0.01p<0.01). Comparison among NSTs, L1, L2, and L3 shows that using multiple Ls\mathcal{L}_{s}s listed in Table 1 is beneficial. For L3 that used only Ls1\mathcal{L}_{s1}, the generated waveform points clustered around one peak in each frame, and the waveform suffered from a pulse-train noise. This can be observed from L3 of Figure 3, whose spectrogram in the high frequency band shows more clearly vertical strips than other models. Accordingly, this artifact can be alleviated by adding Ls2\mathcal{L}_{s2} with a frame length of 5 ms for model training, which explained the improvement in L1. Using phase distance (L4) didn’t improve the speech quality even though the value of the phase distance was consistently decreased on both training and validation data.

The good result of S2 indicates that a single sine-wave excitation with a random initial phase also works. Without the sine-wave excitation, S3 generated waveforms that were intelligible but lacked stable harmonic structure. N1 slightly outperformed NSFs while N2 produced unstable harmonic structures. Because the transformation in N1 is equivalent to skip-connection , the result indicates that the skip-connection may help the model training.

Conclusion

In this paper, we proposed a neural waveform model with separated source and filter modules. The source module produces a sine-wave excitation signal with a specified F0, and the filter module uses dilated convolution to transform the excitation into a waveform. Our experiment demonstrated that the sine-wave excitation was essential for generating waveforms with harmonic structures. We also found that multiple spectral-based training criteria and the transformation in the filter module contributed to the performance of the proposed model. Compared with the AR WaveNet, the proposed model generated speech with a similar quality at a much faster speed.

The proposed model can be improved in many aspects. For example, it is possible to simplify the dilated convolution blocks. It is also possible to try classical speech modeling methods, including glottal waveform excitations , two-bands or multi-bands approaches on waveforms. When applying the model to convert linguistic features into the waveform, we observed the over-smoothing affect in the high-frequency band and will investigate the issue in the future work.

References

Appendix A Forward computation

The framing and windowing operation is also parallelized over x^m(n)\widehat{x}^{(n)}_{m} using for_each command in CUDA/Thrust. However, for explanation, let’s use the matrix operation in Figure 6. In other words, we compute

where wt(n,m)w^{(n,m)}_{t} is the element in the \big{[}(n-1)\times{M}+m\big{]}-th row and the tt-th column of the transformation matrix W\bm{W}.

A.2 DFT

Our implementation uses cuFFT (cufftExecR2C and cufftPlan1d) https://docs.nvidia.com/cuda/cufft/index.html to compute {y(1),,y(N)}\{\bm{y}^{(1)},\cdots,\bm{y}^{(N)}\} in parallel.

Appendix B Backward computation

where wt(n,m)w_{t}^{(n,m)} are the framing/windowing coefficients. This equation explains what we mean by saying ‘Lo^t\frac{\partial{\mathcal{L}}}{\partial{\widehat{{o}}_{t}}} can be easily accumulated the relationship between o^t\widehat{{o}}_{t} and each x^m(n)\widehat{x}^{(n)}_{m} has been determined by the framing and windowing operations’.

Our implementation uses CUDA/Thrust for_each command to launch TT threads and compute Lo^t,t{1,,T}\frac{\partial{\mathcal{L}}}{\partial{\widehat{{o}}_{t}}},t\in\{1,\cdots,T\} in parallel. Because o^t\widehat{o}_{t} is only used in a few frames and there is only one wt(n,m)0w_{t}^{(n,m)}\neq 0 for each {n,t}\{n,t\}, Equation (5) can be optimized as

where [Nt,min,Nt,max][N_{t,min},N_{t,max}] is the frame range that o^t\widehat{o}_{t} appears, and mt,nm_{t,n} is the position of o^t\widehat{o}_{t} in the nn-th frame.

where k[1,K]k\in[1,K]. Note that, although the sum should be m=1K\sum_{m=1}^{K}, the summation over the zero-padded part m=M+1K0cos(2πK(k1)(m1))\sum_{m=M+1}^{K}0\cos(\frac{2\pi}{K}(k-1)(m-1)) can be safely ignored Although we can avoid zero-padding by setting K=MK=M, in practice KK is usually the power of 2 while the frame length MM is not..

Suppose we compute a log spectral amplitude distance L\mathcal{L} over the NN frames as

Because L\mathcal{L}, x^m(n)\widehat{{x}}^{(n)}_{m}, Re(y^k(n))\texttt{Re}(\widehat{{y}}^{(n)}_{k}), and Im(y^k(n))\texttt{Im}(\widehat{{y}}^{(n)}_{k}) are real-valued numbers, we can compute the gradient Lx^m(n)\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{m}} using the chain rule:

Once we compute Lx^m(n)\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{m}} for each mm and nn, we can use Equation (6) to compute the gradient Lo^t\frac{\partial{\mathcal{L}}}{\partial{\widehat{{o}}_{t}}}.

B.3 Implementation of the 1st step using inverse DFT

Because LRe(y^k(n))\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{k}^{(n)})} and LIm(y^k(n))\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{k}^{(n)})} are real numbers, we can directly implement Equation (11) using matrix multiplication. However, a more efficient way is to use inverse DFT (iDFT).

For the first term in Line 15, we can write

Note that in Line (17), Re(g1)sin(2πK(11)(m1))=Re(g1)sin(0)=0\texttt{Re}(g_{1})\sin(\frac{2\pi}{K}(1-1)(m-1))=\texttt{Re}(g_{1})\sin(0)=0, and Re(gK2+1)sin(2πK(K2+11)(m1))=Re(gK2+1)sin((m1)π)=0\texttt{Re}(g_{\frac{K}{2}+1})\sin(\frac{2\pi}{K}(\frac{K}{2}+1-1)(m-1))=\texttt{Re}(g_{\frac{K}{2}+1})\sin((m-1)\pi)=0.

It is easy to those that Line (20) is equal to 0 if Re(gk)=Re(g(K+2k))\texttt{Re}(g_{k})=\texttt{Re}(g_{(K+2-k)}), for any k[2,K2]k\in[2,\frac{K}{2}]. Similarly, it can be shown that k=1KIm(gk)cos(2πK(k1)(m1))=0\sum_{k=1}^{K}\texttt{Im}(g_{k})\cos(\frac{2\pi}{K}(k-1)(m-1))=0 if Im(gk)=Im(g(K+2k)),k[2,K2]\texttt{Im}(g_{k})=-\texttt{Im}(g_{(K+2-k)}),k\in[2,\frac{K}{2}] and Im(g1)=Im(g(K2+1))=0\texttt{Im}(g_{1})=\texttt{Im}(g_{(\frac{K}{2}+1)})=0. When these two terms are equal to 0, the imaginary part in Line (15) will be 0, and bm=k=1Kgkej2πK(k1)(m1)b_{m}=\sum_{k=1}^{K}g_{k}e^{j\frac{2\pi}{K}(k-1)(m-1)} in Line (12) will be a real number.

To summarize, if g\bm{g} satisfies the conditions below

inverse DFT of g\bm{g} will be real-valued:

This is a basic concept in signal processing: the iDFT of a conjugate-symmetric (Hermitian)It should be called circular conjugate symmetry in strict sense signal will be a real-valued signal.

We can observer from Equation (23) and (11) that, if \Big{[}\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{1}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{1}^{(n)})},\cdots,\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{K}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{K}^{(n)})}\Big{]}^{\top} is conjugate-symmetric, the gradient vector Lx^(n)=[Lx^1(n),,Lx^M(n)]\frac{\partial\mathcal{L}}{\partial\widehat{\bm{x}}^{(n)}}=[\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{1}},\cdots,\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{M}}]^{\top} be computed using iDFT:

Note that {Lx^M+1(n),,Lx^K(n)}\{\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{M+1}},\cdots,\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{K}}\} are the gradients w.r.t to the zero-padded part, which will not be used and can safely set to 0. The iDFT of a conjugate symmetric signal can be executed using cuFFT cufftExecC2R command. It is more efficient than other implementations of Equation (11) because

there is no need to compute the imaginary part;

there is no need to compute and allocate GPU memory for gkg_{k} where k[K2+2,K]k\in[\frac{K}{2}+2,K] because of the conjugate symmetry;

iDFT can be executed for all the NN frames in parallel.

B.4 Conjugate symmetry complex-valued gradient vector

The conjugate symmetry of \Big{[}\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{1}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{1}^{(n)})},\cdots,\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{K}^{(n)})}+j\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{K}^{(n)})}\Big{]}^{\top} is satisfied if L\mathcal{L} is carefully chosen. Luckily, most of the common distance metrics can be used.

Given the log spectral amplitude distance Ls\mathcal{L}_{s} in Equation (9), we can compute

Because y^(n)\widehat{\bm{y}}^{(n)} is the DFT spectrum of the real-valued signal, y^(n)\widehat{\bm{y}}^{(n)} is conjugate symmetric, and Re(y^k(n))\texttt{Re}(\widehat{y}_{k}^{(n)}) and Im(y^k(n))\texttt{Im}(\widehat{y}_{k}^{(n)}) satisfy the condition in Equations (21) and (22), respectively. Because the amplitude Re(y^k(n))2+Im(y^k(n))2\texttt{Re}(\widehat{y}_{k}^{(n)})^{2}+{\texttt{Im}(\widehat{y}_{k}^{(n)}})^{2} does not change the symmetry, LsRe(y^k(n))\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Re}(\widehat{y}_{k}^{(n)})} and LsIm(y^k(n))\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Im}(\widehat{y}_{k}^{(n)})} also satisfy the conditions in Equations (21) and (22), respectively, and \Big{[}\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Re}(\widehat{y}_{1}^{(n)})}+j\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Im}(\widehat{y}_{1}^{(n)})},\cdots,\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Re}(\widehat{y}_{K}^{(n)})}+j\frac{\partial\mathcal{L}_{s}}{\partial\texttt{Im}(\widehat{y}_{K}^{(n)})}\Big{]}^{\top} is conjugate-symmetric.

B.4.2 Phase distance

Let θ^k(n)\widehat{\theta}^{(n)}_{k} and θk(n)\theta^{(n)}_{k} to be the phases of y^k(n)\widehat{y}^{(n)}_{k} and yk(n){y}^{(n)}_{k}, respectively. Then, the phase distance is defined as

Because both y(n)\bm{y}^{(n)} and y^(n)\bm{\widehat{y}}^{(n)} are conjugate-symmetric, it can be easily observed that LpRe(y^k(n))\frac{\partial{\mathcal{L}_{p}}}{\partial\texttt{Re}(\widehat{y}_{k}^{(n)})} and LpRe(y^k(n))\frac{\partial{\mathcal{L}_{p}}}{\partial\texttt{Re}(\widehat{y}_{k}^{(n)})} satisfy the condition in Equations (21) and (22), respectively.

Appendix C Multiple distance metrics

Different distance metrics can be merged easily. For example, we can define