Neural source-filter waveform models for statistical parametric speech synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi

I Introduction

Text-to-speech synthesis (TTS), a technology that converts a text string into a speech waveform , has rapidly advanced with the help of deep learning. In TTS systems based on a statistical parametric speech synthesis (SPSS) framework , deep neural networks have been used to derive linguistic features from text . More breakthroughs have been reported for the acoustic modeling part, for which various types of neural networks have been proposed that convert linguistic features into acoustic features such as the fundamental frequency (F0) , cepstral or spectral features , and duration .

Deep learning has recently been used to design a vocoder, a component that converts acoustic features into a speech waveform. The pioneering model called WaveNet directly generates a waveform in a sample-by-sample manner given the input features. It has shown better performance than classical signal-processing-based vocoders for speaker-dependent SPSS systems . Similar AR models such as SampleRNN have also performed reasonably well for SPSS . The key idea of these neural waveform models is to implement an autoregressive (AR) probabilistic model that describes the distribution of a current waveform sample conditioned on previous samples. Although AR models can generate high-quality waveforms, their generation speed is slow because they have to generate waveform samples one by one.

Inverse-AR is another approach to neural waveform modeling. For example, inverse-AR-flow (IAF) can be used to transform a noise sequence into a waveform without the sequential generation process. However, each waveform must be sequentially transformed into a noise-like signal during model training , which significantly increases the amount of training time. Rather than the direct training method, Parallel WaveNet and ClariNet use a pre-trained AR model as a teacher to evaluate the waveforms generated by an IAF student model. Although this training method avoids sequential transformation, it requires a pre-trained WaveNet as the teacher model. Furthermore, additional training criteria have to be incorporated, without which the student model only produces mumbling sounds . The blend of disparate training criteria and the complicated knowledge-distilling approach make IAF-based frameworks even less accessible to the TTS community.

In this paper, we propose a neural source-filter (NSF) waveform modeling framework, which is straightforward to implement, fast in generation, and effective in producing high-quality speech waveforms for TTS. NSF models designed under this framework have three components: a source module that produces a sine-based excitation signal, a filter module that uses a dilated-convolution-based network to convert the excitation into a waveform, and a conditional module that processes the acoustic features for the source and filter modules. The NSF models do not rely on AR or IAF approaches and are directly trained by minimizing the spectral amplitude distances between the generated and natural waveforms.

We describe the three specific NSF models designed under our proposed framework: a baseline NSF (b-NSF) model that adopts a WaveNet-like filter module, simplified NSF (s-NSF) model that simplifies the filter module of the baseline NSF model, and harmonic-plus-noise NSF (hn-NSF) model that uses separate source-filter pairs for the harmonic and noise components of waveforms. While we previously introduced the b-NSF model , the s-NSF and hn-NSF models are newly introduced in this paper. Among the three NSF models, the hn-NSF model outperformed the other two in a large-scale listening test. Compared with WaveNet, the hn-NSF model generated speech waveforms with comparably good quality but much faster.

We review recent neural waveform models in Section II and describe the proposed NSF framework and three NSF models in Section III. After explaining the experiments in Section IV, we draw a conclusion in Section V.

II Review of recent neural waveform models

However, such a naive model did not produce intelligible waveforms in our pilot test. We found that the generated waveform was over-smoothed and lacked the variations that evoke the perception of speech formant. This over-smoothing effect may be reasonable because the way we train the naive model is equivalent to maximizing the likelihood of a waveform o1:T\bm{o}_{1:T} over the distribution

where N(;,σ2)\mathcal{N}(\cdot;\cdot,\sigma^{2}) is a Gaussian distribution with an unknown standard deviation σ\sigma. It is assumed that the waveform values are independently distributed with the naive model, which may be incompatible with the strong temporal correlation of natural waveforms. This mismatch between model assumption and natural data distribution may cause the over-smoothing effect.

II-B Neural AR waveform models

AR neural waveform models have recently been proposed for better waveform modeling . Contrary to the naive model assumption in Equation (2), it is assumed that

with an AR model, where p(oto<t,c1:B;Θ)p({o}_{t}|\bm{o}_{<t},\bm{c}_{1:B};\bm{\Theta}) depends on the previous waveform values o<t\bm{o}_{<t}. Such a model can potentially describe the causal temporal correlation among waveform samples.

An AR model can be implemented by feeding the waveform sample of the previous time step as the input to a convolution network (CNN) or recurrent network (RNN) at the current step, which is illustrated in Figure 1. The model is trained by maximizing the likelihood over natural data while using the data for feedback, i.e., teacher forcing . In the generation stage, the model has to sequentially generate a waveform and use the previously generated sample as the feedback datum.

Pioneering neural AR waveform models include WaveNet and SampleRNN . There are also models that combine WaveNet with classical speech-modeling methods such as glottal waveform modeling and linear-prediction coding . Although these models can generate high-quality waveforms, their sequential generation process is time-consuming. More specifically, these AR models’ time complexity to generate a single waveform of length TT is theoretically equal to O(T)\mathcal{O}(T). Note that the time complexity only takes the waveform length into account, ignoring the number of hidden layers and implementation tricks such as subscale dependency and squeezing . Other models, such as WaveRNN , FFTNet , and subband WaveNet , use different network architectures to reduce or parallelize the computation load, but their generation time is still linearly proportional to the waveform length.

Note that in the teacher-forcing-based training stage, the CNN-based (e.g., WaveNet) and RNN-based AR models (e.g., WaveRNN) have time complexity of O(1)\mathcal{O}(1) and O(T)\mathcal{O}(T), respectively. The time complexity of the RNN-based models is limited by the computation in the recurrent layers. These theoretical interpretations are summarized in Table I.

II-C IAF-flow-based models

Rather than the AR-generation process, an IAF-flow-based model such as WaveGlow uses an invertible and non-AR function HΘ1()H^{-1}_{\bm{\Theta}}(\cdot) to convert a random Gaussian noise signal z^1:T\widehat{\bm{z}}_{1:T} into a waveform o^1:T=HΘ1(z^1:T,c1:B)\widehat{\bm{o}}_{1:T}=H^{-1}_{\bm{\Theta}}(\widehat{\bm{z}}_{1:T},\bm{c}_{1:B}). In the training stage, the model likelihood has to be evaluated as

where HΘ(o1:T,c1:B)H_{\bm{\Theta}}(\bm{o}_{1:T},\bm{c}_{1:B}) sequentially inverts a training waveform o1:T\bm{o}_{1:T} into a noise signal z1:T\bm{z}_{1:T} for likelihood evaluation. When HΘ1()H^{-1}_{\bm{\Theta}}(\cdot) is sufficiently complex, pO(o1:Tc1:B;Θ)p_{O}(\bm{o}_{1:T}|\bm{c}_{1:B};\bm{\Theta}) can approximate the true distribution of o1:T\bm{o}_{1:T} .

In terms of the theoretical time complexity w.r.t TT, the flow-based models are dual to the CNN-based AR models, as Table I summaries. Although the time complexity of the flow-based models in waveform generation is irrelevant to TT, their time complexity in model training is O(T)\mathcal{O}(T). Some models such as WaveGlow reduce TT by squeezing multiple waveform samples into one vector but still require a huge amount of training time. For example, in one study using Japanese speech data , a WaveGlow model was trained on four V100 GPU cards for one month to produce high-quality waveformsThe original WaveGlow paper used eight GV100 GPU cards for model training . However, it did not report the amount of training time..

II-D AR plus IAF

Some models, such as Parallel WaveNet and ClariNet , combine the advantages of AR and IAF-flow-based models under a distilling framework. In this framework, a flow-based student model generates waveform samples o^1:T=HΘ1(z^1:T,c1:B)\widehat{\bm{o}}_{1:T}=H^{-1}_{\bm{\Theta}}(\widehat{\bm{z}}_{1:T},\bm{c}_{1:B}) and queries an AR teacher model to evaluate o^1:T\widehat{\bm{o}}_{1:T}. The student model learns by minimizing the distance between p(o^1:T)p(\widehat{\bm{o}}_{1:T}) and that given by the teacher model. Therefore, neither training nor generation requires sequential transformation.

However, it was reported that knowledge distilling is insufficient and additional spectral-domain criteria must be used . Among the blend of disparate loss functions, it remains unclear which one is essential. Furthermore, knowledge distilling with two large models is complicated in implementation.

III Neural source-filter models

We propose a framework of neural waveform modeling that is fast in generation and straightforward in implementation. As Figure 1 illustrates, our proposed framework contains three modules: a condition module that processes input acoustic features, source module that produces a sine-based excitation signal given the F0, and neural filter module that converts the excitation into a waveform using dilated convolution (CONV) and feedforward transformation. Rather than the MSE or the likelihood over waveform sampling points, the proposed framework uses spectral-domain distances for model training. Because our proposed framework explicitly uses a source-filter structure, we refer to all of the models based on the framework as NSF models.

Our proposed NSF framework does not rely on AR or IAF approaches. An NSF model converts an excitation signal into an output waveform without sequential transformation. It can be simply trained using the stochastic gradient descent (SGD) method under a spectral domain criterion. Therefore, its time complexity is theoretically irrelevant to the waveform length, i.e., O(1)\mathcal{O}(1) in both the model training and waveform generation stages. Neither does an NSF model use knowledge distilling, which makes it straightforward in training and implementation.

An NSF model can be implemented in various architectures. We gives details on the three specific NSF models:

The b-NSF model has a network structure partially similar to ClariNet and Parallel WaveNet;

The s-NSF model inherits b-NSF’s structure but uses much simpler neural filter modules;

The hn-NSF model extends the s-NSF and explicitly generates the harmonic and noise components of a waveform.

Although the three NSF models use different neural filter modules, they use the same spectral-domain training criterion. We therefore first explain the training criterion in Section III-A then the three NSF models in Sections III-B to III-D.

III-A2 Backward propagation

where k{1,,K}k\in\{1,\cdots,K\}The summation in Equations (6)-(7) should be m=1K\sum_{m=1}^{K}, but the zero-padded part m=M+1K0cos(2πK(k1)(m1))\sum_{m=M+1}^{K}0\cos(\frac{2\pi}{K}(k-1)(m-1)) can be safely ignored. Although we can avoid zero-padding by setting K=MK=M, in practice, KK is usually the power of 2 to take advantage of the fast Fourier transform (FFT). A waveform frame of length MM can be zero-padded to length K=2log2MK=2^{\left\lceil\log_{2}{M}\right\rceil} or longer to increase the frequency resolution.. Because Re(y^k(n))\texttt{Re}(\widehat{{y}}^{(n)}_{k}), Im(y^k(n))\texttt{Im}(\widehat{{y}}^{(n)}_{k}), LRe(y^k(n))\frac{\partial\mathcal{L}}{\partial\texttt{Re}(\widehat{y}_{k}^{(n)})}, and LIm(y^k(n))\frac{\partial\mathcal{L}}{\partial\texttt{Im}(\widehat{y}_{k}^{(n)})} are real-valued numbers, we can compute the gradient by using the chain rule:

As long as we can compute Lx^m(n)\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{m}} for each mm and nn, the gradient Lo^t\frac{\partial{\mathcal{L}}}{\partial{\widehat{{o}}_{t}}} for t{1,,T}t\in\{1,\cdots,T\} can be easily accumulated from Lx^m(n)\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{m}} given the relationship between ot\bm{o}_{t} and each x^m(n)\widehat{x}^{(n)}_{m} that is determined by the framing and windowing operations. Lo^t\frac{\partial{\mathcal{L}}}{\partial{\widehat{{o}}_{t}}} then is sent to the output layer of the neural filter module for back-propagation and SGD training.

If g(n)\bm{g}^{(n)} is conjugate symmetric, i.e.,

By combining Equations (9) and (12), we can see that bm(n)b_{m}^{(n)} is equal to Lx^m(n)\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{m}} in Equation (8). In other words, if g(n)\bm{g}^{(n)} is conjugate symmetric, we can calculate Lx^(n)\frac{\partial\mathcal{L}}{\partial\widehat{\bm{x}}^{(n)}} by constructing g(n)\bm{g}^{(n)} and taking its inverse-DFT. It can be shown that g(n)\bm{g}^{(n)} is conjugate symmetric when L\mathcal{L} is defined as the log spectral amplitude distance in Equation (5). Other common distances such as the Kullback-Leibler divergence (KLD) between spectra are also applicable.

Note that, if x^(n)\widehat{\bm{x}}^{(n)} is zero-padded from length MM to length KK before DFT, the inverse-DFT of g(n)\bm{g}^{(n)} will contain gradients w.r.t the zero-padded part. In such a case, we can simply assign Lx^m(n)bm(n),m{1,,M}\frac{\partial\mathcal{L}}{\partial\widehat{{x}}^{(n)}_{m}}\leftarrow{b_{m}^{(n)}},\forall{m}\in\{1,\cdots,M\} and ignore {bM+1(n),,bK(n)}\{b_{M+1}^{(n)},\cdots,b_{K}^{(n)}\} that correspond to the zero-padded part.

III-A3 Multi-resolution spectral amplitude distance

Using multiple spectral distances is expected to help the model learn the spectral details of natural waveforms in different spatial and temporal resolutions. We used three distances in this study, which are explained in Section IV-B.

III-A4 Remark on spectral amplitude distance

Using the spectral amplitude distance is reasonable also because the perception of speech sounds are affected by the spectral acoustic cues such as formants and their transition . Although the spectral amplitude distance ignores other acoustic cues, such as phase and timing , we only considered the spectral amplitude distance in this study because we have not found a phase or timing distance that is differentiable and effective.

III-B Baseline NSF model

We now give the details on the b-NSF model. As Figure 3 illustrates, the b-NSF model uses three modules to convert an input acoustic feature sequence c1:B\bm{c}_{1:B} of length BB into a speech waveform o^1:T\widehat{\bm{o}}_{1:T} of length TT: a source module that generates an excitation signal e1:T\bm{e}_{1:T}, a filter module that transforms e1:T\bm{e}_{1:T} into an output waveform, and a condition module that processes c1:B\bm{c}_{1:B} for the source and filter modules. The model is trained using the spectral distance explained in the previous section.

III-B2 Source module

Given the F0, the source module constructs an excitation signal on the basis of sine waveforms and random noise. In voiced segments, the excitation signal is a mixture of sine waveforms whose frequency values are determined by F0 and its harmonics. In unvoiced regions, the excitation signal is a sequence of Gaussian noise.

where ntN(0,σ2){n}_{t}\sim\mathcal{N}(0,\sigma^{2}) is Gaussian noise, ϕ[π,π]\phi\in[-\pi,\pi] is a random initial phase, and NsN_{s} is the waveform sampling rate. The hyper-parameter α\alpha adjusts the amplitude of source waveforms, while σ\sigma is the standard deviation of the Gaussian noiseIn our previous NSF paper , we used 13σnt\frac{1}{3\sigma}n_{t} for the noise excitation. In this study, we used α3σnt\frac{\alpha}{3\sigma}n_{t} so that the amplitude of the noise in unvoiced segments is comparable to that of the sine waveforms in voiced segments.. We set σ=0.003\sigma=0.003 and α=0.1\alpha=0.1 in this study. Equation (13) treats ftf_{t} as an instantaneous frequency . Thus, the phase of the e1:T<0>\bm{e}_{1:T}^{<0>} becomes continuous even if ftf_{t} changes. Figure 5 plots an example e1:T<0>\bm{e}_{1:T}^{<0>} and the corresponding f1:T\bm{f}_{1:T}.

where {w0,wH,wb}\{w_{0},\cdots w_{H},w_{b}\} are the FF layer’s weights, and HH is the total number of overtones.

The value of HH is not critical to the model’s performance because the model can re-create higher harmonic tones, as the experiments discussed in Section IV-D demonstrated. We set H=7H=7 based on a tentative rule (H+1)fmax<Ns/4(H+1)*f_{max}<{N_{s}/4}, where Ns=16N_{s}=16 kHz is the sampling rate, and fmax500f_{max}\approx{500} Hz is the largest F0 value observed in our data corpus. We used Ns/4{N_{s}/4} as the upper-bound so that there is at least four sampling points in each period of the sine waveform.

III-B3 Neural filter module

The filter module of the b-NSF model transforms the excitation e1:T{\bm{e}_{1:T}} into an output waveform o^1:T\widehat{\bm{o}}_{1:T} by using five baseline dilated-CONV filter blocks. The structure of a baseline filter block is illustrated in Figure 4.

The baseline dilated-CONV filter block is similar to the student models in ClariNet and Parallel WaveNet because all use the stack of so-called “dilated residual blocks” in AR WaveNet . However, because the b-NSF model does not use knowledge distilling, it is unnecessary to compute the distribution of the signal during forward propagation as ClariNet and Parallel WaveNet do. Neither is it necessary to make the filter blocks invertible as IAF does. Accordingly, the dilated convolution layers can be non-causal, even though we used causal ones to keep configurations of our NSF models consistent with our WaveNet-vocoder in the experiments.

III-C Simplified NSF model

The network structure of the b-NSF model, especially the filter module, was designed on the basis of our experience with implementing WaveNet. However, we found that the filter module can be simplified, as shown in Figure 4. Such a filter block keeps only the dilated-CONV layers, skip-connections, and FF layers for dimension change. The output of a filter block is the sum of a residual signal a1:T\bm{a}_{1:T} and the input signal v1:Tin\bm{v}_{1:T}^{\text{in}}. Using the sum v1:Tout=v1:Tin+a1:T{\bm{v}_{1:T}^{\text{out}}}={\bm{v}_{1:T}^{\text{in}}}+\bm{a}_{1:T} rather than the affine transformation v1:Tout=v1:Tinb1:T+a1:T{\bm{v}_{1:T}^{\text{out}}}={\bm{v}_{1:T}^{\text{in}}}\odot{\bm{b}_{1:T}}+\bm{a}_{1:T} was motivated by the result of our ablation test on the b-NSF model (Section IV-C: see the results of N1). Note that each dilated-CONV layer uses the tanh activation function.

On the basis of the simplified filter block, we constructed the s-NSF model. Compared with the b-NSF model, the s-NSF model has the same network structure except the simplified filter blocks. Accordingly, the s-NSF model has fewer parameters and a faster generation speed. The cascade of simplified filter blocks also turns the neural filter module into a deep residual network .

III-D Harmonic-plus-noise NSF model

Although the b-NSF and s-NSF models perform well in many cases, we found that both may produce low-quality unvoiced sounds or sometimes silence for fricative consonants. Analysis on the neural filter modules of a well-trained model suggests that the hidden features for the voiced sounds have a waveform-like temporal structure, while those for the unvoiced sounds are noisy. We thus hypothesize that voiced and unvoiced sounds may require different non-linear transformations in filter modules.

A better strategy may be to generate a periodic and noise component separately and merge them with different ratios into voiced and unvoiced sounds, an idea similar to the harmonic-plus-noise model . Following the literature, we also refer to the periodic component as the harmonic component.

As an implementation, we constructed the hn-NSF model, which is illustrated in Figure 6. While the hn-NSF model uses the same modules as the s-NSF model to generate a waveform for the harmonic component, it uses only a noise excitation and simplified filter block for the noise component. The noise and harmonic components are filtered by a high-pass and low-pass digital filter, respectively, and are summed as the output waveform. Digital filters have been used to merge the noise and periodic signals in the classical speech modeling methods . The difference is that the hn-NSF model uses filters to directly merge the waveforms rather than the source signals.

Because the voiced sounds are usually dominated by the harmonic part, while the unvoiced sounds are dominated by the noise part, we use two pairs of low- and high-pass filters in the hn-NSF model to merge the harmonic and noise components, one pair for voiced sounds and the other for unvoiced sounds. The configurations of the filters are listed in Table II, and their frequency responses are plotted on the right side of Figure 6. We implemented the filters as equiripple finite impulse response (FIR) filters and computed their coefficients by using the Parks-McClellan algorithm . Note that the order of the FIR filter is determined by the Parks-McClellan algorithm. For the filters specified in Figure 6, the filter order is around 10. After the filter coefficients are calculated using the algorithm, they are stored and fixed in the model. The voicing flag for selecting the filter can be easily extracted from the input F0 sequence.

We determined the passband and stopband of the FIR filters after analyzing the spectrogram of the speech data in our corpus. Although the passband and stopband are fixed, the neural filter blocks can learn to compensate and fine-tune the energy of the generated signals in certain frequency bands. Figure 7 plots the spectral amplitudes of the harmonic and noise components of one voiced frame generated by the hn-NSF model. For the harmonic component, although the passband of the low-pass filter is only up to 5kHz, the neural filter module generates a harmonic component with a high energy above between 5 and 7 kHz, which compensates for the attenuation of the low-pass filter. As the last row of Figure 7 shows, the harmonic component dominates the generated waveform frame from 0 to 7 kHz, and its spectral amplitude is similar to that of natural speech.

IV Experiments

Following the explanation on our NSF models, we now discuss the experiments. After describing the corpus and data configuration in Section IV-A, we describe an ablation test conducted on the b-NSF model in Section IV-C. We then compare the three NSF models with WaveNet in Section IV-D. In Section IV-E, we investigate the controllability of the input F0 on the NSF models, and in Section IV-F, we examine the internal behaviors of the NSF models.

Our experiments used a data set of neural-style reading speech by a Japanese female speaker (F009), which is part of the XIMERA speech corpus . This data set was recorded at a sampling rate of 48 kHz and segmented into 30,016 utterances. The total duration is around 50 hours.

We prepared three training subsets for the experiments on waveform modeling: the first subset contained 9,000 randomly selected utterances (15 hours), the second included 3,000 utterances (5 hours) randomly selected from the first subset, and the third contained 1,000 utterances (1.6 hours) randomly selected from the second subset. The first and third subsets were used to evaluate the performance of the NSF models and WaveNet in Section IV-D. The second subset was used in the ablation test on the NSF models in Section IV-C. We also prepared a validation set with 500 utterances and a test set with another 480 utterances. All utterances were downsampled to 16 kHz for waveform model training.

The acoustic features were extracted from the natural waveforms with a frame shift of 5 ms (200 Hz). The F0 values were extracted by an ensemble of pitch trackers . Two types of spectral features were prepared: Mel-generalized cepstral coefficients (MGCs) of order 60 extracted using the WORLD vocoder and Mel-spectrogram of order 80. We used the F0 and either the MGCs or the Mel-spectrogram as the input features to the waveform models.

To evaluate the waveform models in SPSS TTS systems, we also trained acoustic models to predict acoustic features from linguistic features. To train acoustic models, we extracted linguistic features from text, including quin-phone identity, phrase accent type, and other structural information . These features were force-aligned with the acoustic features by using hidden Markov models.

IV-B Model configurations

Four waveform models were evaluated in the experiments: our three NSF models b-NSF, s-NSF, and hn-NSF and AR WaveNet (WaveNet). We chose WaveNet as the benchmark because of its excellent performance reported in both the original paper and our previous study . We did not include IAF-flow-based models due to their high demand on training time and GPU resources. Neither did we consider Parallel WaveNet or ClariNet due to the lack of authentic implementation.

The network configuration of b-NSF was described in Section III-B. It used five baseline dilated-CONV filter blocks, and each block contained ten dilated CONV and other hidden layers, as illustrated in Figure 4. The dilation size of the kk layer was 2k12^{k-1}. s-NSF was the same as b-NSF except that each baseline dilated-CONV filter block was replaced with a simplified version. hn-NSF used the same network as s-NSF for the harmonic component and a single simplified block for the noise component. All three NSF models used the spectral amplitude distance L=L1+L2+L3\mathcal{L}=\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3}, and the configuration of each L\mathcal{L}_{*} is listed in Table III. Note that L1\mathcal{L}_{1} used the same frame shift and frame length as those for extracting the acoustic features from the waveforms. L2\mathcal{L}_{2} and L3\mathcal{L}_{3} were decided so that one has a higher temporal resolution while the other has a higher frequency resolution than L1\mathcal{L}_{1}. The configurations in Table III may be inappropriate for a different corpus, and a good configuration may be found through trial and error.

All the neural waveform models were trained on a single-GPU card (Nvidia Tesla P100) using the Adam optimizer with a learning rate=0.0003=0.0003, β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, and ϵ=108\epsilon=10^{-8}. The training process was terminated when the error on the validation set continually increases for five epochs. The batch size was 1, and each utterance was truncated into segments of at most 3 seconds to fit the GPU memory.

All the neural waveform models were implemented using a modified CURRENNT toolkit . Using the same toolkit allows us to fairly compare the models since they use the same set of low-level CUDA/THUST functionalities . Note that our WaveNet implementation is sufficiently good as the benchmark. As another study on the same corpus demonstrated , our WaveNet implementation can generate speech waveforms that are similar to the original natural waveforms in terms of perceived quality, given natural acoustic features Unlike naive open-source implementations, our WaveNet-vocoder avoids redundant CONV operations as the so-called Fast-WaveNet did . The code, scripts, and samples are publicly available at https://nii-yamagishilab.github.io/samples-nsf/nsf-v2.html.

For the acoustic models that predict acoustic features from the linguistic features, we used shallow and deep neural AR models to generate the MGCs and F0, respectively. The recipes for training these acoustic models were the same as those in another of our previous study . We trained another deep AR model to generate the Mel-spectrogram using a similar recipe. The number of training utterances for acoustic models was around 28,000 (47 hours of data). The acoustic feature sequences were generated given the duration force-aligned on the test set.

IV-C Ablation test on b-NSF

Although we claimed that an NSF model can be implemented in varied network architectures, some of the components may be essential to model performance. This ablation test was conducted to identify those essential components.

We used b-NSF as the reference model and prepared a few variants, as listed in Table IV. All the models including b-NSF were trained using the MGCs, F0, and waveforms from the 5-hr. training set. The trained models then generated waveforms given the natural acoustic features in the test set, and these generated waveforms were evaluated in a subjective evaluation test. In one evaluation round, an evaluator listened to one speech waveform on one screen, rated the speech quality on a 1-to-5 mean-opinion-score (MOS) scale, and repeated the process for multiple screens. The waveforms in one evaluation round were for the same text and were played in a random order. All the waveforms were converted to 16-bit PCM format in advance.

A total of 245 paid Japanese native speakers participated in the test, and 1444 valid evaluation rounds were conducted. The results are plotted in Figure 8, and the difference between b-NSF and the other models was statistically significant (p<0.01p<0.01), as two-sided Mann-Whitney tests demonstrated. First, a comparison made among b-NSF, L1, L2, and L3 showed that using spectral amplitude distances with different windowing and framing configurations is essential to the model’s performance. The waveforms generated from L1, L2, and L3 were perceptually worse because of a pulse-train-like sound in both unvoiced and voiced sounds. This type of artifact could be easily observed in the unvoiced segments plotted in Figure 9. We hypothesize that using spectral distances with different temporal-spatial resolutions could mitigate the pulse-train-like artifacts in the spectrogram.

By comparing b-NSF, S1, and S2, we found that the sine-based excitation is essential to the NSF models. In the case of S2, the generated waveforms were intelligible but unnatural because the perceived pitch was unstable. As Figure 9 shows, the waveform generated from S2 lacked the periodic structure that should be observed in voiced sounds. With a sine-based excitation, the b-NSF model may have a better starting point to generate waveforms with a periodic structure. This hypothesis is supported by the results of the investigation discussed in Section IV-F.

Interestingly, N1 outperformed b-NSF even though it used a simpler transformation in the filter blocks. In comparison, the waveforms generated from N2 were unnatural because they lacked the stable periodic structures in voiced segments. One possible reason is that the skip-connection enables the sine excitation to be propagated to the later filter blocks without being attenuated by the non-linear transformations.

In summary, the results of the ablation test suggest that both the sine excitation and multi-resolution spectral distances are crucial to NSF models. It is also important to keep the skip-connections inside the filter modules.

IV-D Comparison between WaveNet and NSF models

This experiment compared b-NSF, s-NSF, hn-NSF, and WaveNet under four training conditions. Each model was trained using either the 15-hr. or 1.6-hr. training set and conditioned on F0 and either Mel-spectrogram or MGCs. In the testing stage, each model generated speech waveforms given natural acoustic features or generated features produced by the acoustic models. Accordingly, each model was trained and tested under eight conditions. The generated speech waveforms were evaluated in a subjective evaluation test, which was organized in the same manner as that mentioned in Section IV-C.

The results are plotted in Figure 10. Among the three NSF models, hn-NSF performed better than or comparably well with the other two versions. Interestingly, the Mel-spectrogram-based s-NSF performed poorly when it was trained using the 15-hr. set. One reason was that s-NSF produced low-quality unvoiced sounds. For example, the unvoiced segments generated by s-NSF had a very small amplitude, as shown in Figure 11. b-NSF performed well when it was trained using MGCs and F0 from the 15-hr. training set, but its performance was worse than hn-NSF and b-NSF when the amount of training data was less than 2 hours. One hypothesis is that b-NSF requires more training data since it has more parameters, as Table V shows.

A comparison made between hn-NSF and WaveNet shows that hn-NSF was comparable to WaveNet in terms of the generated speech quality. Specifically, in the TTS application, hn-NSF trained on 15 hours of Mel-spectrogram data slightly outperformed WaveNet and b-NSF trained on 15 hours of MGC data, which were the best performing models in our previous study .

IV-D2 Training and generation speed

After the MOS test, we compared the waveform training and generation speed of the experimental models. Although the theoretical time complexity is described in Table I, we measured the actual speed of each model in training and generation stages.

Table V lists the time cost to train the experimental models for one epoch. The training time cost on the 15-hr training set was larger than that on the 1.6-hr set because the number of training utterance increased. Nevertheless, the results indicate that the NSF models are comparable to WaveNet in terms of training speed. This is expected because all three NSF models and WaveNet require no sequential transformation of waveforms during model training.

For waveform generation, our implementation of the NSF models has normal and memory-saving generation modes. The normal mode allocates all of the required GPU memory once but cannot generate very long waveforms due to the limited memory space on a single GPU card. The memory-saving mode supports the generation of long waveforms because it releases and allocates the GPU memory layer by layer. However, these memory operations cost processing time.

We evaluated the NSF models in both modes by using a test subset with 80 test utterances, each of which was around 5 seconds. As the results in Table V indicate, the NSF models were much faster than WaveNet even in the memory-save mode. Note that WaveNet requires no repeated memory operation or memory-save mode. It is slow because of the AR-generation process. Compared with b-NSF, s-NSF was faster in generation because of its simplified network structure. Although hn-NSF slightly lagged behind s-NSF in terms of generation speed, it outperformed b-NSF.

In summary, the results of the MOS and speed tests indicate that hn-NSF performed no worse than WaveNet in terms of the quality of the generated waveforms. Furthermore, hn-NSF outperformed WaveNet by a large margin in term of waveform generation speedThe generation speed in the memory-save mode may be further increased if the GPU memory allocation can be accelerated. .

IV-E Consistency between input F0 and F0 in generated waveforms

The ablation test discussed in Section IV-C demonstrated that sine excitation with the input F0 is essential to the NSF models. In this experiment, we investigated the consistency between the input F0 and the F0 of the waveforms generated from the NSF models, especially when the F0 input to the source module is not identical to the F0 in the input Mel-spectrogram.

This experiment was conducted on hn-NSF and WaveNet, which were trained using the natural Mel-spectrogram and F0 from the 15-hr. training set. The WaveNet was included as a reference model. Before generating the waveforms, we used the deep AR F0 model to randomly generate three F0 contours for each of the test set utterances . In this random generation mode, the three F0 contours for the same test utterance were slightly different from each other . Let F0r1, F0r2, and F0r3 denote the three sets of F0 contours. We then used the three F0 sets and the generated Mel-spectrogram as the input to hn-NSF and WaveNet, which resulted in six sets of generated waveforms.

For each of the six sets, we calculated the correlation between the input F0 and the F0 extracted from the generated waveforms. For reference, we also extracted the F0 from the input Mel-spectrograms using a neural-network-based method . The results listed in Table VI indicate that the F0 contours of the waveforms generated from the NSF models were highly consistent with the F0 input to the source module. However, the waveforms generated from WaveNet correlated with the F0 information buried in the input Mel-spectrogram.

The results indicate that we can easily control the F0 of the generated waveforms from the NSF models through directly manipulating the input F0. In contrast, it is less straightforward in the case of WaveNet because we have to manipulate the F0 contained in the input Mel-spectrogram.

IV-F Investigation of hidden features of hn-NSF

We argued in Section III-C that the simplified neural filter module in the s-NSF and hn-NSF models is similar to a deep residual network. It is thus interesting to look inside the neural filter module. From hn-NSF trained on 15 hours of Mel-spectrogram and F0, we generated one test utterance given natural condition data and extracted the one-dimensional output of each simplified filter block (i.e., the v1:Tout\bm{v}_{1:T}^{\text{out}} in the bottom panel of Figure 4) in the sub-network to generate the harmonic waveform component. We also extracted the sine-based excitation e1:T\bm{e}_{1:T}, filtered harmonic and noise waveform components, and final output waveform o^1:T\widehat{\bm{o}}_{1:T}. These signals and their spectrograms are plotted in Figure 12.

We can observe that the dilated-CONV filter blocks morphed the sine excitation into the waveform. The spectrogram of the sine excitation had no formant structure but only the fundamental frequency and harmonics. From blocks 1 to 5, the spectrogram of the signal was gradually enriched with the formant structure. The results also suggest that hn-NSF kept the F0 of the sine excitation in the output waveform. This explains why the F0 of the waveform generated from hn-NSF was highly consistent with the frequency of the sine excitation, or the input F0, the experiments of Section IV-E.

Similar results were observed when we analyzed s-NSF and b-NSF. For b-NSF, the results are consistent with the ablation test where we found that b-NSF without the skip-connections in the filter module performed poorly (N2 in Section IV-C). The skip-connections make the filter module a deep residual network based on which the excitation signal can be gradually transformed into the output waveform.

These results also indicate how the sine excitation eases the task of waveform modeling because the neural filter modules do not need to reproduce the periodic structure that evokes the perception of F0 in voiced sounds. Without the sine excitation, it may be difficult for the neural filter modules to generate the periodic structure, which explains the poor performance of the b-NSF model without sine excitation (S2 in Section IV-C).

V Conclusion

We proposed a framework called “neural source-filter modeling” for the waveform models in TTS systems. A model implemented in this framework, which is called an “NSF model”, can convert input acoustic features into a high-quality speech waveform. Compared with other neural waveform models such as WaveNet, an NSF model does not use an AR network structure and avoids the slow sequential waveform generation process. Neither does an NSF model use flow-based approaches nor knowledge distilling. Instead, an NSF model uses three modules that can be easily implemented: a source module that produces a sine-based excitation signal, filter module that transforms the excitation into an output waveform, and condition module that processes the input features for the source and filter modules. Such an NSF model can be efficiently trained using a merged spectral amplitude distance. Even though this distance is calculated using multiple short time analysis configurations, it can be efficiently implemented on the basis of STFT. Therefore, the proposed NSF framework allows a neural waveform model to be built and trained straightforwardly.

Experimental results indicated that the specific hn-NSF model, which uses separate modules to model the harmonic and noise components of waveforms, performed comparably well to our WaveNet on a large single-speaker Japanese speech corpus. Furthermore, this NSF model can generate speech waveforms at a much faster speed. Another advantage of this NSF model is that the F0 input to the source module allows easy control on the pitch of the generated waveform.

In this primary study, we mainly described the NSF framework in detail and compared several NSF models with our verified WaveNet implementation. To further understand the NSF models’ performance, we need a thorough comparison between NSF models and other types of neural waveform models on multi-speaker corpora. We leave this task for future work because of the time required to implement and train other models.

References