EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

Introduction

Nowadays audio sharing via social media has become a prevalent activity in our daily life. With mobile apps like Himalaya and Instagram, users can conveniently record their own speech and share with others. When making a long speech recording, e.g., telling a story, describing procedures, unintentional mistakes of speaking like mispronunciations, missing words, stuttering, etc., are inevitable, especially for non-professional speakers. Even if the mistakes only affect locally a small part of the audio, the user may need to re-do the whole recording from the beginning, in order to maintain a coherent speech quality and speaking style. It would be highly desirable to allow the user to edit the recorded speech, e.g., insert missed words, replace mispronounced words, and/or remove unwanted speech or non-speech events, without degrading the quality and naturalness of the edited speech. This demand motivates our present study of designing a novel speech editing system, named EditSpeech.

There were a few previous attempts to developing speech editing systems . The VOCO system was developed for English speech editing in a multi-speaker scenario. Speech signals containing the words for inserting or replacing are generated by unit-selection speech synthesis. As no contextual information is taken into consideration, speech prosody near boundaries of the edited regions could be non-smooth and unnatural. In , context-aware prosody correction was applied in single-speaker English speech editing. The speech segments to be inserted were retrieved from other utterances of the same speaker. Target duration and pitch parameters were predicted from the context, and then prosodic modification was realized by applying the TD-PSOLA algorithm , followed by de-noising and de-reverberation . An obvious limitation of this system is that the words to insert or replace may not be found in the available speech data of the same speaker.

In the present study, we tackle the problem of speech editing on the basis of neural text-to-speech synthesis (NTTS) . An NTTS based speech editing system, named EditSpeech, is developed and evaluated in this study. The key idea of EditSpeech is that, given a speech utterance, we divide it into “to-modify” and “non-modify” regions according to the edited text and speech-text alignment, and generate the new “modified” speech frames using NTTS conditioned on the “non-modify” frames. Several elaborate designs are proposed. First, partial inference is adopted in the speech generation process, i.e., only the “to-modify” region are calculated during inference to produce “modifed” frames, while the frames in “non-modify” region are directly copied to produce “unmodified” frames to minimize unnecessary distortion. Second, a duration based auto-regressive (AR) NTTS model is employed for generating “modified” frames, and the decoding logic is implemented in both forward and backward directions to maximize the use of the left and right “non-modify” frames as contextual condition respectively. Third, a bidirectional fusion process is followed to select the best generated frames from NTTS model. In this way, contextual information related to the edited region is integrally utilized and smooth transition at the boundaries can be achieved.

The EditSpeech system is more efficient than unit-selection systems like VOCO, as search and selection of candidate units are not required in usage. EditSpeech can support arbitrary change of text content and does not require additional recording from the speakers concerned. The current version of EditSpeech is developed for both English and Chinese in the multi-speaker scenario.

Related work

Currently, the most widely used NTTS models can be categorized into two types: attention based AR models and duration based non-autoregressive (NAR) models. Tacotron2 and Transformer TTS are the typical examples of the former, in which the text embedding and acoustic feature are aligned with a location-sensitive attention or multi-head self-attention, and the decoding of the current time step always conditions on the result of its previous step. In contrast, FastSpeech2 and Glow-TTS are the representatives of the latter. They employ an extra duration predictor to address the alignment between text and acoustic frames, thus the decoding of all the time steps are conducted in parallel without internal dependency. However, these models cannot be directly employed to solve speech editing problem, as the attention based AR ones are not able to control the duration of the generated “modified” frames, while the duration based NAR ones fail to utilize the neighboring speech context due to the parallel generation. EditSpeech adopts a duration based AR model as the backbone model, which is shown in Figure 1, to take both the advantages of these two types of models. The backbone model is similar to that in DurIAN and Patnet , but our scheme differs in that we refine the predicted duration and use the partial inference for speech generation in the editing scenario.

On the other hand, to make the edited speech as much natural as possible, EditSpeech adopts two decoders to produce acoustic frames, one in the left-to-right direction and the other in the right-to-left direction. This is relatively novel in speech generation domain. The most similar scheme may be the one in . They adopt two unidirectional decoders and maximize the agreement between forward and backward decoding sequences as regularization for the training of TTS system. Their target is to alleviate the “exposure bias” problem, and they only use one decoder in the synthesis stage. Our work is different in that we aim to better utilize the context, and we use both decoders in the editing scenario. Moreover, similar scheme can be seen in neural machine translation field in , which aims to utilize both the historical and future information in the text-text generation. Different from them, we focus on the context utilization on both the text and speech side. Our decoders not only take in the encoder output as the text-side guidance, but also generate speech frames conditioned on the contextual acoustic frames from “non-modify” regions.

The EditSpeech System

Figure 2 gives an overview of the EditSpeech system, where speech editing is exemplified through a replacement operation of a Chinese utterance. The edited text and the original speaker identity are used to derive the hidden representation. Duration information obtained from the original text and original speech are used as reference for the duration of edited speech. Based on the original speech’s mel-spectrogram and the hidden representation, the two decoders predict mel-spectrogram in a partial inference manner. The two predicted mel-spectrograms are then fused into a single edited mel-spectrogram, which is converted by the vocoder into edited speech waveform.

The EditSpeech system comprises an acoustic model, a duration predictor and a vocoder. The acoustic model is made up of a text encoder, a speaker encoder, a length regulator, a prenet, a forward decoder and a backward decoder.

The training process for the EditSpeech system is shown in Figure 3. Speech utterances with text transcription and speaker identity are used for training. The texts are converted into phone sequences using a grapheme-to-phoneme (G2P) module. Mel-spectrogram is computed from the speech utterances using the same signal processing configuration as in the Tacotron2 model . The ground-truth phone duration is obtained by an HMM based forced aligner.

The phone sequence is first processed by the text encoder to derive the phone-level text embedding. The speaker encoder generates an utterance-level speaker embedding from the speaker identity. Frame-level position embeddings are generated as interpolated values from 0 to 1, which represent the relative position of individual frames within a phone. The text embedding is expanded into frame-level embedding according to the ground-truth phone duration by the length regulator. It is then concatenated with the position embedding and the speaker embedding to construct frame-level hidden representation.

A forward decoder and a backward decoder are used to synthesize mel-spectrograms in the left-to-right and the right-to-left directions respectively. These two decoders share a common prenet and a common linear layer. Each decoder comprises two unidirectional LSTM. For the left-to-right synthesis of speech frame at time $t$ , the ground-truth mel-spectrum of preceding frame $m_{t-1}$ is processed first by the prenet. The prenet output is concatenated with the frame-level hidden representation of the preceding frame, denoted as $h_{t-1}$ , and fed into a unidirectional LSTM to derive the context vector. Then the context vector and $h_{t}$ are taken by another unidirectional LSTM and a linear layer to predict a mel-spectrum representing the current frame, denoted as $\overrightarrow{m}_{t}$ . The mean squared error (MSE) between $\overrightarrow{m}_{t}$ and the ground-truth mel-spectrum $m_{t}$ is used for the computation of the forward mel-spectogram loss $L_{\overrightarrow{Mel}}$ . The prediction of mel-spectrogram in the right-to-left direction is conducted in a similar manner except that $m_{t+1}$ and $h_{t+1}$ are involved to generate $\overleftarrow{m}_{t}$ . Let the backward mel-spectrogram loss be denoted by $L_{\overleftarrow{Mel}}$ . The text encoder, the speaker encoder, the prenet, the forward decoder and the backward decoder are jointly trained to minimize $L_{\overrightarrow{Mel}}+L_{\overleftarrow{Mel}}$ .

The duration predictor aims to predict phone-level duration from the phone sequence. It is trained by minimizing the duration loss, which is the MSE between the logarithms of ground-truth and predicted duration.

There are many different designs and implementations of vocoder for converting mel-spectrogram into speech waveform . In our work, the HiFi-GAN vocoder is adopted in consideration of its good quality and computational efficiency. The vocoder is trained from a open-sourced pre-trained version “UNIVERSAL_V1” https://github.com/jik876/hifi-gan.

Speech Editing Operations

With a properly trained EditSpeech system, a user can perform speech editing by carrying out one of the three operations, namely deletion, insertion and replacement.

The deletion operation allows the user to remove a section of speech waveform that corresponds to certain specified words. By forced alignment, the system locates the start and end time of the phones to be deleted. The corresponding speech frames in the input speech’s mel-spectrogram are removed to obtain the edited mel-spectrogram, which is converted into speech waveform by the trained vocoder.

2 Insertion and replacement

Speech editing involving insertion and/or replacement of words is more complex and has higher requirement than deletion as the edited speech contains newly created content. In the operation of inserting words, the user needs to specify: 1) word position for the insertion; and 2) the text to insert. For replacing words, the users should specify: 1) the first and the last words to be replaced. 2) the new text.

The implementation of insertion and replacement is demonstrated through an example. Readers may refer to Figure 2. Without loss of generality, the mel-spectrogram of the original speech is divided into three parts, denoted as $[m_{A},m_{B},m_{C}]$ , which correspond to three parts of text $[T_{A},T_{B},T_{C}]$ . For $i\in\{A,B,C\}$ , $m_{i}$ contains a sequence of frame-level mel-spectra and $T_{i}$ is a sequence of words. Our goal is to obtain an edited speech waveform with the text content changed to $[T_{A},T_{B^{\prime}},T_{C}]$ , where $T_{B^{\prime}}$ is the text content to replace $T_{B}$ . Insertion is considered as the special case where $m_{B}=T_{B}=\emptyset$ .

With the G2P module, the original text and the replacement text are converted to phone sequence, denoted as $[P_{A},P_{B},P_{C}]$ and $P_{B^{\prime}}$ respectively. $P_{B}$ in the original phone sequence is changed to $P_{B^{\prime}}$ to give the edited phone sequence $[P_{A},P_{B^{\prime}},P_{C}]$ .

2.2 Duration sequence prediction and refinement

The original duration sequence is denoted by $[dur_{A},dur_{B},\\ dur_{C}]$ , where $dur_{i},i\in\{A,B,C\}$ is a sequence of phone duration in frame. They are obtained by forced alignment process with the original text, The predicted duration sequence $[dur_{A}^{p},dur_{B^{\prime}}^{p},dur_{C}^{p}]$ is obtained from the duration predictor with the edited phone sequence $[P_{A},P_{B^{\prime}},P_{C}]$ as input.

To ensure that the speaking rate is consistent in the modified region ( $B^{\prime}$ ) and unmodified regions ( $A$ , $C$ ) of speech, the predicted duration $dur_{B^{\prime}}^{p}$ is refined by referring to the original and predicted duration of the unmodified region, i.e.,

$dur_{B}$ of the original duration sequence is replaced by $dur_{B^{\prime}}$ to construct the edited duration sequence $[dur_{A},dur_{B^{\prime}},\\ dur_{C}]$ , and $t_{tot}=\sum dur_{A}+\sum dur_{B^{\prime}}+\sum dur_{C}$ is the total duration of edited speech in frame.

2.3 Partial inference

The edited phone sequence, the edited duration sequence, the speaker embedding and the position embedding all serve as the inputs. The frame-level hidden representation $h$ are derived from these inputs by the text encoder, speaker encoder and length regulator, which is then used for mel-spectrogram generation in both forward and backward direction.

Different from the fully auto-regressive generation in normal NTTS system, our system generates the mel-spectrogram in the partial inference manner for both forward and backward direction, of which the detail is shown in Figure 4. Specifically, for each time step in the unmodified region, the predicted frame is discarded, and the original frame is fed to the prenet and recurrent decoder for the prediction of next frame. In the contrary, for each time step in the modified region, the predicted frame is fed to the prenet and recurrent decoder for the prediction of next frame.

(i) The partial inference process in forward direction:

Initialization: $m_{0}=\mathbf{0}$ , $h_{0}=\mathbf{0}$

for $t=\sum dur_{A}+1$ to $t=\sum dur_{A}+\sum dur_{B^{\prime}}$

(ii) The partial inference process in backward direction:

Initialization: $m_{t_{tot}+1}=\mathbf{0}$ , $h_{t_{tot}+1}=\mathbf{0}$

for $t=t_{tot}$ to $t=\sum dur_{A}+\sum dur_{B^{\prime}}+1$

for $t=\sum dur_{A}+\sum dur_{B^{\prime}}$ to $t=\sum dur_{A}+1$

2.4 Bidirectional fusion

The partial inference process predicts the mel-spectrogram based on the frames that they have encountered. As a result, forward decoder and backward decoder guarantee the fluency at the left and the right boundaries of edit region respectively. To improve the fluency and naturalness at both boundaries, the predicted mel-spectrogram from both forward decoder and backward decoder ( $\overrightarrow{m}_{t}$ and $\overleftarrow{m}_{t}$ ) are fused. This process is named as bidirectional fusion and the detail is also shown in Figure 4. It should be noted that, $\overrightarrow{m}_{t}$ and $\overleftarrow{m}_{t}$ are frame-synchronous. In the modified region, the frame-level $L2$ -norm differences between two predicted mel-spectrogram are first calculated, then the frame with the least $L2$ -norm difference is selected as the fusion point.

For each frame in the unmodified region, the original mel-spectrum $m_{t}$ directly serves as the edited mel-spectrum $m_{t}^{e}$ . In the modified region, for each frame before the fusion point, the forward predicted mel-spectrum $\overrightarrow{m}_{t}$ is selected as the edited mel-spectrum $m_{t}^{e}$ . For each frame after the fusion point, the backward predicted mel-spectrum $\overleftarrow{m}_{t}$ is selected as the edited mel-spectrum $m_{t}^{e}$ . The edited mel-spectum $m_{t}^{e}$ are merged to construct the edited mel-spectrogram, which is then fed to the trained vocoder to generate the waveform.

Experiment

Four speech editing systems are developed on four datasets respectively. The overview of datasets is shown in Table 1. For each speaker, we select 99% of speech utterances as the training set, and the remaining 1% as the test set for speech editing.

2 Configuration detail

The audio is sampled to 22050 Hz. The short-time Fourier transform (STFT) is first carried out using 50 ms frame size and 12.5 ms frame hop with the Hann window function. Then the STFT magnitude is converted to the mel-spectrogram based an 80-channel mel filterbank from 0 to 8000 Hz, followed by log dynamic range compression.

Forced alignment is carried out by Montreal Forced Aligner (MFA) . The G2P modules are different in English and Chinese. For English, the g2pE with CMU Pronouncing Dictionaryhttp://www.speech.cs.cmu.edu/cgi-bin/cmudict is adopted to convert English words to phones. For Chinese, a toolkit named pypinyinhttps://github.com/mozillazg/python-pinyin is first utilized to convert the Chinese characters (Hanzi) to pinyins. Then another pre-trained G2P module in MFA converts the pinyins into global phones. The punctuation like comma is converted to the “short pause” symbol in this process.

The configuration of modules is listed in Table 2. The Adam optimizer with learning rate of $10^{-3}$ and weight decay of $10^{-6}$ is used in the training process. The batch size is set to be 32. The English/Chinese systems are trained for 100k/200k iterations respectively.

3 Systems for comparison

We mainly compare EditSpeech and the baseline systems in terms of insertion and replacement operations. As new content is added in these two operations, the differences between outputs of systems are obvious for evaluation. The baseline systems are listed as below.

Baseline system 1: This system is a complete TTS system with the whole edited text and original speaker as input, without considering the original speech.

Baseline system 2: This system is a TTS system with the new text part and original speaker as input, and the original speech is utilized by only simple concatenation. Specifically, the text to insert/replace is used to generate mel-spectrogram, which is then concatenated with the unmodified region of original mel-spectrogram to construct the edited mel-spectrogram.

Baseline system 3: This system is a TTS system with the whole edited text and original speaker as input, and the original speech is utilized by frame location and concatenation. Specifically, candidate mel-spectrogram is synthesized as in baseline system 1. Then DTW of mel-cepstral coefficient (MCEP) is used to align the unmodified region of original and candidate mel-spectrogram. In this way, the modified region of the candidate mel-spectrogram can be located, which is then concatenated with the unmodified region of original mel-spectrogram to construct the edited mel-spectrogram.

Baseline system 4: This system is similar to the proposed system except that only one left-to-right decoder is adopted and the bidirectional fusion step is removed.

Result and Discussion

In our experiment, 1/3 of the words at the middle location of sentence are masked. The audio part of masked words are first deleted, and then re-synthesized and inserted back into the original location. As the original audio uttered by human is of high naturalness, a lower difference between the edited audio and original audio not only indicates a higher naturalness of edited audio but also demonstrates a better utilization of the text and speech context. Mel-cepstral distortion (MCD) is adopted to measure the difference of edited and original audio, where lower MCD means higher similarity. The MCD evaluation is carried out on three part: the modified region, the unmodified region and the whole utterance. 30 utterances are randomly selected for MCD calculation. The MCD of baseline system 1, 2 and our proposed system are compared, and the results are shown in Table 3.

The results indicate that: 1) In the modified region, the proposed system has the lowest MCD among the three systems, while the baseline system 1 has lower MCD than the baseline system 2. The reason should be that the baseline system 2 synthesize the modified region directly without utilizing both the speech and text context, while the baseline system 1 utilizes the text context but neglects the speech context. In contrast, the proposed system utilizes both the speech and text context. 2) In the unmodified region, the baseline system 2 and the proposed system has much lower MCD then the baseline system 1. As the baseline system 1 synthesizes the speech audio directly from the text without taking the original mel-spectrogram into consideration, the generated mel-spectrogram has large MCD even for the parts that are not intended to modify. For baseline system 2 and the proposed system, the mel-spectrogram of unmodified region is exactly the same as the original one, and the distortion of waveform mainly comes from the vocoder performance. 3) The MCD of whole utterance implicitly reflects the naturalness of the whole utterance. The proposed system exhibits the lowest MCD compared with the baseline system 1 and 2.

2 Subjective evaluation

For subjective evaluation, we test the insertion and replacement operations with VCTK (English) and MST (Chinese) datasets. For each dataset, 15 samples are provided for each operation respectively. Each samples contain 6 audios: the original audio, the edited audio from the baseline system 1 to 4 and proposed system. 15 and 20 listeners participate in the test of VCTK and MST datasets respectively. For each sample, users are required to: 1) mark the naturalness of the edited speech. The mark is from 1 to 5, where 1 means “completely unnatural” and 5 means “completely natural”. 2) mark the similarity of edited audio to the original audio in the unmodified region. The mark is from 1 to 5, where 1 means “completely different in the unmodified region” and 5 means “exactly same in the unmodified region”. 3) indicate the preference between baseline systems and proposed system. The results are shown in Table 4 and 5.

The results show that, EditSpeech achieves the highest MOS in most cases for both insertion and replacement operations in both Chinese and English datasets, except in two cases the highest MOS is achieved by baseline system 4. Moreover, the ABX preference test indicates that listeners prefer EditSpeech in most cases except in one case the baseline system 4 is preferred.

EditSpeech system completely outperforms the baseline system 1 to 3, which demonstrates that taking both text and speech context into consideration helps to improve the speech editing performance. Specifically, system 1 generates speech based on the whole text but totally neglects the original speech context. The system 2 and 3 both maintain the original speech context by concatenation, but the speech generation is not conditioned on the speech context. The text context is considered in the speech generation in system 3 but not in system 2. In contrast, our system generates speech based on both text context and speech context. Moreover, our system shows advantage compared to baseline system 4 in most cases, indicating the the consideration of both the left and right context further improves the speech editing performance.

Conclusion

The EditSpeech system allows the users to perform deletion, insertion and replacement of words in a given speech audio. The use of NTTS approach leads to an effective system design and facilitates unrestricted change of speech content. Partial inference and bidirectional fusion introduces low distortion and maintains speech naturalness. Having demonstrated effectiveness on English and Chinese, the design of EditSpeech is expected to be applicable to many other languages.

Acknowledgements

This research is partially supported by a Tier 3 funding from ITSP (Ref: ITS/309/18) of the Hong Kong SAR Government, and a Knowledge Transfer Project Fund (Ref: KPF20QEP26) from the Chinese University of Hong Kong.