A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang

Introduction

Recently, speech representation learning has attracted much attention in the speech community due to its strong performance to many speech-related downstream tasks, such as speech recognition, speech classification, and speech translation (Baevski et al., 2020; Chen et al., 2020; Liu et al., 2020; Zheng et al., 2021; Hsu et al., 2021)

However, all these efforts can only support speech understanding tasks which take speech as input, but for the inverse direction, speech synthesis, which synthesis speech as output, the potential of representation learning is yet to be realized. For example, one line of work, such as wav2vec 2.0 (Baevski et al., 2020), Hubert (Hsu et al., 2021) and SLAM (Bapna et al., 2021), learn discrete quantized speech units as latent representations. In this way, these models are good at recognizing and extracting discrete information from speech and successfully improves automatic speech recognition (ASR), but they are unable to generate continuous acoustic signals for speech synthesis. On the other hand, another line of work, such as MAM (Chen et al., 2020) and FAT-MLM (Zheng et al., 2021), show that reconstructing masked spectrogram with continuous units can improve speech-to-text translation. However, the quality of their proposed speech reconstruction is far from the requirement of speech synthesis tasks (see Fig. 5(f)).

To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (A3T), where we introduce cross-modal alignment embeddings which make the model easier to learn the alignment between the acoustic and phoneme input during multi-modal pretraining, and significantly improve the quality of the reconstructed acoustic signals. Different from the segment embeddings used in Segatron and SegaBERT (Bai et al., 2021, 2022), which improve language modeling by grouping tokens according to the sentence position and paragraph position, our alignment embeddings can align a phoneme and its frames together to learn the cross-modal self-attention (Fig. 6). Moreover, we borrow several useful ideas from recent text-to-speech (TTS) literature, including Conformer (Gulati et al., 2020; Guo et al., 2021) and Post-Net (Shen et al., 2018b), to further improve the quality of our reconstructed spectrograms.

Without any finetuning, the proposed model can be adopted as a speech-editing system, a task that modifies an existing speech, by reconstructing the desired acoustic signals given original contextual speech and modified text. Furthermore, the model can be adopted as a multi-speaker TTS system with our proposed prompt-based decoding method, to synthesis unseen speaker’s speech without the external speaker verification model (speaker embeddings). Our experiments show that our A3T with prompt-based decoding can outperform the TTS model equipped with both the speaker embedding (Jia et al., 2018) and the global style token (GST) (Wang et al., 2018).

We propose the Alignment-Aware Acoustic-Text Pretraining (A3T), a BERT-style pretraining model, which takes both phonemes and partially-masked spectrograms as inputs. It can reconstruct masked spectrograms with high quality without finetuning and uses the identical framework for decoding.

We show that the proposed A3T model has the ability to do speech editing and outperforms the current SOTA.

We propose the prompt-based decoding method. We show that our A3T model has the ability to do speech synthesis for unseen speaker and outperforms the speaker-embedding-based multi-speaker TTS system.

Previous Work

Recently, neural TTS systems become capable of generating audios with high naturalness (Oord et al., 2016; Shen et al., 2018a; Ren et al., 2019; Peng et al., 2019; Ren et al., 2020). SOTA neural TTS systems generally consist of two stages: the text-to-spectrogram stage which generates an intermediate acoustic representation (linear- or mel-spectrogram) from the text, and the spectrogram-to-wave stage (vocoder) which converts the aforementioned acoustic representation into actual wave signals (Oord et al., 2018; Prenger et al., 2019).We focus on the text-to-spectrogram stage and use an off-the-shelf vocoder Parallel WaveGAN (Yamamoto et al., 2020).

In the multi-speaker and unseen-speaker settings, the existing TTS models need to be trained with an additional input feature: speaker embedding (Jia et al., 2018), which is extracted from an external speaker verification model trained with tens of thousands of speakers’ audio. And during the inference for an unseen speaker, the embedding will be extracted from one of this speaker’s other audio examples. However, the embedding from the speaker verification model is not optimized directly to capture speaker characteristics relevant to synthesis, and cannot provide enough information for the TTS model to generate audio similar to the example.

The input of speech editing includes the original speech, the original and modified text. Jin et al. (2017) propose to insert a regenerated audio clip back into the original recording. However, due to the absence of speech contextual information, the boundaries of the modified region would be not smooth. Morrison et al. (2021) propose to retrieve the modified speech segments from other utterances of the same speaker and correct the prosody with a context-aware TD-PSOLA corrector (Moulines & Charpentier, 1990). However, the edited content may not be found in the speech data of the same speaker. Most recently, (Tan et al., 2021) use neural TTS model to generate better-modified speech. This method is only compatible with auto-regressive decoding models and highly relies on the speaker embeddings, which limits its efficiency and transferability to new speakers.

2 Speech Pretraining

To improve the Text-to-Speech model from larger-scale pure speech data, one idea is to do pretraining on speech data. All existing speech pretraining work learn either discrete units, which can only support speech understanding tasks, or spectorgram, but with very low quality.

Wav2vec 2.0 proposed by Baevski et al. (2020) is the most popular speech pretrain model recently. It masks the speech input in the latent space and pretrains the model by predicting discrete units via a contrastive task defined over a quantization of the latent representations, as shown in Fig. 1(a). Similar to wav2vec 2.0, Hubert (Hsu et al., 2021) and SLAM (Bapna et al., 2021) also learn discrete speech units from contextualized representations to represent the latent representations. Thus these models can achieve good performance in speech recognition tasks, but they are unable to generate continuous acoustic signals for speech synthesis.

2.2 Reconstructing Low-Quality Spectrogram

Recently, Chen et al. (2020) propose to learn a speech encoder in a self-supervised fashion on the speech side, which can utilize speech data without transcription. Fig. 1(b) demonstrate the architecture of this model, termed Masked Acoustic Modeling (MAM). MAM replaces a span of speech spectrogram with mask tokens, and learns to recover the masked spectrogram during training. On the other hand, Zheng et al. (2021) propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data, as shown in Fig. 1(c).

Both MAM and FAT-MLM reconstruct spectrograms, however, the quality of their spectrogram output is far from the requirement of speech synthesis tasks (see Fig. 5(f)), since these pretrained models are all used in speech understanding task (speech-to-text translation), where the quality of the reconstructed spectrogram is not very important.

Alignment-Aware Acoustic-Text Pretraining

Although existing speech pretraining models show a strong representation learning ability and significantly improve upon many down-stream tasks in speech understanding, all these efforts can not support speech synthesis tasks. To address this problem, we propose the Alignment-Aware Acoustic-Text Pretraining (A3T) which learns to generate high-quality spectrogram given speech context and text.

2 Cross-modal Alignment Embedding

To strengthen the interaction between the speech and text input, we introduce cross-modal alignment embedding as one input of encoder, where we sum the $i$ th acoustic embedding $\bm{e}_{s_{i}}$ or text embedding $\mathbf{x}_{i}$ with its positional embedding $\bm{e}_{\text{pos}_{i}}$ and alignment embedding $\bm{e}_{\text{aln}_{i}}$ all together: $\bm{e}_{s_{i}}+\bm{e}_{\text{pos}_{i}}+\bm{e}_{\text{aln}_{i}}$ , where previous work have proved the embedding sum operation is simple and effective (Devlin et al., 2018; Bai et al., 2021). After that, the phoneme embedding and its acoustic embeddings will share the same alignment embedding. We use a forced aligner (Yuan & Liberman, 2008) to pre-process the dataset to get the alignment information, which is shown in Fig. 2(a).

3 Conformer

Given the recent success of Convolution-augmented Transformer (Conformer) on various speech tasks (Gulati et al., 2020; Guo et al., 2021), we adopt Conformer as the backbone of our encoder and decoder. Compared with Transformer, Conformer introduces a convolution module and an additional feedforward module, which is shown in Fig. 2(c). In our experiments, we find Conformer is better than Transformer for acoustic-text pretraining.

4 Post-Net and Loss Function

We follow Tacotron 2 (Shen et al., 2018b) to use Post-Net to refine the generated spectrogram. The predicted spectorgram is passed through a 5-layer convolution Post-Net to be refined as shown in Fig. 2(d).

where $g$ is a Post-Net which tries to recover a better original signal from encoded representation $f([e_{\hat{\mathbf{s}}};\hat{\mathbf{x}}])$ . We use mean absolute error (MAE) for measuring the difference between $s$ and the reconstructed spectrogram.

5 A3T for Speech Editing

Once A3T finishes the pretraining process, it can be used as a speech editing system directly with an external duration predictor, which is shown in Fig. 3.

6 A3T for Multi-speaker TTS

In addition to the speech editing, we find our model has the potential for unseen speaker TTS.

Existing popular unseen speaker TTS models (Jia et al., 2018) are trained with seen speaker embeddings and generalizes to unseen speaker embeddings during the inference. However, such speaker embeddings are extracted from an external speaker verification model which is trained with tens of thousands of speakers.

In this work, we find our model can achieve comparable naturalness to models with speaker embeddings for unseen speaker TTS task; What’s more, our generations are more similar to the unseen speaker’s reference speech. The illustrations of how to synthesis speech for unseen speakers with our A3T model are shown in Fig. 4, which is named prompt-based A3T.

The key idea is to concatenate the prompt and the target together into a new utterance input, where the target speech is consist of $n$ [MASK] and $n$ is predicted by a duration predictor. By inputting the concatenated speech and text, A3T model will predict the spectrogram of these masked frames. The role of the reference text and speech in our model is similar to prompts in language model (Brown et al., 2020), and hence we call it prompt-based decoding/generation.

Experiments

In this section, we introduce our experiments for spectrogram reconstruction pretraining task, speech-editing task, and multi-speaker TTS. The spectrogram reconstruction is our pretraining task, where we conduct ablation study to show the contributions of different components and also the effects of different masking rates. The experiment settings of speech-editing are followed Tan et al. (2021), where we deploy two speech-editing systems with two datasets and evaluate the Mel-cepstral distortion (MCD) score and human-annotated mean opinion score (MOS) (Chu & Peng, 2006) using Amazon Mechanical Turk. The multi-speaker TTS experiments include seen speaker TTS and unseen speaker TTS evaluated with the MOS scores.

Following Tan et al. (2021), we conduct our speech-editing experiments with a single-speaker TTS dataset LJSpeech (Ito & Johnson, 2017) and a multi-speaker TTS dataset VCTK (Yamagishi et al., 2019). The LJSpeech dataset is a single-speaker dataset with 13K examples in 24 hours. The VCTK dataset is a multi-speaker dataset with 109 speakers and 44K examples in 44 hours. It should be noted that after finishing the pretraining process with LJSpeech or VCTK, our A3T will be used as a speech-editing system without any further finetuning.

We test multi-speaker TTS task with VCTK dataset. For seen multi-speaker TTS, each speaker’s examples would be split into train and test sets. For unseen multi-speaker TTS, the test set contains 10 speakers’ examples, and the other 99 speaker’s examples are used for training.

2 Configuration Details

Raw audio files are processed with 50 ms frame size and 12.5 ms frame hop with the Hann window function to extract 80-dimensional log-Mel filterbanks. We use 24K sampling rate for VCTK and 22K for LJSpeech. The forced alignment and G2P are both carried out by HTK (Young et al., 2002) to convert English words to phones and align phones with audio segments. For speech-editing systems and prompt-based TTS, we use the publicly available duration predictor from FastSpeech 2 implemented in ESPnet (Inaguma et al., 2020). We use Parallel-WaveGAN (Yamamoto et al., 2020) vocoder for all the systems.

All A3T models pretrained in our experiments share the same architecture: 4 layers Conformer encoder, 4 layers Conformer decoder, and 5 layers Conv1d Post-Net, with 2 heads multi-head attention in 384-dim. The convolution kernel sizes of the encoder and decoder are 7 and 31, respectively. The shape of alignment embeddings is (500, 384), where we assume the number of phones will not exceed 500 for a single input. The shape of input phone embeddings is (73, 384), and we use a ReLU (Agarap, 2018) nonlinear layer to transform 80-dim log-Mel filterbanks features to 384-dim. The total number of parameters is 67.7M.

During training, we use Adam optimizer with a 1.0 initial learning rate, 4000 warmup steps, and Noam learning rate scheduler. Instead of setting a fixed batch size, we adjust the batch size according to the length of the input example and set a maximum batch-bin (the total number of input elements) for each model. Following MAM (Chen et al., 2020), 15% frames will be masked for speech-only input, For speech-text input, we randomly select several phonemes spans ( 80% phonemes) and mask their corresponding frames. For speech-editing experiments, we use 2.4M batch-bin, 1M steps for LJSpeech, and 3M batch-bin, 1.2M steps for VCTK.

3 Ablation Study with Spectrogram Reconstruction

We first conduct an ablation study with LJSpeech dataset for our pretraining task: spectrogram reconstruction. This task requires A3T to predict the masked frames. We sample 30 utterances randomly from the test set, and 1/3 phones in the middle of each sentence are masked. We adopt MCD to measure the difference between the ground-truth audio and the reconstructed audio, where we only measure the MCD of the masked region an lower MCD means higher similarity. We incrementally discard the components of A3T: removing the cross-modal alignment embedding, replacing the Conformer with Transformer, removing the Post-Net, and using L2 (MSE) loss instead of L1 (MAE) loss.

Results are shown in Tab. 2. An example of different models’ reconstruction is shown in Fig. 5. By comparing Fig. 5(b) and Fig. 5(c), we can see that many details are lost when A3T trained without the alignment embedding, and the MCD scores rise from 8.09 to 10.73. Similar degrading can be observed after replacing Conformer with Transformer: the MCD scores rise from 10.73 to 12.43 and the spectrogram becomes blurrier (Fig. 5(d)). Compared with the alignment embedding and Conformer, Post-Net contributes only 0.49 MCD score, and L2 loss even achieves better MCD score than L1 loss. However, when looking into the spectrograms, we can see that Fig. 5(f) is blurrier than Fig. 5(e), which conforms to the previous finding (Klimkov et al., 2018) that L1 loss is better than L2 loss for speech synthesis. Hence, we choose L1 loss for A3T pretraining. Also, Fig. 5(f) indicates the quality that previous pretrained model (MAM/FAT-MAM) could achieve, and the other figures show how our A3T transforms Fig. 5(f) to Fig. 5(b).

We also conduct a study with VCTK to show the impacts of difference masking rates. Results are shown in Tab. 3. We can see that 20% masking rate leads to large MCD scores, while 50% and 80% are better. Also, 50% masking rate outperforms 80% on the seen test cases, but not on the unseen. Considering 80% masking rate has a better generalization on unseen cases, we choose 80% for all the following experiments.

Finally, we plot the attention heat maps of encoder with and without our proposed cross-modal alignment embedding in Fig. 6. The attention matrices are collected from the encoder’s last layer with a mean-pooling across heads. It should be noted that the original attention matrix is 310*310, which contains both the speech and phones, and for clarity, we plot only 11 phones and their corresponding frames in Fig. 6. We can see that our A3T is aware of the speech segmentations and their corresponding phones, while the baseline model fails to capture such alignment information. This observation demonstrates the effectiveness of our A3T for cross-modal pretraining. This observation also conforms previous finding that Transformer-based language model cannot align the tokens within the same sentnece/paragraph together, even pre-trained with the BERT-large setting (Bai et al., 2021).

4 Speech Editing

Following Tan et al. (2021), we list several baseline systems below:

Baseline 1: This is a TTS system regenerating a complete waveform from the whole sentence to be edited.

Baseline 2: This system generates the modified region with a TTS model and insert the generation back to the original waveform with a forced aligner.

Baseline 3: This system is similar to Baseline 1, but we cut the modified region from the generation and insert it back to the original waveform with a forced aligner.

Tan et al. (2021): This is a speech-editing system which introduces partial inference and bidirectional fusion to sequence-to-sequence neural TTS model. EditSpeech trains two conventional autoregressive TTS models, one left-to-right and the other right-to-left (Fig. 8(b)). For decoding, the left-to-right TTS model force-decodes the prefix speech context and synthesizes the modified region, and the right-to-left TTS model force-decodes the suffix context and generates the modified region reversely. Finally, the two synthesized speeches are fused for final output (Fig. 8(c)).

Following Tan et al. (2021), we also evaluate our speech editing system with an identical reconstruction task, which is similar to the above ablation experiments but without the ground-truth duration length and can be evaluated with MCD metric. 30 utterances are randomly sampled for each dataset, and a part of speech, which corresponds to 1/3 phonemes in the middle of each sentence, is masked. The audio of the masked region is replaced with each system’s generation. A duration model is used to predict the length of masked speech from phonemes. Results are shown in Tab. 4. From this table, we can see that our system achieves the best MCD score. Besides, alignment embedding is the key to reducing MCD, which confirms our observation in Fig. 5(c). For TTS-based systems, we find that generating the whole audio and then extracting the modified region is better than generating the modified region only.

We then conduct the human evaluation with Amazon Mechanical Turk for the real speech insertion and replacement tasks using the VCTK dataset. To compare our results with Tan et al. (2021), we use the same 15 audio samples and modification operations from their work. For each audio sample, we use 10 English native speakers to evaluate the naturalness of synthesized audios. In Tab.5, our A3T speech editing system outperforms Tan et al. (2021)’s and gets the highest MOS scores among all these systems. Audio examples can be found at our demo link.

5 Prompt-based Multi-speaker TTS

We also conduct the human evaluation for multi-speaker TTS systems with seen speaker (30 test cases, 15 human annotations for each test case) and unseen speaker (20 test cases, 15 human annotations for each test case) testing cases. The quality of the generations and the speaker similarity between the generation and the reference are evaluated, and the results are shown in Tab. 6 and Tab. 7. From this table, we can see that the style embedding GST (Wang et al., 2018) improves the similarity scores but harms the quality scores, while our A3T model is the most favorable system in both the speaker similarity and the speech quality. Strikingly, we observe that the average score of the Unseen cases is higher than the Seen, which is counterintuitive. However, when looking into the MOS of the Groundtruth, the gap is still there and we believe this is due to the difference between these two test case sets.

Discussion

A3T is a pretraining method on parallel data. A3T is a BERT-style pretraining method, which takes both phonemes and partially-masked spectrograms as inputs (Fig. 8(a)). Although A3T can be first trained with speech-only data (Appendix B), but a second stage of training with parallel data is necessary for speech synthesis. A3T trains a non-autoregressive encoder to reconstruct masked acoustic signals, and uses the identical framework for decoding. It is therefore akin to cross-lingual BERT like XLM (Lample & Conneau, 2019) which also trains on parallel data.

Finetuning. In this paper, our major finding is that A3T can be directly used without finetuning (Tab 2- 7), like GPT-3, for downstream tasks such as speech editing (Tab. 5) and unseen-speaker TTS (Tab. 7). We also find A3T can be pretrained with more data and be finetuned, like BERT, to improve downstream tasks. Finetuning results for multispeaker TTS are reported in Appendix B.

A3T is not a TTS model. The input of A3T mush be both the text and the speech context, while the traditional TTS models’ input is only the text. We show a synthesized speech example in our demo, whose input is the text and a piece of silent audio. We find the generated speech sounds like multiple speakers are speaking the text simultaneously. This observation shows that A3T generates speech based on the given context and follows its properties. On the other hand, A3T can become a TTS model after finetuned with TTS task and data, which is introduced in the Appendix B.

Conclusions

In this paper, we propose Alignment-Aware Acoustic-Text Pretraining (A3T) which can reconstruct masked acoustic signals with high quality. We show that our proposed A3T model has the ability to do speech editing and outperforms the current SOTA models, and also improves unseen-speaker speech synthesis with our proposed prompt-based decoding.

References

Appendix A Spectrogram Comparison of Speech Editing

Appendix B Pretraining for Multi-speaker TTS

We conduct finetuning experiments with a large multi-speaker TTS dataset LibriTTS but split the validation and test set with only the new speakers. We test on 50 test cases with 15 human annotators for each case. In this setting, we find the FastSpeech 2 fails to generate high-quality audio for these new speakers, even equipped with X-Vector (Snyder et al., 2018) to generate speaker embeddings for new speakers. After initializing the FastSpeech 2 model with our LibriTTS pretrained A3T, the generated audio can be improved significantly. Results are shown in Tab. 8. We also plot the validation loss and training loss during the training of TTS models with and without A3T in Fig. 11. We can see that both the training and validation loss is improved with the initialization from the A3T model, which demonstrates the effectiveness of our method. Finally, we also observe the improvement from the external data (LibriSpeech (Panayotov et al., 2015) and LibriLight (Kahn et al., 2020)) pretraining for the A3T model, which achieves 3.77 MOS scores in Tab. 8. It should be noted that when training with speech only data LibriLight, our model is similar to MAM (Chen et al., 2020) and the alignment embedding are discarded.