Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour

cs.SD eess.AS

Introduction

Training a text-to-speech (TTS) system typically requires hundreds of hours of parallel data in the form of transcribed utterances. As a consequence, TTS is only available for “high-resource” languages. Moreover, the audio generated by such systems is only as diverse as the parallel data that they are trained on, which should contain many speakers, with various accents, of diverse demographics, and heterogeneous recording conditions. At the same time, for most languages, including low-resource ones, audio-only speech data can be relatively abundant online, present in the forms of audiobooks, podcasts, radio and TV shows.

In this paper, we investigate how audio-only data can be leveraged to reduce the need for supervision in training TTS systems. We introduce SPEAR-TTS,SPEAR stands for “speak, read and prompt”. a multi-speaker TTS system that can be trained with as little as 15 minutes of parallel data from a single speaker. Moreover, SPEAR-TTS can synthesize a new voice using only 3 seconds of speech, without any speaker labels or explicit speaker representation. At its core, SPEAR-TTS leverages recent advances in the “textless” modeling of spoken language (Lakhotia et al., 2021; Dunbar et al., 2021; Polyak et al., 2021; Kreuk et al., 2021; Kharitonov et al., 2022; Nguyen et al., 2022; Borsos et al., 2022). These methods represent continuous audio waveforms as sequences of tokens from a finite vocabulary, casting speech generation as a language modeling task. In particular, AudioLM (Borsos et al., 2022) combines two types of discrete tokens: high-level semantic tokens and low-level acoustic tokens, which that can be mapped to audio. Using these representations, we cast the TTS problem as a “translation” from text transcripts to acoustic tokens with semantic token representations serving as a pivot “language” Utiyama and Isahara (2007). This way, TTS is reduced to a composition of two sequence-to-sequence (seq2seq) tasks: translating text to semantic tokens, and translating semantic to acoustic tokens.

The key benefit of splitting the TTS task into these two sub-tasks is that the supervision needed to learn how to map text into the intermediate semantic token representation (“reading”) and how to produce speech from it (“speaking”) become decoupled. While the “reading” stage relies on parallel text-audio data, the audio tokens used to train the “speaking” component are produced by self-supervised audio models and therefore can be extracted from a massive amount of unlabeled speech data. As a result, the quality and diversity of the generated speech become independent from the available parallel data.

Casting each stage of SPEAR-TTS as a seq2seq problem allows us to use standard Transformer models (Vaswani et al., 2017) and makes it easy to tap into the vast pool of ideas developed by the machine translation community to reduce the need for supervision. Specifically, we combine BART/T5-style pretraining (Lewis et al., 2020; Raffel et al., 2020) with backtranslation (Sennrich et al., 2016) to significantly reduce the amount of parallel supervision required to train SPEAR-TTS.

To control the voice used by SPEAR-TTS when producing an utterance, we leverage an example prompting mechanism that is closely related to prompting in textual language models (Brown et al., 2020). Here we condition the “speaking” model with an audio clip representing the target voice, steering it to use this voice when generating the utterance. This feature can simplify building controllable multi-speaker TTS systems for languages where only single-speaker parallel data is available.

Modeling speech synthesis with seq2seq models enables using stochastic sampling at inference, which allows generating outputs of diverse quality for the same input. We exploit that to improve the synthesized audio quality by proposing a sampling scheme based on an objective quality metric.

Our experimental study on English speech shows that, by combining pretraining and backtranslation over a large dataset — 551 hours from LibriTTS Zen et al. (2019) — with just 15 minutes of parallel data from a single speaker — LJSpeech Ito and Johnson (2017) — SPEAR-TTS (a) generates speech with high fidelity to the input transcript — CER 1.92% on LibriSpeech test-clean Panayotov et al. (2015)); (b) synthesizes speech with diverse voices, (c) reliably reproduces the voice of an unseen speaker, when using a 3 second example from the target speaker; (d) achieves high acoustic quality, comparable to that of the ground-truth utterances (MOS 4.96 vs. 4.92).Samples produced by SPEAR-TTS can be found on the demo site: https://google-research.github.io/seanet/speartts/examples/.

Overall, our approach to building TTS using massive self-supervised pretraining and backtranslation of discrete speech representations considerably differs from how existing TTS systems are implemented Shen et al. (2018); Kong et al. (2020); Ren et al. (2020); Kim et al. (2021); Ao et al. (2022); Wang et al. (2023), significantly reducing the costs related to data collection and potentially providing high-quality multi-speaker TTS for languages that are not well covered today.

Discrete Speech Representations

Below we provide a brief overview of the self-supervised audio representations that are essential for SPEAR-TTS. The combined use of these representations was proposed in AudioLM (Borsos et al., 2022), which we refer to for a detailed discussion.

The role of semantic tokens is to provide a coarse, high-level conditioning to subsequently produce acoustic tokens. Thus, they should provide a representation of speech where linguistic content — from phonetics to semantics — is salient, while paralinguistic information such as speaker identity and acoustic details are removed. To obtain such a representation, we train a self-supervised speech representation model based on w2v-BERT (Chung et al., 2021). This model combines masked language modeling (Devlin et al., 2019) and contrastive learning (van den Oord et al., 2018) to obtain speech representations. After its training, we run a $k$ -means clustering on the mean-variance normalized outputs of a specific layer. We use the centroid indices as discrete tokens.

Acoustic tokens

Acoustic tokens are discrete audio representations that provide high-fidelity reconstruction of the acoustic details. We train a SoundStream (Zeghidour et al., 2021) neural codec to reconstruct speech while compressing it into few discrete units. SoundStream achieves this goal by adding a residual quantizer to the bottleneck of a convolutional autoencoder. To represent the hierarchy of residual quantizers in a sequence, we flatten the tokens corresponding to the different levels by interleaving them Borsos et al. (2022).

SPEAR-TTS Overview

SPEAR-TTS extends AudioLM (Borsos et al., 2022) by enabling text as a form of conditioning. SPEAR-TTS is organized in two main stages, as illustrated in Figure 1. In the first stage ( $\mathcal{S}_{1}$ ), text inputs are translated into a sequence of discrete semantic tokens. The second stage ( $\mathcal{S}_{2}$ ) maps semantic tokens into acoustic tokens, which are decoded to speech by the SoundStream decoder Zeghidour et al. (2021). This way, $\mathcal{S}_{1}$ learns to map text to the internal representation provided by semantic tokens (“reading”), while $\mathcal{S}_{2}$ handles the production of speech from this intermediate internal representation (“speaking”).

By using semantic tokens as an intermediate representation, we achieve two goals. First, semantic tokens provide a speech representation that encodes mostly phonetic content, with limited prosody and speaker information, bridging the gap between text and acoustic tokens. As a result, our intermediate representation is closer to the text than acoustic tokens are. Thus, it is easier to learn a mapping from text transcripts to semantic tokens than directly between text and acoustic tokens. Second, as both semantic and acoustic tokens are derived from self-supervised models, the second stage $\mathcal{S}_{2}$ can be trained using audio-only data. This turns out to be extremely beneficial for training $\mathcal{S}_{2}$ , as the typical scale of available audio-only data is considerably larger than that of parallel data.In the case of English, a large dataset such as LibriTTS has 580h of parallel data Zen et al. (2019), while LibriLight contains 60,000h of untranscribed speech Kahn et al. (2020). In turn, separating $\mathcal{S}_{1}$ from $\mathcal{S}_{2}$ allows us to pretrain the former with a denoising pretext task operating on semantic tokens, further harnessing audio-only data.

Similar to AudioLM Borsos et al. (2022), it is possible to add an optional third stage, with the goal of improving quality of the synthesized speech by predicting acoustic tokens corresponding to fine residual vector quantization levels (Appendix A).

The first stage $\mathcal{S}_{1}$ maps tokenized text into semantic tokens. We use parallel text-semantic tokens data to learn this mapping. We start with a text-audio TTS dataset and extract semantic tokens from audio. As a result, $\mathcal{S}_{1}$ is reduced to a seq2seq task, that can be implemented by encoder-decoder or decoder-only Transformer architectures (Vaswani et al., 2017; Raffel et al., 2020).

Training Transformer seq2seq models can require substantial amounts of parallel data, which can be extremely scarce for low-resource languages. In the following, we discuss two approaches used to alleviate this limitation: target domain pretraining (Section 4.1) and backtranslation (Section 4.2).

We take inspiration from BART and T5 and pretrain an encoder-decoder Transformer on a denoising pretext task (Lewis et al., 2020; Raffel et al., 2020). In this task, the model is provided with a corrupted version of an original semantic token sequence and the goal is to produce the corresponding uncorrupted token sequence.

Typical corruption methods include random substitution, deletion and masking of individual tokens or entire spans of tokens (Raffel et al., 2020). In preliminary studies, we observed that deleting individual tokens independently with a constant probability works better than other alternatives.

After pretraining the model $\mathcal{P}$ on the denoising task, we finetune it for the $\mathcal{S}_{1}$ task. To achieve this, we freeze the upper layers of the encoder and all parameters of the decoder, excluding the parameters used in the decoder-encoder cross-attention layers, and update the lower layers of the encoder. The exact number of layers to tune is a hyperparameter.

2 Backtranslation

The same text sequence can be rendered as audio in multiple ways, with varying voice, accent, prosody, emotional content, and recording conditions. This one-to-many relationship makes the text-to-speech problem highly asymmetric — unlike text translation, where, for example, English-French translation is roughly equally hard in either direction. Thus, it is very attractive to use backtranslation Sennrich et al. (2016); Edunov et al. (2018), i.e., to use the available parallel data to train a speech-to-text model and use it to generate synthetic parallel data from an audio-only corpus.

The two-stage architecture of SPEAR-TTS is particularly suitable for backtranslation as it can be implemented as translation between semantic tokens and text. The benefits are two-fold: (a) a reduction in the computational complexity due to never dealing with raw audio or long acoustic token sequences,In our setup, an acoustic token representation of an utterance is at least 6 $\times$ longer than its semantic token counterpart. and (b) the ability to leverage the same semantic token-level pretraining (Section 4.1) when training the “backward”-direction model, from semantic tokens to text transcripts.

In order to obtain a backtranslation model, we start from the same pretrained model $\mathcal{P}$ as above. However, this time we freeze the encoder and only finetune the decoder. Afterwards, we transcribe audio-only data using this model. Next, we use the synthetically generated parallel data to train the first stage of the TTS system, which, in turn, is also obtained via finetuning another copy of $\mathcal{P}$ (see Section 4.1). After finetuning on the synthetic data, we continue finetuning on the original parallel data.Another option is to train on a mixture of synthetic and original data as in Edunov et al. (2018), which introduces mixture weights as another hyperparameter. In Figure 2 we illustrate this combined pretraining and backtranslation process.

The second stage model $\mathcal{S}_{2}$ maps semantic tokens into acoustic tokens. To train this stage, we extract pairs of sequences of semantic and acoustic tokens from each utterance in an audio-only dataset. Next, we train a Transformer model to perform seq2seq translation between the two token sequences. The second stage generates utterances with randomly varying voice, tempo, and recording conditions, reproducing the distribution of the characteristics observed in the training data. As training of $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ are decoupled, this diversity of speech generated by $\mathcal{S}_{2}$ is preserved also when $\mathcal{S}_{1}$ is trained on a single-speaker dataset.

To control the characteristics of the speaker’s voice, we combine two findings from AudioLM Borsos et al. (2022): (a) whenever the speech prefix is represented solely by semantic tokens, AudioLM generates continuations by sampling a different random voice each time, (b) however, when conditioning also includes acoustic tokens, AudioLM maintains the voice characteristics captured by the acoustic tokens when generating the continuation. In contrast to AudioLM, we explicitly incorporate this ability during training, as illustrated in Figure 3. During training, we randomly select two non-overlapping windows of speech from each training example, from which we compute sequences of semantic and acoustic tokens. We consider one of the windows as the prompt and the other as the target output. Next, we concatenate the sequences in the following order: (a) semantic tokens from the prompt, (b) semantic tokens from the target, (c) acoustic tokens from the prompt, and (d) acoustic tokens from the target. During training of $\mathcal{S}_{2}$ , (a)-(c) are used as prefix and the model learns to generate the target acoustic tokens (d), preserving the speaker identity captured by the acoustic tokens from the prompt. At inference time, (a)-(c) are provided as input, and (d) is generated autoregressively.

Importantly, a special separator token is added at each segment boundary to inform the model about the expected discontinuity. This prevents boundary artifacts, which are sometimes generated when no separator is used. Note that the text transcript of the prompt is not needed.

The speech samples generated by $\mathcal{S}_{2}$ might contain some background noise, since this is typically present in the training data. We consider two methods to control the noise level in the synthesized speech at inference time. First, in the case of prompted generation, it is possible to select prompts containing cleaner speech. Second, we can use a stochastic sampling (e.g., temperature sampling), generate multiple sequences for the same input and then use a no-reference audio quality metric to select the samples containing the least amount of noise. To this end, we use a MOS estimator model similar to DNSMOS Reddy et al. (2021).

Experimental Setup

In this section we introduce the datasets, metrics and baselines used in our experimental study.

We use LibriLight (Kahn et al., 2020) to train the self-supervised representation models (SoundStream and w2v-BERT) as well as the $k$ -means used to discretize w2v-BERT embeddings into semantic tokens. We use the largest unlab-60k split of LibriLight that contains around 60,000 hours of English audiobooks read by more than 7,000 speakers.

To experiment in the low-resource regime, we train $\mathcal{S}_{1}$ on LJSpeech Ito and Johnson (2017), a single-speaker dataset containing 24 hours of parallel data. By using LJSpeech as the only source of parallel data, we also show that our method generalizes to multiple speakers, even if the parallel training data itself contains only a single speaker. Since LJSpeech does not specify a canonical train/dev/test split, we follow Liu et al. (2022, 2020) and randomly select 300 utterances as development and another 300 utterances as test set (30 minutes each), using the rest as training data. To simulate scenarios in which very limited data is available, we uniformly sample subsets of 12, 3, 2, 1 hours, 30, and 15 minutes from the training set. As an indicative figure, the 15 minute subset contains around 21k semantic tokens and 2k words.

Pretraining:

To pretrain a model on the sequence corruption task (Section 4.1), we extract semantic tokens from LibriLight (Kahn et al., 2020), since pre-training only requires audio data.

Backtranslation:

In our experiments with backtranslation, we use LibriTTS Zen et al. (2019) as a source of unlabelled speech (ignoring transcripts). We pool all training subsets of LibriTTS to obtain an audio-only dataset containing 551 hours of speech. Using LibriTTS as a source for audio-only data for backtranslation allows us to compare SPEAR-TTS with $\mathcal{S}_{1}$ trained on original and backtranslated LibriTTS transcripts.

To train $\mathcal{S}_{2}$ , we extract pairs of semantic and acoustic token sequences from LibriLight (Kahn et al., 2020).

2 Evaluation data

We use LibriSpeech test-clean (Panayotov et al., 2015) to measure the character error rate (CER) (see Section 6.4). As LJSpeech only contains sequences shorter than 10 seconds, we filter out sequences longer than that from LibriSpeech test-clean. As a result, we obtain 2,007 utterances, with a total duration of approximately 3 hours. Importantly, LibriSpeech test-clean has no intersection with any training or validation data we used.

3 Preprocessing

To prepare the data for training, we unroll standard abbreviations used in LJSpeech. Next, we apply the G2p_en phonemizer (Park and Kim, 2019). After removing the lexical stress information from its output, we obtain a string representation in a vocabulary of 47 tokens (39 phonemes from the CMU Dictionary, whitespace, and punctuation).

Since we cannot expect that a phonemizer is universally available in low-supervision scenarios, in Appendix G we experiment with grapheme inputs.

4 Metrics

We are interested in the following desired properties of SPEAR-TTS:

Generated speech should adhere to the input;

It should provide voice diversity even when $\mathcal{S}_{1}$ is trained on single-speaker data;

When prompted with an utterance from an unseen target speaker, SPEAR-TTS should synthesize speech that matches their voice;

Generated speech should be of high quality.

Below we discuss the metrics used to assess whether those properties are satisfied.

We transcribe the utterances synthesized by SPEAR-TTS using an in-house ASR system and we evaluate the faithfulness to the input transcript by measuring the character error rate (CER). We use the LibriSpeech test-clean dataset (Panayotov et al., 2015) to calculate CER, since it requires minimal postprocessing to be compared to the output of the adopted ASR system. As a reference, on the original ground-truth audio, CER is equal to 0.98%.

Voice diversity

To measure the voice diversity within a set of synthesized speech utterances, we apply a speaker classifier that assigns one speaker per utterance and we measure the entropy of the empirical distribution of the detected speakers across all utterances. We use the same speaker classifier as Borsos et al. (2022), which is trained on a union of LibriSpeech train-clean-100 and test-clean containing 251 and 40 speakers, respectively, and computes predictions over a set of 291 speaker classes. We provide more details in Appendix D.

Voice preservation

When prompting the model with a short utterance, we evaluate the consistency of the speaker voice between the prompt and the generated speech. To this end, we use the same speaker classifier as above and measure how often the speaker label predicted from the generated speech matches the one predicted from the prompt.

Quality

We rely on human judgments to evaluate the perceived quality of SPEAR-TTS by collecting Mean Opinion Scores (MOS). In this context, human raters listen to individual audio segments and rate their audio quality and speech naturalness on a scale from Poor (1) to Excellent (5).

5 Baselines

As our main baseline, we consider a system explicitly trained to target the low-supervision scenario. Namely, we use a modification of FastSpeech2 Ren et al. (2020), which is a non-autoregressive model that uses auxiliary duration, pitch, and energy predictors. Specifically, in our experiments we consider the adaptation to the low-resource setting by Pine et al. (2022). The model takes as input the phoneme representation of the text and predicts a spectrogram, which is then vocoded with HiFi-GAN Kong et al. (2020). We denote this modification as FastSpeech2-LR. In a subjective evaluation reported by Pine et al. (2022), FastSpeech2-LR trained on 1 (3) hour(s) of parallel data performed on par with an open-source implementation of Tacotron2 Shen et al. (2018) trained with 10 (24) hours of parallel data. We use checkpoints trained on 15 minutes, 30 minutes, 1 hours, 3 hours, and 24 hours subsamples of LJSpeech that were shared by the authors.https://github.com/roedoejet/FastSpeech2_ACL2022_reproducibility

We also compare SPEAR-TTS to VALL-E Wang et al. (2023), a recent TTS system that demonstrates state-of-the-art results in zero-shot voice adaptation. Similarly to SPEAR-TTS, it is capable of voice transfer using a 3 second voice prompt. VALL-E maps the input text to coarse acoustic tokens, and uses a non-autoregressive refinement stage to predict fine-grained acoustic tokens. VALL-E is trained on an ASR-transcribed version of LibriLight Kahn et al. (2020), containing roughly 60,000 hours of parallel data. Since the model is not publicly available, the comparison is based on the samples provided on its demo page.

Hyperparameters & Training details

We follow the setup of AudioLM Borsos et al. (2022) to compute both semantic and acoustic tokens, with a few differences. The semantic tokens are obtained by quantizing the embeddings returned by the 7th layer of w2v-BERT using a codebook of size 512. As a result, 1 second of audio is represented by 25 semantic tokens with a vocabulary size of 512, resulting in an equivalent bitrate of $25\times\log_{2}{512}=225$ bit/s. We remove sequentially repeated semantic tokens, as done in Lakhotia et al. (2021); Borsos et al. (2022).

We extract acoustic tokens from a SoundStream neural codec (Zeghidour et al., 2021) with 3 quantization levels, each with a codebook of size 1024. We use a vocabulary with $3\times 1024$ unique tokens and represent each frame as a flat sequence of tokens, interleaving the first, second, and third quantization layers, respectively. As a result, 1 second of audio is represented by 50 Hz $\times$ 3 = 150 acoustic tokens, an equivalent bitrate of 1500 bit/s.

In all experiments, we use the Adafactor optimizer (Shazeer and Stern, 2018) with inverse square-root learning rate decay. As a regularization method, we use label smoothing with the smoothing parameter set to 0.1, except in the case of pretraining, when a large amount of data is available.

The pretraining task is configured so that the probability of deleting individual tokens is set to 0.6. This parameter was selected via grid search inspecting the validation accuracy of $\mathcal{S}_{1}$ after finetuning. We apply dropout with probability equal to 0.5 and set the batch size to 256. We ran the pretraining for 1M updates and used the resulting checkpoint $\mathcal{P}$ in all our experiments. As the architecture, we use T5-Large Raffel et al. (2020), which is a 24 layer encoder-decoder seq2seq model (see Appendix F).

Finetuning

The same pretrained checkpoint $\mathcal{P}$ is finetuned for different purposes (Figure 2). In all cases we perform a grid search on the dropout rate ({0.1, 0.3, 0.5}) and the number of layers to finetune, selecting the combination with the highest validation accuracy (with teacher-forcing). More specifically, when finetuning on ground-truth parallel data (as an ablation), we freeze both the upper layers of the encoder and the entire decoder, while updating the weights of the encoder embeddings and the lower layers. The number of the lower layers to tune is searched in {4, 6, 8}. When finetuning on synthetic parallel data, we search over the number of the encoder’s lower layers to be finetuned in {4, 6, 8, 10, 12, 24}. Next, we finetune the lower 4 layers of the encoder on the original parallel data (to avoid overfitting when very little data is available). Finally, when finetuning the decoder for backtranslation, we finetune $N$ top and $N$ bottom layers, with $N\in\{2,3,4,12\}.$ During finetuning, we select the checkpoint with the best validation accuracy.

Training from scratch

As an ablation experiment, we train $\mathcal{S}_{1}$ from scratch, experimenting with different variants of T5 architectures Raffel et al. (2020), depending on the amount of data available. We adopt a decoder-only model without causal masking on the input sequence Raffel et al. (2020), which led to better results in our preliminary experiments. We perform a grid-search on the following hyperparameters: dropout probability {0.1, 0.3, 0.5}; architecture size (T5-small or T5-base); the number of layers (T5-small: 2, 4, 6, 8; T5-base: 4, 6, 8, 12). Further details are in Appendix F.

For $\mathcal{S}_{2}$ , we use a 12-layer decoder-only Transformer model, with each layer having 12 heads with dimensionality 64, embedding dimensionality of 768, and FFN size of 2048. The optimizer and the learning rate schedule are the same as for $\mathcal{S}_{1}$ .

4 Inference

We use beam search to sample from $\mathcal{S}_{1}$ and temperature sampling to sample from $\mathcal{S}_{2}$ . This combination ensures faithfulness to the transcript while enabling more diverse and natural sounding speech. We use a beam size equal to 10, as larger values do not lead to improvements in CER but are more computationally expensive. When generating backtranslation data we re-use the settings of $\mathcal{S}_{1}$ , without running any additional hyperparameter search. For $\mathcal{S}_{2}$ , we experiment with sampling temperatures $T\in\{0.50,0.55,...,0.95,1.0\}$ and select $T=0.75$ which minimizes the CER on the LJSpeech validation dataset. In this case, the $\mathcal{S}_{1}$ model is trained on synthetically generated parallel data obtained by backtranslation, with the backtranslation model trained on the 15 minute split of LJSpeech.

To control the noise levels in the synthesized speech, we employ the sampling technique (Section 5) where we sample $n_{s}$ audio utterances corresponding to the input and select the one that has highest quality according to a no-reference audio quality model similar to DNSMOS Reddy et al. (2021). We set $n_{s}$ to 3, as a trade-off between audio quality and computational cost (Appendix B).

Experiments

We evaluate SPEAR-TTS along several dimensions. First, we measure the faithfulness of the generated speech to the input transcript, for different training scenarios and amounts of parallel data available (Section 8.1). Then, we observe that SPEAR-TTS is able to generate speech that is more diverse in voices than the parallel data used during training (Section 8.2). Finally, we show that SPEAR-TTS is able to successfully control the speaker voice, without any degradation in terms of fidelity to the transcript (Section 8.3).

When evaluating SPEAR-TTS, we consider the following training settings for $\mathcal{S}_{1}$ : (a) training from scratch using parallel data; (b) finetuning the pretrained checkpoint $\mathcal{P}$ using parallel data; (c) finetuning the pretrained checkpoint $\mathcal{P}$ to obtain the backtranslation model and then training the forward model from scratch on the synthetically generated data; (d) same as (c), but both the backward and the forward models are obtained by finetuning $\mathcal{P}$ with an additional finetuning of the forward model on the original parallel data.

Table 1 reports the main results in terms of CER, as a proxy for the intelligibility of the generated speech. We observe that when decreasing the amount of parallel data, training from scratch (a) results in very high error rates. Conversely, thanks to pretraining (b), SPEAR-TTS maintains a relatively low CER ( $\leq 4\%$ ), when using as little as 2 hours of parallel data. This is similar to the CER achieved with 24 hours, but without pretraining. Backtranslation (c) has a general positive impact, especially when the amount of parallel data is reduced, achieving a CER of 2.88% with only 15 minutes. By combining backtranslation with pretraining (d), the CER is further decreased to 2.21% with the same amount of parallel data. This indicates that having a fixed decoder is useful to cope with the noisy nature of the synthetically generated training data obtained via backtranslation. As a result, SPEAR-TTS trained on 3 hours (with pretraining and backtranslation) achieves the same CER that can be observed when training from scratch on the original transcripts of LibriTTS-train, that is, 551 hours of parallel data (see Appendix C).

We also compare SPEAR-TTS to FastSpeech2-LR, observing that when using 24 hours of parallel data, both systems perform approximately on par (FastSpeech2-LR: 1.99% vs. SPEAR-TTS: 2.06%). However, as the amount of parallel data is reduced, CER of FastSpeech2-LR increases very rapidly. As a result, there is a significant gap when only 15 minutes are available, that is, FastSpeech2-LR: 4.90% vs. SPEAR-TTS: 2.21%.

In conclusion, the combination of pretraining and backtranslation allows SPEAR-TTS to synthesize speech that adheres to the input transcript, even with as little as 15 minutes of parallel data.

2 Voice diversity

SPEAR-TTS is capable of generating utterances using diverse voices, including speakers not seen in the parallel data. For example, when using the LJSpeech dataset Ito and Johnson (2017) as the source of parallel data, the model generates multiple different voices, despite the fact that this dataset contains a single speaker. In the following experiments, we quantitatively measure the voice diversity of the generated speech.

To this end, we train $\mathcal{S}_{1}$ on parallel datasets characterized by a different number of speakers and verify that the diversity of the synthesized voices remains stable. We consider 1 speaker (LJSpeech), 61, 123 and 247 speakers (from LibriTTS). Namely, we use the full LibriTTS train-clean-100, which contains 247 speakers and two its subsets with approximately 1/2 and 1/4 of the speakers. We use transcripts from LibriSpeech test-clean.

Table 2 illustrates how the ground-truth speech naturally becomes less diverse in terms of voice variability (from 7.68 to 2.55), as the number of speakers is decreased (from 247 to 1). Note that the LJSpeech voice is out-of-domain for the speaker classifier used, so the measured voice variability is non-zero. Instead, for SPEAR-TTS, voice variability is not significantly affected by the number of speakers (min: 6.16, max: 6.28) and significantly higher than FastSpeech2-LR (6.11 vs. 0.66).

This experiment demonstrates that the variety of voices synthesized by SPEAR-TTS is independent from the number of distinct speaker voices contained in the parallel data used for training $\mathcal{S}_{1}$ .

3 Prompted generation

SPEAR-TTS is able to control the speaker voice via example prompting, as described in Section 8.3. We evaluate SPEAR-TTS in a zero-shot scenario, in which the voice used for prompting was never seen by $\mathcal{S}_{1}$ or $\mathcal{S}_{2}$ at training and $\mathcal{S}_{2}$ has to reproduce its characteristics from a single prompt example. Specifically, we fix $\mathcal{S}_{1}$ , using the model trained on 15-minutes of LJSpeech and we consider all 40 speakers from LibriSpeech test-clean as target speakers. For each speaker, we randomly select 5 speech prompts with duration of 3 seconds each and transcripts from the same dataset. For each speech prompt and text transcript, we repeat synthesis 5 times and average metrics across the generated samples.

Table 3 reports the speaker accuracy, that is, how often the same voice is detected in both the prompt and the generated speech. We observe top-1 accuracy equal to $92.4\%$ showing that the prompting allows SPEAR-TTS to preserve the speaker voice. Also, the synthesized voice is stable when repeating inference, as captured by a low value of voice variability (0.41 bits). Moreover, we observe that with prompted generation SPEAR-TTS achieves a CER equal to 1.92%, which is lower than without prompting (2.21%). We believe that this improvement is due to using cleaner recordings for prompts, which steers the $\mathcal{S}_{2}$ model to produce cleaner speech and consequently reduce ASR errors.

We also compare the voice preservation abilities of SPEAR-TTS with those of VALL-E (Wang et al., 2023). Following the methodology of Wang et al. (2023) we compute the cosine similarity between embeddings computed from the prompt (encoded and decoded with SoundStream) and from the generated speech, using a publicly available speaker verification system based on WavLM (Chen et al., 2022).https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification#pre-trained-models, “WavLM large” model. This is the same model used by Wang et al. (2023) which makes our measurements directly comparable with scores reported in their paper. From the results reported in Table 4, we observe that SPEAR-TTS significantly outperforms YourTTS (Casanova et al., 2022) (0.56 vs. 0.34) and almost matches the speaker similarity of VALL-E (0.58), despite being trained with 240,000 $\times$ less parallel data.

Subjective Evaluation

Ultimately, we resort to subjective tests with human raters to compare the quality of SPEAR-TTS with the baselines and with ground-truth natural speech. We focus on the scenario with minimal supervision and use the $\mathcal{S}_{1}$ model that is trained with the 15 minute LJSpeech Ito and Johnson (2017) subset. As baselines, we use the FastSpeech2-LR models Ren et al. (2020); Pine et al. (2022) trained on 15 minutes, 1 hour, and 24 hour subsets of LJSpeech.

To ensure that the evaluation sentences are not part of the training set of SPEAR-TTS or the FastSpeech2-LR models, we extract sentences from an audiobook chapter released in 2022, read by the same voice as in LJSpeech.https://librivox.org/predecessors-of-cleopatra-by-leigh-north/, §10. This chapter was published later than any of the datasets we use. We extract 20 sentences from it, each with duration between 3 and 11 seconds, for a total of 133 seconds. We take transcripts for those sentences in the text of the corresponding book. https://www.gutenberg.org/cache/epub/58236/pg58236.txt We provide the transcripts in Table 12 in Appendix.

The baselines are TTS systems trained to generate a single voice. To ensure a fair comparison, we prompt $\mathcal{S}_{2}$ with utterances extracted from the LJSpeech dataset, so that SPEAR-TTS generates speech with the same voice. To this end, we randomly select 3s speech samples from LJSpeech and filter out samples that have more than 1s of silence, using the remaining as prompts.

Samples are presented to raters one-by-one, and raters are asked to judge the audio quality and speech naturalness on a scale from Poor (1) to Excellent (5). Before starting, the raters were provided with example utterances for each grade. Each audio sample is evaluated by 20 raters. For each treatment, we average all scores to compute the Mean Opinion Score (MOS).

Table 5 reports the results of the subjective tests. We observe that SPEAR-TTS achieves considerably higher quality than the baselines, even when the latter use more parallel data during training. The MOS score achieved by SPEAR-TTS (4.96) is comparable to the one obtained for the ground-truth speech (4.92), confirming the high quality of the generated speech, despite the fact that the model was trained only on 15 minutes of parallel data.

We also compare SPEAR-TTS and VALL-E Wang et al. (2023) in a small-scale subjective test using the examples provided on its demo page.https://valle-demo.github.io/, “More Samples”. These examples are generated by combining 8 transcripts with 3 prompts each, resulting in 24 speech utterances. Using the same instance of SPEAR-TTS described above (with $\mathcal{S}_{1}$ trained with 15 minutes of single-speaker LJSpeech), we synthesize 24 utterances using the same transcripts and prompts and conduct a subjective test with the same protocol described above. Table 6 shows that, on these examples, SPEAR-TTS achieves considerably better naturalness and higher speech quality (MOS 4.75) than VALL-E (3.35), despite using considerably less supervision (15 min of parallel data & 1 speaker vs. approximately 60,000 hours of parallel data spoken by over 7,000 speakers).

Related Work

The work of Lakhotia et al. (2021) on generative spoken language modeling (GSLM) pioneered the use of language models on discretized speech representations. The main tasks Lakhotia et al. (2021) focuses on are unconstrained speech generation and speech continuation. Their work became a foundation for a range of applications and extensions, including emotion transfer Kreuk et al. (2021), prosody Kharitonov et al. (2022) and dialog Nguyen et al. (2022) modeling. SPEAR-TTS is related to AudioLM (Borsos et al., 2022), a recent development in this line of work that achieves a superior quality in spoken language modeling as well as a high audio quality.

2 Low- and semi-supervised TTS

Being able to leverage audio-only data is one of the distinct features of SPEAR-TTS. Guided-TTS, proposed by Kim et al. (2021), is another TTS system that is capable of doing this. At its core, Guided-TTS combines (a) a denoising diffusion probablistic model (DDPM) that learns to model audio-only data, and (b) a phoneme classifier that guides the generative process towards producing an utterance with a desired transcript. Guided-TTS 2 (Kim et al., 2022) extends Guided-TTS by allowing speaker adaptability either via finetuning or in a zero-shot manner, using a 10 second speech sample processed by a dedicated speaker embedding module. Another adaptable DDPM-based TTS system was proposed by Levkovitch et al. (2022), which uses the classifier guidance mechanism to steer generation towards a particular voice in a zero-shot manner.

In contrast to SPEAR-TTS, the above works rely on a stronger supervision: (a) Guided-TTS uses a phoneme classifier that is trained on LibriSpeech 960, (b) Guided-TTS 2 relies on a pre-trained speaker verification system. Conversely, SPEAR-TTS uses an intuitive and parameter-less prompting mechanism which does not require any speaker labels.

Liu et al. (2020) combine a sequential autoencoder with vector quantization and temporal segmentation mechanisms to learn a phoneme-like discrete speech representation, along with a seq2seq model that maps these representations to phonemes. Similarly to SPEAR-TTS, this system can be trained with almost no supervision, however the generated speech is single-speaker only and of much lower quality than ground-truth audio ( $2.33$ vs $4.81$ in their experiments). This is unlike SPEAR-TTS which despite minimal, single-speaker supervision can generate speech from arbitrary voices while matching the quality of ground-truth speech.

Next, there is a body of research that exploits availability of unpaired texts. Backtranslating audio-only data, as done by SPEAR-TTS, can be thought of using an ASR system to generate training data for TTS. Tjandra et al. (2017) proposed to train both ASR and TTS simultaneously, with TTS reconstructing the waveform based on the ASR output and ASR recognizing audio, synthesized by TTS. Chung et al. (2019) discussed a set of approaches for pretraining the Tacotron TTS system, that includes per-frame autoregressive pretraining of the decoder and pretraining word embeddings for the encoder. Ao et al. (2022) proposed SpeechT5, a system that can combines text- and audio-only data for pretraining.

3 Prompted Audio Generation

When a sentence is prepended by an emotional prompt, expressed in a plain English, e.g. [I am really sad, ] Tortoise TTS Betker (2022) synthesizes text in a sad-sounding voice.

AudioLM Borsos et al. (2022) demonstrates a voice-prompting ability where an acoustic token prefix forces the model to maintain the speaker characteristics and recording conditions in the prompt, while generating a speech continuation. We extend the prompting capabilities of AudioLM by proposing prompt-aware training of $\mathcal{S}_{2}$ .

Wang et al. (2023) propose VALL-E, a TTS system that allows prompt-based conditioning of the synthesized voice and emotion. In contrast to the two-stage architecture of SPEAR-TTS, VALL-E predicts an equivalent of acoustic tokens directly from a phoneme representation of a text. As a result, the transcript of the prompt is required, which can be challenging e.g. if the prompt is noisy. This is unlike SPEAR-TTS which only prompts the model with self-supervised audio tokens, and thus does not require the corresponding transcript. Another difference is the amount of the parallel training data used: VALL-E is trained on the 60,000 hours of ASR-transcribed LibriLight Kahn et al. (2020). Sections 8.3 and 9 show that SPEAR-TTS provides similar zero-shot prompting abilities with much higher audio quality, even when trained with only 15 minutes of parallel data.

Conclusions & Future work

In this work, we introduce SPEAR-TTS, a multi-speaker TTS system that has two features setting it apart. First, it only requires a minimal amount of parallel data to be trained, i.e. it can synthesize speech with high fidelity and voice diversity when trained on as little as 15 minutes of parallel data coming from a single speaker. Second, SPEAR-TTS is able to synthesize speech maintaining voice characteristics of a previously unseen speaker using a 3-second long voice example.

These capabilities originate from harnessing abundant audio-only data. The key component that unlocks the usage of such data is the hierarchical discrete representation of speech that combines high-level semantic tokens with low-level acoustic tokens. Using these representations, SPEAR-TTS casts the TTS problem as a composition of two sequence-to-sequence tasks, “reading” (from tokenized text to semantic tokens) and “speaking” (from semantic tokens to acoustic tokens).

SPEAR-TTS uses audio-only data in three ways: (a) to train the “speaking” model, such that the hard task of speech generation benefits from large-scale data, (b) as a domain for pretraining a model that is further used as a foundation for text-to-semantic tokens and semantic tokens-to-text models, and (c) to generate synthetic parallel data for backtranslation.

Our experimental study on English data (Section 8) shows that by combining audio-only data from LibriTTS Zen et al. (2019) with 15 minutes of parallel data sampled from LJSpeech Ito and Johnson (2017), SPEAR-TTS achieves intelligibility comparable to that of an adapted FastSpeech2-LR trained on the entire 24 hours of LJSpeech (CER 1.92% vs. 1.99% on LibriSpeech test-clean Panayotov et al. (2015)). Simultaneously, even when trained on parallel data from a single speaker, SPEAR-TTS synthesizes speech with diverse voices (Section 8.2).

Next, our experiments in Section 8.3 show that SPEAR-TTS can maintain voice characteristics of a previously unseen speaker, in a zero-shot manner, with high accuracy. Indeed, our measurements indicate that by taking a 3 second-long voice example for a speaker from LibriSpeech test-clean, SPEAR-TTS achieves 92.4% accuracy on maintaining the voice when synthesizing held-out text transcripts, according to our speaker classifier. Moreover, when measuring speaker similarity between prompts and generated speech, SPEAR-TTS obtains a cosine similarity of 0.56, which is close to the score reported for VALL-E Wang et al. (2023) and significantly higher than the score of YourTTS Casanova et al. (2022) (0.58 and 0.34, respectively).

Subjective evaluations of speech naturalness show that SPEAR-TTS has significantly higher quality than a strong single-voice baseline even when trained with 96 $\times$ less parallel data (MOS 4.96 vs. 2.11). Moreover, the MOS score of SPEAR-TTS is on par with the natural speech (4.92). When comparing quality of the speech synthesized in a zero-shot voice transfer task, SPEAR-TTS obtains a MOS that is considerably higher than VALL-E (4.75 vs. 3.35), with 240,000 $\times$ less data.

We believe our work on high-quality TTS with limited supervision (quantity- and quality-wise) paves the way for enabling TTS technology for communities that are currently excluded due to speaking “low-resource” languages and dialects. Another exciting potential application that can be unlocked by SPEAR-TTS is allowing people with speech impairments to use old recordings of their own voice to communicate orally. At the same time, we admit that our initial study has certain limitations and could be misused (Sections 12 & 13).

We believe that applying our findings to building a TTS system for truly low-resource languages and further reducing the need for supervision are the main directions for further work.

Limitations

While our motivation is to enable high-quality, diverse, and controllable TTS for low-resource languages, we started our investigations with English, which allowed us to address the problem using a collection of well-studied datasets.

Next, we rely on LibriLight Kahn et al. (2020) as our audio-only dataset which provides a sufficiently diverse set of audio. However, LibriLight only contains audio samples at 16 kHz, hence SPEAR-TTS requires an additional step to synthesize speech at a higher sampling rate (Appendix A). In addition, LibriLight contains audio of a lower quality than curated datasets on average. However, these are not limitations of SPEAR-TTS, but rather are limitations of the data we used. Moreover, the quality of SPEAR-TTS could be improved by changing the neural codec used to produce acoustic tokens, with no change to $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ .

Finally, the flexibility of SPEAR-TTS comes from relying on relatively large Transformer models that require substantial computing resources for training and inference. We believe this can be addressed separately by model distillation and quantization Polino et al. (2018); Fan et al. (2020).

Broader Impact

We believe our work on high-quality TTS that requires very limited supervision (quantity- and quality-wise) can be an important stepping stone for enabling this core speech technology for communities that are currently underserved by TTS solutions due to speaking “low-resource” languages, i.e., languages that do not have large parallel corpora required for training deep learning models. Even for high-resource languages, such as English, the ability to harness untranscribed data for speech generation can enable producing speech in accents and dialects that are currently uncovered by the existing TTS systems. Another exciting potential application provided by SPEAR-TTS is allowing people with speech impairments to use recordings of their own voice to prompt SPEAR-TTS.

At the same time, we acknowledge that the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and for the purpose of impersonation Delgado et al. (2021); Casanova et al. (2022). Thus it is crucial to put in place safeguards against the misuse and, as an initial step, we verify that speech produced by SPEAR-TTS can be reliably detected by a classifier with an accuracy of 82.5% on a balanced dataset (see Appendix E). In the future, one can explore other approaches for detecting synthesized speech, e.g. audio watermarking.

Acknowledgements

The authors are grateful to Ron Weiss and Matthieu Geist for their feedback on a draft of this paper. We also thank Aidan Pine for helping us to obtain and run checkpoints from Pine et al. (2022).

References

Appendix A Bandwidth extension: from 16 to 24 kHz

While relying on LibriLight as our unpaired dataset allows for modeling a diverse set of speakers and conditions in $\mathcal{S}_{2}$ , this dataset contains only 16 kHz audio, whereas 24 kHz audio is preferable in many TTS applications. We provide a simple approach via bandwidth extension that enables SPEAR-TTS to generate speech at 24 kHz, while still being able to benefit from the diversity of LibriLight.

We cast bandwidth extension as a sequence-to-sequence task of mapping tokens produced by the SoundStream codec at 16 kHz (Section 7.1) to the tokens produces by a SoundStream codec at 24 kHz. We train the latter on LibriTTS (Zen et al., 2019) with 4 residual vector quantizer layers, a codebook size of 1024 per layer and 50 Hz embedding rate, resulting in a 2000 bit/s codec. To create the training data for this task, we extract SoundStream token sequence pairs from LibriTTS: the target tokens are produced by the 24 kHz codec on the target audio sample, and the input tokens are produced by the 16 kHz codec on the audio sample, after applying a lowpass filter with random cutoff frequencies between 5 and 8 kHz.

Since the sequence-to-sequence formulation of bandwidth extension fits easily into our framework, we train a T5-small encoder-decoder on the task. We note that the training data for this stage is two orders of magnitudes smaller than for $\mathcal{S}_{2}$ (LibriTTS vs LibriLight), so with this approach we can benefit at the same time from the acoustic diversity of a large, but low resolution dataset and the quality of a small, but high resolution dataset.

Appendix B Controlling audio quality by sampling

As discussed in Section 5, without prompting, the quality of audio produced by SPEAR-TTS matches that of the training data used to train $\mathcal{S}_{2}$ . As the LibriLight dataset (Kahn et al., 2020) contains audiobooks read by volunteers using their personal equipment, so the quality of the recordings varies a lot.

Here we verify that the sampling technique proposed in Section 5 allows us to control the quality of the generated speech and study how it affects the recognition error by the used ASR system. In this experiment, for each phoneme input in LibriSpeech dev-clean, we sample $n_{s}$ times from SPEAR-TTS ( $n_{s}\in\{1,2,5,10\}$ ) and select the sample that has the highest MOS estimate, returned by a modification of the DNSMOS model Reddy et al. (2021). We use the selected example for calculating CER.

We report the results of this experiment in Table 7. We observe that increasing $n_{s}$ leads to a higher estimated quality. Moreover, higher audio quality allows SPEAR-TTS to achieve lower CER. Based on the results in Table 7, we use $n_{s}=3$ in all our experiments, as a trade-off between the computational complexity and the estimated quality estimate.

In this Section, we study intelligibility of SPEAR-TTS with $\mathcal{S}_{1}$ trained on LibriTTS. We generally use the same hyperparameter grids as in experiments with LJSpeech Ito and Johnson (2017) that are reported in Section 7. However, as LibriTTS is larger than LJSpeech, we also experiment with encoder-decoder models. For the largest training subset of LibriTTS (551h), we also experimented with T5-Large-sized encoder-decoder and decoder-only architectures (24 layers). For encoder-decoder models, we always set the numbers of layers in the encoder and the decoder to be equal.

Table 8 reports CER for two variants of SPEAR-TTS: with $\mathcal{S}_{1}$ trained from scratch and starting from the pretrained checkpoint $\mathcal{P}$ , the same as used in the main experiments. We consider three subsets of LibriTTS Zen et al. (2019) of different sizes (54, 241, and 551 hours). First, we notice with the largest subset (551h), SPEAR-TTS reaches a low error rate of 2.04% and, in this case, pretraining provides virtually no improvement. However, with less paired data, pretraining is increasingly important: it starts to play a role when 241 hours of paired data available and becomes strongly beneficial when training on 54 hours of paired data (CER 2.61% vs. 2.13%).

Appendix D Speaker classifier

We use the same speaker classifier as Borsos et al. (2022), which is a convolutional network that takes log-mel spectrograms as its input. The spectrograms are calculated with a window size of 25ms, A hop length of 10ms and have 64 mel bins. The network contains 6 blocks, each cascading convolutions with kernels of 3x1 and 1x3. Each block is followed by a ReLU non-linearity and batch normalization (Ioffe and Szegedy, 2015). The per-block numbers of channels are . The classifier has an input span of 1 second and, to classify a longer utterance, we run a sliding window with a hop length of 250 ms and average predictions across the windows.

Appendix E Detecting synthesized speech

In this section we demonstrate that speech generated by SPEAR-TTS can be successfully distinguished from real human speech. To this end, we use the classifier trained to detect speech generated by AudioLM Borsos et al. (2022). It uses the same architecture as the speaker classifier (Appendix D) and was trained to discriminate LibriSpeech train-clean-100 Panayotov et al. (2015) examples, compressed with SoundStream Zeghidour et al. (2021), against AudioLM generated speech.

To assess how effective this classifier is on the speech that SPEAR-TTS synthesizes, we iterate over examples in LibriSpeech dev-clean and, from each example, generate two utterances: (a) one by synthesising text using SPEAR-TTS, and (b) one by re-synthesising the ground-truth audio via acoustic tokens.We do not compare against uncompressed ground-truth audio as this task is trivial for the classifier by allowing it to focus on superficial coding artifacts, thus making it easier to bypass. We observe that on this set of samples, our classifier attains an accuracy of 82.5% on discriminating generated vs. natural speech. We believe that this result can be further improved by training the classifier directly on the output of SPEAR-TTS.

Appendix F Architecture details

We report parameters for the Transformer layers we used in Table 9.

Appendix G Evaluating grapheme-based SPEAR-TTS

In the main text, we report evaluation results for a variant of SPEAR-TTS trained on phoneme representations of text. In some cases, in particular for low-resource languages, a phonemizer might not be available. Hence, we complement our experimental study by evaluating SPEAR-TTS trained on grapheme-based representation of transcripts. We report results in Table 10. On comparing these results with Table 1, we observe that having phoneme-based representation brings strong benefits when very little parallel data is available (e.g., 3.45% vs. 2.21% with 15 minutes). In contrast, with more than 2 hours of parallel data, the benefits of using a phonemizer shrink, and with 12 hours or more, grapheme-based training outperforms the phoneme-based model.

In this experiment, we measure how sensitive $\mathcal{S}_{2}$ is to the amount of data used to train it. To this end, we downsample LibriLight Kahn et al. (2020) by factors of 1, 2, 5, and 10 before training $\mathcal{S}_{2}$ models. All models share the same architecture and are trained for the same number of updates and we select the checkpoint with the highest validation accuracy. Next, we combine the selected checkpoints with $\mathcal{S}_{1}$ trained on LibriTTS Zen et al. (2019) (with pretraining) and measure intelligibility of SPEAR-TTS on LibriSpeech dev-clean. We report results in Table 11. We notice that reducing the data size 5x starts to affect the performance.