Injecting Text in Self-Supervised Speech Pretraining

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

Introduction

Self-supervised pretraining has been successful in several speech and language tasks. In ASR, these techniques have demonstrated the ability to effectively leverage large amounts of untranscribed speech (e.g. ). However, self-supervised pretraining needs to discover effective representations for speech recognition using only internally consistent representations. While these representations can be learned with multiple views (objectives), there is no guarantee that the learned representation is optimal for any given task such as ASR, language identification or speaker verification tasks . To wit, fine-tuning of the pretrained encoder for the given task is always necessary for optimal performance.

Unspoken text is complementary to un-transcribed speech in self-supervised learning. It is also much easier to collect than un-transcribed speech. Pretraining techniques such as MoCo , Contrastive Predicting Coding (CPC) , Autoregressive Perdictive Coding (APC) , SimCLR , etc., generalize using un-transcribed speech, however, they cannot leverage unspoken text, thereby limiting the power of the learned representations.

In this paper, we propose to jointly learn representations during pretraining from two different modalities, namely speech and text. We show that Text-to-Speech (TTS) can inject this lexical and phonetic information to the speech encoder during pretraining. We propose tts4pretrain, a method to use synthesized speech during pretraining of the encoder. Central to this technique is the use of additional auxiliary decoder objectives such as phoneme, grapheme and word-piece sequence prediction. These losses coupled with contrastive learning on real and synthesized speech help to inject lexical information in the speech encoder during pretraining.

The main contributions of this paper are:

A novel algorithm tts4pretrain to learn encoder representations from both un-transcribed speech and unspoken text, thus allowing for the explicit injection of lexical/phonetic/linguistic information in self-supervised pretraining.

A significant reduction in the amount of transcribed data needed for subsequent “fine tuning” to the domain or task at hand thereby directly resulting in cost savings.

A framework to adapt out-of-domain speech representations using in-domain text data through TTS.

Language-model fusion is complementary to the textual information introduced in pretraining.

Generalization of the algorithm to different encoder architectures and sequence training objectives such as Connectionist Temporal Classification(CTC), Recurrent Neural Network Transducers(RNN-T), and Hybrd Autoregressve Transducers(HAT).

We present results on two publicly available, well-benchmarked ASR tasks, namely LibriSpeech and AMI meeting transcription tasks, and on queries representative of Google Voice Search traffic. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on LibriSpeech over training with contrastive loss alone, establishing a new state of the art result. We also show that tts4pretrain matches the performance of 5,000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining.

The rest of this paper is organized as follows. We compare to related work in Section 2. The proposed model is described in Section 3. Experiments are given in Section 6 with dataset and model details listed in Sections 4 and 5. Ablation study is conducted in Section 7, followed by conclusion in Section 8.

Related work

Self-supervised pretraining techniques leverage untranscribed speech in ASR. wav2vec2.0 has emerged as a successful training method that masks latent representations of input speech and solves a contrastive task over quantized speech representations. Recent advances in semi-supervised learning have revisited unsupervised learning in the form of Noisy Student Training (NST) and introduced augmentation strategies such as FixMatch and Sequential MixMatch to ASR. Training methodologies to jointly learn from unpaired speech and text such as Deep Chain , cycle-consistency training , and augmentation approaches are becoming increasingly popular.

Leveraging vast amounts of unpaired text through learned text representations have been explored using shared encoder representations in . These approaches have shown to be effective for ASR when combined with both transcribed and untranscribed speech . Adversarial and cycle consistency training objectives have also been proposed to leverage unpaired data. Connectionist Temporal Classification (CTC) objective to train end-to-end models was first introduced in . CTC has many advantages for ASR as it helps to improve robustness and achieve fast convergence and allows for streaming applications . Recurrent Neural Network Transducers (RNN-T) are also popular in streaming ASR applications. Both these objectives have been used in conjunction with unsupervised training.

Language model fusion in end-to-end ASR falls into two main approaches. These are approaches such as “Shallow Fusion” that interpolate scores from the end-to-end model and an external language model (LM) and approaches that jointly train end to end models and LMs, such as “Cold Fusion” , “Deep Fusion” , “Component Fusion” and Hybrid Autoregressive Transducers (HAT). The HAT model separately preserves the internal LM learned by the E2E model thus allowing for a more accurate integration with an external LM. In this paper, we propose a new method for combining untranscribed speech and synthesis of unspoken text in self-supervision (wav2vec2.0) with CTC and RNN-T training objectives. We also show that the proposed approach is complementary to both shallow fusion and HAT-based unspoken text integration.

Proposed Method: tts4pretrain

Tts4pretrain comprises two additional components that can be applied to any self-supervised pretraining techniques: 1) the use of synthesized utterances along with untranscribed “real” utterances during pretraining, and 2) the inclusion of auxiliary ASR-based losses. Figure 1 shows the tts4pretrain framework. In this paper, we follow the Wav2vec 2.0 pretraining framework to apply contrastive loss on Conformer encoder representations. Every audio $x^{*}$ drawn from untranscribed speech corpora $\mathcal{L}_{speech}$ results in a loss function $\mathcal{J}_{\tt speech}=\mathcal{J}_{\tt w2v}(x^{*}\mid\theta_{e}),x^{*}\in\mathcal{L}_{speech}$ used to optimize encoder parameters $\theta_{e}$ .

To inject lexical information into the encoder, the pretraining data set includes synthetic utterances $x$ generated via speech synthesis (TTS) of text $y^{*}$ drawn from an unspoken text dataset $\mathcal{L}_{text}$ . The TTS model includes conditioning variables for both speaker conditioning, and a VAE-based latent variable for prosodic control (cf. Section 3.2). During synthesis these are sampled from $Z$ , the set of appropriate conditional parameters (speaker embedding or VAE prior). This results in a similar loss term for the synthesized utterances;

While self-supervision has been used to improve a variety of speech tasks, our aim is to improve ASR performance. Thus, we encourage the encoder to learn representations that useful for ASR by introducing supervision through auxiliary decoders $\theta_{d}$ . The decoder objective function recognized the text $y^{*}$ of synthesized utterances $x=\tt{TTS}(\hat{x}\mid y^{*},z)$ . In this case, both $\theta_{e}$ and $\theta_{d}$ are optimized, though only the encoder parameters $\theta_{e}$ are used in pretraining; $\theta_{d}$ is discarded. We find that a linear readout layer followed by CTC loss is an effective auxiliary decoder, but any ASR decoder can be used here. (We evaluate other options in Section 7.2.) The auxiliary loss is defined as

Note that the text labels, $y^{*}$ that are necessary for synthesis are available for auxiliary ASR loss calculation. This is similar to the use of TTS in ASR training as in , auxiliary ASR losses are also used in encoder training for voice conversion in .

2 On-the-fly Speech Synthesis and Utterance Selection

We use a TTS system trained to generate ASR features from the unspoken text as the input of pretrained encoders. TTS model is based on Tacotron 2D , which takes text sequences as input, conditioned on speaker and utterance embeddings and outputs a sequence of mel spectrogram frames.

Mel-filter bank features from the model can be consumed by the pretrained encoder, eliminating the need for any vocoder. To model prosody and increase its variability during inference, we use a hierarchical variational auto encoder (VAE) as in . This architecture captures local and global speaking styles separately and makes the TTS more stable. The hierarchical VAE includes a local encoder which encodes two-second chunks with a one-second overlap and a global encoder which encodes the whole utterance.

We follow , synthesizing distinct utterances on-the-fly during batch construction. Sampling a new $z\sim Z$ (speaker embedding and VAE latent) each time $y^{*}$ is included in a batch results in novel realizations of TTS utterances rather that training on the same TTS utterances during each training epoch.

3 Contrastive Loss

We pretrain a Conformer encoder following . We first use log-mel spectrograms from real data as input features and pass through 2 convolution subsampling blocks as a “feature encoder” to produce target frames (no quantization layer is used). The convolutional subsampling block has two 2D-convolution layers, both with strides $(2,2)$ , resulting in a 4x reduction in the feature sequence length. A “context network” consists a stack of Conformer blocks makes predictions over the masked frames. A contrastive loss is optimized between the context vectors from the masked positions and the target context vectors. After the pretrained encoder converges on real untranscribed speech, we repeat the pretraining procedure on both TTS and real speech. Contrastive loss is optimized for both real and TTS features, with additional auxiliary losses on TTS material.

4 Training on Unspoken Text and Untranscribed Speech

A major challenge in using TTS for ASR data augmentation is encouraging effective generalization from synthetic to real speech . Synthesized speech exhibits much less variation than real speech; it has a low SNR, and contains no disfluencies and few internal silences. The following two design choices encourage effective self-supervised pretraining from synthetic speech. First, we mix synthetic and real utterances within each batch. A loss mask $\sigma$ is then used to combine the speech and text based losses as follows

This forces the model to learn representations that are effective for both synthetic and real speech. Second, we apply data augmentation the synthetic speech when optimizing the auxiliary losses. The $\mathcal{J}_{\tt w2v}(x\mid\theta_{e})$ necessarily includes time masking in its loss calculation . However, for $\mathcal{J}_{\tt aux}$ on TTS data, we use SpecAugment , applying both time and frequency masking. SpecAugment frequency masking promotes better generalization from synthetic to real speech (cf. Section 7.3).

Data

ASR: The training and test data sets used in this paper including public well-benchmarked corpora and in-house voice search corpora. These are detailed in Table 1. The first two rows correspond to the three public corpora, LibriSpeech , LibriLight and AMI . The last two rows describe two in-house data sets representative of Google’s voice search (VS) traffic in two languages, U.S. English (en-us) and Marathi (mr-in). The in-house ASR training data from voice search utterances for both languages are anonymized and hand-transcribed. The development and test sets are a small fraction of training set held out for validation and evaluation. The unspoken text used in pretraining, labeled as Unsup. Text in Table 1 comprises of anonymized and aggregated, typed search query data. These text queries were selected from a much larger pool of 2000M and 170M queries for English and Marathi respectively using the data selection method described in Section 3.2. In addition, to measure the long-tail word performance in voice search queries, a 15k synthetic test set targeting rare proper nouns or words with surprising pronunciations is used. It is important to note that the TTS model used to generate this synthetic test set not only uses a different architecture from what is used in pretraining, but also has no speaker overlap.

TTS: Two different TTS corpora are used in this paper. We use the freely available LibriTTS corpus containing a total of 960 hours of segmented Librispeech data from 2,456 speakers. The second corpus is an in-house 30-hour data set comprising of 7 Marathi professional speakers.

Language Model: The Librispeech text corpus comprises of nearly 803 million tokens from 40M utterances of filtered text derived from 14.5K Project Gutenberg books . Training data for the Voice Search experiments are randomly drawn from a number of text sources including supervised transcripts used in E2E model training, YouTube search logs, Google search queries, Maps search queries and crawled web documents . Overall, this amounts to nearly 100 billion and 380 million text sentences for English and Marathi respectively.

Model Descriptions

The ASR network is a RNN transducer consisting of a LSTM decoder and a Conformer encoder . The encoder is a stack of ”conformer block”s, each of which is a series of multi-headed self attention , depth-wise convolution and feed-forward layers. The model configuration is summarized in Table 2. All models are trained on use 80-dimensional log-mel filter bank coefficients. The experiments on the public corpora use 1024 word-piece targets and the Voice Search experiments use 4K word-piece targets .

Pretraining Parameters: All ASR models are trained on Google TPU V3 cores . For the XL model, we use Adam optimization and cap the norm of the gradient to 20. For the XXL model, we switch the optimizer to Adafactor with $\beta_{1}=0.9$ and $\beta_{2}=0.98$ , and use 2nd-moment estimator factorization to reduce the accelerator’s memory footprint. Both models use a transformer learning rate schedule in with a peak learning rate of 2e-3 and 25k warm-up steps. For tts4pretrain, we use the same settings, with a global batch size of 1024 and 5 times less learning rate. The global batch size used for corpora $\geq 1000h$ is 512 while for corpora $\leq 1000h$ is 256. We introduce phoneme and word-piece auxiliary decoders described in Section 3 and explore CTC and RNNT as two options for the training objective. The CTC objective uses a single layer prediction network while the RNN-T objective uses a 2-layer LSTM network.

Fine-tuning Parameters: Following we optimize the encoder and decoder with separate optimizers and learning rate schedules. We use Adam optimization with the transformer learning rate schedule described in . The encoder uses a peak learning rate of 3e-4 with 5k warm-up steps. The decoder uses a peak learning rate of 1e-3 and 1.5k warm-up steps. For evaluation, we keep a separate copy of exponential-moving-averaged model weights aggregated with decay rate 0.9999.

2 TTS

The multi-speaker TTS model uses a Tacotron2 TTS architecture described in with hierarchical VAE . The input sequence embedding is encoded by three convolutional layers, which contain 512 filters with shape 5 x 1, followed by a bidirectional long short-term memory (LSTM) layer of 256 units for each direction. The resulting embeddings are accessed by the decoder through a location sensitive attention mechanism. The decoder is followed by a PostNet with five convolutional layers of 512 filters with shape 5 x 1.

3 Language Model

The Librispeech LM is an eight-layer 103M-parameter transformer language model trained on the LibriSpeech language model corpus . The in-house en-us Voice Search experiments use a Conformer LM described in trained on multiple domains. The in-house mr-in Voice Search experiments use an N-gram LM for 1-pass decoding and a maximum-entropy LM for 2-pass rescoring. During rescoring, the first-pass LM’s log-likelihood is log-linearly interpolated with the second-pass model score .

Results

Table 3 presents our results on LibriSpeech evaluation sets when using the 960hr supervised training corpus. The TTS model used in the section is trained with the LibriTTS data described in Section 4. We compare a number of state-of-the-art self-supervised representation learning methods from the literature including, the recently introduced techniques, HuBERT and w2v-Conformer .

As shown in Table 3: 1) Injecting text information does help train a better speech encoder for ASR: Even without an external LM, a model pretrained with tts4pretrain matches other models. To the best of our knowledge, after LM fusion, it has resulted in a new state-of-the-art baseline with the Librispeech 960 hour with pretrain only model (last row); 2) Larger model benefits more from text data: The relative gain from 1B model is larger than 600M model. We believe a larger model has increased capacity to better utilize the text information; 3) Representations learned from speech and text during pretraining is better than speech alone: As shown in the table, tts4pretrain based pretraining reaches the same performance as a speech-only pretrained model coupled with external LM fusion. We hypothesize that the encoder has now learned contextual, lexical information. In order to get additional wins from LM fusion, the LM would have to be trained on either different text sources or spanning domains not included in pretraining.

2 AMI Meeting Transcription

Table 4 presents results on the AMI meeting transcription task. Speech-only pretraining on Libri-Light followed by fine-tuning on the AMI corpus provides significant reduction in WER (Row 2). tts4pretrain provides an additional 5.6% relative win over the speech-only pretrained model (Row 3). We present an additional data point for tts4pretrain by incorporating text from the supervised transcripts in other freely-available corpora as the unspoken text corpus. This yields an additional 3.5M utterances (over the 40M from Librispeech) from the SpeechStew training set, a combination of 7 publicly available supervised speech corpora. The last row in Table 4 serves as a reference baseline when a model is trained with all the available corpora as done in . From Row 3, we observe that textual information in pretraining can compensate for lack of real acoustic training data as tts4pretrain is able to close the gap with the reference baseline performance.

3 Voice Search

We begin with results on English Voice Search queries. The TTS model for tts4pretrain was trained using the LibriTTS corpus described in Section 4. Tables 5 and 6 present results comparing tts4pretrain with audio-only self supervision on two different test sets described in Section 4. The first row in Table 5 illustrates the performance of the baseline model with and without LM fusion. It can be seen that tts4pretrain improves over audio-only pretraining by 15% relative (Row 3). When trained with 15-fold more YouTube data (last row), tts4pretrain still improves over audio-only pretraining by 10% relative. It is interesting to note from Rows 3 and 5 in Table 5, that with less untranscribed speech, LM fusion seems less effective after tts4pretrain. Table 6 shows a similar trend on two rare word testsets . The integration of an external LM is more crucial for recognizing rare words than commonly used words and is consistent with the observations in .

Next, we present results on Marathi voice search queries in Table 7. The TTS model for tts4preTrain was trained using the Marathi TTS corpus described in Section 4. We study LM integration with a Hybrid Autoregressive Transducer (HAT) model. The HAT model couples the powerful modeling ability of E2E models with an inference algorithm that separately preserves the internal LM learned by the E2E model thus allowing for integration with an external LM. We observe that tts4pretrain outperforms audio-only self-supervision by 4% relative, with the best performing model being the HAT model at 18.6% WER. We observe that the gains from all the methods presented in Table 7 are a lot less than those seen in English queries. We attribute this to the increased amount of real-speech used in Marathi for pretraining and fine-tuning but 5-fold less unspoken text compared to English queries. The performance of tts4pretrain in these two languages provide insight into the impact of pretraining with varied amounts of unspoken text and untranscribed speech.

Analysis

In this Section, we explore several questions to better understand the impact and behavior of tts4pretrain.

In order to answer the question on the amount of unsupervised data needed to leverage tts4pretrain, we look at both untranscribed speech and text. All experiments in this section are conducted on English Voice Search queries.

Table 8 includes the performance of an ASR model pretrained with different amounts of untranscribed speech and a fixed amount 100M of unspoken text. The first row shows how the model improves with increasing amounts of speech by speech-only pretraining using wav2vec2.0. While there is a significant improvement in performance with every 10-fold increase in data, it can be seen that the gains begin to asymptote, with a 26% relative win from 600-hour to 6000-hours of pretraining, followed by only 9.8% relative gains when increasing the training data to 60K hours. The second row presents the same analysis with tts4pretrain. Here, we see a more uniform trend with approximately 10% relative win in both 10-fold increases of data. It can also be seen that there is no WER reduction (a small regression exists) seen with wav2vec2.0 when training with 600 hours of real speech. However, when the same amount of speech is supplemented with synthesized speech by tts4pretrain, a 26% relative gain can be seen with 10-fold less speech data. This suggests that the combination of speech and text modalities is effective and particularly useful for languages where less real speech is available. With subsequent additions of real speech, the model is able to learn more effectively and outperform speech-only pretraining.

Next, we explored the impact of the amount of unspoken text injected via TTS while keeping the amount of untranscribed speech at 60K hours. Table 9 shows that the initial addition of 1M utterances yields a win of 6.8% relatve. However, the next similar win requires a 100-fold increase in the amount of unspoken text. While not a surprising result, it offers insight in balancing the needs and costs of acquiring untranscribed speech and unspoken text.

We observed from Table 5 that regardless of the style of speech, Librispeech or YouTube videos, the WER on this task reached the same 6.2% wth tts4pretrain. Table 10 studies the effect of domain mismatch in unspoken text. The use of Librispeech LM text in pretraining does not provide as much gain (7.0%) as typed text queries (6.5%) which are better matched to the Voice Search task. An additional modest win can be obtained (6.2%) with data selection described in Section 3.2 to better match the domain to the task at hand.

2 Impact of training objective and auxiliary decoders

In this Section, we explore few obvious choices for the training objective and auxiliary decoders. These ablation studies were conducted on the smaller 100-hour supervised Librispeech corpus and 60K hours of unsupervised pretraining. As mentioned in Section 3, we explored two different training objectives for the decoder in tts4pretrain. Table 11 concludes that a CTC loss based decoder works better than an RNN-T decoder. We attribute this to the better alignment properties of CTC compared to RNN-T. Table 12 shows that without any type of auxiliary decoder to enforce lexical information, the model is able to learn very little from the synthesized speech alone. All experiments in this table used a CTC training objective based on the conclusion from Table 11. Introducing auxiliary decoders with word-piece and phonemic targets (which come for free from the TTS front-end) improves learning from unspoken text with the best result (last row) obtained by using both objectives.

3 Impact of data augmentation on the synthesized speech

Synthesized speech that has been augmented with different noise styles is effective in robust model training . We present different masking schemes used to augment TTS data during pretraining in Table 13. We find that the 50% time masking used in wav2vec2.0 is not optimal for ASR-derived losses on TTS utterances in tts4pretrain. The best setup uses 20% time and frequency masking with frequency warping. This is consistent with SpecAugment hyperparameters used in downstream ASR . Note, this augmentation is only used for the auxiliary, decoder loss not the contrastive loss.

Conclusion

We propose tts4pretrain, a method to learn self-supervised representations from both untranscribed speech and unspoken text using 1) speech synthesis to generate speech from unspoken text and 2) auxiliary decoders and losses based on ASR objectives for this synthesized speech. tts4pretrain yields WER reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The effectiveness of tts4pretrain is also demonstrated on AMI and in-house data. We show that tts4pretrain is effective on different encoder architectures and sequence training objectives such as CTC, RNN-T, and HAT. Moreover, language-model fusion is shown to be complementary to the introduction of textual information via tts4pretrain.