SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H. Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang

cs.CL cs.LG

Introduction

Self-supervised learning of text and speech representations has been particularly impactful in natural language processing and speech processing. Since GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and their variations (Yang et al., 2019; Conneau & Lample, 2019; Lewis et al., 2020; Raffel et al., 2020; Joshi et al., 2020), performance on natural language understanding downstream tasks (Socher et al., 2013; Rajpurkar et al., 2016; Agirre et al., 2007; Williams et al., 2018) and monolingual (e.g., GLUE (Wang et al., 2019b), SuperGLUE (Wang et al., 2019a)) and multilingual (e.g., XTREME (Hu et al., 2020), XTREME-R (Ruder et al., 2021)) benchmarks has largely improved thanks to evolving pre-trained models, which leverage increasing amounts of unannotated data (Radford et al., 2019; Liu et al., 2019; Conneau et al., 2019; Wenzek et al., 2020; Xue et al., 2021b) and increased model capacity (Brown et al., 2020; Xue et al., 2021b; Lepikhin et al., 2020; Fedus et al., 2021). Similarly for speech, unsupervised pre-training has emerged as a predominant approach. Wav2vec 2.0 (Baevski et al., 2020b) and newer variants (Zhang et al., 2020) initially showed the strength of pre-training on speech recognition (Panayotov et al., 2015; Kahn et al., 2020; Zhang et al., 2021) on multiple domains (Hsu et al., 2021) and languages (Conneau et al., 2020).

Self-supervised learning methods in language understanding are designed to be used universally, i.e. a single large pre-trained model for all domains and languages. One big advantage of these universal models is the ability to leverage data skew across domains, tasks and languages; the availability of task or domain-specific data in one language can boost model performance for several languages that the model was pre-trained on. Extending this generalization capability across modalities by having neural networks understand both text and speech at the same time is a natural next step.

Jointly pre-training models on speech and text is a natural choice for multimodal self-supervised learning, given the similarities between the two modalities and the abundance of unannotated text data compared to speech. Recent work has also shown that self-supervised speech representations can be aligned to text with little to no supervision (Baevski et al., 2021), suggesting the possibility of learning both modalities within a single neural network. However, past work in multilingual modeling in particular has demonstrated the difficulty of learning representations of different data structures, however similar, within a shared network, exposing the so-called transfer interference problem (Arivazhagan et al., 2019). We show in this work that this trade-off also applies to joint speech-text self-supervised learning.

We study a new multimodal speech-text pre-training approach that leverages data from one modality to improve representations of the other, but also suffers from transfer interference and capacity dilution. Our Speech and LAnguage Model (SLAM) consists of a single Conformer (Gulati et al., 2020) trained with the SpanBERT objective for text (Joshi et al., 2020) and the w2v-BERT (Chung et al., 2021) objective for speech. We show that a model using only self-supervised objectives leads to good performance on both modalities, but is outperformed by mono-modal pre-trained models, suffering from significant transfer interference. To reduce the gap, we leverage supervised alignment losses, specifically a translation language model (Conneau & Lample, 2019; Zheng et al., 2021) and speech-text matching (Li et al., 2021) loss. We train our model in a multi-task fashion with the self-supervised and alignment losses. This leads to performance competitive with the state-of-the-art on SpeechStew and LibriSpeech ASR and on CoVoST 2 speech translation tasks. On speech translation, we demonstrate further quality improvements by continuing pre-training on speech-only, outperforming previous approaches by 1 BLEU on average. On text tasks, our joint model loses quality compared to equivalent mono-modal pre-trained models, but remains competitive with initial BERT results (Devlin et al., 2019), demonstrating the capacity limitations with modeling two high-resource modalities simultaneously. To the best of our knowledge, our work is the first to study and underline the benefits and limitations of speech-text unsupervised pre-training over mono-modal models, on various speech and text downstream tasks. Our initial results set a new challenge in multimodal self-supervised language understanding.

Related Work

Self-supervised learning of language representations using neural networks has a long history. In the deep learning era, word2vec (Mikolov et al., 2013) initially trained word representations from unannotated data using noise contrastive estimation (Gutmann & Hyvärinen, 2012; Mnih & Teh, 2012). Word2vec was followed by a series of papers that expanded the approach to contextual representations of sentences, including ELMo (Peters et al., 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2019) and T5 (Raffel et al., 2019). They rely on either generative language modelling (Bengio et al., 2003) or masked language modeling (Taylor, 1953) (MLM) and these self-supervised pre-training approaches have led to significant improvements on a wide variety of downstream tasks (Wang et al., 2019b; Hu et al., 2020).

In parallel, similar approaches were explored in speech understanding. Chung et al. (2016) follows the word2vec approach to learn vector representations of variable-length audio segments. Oord et al. (2018) introduces contrastive predictive coding (CPC) which leverages language modeling and negative sampling to learn speech representations. The first wav2vec model (Schneider et al., 2019) closely follows this architecture using a noise contrastive binary classification task for unsupervised pre-training. vq-wav2vec (Baevski et al., 2020a) proposes to add a vector quantizer similar to VQ-VAE (van den Oord et al., 2018), using Gumbel softmax (Jang et al., 2016) or online k-means clustering to quantize the dense speech representations (Eloff et al., 2019). When quantized, speech utterances become sequences of discrete tokens belonging to a fixed vocabulary, similar to text, on which BERT is applied. wav2vec 2.0 merges those two separate steps (quantization and contrastive learning) into a unified end-to-end learning procedure that pre-trains a Transformer model. They show significant gains on LibriSpeech (Panayotov et al., 2015) as well as on few-shot learning for low-resource languages (Conneau et al., 2020). w2v-BERT (Chung et al., 2021) expands wav2vec 2.0 by combining contrastive learning and MLM. Zhang et al. (2020) and BigSSL (Zhang et al., 2021) explore the limits of large-scale semi-supervised learning with Conformers (Gulati et al., 2020).

One approach to utilize data across modalities could involve synthetically transforming the modality of the data; one example being Chen et al. (2021) where the authors utilize text-to-speech (TTS) to transform text data into speech, and utilize it for monomodal speech pre-training. Recent advances in self-supervised learning for text, speech and images have led to a new frontier: multimodal self-supervised learning, where a single model learns representations of all modalities using both unannotated and aligned data. VATT Transformer (Akbari et al., 2021) leverages datasets of more than 100M video-audio-text triplets to learn representations on all modalities at once with noise contrastive estimation. Li et al. (2021) jointly learns to do masked language modeling on text as well as matching image representations to text with parallel data through alignment losses. Jia et al. (2021) learns language representation for text-to-speech synthesis by jointly training on phoneme and grapheme representations with MLM. Perhaps most similar to our work, Zheng et al. (2021) learn joint speech-text representations by adapting a translation language modeling (TLM) loss (Conneau & Lample, 2019) to the speech-text setting and studies downstream effect on speech translation.

This work investigates the possibility of developing truly multimodal pre-trained models building on state-of-the-art speech and text pre-training approaches, and highlights the advantages and challenges associated with multimodal pre-trained models by evaluating on a variety of speech and text downstream tasks.

Method

In this section, we describe each component of our speech-text pre-training framework, SLAM, starting with the model architecture in Section 3.1. We then present the pre-training objectives and our multi-stage pre-training strategy in Sections 3.2 and 3.4, followed by introducing the pre-training data in Section 4.1. Figure 1 illustrates the overall pre-training framework.

Our model contains a speech encoder, a text encoder, and a multimodal encoder. At a high level, the speech and text encoders take speech and text signals as input respectively and extract latent features from them. The latent features from the two modalities are then fed to the same multimodal encoder for learning speech-text joint representations. Next, we describe each of these components.

The text encoder is a simple token embedding layer that transforms input text into a sequence of token vector embeddings $W=(w_{1},w_{2},...,w_{T^{\prime}})$ . We evaluated using a deep Transformer or Conformer stack for the text encoder but did not find it empirically useful for speech translation or ASR. The textual tokens are combined with sinusoidal positional encodings and layer normalized before being fed to the multimodal encoder. We utilize a SentencePiece model (Kudo & Richardson, 2018) with a $32k$ token vocabulary.

The multimodal encoder is a deep stack of Conformer layers that can take either just speech, or just text, or concatenated speech-text pairs as input. The Conformer layers used in the multimodal encoder are identical to the ones used in the speech encoder. When training with w2v-BERT we use $M=16$ Conformer layers in the multimodal stack, while we use $M=24$ layers when training with wav2vec 2.0. Depending on the type of input - i.e. just speech, text or a speech-text pair - the model is tasked to solve different self-supervised pre-training objectives.

2 Pre-Training Objectives

We pre-train the model with four objectives: SpanBERT (Joshi et al., 2020) on unlabeled text, w2v-BERT (Chung et al., 2021) on unlabeled speech, Translation Language Modeling (Conneau & Lample, 2019; Zheng et al., 2021) on paired speech and text data, and Speech-Text Matching (Li et al., 2021) on paired and non-paired speech and text data.

We use two self-supervised learning objectives that are trained on unannotated text or speech data.

BERT is the self-supervised learning objective applied to unannotated text input (Devlin et al., 2019). It aims to learn contextualized textual representations via solving a masked language modeling (MLM) task. We mask spans of text as in SpanBERT (Joshi et al., 2020).

is the self-supervised learning objective used for pre-training on unannotated speech data (Chung et al., 2021). It combines contrastive learning and MLM, where the former trains the model to discretize continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens.

2.2 Alignment losses

Without the presence of paired data, the only incentive for the model to learn joint representation is the inductive bias of having a shared set of Conformer layers. Because this is such a strong assumption, we also leverage alignment losses, which use speech-text paired ASR data to explicitly incentivize the model to share representations within the model. We will see below that this leads to better alignment between the speech and text representations, as indicated by better performance on downstream tasks.

was first introduced to align representations between two languages within a shared Transformer. With TLM, parallel sentences are concatenated and sent to a Transformer MLM which predicts missing words, encouraging the model to leverage context from both input languages. In this work, we concatenate speech utterances with their transcriptions using ASR supervised data, similar to Zheng et al. (2021). We then train the model to predict masked text or speech spans with BERT or w2v-BERT, encouraging the use of cross-modal context.

predicts whether a pair of speech and text is positive (matched) or negative (not matched). We use the multimodal encoder’s output embedding of the [CLS] token as the joint representation of the speech-text pair, and append a fully-connected (FC) layer followed by softmax to predict a two-class probability $p^{\text{STM}}$ . The STM loss is:

where $y^{\text{STM}}$ is a 2-dimensional one-hot vector representing the ground-truth label, and H is cross-entropy. The STM objective explicitly trains the model to align speech-text pairs, a signal which self-supervised learning cannot explicitly provide.

3 Implementation Details

When the input only contains speech, the latent speech features $X=(x_{1},x_{2},...,x_{T})$ extracted by the speech encoder are directly fed to the multimodal encoder as input. The speech branch of the model (i.e., the speech encoder along with the multimodal encoder) is trained to optimize the w2v-BERT objective. Following Chung et al. (2021), we mask approximately $50\%$ of the speech frames with spans of length $10$ . Analogously, when the input only contains text, the latent textual features $W=(w_{1},w_{2},...,w_{T^{\prime}})$ extracted by the text encoder are fed to the multimodal encoder as input, and the text branch (i.e., the text encoder along with the multimodal encoder) is trained to optimize the SpanBERT objective. We mask $15\%$ of text tokens with spans of length $5$ .

When the input is a speech-text pair, the latent speech and text representations $C$ and $W$ extracted respectively by the speech and text encoders are concatenated, forming a sequence with a total length of $T+T^{\prime}$ that is fed to the multimodal encoder as input. The multimodal encoder is then trained to simultaneously predict the masked speech features (as in the w2v-BERT objective) and masked text features (SpanBERT). We use more aggressive masking when using paired data to increase the difficulty-level of the task, and to encourage the multimodal encoder to learn to extract useful features across modalities. We mask a single span consisting of $50\%$ of text tokens, and multiple spans masking out $75\%$ of the speech features when training with paired data.

We train the model simultaneously on all these objectives; at every training step the model is trained on a batch of (i) unlabeled speech, (ii) unlabeled text, and (iii) paired speech and text. The gradients of all objectives are aggregated and used to update the model parameters.

4 Multi-Stage Pre-Training

In practice, we find that pre-training the model simultaneously with unpaired and paired data results in the model overfitting to the relatively small paired dataset. To avoid this we pre-train the model in a multi-stage fashion, where we first pre-train the model just on unpaired text and speech, and then optimize it with unpaired and paired data simultaneously. This multi-stage pre-training approach achieves better downstream performance than optimizing all four losses from scratch. Concretely, we train on 500k updates with the self-supervised losses, and between 250k and 500k additional steps with the alignment losses. We observed improvements of 0.1 to 0.2 WER on LibriSpeech dev-other and 0.3 average BLEU on CoVoST 2 when using the multi-stage strategy as against training with all losses from scratch. In all models that use TLM and/or STM, we utilize multi-stage pre-training.

Experiments

We first describe our pre-training and fine-tuning setup, including the speech and text datasets used for pre-training as well as all our downstream tasks. We then present our results, including ablations of our approach and comparisons between multimodal and mono-modal models.

Libri-light (speech only): The Libri-light (LL-60k) dataset contains 60k hours of unlabeled speech and is used to pre-train all our Masked Speech Models (MSM). LL-60k is the most widely used large unsupervised speech corpus for various pre-training techniques. Each input speech sequence is constructed by first randomly selecting 32-64 seconds segments from the original utterance. From these segments, a contiguous 32 second region is extracted from a random starting point on-the-fly during MSM pre-training as described in (Zhang et al., 2020)

LibriLM (text only): The Librispeech text corpus comprises of nearly 803 million tokens from 40M utterances of filtered text derived from 14.5K Project Gutenberg books (Panayotov et al., 2015).

mC4-En (text only): The mC4-En dataset (Xue et al., 2021a) consists of multiple terabytes of English text data, mined from CommonCrawl. The dataset is publicly available.https://huggingface.co/datasets/mc4

LibriSpeech (paired data): We use LibriSpeech (Panayotov et al., 2015) fullset (960h) as paired data for Translation Language Modeling (TLM) and Speech-Text Matching (STM).

2 Downstream Tasks

We present results on publicly available, well-benchmarked downstream tasks including speech recogntion, speech translation, text normalization and language understanding.

CoVoST 2 (Wang et al., 2021a) is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. This represents the largest open dataset available to date from total volume and language coverage perspective. Following Wang et al. (2021b), we use four English to X directions, specifically German, Catalan, Arabic and Turkish. To evaluate our pre-trained encoders on speech-translation, we fine-tune it as part of a sequence-to-sequence model with a $4$ -layer Transformer decoder. The decoder uses a $384$ embedding dimension, $1536$ feed-forward hidden dimension, $4$ attention heads and a $8192$ token multilingual sub-word vocabulary.

We consider four main tasks from the GLUE natural language understanding benchmark: the MNLI natural language inference benchmark (Williams et al., 2018), the Quora Question Pair (QQP) classification dataset,https://www.kaggle.com/c/quora-question-pairs the QNLI question answering task (Wang et al., 2019b) and the SST-2 sentiment analysis dataset (Socher et al., 2013). We report accuracy on the dev sets of each dataset (except SST-2 where we report test accuracy) and compare our results to BERT, SpanBERT and RoBERTa.

Text normalization — also referred to as text verbalization — is a core task in the text-to-speech (TTS) community. Text normalization takes as input raw unverbalized text—as typically found in written form—and produces a verbalized form of that text, expanding numbers, dates, abbreviations, etc. The output of this task is a word-for-word spoken transcript—the input format expected by TTS systems. For example, “A 1951 Lancia V6” would become “a nineteen fifty one lancia v six” while “Dial 1951 for 6V batteries” might become “dial one nine five one for six volt batteries.”. We consider the English task data from Sproat & Jaitly (2016) and compare our results to those of Stahlberg & Kumar (2020), known to be previously state-of-the-art. We report sentence error-rate on the test set for all our experiments. When evaluating on text normalization, we fine-tune our encoder with a $3$ -layer transformer decoder with a model dimension of $512$ , hidden dimension of $1024$ and $4$ attention heads.

3 Main Results

In this section, we analyze the results of fine-tuning our models on speech and text downstream tasks.

We present our results on CoVoST 2 En-X translation in Table 1. We compare our models against results from Wang et al. (2021b), specifically against fine-tuning wav2vec 2.0 on speech-translation, with and without LM fusion. Our speech-only baselines trained using wav2vec 2.0 improve over Wang et al. (2021b) by over 1 BLEU, possible due to increased encoder capacity. Our w2v-BERT speech-only encoder further improves performance by around 0.4 BLEU.

The addition of mC4-En data to pre-training results in a drop of around 1.3 BLEU for the w2v-conformer, a concrete example of the interference issue. In comparison, the w2v-BERT speech-text model is only worse than its speech-only counterpart by approximately 0.6 BLEU. The addition of alignment losses results in the joint pre-trained model matching the speech-only baseline and alleviates interference.

We continue training our TLM + STM joint model on unlabeled speech-only data to alleviate the capacity limitation in the multimodal pre-trained model. Fine-tuning this speech-adapted model on CoVoST results in a model that outperforms our best speech-only model by almost 1 BLEU point, illustrating positive cross-modal transfer and the advantages of multimodal pre-training.

3.2 Speech Recognition

In Table 2, we present our results on the Librispeech 960h ASR benchmark. We compare our unified speech-text encoders to a number of state-of-the-art self-supervised representation learning methods from the literature, including wav2vec 2.0 (Baevski et al., 2020b) and w2v-bert (Chung et al., 2021).

As shown in Table 2, w2v-BERT is consistently better than w2v-Conformer with the text modality by 17% relative (line 6 and 7). However, simply adding the text modality with LibriLM data hurts ASR performance by 14% relative compared to the speech-only model, from 2.9 to 3.3 average WER (line 5 and 7) on testother. By adding TLM loss (line 8), we are able to reduce the interference and we bridge most of this gap, matching performance on dev/devother/test, and only 0.2% worse on testother compared to the mono-modal model. We conclude that the alignment losses help the model align the two modalities, resulting in better use of shared parameters and reduction in the interference between the speech and text modalities. Further introducing STM loss does not improve ASR performance (line 9), but it still performs better than the model without alignment losses. As we increase the amount of text data from LibriLM to mC4-En (line 10), we observed a regression on devother and testother. We conclude that the model needs more capacity to learn from the out-of-domain and larger text dataset. Similar to speech translation, if we further pre-train the model with speech only data, there is 0.1% consistent improvement over all the test sets (line 10 and 11).

In Table 3, we present our results on 5 ASR benchmarks using SpeechStew supervised data. Note that the unified encoder model has not seen any paired data during pre-training on these out-of-domain benchmarks. We notice that the alignment losses still improve over the baseline multimodal model (line 5 to 7). Interestingly, mC4-En data improves performance on TED-LIUM but is worse on AMI compared to pre-training on LibriLM.TED-LIUM is clean speech from the TED-talks domain and thus likely to benefit from more text data, whereas AMI has natural speech from meetings and might benefit from additional capacity devoted to acoustic modeling.

3.3 Natural Language Understanding

We report results on four natural language understanding tasks from GLUE in Table 4. We compare our methods to the original BERT model of Devlin et al. (2019) and its extended versions, SpanBERT (Joshi et al., 2020) and RoBERTa (Liu et al., 2019) which are trained on comparable objectives and comparable text data respectively. We report dev results for MNLI, QNLI and QQP, as test sets are not available for these tasks. We see that our SpanBERT-conformer text-only baseline obtains competitive results with SpanBERT but is outperformed by RoBERTa, possibly because of the Conformer architecture and the optimized pre-training and fine-tuning of the RoBERTa architecture. Doing an apples-to-apples comparison of our text-only model and our speech-text architectures, we observe significant decrease in performance when adding the speech modality. On MNLI, for instance, we go from 87.9% accuracy (line 5) down to 83.3% accuracy with our full model (line 8), or from 95.4 on SST-2 to 93.9%. We observe some gains in performance when using alignment losses over the fully self-supervised learning approach (line 6) which only slightly alleviates the interference problem. Given the large amount of data in both speech and text for English, it is likely that the capacity of the model is a limiting factor for understanding both modalities simultaneously. We believe that alleviating capacity limitations by inducing better cross-modal alignments is an important challenge. We leave the investigation of larger-capacity models and lower-resource languages for future work.

3.4 Text Normalization

In addition to GLUE, we also evaluate and report sentence error rate for text normalization and compare our approach to Stahlberg & Kumar (2020) in Table 4. Our baseline text-only model improves over the previous state-of-the-art by $0.25\%$ absolute (lines 5 and 6). Adding speech during pre-training results in worse performance compared to our text-only pre-training, but the addition of TLM and STM alignment losses is able to recover some of the lost quality (lines 6 to 8). Based on this, we suspect that future work in cross-modality alignment may yield improvements on this task.

Discussion

In this work, we demonstrate that a single encoder model can be pre-trained to learn strong contextualized representations of speech and text simultaneously. We combine self-supervised learning objectives for text (BERT) and self-supervised approaches for speech (w2v-BERT) to learn a joint Speech and LAnguage Model (SLAM). Downstream evaluations on speech and language understanding tasks, including LibriSpeech and SpeechStew ASR, CoVoST 2 speech translation, four GLUE tasks, and text-normalization uncover significant interference challenges when pre-training simultaneously on high-resource modalities. Using alignment losses such as translation language modeling and speech-text matching which leverage speech-text supervised aligned data, we show that we can improve the cross-modal representation alignment and improve over mono-modal models on the speech translation tasks, while maintaining state-of-the-art performance on speech recognition. We hope that this work would motivate further research on extending the universality of self-supervised learning of language representations to the multimodal speech-text setting.