Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu

Introduction

Recent advances in self-supervised learning have ushered in a new era for speech recognition. Whereas previous works focused mostly on improving the quality of monolingual models for mainstream languages, recent studies have increasingly turned to “universal” models . These may take the form of a single model that performs well on multiple tasks , or one that covers multiple domains , or one that supports multiple languages . In this work, we explore the frontiers of language expansion. Our long-term goal is to train a universal ASR model that covers all the spoken languages in the world.

A fundamental challenge in scaling speech technologies to many languages is obtaining enough data to train high-quality models. With conventional supervised training approaches, audio data needs to be manually transcribed, which is lengthy and expensive, or collected from existing transcribed sources which are hard to find for tail languages. While transcribed speech may be scarce in many languages, untranscribed speech and text data are practically unlimited. Recent developments in semi-supervised algorithms for speech recognition makes it possible to leverage such data for pre-training and produce high-quality speech models with a limited amount of transcribed data .

Moreover, recent studies have shown that a single large model can utilize large data sets more effectively than smaller models . This all points to a promising direction where large amounts of unpaired multilingual speech and text data and smaller amounts of transcribed data can contribute to training a single large universal ASR model.

We produce large “Universal Speech Models" (USMs) through a training pipeline that utilizes three types of datasets:

YT-NTL-U: A large unlabeled multilingual dataset consisting of 12M hours of YouTube-based audio covering over 300 languages.

Pub-U: 429k hours of unlabeled speech in 51 languages based on public datasets.

Web-NTL: A large multilingual text-only corpus with 28B sentences spanning over 1140 languages.

Paired ASR Data: We utilize two corpora of paired audio-text data with O(10k) hours of audio for supervised training.

YT-SUP+: 90k hours of labeled multilingual data covering 73 language and 100k hours of en-US pseudo-labeled data generated by noisy student training (NST) from YT-NTL-U.

Pub-S: 10k hours of labeled multi-domain en-US public data and 10k labeled multilingual public data covering 102 languages.

2B-parameter Conformer models are built using these datasets through the following steps:

Unsupervised Pre-training: BEST-RQ (BERT-based Speech pre-Training with Random-projection Quantizer) is used to pre-train the encoder of the model with YT-NTL-U.

MOST (Multi-Objective Supervised pre-Training): The model can optionally be further prepared by a multi-objective supervised pre-training pipeline that utilizes all three kinds of datasets: YT-NTL-U, Pub-U, Web-NTL and Pub-S. Here, a weighted sum of the BEST-RQ masked language model loss , along with the text-injection losses (including the supervised ASR loss and modality matching losses) is optimized during training.

Supervised ASR Training: We produce generic ASR models trained with connectionist temporal classification (CTC) and Listen, Attend, and Spell (LAS) tranducers for downstream tasks.

Two types of models are produced through this pipeline—pre-trained models that can be fine-tuned on downstream tasks, and generic ASR models for which we assume no downstream fine-tuning occurs. The generic ASR models are trained with chunk-wise attention, which we introduce later in this report.

We denote the pre-trained models USM and USM-M, where the appendix -M indicates that MOST has been utilized for the preparation of the model. The USM and USM-M models can be further fine-tuned on the downstream task of choice with an appropriate transducer unit, which can be a CTC, LAS or RNN transducer (RNN-T) unit. We evaluate our USM models on two types of benchmarks:

Automatic Speech Recognition (ASR): We use YouTube data to train USMs for YouTube (e.g., closed captions). We evaluate the USMs on two public benchmarks, SpeechStew and FLEURS . We also report results on the long-form test set CORAAL for which only the evaluation set is available.

Automatic Speech Translation (AST): We test AST performance on CoVoST 2 .

As indicated in Table 1, the generic ASR models are trained with YT-SUP+ and not fine-tuned on domain-specific datasets for downstream ASR tasks. We, however, explore the possibility of attaching additional “adapter" units to both generic and pre-trained ASR models and training adapter weights while keeping the rest of the model frozen.

The overall training pipeline of our models is summarized in Fig. 1. In our design, once a large amount of compute is expended in the pre-training stages, the downstream application can be conveniently fine-tuned from a model trained from stage-1 or stage-2 with a task-specific transducer. Our experimental results demonstrate that this pipelined training framework enables us to build both generic multilingual ASR systems and domain specific models with state-of-the-art performance.

We next present the key findings of our research, provide an overall view of the report, and review related work.

2 Key Findings

SoTA results for downstream multilingual speech tasks: Our USM models achieve state-of-the-art performance for multilingual ASR and AST for multiple datasets in multiple domains. This includes SpeechStew (mono-lingual ASR) , CORAAL (African American Vernacular English (AAVE) ASR) , FLEURS (multi-lingual ASR) , YT (multilingual long-form ASR), and CoVoST (AST from English to multiple languages). We depict our model’s performance in the first panel of Fig. 2. We also build an ASR model for YouTube captioning – i.e., the transcription of speech in YouTube videos, that achieves < 30% WER on 73 languages. With only 90k hours of supervised data, this model performs better than Whisper , a strong general ASR system trained on more than 400k hours of transcribed data (we select 18 languages that Whisper can successfully decode with lower than 40% WER). The second panel of Fig. 2 demonstrates that our YouTube captions model generalizes well to unseen domains.

BEST-RQ is a scalable speech representation learner: We find that BEST-RQ pre-training can effectively scale to the very large data regime with a 2B parameter Conformer-based backbone, comparing favorably against Wav2Vec 2.0 and W2v-BERT in this setting.

MOST (BEST-RQ + text-injection) is a scalable speech and text representation learner: We demonstrate that MOST is an effective method for utilizing large scale text data for improving quality on downstream speech tasks, as demonstrated by quality gains exhibited for the FLEURS and CoVoST 2 tasks. Fig. 2 depicts USM’s performance, establishing a new state-of-the-art on the FLEURS benchmark across 102 languages for ASR and on CoVoST 2 across 21 languages on AST.

Representations from MOST (BEST-RQ + text-injection) can quickly adapt to new domains: We find that it is possible to obtain powerful downstream ASR/AST models by attaching and training light-weight residual adapter modules, which only add 2% of additional parameters, while keeping the rest of the model frozen.

Chunk-wise attention for robust long-form speech recognition: We introduce chunk-wise attention, an effective, scalable method for extending the performance of ASR models trained on shorter utterances to very long speech inputs. We find that the USM-CTC/LAS models trained with chunk-wise attention is able to produce high-quality transcripts for very long utterances in the YouTube evaluation sets.

3 Outline

The outline of this report is as follows:

Methods: We review the architecture and the methods used in the paper. We provide brief summaries of the Conformer , BEST-RQ , text-injection used for MOST, and Noisy Student Training (NST) . We also introduce chunk-wise attention for scalable training on long utterances.

Data: We describe the four types of datasets used to train our models: the unlabeled multilingual speech dataset YT-NTL-U, the multilingual text corpus Web-NTL, labeled datasets, and pseudo-labeled datasets.

Key Results: We present the performance of our USM models on downstream ASR and AST tasks. We demonstrate that USM establishes new states-of-the-art on several speech understanding benchmarks.

Analysis and Ablations: We present analysis of the effects of the key components of our work and compare their performance against existing methods.

4 Related Work

There is extensive literature on pre-training and self-training for ASR. Large speech models trained on large datasets have been studied previously in both monolingual and multilingual contexts . Large multi-modal speech models have been explored in . Various unsupervised pre-training methods for speech models have been proposed and applied in .

Our work is an extension of a host of recent research efforts that have studied semi-supervised learning for ASR in the context of deep-learning. Large speech models (> 1B) were first studied in ; we expand upon this approach to train multilingual speech models in this work. We improve the methods used in by employing a more scalable self-supervised learning algorithm (BEST-RQ) and additionally applying multi-modal pre-training (text-injection) to prepare the models. We introduce an improvement to BEST-RQ by utilizing a multi-softmax loss. We also incorporate Multi-Objective Supervised Training (BEST-RQ with text-injection) to improve the quality of speech representations learnt during pre-training, by utilizing transcribed data and unlabeled text. Long-form ASR has been studied in ; we propose chunk-wise attention as an alternative solution to chunk-based decoding.

In this paper, we propose a scalable self-supervised training framework for multilingual ASR which extends to hundreds of languages. In particular:

We demonstrate that USMs pre-trained on 300 languages can successfully adapt to both ASR and AST tasks in new languages with a small amount of supervised data.

We build a generic ASR model on 73 languages by fine-tuning pre-trained models on 90k hours of supervised data. We show that the generic ASR models can carry out inference efficiently on TPUs and can reliably transcribe hours-long audio on YouTube Caption ASR benchmarks.

We conduct a systematic study on the effects of pre-training, noisy student training, text injection, and model size for multilingual ASR.

Methods

We use the convolution-augmented transformer , or Conformer, with relative attention as an encoder model. For downstream speech tasks such as ASR or AST, the features produced by the Conformer are either used as an input to a connectionist temporal classification (CTC) , RNN transducer (RNN-T) or a Listen, Attend, and Spell (LAS) unit after additional projection. As will be discussed further, BEST-RQ pre-training is exclusively applied to the encoder, while other forms of training (e.g., T5 ) train the entire task network as a whole.

For our experiments, we consider two models with 600M and 2B parameters respectively. While the main results presented have been obtained using the 2B model, the 600M model is utilized for ablation studies and observing model scaling behavior. Some features of the models are listed in Table 2.

2 Pre-training: BEST-RQ

We select BEST-RQ as the method to pre-train our networks with speech audio. BEST-RQ provides a simple framework with a small number of hyperparameters for unsupervised training on large-scale unlabeled audio data. We discuss the comparative advantage of BEST-RQ against other pre-training methods in section 5.3.

BEST-RQ employs a BERT-style training task for the audio input that attempts to predict masked speech features. To make the task compatible with BERT-style training, the original speech features corresponding to the masked frames are quantized, and the task requires predicting the quantized label of these features. For a given number of quantization targets cc, random “codebook" vectors v0,,vc1v_{0},\cdots,v_{c-1} are chosen in an embedding space. The discrete label of the speech feature is obtained by first projecting the feature into the embedding space by a randomly initialized, frozen projection matrix and then finding the closest codebook vector. The index of this codebook vector is identified as the label of the speech feature. Cosine similarity is used as the distance measure for determining the code.

We note that while w2v-BERT pre-training has proven to be an effective method for unsupervised pre-training, it requires an additional quantization module which introduces more complexity. As we increase the model size and language coverage, the learnt codebook module proves costly to tune and can impede progress of model development. Meanwhile, the BEST-RQ algorithm does not require such a module, making it a more scalable method for pre-training.

Instead of utilizing a single codebook , we use multiple codebooks to improve BEST-RQ training in this study. More precisely, we use NN softmax layers to produce NN probability predictions from the output of the encoder to compare against NN independent quantization targets obtained from the masked speech features. We train the network with equal weights for each softmax layer. The use of multiple codebooks improves the stability and convergence of the model.

3 Self-training: Noisy Student Training

We utilize noisy student training (NST) to generate pseudo-labeled data to augment supervised training. This is done by first training a teacher model with augmentation on a supervised set, then using that teacher to generate transcripts for unlabeled audio data. A heuristic filtering method based on the ratio between the number of words and audio length is used to filter the pseudo-labeled data. The pseudo-labeled data is mixed with supervised data to train the student model.

4 Chunk-wise Attention for Long-form ASR

In many real-world applications, ASR systems are required to transcribe minutes- or hours-long audio. This poses significant challenges to many end-to-end ASR systems, as these ASR systems are usually trained on much shorter segments, typically less than 30 seconds. For systems that use attention-based encoders, it is impractical to use global attention to attend to the entire audio. Local self attention, which only attends to the fixed length of left and right context, is thus widely used. For example, in BEST-RQ pre-training, only 128 left and 128 right context frames are used for local self attention. However, stacking many local self attention layers creates a significant receptive field mismatch between training and inference. The left figure in Fig. 4 illustrates this issue with a network consisting of 4 local self attention layers, each using only 1 left and 1 right context frames. Since the context is leaked in every layer, the receptive field width grows linearly with respect to the number of layers; for a big encoder like that of the Conformer-2B, this means that the receptive field width for the encoder output is longer than 327 seconds. During training, the model is trained with at most 30 seconds speech segments, while at inference time, when minutes or hours long audio is fed to the model, the encoder needs to process over 300 seconds of audio to produce one encoder output—a pattern it has never trained on. Our empirical observations demonstrate that, under this train-test mismatch, these models with deep architectures and high capacity suffer from high deletion errors. We henceforth refer to this problem as the “long-form (performance) degradation" problem.

To solve this problem, we propose a simple modification to the attention mechanism; the attention is restricted to audio chunks. This is illustrated on the right side of Fig. 4, in which 8 frames are divided into 2 chunks, and the attention is performed within each chunk. In this case, there is no context leaking in the attention layer, and thus the receptive field width is independent of the number of layers. In our experiments an 8-second chunk resulted in the best recognition quality vs. computational cost trade-off.

It is worthwhile to note there are a few other works in the literature which also modify the attention pattern to deal with the long-form audio in ASR, e.g., . Though conceptually similar to block processing (e.g. ), chunk-wise attention is more flexible. Block processing is performed at the input feature level, which limits the encoder layers to the context frame at the current chunk. On the other hand, chunk-wise attention allows other layers in the encoder (e.g., convolution layers) to process contextual frames beyond the current chunk. Compared with Whisper , which segments the audio into 30 second chunks and uses a heuristic process to carry the decoder states over, we only chunk the attention state, and allow the decoder to access the entire encoder output. We also use either a CTC or RNN-T decoder to decode on long-form audio, neither of which have been observed to hallucinate compared to attention-based sequence-to-sequence decoders. We observe our systems are robust on long-form ASR tasks with a simpler decoding process on long-form speech signals.

5 Multi-Objective Supervised Pre-training: BEST-RQ + text-injection

In addition to pre-training with unlabeled speech, we add an additional stage of Multi-Objective Supervised pre-Training (MOST) as shown in Fig. 5, where we train the model jointly on unlabeled speech, unlabeled text and paired speech and text data. The training loss for this procedure is based on the text-injection loss including duration modeling and consistency regularization as in , to which we add a weighted BEST-RQ loss for the encoder of the model. MOST yields two benefits: (i) Training with paired speech and text data with alignment losses results in learning speech representations that are better aligned with text, improving quality on tasks like ASR and AST that require mapping the acoustics of the speech signal to text. (ii) Training simultaneously on unlabeled text in a model that learns speech and text representations jointly improves the robustness of learned representations, especially on low resource languages and domains, also generalizing to new languages with no paired data seen during training .

The key architectural components for constructing the text-injection loss as utilized in our approach include: (i) A speech-only encoder that utilizes a convolutional sub-sampling feature encoder and a single conformer layer. For continued pre-training the feature encoder is initialized from the BEST-RQ pre-trained checkpoint while the conformer layer is initialized randomly. (ii) A text-only encoder that consists of an embedding layer, an upsampler, and a conformer layer block. The upsampler used in this work is a learned duration based upsampling model , though a fixed or random repetition upsampler can also be used for text-injection . All components are initialized randomly. (iii) A shared conformer encoder initialized from the pre-trained BEST-RQ speech encoder. (iv) The BEST-RQ speech softmax layers initialized from the BEST-RQ checkpoint. (v) The decoder unit which is initialized randomly.

The main idea of text-injection (e.g. ) is to produce joint, co-aligned embeddings of speech and text as sequences in the same embedding space. Given this embedding space, text data with no associated audio can contribute to improving the speech task. The speech and text encoders presented above are intended to produce these embeddings, which need to be matched in the embedding space and are also required to be co-aligned in the time dimension. The embeddings enable the text data to contribute to preparing the model for downstream tasks.

To achieve these objectives, the architecture as presented above is trained using three types of data, each contributing to different types of losses:

The unlabeled speech passes through the shared encoder and the BEST-RQ softmax layers to contribute to the BEST-RQ loss.

The paired speech-text data serves multiple functions.

The labeled speech flows through the speech encoder, the shared encoder and the decoder unit and contributes to the standard ASR loss computed against the paired text. Here, the speech-text alignments of the paired data are extracted from the decoder unit and used to train the duration upsampler within the text encoder.

The text of the paired data also passes through the text encoder. The encoded text sequence is used to compute a consistency loss against the encoded speech sequence. This loss is used to train solely the text encoder—the speech encoder weights are frozen for this particular forward-propagation.

The unlabeled text data contributes to a reconstruction loss. This loss is constructed by passing the text through the text encoder, then masking chunks of the feature sequence produced. These masked text features live in the same embedding space as masked speech features, and thus can be passed through the shared encoder and the decoder unit to compute the ASR loss against the original text. This is the reconstruction loss used to train the model.

For training stability, MOST proceeds in two stages—we first train solely on paired data to learn stable decoder alignments for 20k steps. We then train the duration upsampler and activate the losses for unlabeled text. We refer the reader to for further details.

When fine-tuning for ASR, we initialize the feature encoder of the ASR model with the speech feature encoder, initialize the conformer block with the shared conformer encoder, and add a randomly initialized task-specific transducer.

In the MOST set-up, the speech and text representations live in a shared representation space, thereby allowing us to utilize text machine translation (MT) data during the fine-tuning stage of AST tasks. We follow the same approach described in and report the AST results with joint fine-tuning for models prepared with MOST.

6 Residual Adaptation with a Frozen Encoder

Ideally, the fine-tuning process of the model should be scalable with the number of downstream tasks while in reality, fine-tuning the pre-trained USM individually for various domains and tasks becomes prohibitively expensive. In order to mitigate this issue, we explore a lightweight alternative to training the full network where residual adapters with a small number of parameters are added for each individual language while the pre-trained USM is entirely frozen during fine-tuning. We experiment with adding two parallel adapters to each Conformer block, whose parameter count amounts to 2% of the original pre-trained USM, and fine-tune the adapters on downstream language tasks. When serving the model, the adapter is dynamically loaded according to the language of the input batch . This enables one to conduct inference on 100+ languages while keeping the total number of parameters manageable by re-using the same parameters and computation process for the majority of the time. We also find that training the adapter versus fine-tuning the entire model can reduce over-fitting especially when the training data is limited.

7 Training Details

Data Processing: The audio is uniformly sampled to 16 kHz quality—any audio with a different native sampling rate is either up-sampled or down-sampled. The audio is then featurized into 128-dimensional log-mel filterbank coefficients. Graphemes are used to tokenize the text for FLEURS in-domain fine-tuning, while word-piece models (WPMs) are used for tokenization for all other tasks.

BEST-RQ: We follow default masking and quantization parameters of BEST-RQ as in . We use a 16 codebook multi-softmax loss to stabilize training and improve performance as described in 5.1. We do not use EMA for pre-training.

MOST: We follow the text encoder and decoder architecture described in but use 4k sentence-piece models (SPMs). We use a single 1536-dimensional Conformer layer as the speech encoder and Conformer-2B encoder as the shared encoder. We mix un-transcribed speech, unspoken text, and transcribed speech in each batch with fixed batch sizes of, respectively, 4096, 8192, and 1024. The model is initialized with the BEST-RQ pre-trained encoder. MOST employs a curriculum learning schedule where training initially is conducted with un-transcribed speech and paired speech-text data, and unspoken text is utilized only after 20k steps. The joint training employing all three types of data lasts for another 100K steps.

Supervised Training: We use two separate optimizers for the encoder parameters and the decoder parameters of the network . For USM-CTC and USM-LAS, we train the model for 100k steps with 2048 batch size. For in-domain experiments, the checkpoint is selected based on development set performance.

Training Large Models: We use the GShard framework with the GSPMD backend to train our large models on TPUs.

Datasets

The following audio datasets are used in this report to train our models:

YT-SUP: 90k hours of segmented, labeled audio across 75 languages.

YT-Pseudo-Labeled: 100k hours of segmented, pseudo-labeled en-US audio from YT-NTL-U. The pseudo-labels are generated by a 600M CTC model trained on YT-SUP en-US data.

YouTube Next Thousand Languages Unsupervised (YT-NTL-U): 12.1M hours of segmented, unlabeled audio, including:

YT-55-U: 12M hours of segmented, unlabeled audio on 55 rich resource languages identified by YouTube production language id models.

YT-513-U: 100k hours of segmented, unlabeled audio across 513 tail languages not covered by YouTube production language id models. These languages are identified by vendors.

Let us expand upon how each dataset has been constructed.

YT-SUP+: YT-SUP is a dataset with audio from videos that have user-uploaded transcripts from 75 languages. We group consecutive segments into a longer unit similar to . The maximal sequence length for training is 30 seconds. The total amount of training data is 90k hours, ranging from English (en-US) (3.5k hours) to Amharic (Am-Et) (150 hours). We also introduce an additional 100k hours of en-US audio from YT-NTL-U to YT-SUP. We choose to generate pseudo-labels on this dataset using a 600M-parameter CTC YT teacher model trained on YT-SUP. Each audio is randomly segmented between 5 to 15 seconds.

YT-55-U: YT-55-U is built by first randomly collecting 3 million hours of audio from "speech-heavy" YouTube videos, filtered by language. The 3 million hours of audio is then further segmented by the YT teacher model. Instead of using a teacher model as in , the non-speech segments identified by a Voice Activity Detection (VAD) model are removed to yield approximately 1 million hours of unlabeled audio data. Later, we use a YouTube production language identification model to select 55 languages from that audio.

YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. Vendors were tasked with ensuring a variety of domains, voices, and content in the videos that are collected in each language. These videos are segmented into speech segments using a VAD model, resulting in a total of 102k hours of speech. Our final YT-513-U dataset contains 88 languages with over 500 hours of speech each, 237 languages with between 100-500 hours, and 188 languages with less than 100 hours of data. The languages chosen for this collection are wide-ranging, with a majority of our data corresponding to languages from South Asia, Southeast Asia, West Africa, and East Africa. The distribution of video categories and lengths in our dataset are depicted in Figure 6.

In addition to YouTube data, we also include public data for MOST training:

Public Unsupervised (Pub-U): Following , we use approximately 429k hours of unlabeled speech data in 51 languages. It includes: 372k hours of speech data spanning 23 languages from VoxPopuli , read speech data in 25 languages drawn from the v6.1 release of Common Voice , 50k hours of read books data in eight European languages from Multilingual LibriSpeech and 1k hours of telephonic conversation data spanning 1717 African and Asian languages from BABEL .

Public Supervised (Pub-S): Similar to , our public supervised set includes approximately 1.3k hours of speech and transcript data spanning 14 languages from VoxPopuli, 1010 hour training splits for each of the 88 MLS languages, and 1k1k hours of data spanning 1717 languages from the Babel ASR task.

Note that the public data is only used for in-domain pre-training and is excluded for training the generic USM-LAS/CTC models. This allows us to treat the public task performance as out-of-domain benchmarks for the USM-LAS/CTC models.

2 Text Data

Web-NTL: For pre-training with unlabeled text, we use a web-crawled corpus of monolingual text containing over 28B sentences . The dataset spans 11401140 languages, 205205 of which have over 1M1M sentences and 199199 of which have between 100k100k and 1M1M sentences. We up-sample lower resource languages using temperature-based sampling with T=3.0T=3.0. More details about the dataset and the mining approach have been described in Section 2 of .

3 Downstream Benchmarks

We present our results on two public tasks, SpeechStew and FLEURS , and an internal benchmark on YouTube.

The SpeechStew dataset is assembled by putting together seven public speech corpora—AMI , Common Voice , English Broadcast NewsLinguistic data consortium (LDC) datasets LDC97S44, LDC97T22, LDC98S71 and LDC98T28., LibriSpeech , Switchboard/FisherLDC datasets LDC2004T19, LDC2005T19, LDC2004S13, LDC2005S13 and LDC97S62., TED-LIUM v3 and Wall Street JournalLDC datasets LDC93S6B and LDC94S13B., which are all standard benchmarks covering different domains in en-US.

The FLEURS dataset is a publicly available, multi-way parallel dataset of 1010 hours of read speech in 102102 languages spanning 77 geo-groups. We restrict our use of the dataset to its ASR benchmark. Among the 102102 languages present in the FLEURS benchmark, we select 6262 to serve as a sub-group to compare our generic ASR system with Whisper , as those languages are covered by the training sets of both models. We also report full results for in-domain fine-tuning and adaptation. Unlike , we report both WER and CER metrics, as CER is inappropriate as an indicator of performance for some languages. When presenting the error rate metrics, we use CER for Chinese, Japanese, Thai, Lao, and Burmese to be consistent with Whisper .

The test set for the YouTube domain consists of utterances from 73 languages with an average of 15 hours of audio per language, the audio length for each individual language ranging from 1 to 24 hours. The audio is transcribed manually from popular YouTube videos, each with a duration of up to 30 minutes.

3.2 Speech Translation (AST)

Following , we use CoVoST 2 to benchmark multilingual speech translation. We evaluate the multilingual XX-to-English task that covers translation from 21 source languages into English. Depending on the language, the training data ranges in size from 1 - 264 hours.

Besides speech translation data, we also add text-to-text translation data for training the model as in . This dataset includes the text translation data from CoVoST 2 combined with all data from either WMT or TED Talks, as available.

Key Results

In this section, we compare the performance of our models against public baselines, including Whisper large-v2Whisper large-v2 on Github (https://github.com/openai/whisper.git, revision b4308c4) is used for evaluation. , which has been trained on 680k hours of weakly supervised data across 100 languages.

For the massively multilingual speech recognition test dataset from YouTube, we observe that Whisper hallucinates in many languages, resulting in a WER exceeding 100%. For a reasonable comparison, we restrict the language set on which we compare the performance USM against Whisper by first selecting the top-25 languages from the training data for Whisper and further excluding languages for which Whisper produces > 40% WER. We also use segmented decoding for Whisper with 30-second segments to further reduce the effect of hallucinations. As shown in Table 3, our USM-LAS and USM-CTC models outperform Whisper by a wide margin on YouTube en-US, despite training on significantly less supervised data (3.5k hours versus Whisper’s 400k hours ). While the USM-LAS model also requires segmented decoding to reduce long-form degradation as discussed in section 2.4, it is far more robust, out-performing Whisper by a relative 30% WER on those 18 languages. USM-CTC does not exhibit long-form performance degradation and achieves the best performance on YouTube.

On the out-of-domain long-form CORAAL set, both USM-CTC and USM-LAS outperform Whisper by more than 10% relative WER. USM-CTC and USM-LAS similary outperform Whisper on SpeechStew, whose training data the models have not had access to.

We further compare the multilingual performance of the models on the held-out set from FLEURS. As shown in Table 3, USM-LAS and USM-CTC both outperform Whisper by 66% relative WER, despite using a smaller amount of multilingual supervised data (90k versus Whisper’s 117k, when en-US is excluded). USM-LAS consistently outperforms USM-CTC for short-form ASR tasks.

2 Massively Multilingual Results Beyond 100 Languages

The lower part of Table 3 shows our results for in-domain fine-tuning. Our pre-trained model improves the FLEURS benchmark significantly, even when using only 10 hours per language. Compared to the previous SoTA in , our model achieves a 30% relative improvement in terms of WER across 102 languages. Our results show that while generic speech models can be powerful, performance is still maximized by in-domain fine-tuning.

3 MOST Produces Robust Representations that Generalize to New Domains

MOST training aligns the representations of speech and text by training simultaneously on the two modalities. We investigate whether MOST representations are useful for adapting the model to new domains by freezing the entire learned encoder produced by MOST and adjusting a small amount of parameters added to the network by residual adapters. As shown in Table 3, by adding only 2% to the total number of parameters, the MOST representation model (USM-M-adapter) only performs slightly worse than the fine-tuning baselines, still showing competitive performance on downstream ASR and AST tasks. The small number of parameters being trained in this approach makes it feasible to extend our system to a large number of new domains and new tasks, even with a limited amount of training data, such as in FLEURS.

4 Pushing the Quality of ASR on Unseen Languages

Tail languages often do not have paired transcriptions for supervised learning—we refer to these languages as unseen languages, as the model has not seen paired data for these lanugages during training. To create pseudo-labels for these languages, we first build a USM-LAS-Adapter by attaching residual adapters to USM-LAS and training them using FLEURS data. By using the USM-LAS-Adapter as a teacher, we can now transcribe the unlabeled data in the unseen languages as part of the YT-NTL dataset. As shown in Table 4, we observe consistent wins for all languages on the FLEURS benchmark. For some languages, the improvement is larger than 30%. This further demonstrates the robustness of the USM-LAS model—despite using only 10 hours of out of domain data from FLEURS, the USM-LAS-Adapter is able to transcribe YouTube data to produce meaningful recognition results that lead to these improvements. We find the approach of training adapter models on small datasets and utilizing them for pseudo-labeling to be a promising route for scaling up the number of languages that can be transcribed by USMs.

5 USMs are Strong AST Models

The multi-lingual speech translation performance of fine-tuned USMs are shown in Table 3. We find that we are already comparable to the CoVoST 2 SoTA BLEU score by fine-tuning the speech-only USM. We note that the previous SoTA uses 125k hours of supervised speech translation data compared to the 859 hours of data used by the USM. After MOST training, USM-M can use both speech and text as training input. By introducing text-to-text machine translation (MT) data during fine-tuning, USM-M is able to achieve an unprecedented > 30 BLEU on CoVoST (a 1 BLEU increase from SoTA).

Analysis and Ablations

We observe a consistent > 5% relative improvement in ASR and AST benchmarks by increasing the number of the softmax groups in the multi-softmax loss for BEST-RQ training from 1 to 16, as shown in Table 5. We also find that using multiple softmax groups significantly reduces performance variation across different pre-training runs and improves convergence speed.

2 Model and Language Scaling

We find that scaling up the model size and increasing the language coverage of the pre-training dataset greatly benefits the performance of the USMs, as demonstrated in Table 5. In particular, we find a 10% relative improvement of ASR and AST performance by using YT-NTL vs. YT-55 for pre-training, despite the fact that each newly added language in YT-NTL contains approximately 500 hours of speech—a relatively small amount. As could be expected, the relative gains on the newly covered languages are more substantial than those on other languages.

3 BEST-RQ is a Scalable Self-supervised Learner

BEST-RQ has been shown to outperform or be comparable to other prominent pre-training methods for speech recognition, including wav2vec 2.0 and W2v-BERT in the original work in which it was introduced . Here we investigate its comparative performance and scaling properties, similar to what has been done for wav2vec 2.0 in and W2v-BERT in . We utilize the set-up of pre-training the model using YT-55 and fine-tuning it on CoVoST 2. As shown in Table 6, our results indicate that for the Conformer-0.6B, W2v-BERT and BEST-RQ perform similarly, but BEST-RQ obtains greater gains when scaled up. A contributing factor to this can be that W2v-BERT is more prone to codebook collapse and training instabilities at the 2B scale, while BEST-RQ by construction doesn’t suffer from codebook collapse.

4 Chunk-wise attention for robust long-form speech recognition

Fig. 7 depicts the long-form performance degradation issue as described in section 2.4. In the figure, we see that for the shallow Conformer model with 17 layers, using a small local self attention context (65) length, the word error rate measured on the long-form test set gradually improves as the training progresses. With a deeper model that has 48 layers but roughly the same number of parameters, however, the larger receptive field mismatch results in higher test WERs as the training step increases.

Table 7 demonstrates that chunk-wise attention is able to address the long-form degradation issue and show robust performance across four different languages—en-US (English), ru-RU (Russian), ko-KR (Korean), and uk-UA (Ukrainian). We compare chunk-wise attention models with an 8-second chunk size (CW-8s in Table 7) against local self attention models which uses 128 context frames in each conformer layer (LSA-128). We note that further increasing the context window size of the local self attention model results in high deletion error rates on all languages of the YouTube long-form test sets. These results show that the chunk-wise attention models do not exhibit long-form performance degradation and are able to improve upon the performance of the local self attention models operating at the maximum allowed receptive field length.

5 TPU Serving Capacity of USM-CTC Models

In section 4, we have demonstrated that USM-CTC models are powerful generic ASR models with reliable long-form transcription performance and excellent generalization properties. Here we measure the serving capacity of the USM-CTC model as represented by the real time factor (RTF) in an ideal setup where we assume that each batch sent to TPU is fully packed along the time axis. The results of these measurements are presented in Table 8. Surprisingly, we find that the 2B-paramter USM-CTC model is only 3.9×3.9\times slower than the 100M-parameter streaming model , primarily due to the fact that our models operate at batch processing mode. This result demonstrates that the USM-CTC can be used as an offline transcriber efficiently on TPUs (or GPUs).

Discussion

In this report, we put forward a practical and flexible approach for training speech understanding models capable of scaling speech recognition to hundreds of languages. We conclude the report with summarizing insights gained in the process:

Unlabeled versus weakly labeled data: We believe diverse unlabeled data is more practical to acquire for building usable ASR for tail languages than weakly labeled data. We have demonstrated that collaborating with native speakers to identify unsupervised data in hundreds of tail languages can be an effective route to improving recognition performance on low resource languages.

In-domain data is best: We have demonstrated that we can build a robust ASR system across many domains by utilizing a large amount of unsupervised data and a small amount of labeled data. Our results, however, also confirm that the most effective way to optimize the performance for a given domain is to use in-domain data to fine-tune the model.

CTC vs RNN-T vs LAS: The best transducer depends on the downstream task. A large pre-trained model with a frozen encoder can allow experimenters to test different transducers quickly and select the optimal transducer for their purpose.

Acknowledgments

We would like to thank Alexis Conneau, Min Ma, Shikhar Bharadwaj, Sid Dalmia, Jiahui Yu, Jian Cheng, Paul Rubenstein, Ye Jia, Justin Snyder, Vincent Tsang, Yuanzhong Xu, Tao Wang, Anusha Ramesh, Calum Barnes, Salem Haykal for useful discussions.

We appreciate valuable feedback and support from Eli Collins, Jeff Dean, Sissie Hsiao, Zoubin Ghahramani. Special thanks to Austin Tarango, Lara Tumeh, and Jason Porta for their guidance around responsible AI practices.

References