A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N Sainath, Karen Livescu

Introduction

Attention-based recurrent neural encoder-decoder models provide an elegant end-to-end framework for speech recognition, machine translation, and other sequence transduction tasks . In automatic speech recognition (ASR), the model folds the traditionally separately learned acoustic model, pronunciation model, and language model (LM) into a single network that can be trained end-to-end. The encoder maps the input speech to a sequence of higher-level learned features, while the decoder maps these higher-level features to output labels with assistance from the attention mechanism that provides an alignment between speech and text. The model can be learned end-to-end and requires just paired speech and text data. Encoder-decoder models for speech recognition have become quite popular recently and perform competitively on a number of ASR tasks .

While end-to-end training offers several advantages, it also restricts the training data to have both input and output sequences, for example the paired speech and text data in the case of speech recognition. Conventional ASR models leverage a separate LM trained on all available text, which can be orders of magnitude larger than just the transcripts of transcribed audio. Decoder of an encoder-decoder model is exposed only to the audio transcripts.

Previous work addressing the issue of utilizing unpaired text has proposed ways of integrating an external pretrained LM, trained on all of the text data, with the ASR model . The main LM integration approaches from past work have been referred to as shallow , deep , and cold fusion . The three approaches differ in two important criteria:

Early/late model integration: At what point in the ASR model’s computation should the LM be integrated? In deep and cold fusion, the external LM is fused directly into the ASR model by combining their hidden states, resulting in a single model with tight integration. In contrast, in shallow fusion the LM and ASR model remain separate and only their scores are combined, similarly to an ensemble. The shallow fusion score combination is also similar to the interpolation of acoustic and language models done in traditional ASR.

Early/late training integration: At what point in the ASR model’s training should the LM be integrated? Deep and shallow fusion use a late integration where both the ASR and LM models are trained separately and then combined, while cold fusion uses the external pretrained LM model from the very start of the ASR model training. An important point is that early training integration approaches are computationally costlier if either of the two models is frequently changing.

A thorough comparison between these LM integration techniques is, to the best of our knowledge, currently lacking. In this paper, we compare the three fusion approaches mentioned above on (a) the medium-sized Switchboard data set and (b) the large-scale Google voice search and dictation data sets used in . Our aim is to shed light on how the LM integration approaches compare, as well as how they scale up with data size. We also propose some novel LM integration approaches and compare them against the three prior fusion approaches on Switchboard.

Our results show that almost all of the LM integration approaches improve over a baseline encoder-decoder model for all data sets, confirming the benefit of utilizing unpaired text. We also make several other findings: (a) the rather simple approach of shallow fusion works best for first-pass decoding on all of our data sets, (b) our best proposed approach performs similarly to deep and cold fusion on the Switchboard data set, (c) deep fusion doesn’t scale well, obtaining no or negligible gains over baseline for large-scale Google data sets, and (d) cold fusion produces high-quality and diverse beam outputs resulting in lowest oracle word error rate on Google data sets and edges ahead when coupled with second-pass LM rescoring on Google voice search.

Related Work

Previous work on using unpaired text for encoder-decoder models can be categorized along two major themes:

Using an external language model This approach consists of training an external LM on the unpaired text and integrating it into the encoder-decoder model, which is the focus of this paper. An early study along these lines was by Gulcehre et al. , who proposed the shallow and deep fusion methods in the context of neural machine translation (NMT) models. In that work both shallow and deep fusion improved performance, with deep fusion somewhat outperforming shallow fusion, especially for low-resource language pairs. Another previous work in context of NMT models by Ramachandran et al. proposed initializing the lower layer of both encoder and decoder with separate pretrained LMs followed by joint training using both language modeling and machine translation losses. Shallow fusion has largely been the method of choice for ASR , getting significant performance gains, although in some cases with slight modifications to the decoding objective function . Cold fusion, a modification of deep fusion, was proposed for ASR by Sriram et al. . This work found that, on medium-scale data sets of $\sim$ 300-400K training utterances, cold fusion outperforms deep fusion, especially in a cross-domain setting, but did not compare with shallow fusion. None of these studies compared all three fusion approaches.

Generating paired data from unpaired text A second line of research is to use unpaired text to synthetically generate matching input sequences, thus expanding the paired data set. In machine translation this process of generating paired data from monolingual data is referred to as backtranslation—that is, generating source-language text from unpaired target-language text—and has been used in the context of neural machine translation by Sennrich et al. . The directly analogous approach for ASR would be to use text-to-speech synthesis to generate speech from unpaired text. The complexity of the text-to-speech (TTS) task means that there has been limited work exploring the use of speech generated from unpaired text, often in limited settings . A workaround of “translating” the text to phoneme sequences and using the resulting paired data in a multitask learning setup has been explored by Renduchintala et al. .

Model

Our model is based on the Listen, Attend and Spell (LAS) attention-based encoder-decoder ASR model proposed by . We begin by reviewing this model, and then describe the techniques we consider for LM integration with the LAS model.

The LAS model consists of three components: an encoder, a decoder, and an attention network which are trained jointly to predict the output sequence. The transcription can be decoded as a sequence of graphemes/characters, wordpieces, or words from a sequence of acoustic feature frames. Based on the recent success of wordpiece-based models in a variety of ASR tasks and machine translation tasks , we choose wordpieces as output unit in all our models.

The encoder consists of a stacked (bidirectional) recurrent neural network (RNN) which reads in acoustic features $\bm{{x}}=(\bm{{x}}_{1},\cdots,\bm{{x}}_{T})$ and outputs a sequence of high-level features $\bm{{h}}$ . The sequence of high-level features $\bm{{h}}$ could either be the same length as the acoustic feature sequence or be downsampled if a pyramidal structure is used as in .

The decoder is a stacked unidirectional RNN that computes the probability of a sequence of output units $\bm{{y}}$ as follows:

At every time step $t$ , the conditional dependence of the output on the encoder features $\bm{{h}}$ is calculated via the attention mechanism. The attention mechanism, which is a function of the current decoder hidden state and the encoder features, condenses the encoder features into a context vector $\bm{{c}}_{t}$ via the following mechanism:

where the vectors $\bm{{v}},\bm{{b_{a}}}$ and the matrices $\bm{{W_{h}}},\bm{{W_{d}}}$ are learnable parameters; $\bm{{d}}_{t}$ is the hidden state of the decoder at time step $t$ .

The hidden state of the decoder, $\bm{{d}}_{t}$ , which captures the previous output context $\bm{{y_{<t}}}$ , is given by:

where $\bm{{W_{\text{s}}}}$ and $\bm{{b_{\text{s}}}}$ are again learnable parameters. The model is trained to minimize the discriminative loss:

2 LM Integration Approaches

Below we discuss the various LM integration approaches for encoder-decoder models that we study.

In shallow fusion , the external LM is incorporated via log-linear interpolation at inference time only. So while for the baseline model, beam search is used to approximate the solution for:

in the most basic version of shallow fusion , we instead use the following criterion:

Recently some additional penalty terms have been introduced in the criterion . For example, Chorowski and Jaitly use a coverage penalty term $c(\bm{{x}},\bm{{y}})$ to ensure all of the input frames have been “well attended” during decoding.

2.2 Deep Fusion

Like shallow fusion, deep fusion is a late training integration procedure, i.e. it assumes the encoder-decoder and language models to be pretrained. The key difference is that it integrates the external LM into the encoder-decoder model by fusing together the hidden states of the external LM (assuming a neural LM) and the decoder in the following way:

where the scalar $b_{g}$ , vectors $\bm{{v}}_{g}$ and $\bm{{b}}_{DF}$ , and matrix $\bm{{W\!}}_{DF}$ are all learned while keeping all other model parameters fixed. Fixing most of the model parameters reduces the backpropagation computation cost, and the fine-tuning procedure converges quickly in comparison to the cost of training the baseline model.

2.3 Cold Fusion

Cold fusion builds on the idea of deep fusion and proposes a modified LM integration procedure shown below:

where all of the parameters introduced in the above equations are learned. Some of the key differences between cold fusion and deep fusion are:

Cold fusion is an early training integration approach: The encoder-decoder model is trained from scratch with a pretrained external LMLM parameters are kept fixed..

Both the LM state $\bm{{s}}^{LM}$ and encoder-decoder model’s state $\bm{{s}}^{ED}_{t}$ are used in gate computation as shown in equation 8.

Cold fusion uses a fine gating mechanism, equation 9, in comparison to a coarse gating mechanism used by deep fusion, equation 4.

As originally proposed, cold fusion uses the LM logits rather than the LM hidden state, in order to allow for flexible LM swapping. That is, $\bm{{d}}^{LM}_{t}$ used in equation 6 refers to the logit scores of the LM rather than the hidden state of LM in the proposed version of cold fusion. However, in practice, with wordpieces used as output units, the relatively large vocabulary results in a long vector of logits $\bm{{d}}^{LM}_{t}$ which causes an unnecessary increase in the number of parametersThe cold fusion paper experiments with character level models.. In our experiments we are not concerned with the flexibility of swapping LMs. Hence, in our experiments we still set $\bm{{d}}^{LM}_{t}$ to the LM hidden state Our preliminary experiments with Switchboard suggest a performance gain with this proposed modification.

Note that, since cold fusion is an early training integration approach, in a dynamic setting with frequent changes of LM and ASR models the approach would be computationally costlier than the previous two fusion approaches, especially shallow fusion.

2.4 LM as lower decoder layer

Previous work in machine translation has suggested the utility of using a pretrained LM as a lower layer of the decoder . Similarly, used a pretrained LM to initialize the decoder in an RNN transducer model for speech recognition. The motivation for this approach is that it can provide better contextualized word embeddings, as is the case with the recently proposed Embeddings from Language Models (ELMo) . We propose introducing the external LM as a lower layer in the decoder of a pretrained LAS model. All of the model parameters, including the LM parameters, are fine-tuned for a few epochs with the LAS objective.

2.5 LM integration via multitask learning

the decoder can be seen as a conditional LM, conditioned on the encoder features that represent the speech input. The exact dependence of the decoder on the speech features is captured by the context vector $\bm{{c}}_{t}$ which, from equation 2, affects the output posterior distribution as follows:

Now, unpaired text has no corresponding speech signal. In the LAS model, this can be represented by a zero context vector. A zero context vector reduces the decoder from a conditional LM to a plain LM as shown below Note that an “equivocal” $\bm{{c}}_{t}$ that equally affects all of the logit scores could also work. However, such a vector would depend on $\bm{{W_{\text{s}}}}$ , whereas $\bm{{c}}_{t}=\bm{{0}}$ is independent of $\bm{{W_{\text{s}}}}$ .:

In this way, the decoder can also be used for the task of language modeling. Based on this observation, we propose a multitask learning approach for using the unpaired text, where the decoder of the LAS model is shared for the primary ASR task and the auxiliary LM task. In each iteration of multitask learning, we sample one of the tasks among the ASR and LM task based on the prior probability for picking each task. Note that when the decoder is trained for LM, the encoder and attention components of LAS model are unaffected. One important aspect to note is that, unlike all of the previous approaches discussed, this approach has no external LM; rather, the decoder itself is trained for both tasks.

Experimental Setup

We use the Switchboard corpus (LDC97S62) , which contains roughly 300 hours of conversational telephone speech as our choice of medium-scale training set. The first 4K utterances from the training set are reserved as validation set for hyperparameter tuning and early stopping. Since the training set has a large number of repetitions of short utterances (“yeah”, “uh-huh”, etc.), we remove duplicates beyond a count threshold of 300. After these preprocessing steps, the final training set has about 192K utterances. For evaluation, we use the HUB5 Eval2000 data set (LDC2002S09), that consists of two subsets: Switchboard (SWB), which is similar in style to the training set, and CallHome (CH), which contains unscripted conversations between close friends and family. For acoustic features, we use 40-dimensional log-mel filterbank features along with their deltas, with per-speaker mean and variance normalization. For all of the above data processing, we use the EESEN toolkit’s recipe which is based on the Kaldi toolkit’s recipe .

For external LM training, we combine the Switchboard training set with the Fisher corpus (LDC200{4,5}T19) . To avoid domain mismatch, we process Fisher utterances to (a) remove noise/hesitation markers not used in Switchboard, and (b) filter out utterances not covered by the wordpiece model trained on Switchboard Some utterances in Fisher have symbols such as period sign which are not present in Switchboard and hence are not covered by the wordpiece model trained on Switchboard transcripts.. The filtering process removes $\sim$ 400K utterances out of 2.2 million Fisher utterances. Thus, combined with the Switchboard training utterances, the LM is trained on $\sim$ 2 million utterances.

1.2 Model Details

The encoder is a 4-layer pyramidal bidirectional long short-term memory (LSTM) network , resulting in an 8-fold reduction in time resolution. For the 2-fold reduction done at each layer below the topmost, we max-pool over 2 consecutive hidden states and feed the result into the layer above. We use 256 hidden units in each direction of each layer.

The decoder in the baseline LAS model is a single-layer unidirectional LSTM network with 256 hidden units. We use a 1K wordpiece output vocabulary, which includes all the characters to ensure open-vocabulary coverage. The vocabulary is generated using a variant of the byte pair encoding (BPE) algorithm implemented in the SentencePiece library by Googlehttps://github.com/google/sentencepiece. We represent the wordpieces with 256-dimensional embeddings learned jointly with the rest of the model. For regularization we use: (a) label smoothing , where we uniformly distribute 0.1 probability mass among labels other than the ground-truth label, and (b) dropout with probability 0.1 applied on outputs all of the RNN layers. We also use scheduled sampling with a fixed schedule, where each timestep’s decoder input is either the ground-truth previous label with probability 0.9 or sampled from the model’s posterior distribution for the previous label with probability 0.1.

For inference, we use beam search with beam size of 10. We observed that for some of the models increasing the beam size to 10 resulted in escalation of insertion errors compared to a lower beam size. To counter this, we add a wordpiece insertion reward $\in\{0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}$ , tuned on the development set. With the addition of the wordpiece insertion reward, larger beam sizes outperform smaller beam sizes for all models, with insignificant gains beyond a sufficiently large beam size of 10. For shallow fusion, we pick the LM weight $\lambda$ from $\{0,0.05,0.1,0.15,$ $0.2,0.25\}$ by tuning on the development set, resulting in a final tuned value of $\lambda=0.2$ .

The external LM is a single-layer 512 hidden unit RNN with LSTM cells. The RNN hidden state is first passed through a projection layer with 256 hidden units and finally fed into the softmax layer. The LM is trained with the same output vocabulary as the LAS model. The LM is trained for 20 epochs with early stopping based on development set perplexity and attains a perplexity of $\sim$ 15 on the Switchboard development set.

1.3 Training Details

We bucket our training data by utterance length into 5 buckets, restricting utterances within a minibatch to come from a single bucket for training efficiency. Different minibatch sizes are used for different buckets, with a batch size of 128 used for the shortest utterances and a 32 batch size used for the bucket with longest ones. Preliminary experiments suggested a performance benefit by proceeding through the training set from the bucket with smallest utterances to the one with longest utterances in each epoch. We use this order for training all of our models. (A similar training order scheme was also used in .)

All models are trained using the Adam optimizer with an initial learning rate of 0.001. For the baseline LAS model and models with early LM training integration, we train for 12 epochs and start halving the learning rate every epoch after 7 epochs. For the models with late LM training integration, we start halving the learning rate every epoch after 4 epochs and train for a total of 8 epochs. For all models, we use early stopping based on development set WER when using greedy decoding.

To speed up training, the encoder of all models is initialized with the encoder of a LAS model trained for predicting phone sequences, similarly to . All models are trained on a single NVIDIA TitanX GPU and finish training within 2 days, with each epoch taking 3-4 hours. Finally, all of our models are implemented in TensorFlow .

2 Google Voice Search and Dictation

The training data consists of approximately 22 million anonymized, human-transcribed utterances representative of live Google traffic, both Voice Search and dictation. Clean utterances are artificially corrupted using a room simulator, adding varying degrees of noise and reverberation such that the overall SNR is between 0dB and 30dB, with an average SNR of 12dB. The noise sources are from YouTube and daily life noisy environmental recordings. The models are evaluated on two data sets: VS14K, which consists of about 14K Voice Search utterances, and D15K, which contains about 15K dictation utterances.

The external LM is trained on a variety of text data sources, including untranscribed anonymized voice queries (both search and dictation), anonymized typed queries from Google Search, as well as the transcribed training utterances mentioned above. Since these component data sources have varying sizes, we up- and down-sample to mix them at a 1:1:1 ratio.

2.2 Model Details

Our LAS model is consistent with : The encoder is composed of 5 unidirectional LSTM layers of 1400 hidden units each, the attention mechanism is a multi-headed additive attention with four heads, the decoder consists of 2 unidirectional LSTM layers of 1024 hidden units each, and the output vocabulary is 16384 wordpieces. We use 80-dimensional log-mel filterbank features, computed with a 25ms window and shifted every 10ms. Similarly to , at each frame t, these features are stacked with 3 frames to the left and downsampled to a 30ms frame rate.

As in , inference is done via beam search with a beam size of 8. Shallow fusion numbers are reported after tuning the LM weight $\lambda$ over the values $\{0,0.05,0.1,0.15,0.2,0.25,0.3,0.35\}$ and a coverage penalty over the values $\{0,0.01,0.02,0.03,0.04,0.05\}$ , following . These parameters are tuned on a development set consisting of about 10K Voice Search utterances.

The external recurrent LM is composed of 2 LSTM layers of 2048 hidden units each. It has the same wordpiece output vocabulary as the LAS model.

2.3 Training Details

LAS models are trained in two stages. First, they are trained to convergence with a cross-entropy criterion using synchronous replica training . We use tensor processing units (TPUs) with a topology of 8x8, for a total of 128 synchronous replicas and an effective batch size of 4096. We found that having a very large batch size was critical to seeing any improvement from cold fusion. Our learning rate schedule includes an initial warm-up phase, a constant phase, and a decay, consistent with .

Next, we conduct a second training phase with a minimum word error rate (MWER) criterion . This phase is performed on 16 synchronous GPU replicas to convergence, which is typically about one epoch. Note that for deep fusion we effectively have four training phases: cross-entropy training of LAS, MWER training of LAS, cross-entropy training of deep fusion, MWER training of deep fusion.

The external LM is also trained on TPUs with a topology of 4x4. All models are trained using the Adam optimizer and are implemented in TensorFlow .

Results

Table 1 shows the results of a baseline LAS model and the three fusion approaches on Switchboard and CallHome. All of the fusion approaches improve over the baseline model with a relative WER reduction of 3-7% on Eval2000. Among the fusion approaches, shallow fusion is a clear winner with almost double the gains over baseline compared to deep and cold fusion. Finally, deep and cold fusion have comparable performance on Switchboard.

Table 2 shows the corresponding results for VS14K and D15K. All of the fusion approaches improve performance over the baseline model for VS14K, but for D15K deep fusion suffers a minor degradation compared to baseline. As with Switchboard, on both of these data sets shallow fusion is again the best performer, although it is tied with cold fusion on VS14K. Finally, deep fusion has no or negligible gain over baseline, suggesting that deep fusion does not scale well with data.

2 Proposed Approaches

Next we present results of our proposed approaches (Section 3.2.4 and 3.2.5) on Switchboard in Table 3 and compare them against the earlier Switchboard results from Table 1. The multitask learning approach achieves minor gains over the LAS baseline. While these gains are more modest than those of the three fusion approaches, it is important to note that unlike the fusion approaches, the LM multitask approach introduces no new parameters.

Next we evaluate the approach of introducing the LM as a lower decoder layer (making the decoder two layers deep in the case of our Switchboard models). To account for the confounding variable of a deeper decoder, we also compare it to a version where a randomly initialized RNN is introduced instead of the pretrained RNN LM. As can be seen from the table, the performance of this approach is comparable to that of deep and cold fusion. In addition, the marginal gains from introducing a randomly initialized RNN instead demonstrate the benefit of LM pretraining. The promising performance we see here is consistent with the findings of using a pretrained LM as the decoder, and this simple approaches of using a LM to initialize parts of the decoder warrants further investigation in future work.

3 Second Pass Rescoring

While shallow and cold fusion have very similar top-1 WER on the Google data sets, we can investigate the quality of the top-8 to better understand the strengths of each approach. For each of the fusion methods, Table 4 shows the WER after second pass rescoring with a large, production-scale LM (as used in ), as well as the oracle WER in parentheses.

As the table shows, cold fusion has significantly better oracle WER on VS14K than the baseline and other fusion methods and, as a result, benefits the most from a second pass LM. While shallow fusion is unaffected by the second pass, the cold fusion WER drops from 5.3 to 5.0. This suggests that the improvements provided by shallow fusion are redundant with the benefits of second pass rescoring, whereas cold fusion does something distinct, improving the overall quality and diversity of the top 8 decoded transcripts.

Cold fusion also has the lowest oracle WER on D15K, but none of the models benefit much from the second pass on this data set. The lack of improvement is likely because the second pass LM is primarily designed to improve performance on Voice Search. Shallow fusion therefore remains best on this data set.

Finally, we note that shallow fusion actually has higher oracle WER than the baseline LAS system on both data sets. This may be because shallow fusion can actually pull poor transcripts into the beam if they are heavily favored by the LM.

Conclusion

We perform a thorough investigation of the problem of LM integration in encoder-decoder based ASR models. We compare some of the most prominent past methods and a few of our own proposed methods on the medium-scale and publicly available Switchboard dataset and the large-scale Google voice search and dictation data sets. Our results show that for first-pass scoring, the simple approach of shallow fusion performs best on all of our data sets. However, cold fusion produces lower oracle error rates among the top-8 decoded transcripts, and outperforms shallow fusion after second pass rescoring on Google voice search. Deep fusion is comparable to cold fusion on Switchboard but gets no or negligible gains over the baseline on Google data sets, suggesting that it does not scale well with data. Among our proposed methods, the simple approach of using a pretrained language model as a lower layer of the decoder performs comparably to cold and deep fusion on Switchboard, suggesting that further investigation of the approach may be fruitful.