Multilingual Speech Recognition With A Single End-To-End Model

Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, Kanishka Rao

Introduction

Speech recognition has made remarkable progress in the past few years with services such as Google Voice Search supporting about 120 languages.https://www.blog.google/products/search/type-less-talk-more/ Further expanding its coverage of the world’s $\approx\,$ 7,000 languages is of great interest to both academia and industry. However, in many cases the resources available to train large vocabulary continuous speech recognizers are severely limited . These challenges have meant that there has been a perennial interest in multilingual and cross-lingual models which allow for knowledge transfer across languages, and thus relieve burdensome data requirements .

Most of the previous work on multilingual speech recognition has been limited to making the acoustic model (AM) multilingual . Some of the multilingual AMs require a common phone set while others share some of the acoustic model parameters . A hat swap structure is proposed in , where the lower layers of a deep neural network (DNN) are shared across languages and the output layer is language-specific. Alternatively, multilingual bottleneck features from a DNN feature extractor can be used for either a Gaussian Mixture Model or DNN-based systems . These multilingual AMs still require language-specific pronunciation models (PMs) and language models (LMs) which means that often such models must know the speech language identity during inference . Moreover, the AMs, PMs and LMs are usually optimized independently, in which case errors from one component propagate to subsequent components in a way that was not seen during training.

Sequence-to-sequence models fold the AM, PM and LM into a single network, making them attractive to explore for multilingual speech recognition. Building a multilingual sequence-to-sequence model requires taking the union over all the language-specific grapheme sets and training the model jointly on data from all the languages. In addition to their simplicity, the end-to-end nature of such models means that all of the model parameters contribute to handling the variations between different languages. Our attention-based sequence-to-sequence model is based on the Listen, Attend and Spell (LAS) model , the details of which are explained in the next section. Our work is most similar to that of which similarly proposes an end-to-end trained multilingual recognizer to directly predict grapheme sequences in 10 distantly related languages. They utilize a hybrid attention/connectionist temporal classification model integrated with an independently trained grapheme LM. In this paper we use a simpler sequence-to-sequence model without an explicit LM, and study a corpus of 9 more closely related Indian languages.

We show that a LAS model jointly trained across data from 9 Indian languages without any explicit language specification consistently outperforms monolingual LAS models trained independently on each language. Even without explicit language specification, the model is rarely confused between languages. We also experiment with certain language-dependent variants of the model. In particular, we obtain the largest improvement by conditioning the encoder on the speech language identity. We also run several experiments on synthesized data to gain insights into the behavior of these models. We find that the multilingual model is unable to code-switch between languages, indicating that the language model is dominating the acoustic model. Finally, we find that the language-conditioned model is able to transliterate Urdu speech into Hindi text, suggesting that the model has learned an internal representation which disentangles the underlying acoustic-phonetic content from the language.

Model

In this section we describe the Listen, Attend and Spell (LAS) attention-based sequence-to-sequence ASR model proposed by Chan et al , as well as our proposed modifications to support recognition in multiple languages.

The sequence-to-sequence model consists of three modules: an encoder, decoder and attention network which are trained jointly to predict a sequence of graphemes from a sequence of acoustic feature frames.

We use 80-dimensional log-mel acoustic features computed every 10ms over a 25ms window. Following we stack 8 consecutive frames and stride the stacked frames by a factor of 3. This downsampling enables us to use a simpler encoder architecture than .

The encoder is comprised of a stacked bidirectional recurrent neural network (RNN) that reads acoustic features $\bm{{x}}=(\bm{{x}}_{1},\dots,\bm{{x}}_{K}$ ) and outputs a sequence of high-level features (hidden states) $\bm{{h}}$ = ( $\bm{{h}}_{1},\dots,\bm{{h}}_{K}$ ). The encoder is similar to the acoustic model in an ASR system.

The decoder is a stacked unidirectional RNN that computes the probability of a sequence of characters $\bm{{y}}$ as follows:

The conditional dependence on the encoder state vectors $\bm{{h}}$ is represented by context vector $\bm{{c}}_{t}$ , which is a function of the current decoder hidden state and the encoder state sequence:

where the vectors $\bm{{v}},\bm{{b_{\text{a}}}}$ and the matrices $\bm{{W_{h}}},\bm{{W_{d}}}$ are learnable parameters; $\bm{{d}}_{t}$ is the hidden state of the decoder at time step $t$ .

The hidden state of the decoder, $\bm{{d}}_{t}$ , which captures the previous character context $\bm{{y_{<t}}}$ , is given by:

where $\bm{{W_{\text{s}}}}$ and $\bm{{b_{\text{s}}}}$ are again learnable parameters. The model is trained to optimize the discriminative loss:

2 Multilingual Models

In the multilingual scenario, we are given $n$ languages $\{\mathcal{L}_{1},\dots,\mathcal{L}_{n}\}$ , each with independent character sets $\{\mathcal{C}_{1},\mathcal{C}_{2},\cdots,\mathcal{C}_{n}\}$ and training sets $\{(\mathcal{X}_{1},\mathcal{Y}_{1}),\dots,(\mathcal{X}_{n},\mathcal{Y}_{n})\}$ . The combined training dataset is thus given by the union of the datasets for each language:

and the character set for the combined dataset is similarly given by:

We begin by training a joint model, consisting of the LAS model described in the previous section trained directly on the combined multilingual dataset. This model is not given any explicit indication that the training dataset is composed of different languages. However, as we will show later, this model is still able to recognize speech in multiple languages despite the lack of runtime language-specification.

2.2 Multitask

We also experiment with a variant of the joint model which has the same architecture but is trained in a multitask learning (MTL) configuration to jointly recognize speech and simultaneously predict its language. The language ID annotation is thus utilized during training, but is not passed as an input during inference. In order to predict the language ID, we average the encoder output $h$ across all time frames to compute an utterance-level feature. This averaged feature is then passed to a softmax layer to predict the likelihood of the speech belonging to each language:

The language identification loss is given by:

where the $j$ -th language, $\mathcal{L}_{j}$ , is the ground truth language. The two losses are combined using an empirically determined weight $\lambda$ to obtain the final training loss:

2.3 Conditioned

Finally, we consider a set of conditional models which utilize the language ID during inference. Intuitively, we expect that a model which is explicitly conditioned on the speech language will have an easier time allocating its capacity appropriately across languages, speeding up training and improving recognition performance.

Specifically, we learn a fixed-dimensional language embedding for each language to condition different components of the basic joint model on language ID. This conditioning is achieved by feeding in the language embedding as an input to the first layer of encoder, decoder or both giving rise to (a) Encoder-conditioned, (b) Decoder-conditioned, and (c) Encoder+Decoder-conditioned variants. In contrast to the MTL model, the language ID is not used as part of the training cost.

Experimental Setup

We conduct our experiments on data from nine Indian languages shown in Table 1, which corresponds to a total of about 1500 hours of training data and 90 hours of test data. The nine languages have little overlap in their character sets, with the exception of Hindi and Marathi which both use the Devanagari script. The small overlap means that the output vocabulary for our multilingual models, which is union over character sets, is also quite large, containing 964 characters. Separate validation sets of around 10k utterances per language are used for hyperparameter tuning. All the utterances are dictated queries collected using desktop and mobile devices.

2 Model and Training Details

As a baseline, we train nine monolingual models independently on data for each language. We tune the hyperparameters on Marathi and reuse the optimal configuration to train models for the remaining languages. The best configuration for Marathi uses a 4 layer encoder comprised of 350 bidirectional long short-term memory (biLSTM) cells (i.e. 350 cells in forward layer and 350 cells in backward layer), and a 2 layer decoder containing 768 LSTM cells in each layer. For regularization, we apply a small L2 weight penalty of 1e-6 and add Gaussian weight noise with standard deviation of 0.01 to all parameters after 20k training steps. All the monolingual models converge within 200-300k gradient steps.

Since the multilingual training corpus is much larger, we were able to train a joint larger multilingual model without overfitting. As with the training set, the validation set is also a union of the language-specific validation sets. The best configuration uses a 5 layer encoder comprised of 700 biLSTM cells, and a 2 layer decoder containing 1024 LSTM cells in each layer. For the multitask model, we find $\lambda=0.01$ among $\{0.1,0.01\}$ to work the best. We restricted ourselves to these values because for a very large $\lambda$ , the language ID prediction task would dominate the primary task of ASR, while for a very small $\lambda$ the additional task would have no effect on the training loss. For all conditional models, we use a 5-dimensional language embedding. For regularization we add Gaussian weight noise with standard deviation of 0.0075 after 25k training steps. All multilingual models are trained for approximately 2 million steps.

All models are implemented in TensorFlow and trained using asynchronous stochastic gradient descent using 16 workers. The initial learning rate is set to 1e-3 for the monolingual models and 1e-4 for the multilingual models with learning rate decay in all the models.

Results

We first compare the language-specific LAS models with the joint LAS model trained on all languages. As shown in Table 2, the joint LAS model outperforms the language-specific models for all the languages. In fact, the joint model decreases weighted average WERs across all the 9 languages, weighted by number of words, by more than 21% relative to the monolingual models. This result is quite interesting not only because the joint model is a single model that is being compared to 9 different monolingual models, but unlike the monolingual models the joint model it not language-aware at runtime. Finally, the large performance gain of the joint model is also attributable to the fact that the Indian languages are very similar in the phonetic space , despite using different grapheme sets.

Second, we compare the joint LAS model with the multitask trained variant. As shown in the right two columns of Table 2, the MTL model shows limited improvements over the joint model. This might be due to the following reasons: (a) Static choice of $\lambda$ . Since the language ID prediction task is easier than ASR, a dynamic $\lambda$ which is high initially and decays over time might be better suited, and (b) The language ID prediction mechanism of averaging over encoder outputs might not be ideal. A learned weighting of the encoder outputs, similar to the attention module, might be better suited for the task.

Third, Table 3 shows that all the joint models conditioned on the language ID outperform the joint model. The encoder-conditioned model (Enc) is better than the decoder-conditioned model (Dec) indicating that some form of acoustic model adaptation towards different languages and accents occurs when the encoder is conditioned. In addition, conditioning both the encoder and decoder (Enc + Dec) does not improve much over conditioning just the encoder, suggesting that feeding the encoder with language ID information is sufficient, as the encoder outputs are then fed to the decoder anyways via the attention mechanism.

Comparing model performances across languages we see that all the models perform worst on Malayalam and Kannada. We hypothesize that this has to do with the agglutinative nature of these languages which makes the average word longer in these languages compared to languages like Hindi or Gujarati. For example, an average training set word in Malayalam has 9 characters compared to 5 in Hindi. In fact, we found that in contrast to the WER, the character error rate (CER) for Hindi and Malayalam were quite close.

Analysis

In this section we investigate the behavior and capacity of the proposed system in more detail, by asking the questions detailed below.

How often does the model confuse between languages? The ability of the proposed model to recognize multiple languages comes with the potential side effect of confusing the languages. The lack of script overlap between Indian languages, with the exceptions of Hindi and Marathi, means that the surface analysis of the script used in the model output is a good proxy to tell if the model is confused between languages or not. We carry out this analysis at the word level and check if the output words use graphemes from a single language or a mixture. We test the word first on the ground truth language, and in case of failure, test it on other languages. If the word cannot be expressed using the character set of any single language, we classify it as mixed. The result for both the joint and the encoder-conditioned model is summarized in Figure 1. While both the models are rarely confused between languages, the result for the joint model is interesting given its lack of explicit language awareness, showing that the LAS model is implicitly learning to predict language ID. It is also interesting to observe that by conditioning the joint model on the language ID, there is no confusion between languages.

Can the joint model perform code-switching? The joint model in theory has the capacity to switch between languages. In fact, it can code-switch between English and the 9 Indian languages due to the presence of English words in the training data1-6% of the total words in the training set are English words in all the 9 languages.. We were interested in testing if the model could also code-switch between a pair of Indian languages which was not seen during training. For this purpose, we created an artificial dataset by selecting about 1,000 Tamil utterances and appending them with the same number of Hindi utterances with a 50ms break in between. To our disappointment, the model is not able to code-switch at all. It picks one of the two scripts and sticks with it. Manual inspection shows that: (a) when the model chooses Hindi, it only transcribes the Hindi part of the utterance (b) similarly when the model chooses Tamil it only transcribes the Tamil part, but on rare occasions it also transliterates the Hindi part. This suggests that the language model is dominating the acoustic model and points to overfitting, which is a known issue with attention-based sequence-to-sequence models .

What does the conditioned model output for mismatched language ID? The interesting question here is does the model obey acoustics or is it faithful to the language ID. To answer this, we created an artificial dataset of about 1,000 Urdu utterances labeled with the Hindi language ID and transcribed it with the encoder-conditioned model. As it turns out, the model is extremely faithful to the language ID and sticks to Hindi’s character set. Manual inspection of the outputs reveals that the model transliterates Urdu utterances in Hindi, suggesting that the model has learned an internal representation which disentangles the underlying acoustic-phonetic content from the language identity.

conclusion

We present a sequence-to-sequence model for multilingual speech recognition which is able to recognize speech without any explicit language specification. We also propose simple variants of the model conditioned on language identity. The proposed model and its variants substantially outperform baseline monolingual sequence-to-sequence models for all languages, and rarely chooses the incorrect grapheme set in its output. The model, however, cannot handle code-switching, suggesting that the language model is dominating the acoustic model. In future work, we would like to integrate the conditional variants of the model with separate language-specific language models to further improve recognition accuracy. We would also like to compare the proposed models against traditional models on live traffic data. The exploration of reasons for lack of code-switching in joint model can also lead to interesting insights regarding sequence-to-sequence models.

Acknowledgements

We would like to thank Rohit Prabhavalkar, Yonghui Wu, Vijay Peddinti, Zhifeng Chen and Patrick Nguyen for helpful comments. We are also thankful to the anonymous reviewers for their helpful comments.