Transfer learning of language-independent end-to-end ASR with language model fusion

Hirofumi Inaguma, Jaejin Cho, Murali Karthick Baskar, Tatsuya Kawahara, Shinji Watanabe

Introduction

Fast system development for low-resourced new languages is one of the challenges in automatic speech recognition (ASR). Recently, end-to-end ASR systems based on the sequence-to-sequence (S2S) architecture are filling up the gap of performance from the conventional HMM-based hybrid systems and showing promising results in many tasks with its extremely simplified training and decoding schemes . This is very attractive when building systems in new languages quickly. However, models tend to suffer from the data sparseness problems in the low-resource scenario, especially in S2S models due to its data-driven optimization.

One possible approach to this problem is to utilize data of other languages. There are various approaches to leverage other languages: (a) to train a model multilingually (multi-task learning with other languages), and then further fine-tune to a particular language , and (b) to adapt a multilingual model to a new language using transfer learning and additional features obtained from the multilingual model such as multilingual bottleneck features (BNF) and language feature vectors (LFV) (cross-lingual adaptation). To obtain a multilingual S2S model, a part of parameters can be shared while preparing the output layers per language , and we can further use a unified architecture with a shared vocabulary among multiple languages . Since it would take much time to train such systems from scratch for many languages including new languages, we focus on the cross-lingual adaptation approach (b).

While a majority of the conventional transfer learning is concerned with acoustic model, using linguistic context during adaptation has not been investigated yet. The research question in this paper is: Is linguistic context also helpful for adaptation to new languages? The most common approach to integrate the external language model (LM) is referred to as shallow fusion, where LM scores are interpolated with scores from the S2S model . Recently, several methods to leverage an external LM during training of S2S models are proposed: deep fusion and cold fusion . In deep fusion, the decoder network in the pre-trained S2S model and an external Recurrent neural network language model (RNNLM) are integrated into a single architecture by the gating mechanism and only the gating part is re-trained. In contrast, cold fusion integrates an external LM during the entire training stage.

In this paper, we investigate methods to fully utilize text data for adaptation to unseen low-resource languages. We propose LM fusion transfer, where an external LM is integrated into the decoder network of the S2S model only in the adaptation stageAlthough we can perform LM fusion during training of the seed multilingual model, we focus on applying it during adaptation because our goal is to adapt it to a particular language rapidly., as an extension of cold fusion. Since the decoder network is already well-trained in a language-independent manner, the model can better incorporate linguistic context from the external LM. The extra cost to integrate the external LM during adaptation is trivial in the resource constrained condition. We also investigate various seed multilingual models trained with 600 to 2200-hours speech data and show the effect of the amount and variety of multilingual training data.

Experimental evaluations on the IARPA BABEL corpus show that the LM fusion transfer improves performance compared to simple transfer learning with shallow fusion when the additional text data is available. The performance of the transferred models is drastically improved by increasing the model capacity and incorporating the external LM, and the resulting models perform comparably with the latest BLSTM-HMM hybrid systems . To our best knowledge, this is the first results for the S2S model to show the competitive performance to the conventional hybrid systems in the low-resource scenario ( $\sim$ 50 hours).

Related work

The traditional usage of unpaired text data in the S2S framework is categorized to four approaches: LM integration, pre-training, multi-task learning (MTL), and data augmentation. In the LM integration approach, there are three methods: shallow fusion, deep fusion, and cold fusion as described in Section 1. Their differences are in the timing to integrate an external LM and the existence of additional parameters of the gating mechanism. We depict these fusion methods in Fig. 1. In, , these fusion methods are compared in middle-size English conversational speech ( $\sim$ 300h) and large-scale Google voice search data. However, none of previous works investigated the effect of them in other languages especially for low-resource languages, which is the focus of this paper. In , the authors show the effectiveness of cold fusion in a cross-domain scenario. Since the external LM is more likely to be changed frequently than the acoustic model, it is time-consuming to train a new S2S model with the LM integration from scratch every time the external LM is updated. In this work, we conduct LM fusion during adaptation to target languages, which is regarded as a more realistic scenario.

Another usage of the external LM is to initialize the lower layer in the decoder network with the pre-trained LM . However, we transfer almost all parameters in a multilingual S2S model (both encoder and decoder networks), and thus we do not explore this direction. Apart from the external LM, the MTL approach with LM objective are investigated in . Although the MTL approach does not require any additional parameters, it gets minor gains compared to LM fusion methods .

Recently, data augmentation of speech data based on text-to-speech (TTS) synthesis is investigated in the S2S framework . Since we are interested in the usage of linguistic context during adaptation, we leave this direction to the future work.

End-to-end ASR

We build all models with attention-based sequence-to-sequence (S2S) models, which can learn soft alignments between input and output sequences of variable lengths . They are composed of encoder and decoder networks. The encoder network transforms input features $\bm{x}=(\bm{x}_{1},\dots,\bm{x}_{T})$ to a high-level continuous representation $\bm{h}=(\bm{h}_{1},\dots,\bm{h}_{T^{\prime}})$ ( $T^{\prime}\leq T$ ), interleaved with subsampling layers to reduce the computational complexity . The decoder network generates a probability distribution $P_{\rm S2S}$ of the corresponding $U$ -length transcription $\bm{y}=(y_{1},\dots,y_{U})$ conditioned over all previous generated tokens:

where $\bm{W}^{\rm o}$ and $\bm{b}^{\rm o}$ are trainable parameters, $\bm{s}^{\rm S2S}_{u}$ is a decoder state at the $u$ -th timestep, and $\bm{c}_{u}$ is a context vector summarizing notable parts from the encoder states $\bm{h}$ . We adopt the location-based scoring function . To encourage monotonic alignments, the auxiliary Connectionist Temporal Classification (CTC) objective is linearly interpolated .

During the inference stage, scores from the softmax layer used for the CTC objective are linearly interpolated in log-scale with a tunable parameter $\lambda\ (0\leq\lambda\leq 1)$ to avoid generating incomplete and repeated hypotheses as follows :

2 LM fusion

In the conventional decoding paradigm with an external LM, referred to as shallow fusion, scores from both the S2S model and LM are linearly interpolated to maximize the following criterion:

where $\beta$ is a tunable parameter to define the importance of the external LM. The separate LM, especially trained with a larger external text, has complementary effects to the implicit LM modeled in the decoder network. Therefore, shallow fusion shows performance gains in many ASR tasks .

2.2 Cold fusion (flat-start fusion)

While shallow fusion uses the external LM only in the inference stage, cold fusion uses the pre-trained LM during training of the S2S model to provide effective linguistic context. The fine-grained element-wise gating function is equipped to flexibly rely on the LM depending on the uncertainty of prediction:

where $\bm{W}^{*}$ and $\bm{b}^{*}$ are trainable parameters, $\bm{d}_{u}^{\rm LM}$ is a hidden state of RNNLM, $\bm{s}_{u}^{\rm LM}$ is a feature from the external LM, $\bm{s}_{u}^{\rm CF}$ is a bottleneck feature before the final softmax layer, $\bm{g}_{u}$ is a gating function, and $\odot$ represents element-wise multiplication. ReLU non-linear function is inserted before the softmax layer as suggested in . We use the hidden state as a feature from RNNLM instead of logits because we use the universal character vocabulary for multilingual experiments, which results in the large softmax layer and increases the computational time .

In the original formulation in , scores from the external LM are not used. We found that linear interpolation of log probabilities from the LM with those from the S2S model during the inference as in shallow fusion still has complementary effects to improve performance. Therefore, we adopt it in all experiments.

2.3 Deep fusion (fine-tuning fusion)

Deep fusion is another method to integrate an external LM during training. Unlike cold fusion, deep fusion is applied only for fine-tuning the gating part after parameters of both the pre-trained S2S model and RNNLM are frozen. Although deep fusion is formulated with a scalar gating function in , we use the same architecture as cold fusion in Section 3.2.2 to make a strict comparison. Then, the difference from the cold fusion is in the timing to integrate the external LM (from scratch or in the middle stage) and which parameters to update after integration (see Figure 1).

Transfer learning of multilingual ASR

We adapt a seed language-independent end-to-end ASR model to an (unseen) target language. We investigate the following four scenarios:

multi10: From non-target 10 languages to an unseen target language

high2: From 2 high resource languages (English and Japanese) to an unseen target language

multi10+high2: From the mix of non-target 10 languages and 2 high resource languages to an unseen target language

multi15: From the mix of non-target 10 languages and target 5 languages to a particular target language

The top three conditions are regarded as cross-lingual adaptation.

2 LM fusion transfer

During adaptation, all parameters are copied from the seed language-independent S2S model, then training is continued toward a target language. We investigate improved adaptation methods by integrating the external LM during and/or after transfer learning from the seed model. Three methods are considered as follows:

Transfer + SF: Shallow fusion in Section 3.2.1 is conducted in the inference stage after adaptation.

Cold fusion transfer (CF-transfer): Cold fusion in Section 3.2.2 is conducted during adaptation. We integrate the external RNNLM from the start point of adaptation to a target language. The softmax layer is randomly initialized before adaptation due to the additional gating part.

Deep fusion transfer (DF-transfer): Deep fusion in Section 3.2.3 is conducted after adaptation. DF-transfer is composed of two stages: (1) adaptation by updating the whole parameters until convergence, and (2) fine-tuning only the gating part after integrating the external RNNLM. The softmax layer is randomly initialized before stage (2).

Experimental evaluation

We used data from the IARPA BABEL project and selected 10 languages as non-target languages for training the seed language-independent model: Cantonese, Bengali, Pashto, Turkish, Vietnamese, Haitian, Tamil, Kurmanji, Tokpisin and Georgian, and 5 languages for adaptation: Assamese (AS), Swahili (SW), Lao (LA), Tagalog (TA) and Zulu (ZU). Full language pack (FLP) is used for all experiments except for Section 5.2.3, where limited language pack (LLP) which consists of about 10% of FLP is used for adaptation. We sampled 10% of data from the training data for each language as the validation set. In addition, we used Librispeech corpus and the Corpus of Spontaneous Japanese (CSJ) as additional high resources.

We used Kaldi toolkit for feature extraction. The input features were static 80-channel log-mel filterbank outputs appended with 3-dimensional pitch features computed with a 25ms window and shifted every 10ms. The features were normalized by the mean and the standard deviation on the whole training set. For the vocabulary, we used the universal character set including all characters from all languages , resulting in the vocabulary size of 5,353 classes including 17 language IDs, sos, eos, unk, and blank labels. For multilingual experiments, we prepended the corresponding language ID so that the decoder network can jointly identify the correct target language while recognizing speech .

Our encoder network is composed of two VGG-like CNN blocks followed by a max-pooling layer with a stride of 2 $\times$ 2, and 5 layers of bidirectional long short-term memory (BLSTM) with 1024 memory cells, which results in time reduction by a factor of 4. The decoder network consists of two layers of LSTM with 1024 memory cells. For both monolingual and multilingual experiments, we used the same architecture. Training was performed on the mini-batch size of 15 utterances using Adadelta algorithm with an initial epsilon $1e-8$ . Epsilon was divided by a factor of 0.01 when the teacher-forcing accuracy does not improve for the validation set at each epoch. Scheduled sampling with probability 0.4 and dropout for the encoder network with probability 0.2 were performed in all experiments during adaptation. We set the CTC weight during training and decoding to 0.5 and 0.3, respectively. We also set the beam width to 20 and the LM weight to 0.3.

For RNNLM, we used two layers of LSTM with 650 memory cells. All RNNLMs were trained with transcriptions in the parallel data except for experiments in Table 4. We used stochastic gradient descent (SGD) for RNNLM optimization. All networks are implemented by ESPnet toolkit with pytorch backend .

2 Results

First, we present the results of the baseline monolingual end-to-end systems in Table 1. Our new systems (line 2) significantly outperformed the old baseline reported on . The gain mostly came from adding VGG blocks before BLSTM encoder and one more decoder LSTM layer though we also tuned other hyper-parameters. Next, changing the unit sizes of each LSTM layer from 320 to 1024 drastically improved the performance. This is surprising because increasing the number of parameters often makes the model overfit to the small amount of training data. Finally, shallow fusion with the monolingual RNNLM further boosted the performance although the RNNLM was trained with the small amount of transcriptions only. We use this setting as default in the rest of experiments.

We also built BLSTM-HMM hybrid systems for comparison. The BLSTM-HMM architecture includes 3 BLSTM layers each with 512 memory cells and 300 projection unitsIncreasing the unit size did not lead to any improvement.. The BLSTM acoustic model was trained using the latency control technique with 22 past frames and 21 future frames. The acoustic model receives 40-dimensional filterbank features as input. N-gram language model is built with the training transcriptions. WERs by our end-to-end systems with shallow fusion are close to those of the hybrid system, just 3.6 and 1.8 % absolute difference for Tagalog and Zulu, respectively.

2.2 Comparison of seed language-independent models

We compared the seed language-independent models for adaptation to target languages. All models were transferred, and shallow fusion with the corresponding monolingual RNNLM trained with the parallel data was performed. The results are shown in Table 2. The overall performance was significantly improved by transfer learning. The transferred S2S models achieved comparable WER to BLSTM-HMM for Tagalog and outperformed for Zulu in Table 1. We can see that multi10 model is generally better than high2 model despite the smaller data size, and combination of them (multi10+high2) gives slight improvement. On the other hand, multi15 model that includes the target language does not lead to further improvement even after fine-tuning. We can conclude that the diversity of languages is more important than the total amount of training data, and 10 languages are almost sufficient for learning language-independent feature representation and generalized to other languages well . Since multi10 shows the competitive results to multi10+high2 only with one third training data, we use multi10 as the seed model and investigate cross-lingual adaptation in the following experiments.

2.3 Effect of LM fusion transfer

The results of our proposed LM fusion transfer are given in Table 3. When training S2S models from scratch, there is no difference among all fusion methods. When transferred from the language-independent S2S model, significant improvement is observed by integrating the external RNNLM. Shallow fusion was more effective than when training the S2S models from scratch in Table 1 because the multilingual training led to generalization and the affinity for the external LM was enhanced. CF-transfer got some improvements compared to transfer learning with shallow fusion for 3 target languages, but the effects of DF-transfer and CF-transfer are not significant. This is because RNNLMs were trained with text in the small parallel data only, therefore linguistic context during adaptation was not so effective. However, CF-transfer in Tagalog outperformed the monolingual hybrid system in Table 1. When compared to the previous work using the same data , CF-transfer yielded 21.6% gains relatively on average. Furthermore, 6.8% gains were achieved from transfer learning without the external LM.

To investigate the effect of additional text data, we evaluate the LM fusion transfer with LLP on each target language ( $\sim$ 10 hours). The results are shown in Table 4. We used monolingual RNNLM trained with LLP (parallel data) and FLP ( $\sim$ 50 hours), respectively. The latter setting of a small speech data set ( $\sim$ 10 hours) and a larger text data set ( $\sim$ 50 hours) is regarded as a more realistic scenario in low-resource languages. When training S2S models from scratch, all models could not converge in our implementation even when reducing the unit sizes. The Babel corpus is mostly composed of conversational telephone speech (CTS), so it is difficult to optimize the S2S model from scratch with just around 10-hour training data. In the transfer learning approach, all three fusion methods got significant gains by using the external LM except for deep fusion in Assamese. For RNNLM trained with LLP, all fusion methods achieved a larger improvement than in Table 3. Interestingly, WER significantly dropped even when each RNNLM was trained with 10-hour data only. But all fusion methods show similar performance. In contrast, CF-transfer significantly outperformed simple transfer learning with shallow fusion on all 5 target languages when the RNNLM was trained with FLP, which is five-times larger than LLP. Therefore, we can conclude that linguistic context is helpful for adaptation when additional text data is available. This shows CF-transfer in Table 3 has the potential to surpass transfer learning with shallow fusion if we can access to additional text dataSince the provided data only can be used for system training in BABEL rules, we do not explore to crawl text data from the WEB.. In summary, CF-transfer yielded relative 10.4% and 2.3% gains on average compared to transfer learning without and with shallow fusion, respectively.

Conclusion

We explored the usage of linguistic context from the external LM during adaptation of the language-independent S2S model to target low-resource languages. We empirically compared various LM fusion methods and confirmed their effectiveness in resource limited situations. We showed that cold fusion transfer is more effective than simply applying shallow fusion after adaptation when additional text is available, which means linguistic context is also helpful in addition to acoustic adaptation. Our S2S model drastically closed the gap from the BLSTM-HMM hybrid system.