Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

Shubham Toshniwal, Hao Tang, Liang Lu, Karen Livescu

Introduction

Automatic speech recognition (ASR) has historically been addressed with modular approaches, in which multiple parts of the system are trained separately. For example, traditional ASR systems include components like frame classifiers, phonetic acoustic models, lexicons (which may or may not be learned from data), and language models . These components typically correspond to different levels of representation, such as frame-level triphone states, phones, and words. Breaking up the task into such modules makes it easy to train each of them separately, possibly on different data sets, and to study the effect of modifying each component separately.

Over time, ASR research has moved increasingly toward training multiple components of ASR systems jointly. Typically, such approaches involve training initial separate modules, followed by joint fine-tuning using sequence-level losses . Recently, completely integrated end-to-end training approaches, where all parameters are learned jointly using a loss at the final output level, have become viable and popular. End-to-end training is especially natural for deep neural network-based models, where the final loss gradient can be backpropagated through all layers. Typical end-to-end models are based on recurrent neural network (RNN) encoder-decoders or connectionist temporal classification (CTC)-based models .

End-to-end training is appealing because it is conceptually simple and allows all model parameters to contribute to the same final goal, and to do so in the context of all other model parameters. End-to-end approaches have also achieved impressive results in ASR as well as other domains . On the other hand, end-to-end training has some drawbacks: Optimization can be challenging; the intermediate learned representations are not interpretable, making the system hard to debug; and the approach ignores potentially useful domain-specific information about intermediate representations, as well as existing intermediate levels of supervision.

Prior work on analyzing deep end-to-end models has found that different layers tend to specialize for different sub-tasks, with lower layers focusing on lower-level tasks and higher ones on higher-level tasks. This effect has been found in systems for speech processing as well as computer vision .

We propose an approach for deep neural ASR that aims to maintain the advantages of end-to-end approaches, while also including the domain knowledge and intermediate supervision used in modular systems. We use a multitask learning approach that combines the final task loss (in our case, log loss on the output labels) with losses corresponding to lower-level tasks (such as phonetic recognition) applied on lower layers. This approach is intended to encapsulate the intuitive and empirical observation that different layers encode different levels of information, and to encourage this effect more explicitly. In other words, while we want the end-to-end system to take input acoustics and produce output text, we also believe that at some appropriate intermediate layer, the network should do a good job at distinguishing more basic units like states or phones. Similarly, while end-to-end training need not require supervision at intermediate (state/phone) levels, if they are available then our multitask approach can take advantage of them.

We demonstrate this approach on a neural attention-based encoder-decoder character-level ASR model. Our baseline model is inspired by prior work , and our lower-level auxiliary tasks are based on phonetic recognition and frame-level state classification. We find that applying an auxiliary loss at an appropriate intermediate layer of the encoder improves performance over the baseline.

Related Work

Multitask training has been studied extensively in the machine learning literature . Its application to deep neural networks has been successful in a variety of settings in speech and language processing . Most prior work combines multiple losses applied at the final output layer of the model, such as joint Mandarin character and phonetic recognition in and joint CTC and attention-based training for English ASR . Our work differs from this prior work in that our losses relate to different types of supervision and are applied different levels of the model.

The idea of using low-level supervision at lower levels was, to our knowledge, first introduced by Søgaard & Goldberg for natural language processing tasks, and has since been extended by . The closest work to ours is the approach of Rao and Sak using phoneme labels for training a multi-accent CTC-based ASR system in a multitask setting. Here we study the approach in the context of encoder-decoder models, and we compare a number of low-level auxiliary losses.

Models

The multitask approach we propose can in principle be applied to any type of deep end-to-end model. Here we study the approach in the context of attention-based deep RNNs. Below we describe the baseline model, followed by the auxiliary low-level training tasks.

The model is based on attention-enabled encoder-decoder RNNs, proposed by . The speech encoder reads in acoustic features $\bm{{x}}=(\bm{{x}}_{1},\dots,\bm{{x}}_{T})$ and outputs a sequence of high-level features (hidden states) $\bm{{h}}$ which the character decoder attends to in generating the output character sequence $\bm{{y}}=(y_{1},\dots,y_{K})$ , as shown in Figure 1 (the attention mechanism and a pyramidal LSTM layer are not shown in the figure for simplicity).

The speech encoder is a deep pyramidal bidirectional Long Short-Term Memory (BiLSTM) network . In the first layer, a BiLSTM reads in acoustic features $\bm{{x}}$ and outputs $\bm{{h^{(1)}}}=(\bm{{h}}^{(1)}_{1},\dots,\bm{{h}}^{(1)}_{T})$ given by:

where $i\in\{1,\dots,T\}$ denotes the index of the timestep; $f^{(1)}(\cdot)$ and $b^{(1)}(\cdot)$ denote the first layer forward and backward LSTMs respectivelyFor brevity we exclude the LSTM equations. The details can be found, e.g., in Zaremba et al. ..

The first layer output $\bm{{h^{(1)}}}=(\bm{{h}}^{(1)}_{1},\dots,\bm{{h}}^{(1)}_{T})$ is then processed as follows:

where $f^{(j)}$ and $b^{(j)}$ denote the forward and backward running LSTMs at layer $j$ . Following , we use pyramidal layers to reduces the time resolution of the final state sequence $\bm{{h^{(4)}}}$ by a factor of $2^{3}=8$ . This reduction brings down the input sequence length, initially $T=|\bm{{x}}|$ , where $|\cdot|$ denotes the length of a sequence of vectors, close to the output sequence lengthFor Switchboard, the average of number of frames per character is about 7., $K=|\bm{{y}}|$ . For simplicity, we will refer to $\bm{{h}}^{(4)}$ as $\bm{{h}}$ .

1.2 Character Decoder

The character decoder is a single-layer LSTM that predicts a sequence of characters $\bm{{y}}$ as follows:

The conditional dependence on the encoder state vectors $\bm{{h}}$ is represented by context vector $\bm{{c_{t}}}$ , which is a function of the current decoder hidden state and the encoder state sequence:

where the vectors $\bm{{v}},\bm{{b_{a}}}$ and the matrices $\bm{{W_{1}}},\bm{{W_{2}}}$ are learnable parameters; $\bm{{d}}_{t}$ is the hidden state of the decoder at time step $t$ . The time complexity of calculating the context vector $\bm{{c_{t}}}$ for every time step is $O(|\bm{{h}}|)$ ; reducing the resolution on encoder side is crucial to reducing this runtime.

The hidden state of the decoder, $\bm{{d}}_{t}$ , which captures the previous character context $\bm{{y_{<t}}}$ , is given by:

and the character decoder loss function is then defined as

2 Low-Level Auxiliary Tasks

As shown in Figure 1, we explore multiple types of auxiliary tasks in our multitask approach. We explore two types of auxiliary labels for multitask learning: phonemes and sub-phonetic states. We hypothesize that the intermediate representations needed for sub-phonetic state classification are learned at the lowest layers of the encoder, while representations for phonetic prediction may be learned at a somewhat higher level.

We use phoneme-level supervision obtained from the word-level transcriptions and pronunciation dictionary. We consider two types of phoneme transcription loss:

Phoneme Decoder Loss: Similar to the character decoder described above, we can attach a phoneme decoder to the speech encoder as well. The phoneme decoder has exactly the same mathematical form as the character decoder, but with a phoneme label vocabulary at the output. Specifically, the phoneme decoder loss is defined as

where $\bm{{z}}$ is the target phoneme sequence. Since this decoder can be attached at any depth of the four-layer encoder described above, we have four depths to choose from. We attach the phoneme decoder to layer 3 of the speech encoder, and also compare this choice to attaching it to layer 4 (the final layer) for comparison with a more typical multitask training approach.

CTC Loss: A CTC output layer can also be added to various layers of the speech encoder . This involves adding an extra softmax output layer on top of the chosen intermediate layer of the encoder, and applying the CTC loss to the output of this softmax layer. Specifically, let $\bm{{z}}$ be the target phoneme sequence, and $k$ be the speech encoder layer where the loss is applied. The probability of $\bm{{z}}$ given the input sequence is

where $\mathcal{B}(\cdot)$ removes repetitive symbols and blank symbols, $\mathcal{B}^{-1}$ is $\mathcal{B}$ ’s pre-image, $J$ is the number of frames at layer $k$ and $P(\pi_{j}|\bm{{h}}^{(k)}_{j})$ is computed by a softmax function. The final CTC objective is

The CTC objective computation requires the output length to be less than the input length, i.e., $|\bm{{z}}|<J$ . In our case the encoder reduces the time resolution by a factor of 8 between the input and the top layer, making the top layer occasionally shorter than the number of phonemes in an utterance. We therefore cannot apply this loss to the topmost layer, and use it only at the third layer.In fact, even at the third layer we find occasional instances (about 10 utterances in our training set) where the hidden state sequence is shorter than the input sequence, due to sequences of phonemes of duration less than 4 frames each. Anecdotally, these examples appear to correspond to incorrect training utterance alignments

2.2 State-Level Auxiliary Task

Sub-phonetic state labels provide another type of low-level supervision that can be borrowed from traditional modular HMM-based approaches. We apply this type of supervision at the frame level, as shown in Figure 1, using state alignments obtained from a standard HMM-based system. We apply this auxiliary task at layer 2 of the speech encoder. The probability of a sequence of states $\bm{{s}}$ is defined as

where $P(s_{m}|\bm{{h}}^{(2)}_{m})$ is computed by a softmax function, and $M$ is the number of frames at layer 2 (in this case $\lceil T/2\rceil$ ). Since we use this task at layer 2, we subsample the state labels to match the reduced resolution. The final state-level loss is

2.3 Training Loss

The final loss function that we minimize is the average of the losses involved. For example, in the case where we use the character and phoneme decoder losses and the state-level loss, the loss would be

Experiments

We use the Switchboard corpus (LDC97S62) , which contains roughly 300 hours of conversational telephone speech, as our training set. We reserve the first 4K utterances as a development set. Since the training set has a large number of repetitions of short utterances (“yeah”, “uh-huh”, etc.), we remove duplicates beyond a count threshold of 300. The final training set has about 192K utterances. For evaluation, we use the HUB5 Eval2000 data set (LDC2002S09), consisting of two subsets: Switchboard (SWB), which is similar in style to the training set, and CallHome (CHE), which contains unscripted conversations between close friends and family.

For input features, we use 40-dimensional log-mel filterbank features along with their deltas, normalized with per-speaker mean and variance normalization. The phoneme labels for the auxiliary task are generated by mapping words to their canonical pronunciations, using the lexicon in the Kaldi Switchboard training recipe. The HMM state labels were obtained via forced alignment using a baseline HMM/DNN hybrid system using the Kaldi NNet1 recipe. The HMM/DNN has 8396 tied states, which makes the frame-level softmax costly for multitask learning. We use the importance sampling technique described in to reduce this cost.

The speech encoder is a 4-layer pyramidal bidirectional LSTM, resulting in a 8-fold reduction in time resolution. We use 256 hidden units in each direction of each layer. The decoder for all tasks is a single-layer LSTM with 256 hidden units. We represent the decoders’ output symbols (both characters and, at training time, phonemes) using 256-dimensional embedding vectors. At test time, we use a greedy decoder (beam size = 1) to generate the character sequence. The character with the maximum posterior probability is chosen at every time step and fed as input into the next time step. The decoder stops after encountering the “EOS” (end-of-sentence) symbol. We use no explicit language model.

We train all models using Adam with a minibatch size of 64 utterances. The initial learning rate is 1e-3 and is decayed by a factor of 0.95, whenever there is an increase in log-likelihood of the development data, calculated after every 1K updates, over its previous value. All models are trained for 75K gradient updates (about 25 epochs) and early stopping. To further control overfitting we: (a) use dropout at a rate of 0.1 on the output of all LSTM layers (b) sample the previous step’s prediction in the character decoder, with a constant probability of 0.1 as in .

2 Results

We evaluate performance using word error rate (WER). We report results on the combined Eval2000 test set as well as separately on the SWB and CHE subsets. We also report character error rates (CER) on the development set.

Development set results are shown in Table 1. We refer to the baseline model as “Enc-Dec” and the models with multitask training as “Enc-Dec + [auxiliary task]-[layer]”. Adding phoneme recognition as an auxiliary task at layer 3, either with a separate LSTM decoder or with CTC, reduces both the character error rates and the final word error rates.

In order to determine whether the improved performance is a basic multitask training effect or is specific to the low-level application of the loss, we compare these results to those of adding the phoneme decoder at the topmost layer (Enc-Dec + PhoneDec-4). The top-layer application of the phoneme loss produces worse performance than having the supervision at the lower (third) layer. Finally, we obtain the best results by adding both phoneme decoder supervision at the third layer and frame-level state supervision at the second layer (Enc-Dec + PhoneDec-3 + State-2). The results support the hypothesis that lower-level supervision is best provided at lower layers. Table 2 provides test set results, showing the same pattern of improvement on both the SWB and CHE subsets. For comparison, we also include a variety of other recent results with neural end-to-end approaches on this task. Our baseline model has better performance than the most similar previous encoder-decoder result . With the addition of the low-level auxiliary task training, our models are competitive with all of the previous end-to-end systems that do not use a language model.

Figure 2 shows the training set log-likelihood for the baseline model and two multitask variants. The plot suggests that multitask training helps with optimization (improving the training error). Training error is very similar for both multitask models, while the development set performance is better for one of them (see Table 1), suggesting that there may also be an improved generalization effect and not only improved optimization.

Conclusion

We have presented a multitask training approach for deep end-to-end ASR models in which lower-level task losses are applied at lower levels, and we have explored this approach in the context of attention-based encoder-decoder models. Results on Switchboard and CallHome show consistent improvements over baseline attention-based models and support the hypothesis that lower-level supervision is more effective when applied at lower layers of the deep model. We have compared several types of auxiliary tasks, obtaining the best performance with a combination of a phoneme decoder and frame-level state loss. Analysis of model training and performance suggests that the addition of auxiliary tasks can help in either optimization or generalization.

Future work includes studying a broader range of auxiliary tasks and model configurations. For example, it would be interesting to study even deeper models and word-level output, which would allow for more options of intermediate tasks and placements of the auxiliary losses. Viewing the approach more broadly, it may be fruitful to also consider higher-level task supervision, incorporating syntactic or semantic labels, and to view the ASR output as an intermediate output in a more general hierarchy of tasks.

Acknowledgements

We are grateful to William Chan for helpful discussions, and to the speech group at TTIC, especially Shane Settle, Herman Kamper, Qingming Tang, and Bowen Shi for sharing their data processing code. This research was supported by a Google faculty research award.