Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

Suyoun Kim, Takaaki Hori, Shinji Watanabe

Introduction

End-to-end speech recognition is a recently proposed approach that directly transcribes speech to text without requiring predefined alignment between acoustic frames and characters . The traditional hybrid approach, Deep Neural Networks - Hidden Markov Models (DNN-HMM), factorizes the system into several components trained separately (i.e. acoustic model, context-dependent phone transducer, pronunciation model, and language model) based on conditional independence assumptions (including Markov assumptions) and approximations . Unlike such hybrid approaches, the end-to-end model learns acoustic frames to character mappings in one step towards the final objective of interest, and attempts to rectify the suboptimal issues that arise from the disjoint training procedure.

Recent work on end-to-end speech recognition can be categorized into two main approaches: Connectionist Temporal Classification (CTC) and attention-based encoder-decoder . Both methods address the problem of variable-length input and output sequences. The key idea of CTC is to use intermediate label representation allowing repetitions of labels and occurrences of blank labels to identify no output label. The CTC loss can be efficiently calculated by the forward-backward algorithm, but it still predicts targets for every frame, and assumes that the targets are conditionally independent of each other.

Another approach, the attention-based encoder-decoder directly learns a mapping from acoustic frame to character sequences. At each output time step, the model emits a character conditioned on the inputs and the history of the target character. Since the attention model does not use any conditional independence assumption, it has often shown to improve Character Error Rate (CER) than CTC when no external language model is used . However, in real-environment speech recognition tasks, the model shows poor results because the alignment estimated in the attention mechanism is easily corrupted due to the noise. Another issue is that the model is hard to learn from scratch due to the misalignment on longer input sequences, and therefore a windowing technique is commonly used to limit the area explored by the attention mechanism , but several parameters for windowing need to be determined manually depending on the training data.

To overcome the above misalignment issues, this paper proposes a novel end-to-end speech recognition method to improve performance and accelerate learning by using a joint CTC-attention model within the multi-task learning framework. The key to our approach is that we use a shared-encoder representation trained by both CTC and attention model objectives simultaneously. We think that the weakness of the attention model is due to lack of left-to-right constraints as used in DNN-HMM and CTC, making it difficult to train the encoder network with proper alignments in the case of noisy data and/or long input sequences. Our proposed method improves the performance by rectifying the alignment problem using the CTC loss function based on the forward-backward algorithm. Along with improving performance, our framework significantly speeds up learning with fast convergence. We evaluate our model on the WSJ and CHiME-4 tasks, and show that our system outperforms both the CTC and attention models in CER and learning speed.

JOINT CTC-ATTENTION MECHANISM

In this section, we review the CTC in Section 2.1 and the attention-based encoder-decoder in Section 2.2, addressing the variable ( $T$ ) length input frames, $\bm{x}=(x_{1},\cdots,x_{T})$ , and $U$ length output characters, $\bm{y}=(y_{1},\cdots,y_{U})$ , where $y_{u}\in\{1,\cdots,K\}$ . $K$ is the number of distinct labels. Then, our joint CTC-attention based end-to-end framework will be described in Section 2.3.

The key idea of CTC is to use intermediate label representation $\pi=(\pi_{1},\cdots,\pi_{T})$ , allowing repetitions of labels and occurrences of a blank label ( $-$ ), which represents the special emission without labels, i.e., $\pi_{t}\in\{1,\cdots,K\}\cup\{-\}$ . CTC trains the model to maximize $P(\bm{y}|\bm{x})$ , the probability distribution over all possible label sequences $\Phi(\bm{y^{\prime}})$ :

where $\bm{y^{\prime}}$ is a modified label sequence of $\bm{y}$ , which is made by inserting the blank symbols between each label and the beginning and the end for allowing blanks in the output (i.e., $\bm{y}=(c,a,t),\bm{y^{\prime}}=(-,c,-,a,-,t,-)$ ).

CTC is generally applied on top of Recurrent Neural Networks (RNNs). Each RNN output unit is interpreted as the probability of observing the corresponding label at particular time. The probability of label sequence $P(\bm{\pi}|\bm{x})$ is modeled as being conditionally independent by the product of the network outputs:

where $q_{t}(\pi_{t})$ denotes the softmax activation of $\pi_{t}$ label in RNN output layer $q$ at time $t$ .

The CTC loss to be minimized is defined as the negative log likelihood of the ground truth character sequence $\bm{y^{*}}$ , i.e.

The probability distribution $P(\bm{y}|\bm{x})$ can be computed efficiently using the forward-backward algorithm as

where $\alpha_{t}(u)$ is the forward variable, representing the total probability of all possible prefixes ( $y^{\prime}_{1:u}$ ) that end with the $u$ -th label, and $\beta_{t}(u)$ is the backward variable of all possible suffixes ( $y^{\prime}_{u:U}$ ) that start with the $u$ -th label. The network can then be trained with standard backpropagation by taking the derivative of the loss function with respect to $q_{t}(k)$ for any $k$ label including the blank.

Since CTC does not explicitly model inter-label dependencies based on the conditional independence assumption in Eq.(2), there are limits to model character-level language information. Therefore, lexicon or language models are commonly incorporated, like the hybrid framework .

2 Attention-based encoder-decoder

Unlike the CTC approach, the attention model directly predicts each target without requiring intermediate representation or any assumptions, improving CER as compared to CTC when no external language model is used . The model emits each label distribution at $u$ conditioning on previous labels according to the following recursive equations:

The framework consists of two RNNs: Encoder and AttentionDecoder, so that it is able to learn two different lengths of sequences based on the cross-entropy criterion. Encoder transforms $\bm{x}$ , to high-level representation $\bm{h}=(h_{1},\cdots,h_{L})$ in Eq. (6), then AttentionDecoder produces the probability distribution over characters, $y_{u}$ , conditioned on $\bm{h}$ and all the characters seen previously $y_{1:u-1}$ in Eq. (7). $L$ is the number of the frames of Encoder output, and $L<T$ . Here, a special start-of-sentence(sos)/end-of-sentence(eos) token is added to the target set, so that the decoder completes the generation of the hypothesis when (eos) is emitted. The loss function of the attention model is computed from Eq. (5) as:

where $y^{*}_{1:u-1}$ is the ground truth of the previous characters.

where $w,W,V,F,U,b$ are trainable parameters, $s_{u-1}$ is the decoder state, $\gamma$ is the sharpening factor , and * denotes convolution.

$a_{u}$ can be computed by the softmax of energy $e_{u,l}$ from two types of attention mechanisms: content-based and location-based in Eq. (9). Both depend on the decoder state, $s_{u-1}$ , and the content of input, $h_{l}$ . The location-based attention mechanism additionally uses convolutional feature vectors $f_{u,l}$ extracted from the previous attention $a_{u-1}$ by convolving with matrix $F$ along the time axis .

With $c_{u}$ , $s_{u-1}$ , and $y_{u-1}$ , the decoder generates next label $y_{u}$ and updates the state as:

where the Generate and Recurrency functions indicate a feed-forward network and a recurrent network, respectively.

In practice, the approach has two main issues. (1) The model is weak on noisy speech data. The attention model is easily affected by noises, and generates misalignments because the model does not have any constraint that guides the alignments be monotonic as in DNN-HMM and CTC. (2) Another issue is that it is hard to learn from scratch on larger input sequences via purely data-driven methods. To make training faster, the authors constrains the attention mechanism to only consider inputs within a narrow range. However, this modification may limit the model’s capability to extract useful information from long character sequences.

3 Proposed model: Joint CTC-attention (MTL)

The idea of our model is to use a CTC objective function as an auxiliary task to train the attention model encoder within the multi-task learning (MTL) framework.

Figure 1 illustrates the overall architecture of our framework, where the encoder network is shared with CTC and attention models. Unlike the attention model, the forward-backward algorithm of CTC can enforce monotonic alignment between speech and label sequences. We therefore expect that our framework is more robust in acquiring appropriate alignments in noisy conditions. Another advantage of using CTC as an auxiliary task is that the network is learned quickly. In our experiments, rather than solely depending on data-driven attention methods to estimate the desired alignments in long sequences, the forward-backward algorithm in CTC helps to speed up the process of estimating the desired alignment without the aid of rough estimates of the alignment which requires manual effort. The proposed objective is represented as follows by using both attention model in Eq. (8) and CTC in Eq. (3):

with a tunable parameter $\lambda:0\leq\lambda\leq 1$ .

EXPERIMENTS

We performed three sets of experiments: two on clean speech corpora, WSJ1 (81 hours) and WSJ0 (15 hours) , and one on a noisy speech corpus, CHiME-4 (18 hours) . The CHiME-4 corpus was recorded using a tablet device in everyday environments - a cafe, a street junction, public transport, and a pedestrian area. As input features, we used 40 mel-scale filterbank coefficients, with their first and second order temporal derivatives to obtain a total of 120 feature values per frame. Evaluation was done on (1) ”eval92” for WSJ, and (2) ”et05_real_isolated_1ch_track” for CHiME-4. Hyperparameter selection was performed on the (1) ”dev93” for WSJ, and (2) ”dt05_multi_isolated_1ch_track” for CHiME-4. None of our experiments used any language model or lexicon information. For the attention model, we used only 32 distinct labels: 26 characters, apostrophe, period, dash, space, noise, and sos/eos tokens. The CTC model uses the blank instead of sos/eos, and our MTL model uses both sos/eos and the blank.

2 Training and Decoding

The encoder was a 4-layer Bidirectional Long Short-Term Memory (BLSTM) with 320 cells in each layer and direction, and linear projection layer is followed by each BLSTM layer. The top two layers of the encoder read every second hidden state in the network below, reducing the utterance length by the factor of 4, $L=T/4$ . The decoder was 1-layer LSTM with 320 cells. In case of the location-based attention model, 10 centered convolution filters of width 100 were used to extract the convolutional features. We used the sharpening factor $\gamma=2$ . The AdaDelta algorithm with gradient clipping was used for optimization. All the weights are initialized with the range [-0.1, 0.1] of uniform distribution. For our MTL, we tested three different task weights, $\lambda$ : 0.2, 0.5, and 0.8.

For decoding of the attention and MTL models, we used a beam search algorithm similar to with the beam size 20 to reduce the computation cost. We adjusted the score by adding a length penalty, $\text{length(hyp)}*0.3$ for CHiME-4 and $\text{length(hyp)}*0.1$ for WSJ experiments. For decoding of CTC model, we took the sequence of most likely outputs. Note that we do not use any lexicon or language models. Our framework is implemented with the Chainer library .

3 Results

The results in Table 1 show that our proposed model MTL significantly outperformed both CTC and the attention model in CER on both the noisy CHiME-4 and clean WSJ tasks. Our model showed 6.0 - 8.4% and 5.4 - 14.6% relative improvements on validation and evaluation set, respectively. We observed that our joint CTC-attention achieved the best performance when we use the $\lambda=0.2$ on both the noisy CHiME-4 and clean WSJ tasks.

One noticeable thing is that our framework significantly outperformed both the CTC and attention model even on clean corpora WSJ1 and WSJ0. It is possible that the CTC improved generalisation because of its training procedure that does not explicitly use character inter-dependencies. This point needs to be verified with additional experiments in future work.

Apart from the CER improvements, MTL can also be very helpful in accelerating the learning of the desired alignment. Figure 2 shows the learning curves of character accuracy on the validation sets of CHiME-4 over training epochs. Note that the accuracies of the attention and our MTL model were obtained with given gold standard history. As we use large $\lambda$ giving more weight to CTC loss, the network learns quickly and converges early. Figure 3 visualizes the attention alignments between characters and acoustic frames over training epoch. We observed that our MTL model learned the desired alignment in an early training stage, the 5th epoch, while the attention model could not learn the desired alignment even at the 9th epoch. This result indicates that the CTC loss guided the alignment to be monotonic in our MTL approach.

CONCLUSIONS

We have introduced a novel, general method for end-to-end speech recognition based on the multi-task learning approach using the CTC and the attention encoder-decoder. Our method improves performance by training a shared encoder using an auxiliary CTC objective function. Moreover, it significantly speeds up the process of learning the desired alignment without requiring manual restriction of the range of inputs, even in longer sequences. Our method has outperformed both CTC and an attention model on a speech recognition task in real-world noisy conditions as well as in clean conditions. This work can potentially be applied to any sequence-to-sequence learning task.