Latent Sequence Decompositions
William Chan, Yu Zhang, Quoc Le, Navdeep Jaitly
Introduction
Sequence-to-sequence (seq2seq) models (Sutskever et al., 2014; Cho et al., 2014) with attention (Bahdanau et al., 2015) have been successfully applied to many applications including machine translation (Luong et al., 2015; Jean et al., 2015), parsing (Vinyals et al., 2015a), image captioning (Vinyals et al., 2015b; Xu et al., 2015) and Automatic Speech Recognition (ASR) (Chan et al., 2016; Bahdanau et al., 2016a).
Previous work has assumed a fixed deterministic decomposition for each output sequence. The output representation is usually a fixed sequence of words (Sutskever et al., 2014; Cho et al., 2014), phonemes (Chorowski et al., 2015), characters (Chan et al., 2016; Bahdanau et al., 2016a) or even a mixture of characters and words (Luong & Manning, 2016). However, in all these cases, the models are trained towards one fixed decomposition for each output sequence.
We argue against using fixed deterministic decompositions of a sequence that has been defined a priori. Word segmented models (Luong et al., 2015; Jean et al., 2015) often have to deal with large softmax sizes, rare words and Out-of-Vocabulary (OOV) words. Character models (Chan et al., 2016; Bahdanau et al., 2016a) overcome the OOV problem by modelling the smallest output unit, however this typically results in long decoder lengths and computationally expensive inference. And even with mixed (but fixed) character-word models (Luong & Manning, 2016), it is unclear whether such a predefined segmentation is optimal. In all these examples, the output decomposition is only a function of the output sequence. This may be acceptable for problems such as translations, but inappropriate for tasks such as speech recognition, where segmentation should also be informed by the characteristics of the inputs, such as audio.
We want our model to have the capacity and flexibility to learn a distribution of sequence decompositions. Additionally, the decomposition should be a sequence of variable length tokens as deemed most probable. For example, language may be more naturally represented as word pieces (Schuster & Nakajima, 2012) rather than individual characters. In many speech and language tasks, it is probably more efficient to model “qu” as one output unit rather than “q” + “u” as separate output units (since in English, “q” is almost always followed by “u”). Word piece models also naturally solve rare word and OOV problems similar to character models.
The output sequence decomposition should be a function of both the input sequence and the output sequence (rather than output sequence alone). For example, in speech, the choice of emitting “ing” as one word piece or as separate tokens of “i” + “n” + “g” should be a function of the current output word as well as the audio signal (i.e., speaking style).
We present the Latent Sequence Decompositions (LSD) framework. LSD does not assume a fixed decomposition for an output sequence, but rather learns to decompose sequences as function of both the input and the output sequence. Each output sequence can be decomposed to a set of latent sequence decompositions using a dictionary of variable length output tokens. The LSD framework produces a distribution over the latent sequence decompositions and marginalizes over them during training. During test inference, we find the best decomposition and output sequence, by using beam search to find the most likely output sequence from the model.
Latent Sequence Decompositions
In this section, we describe LSD more formally. Let be our input sequence, be our output sequence and be a latent sequence decomposition of . The latent sequence decomposition consists of a sequence of where is the constructed token space. Each token need not be the same length, but rather in our framework, we expect the tokens to have different lengths. Specifically, where is the set of singleton tokens and is the length of the largest output token. In ASR , would typically be the set of English characters, while would be word pieces (i.e., -grams of characters).
To give a concrete example, consider a set of tokens . With this set of tokens, the word “cat” may be represented as the sequence “c”, “a”, “t”, or the sequence “ca”, “t”, or alternatively as the single token “cat”. Since the appropriate decomposition of the word “cat” is not known a priori, the decomposition itself is latent.
Note that the length of a decomposition need not be the same as the length of the output sequence, (for example “ca”, “t” has a length of 2, whereas the sequence is 3 characters long). Similarly, a different decomposition (for example the 3-gram token “cat”) of the same sequence may be of a different length (in this case 1).
If there was a known, unique, correct segmentation for a given pair, , one could simply train the model to output the fixed deterministic decomposition . However, in most problems, we do not know the best possible decomposition ; indeed it may be possible that the output can be correctly decomposed into multiple alternative but valid segmentations. For example, in end-to-end ASR we typically use characters as the output unit of choice (Chan et al., 2016; Bahdanau et al., 2016a) but word pieces may be better units as they more closely align with the acoustic entities such as syllables. However, the most appropriate decomposition for a given is pair is often unknown. Given a particular , the best could even change depending on the input sequence (i.e., speaking style).
In LSD, we want to learn a probabilistic segmentation mapping from . The model produces a distribution of decompositions, , given an input sequence , and the objective is to maximize the log-likelihood of the ground truth sequence . We can accomplish this by factorizing and marginalizing over all possible latent sequence decompositions under our model with parameters :
Similarly, computing the exact gradient is intractable. However, we can derive a gradient estimator by differentiating w.r.t. to and taking its expectation:
Instead, in our implementation we use a heuristic to sample . At each output time step when producing tokens , we sample from in a left-to-right fashion. In other words, we sample valid extensions at each time step . At the start of the training, this left-to-right sampling procedure is not a good approximation to the posterior, since the next step probabilities at a time step include probabilities of all future paths from that point.
For example, consider the case when the target word is “cat”, and the vocabulary includes all possible characters and the tokens “ca”, and “cat”. At time step 1, when the valid next step options are “c”, “ca”, “cat”, their relative probabilities reflect all possible sequences “c*”, “ca*”, “cat*” respectively, that start from the first time step of the model. These sets of sequences include sequences other than the target sequence “cat”. Thus sampling from the distribution at step 1 is a biased procedure.
However, as training proceeds the model places more and more mass only on the correct hypotheses, and the relative probabilities that the model produces between valid extensions gets closer to the posterior. In practice, we find that the when the model is trained with this method, it quickly collapses to using single character targets, and never escapes from this local minimaOne notable exception was the word piece “qu” (“u” is almost always followed by “q” in English). The model does learn to consistently emit “qu” as one token and never produce “q” + “u” as separate tokens.. Thus, we follow an -greedy exploration strategy commonly found in reinforcement learning literature (Sutton & Barto, 1998) – we sample from a mixture of a uniform distribution over valid next tokens and . The relative probability of using a uniform distribution vs. is varied over training. With this modification the model learns to use the longer n-grams of characters appropriately, as shown in later sections.
Model
In this work, we model the latent sequence decompositions with an attention-based seq2seq model (Bahdanau et al., 2015). Each output token is modelled as a conditional distribution over all previously emitted tokens and the input sequence using the chain rule:
The output sequence is generated with an attention-based transducer (Bahdanau et al., 2015) one token at a time:
Decoding
During inference we want to find the most likely word sequence given the input acoustics:
however this is obviously intractable for any non-trivial token space and sequence lengths. We simply approximate this by decoding for the best word piece sequence and then collapsing it to its corresponding word sequence :
We approximate for the best sequence by doing a left-to-right beam search (Chan et al., 2016).
Experiments
Table 1 compares the effect of varying the sized word piece vocabulary. The Latent Sequence Decompositions (LSD) models were trained with the framework described in Section 2 and the (Maximum Extension) MaxExt decomposition is a fixed decomposition. MaxExt is generated in a left-to-right fashion, where at each step the longest word piece extension is selected from the vocabulary. The MaxExt decomposition is not the shortest possible sequence, however it is a deterministic decomposition that can be easily generated in linear time on-the-fly. We decoded these models with simple n-best list beam search without any external dictionary or Language Model (LM).
The baseline model is simply the unigram or character model and achieves 14.76% WER. We find the LSD word piece vocabulary model to perform the best at 12.88% WER or yielding a 12.7% relative improvement over the baseline character model. None of our MaxExt models beat our character model baseline, suggesting the maximum extension decomposition to be a poor decomposition choice. However, all our LSD models perform better than the baseline suggesting the LSD framework is able to learn a decomposition better than the baseline character decomposition.
The MaxExt model were trained to greedily emit the longest possible word piece, consequently this prior meant the model will prefer to emit long word pieces over characters. While this decomposition results in the shorter length, the WER is slightly worse than the character baseline. This suggest the much shorter decompositions generated by the MaxExt prior may not be best decomposition. This falls onto the principle that the best decomposition is not only a function of but as a function of . In the case of ASR, the segmentation is a function of the acoustics as well as the text.
Table 2 compares our WSJ results with other published end-to-end models. The best CTC model achieved 27.3% WER with REINFORCE optimization on WER (Graves & Jaitly, 2014). The previously best reported basic seq2seq model on WSJ WER achieved 18.0% WER (Bahdanau et al., 2016b) with Task Loss Estimation (TLE). Our baseline, also a seq2seq model, achieved 14.8% WER. Main differences between our models is that we did not use convolutional locational-based priors and we used weight noise during training. The deep CNN model with residual connections, batch normalization and convolutions achieved a WER of 11.8% (Zhang et al., 2017) For our CNN architectures, we use and compare to the “(C (3 3) / 2) 2 + NiN” architecture from Table 2 line 4..
Our LSD model using a word piece vocabulary achieves a WER of 12.9% or 12.7% relatively better over the baseline seq2seq model. If we combine our LSD model with the CNN (Zhang et al., 2017) model, we achieve a combined WER of 9.6% WER or 35.1% relatively better over the baseline seq2seq model. These numbers are all reported without the use of any language model.
Please see Appendix A for the decompositions generated by our model. The LSD model learns multiple word piece decompositions for the same word sequence.
Related Work
Singh et al. (2002); McGraw et al. (2013); Lu et al. (2013) built probabilistic pronunciation models for Hidden Markov Model (HMM) based systems. However, such models are still constraint to the conditional independence and Markovian assumptions of HMM-based systems.
Connectionist Temporal Classification (CTC) (Graves et al., 2006; Graves & Jaitly, 2014) based models assume conditional independence, and can rely on dynamic programming for exact inference. Similarly, Ling et al. (2016) use latent codes to generate text, and also assume conditional independence and leverage on dynamic programming for exact maximum likelihood gradients. Such models can not learn the output language if the language distribution is multimodal. Our seq2seq models makes no such Markovian assumptions and can learn multimodal output distributions. Collobert et al. (2016) and Zweig et al. (2016) developed extensions of CTC where they used some word pieces. However, the word pieces are only used in repeated characters and the decompositions are fixed.
Word piece models with seq2seq have also been recently used in machine translation. Sennrich et al. (2016) used word pieces in rare words, while Wu et al. (2016) used word pieces for all the words, however the decomposition is fixed and defined by heuristics or another model. The decompositions in these models are also only a function of the output sequence, while in LSD the decomposition is a function of both the input and output sequence. The LSD framework allows us to learn a distribution of decompositions rather than learning just one decomposition defined by a priori.
Vinyals et al. (2016) used seq2seq to outputs sets, the output sequence is unordered and used fixed length output units; in our decompositions we maintain ordering use variable lengthed output units. Reinforcement learning (i.e., REINFORCE and other task loss estimators) (Sutton & Barto, 1998; Graves & Jaitly, 2014; Ranzato et al., 2016) learn different output sequences can yield different task losses. However, these methods don’t directly learn different decompositions of the same sequence. Future work should incorporate LSD with task loss optimization methods.
Conclusion
We presented the Latent Sequence Decompositions (LSD) framework. LSD allows us to learn decompositions of sequences that are a function of both the input and output sequence. We presented a biased training algorithm based on sampling valid extensions with an -greedy strategy, and an approximate decoding algorithm. On the Wall Street Journal speech recognition task, the sequence-to-sequence character model baseline achieves 14.8% WER while the LSD model achieves 12.9%. Using a a deep convolutional neural network on the encoder with LSD, we achieve 9.6% WER.
We thank Ashish Agarwal, Philip Bachman, Dzmitry Bahdanau, Eugene Brevdo, Jan Chorowski, Jeff Dean, Chris Dyer, Gilbert Leung, Mohammad Norouzi, Noam Shazeer, Xin Pan, Luke Vilnis, Oriol Vinyals and the Google Brain team for many insightful discussions and technical assistance.
References
Appendix A Learning the Decompositions
We give the top 8 hypothesis generated by a baseline seq2seq character model, a Latent Sequence Decompositions (LSD) word piece model and a Maximum Extension (MaxExt) word piece model. We note that “shamrock’s” is an out-of-vocabulary word while “shamrock” is in-vocabulary. The ground truth is “shamrock’s pretax profit from the sale was one hundred twenty five million dollars a spokeswoman said”. Note how the LSD model generates multiple decompostions for the same word sequence, this does not happen with the MaxExt model.