Alternative structures for character-level RNNs

Piotr Bojanowski, Armand Joulin, Tomas Mikolov

Introduction

Modeling sequential data is a fundamental problem in machine learning with many applications, for example in language modeling (Goodman, 2001), speech recognition(Young et al., 1997) and machine translation (Koehn et al., 2007). In particular, for modeling natural language, recurrent neural networks (RNNs) are now widely used and have demonstrated state-of-the-art performance in many standard tasks (Mikolov, 2012).

While RNNs have been shown to outperform the traditional n-gram models and feeedforward neural network language models in numerous experiments, they are usually based on the word level information and thus are oblivious to subword information. For example, RNN language models encode input words such as “build”, “building” and “buildings” using 1-of-N coding, which does not capture any similarity of the written form of the words. This can potentially result in poor representation of words that are rarely seen during training. Even worse, the words that appear only in the test data will be not represented at all. This problem can become significant when working with languages that have extremely large vocabularies, such as agglutinative languages where words can be created by concatenating morphemes (Finnish and Turkish being well-studied examples). Further, in many real-world applications, typos and spelling mistakes artificially increase the size of the vocabulary by adding several versions of the same word. This requires ad-hoc spell checking approaches that are designed disjointly from the main language modeling task.

To overcome these limitations, we investigate the use of character based recurrent neural networks (Char-RNNs) to capture subword information. While this type of models has been widely studied in the past (see for example (Mikolov et al., 2011; Sutskever et al., 2011; Graves, 2013)), Char-RNNs lead to both lower accuracy and higher computational cost than word-based models (Mikolov et al., 2012). This drop of performance is unlikely due to the difficulty for character level model to capture longer short term memory, since also the Longer Short Term Memory (LSTM) recurrent networks work better with word-based input (Graves, 2013).

Ad-hoc solutions based on larger sub-word units seem to be able to both deal with new words and offer reasonable accuracy and training speed (Mikolov et al., 2012). However, these approaches have several issues: one has to specify how to create the sub-word units, which can differ from language to language; and a word can have multiple segmentations into the sub-word units.

In this paper, we investigate at first an extension of a standard Char-RNN that includes both word level and character level information. Arguably, such approach is simpler than the one based on sub-words, and does not have the potential problems mentioned above. Further, we can see that one of the fundamental differences between the word level and character level models is in the number of parameters the RNN has to access during the training and test. The smaller is the input and output layer of RNN, the larger needs to be the fully connected hidden layer, which makes the training of the model expensive. Following this observation, we investigate another architecture of the Char-RNN that does not include the (still somewhat ad-hoc) word level information, and rather attempts to make the computation the model performs more sparse. In our experiments, this is achieved by conditioning the computation of the probability distribution in the output layer using the recent history. This greatly increases the number of parameters in the model without increasing the size of the hidden layer or the output layer, and thus does not increase the computational complexity.

First, we describe the standard RNN in the context of character prediction problem in Sec. 2, then we propose two different structural modifications of this model. The first modification, described in Sec. 3, combines two networks, one working with characters at the input, and the other with words. The second approach, described in Sec. 4, attempts to increase capacity of the RNN model by conditioning the softmax output on the recent history.

Agglutinative languages such as Finnish or Turkish have very large vocabularies, making word based models impractical (Kurimo et al., 2006). Subword units such as morphemes have been used in statstical models for speech recognition (Vergyri et al., 2004; Hirsimäki et al., 2006; Arisoy et al., 2009; Creutz et al., 2007). In particular, Creutz et al. (2007) show that morph-based N-gram models outperform word based ones on most of the agglutinative languages.

A mix of word and character level input for neural network language models has been investigated by Kang et al. (2011) in the context of Chinese. More recently, Kim et al. (2015) propose a model to predict words given character level inputs, while we predict characters based on a mix of word and character level inputs.

Recurrent networks have been popularized for statistical language modeling by Mikolov et al. (2010). Since then, many authors have investigated the use of subword units in order to deal with Out-Of-Vocabulary (OOV) words in the context of recurrent networks. Typical choice of subword units are either characters (Mikolov et al., 2011; Sutskever et al., 2011; Graves, 2013) or syllables (Mikolov et al., 2012).

Others have used embedding of words to deal with OOV words (Bilmes & Kirchhoff, 2003; Alexandrescu & Kirchhoff, 2006; Luong et al., 2013). Luong et al. (2013) build word embeddings by applying a recursive neural network over morpheme embeddings, while Bilmes & Kirchhoff (2003) build their embedding by concatenating features built on previously seen words.

Simple recurrent network

In this section, we describe the simple RNN model popularized by Elman (1990), in the context of language modeling. We formulate language modeling as a discrete sequence prediction problem. That is, we want to predict the next token in a sequence given its past. We suppose a fixed size dictionary of $k$ words formed from $d$ different characters. We denote by $c_{t}$ the one-hot encoding of $t$ -th character in the sequence, and $w_{p}$ the one-hot encoding of the $p$ -th word. Our basic unit is the character.

An RNN consists of an input layer, a hidden layer and an output layer. Its hidden layer has a recurrent connection which allows the propagation through time of information. More precisely, the state of the $m$ hidden units, $h_{t}$ is updated as a function of its previous state, $h_{t-1}$ and the current character one-hot representation $c_{t}$ :

where $\sigma(x)\mapsto 1/(1+\exp(-x))$ is the pointwise sigmoid function, $A$ is the $m\times d$ embedding matrix and $R$ the $m\times m$ recurrent matrix. This hidden representation is supposed to act as a memory, and should be able to convey long-term dependencies. With sufficiently high-dimensional hidden representation, it should be a priori possible to store the whole history. However, using a big hidden layer implies high computational costs which are prohibitive in practice. Using its hidden representation, the RNN compute a probability distribution $y_{t}$ over the next character:

where $U$ is a $d\times m$ matrix and $f$ is the pointwise softmax function, i.e., $[f(x)]_{i}=\exp(x_{i})/\sum_{j}\exp(x_{j})$ .

In order to learn the parameters $\theta=(A,~{}R,~{}U)$ of the model, we minimize the negative log-likelihood (NLL):

with a stochastic gradient descent method and backpropagation through time (Rumelhart et al., 1985; Werbos, 1988). We clip the gradient in order to avoid gradient explosion. The details of the implementation are given in the experiment section.

Character level RNNs have been shown to perform poorly compared to word level ones (Mikolov et al., 2012). In particular, they require a massive hidden layer in order to obtain results which are on par with word level models, this makes them very expensive to compute. In the following sections, we describe two different structural modification of the char-RNNs in order to add capacity while reducing the overall computational cost.

Conditioning on words

In this section, we consider an extension of character-level RNN by conditioning it with word level information. This allows a more direct flow of information from the previous words to the character level prediction. We propose to condition the character level on a context vector $z_{t}$ as follow:

where $Q$ is the conditioning matrix. The context vector $z_{t}$ is built by gathering information at the word level using a word level RNN, with a similar architecture to the one described in the previous section. Its input for the $p$ -th word is its one-hot representation $w_{p}$ . The context vector is then simply the state of the hidden layer $g_{p}$ : if the $t$ -th character belongs to the $p$ -th word, then $z_{t}=g_{p-1}$ . Figure 1 (Left) provides an illustration of this hybrid word and character level RNN.

In order to train this model, we combine a loss on characters with a loss on words. However, computing a full softmax at the word level is expensive, in particular when facing large vocabularies. Many solutions have been proposed to reduce the cost of this step, such as hierachical softmax, or sampling techniques. In this work, we simply restrict the word vocabulary for the output of the word level RNN. We keep the most frequent words (between 3000 and 5000) and associate the other ones with the token. The loss that we use to train our model can be written as:

where $v_{p}$ is the prediction made by the word level RNN and $\lambda>0$ is an interpolation parameter. Note that the restricted vocabulary is only used for the output of the word level RNN, the rest of the model works on the large vocabulary. In the next section, we describe another structural modification that we propose to speed up language modeling on the character level.

Conditioning prediction on recent history

In a character based RNNs, the classifier has a very small number of parameters. We propose to condition this classifier on the recent contextual information to increase its capacity while keeping the computational cost constant. This context information is related to simple short-term statitics, such as coocurrence in a language. Explicitly modeling this information in the classifier, removes some of the burden from the hidden layer, allowing to use smaller recurrent matrices while maintaining the performance.

There are many relevant contextual informations on which we can condition the classifier. In particular, simple short term dependencies are easily captured by n-grams, which are memory expensive but very efficient. Using such cheap information to condition the classifier of the RNN gives a simple way to increase the capacity of the model while encouraging the rest of the RNN to focus on more complex stastitical patterns. Conditioning the classifier on n-grams can be written as a bilinear model. More precisely, denoting by $n_{t}$ the one-hot representation of the $t$ -th n-gram, the prediction of our model is:

where $U$ is a tensor going from the product space of hidden representation and n-grams to characters. This alternative architecture is depicted in Fig. 1 (Right).

Obviously, the exponential number of n-grams makes this model impossible to learn. Worse, it is impossible to generalize at test time to unseen n-grams. To avoid these problems, we restrict our set of n-grams $\mathcal{N}$ to those that contain less than $N$ symbols and appear frequently enough in the training set. If multiple n-grams can be used at the $t$ -th character of the text, we fix $n_{t}$ to be the longest one appearing in $\mathcal{N}$ . This insures that each n-gram $n_{t}$ is associated with enough examples to learn a statistically meaningful tensor $U$ . The great thing with this solution is the fact that apart from characters that don’t appear in the training set, we always have a non-trivial output model at test time. In the worst case, this procedure will select the model corresponding to the last unigram. The n-gram cut-off frequence allows to control the overfitting of the model.

Experimental evaluation

We evaluate the proposed models on the Penn Treebank corpus and on a subset of the Europarl dataset. For the sake of simplicity, we compare our models to a plain RNN but all the modifications that we propose can be applied to more complex units (LSTM etc.). We train our model using stochastic gradient descent and select hyper parameters on a validation set. We set a constant learning rate $\gamma$ and when the validation entropy starts to increase, we divide it by a factor $\alpha$ after every epoch (values arround $\gamma=0.1$ and $\alpha=1.5$ work well in practice). Our implementation is a single threaded CPU code which could easily be parallelized. Code for training both models is publicly availablehttps://github.com/facebook/Conditional-character-based-RNN.

We evaluate our method using entropy in bits per character. It is defined as the empirical estimate of the cross-entropy between the target distribution and the model output in base 2. This corresponds to the negative log likelihood that we use to train our model up to a multiplicative factor: $BPC(\theta)~{}=~{}\frac{1}{T\log(2)}~{}NLL(\theta)$ .

We first carry out experiments on the Penn Treebank corpus (Marcus et al., 1993). This a dataset with a training set composed of 930k normalized words, yielding a total of 5017k characters. All characters are in the ASCII format which leads to a limited size of character vocabulary $C$ . The text was normalized and the word dictionary limited to $10000$ most frequent words of the training set. The other words were replaced by a token in the training, validation and testing sets.

We evaluate both models described in this paper on the Penn Treebank dataset. For the mixed model from Sec. 3 (Mixed), we fix size of the word-level hidden representations to 200. For the conditional model presented in Sec. 4 (Cond.), we choose the optimal $N$ on the validation set. We compare these models with our own implementation of a character-level RNN as it allows us to fairly compare run times (a significant part of the code is shared). All models are trained for various sizes of the hidden layer $m$ . We report entropy in bits per character on the validation set and the training time per epoch in Table 1.

The character-level performance we obtain for the “vanilla” RNN is coherent with numbers published in the past on this dataset (Mikolov et al., 2012). We observe three important things: (a), for any size of character hidden layer, both proposed models perform better than the plain one. This of course comes at the expense of some additional computational cost and the benefits seem to decrease when the size of $h_{t}$ grows. (b) using this model, we manage to obtain a comparable performance to the heavy 1000-dimensional character RNN with a hidden representation of only 300. This corresponds to an important reduction in the number of recurent parameters and to a five times speedup per epoch at training time. (c) when the hidden representation is small, the best working model is the conditional one. However, when $m$ gets larger, the mixed one seems to work best, and provides competitive entropy for a reasonable runtime.

For the conditional model, we observed in our experiments, that there seems to be a clear trade off in the choice of $N$ , reached on Penn Treebank at roughly $N=1000$ . Indeed, in the limit case, when $N$ is very large, we only have one output model, and the network is exactly equivalent to a RNN. On the other hand, when $N$ is small, we keep a separate model for any sequence and therefore overfits to the training set.

2 Binary Penn Treebank

We carry out some experiments on the binary representation of Penn Treebank. As mentionned in the introduction, we would like to develop models that would be independent of the representation in use. Working with binary representation would allow to have models of sequential data that would be agnostic to the nature of the sequence. This could straightforwadly be applied to language modelling but also speech recognition directly from wave files etc.

We run the conditional model and a baseline bit-level RNN, both with a hidden representation of $100$ . For the conditional model we select the optimal $N$ by choosing it on the validation set ( $N=2000$ ). We evaluate both models by computing the entropy per bit and per character. The results for this experiment are presented in Table 2. This setting corresponds to the extreme case where the dictionary is as small as it could be. The input and output model only have $2\times m$ which can be a serious limitation for RNNs. As we see in Table 2, the conditional model works much better as it compensates this small output model by storing several ones instead.

3 The multilingual Europarl dataset

We perform another set of experiments on the Europarl dataset (Koehn, 2005). It is a corpus for machine translation with sentences from 20 different languages aligned with their English correspondence. For almost every language, there is more than 500k sentences, composed of more that 10M words. Because of its size, we restrict our experiments to a subset of sentences for each language. We randomly permute lines of the transcriptions, select 60k sentences for training, 10k for validation and 10k for testing. The permutation we use will be made publicly available upon publication.

In this experiment, as in the one described in Sec. 5.1, we compare our models to a character-level RNN. We train our mixed model with a word hidden of $200$ and a character hidden of $300$ . For the conditional one, we fix the hidden representation and select the optimal $N$ on the validation set. As a baseline, we train a character-level RNN for two sizes of hidden layers: $200$ and $500$ . These results are summarized in Table 3, where we group “light” and “heavy” models together. We also report the word dictionary size ( $k$ ) and out of vocabulary rate (OOVR) for every language.

The CRNN baseline as well as the proposed models still are quite far from the performance of a word-level RNN. As we see in Table 3, the average performance of both proposed models give a per-character entropy of 1.36. The proposed structural modifications allow us to achieve similar performance to a large character-level RNN with a reduced computational cost. For languages such as Finnish and Hungarian, the conditional model (Cond.) yields best performance.

For reference, we computed a word-level RNN baseline using a modified version of SCRNNhttps://github.com/facebook/SCRNNs. If we assign 4 times the average entropy to OOV words, it gives us an entropy of 1.27 BPC. The proposed models allow us to efficiently tackle the problem of learning small vocabulary sequences. However, the gap between the word and character-level models is far from being closed.

Conclusion

In this work we investigated modifications of RNNs for general discrete sequence prediction when the number of symbols is very small, such as in char-RNNLM. We found that with certain tricks, one can train the model much faster, and overall we observed that the fully connected RNN architecture has its weaknesses, especially related to the excessive computational complexity. We believe more research is needed to develop general mechanisms that would allow us to train RNNs with richer internal structure. We expect such research can greatly simplify many pipelines, for example in the NLP applications where we could avoid having separate systems that perform spell checking, text normalization, and modeling of the language disjointly.

We hope that this work will open up new research paths for modeling sequences with small vocabularies. Initial yet promising results on binary representations of Penn Treebank show that RNNs can be trained on that kind of data. We believe that this allows us to define models for sequence modeling that would be agnostic to the nature of the input. Bit-level models could be used on any sequential data, for example speech signal in binary .wav form.