Enriching Word Vectors with Subword Information

Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

Introduction

Learning continuous representations of words has a long history in natural language processing [Rumelhart et al. (1988]. These representations are typically derived from large unlabeled corpora using co-occurrence statistics [Deerwester et al. (1990, Schütze (1992, Lund and Burgess (1996]. A large body of work, known as distributional semantics, has studied the properties of these methods [Turney et al. (2010, Baroni and Lenci (2010]. In the neural network community, ?) proposed to learn word embeddings using a feedforward neural network, by predicting a word based on the two words on the left and two words on the right. More recently, ?) proposed simple log-bilinear models to learn continuous representations of words on very large corpora efficiently.

Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages, such as Turkish or Finnish. For example, in French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns. These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information.

In this paper, we propose to learn representations for character nn-grams, and to represent words as the sum of the nn-gram vectors. Our main contribution is to introduce an extension of the continuous skipgram model [Mikolov et al. (2013b], which takes into account subword information. We evaluate this model on nine languages exhibiting different morphologies, showing the benefit of our approach.

Related work

In recent years, many methods have been proposed to incorporate morphological information into word representations. To model rare words better, ?) introduced factored neural language models, where words are represented as sets of features. These features might include morphological information, and this technique was succesfully applied to morphologically rich languages, such as Turkish [Sak et al. (2010]. Recently, several works have proposed different composition functions to derive representations of words from morphemes [Lazaridou et al. (2013, Luong et al. (2013, Botha and Blunsom (2014, Qiu et al. (2014]. These different approaches rely on a morphological decomposition of words, while ours does not. Similarly, ?) introduced a method to jointly learn embeddings for Chinese words and characters. ?) proposed to constrain morphologically similar words to have similar representations. ?) described a method to learn vector representations of morphological transformations, allowing to obtain representations for unseen words by applying these rules. Word representations trained on morphologically annotated data were introduced by ?). Closest to our approach, ?) learned representations of character four-grams through singular value decomposition, and derived representations for words by summing the four-grams representations. Very recently, ?) also proposed to represent words using character nn-gram count vectors. However, the objective function used to learn these representations is based on paraphrase pairs, while our model can be trained on any text corpus.

Character level features for NLP.

Another area of research closely related to our work are character-level models for natural language processing. These models discard the segmentation into words and aim at learning language representations directly from characters. A first class of such models are recurrent neural networks, applied to language modeling [Mikolov et al. (2012, Sutskever et al. (2011, Graves (2013, Bojanowski et al. (2015], text normalization [Chrupała (2014], part-of-speech tagging [Ling et al. (2015] and parsing [Ballesteros et al. (2015]. Another family of models are convolutional neural networks trained on characters, which were applied to part-of-speech tagging [dos Santos and Zadrozny (2014], sentiment analysis [dos Santos and Gatti (2014], text classification [Zhang et al. (2015] and language modeling [Kim et al. (2016]. ?) introduced a language model based on restricted Boltzmann machines, in which words are encoded as a set of character nn-grams. Finally, recent works in machine translation have proposed using subword units to obtain representations of rare words [Sennrich et al. (2016, Luong and Manning (2016].

Model

In this section, we propose our model to learn word representations while taking into account morphology. We model morphology by considering subword units, and representing words by a sum of its character nn-grams. We will begin by presenting the general framework that we use to train word vectors, then present our subword model and eventually describe how we handle the dictionary of character nn-grams.

We start by briefly reviewing the continuous skipgram model introduced by ?), from which our model is derived. Given a word vocabulary of size WW, where a word is identified by its index w  {1,...,W}w~{}\in~{}\{1,...,W\}, the goal is to learn a vectorial representation for each word ww. Inspired by the distributional hypothesis [Harris (1954], word representations are trained to predict well words that appear in its context. More formally, given a large training corpus represented as a sequence of words w1,...,wTw_{1},...,w_{T}, the objective of the skipgram model is to maximize the following log-likelihood:

However, such a model is not adapted to our case as it implies that, given a word wtw_{t}, we only predict one context word wcw_{c}.

The problem of predicting context words can instead be framed as a set of independent binary classification tasks. Then the goal is to independently predict the presence (or absence) of context words. For the word at position tt we consider all context words as positive examples and sample negatives at random from the dictionary. For a chosen context position cc, using the binary logistic loss, we obtain the following negative log-likelihood:

2 Subword model

By using a distinct vector representation for each word, the skipgram model ignores the internal structure of words. In this section, we propose a different scoring function ss, in order to take into account this information.

Each word ww is represented as a bag of character nn-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. We also include the word ww itself in the set of its nn-grams, to learn a representation for each word (in addition to character nn-grams). Taking the word where and n=3n=3 as an example, it will be represented by the character nn-grams:

Note that the sequence , corresponding to the word her is different from the tri-gram her from the word where. In practice, we extract all the nn-grams for nn greater or equal to 3 and smaller or equal to 66. This is a very simple approach, and different sets of nn-grams could be considered, for example taking all prefixes and suffixes.

Suppose that you are given a dictionary of nn-grams of size GG. Given a word ww, let us denote by Gw{1,,G}\mathcal{G}_{w}\subset\{1,\dots,G\} the set of nn-grams appearing in ww. We associate a vector representation zg\mathbf{z}_{g} to each nn-gram gg. We represent a word by the sum of the vector representations of its nn-grams. We thus obtain the scoring function:

This simple model allows sharing the representations across words, thus allowing to learn reliable representation for rare words.

In order to bound the memory requirements of our model, we use a hashing function that maps nn-grams to integers in 1 to KK. We hash character sequences using the Fowler-Noll-Vo hashing function (specifically the FNV-1a variant).http://www.isthe.com/chongo/tech/comp/fnv We set K=2.106K=2.10^{6} below. Ultimately, a word is represented by its index in the word dictionary and the set of hashed nn-grams it contains.

Experimental setup

In most experiments (except in Sec. 5.3), we compare our model to the C implementation of the skipgram and cbow models from the word2vechttps://code.google.com/archive/p/word2vec package.

2 Optimization

We solve our optimization problem by performing stochastic gradient descent on the negative log likelihood presented before. As in the baseline skipgram model, we use a linear decay of the step size. Given a training set containing TT words and a number of passes over the data equal to PP, the step size at time tt is equal to γ0(1tTP)\gamma_{0}(1-\frac{t}{TP}), where γ0\gamma_{0} is a fixed parameter. We carry out the optimization in parallel, by resorting to Hogwild [Recht et al. (2011]. All threads share parameters and update vectors in an asynchronous manner.

3 Implementation details

For both our model and the baseline experiments, we use the following parameters: the word vectors have dimension 300300. For each positive example, we sample 55 negatives at random, with probability proportional to the square root of the uni-gram frequency. We use a context window of size cc, and uniformly sample the size cc between 11 and 55. In order to subsample the most frequent words, we use a rejection threshold of 10410^{-4} (for more details, see [Mikolov et al. (2013b]). When building the word dictionary, we keep the words that appear at least 55 times in the training set. The step size γ0\gamma_{0} is set to 0.0250.025 for the skipgram baseline and to 0.050.05 for both our model and the cbow baseline. These are the default values in the word2vec package and work well for our model too.

Using this setting on English data, our model with character nn-grams is approximately 1.5×1.5\times slower to train than the skipgram baseline. Indeed, we process 105105k words/second/thread versus 145145k words/second/thread for the baseline. Our model is implemented in C++, and is publicly available.https://github.com/facebookresearch/fastText

4 Datasets

Except for the comparison to previous work (Sec. 5.3), we train our models on Wikipedia data.https://dumps.wikimedia.org We downloaded Wikipedia dumps in nine languages: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian. We normalize the raw Wikipedia data using Matt Mahoney’s pre-processing perl script.http://mattmahoney.net/dc/textdata All the datasets are shuffled, and we train our models by doing five passes over them.

Results

We evaluate our model in five experiments: an evaluation of word similarity and word analogies, a comparison to state-of-the-art methods, an analysis of the effect of the size of training data and of the size of character nn-grams that we consider. We will describe these experiments in detail in the following sections.

We first evaluate the quality of our representations on the task of word similarity / relatedness. We do so by computing Spearman’s rank correlation coefficient [Spearman (1904] between human judgement and the cosine similarity between the vector representations. For German, we compare the different models on three datasets: Gur65, Gur350 and ZG222 [Gurevych (2005, Zesch and Gurevych (2006]. For English, we use the WS353 dataset introduced by ?) and the rare word dataset (RW), introduced by ?). We evaluate the French word vectors on the translated dataset RG65 [Joubarne and Inkpen (2011]. Spanish, Arabic and Romanian word vectors are evaluated using the datasets described in [Hassan and Mihalcea (2009]. Russian word vectors are evaluated using the HJ dataset introduced by ?).

We report results for our method and baselines for all datasets in Table 1. Some words from these datasets do not appear in our training data, and thus, we cannot obtain word representation for these words using the cbow and skipgram baselines. In order to provide comparable results, we propose by default to use null vectors for these words. Since our model exploits subword information, we can also compute valid representations for out-of-vocabulary words. We do so by taking the sum of its nn-gram vectors. When OOV words are represented using null vectors we refer to our method as sisg- and sisg otherwise (Subword Information Skip Gram).

First, by looking at Table 1, we notice that the proposed model (sisg), which uses subword information, outperforms the baselines on all datasets except the English WS353 dataset. Moreover, computing vectors for out-of-vocabulary words (sisg) is always at least as good as not doing so (sisg-). This proves the advantage of using subword information in the form of character nn-grams.

Second, we observe that the effect of using character nn-grams is more important for Arabic, German and Russian than for English, French or Spanish. German and Russian exhibit grammatical declensions with four cases for German and six for Russian. Also, many German words are compound words; for instance the nominal phrase “table tennis” is written in a single word as “Tischtennis”. By exploiting the character-level similarities between “Tischtennis” and “Tennis”, our model does not represent the two words as completely different words.

Finally, we observe that on the English Rare Words dataset (RW), our approach outperforms the baselines while it does not on the English WS353 dataset. This is due to the fact that words in the English WS353 dataset are common words for which good vectors can be obtained without exploiting subword information. When evaluating on less frequent words, we see that using similarities at the character level between words can help learning good word vectors.

2 Word analogy tasks

We now evaluate our approach on word analogy questions, of the form AA is to BB as CC is to DD, where DD must be predicted by the models. We use the datasets introduced by ?) for English, by ?) for Czech, by ?) for German and by ?) for Italian. Some questions contain words that do not appear in our training corpus, and we thus excluded these questions from the evaluation.

We report accuracy for the different models in Table 2. We observe that morphological information significantly improves the syntactic tasks; our approach outperforms the baselines. In contrast, it does not help for semantic questions, and even degrades the performance for German and Italian. Note that this is tightly related to the choice of the length of character nn-grams that we consider. We show in Sec. 5.5 that when the size of the nn-grams is chosen optimally, the semantic analogies degrade less. Another interesting observation is that, as expected, the improvement over the baselines is more important for morphologically rich languages, such as Czech and German.

3 Comparison with morphological representations

We also compare our approach to previous work on word vectors incorporating subword information on word similarity tasks. The methods used are: the recursive neural network of ?), the morpheme cbow of ?) and the morphological transformations of ?). In order to make the results comparable, we trained our model on the same datasets as the methods we are comparing to: the English Wikipedia data released by ?), and the news crawl data from the 2013 WMT shared task for German, Spanish and French. We also compare our approach to the log-bilinear language model introduced by ?), which was trained on the Europarl and news commentary corpora. Again, we trained our model on the same data to make the results comparable. Using our model, we obtain representations of out-of-vocabulary words by summing the representations of character nn-grams. We report results in Table 3. We observe that our simple approach performs well relative to techniques based on subword information obtained from morphological segmentors. We also observe that our approach outperforms the ?) method, which is based on prefix and suffix analysis. The large improvement for German is due to the fact that their approach does not model noun compounding, contrary to ours.

4 Effect of the size of the training data

Since we exploit character-level similarities between words, we are able to better model infrequent words. Therefore, we should also be more robust to the size of the training data that we use. In order to assess that, we propose to evaluate the performance of our word vectors on the similarity task as a function of the training data size. To this end, we train our model and the cbow baseline on portions of Wikipedia of increasing size. We use the Wikipedia corpus described above and isolate the first 11, 22, 55, 1010, 2020, and 5050 percent of the data. Since we don’t reshuffle the dataset, they are all subsets of each other. We report results in Fig. 1.

As in the experiment presented in Sec. 5.1, not all words from the evaluation set are present in the Wikipedia data. Again, by default, we use a null vector for these words (sisg-) or compute a vector by summing the nn-gram representations (sisg). The out-of-vocabulary rate is growing as the dataset shrinks, and therefore the performance of sisg- and cbow necessarily degrades. However, the proposed model (sisg) assigns non-trivial vectors to previously unseen words.

First, we notice that for all datasets, and all sizes, the proposed approach (sisg) performs better than the baseline. However, the performance of the baseline cbow model gets better as more and more data is available. Our model, on the other hand, seems to quickly saturate and adding more data does not always lead to improved results.

Second, and most importantly, we notice that the proposed approach provides very good word vectors even when using very small training datasets. For instance, on the German Gur350 dataset, our model (sisg) trained on 5%5\% of the data achieves better performance (6666) than the cbow baseline trained on the full dataset (6262). On the other hand, on the English RW dataset, using 1%1\% of the Wikipedia corpus we achieve a correlation coefficient of 4545 which is better than the performance of cbow trained on the full dataset (4343). This has a very important practical implication: well performing word vectors can be computed on datasets of a restricted size and still work well on previously unseen words. In general, when using vectorial word representations in specific applications, it is recommended to retrain the model on textual data relevant for the application. However, this kind of relevant task-specific data is often very scarce and learning from a reduced amount of training data is a great advantage.

5 Effect of the size of n𝑛n-grams

The proposed model relies on the use of character nn-grams to represent words as vectors. As mentioned in Sec. 3.2, we decided to use nn-grams ranging from 33 to 66 characters. This choice was arbitrary, motivated by the fact that nn-grams of these lengths will cover a wide range of information. They would include short suffixes (corresponding to conjugations and declensions for instance) as well as longer roots. In this experiment, we empirically check for the influence of the range of nn-grams that we use on performance. We report our results in Table 4 for English and German on word similarity and analogy datasets.

We observe that for both English and German, our arbitrary choice of 33-66 was a reasonable decision, as it provides satisfactory performance across languages. The optimal choice of length ranges depends on the considered task and language and should be tuned appropriately. However, due to the scarcity of test data, we did not implement any proper validation procedure to automatically select the best parameters. Nonetheless, taking a large range such as 363-6 provides a reasonable amount of subword information.

This experiment also shows that it is important to include long nn-grams, as columns corresponding to n5n\leq 5 and n6n\leq 6 work best. This is especially true for German, as many nouns are compounds made up from several units that can only be captured by longer character sequences. On analogy tasks, we observe that using larger nn-grams helps for semantic analogies. However, results are always improved by taking n3n\geq 3 rather than n2n\geq 2, which shows that character 22-grams are not informative for that task. As described in Sec. 3.2, before computing character nn-grams, we prepend and append special positional characters to represent the beginning and end of word. Therefore, 22-grams will not be enough to properly capture suffixes that correspond to conjugations or declensions, since they are composed of a single proper character and a positional one.

6 Language modeling

In this section, we describe an evaluation of the word vectors obtained with our method on a language modeling task. We evaluate our language model on five languages (Cs, De, Es, Fr, Ru) using the datasets introduced by ?). Each dataset contains roughly one million training tokens, and we use the same preprocessing and data splits as ?).

Our model is a recurrent neural network with 650650 LSTM units, regularized with dropout (with probability of 0.50.5) and weight decay (regularization parameter of 10510^{-5}). We learn the parameters using the Adagrad algorithm with a learning rate of 0.10.1, clipping the gradients which have a norm larger than 1.01.0. We initialize the weight of the network in the range [0.05,0.05][-0.05,0.05], and use a batch size of 2020. Two baselines are considered: we compare our approach to the log-bilinear language model of ?) and the character aware language model of ?). We trained word vectors with character nn-grams on the training set of the language modeling task and use them to initialize the lookup table of our language model. We report the test perplexity of our model without using pre-trained word vectors (LSTM), with word vectors pre-trained without subword information (sg) and with our vectors (sisg). The results are presented in Table 5.

We observe that initializing the lookup table of the language model with pre-trained word representations improves the test perplexity over the baseline LSTM. The most important observation is that using word representations trained with subword information outperforms the plain skipgram model. We observe that this improvement is most significant for morphologically rich Slavic languages such as Czech (8% reduction of perplexity over sg) and Russian (13% reduction). The improvement is less significant for Roman languages such as Spanish (3% reduction) or French (2% reduction). This shows the importance of subword information on the language modeling task and exhibits the usefulness of the vectors that we propose for morphologically rich languages.

Qualitative analysis

We report sample qualitative results in Table 7. For selected words, we show nearest neighbors according to cosine similarity for vectors trained using the proposed approach and for the skipgram baseline. As expected, the nearest neighbors for complex, technical and infrequent words using our approach are better than the ones obtained using the baseline model.

2 Character n𝑛n-grams and morphemes

We want to qualitatively evaluate whether or not the most important nn-grams in a word correspond to morphemes. To this end, we take a word vector that we construct as the sum of nn-grams. As described in Sec. 3.2, each word ww is represented as the sum of its nn-grams: uw=gGwzgu_{w}=\sum_{g\in\mathcal{G}_{w}}z_{g}. For each nn-gram gg, we propose to compute the restricted representation uw\gu_{w\backslash g} obtained by omitting gg:

We then rank nn-grams by increasing value of cosine between uwu_{w} and uw\gu_{w\backslash g}. We show ranked nn-grams for selected words in three languages in Table 6.

For German, which has a lot of compound nouns, we observe that the most important nn-grams correspond to valid morphemes. Good examples include Autofahrer (car driver) whose most important nn-grams are Auto (car) and Fahrer (driver). We also observe the separation of compound nouns into morphemes in English, with words such as lifetime or starfish. However, for English, we also observe that nn-grams can correspond to affixes in words such as kindness or unlucky. Interestingly, for French we observe the inflections of verbs with endings such as ais>, ent> or ions>.

3 Word similarity for OOV words

As described in Sec. 3.2, our model is capable of building word vectors for words that do not appear in the training set. For such words, we simply average the vector representation of its nn-grams. In order to assess the quality of these representations, we analyze which of the nn-grams match best for OOV words by selecting a few word pairs from the English RW similarity dataset. We select pairs such that one of the two words is not in the training vocabulary and is hence only represented by its nn-grams. For each pair of words, we display the cosine similarity between each pair of nn-grams that appear in the words. In order to simulate a setup with a larger number of OOV words, we use models trained on 1%1\% of the Wikipedia data as in Sec. 5.4. The results are presented in Fig. 2.

We observe interesting patterns, showing that subwords match correctly. Indeed, for the word chip, we clearly see that there are two groups of nn-grams in microcircuit that match well. These roughly correspond to micro and circuit, and nn-grams in between don’t match well. Another interesting example is the pair rarity and scarceness. Indeed, scarce roughly matches rarity while the suffix -ness matches -ity very well. Finally, the word preadolescent matches young well thanks to the -adolesc- subword. This shows that we build robust word representations where prefixes and suffixes can be ignored if the grammatical form is not found in the dictionary.

Conclusion

In this paper, we investigate a simple method to learn word representations by taking into account subword information. Our approach, which incorporates character nn-grams into the skipgram model, is related to an idea that was introduced by ?). Because of its simplicity, our model trains fast and does not require any preprocessing or supervision. We show that our model outperforms baselines that do not take into account subword information, as well as methods relying on morphological analysis. We will open source the implementation of our model, in order to facilitate comparison of future work on learning subword representations.

We thank Marco Baroni, Hinrich Schütze and the anonymous reviewers for their insightful comments.

References