Morphological Word Embeddings

Ryan Cotterell, Hinrich Schütze

Introduction

Word representation is fundamental for NLP. Recently, continuous word-embeddings have gained traction as a general-purpose representation framework. While such embeddings have proven themselves useful, they typically treat words holistically, ignoring their internal structure. For morphologically impoverished languages, i.e., languages with a low morpheme-per-word ratio such as English, this is often not a problem. However, for the processing of morphologically-rich languages exploiting word-internal structure is necessary.

Word-embeddings are typically trained to produce representations that capture linguistic similarity. The general idea is that words that are close in the embedding space should be close in meaning. A key issue, however, is that meaning is a multi-faceted concept and thus there are multiple axes, along which two words can be similar. For example, ice and cold are topically related, ice and fire are syntactically related as they are both nouns, and ice and icy are morphologically related as they are both derived from the same root. In this work, we are interested in distinguishing between these various axes and guiding the embeddings such that similar embeddings are morphologically related.

We augment the log-bilinear model (LBL) of ?) with a multi-task objective. In addition to raw text, our model is trained on a corpus annotated with morphological tags, encouraging the vectors to encode a word’s morphology. To be concrete, the first task is language modeling—the traditional use of the LBL—and the second is akin to unigram morphological tagging. The LBL, described in section 3, is fundamentally a language model (LM)—word-embeddings fall out as low dimensional representations of context used to predict the next word. We extend the model to jointly predict the next morphological tag along with the next word, encouraging the resulting embeddings to encode morphology. We present a novel metric and experiments on German as a case study that demonstrates that our approach produces word-embeddings that better preserve morphological relationships.

Related Work

Here we discuss the role morphology has played in language modeling and offer a brief overview of various approaches to the larger task of computational morphology.

Morphological structure has been previously integrated into LMs. Most notably, ?) introduced factored LMs, which effectively add tiers, allowing easy incorporation of morphological structure as well as part-of-speech (POS) tags. More recently, ?) trained a class-based LM using common suffixes—often indicative of morphology—achieving state-of-the-art results when interpolated with a Kneser-Ney LM. In neural probabilistic modeling, ?) described a recursive neural network LM, whose topology was derived from the output of Morfessor, an unsupervised morphological segmentation tool [Creutz and Lagus, 2005]. Similarly, ?) augmented word2vec [Mikolov et al., 2013] to embed morphs as well as whole words—also taking advantage of Morfessor. LMs were tackled by ?) with a convolutional neural network with a $k$ -best max-pooling layer to extract character level $n$ -grams, efficiently inserting orthographic features into the LM—use of the vectors in down-stream POS tagging achieved state-of-the-art results in Portuguese. Finally, most similar to our model, ?) introduced the additive log-bilinear model (LBL++). Best summarized as a neural factored LM, the LBL++ created separate embeddings for each constituent morpheme of a word, summing them to get a single word-embedding.

2 Computational Morphology

Our work is also related to morphological tagging, which can be thought of as ultra-fine-grained POS tagging. For morphologically impoverished languages, such as English, it is natural to consider a small tag set. For instance, in their universal POS tagset, ?) propose the coarse tag Noun to represent all substantives. In inflectionally-rich languages, like German, considering other nominal attributes, e.g., case, gender and number, is also important. An example of an annotated German phrase is found in table 1. This often leads to a large tag set; e.g., in the morphological tag set of ?), English had 137 tags whereas morphologically-rich Czech had 970 tags!

Clearly, much of the information needed to determine a word’s morphological tag is encoded in the word itself. For example, the suffix ed is generally indicative of the past tense in English. However, distributional similarity has also been shown to be an important cue for morphology [Yarowsky and Wicentowski, 2000, Schone and Jurafsky, 2001]. Much as contextual signatures are reliably exploited approximations to the semantics of the lexicon [Harris, 1954]—you shall know the meaning of the word by the company it keeps [Firth, 1957]—they can be similarly exploited for morphological analysis. This is not an unexpected result—in German, e.g., we would expect nouns that follow an adjective in the genitive case to also be in the genitive case themselves. Much of what our model is designed to accomplish is the isolation of the components of the contextual signature that are indeed predictive of morphology.

Log-Bilinear Model

The LBL is a generalization of the well-known log-linear model. The key difference lies in how it deals with features—instead of making use of hand-crafted features, the LBL learns the features along with the weights. In the language modeling setting, we define the following model,

where $w$ is a word, $h$ is a history and $s_{\theta}$ is an energy function. Following the notation of ?), in the LBL we define

Morph-LBL

We propose a multi-task objective that jointly predicts the next word $w$ and its morphological tag $t$ given a history $h$ . Thus we are interested in a joint probability distribution defined as

where $f_{t}$ is a hand-crafted feature vector for a morphological tag $t$ and $S$ is an additional weight matrix. Upon inspection, we see that

Hence given a fixed embedding $q_{w}$ for word $w$ , we can interpret $S$ as the weights of a conditional log-linear model used to predict the tag $t$ .

Morphological tags lend themselves to easy featurization. As shown in table 1, the morphological tag Adj.Nom.Sg.Fem decomposes into sub-tag units Adj, Nom, Sg and Fem. Our model includes a binary feature for each sub-tag unit in the tag set and only those present in a given tag fire; e.g., $F_{\textsc{Adj.Nom.Sg.Fem}}$ is a vector with exactly four non-zero components.

In the fully supervised case, the method we proposed above requires a corpus annotated with morphological tags to train. This conflicts with a key use case of word-embeddings—they allow the easy incorporation of large, unannotated corpora into supervised tasks [Turian et al., 2010]. To resolve this, we train our model on a partially annotated corpus. The key idea here is that we only need a partial set of labeled data to steer the embeddings to ensure they capture morphological properties of the words. We marginalize out the tags for the subset of the data for which we do not have annotation.

Evaluation

In our evaluation, we attempt to intrinsically determine whether it is indeed true that words similar in the embedding space are morphologically related. Qualitative evaluation, shown in figure 1, indicates that this is the case.

We introduce a new evaluation metric for morphologically-driven embeddings to quantitatively score models. Roughly, the question we want to evaluate is: are words that are similar in the embedded space also morphologically related? Given a word $w$ and its embedding $q_{w}$ , let $\mathcal{M}_{w}$ be the set of morphological tags associated with $w$ represented by bit vectors. This is a set because words may have several morphological parses. Our measure is then defined below,

where $m_{w}\in\mathcal{M}_{w}$ , $m_{w^{\prime}}\in\mathcal{M}_{w^{\prime}}$ , $d_{h}$ is the Hamming distance and $\mathcal{K}_{w}$ is a set of words close to $w$ in the embedding space. We are given some freedom in choosing the set $\mathcal{K}_{w}$ —in our experiments we take $\mathcal{K}_{w}$ to be the $k$ -nearest neighbors ( $k$ -NN) in the embedded space using cosine distance. We report performance under this evaluation metric for various $k$ . Note that MorphoSim can be viewed as a soft version of $k$ -NN—we measure not just whether a word has the same morphological tag as its neighbors, but rather has a similar morphological tag.

Metrics similar to MorphoSim have been applied in the speech recognition community. For example, ?) had a similar motivation for their evaluation of fixed-length acoustic embeddings that preserve linguistic similarity.

Experiments and Results

To show the potential of our approach, we chose to perform a case study on German, a morphologically-rich language. We conducted experiments on the TIGER corpus of newspaper German [Brants et al., 2004]. To the best of our knowledge, no previous word-embedding techniques have attempted to incorporate morphological tags into embeddings in a supervised fashion. We note again that there has been recent work on incorporating morphological segmentations into embeddings—generally in a pipelined approach using a segmenter, e.g., Morfessor, as a preprocessing step, but we distinguish our model through its use of a different view on morphology.

We opted to compare Morph-LBL with two fully unsupervised models: the original LBL and word2vec (code.google.com/p/word2vec/, ?)). All models were trained on the first 200k words of the train split of the TIGER corpus; Morph-LBL was given the correct morphological annotation for the first 100k words. The LBL and Morph-LBL models were implemented in Python using theano [Bastien et al., 2012]. All vectors had dimensionality 200. We used the Skip-Gram model of the word2vec toolkit with context $n=5$ . We initialized parameters of LBL and Morph-LBL randomly and trained them using stochastic gradient descent [Robbins and Monro, 1951]. We used a history size of $n=4$ .

We first investigated whether the embeddings learned by Morph-LBL do indeed encode morphological information. For each word, we selected the most frequently occurring morphological tag for that word (ties were broken randomly). We then treated the problem of labeling a word-embedding with its most frequent morphological tag as a multi-way classification problem. We trained a $k$ nearest neighbors classifier where $k$ was optimized on development data. We used the scikit-learn library [Pedregosa et al., 2011] on all types in the vocabulary with 10-fold cross-validation, holding out 10% of the data for testing at each fold and an additional 10% of training as a development set. The results displayed in table 2 are broken down by whether MorphLBL observed the morphological tag at training time or not. We see that embeddings from Morph-LBL do store the proper morphological analysis at a much higher rate than both the vanilla LBL and word2vec.

Word-embeddings, however, are often trained on massive amounts of unlabeled data. To this end, we also explored on how word2vec itself encodes morphology, when trained on an order of magnitude more data. Using the same experimental setup as above, we trained word2vec on the union of the TIGER German corpus and German section of Europarl [Koehn, 2005] for a total of $\approx$ 45 million tokens. Looking only at those types found in TIGER, we found that the $k$ -NN classifier predicted the correct tag with $\approx$ 22% accuracy (not shown in the table).

2 Experiment 2: MorphoDist

We also evaluated the three types of embeddings using the MorphoSim metric introduced in section 5.1. This metric roughly tells us how similar each word is to its neighbors, where distance is measured in the Hamming distance between morphological tags. We only evaluated on words that MorphLBL did not observe at training time to get a fair idea of how well our model has managed to encode morphology purely from the contextual signature. Figure 2 reports results for $k\in\{5,10,25,50\}$ nearest neighbors. We see that the values of $k$ studied do not affect the metric—the closest 5 words are about as similar as the closest 50 words. We see again that the Morph-LBL embeddings generally encode morphology better than the baselines.

3 Discussion

The superior performance of Morph-LBL over both the original LBL and word2vec under both evaluation metrics is not surprising as we provide our model with annotated data at training time. That the LBL outperforms word2vec is also not surprising. The LBL looks at a local history thus making it more amenable to learning syntactically-aware embeddings than word2vec, whose skip-grams often look at non-local context.

What is of interest, however, is Morph-LBL’s ability to robustly maintain morphological relationships only making use of the distributional signature, without word-internal features. This result shows that in large corpora, a large portion of morphology can be extracted through contextual similarity.

Conclusion and Future Work

We described a new model, Morph-LBL, for the semi-supervised induction of morphologically guided embeddings. The combination of morphologically annotated data with raw text allows us to train embeddings that preserve morphological relationships among words. Our model handily outperformed two baselines trained on the same corpus.

While contextual signatures provide a strong cue for morphological proximity, orthographic features are also requisite for a strong model. Consider the words loving and eating. Both are likely to occur after is/are and thus their local contextual signatures are likely to be similar. However, perhaps an equally strong signal is that the two words end in the same substring ing. Future work will handle such integration of character-level features.

We are interested in the application of our embeddings to morphological tagging and other tasks. Word-embeddings have proven themselves as useful features in a variety of tasks in the NLP pipeline. Morphologically-driven embeddings have the potential to leverage raw text in a way state-of-the-art morphological taggers cannot, improving tagging performance downstream.

Acknowledgements

This material is based upon work supported by a Fulbright fellowship awarded to the first author by the German-American Fulbright Commission and the National Science Foundation under Grant No. 1423276. The second author was supported by Deutsche Forschungsgemeinschaft (grant DFG SCHU 2246/10-1). We thank Thomas Müller for several insightful discussions on morphological tagging and Jason Eisner for discussions about experimental design. Finally, we thank the anonymous reviewers for their many helpful comments.