Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
Barbara Plank, Anders Søgaard, Yoav Goldberg
Introduction
Recently, bidirectional long short-term memory networks (bi-LSTM) [Graves and Schmidhuber, 2005, Hochreiter and Schmidhuber, 1997] have been used for language modelling [Ling et al., 2015], POS tagging [Ling et al., 2015, Wang et al., 2015], transition-based dependency parsing [Ballesteros et al., 2015, Kiperwasser and Goldberg, 2016], fine-grained sentiment analysis [Liu et al., 2015], syntactic chunking [Huang et al., 2015], and semantic role labeling [Zhou and Xu, 2015]. LSTMs are recurrent neural networks (RNNs) in which layers are designed to prevent vanishing gradients. Bidirectional LSTMs make a backward and forward pass through the sequence before passing on to the next layer. For further details, see [Goldberg, 2015, Cho, 2015].
We consider using bi-LSTMs for POS tagging. Previous work on using deep learning-based methods for POS tagging has focused either on a single language [Collobert et al., 2011, Wang et al., 2015] or a small set of languages [Ling et al., 2015, Santos and Zadrozny, 2014]. Instead we evaluate our models across 22 languages. In addition, we compare performance with representations at different levels of granularity (words, characters, and bytes). These levels of representation were previously introduced in different efforts [Chrupała, 2013, Zhang et al., 2015, Ling et al., 2015, Santos and Zadrozny, 2014, Gillick et al., 2016, Kim et al., 2015], but a comparative evaluation was missing.
Moreover, deep networks are often said to require large volumes of training data. We investigate to what extent bi-LSTMs are more sensitive to the amount of training data and label noise than standard POS taggers.
Finally, we introduce a novel model, a bi-LSTM trained with auxiliary loss. The model jointly predicts the POS and the log frequency of the word. The intuition behind this model is that the auxiliary loss, being predictive of word frequency, helps to differentiate the representations of rare and common words. We indeed observe performance gains on rare and out-of-vocabulary words. These performance gains transfer into general improvements for morphologically rich languages.
In this paper, we a) evaluate the effectiveness of different representations in bi-LSTMs, b) compare these models across a large set of languages and under varying conditions (data size, label noise) and c) propose a novel bi-LSTM model with auxiliary loss (Logfreq).
Tagging with bi-LSTMs
Recurrent neural networks (RNNs) [Elman, 1990] allow the computation of fixed-size vector representations for word sequences of arbitrary length. An RNN is a function that reads in vectors and produces an output vector , that depends on the entire sequence . The vector is then fed as an input to some classifier, or higher-level RNNs in stacked/hierarchical models. The entire network is trained jointly such that the hidden representation captures the important information from the sequence for the prediction task.
A bidirectional recurrent neural network (bi-RNN) [Graves and Schmidhuber, 2005] is an extension of an RNN that reads the input sequence twice, from left to right and right to left, and the encodings are concatenated. The literature uses the term bi-RNN to refer to two related architectures, which we refer to here as “context bi-RNN” and “sequence bi-RNN”. In a sequence bi-RNN (bi-RNN), the input is a sequence of vectors and the output is a concatenation () of a forward () and reverse () RNN each reading the sequence in a different directions:
In a context bi-RNN (bi-RNN), we get an additional input indicating a sequence position, and the resulting vectors result from concatenating the RNN encodings up to :
Thus, the state vector in this bi-RNN encodes information at position and its entire sequential context. Another view of the context bi-RNN is of taking a sequence and returning the corresponding sequence of state vectors .
LSTMs [Hochreiter and Schmidhuber, 1997] are a variant of RNNs that replace the cells of RNNs with LSTM cells that were designed to prevent vanishing gradients. Bidirectional LSTMs are the bi-RNN counterpart based on LSTMs.
Our basic bi-LSTM tagging model is a context bi-LSTM taking as input word embeddings . We incorporate subtoken information using an hierarchical bi-LSTM architecture [Ling et al., 2015, Ballesteros et al., 2015]. We compute subtoken-level (either characters or unicode byte ) embeddings of words using a sequence bi-LSTM at the lower level. This representation is then concatenated with the (learned) word embeddings vector which forms the input to the context bi-LSTM at the next layer. This model, illustrated in Figure 1 (lower part in left figure), is inspired by ?). We also test models in which we only keep sub-token information, e.g., either both byte and character embeddings (Figure 1, right) or a single (sub-)token representation alone.
In our novel model, cf. Figure 1 left, we train the bi-LSTM tagger to predict both the tags of the sequence, as well as a label that represents the log frequency of the token as estimated from the training data. Our combined cross-entropy loss is now: , where stands for a POS tag and is the log frequency label, i.e., . Combining this log frequency objective with the tagging task can be seen as an instance of multi-task learning in which the labels are predicted jointly. The idea behind this model is to make the representation predictive for frequency, which encourages the model to not share representations between common and rare words, thus benefiting the handling of rare tokens.
Experiments
All bi-LSTM models were implemented in CNN/pycnn,https://github.com/clab/cnn a flexible neural network library. For all models we use the same hyperparameters, which were set on English dev, i.e., SGD training with cross-entropy loss, no mini-batches, 20 epochs, default learning rate (0.1), 128 dimensions for word embeddings, 100 for character and byte embeddings, 100 hidden states and Gaussian noise with =0.2. As training is stochastic in nature, we use a fixed seed throughout. Embeddings are not initialized with pre-trained embeddings, except when reported otherwise. In that case we use off-the-shelf polyglot embeddings [Al-Rfou et al., 2013].https://sites.google.com/site/rmyeid/projects/polyglot No further unlabeled data is considered in this paper. The code is released at: https://github.com/bplank/bilstm-aux
We want to compare POS taggers under varying conditions. We hence use three different types of taggers: our implementation of a bi-LSTM; Tnt [Brants, 2000]—a second order HMM with suffix trie handling for OOVs. We use Tnt as it was among the best performing taggers evaluated in ?).They found TreeTagger was closely followed by HunPos, a re-implementation of TnT, and Stanford and ClearNLP were lower ranked. In an initial investigation, we compared Tnt, HunPos and TreeTagger and found Tnt to be consistently better than Treetagger, Hunpos followed closely but crashed on some languages (e.g., Arabic). We complement the NN-based and HMM-based tagger with a CRF tagger, using a freely available implementation [Plank et al., 2014] based on crfsuite.
1 Datasets
For the multilingual experiments, we use the data from the Universal Dependencies project v1.2 [Nivre et al., 2015] (17 POS) with the canonical data splits. For languages with token segmentation ambiguity we use the provided gold segmentation. If there is more than one treebank per language, we use the treebank that has the canonical language name (e.g., Finnish instead of Finnish-FTB). We consider all languages that have at least 60k tokens and are distributed with word forms, resulting in 22 languages. We also report accuracies on WSJ (45 POS) using the standard splits [Collins, 2002, Manning, 2011]. The overview of languages is provided in Table 1.
2 Results
Our results are given in Table 2. First of all, notice that TnT performs remarkably well across the 22 languages, closely followed by CRF. The bi-LSTM tagger () without lower-level bi-LSTM for subtokens falls short, outperforms the traditional taggers only on 3 languages. The bi-LSTM model clearly benefits from character representations. The model using characters alone () works remarkably well, it improves over TnT on 9 languages (incl. Slavic and Nordic languages). The combined word+character representation model is the best representation, outperforming the baseline on all except one language (Indonesian), providing strong results already without pre-trained embeddings. This model () reaches the biggest improvement (more than +2% accuracy) on Hebrew and Slovene. Initializing the word embeddings (+Polyglot) with off-the-shelf language-specific embeddings further improves accuracy. The only system we are aware of that evaluates on UD is ?) (last column). However, note that these results are not strictly comparable as they use the earlier UD v1.1 version.
The overall best system is the multi-task bi-LSTM freqbin (it uses and Polyglot initialization for ). While on macro average it is on par with bi-LSTM , it obtains the best results on 12/22 languages, and it is successful in predicting POS for OOV tokens (cf. Table 2 OOV Acc columns), especially for languages like Arabic, Farsi, Hebrew, Finnish.
We examined simple RNNs and confirm the finding of ?) that they performed worse than their LSTM counterparts. Finally, the bi-LSTM tagger is competitive on WSJ, cf. Table 3.
In order to evaluate the effect of modeling sub-token information, we examine accuracy rates at different frequency rates. Figure 2 shows absolute improvements in accuracy of bi-LSTM over mean log frequency, for different language families. We see that especially for Slavic and non-Indoeuropean languages, having high morphologic complexity, most of the improvement is obtained in the Zipfian tail. Rare tokens benefit from the sub-token representations.
Data set size
Prior work mostly used large data sets when applying neural network based approaches [Zhang et al., 2015]. We evaluate how brittle such models are with respect to their more traditional counterparts by training bi-LSTM ( without Polyglot embeddings) for increasing amounts of training instances (number of sentences). The learning curves in Figure 3 show similar trends across language families.We observe the same pattern with more, 40, iterations. TnT is better with little data, bi-LSTM is better with more data, and bi-LSTM always wins over CRF. The bi-LSTM model performs already surprisingly well after only 500 training sentences. For non-Indoeuropean languages it is on par and above the other taggers with even less data (100 sentences). This shows that the bi-LSTMs often needs more data than the generative markovian model, but this is definitely less than what we expected.
Label Noise
We investigated the susceptibility of the models to noise, by artificially corrupting training labels. Our initial results show that at low noise rates, bi-LSTMs and TnT are affected similarly, their accuracies drop to a similar degree. Only at higher noise levels (more than 30% corrupted labels), bi-LSTMs are less robust, showing higher drops in accuracy compared to TnT. This is the case for all investigated language families.
Related Work
Character embeddings were first introduced by ?) for language modeling. Early applications include text classification [Chrupała, 2013, Zhang et al., 2015]. Recently, these representations were successfully applied to a range of structured prediction tasks. For POS tagging, ?) were the first to propose character-based models. They use a convolutional neural network (CNN; or convnet) and evaluated their model on English (PTB) and Portuguese, showing that the model achieves state-of-the-art performance close to taggers using carefully designed feature templates. ?) extend this line and compare a novel bi-LSTM model, learning word representations through character embeddings. They evaluate their model on a language modeling and POS tagging setup, and show that bi-LSTMs outperform the CNN approach of ?). Similarly, ?) evaluate character embeddings for German. Bi-LSTMs for POS tagging are also reported in ?), however, they only explore word embeddings, orthographic information and evaluate on WSJ only. A related study is ?) who propose a multi-task RNN for named entity recognition by jointly predicting the next token and current token’s name label. Our model is simpler, it uses a very coarse set of labels rather then integrating an entire language modeling task which is computationally more expensive. An interesting recent study is ?), they build a single byte-to-span model for multiple languages based on a sequence-to-sequence RNN [Sutskever et al., 2014] achieving impressive results. We would like to extend this work in their direction.
Conclusions
We evaluated token and subtoken-level representations for neural network-based part-of-speech tagging across 22 languages and proposed a novel multi-task bi-LSTM with auxiliary loss. The auxiliary loss is effective at improving the accuracy of rare words.
Subtoken representations are necessary to obtain a state-of-the-art POS tagger, and character embeddings are particularly helpful for non-Indoeuropean and Slavic languages.
Combining them with word embeddings in a hierarchical network provides the best representation. The bi-LSTM tagger is as effective as the CRF and HMM taggers with already as little as 500 training sentences, but is less robust to label noise (at higher noise rates).
Acknowledgments
We thank the anonymous reviewers for their feedback. AS is funded by the ERC Starting Grant LOWLANDS No. 313695. YG is supported by The Israeli Science Foundation (grant number 1555/15) and a Google Research Award.