Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields

Ryan Cotterell, Kevin Duh

Introduction

Named entity recognition (NER) presents a challenge for modern machine learning, wherein a learner must deduce which word tokens refer to people, locations and organizations (along with other possible entity types). The task demands that the learner generalize from limited training data and to novel entities, often in new domains. Traditionally, state-of-the-art NER models have relied on hand-crafted features that pick up on distributional cues as well as portions of the word forms themselves. In the past few years, however, neural approaches that jointly learn their own features have surpassed the feature-based approaches in performance. Despite their empirical success, neural networks have remarkably high sample complexity and still only outperform hand-engineered feature approaches when enough supervised training data is available, leaving effective training of neural networks in the low-resource case a challenge.

For most of the world’s languages, there is a very limited amount of training data for NER; CoNLL—the standard dataset in the field—only provides annotations for 4 languages Tjong Kim Sang (2002); Tjong Kim Sang and De Meulder (2003). Creating similarly sized datasets for other languages has a prohibitive annotation cost, making the low-resource case an important scenario. To get around this barrier, we develop a cross-lingual solution: given a low-resource target language, we additionally offer large amounts of annotated data in a language that is genetically related to the target language. We show empirically that this improves the quality of the resulting model.

In terms of neural modeling, we introduce a novel neural conditional random field (CRF) for cross-lingual NER that allows for cross-lingual transfer by extracting character-level features using recurrent neural networks, shared among multiple languages and; this tying of parameters enables cross-lingual abstraction. With experiments on 15 languages, we confirm that feature-based CRFs outperform the neural methods consistently in the low-resource training scenario. However, with the addition of cross-lingual information, the tables turn and the neural methods are again on top, demonstrating that cross-lingual supervision is a viable method to reduce the training data state-of-the-art neural approaches require.

Neural Conditional Random Fields

Named entity recognition is typically framed as a sequence labeling task using the bio scheme Ramshaw and Marcus (1995); Baldwin (2009), i.e., given an input sentence, the goal is to assign a label to each token: b if the token is the beginning of an entity, or i if the token is inside an entity, or o if the token is outside an entity (see Fig. 1). Following convention, we focus on person (per), location (loc), organization (org), and miscellaneous (misc) entity types, resulting in 9 tags: {b-org, i-org, b-per, i-per, b-loc, i-loc, b-misc, i-misc}.

Conditional random fields (CRFs), first introduced in Lafferty et al. (2001), generalize the classical maximum entropy models Berger et al. (1996) to distributions over structured objects, and are an effective tool for sequence labeling tasks like NER. We briefly overview the formalism here and then discuss its neural parameterization.

We start with two discrete alphabets $\Sigma$ and $\Delta$ . In the case of sentence-level sequence tagging, $\Sigma$ is a set of words (potentially infinite) and $\Delta$ is a set of tags (generally finite; in our case $|\Delta|=9$ ). Given $\mathbf{t}=t_{1}\cdots t_{n}\in\Delta^{n}$ and $\mathbf{w}=w_{1}\cdots w_{n}\in\Sigma^{n}$ , where $n$ is the sentence length. A CRF is a globally normalized conditional probability distribution,

where $t_{0}$ is distinguished beginning-of-tagging symbol, $\psi\left(t_{i-1},t_{i},\mathbf{w};{\boldsymbol{\theta}}\right)\geq 0$ is an arbitrary non-negative potential functionWe slightly abuse notation and use $t_{0}$ as a distinguished beginning-of-sentence symbol. that we take to be a parametric function of the parameters ${\boldsymbol{\theta}}$ and the partition function $Z_{{\boldsymbol{\theta}}}(\mathbf{w})$ is the sum over all taggings of length $n$ .

So how do we choose $\psi\left(t_{i-1},t_{i},\mathbf{w};{\boldsymbol{\theta}}\right)$ ? We discuss two alternatives, which we will compare experimentally in § 5.

2 Log-Linear Parameterization

Traditionally, computational linguists have cosnidered a simple log-linear parameterization, i.e.,

3 (Recurrent) Neural Parameterization

Modern CRFs, however, try to obviate the hand-selection of features through deep, non-linear parameterizations of $\psi\left(t_{i-1},t_{i},\mathbf{w};{\boldsymbol{\theta}}\right)$ . This idea is far from novel and there have been numerous attempts in the literature over the past decade to find effective non-linear parameterizations Peng et al. (2009); Do and Artières (2010); Collobert et al. (2011); Vinel et al. (2011); Fujii et al. (2012). Until recently, however, it was not clear that these non-linear parameterizations of CRFs were worth the non-convexity and the extra computational cost. Indeed, on neural CRFs, Wang and Manning (2013) find that “a nonlinear architecture offers no benefits in a high-dimensional discrete feature space.”

However, recently with the application of long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) recurrent neural networks (RNNs; Elman, 1990) to CRFs, it has become clear that neural feature extractors are superior to the hand-crafted approaches Huang et al. (2015); Lample et al. (2016); Ma and Hovy (2016). As our starting point, we build upon the architecture of Lample et al. (2016), which is currently competitive with the state of the art for NER.

We denote the embedding for the $i^{\text{th}}$ word in this sentence is $\mathbf{s}(\mathbf{w})_{i}$ . The input vector ${\boldsymbol{\omega}}=[\omega_{1},\ldots,\omega_{|\mathbf{w}|}]$ to this BiLSTM is a vector of embeddings: we define

where $c_{1}\cdots c_{|w_{i}|}$ are the characters in word $w_{i}$ . In other words, we run an LSTM over the character stream and concatenate it with a word embedding for each type. Now, the parameters ${\boldsymbol{\theta}}$ , ${\boldsymbol{\theta}}^{\prime}$ and ${\boldsymbol{\theta}}^{\prime\prime\prime}$ are the set of all the LSTM parameters and other embeddings.

Cross-Lingual Extensions

One of the most striking features of neural networks is their ability to abstract general representations across many words. Our question is: can neural feature-extractors abstract the notion of a named entity across similar languages? For example, if we train a character-level neural CRF on several of the highly related Romance languages, can our network learn a general representation of the entities in these languages?

Now, given a low-resource target language $\tau$ and a source language $\sigma$ (potentially, a set of $m$ high-resource source languages $\{\sigma_{i}\}_{i=1}^{m}$ ). We consider the following log-likelihood objective

where $\mu>0$ is a trade-off parameter, ${\cal D}_{\tau}$ is the set of training examples for the target language and ${\cal D}_{\sigma}$ is the set of training data for the source language $\sigma$ . In the case of multiple source languages, we add a summand to the set of source languages used, in which case set have multiple training sets ${\cal D}_{\sigma_{i}}$ .

In the case of the log-linear parameterization, we simply add a language-specific atomic feature for the language-id, drawing inspiration from Daumé III (2007)’s approach for domain adaption. We then conjoin this new atomic feature with the existing feature templates, doubling the number of feature templates: the original and the new feature template conjoined with the language ID.

Related Work

We divide the discussion of related work topically.

In recent years, many authors have incorporated character-level information into taggers using neural networks, e.g., dos Santos and Zadrozny (2014) employed a convolutional network for part-of-speech tagging in morphologically rich languages and Ling et al. (2015) a LSTM for a myriad of different tasks. Relatedly, Chiu and Nichols (2016) approached NER with character-level LSTMs, but without using a CRF. Our work firmly builds upon this in that we, too, compactly summarize the word form with a recurrent neural component.

Neural Transfer Schemes.

Previous work has also performed transfer learning using neural networks. The novelty of our work lies in the cross-lingual transfer. For example, Peng and Dredze (2017) and Yang et al. (2017), similarly oriented concurrent papers, focus on domain adaptation within the same language. While this is a related problem, cross-lingual transfer is much more involved since the morphology, syntax and semantics change more radically between two languages than between domains.

Projection-based Transfer Schemes.

Projection is a common approach to tag low-resource languages. The strategy involves annotating one side of bitext with a tagger for a high-resource language and then project the annotation the over the bilingual alignments obtained through unsupervised learning Och and Ney (2003). Using these projected annotations as weak supervision, one then trains a tagger in the target language. This line of research has a rich history, starting with Yarowsky and Ngai (2001). For a recent take, see Wang and Manning (2014) for projecting NER from English to Chinese. We emphasize that projection-based approaches are incomparable to our proposed method as they make an additional bitext assumption, which is generally not present in the case of low-resource languages.

Experiments

Fundamentally, we want to show that character-level neural CRFs are capable of generalizing the notion of an entity across related languages. To get at this, we compare a linear CRF (see § 2.2) with standard feature templates for the task and a neural CRF (see § 2.3). We further compare three training set-ups: low-resource, high-resource and low-resource with additional cross-lingual data for transfer. Given past results in the literature, we expect linear CRF to dominate in the low-resource settings, the neural CRF to dominate in the high-resource setting. The novelty of our paper lies in the consideration of the low-resource with transfer case: we show that neural CRFs are better at transferring entity-level abstractions cross-linguistically.

We experiment on 15 languages from the cross-lingual named entity dataset described in Pan et al. (2017). We focus on 5 typologically diverseWhile most of these languages are from the Indo-European family, they still run the gauntlet along a number of typological axes, e.g., Dutch and West Frisian have far less inflection compared to Russian and Ukrainian and the Indo-Aryan languages employ postpositions (attached to the word) rather than prepositions (space separated). target languages: Galician, West Frisian, Ukrainian, Marathi and Tagalog. As related source languages, we consider Spanish, Catalan, Italian, French, Romanian, Dutch, Russian, Cebuano, Hindi and Urdu. For the language code abbreviations and linguistic families, see Tab. 1. For each of the target languages, we emulate a truly low-resource condition, creating a 100 sentence split for training. We then create a 10000 sentence superset to be able to compare to a high-resource condition in those same languages. For the source languages, we only created a 10000 sentence split. We also create disjoint validation and test splits, of 1000 sentences each. author=kevin,color=violet!40,size=,fancyline,caption=,inline,]In Table 2, it’s not clear for cross-lingual settings whether the source data uses 10000 sentences or 100 sentences. Clarify.

2 Results

The linear CRF is trained using L-BFGS author=ryan,color=violet!40,size=,fancyline,caption=,]add citation until convergence using the CRF suite toolkit.http://www.chokkan.org/software/crfsuite/ We train our neural CRF for 100 epochs using AdaDelta Zeiler (2012) with a learning rate of $1.0$ . The results are reported in Tab. 2. To understand the table, take the target language ( $\tau$ ) Galician. In terms of $F_{1}$ , while the neural CRF outperforms the log-linear CRF in the high-resource setting ( $89.42$ vs. $87.23$ ), it performs poorly in the low-resource setting ( $49.19$ vs. $56.64$ ); when we add in a source language ( $\sigma_{i}$ ) such as Spanish, $F_{1}$ increases to $76.40$ for the neural CRF and $71.46$ for the log-linear CRF. The trend is similar for other source languages, such as Catalan ( $75.40$ ) and Italian ( $70.93$ ).

Overall, we observe three general trends. i) In the monolingual high-resource case, the neural CRF outperforms the log-linear CRF. ii) In the low-resource case, the log-linear CRF outperforms the neural CRF. iii) In the transfer case, the neural CRF wins, however, indicating that our character-level neural approach is truly better at generalizing cross-linguistically in the low-resource case (when we have little target language data), as we hoped. In the high-resource case (when we have a lot of target language data), the transfer learning has little to no effect. We conclude that our cross-lingual neural CRF is a viable method for the transfer of NER. However, there is still a sizable gap between the neural CRF trained on 10000 target sentences and the transfer case (100 target and 10000 source), indicating there is still room for improvement.

author=kevin,color=violet!40,size=,fancyline,caption=,]Do we have some example character embeddings that worked well cross-lingually? Is it possible to find at least one example quickly?

author=kevin,color=violet!40,size=,fancyline,caption=,]Can we find a breakdown of F1 or precision/recall scores by entity type and see which one is helped most by cross-lingual (if there’s any trend)?

author=ryan,color=violet!40,size=,fancyline,caption=,]Consider multi-source and script transfer (already done with Urdu)

Conclusion

We have investigated the task of cross-lingual transfer in low-resource named entity recognition using neural CRFs with experiments on 15 typologically diverse languages. Overall, we show that direct cross-lingual transfer is an option for reducing sample complexity for state-of-the-art architectures. In the future, we plan to investigate how exactly the networks manage to induce a cross-lingual entity abstraction.

Acknowledgments

We are grateful to Heng Ji and Xiaoman Pan for sharing their dataset and providing support.