Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Christopher D. Manning

Introduction

The growing availability of open-source natural language processing (NLP) toolkits has made it easier for users to build tools with sophisticated linguistic processing. While existing NLP toolkits such as CoreNLP Manning et al. (2014), Flair Akbik et al. (2019), spaCyhttps://spacy.io/, and UDPipe Straka (2018) have had wide usage, they also suffer from several limitations. First, existing toolkits often support only a few major languages. This has significantly limited the community’s ability to process multilingual text. Second, widely used tools are sometimes under-optimized for accuracy either due to a focus on efficiency (e.g., spaCy) or use of less powerful models (e.g., CoreNLP), potentially misleading downstream applications and insights obtained from them. Third, some tools assume input text has been tokenized or annotated with other tools, lacking the ability to process raw text within a unified framework. This has limited their wide applicability to text from diverse sources.

We introduce St a n z a The toolkit was called StanfordNLP prior to v1.0.0., a Python natural language processing toolkit supporting many human languages. As shown in Table 1, compared to existing widely-used NLP toolkits, St a n z a has the following advantages:

From raw text to annotations. St a n z a features a fully neural pipeline which takes raw text as input, and produces annotations including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.

Multilinguality. St a n z a ’s architectural design is language-agnostic and data-driven, which allows us to release models supporting 66 languages, by training the pipeline on the Universal Dependencies (UD) treebanks and other multilingual corpora.

State-of-the-art performance. We evaluate St a n z a on a total of 112 datasets, and find its neural pipeline adapts well to text of different genres, achieving state-of-the-art or competitive performance at each step of the pipeline.

Additionally, St a n z a features a Python interface to the widely used Java CoreNLP package, allowing access to additional tools such as coreference resolution and relation extraction.

St a n z a is fully open source and we make pretrained models for all supported languages and datasets available for public download. We hope St a n z a can facilitate multilingual NLP research and applications, and drive future research that produces insights from human languages.

System Design and Architecture

At the top level, St a n z a consists of two individual components: (1) a fully neural multilingual NLP pipeline; (2) a Python client interface to the Java Stanford CoreNLP software. In this section we introduce their designs.

St a n z a ’s neural pipeline consists of models that range from tokenizing raw text to performing syntactic analysis on entire sentences (see Figure 1). All components are designed with processing many human languages in mind, with high-level design choices capturing common phenomena in many languages and data-driven models that learn the difference between these languages from data. Moreover, the implementation of St a n z a components is highly modular, and reuses basic model architectures when possible for compactness. We highlight the important design choices here, and refer the reader to Qi et al. (2018) for modeling details.

When presented raw text, St a n z a tokenizes it and groups tokens into sentences as the first step of processing. Unlike most existing toolkits, St a n z a combines tokenization and sentence segmentation from raw text into a single module. This is modeled as a tagging problem over character sequences, where the model predicts whether a given character is the end of a token, end of a sentence, or end of a multi-word token (MWT, see Figure 2).Following Universal Dependencies Nivre et al. (2020), we make a distinction between tokens (contiguous spans of characters in the input text) and syntactic words. These are interchangeable aside from the cases of MWTs, where one token can correspond to multiple words. We choose to predict MWTs jointly with tokenization because this task is context-sensitive in some languages.

Multi-word Token Expansion.

Once MWTs are identified by the tokenizer, they are expanded into the underlying syntactic words as the basis of downstream processing. This is achieved with an ensemble of a frequency lexicon and a neural sequence-to-sequence (seq2seq) model, to ensure that frequently observed expansions in the training set are always robustly expanded while maintaining flexibility to model unseen words statistically.

POS and Morphological Feature Tagging.

For each word in a sentence, St a n z a assigns it a part-of-speech (POS), and analyzes its universal morphological features (UFeats, e.g., singular/plural, 1st/2nd/3rd person, etc.). To predict POS and UFeats, we adopt a bidirectional long short-term memory network (Bi-LSTM) as the basic architecture. For consistency among universal POS (UPOS), treebank-specific POS (XPOS), and UFeats, we adopt the biaffine scoring mechanism from Dozat and Manning (2017) to condition XPOS and UFeats prediction on that of UPOS.

Lemmatization.

St a n z a also lemmatizes each word in a sentence to recover its canonical form (e.g., did\todo). Similar to the multi-word token expander, St a n z a ’s lemmatizer is implemented as an ensemble of a dictionary-based lemmatizer and a neural seq2seq lemmatizer. An additional classifier is built on the encoder output of the seq2seq model, to predict shortcuts such as lowercasing and identity copy for robustness on long input sequences such as URLs.

Dependency Parsing.

St a n z a parses each sentence for its syntactic structure, where each word in the sentence is assigned a syntactic head that is either another word in the sentence, or in the case of the root word, an artificial root symbol. We implement a Bi-LSTM-based deep biaffine neural dependency parser Dozat and Manning (2017). We further augment this model with two linguistically motivated features: one that predicts the linearization order of two words in a given language, and the other that predicts the typical distance in linear order between them. We have previously shown that these features significantly improve parsing accuracy Qi et al. (2018).

Named Entity Recognition.

For each input sentence, St a n z a also recognizes named entities in it (e.g., person names, organizations, etc.). For NER we adopt the contextualized string representation-based sequence tagger from Akbik et al. (2018). We first train a forward and a backward character-level LSTM language model, and at tagging time we concatenate the representations at the end of each word position from both language models with word embeddings, and feed the result into a standard one-layer Bi-LSTM sequence tagger with a conditional random field (CRF)-based decoder.

2 CoreNLP Client

Stanford’s Java CoreNLP software provides a comprehensive set of NLP tools especially for the English language. However, these tools are not easily accessible with Python, the programming language of choice for many NLP practitioners, due to the lack of official support. To facilitate the use of CoreNLP from Python, we take advantage of the existing server interface in CoreNLP, and implement a robust client as its Python interface.

When the CoreNLP client is instantiated, St a n z a will automatically start the CoreNLP server as a local process. The client then communicates with the server through its RESTful APIs, after which annotations are transmitted in Protocol Buffers, and converted back to native Python objects. Users can also specify JSON or XML as annotation format. To ensure robustness, while the client is being used, St a n z a periodically checks the health of the server, and restarts it if necessary.

System Usage

St a n z a ’s user interface is designed to allow quick out-of-the-box processing of multilingual text. To achieve this, St a n z a supports automated model download via Python code and pipeline customization with processors of choice. Annotation results can be accessed as native Python objects to allow for flexible post-processing.

St a n z a ’s neural NLP pipeline can be initialized with the Pipeline class, taking language name as an argument. By default, all processors will be loaded and run over the input text; however, users can also specify the processors to load and run with a list of processor names as an argument. Users can additionally specify other processor-level properties, such as batch sizes used by processors, at initialization time.

The following code snippet shows a minimal usage of St a n z a for downloading the Chinese model, annotating a sentence with customized processors, and printing out all annotations:

After all processors are run, a Document instance will be returned, which stores all annotation results. Within a Document, annotations are further stored in Sentences, Tokens and Words in a top-down fashion (Figure 1). The following code snippet demonstrates how to access the text and POS tag of each word in a document and all named entities in the document:

St a n z a is designed to be run on different hardware devices. By default, CUDA devices will be used whenever they are visible by the pipeline, or otherwise CPUs will be used. However, users can force all computation to be run on CPUs by setting use_gpu=False at initialization time.

2 CoreNLP Client Interface

The CoreNLP client interface is designed in a way that the actual communication with the backend CoreNLP server is transparent to the user. To annotate an input text with the CoreNLP client, a CoreNLPClient instance needs to be initialized, with an optional list of CoreNLP annotators. After the annotation is complete, results will be accessible as native Python objects.

This code snippet shows how to establish a CoreNLP client and obtain the NER and coreference annotations of an English sentence:

With the client interface, users can annotate text in 6 languages as supported by CoreNLP.

3 Interactive Web-based Demo

To help visualize documents and their annotations generated by St a n z a , we build an interactive web demo that runs the pipeline interactively. For all languages and all annotations St a n z a provides in those languages, we generate predictions from the models trained on the largest treebank/NER dataset, and visualize the result with the Brat rapid annotation tool.https://brat.nlplab.org/ This demo runs in a client/server architecture, and annotation is performed on the server side. We make one instance of this demo publicly available at http://stanza.run/. It can also be run locally with proper Python libraries installed. An example of running St a n z a on a German sentence can be found in Figure 3.

4 Training Pipeline Models

For all neural processors, St a n z a provides command-line interfaces for users to train their own customized models. To do this, users need to prepare the training and development data in compatible formats (i.e., CoNLL-U format for the Universal Dependencies pipeline and BIO format column files for the NER model). The following command trains a neural dependency parser with user-specified training and development data:

Performance Evaluation

To establish benchmark results and compare with other popular toolkits, we trained and evaluated St a n z a on a total of 112 datasets. All pretrained models are publicly downloadable.

We train and evaluate St a n z a ’s tokenizer/sentence splitter, MWT expander, POS/UFeats tagger, lemmatizer, and dependency parser with the Universal Dependencies v2.5 treebanks Zeman et al. (2019). For training we use 100 treebanks from this release that have non-copyrighted training data, and for treebanks that do not include development data, we randomly split out 20% of the training data as development data. These treebanks represent 66 languages, mostly European languages, but spanning a diversity of language families, including Indo-European, Afro-Asiatic, Uralic, Turkic, Sino-Tibetan, etc. For NER, we train and evaluate St a n z a with 12 publicly available datasets covering 8 major languages as shown in Table 3 Nothman et al. (2013); Tjong Kim Sang and De Meulder (2003); Tjong Kim Sang (2002); Benikova et al. (2014); Mohit et al. (2012); Taulé et al. (2008); Weischedel et al. (2013). For the WikiNER corpora, as canonical splits are not available, we randomly split them into 70% training, 15% dev and 15% test splits. For all other corpora we used their canonical splits.

Training.

On the Universal Dependencies treebanks, we tuned all hyper-parameters on several large treebanks and applied them to all other treebanks. We used the word2vec embeddings released as part of the 2018 UD Shared Task Zeman et al. (2018), or the fastText embeddings Bojanowski et al. (2017) whenever word2vec is not available. For the character-level language models in the NER component, we pretrained them on a mix of the Common Crawl and Wikipedia dumps, and the news corpora released by the WMT19 Shared Task Barrault et al. (2019), except for English and Chinese, for which we pretrained on the Google One Billion Word Chelba et al. (2013) and the Chinese Gigaword corporahttps://catalog.ldc.upenn.edu/LDC2011T13, respectively. We again applied the same hyper-parameters to models for all languages.

Universal Dependencies Results.

For performance on UD treebanks, we compared St a n z a (v1.0) against UDPipe (v1.2) and spaCy (v2.2) on treebanks of 5 major languages whenever a pretrained model is available. As shown in Table 2, St a n z a achieved the best performance on most scores reported. Notably, we find that St a n z a ’s language-agnostic architecture is able to adapt to datasets of different languages and genres. This is also shown by St a n z a ’s high macro-averaged scores over 100 treebanks covering 66 languages.

NER Results.

For performance of the NER component, we compared St a n z a (v1.0) against Flair (v0.4.5) and spaCy (v2.2). For spaCy we reported results from its publicly available pretrained model whenever one trained on the same dataset can be found, otherwise we retrained its model on our datasets with default hyper-parameters, following the publicly available tutorial.https://spacy.io/usage/training##ner Note that, following this public tutorial, we did not use pretrained word embeddings when training spaCy NER models, although using pretrained word embeddings may potentially improve the NER results. For Flair, since their downloadable models were pretrained on dataset versions different from canonical ones, we retrained all models on our own dataset splits with their best reported hyper-parameters. All test results are shown in Table 3. We find that on all datasets St a n z a achieved either higher or close F1\text{F}_{1} scores when compared against Flair. When compared to spaCy, St a n z a ’s NER performance is much better. It is worth noting that St a n z a ’s high performance is achieved with much smaller models compared with Flair (up to 75% smaller), as we intentionally compressed the models for memory efficiency and ease of distribution.

Speed comparison.

We compare St a n z a against existing toolkits to evaluate the time it takes to annotate text (see Table 4). For GPU tests we use a single NVIDIA Titan RTX card. Unsurprisingly, St a n z a ’s extensive use of accurate neural models makes it take significantly longer than spaCy to annotate text, but it is still competitive when compared against toolkits of similar accuracy, especially with the help of GPU acceleration.

Conclusion and Future Work

We introduced St a n z a , a Python natural language processing toolkit supporting many human languages. We have showed that St a n z a ’s neural pipeline not only has wide coverage of human languages, but also is accurate on all tasks, thanks to its language-agnostic, fully neural architectural design. Simultaneously, St a n z a ’s CoreNLP client extends its functionality with additional NLP tools.

For future work, we consider the following areas of improvement in the near term:

Models downloadable in St a n z a are largely trained on a single dataset. To make models robust to many different genres of text, we would like to investigate the possibility of pooling various sources of compatible data to train “default” models for each language;

The amount of computation and resources available to us is limited. We would therefore like to build an open “model zoo” for St a n z a , so that researchers from outside our group can also contribute their models and benefit from models released by others;

St a n z a was designed to optimize for accuracy of its predictions, but this sometimes comes at the cost of computational efficiency and limits the toolkit’s use. We would like to further investigate reducing model sizes and speeding up computation in the toolkit, while still maintaining the same level of accuracy.

We would also like to expand St a n z a ’s functionality by adding other processors such as neural coreference resolution or relation extraction for richer text analytics.

Acknowledgments

The authors would like to thank the anonymous reviewers for their comments, Arun Chaganty for his early contribution to this toolkit, Tim Dozat for his design of the original architectures of the tagger and parser models, Matthew Honnibal and Ines Montani for their help with spaCy integration and helpful comments on the draft, Ranting Guo for the logo design, and John Bauer and the community contributors for their help with maintaining and improving this toolkit. This research is funded in part by Samsung Electronics Co., Ltd. and in part by the SAIL-JD Research Initiative.

References