Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

cs.CL

Introduction

Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development and research on parsing and cross-lingual learning. The annotation scheme is based on an evolution of (universal) Stanford dependencies [de Marneffe et al. (2006, de Marneffe and Manning (2008, de Marneffe et al. (2014], Google universal part-of-speech tags [Petrov et al. (2012], and the Interset interlingua for morphosyntactic tagsets [Zeman (2008]. The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.

The project started in 2014 and has developed into an open community effort with a very rapid growth, both in terms of the number of researchers contributing to the project, which now exceeds 300, and in terms of the number of languages represented by treebanks, which is approaching 100. An early snapshot of this development can be found in ?), which describes version 1 of the UD guidelines (UD v1) and the treebank resources available in UD v1.2. Since then, there has been one major change of the guidelines, from UD v1 to UD v2, and the number of treebanks has more than quadrupled. Figure 1 shows the growth in number of languages, treebanks and annotated words from UD v1.0 to UD v2.5. During the same period, the number of downloads or accesses at the official repository at https://lindat.cz has grown to 46439.November 25, 2019. The UD resources have also made a significant impact on NLP research, most notably for multilingual dependency parsing through two editions of CoNLL shared tasks [Zeman et al. (2017, Zeman et al. (2018], which have created a new generation of parsers that handle a large number of languages and that parse from raw text rather than relying on pre-tokenized input. Figure 2 visualizes the increase in available data resources and parsing scores for all languages involved in both tasks.

This paper provides an up-to-date description of the project, focusing on the annotation guidelines, especially on the major changes from UD v1 to v2, and on the existing treebank resources. For more information on the project motivation and history, we refer to ?). For more information about UD treebanks and applications of these resources, we refer to the proceedings of the UD workshops held annually since 2017 [de Marneffe et al. (2017, de Marneffe et al. (2018, Rademaker and Tyers (2019].

Annotation Scheme

In this section, we give a brief introduction to the UD annotation scheme. For more details, we refer to the documentation on the UD website.https://universaldependencies.org/guidelines.html

UD is based on a lexicalist view of syntax, which means that dependency relations hold between words, and that morphological features are encoded as properties of words with no attempt at segmenting words into morphemes. However, it is important to note that the basic units of annotation are syntactic words (not phonological or orthographic words), which means that it is often necessary to split off clitics, as in Spanish dámelo = da me lo, and undo contractions, as in French au = à le. We refer to such cases as multiword tokens because a single orthographic token corresponds to multiple (syntactic) words. In exceptional cases, it may be necessary to go in the other direction, and combine several orthographic tokens into a single syntactic word (see Section 3.1.).

2. Morphological Annotation

The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:

A lemma representing the base form of the word.

A part-of-speech tag representing the grammatical category of the word.

A set of features representing lexical and grammatical properties associated with the particular word form.

The lemma is the canonical form of the word, which is the form typically found in dictionaries. In agglutinative languages, this is typically the form with no inflectional affixes; in fusional languages, the lemma is usually the result of a language-particular convention. The list of universal part-of-speech tags is a fixed list containing 17 tags, shown in Table 1. Languages are not required to use all tags, but the list cannot be extended to cover language-specific categories. Instead, more fine-grained classification of words can be achieved via the use of features, which specify additional information about morphosyntactic properties. We provide an inventory of features that are attested in multiple languages and need to be encoded in a uniform way, listed in Table 1. Users can extend this set of universal features and add language-specific features when necessary.

3. Syntactic Annotation

Syntactic annotation in the UD scheme consists of typed dependency relations between words. The basic syntactic representation forms a tree rooted in one word, normally the main clause predicate, on which all other words of the sentence are dependent. In addition to the basic representation, which is obligatory for all UD treebanks, it is possible to give an enhanced dependency representation, which adds (and in a few cases changes) relations in order to give a more complete basis for semantic interpretation. We will focus here on the basic representation and return to the enhanced representation when discussing changes in UD v2.

The syntactic analysis in UD gives priority to predicate-argument and modifier relations that hold directly between content words, as opposed to being mediated by function words. The rationale is that this makes more transparent what grammatical relations are shared across languages, even when the languages differ in the way that they use word order, function words or morphological inflection to encode these relations. This is illustrated in Figure 2.3., which shows three parallel sentences in Czech, English and Swedish. In all three cases, there is a passive predicate with a subject and an oblique modifier (the relations marked in solid blue), but the languages differ in how they encode certain grammatical categories (marked in dashed red): definiteness is indicated by a separate function word (the article the) in English, by a morphological inflection in Swedish and not at all in Czech; passive is expressed by a periphrastic construction involving an auxiliary and a participle in English, by a morphological inflection in Swedish, and by a combination of these strategies in Czech (because the participle is unique to the passive construction); and the oblique modifier is introduced by a preposition in English and Swedish but marked by instrumental case in Czech.

2. Morphological Annotation

The universal part-of-speech tagset is essentially the same in UD v2 as in UD v1, but the tag for coordinating conjunctions has been renamed from CONJ to CCONJThe motivation is to make it parallel to SCONJ (for subordinating conjunctions), more similar to the syntactic relation cc with which it often cooccurs, and less similar to the relation conj with which it practically never cooccurs. and the guidelines have been modified slightly for three tags:

The use of AUX is extended from auxiliary verbs in a narrow sense to also include copula verbs and nonverbal TAME particles (tense, aspect, mood, evidentiality, and, sometimes, voice or polarity particles).

The use of PART is limited to a small set of words that must be listed in the language-specific documentation.

The distinction between PRON and DET is made more flexible to accommodate cross-linguistic variation.

The inventory of universal morphological features has been extended with new features and new values for existing features. In addition, a few features and feature values have been renamed or removed. These changes, which are summarized in Table 2, are motivated by the addition of new languages to UD as well as an effort to harmonize UD with the UniMorph project [Sylak-Glassman et al. (2015].

3. Syntactic Annotation

Although most syntactic relations are the same in UD v2 as in UD v1, the guidelines have often been improved by providing more explicit criteria and examples from multiple languages. Here we only list cases where relations have been removed, added or renamed, or where the use of an existing relation has changed significantly.

As explained earlier, UD assumes a distinction between core and non-core dependents of predicates. For nominal core arguments, UD v1 used the labels nsubj, dobj and iobj. These relations remain conceptually unchanged, but the second label has been changed from dobj to obj, because this seems to better convey the intended interpretation of “second core argument” or “P/O argument” (without connection to specific cases or semantic roles). In addition, the nsubjpass label for passive subjects is removed, and passive subjects are subsumed under the nsubj relation, but with a strong recommendation to use the subtype nsubj:pass for languages where the distinction is relevant. Analogously, the relations csubjpass (for clausal passive subject) and auxpass (for passive auxiliary) are now subsumed under csubj and aux (with possible subtypes csubj:pass and aux:pass).

The second change in this area concerns the analysis of oblique nominals at the clause level, that is, nominal expressions that are dependents of predicates but not core arguments, and which are typically accompanied by case marking in the form of adpositions or oblique morphological case. In UD v1, such expressions were subsumed under the nmod relation (for nominal modifier), which also applies to nominal expressions that modify other nominals and are not dependents of predicates at the clause level. This violated a fundamental principle of UD, namely that distinct labels should be used for dependents of nominals and dependents of predicates, even if the overt form of the modifier is the same. In UD v2, the obl relation is therefore used for oblique nominals at the clause level, while the nmod relation is reserved for nominals modifying other nominal expressions. The distinction is illustrated in (3.3.) and (3.3.), which also show that the core/non-core distinction is only applied at the clause level. Hence, both the nsubj and the obl relations in the clause example correspond to nmod relations in the nominal example.

{dependency}{deptext}she & suddenly & went & to & Paris PRON & ADV & VERB & ADP & PROPN \depedge[edge style=thick]31nsubj \depedge[edge style=thick]32advmod \depedge[edge style=thick]54case \depedge[edge style=thick]35obl

{dependency}{deptext}her & sudden & trip & to & Paris PRON & ADJ & NOUN & ADP & PROPN \depedge[edge style=thick]31nmod \depedge[edge style=thick]32amod \depedge[edge style=thick]54case \depedge[edge style=thick]35nmod

The final modification in the annotation of clause structure is a more restricted application of the cop relation. In UD v2, the cop relation is restricted to function words (verbal or nonverbal) whose sole function is to link a nonverbal predicate to its subject and which does not add any meaning other than grammaticalized TAME categories. The range of constructions that are analyzed using the cop relation is subject to language-specific variation but can be identified using universal criteria described in the guidelines.

The question of whether and how coordination can be analyzed as a dependency structure is a vexed one [Popel et al. (2013, Gerdes and Kahane (2015]. UD treats coordination as an essentially symmetric relation, and uses the special conj relation to connect all non-first conjuncts to the first one. In this respect, UD v2 is exactly the same as UD v1, but UD v2 differs by attaching coordinating conjunctions (cc) and punctuation (punct) inside coordinated structures to the immediately succeeding conjunct (instead of the first conjunct as in UD v1), following the approach of ?), as illustrated in (3.3.).

{dependency}{deptext}bacon & , & lettuce & and & tomato NOUN & PUNCT & NOUN & CCONJ & NOUN \depedge[edge style=thick]32punct \depedge[edge style=thick]13conj \depedge[edge style=thick]54cc \depedge[edge style=thick, edge unit distance=1.0em]15conj

The analysis of elliptical constructions like gapping is completely different in UD v2 compared to UD v1. Let us first note that most cases of ellipsis are simply treated by “promoting” a dependent of the elided element to take its place in the syntactic structure. Thus, adjectival modifiers or even determiners can head nominals if the head noun is omitted. Similarly, auxiliary verbs can head clauses in constructions like VP ellipsis. However, in cases like gapping, this yields a rather unsatisfactory analysis where one core argument is typically attached to another. UD v2 therefore uses a special relation orphan to indicate that this is an anomalous structure where the dependent is really a sibling of the word to which is it attached. As illustrated in (3.3.), this gives an underspecified analysis of the predicate-argument structure, which can be fully resolved in the enhanced representation (see Section 3.4.).

{dependency}{deptext}she & drank & coffee & and & he & tea PRON & VERB & NOUN & CCONJ & PRON & NOUN \depedge[edge style=thick]21nsubj \depedge[edge style=thick]23obj \depedge[edge style=thick]54cc \depedge[edge style=thick,edge unit distance=2ex]25conj \depedge[edge style=thick]56orphan

The choice of which dependent to promote is determined by an obliqueness hierarchy (where subjects precede objects) described in the guidelines. This new analysis of gapping is superior to the UD v1 analysis (which used a remnant relation), because it preserves the integrity of the two clauses and introduces fewer non-projective dependencies.

UD v2 also includes some changes in the annotation of functional relations, that is, relations holding between a function word or grammatical marker and its host (mostly a verb or noun). More specifically:

A new relation clf is added for nominal classifiers.

The aux relation is extended from auxiliary verbs in a narrow sense to also include nonverbal TAME particles in analogy with the extended use of the part-of-speech tag AUX (see Section 3.2.).

The auxpass relation is subsumed under the aux relation (see above).

The cop relation is restricted to pure linking words (see above).

The neg relation is removed from the set of universal relations, and polarity is instead encoded in a feature (see Section 3.2.).

3.1. Multiword Expressions

The guidelines for annotation of multiword expressions have been thoroughly revised in UD v2. Multiword expressions that are morphosyntactically regular (and only exhibit semantic non-compositionality) normally do not receive any special treatment at all. Hence, the UD guidelines in this area only apply to a few subtypes of the many phenomena that have been discussed in the literature on multiword expressions.

The first subtype is compounding. The relation compound is used for any kind of lexical compounds: noun compounds such as phone book, but also verb and adjective compounds, such as the serial verbs that occur in many languages, or a Japanese light verb construction such as benkyō suru (“to study”). The compound relation is also used for phrasal verbs, such as put up: compound(put, up). Despite operating at the lexical level, compounds are regular headed constructions, as illustrated in (3.3.1.). This behavior distinguishes compounds from the other two types of multiword expressions.

{dependency} {deptext}[column sep=0.6cm] hate & speech & detection NOUN & NOUN & NOUN \depedge[edge style=thick]32compound \depedge[edge style=thick]21compound

The second subtype is fixed expressions, highly grammaticalized expressions that typically behave as function words or short adverbials, for which the relation fixed is used. The name and rough scope of usage is borrowed from the fixed expressions category of ?).This relation was called mwe in UD v1, but the name was found to be misleading as the relation only applies to a very small subset of multiword expressions. Fixed multiword expressions are annotated with a flat structure. Since there is no clear basis for internal syntactic structure, we adopt the convention of always attaching subsequent words to the first one with the fixed label, as shown in (3.3.1.).

{dependency} {deptext}[column sep=0.6cm] dogs & as & well & as & cats NOUN & ADP & ADV & ADP & NOUN \depedge[edge style=thick]24fixed \depedge[edge style=thick]23fixed \depedge[edge style=thick]52cc \depedge[edge style=thick]15conj

As with other clines of grammaticalization, it is not always clear where to draw the line between giving a regular syntactic analysis versus a fixed expression analysis of a conventionalized expression. In practice, the best solution is to be conservative and to prefer a regular syntactic analysis except when an expression is highly opaque and clearly does not have internal syntactic structure (except from a historical perspective).

The final subtype is headless multiword expressions analyzed with the relation flat. This class is less clearly recognized in most grammars of human languages, but in practice there are many linguistic constructions with a sequence of words that do not have any clear synchronic grammatical structure but are not fixed expressions. These include names, dates, and calqued expressions from other languages. We again adopt the convention that in these cases subsequent words are attached to the first word with the flat relation, as exemplified in (3.3.1.).

{dependency} {deptext}[column sep=0.4cm] Hillary & Rodham & Clinton PROPN & PROPN & PROPN \depedge[edge style=thick]12flat \depedge[edge style=thick]13flat

This relation replaces two more specific relations from UD v1, name and foreign. Subtypes like flat:name and flat:foreign can be used in cases where a flat analysis is appropriate for complex names and foreign expressions.

4. Enhanced Dependencies

UD v2 now also provides guidelines for enhanced dependency graphs. With a few exceptions, enhanced graphs consist of all the syntactic relations in the basic dependency tree and may contain additional relations and nodes that make otherwise implicit relations between tokens explicit, with the purpose of facilitating downstream natural language understanding tasks. The guidelines are based on the CCprocessed Stanford dependencies representation [de Marneffe et al. (2006] and a proposal for enhanced dependencies [Schuster and Manning (2016], and define five types of enhancements. For more information, we refer to the documentation on the UD website.https://universaldependencies.org/u/overview/enhanced-syntax.html

For sentences with elided predicates, in the basic representation, one word is promoted to be the head of the clause and all words that would have been a sibling of the promoted word if no predicate had been elided are attached with the orphan relation (see Section 3.3.). The enhanced representation for sentences with gapping contains additional null nodes representing elided predicates. Arguments and modifiers of the elided predicate are attached to the null nodes, as illustrated in (3.4.), which contains a null node (E5.1) and relations between the null node and the arguments in the second clause.

{dependency}{deptext} she & drank & coffee & and & he & E5.1 & tea PRON & VERB & NOUN & CCONJ & PRON & VERB & NOUN \depedge[edge style=thick]21nsubj \depedge[edge style=thick]23obj \depedge[edge style=thick]64cc \depedge[edge style=thick]65nsubj \depedge[edge style=thick, edge unit distance=1em]26conj \depedge[edge style=thick]67obj

Conjoined predicates often share dependents (e.g., a subject) and conjoined dependents share a head. In (3.4.), the two predicates (buys and sells) share the subject (the store) and object (cameras). The shared status of dependents and governors is made explicit in the enhanced representation through additional relations, such as the nsubj and obj relations below the sentence.The placement of arcs above and below the sentence, respectively, is only for perspicuity and does not imply any difference in status between different types of arcs.

{dependency} {deptext}the & store & buys & and & sells & cameras ADP & NOUN & VERB & CCONJ & VERB & NOUN \depedge[edge style=thick]21det \depedge[edge style=thick]32nsubj \depedge[edge style=thick]54cc \depedge[edge style=thick]35conj \depedge[edge style=thick]36obj \depedge[edge style=thick, edge below]56obj \depedge[edge style=thick, edge below, edge unit distance=0.45em]52nsubj

For sentences with control or raising predicates, in the basic representation, the argument that is shared between the matrix predicate and the embedded predicate is only attached to the matrix predicate. Thus, similarly as in the case of shared dependents in conjoined phrases, there is no explicit relation between the embedded predicate and its subject. In the enhanced representation, this implicit subject relation is made explicit with an additional relation, such as the nsubj relationThe fact that this relation is between an embedded predicate and an argument of the matrix verb can be optionally marked with the nsubj:xsubj subtype. below the sentence in (3.4.).

{dependency} {deptext} Mary & wants & to & buy & a & book PROPN & VERB & PART & VERB & DET & NOUN \depedge[edge style=thick]21nsubj \depedge[edge style=thick]43mark \depedge[edge style=thick]24xcomp \depedge[edge style=thick]65det \depedge[edge style=thick]46obj \depedge[edge style=thick, edge below, edge unit distance=0.45em]41nsubj

In the enhanced representation, the coreferential status of relative pronouns is marked with the special ref relation. Further, to represent the implicit relation between the predicate of the relative clause and the antecedent of the relative pronoun, there is an additional relation between the predicate and the antecedent, such as the nsubj relation between lived and boy in (3.4.).The nsubj relation between lived and who is common to the basic and enhanced representation.

{dependency}[column sep=0.4cm] {deptext} the & boy & who & lived DET & NOUN & PRON & VERB \depedge[edge style=thick]21det \depedge[edge style=thick]24acl:relcl \depedge[edge style=thick]43nsubj \depedge[edge below, edge style=thick]42nsubj \depedge[edge below, edge style=thick]23ref

Available Treebanks

Finally, since many modifier relation types such as obl or acl are used for many different types of relations, and since adpositions or case information often disambiguate the semantic role, the enhanced representation provides augmented modifier relations that include adposition or case information in the relation name, such as the nmod:on relation in (3.4.).

{dependency}[column sep=0.4cm] {deptext} the & house & on & the & hill ADP & NOUN & ADP & DET & NOUN \depedge21det \depedge[edge style=thick]53case \depedge[edge style=thick]54det \depedge[edge style=thick]25nmod:on

All enhancements are optional and users may decide to implement only a subset of these. As of UD release v2.5, only 24 treebanks include an enhanced representation, and even fewer treebanks implement all five enhancements (see also ?)). In many cases, the enhanced graphs can be computed automatically from a basic dependency tree (see ?) for a discussion and evaluation of a rule-based and a machine learning-based converter from basic to enhanced dependencies), and ?) recently used the Stanford Enhancer [Schuster and Manning (2016] to automatically predict enhanced dependencies for all UD treebanks.

UD release v2.5UD releases are numbered by letting the first digit (2) refer to the version of the guidelines and the second digit (5) to the number of releases under that version. [Zeman et al. (2019] contains 157 treebanks representing 90 languages. Table 3 specifies for each language the number of treebanks available, as well as the total number of annotated sentences and words in that language. It is worth noting that the amount of data varies considerably between languages, from Skolt Sámi with 36 sentences and 321 words, to German with over 200,000 sentences and nearly 4 million words. The majority of treebanks are small but it should be kept in mind that many of these treebanks are new initiatives and can be expected to grow substantially in the future.

The languages in UD v2.5 represent 20 different language families (or equivalent), listed in Table 4. The selection is very heavily biased towards Indo-European languages (48 out of 90), and towards a few branches of this family – Germanic (10), Romance (8) and Slavic (13) – but it is worth noting that the bias is (slowly) becoming less extreme over time.The proportion of Indo-European languages has gone from 60% in v2.1 to 53% in v2.5. Another way of visualizing the gradual extension of UD to new language families and geographic areas can be found in Figure 4, which shows the approximate geographic locations of languages added in UD v1.0 (green), UD v2.0 (blue) and UD v2.5 (red). It is clear that, whereas UD v1.0 was almost completely restricted to Europe, later versions have extended to other areas, and by v2.5 all inhabited continents are represented – although there are still large white areas on the map.

The treebanks in UD v2.5 are also heterogeneous with respect to the type of text (or spoken data) annotated. A very coarse-grained picture of this variation can be gathered from Table 5, which specifies the number of treebanks that contain some amount of data from different “genres”, as reported by each treebank provider in the treebank documentation. The categories in this classification are neither mutually exclusive nor based on homogeneous criteria, but it is currently the best documentation that can be obtained.

The UD project has come a long way in only five years, and UD treebanks are now widely used in NLP as well as in linguistic research, especially with a typological orientation. Future priorities for the project include obtaining data from more languages – in order to achieve better coverage of major language families – but also obtaining more annotated data for existing languages – in order to make the data more useful for NLP as well as linguistic studies. Finally, the work on achieving cross-linguistic consistency needs to continue. Adopting a common set of categories and guidelines is a first step in this direction, but ensuring that these are applied consistently across a growing set of typologically diverse languages will continue to be a challenge for years to come. Fortunately, efforts in this direction are constantly being pursued in the active UD user community.

We want to thank our colleagues in the UD core guidelines group Yoav Goldberg, Ryan McDonald, Slav Petrov and Reut Tsarfaty for fruitful discussions and comments on a draft version of this paper, as well as all the 345 UD treebank contributors, listed in ?), without whom UD literally would not exist.