Dependency Parsing as Head Selection

Xingxing Zhang, Jianpeng Cheng, Mirella Lapata

Introduction

Dependency parsing plays an important role in many natural language applications, such as relation extraction [Fundel et al., 2007], machine translation [Carreras and Collins, 2009], language modeling [Chelba et al., 1997, Zhang et al., 2016] and ontology construction [Snow et al., 2005]. Dependency parsers represent syntactic information as a set of head-dependent relational arcs, typically constrained to form a tree. Practically all models proposed for dependency parsing in recent years can be described as graph-based [McDonald et al., 2005a] or transition-based [Yamada and Matsumoto, 2003, Nivre et al., 2006b].

Graph-based dependency parsers are typically arc-factored, where the score of a tree is defined as the sum of the scores of all its arcs. An arc is scored with a set of local features and a linear model, the parameters of which can be effectively learned with online algorithms [Crammer and Singer, 2001, Crammer and Singer, 2003, Freund and Schapire, 1999, Collins, 2002]. In order to efficiently find the best scoring tree during training and decoding, various maximization algorithms have been developed [Eisner, 1996, Eisner, 2000, McDonald et al., 2005b]. In general, graph-based methods are optimized globally, using features of single arcs in order to make the learning and inference tractable. Transition-based algorithms factorize a tree into a set of parsing actions. At each transition state, the parser scores a candidate action conditioned on the state of the transition system and the parsing history, and greedily selects the highest-scoring action to execute. This score is typically obtained with a classifier based on non-local features defined over a rich history of parsing decisions [Yamada and Matsumoto, 2003, Zhang and Nivre, 2011].

Regardless of the algorithm used, most well-known dependency parsers, such as the MST-Parser [McDonald et al., 2005b] and the MaltPaser [Nivre et al., 2006a], rely on extensive feature engineering. Feature templates are typically manually designed and aim at capturing head-dependent relationships which are notoriously sparse and difficult to estimate. More recently, a few approaches [Chen and Manning, 2014, Lei et al., 2014a, Kiperwasser and Goldberg, 2016] apply neural networks for learning dense feature representations. The learned features are subsequently used in a conventional graph- or transition-based parser, or better designed variants [Dyer et al., 2015].

In this work, we propose a simple neural network-based model which learns to select the head for each word in a sentence without enforcing tree structured output. Our model which we call DeNSe (as shorthand for Dependency Neural Selection) employs bidirectional recurrent neural networks to learn feature representations for words in a sentence. These features are subsequently used to predict the head of each word. Although there is nothing inherent in the model to enforce tree-structured output, when tested on an English dataset, it is able to generate trees for 95% of the sentences, 87% of which are projective. The remaining non-tree (or non-projective) outputs are post-processed with the Chu-Liu-Edmond (or Eisner) algorithm. DeNSe uses the head selection procedure to estimate arc weights during training. During testing, it essentially reduces to a standard graph-based parser when it fails to produce tree (or projective) output.

We evaluate our model on benchmark dependency parsing corpora, representing four languages (English, Chinese, Czech, and German) with varying degrees of non-projectivity. Despite the simplicity of our approach, experiments show that the resulting parsers are on par with the state of the art.

Related Work

Graph-based dependency parsers employ a model for scoring possible dependency graphs for a given sentence. The graphs are typically factored into their component arcs and the score of a tree is defined as the sum of its arcs. This factorization enables tractable search for the highest scoring graph structure which is commonly formulated as the search for the maximum spanning tree (MST). The Chu-Liu-Edmonds algorithm [Chu and Liu, 1965, Edmonds, 1967, McDonald et al., 2005b] is often used to extract the MST in the case of non-projective trees, and the Eisner algorithm [Eisner, 1996, Eisner, 2000] in the case of projective trees. During training, weight parameters of the scoring function can be learned with margin-based algorithms [Crammer and Singer, 2001, Crammer and Singer, 2003] or the structured perceptron [Freund and Schapire, 1999, Collins, 2002]. Beyond basic first-order models, the literature offers a few examples of higher-order models involving sibling and grand parent relations [Carreras, 2007, Koo et al., 2010, Zhang and McDonald, 2012]. Although more expressive, such models render both training and inference more challenging.

Transition-based Parsing

As the term implies, transition-based parsers conceptualize the process of transforming a sentence into a dependency tree as a sequence of transitions. A transition system typically includes a stack for storing partially processed tokens, a buffer containing the remaining input, and a set of arcs containing all dependencies between tokens that have been added so far [Nivre, 2003, Nivre et al., 2006b]. A dependency tree is constructed by manipulating the stack and buffer, and appending arcs with predetermined operations. Most popular parsers employ an arc-standard [Yamada and Matsumoto, 2003, Nivre, 2004] or arc-eager transition system [Nivre, 2008]. Extensions of the latter include the use of non-local training methods to avoid greedy error propagation [Zhang and Clark, 2008, Huang and Sagae, 2010, Zhang and Nivre, 2011, Goldberg and Nivre, 2012].

Neural Network-based Features

Neural network representations have a long history in syntactic parsing [Mayberry and Miikkulainen, 1999, Henderson, 2004, Titov and Henderson, 2007]. Recent work uses neural networks in lieu of the linear classifiers typically employed in conventional transition- or graph-based dependency parsers. For example, ?) use a feed forward neural network to learn features for a transition-based parser, whereas ?) do the same for a graph-based parser. ?) apply tensor decomposition to obtain word embeddings in their syntactic roles, which they subsequently use in a graph-based parser. ?) redesign components of a transition-based system where the buffer, stack, and action sequences are modeled separately with stack long short-term memory networks. The hidden states of these LSTMs are concatenated and used as features to a final transition classifier. ?) use bidirectional LSTMs to extract features for a transition- and graph-based parser, whereas ?) build a greedy arc-standard parser using similar features.

In our work, we formalize dependency parsing as the task of finding for each word in a sentence its most probable head. Both head selection and the features it is based on are learned using neural networks. The idea of modeling child-parent relations independently dates back to ?) who use an edge-factored model to generate $k$ -best parse trees which are subsequently reranked using a model based on rich global features. Later ?) show that a head selection variant of their loopy belief propagation parser performs worse than a model which incorporates tree structure constraints. Our parser is conceptually simpler: we rely on head selection to do most of the work and decode the best tree directly without using a reranker. In common with recent neural network-based dependency parsers, we aim to alleviate the need for hand-crafting feature combinations. Beyond feature learning, we further show that it is possible to simplify the training of a graph-based dependency parser in the context of bidirectional recurrent neural networks.

Dependency Parsing as Head Selection

In this section we present our parsing model, DeNSe, which tries to predict the head of each word in a sentence. Specifically, the model takes as input a sentence of length $N$ and outputs $N$ $\langle$ head, dependent $\rangle$ arcs. We describe the model focusing on unlabeled dependencies and then discuss how it can be straightforwardly extended to the labeled setting. We begin by explaining how words are represented in our model and then give details on how DeNSe makes predictions based on these learned representations. Since there is no guarantee that the outputs of DeNSe are trees (although they mostly are), we also discuss how to extend DeNSe in order to enforce projective and non-projective tree outputs. Throughout this paper, lowercase boldface letters denote vectors (e.g., $\mathbf{v}$ or $\mathbf{v}_{i}$ ), uppercase boldface letters denote matrices (e.g., $\mathbf{M}$ or $\mathbf{M}_{b}$ ), and lowercase letters denote scalars (e.g., $w$ or $w_{i}$ ).

Let $S=(w_{0},w_{1},\dots,w_{N})$ denote a sentence of length $N$ ; following common practice in the dependency parsing literature [Kübler et al., 2009], we add an artificial root token represented by $w_{0}$ . Analogously, let $A=(\mathbf{a}_{0},\mathbf{a}_{1},\dots,\mathbf{a}_{N})$ denote the representation of sentence $S$ , with $\mathbf{a}_{i}$ representing word $w_{i}\quad(0\leq i\leq N)$ . Besides encoding information about each $w_{i}$ in isolation (e.g., its lexical meaning or POS tag), $\mathbf{a}_{i}$ must also encode $w_{i}$ ’s positional information within the sentence. Such information has been shown to be important in dependency parsing [McDonald et al., 2005a]. For example, in the following sentence:

the head of the first a is dog, whereas the head of the second a is cat. Without considering positional information, a model cannot easily decide which a (nearer or farther) to assign to dog.

Long short-term memory networks (Hochreiter and Schmidhuber, 1997; LSTMs), a type of recurrent neural network with a more complex computational unit, have proven effective at capturing long-term dependencies. In our case LSTMs allow to represent each word on its own and within a sequence leveraging long-range contextual information. As shown in Figure 1, we first use a forward LSTM ( $\text{LSTM}^{F}$ ) to read the sentence from left to right and then a backward LSTM ( $\text{LSTM}^{B}$ ) to read the sentence from right to left, so that the entire sentence serves as context for each word:For more detail on LSTM networks, see e.g., ?) or ?).

Note that bidirectional LSTMs are one of many possible ways of representing word $w_{i}$ . Alternative representations include embeddings obtained from feed-forward neural networks [Chen and Manning, 2014, Lei et al., 2014a], character-based embeddings [Ballesteros et al., 2015], and more conventional features such as those introduced in ?).

2 Head Selection

We now move on to discuss our formalization of dependency parsing as head selection. We begin with unlabeled dependencies and then explain how the model can be extended to predict labeled ones.

In a dependency tree, a head can have multiple dependents, whereas a dependent can have only one head. Based on this fact, dependency parsing can be formalized as follows. Given a sentence $S=(w_{0},w_{1},\dots,w_{N})$ , we aim to find for each word $w_{i}\in\{w_{1},w_{2},\dots,w_{n}\}$ the most probable head $w_{j}\in\{w_{0},w_{1},\dots,w_{N}\}$ . For example, in Figure 1, to find the head for the token love, we calculate probabilities $P_{head}(\text{\sc root}|\text{love},S)$ , $P_{head}(\text{kids}|\text{love},S)$ , and $P_{head}(\text{candy}|\text{love},S)$ , and select the highest. More formally, we estimate the probability of token $w_{j}$ being the head of token $w_{i}$ in sentence $S$ as:

where $\mathbf{a}_{i}$ and $\mathbf{a}_{j}$ are vector-based representations of $w_{i}$ and $w_{j}$ , respectively (described in Section 3.1); $g(\mathbf{a}_{j},\mathbf{a}_{i})$ is a neural network with a single hidden layer that computes the associative score between representations $\mathbf{a}_{i}$ and $\mathbf{a}_{j}$ :

We train our model by minimizing the negative log likelihood of the gold standard $\langle$ head, dependent $\rangle$ arcs in all training sentences:

where $\mathcal{T}$ is the training set, $h(w_{i})$ is $w_{i}$ ’s gold standard headNote that $h(w_{i})$ can be root. within sentence $S$ , and $N_{S}$ the number of words in $S$ (excluding root). During inference, for each word $w_{i}~{}(i\in[1,N_{S}])$ in $S$ , we greedily choose the most likely head $w_{j}~{}(j\in[0,N_{S}])$ :

Note that the prediction for each word $w_{i}$ is made independently of the other words in the sentence.

Given our greedy inference method, there is no guarantee that predicted $\langle$ head, dependent $\rangle$ arcs form a tree (maybe there are cycles). However, we empirically observed that most outputs during inference are indeed trees. For instance, on an English dataset, 95% of the arcs predicted on the development set are trees, and 87% of them are projective, whereas on a Chinese dataset, 87% of the arcs form trees, 73% of which are projective. This indicates that although the model does not explicitly model tree structure during training, it is able to figure out from the data (which consists of trees) that it should predict them.

So far we have focused on unlabeled dependencies, however it is relatively straightforward to extend DeNSe to produce labeled dependencies. We basically train an additional classifier to predict labels for the arcs which have been already identified. The classifier takes as input features $[\mathbf{a}_{i};\mathbf{a}_{j};\mathbf{x}_{i};\mathbf{x}_{j}]$ representing properties of the arc $\langle w_{j},w_{i}\rangle$ . These consist of $\mathbf{a}_{i}$ and $\mathbf{a}_{j}$ , the LSTM-based representations for $w_{i}$ and $w_{j}$ (see Equation (4)), and their word and part-of-speech embeddings, $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ (see Equation (3)). Specifically, we use a trained DeNSe model to go through the training corpus and generate features and corresponding dependency labels as training data. We employ a two-layer rectifier network [Glorot et al., 2011] for the classification task.

3 Maximum Spanning Tree Algorithms

As mentioned earlier, greedy inference may not produce well-formed trees. In this case, the output of DeNSe can be adjusted with a maximum spanning tree algorithm. We use the Chu-Liu-Edmonds algorithm [Chu and Liu, 1965, Edmonds, 1967] for building non-projective trees and the Eisner [Eisner, 1996] algorithm for projective ones.

Following ?), we view a sentence $S=(w_{0}=\text{\sc root},w_{1},\dots,w_{N})$ as a graph $G_{S}=\langle V_{S},E_{S}\rangle$ with the sentence words and the dummy root symbol as vertices and a directed edge between every pair of distinct words and from the root symbol to every word. The directed graph $G_{S}$ is defined as:

where $s(i,j)$ is the weight of edge $\langle i,j\rangle$ and $P_{head}(w_{i}|w_{j},S)$ is known. The problem of dependency parsing now boils down to finding the tree with the highest score which is equivalent to finding a MST in $G_{S}$ [McDonald et al., 2005b].

To build a non-projective parser, we solve the MST problem with the Chu-Liu-Edmonds algorithm [Chu and Liu, 1965, Edmonds, 1967]. The algorithm selects for each vertex (excluding root) the in-coming edge with the highest weight. If a tree results, it must be the maximum spanning tree and the algorithm terminates. Otherwise, there must be a cycle which the algorithm identifies, contracts into a single vertex and recalculates edge weights going into and out of the cycle. The greedy inference strategy described in Equation (8)) is essentially a sub-procedure in the Chu-Liu-Edmonds algorithm with the algorithm terminating after the first iteration. In implementation, we only run the Chu-Liu-Edmonds algorithm through graphs with cycles, i.e., non-tree outputs.

Projective Parsing

For projective parsing, we solve the MST problem with the Eisner [Eisner, 1996] algorithm. The time complexity of the Eisner algorithm is $O(N^{3})$ , while checking if a tree is projective can be done reasonably faster, with a $O(N\log N)$ algorithm. Therefore, we only apply the Eisner algorithm to the non-projective output of our greedy inference strategy. Finally, it should be noted that the training of our model does not rely on the Chu-Liu-Edmonds or Eisner algorithm, or any other graph-based algorithm. MST algorithms are only used at test time to correct non-tree outputs which are a minority; DeNSe acquires underlying tree structure constraints from the data without an explicit learning algorithm.

Experiments

We evaluated our parser in a projective and non-projective setting. In the following, we describe the datasets we used and provide training details for our models. We also present comparisons against multiple previous systems and analyze the parser’s output.

In the projective setting, we assessed the performance of our parser on the English Penn Treebank (PTB) and the Chinese Treebank 5.1 (CTB). Our experimental setup closely follows ?) and ?).

For English, we adopted the Stanford basic dependencies (SD) representation [De Marneffe et al., 2006].We obtained SD representations using the Stanford parser v.3.3.0. We follow the standard splits of PTB, sections 2–21 were used for training, section 22 for development, and section 23 for testing. POS tags were assigned using the Stanford tagger [Toutanova et al., 2003] with an accuracy of 97.3%. For Chinese, we follow the same split of CTB5 introduced in ?). In particular, we used sections 001–815, 1001–1136 for training, sections 886–931, 1148–1151 for development, and sections 816–885, 1137–1147 for testing. The original constituency trees in CTB were converted to dependency trees with the Penn2Malt tool. http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html We used gold segmentation and gold POS tags as in ?) and ?).

In the non-projective setting, we assessed the performance of our parser on Czech and German, the largest non-projective datasets released as part of the CoNLL 2006 multilingual dependency parsing shared task. Since there is no official development set in either dataset, we used the last 374/367 sentences in the Czech/German training set as development data.We make the number of sentences in the development and test sets comparable. Projective statistics of the four datasets are summarized in Table 1.

2 Training Details

We trained our models on an Nvidia GPU card; training takes one to two hours. Model parameters were uniformly initialized to $[-0.1,0.1]$ . We used Adam [Kingma and Ba, 2014] to optimize our models with hyper-parameters recommended by the authors (i.e., learning rate 0.001, first momentum coefficient 0.9, and second momentum coefficient 0.999). To alleviate the gradient exploding problem, we rescaled the gradient when its norm exceeded 5 [Pascanu et al., 2013]. Dropout [Srivastava et al., 2014] was applied to our model with the strategy recommended in the literature [Zaremba et al., 2014, Semeniuta et al., 2016]. On all datasets, we used two-layer LSTMs and set $d=s=300$ , where $d$ is the hidden unit size and $s$ is the word embedding size.

As in previous neural dependency parsing work [Chen and Manning, 2014, Dyer et al., 2015], we used pre-trained word vectors to initialize our word embedding matrix $\mathbf{W}_{e}$ . For the PTB experiments, we used 300 dimensional pre-trained GloVehttp://nlp.stanford.edu/projects/glove/ vectors [Pennington et al., 2014]. For the CTB experiments, we trained 300 dimensional GloVe vectors on the Chinese Gigaword corpus which we segmented with the Stanford Chinese Segmenter [Tseng et al., 2005]. For Czech and German, we did not use pre-trained word vectors. The POS tag embedding size was set to $q=30$ in the English experiments, $q=50$ in the Chinese experiments and $q=40$ in both Czech and German experiments.

3 Results

For both English and Chinese experiments, we report unlabeled (UAS) and labeled attachment scores (LAS) on the development and test sets; following ?) punctuation is excluded from the evaluation.

Experimental results on PTB are shown in Table 2. We compared our model with several recent papers following the same evaluation protocol and experimental settings. The first block in the table contains mostly graph-based parsers which do not use neural networks: Bohnet10 [Bohnet, 2010], Martins13 [Martins et al., 2013], and Z&M14 [Zhang and McDonald, 2014]. Z&N11 [Zhang and Nivre, 2011] is a transition-based parser with non-local features. Accuracy results for all four parsers are reported in ?).

The second block in Table 2 presents results obtained from neural network-based parsers. C&M14 [Chen and Manning, 2014] is a transition-based parser using features learned with a feed forward neural network. Although very fast, its performance is inferior compared to graph-based parsers or strong non-neural transition based parsers (e.g., Z&N11). Dyer15 [Dyer et al., 2015] uses (stack) LSTMs to model the states of the buffer, the stack, and the action sequence of a transition system. Weiss15 [Weiss et al., 2015] is another transition-based parser, with a more elaborate training procedure. Features are learned with a neural network model similar to C&M14, but much larger with two layers. The hidden states of the neural network are then used to train a structured perceptron for better beam search decoding. Andor16 [Andor et al., 2016] is similar to Weiss15, but uses a globally normalized training algorithm instead.

Unlike all models above, DeNSe does not use any kind of transition- or graph-based algorithm during training and inference. Nonetheless, it obtains a UAS of 94.02%. Around 95% of the model’s outputs after inference are trees, 87% of which are projective. When we post-process the remaining 13% of non-projective outputs with the Eisner algorithm (DeNSe+E), we obtain a slight improvement on UAS (94.10%).

?) extract features from bidirectional LSTMs and feed them to a graph- (K&G16 graph) and transition-based parser (K&G16 trans). Their LSTMs are jointly trained with the parser objective. DeNSe yields very similar performance to their transition-based parser while it outperforms K&G16 graph. A key difference between DeNSe and K&G16 lies in the training objective. The objective of DeNSe is log-likelihood based without tree structure constraints (the model is trained to produce a distribution over possible heads for each word, where each head selection is independent), while K&G16 employ a max-margin objective with tree structure constraints. Although our probabilistic objective is non-structured, it is perhaps easier to train compared to a margin-based one.

We also assessed the importance of the bidirectional LSTM on its own by replacing our LSTM-based features with those obtained from a feed-forward network. Specifically, we used the 1-order-atomic features introduced in ?) which represent POS-tags, modifiers, heads, and their relative positions. As can be seen in Table 2 (DeNSe-Pei), these features are less effective compared to LSTM-based ones and the contribution of the MST algorithm (Eisner) during decoding is more pronounced (DeNSe-Pei+E). We observe similar trends in the Chinese, German, and Czech datasets (see Tables 3 and 5).

Results on CTB follow a similar pattern. As shown in Table 3, DeNSe outperforms all previous neural models (see the test set columns) on UAS and LAS. DeNSe performs competitively with Z&M14, a non-neural model with a complex high order decoding algorithm involving cube pruning and strategies for encouraging diversity. Post-processing the output of the parser with the Eisner algorithm generally improves performance (by 0.21%; see last row in Table 3). Again we observe that 1-order-atomic features [Lei et al., 2014a] are inferior compared to the LSTM. Table 4 reports unlabeled sentence level exact match (UEM) in Table 4 for English and Chinese. Interestingly, even when using the greedy inference strategy, DeNSe yields a UEM comparable to Dyer15 on PTB. Finally, in Figure 2 we analyze the performance of our parser on sentences of different length. On both PTB and CTB, DeNSe has an advantage on long sentences compared to C&M14 and Dyer15.

For Czech and German, we closely follow the evaluation setup of CoNLL 2006. We report both UAS and LAS, although most previous work has focused on UAS. Our results are summarized in Table 5. We compare DeNSe against three non-projective graph-based dependency parsers: the MST parser [McDonald et al., 2005b], the Turbo parser [Martins et al., 2013], and the RBG parser [Lei et al., 2014b]. We show the performance of these parsers in the first order setting (e.g., MST-1st) and in higher order settings (e.g., Turbo-3rd). The results of MST-1st, MST-2nd, RBG-1st and RBG-3rd are reported in ?) and the results of Turbo-1st and Turbo-3rd are reported in ?). We show results for our parser with greedy inference (see DeNSe in the table) and when we use the Chu-Liu-Edmonds algorithm to post-process non-tree outputs (DeNSe+CLE).

As can been seen, DeNSe outperforms all other first (and second) order parsers on both German and Czech. As in the projective experiments, we observe slight a improvement (on both UAS and LAS) when using a MST algorithm. On German, DeNSe is comparable with the best third-order parser (Turbo-3rd), while on Czech it lags behind Turbo-3rd and RBG-3rd. This is not surprising considering that DeNSe is a first-order parser and only uses words and POS tags as features. Comparison systems use a plethora of hand-crafted features and more sophisticated high-order decoding algorithms. Finally, note that a version of DeNSe with features in [Lei et al., 2014a] is consistently worse (see the second block in Table 5).

Our experimental results demonstrate that using a MST algorithm during inference can slightly improve the model’s performance. We further examined the extent to which the MST algorithm is necessary for producing dependency trees. Table 6 shows the percentage of trees before and after the application of the MST algorithm across the four languages. In the majority of cases DeNSe outputs trees (ranging from 87.0% to 96.7%) and a significant proportion of them are projective (ranging from 65.5% to 86.6%). Therefore, only a small proportion of outputs (14.0% on average) need to be post-processed with the Eisner or Chu-Liu-Edmonds algorithm.

Conclusions

In this work we presented DeNSe, a neural dependency parser which we train without a transition system or graph-based algorithm. Experimental results show that DeNSe achieves competitive performance across four different languages and can seamlessly transfer from a projective to a non-projective parser simply by changing the post-processing MST algorithm during inference. In the future, we plan to increase the coverage of our parser by using tri-training techniques [Li et al., 2014] and multi-task learning [Luong et al., 2015].

We would like to thank Adam Lopez and Frank Keller for their valuable feedback. We acknowledge the financial support of the European Research Council (ERC; award number 681760).

Introduction

Related Work

Transition-based Parsing

Neural Network-based Features

Dependency Parsing as Head Selection

2 Head Selection

3 Maximum Spanning Tree Algorithms

Projective Parsing

Experiments

2 Training Details

3 Results

Conclusions

References