Structured Training for Neural Network Transition-Based Parsing

David Weiss, Chris Alberti, Michael Collins, Slav Petrov

Introduction

Syntactic analysis is a central problem in language understanding that has received a tremendous amount of attention. Lately, dependency parsing has emerged as a popular approach to this problem due to the availability of dependency treebanks in many languages [Buchholz and Marsi (2006, Nivre et al. (2007, McDonald et al. (2013] and the efficiency of dependency parsers.

Transition-based parsers [Nivre (2008] have been shown to provide a good balance between efficiency and accuracy. In transition-based parsing, sentences are processed in a linear left to right pass; at each position, the parser needs to choose from a set of possible actions defined by the transition strategy. In greedy models, a classifier is used to independently decide which transition to take based on local features of the current parse configuration. This classifier typically uses hand-engineered features and is trained on individual transitions extracted from the gold transition sequence. While extremely fast, these greedy models typically suffer from search errors due to the inability to recover from incorrect decisions. ?) showed that a beam-search decoding algorithm utilizing the structured perceptron training algorithm can greatly improve accuracy. Nonetheless, significant manual feature engineering was required before transition-based systems provided competitive accuracy with graph-based parsers [Zhang and Nivre (2011], and only by incorporating graph-based scoring functions were ?) able to exceed the accuracy of graph-based approaches.

In contrast to these carefully hand-tuned approaches, ?) recently presented a neural network version of a greedy transition-based parser. In their model, a feed-forward neural network with a hidden layer is used to make the transition decisions. The hidden layer has the power to learn arbitrary combinations of the atomic inputs, thereby eliminating the need for hand-engineered features. Furthermore, because the neural network uses a distributed representation, it is able to model lexical, part-of-speech (POS) tag, and arc label similarities in a continuous space. However, although their model outperforms its greedy hand-engineered counterparts, it is not competitive with state-of-the-art dependency parsers that are trained for structured search.

In this work, we combine the representational power of neural networks with the superior search enabled by structured training and inference, making our parser one of the most accurate dependency parsers to date. Training and testing on the Penn Treebank [Marcus et al. (1993], our transition-based parser achieves 93.99% unlabeled (UAS) / 92.05% labeled (LAS) attachment accuracy, outperforming the 93.22% UAS / 91.02% LAS of ?) and 93.27 UAS / 91.19 LAS of ?). In addition, by incorporating unlabeled data into training, we further improve the accuracy of our model to 94.26% UAS / 92.41% LAS (93.46% UAS / 91.49% LAS for our greedy model).

In our approach we start with the basic structure of ?), but with a deeper architecture and improvements to the optimization procedure. These modifications (Section 2) increase the performance of the greedy model by as much as 1%. As in prior work, we train the neural network to model the probability of individual parse actions. However, we do not use these probabilities directly for prediction. Instead, we use the activations from all layers of the neural network as the representation in a structured perceptron model that is trained with beam search and early updates (Section 3). On the Penn Treebank, this structured learning approach significantly improves parsing accuracy by 0.8%.

An additional contribution of this work is an effective way to leverage unlabeled data. Neural networks are known to perform very well in the presence of large amounts of training data; however, obtaining more expert-annotated parse trees is very expensive. To this end, we generate large quantities of high-confidence parse trees by parsing unlabeled data with two different parsers and selecting only the sentences for which the two parsers produced the same trees (Section 3.3). This approach is known as “tri-training” [Li et al. (2014] and we show that it benefits our neural network parser significantly more than other approaches. By adding 10 million automatically parsed tokens to the training data, we improve the accuracy of our parsers by almost \sim1.0% on web domain data.

We provide an extensive exploration of our model in Section 5 through ablative analysis and other retrospective experiments. One of the goals of this work is to provide guidance for future refinements and improvements on the architecture and modeling choices we introduce in this paper.

Finally, we also note that neural network representations have a long history in syntactic parsing [Henderson (2004, Titov and Henderson (2007, Titov and Henderson (2010]; however, like ?), our network avoids any recurrent structure so as to keep inference fast and efficient and to allow the use of simple backpropagation to compute gradients. Our work is also not the first to apply structured training to neural networks (see e.g. ?) and ?) for Conditional Random Field (CRF) training of neural networks). Our paper extends this line of work to the setting of inexact search with beam decoding for dependency parsing; ?) concurrently explored a similar approach using a structured probabilistic ranking objective. ?) concurrently developed the Stack Long Short-Term Memory (S-LSTM) architecture, which does incorporate recurrent architecture and look-ahead, and which yields comparable accuracy on the Penn Treebank to our greedy model.

Neural Network Model

In this section, we describe the architecture of our model, which is summarized in Figure 1. Note that we separate the embedding processing to a distinct “embedding layer” for clarity of presentation. Our model is based upon that of ?) and we discuss the differences between our model and theirs in detail at the end of this section. We use the arc-standard [Nivre (2004] transition system.

For all feature groups, we add additional special values for “ROOT” (indicating the POS or word of the root token), “NULL” (indicating no valid feature value could be computed) or “UNK” (indicating an out-of-vocabulary item).

2 Embedding layer

The first learned layer h0h_{0} in the network transforms the sparse, discrete features X\mathbf{X} into a dense, continuous embedded representation. For each feature group Xg\mathbf{X}_{g}, we learn a Vg×DgV_{g}\times D_{g} embedding matrix Eg\mathbf{E}_{g} that applies the conversion:

3 Hidden layers

We experimented with one and two hidden layers composed of MM rectified linear (Relu) units [Nair and Hinton (2010]. Each unit in the hidden layers is fully connected to the previous layer:

where W1\mathbf{W}_{1} is a M1×EM_{1}\times E weight matrix for the first hidden layer and Wi\mathbf{W}_{i} are Mi×Mi1M_{i}\times M_{i-1} matrices for all subsequent layers. The weights bi\mathbf{b}_{i} are bias terms. Relu layers have been well studied in the neural network literature and have been shown to work well for a wide domain of problems [Krizhevsky et al. (2012, Zeiler et al. (2013]. Through most of development, we kept Mi=200M_{i}=200, but we found that significantly increasing the number of hidden units improved our results for the final comparison.

4 Relationship to ?)

Our model is clearly inspired by and based on the work of ?). There are a few structural differences: (1) we allow for much smaller embeddings of POS tags and labels, (2) we use Relu units in our hidden layers, and (3) we use a deeper model with two hidden layers. Somewhat to our surprise, we found these changes combined with an SGD training scheme (Section 3.1) during the “pre-training” phase of the model to lead to an almost 1% accuracy gain over ?). This trend held despite carefully tuning hyperparameters for each method of training and structure combination.

Our main contribution from an algorithmic perspective is our training procedure: as described in the next section, we use the structured perceptron for learning the final layer of our model. We thus present a novel way to leverage a neural network representation in a structured prediction setting.

Semi-Supervised Structured Learning

In this work, we investigate a semi-supervised structured learning scheme that yields substantial improvements in accuracy over the baseline neural network model. There are two complementary contributions of our approach: (1) incorporating structured learning of the model and (2) utilizing unlabeled data. In both cases, we use the neural network to model the probability of each parsing action yy as a soft-max function taking the final hidden layer as its input:

where βy\beta_{y} is a MiM_{i} dimensional vector of weights for class yy and ii is the index of the final hidden layer of the network. At a high level our approach can be summarized as follows:

First, we pre-train the network’s hidden representations by learning probabilities of parsing actions. Fixing the hidden representations, we learn an additional final output layer using the structured perceptron that uses the output of the network’s hidden layers. In practice this improves accuracy by \sim0.6% absolute.

Next, we show that we can supplement the gold data with a large corpus of high quality automatic parses. We show that incorporating unlabeled data in this way improves accuracy by as much as 1% absolute.

To learn the hidden representations, we use mini-batched averaged stochastic gradient descent (ASGD) [Bottou (2010] with momentum [Hinton (2012] to learn the parameters Θ\Theta of the network, where Θ={Eg,Wi,bi,βyg,i,y}\Theta=\{\mathbf{E}_{g},\mathbf{W}_{i},\mathbf{b}_{i},\beta_{y}\mid\forall g,i,y\}. We use back-propagation to minimize the multinomial logistic loss:

where λ\lambda is a regularization hyper-parameter over the hidden layer parameters (we use λ=104\lambda=10^{-4} in all experiments) and jj sums over all decisions and configurations {yj,cj}\{y_{j},c_{j}\} extracted from gold parse trees in the dataset.

The specific update rule we apply at iteration tt is as follows:

where the descent direction gtg_{t} is computed by a weighted combination of the previous direction gt1g_{t-1}and the current gradient ΔL(Θt)\Delta L(\Theta_{t}). The parameter μ[0,1)\mu\in[0,1) is the momentum parameter while ηt\eta_{t} is the traditional learning rate. In addition, since we did not tune the regularization parameter λ\lambda, we apply a simple exponential step-wise decay to ηt\eta_{t}; for every γ\gamma rounds of updates, we multiply ηt=0.96ηt1\eta_{t}=0.96\eta_{t-1}.

The final component of the update is parameter averaging: we maintain averaged parameters Θˉt=αtΘˉt1+(1αt)Θt\bar{\Theta}_{t}=\alpha_{t}\bar{\Theta}_{t-1}+(1-\alpha_{t})\Theta_{t}, where αt\alpha_{t} is an averaging weight that increases from 0.1 to 0.99990.9999 with 1/t1/t. Combined with averaging, careful tuning of the three hyperparameters μ\mu, η0\eta_{0}, and γ\gamma using held-out data was crucial in our experiments.

2 Structured Perceptron Training

In decoding with the perceptron-trained model, we will use beam search to attempt to find:

Thus each decision yjy_{j} receives a score:

In the perceptron with early updates, the parameters v(y)\mathbf{v}(y) are trained as follows. On each training example, we run beam search until the gold-standard parse tree falls out of the beam.If the gold parse tree stays within the beam until the end of the sentence, conventional perceptron updates are used. Define jj to be the length of the beam at this point. A structured perceptron update is performed using the gold-standard decisions y1yjy_{1}\ldots y_{j} as the target, and the highest scoring (incorrect) member of the beam as the negative example.

A key idea in this paper is to use the neural network to define the representation ϕ(x,c)\phi(x,c). Given the sentence xx and the configuration cc, assuming two hidden layers, the neural network defines values for h1\mathbf{h}_{1}, h2\mathbf{h}_{2}, and P(y)P(y) for each decision yy. We experimented with various definitions of ϕ\phi (Section 5.2) and found that ϕ(x,c)=[h1 h2 P(y)]\phi(x,c)=[\mathbf{h}_{1}~{}\mathbf{h}_{2}~{}P(y)] (the concatenation of the outputs from both hidden layers, as well as the probabilities for all decisions yy possible in the current configuration) had the best accuracy on development data.

Note that it is possible to continue to use backpropagation to learn the representation ϕ(x,c)\phi(x,c) during perceptron training; however, we found using ASGD to pre-train the representation always led to faster, more accurate results in preliminary experiments, and we left further investigation for future work.

3 Incorporating Unlabeled Data

Given the high capacity, non-linear nature of the deep network we hypothesize that our model can be significantly improved by incorporating more data. One way to use unlabeled data is through unsupervised methods such as word clusters [Koo et al. (2008]; we follow ?) and use pretrained word embeddings to initialize our model. The word embeddings capture similar distributional information as word clusters and give consistent improvements by providing a good initialization and information about words not seen in the treebank data.

However, obtaining more training data is even more important than a good initialization. One potential way to obtain additional training data is by parsing unlabeled data with previously trained models. ?) and ?) showed that iteratively re-training a single model (“self-training”) can be used to improve parsers in certain settings; ?) built on this work and showed that a slow and accurate parser can be used to “up-train” a faster but less accurate parser.

In this work, we adopt the “tri-training” approach of ?): Two parsers are used to process the unlabeled corpus and only sentences for which both parsers produced the same parse tree are added to the training data. The intuition behind this idea is that the chance of the parse being correct is much higher when the two parsers agree: there is only one way to be correct, while there are many possible incorrect parses. Of course, this reasoning holds only as long as the parsers suffer from different biases.

We show that tri-training is far more effective than vanilla up-training for our neural network model. We use same setup as ?), intersecting the output of the BerkeleyParser [Petrov et al. (2006], and a reimplementation of ZPar [Zhang and Nivre (2011] as our baseline parsers. The two parsers agree only 36% of the time on the tune set, but their accuracy on those sentences is 97.26% UAS, approaching the inter annotator agreement rate. These sentences are of course easier to parse, having an average length of 15 words, compared to 24 words for the tune set overall. However, because we only use these sentences to extract individual transition decisions, the shorter length does not seem to hurt their utility. We generate 10710^{7} tokens worth of new parses and use this data in the backpropagation stage of training.

Experiments

In this section we present our experimental setup and the main results of our work.

We conduct our experiments on two English language benchmarks: (1) the standard Wall Street Journal (WSJ) part of the Penn Treebank [Marcus et al. (1993] and (2) a more comprehensive union of publicly available treebanks spanning multiple domains. For the WSJ experiments, we follow standard practice and use sections 2-21 for training, section 22 for development and section 23 as the final test set. Since there are many hyperparameters in our models, we additionally use section 24 for tuning. We convert the constituency trees to Stanford style dependencies [De Marneffe et al. (2006] using version 3.3.0 of the converter. We use a CRF-based POS tagger to generate 5-fold jack-knifed POS tags on the training set and predicted tags on the dev, test and tune sets; our tagger gets comparable accuracy to the Stanford POS tagger [Toutanova et al. (2003] with 97.44% on the test set. We report unlabeled attachment score (UAS) and labeled attachment score (LAS) excluding punctuation on predicted POS tags, as is standard for English.

For the second set of experiments, we follow the same procedure as above, but with a more diverse dataset for training and evaluation. Following ?), we use (in addition to the WSJ), the OntoNotes corpus version 5 [Hovy et al. (2006], the English Web Treebank [Petrov and McDonald (2012], and the updated and corrected Question Treebank [Judge et al. (2006]. We train on the union of each corpora’s training set and test on each domain separately. We refer to this setup as the “Treebank Union” setup.

In our semi-supervised experiments, we use the corpus from ?) as our source of unlabeled data. We process it with the BerkeleyParser [Petrov et al. (2006], a latent variable constituency parser, and a reimplementation of ZPar [Zhang and Nivre (2011], a transition-based parser with beam search. Both parsers are included as baselines in our evaluation. We select the first 10710^{7} tokens for which the two parsers agree as additional training data. For our tri-training experiments, we re-train the POS tagger using the POS tags assigned on the unlabeled data from the Berkeley constituency parser. This increases POS accuracy slightly to 97.57% on the WSJ.

2 Model Initialization & Hyperparameters

In all cases, we initialized Wi\mathbf{W}_{i} and β\beta randomly using a Gaussian distribution with variance 10410^{-4}. We used fixed initialization with bi=0.2b_{i}=0.2, to ensure that most Relu units are activated during the initial rounds of training. We did not systematically compare this random scheme to others, but we found that it was sufficient for our purposes.

3 Results

Table 1 shows our final results on the WSJ test set, and Table 2 shows the cross-domain results from the Treebank Union. We compare to the best dependency parsers in the literature. For [Chen and Manning (2014] and [Dyer et al. (2015], we use reported results; the other baselines were run by Bernd Bohnet using version 3.3.0 of the Stanford dependencies and our predicted POS tags for all datasets to make comparisons as fair as possible. On the WSJ and Web tasks, our parser outperforms all dependency parsers in our comparison by a substantial margin. The Question (QTB) dataset is more sensitive to the smaller beam size we use in order to train the models in a reasonable time; if we increase to B=32B=32 at inference time only, our perceptron performance goes up to 92.29% LAS.

Since many of the baselines could not be directly compared to our semi-supervised approach, we re-implemented ?) and trained on the tri-training corpus. Although tri-training did help the baseline on the dev set (Figure 4), test set performance did not improve significantly. In contrast, it is quite exciting to see that after tri-training, even our greedy parser is more accurate than any of the baseline dependency parsers and competitive with the BerkeleyParser used to generate the tri-training data. As expected, tri-training helps most dramatically to increase accuracy on the Treebank Union setup with diverse domains, yielding 0.4-1.0% absolute LAS improvement gains for our most accurate model.

Unfortunately we are not able to compare to several semi-supervised dependency parsers that achieve some of the highest reported accuracies on the WSJ, in particular ?), ?) and ?). These parsers use the ?) dependency conversion and the accuracies are therefore not directly comparable. The highest of these is ?), with a reported accuracy of 94.22% UAS. Even though the UAS is not directly comparable, it is typically similar, and this suggests that our model is competitive with some of the highest reported accuries for dependencies on WSJ.

Discussion

In this section, we investigate the contribution of the various components of our approach through ablation studies and other systematic experiments. We tune on Section 24, and use Section 22 for comparisons in order to not pollute the official test set (Section 23). We focus on UAS as we found the LAS scores to be strongly correlated. Unless otherwise specified, we use 200 hidden units in each layer to be able to run more ablative experiments in a reasonable amount of time.

In addition to initialization and hyperparameter tuning, there are several additional choices about model structure and size a practitioner faces when implementing a neural network model. We explore these questions and justify the particular choices we use in the following. Note that we do not use a beam for this analysis and therefore do not train the final perceptron layer. This is done in order to reduce training times and because the trends persist across settings.

Since the learning problem is non-convex, different initializations of the parameters yield different solutions to the learning problem. Thus, for any given experiment, we ran multiple random restarts for every setting of our hyperparameters and picked the model that performed best using the held-out tune set. We found it important to allow the model to stop training early if tune set accuracy decreased.

We visualize the performance of 32 random restarts with one or two hidden layers and with and without pretrained word embeddings in Figure 2, and a summary of the figure in Table 3. While adding a second hidden layer results in a large gain on the tune set, there is no gain on the dev set if pre-trained embeddings are not used. In fact, while the overall UAS scores of the tune set and dev set are strongly correlated (ρ=0.64\rho=0.64, p<1010p<10^{-10}), they are not significantly correlated if pre-trained embeddings are not used (ρ=0.12\rho=0.12, p>0.3p>0.3). This suggests that an additional benefit of pre-trained embeddings, aside from allowing learning to reach a more accurate solution, is to push learning towards a solution that generalizes to more data.

Diminishing returns with increasing embedding dimensions.

Increasing hidden units yields large gains.

2 Impact of Structured Perceptron

We now turn our attention to the importance of structured perceptron training as well as the impact of different latent representations.

To evaluate the impact of structured training, we compare using the estimates P(y)P(y) from the neural network directly for beam search to using the activations from all layers as features in the structured perceptron. Using the probability estimates directly is very similar to ?), where a maximum-entropy model was used to model the distribution over possible actions at each parser state, and beam search was used to search for the highest probability parse. A known problem with beam search in this setting is the label-bias problem. Table 5 shows the impact of using structured perceptron training over using the softmax function during beam search as a function of the beam size used. For reference, our reimplementation of ?) is trained equivalently for each setting. We also show the impact on beam size when tri-training is used. Although the beam does marginally improve accuracy for the softmax model, much greater gains are achieved when perceptron training is used.

Using all hidden layers crucial for structured perceptron.

We also investigated the impact of connecting the final perceptron layer to all prior hidden layers (Table 6). Our results suggest that all intermediate layers of the network are indeed discriminative. Nonetheless, aggregating all of their activations proved to be the most effective representation for the structured perceptron. This suggests that the representations learned by the network collectively contain the information required to reduce the bias of the model, but not when filtered through the softmax layer. Finally, we also experimented with connecting both hidden layers to the softmax layer during backpropagation training, but we found this did not significantly affect the performance of the greedy model.

3 Impact of Tri-Training

To evaluate the impact of the tri-training approach, we compared to up-training with the BerkelyParser [Petrov et al. (2006] alone. The results are summarized in Figure 4 for the greedy and perceptron neural net models as well as our reimplementated ?) baseline.

For our neural network model, training on the output of the BerkeleyParser yields only modest gains, while training on the data where the two parsers agree produces significantly better results. This was especially pronounced for the greedy models: after tri-training, the greedy neural network model surpasses the BerkeleyParser in accuracy. It is also interesting to note that up-training improved results far more than tri-training for the baseline. We speculate that this is due to the a lack of diversity in the tri-training data for this model, since the same baseline model was intersected with the BerkeleyParser to generate the tri-training data.

4 Error Analysis

Regardless of tri-training, using the structured perceptron improved error rates on some of the common and difficult labels: ROOT, ccomp, cc, conj, and nsubj all improved by >>1%. We inspected the learned perceptron weights v\mathbf{v} for the softmax probabilities P(y)P(y) (see Appendix) and found that the perceptron reweights the softmax probabilities based on common confusions; e.g. a strong negative weight for the action RIGHT(ccomp) given the softmax model outputs RIGHT(conj). Note that this trend did not hold when ϕ(x,c)=[P(y)]\phi(x,c)=[P(y)]; without the hidden layer, the perceptron was not able to reweight the softmax probabilities to account for the greedy model’s biases.

Conclusion

We presented a new state of the art in dependency parsing: a transition-based neural network parser trained with the structured perceptron and ASGD. We then combined this approach with unlabeled data and tri-training to further push state-of-the-art in semi-supervised dependency parsing. Nonetheless, our ablative analysis suggests that further gains are possible simply by scaling up our system to even larger representations. In future work, we will apply our method to other languages, explore end-to-end training of the system using structured learning, and scale up the method to larger datasets and network structures.

Acknowledgements

We would like to thank Bernd Bohnet for training his parsers and TurboParser on our setup. This paper benefitted tremendously from discussions with Ryan McDonald, Greg Coppola, Emily Pitler and Fernando Pereira. Finally, we are grateful to all members of the Google Parsing Team.

References