Jointly Predicting Predicates and Arguments in Neural Semantic Role Labeling

Luheng He, Kenton Lee, Omer Levy, Luke Zettlemoyer

Introduction

Semantic role labeling (SRL) captures predicate-argument relations, such as “who did what to whom.” Recent high-performing SRL models He et al. (2017); Marcheggiani et al. (2017); Tan et al. (2018) are BIO-taggers, labeling argument spans for a single predicate at a time (as shown in Figure 1). They are typically only evaluated with gold predicates, and must be pipelined with error-prone predicate identification models for deployment.

We propose an end-to-end approach for predicting all the predicates and their argument spans in one forward pass. Our model builds on a recent coreference resolution model Lee et al. (2017), by making central use of learned, contextualized span representations. We use these representations to predict SRL graphs directly over text spans. Each edge is identified by independently predicting which role, if any, holds between every possible pair of text spans, while using aggressive beam pruning for efficiency. The final graph is simply the union of predicted SRL roles (edges) and their associated text spans (nodes).

Our span-graph formulation overcomes a key limitation of semi-markov and BIO-based models Kong et al. (2016); Zhou and Xu (2015); Yang and Mitchell (2017); He et al. (2017); Tan et al. (2018): it can model overlapping spans across different predicates in the same output structure (see Figure 1). The span representations also generalize the token-level representations in BIO-based models, letting the model dynamically decide which spans and roles to include, without using previously standard syntactic features Punyakanok et al. (2008); FitzGerald et al. (2015).

To the best of our knowledge, this is the first span-based SRL model that does not assume that predicates are given. In this more realistic setting, where the predicate must be predicted, our model achieves state-of-the-art performance on PropBank. It also reinforces the strong performance of similar span embedding methods for coreference Lee et al. (2017), suggesting that this style of models could be used for other span-span relation tasks, such as syntactic parsing Stern et al. (2017), relation extraction Miwa and Bansal (2016), and QA-SRL FitzGerald et al. (2018).

Model

We consider the space of possible predicates to be all the tokens in the input sentence, and the space of arguments to be all continuous spans.Our model decides what relation exists between each predicate-argument pair (including no relation).

Formally, given a sequence $X=w_{1},\dots,w_{n}$ , we wish to predict a set of labeled predicate-argument relations $Y\subseteq\mathcal{P}\times\mathcal{A}\times\mathcal{L}$ , where $\mathcal{P}=\{w_{1},\ldots,w_{n}\}$ is the set of all tokens (predicates), $\mathcal{A}=\{(w_{i},\dots,w_{j})\mid 1\leq i\leq j\leq n\}$ contains all the spans (arguments), and $\mathcal{L}$ is the space of semantic role labels, including a null label $\epsilon$ indicating no relation. The final SRL output would be all the non-empty relations $\{(p,a,l)\in Y\mid l\neq\epsilon\}$ .

We then define a set of random variables, where each random variable $y_{p,a}$ corresponds to a predicate $p\in\mathcal{P}$ and an argument $a\in\mathcal{A}$ , taking value from the discrete label space $\mathcal{L}$ . The random variables $y_{p,a}$ are conditionally independent of each other given the input $X$ :

Where $\phi(p,a,l)$ is a scoring function for a possible (predicate, argument, label) combination. $\phi$ is decomposed into two unary scores on the predicate and the argument (defined in Section 3), as well as a label-specific score for the relation:

The score for the null label is set to a constant: $\phi(p,a,\epsilon)=0$ , similar to logistic regression.

For each input $X$ , we minimize the negative log likelihood of the gold structure $Y^{*}$ :

Beam pruning

As our model deals with $O(n^{2})$ possible argument spans and $O(n)$ possible predicates, it needs to consider $O(n^{3}|\mathcal{L}|)$ possible relations, which is computationally impractical. To overcome this issue, we define two beams $B_{\text{a}}$ and $B_{\text{p}}$ for storing the candidate arguments and predicates, respectively. The candidates in each beam are ranked by their unary score ( $\Phi_{\text{a}}$ or $\Phi_{\text{p}}$ ). The sizes of the beams are limited by $\lambda_{\text{a}}n$ and $\lambda_{\text{p}}n$ . Elements that fall out of the beam do not participate in computing the edge factors $\Phi_{\text{rel}}^{(l)}$ , reducing the overall number of relational factors evaluated by the model to $O(n^{2}|\mathcal{L}|)$ . We also limit the maximum width of spans to a fixed number $W$ (e.g. $W=30$ ), further reducing the number of computed unary factors to $O(n)$ .

Neural Architecture

Our model builds contextualized representations for argument spans $a$ and predicate words $p$ based on BiLSTM outputs (Figure 2) and uses feed-forward networks to compute the factor scores in $\phi(p,a,l)$ described in Section 2 (Figure 3).

The bottom layer consists of pre-trained word embeddings concatenated with character-based representations, i.e. for each token $w_{i}$ , we have $\mathbf{x}_{i}=[\textsc{WordEmb}(w_{i});\textsc{CharCNN}(w_{i})]$ . We then contextualize each $\mathbf{x}_{i}$ using an $m$ -layered bidirectional LSTM with highway connections Zhang et al. (2016), which we denote as $\mathbf{\bar{x}}_{i}$ .

Argument and predicate representation

We build contextualized representations for all candidate arguments $a\in\mathcal{A}$ and predicates $p\in\mathcal{P}$ . The argument representation contains the following: end points from the BiLSTM outputs ( $\mathbf{\bar{x}}_{\textsc{Start}(a)},\mathbf{\bar{x}}_{\textsc{End}(a)}$ ), a soft head word $\mathbf{x}_{\text{h}}(a)$ , and embedded span width features $\mathbf{f}(a)$ , similar to Lee et al. (2017). The predicate representation is simply the BiLSTM output at the position $\textsc{Index}(p)$ .

The soft head representation $\mathbf{x}_{\text{h}}(a)$ is an attention mechanism over word inputs $\mathbf{x}$ in the argument span, where the weights $\mathbf{e}(a)$ are computed via a linear layer over the BiLSTM outputs $\mathbf{\bar{x}}$ .

$\mathbf{x}_{\textsc{Start}(a):\textsc{End}(a)}$ is a shorthand for stacking a list of vectors $\mathbf{x}_{t}$ , where $\textsc{Start}(a)\leq t\leq\textsc{End}(a)$ .

Scoring

The scoring functions $\Phi$ are implemented with feed-forward networks based on the predicate and argument representations $\mathbf{g}$ :

Experiments

We experiment on the CoNLL 2005 Carreras and Màrquez (2005) and CoNLL 2012 (OntoNotes 5.0, Pradhan et al. (2013)) benchmarks, using two SRL setups: end-to-end and gold predicates. In the end-to-end setup, a system takes a tokenized sentence as input, and predicts all the predicates and their arguments. Systems are evaluated on the micro-averaged F1 for correctly predicting (predicate, argument span, label) tuples. For comparison with previous systems, we also report results with gold predicates, in which the complete set of predicates in the input sentence is given as well. Other experimental setups and hyperparameteres are listed in Appendix A.1.

To further improve performance, we also add ELMo word representations Peters et al. (2018) to the BiLSTM input (in the +ELMo rows). Since the contextualized representations ELMo provides can be applied to most previous neural systems, the improvement is orthogonal to our contribution. In Table 1 and 2, we organize all the results into two categories: the comparable single model systems, and the models augmented with ELMo or ensembling (in the PoE rows).

End-to-end results

As shown in Table 1,For the end-to-end setting on CoNLL 2012, we used a subset of the train/dev data from previous work due to noise in the dataset; the dev result is not directly comparable. See Appendix A.2 for detailed explanation. our joint model outperforms the previous best pipeline system He et al. (2017) by an F1 difference of anywhere between 1.3 and 6.0 in every setting. The improvement is larger on the Brown test set, which is out-of-domain, and the CoNLL 2012 test set, which contains nominal predicates. On all datasets, our model is able to predict over 40% of the sentences completely correctly.

Results with gold predicates

To compare with additional previous systems, we also conduct experiments with gold predicates by constraining our predicate beam to be gold predicates only. As shown in Table 2, our model significantly out-performs He et al. (2017), but falls short of Tan et al. (2018), a very recent attention-based Vaswani et al. (2017) BIO-tagging model that was developed concurrently with our work. By adding the contextualized ELMo representations, we are able to out-perform all previous systems, including Peters et al. (2018), which applies ELMo to the SRL model introduced in He et al. (2017).

Analysis

Our model’s architecture differs significantly from previous BIO systems in terms of both input and decision space. To better understand our model’s strengths and weaknesses, we perform three analyses following Lee et al. (2017) and He et al. (2017), studying (1) the effectiveness of beam pruning, (2) the ability to capture long-range dependencies, (3) agreement with syntactic spans, and (4) the ability to predict globally consistent SRL structures. The analyses are performed on the development sets without using ELMo embeddings. For comparability with prior work, analyses (2)-(4) are performed on the CoNLL 05 dev set with gold predicates.

Figure 4 shows the predicate and argument spans kept in the beam, sorted with their unary scores. Our model efficiently prunes unlikely argument spans and predicates, significantly reduces the number of edges it needs to consider. Figure 5 shows the recall of predicate words on the CoNLL 2012 development set. By retaining $\lambda_{\text{p}}=0.4$ predicates per word, we are able to keep over 99.7% argument-bearing predicates. Compared to having a part-of-speech tagger (POS:X in Figure 5), our joint beam pruning allowing the model to have a soft trade-off between efficiency and recall.The predicate ID accuracy of our model is not comparable with that reported in He et al. (2017), since our model does not predict non-argument-bearing predicates.

Long-distance dependencies

Figure 6 shows the performance breakdown by binned distance between arguments to the given predicates. Our model is better at accurately predicting arguments that are farther away from the predicates, even compared to an ensemble model He et al. (2017) that has a higher overall F1. This is very likely due to architectural differences; in a BIO tagger, predicate information passes through many LSTM timesteps before reaching a long-distance argument, whereas our architecture enables direct connections between all predicates-arguments pairs.

Agreement with syntax

As mentioned in He et al. (2017), their BIO-based SRL system has good agreement with gold syntactic span boundaries (94.3%) but falls short of previous syntax-based systems Punyakanok et al. (2004). By directly modeling span information, our model achieves comparable syntactic agreement (95.0%) to Punyakanok et al. (2004) without explicitly modeling syntax.

Global consistency

On the other hand, our model suffers from global consistency issues. For example, on the CoNLL 2005 test set, our model has lower complete-predicate accuracy (62.6%) than the BIO systems He et al. (2017); Tan et al. (2018) (64.3%-66.4%). Table 3 shows its violations of global structural constraintsPunyakanok et al. (2008) described a list of global constraints for SRL systems, e.g., there can be at most one core argument of each type for each predicate. compared to previous systems. Our model made more constraint violations compared to previous systems. For example, our model predicts duplicate core argumentsArguments with labels ARG0,ARG1,…,ARG5 and AA. (shown in the U column in Table 3) more often than previous work. This is due to the fact that our model uses independent classifiers to label each predicate-argument pair, making it difficult for them to implicitly track the decisions made for several arguments with the same predicate.

The Ours+decode row in Table 3 shows SRL performance after enforcing the U-constraint using dynamic programming Täckström et al. (2015) at decoding time. Constrained decoding at test time is effective at eliminating all the core-role inconsistencies (shown in the U-column), but did not bring significant gain on the end result (shown in SRL F1), which only evaluates the piece-wise predicate-argument structures.

Conclusion and Future Work

We proposed a new SRL model that is able to jointly predict all predicates and argument spans, generalized from a recent coreference system Lee et al. (2017). Compared to previous BIO systems, our new model supports joint predicate identification and is able to incorporate span-level features. Empirically, the model does better at long-range dependencies and agreement with syntactic boundaries, but is weaker at global consistency, due to our strong independence assumption.

In the future, we could incorporate higher-order inference methods Lee et al. (2018) to relax this assumption. It would also be interesting to combine our span-based architecture with the self-attention layers Tan et al. (2018); Strubell et al. (2018) for more effective contextualization.

Acknowledgments

This research was supported in part by the ARO (W911NF-16-1-0121), the NSF (IIS-1252835, IIS-1562364), a gift from Tencent, and an Allen Distinguished Investigator Award. We thank Eunsol Choi, Dipanjan Das, Nicholas Fitzgerald, Ariel Holtzman, Julian Michael, Noah Smith, Swabha Swayamdipta, and our anonymous reviewers for helpful feedback.

References

Appendix A Supplemental Material

The word embeddings are fixed 300-dimensional GloVe embeddings Pennington et al. (2014) (context window size of 2 for head word embeddings, and window size of 10 for LSTM inputs), normalized to be unit vectors. Out-of-vocabulary words are represented by a vector of zeros. In the character CNN, characters are represented as learned 8-dimensional embeddings. The convolutions have window sizes of 3, 4, and 5 characters, each consisting of 50 filters.

Network sizes

We use 3 stacked bidirectional LSTMs with highway connections and 200 dimensional hidden states. Each MLP consists of two hidden layers with 150 dimensions and rectified linear units Nair and Hinton (2010).

Inference

We model spans up to length 30. We use $\lambda_{\text{a}}=0.8$ for pruning arguments, $\lambda_{\text{p}}=0.4$ for pruning predicates. At decoding time, we use dynamic programming (a simplified version of Täckström et al. (2015)) to predict a set of non-overlapping arguments for each predicate This is mainly a constraint enforced by the official CoNLL evaluation script..

Training

We use Adam Kingma and Ba (2015) with initial learning rate $0.001$ and decay rate of 0.1% every 100 steps. The LSTM weights are initialized with random orthonormal matrices Saxe et al. (2014). We apply 0.5 dropout to the word embeddings and character CNN outputs and 0.2 dropout to all hidden layers and feature embeddings. In the LSTMs, we use variational dropout masks that are shared across timesteps Gal and Ghahramani (2016), with $0.4$ dropout rate.

Batching

At training time, we randomly shuffle all the documents and then batch at sentence level. Each batch contains at most $40$ sentences and $700$ words. All models are trained for at most 320,000 steps with early stopping on the development set, which takes less than 48 hours on a single Titan X GPU.

A.2 OntoNotes Data Statistics

Table 4 shows the data statistics on various splits of OntoNotes. We found that some sentences in the OntoNotes 5.0 train/dev split have missing predicates, which is unsuitable for training end-to-end SRL systems. Therefore, our end-to-end SRL models are trained on the smaller but cleaner CoNLL 2012 splits. For experiments with gold predicates, we use the full OntoNotes 5.0 train/dev split and the CoNLL 2012 test set, following previous work.