Learning Global Features for Coreference Resolution

Sam Wiseman, Alexander M. Rush, Stuart M. Shieber

Introduction

While structured, non-local coreference models would seem to hold promise for avoiding many common coreference errors (as discussed further in Section 3), the results of employing such models in practice are decidedly mixed, and state-of-the-art results can be obtained using a completely local, mention-ranking system.

In this work, we posit that global context is indeed necessary for further improvements in coreference resolution, but argue that informative cluster, rather than mention, level features are very difficult to devise, limiting their effectiveness. Accordingly, we instead propose to learn representations of mention clusters by embedding them sequentially using a recurrent neural network (shown in Section 4). Our model has no manually defined cluster features, but instead learns a global representation from the individual mentions present in each cluster. We incorporate these representations into a mention-ranking style coreference system.

The entire model, including the recurrent neural network and the mention-ranking sub-system, is trained end-to-end on the coreference task. We train the model as a local classifier with fixed context (that is, as a history-based model). As such, unlike several recent approaches, which may require complicated inference during training, we are able to train our model in much the same way as a vanilla mention-ranking model.

Experiments compare the use of learned global features to several strong baseline systems for coreference resolution. We demonstrate that the learned global representations capture important underlying information that can help resolve difficult pronominal mentions, which remain a persistent source of errors for modern coreference systems [Durrett and Klein (2013, Kummerfeld and Klein (2013, Wiseman et al. (2015, Martschat and Strube (2015]. Our final system improves over 0.8 points in CoNLL score over the current state of the art, and the improvement is statistically significant on all three CoNLL metrics.

Background and Notation

Coreference resolution is fundamentally a clustering task. Given a sequence $(x_{n})_{n=1}^{N}$ of (intra-document) mentions – that is, syntactic units that can refer or be referred to – coreference resolution involves partitioning $(x_{n})$ into a sequence of clusters $(X^{(m)})_{m=1}^{M}$ such that all the mentions in any particular cluster $X^{(m)}$ refer to the same underlying entity. Since the mentions within a particular cluster may be ordered linearly by their appearance in the document,We assume nested mentions are ordered by their syntactic heads. we will use the notation $X_{j}^{(m)}$ to refer to the $j$ ’th mention in the $m$ ’th cluster.

A valid clustering places each mention in exactly one cluster, and so we may represent a clustering with a vector $\boldsymbol{z}\,{\in}\,\{1,\ldots,M\}^{N}$ , where $z_{n}\,{=}\,m$ iff $x_{n}$ is a member of $X^{(m)}$ . Coreference systems attempt to find the best clustering $\boldsymbol{z}^{*}\in\mathcal{Z}$ under some scoring function, with $\mathcal{Z}$ the set of valid clusterings.

One strategy to avoid the computational intractability associated with predicting an entire clustering $\boldsymbol{z}$ is to instead predict a single antecedent for each mention $x_{n}$ ; because $x_{n}$ may not be anaphoric (and therefore have no antecedents), a “dummy” antecedent $\epsilon$ may also be predicted. The aforementioned strategy is adopted by “mention-ranking” systems [Denis and Baldridge (2008, Rahman and Ng (2009, Durrett and Klein (2013], which, formally, predict an antecedent $\hat{y}\,{\in}\,\mathcal{Y}(x_{n})$ for each mention $x_{n}$ , where $\mathcal{Y}(x_{n})\,{=}\,\{1,\ldots,n-1,\epsilon\}$ . Through transitivity, these decisions induce a clustering over the document.

Mention-ranking systems make their antecedent predictions with a local scoring function $f(x_{n},y)$ defined for any mention $x_{n}$ and any antecedent $y\,{\in}\,\mathcal{Y}(x_{n})$ . While such a scoring function clearly ignores much structural information, the mention-ranking approach has been attractive for at least two reasons. First, inference is relatively simple and efficient, requiring only a left-to-right pass through a document’s mentions during which a mention’s antecedents (as well as $\epsilon$ ) are scored and the highest scoring antecedent is predicted. Second, from a linguistic modeling perspective, mention-ranking models learn a scoring function that requires a mention $x_{n}$ to be compatible with only one of its coreferent antecedents. This contrasts with mention-pair models (e.g., ?)), which score all pairs of mentions in a cluster, as well as with certain cluster-based models (see discussion in ?)). Modeling each mention as having a single antecedent is particularly advantageous for pronominal mentions, which we might like to model as linking to a single nominal or proper antecedent, for example, but not necessarily to all other coreferent mentions.

Accordingly, in this paper we attempt to maintain the inferential simplicity and modeling benefits of mention ranking, while allowing the model to utilize global, structural information relating to $\boldsymbol{z}$ in making its predictions. We therefore investigate objective functions of the form

where $g$ is a global function that, in making predictions for $x_{n}$ , may examine (features of) the clustering $\boldsymbol{z}_{1:n-1}$ induced by the antecedent predictions made through $y_{n-1}$ .

The Role of Global Features

Here we motivate the use of global features for coreference resolution by focusing on the issues that may arise when resolving pronominal mentions in a purely local way. See ?) and ?) for more general motivation for using global models.

Recent empirical work has shown that the resolution of pronominal mentions accounts for a substantial percentage of the total errors made by modern mention-ranking systems. ?) show that on the CoNLL 2012 English development set, almost 59% of mention-ranking precision errors and almost 24% of recall errors involve pronominal mentions. ?) found a similar pattern in their comparison of mention-ranking, mention-pair, and latent-tree models.

DA: um and [I]1 think that is what’s - Go ahead [Linda]2.

LW: Well and uh thanks goes to [you]1 and to [the media]3 to help [us]4…So [our]4 hat is off to all of [you]5 as well.

This example is typical of Broadcast Conversation, and it is difficult because local systems learn to myopically link pronouns such as [you]5 to other instances of the same pronoun that are close by, such as [you]1. While this is often a reasonable strategy, in this case predicting [you]1 to be an antecedent of [you]5 would result in the prediction of an incoherent cluster, since [you]1 is coreferent with the singular [I]1, and [you]5, as part of the phrase “all of you,” is evidently plural. Thus, while there is enough information in the text to correctly predict [you]5, doing so crucially depends on having access to the history of predictions made so far, and it is precisely this access to history that local models lack.

More empirically, there are non-local statistical regularities involving pronouns we might hope models could exploit. For instance, in the CoNLL training data over 70% of pleonastic “it” instances and over 74% of pleonastic “you” instances follow (respectively) previous pleonastic “it” and “you” instances. Similarly, over 78% of referential “I” instances and over 68% of referential “he” instances corefer with previous “I” and “he” instances, respectively.

Accordingly, we might expect non-local models with access to global features to perform significantly better. However, models incorporating non-local features have a rather mixed track record. For instance, ?) found that cluster-level features improved their results, whereas ?) found that they did not. ?) found that incorporating cluster-level features beyond those involving the pre-computed mention-pair and mention-ranking probabilities that form the basis of their agglomerative clustering coreference system did not improve performance. Furthermore, among recent, state-of-the-art systems, mention-ranking systems (which are completely local) perform at least as well as their more structured counterparts [Durrett and Klein (2014, Clark and Manning (2015, Wiseman et al. (2015, Peng et al. (2015].

2 Issues with Global Features

We believe a major reason for the relative ineffectiveness of global features in coreference problems is that, as noted by ?), cluster-level features can be hard to define. Specifically, it is difficult to define discrete, fixed-length features on clusters, which can be of variable size (or shape). As a result, global coreference features tend to be either too coarse or too sparse. Thus, early attempts at defining cluster-level features simply applied the coarse quantifier predicates all, none, most to the mention-level features defined on the mentions (or pairs of mentions) in a cluster [Culotta et al. (2007, Rahman and Ng (2011]. For example, a cluster would have the feature ‘most-female=true’ if more than half the mentions (or pairs of mentions) in the cluster have a ‘female=true’ feature.

On the other extreme, ?) define certain cluster-level features by concatenating the mention-level features of a cluster’s constituent mentions in order of the mentions’ appearance in the document. For example, if a cluster consists, in order, of the mentions (the president, he, he), they would define a cluster-level “type” feature ‘C-P-P=true’, which indicates that the cluster is composed, in order, of a common noun, a pronoun, and a pronoun. While very expressive, these concatenated features are often quite sparse, since clusters encountered during training can be of any size.

Learning Global Features

To circumvent the aforementioned issues with defining global features, we propose to learn cluster-level feature representations implicitly, by identifying the state of a (partial) cluster with the hidden state of an RNN that has consumed the sequence of mentions composing the (partial) cluster. Before providing technical details, we provide some preliminary evidence that such learned representations capture important contextual information by displaying in Figure 1 the learned final states of all clusters in the CoNLL development set, projected using T-SNE [van der Maaten and Hinton (2012]. Each point in the visualization represents the learned features for an entity cluster and the head words of mentions are shown for representative points. Note that the model learns to roughly separate clusters by simple distinctions such as predominant type (nominal, proper, pronominal) and number (it, they, etc), but also captures more subtle relationships such as grouping geographic terms and long strings of pronouns.

where $\boldsymbol{\theta}$ is the set of parameters for the model, which are shared over time.

There are several varieties of RNN, but by far the most commonly used in natural-language processing is the Long Short-Term Memory network (LSTM) [Hochreiter and Schmidhuber (1997], particularly for language modeling (e.g., ?)) and machine translation (e.g., ?)), and we use LSTMs in all experiments.

2 RNNs for Cluster Features

Our main contribution will be to utilize RNNs to produce feature representations of entity clusters which will provide the basis of the global term $g$ . Recall that we view a cluster $X^{(m)}$ as a sequence of mentions $(X^{(m)}_{j})_{j=1}^{J}$ (ordered in linear document order). We therefore propose to embed the state(s) of $X^{(m)}$ by running an RNN over the cluster in order.

We will refer to the $j$ ’th hidden state of the RNN corresponding to $X^{(m)}$ as $\boldsymbol{h}^{(m)}_{j}$ , and we obtain it according to the following formula

again assuming that $\boldsymbol{h}^{(m)}_{0}\,{=}\,\boldsymbol{0}$ . Thus, we will effectively run an RNN over each (sequence of mentions corresponding to a) cluster $X^{(m)}$ in the document, and thereby generate a hidden state $\boldsymbol{h}^{(m)}_{j}$ corresponding to each step of each cluster in the document. Concretely, this can be implemented by maintaining $M$ RNNs – one for each cluster – that all share the parameters $\boldsymbol{\theta}$ . The process is illustrated in the top portion of Figure 2.

Coreference with Global Features

We now describe how the RNN defined above is used within an end-to-end coreference system.

Recall that our inference objective is to maximize the score of both a local mention ranking term as well as a global term based on the current clusters:

We begin by defining the local model $f(x_{n},y)$ with the two layer neural network of ?), which has a specialization for the non-anaphoric case, as follows:

We now specify our global scoring function $g$ based on the history of previous decisions. Define $\boldsymbol{h}_{<n}^{(m)}$ as the hidden state of cluster $m$ before a decision is made for $x_{n}$ – that is, $\boldsymbol{h}_{<n}^{(m)}$ is the state of cluster $m$ ’s RNN after it has consumed all mentions in the cluster preceding $x_{n}$ . We define $g$ as

See Figure 2 for a diagram. The intuition behind the first case in $g$ is that in considering whether $y$ is a good antecedent for $x_{n}$ , we add a term to the score that examines how well $x_{n}$ matches with the mentions already in $X^{(z_{y})}$ ; this matching score is expressed via a dot-product.We also experimented with other non-linear functions, but dot-products performed best. In the second case, when predicting that $x_{n}$ is non-anaphoric, we add the NA term to the score, which examines the (sum of) the current states $\boldsymbol{h}^{(m)}_{<n}$ of all clusters. This information is useful both because it allows the non-anaphoric score to incorporate information about potential antecedents, and because the occurrence of certain singleton-clusters often predicts the occurrence of future singleton-clusters, as noted in Section 3.

The whole system is trained end-to-end on coreference using backpropagation. For a given training document, let $\boldsymbol{z}^{(o)}$ be the oracle mapping from mention to cluster, which induces an oracle clustering. While at training time we do have oracle clusters, we do not have oracle antecedents $(y)_{n=1}^{N}$ , so following past work we treat the oracle antecedent as latent [Yu and Joachims (2009, Fernandes et al. (2012, Chang et al. (2013, Durrett and Klein (2013]. We train with the following slack-rescaled, margin objective:

if $x_{n}$ is anaphoric, and is $\epsilon$ otherwise. The term $\Delta(x_{n},\hat{y})$ gives different weight to different error types. We use a $\Delta$ with 3 different weights $(\alpha_{1},\alpha_{2},\alpha_{3})$ for “false link” (fl), “false new” (fn), and “wrong link” (wl) mistakes [Durrett and Klein (2013], which correspond to predicting an antecedent when non-anaphoric, $\epsilon$ when anaphoric, and the wrong antecedent, respectively.

Note that in training we use the oracle clusters $\boldsymbol{z}^{(o)}$ . Since these are known a priori, we can pre-compute all the hidden states $\boldsymbol{h}_{j}^{(m)}$ in a document, which makes training quite simple and efficient. This approach contrasts in particular with the work of ?) — who also incorporate global information in mention-ranking — in that they train against latent trees, which are not annotated and must be searched for during training. On the other hand, training on oracle clusters leads to a mismatch between training and test, which can hurt performance.

2 Search

When moving from a strictly local objective to one with global features, the test-time search problem becomes intractable. The local objective requires $O(n^{2})$ time, whereas the full clustering problem is NP-Hard. Past work with global features has used integer linear programming solvers for exact search [Chang et al. (2013, Peng et al. (2015], or beam search with (delayed) early update training for an approximate solution [Björkelund and Kuhn (2014]. In contrast, we simply use greedy search at test time, which also requires $O(n^{2})$ time.While beam search is a natural way to decrease search error at test time, it may fail to help if training involves a local margin objective (as in our case), since scores need not be calibrated across local decisions. We accordingly attempted to train various locally normalized versions of our model, but found that they underperformed. We also experimented with training approaches and model variants that expose the model to its own predictions [Daumé III et al. (2009, Ross et al. (2011, Bengio et al. (2015], but found that these yielded a negligible performance improvement. The full algorithm is shown in Algorithm 1. The greedy search algorithm is identical to a simple mention-ranking system, with the exception of line 11, which updates the current RNN representation based on the previous decision that was made, and line 4, which then uses this cluster representation as part of scoring.

Experiments

We use the raw Basic+ feature sets described by ?), with the following modifications:

For training, we use document-size minibatches, which allows for efficient pre-computation of RNN states, and we minimize the loss described in Section 5 with AdaGrad [Duchi et al. (2011] (after clipping LSTM gradients to lie (elementwise) in $(-10,10)$ ). We find that the initial learning rate chosen for AdaGrad has a significant impact on results, and we choose learning rates for each layer out of $\{0.1,0.02,0.01,0.002,0.001\}$ .

2 Results

In Table 1 we present our main results on the CoNLL English test set, and compare with other recent state-of-the-art systems. We see a statistically significant improvement of over 0.8 CoNLL points over the previous state of the art, and the highest F1 scores to date on all three CoNLL metrics.

We now consider in more detail the impact of global features and RNNs on performance. For these experiments, we report MUC, B3, and CEAFe F1-scores in Table 2 as well as errors broken down by mention type and by whether the mention is anaphoric or not in Table 3. Table 3 further partitions errors into fl, fn, and wl categories, which are defined in Section 5.1. We typically think of fl and wl as representing precision errors, and fn as representing recall errors.

In Table 3 we see that the RNN improves performance overall, with the most dramatic improvements on non-anaphoric pronouns, though errors are also decreased significantly for non-anaphoric nominal and proper mentions that follow at least one mention with the same head. While wl errors also decrease for both these mention-categories under the RNN model, fn errors increase. Importantly, the RNN performance is significantly better than that of the Avg baseline, which barely improves over mention-ranking, even with oracle history. This suggests that modeling the sequence of mentions in a cluster is advantageous. We also note that while RNN performance degrades in both precision and recall when moving from the oracle history upper-bound to a greedy setting, we are still able to recover a significant portion of the possible performance improvement.

3 Qualitative Analysis

In this section we consider in detail the impact of the $g$ term in the RNN scoring function on the two error categories that improve most under the RNN model (as shown in Table 3), namely, pronominal wl errors and pronominal fl errors. We consider an example from the CoNLL development set in each category on which the baseline MR model makes an error but the greedy RNN model does not.

The example in Figure 3 involves the resolution of the ambiguous pronoun “his,” which is bracketed and in bold in the figure. Whereas the baseline MR model incorrectly predicts “his” to corefer with the closest gender-consistent antecedent “Justin” — thus making a wl error — the greedy RNN model correctly predicts “his” to corefer with “Mr. Kaye” in the previous sentence. (Note that “the official” also refers to Mr. Kaye). To get a sense of the greedy RNN model’s decision-making on this example, we color the mentions the greedy RNN model has predicted to corefer with “Mr. Kaye” in green, and the mentions it has predicted to corefer with “Justin” in blue. (Note that the model incorrectly predicts the initial “I” mentions to corefer with “Justin.”) Letting $X^{(1)}$ refer to the blue cluster, $X^{(2)}$ refer to the green cluster, and $x_{n}$ refer to the ambiguous mention “his,” we further shade each mention $x_{j}$ in $X^{(1)}$ so that its intensity corresponds to $\boldsymbol{h}_{c}(x_{n})^{\mathsf{T}}\boldsymbol{h}_{<k}^{(1)}$ , where $k\,{=}\,j+1$ ; mentions in $X^{(2)}$ are shaded analogously. Thus, the shading shows how highly $g$ scores the compatibility between “his” and a cluster $X^{(i)}$ as each of $X^{(i)}$ ’s mentions is added. We see that when the initial “Justin” mentions are added to $X^{(1)}$ the $g$ -score is relatively high. However, after “The company” is correctly predicted to corefer with “Justin,” the score of $X^{(1)}$ drops, since companies are generally not coreferent with pronouns like “his.”

We see that the potential antecedent “S-Bahn” has a large gradient, but also that the initial, obviously pleonastic use of “it’s” has a large gradient, which may suggest that earlier, easier predictions of pleonasm can inform subsequent predictions.

Related Work

In addition to the related work noted throughout, we add supplementary references here. Unstructured approaches to coreference typically divide into mention-pair models, which classify (nearly) every pair of mentions in a document as coreferent or not [Soon et al. (2001, Ng and Cardie (2002, Bengtson and Roth (2008], and mention-ranking models, which select a single antecedent for each anaphoric mention [Denis and Baldridge (2008, Rahman and Ng (2009, Durrett and Klein (2013, Chang et al. (2013, Wiseman et al. (2015]. Structured approaches typically divide between those that induce a clustering of mentions [McCallum and Wellner (2003, Culotta et al. (2007, Poon and Domingos (2008, Haghighi and Klein (2010, Stoyanov and Eisner (2012, Cai and Strube (2010], and, more recently, those that learn a latent tree of mentions [Fernandes et al. (2012, Björkelund and Kuhn (2014, Martschat and Strube (2015].

There have also been structured approaches that merge the mention-ranking and mention-pair ideas in some way. For instance, ?) rank clusters rather than mentions; ?) use the output of both mention-ranking and mention pair systems to learn a clustering.

The application of RNNs to modeling (the trajectory of) the state of a cluster is apparently novel, though it bears some similarity to the recent work of ?), who use LSTMs to embed the state of a transition based parser’s stack.

Conclusion

We have presented a simple, state of the art approach to incorporating global information in an end-to-end coreference system, which obviates the need to define global features, and moreover allows for simple (greedy) inference. Future work will examine improving recall, and more sophisticated approaches to global training.

Acknowledgments

We gratefully acknowledge the support of a Google Research Award.