Matching the Blanks: Distributional Similarity for Relation Learning

Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, Tom Kwiatkowski

Introduction

Reading text to identify and extract relations between entities has been a long standing goal in natural language processing Cardie (1997). Typically efforts in relation extraction fall into one of three groups. In a first group, supervised Kambhatla (2004); GuoDong et al. (2005); Zeng et al. (2014), or distantly supervised relation extractors Mintz et al. (2009) learn a mapping from text to relations in a limited schema. Forming a second group, open information extraction removes the limitations of a predefined schema by instead representing relations using their surface forms Banko et al. (2007); Fader et al. (2011); Stanovsky et al. (2018), which increases scope but also leads to an associated lack of generality since many surface forms can express the same relation. Finally, the universal schema Riedel et al. (2013) embraces both the diversity of text, and the concise nature of schematic relations, to build a joint representation that has been extended to arbitrary textual input Toutanova et al. (2015), and arbitrary entity pairs Verga and McCallum (2016). However, like distantly supervised relation extractors, universal schema rely on large knowledge graphs (typically Freebase Bollacker et al. (2008)) that can be aligned to text.

Building on Lin and Pantel (2001)’s extension of Harris’ distributional hypothesis Harris (1954) to relations, as well as recent advances in learning word representations from observations of their contexts Mikolov et al. (2013); Peters et al. (2018); Devlin et al. (2018), we propose a new method of learning relation representations directly from text. First, we study the ability of the Transformer neural network architecture Vaswani et al. (2017) to encode relations between entity pairs, and we identify a method of representation that outperforms previous work in supervised relation extraction. Then, we present a method of training this relation representation without any supervision from a knowledge graph or human annotators by matching the blanks.

Following Riedel et al. (2013), we assume access to a corpus of text in which entities have been linked to unique identifiers and we define a relation statement to be a block of text containing two marked entities. From this, we create training data that contains relation statements in which the entities have been replaced with a special [blank] symbol, as illustrated in Figure 1. Our training procedure takes in pairs of blank-containing relation statements, and has an objective that encourages relation representations to be similar if they range over the same pairs of entities. After training, we employ learned relation representations to the recently released FewRel task Han et al. (2018) in which specific relations, such as ‘original language of work’ are represented with a few exemplars, such as The Crowd (Italian: La Folla) is a 1951 Italian film. Han et al. (2018) presented FewRel as a supervised dataset, intended to evaluate models’ ability to adapt to relations from new domains at test time. We show that through training by matching the blanks, we can outperform Han et al. (2018)’s top performance on FewRel, without having seen any of the FewRel training data. We also show that a model pre-trained by matching the blanks and tuned on FewRel outperforms humans on the FewRel evaluation. Similarly, by training by matching the blanks and then tuning on labeled data, we significantly improve performance on the SemEval 2010 Task 8 Hendrickx et al. (2009), KBP-37 Zhang and Wang (2015), and TACRED Zhang et al. (2017) relation extraction benchmarks.

Overview

In this paper, we focus on learning mappings from relation statements to relation representations. Formally, let $\mathbf{x}=[x_{0}\dots x_{n}]$ be a sequence of tokens, where $x_{0}=[\textsc{cls}]$ and $x_{n}=[\textsc{sep}]$ are special start and end markers. Let $\mathbf{s}_{1}=(i,j)$ and $\mathbf{s}_{2}=(k,l)$ be pairs of integers such that $0<i<j-1$ , $j<k$ , $k\leq l-1$ , and $l\leq n$ . A relation statement is a triple $\mathbf{r}=(\mathbf{x},\mathbf{s}_{1},\mathbf{s}_{2})$ , where the indices in $\mathbf{s}_{1}$ and $\mathbf{s}_{2}$ delimit entity mentions in $\mathbf{x}$ : the sequence $[x_{i}\dots x_{j-1}]$ mentions an entity, and so does the sequence $[x_{k}\dots x_{l-1}]$ . Our goal is to learn a function $\mathbf{h}_{r}=f_{\theta}(\mathbf{r})$ that maps the relation statement to a fixed-length vector $\mathbf{h}_{r}\in\mathcal{R}^{d}$ that represents the relation expressed in $\mathbf{x}$ between the entities marked by $\mathbf{s}_{1}$ and $\mathbf{s}_{2}$ .

This paper contains two main contributions. First, in Section 3.1 we investigate different architectures for the relation encoder $f_{\theta}$ , all built on top of the widely used Transformer sequence model Devlin et al. (2018); Vaswani et al. (2017). We evaluate each of these architectures by applying them to a suite of relation extraction benchmarks with supervised training.

Our second, more significant, contribution—presented in Section 4—is to show that $f_{\theta}$ can be learned from widely available distant supervision in the form of entity linked text.

Architectures for Relation Learning

The primary goal of this work is to develop models that produce relation representations directly from text. Given the strong performance of recent deep transformers trained on variants of language modeling, we adopt Devlin et al. (2018)’s bert model as the basis for our work. In this section, we explore different methods of representing relations with the Transformer model.

We evaluate the different methods of representation on a suite of supervised relation extraction benchmarks. The relation extractions tasks we use can be broadly categorized into two types: fully supervised relation extraction, and few-shot relation matching.

For the supervised tasks, the goal is to, given a relation statement $\mathbf{r}$ , predict a relation type $t\in\mathcal{T}$ where $\mathcal{T}$ is a fixed dictionary of relation types and $t=0$ typically denotes a lack of relation between the entities in the relation statement. For this type of task we evaluate on SemEval 2010 Task 8 Hendrickx et al. (2009), KBP-37 Zhang and Wang (2015) and TACRED Zhang et al. (2017). More formally,

In the case of few-shot relation matching, a set of candidate relation statements are ranked, and matched, according to a query relation statement. In this task, examples in the test and development sets typically contain relation types not present in the training set. For this type of task, we evaluate on the FewRel Han et al. (2018) dataset. Specifically, we are given $K$ sets of $N$ labeled relation statements $\mathcal{S}_{k}=\{(\mathbf{r}_{0},t_{0})\dots(\mathbf{r}_{N},t_{N})\}$ where $t_{i}\in\{1\dots K\}$ is the corresponding relation type. The goal is to predict the $t_{q}\in\{1\dots K\}$ for a query relation statement $\mathbf{r}_{q}$ .

2 Relation Representations from Deep Transformers Model

In all experiments in this section, we start with the bert ${}_{\textsc{large}}$ model made available by Devlin et al. (2018) and train towards task-specific losses. Since bert has not previously been applied to the problem of relation representation, we aim to answer two primary modeling questions: (1) how do we represent entities of interest in the input to bert, and (2) how do we extract a fixed length representation of a relation from bert’s output. We present three options for both the input encoding, and the output relation representation. Six combinations of these are illustrated in Figure 3.

Recall, from Section 2, that the relation statement $\mathbf{r}=(\mathbf{x},\mathbf{s}_{1},\mathbf{s}_{2})$ contains the sequence of tokens $\mathbf{x}$ and the entity span identifiers $\mathbf{s}_{1}$ and $\mathbf{s}_{2}$ . We present three different options for getting information about the focus spans $\mathbf{s}_{1}$ and $\mathbf{s}_{2}$ into our bert encoder.

First we experiment with a bert model that does not have access to any explicit identification of the entity spans $\mathbf{s}_{1}$ and $\mathbf{s}_{2}$ . We refer to this choice as the standard input. This is an important reference point, since we believe that bert has the ability to identify entities in $\mathbf{x}$ , but with the standard input there is no way of knowing which two entities are in focus when $\mathbf{x}$ contains more than two entity mentions.

For each of the tokens in its input, bert also adds a segmentation embedding, primarily used to add sentence segmentation information to the model. To address the standard representation’s lack of explicit entity identification, we introduce two new segmentation embeddings, one that is added to all tokens in the span $\mathbf{s}_{1}$ , while the other is added to all tokens in the span $\mathbf{s}_{2}$ . This approach is analogous to previous work where positional embeddings have been applied to relation extraction Zhang et al. (2017); Bilan and Roth (2018).

Finally, we augment $\mathbf{x}$ with four reserved word pieces to mark the begin and end of each entity mention in the relation statement. We introduce the $[E1_{start}]$ , $[E1_{end}]$ , $[E2_{start}]$ and $[E2_{end}]$ and modify $\mathbf{x}$ to give

3 Fixed length relation representation

Recall from Section 2 that each $\mathbf{x}$ starts with a reserved [cls] token. bert’s output state that corresponds to this token is used by Devlin et al. (2018) as a fixed length sentence representation. We adopt the [cls] output, $\mathbf{h}_{0}$ , as our first relation representation.

We obtain $\mathbf{h}_{r}$ by max-pooling the final hidden layers corresponding to the word pieces in each entity mention, to get two vectors $\mathbf{h}_{e_{1}}=\textsc{maxpool}([\mathbf{h}_{i}...\mathbf{h}_{j-1}])$ and $\mathbf{h}_{e_{2}}=\textsc{maxpool}([\mathbf{h}_{k}...\mathbf{h}_{l-1}])$ representing the two entity mentions. We concatenate these two vectors to get the single representation $\mathbf{h}_{r}=\langle\mathbf{h}_{e_{1}}|\mathbf{h}_{e_{2}}\rangle$ where $\langle a|b\rangle$ is the concatenation of $a$ and $b$ . We refer to this architecture as mention pooling.

Finally, we propose simply representing the relation between two entities with the concatenation of the final hidden states corresponding their respective start tokens, when entity markers are used. Recalling that entity markers inserts tokens in $\mathbf{x}$ , creating offsets in $\mathbf{s}_{1}$ and $\mathbf{s}_{2}$ , our representation of the relation is $\mathbf{r}_{h}=\langle\mathbf{h}_{i}|\mathbf{h}_{j+2}\rangle$ . We refer to this output representation as entity start output. Note that this can only be applied to the entity markers input.

Figure 3 illustrates a few of the variants we evaluated in this section. In addition to defining the model input and output architecture, we fix the training loss used to train the models (which is illustrated in Figure 2). In all models, the output representation from the Transformer network is fed into a fully connected layer that either (1) contains a linear activation, or (2) performs layer normalization Ba et al. (2016) on the representation. We treat the choice of post Transfomer layer as a hyper-parameter and use the best performing layer type for each task.

For the supervised tasks, we introduce a new classification layer $\mathcal{W}\in\mathcal{R}^{KxH}$ where $H$ is the size of the relation representation and $K$ is the number of relation types. The classification loss is the standard cross entropy of the softmax of $h_{r}W^{T}$ with respect to the true relation type.

For the few-shot task, we use the dot product between relation representation of the query statement and each of the candidate statements as a similarity score. In this case, we also apply a cross entropy loss of the softmax of similarity scores with respect to the true class.

We perform task-specific fine-tuning of the bert model, for all variants, with the following set of hyper-parameters:

Transformer Architecture: 24 layers, 1024 hidden size, 16 heads

Weight Initialization: BERT ${}_{\textsc{LARGE}}$

Post Transformer Layer: Dense with linear activation (KBP-37 and TACRED), or Layer Normalization layer (SemEval 2010 and FewRel).

Learning Rate (supervised): 3e-5 with Adam

Table 1 shows the results of model variants on the three supervised relation extraction tasks and the 5-way-1-shot variant of the few-shot relation classification task. For all four tasks, the model using the entity markers input representation and entity start output representation achieves the best scores.

From the results, it is clear that adding positional information in the input is critical for the model to learn useful relation representations. Unlike previous work that have benefited from positional embeddings Zhang et al. (2017); Bilan and Roth (2018), the deep Transformers benefits the most from seeing the new entity boundary word pieces (entity markers). It is also worth noting that the best variant outperforms previous published models on all four tasks. For the remainder of the paper, we will use this architecture when further training and evaluating our models.

Learning by Matching the Blanks

So far, we have used human labeled training data to train our relation statement encoder $f_{\theta}$ . Inspired by open information extraction Banko et al. (2007); Angeli et al. (2015), which derives relations directly from tagged text, we now introduce a new method of training $f_{\theta}$ without a predefined ontology, or relation-labeled training data. Instead, we declare that for any pair of relation statements $\mathbf{r}$ and $\mathbf{r}^{\prime}$ , the inner product $f_{\theta}(\mathbf{r})^{\top}f_{\theta}(\mathbf{r}^{\prime})$ should be high if the two relation statements, $\mathbf{r}$ and $\mathbf{r}^{\prime}$ , express semantically similar relations. And, this inner product should be low if the two relation statements express semantically different relations.

Unlike related work in distant supervision for information extraction Hoffmann et al. (2011); Mintz et al. (2009), we do not use relation labels at training time. Instead, we observe that there is a high degree of redundancy in web text, and each relation between an arbitrary pair of entities is likely to be stated multiple times. Subsequently, $\mathbf{r}=(\mathbf{x},\mathbf{s}_{1},\mathbf{s}_{2})$ is more likely to encode the same semantic relation as $\mathbf{r}^{\prime}=(\mathbf{x}^{\prime},\mathbf{s}^{\prime}_{1},\mathbf{s}^{\prime}_{2})$ if $\mathbf{s}_{1}$ refers to the same entity as $\mathbf{s}^{\prime}_{1}$ , and $\mathbf{s}_{2}$ refers to the same entity as $\mathbf{s}^{\prime}_{2}$ . Starting with this observation, we introduce a new method of learning $f_{\theta}$ from entity linked text. We introduce this method of learning by matching the blanks (MTB). In Section 5 we show that MTB learns relation representations that can be used without any further tuning for relation extraction—even beating previous work that trained on human labeled data.

Let $\mathcal{E}$ be a predefined set of entities. And let $\mathcal{D}=[(\mathbf{r}^{0},e_{1}^{0},e_{2}^{0})\dots(\mathbf{r}^{N},e_{1}^{N},e_{2}^{N})]$ be a corpus of relation statements that have been labeled with two entities $e^{i}_{1}\in\mathcal{E}$ and $e^{i}_{2}\in\mathcal{E}$ . Recall, from Section 2, that $\mathbf{r}^{i}=(\mathbf{x}^{i},\mathbf{s}_{1}^{i},\mathbf{s}_{2}^{i})$ , where $\mathbf{s}^{i}_{1}$ and $\mathbf{s}^{i}_{2}$ delimit entity mentions in $\mathbf{x}^{i}$ . Each item in $\mathcal{D}$ is created by pairing the relation statement $\mathbf{r}^{i}$ with the two entities $e_{1}^{i}$ and $e_{2}^{i}$ corresponding to the spans $\mathbf{s}_{1}^{i}$ and $\mathbf{s}_{2}^{i}$ , respectively.

We aim to learn a relation statement encoder $f_{\theta}$ that we can use to determine whether or not two relation statements encode the same relation. To do this, we define the following binary classifier

to assign a probability to the case that $\mathbf{r}$ and $\mathbf{r}^{\prime}$ encode the same relation $(l=1)$ , or not $(l=0)$ . We will then learn the parameterization of $f_{\theta}$ that minimizes the loss

where $\delta_{e,e^{\prime}}$ is the Kronecker delta that takes the value $1$ iff $e=e^{\prime}$ , and otherwise.

2 Introducing Blanks

Readers may have noticed that the loss in Equation 1 can be minimized perfectly by the entity linking system used to create $\mathcal{D}$ . And, since this linking system does not have any notion of relations, it is not reasonable to assume that $f_{\theta}$ will somehow magically build meaningful relation representations. To avoid simply relearning the entity linking system, we introduce a modified corpus

3 Matching the Blanks Training

To train a model with matching the blank task, we construct a training setup similar to BERT, where two losses are used concurrently: the masked language model loss and the matching the blanks loss. For generating the training corpus, we use English Wikipedia and extract text passages from the HTML paragraph blocks, ignoring lists, and tables. We use an off-the-shelf entity linking systemWe use the public Google Cloud Natural Language API to annotate our corpus extracting the “entity analysis” results — https://cloud.google.com/natural-language/docs/basics#entity_analysis . to annotate text spans with a unique knowledge base identifier (e.g., Freebase ID or Wikipedia URL). The span annotations include not only proper names, but other referential entities such as common nouns and pronouns. From this annotated corpus we extract relation statements where each statement contains at least two grounded entities within a fixed sized window of tokensWe use a window of 40 tokens, which we observed provides some coverage of long range entity relations, while avoiding a large number of co-occurring but unrelated entities.. To prevent a large bias towards relation statements that involve popular entities, we limit the number of relation statements that contain the same entity by randomly sampling a constant number of relation statements that contain any given entity.

Instead of summing over all pairs of relation statements that do not contain the same pair of entities, we sample a set of negatives that are either randomly sampled uniformly from the set of all relation statement pairs, or are sampled from the set of relation statements that share just a single entity. We include the second set ‘hard’ negatives to account for the fact that most randomly sampled relation statement pairs are very unlikely to be even remotely topically related, and we would like to ensure that the training procedure sees pairs of relation statements that refer to similar, but different, relations. Finally, we probabilistically replace each entity’s mention with [blank] symbols, with a probability of $\alpha=0.7$ , as described in Section 3.2, to ensure that the model is not confounded by the absence of [blank] symbols in the evaluation tasks. In total, we generate 600 million relation statement pairs from English Wikipedia, roughly split between 50% positive and 50% strong negative pairs.

Experimental Evaluation

In this section, we evaluate the impact of training by matching the blanks. We start with the best bert based model from Section 3.3, which we call bert ${}_{\textsc{EM}}$ , and we compare this to a variant that is trained with the matching the blanks task (bert ${}_{\textsc{EM}}$ +mtb). We train the bert ${}_{\textsc{EM}}$ +mtb model by initializing the Transformer weights to the weights from bert ${}_{\textsc{LARGE}}$ and use the following parameters:

We report results on all of the tasks from Section 3.1, using the same task-specific training methodology for both bert ${}_{\textsc{EM}}$ and bert ${}_{\textsc{EM}}$ +mtb.

First, we investigate the ability of bert ${}_{\textsc{EM}}$ +mtb to solve the FewRel task without any task-specific training data. Since FewRel is an exemplar-based approach, we can just rank each candidate relation statement according to its representation’s inner product with the exemplars’ representations.

Figure 4 shows that the task agnostic bert ${}_{\textsc{EM}}$ and bert ${}_{\textsc{EM}}$ +mtb models outperform the previous published state of the art on FewRel task even when they have not seen any FewRel training data. For bert ${}_{\textsc{EM}}$ +mtb, the increase over Han et al. (2018)’s supervised approach is very significant—8.8% on the 5-way-1-shot task and 12.7% on the 10-way-1-shot task. bert ${}_{\textsc{EM}}$ +mtb also significantly outperforms bert ${}_{\textsc{EM}}$ in this unsupervised setting, which is to be expected since there is no relation-specific loss during bert ${}_{\textsc{EM}}$ ’s training.

To investigate the impact of supervision on bert ${}_{\textsc{EM}}$ and bert ${}_{\textsc{EM}}$ +mtb, we introduce increasing amounts of FewRel’s training data. Figure 4 shows the increase in performance as we either increase the number of training examples for each relation type, or we increase the number of relation types in the training data. When given access to all of the training data, bert ${}_{\textsc{EM}}$ approaches bert ${}_{\textsc{EM}}$ +mtb’s performance. However, when we keep all relation types during training, and vary the number of types per example, bert ${}_{\textsc{EM}}$ +mtb only needs 6% of the training data to match the performance of a bert ${}_{\textsc{EM}}$ model trained on all of the training data. We observe that maintaining a diversity of relation types, and reducing the number of examples per type, is the most effective way to reduce annotation effort for this task. The results in Figure 4 show that MTB training could be used to significantly reduce effort in implementing an exemplar based relation extraction system.

Finally, we report bert ${}_{\textsc{EM}}$ +mtb’s performance on all of FewRel’s fully supervised tasks in Table 3. We see that it outperforms the human upper bound reported by Han et al. (2018), and it significantly outperforms all other submissions to the FewRel leaderboard, published or unpublished.

2 Supervised Relation Extraction

Table 4 contains results for our classifiers tuned on supervised relation extraction data. As was established in Section 3.2, our bert ${}_{\textsc{EM}}$ based classifiers outperform previously published results for these three tasks. The additional MTB based training further increases F1 scores for all tasks.

We also analyzed the performance of our two models while reducing the amount of supervised task specific tuning data. The results displayed in Table 5 show the development set performance when tuning on a random subset of the task specific training data. For all tasks, we see that MTB based training is even more effective for low-resource cases, where there is a larger gap in performance between our bert ${}_{\textsc{EM}}$ and bert ${}_{\textsc{EM}}$ +mtb based classifiers. This further supports our argument that training by matching the blanks can significantly reduce the amount of human input required to create relation extractors, and populate a knowledge base.

Conclusion and Future Work

In this paper we study the problem of producing useful relation representations directly from text. We describe a novel training setup, which we call matching the blanks, which relies solely on entity resolution annotations. When coupled with a new architecture for fine-tuning relation representations in BERT, our models achieves state-of-the-art results on three relation extraction tasks, and outperforms human accuracy on few-shot relation matching. In addition, we show how the new model is particularly effective in low-resource regimes, and we argue that it could significantly reduce the amount of human effort required to create relation extractors.

In future work, we plan to work on relation discovery by clustering relation statements that have similar representations according to bert ${}_{\textsc{EM}}$ +mtb. This would take us some of the way toward our goal of truly general purpose relation identification and extraction. We will also study representations of relations and entities that can be used to store relation triples in a distributed knowledge base. This is inspired by recent work in knowledge base embedding Bordes et al. (2013); Nickel et al. (2016).