Lifted Rule Injection for Relation Embeddings

Thomas Demeester, Tim Rocktäschel, Sebastian Riedel

Introduction

Current successful methods for automated knowledge base construction tasks heavily rely on learned distributed vector representations [Nickel et al., 2012, Riedel et al., 2013, Socher et al., 2013, Chang et al., 2014, Neelakantan et al., 2015, Toutanova et al., 2015, Nickel et al., 2015, Verga et al., 2016, Verga and McCallum, 2016]. Although these models are able to learn robust representations from large amounts of data, they often lack commonsense knowledge. Such knowledge is rarely explicitly stated in texts but can be found in resources like PPDB [Ganitkevitch et al., 2013] or WordNet [Miller, 1995].

Combining neural methods with symbolic commonsense knowledge, for instance in the form of implication rules, is in the focus of current research [Rocktäschel et al., 2014, Wang et al., 2014, Bowman et al., 2015, Wang et al., 2015, Vendrov et al., 2016, Hu et al., 2016, Rocktäschel and Riedel, 2016, Cohen, 2016]. A recent approach [Rocktäschel et al., 2015] regularizes entity-tuple and relation embeddings via first-order logic rules. To this end, every first-order rule is propositionalized based on observed entity-tuples, and a differentiable loss term is added for every propositional rule. This approach does not scale beyond only a few entity-tuples and rules. For example, propositionalizing the rule $\forall x:\verb~isMan~(x)\Rightarrow\verb~isMortal~(x)$ would result in a very large number of loss terms on a large database.

In this paper, we present a method to incorporate simple rules while maintaining the computational efficiency of only modeling training facts. This is achieved by minimizing an upper bound of the loss that encourages the implication between relations to hold, entirely independent from the number of entity pairs. It only involves representations of the relations that are mentioned in rules, as well as a general rule-independent constraint on the entity-tuple embedding space. In the example given above, if we require that every component of the vector representation of sMan } s smaller than the corresponding component of relation sMortal }, then we can show that the rule holds for any \emph{non-negatve representation of an entity-tuple. Hence our method avoids the need for separate loss terms for every ground atom resulting from propositionalizing rules. In statistical relational learning this type of approach is often referred to as lifted inference or learning [Poole, 2003, Braz, 2007] because it deals with groups of random variables at a first-order level. In this sense our approach is a lifted form of rule injection. This allows for imposing large numbers of rules while learning distributed representations of relations and entity-tuples. Besides drastically lower computation time, an important advantage of our method over ?) is that when these constraints are satisfied, the injected rules always hold, even for unseen but inferred facts. While the method presented here only deals with implications and not general first-order rules, it does not rely on the assumption of independence between relations, and is hence more generally applicable.

Our contributions are fourfold: (i) we develop a very efficient way of regularizing relation representations to incorporate first-order logic implications (§3), (ii) we reveal that, against expectation, mapping entity-tuple embeddings to non-negative space does not hurt but instead improves the generalization ability of our model (§5.1) (iii) we show improvements on a knowledge base completion task by injecting mined commonsense rules from WordNet (§5.3), and finally (iv) we give a qualitative analysis of the results, demonstrating that implication constraints are indeed satisfied in an asymmetric way and result in a substantially increased structuring of the relation embedding space (§5.6).

Background

In this section we revisit the matrix factorization relation extraction model by ?) and introduce the notation used throughout the paper. We choose the matrix factorization model for its simplicity as the base on which we develop implication injection.

Model F by ?) measures the compatibility between a relation $r$ and an entity-tuple $t$ using the dot product $\bm{r}^{\top}\bm{t}$ of their respective vector representations. During training, the representations are learned such that valid facts receive high scores, whereas negative ones receive low scores. Typically no negative evidence is available at training time, and therefore a Bayesian Personalized Ranking (BPR) objective [Rendle et al., 2009] is used. Given a pair of facts $f_{p}:=\langle r_{p},t_{p}\rangle\not\in\mathcal{O}$ and $f_{q}:=\langle r_{q},t_{q}\rangle\in\mathcal{O}$ , this objective requires that

where $\alpha$ is the regularization strength.

Lifted Injection of Implications

In this section, we show how an implication

can be imposed independently of the entity-tuples. For simplicity, we abbreviate such implications as $r_{p}\Rightarrow r_{q}$ (e.g., $\text{\tt{professorAt}}\Rightarrow\text{\tt{employeeAt}}$ ).

The implication rule can be imposed by requiring that every tuple $t\in\mathcal{T}$ is at least as compatible with relation $r_{p}$ as with $r_{q}$ . Written in terms of the latent representations, eq. (4) therefore becomes

2 Lifted Loss Formulation

The implication $r_{p}\Rightarrow r_{q}$ can thus be imposed by minimizing the lifted loss $\mathcal{L}_{I}^{U}$ . Note that by minimizing $\mathcal{L}_{I}^{U}$ , the model is encouraged to satisfy the constraint $\bm{r_{p}}\leq\bm{r_{q}}$ on the relation embeddings, where $\leq$ denotes the component-wise comparison. In fact, a sufficient condition for eq. (5) to hold, is

3 Approximately Boolean Entity Tuples

For minimizing the loss, the gradients are hence computed with respect to $\bm{e}$ , and the $L_{2}$ regularization is applied to the components of $\bm{e}$ instead of $\bm{t}$ .

Other choices for ensuring the restriction $\bm{t}\geq\bm{0}$ in eq. (11) are possible, but we found that our approach works better in practice than those (e.g., the exponential transformation proposed by ?)). It can also be observed that the unit tuples over which the implication loss is grounded, form a special case of approximately Boolean embeddings.

In order to investigate the impact of this restriction even when not injecting any rules, we introduce model FS: the original model F, but with sigmoidal entity-tuples:

Here, $\bm{e_{p}}$ and $\bm{e_{q}}$ are the real-valued representations as in eq. (12), for tuples $t_{p}$ and $t_{q}$ , respectively.

With the above choice of a non-negative tuple-embedding space we can now state the full lifted rule injection model (FSL):

$\mathcal{L}^{U}_{I}$ denotes a lifted loss term for every rule in a set $\mathcal{I}$ of implication rules that we want to inject.

4 Convex Implication Loss

with $\delta$ a small positive margin to ensure that the gradient does not disappear before the inequality is actually satisfied. We use $\delta=0.01$ in all experiments.

The main advantage of the presented approach over earlier methods that impose the rules in a grounded way [Rocktäschel et al., 2015, Wang et al., 2015] is the computational efficiency of imposing the lifted loss. Evaluating $\mathcal{L}_{I}^{U}$ or its gradient for one implication rule is comparable to evaluating the reconstruction loss for one pair of training facts. In typical applications there are much fewer rules than training facts and the extra computation time needed to inject these rules is therefore negligible.

Related Work

Recent research on combining rules with learned vector representations has been important for new developments in the field of knowledge base completion. ?) and ?) provided a framework to jointly maximize the probability of observed facts and propositionalized first-order logic rules. ?) demonstrated how different types of rules can be incorporated using an Integer Linear Programming approach. ?) learned embeddings for facts and first-order logic rules using matrix factorization. Yet, all of these approaches ground the rules in the training data, limiting their scalability towards large rule sets and KBs with many entities. As argued in the introduction, this forms an important motivation for the lifted rule injection model put forward in this work, which by construction does not suffer from that limitation. ?) proposed an alternative strategy to tackle the scalability problem by reasoning on a filtered subset of grounded facts.

?) proposed to use a path ranking approach for capturing long-range interactions between entities, and to add these as an extra loss term, besides the loss that models pairwise relations. Our model FSL differs substantially from their approach, in that we consider tuples instead of separate entities, and we inject a given set of rules. Yet, by creating a partial ordering in the relation embeddings as a result of injecting implication rules, model FSL can also capture interactions beyond direct relations. This will be demonstrated in §5.3 by injecting rules between surface patterns only and still measuring an improvement on predictions for structured Freebase relations.

Combining logic and distributed representations is also an active field of research outside of automated knowledge base completion. Recent advances include the work by ?), who injected ontological knowledge from WordNet into word representations. Furthermore, ?) proposed to enforce a partial ordering in an embeddings space of images and phrases. Our method is related to such order embeddings since we define a partial ordering on relation embeddings. However, to ensure that implications hold for all entity-tuples we also need a restriction on the entity-tuple embedding space and derive bounds on the loss. Another important contribution is the recent work by ?), who proposed a framework for injecting rules into general neural network architectures, by jointly training on the actual targets and on the rule-regularized predictions provided by a teacher network. Although quite different at first sight, their work could offer a way to use our model in various neural network architectures, by integrating the proposed lifted loss into the teacher network.

This paper builds upon our previous workshop paper [Demeester et al., 2016]. In that work, we tested different tuple embedding transformations in an ad-hoc manner. We used approximately Boolean representations of relations instead of entity-tuples, strongly reducing the model’s degrees of freedom. We now derive the FSL model from a carefully considered mathematical transformation of the grounded loss. The FSL model only restricts the tuple embedding space, whereby relation vectors remain real valued. Furthermore, previous experiments were performed on small-scale artificial datasets, whereas we now test on a real-world relation extraction benchmark.

Finally, we explicitly discuss the main differences with respect to the strongly related work from ?). Their method is more general, as they cover a wide range of first-order logic rules, whereas we only discuss implications. Lifted rule injection beyond implications will be studied in future research contributions. However, albeit less general, our model has a number of clear advantages:

Scalability – Our proposed model of lifted rule injection scales according to the number of implication rules, instead of the number of rules times the number of observed facts for every relation present in a rule.

Generalizability – Injected implications will hold even for facts not seen during training, because their validity only depends on the order relation imposed on the relation representations. This is not guaranteed when training on rules grounded in training facts by ?).

Training Flexibility – Our method can be trained with various loss functions, including the rank-based loss as used in ?). This was not possible for the model of ?) and already leads to an improved accuracy as seen from the zero-shot learning experiment in §5.2.

Independence Assumption – In ?) an implication of the form $a_{p}\Rightarrow a_{q}$ for two ground atoms $a_{p}$ and $a_{q}$ is modeled by the logical equivalence $\neg(a_{p}\wedge\neg a_{q})$ , and its probability is approximated in terms of the elementary probabilities $\pi(a_{p})$ and $\pi(a_{q})$ as $1-\pi(a_{p})\big{(}1-\pi(a_{q})\big{)}$ . This assumes the independence of the two atoms $a_{p}$ and $a_{q}$ , which may not hold in practice. Our approach does not rely on that assumption and also works for cases of statistical dependence. For example, the independence assumption does not hold in the trivial case where the relations $r_{p}$ and $r_{q}$ in the two atoms are equivalent, whereas in our model, the constraints $\bm{r}_{p}\leq\bm{r}_{q}$ and $\bm{r}_{p}\geq\bm{r}_{q}$ would simply reduce to $\bm{r}_{p}=\bm{r}_{q}$ .

Experiments and Results

We now present our experimental results. We start by describing the experimental setup and hyperparameters. Before turning to the injection of rules, we compare model F with model FS, and show that restricting the tuple embedding space has a regularization effect, rather than limiting the expressiveness of the model (§5.1). We then demonstrate that model FSL is capable of zero-shot learning (§5.2), and show that injecting high-quality WordNet rules leads to an improved precision (§5.3). We proceed with a visual illustration of the relation embeddings with and without injected rules (§5.4), provide details on time efficiency of the lifted rule injection method (§5.5), and show that it correctly captures the asymmetry of implication rules (§5.6).

Before incorporating external commonsense knowledge into relation representations, we were curious how much we lose by restricting the entity-tuple space to approximately Boolean embeddings. We evaluate our models on the New York Times dataset introduced by ?). Surprisingly, we find that the expressiveness of the model does not suffer from this strong restriction. From Table 1 we see that restricting the tuple-embedding space seems to perform slightly better (FS) as opposed to a real-valued tuple-embedding space (F), suggesting that this restriction has a regularization effect that improves generalization. We also provide the original results for model F by ?) (denoted as R13-F) for comparison. Due to a different implementation and optimization procedure, the results for our model F and R13-F are not identical.

Inspecting the top relations for a sampled dimension in the embedding space reveals that the relation space of model FS more closely resembles clusters than that of model F (Table 2). We hypothesize that this might be caused by approximately Boolean entity-tuple representations in model FS, resulting in attribute-like entity-tuple vectors that capture which relation clusters they belong to.

2 Zero-shot Learning

The zero-shot learning experiment performed in ?) leads to an important finding: when injecting implications with right-hand sides for Freebase relations for which no or very limited training facts are available, the model should be able to infer the validity of Freebase facts for those relations based on rules and correlations between textual surface patterns.

We inject the same hand-picked relations as used by ?), after removing all Freebase training facts. The lifted rule injection (model FSL) reaches a weighted MAP of $0.35$ , comparable with $0.38$ by the Joint model from ?) (denoted R15-Joint). Note that for this experiment we initialized the Freebase relations implied by the rules with negative random vectors (sampled uniformly from $[-7.9,-8.1]$ ). The reason is that without any negative training facts for these relations, their components can only go up due to the implication loss, and we do not want to get values that are too high before optimization.

Figure 1 shows how the relation extraction performance improves when more Freebase relation training facts are added. It effictively measures how well the proposed models, matrix factorization (F), propositionalized rule injection (R15-Joint), and our model (FSL), can make use of the provided rules and correlations between textual surface form patterns and increased fractions of Freebase training facts. Although FSL starts at a lower performance than R15-Joint when no Freebase training facts are present, it outperforms R15-Joint and a plain matrix factorization model by a substantial margin when provided with more than $7.5\%$ of Freebase training facts. This indicates that, in addition to being much faster than R15-Joint, it can make better use of provided rules and few training facts. We attribute this to the Bayesian personalized ranking loss instead of the logistic loss used in ?). The former is compatible with our rule-injection method, but not with the approach of maximizing the expectation of propositional rules used by R15-Joint.

3 Injecting Knowledge from WordNet

The main purpose of this work is to be able to incorporate rules from external resources for aiding relation extraction. We use WordNet hypernyms to generate rules for the NYT dataset. To this end we iterate over all surface form patterns in the dataset and attempt to replace words in the pattern by their hypernyms. If the resulting pattern is contained in the dataset, we generate the corresponding rule. For instance, we generate a rule appos->diplomat->amod $\Rightarrow$ appos->official->amod since both patterns are contained in the NYT dataset and we know from WordNet that a diplomat is an official. This leads to $427$ rules from WordNet that we subsequently annotate manually to obtain $36$ high-quality rules. Note that none of these rules directly imply a Freebase relation. Although the test relations all originate from Freebase, we still hope to see improvements by transitive effects, i.e., better surface form representations that in turn help to predict Freebase facts.

We show results obtained by injecting these WordNet rules in Table 1 (column FSL). The weighted MAP measure increases by 2% with respect to model FS, and 4% compared to our reimplementation of the matrix factorization model F. This demonstrates that imposing a partial ordering based on implication rules can be used to incorporate logical commonsense knowledge and increase the quality of information extraction systems. Note that our evaluation setting guarantees that only indirect effects of the rules are measured, i.e., we do not use any rules directly implying test relations. This shows that injecting such rules influences the relation embedding space beyond only the relations explicitly stated in the rules. For example, injecting the rule appos<-father->appos $\Rightarrow$ poss<-parent->appos can contribute to improved predictions for the test relation parent/child.

4 Visualizing Relation Embeddings

5 Efficiency of Lifted Injection of Rules

In order to get an idea of the time efficiency of injecting rules, we measure the time per epoch when restricting the program execution to a single 2.4GHz CPU core. We measure on average 6.33s per epoch without rules (model FS), against 6.76s and 6.97s when injecting the 36 high-quality WordNet rules and the unfiltered 427 rules (model FSL), respectively. Increasing the amount of injected rules from 36 to 427 leads to an increase of only 3% in computation time, even though in our setup all rule losses are used in every training batch. This confirms the high efficiency of our lifted rule injection method.

6 Asymmetric Character of Implications

In order to demonstrate that injecting implications conserves their asymmetric nature, we perform the following experiment. After incorporating high-quality Wordnet rules $r_{p}\Rightarrow r_{q}$ into model FSL we select all of the tuples $t_{p}$ that occur with relation $r_{p}$ in a training fact $\langle r_{p},t_{p}\rangle$ . Matching these with relation $r_{q}$ should result in high values for the scores $\bm{r_{q}}^{\top}\bm{t_{p}}$ , if the implication holds. If however the tuples $t_{q}$ are selected from the training facts $\langle r_{q},t_{q}\rangle$ , and matched with relation $r_{p}$ , the scores $\bm{r_{p}}^{\top}\bm{t_{q}}$ should be much lower if the inverse implication does not hold (in other words, if $r_{q}$ and $r_{p}$ are not equivalent). Table 3 lists the averaged results for 5 example rules, and the average over all relations in WordNet rules, both for the case with injected rules (model FSL), and without rules (model FS). For easier comparison, the scores are mapped to the unit interval via the sigmoid function. This quantity $\sigma(\bm{r}^{\top}\bm{t})$ is often interpreted as the probability that the corresponding fact holds [Riedel et al., 2013], but because of the BPR-based training, only differences between scores play a role here. After injecting rules, the average scores of facts inferred by these rules (i.e., column $\sigma(\bm{r_{q}}^{\top}\bm{t_{p}})$ for model FSL) are always higher than for facts (incorrectly) inferred by the inverse rules (column $\sigma(\bm{r_{p}}^{\top}\bm{t_{q}})$ for model FSL). In the fourth example, the inverse rule leads to high scores as well (on average 0.79, vs. 0.98 for the actual rule). This is due to the fact that the daily and newspaper relations are more or less equivalent, such that the components of $\bm{r_{p}}$ are not much below those of $\bm{r_{q}}$ . For the last example (the ambassador $\Rightarrow$ diplomat rule), the asymmetry in the implication is maintained, although the absolute scores are rather low for these two relations.

The results for model FS reflect how strongly the implications in either direction are latently present in the training data. We can only conclude that model FS manages to capture the similarity between relations, but not the asymmetric character of implications. For example, purely based on the training data, it appears to be more likely that the parent relation implies the father relation, than vice versa. This again demonstrates the importance and added value of injecting external rules capturing commonsense knowledge.

Conclusions

We presented a novel, fast approach for incorporating first-order implication rules into distributed representations of relations. We termed our approach ‘lifted rule injection’, as it avoids the costly grounding of first-order implication rules and is thus independent of the size of the domain of entities. By construction, these rules are satisfied for any observed or unobserved fact. The presented approach requires a restriction on the entity-tuple embedding space. However, experiments on a real-world dataset show that this does not impair the expressiveness of the learned representations. On the contrary, it appears to have a beneficial regularization effect.

By incorporating rules generated from WordNet hypernyms, our model improved over a matrix factorization baseline for knowledge base completion. Especially for domains where annotation is costly and only small amounts of training facts are available, our approach provides a way to leverage external knowledge sources for inferring facts.

In future work, we want to extend the proposed ideas beyond implications towards general first-order logic rules. We believe that supporting conjunctions, disjunctions and negations would enable to debug and improve representation learning based knowledge base completion. Furthermore, we want to integrate these ideas into neural methods beyond matrix factorization approaches.

Acknowledgments

This work was supported by the Research Foundation - Flanders (FWO), Ghent University - iMinds, Microsoft Research through its PhD Scholarship Programme, an Allen Distinguished Investigator Award, and a Marie Curie Career Integration Award.