SimplE Embedding for Link Prediction in Knowledge Graphs

Seyed Mehran Kazemi, David Poole

Introduction

During the past two decades, several knowledge graphs (KGs) containing (perhaps probabilistic) facts about the world have been constructed. These KGs have applications in several fields including search, question answering, natural language processing, recommendation systems, etc. Due to the enormous number of facts that could be asserted about our world and the difficulty in accessing and storing all these facts, KGs are incomplete. However, it is possible to predict new links in a KG based on the existing ones. Link prediction and several other related problems aiming at reasoning with entities and relationships are studied under the umbrella of statistical relational learning (SRL) . The problem of link prediction for KGs is also known as knowledge graph completion. A KG can be represented as a set of (head,relation,tail)(\mathit{head},\mathit{relation},\mathit{tail}) triplesTriples are complete for relations. They are sometimes written as (subject,verb,object)(\mathit{subject},\mathit{verb},\mathit{object}) or (individual,property,value)(\mathit{individual},\mathit{property},\mathit{value}).. The problem of KG completion can be viewed as predicting new triples based on the existing ones.

Tensor factorization approaches have proved to be an effective SRL approach for KG completion . These approaches consider embeddings for each entity and each relation. To predict whether a triple holds, they use a function which takes the embeddings for the head and tail entities and the relation as input and outputs a number indicating the predicted probability. Details and discussions of these approaches can be found in several recent surveys .

One of the first tensor factorization approaches is the canonical Polyadic (CP) decomposition . This approach learns one embedding vector for each relation and two embedding vectors for each entity, one to be used when the entity is the head and one to be used when the entity is the tail. The head embedding of an entity is learned independently of (and is unrelated to) its tail embedding. This independence has caused CP to perform poorly for KG completion . In this paper, we develop a tensor factorization approach based on CP that addresses the independence among the two embedding vectors of the entities. Due to the simplicity of our model, we call it SimplE (Simple Embedding).

We show that SimplE: 1- can be considered a bilinear model, 2- is fully expressive, 3- is capable of encoding background knowledge into its embeddings through parameter sharing (aka weight tying), and 4- performs very well empirically despite (or maybe because of) its simplicity. We also discuss several disadvantages of other existing approaches. We prove that many existing translational approaches (see e.g., ) are not fully expressive and we identify severe restrictions on what they can represent. We also show that the function used in ComplEx , a state-of-the-art approach for link prediction, involves redundant computations.

Background and Notation

Let E\mathcal{E} and R\mathcal{R} represent the set of entities and relations respectively. A triple is represented as (h,r,t)(\mathit{h},\mathit{r},\mathit{t}), where hEh\in\mathcal{E} is the head, rRr\in\mathcal{R} is the relation, and tEt\in\mathcal{E} is the tail of the triple. Let ζ\zeta represent the set of all triples that are true in a world (e.g., (paris,capitalOf,france)(\mathit{paris},\mathit{capitalOf},\mathit{france})), and ζ\zeta^{\prime} represent the ones that are false (e.g., (paris,capitalOf,italy)(\mathit{paris},\mathit{capitalOf},\mathit{italy})). A knowledge graph KG\mathcal{KG} is a subset of ζ\zeta. A relation rr is reflexive on a set E\mathcal{E} of entities if (e,r,e)ζ(\mathit{e},\mathit{r},\mathit{e})\in\zeta for all entities eEe\in\mathcal{E}. A relation rr is symmetric on a set E\mathcal{E} of entities if (e1,r,e2)ζ    (e2,r,e1)ζ(\mathit{e_{1}},\mathit{r},\mathit{e_{2}})\in\zeta\iff(\mathit{e_{2}},\mathit{r},\mathit{e_{1}})\in\zeta for all pairs of entities e1,e2Ee_{1},e_{2}\in\mathcal{E}, and is anti-symmetric if (e1,r,e2)ζ    (e2,r,e1)ζ(\mathit{e_{1}},\mathit{r},\mathit{e_{2}})\in\zeta\iff(\mathit{e_{2}},\mathit{r},\mathit{e_{1}})\in\zeta^{\prime}. A relation rr is transitive on a set E\mathcal{E} of entities if (e1,r,e2)ζ(e2,r,e3)ζ(e1,r,e3)ζ(\mathit{e_{1}},\mathit{r},\mathit{e_{2}})\in\zeta\wedge(\mathit{e_{2}},\mathit{r},\mathit{e_{3}})\in\zeta\Rightarrow(\mathit{e_{1}},\mathit{r},\mathit{e_{3}})\in\zeta for all e1,e2,e3Ee_{1},e_{2},e_{3}\in\mathcal{E}. The inverse of a relation rr, denoted as r1r^{-1}, is a relation such that for any two entities eie_{i} and eje_{j}, (ei,r,ej)ζ    (ej,r1,ei)ζ(\mathit{e_{i}},\mathit{r},\mathit{e_{j}})\in\zeta\iff(\mathit{e_{j}},\mathit{r^{-1}},\mathit{e_{i}})\in\zeta.

An embedding is a function from an entity or a relation to one or more vectors or matrices of numbers. A tensor factorization model defines two things: 1- the embedding functions for entities and relations, 2- a function ff taking the embeddings for hh, rr and tt as input and generating a prediction of whether (h,r,t)(\mathit{h},\mathit{r},\mathit{t}) is in ζ\zeta or not. The values of the embeddings are learned using the triples in a KG\mathcal{KG}. A tensor factorization model is fully expressive if given any ground truth (full assignment of truth values to all triples), there exists an assignment of values to the embeddings of the entities and relations that accurately separates the correct triples from incorrect ones.

Related Work

SimplE: A Simple Yet Fully Expressive Model

Let likes(p,m)likes(p,m) represent if a person pp likes a movie mm and acted(m,a)acted(m,a) represent who acted in which movie. Which actors play in a movie is expected to affect who likes the movie. In CP, observations about likes only update the tt vector of movies and observations about acted only update the hh vector. Therefore, what is being learned about movies through observations about acted does not affect the predictions about likes and vice versa.

SimplE takes advantage of the inverse of relations to address the independence of the two vectors for each entity in CP. While inverse of relations has been used for other purposes (see e.g., ), using them to address the independence of the entity vectors in CP is a novel contribution.

Learning SimplE Models: To learn a SimplE model, we use stochastic gradient descent with mini-batches. In each learning iteration, we iteratively take in a batch of positive triples from the KG\mathcal{KG}, then for each positive triple in the batch we generate nn negative triples by corrupting the positive triple. We use Bordes et al. ’s procedure to corrupt positive triples. The procedure is as follows. For a positive triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}), we randomly decide to corrupt the head or tail. If the head is selected, we replace hh in the triple with an entity hh^{\prime} randomly selected from E{h}\mathcal{E}-\{h\} and generate the corrupted triple (h,r,t)(\mathit{h^{\prime}},\mathit{r},\mathit{t}). If the tail is selected, we replace tt in the triple with an entity tt^{\prime} randomly selected from E{t}\mathcal{E}-\{t\} and generate the corrupted triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t^{\prime}}). We generate a labelled batch LB\mathbf{LB} by labelling positive triples as +1+1 and negatives as 1-1. Once we have a labelled batch, following we optimize the L2L2 regularized negative log-likelihood of the batch: minθ((h,r,t),l)LBsoftplus(lϕ(h,r,t))+λθ22\min_{\theta}\sum_{((\mathit{h},\mathit{r},\mathit{t}),l)\in\mathbf{LB}}softplus(-l\cdot\phi(\mathit{h},\mathit{r},\mathit{t}))+\lambda||\theta||_{2}^{2}, where θ\theta represents the parameters of the model (the parameters in the embeddings), ll represents the label of a triple, ϕ(h,r,t)\phi(\mathit{h},\mathit{r},\mathit{t}) represents the similarity score for triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}), λ\lambda is the regularization hyper-parameter, and softplus(x)=log(1+exp(x))softplus(x)=log(1+\exp(x)). While several previous works (e.g., TransE, TransR, STransE, etc.) consider a margin-based loss function, Trouillon and Nickel show that the margin-based loss function is more prone to overfitting compared to log-likelihood.

Theoretical Analyses

In this section, we provide some theoretical analyses of SimplE and other existing approaches.

The following proposition establishes the full expressivity of SimplE.

For any ground truth over entities E\mathcal{E} and relations R\mathcal{R} containing γ\gamma true facts, there exists a SimplE model with embedding vectors of size min(ER,γ+1)min(|\mathcal{E}|\cdot|\mathcal{R}|,\gamma+1) that represents that ground truth.

First, we prove the ER|\mathcal{E}|\cdot|\mathcal{R}| bound. With embedding vectors of size ER|\mathcal{E}|*|\mathcal{R}|, for each entity eie_{i} we let the n-th element of hei=1h_{e_{i}}=1 if (nn mod E)=i|\mathcal{E}|)=i and otherwise, and for each relation rjr_{j} we let the n-th element of vrj=1v_{r_{j}}=1 if (nn div E)=j|\mathcal{E}|)=j and otherwise (see Fig 1). Then for each eie_{i} and rjr_{j}, the product of heih_{e_{i}} and vrjv_{r_{j}} is everywhere except for the (jE+i)(j*|\mathcal{E}|+i)-th element. So for each entity eke_{k}, we set the (jE+i)(j*|\mathcal{E}|+i)-th element of tekt_{e_{k}} to be 11 if (ei,rj,ek)(\mathit{e_{i}},\mathit{r_{j}},\mathit{e_{k}}) holds and 1-1 otherwise.

Now we prove the γ+1\gamma+1 bound. Let γ\gamma be zero (base of the induction). We can have embedding vectors of size 11 for each entity and relation, setting the value for entities to 11 and for relations to 1-1. Then hei,vrj,tek\langle h_{e_{i}},v_{r_{j}},t_{e_{k}}\rangle is negative for every entities eie_{i} and eke_{k} and relation rjr_{j}. So there exists embedding vectors of size γ+1\gamma+1 that represents this ground truth. Let us assume for any ground truth where γ=n1\gamma=n-1 (1nRE2)(1\leq n\leq|\mathcal{R}||\mathcal{E}|^{2}), there exists an assignment of values to embedding vectors of size nn that represents that ground truth (assumption of the induction). We must prove for any ground truth where γ=n\gamma=n, there exists an assignment of values to embedding vectors of size n+1n+1 that represents this ground truth. Let (ei,rj,ek)(\mathit{e_{i}},\mathit{r_{j}},\mathit{e_{k}}) be one of the nn true facts. Consider a modified ground truth which is identical to the ground truth with nn true facts, except that (ei,rj,ek)(\mathit{e_{i}},\mathit{r_{j}},\mathit{e_{k}}) is assigned false. The modified ground truth has n1n-1 true facts and based on the assumption of the induction, we can represent it using some embedding vectors of size nn. Let q=hei,vrj,tekq=\langle h_{e_{i}},v_{r_{j}},t_{e_{k}}\rangle where heih_{e_{i}}, vrjv_{r_{j}} and tekt_{e_{k}} are the embedding vectors that represent the modified ground truth. We add an element to the end of all embedding vectors and set it to . This increases the vector sizes to n+1n+1 but does not change any scores. Then we set the last element of heih_{e_{i}} to 11, vrjv_{r_{j}} to 11, and tekt_{e_{k}} to q+1q+1. This ensures that hei,vrj,tek>0\langle h_{e_{i}},v_{r_{j}},t_{e_{k}}\rangle>0 for the new vectors, and no other score is affected. ∎

DistMult is not fully expressive as it forces relations to be symmetric. It has been shown in that ComplEx is fully expressive with embeddings of length at most ER|\mathcal{E}|\cdot|\mathcal{R}|. According to the universal approximation theorem , under certain conditions, neural networks are universal approximators of continuous functions over compact sets. Therefore, we would expect there to be a representation based on neural networks that can approximate any ground truth, but the number of hidden units might have to grow with the number of triples. Wang et al. prove that TransE is not fully expressive. Proposition 2 proves that not only TransE but also many other translational approaches are not fully expressive. The proposition also identifies severe restrictions on what relations these approaches can represent.

FSTransE is not fully expressive and has the following restrictions. R1:\mathsf{R1:} If a relation rr is reflexive on ΔE\Delta\subset\mathcal{E}, rr must also be symmetric on Δ\Delta, R2:\mathsf{R2:} If rr is reflexive on ΔE\Delta\subset\mathcal{E}, rr must also be transitive on Δ\Delta, and R3:\mathsf{R3:} If entity e1e_{1} has relation rr with every entity in ΔE\Delta\subset\mathcal{E} and entity e2e_{2} has relation rr with one of the entities in Δ\Delta, then e2e_{2} must have the relation rr with every entity in Δ\Delta.

For any entity ee and relation rr, let pre=Prvep_{re}=P_{r}v_{e} and qre=Qrveq_{re}=Q_{r}v_{e}. For a triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}) to hold, we should ideally have prh+vr=αqrtp_{rh}+v_{r}=\alpha q_{rt} for some α\alpha. We assume s1s_{1}, s2s_{2}, s3s_{3} and s4s_{4} are entities in Δ\Delta.

R1:\mathsf{R1:} A relation rr being reflexive on Δ\Delta implies prs1+vr=α1qrs1p_{rs_{1}}+v_{r}=\alpha_{1}q_{rs_{1}} and prs2+vr=α2qrs2p_{rs_{2}}+v_{r}=\alpha_{2}q_{rs_{2}}. Suppose (s1,r,s2)(\mathit{s_{1}},\mathit{r},\mathit{s_{2}}) holds as well. Then we know prs1+vr=α3qrs2p_{rs_{1}}+v_{r}=\alpha_{3}q_{rs_{2}}. Therefore, prs2+vr=α2qrs2=α2α3(prs1+vr)=α2α3α1qrs1=α4qrs1p_{rs_{2}}+v_{r}=\alpha_{2}q_{rs_{2}}=\frac{\alpha_{2}}{\alpha_{3}}(p_{rs_{1}}+v_{r})=\frac{\alpha_{2}}{\alpha_{3}}\alpha_{1}q_{rs_{1}}=\alpha_{4}q_{rs_{1}}, where α4=α2α1α3\alpha_{4}=\frac{\alpha_{2}\alpha_{1}}{\alpha_{3}}. Therefore, (s2,r,s1)(\mathit{s_{2}},\mathit{r},\mathit{s_{1}}) must holds.

R2:\mathsf{R2:} A relation rr being reflexive implies prs1+vr=α1qrs1p_{rs_{1}}+v_{r}=\alpha_{1}q_{rs_{1}}, prs2+vr=α2qrs2p_{rs_{2}}+v_{r}=\alpha_{2}q_{rs_{2}}, and prs3+vr=α3qrs3p_{rs_{3}}+v_{r}=\alpha_{3}q_{rs_{3}}. Suppose (s1,r,s2)(\mathit{s_{1}},\mathit{r},\mathit{s_{2}}) and (s2,r,s3)(\mathit{s_{2}},\mathit{r},\mathit{s_{3}}) hold. Then we know prs1+vr=α4qrs2p_{rs_{1}}+v_{r}=\alpha_{4}q_{rs_{2}} and prs2+vr=α5qrs3p_{rs_{2}}+v_{r}=\alpha_{5}q_{rs_{3}}. We can conclude prs1+vr=α4qrs2=α4α2(prs2+vr)=α4α2α5qrs3=α6qrs3p_{rs_{1}}+v_{r}=\alpha_{4}q_{rs_{2}}=\frac{\alpha_{4}}{\alpha_{2}}(p_{rs_{2}}+v_{r})=\frac{\alpha_{4}}{\alpha_{2}}\alpha_{5}q_{rs_{3}}=\alpha_{6}q_{rs_{3}}, where α6=α4α5α2\alpha_{6}=\frac{\alpha_{4}\alpha_{5}}{\alpha_{2}}. The above equality proves (s1,r,s3)(\mathit{s_{1}},\mathit{r},\mathit{s_{3}}) must hold.

R3:\mathsf{R3:} Let e2e_{2} have relation rr with s1s_{1}. We know pre1+vr=α1qrs1p_{re_{1}}+v_{r}=\alpha_{1}q_{rs_{1}}, pre1+vr=α2qrs2p_{re_{1}}+v_{r}=\alpha_{2}q_{rs_{2}}, and pre2+vr=α3qrs1p_{re_{2}}+v_{r}=\alpha_{3}q_{rs_{1}}. We can conclude pre2+vr=α3qrs1=α3α1(pre1+vr)=α3α1α2qrs2=α4qrs2p_{re_{2}}+v_{r}=\alpha_{3}q_{rs_{1}}=\frac{\alpha_{3}}{\alpha_{1}}(p_{re_{1}}+v_{r})=\frac{\alpha_{3}}{\alpha_{1}}\alpha_{2}q_{rs_{2}}=\alpha_{4}q_{rs_{2}}, where α4=α3α2α1\alpha_{4}=\frac{\alpha_{3}\alpha_{2}}{\alpha_{1}}. Therefore, (e2,r,s2)(\mathit{e_{2}},\mathit{r},\mathit{s_{2}}) must hold. ∎

Other variants of translational approaches such as TransE, FTransE, STransE, TransH , and TransR also have the restrictions mentioned in Proposition 2.

2 Incorporating Background Knowledge into the Embeddings

In SimplE, each element of the embedding vector of the entities can be considered as a feature of the entity and the corresponding element of a relation can be considered as a measure of how important that feature is to the relation. Such interpretability allows the embeddings learned through SimplE for an entity (or relation) to be potentially transferred to other domains. It also allows for incorporating observed features of entities into the embeddings by fixing one of the elements of the embedding vector of the observed value. Nickel et al. show that incorporating such features helps reduce the size of the embeddings.

Recently, incorporating background knowledge into tensor factorization approaches has been the focus of several studies. Towards this goal, many existing approaches rely on post-processing steps or add additional terms to the loss function to penalize predictions that violate the background knowledge . Minervini et al. show how background knowledge in terms of equivalence and inversion can be incorporated into several tensor factorization models through parameter tyingAlthough their incorporation of inversion into DistMult is not correct as it has side effects.. Incorporating background knowledge by parameter tying has the advantage of guaranteeing the predictions follow the background knowledge for all embeddings. In this section, we show how three types of background knowledge, namely symmetry, anti-symmetry, and inversion, can be incorporated into the embeddings of SimplE by tying the parametersNote that such background knowledge can be exerted on some relations selectively and not on the others. This is different than, e.g., DistMult which enforces symmetry on all relations. (we ignore the equivalence between two relations as it is trivial).

Let rr be a relation such that for any two entities eie_{i} and eje_{j} we have (ei,r,ej)ζ    (ej,r,ei)ζ(\mathit{e_{i}},\mathit{r},\mathit{e_{j}})\in\zeta\iff(\mathit{e_{j}},\mathit{r},\mathit{e_{i}})\in\zeta (i.e. rr is symmetric). This property of rr can be encoded into SimplE by tying the parameters vr1v_{r^{-1}} to vrv_{r}.

If (ei,r,ej)ζ(\mathit{e_{i}},\mathit{r},\mathit{e_{j}})\in\zeta, then a SimplE model makes hei,vr,tej\langle h_{e_{i}},v_{r},t_{e_{j}}\rangle and hej,vr1,tei\langle h_{e_{j}},v_{r^{-1}},t_{e_{i}}\rangle positive. By tying the parameters vr1v_{r^{-1}} to vrv_{r}, we can conclude that hej,vr,tei\langle h_{e_{j}},v_{r},t_{e_{i}}\rangle and hei,vr1,tej\langle h_{e_{i}},v_{r^{-1}},t_{e_{j}}\rangle also become positive. Therefore, the SimplE model predicts (ej,r,ei)ζ(\mathit{e_{j}},\mathit{r},\mathit{e_{i}})\in\zeta. ∎

Let rr be a relation such that for any two entities eie_{i} and eje_{j} we have (ei,r,ej)ζ    (ej,r,ei)ζ(\mathit{e_{i}},\mathit{r},\mathit{e_{j}})\in\zeta\iff(\mathit{e_{j}},\mathit{r},\mathit{e_{i}})\in\zeta^{\prime} (i.e. rr is anti-symmetric). This property of rr can be encoded into SimplE by tying the parameters vr1v_{r^{-1}} to the negative of vrv_{r}.

If (ei,r,ej)ζ(\mathit{e_{i}},\mathit{r},\mathit{e_{j}})\in\zeta, then a SimplE model makes hei,vr,tej\langle h_{e_{i}},v_{r},t_{e_{j}}\rangle and hej,vr1,tei\langle h_{e_{j}},v_{r^{-1}},t_{e_{i}}\rangle positive. By tying the parameters vr1v_{r^{-1}} to the negative of vrv_{r}, we can conclude that hej,vr,tei\langle h_{e_{j}},v_{r},t_{e_{i}}\rangle and hei,vr1,tej\langle h_{e_{i}},v_{r^{-1}},t_{e_{j}}\rangle become negative. Therefore, the SimplE model predicts (ej,r,ei)ζ(\mathit{e_{j}},\mathit{r},\mathit{e_{i}})\in\zeta^{\prime}. ∎

Let r1r_{1} and r2r_{2} be two relations such that for any two entities eie_{i} and eje_{j} we have (ei,r1,ej)ζ    (ej,r2,ei)ζ(\mathit{e_{i}},\mathit{r_{1}},\mathit{e_{j}})\in\zeta\iff(\mathit{e_{j}},\mathit{r_{2}},\mathit{e_{i}})\in\zeta (i.e. r2r_{2} is the inverse of r1r_{1}). This property of r1r_{1} and r2r_{2} can be encoded into SimplE by tying the parameters vr11v_{r_{1}^{-1}} to vr2v_{r_{2}} and vr21v_{r_{2}^{-1}} to vr1v_{r_{1}}.

If (ei,r1,ej)ζ(\mathit{e_{i}},\mathit{r_{1}},\mathit{e_{j}})\in\zeta, then a SimplE model makes hei,vr1,tej\langle h_{e_{i}},v_{r_{1}},t_{e_{j}}\rangle and hej,vr11,tei\langle h_{e_{j}},v_{r_{1}^{-1}},t_{e_{i}}\rangle positive. By tying the parameters vr21v_{r_{2}^{-1}} to vr1v_{r_{1}} and vr2v_{r_{2}} to vr11v_{r_{1}^{-1}}, we can conclude that hei,vr21,tej\langle h_{e_{i}},v_{r_{2}^{-1}},t_{e_{j}}\rangle and hej,vr2,tei\langle h_{e_{j}},v_{r_{2}},t_{e_{i}}\rangle also become positive. Therefore, the SimplE model predicts (ej,r2,ei)ζ(\mathit{e_{j}},\mathit{r_{2}},\mathit{e_{i}})\in\zeta. ∎

3 Time Complexity and Parameter Growth

As described in , to scale to the size of the current KGs and keep up with their growth, a relational model must have a linear time and memory complexity. Furthermore, one of the important challenges in designing tensor factorization models is the trade-off between expressivity and model complexity. Models with many parameters usually overfit and give poor performance. While the time complexity for TransE is O(d)O(d) where dd is the size of the embedding vectors, adding the projections as in STransE (through the two relation matrices) increases the time complexity to O(d2)O(d^{2}). Besides time complexity, the number of parameters to be learned from data grows quadratically with dd. A quadratic time complexity and parameter growth may arise two issues: 1- scalability problems, 2- overfitting. Same issues exist for models such as RESCAL and NTNs that have quadratic or higher time complexities and parameter growths. DistMult and ComplEx have linear time complexities and the number of their parameters grow linearly with dd.

The time complexity of both SimplE-ignr and SimplE is O(d)O(d), i.e. linear in the size of vector embeddings. SimplE-ignr requires one multiplication between three vectors for each triple. This number is 22 for SimplE and 44 for ComplEx. Thus, with the same number of parameters, SimplE-ignr and SimplE reduce the computations by a factor of 44 and 22 respectively compared to ComplEx.

4 Family of Bilinear Models

The constraint over MrM_{r} matrices in SimplE is very similar to the constraint in DistMult. vhTMrv_{h}^{T}M_{r} in both SimplE and DistMult can be considered as an element-wise product of the parameters, except that the MrM_{r}s in SimplE swap the first and second halves of the resulting vector. Compared to ComplEx, SimplE removes the parameters on the main diagonal of MrM_{r}s. Note that several other restrictions on the MrM_{r} matrices are equivalent to SimplE, e.g., restricting MrM_{r} matrices to be zero everywhere except on the counterdiagonal. Viewing SimplE as a single-vector-per-entity model makes it easily integrable (or compatible) with other embedding models (in knowledge graph completion, computer vision and natural language processing) such as .

5 Redundancy in ComplEx

As argued earlier, with the same number of parameters, the number of computations in ComplEx are 4x and 2x more than SimplE-ignr and SimplE. Here we show that a portion of the computations performed by ComplEx to make predictions is redundant. Consider a ComplEx model with embedding vectors of size 11 (for ease of exposition). Suppose the embedding vectors for hh, rr and tt are [α1+β1i][\alpha_{1}+\beta_{1}i], [α2+β2i][\alpha_{2}+\beta_{2}i], and [α3+β3i][\alpha_{3}+\beta_{3}i] respectively. Then the probability of (h,r,t)(\mathit{h},\mathit{r},\mathit{t}) being correct according to ComplEx is proportional to the sum of the following four terms: 1) α1α2α31)~{}\alpha_{1}\alpha_{2}\alpha_{3}, 2) α1β2β32)~{}\alpha_{1}\beta_{2}\beta_{3}, 3) β1α2β33)~{}\beta_{1}\alpha_{2}\beta_{3}, and 4) β1β2α34)~{}\mathnormal{-}\beta_{1}\beta_{2}\alpha_{3}. It can be verified that for any assignment of (non-zero) values to αi\alpha_{i}s and βi\beta_{i}s, at least one of the above terms is negative. This means for a correct triple, ComplEx uses three terms to overestimate its score and then uses a term to cancel the overestimation.

The following example shows how this redundancy in ComplEx may affect its interpretability:

Consider a ComplEx model with embeddings of size 11. Consider entities e1e_{1}, e2e_{2} and e3e_{3} with embedding vectors [1+4i][1+4i], [1+6i][1+6i], and [3+2i][3+2i] respectively, and a relation rr with embedding vector [1+i][1+i]. According to ComplEx, the score for triple (e1,r,e3)(\mathit{e_{1}},\mathit{r},\mathit{e_{3}}) is positive suggesting e1e_{1} probably has relation rr with e3e_{3}. However the score for triple (e2,r,e3)(\mathit{e_{2}},\mathit{r},\mathit{e_{3}}) is negative suggesting e2e_{2} probably does not have relation rr with e3e_{3}. Since the only difference between e1e_{1} and e2e_{2} is that the imaginary part changes from 44 to 66, it is difficult to associate a meaning to these numbers.

Experiments and Results

Datasets: We conducted experiments on two standard benchmarks: WN18 a subset of Wordnet , and FB15k a subset of Freebase . We used the same train/valid/test sets as in . WN18 contains 40,94340,943 entities, 1818 relations, 141,442141,442 train, 5,0005,000 validation and 5,0005,000 test triples. FB15k contains 14,95114,951 entities, 1,3451,345 relations, 483,142483,142 train, 50,00050,000 validation, and 59,07159,071 test triples.

Baselines: We compare SimplE with several existing tensor factorization approaches. Our baselines include canonical Polyadic (CP) decomposition, TransE, TransR, DistMult, NTN, STransE, ER-MLP, and ComplEx. Given that we use the same data splits and objective function as ComplEx, we report the results of CP, TransE, DistMult, and ComplEx from . We report the results of TransR and NTN from , and ER-MLP from for further comparison.

Evaluation Metrics: To measure and compare the performances of different models, for each test triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}) we compute the score of (h,r,t)(\mathit{h^{\prime}},\mathit{r},\mathit{t}) triples for all hEh^{\prime}\in\mathcal{E} and calculate the ranking rankhrank_{h} of the triple having hh, and we compute the score of (h,r,t)(\mathit{h},\mathit{r},\mathit{t^{\prime}}) triples for all tEt^{\prime}\in\mathcal{E} and calculate the ranking ranktrank_{t} of the triple having tt. Then we compute the mean reciprocal rank (MRR) of these rankings as the mean of the inverse of the rankings: MRR=12tt(h,r,t)tt1rankh+1ranktMRR=\frac{1}{2*|tt|}\sum_{(\mathit{h},\mathit{r},\mathit{t})\in tt}\frac{1}{rank_{h}}+\frac{1}{rank_{t}}, where tttt represents the test triples. MRR is a more robust measure than mean rank, since a single bad ranking can largely influence mean rank.

Bordes et al. identified an issue with the above procedure for calculating the MRR (hereafter referred to as raw MRR). For a test triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}), since there can be several entities hEh^{\prime}\in\mathcal{E} for which (h,r,t)(\mathit{h^{\prime}},\mathit{r},\mathit{t}) holds, measuring the quality of a model based on its ranking for (h,r,t)(\mathit{h},\mathit{r},\mathit{t}) may be flawed. That is because two models may rank the test triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}) to be second, when the first model ranks a correct triple (e.g., from train or validation set) (h,r,t)(\mathit{h^{\prime}},\mathit{r},\mathit{t}) to be first and the second model ranks an incorrect triple (h,r,t)(\mathit{h^{\prime\prime}},\mathit{r},\mathit{t}) to be first. Both these models will get the same score for this test triple when the first model should get a higher score. To address this issue, proposed a modification to raw MRR. For each test triple (h,r,t)(\mathit{h},\mathit{r},\mathit{t}), instead of finding the rank of this triple among triples (h,r,t)(\mathit{h^{\prime}},\mathit{r},\mathit{t}) for all hEh^{\prime}\in\mathcal{E} (or (h,r,t)(\mathit{h},\mathit{r},\mathit{t^{\prime}}) for all tEt^{\prime}\in\mathcal{E}), they proposed to calculate the rank among triples (h,r,t)(\mathit{h^{\prime}},\mathit{r},\mathit{t}) only for hEh^{\prime}\in\mathcal{E} such that (h,r,t)∉trainvalidtest(\mathit{h^{\prime}},\mathit{r},\mathit{t})\not\in train\cup valid\cup test. Following , we call this measure filtered MRR. We also report hit@khit@k measures. The hit@khit@k for a model is computed as the percentage of test triples whose ranking (computed as described earlier) is less than or equal kk.

Implementation: We implemented SimplE in TensorFlow . We tuned our hyper-parameters over the validation set. We used the same search grid on embedding size and λ\lambda as to make our results directly comparable to their results. We fixed the maximum number of iterations to 10001000 and the batch size to 100100. We set the learning rate for WN18 to 0.10.1 and for FB15k to 0.050.05 and used adagrad to update the learning rate after each batch. Following , we generated one negative example per positive example for WN18 and 1010 negative examples per positive example in FB15k. We computed the filtered MRR of our model over the validation set every 5050 iterations for WN18 and every 100100 iterations for FB15kFB15k and selected the iteration that resulted in the best validation filtered MRR. The best embedding size and λ\lambda values on WN18 for SimplE-ignr were 200200 and 0.0010.001 respectively, and for SimplE were 200200 and 0.030.03. The best embedding size and λ\lambda values on FB15k for SimplE-ignr were 200200 and 0.030.03 respectively, and for SimplE were 200200 and 0.10.1.

Table 1 shows the results of our experiments. It can be viewed that both SimplE-ignr and SimplE do a good job compared to the existing baselines on both datasets. On WN18, SimplE-ignr and SimplE perform as good as ComplEx, a state-of-the-art tensor factorization model. On FB15k, SimplE outperforms the existing baselines and gives state-of-the-art results among tensor factorization approaches. SimplE (and SimplE-ignr) work especially well on this dataset in terms of filtered MRR and hit@1, so SimplE tends to do well at having its first prediction being correct.

The table shows that models with many parameters (e.g., NTN and STransE) do not perform well on these datasets, as they probably overfit. Translational approaches generally have an inferior performance compared to other approaches partly due to their representation restrictions mentioned in Proposition 2. As an example for the friendship relation in FB15k, if an entity e1e_{1} is friends with 2020 other entities and another entity e2e_{2} is friends with only one of those 20, then according to Proposition 2 translational approaches force e2e_{2} to be friends with the other 19 entities as well (same goes for, e.g., netflix genre in FB15k and has part in WN18). The table also shows that bilinear approaches tend to have better performances compared to translational and deep learning approaches. Even DistMult, the simplest bilinear approach, outperforms many translational and deep learning approaches despite not being fully expressive. We believe the simplicity of embeddings and the scoring function is a key property for the success of SimplE.

2 Incorporating background knowledge

When background knowledge is available, we might expect that a knowledge graph might not include redundant information because it is implied by background knowledge and so the methods that do not include the background knowledge can never learn it. In section 5.2, we showed how background knowledge that can be formulated in terms of three types of rules can be incorporated into SimplE embeddings. To test this empirically, we conducted an experiment on WN18 in which we incorporated several such rules into the embeddings as outlined in Propositions 3, 4, and 5. The rules can be found in Table 2. As can be viewed in Table 2, most of the rules are of the form ei,ejE:(ei,r1,ej)ζ(ej,r2,ei)ζ\forall e_{i},e_{j}\in\mathcal{E}:(\mathit{e_{i}},\mathit{r_{1}},\mathit{e_{j}})\in\zeta\Leftrightarrow(\mathit{e_{j}},\mathit{r_{2}},\mathit{e_{i}})\in\zeta. For (possibly identical) relations such as r1r_{1} and r2r_{2} participating in such a rule, if both (ei,r1,ej)(\mathit{e_{i}},\mathit{r_{1}},\mathit{e_{j}}) and (ej,r2,ei)(\mathit{e_{j}},\mathit{r_{2}},\mathit{e_{i}}) are in the training set, one of them is redundant because one can be inferred from the other. We removed redundant triples from the training set by randomly removing one of the two triples in the training set that could be inferred from the other one based on the background rules. Removing redundant triples reduced the number of triples in the training set from (approximately) 141K141K to (approximately) 90K90K, almost 36%36\% reduction in size. Note that this experiment provides an upper bound on how much background knowledge can improve the performance of a SimplE model.

We trained SimplE-ignr and SimplE (with tied parameters according to the rules) on this new training dataset with the best hyper-parameters found in the previous experiment. We refer to these two models as SimplE-ignr-bk and SimplE-bk. We also trained another SimplE-ignr and SimplE models on this dataset, but without incorporating the rules into the embeddings. For sanity check, we also trained a ComplEx model over this new dataset. We found that the filtered MRR for SimplE-ignr, SimplE, and ComplEx were respectively 0.2210.221, 0.3840.384, and 0.2750.275. For SimplE-ignr-bk and SimplE-bk, the filtered MRRs were 0.7720.772 and 0.7760.776 respectively, substantially higher than the case without background knowledge. In terms of hit@khit@k measures, SimplE-ignr gave 0.2190.219, 0.2200.220, and 0.2240.224 for hit@1hit@1, hit@3hit@3 and hit@10hit@10 respectively. These numbers were 0.3340.334, 0.4040.404, and 0.4820.482 for SimplE, and 0.2540.254, 0.2800.280 and 0.3130.313 for ComplEx. For SimplE-ignr-bk, these numbers were 0.7150.715, 0.8090.809 and 0.8770.877 and for SimplE-bk they were 0.7150.715, 0.8180.818 and 0.8830.883, also substantially higher than the models without background knowledge. The obtained results validate that background knowledge can be effectively incorporated into SimplE embeddings to improve its performance.

Conclusion

We proposed a simple interpretable fully expressive bilinear model for knowledge graph completion. We showed that our model, called SimplE, performs very well empirically and has several interesting properties. For instance, three types of background knowledge can be incorporated into SimplE by tying the embeddings. In future, SimplE could be improved or may help improve relational learning in several ways including: 1- building ensembles of SimplE models as do it for DistMult, 2- adding SimplE to the relation-level ensembles of , 3- explicitly modelling the analogical structures of relations as in , 4- using ’s 1-N scoring approach to generate many negative triples for a positive triple (Trouillon et al. show that generating more negative triples improves accuracy), 5- combining SimplE with logic-based approaches (e.g., with ) to improve property prediction, 6- combining SimplE with (or use SimplE as a sub-component in) techniques from other categories of relational learning as do with ComplEx, 7- incorporating other types of background knowledge (e.g., entailment) into SimplE embeddings.

References