RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, Jian Tang

Introduction

Knowledge graphs are collections of factual triplets, where each triplet (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}) represents a relation r{\mathtt{r}} between a head entity h{\mathtt{h}} and a tail entity t{\mathtt{t}}. Examples of real-world knowledge graphs include Freebase (Bollacker et al., 2008), Yago (Suchanek et al., 2007), and WordNet (Miller, 1995). Knowledge graphs are potentially useful to a variety of applications such as question-answering (Hao et al., 2017), information retrieval (Xiong et al., 2017), recommender systems (Zhang et al., 2016), and natural language processing (Yang & Mitchell, 2017). Research on knowledge graphs is attracting growing interests in both academia and industry communities.

Since knowledge graphs are usually incomplete, a fundamental problem for knowledge graph is predicting the missing links. Recently, extensive studies have been done on learning low-dimensional representations of entities and relations for missing link prediction (a.k.a., knowledge graph embedding) (Bordes et al., 2013; Trouillon et al., 2016; Dettmers et al., 2017). These methods have been shown to be scalable and effective. The general intuition of these methods is to model and infer the connectivity patterns in knowledge graphs according to the observed knowledge facts. For example, some relations are symmetric (e.g., marriage) while others are antisymmetric (e.g., filiation); some relations are the inverse of other relations (e.g., hypernym and hyponym); and some relations may be composed by others (e.g., my mother’s husband is my father). It is critical to find ways to model and infer these patterns, i.e., symmetry/antisymmetry, inversion, and composition, from the observed facts in order to predict missing links.

Indeed, many existing approaches have been trying to either implicitly or explicitly model one or a few of the above relation patterns (Bordes et al., 2013; Wang et al., 2014; Lin et al., 2015b; Yang et al., 2014; Trouillon et al., 2016). For example, the TransE model (Bordes et al., 2011), which represents relations as translations, aims to model the inversion and composition patterns; the DisMult model (Yang et al., 2014), which models the three-way interactions between head entities, relations, and tail entities, aims to model the symmetry pattern. However, none of existing models is capable of modeling and inferring all the above patterns. Therefore, we are looking for an approach that is able to model and infer all the three types of relation patterns.

It turns out that such a simple operation can effectively model all the three relation patterns: symmetric/antisymmetric, inversion, and composition. For example, a relation r{\mathtt{r}} is symmetric if and only if each element of its embedding r{\mathbf{r}}, i.e. rir_{i}, satisfies ri=e0/iπ=±1r_{i}=e^{0/i\pi}=\pm 1; two relations r1{\mathtt{r}}_{1} and r2{\mathtt{r}}_{2} are inverse if and only if their embeddings are conjugates: r2=rˉ1{\mathbf{r}}_{2}=\bar{{\mathbf{r}}}_{1}; a relation r3=eiθ3{\mathbf{r}}_{3}=e^{i\bm{\theta_{3}}} is a combination of other two relations r1=eiθ1{\mathbf{r}}_{1}=e^{i\bm{\theta_{1}}} and r2=eiθ2{\mathbf{r}}_{2}=e^{i\bm{\theta_{2}}} if and only if r3=r1r2{\mathbf{r}}_{3}={\mathbf{r}}_{1}\circ{\mathbf{r}}_{2} (i.e. θ3=θ1+θ2\bm{\theta_{3}}=\bm{\theta_{1}}+\bm{\theta_{2}}). Moreover, the RotatE model is scalable to large knowledge graphs as it remains linear in both time and memory.

To effectively optimizing the RotatE, we further propose a novel self-adversarial negative sampling technique, which generates negative samples according to the current entity and relation embeddings. The proposed technique is very general and can be applied to many existing knowledge graph embedding models. We evaluate the RotatE on four large knowledge graph benchmark datasets including FB15k (Bordes et al., 2013), WN18 (Bordes et al., 2013), FB15k-237 (Toutanova & Chen, 2015) and WN18RR (Dettmers et al., 2017). Experimental results show that the RotatE model significantly outperforms existing state-of-the-art approaches. In addition, RotatE also outperforms state-of-the-art models on Countries (Bouchard et al., 2015), a benchmark explicitly designed for composition pattern inference and modeling. To the best of our knowledge, RotatE is the first model that achieves state-of-the-art performance on all the benchmarks.The codes of our paper are available online: https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding.

Related Work

Predicting missing links with knowledge graph embedding (KGE) methods has been extensively investigated in recent years. The general methodology is to define a score function for the triplets. Formally, let E\mathcal{E} denote the set of entities and R\mathcal{R} denote the set of relations, then a knowledge graph is a collection of factual triplets (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}), where h,tE{\mathtt{h}},{\mathtt{t}}\in\mathcal{E} and rR{\mathtt{r}}\in\mathcal{R}. Since entity embeddings are usually represented as vectors, the score function usually takes the form fr(h,t)f_{r}({\mathbf{h}},{\mathbf{t}}), where h{\mathbf{h}} and t{\mathbf{t}} are head and tail entity embeddings. The score function fr(h,t)f_{r}({\mathbf{h}},{\mathbf{t}}) measures the salience of a candidate triplet (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}). The goal of the optimization is usually to score true triplet (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}) higher than the corrupted false triplets (h,r,t)({\mathtt{h}}^{\prime},{\mathtt{r}},{\mathtt{t}}) or (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}^{\prime}). Table 1 summarizes different score functions fr(h,t)f_{r}({\mathbf{h}},{\mathbf{t}}) in previous state-of-the-art methods as well as the model proposed in this paper. These models generally capture only a portion of the relation patterns. For example, TransE represents each relation as a bijection between source entities and target entities, and thus implicitly models inversion and composition of relations, but it cannot model symmetric relations; ComplEx extends DistMult by introducing complex embeddings so as to better model asymmetric relations, but it cannot infer the composition pattern. The proposed RotatE model leverages the advantages of both.

A relevant and concurrent work to our work is the TorusE (Ebisu & Ichise, 2018) model, which defines knowledge graph embedding as translations on a compact Lie group. The TorusE model can be regarded as a special case of RotatE, where the modulus of embeddings are set fixed; our RotatE is defined on the entire complex space, which has much more representation capacity. Our experiments show that this is very critical for modeling and inferring the composition patterns. Moreover, TorusE focuses on the problem of regularization in TransE while this paper focuses on modeling and inferring multiple types of relation patterns.

There are also a large body of relational approaches for modeling the relational patterns on knowledge graphs (Lao et al., 2011; Neelakantan et al., 2015; Das et al., 2016; Rocktäschel & Riedel, 2017; Yang et al., 2017). However, these approaches mainly focus on explicitly modeling the relational paths while our proposed RotatE model implicitly learns the relation patterns, which is not only much more scalable but also provides meaningful embeddings for both entities and relations.

Another related problem is how to effectively draw negative samples for training knowledge graph embeddings. This problem has been explicitly studied by Cai & Wang (2017), which proposed a generative adversarial learning framework to draw negative samples. However, such a framework requires simultaneously training the embedding model and a discrete negative sample generator, which are difficult to optimize and also computationally expensive. We propose a self-adversarial sampling scheme which only relies on the current model. It does require any additional optimization component, which make it much more efficient.

RotatE: Relational Rotation in Complex Vector Space

In this section, we introduce our proposed RotatE model. We first introduce three important relation patterns that are widely studied in the literature of link prediction on knowledge graphs. Afterwards, we introduce our proposed RotatE model, which defines relations as rotations in complex vector space. We also show that the RotatE model is able to model and infer all three relation patterns.

The key of link prediction in knowledge graph is to infer the connection patterns, e.g., relation patterns, with observed facts. According to the existing literature (Trouillon et al., 2016; Toutanova & Chen, 2015; Guu et al., 2015; Lin et al., 2015a), three types of relation patterns are very important and widely spread in knowledge graphs: symmetry, inversion and composition. We give their formal definition here:

A relation r{\mathtt{r}} is symmetric (antisymmetric) if x,y\forall{\mathtt{x}},{\mathtt{y}}

A clause with such form is a symmetry (antisymmetry) pattern.

Relation r1{\mathtt{r}}_{1} is inverse to relation r2{\mathtt{r}}_{2} if x,y\forall{\mathtt{x}},{\mathtt{y}}

A clause with such form is a inversion pattern.

Relation r1{\mathtt{r}}_{1} is composed of relation r2{\mathtt{r}}_{2} and relation r3{\mathtt{r}}_{3} if x,y,z\forall{\mathtt{x}},{\mathtt{y}},{\mathtt{z}}

A clause with such form is a composition pattern.

According to the definition of the above three types of relation patterns, we provide an analysis of existing models on their abilities in inferring and modeling these patterns. Specifically, we provide an analysis on TransE, TransX, DistMult, and ComplEx.See discussion at Appendix A We did not include the analysis on HolE and ConvE since HolE is equivalent to ComplEx (Hayashi & Shimbo, 2017), and ConvE is a black box that involves two-layer neural networks and convolution operations, which are hard to analyze. The results are summarized into Table 2. We can see that no existing approaches are capable of modeling all the three relation patterns.

2 Modeling Relations as Rotations in Complex Vector Space

By defining each relation as a rotation in the complex vector spaces, RotatE can model and infer all the three types of relation patterns introduced above. Formally, we have following resultsWe relegate all proofs to the appendix.:

RotatE can infer the symmetry/antisymmetry pattern. (See proof in Appendix B)

RotatE can infer the inversion pattern. (See proof in Appendix C)

RotatE can infer the composition pattern. (See proof in Appendix D)

These results are also summarized into Table 2. We can see that the RotatE model is the only model that can model and infer all the three types of relation patterns.

From Table 2, we can see that TransE is able to infer and model all the other relation patterns except the symmetry pattern. The reason is that in TransE, any symmetric relation will be represented by a 0\mathbf{0} translation vector. As a result, this will push the entities with symmetric relations to be close to each other in the embedding space. RotatE solves this problem and is a able to model and infer the symmetry pattern. An arbitrary vector r{\mathbf{r}} that satisfies ri=±1r_{i}=\pm 1 can be used for representing a symmetric relation in RotatE, and thus the entities having symmetric relations can be distinguished. Different symmetric relations can be also represented with different embedding vectors. Figure 1 provides illustrations of TransE and RotatE with only 1-dimensional embedding and shows how RotatE models a symmetric relation.

3 Optimization

Negative sampling has been proved quite effective for both learning knowledge graph embedding (Trouillon et al., 2016) and word embedding (Mikolov et al., 2013). Here we use a loss function similar to the negative sampling loss (Mikolov et al., 2013) for effectively optimizing distance-based models:

where γ\gamma is a fixed margin, σ\sigma is the sigmoid function, and (hi,r,ti)({\mathtt{h}}_{i}^{\prime},{\mathtt{r}},{\mathtt{t}}_{i}^{\prime}) is the ii-th negative triplet.

We also propose a new approach for drawing negative samples. The negative sampling loss samples the negative triplets in a uniform way. Such a uniform negative sampling suffers the problem of inefficiency since many samples are obviously false as training goes on, which does not provide any meaningful information. Therefore, we propose an approach called self-adversarial negative sampling, which samples negative triples according to the current embedding model. Specifically, we sample negative triples from the following distribution:

where α\alpha is the temperature of sampling. Moreover, since the sampling procedure may be costly, we treat the above probability as the weight of the negative sample. Therefore, the final negative sampling loss with self-adversarial training takes the following form:

In the experiments, we will compare different approaches for negative sampling.

Experiments

We evaluate our proposed model on four widely used knowledge graphs. The statistics of these knowledge graphs are summarized into Table 3.

FB15k (Bordes et al., 2013) is a subset of Freebase (Bollacker et al., 2008), a large-scale knowledge graph containing general knowledge facts. Toutanova & Chen (2015) showed that almost 81%81\% of the test triplets (x,r,y)\mathtt{(x,r,y)} can be inferred via a directly linked triplet (x,r,y)\mathtt{(x,r^{\prime},y)} or (y,r,x)\mathtt{(y,r^{\prime},x)}. Therefore, the key of link prediction on FB15k is to model and infer the symmetry/antisymmetry and inversion patterns.

WN18 (Bordes et al., 2013) is a subset of WordNet (Miller, 1995), a database featuring lexical relations between words. This dataset also has many inverse relations. So the main relation patterns in WN18 are also symmetry/antisymmetry and inversion.

FB15k-237 (Toutanova & Chen, 2015) is a subset of FB15k, where inverse relations are deleted. Therefore, the key of link prediction on FB15k-237 boils down to model and infer the symmetry/antisymmetry and composition patterns.

WN18RR (Dettmers et al., 2017) is a subset of WN18. The inverse relations are deleted, and the main relation patterns are symmetry/antisymmetry and composition.

We use Adam (Kingma & Ba, 2014) as the optimizer and fine-tune the hyperparameters on the validation dataset. The ranges of the hyperparameters for the grid search are set as follows: embedding dimensionFollowing Trouillon et al. (2016), we treat complex number as the same as real number with regard to the embedding dimension. If the same number of dimension is used for both the real and imaginary parts of the complex number as the real number, the number of parameters for the complex embedding would be twice the number of parameters for the embeddings in the real space. k{125,250,500,1000}k\in\{125,250,500,1000\}, batch size b{512,1024,2048}b\in\{512,1024,2048\}, self-adversarial sampling temperature α{0.5,1.0}\alpha\in\{0.5,1.0\}, and fixed margin γ{3,6,9,12,18,24,30}\gamma\in\{3,6,9,12,18,24,30\}. Both the real and imaginary parts of the entity embeddings are uniformly initialized, and the phases of the relation embeddings are uniformly initialized between and 2π2\pi. No regularization is used since we find that the fixed margin γ\gamma could prevent our model from over-fitting.

Evaluation Settings.

We evaluate the performance of link prediction in the filtered setting: we rank test triples against all other candidate triples not appearing in the training, validation, or test set, where candidates are generated by corrupting subjects or objects: (h,r,t)({\mathtt{h}}^{\prime},{\mathtt{r}},{\mathtt{t}}) or (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}^{\prime}). Mean Rank (MR), Mean Reciprocal Rank (MRR) and Hits at N (H@N) are standard evaluation measures for these datasets and are evaluated in our experiments.

Baseline.

Apart from RotatE, we propose a variant of RotatE as baseline, where the modulus of the entity embeddings are also constrained: hi=ti=C|h_{i}|=|t_{i}|=C, and the distance function is thus 2Csinθh+θrθt22C\left\lVert\sin\frac{\bm{\theta}_{h}+\bm{\theta}_{r}-\bm{\theta}_{t}}{2}\right\rVert (See Equation 17 at Appendix F for a detailed derivation). In this way, we can investigate how RotatE works without modulus information and with only phase information. We refer to the baseline as pRotatE. It is obvious to see that pRotatE can also model and infer all the three relation patterns.

2 Main Results

We compare RotatE to several state-of-the-art models, including TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), HolE (Nickel et al., 2016), and ConvE (Dettmers et al., 2017), as well as our baseline model pRotatE, to empirically show the importance of modeling and inferring the relation patterns for the task of predicting missing links.

Table 4 summarizes our results on FB15k and WN18. We can see that RotatE outperforms all the state-of-the-art models. The performance of pRotatE and RotatE are similar on these two datasets. Table 5 summarizes our results on FB15k-237 and WN18RR, where the improvement is much more significant. The difference between RotatE and pRotatE is much larger on FB15k-237 and WN18RR, where there are a lot of composition patterns. This indicates that modulus is very important for modeling and inferring the composition pattern.

Moreover, the performance of these models on different datasets is consistent with our analysis on the three relation patterns (Table 2):

On FB15K, the main relation patterns are symmetry/antisymmetry and inversion. We can see that ComplEx performs well while TransE does not perform well since ComplEx can infer both symmetry/antisymmetry and inversion patterns while TransE cannot infer symmetry pattern. Surprisingly, DistMult achieves good performance on this dataset although it cannot model the antisymmetry and inversion patterns. The reason is that for most of the relations in FB15K, the types of head entities and tail entities are different. Although DistMult gives the same score to a true triplet (h,r,t)\mathtt{(h,r,t)} and its opposition triplet (t,r,h)\mathtt{(t,r,h)}, (t,r,h)\mathtt{(t,r,h)} is usually impossible to be valid since the entity type of tt does not match the head entity type of h{\mathtt{h}}. For example, DistMult assigns the same score to (Obama,nationality,USA)\mathtt{(Obama,nationality,USA)} and (USA,nationality,Obama)\mathtt{(USA,nationality,Obama)}. But (USA,nationality,Obama)\mathtt{(USA,nationality,Obama)} can be simply predicted as false since USA\mathtt{USA} cannot be the head entity of the relation nationality\mathtt{nationality}.

On WN18, the main relation patterns are also symmetry/antisymmetry and inversion. As expected, ComplEx still performs very well on this dataset. However, different from the results on FB15K, the performance of DistMult significantly decreases on WN18. The reason is that DistMult cannot model antisymmetry and inversion patterns, and almost all the entities in WN18 are words and belong to the same entity type, which do not have the same problem as FB15K.

On FB15k-237, the main relation pattern is composition. We can see that TransE performs really well while ComplEx does not perform well. The reason is that, as discussed before, TransE is able to infer the composition pattern while ComplEx cannot infer the composition pattern.

On WN18RR, one of the main relation patterns is the symmetry pattern since almost each word has a symmetric relation in WN18RR, e.g., also_seealso\_see and similar_tosimilar\_to. TransE does not well on this dataset since it is not able to model the symmetric relations.

3 Inferring Relation Patterns on Countries DataSet

We also evaluate our model on the Countries dataset (Bouchard et al., 2015; Nickel et al., 2016), which is carefully designed to explicitly test the capabilities of the link prediction models for composition pattern modeling and inferring. It contains 2 relations and 272 entities (244 countries, 5 regions and 23 subregions). Unlike link prediction on general knowledge graphs, the queries in Countries are of the form locatedIn(c,?)\mathtt{locatedIn(c,?)}, and the answer is one of the five regions. The Countries dataset has 3 tasks, each requiring inferring a composition pattern with increasing length and difficulty. For example, task S2 requires inferring a relatively simpler composition pattern:

while task S3 requires inferring the most complex composition pattern:

In Table 6, we report the results with respect to the AUC-PR metric, which is commonly used in the literature. We can see that RotatE outperforms all the previous models. The performance of RotatE is significantly better than other methods on S3, which is the most difficult task.

4 Implicit Relation Pattern Inference

In this section, we verify whether the relation patterns are implicitly represented by RotatE relation embeddings. We ignore the specific positions in the relation embedding θr\bm{\theta}_{r} and plot the histogram of the phase of each element in the relation embedding, i.e., {θr,i\theta_{r,i}}.

Symmetry pattern requires the symmetric relations to have property rr=1{\mathbf{r}}\circ{\mathbf{r}}=\mathbf{1}, and the solution is ri=±1r_{i}=\pm 1. We investigate the relation embeddings from a 500500-dimensional RotatE trained on WN18. Figure 2(a) gives the histogram of the embedding phases of a symmetric relation similar_tosimilar\_to. We can find that the embedding phases are either π\pi (ri=1r_{i}=-1) or 0,2π0,2\pi (ri=1r_{i}=1). It indicates that the RotatE model does infer and model the symmetry pattern. Figure 2(b) is the histogram of relation hypernym\mathtt{hypernym}, which shows that the embedding of a general relation does not have such a ±1\pm 1 pattern.

Inversion pattern requires the embeddings of a pair of inverse relations to be conjugate. We use the same RotatE model trained on WN18 for an analysis. Figure 2(c) illustrates the element-wise addition of the embedding phases from relation r1=hypernym{\mathtt{r}}_{1}=\mathtt{hypernym} and its inversed relation r2=hyponym{\mathtt{r}}_{2}=\mathtt{hyponym}. All the additive embedding phases are or 2π2\pi, which represents that r1=r21{\mathbf{r}}_{1}={\mathbf{r}}_{2}^{-1}. This case shows that the inversion pattern is also inferred and modeled in the RotatE model.

Composition pattern requires the embedding phases of the composed relation to be the addition of the other two relations. Since there is no significant composition pattern in WN18, we study the inference of the composition patterns on FB15k-237, where a 10001000-dimensional RotatE is trained. Figure 2(d) - 2(g) illustrate such a r1=r2r3{\mathbf{r}}_{1}={\mathbf{r}}_{2}\circ{\mathbf{r}}_{3} case, where θ2,i+θ3,i=θ1,i\theta_{2,i}+\theta_{3,i}=\theta_{1,i} or θ2,i+θ3,i=θ1,i+2π\theta_{2,i}+\theta_{3,i}=\theta_{1,i}+2\pi.

More results of implicitly inferring basic patterns are presented in the appendix.

5 Comparing different negative sampling techniques

In this part, we compare different negative sampling techniques including uniform sampling, our proposed self-adversarial technique, and the KBGAN model (Cai & Wang, 2017), which aims to optimize a generative adversarial network to generate the negative samples. We re-implement a 5050-dimension TransE model with the margin-based ranking criterion that was used in (Cai & Wang, 2017), and evaluate its performance on FB15k-237, WN18RR and WN18 with self-adversarial negative sampling. Table 7 summarizes our results. We can see that self-adversarial sampling is the most effective negative sampling technique.

6 Further Experiments on TransE and ComplEx

One may argue that the contribution of RotatE comes from the self-adversarial negative sampling technique. In this part, we conduct further experiments on TransE and ComplEx in the same setting as RotatE to make a fair comparison among the three models. Table 8 shows the results of TransE and ComplEx trained with the self-adversarial negative sampling technique on FB15k and FB15k-237 datasets, where a large number of relations are available. In addition, we evaluate these three models on the Countries dataset, which explicitly requires inferring the composition pattern. We also provide a detailed ablation study on TransE and RotatE in the appendix.

From Table 8, we can see that similar results are observed as Table 4 and 5. The RotatE model achieves the best performance on both FB15k and FB15k-237, as it is able to model all the three relation patterns. The TransE model does not work well on the FB15k datasets, which requires modeling the symmetric pattern; the ComplEx model does not work well on FB15k-237, which requires modeling the composition pattern. The results on the Countries dataset are a little bit different, where the TransE model slightly outperforms RoateE on the S3 task. The reason is that the Countries datasets do not have the symmetric relation between different regions, and all the three tasks in the Countries datasets only require inferring the region for a given city. Therefore, the TransE model does not suffer from its inability of modeling symmetric relations. For ComplEx, we can see that it does not perform well on Countries since it cannot infer the composition pattern.

7 Experimental results on FB15k by relation category

We also did some further investigation on the performance of RotatE on different relation categories: one-to-many, many-to-one, and many-to-many relationsFollowing Wang et al. (2014), for each relation r{\mathtt{r}}, we compute the average number of tails per head (tphrtphr) and the average number of head per tail (hptrhptr). If tphr<1.5tphr<1.5 and hptr<1.5hptr<1.5, r{\mathtt{r}} is treated as one-to-one; if tphr1.5tphr\geq 1.5 and hptr1.5hptr\geq 1.5, r{\mathtt{r}} is treated as a many-to-many; if tphr<1.5tphr<1.5 and hptr1.5hptr\geq 1.5, r{\mathtt{r}} is treated as one-to-many.. The results of RotatE on different relation categories on the data set FB15k are summarized into Table 9. We also compare an additional approach KG2E_KL (He et al., 2015), which is a probabilistic framework for knowledge graph embedding methods and aims to model the uncertainties of the entities and relations in knowledge graphs with the TransE model. We also summarize the statistics of different relation categories into Table 10 in the appendix.

We can see that besides the one-to-one relation, the RotatE model also performs quite well on the non-injective relations, especially on many-to-many relations. We also notice that the probabilistic framework KG2E_KL(bern) (He et al., 2015) is quite powerful, which consistently outperforms its corresponding knowledge graph embedding model, showing the importance of modeling the uncertainties in knowledge graphs. We leave the work of modeling the uncertainties in knowledge graphs with RotatE as our future work.

Conclusion

We have proposed a new knowledge graph embedding method called RotatE, which represents entities as complex vectors and relations as rotations in complex vector space. In addition, we propose a novel self-adversarial negative sampling technique for efficiently and effectively training the RotatE model. Our experimental results show that the RotatE model outperforms all existing state-of-the-art models on four large-scale benchmarks. Moreover, RotatE also achieves state-of-the-art results on a benchmark that is explicitly designed for composition pattern inference and modeling. A deep investigation into RotatE relation embeddings shows that the three relation patterns are implicitly represented in the relation embeddings. In the future, we plan to evaluate the RotatE model on more datasets and leverage a probabilistic framework to model the uncertainties of entities and relations.

References

Appendix

Appendix A Discussion on the Ability of Pattern Modeling and Inference

No existing models are capable of modeling all the three relation patterns. For example, TransE cannot model the symmetry pattern because it would yield r=0{\mathbf{r}}=\mathbf{0} for symmetric relations; TransX can infer and model the symmetry/antisymmetry pattern when gr,1=gr,2g_{r,1}=g_{r,2}, e.g. in TransH (Wang et al., 2014), but cannot infer inversion and composition as gr,1g_{r,1} and gr,2g_{r,2} are invertible matrix multiplications; due to its symmetric nature, DistMult is difficult to model the asymmetric and inversion pattern; ComplEx addresses the problem of DisMult and is able to infer both the symmetry and asymmetric patterns with complex embeddings. Moreover, it can infer inversion rules because the complex conjugate of the solution to arg maxr(x,r,y)\operatorname*{arg\,max}_{{\mathbf{r}}}\real(\langle{\mathbf{x}},{\mathbf{r}},\overline{{\mathbf{y}}}\rangle) is exactly the solution to arg maxr(y,r,x)\operatorname*{arg\,max}_{{\mathbf{r}}}\real(\langle{\mathbf{y}},{\mathbf{r}},\overline{{\mathbf{x}}}\rangle). However, ComplEx cannot infer composition rules, since it does not model a bijection mapping from h{\mathbf{h}} to t{\mathbf{t}} via relation r{\mathbf{r}}. These concerns are summarized in Table 2.

Appendix B Proof of Lemma 1

if r(x,y){\mathtt{r}}({\mathtt{x}},{\mathtt{y}}) and r(y,x){\mathtt{r}}({\mathtt{y}},{\mathtt{x}}) hold, we have

Otherwise, if r(x,y){\mathtt{r}}({\mathtt{x}},{\mathtt{y}}) and ¬r(y,x)\neg{\mathtt{r}}({\mathtt{y}},{\mathtt{x}}) hold, we have

Appendix C Proof of Lemma 2

if r1(x,y){\mathtt{r}}_{1}({\mathtt{x}},{\mathtt{y}}) and r2(y,x){\mathtt{r}}_{2}({\mathtt{y}},{\mathtt{x}}) hold, we have

Appendix D Proof of Lemma 3

if r1(x,z){\mathtt{r}}_{1}({\mathtt{x}},{\mathtt{z}}), r2(x,y){\mathtt{r}}_{2}({\mathtt{x}},{\mathtt{y}}) and r3(y,z){\mathtt{r}}_{3}({\mathtt{y}},{\mathtt{z}}) hold, we have

Appendix E Properties of RotatE

A useful property for RotatE is that the inverse of a relation can be easily acquired by complex conjugate. In this way, the RotatE model treats head and tail entities in a uniform way, which is potentially useful for efficient 1-N scoring (Dettmers et al., 2017):

Moreover, considering the embeddings in the polar form, i.e., hi=mh,ieiθh,i,ri=eiθr,i,ti=mt,ieiθt,ih_{i}=m_{h,i}e^{i\theta_{h,i}},r_{i}=e^{i\theta_{r,i}},t_{i}=m_{t,i}e^{i\theta_{t,i}}, we can rewrite the RotatE distance function as:

This equation provides two interesting views of the model:

(1) When we constrain the modulus mh,i=mt,i=Cm_{h,i}=m_{t,i}=C, the distance function is reduced to 2Csinθh+θrθt22C\left\lVert\sin\frac{\bm{\theta}_{h}+\bm{\theta}_{r}-\bm{\theta}_{t}}{2}\right\rVert. We can see that this is very similar to the distance function of TransE: h+rt\left\lVert{\mathbf{h}}+{\mathbf{r}}-{\mathbf{t}}\right\rVert. Based on this intuition, we can show that:

RotatE can degenerate into TransE. (See proof at Appendix F)

which indicates that RotatE is able to simulate TransE.

(2) The modulus provides the lower bound of the distance function, which is mhmt\left\lVert{\mathbf{m}}_{h}-{\mathbf{m}}_{t}\right\rVert.

Appendix F Proof of Theorem 4

By further restricting hi=ti=C|h_{i}|=|t_{i}|=C, we can rewrite h,r,t{\mathbf{h}},{\mathbf{r}},{\mathbf{t}} by

If the embedding of (h,r,t)({\mathtt{h}},{\mathtt{r}},{\mathtt{t}}) in TransE is h,r,t{\mathbf{h}}^{\prime},{\mathbf{r}}^{\prime},{\mathbf{t}}^{\prime}, let θh=ch,θr=cr,θt=ct\bm{\theta}_{h}=c{\mathbf{h}}^{\prime},\bm{\theta}_{r}=c{\mathbf{r}}^{\prime},\bm{\theta}_{t}=c{\mathbf{t}}^{\prime} and C=1/cC=1/c , we have

Appendix G Link Prediction on YAGO3-10

YAGO3-10 is a subset of YAGO3 (Mahdisoltani et al., 2013), which consists of entities that have a minimum of 10 relations each. It has 123,182 entities and 37 relations. Most of the triples deal with descriptive attributes of people, such as citizenship, gender, profession and marital status.

Table 11 shows that the RotatE model also outperforms state-of-the-art models on YAGO3-10.

Appendix H Hyperparameters

We list the best hyperparameter setting of RotatE w.r.t the validation dataset on several benchmarks in Table 12.

Appendix I Ablation Study

Table 13 shows our ablation study of self-adversarial sampling and negative sampling loss on FB15k-237. We also re-implement a 1000-dimension TransE and do ablation study on it. From the table, We can find that self-adversarial sampling boosts the performance for both models, while negative sampling loss is only effective on RotatE; in addition, our re-implementation of TransE also outperforms all the state-of-the-art models on FB15k-237.

Appendix J Variance of the Results

In Table 14, We provide the average and variance of the MRR results on FB15k, WN18, FB15k-237 and WN18RR. Both the average and the variance is calculated by three runs of RotatE with difference random seeds. We can find that the performance of RotatE is quite stable for different random initialization.

Appendix K More results of implicit basic pattern inference

We provide more histograms of embedding phases in Figure 3 - 5.