On Multi-Relational Link Prediction with Bilinear Models

Yanjie Wang, Rainer Gemulla, Hui Li

Introduction

Multi-relational link prediction is the task of predicting missing links in an edge-labeled graph. We focus and use the terminology of knowledge base completion throughout. Large-scale knowledge bases (KB) such as DBPedia (?) or YAGO (?) contain millions of entities and facts, but they are nevertheless far from being complete (?). Given a set of entities (vertices) and relations (edge labels) that hold between these entities, the goal of multi-relational link prediction (?) is to determine whether or not some entity e1e_{1} links to some entity e2e_{2} via a relation RR, i.e., whether the fact R(e1,e2)R(e_{1},e_{2}) is true.

Embedding models have recently received considerable attention for knowledge base completion tasks (?; ?; ?). Such models embed both entities and relations in a low-dimensional latent space such that the structure of the knowledge base is (largely) maintained. The embeddings are subsequently used to predict missing facts or to detect erroneous facts.

The perhaps most basic class of embedding models is given by bilinear models. Such models predict a “score” for each fact R(e1,e2)R(e_{1},e_{2}) by computing a weighted sum—where the weights depend on RR—of the pairwise interactions of the entity embeddings of e1e_{1} and e2e_{2}. The scores are used to rank (pairs of) entities according to their predicted truthfulness. Bilinear models are comparably efficient to train and use and they can provide good prediction performance (?).

A large number of bilinear models has been proposed in the literature, including RESCAL (?), TransE (?), DISTMULT (?), HolE (?), and ComplEx (?). There is, however, little work on the expressiveness of and the connections between various bilinear models. In this paper, we argue that all of the aforementioned models can be seen as bilinear models subject to certain constraints. We study whether and under which conditions each model is universal in that it can represent every possible set of relation instances (or, more precisely, entity rankings). We also explore the size of the embeddings needed for universality and derive upper bounds for the embedding size needed to obtain embeddings consistent with a given dataset. We establish a number of subsumption relationships between various models by giving explicit constructions on how to transform instances of one model to instances of another model (sometimes with a different embedding size). A summary of our results is given in Tab. 3.

We report on an independent experimental study that compared various bilinear models on standard datasets in a common experimental setup. We found that the relative performance among the models is highly relation-dependent. We thus propose a simple relation-level ensemble of multiple bilinear models, which—according to our experiments—significantly and consistently improved prediction performance over individual models. In fact, we found that the ensemble performed competitively to the state-of-the-art embedding approaches, whether or not they are bilinear.

Multi-Relational Link Prediction

Let E\mathcal{E} and R\mathcal{R} be a set of entities and relation names. A knowledge base KE×R×E\mathcal{K}\subseteq\mathcal{E}\times\mathcal{R}\times\mathcal{E} is a collection of triples (i,k,j)(i,k,j) where ii, jj, and kk refer to subject, object and relation, resp. We denote by K=R1K=\lvert\mathcal{R}\rvert\geq 1 and N=E2N=\lvert\mathcal{E}\rvert\geq 2 the number of entities and relations, resp. We represent knowledge base K\mathcal{K} via a binary tensor \mathbfcalX{0,1}N×N×K\mathbfcal{X}\in\{0,1\}^{N\times N\times K}, where xijk=1x_{ijk}=1 if and only if (i,k,j)K(i,k,j)\in\mathcal{K}. By convention, vectors ai{\boldsymbol{a}}_{i} refer to rows of matrix A{\boldsymbol{A}} (as a column vector) and scalars aija_{ij} to individual entries. Given dimensionalities rr and rr^{\prime}, we denote by ei,r{\boldsymbol{e}}_{i,r} the ii-th standard basis vector, by 0r{\boldsymbol{0}}_{r} the zero vector, and by 0r×r{\boldsymbol{0}}_{r\times r^{\prime}} the zero matrix of the respective shape. Finally, let diag()\operatorname{diag}\left(\cdot\right) refer to a block-diagonal matrix built from the arguments (a vector or a list of matrices).

2 Bilinear Models

In this paper, we consider bilinear models as well as models that can be represented as bilinear models with an at most linear increase in model size. Although some of the model considered here may not “look” bilinear at first glance, we show that they are closely related to bilinear models. We denote throughout the set of all models of type tt (and of size rr) and by MtM^{t} (MrtM^{t}_{r}).

RESCAL can be seen as an extension of the low-rank matrix factorization methods prominent in recommender systems to more then one relation.

DISTMULT (?).

DISTMULT can be seen as a variant of RESCAL that puts a diagonality constraint on the relation matrices. Due to this constraint, it can only model symmetric relations. The model is equivalent to the INDSCAL tensor decomposition (?).

HolE (?).

where \star refers to the circular correlation between ai{\boldsymbol{a}}_{i} and aj{\boldsymbol{a}}_{j}, i.e., (aiaj)k=t=1raitaj((k+t2modr)+1)({\boldsymbol{a}}_{i}\star{\boldsymbol{a}}_{j})_{k}=\sum_{t=1}^{r}a_{it}a_{j((k+t-2\mod r)+1)}. The idea of using circular convolution relates to associative memory (?). ? (?) provide an alternative viewpoint in terms of ComplEx, discussed next.

ComplEx (?).

where Re()\operatorname{Re}(\cdot) extracts the real part of a complex number. ComplEx is superficially related to DISTMULT but uses complex-valued parameter matrices. Note that aiTdiag(rk)aj{\boldsymbol{a}}_{i}^{T}\operatorname{diag}\left({\boldsymbol{r}}_{k}\right){\boldsymbol{a}}_{j} is not guaranteed to be real.

TransE (?).

In contrast to the models presented above, TransE is a translation-based model, not a factorization-based model. The use of translations—i.e., differences between entity embeddings—is inspired by Word2Vec’s word analogy results (?). Note that TransE can also be used with L1L_{1} norm instead of L2L_{2}; we focus on the L2L_{2} variant given above throughout.

Subsumption and Expressiveness