Canonical Tensor Decomposition for Knowledge Base Completion

Timothée Lacroix, Nicolas Usunier, Guillaume Obozinski

Introduction

In knowledge base completion, the learner is given triples (subject, predicate, object) of facts about the world, and has to infer new triples that are likely but not yet known to be true. This problem has attracted a lot of attention (Nickel et al., 2016a; Nguyen, 2017) both as an example application of large-scale tensor factorization, and as a benchmark of learning representations of relational data.

The standard completion task is link prediction, which consists in answering queries (subject, predicate, ?) or (?, predicate, object). In that context, the canonical decomposition of tensors (also called CANDECOMP/PARAFAC or CP) (Hitchcock, 1927) is known to perform poorly compared to more specialized methods. For instance, DistMult (Yang et al., 2014), a particular case of CP which shares the factors for the subject and object modes, was recently shown to have state-of-the-art results (Kadlec et al., 2017). This result is surprising because DistMult learns a tensor that is symmetric in the subject and object modes, while the datasets contain mostly non-symmetric predicates.

The goal of this paper is to study whether and how CP can perform as well as its competitors. To that end, we evaluate three possibilities.

First, as Kadlec et al. (2017) showed that performances for these tasks are sensitive to the loss function and optimization parameters, we re-evaluate CP with a broader parameter search and a multiclass log-loss.

Second, since the best performing approaches are less expressive than CP, we evaluate whether regularization helps. On this subject, we show that the standard regularization used in knowledge base completion does not correspond to regularization with a tensor norm. We then propose to use tensor nuclear pp-norms (Friedland & Lim, 2018), with the goal of designing more principled regularizers.

Third, we propose a different formulation of the objective, in which we model separately predicates and their inverse: for each predicate pred{\rm pred}, we create an inverse predicate pred1{\rm pred}^{-1} and create a triple (obj,pred1,sub)({\rm obj},{\rm pred}^{-1},{\rm sub}) for each training triple (sub,pred,obj)({\rm sub},{\rm pred},{\rm obj}). At test time, queries of the form (?,pred,obj)(?,{\rm pred},{\rm obj}) are answered as (obj,pred1,?)({\rm obj},{\rm pred}^{-1},?). Similar formulations were previously used by Shen et al. (2016) and Joulin et al. (2017), but for different models for which there was no clear alternative, so the impact of this reformulation has never been evaluated.

To assess whether the results we obtain are specific to CP, we also carry on the same experiments with a state-of-the-art model, ComplEx (Trouillon et al., 2016). ComplEx has the same expressivity as CP in the sense that it can represent any tensor, but it implements a specific form of parameter sharing. We perform all our experiments on 55 common benchmark datasets of link prediction in knowledge bases.

Our results first confirm that within a reasonable time budget, the performance of both CP and ComplEx are highly dependent on optimization parameters. With systematic parameter searches, we obtain better results for ComplEx than what was previously reported, confirming its status as a state-of-the-art model on all datasets. For CP, the results are still way below its competitors.

Learning and predicting with the inverse predicates, however, changes the picture entirely. First, with both CP and ComplEx, we obtain significant gains in performance on all the datasets. More precisely, we obtain state-of-the-art results with CP, matching those of ComplEx. For instance, on the benchmark dataset FB15K (Bordes et al., 2013), the mean reciprocal rank of vanilla CP and vanilla ComplEx are 0.400.40 and 0.800.80 respectively, and it grows to 0.860.86 for both approaches when modeling the inverse predicates.

Finally, the new regularizer we propose based on the nuclear 33-norm, does not dramatically help CP, which leads us to believe that a careful choice of regularization is not crucial for these CP models. Yet, for both CP and ComplEx with inverse predicates, it yields small but significant improvements on the more difficult datasets.

Tensor Factorization of Knowledge Bases

We describe in this section the formal framework we consider for knowledge base completion and more generally link prediction in relational data, the learning criteria, as well as the approaches that we will discuss.

We consider relational data that comes in the form of triples (subject, predicate, object), where the subject and the object are from the same set of entities. In knowledge bases, these triples represent facts about entities of the world, such as (W ⁣ashington,capital_of,U ⁣S ⁣A)(W\!ashington,capital\_of,U\!S\!A). A training set S{\cal S} contains triples of indices S={(i1,j1,k1),...,(iS,jS,kS)}{\cal S}=\{(i_{1},j_{1},k_{1}),...,(i_{|{\cal S}|},j_{|{\cal S}|},k_{|{\cal S}|})\} that represent predicates that are known to hold. The validation and test sets contain queries of the form (?,j,k)(?,j,k) and (i,j,?)(i,j,?), created from triples (i,j,k)(i,j,k) that are known to hold but held-out from the training set. To give orders of magnitude, the largest datasets we experiment on, FB15K and YAGO3-10, contain respectively 15k/1.3k15k/1.3k and 123k/37123k/37 entities/predicates.

2 Tensor Decomposition for Link Prediction

Several approaches have considered link prediction as a low-rank tensor decomposition problem. These models then differ only by structural constraints on the learned tensor. Three models of interest are:

A representation of this decomposition, and the score of a specific triple is given in Figure 1 (a). Given X{\bm{X}}, the smallest RR for which this decomposition holds is called the canonical rank of X{\bm{X}}.

DistMult.

ComplEx.

where ur(1)\overline{u}^{(1)}_{r} is the complex conjugate of ur(1)u^{(1)}_{r}. This decomposition can represent any real tensor (Trouillon et al., 2016).

The good performances of DistMult on notoriously non-symmetric datasets such as FB15K or WN18 are surprising. First, let us note that for the symmetricity to become an issue, one would have to evaluate queries (i,j,?)(i,j,?) while also trying to answer correctly to queries of the form (?,j,i)(?,j,i) for a non-symmetric predicate jj. The ranking for these two queries would be identical, and thus, we can expect issues with relations such as capital_ofcapital\_of. In FB15K, those type of problematic queries make up only 4%4\% of the test set and thus, have a small impact. On WN18 however, they make up 60%60\% of the test set. We describe in appendix 8.1 a simple strategy for DistMult to have a high filtered MRR on the hierarchical predicates of WN18 despite its symmetricity assumption.

3 Training

Previous work suggested ranking losses (Bordes et al., 2013), binary logistic regression (Trouillon et al., 2016) or sampled multiclass log-loss (Kadlec et al., 2017). Motivated by the solid results in Joulin et al. (2017), our own experimental results, and with a satisfactory speed of about two minutes per epoch on FB15K, we decided to use the full multiclass log-loss.

These two partial losses are represented in Figure 1 (b). For CP, the final tensor is computed by finding a minimizer of a regularized empirical risk formulation, where the factors ur(d)u^{(d)}_{r} are weighted in a data-dependent manner by wS(d)w_{{\cal S}}^{(d)}, which we describe below:

where \odot is the entry-wise multiplication of vectors. For DistMult and ComplEx, the learning objective is similar, up to the appropriate parameter sharing and computation of the tensor.

As discussed in Section 3.2, the weights wS(d)w_{{\cal S}}^{(d)} may improve performances when some rows/columns are sampled more than others. They appear naturally in optimization with stochastic gradient descent when the regularizer is applied only to the parameters that are involved in the computation of the instantaneous loss. For instance, in the case of the logistic loss with negative sampling used by Trouillon et al. (2016), denoting by qidq^{d}_{i} the marginal probability (over S{\cal S}) that index ii appears in mode dd of a data triple, these weights are wS,i(d)=qid+αw_{{\cal S},i}^{(d)}=\sqrt{q^{d}_{i}+\alpha} for some α>0\alpha>0 that depends on the negative sampling scheme.

We focus on redefining the loss (1) and the regularizer (2.3).

Related Work

We discuss here in more details the work that has been done on link prediction in relational data and on regularizers for tensor completion.

There has been extensive research on link prediction in relational data, especially in knowledge bases, and we review here only the prior work that is most relevant to this paper. While some approaches explicitly use the graph structure during inference (Lao et al., 2011), we focus here on representation learning and tensor factorization methods, which are the state-of-the-art on the benchmark datasets we use. We also restrict the discussion to approaches that only use relational information, even though some approaches have been proposed to leverage additional types (Krompass et al., 2015; Ma et al., 2017) or external word embeddings (Toutanova & Chen, 2015).

We can divide the first type of approaches into two broad categories. First, two-way approaches score a triple (i,j,k)(i,j,k) depending only on bigram interaction terms of the form subject-object, subject-predicate, and predicate-object. Even though they are tensor approximation algorithms of limited expressivity, two-way models based on translations TransE, or on bag-of-word representations (Joulin et al., 2017) have proved competitive on many benchmarks. Yet, methods using three-way multiplicative interactions, as described in the previous section, show the strongest performances (Bordes et al., 2011; Garcia-Duran et al., 2016; Nickel et al., 2016b; Trouillon et al., 2016). Compared to general-purpose tensor factorization methods such as CP, a common feature of these approaches is to share parameters between objects and subjects modes (Nickel et al., 2011), an idea that has been widely accepted except for the two-way model of Joulin et al. (2017). DistMult (Yang et al., 2014) is the extreme case of this parameter sharing, in which the predicted tensor is symmetric in the subject and object modes.

2 Regularization for Matrix Completion

Norm-based regularization has been extensively studied in the context of matrix completion. The trace norm (or nuclear norm) has been proposed as a convex relaxation of the rank (Srebro et al., 2005) for matrix completion in the setting of rating prediction, with strong theoretical guarantees (Candès & Recht, 2009). While efficient algorithms to solve the convex problems have been proposed (see e.g. Cai et al., 2010; Jaggi et al., 2010), the practice is still to use the matrix equivalent of the nonconvex formulation (2.3). For the trace norm (nuclear 22-norm), in the matrix case, the regularizer simply becomes the squared 22-norm of the factors and lends itself to alternating methods or SGD optimization (Rennie & Srebro, 2005; Koren et al., 2009). When the samples are not taken uniformly at random from a matrix, some other norms are preferable to the usual nuclear norm. The weighted trace norm reweights elements of the factors based on the marginal rows and columns sampling probabilities, which can improve sample complexity bounds when sampling is non-uniform (Foygel et al., 2011; Negahban & Wainwright, 2012). Direct SGD implementations on the nonconvex formulation implicitly take this reweighting rule into account and were used by the winners of the Netflix challenge (see Srebro & Salakhutdinov, 2010, Section 5).

3 Tensor Completion and Decompositions

There is a large body of literature on low-rank tensor decompositions (see Kolda & Bader, 2009, for a comprehensive review). Closely related to our work is the canonical decomposition of tensor (also called CANDECOMP/PARAFAC or CP) (Hitchcock, 1927), which solves a problem similar to (14) without the regularization (i.e., λ=0\lambda=0), and usually the square loss.

Several norm-based regularizations for tensors have been proposed. Some are based on unfolding a tensor along each of its modes to obtain matricizations, and either regularize by the sum of trace norms of the matricizations (Tomioka et al., 2010) or write the original tensor as a sum of tensors TkT_{k}, regularizing their respective kkth matricizations with the trace norm (Wimalawarne et al., 2014). However, in the large-scale setting, even rank-1 approximations of matricizations involve too many parameters to be tractable.

Recently, the tensor trace norm (nuclear 22-norm) was proposed as a regularizer for tensor completion Yuan & Zhang (2016), and an algorithm based on the generalized conditional gradient has been developed by Cheng et al. (2016). This algorithm requires, in an inner loop, to compute a (constrained) rank-1 tensor that has largest dot-product with the gradient of the data-fitting term (gradient w.r.t. the tensor argument). This algorithm is efficient in our setup only with the square error loss (instead of the multiclass log-loss), because the gradient is then a low-rank + sparse tensor when the argument is low-rank. However, on large-scale knowledge bases, the state of the art is to use a binary log-loss or a multiclass log-loss (Trouillon et al., 2016; Kadlec et al., 2017); in that case, the gradient is not adequately structured, thereby causing the approach of (Cheng et al., 2016) to be too computationally costly.

Nuclear p𝑝p-Norm Regularization

As discussed in Section 3, norm-based regularizers have proved useful for matrices. We aim to reproduce these successes with tensor norms. We use the nuclear pp-norms defined by Friedland & Lim (2018). As shown in Equation (2.3), the community has favored so far a regularizer based on the square Frobenius norms of the factors (Yang et al., 2014; Trouillon et al., 2016). We first show that the unweighted version of this regularizer is not a tensor norm. Then, we propose4 a variational form of the nuclear 33-norm to replace the usual regularization at no additional computational cost when used with SGD. Finally, we discuss a weighting scheme analogous to the weighted trace-norm proposed in Srebro & Salakhutdinov (2010).

To simplify notation, let us introduce the set of CP decompositions of a tensor X{\bm{X}} of rank at most RR:

We will study the family of regularizers:

Note that with p=α=2p=\alpha=2, we recover the familiar squared Frobenius norm regularizer used in (2.3). Similar to showing that the squared Frobenius norm is a variational form of the trace norm on matrices (i.e., its minimizers realize the trace norm, infM=UVT12(UF2+VF2)=M\inf_{M=UV^{T}}\frac{1}{2}(\lVert U\lVert_{F}^{2}+\lVert V\lVert_{F}^{2})=\lVert M\lVert_{*}), we start with a technical lemma that links our regularizer with a function on the spectrum of our decompositions.

Moreover, the minimizers of the left-hand side satisfy:

This Lemma motivates the introduction of the set of pp-norm normalized tensor decompositions:

We show that the sub-level sets of the term on the right are not convex, which implies that Ω22\Omega_{2}^{2} is not the variational form of a tensor norm, and hence, is not the tensor analog to the matrix trace norm.

Cheng et al. (2016, Appendix D) already showed that regularizing with the square Frobenius norm of the factors is not related to the trace norm for tensors of order 33 and above, but their observation is that the regularizer is not positively homogeneous, i.e., minuαUR(X)Ω22(u)αminuUR(X)Ω22(u)\min_{u\in\alpha\mathcal{U}_{R}({\bm{X}})}\Omega_{2}^{2}(u)\neq|\alpha|\min_{u\in\mathcal{U}_{R}({\bm{X}})}\Omega_{2}^{2}(u). Our result in Proposition 1 is stronger in that we show that this regularizer is not a norm even after the rescaling (11) to make it homogeneous.

Given an estimated upper bound on the optimal RR, the original problem (2.3) can then be re-written as a non-convex problem using the equivalence in Lemma 20:

This variational form suggests to use p=3p=3, as a means to make the regularizer separable in each coefficients, given that then \lVert u^{(d)}_{r}\rVert_{p}^{3}=\sum_{i=1}^{{n_{d}}}\big{|}u^{(d)}_{r,i}|^{3}.

2 Weighted Nuclear p𝑝p-Norm

Similar to the weighted trace-norm for matrices, the weighted nuclear 33-norm can be easily implemented by keeping the regularization terms corresponding to the sampled triplets only, as discussed in Section 3.2. This leads to a formulation of the form

For an example (i,j,k)(i,j,k), only the parameters involved in the computation of X^i,j,k\hat{X}_{i,j,k} are regularized. The computational complexity is thus the same as the currently used weighted Frobenius norm regularizer. With q(1)q^{(1)} (resp. q(2)q^{(2)}, q(3)q^{(3)}) the marginal probabilities of sampling a subject (resp. predicate, object), the weighting implied by this regularization scheme is

We justify this weighting only by analogy with the matrix case discussed by (Srebro & Salakhutdinov, 2010): to make the weighted nuclear 33-norm of the all 11 tensor independent of its dimensions for a uniform sampling (since the nuclear 33-norm grows as MNP\sqrt{MNP} for an (M,N,P)(M,N,P) tensor).

Comparatively, for the weighted version of the nuclear 22-norm analyzed in Yuan & Zhang (2016), the nuclear 22-norm of the all 11 tensor scales like NMP\sqrt{NMP}. This would imply a formulation of the form

Contrary to formulation (15), the optimization of formulation (16) with a minibatch SGD leads to an update of every coefficients for each mini-batch considered. Depending on the implementation, and size of the factors, there might be a large difference in speed between the updates of the weighted nuclear {2,3}\{2,3\}-norm. In our implementation, this difference for CP is of about 1.5×1.5\times in favor of the nuclear 33-norm on FB15K.

A New CP Objective

More precisely, the instantaneous loss of a training triple (i,j,k)(i,j,k) becomes :

At test time we use X^i,j,:{\bm{\hat{X}}}_{i,j,:} to rank possible right hand sides for query (i,j,?)(i,j,?) and X^k,j+P,:{\bm{\hat{X}}}_{k,j+P,:} to rank possible left hand sides for query (?,j,k)(?,j,k).

Using CP to factor the tensor described in (17), we beat the previous state of the art on many benchmarks, as shown in Table 2. This reformulation seems to help even the ComplEx decomposition, for which parameters are shared between the entity embeddings of the first and third mode.

Experiments

We conducted all experiments on a Quadro GP 100 GPU. The code is available at https://github.com/facebookresearch/kbc.

WN18 and FB15K are popular benchmarks in the Knowledge Base Completion community. The former comes from the WordNet database, was introduced in Bordes et al. (2014) and describes relations between words. The most frequent types of relations are highly hierarchical (e.g., hypernym, hyponym). The latter is a subsampling of Freebase limited to 1515k entities, introduced in Bordes et al. (2013). It contains predicates with different characteristics (e.g., one-to-one relations such as capital_of to many-to-many such as actor_in_film).

Toutanova & Chen (2015) and Dettmers et al. (2017) identified train to test leakage in both these datasets, in the form of test triplets, present in the train set for the reciprocal predicates. Thus, both of these authors created two modified datasets: FB15K-237 and WN18RR. These datasets are harder to fit, so we expect regularization to have more impact. Dettmers et al. (2017) also introduced the dataset YAGO3-10, which is larger in scale and doesn’t suffer from leakage. All datasets statistics are shown in Table 1.

In all our experiments, we distinguish two settings: Reciprocal, in which we use the loss described in equation (17) and Standard, which uses the loss in equation (1). We compare our implementation of CP and ComplEx with the best published results, then the different performances between the two settings, and finally, the contribution of the regularizer in the reciprocal setting. In the Reciprocal setting, we compare the weighted nuclear 33-norm (N3) against the regularizer described in (2.3) (FRO). In preliminary experiments, the weighted nuclear 22-norm described in (16) did not seem to perform better than N3 and was slightly slower. We used Adagrad (Duchi et al., 2011) as our optimizer, whereas Kadlec et al. (2017) favored Adam (Kingma & Ba, 2014), because preliminary experiments didn’t show improvements.

We ran the same grid for all algorithms and regularizers on the FB15K, FB15K-237, WN18, WN18RR datasets, with a rank set to 20002000 for ComplEx, and 40004000 for CP. Our grid consisted of two learning rates: 10110^{-1} and 10210^{-2}, two batch-sizes: 2525 and 100100, and regularization coefficients in [0,103,5.103,102,5.102,101,5.101][0,10^{-3},5.10^{-3},10^{-2},5.10^{-2},10^{-1},5.10^{-1}]. On YAGO3-10, we limited our models to rank 10001000 and used batch-sizes 500500 and 30003000, the rest of the grid was identical. We used the train/valid/test splits provided with these datasets and measured the filtered Mean Reciprocal Rank (MRR) and Hits@10 (Bordes et al. (2013)). We used the filtered MRR on the validation set for early stopping and report the corresponding test metrics. In this setting, an epoch for ComplEx with batch-size 100 on FB15K took about 110s110s and 325s325s for a batch-size of 2525. We trained for 100100 epochs to ensure convergence, reported performances were reached within the first 2525 epochs.

All our results are reported in Table 2 and will be discussed in the next subsections. Besides our implementations of CP and ComplEx, we include the results of ConvE and DistMult in the baselines. The former because Dettmers et al. (2017) includes performances on the WN18RR and YAGO3-10 benchmarks, the latter because of the good performances on FB15K of DistMult and the extensive experiments on WN18 and FB15K reported in Kadlec et al. (2017). The performances of DistMult on FB15K-237, WN18RR and YAGO3-10 may be slightly underestimated, since our baseline CP results are better. To avoid clutter, we did not include in our table of results algorithms that make use of external data such as types (Krompass et al., 2015), external word embeddings (Toutanova & Chen, 2015), or using path queries as regularizers (Guu et al., 2015). The published results corresponding to these methods are subsumed in the "Best Published" line of Table 2, which is taken, for every single metric and dataset, as the best published result we were able to find.

2 Reimplementation of the Baselines

The performances of our reimplementation of CP and ComplEx appear in the middle rows of Table 2 (Standard setting). We only kept the results for the nuclear 33-norm, which didn’t seem to differ from those with the Frobenius norm. Our results are slightly better than their published counterparts, going from 0.330.33 to 0.460.46 filtered MRR on FB15K for CP and 0.700.70 to 0.800.80 for ComplEx. This might be explained in part by the fact that in the Standard setting (2.3) we use a multi-class log-loss, whereas Trouillon et al. (2016) used binomial negative log-likelihood. Another reason for this increase can be the large rank of 20002000 that we chose, where previously published results used a rank of around 200200; the more extensive search for optimization/regularization parameters and the use of nuclear 33-norm instead of the usual regularization are also most likely part of the explanation.

3 Standard vs Reciprocal

In this section, we compare the effect of reformulation (17), that is, the middle and bottom rows of Table 2. The largest differences are obtained for CP, which becomes a state of the art contender going from 0.20.2 to 0.950.95 filtered MRR on WN18, or from 0.460.46 to 0.860.86 filtered MRR on FB15K.For ComplEx, we notice a weaker, but consistent improvement by using our reformulation, with the biggest improvements observed on FB15K and YAGO3-10. Following the analysis in Bordes et al. (2013), we show in Table 3 the average filtered MRR as a function of the degree of the predicates. We compute the average in and out degrees on the training set, and separate the predicates in 44 categories : 1-1, 1-m, m-1 and m-m, with a cut-off at 1.51.5 on the average degree. We include reciprocal predicates in these statistics. That is, a predicate with an average in-degree of 1.21.2 and average out-degree of 3.23.2 will count as a 1-m when we predict its right-hand side, and as an m-1 when we predict its left-hand side. Most of our improvements come from the 1-m and m-m categories, both on ComplEx and CP.

4 Frobenius vs Nuclear 333

We focus now on the effect of our norm-based N3 regularizer, compared to the Frobenius norm regularizer favored by the community. Comparing the four last rows of Table 2, we notice a small but consistent performance gain across datasets. The biggest improvements appear on the harder datasets WN18RR, FB15K-237 and YAGO3-10. We checked on WN18RR the significance of that gain with a Signed Rank test on the rank pairs for CP.

5 Effect of Optimization Parameters

During these experiments, we noticed a heavy influence of optimization hyper-parameters on final results. This influence can account for as much as 0.10.1 filtered MRR and is illustrated in Figure 2.

Conclusion and Discussion

The main contribution of this paper is to isolate and systematically explore the effect of different factors for large-scale knowledge base completion. While the impact of optimization parameters was well known already, neither the effect of the formulation (adding reciprocals doubles the mean reciprocal rank on FB15K for CP) nor the impact of the regularization was properly assessed. The conclusion is that the CP model performs nearly as well as the competitors when each model is evaluated in its optimal configuration. We believe this observation is important to assess and prioritize directions for further research on the topic.

In addition, our proposal to use nuclear pp-norm as regularizers with p2p\neq 2 for tensor factorization in general is of independent interest.

The results we present leave several questions open. Notably, whereas we give definite evidence that CP itself can perform extremely well on these datasets as long as the problem is formulated correctly, we do not have a strong theoretical justification as to why the differences in performances are so significant.

Acknowledgements

The authors thank Armand Joulin and Maximilian Nickel for valuable discussions.

References

Appendix

Suppose we are trying to embed a single hierarchical predicate pp which is a nn-ary tree of depth dd. This tree will have ndn^{d} leaves, 11 root and Knd=ndnn1nd1K_{n}^{d}=\frac{n^{d}-n}{n-1}\sim n^{d-1} internal nodes. We forget any modeling issues we might have, but focus on the symmetricity assumption in Distmult.

Leaves and the root only appear on one side of the queries (i,p,j)(i,p,j) and hence won’t have any problems with the symmetricity. We now focus on an internal node ii. It has nn children (cki)k=1..n(c^{i}_{k})_{k=1..n} and one ancestor aia_{i}. Assuming n>2n>2, the MRR associated with this node will be higher if the query (i,p,?)(i,p,?) yields the ranked list [c1i,...,cni,ai][c^{i}_{1},...,c^{i}_{n},a_{i}]. Indeed, the filtered rank of the n queries (i,p,cki)(i,p,c^{i}_{k}) will be 11 while the filtered rank of the query (ai,p,i)(a_{i},p,i) will be n+1n+1.

Counting the number of queries for which the filtered rank is 11, we see that they far outweigh the queries for which the filtered rank is n+1n+1 in the final filtered MRR. For each internal nodes, nn queries lead to a rank of 11, and only 11 to a rank of n+1n+1. For the root, nn queries with a rank of 11, for the leaves, ndn^{d} queries with a rank of 11.

Hence for big hierarchies such as hyponym or hypernym in WN, we expect the filtered MRR of DistMult to be high even though its modeling assumptions are incorrect.

2 Proofs

Moreover, the minimizers of the left-hand side satisfy:

We study a summand, for ci,ai>0c_{i},a_{i}>0 :

Using constrained optimization techniques, we obtain that this minimum is obtained for :

and has value (d=13ai)α/3(\prod_{d=1}^{3}a_{i})^{\alpha/3}, which completes the proof. ∎

Let A=12I2A=\frac{1}{2}I_{2}, the mean of these two matrices. Identifying AA with a 2×2×12\times 2\times 1 tensor A{\bm{A}} to obtain the decomposition (σ,u)(\sigma,u) yielding A{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|{\bm{A}}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}, we have that the matrix AA can be written as r=1Rσrur(1)ur(2)\sum_{r=1}^{R}\sigma_{r}u^{(1)}_{r}\otimes u^{(2)}_{r}. This comes from the fact that ur(3)u^{(3)}_{r} is a normalized 1×11\times 1 vector, so its only entry is equal to 11. We then write that trace Tr(A)=r=1RσrTr(ur(1)ur(2))r=1Rσr\operatorname{Tr}(A)=\sum_{r=1}^{R}\sigma_{r}\operatorname{Tr}(u^{(1)}_{r}\otimes u^{(2)}_{r})\leq\sum_{r=1}^{R}\sigma_{r} by Cauchy-Schwarz. Hence σ1Tr(A)=1\lVert\sigma\rVert_{1}\geq\operatorname{Tr}(A)=1. Moreover, we have σ2/3σ1\lVert\sigma\rVert_{2/3}\geq\lVert\sigma\rVert_{1} with equality only for σ\sigma with at most one non-zero coordinate. Since AA is of rank 22, its representation has at least 22 non-zero coordinates, hence A=σ2/3>1{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\lVert\sigma\rVert_{2/3}>1, which contradicts convexity. This proof can naturally be extended to tensors of any sizes. ∎

3 Best Published results

We report in Table 4 the references for each of the results in Table 2 in the article.