Fast Linear Model for Knowledge Graph Embeddings

Armand Joulin, Edouard Grave, Piotr Bojanowski, Maximilian Nickel, Tomas Mikolov

Introduction

Learning representations for Knowledge bases (KBs) such as Freebase (Bollacker et al., 2008) has become a core problem in machine learning, with a large variety of applications from question answering (Yao and Van Durme, 2014) to image classification (Deng et al., 2014). Many approaches have been proposed to learn these representations, or embeddings, with either single relational (Hoff et al., 2002; Perozzi et al., 2014) or multi-relational data (Nickel et al., 2011; Bordes et al., 2013).

These approaches learn graph embeddings by modeling the relation between the different entities in the graphs (Perozzi et al., 2014; Nickel et al., 2016), Instead, we frame this problem in a multiclass multilabel classification problem and model only the co-occurences of entities and relations with a linear classifier based on a Bag-of-Words (BoW) representation and standard cost functions. In practice, this approach works surprisingly well on a variety of standard datasets, obtaining performance competitive with the state-of-the-art approaches while using a standard text library (i.e., fastText) and running in a few minutes (Joulin et al., 2017).

We focus our study on two standard approaches to learn representations for KBs: knowledge base completion and question answering. For KB completion, our conclusions extend those of Kadlec et al. (2017), that simple models like TransE (Bordes et al., 2013) work as well, if not better than more sophisticated ones, if tuned properly. Kadlec et al. (2017) focus on a bilinear model designed for KB completion, DistMul (Yang et al., 2014), that still takes a few hours to train on a high-end GPU. We show that similar performance can be achieved with a linear classifier and a training time reduced to a few minutes. For question answering, we consider datasets where we have guarantees that the question answer pairs are covered by the graph in one hop to indirectly learn graph embeddings (Bordes et al., 2015; Miller et al., 2016). Following Bordes et al. (2014a), we predict the relation between the entities appearing in the question and answer pairs to learn embeddings of the graph edges. The embeddings of the entities, or nodes, are indirectly learned by embedding the questions. In this setting, we achieve competitive performance as long as we have access to a clean KB related to the question answering task.

Approach

Linear models (Joachims, 1998) are powerful and efficient baselines for text classification. In particular, the fastText model proposed by Joulin et al. (2017) achieves state-of-the-art performance on many datasets by combining several standard tricks, such as low rank constraints (Schutze, 1992) and n-gram features (Wang and Manning, 2012). The same approach can be applied to any problem where the input is a set of discrete tokens. For example, a KB is composed of entities (or nodes) and relations (or edges) that can be represented by a unique discrete token.

The model is composed of a matrix $V$ which is used as a look-up table over the discrete tokens and a matrix $W$ for the classifier. The representations of the discrete tokens are averaged into BoW representation, which is in turn fed to the linear classifier. Using a function $f$ to compute the probability distribution over the classes, and $N$ input sets for discrete token (e.g., sentences), leads to minimize:

where $x_{n}$ is the normalized BoW of the $n$ -th input set, $y_{n}$ the label. While BoW models are memory inefficient, their memory footprint can be significantly reduced (Joulin et al., 2016a). The model is trained asynchronously on multiple CPUs with SGD and a linearly decaying learning rate.

2 Loss functions

We consider two loss functions in our experiments: the softmax function and a one-versus-all loss function with negative sampling.

Given $K$ classes, and a score $s_{k}$ for each class $k$ , the softmax function is defined as $f(s)_{k}=\exp(s_{k})/\sum_{i=1}^{K}\exp(s_{i}).$ This function requires the score of every class, leading to a complexity of $O(Kh)$ where $h$ is the size of the embeddings. This function is often used to compute the probability distribution of a finite set of discrete classes.

one-versus-all loss.

Computing the softmax function over a large number of classes is computationally prohibitive. We replace it by an independent binary classifier per class, i.e., a set of one-versus-all losses. During training, for each positive example, we draw randomly $k$ negative classes, and update the $k+1$ classifiers. The number $k$ is significantly smaller than $K$ , reducing the complexity from $O(Kh)$ to $O(kh)$ . This loss has been used for word embeddings (Mikolov et al., 2013; Bojanowski et al., 2017) as well as to object classification (Joulin et al., 2016b).

3 Knowledge base completion

A knowledge base is represented as a set of subject-relation-object triplets $(e,r,p)$ . Typically, the entity $p$ is predicted according to the subject $e$ and the relation $r$ . With the notations of the fastText model described in Sec. 2.1, each entity $e$ is associated with a vector $v_{e}$ and each relation $r$ with a vector $v_{r}$ of the same dimension $h$ . The target entity $p$ is also represented by a $h$ dimensional vector $w_{p}$ . The scoring function $s_{p}$ for a triplet $(e,r,p)$ is simply the dot product between the BoW representation of the input pair $(e,r)$ and the target:

4 Question answering

Question answering problems can be used to learn graph embeddings if framed as edge prediction problems between entities appearing in the question answer pairs (Bordes et al., 2014a). The question is represented as a bag of words and the potential relations are labels. An entity is indirectly represented by the associated words in the question.

The questions and answers are matched to entities in the KB with a string matching algorithm (Bordes et al., 2014a), using a look-up table between entities and their string representations. Every pair of question and answer in the training set is thus matched to a set of potential pairs of entities. Several entities are often matched to a question and we use an ad-hoc euristic to sort them, i.e., using the inverse of their frequency in the training set, and the size of their associated strings in case of ties (to approximate the frequency).

Relation prediction for question answering.

Once a question-answer pair is associated with a set of pairs of entities, candidate relations are extracted. Following Bordes et al. (2014a), we consider the relations as labels and use fastText to predict them. At test time, the answer to a question is inferred by taking the most likely relation and verify if any of the entities matched to the question forms a valid pair in the KB. If not, we move to the next most likely relation and reiterate the process.

Results

We use several standard benchmarks for KB completion:

The WN18 dataset is a subset of WordNet, containing 40,943 entities, 18 relation types, and 151,442 triples. WordNet is a KB built by grouping synonym words and provides lexical relationships between them.

The FB15k dataset is a subset of Freebase, containing 14,951 entities, 1345 relation types, and 592,213 triples. Freebase is a large KB containing general facts about the world.

The FB15k-237 dataset that is a subset of FB15k with no reversible relations (Toutanova et al., 2015). It contains 237 relations and 14,541 entities, for a total of 298,970 triples.

The SVO dataset is a subset of subject-relation-object triplets extracted from Wikipedia articles, containing $30,605$ entities, $4,547$ relation types and $1.3$ M triples.

Experimental protocol.

For WN18, FB15k and FB15k-237, the goal is to predict one end of a triple given the other end and the relation, e.g., the subject given the object and the relation. We report Hit@10, also known as Recall@10, on raw and filtered datasets. Raw means the standard recall measure while filtered means that every relation that already exists in the KB are first removed, even those in the test set. The filtered measure allows a direct comparison of the target entity with negative ones. On SVO, the goal is to predict the relation given a pair of entities. The measure is Hit@5 $\%$ , i.e., Hit@227 for $4,547$ relation types.

Implementation details.

For both WN18, FB15k and FB15k-237, we use a negative sampling approximation of the softmax and select the hyper-parameters based on the filtered hits@10 on the validation set. On WN18 and FB15k,he grid of parameters used is $ $for the embedding size$ h $, <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>f</mi><mi>o</mi><mi>r</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>n</mi><mi>u</mi><mi>m</mi><mi>b</mi><mi>e</mi><mi>r</mi><mi>o</mi><mi>f</mi><mi>e</mi><mi>p</mi><mi>o</mi><mi>c</mi><mi>h</mi><mi>s</mi><mi>a</mi><mi>n</mi><mi>d</mi></mrow><annotation encoding="application/x-tex">for the number of epochs and</annotation></semantics></math>forthenumberofepochsand for the number of negative examples. Since FB15k-237 is much smaller, we limit the number of epochs to$ $. The initial learning rate is fixed at$ 0.2 $. On WN18, the best set of hyper-parameters are$ 100 $dimensions,$ 100 $epochs and$ 500 $negative samples. On FB15k, the selected hyper-parameters are$ 100 $dimensions,$ 100 $epochs and$ 100 $negative samples. On FB15k-237, the best set of hyper-parameters are a hidden of$ 50 $,$ 10 $epochs and a$ 500 $negative samples. For SVO, the number of relations to predict is quite small, we thus use a full softmax and select hyper-parameters based on hit@5%. The grid of hyper-parameters is$ $for the embedding size$ h $and$ $for the number of epochs. The initial learning rate is fixed at$ 0.2$. For all these experiments, we report both the performance on the model train on the train set and on the concatenation of the train and validation set, run with the same hyper-parameters.

Comparison.

We compare our approach to several standard models in Table 1 on WN18 and FB15k. We report numbers from their original papers. Some of them are not using a fine grid of hyper-parameters, which partially explains the gap in performance. We separate these models from more recent ones for fairer comparison. Despite its simplicity, our approach is competitive with dedicated pipelines both for raw and filtered measurements. This extends the findings of Trouillon and Nickel (2017), i.e., the choice of loss function can have a significant impact on overall performance. Table 3 extends this observation to a harder dataset, FB15k-237, where our BoW model compares favorably with existing KB completion models.

We also report comparison on relation prediction dataset SVO in Table 3. Our approach is competitive with approaches using bigram and high order information, like TATEC (Garcia-Duran et al., 2015). Note TATEC can be, theoretically, used for both relation and entity prediction, while our model only predicts relations.

Table 4 show the running time for a fastText implementation. It runs in a few minutes, which is comparable with optimized pipelines like Fast-TransD and Fast-TransR (Lin et al., 2015). Note that similar running times should be achievable for other linear models like TransE.

2 Question answering.

We consider two standard datasets with a significant amount of question answer pairs.

SimpleQuestion consists of 108,442 question-answer pairs generated from Freebase. It comes with a subset of Freebase with 2M triplets.

WikiMovies consists of more than 100,000 questions about movies generated also from Freebase. It comes with a subset of the KB associated with the question-answer pairs. This dataset also provides with settings where different preprocessed versions of Wikipedia are considered instead of the KB. These settings are beyond the scope of this paper.

Implementation details.

For both SimpleQuestion and MovieWiki, the number of relations are relatively small. We thus use a full softmax. For SimpleQuestion, the grid of hyper-parameters is $for the dimension of the embeddings and$ for the number of epochs. We use bigrams and an initial learning rate of $1$ . For MovieWiki, we fixed the embedding size to $16$ since there are only $16$ relations and the number of epochs was selected on the validation set in $ $. We use an initial learning rate of$ .3$.

SimpleQuestion.

Figure 5 compares this approach with the state-of-the-art. We learn a relation classifier with fastText in 42sec. Using a larger KB, i.e., FB5M, does not degrade the performance, despite having much more irrelevant entities. Our approach compares favorably well other with question answering systems. This suggests that the learned embeddings capture some important information about the KB. Note, however, that the performance is very sensible to the quality of the entity linker and the ad-hoc sorting of extracted subjects. Typically, going from a random order to the one used in this paper gives a boost of up to $10\%$ depending on the hyper-parameters.

WikiMovies.

Table 6 compares our models with several state-of-the-art pipelines. In the case where the clean KB is accessible, our method works very well. fastText runs in 1sec. for relation prediction. Note that this dataset was primarily made for the case where only text is available. This setting goes beyond the scope of our method, while a more general approach like KV-memNN still works reasonably well (Miller et al., 2016).

Conclusion

In this paper, we show that linear models learn good embeddings from a KB by recasting graph related problems into supervised classification ones. The limitations of such approach are that it requires a clean KB and a task that uses direct information about local connectivity in the graph. Moreover, the observation that our non-relational approach provides state-of-the-art performance on KBC benchmarks raises also important questions regarding the evaluation of link-prediction models and the design of benchmarks for this task.

We thank Timothée Lacroix, Nicolas Usunier, Antoine Bordes and the rest of FAIR for their precious help and comments. We also would like to thank Adam Fisch and Alex Miller for their help regarding MovieWiki.

Introduction

Approach

2 Loss functions

one-versus-all loss.

3 Knowledge base completion

4 Question answering

Relation prediction for question answering.

Results

Experimental protocol.

Implementation details.

Comparison.

2 Question answering.

Implementation details.

SimpleQuestion.

WikiMovies.

Conclusion

References