Discriminative Gaifman Models

Mathias Niepert

Introduction

Knowledge bases are attracting considerable interest both from industry and academia . Instances of knowledge bases are the web graph, social and citation networks, and multi-relational knowledge graphs such as Freebase and YAGO . Large knowledge bases motivate the development of scalable machine learning models that can reason about objects as well as their properties and relationships. Research in statistical relational learning (SRL) has focused on particular formalisms such as Markov logic and ProbLog and is often concerned with improving the efficiency of inference and learning . The scalability problems of these statistical relational languages, however, remain an obstacle and have prevented a wider adoption. Another line of work focuses on efficient relational machine learning models that perform well on a particular task such as knowledge base completion and relation extraction. Examples are knowledge base factorization and embedding approaches and random-walk based ML models . We aim to advance the state of the art in relational machine learning by developing efficient models that learn knowledge base embeddings that are effective for probabilistic query answering on the one hand, and interpretable and widely applicable on the other.

Gaifman’s locality theorem is a result in the area of finite model theory . The Gaifman graph of a knowledge base is the undirected graph whose nodes correspond to objects and in which two nodes are connected if the corresponding objects co-occur as arguments of some relation. Gaifman’s locality theorem states that every first-order sentence is equivalent to a Boolean combination of sentences whose quantifiers range over local neighborhoods of the Gaifman graph. With this paper, we aim to explore Gaifman locality from a machine learning perspective. If every first-order sentence is equivalent to a Boolean combination of sentences whose quantifiers range over local neighborhoods only, we ought to be able to develop models that learn effective representations from these local neighborhoods. There is increasing evidence that learning representations that are built up from local structures can be highly successful. Convolutional neural networks, for instance, learn features over locally connected regions of images. The aim of this work is to investigate the effectiveness and efficiency of machine learning models that perform learning and inference within and across locally connected regions of knowledge bases. This is achieved by combining relational features that are often used in statistical relatinal learning with novel ideas from the area of deep learning. The following problem motivates Gaifman models.

Given a knowledge base (relational structure, mega-example, knowledge graph) or a collection of knowledge bases, learn a relational machine learning model that supports complex relational queries. The model learns a probability for each tuple in the query answer.

Note that this is a more general problem than knowledge base completion since it includes the learning of a probability distribution for a complex relational query. The query corresponding to knowledge base completion is $\mathtt{r}(x,y)$ for logical variables $x$ and $y$ , and relation $\mathtt{r}$ . The problem also touches on the problem of open-world probabilistic KBs since tuples whose prior probability is zero will often have a non-zero probability in the query answer.

Background

We first review some important concepts and notation in first-order logic.

An atom $\mathtt{r}(t_{1},...,t_{n})$ consists of predicate $\mathtt{r}$ of arity $n$ followed by $n$ arguments, which are either elements from a finite domain $\mathbf{D}=\{a,b,...\}$ or logical variables $\{x,y,...\}$ . We us the terms domain element and object synonymously. A ground atom is an atom without logical variables. Formulas are built from atoms using the usual Boolean connectives and existential and universal quantification. A free variable in a first-order formula is a variable $x$ not in the scope of a quantifier. We write $\varphi(x,y)$ to denote that $x,y$ are free in $\varphi$ , and $\mathtt{free}(\varphi)$ to refer to the free variables of $\varphi$ . A substitution replaces all occurrences of logical variable $x$ by $t$ in some formula $\varphi$ and is denoted by $\varphi[x/t]$ .

A vocabulary consists of a finite set of predicates $\mathbf{R}$ and a domain $\mathbf{D}$ . Every predicate $\mathtt{r}$ is associated with a positive integer called the arity of $\mathtt{r}$ . A $\mathbf{R}$ -structure (or knowledge base) $\mathcal{D}$ consists of the domain $\mathbf{D}$ , a set of predicates $\mathbf{R}$ , and an interpretation. The Herbrand base of $\mathcal{D}$ is the set of all ground atoms that can be constructed from $\mathbf{R}$ and $\mathbf{D}$ . The interpretation assigns a truth value to every atom in the Herbrand base by specifying $\mathtt{r}^{\mathcal{D}}\subseteq\mathbf{D}^{n}$ for each $n$ -ary predicate $\mathtt{r}\in\mathbf{R}$ . For a formula $\varphi(x_{1},...,x_{n})$ and a structure $\mathcal{D}$ , we write $\mathcal{D}\models\varphi(d_{1},...,d_{n})$ to say that $\mathcal{D}$ satisfies $\varphi$ if the variables $x_{1},...,x_{n}$ are substituted with the domain elements $d_{1},....,d_{n}$ . We define $\varphi(\mathcal{D}):=\{(d_{1},...,d_{n})\in\mathbf{D}^{n}\mid\mathcal{D}\models\varphi(d_{1},...,d_{n})\}$ . For the $\mathbf{R}$ -structure $\mathcal{D}$ and $\mathbf{C}\subseteq\mathbf{D}$ , $\langle\mathbf{C}\rangle^{\mathcal{D}}$ denotes the substructure induced by $\mathbf{C}$ on $\mathcal{D}$ , that is, the $\mathbf{R}$ -structure $\mathcal{C}$ with domain $\mathbf{C}$ and $\mathtt{r}^{\mathcal{C}}:=\mathtt{r}^{\mathcal{D}}\cap\mathbf{C}^{n}$ for every $n$ -ary $\mathtt{r}\in\mathbf{R}$ .

2 Gaifman’s Locality Theorem

The Gaifman graph of a $\mathcal{R}$ -structure $\mathcal{D}$ is the graph $G_{\mathcal{D}}$ with vertex set $\mathbf{D}$ and an edge between two vertices $d,d^{\prime}\in\mathbf{D}$ if and only if there exists an $\mathtt{r}\in\mathbf{R}$ and a tuple $(d_{1},...,d_{k})\in\mathtt{r}^{\mathcal{D}}$ such that $d,d^{\prime}\in\{d_{1},...,d_{k}\}$ . Figure 1a depicts a fragment of a knowledge base and the corresponding Gaifman graph. The distance $\mathtt{d}_{\mathcal{D}}(d_{1},d_{2})$ between two elements $d_{1},d_{2}\in\mathbf{D}$ of a structure $\mathcal{D}$ is the length of the shortest path in $G_{\mathcal{D}}$ connecting $d_{1}$ and $d_{2}$ . For $r\geq 1$ and $d\in\mathbf{D}$ , we define the $r$ -neighborhood of $d$ to be $\mathbf{N}_{r}(d):=\{x\in\mathbf{D}\mid\mathtt{d}_{\mathcal{D}}(d,x)\leq r\}$ . We refer to $r$ also as the depth of the neighborhood. Let $\mathbf{d}=(d_{1},...,d_{n})\in\mathbf{D}^{n}$ . The $r$ -neighborhood of $\mathbf{d}$ is defined as

For the Gaifman graph in Figure 1a, we have that $\mathbf{N}_{1}(d_{4})=\{d_{1},d_{2},d_{5}\}$ and $\mathbf{N}_{1}((d_{1},d_{2}))=\{d_{1},...,d_{6}\}$ . $\varphi^{\mathbf{N}_{r}}(x)$ is the formula obtained from $\varphi(x)$ by relativizing all quantifiers to $\mathbf{N}_{r}(x)$ , that is, by replacing every subformula of the form $\exists y\psi(x,y,\mathbf{z})$ by $\exists y(\mathtt{d}_{\mathcal{D}}(x,y)\leq r\wedge\psi(x,y,\mathbf{z}))$ and every subformula of the form $\forall y\psi(x,y,\mathbf{z})$ by $\forall y(\mathtt{d}_{\mathcal{D}}(x,y)\leq r\rightarrow\psi(x,y,\mathbf{z}))$ . A formula $\psi(x)$ of the form $\varphi^{\mathbf{N}_{r}}(x)$ , for some $\varphi(x)$ , is called $r$ -local. Whether an $r$ -local formula $\psi(x)$ holds depends only on the $r$ -neighborhood of $x$ , that is, for every structure $\mathcal{D}$ and every $d\in\mathbf{D}$ we have $\mathcal{D}\models\psi(d)$ if and only if $\langle\mathbf{N}_{r}(d)\rangle\models\psi(d)$ . For $r,k\geq 1$ and $\psi(x)$ being $r$ -local, a local sentence is of the form

We can now state Gaifman’s locality theorem.

Every first-order sentence is equivalent to a Boolean combination of local sentences.

Gaifman’s locality theorem states that any first-order sentence can be expressed as a Boolean combination of $r$ -local sentences defined for neighborhoods of objects that are mutually far apart (have distance at least $2r+1$ ). Now, a novel approach to (statistical) relational learning would be to consider a large set of objects (or tuples of objects) and learn models from their local neighborhoods in the Gaifman graphs. It is this observation that motivates Gaifman models.

Learning Gaifman Models

Instead of taking the costly approach of applying relational learning and inference directly to entire knowledge bases, the representations of Gaifman models are learned bottom up, by performing inference and learning within bounded-size, locally connected regions of Gaifman graphs. Each Gaifman model specifies the data generating process from a given knowledge base (or collection of knowledge bases), a set of relational features, and a ML model class used for learning.

Given a $\mathbf{R}$ -structure $\mathcal{D}$ , a discriminative Gaifman model for $\mathcal{D}$ is a tuple $(\mathsf{q},r,k,\mathbf{\Phi},\mathcal{M})$ as follows:

$\mathsf{q}$ is a first-order formula called the target query with at least one free variable;

$r$ is the depth of the Gaifman neighborhoods;

$k$ is the size-bound of the Gaifman neighborhoods;

$\mathbf{\Phi}$ is a set of first-order formulas (the relational features);

$\mathcal{M}$ is the base model class (loss, hyper-parameters, etc.).

Throughout the rest of the paper, we will provide detailed explanations of the different parameters of Gaifman models and their interaction with data generation, learning, and inference.

During the training of Gaifman models, neighborhoods are generated for tuples of objects $\mathbf{d}\in\mathbf{D}^{n}$ based on the parameters $r$ and $k$ . We first describe the procedure for arbitrary tuples $\mathbf{d}$ of objects and will later explain where these tuples come from. For a given tuple $\mathbf{d}$ the $r$ -neighborhood of $\mathbf{d}$ within the Gaifman graph is computed. This results in the set of objects ${\mathbf{N}_{r}}(\mathbf{d})$ . Now, from this neighborhood we sample $w$ neighborhoods consisting of at most $k$ objects. Sampling bounded-size sub-neighborhoods from ${\mathbf{N}_{r}}(\mathbf{d})$ is motivated as follows:

The degree distribution of Gaifman graphs is often skewed (see Figure 2a), that is, the number of other objects a domain element is related to varies heavily. Generating smaller, bounded-size neighborhoods allows the transfer of learned representations between more and less connected objects. Moreover, the sampling strategy makes Gaifman models more robust to object uncertainty . We show empirically that larger values for $k$ reduce the effectiveness of the learned models for some knowledge bases.

Relational learning and inference is performed within the generated neighborhoods. ${\mathbf{N}_{r}}(\mathbf{d})$ can be very large, even for $r=1$ (see Figure 2a), and we want full control over the complexity of the computational problems.

Even for a single object tuple $\mathbf{d}$ we can generate a large number of training examples if $|\mathbf{N}_{r}(\mathbf{d})|>k$ . This mitigates the risk of overfitting. The number of training examples per tuple strongly influences the models’ accuracy.

We can now define the set of $(r,k)$ -neighborhoods generated from a $r$ -neighborhood.

For a given tuple of objects $\mathbf{d}$ , Algorithm 1 returns a set of $w$ neighborhoods drawn from $\mathbf{N}_{r,k}(\mathbf{d})$ such that the number of objects for each $d_{i}$ is the same in expectation.

The formulas in the set $\mathbf{\Phi}$ are indexed and of the form $\varphi_{i}(s_{1},...,s_{n},u_{1},...,u_{m})$ with $s_{j}\in\mathtt{free}(\mathsf{q})$ and $u_{j}\not\in\mathtt{free}(\mathsf{q})$ . For every tuple $\mathbf{d}=(d_{1},...,d_{n})$ , generated neighborhood $\mathbf{N}\in\mathbf{N}_{r,k}(\mathbf{d})$ , and $\varphi_{i}\in\mathbf{\Phi}$ , we perform the substitution $[s_{1}/d_{1},...,s_{n}/d_{n}]$ and relativize $\varphi_{i}$ ’s quantifiers to $\mathbf{N}$ , resulting in $\varphi_{i}^{\mathbf{N}}[s_{1}/d_{1},...,s_{n}/d_{n}]$ which we write as $\varphi_{i}^{\mathbf{N}}[\mathbf{s}/\mathbf{d}]$ . Let $\langle\mathbf{N}\rangle$ be the substructure induced by $\mathbf{N}$ on $\mathcal{D}$ . For every formula $\varphi_{i}(s_{1},...,s_{n},u_{1},...,u_{m})$ and every $\mathbf{n}\in\mathbf{N}^{m}$ , we now have that $\mathcal{D}\models\varphi_{i}^{\mathbf{N}}[\mathbf{s}/\mathbf{d},\mathbf{u}/\mathbf{n}]$ if and only if $\langle\mathbf{N}\rangle\models\varphi_{i}^{\mathbf{N}}[\mathbf{s}/\mathbf{d},\mathbf{u}/\mathbf{n}]$ . In other words, satisfaction is now checked locally within the neighborhoods $\mathbf{N}$ , by deciding whether $\langle\mathbf{N}\rangle\models\varphi_{i}^{\mathbf{N}}[\mathbf{s}/\mathbf{d},\mathbf{u}/\mathbf{n}]$ . The relational semantics of Gaifman models is based on the set of formulas $\mathbf{\Phi}$ . The feature vector $\mathbf{v}=(v_{1},...,v_{|\mathbf{\Phi}|})$ for tuple $\mathbf{d}$ , and neighborhood $\mathbf{N}\in\mathbf{N}_{r,k}(\mathbf{d})$ , written as $\mathbf{v}_{\mathbf{N}}$ , is constructed as follows

That is, if $\varphi_{i}^{\mathbf{N}}[\mathbf{s}/\mathbf{d}]$ has free variables, $v_{i}$ is equal to the number of groundings of $\varphi_{i}[\mathbf{s}/\mathbf{d}]$ that are satisfied within the neighborhood substructure $\langle\mathbf{N}\rangle$ ; if $\varphi_{i}[\mathbf{s}/\mathbf{d}]$ has no free variables, $v_{i}=1$ if and only if $\varphi_{i}[\mathbf{s}/\mathbf{d}]$ is satisfied within the neighborhod substructure $\langle\mathbf{N}\rangle$ ; and $v_{i}=0$ otherwise. The neighborhood representations $\mathbf{v}$ capture $r$ -local formulas and help the model learn formula combinations that are associated with negative and positive examples. For the right choices of the parameters $r$ and $k$ , the neighborhood representations of Gaifman models capture the relational structure associated with positive and negative examples.

Deciding $\mathcal{D}\models\varphi$ for a structure $\mathcal{D}$ and a first-order formula $\varphi$ is referred to as model checking and computing $\varphi(\mathcal{D})$ is called $\varphi$ -counting. The combined complexity of model checking is PSPACE-complete and there exists a $||\mathcal{D}||^{O(||\varphi||)}$ algorithm for both problems where $||\cdot||$ is the size of an encoding. Clearly, for most real-world KBs this is not feasible. For Gaifman models, however, where the neighborhoods are bounded-size, typically $10\leq|\mathbf{N}|=k\leq 100$ , the above representation can be computed very efficiently for a large class of relational features. We can now state the following complexity result.

Let $\mathcal{D}$ be a relational structure (knowledge base), let $\overline{\mathtt{d}}$ be the size of the largest $r$ -neighborhood of $\mathcal{D}$ ’s Gaifman graph, and let $\overline{\mathtt{s}}$ be the greatest encoding size of any formula in $\mathbf{\Phi}$ . For a Gaifman model with parameters $r$ and $k$ , the worst-case complexity for computing the feature representations of $N$ neighborhoods is $O(N(\overline{\mathtt{d}}+|\mathbf{\Phi}|k^{\overline{\mathtt{s}}}))$ .

Existing SRL approaches could be applied to the generated neighborhoods, treating each as a possible world for structure and parameter learning. However, our goal is to learn relational models that utilize embeddings computed by multi-layered neural networks.

Let $\mathsf{q}$ be a first-order formula (the relational query) and $\mathcal{S}(\mathsf{q})$ the result set of the query, that is, all groundings that render the formula satisfied in the knowledge base. The feature representations generated for tuples of objects $\mathbf{d}\in\mathcal{S}(\mathsf{q})$ serve as positive training examples. The Gaifman models’ aim is to learn neighborhood embeddings that capture local structure of tuples for which we know that the target query evaluates to true. Similar to previous work, we generate negative examples by corrupting tuples that correspond to positive examples. The corruption mechanism takes a positive input tuple $\mathbf{d}=(d_{1},...,d_{n})$ and substitutes, for each $i\in\{1,...,n\}$ , the domain element $d_{i}$ with objects sampled from $\mathbf{D}$ while keeping the rest of the tuple fixed.

The discriminative Gaifman model performs the following steps.

Evaluate the target query $\mathsf{q}$ and compute the result set $\mathcal{S}(\mathsf{q})$

For each tuple $\mathbf{d}$ in the result set $\mathcal{S}(\mathsf{q})$ :

Learn a ML model with the generated positive and negative training examples.

Learning the final Gaifman model depends on the base ML model class $\mathcal{M}$ and its loss function. We obtained state of the art results with neural networks, gradient-based learning, and categorical cross-entropy as loss function

where $p_{\mathcal{M}}(\mathbf{v}_{\mathbf{N}})$ is the probability the model returns on input $\mathbf{v}_{\mathbf{N}}$ . However, other loss functions are possible. The probability of a particular substitution of the target query to be true is now

The expected probability of a representation of a neighborhood drawn uniformly at random from $\mathbf{N}_{(r,k)}(\mathbf{d})$ . It is now possible to generate several neighborhoods $\mathbf{N}$ and their representations $\mathbf{v}_{\mathbf{N}}$ to estimate $P(\mathsf{q}[\mathbf{s}/\mathbf{d}]=\mathtt{True})$ , simply by averaging the neighborhoods’ probabilities. We have found experimentally that a single neighborhood already leads to highly accurate results but also that more neighborhood samples further improve the accurracy.

Let us emphasize again the novel semantics of Gaifman models. Gaifman models generate a large number of small, bounded-size structures from a large structure, learn a representation for these bounded-size structures, and use the resulting representation to answer queries concerning the original structure as a whole. The advantages are model weight sharing across a large number of neighborhoods and efficiency of the computational problems. Figure 4 and Figure 4 illustrate learning from bounded-size neighborhood structures and inference in Gaifman models.

2 Structure Learning

Structure learning is the problem of determining the set of relational features $\mathbf{\Phi}$ . We provide some directions and leave the problem to future work. Given a collection of bounded-size neighborhoods of the Gaifman graph, the goal is to determine suitable relational features for the problem at hand. There is a set of features which we found to be highly effective. For example, formulas of the form $\exists x\ \mathtt{r}(s_{1},x)$ , $\exists x\ \mathtt{r}(s_{1},x)\wedge\mathtt{r}(x,s_{2})$ , and $\exists x,y\ \mathtt{r_{1}}(s_{1},x)\wedge\mathtt{r_{2}}(x,y)\wedge\mathtt{r_{3}}(y,s_{2})$ for all relations. The latter formulas capture fixed-length paths between $s_{1}$ and $s_{2}$ in the neighborhoods. Hence, Path Ranking type features can be used in Gaifman models as a particular relational feature class. For path formulas with several different relations we cannot include all $|\mathbf{R}|^{3}$ combinations and, hence, we have to determine a subset occurring in the training data. Fortunately, since the neighborhood size is bounded, it is computationally feasible to compute frequent paths in the neighborhoods and to use these as features. The complexity of this learning problem is in the number of elements in the neighborhood and not in the number of all objects in the knowledge base. Relation paths that do not occur in the data can be discarded. Gaifman models can also use features of the form $\forall x,y\ \mathtt{r}(x,y)\Rightarrow\mathtt{r}(y,x)$ , $\exists x,y\ \mathtt{r}(x,y)$ , and $\forall x,y,z\ \mathtt{r}(x,y)\wedge\mathtt{r}(y,z)\Rightarrow\mathtt{r}(x,z)$ , to name but a few. Moreover, features with free variables, such as $\mathtt{r}(s_{1},x)$ are counting features (here: the $\mathtt{r}$ out-degree of $s_{1}$ ). It is even computationally feasible to include specific second-order features (for instance, quantifiers ranging over $\mathbf{R}$ ) and aggregations of feature values.

3 Prior Confidence Values, Types, and Numerical Attributes

Numerous existing knowledge bases assign confidence values (probabilities, weights, etc.) to their statements. Gaifman models can incorporate confidence values during the sampling and learning process. Instead of adding random noise to the representations, which we have found to be beneficial, noise can be added inversely proportional to the confidence values. Statements for which the prior confidence values are lower are more likely to be dropped out during training than statements with higher confidence values. Furthermore, Gaifman models can directly incorporate object types such as Actor and Action Movie as well as numerical features such as location and elevation. One simply has to specify a fixed position in the neighborhood representation $\mathbf{v}$ for each object position within the input tuples $\mathbf{d}$ .

Related Work

Recent work on relational machine learning for knowledge graphs is surveyed in . We focus on a select few methods we deem most related to Gaifman models and refer the interested reader to the above article. A large body of work exists on learning inference rules from knowledge bases. Examples include and where inference rules of length one are learned; and where general inference rules are learned by applying a support threshold. Their method does not scale to large KBs and depends on predetermined thresholds. Lao et al. train a logistic regression classifier with path features to perform KB completion. The idea is to perform a random walk between objects and to exploit the discovered paths as features. SFE improves PRA by making the generation of random walks more efficient. More recent embedding methods have combined paths in KBs with KB embedding methods . Gaifman models support a much broader class of relational features subsuming path features. For instance, Gaifman models incorporate counting features that have shown to be beneficial for relational models.

Experiments

The aim of the experiments is to understand the efficiency and effectiveness of Gaifman models for typical knowledge base inference problems. We evaluate the proposed class of models with two data sets derived from the knowledge bases WordNet and Freebase . Both data sets consist of a list of statements $\mathtt{r}(d_{1},d_{2})$ that are known to be true. For a detailed description of the data sets, whose statistics are listed in Table 1, we refer the reader to previous work .

To compute the probabilities, we averaged the probabilities of $N=1,2,$ or $3$ generated $(r,k)$ -neighborhoods.

We performed runtime experiments to evaluate the models’ efficiency. Embedding models have the advantage that one dot product for every candidate object is sufficient to compute the score for the corresponding statement and we need to assess the performance of Gaifman models in this context. All experiments were run on commodity hardware with 64G RAM and a single 2.8 GHz CPU.

The runtime experiments demonstrate that Gaifman models perform inference very efficiently for $k\leq 20$ . Figure 5 depicts the number of query answers the Gaifman models are able to serve per second, averaged over relation types. A query answer returns the probability for one object pair. These numbers include neighborhood generation and network inference. The results are promising with about $5000$ query answers per second (averaged across relation types) as long as $k$ remains small. Since most object pairs of WN18 have a $1$ -neighborhood whose size is smaller than $20$ , the answers per second rates for $k>20$ is not reduced as drastically as for FB15k.

Conclusion and Future Work

Gaifman models are a novel family of relational machine learning models that perform learning and inference within and across locally connected regions of relational structures. Future directions of research include structure learning, more sophisticated base model classes, and application of Gaifman models to additional relational ML problems.

Many thanks to Alberto García-Durán, Mohamed Ahmed, and Kristian Kersting for their helpful feedback.