End-to-End Differentiable Proving

Tim Rocktäschel, Sebastian Riedel

Introduction

Current state-of-the-art methods for automated Knowledge Base (KB) completion use neural link prediction models to learn distributed vector representations of symbols (i.e. subsymbolic representations) for scoring fact triples . Such subsymbolic representations enable these models to generalize to unseen facts by encoding similarities: If the vector of the predicate symbol $\verb~grandfatherOf~$ is similar to the vector of the symbol $\verb~grandpaOf~$ , both predicates likely express a similar relation. Likewise, if the vector of the constant symbol lisa is similar to maggie, similar relations likely hold for both constants (e.g. they live in the same city, have the same parents etc.).

This simple form of reasoning based on similarities is remarkably effective for automatically completing large KBs. However, in practice it is often important to capture more complex reasoning patterns that involve several inference steps. For example, if abe is the father of homer and homer is a parent of bart, we would like to infer that abe is a grandfather of bart. Such transitive reasoning is inherently hard for neural link prediction models as they only learn to score facts locally. In contrast, symbolic theorem provers like Prolog enable exactly this type of multi-hop reasoning. Furthermore, Inductive Logic Programming (ILP) builds upon such provers to learn interpretable rules from data and to exploit them for reasoning in KBs. However, symbolic provers lack the ability to learn subsymbolic representations and similarities between them from large KBs, which limits their ability to generalize to queries with similar but not identical symbols.

While the connection between logic and machine learning has been addressed by statistical relational learning approaches, these models traditionally do not support reasoning with subsymbolic representations (e.g. ), and when using subsymbolic representations they are not trained end-to-end from training data (e.g. ). Neural multi-hop reasoning models address the aforementioned limitations to some extent by encoding reasoning chains in a vector space or by iteratively refining subsymbolic representations of a question before comparison with answers. In many ways, these models operate like basic theorem provers, but they lack two of their most crucial ingredients: interpretability and straightforward ways of incorporating domain-specific knowledge in form of rules.

Our approach to this problem is inspired by recent neural network architectures like Neural Turing Machines , Memory Networks , Neural Stacks/Queues , Neural Programmer , Neural Programmer-Interpreters , Hierarchical Attentive Memory and the Differentiable Forth Interpreter . These architectures replace discrete algorithms and data structures by end-to-end differentiable counterparts that operate on real-valued vectors. At the heart of our approach is the idea to translate this concept to basic symbolic theorem provers, and hence combine their advantages (multi-hop reasoning, interpretability, easy integration of domain knowledge) with the ability to reason with vector representations of predicates and constants. Specifically, we keep variable binding symbolic but compare symbols using their subsymbolic vector representations.

Concretely, we introduce Neural Theorem Provers (NTPs): End-to-end differentiable provers for basic theorems formulated as queries to a KB. We use Prolog’s backward chaining algorithm as a recipe for recursively constructing neural networks that are capable of proving queries to a KB using subsymbolic representations. The success score of such proofs is differentiable with respect to vector representations of symbols, which enables us to learn such representations for predicates and constants in ground atoms, as well as parameters of function-free first-order logic rules of predefined structure. By doing so, NTPs learn to place representations of similar symbols in close proximity in a vector space and to induce rules given prior assumptions about the structure of logical relationships in a KB such as transitivity. Furthermore, NTPs can seamlessly reason with provided domain-specific rules. As NTPs operate on distributed representations of symbols, a single hand-crafted rule can be leveraged for many proofs of queries with symbols that have a similar representation. Finally, NTPs demonstrate a high degree of interpretability as they induce latent rules that we can decode to human-readable symbolic rules.

Our contributions are threefold: (i) We present the construction of NTPs inspired by Prolog’s backward chaining algorithm and a differentiable unification operation using subsymbolic representations, (ii) we propose optimizations to this architecture by joint training with a neural link prediction model, batch proving, and approximate gradient calculation, and (iii) we experimentally show that NTPs can learn representations of symbols and function-free first-order rules of predefined structure, enabling them to learn to perform multi-hop reasoning on benchmark KBs and to outperform ComplEx , a state-of-the-art neural link prediction model, on three out of four KBs.

Background

Given a query (also called goal) such as $[\verb~grandfatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},\textsc{bart}]$ , we can use Prolog’s backward chaining algorithm to find substitutions for Q (see appendix A for pseudocode). On a high level, backward chaining is based on two functions called Or and And. Or iterates through all rules (including rules with an empty body, i.e., facts) in a KB and unifies the goal with the respective rule head, thereby updating a substitution set. It is called Or since any successful proof suffices (disjunction). If unification succeeds, Or calls And to prove all atoms (subgoals) in the body of the rule. To prove subgoals of a rule body, And first applies substitutions to the first atom that is then proven by again calling Or, before proving the remaining subgoals by recursively calling And. This function is called And as all atoms in the body need to be proven together (conjunction). As an example, a rule such as $[\verb~grandfatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}]\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ [[\verb~fatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}],[\verb~parentOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}]]$ is used in Or for translating a goal like $[\verb~grandfatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},\textsc{bart}]$ into subgoals $[\verb~fatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}]$ and $[\verb~parentOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},\textsc{bart}]$ that are subsequently proven by And.For clarity, we will sometimes omit lists when writing rules and atoms, e.g., $\verb~grandfatherOf~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \verb~fatherOf~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\verb~parentOf~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})$ .

Differentiable Prover

In the following, we describe the recursive construction of NTPs– neural networks for end-to-end differentiable proving that allow us to calculate the gradient of proof successes with respect to vector representations of symbols. We define the construction of NTPs in terms of modules similar to dynamic neural module networks . Each module takes as inputs discrete objects (atoms and rules) and a proof state, and returns a list of new proof states (see Figure 1 for a graphical representation).

A proof state $S=(\psi,\rho)$ is a tuple consisting of the substitution set $\psi$ constructed in the proof so far and a neural network $\rho$ that outputs a real-valued success score of a (partial) proof. While discrete objects and the substitution set are only used during construction of the neural network, once the network is constructed a continuous proof success score can be calculated for many different goals at training and test time. To summarize, modules are instantiated by discrete objects and the substitution set. They construct a neural network representing the (partial) proof success score and recursively instantiate submodules to continue the proof.

The shared signature of modules is ${\color[rgb]{0.75,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0,0}\mathcal{D}}\times{\color[rgb]{0,0,0.65}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.65}\mathcal{S}}\to{\color[rgb]{0,0,0.65}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.65}\mathcal{S}}^{N}$ where ${\color[rgb]{0.75,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0,0}\mathcal{D}}$ is a domain that controls the construction of the network, ${\color[rgb]{0,0,0.65}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.65}\mathcal{S}}$ is the domain of proof states, and $N$ is the number of output proof states. Furthermore, let $S_{\psi}$ denote the substitution set of the proof state $S$ and let $S_{\rho}$ denote the neural network for calculating the proof success.

Unification of two atoms, e.g., a goal that we want to prove and a rule head, is a central operation in backward chaining. Two non-variable symbols (predicates or constants) are checked for equality and the proof can be aborted if this check fails. However, we want to be able to apply rules even if symbols in the goal and head are not equal but similar in meaning (e.g. grandfatherOf and grandpaOf) and thus replace symbolic comparison with a computation that measures the similarity of both symbols in a vector space.

The module unify updates a substitution set and creates a neural network for comparing the vector representations of non-variable symbols in two sequences of terms. The signature of this module is ${\color[rgb]{0.75,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0,0}\mathcal{L}\times\mathcal{L}}\times{\color[rgb]{0,0,0.65}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.65}\mathcal{S}}\to{\color[rgb]{0,0,0.65}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.65}\mathcal{S}}$ where $\mathcal{L}$ is the domain of lists of terms. unify takes two atoms represented as lists of terms and an upstream proof state, and maps these to a new proof state (substitution set and proof success). To this end, unify iterates through the list of terms of two atoms and compares their symbols. If one of the symbols is a variable, a substitution is added to the substitution set. Otherwise, the vector representations of the two non-variable symbols are compared using a Radial Basis Function (RBF) kernel where $\mu$ is a hyperparameter that we set to $\frac{1}{\sqrt{2}}$ in our experiments. The following pseudocode implements unify. Note that " $\_$ " matches every argument and that the order matters, i.e., if arguments match a line, subsequent lines are not evaluated.

Assume that we are unifying two atoms $[\verb~grandpaOf~,\textsc{abe},\textsc{bart}]$ and $[s,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},i]$ given an upstream proof state $S=(\varnothing,\rho)$ where the latter input atom has placeholders for a predicate $s$ and a constant $i$ , and the neural network $\rho$ would output $0.7$ when evaluated. Furthermore, assume $\verb~grandpaOf~$ , abe and bart represent the indices of the respective symbols in a global symbol vocabulary. Then, the new proof state constructed by unify is:

Thus, the output score of the neural network $S^{\prime}_{\rho}$ will be high if the subsymbolic representation of the input $s$ is close to $\verb~grandpaOf~$ and the input $i$ is close to bart. However, the score cannot be higher than $0.7$ due to the upstream proof success score in the forward pass of the neural network $\rho$ . Note that in addition to extending the neural networks $\rho$ to $S^{\prime}_{\rho}$ , this module also outputs a substitution set $\{{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}}/\textsc{abe}\}$ at graph creation time that will be used to instantiate submodules.

2 OR Module

For a goal $[s,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},i]$ , or would instantiate an and submodule based on the rule $[\verb~grandfatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}]$ :– $[[\verb~fatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}],[\verb~parentOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}]]$ as follows

3 AND Module

For implementing and we first define an auxiliary function called substitute which applies substitutions to variables in an atom if possible. This is realized via

For example, $\text{substitute}(\bm{[}\verb~fatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}\bm{]},\{{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}}/{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}/i\})$ results in $\bm{[}\verb~fatherOf~,{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Q}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}\bm{]}$ .

Continuing the example from Section 3.2, the and module would instantiate submodules as follows:

4 Proof Aggregation

Finally, we define the overall success score of proving a goal G using a KB $\mathfrak{K}$ with parameters ${{\bm{\theta}}}$ as

where $d$ is a predefined maximum proof depth and the initial proof state is set to an empty substitution set and a proof success score of $1$ .

Figure 2 illustrates an examplary NTP computation graph constructed for a toy KB. Note that such an NTP is constructed once before training, and can then be used for proving goals of the structure $[s,i,j]$ at training and test time where $s$ is the index of an input predicate, and $i$ and $j$ are indices of input constants. Final proof states which are used in proof aggregation are underlined.

5 Neural Inductive Logic Programming

We can use NTPs for ILP by gradient descent instead of a combinatorial search over the space of rules as, for example, done by the First Order Inductive Learner (FOIL) . Specifically, we are using the concept of learning from entailment to induce rules that let us prove known ground atoms, but that do not give high proof success scores to sampled unknown ground atoms.

Optimization

In this section, we present the basic training loss that we use for NTPs, a training loss where a neural link prediction models is used as auxiliary task, as well as various computational optimizations.

where $[s,i,j]$ is a training ground atom and $y$ its target proof success score. Note that since in our application all training facts are ground atoms, we only make use of the proof success score $\rho$ and not the substitution list of the resulting proof state. We can prove known facts trivially by a unification with themselves, resulting in no parameter updates during training and hence no generalization. Therefore, during training we are masking the calculation of the unification success of a known ground atom that we want to prove. Specifically, we set the unification score to to temporarily hide that training fact and assume it can be proven from other facts and rules in the KB.

2 Neural Link Prediction as Auxiliary Loss

At the beginning of training all subsymbolic representations are initialized randomly. When unifying a goal with all facts in a KB we consequently get very noisy success scores in early stages of training. Moreover, as only the maximum success score will result in gradient updates for the respective subsymbolic representations along the maximum proof path, it can take a long time until NTPs learn to place similar symbols close to each other in the vector space and to make effective use of rules.

To speed up learning subsymbolic representations, we train NTPs jointly with ComplEx (Appendix B). ComplEx and the NTP share the same subsymbolic representations, which is feasible as the RBF kernel in unify is also defined for complex vectors. While the NTP is responsible for multi-hop reasoning, the neural link prediction model learns to score ground atoms locally. At test time, only the NTP is used for predictions. Thus, the training loss for ComplEx can be seen as an auxiliary loss for the subsymbolic representations learned by the NTP. We term the resulting model NTP $\lambda$ . Based on the loss in Section 4.1, the joint training loss is defined as

where $[s,i,j]$ is a training atom and $y$ its ground truth target.

3 Computational Optimizations

NTPs as described above suffer from severe computational limitations since the neural network is representing all possible proofs up to some predefined depth. In contrast to symbolic backward chaining where a proof can be aborted as soon as unification fails, in differentiable proving we only get a unification failure for atoms whose arity does not match or when we detect cyclic rule application. We propose two optimizations to speed up NTPs in the Appendix. First, we make use of modern GPUs by batch processing many proofs in parallel (Appendix C). Second, we exploit the sparseness of gradients caused by the $\min$ and $\max$ operations used in the unification and proof aggregation respectively to derive a heuristic for a truncated forward and backward pass that drastically reduces the number of proofs that have to be considered for calculating gradients (Appendix D).

Experiments

Consistent with previous work, we carry out experiments on four benchmark KBs and compare ComplEx with the NTP and NTP $\lambda$ in terms of area under the Precision-Recall-curve (AUC-PR) on the Countries KB, and Mean Reciprocal Rank (MRR) and HITS@ $m$ on the other KBs described below. Training details, including hyperparameters and rule templates, can be found in Appendix E.

The Countries KB is a dataset introduced by for testing reasoning capabilities of neural link prediction models. It consists of $244$ countries, $5$ regions (e.g. europe), $23$ subregions (e.g. western europe, northern america), and $1158$ facts about the neighborhood of countries, and the location of countries and subregions. We follow and split countries randomly into a training set of $204$ countries (train), a development set of $20$ countries (dev), and a test set of $20$ countries (test), such that every dev and test country has at least one neighbor in the training set. Subsequently, three different task datasets are created. For all tasks, the goal is to predict $\verb~locatedIn~(c,r)$ for every test country $c$ and all five regions $r$ , but the access to training atoms in the KB varies. S1: All ground atoms $\verb~locatedIn~(c,r)$ where $c$ is a test country and $r$ is a region are removed from the KB. Since information about the subregion of test countries is still contained in the KB, this task can be solved by using the transitivity rule $\verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$ S2: In addition to S1, all ground atoms $\verb~locatedIn~(c,s)$ are removed where $c$ is a test country and $s$ is a subregion. The location of test countries needs to be inferred from the location of its neighboring countries: $\verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \verb~neighborOf~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$ This task is more difficult than S1, as neighboring countries might not be in the same region, so the rule above will not always hold. S3: In addition to S2, all ground atoms $\verb~locatedIn~(c,r)$ where $r$ is a region and $c$ is a training country that has a test or dev country as a neighbor are also removed. The location of test countries can for instance be inferred using the three-hop rule $\verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \verb~neighborOf~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\verb~neighborOf~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{W}}),\verb~locatedIn~({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{W}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$

Kinship, Nations & UMLS

We use the Nations, Alyawarra kinship (Kinship) and Unified Medical Language System (UMLS) KBs from . We left out the Animals dataset as it only contains unary predicates and can thus not be used for evaluating multi-hop reasoning. Nations contains $56$ binary predicates, $111$ unary predicates, $14$ constants and $2565$ true facts, Kinship contains $26$ predicates, $104$ constants and $10686$ true facts, and UMLS contains $49$ predicates, $135$ constants and $6529$ true facts. Since our baseline ComplEx cannot deal with unary predicates, we remove unary atoms from Nations. We split every KB into $80\%$ training facts, $10\%$ development facts and $10\%$ test facts. For evaluation, we take a test fact and corrupt its first and second argument in all possible ways such that the corrupted fact is not in the original KB. Subsequently, we predict a ranking of every test fact and its corruptions to calculate MRR and HITS@ $m$ .

Results and Discussion

Results for the different model variants on the benchmark KBs are shown in Table 1. Another method for inducing rules in a differentiable way for automated KB completion has been introduced recently by and our evaluation setup is equivalent to their Protocol II. However, our neural link prediction baseline, ComplEx, already achieves much higher HITS@10 results ( $1.00$ vs. $0.70$ on UMLS and $0.98$ vs. $0.73$ on Kinship). We thus focus on the comparison of NTPs with ComplEx.

First, we note that vanilla NTPs alone do not work particularly well compared to ComplEx. They only outperform ComplEx on Countries S3 and Nations, but not on Kinship or UMLS. This demonstrates the difficulty of learning subsymbolic representations in a differentiable prover from unification alone, and the need for auxiliary losses. The NTP $\lambda$ with ComplEx as auxiliary loss outperforms the other models in the majority of tasks. The difference in AUC-PR between ComplEx and NTP $\lambda$ is significant for all Countries tasks ( $p<0.0001$ ).

A major advantage of NTPs is that we can inspect induced rules which provide us with an interpretable representation of what the model has learned. The right column in Table 1 shows examples of induced rules by NTP $\lambda$ (note that predicates on Kinship are anonymized). For Countries, the NTP recovered those rules that are needed for solving the three different tasks. On UMLS, the NTP induced transitivity rules. Those relationships are particularly hard to encode by neural link prediction models like ComplEx, as they are optimized to locally predict the score of a fact.

Related Work

Combining neural and symbolic approaches to relational learning and reasoning has a long tradition and let to various proposed architectures over the past decades (see for a review). Early proposals for neural-symbolic networks are limited to propositional rules (e.g., EBL-ANN , KBANN and C-IL2P ). Other neural-symbolic approaches focus on first-order inference, but do not learn subsymbolic vector representations from training facts in a KB (e.g., SHRUTI , Neural Prolog , CLIP++ , Lifted Relational Neural Networks , and TensorLog ). Logic Tensor Networks are in spirit similar to NTPs, but need to fully ground first-order logic rules. However, they support function terms, whereas NTPs currently only support function-free terms.

Recent question-answering architectures such as translate query representations implicitly in a vector space without explicit rule representations and can thus not easily incorporate domain-specific knowledge. In addition, NTPs are related to random walk and path encoding models . However, instead of aggregating paths from random walks or encoding paths to predict a target predicate, reasoning steps in NTPs are explicit and only unification uses subsymbolic representations. This allows us to induce interpretable rules, as well as to incorporate prior knowledge either in the form of rules or in the form of rule templates which define the structure of logical relationships that we expect to hold in a KB. Another line of work regularizes distributed representations via domain-specific rules, but these approaches do not learn such rules from data and only support a restricted subset of first-order logic. NTPs are constructed from Prolog’s backward chaining and are thus related to Unification Neural Networks . However, NTPs operate on vector representations of symbols instead of scalar values, which are more expressive.

As NTPs can learn rules from data, they are related to ILP systems such as FOIL , Sherlock and meta-interpretive learning of higher-order dyadic Datalog (Metagol) . While these ILP systems operate on symbols and search over the discrete space of logical rules, NTPs work with subsymbolic representations and induce rules using gradient descent. Recently, introduced a differentiable rule learning system based on TensorLog and a neural network controller similar to LSTMs . Their method is more scalable than the NTPs introduced here. However, on UMLS and Kinship our baseline already achieved stronger generalization by learning subsymbolic representations. Still, scaling NTPs to larger KBs for competing with more scalable relational learning methods is an open problem that we seek to address in future work.

Conclusion and Future Work

We proposed an end-to-end differentiable prover for automated KB completion that operates on subsymbolic representations. To this end, we used Prolog’s backward chaining algorithm as a recipe for recursively constructing neural networks that can be used to prove queries to a KB. Specifically, we introduced a differentiable unification operation between vector representations of symbols. The constructed neural network allowed us to compute the gradient of proof successes with respect to vector representations of symbols, and thus enabled us to train subsymbolic representations end-to-end from facts in a KB, and to induce function-free first-order logic rules using gradient descent. On benchmark KBs, our model outperformed ComplEx, a state-of-the-art neural link prediction model, on three out of four KBs while at the same time inducing interpretable rules.

To overcome the computational limitations of the end-to-end differentiable prover introduced in this paper, we want to investigate the use of hierarchical attention and reinforcement learning methods such as Monte Carlo tree search that have been used for learning to play Go and chemical synthesis planning . In addition, we plan to support function terms in the future. Based on , we are furthermore interested in applying NTPs to automated proving of mathematical theorems, either in logical or natural language form, similar to recent approaches by and .

Acknowledgements

We thank Pasquale Minervini, Tim Dettmers, Matko Bosnjak, Johannes Welbl, Naoya Inoue, Kai Arulkumaran, and the anonymous reviewers for very helpful comments on drafts of this paper. This work has been supported by a Google PhD Fellowship in Natural Language Processing, an Allen Distinguished Investigator Award, and a Marie Curie Career Integration Award.

References

Appendix

Appendix A Backward Chaining Pseudocode

Simplified pseudocode for symbolic backward chaining (cycle detection omitted for brevity, see for details).

Appendix B ComplEx

where $\odot$ denotes the element-wise multiplication and $\sigma$ the sigmoid function. The benefit of ComplEx over other neural link prediction models such as RESCAL or DistMult is that by using complex vectors as subsymbolic representations it can capture symmetric as well as asymmetric relations.

Appendix C Batch Proving

where ${\bm{1}}_{N}$ and ${\bm{1}}_{M}$ are vectors of $N$ and $M$ ones respectively, and the square root is taken element-wise. In practice, we partition the KB into rules that have the same structure and batch-unify goals with all rule heads per partition at the same time on a Graphics Processing Unit (GPU). Furthermore, substitution sets bind variables to vectors of symbol indices instead of single symbol indices, and min and max operations are taken per goal.

Appendix D Kmax𝐾K\max Gradient Approximation

NTPs allow us to calculate the gradient of proof success scores with respect to subsymbolic representations and rule parameters. While backpropagating through this large computation graph will give us the exact gradient, it is computationally infeasible for any reasonably-sized KB. Consider the parameterized rule $\bm{\theta}_{1:}({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \bm{\theta}_{2:}({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\bm{\theta}_{3:}({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})$ and let us assume the given KB contains $1\ 000$ facts with binary predicates. While X and Y will be bound to the respective representations in the goal, Z we will be substituted with every possible second argument of the $1\ 000$ facts in the KB when proving the first atom in the body. Moreover, for each of these $1\ 000$ substitutions, we will again need to compare with all facts in the KB when proving the second atom in the body of the rule, resulting in $1\ 000\ 000$ proof success scores. However, note that since we use the max operator for aggregating the success of different proofs, only subsymbolic representations in one out of $1\ 000\ 000$ proofs will receive gradients.

To overcome this computational limitation, we propose the following heuristic. We assume that when unifying the first atom with facts in the KB, it is unlikely for any unification successes below the top $K$ successes to attain the maximum proof success when unifying the remaining atoms in the body of a rule with facts in the KB. That is, after the unification of the first atom, we only keep the top $K$ substitutions and their success scores, and continue proving only with these. This means that all other partial proofs will not contribute to the forward pass at this stage, and consequently not receive any gradients on the backward pass of backpropagation. We term this the $K\max$ heuristic. Note that we cannot guarantee anymore that the gradient of the proof success is the exact gradient, but for a large enough $K$ we get a close approximation to the true gradient.

Appendix E Training Details

Countries S1 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}}).$ 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$

Countries S2 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}}).$ 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$ 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#3({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$

Countries S3 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}}).$ 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$ 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#3({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$ 3 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#3({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{W}}),\#4({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{W}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$

Kinship, Nations & UMLS 20 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$ 20 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}}).$ 20 $\#1({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}})\ {\color[rgb]{0.6708984375,0.076171875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{0.6708984375,0.076171875,0.08203125}\text{:--}}\ \#2({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{X}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}}),\#3({\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Z}},{\color[rgb]{0.75,0.3720703125,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.75,0.3720703125,0}\textsc{Y}}).$