TensorLog: A Differentiable Deductive Database

William W. Cohen

Introduction

Large knowledge bases (KBs) have proven useful in many tasks, but it is unclear how to integrate this sort of knowledge into “deep” gradient-based learning systems. Motivated by this, we describe a probabilistic deductive database (PrDDB) system in which reasoning is performed by a differentiable process. In addition to enabling novel gradient-based learning algorithms for PrDDBs, this approach could potentially enable tight integration of logical reasoning into deep learners (or conversely, of deep learning into reasoning systems.

In a traditional deductive database (DDB), a database DB{\cal{DB}} with a theory T{\cal{T}} together define a set of facts f1,,fnf_{1},\ldots,f_{n} which can be derived by accessing the database and reasoning using T{\cal{T}}. As an example, Figure 1 contains a small theory and an associated database. End users can test to see if a fact ff is derivable, or retrieve all derivable facts that match some query: e.g., one could test if f=uncle(joe,bob)f={\texttt{uncle(joe,bob)}} is derivable in the sample database, or find all values of YY such that uncle(joe,Y) holds. A probabilistic DDB is a “soft” extension of a DDB, where derived facts ff have a numeric confidence, typically based on augmenting DB{\cal{DB}} with a set of parameters Θ\Theta. In many existing PrDDB models computation of confidences is computationally expensive, and often not be conducive to learning the parametersb Θ\Theta.

Here we describe a probabilistic deductive database called TensorLog in which reasoning uses a differentiable process. In TensorLog, each clause in a logical theory is first converted into certain type of factor graph, in which each logical variable appearing in the clause is associated with a random variable in the factor graph, and each literal is associated with a factor (as shown in Figure 2). Then, for each type of query to the factor graph, the message-passing steps required to perform BP are “unrolled” into a function, which is differentiable. Each function will answer queries for a particular combination of evidence variables and query variables in the factor graph, which in turn corresponds to logical queries in a particular syntactic form. We also show how these functions can be composed recursively to perform inference in non-trivial logical theories containing multiple interrelated clauses and predicates.

In TensorLog, compilation is linear in theory size and proof depth, and inference is linear in database size and the number of message-passing steps used in BP. Most importantly, inference is also differentiable, enabling gradient-based parameter learning. Formally, we can show that TensorLog subsumes some prior probabilistic logic programming models, including several variants of stochastic logic programs (SLPs) , and approximates others .

Below, we first present background material, then introduce our main results for differentiable inference, We then discuss related work, in particular the relationship between TensorLog and existing probabilistic logics, present experimental results, and conclude.

Background: Deductive and Probabilistic DBs

To begin, we review the definition for an ordinary DDB, an example of which is in Figure 1. A database, DB{\cal{DB}}, is a set {f1,,fN}\{f_{1},\ldots,f_{N}\} of ground facts. We focus here on DB relations which are unary or binary (e.g., from a “knowledge graph”), hence, facts will be written as p(a,b)p(a,b) or q(c)q(c) where pp and qq are predicate symbols, and a,b,ca,b,c are constants from a fixed domain C{\cal{C}}. A theory, T{\cal{T}}, is a set of function-free Horn clauses. Clauses are written A:-B1,,BkA{\texttt{:-}}B_{1},\ldots,B_{k}, where AA is called the head of the clause, B1,,BkB_{1},\ldots,B_{k} is the body, and AA and the BiB_{i}’s are called literals. Literals must be of the form q(X)q(X), p(X,Y)p(X,Y), p(c,Y)p(c,Y), or p(X,c)p(X,c), where XX and YY are logical variables, and cc is a database constant.

Clauses can be understood as logical implications. Let σ\sigma be a substitution, i.e., a mapping from logical variables to constants in C{\cal{C}}, and let σ(L)\sigma(L) be the result of replacing all logical variables XX in the literal LL with σ(X)\sigma(X). A set of tuples SS is deductively closed with respect to the clause AB1,,BkA\leftarrow{}B_{1},\ldots,B_{k} iff for all substitutions σ\sigma, either σ(A)S\sigma(A)\in{}S or Bi:σ(Bi)∉S\exists B_{i}:\sigma(B_{i})\not\in{}S. For example, if SS contains the facts of Figure 1, SS is not deductively closed with respect to the clause 1 unless it also contains uncle(chip,liam) and uncle(chip,dave). The least model for a pair DB,T{\cal{DB}},{\cal{T}}, written Model(DB,T)\textit{Model}({\cal{DB}},{\cal{T}}), is the smallest superset of DB{\cal{DB}} that is deductively closed with respect to every clause in T{\cal{T}}. In the usual DDB semantics, a ground fact ff is considered “true” iff fModel(DB,T)f\in\textit{Model}({\cal{DB}},T).

To introduce “soft” computation into this model, we add a parameter vector Θ\Theta which associates each fact fDBf\in{\cal{DB}} with a positive scalar θf\theta_{f} (as shown in the example). The semantics of this parameter vary in different PrDDB models, but Θ\Theta will always define a distribution Pr(fT,DB,Θ)\Pr(f|{\cal{T}},{\cal{DB}},\Theta) over the facts in Model(T,DB)\textit{Model}({\cal{T}},{\cal{DB}}).

Differentiable soft reasoning

Numeric encoding of PrDDB’s and queries. We will implement reasoning in PrDDB’s by defining a series of numeric functions, each of finds answers to a particular family of queries. It will be convenient to encode the database numerically. We will assume all constants have been mapped to integers. For a constant cCc\in{\cal{C}}, we define uc\textbf{u}_{c} to be a one-hot row-vector representation for cc, i.e., a row vector of dimension C|{\cal{C}}| where u[c]=1\textbf{u}[c]=1 and u[c]=0\textbf{u}[c^{\prime}]=0 for cCc^{\prime}\not=C. We can also represent a binary predicate pp by a sparse matrix Mp\textbf{M}_{p}, where Mp[a,b]=θp(a,b)\textbf{M}_{p}[a,b]=\theta_{p(a,b)} if p(a,b)DBp(a,b)\in{\cal{DB}}, and a unary predicate qq as an analogous row vector vq\textbf{v}_{q}. Note that Mp\textbf{M}_{p} encodes information not only about the database facts in predicate pp, but also about their parameter values.

PrDDB’s are commonly used test to see if a fact ff is derivable, or retrieve all derivable facts that match some query: e.g., one could test if f=uncle(joe,bob)f={\texttt{uncle(joe,bob)}} is derivable in the sample database, or find all values of YY such that uncle(joe,Y) holds. We focus here on the latter type of query, which we call an argument-retrieval query. An argument-retrieval query QQ is of the form p(c,Y)p(c,Y) or p(Y,c)p(Y,c): we say that p(c,Y)p(c,Y) has an input-output mode of in,out and p(Y,c)p(Y,c) has an input-output mode of out,in. For the sake of brevity, below we will assume below the mode in,out when possible, and abbreviate the two modes as io and io.

The response to a query p(c,Y)p(c,Y) is a distribution over possible substitutions for YY, encoded as a vector vY\textbf{v}_{Y} such that for all constants dCd\in{\cal{C}}, vY[d]=Pr(p(c,d)T,DB,Θ)\textbf{v}_{Y}[d]=\Pr(p(c,d)|{\cal{T}},{\cal{DB}},\Theta). Alternatively (since often we care only about the relative scores of the possible answers), the system might instead return a conditional probability vector vYc\textbf{v}_{Y|c}: if Up(c,Y)U_{p(c,Y)} is the set of facts ff that “match” p(c,Y)p(c,Y), then vYc[d]=Pr(f=p(c,d)fUp(c,Y),T,DB,Θ)\textbf{v}_{Y|c}[d]=\Pr(f=p(c,d)|f\in{}U_{p(c,Y)},{\cal{T}},{\cal{DB}},\Theta).

Since the ultimate goal of our reasoning system is to correctly answer queries using functions, we also introduce a notation for functions that answer particular types of queries: in particular, for a predicate symbol fiopf^{p}_{\texttt{io}} denotes a query response function for all queries with predicate pp and mode io, i.e., queries of the form p(c,Y)p(c,Y), when given a one-hot encoding of cc, fiopf^{p}_{\texttt{io}} returns the appropriate conditional probability vector:

Syntactic restrictions. Algorithmically it will be convenient to constrain the use of constants in clauses. We introduce a special DB predicate assign, which will be used only in literals of the form assign(WW,cc), which in turn will be treated as literals for a special unary predicate assign_cc. Without loss of generality, we can now assume that constants only appear in assign literals. For instance, the clause 3 of Figure 1 would be rewritten as

We will also introduce another special DB predicate any, where any(a,ba,b) is conceptually true for any pair of constants a,ba,b; however, as we show below, the matrix Many\textbf{M}_{\texttt{any}} need not be explicitly stored. We also constrain clause heads to contain distinct variables which all appear also in the body.

A factor graph for a one-clause program. We will start by considering a highly restricted class of theories T{\cal{T}}, namely programs containing only one non-recursive clause rr that obeys the restrictions above. We build a factor graph GrG_{r} for rr as follows: for each logical variable WW in the body, there is a random variable WW; and for every literal q(Wi,Wj)q(W_{i},W_{j}) in the body of the clause, there is a factor with potentials Mq\textbf{M}_{q} linking variables WiW_{i} and WjW_{j}. Finally, if the factor graph is disconnected, we add any factors between the components until it is connected. Figure 2 gives examples. The variables appearing in the clause’s head are starred.

We now argue that GrG_{r} imposes a valid distribution Pr(fT,DB,Θ)\Pr(f|{\cal{T}},{\cal{DB}},\Theta) over facts in Model(T,DB)\textit{Model}({\cal{T}},{\cal{DB}}). In GrG_{r} the variables are multinomials over C{\cal{C}}, the factors represent predicates and the graph GrG_{r} represents a distribution of possible bindings to the logical variables in the clause ff, i.e., to possible substitutions σ\sigma. Let W1,,WmW_{1},\ldots,W_{m} be the variables of GrG_{r}, and for each factor/edge ee let pe(Wie,Wje)p_{e}(W_{i_{e}},W_{j_{e}}) be the literal associated with it. In the distribution defined by GrG_{r}

Recall f,θf>0\forall f,\theta_{f}>0, so if Pr(W1=c1,,Wm=cm)>0\Pr(W_{1}=c_{1},\ldots,W_{m}=c_{m})>0 then for each edge pe(cie,cje)DBp_{e}(c_{i_{e}},c_{j_{e}})\in{\cal{DB}}, and hence the substitution σ={W1=c1,,Wm=cm}\sigma=\{W_{1}=c_{1},\ldots,W_{m}=c_{m}\} makes the body of clause rr true. The converse is also clearly true: so GrG_{r} defines a distribution over exactly those substitutions σ\sigma that make the the body of rr true.

BP over GrG_{r} can now be used to compute the conditional vectors fiop(uc)f^{p}_{\texttt{io}}(\textbf{u}_{c}) and foip(uc)f^{p}_{\texttt{oi}}(\textbf{u}_{c}). For example to compute fiop(uc)f^{p}_{\texttt{io}}(\textbf{u}_{c}) for clause 1, we would set the message for the evidence variable XX to uc\textbf{u}_{c}, run BP, and read out as the value of ff the marginal distribution for YY.

However, we would like to do more: we would like to compute an explicit, differentiable, query response function, which computes fiop(uc)f^{p}_{\texttt{io}}(\textbf{u}_{c}). To do this we “unroll” the message-passing steps into a series of operations, following .

For completeness, we include in Figure 3 a sketch of the algorithm used in the current implementation of TensorLog, which makes the (strong) assumption that GrG_{r} is a tree. In the code, we found it convenient to extend the notion of input-output modes for a query, a variable XX appearing in a literal L=p(X,Y)L=p(X,Y) in a clause body is an nominal input if it appears in the input position of the head, or any literal to the left of LL in the body, and is an nomimal output otherwise. In Prolog a convention is that nominal inputs appear as the first argument of a predicate, and in TensorLog, if the user respects this convention, then “forward” message-passing steps use MpM_{p} rather than MpTM_{p}^{T}w (reducing the cost of transposing large DB{\cal{DB}}-derived matrices, since our message-passing schedule tries to maximize forward messages.) The code contains two mutually recursive routines, and is invoked by requesting a message from the output variable to a fictional output literal. The result will be to emit a series of operations, and return the name of a register that contains the unnormalized conditional probability vector for the output variable: e.g., for the sample clauses the functions returned are:

Here we use gior(uc)g^{r}_{\texttt{io}}(\vec{u}_{c}) for the unnormalized version of the query response function build from GrG_{r}, i.e.,

where rr is the one-clause theory defining pp.

Sets of factor graphs for multi-clause programs. We now extend this idea to theories with many clauses. We first note that if there are several clauses with the same predicate symbol in the head, we simply sum the unnormalized query response functions: e.g., for the predicate cduncle, defined by rules r1r_{1} and r2r_{2}, we can define

and then re-normalize. This is equivalent to building a new factor graph GG, which would be approximately iGri\cup_{i}G_{ri}, together global input and output variables, and a factor that constrains the input variables of the GriG_{ri}’s to be equal, and a factor that constrains the output variable of GG to be the sum of the outputs of the GriG_{ri}’s.

A more complex situation is when the clauses for one predicate, pp, use a second theory predicate qq, in their body: for example, this would be the case if aunt was also defined in the theory, rather than the database. For a theory with no recursion, we can replace the message-passing operations vY=vXMq\textbf{v}_{Y}=\textbf{v}_{X}\textbf{M}_{q} with the function call vY=gioq(vX)\textbf{v}_{Y}=g^{q}_{\texttt{io}}(\textbf{v}_{X}), and likewise the operation vY=vXMqT\textbf{v}_{Y}=\textbf{v}_{X}\textbf{M}_{q}^{T} with the function call vY=goiq(vX)\textbf{v}_{Y}=g^{q}_{\texttt{oi}}(\textbf{v}_{X}). It can be shown that this is equivalent to taking the factor graph for qq and “splicing” it into the graph for pp.

It is also possible to allow function calls to recurse to a fixed maximum depth: we must simply add some sort of extra argument that tracks depth to the recursively-invoked gqg^{q} functions, and make sure that gpg^{p} returns an all-zeros vector (indicating no more proofs can be found) when the depth bound is exceeded. Currently this is implemented by marking learned functions gg with the predicate qq, a mode, and a depth argument dd, and ensuring that function calls inside gio,dpg^{p}_{{\texttt{io}},d} to qq always call the next-deeper version of the function for qq, e.g., gio,d+1qg^{q}_{{\texttt{io}},d+1}.

Uncertain inference rules. Notice that Θ\Theta associates confidences with facts in the databases, not with clauses in the theory. To attach a probability to a clause, a standard trick is to introduce a special clause-specific fact, and add it to the clause body . For example, a soft version of clause 3 could be re-written as

where the (parameterized) fact weighted(c3) appears in DB{\cal{DB}}, and the constant c3 appears nowhere else in T{\cal{T}}. TensorLog supports some special syntax to make it easy to build rules with associated weights: for instance, status(X,tired) :- assign(C3,c3), weighted(C3), child(W,X), infant(W) can be written simply as status(X,tired) :- child(W,X), infant(W) {c3}.

Discussion. Collectively, the computation performed by TensorLog’s functions are equivalent to computing a set of marginals over a particular factor graph GG: specifically GG would be formed by using the construction for multiple clauses with the same head (described above), and then splicing in the factor graphs of subpredicates. The unnormalized messages over this graph, and their functional equivalent, can be viewed implementing a first-order version of weighted model counting, a well-studied problem in satisfiability.

Computationally, the algorithm we describe is quite efficient. Assuming the matrices Mp\textbf{M}_{p} exist, the additional memory needed for the factor-graph GrG_{r} is linear in the size of the clause rr, and hence the compilation to response functions is linear in the theory size and the number of steps of BP. For theories where every GrG_{r} is a tree, the number of message-passing steps is also linear. Message size is (by design) limited to C|{\cal{C}}|, and is often smaller due to sparsity.

The current implementation of TensorLog includes many restrictions that could be relaxed: e.g., predicates must be unary or binary, only queries of the types discussed here are allowed, and every factor graph GrG_{r} must be a tree. Matrix operations are implemented in the scipy sparse-matrix package, and the “unrolling” code performs a number of optimizations to the sequence in-line: one important one is to use the fact that vX(vYMany)=vX ⁣vY ⁣1\textbf{v}_{X}\circ(\textbf{v}_{Y}\textbf{M}_{\texttt{any}})=\textbf{v}_{X}{|\!|{\textbf{v}_{Y}}|\!|_{1}} to avoid explicitly building Many\textbf{M}_{\texttt{any}}.

Related Work

Hybrid logical/neural systems. There is a long tradition of embedding logical expressions in neural networks for the purpose of learning, but generally this is done indirectly, by conversion of the logic to a boolean formula, rather than developing a differentiable theorem-proving mechanism, as considered here. Embedding logic may lead to a useful architecture or regularizer .

Recently have proposed a differentiable theorem prover, in which a proof for an example is unrolled into a network. Their system includes representation-learning as a component, as well as a template-instantiation approach (similar to ), allowing structure learning as well. However, published experiments with the system been limited to very small datasets. Another recent paper describes a system in which non-logical but compositionally defined expressions are converted to neural components for question-answering tasks.

Explicitly grounded probabilistic first-order languages. Many first-order probabilistic models are implemented by “grounding”, i.e., conversion to a more traditional representation.For a survey of such models see . For example, Markov logic networks (MLNs) are a widely-used probabilistic first-order model in which a Bernoulli random variable is associated with each potential ground database fact (e.g., in the binary-predicate case, there would be a random variable for each possible p(a,b)p(a,b) where aa and bb are any facts in the database and pp is any binary predicate) and each ground instance of a clause is a factor. The Markov field built by an MLN is hence of size O(C2)O(|{\cal{C}}|^{2}) for binary predicates, which is much larger than the factor graphs used by TensorLog, which are of size linear in the size of the theory. In our experiments we compare to ProPPR, which has been elsewhere compared extensively to MLNs.

Inference on the Markov field can also be expensive, which motivated the development of probabilistic similarity logic (PSL), a MLN variant which uses a more tractible hinge loss, as well as lifted relational neural networks , a recent model which grounds first-order theories to a neural network. However, any grounded model for a first-order theory can be very large, limiting the scalability of such techniques.

Probabilistic deductive databases and tuple independence.

TensorLog is superficially similar to the tuple independence model for PrDDB’s , which use Θ\Theta to define a distribution, Pr(IDB,Θ)\Pr(I|{\cal{DB}},\Theta), over “hard” databases (aka interpretations) II. In particular, to generate II, each fact fDBf\in{\cal{DB}} sampled by independent coin tosses, i.e., PrTupInd(IDB,Θ)tIθttDBI(1θt)\Pr_{\texttt{TupInd}}(I|{\cal{DB}},\Theta)\equiv\prod_{t\in I}\theta_{t}\cdot\prod_{t\in{\cal{DB}}-I}(1-\theta_{t}). The probability of a derived fact ff is defined as follows, where  ⁣[] ⁣|\![{\cdot}]\!| is a zero-one indicator function:

There is a large literature (for surveys, see ) on approaches to tractibly estimating Eq 3, which naively requires marginalizing over all 2DB2^{|{\cal{DB}}|} interpretations. One approach, taken by the ProbLog system , relies on the notion of an explanation. An explanation EE for ff is a minimal interpretation that supports ff: i.e., fModel(T,E)f\in{}\textit{Model}({\cal{T}},E) but f∉Model(T,E)f\not\in{}\textit{Model}({\cal{T}},E^{\prime}) for all EEE^{\prime}\subset{}E. It is easy to show that if EEE^{\prime\prime}\supset{}E then fModel(T,E)f\in{}\textit{Model}({\cal{T}},E^{\prime\prime}); hence, the set Ex(f){\cal E}x(f) of all explanations for ff is a more concise representation of the interpretations that support ff.

Under the tuple independence model, the marginal probability of drawing some interpretation IEI\supseteq{}E is simply

So TensorLog’s score for a single-explanation fact is the same as under PrTupInd\Pr_{\texttt{TupInd}}, but more generally only approximates Eq 3, since

the inequality occurring because TensorLog overcounts interpretations II that are supersets of more than one explanation.

This approximation step is important to TensorLog’s efficiency, however. Exact computation of probabilities in the tuple independence model are #P hard to compute in the size of the set of explanations, which as noted, can itself be exponentially large. A number of methods have been developed for approximating this computation, or performing it as efficiently as can be done—for example, by grounding to a boolean formula and converting to a decision-tree like format that facilitates counting . Below we experimentally compare inference times to ProbLog2, one system which adopts these semantics.

Stochastic logic programs and ProPPR. TensorLog is more closely related to stochastic logic programs (SLPs) . In an SLP, a probabilistic process is associated with a top-down theorem-prover: i.e., each clause rr used in a derivation has an assocated probability θr\theta_{r}. Let N(r,E)N(r,E) be the number of times rr was used in deriving the explanation EE: then in SLPs, PrSLP(f)=1ZEEx(f)rθrN(r,E)\Pr_{\texttt{SLP}}(f)=\frac{1}{Z}\sum_{E\in{\cal E}x(f)}\prod_{r}\theta_{r}^{N(r,E)}. The same probability distribution can be generated by TensorLog if (1) for each rule rr, the body of rr is prefixed with the literals assign(RuleId,rr),weighted(RuleId), where rr is a unique identifier for the rule and (2) Θ\Theta is constructed so that θf=1\theta_{f}=1 for ordinary database facts ff, and θweighted(r)=θr\theta_{\texttt{weighted(r)}}=\theta^{\prime}_{\texttt{r}}, where Θ\Theta^{\prime} is the parameters for a SLP.

SLPs can be normalized or unnormalized; in normalized SLPs, Θ\Theta is defined so for each set of clauses SpS_{p} of clauses with the same predicate symbol pp in the head, rSpθr=1\sum_{r\in{}S_{p}}\theta_{r}=1. TensorLog can represent both normalized and unnormalized SLPs (although clearly learning must be appropriately constrained to learn parameters for normalized SLPs.) Normalized SLPs generalize probabilistic context-free grammars, and unnormalized SLPs can express Bayesian networks or Markov random fields .

ProPPR is a variant of SLPs in which (1) the stochastic proof-generation process is augmented with a reset, and (2) the transitional probabilities are based on a normalized soft-thresholded linear weighting of features. The first extension to SLPs can be easily modeled in TensorLog, but the second cannot: the equivalent of ProPPR’s clause-specific features can be incorporated, but they are globally normalized, not locally normalized as in ProPPR.

ProPPR also includes an approximate grounding procedure which generates networks of size linear in mm, α1\alpha^{-1}, ϵ1\epsilon^{-1}, and where mm is the number of training examples, α\alpha is the reset parameter, degitmax\it{deg}_{itmax} is the maximum degree of the proof graph, and ϵ\epsilon is the pointwise error of the approximation. Asymptotic analysis suggests that ProPPR should be faster for very large database and small numbers of training examples (assuming moderate values of ϵ\epsilon and α\alpha are feasible to use), but that TensorLog should be faster with large numbers of training examples and moderate-sized databases.

Experiments

We compared TensorLog’s inference time with ProbLog2, a mature probabilistic logic programming system which implements the tuple independence semantics, on two inference problems described in . One is a version of the “friends and smokers” problem, a toy model of social influence. In small graphs were artificially generated using a preferential attachment model, the details of which were not described; instead we used a small existing network datasetThe Citeseer dataset from . which displays preferential-attachment statistics. The inference times we report are for the same inference tasks, for a subset of 120 randomly-selected entities. In spite of querying six times as many entities, TensorLog is many times faster.

We also compare on a path-finding task, also described in , which is intended to test performance on deeply recursive tasks. The goal here is to compute fixed-depth transitive closure on a grid: in a 16-by-16 grid was used, with a maximum path length of 10. Again TensorLog shows much faster performance, and better scalabilityWe set TensorLog’s maximum depth to 10 for the 16-by-16 grid, and to 99 for the larger grid., as shown by run times on a larger 64-by-64 grid.

These results demonstrate that TensorLog’s approximation to ProbLog2’s semantics is efficient, but not that it is useful. To demonstrate that TensorLog can efficiently and usefully approximate deeply recursive concepts, we posed a learning task on the 16-by-16 grid, and trained TensorLog to approximate the distribution for this task. The dataset consists of 256 grid cells connected by 2116 edges, so there are 256 example queries of the form path(a,X) where aa is a particular grid cell. We picked 1/3 of these queries as test, and the remainder as train, and trained so that that the single positive answer to the query path(a,X) is the extreme corner closest to a—i.e., one of the corners (1,1), (1,16), (16,1) or (16,16). Training for 20 epochs brings the accuracy from to 0% to 96.5% (for test), and learning takes approximately 3 sec/epoch. After learning query times are still quite fast.

We note, however, that ProbLog2, in addition to implementing the full tuple-independence semantics, implements a much more expressive logic than considered here, including a large portion of full Prolog. In contrast TensorLog includes only a subset of Datalog.

The table also includes a visualization of the learned weights for a small 6x6 grid. For every pair of adjacent grid cells u,vu,v, there are two weights to learn, one for the edge from uu to vv and one for its converse. For each weight pair, we show a single directed edge (the heavy blue squares are the arrows) colored by the magnitude of the difference.

We also compared experimentally with ProPPR on several tasks. One was a citation-matching task (from ), in which ProPPR was favorable compared to MLNsWe replicated the experiments with the most recent version of ProPPR, obtaining a result slightly higher than the 2013 version’s published AUC of 80.0. Motivated by recent comparisons between ProPPR and embedding-based approaches to knowledge-base completion , we also compared to ProPPR on six relation-prediction tasksWe chose this protocol since the current TensorLog implementation can only learn parameters for one target relation at a time. involving two databases, Wordnet and FreeBase15k, a 15,000-entity subset of FreeBase, using rules from the (non-recursive) theory used in .

In all of these tasks parameters are learned on a separate training set. For TensorLog’s learner, we optimized unregularized cross-entropy loss, using a fixed-rate gradient descent learner. We set the learning rate to 0.1, used no regularization, and used a fixed number of epochs (30), which approximately matched ProPPR’s learning time.Since the current TensorLog implementation is single-threaded we used only one thread for ProPPR as well. The parameters θf\theta_{f} are simply “clipped” to prevent them becoming negative (as in a rectified linear unit) and we use softmax to convert the output of the gpg^{p} functions to distributions. We used the default parameters for ProPPR’s learning.

Encouragingly, the accuracy of the two systems after learning is comparable, even with TensorLog’s rather simplistic learning scheme. ProPPR, of course, is not well suited to tight integration with deep learners.

Concluding Remarks

Large knowledge bases (KBs) are useful in many tasks, but integrating this knowledge into deep learners is a challenge. To address this problem, we described a probabilistic deductive database, called TensorLog, in which reasoning is performed with a differentiable process. The current TensorLog prototype is limited in many respects: for instance, it is not multithreaded, and only the simplest learning algorithms have been tested. In spite of this, it appears to be comparable to more mature first-order probabilistic learners in learning performance and inference time—while holding the promise of allowing large KBs to be tightly integrated with deep learning.

Thanks to William Wang for providing some of the datasets used here; and to William Wang, Katie Mazaitis, and many other colleagues contributed with technical discussions and advice. The author is greatful to Google for financial support, and also to NSF for their support of his work via grants CCF-1414030 and IIS-1250956.

References