Embedding Logical Queries on Knowledge Graphs

William L. Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, Jure Leskovec

Introduction

A wide variety of heterogeneous data can be naturally represented as networks of interactions between typed entities, and a fundamental task in machine learning is developing techniques to discover or predict unobserved edges using this graph-structured data. Link prediction , recommender systems , and knowledge base completion are all instances of this common task, where the goal is to predict unobserved edges between nodes in a graph using an observed set of training edges. However, an open challenge in this domain is developing techniques to make predictions about more complex graph queries that involve multiple unobserved edges, nodes, and even variables—rather than just single edges.

One particularly useful set of such graph queries, and the focus of this work, are conjunctive queries, which correspond to the subset of first-order logic using only the conjunction and existential quantification operators . In terms of graph structure, conjunctive queries allow one to reason about the existence of subgraph relationships between sets of nodes, which makes conjunctive queries a natural focus for knowledge graph applications. For example, given an incomplete biological knowledge graph—containing known interactions between drugs, diseases, and proteins—one could pose the conjunctive query: “what protein nodes are likely to be associated with diseases that have both symptoms X and Y?” In this query, the disease node is an existentially quantified variable—i.e., we only care that some disease connects the protein node to these symptom nodes X and Y. Valid answers to such a query correspond to subgraphs. However, since any edge in this biological interaction network might be unobserved, naively answering this query would require enumeration over all possible diseases.

In general, the query prediction task—where we want to predict likely answers to queries that can involve unobserved edges—is difficult because there are a combinatorial number of possible queries of interest, and any given conjunctive query can be satisfied by many (unobserved) subgraphs (Figure 1). For instance, a naive approach to make predictions about conjunctive queries would be the following: First, one would run an edge prediction model on all possible pairs of nodes, and—after obtaining these edge likelihoods—one would enumerate and score all candidate subgraphs that might satisfy a query. However, this naive enumeration approach could require computation time that is exponential in the number of existentially quantified (i.e., bound) variables in the query .

Here we address this challenge and develop graph query embeddings (GQEs), an embedding-based framework that can efficiently make predictions about conjunctive queries on incomplete knowledge graphs. The key idea behind GQEs is that we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. After training, we can use the model to predict which nodes are likely to satisfy any valid conjunctive query, even if the query involves unobserved edges. Moreover, we can make this prediction efficiently, in time complexity that is linear in the number of edges in the query and constant with respect to the size of the input network. We demonstrate the utility of GQEs in two application studies involving networks with millions of edges: discovering new interactions in a biomedical drug interaction network (e.g., “predict drugs that might treat diseases associated with protein X”) and predicting social interactions on the website Reddit (e.g., “recommend posts that user A is likely to downvote, but user B is likely to upvote”).

Related Work

Our framework builds upon a wealth of previous research at the intersection of embedding methods, knowledge graph completion, and logical reasoning.

Logical reasoning and knowledge graphs. Recent years have seen significant progress in using machine learning to reason with relational data , especially within the context of knowledge graph embeddings , probabilistic soft logic , and differentiable tensor-based logic . However, existing work in this area primarily focuses on using logical reasoning to improve edge prediction in knowledge graphs , for example, by using logical rules as regularization . In contrast, we seek to directly make predictions about conjunctive logical queries. Another well-studied thread in this space involves leveraging knowledge graphs to improve natural language question answering (QA) . However, the focus of these QA approaches is understanding natural language, whereas we focus on queries that are in logical form.

Probabilistic databases. Our research also draws inspiration from work on probabilistic databases . The primary distinction between our work and probabilistic databases is the following: Whereas probabilistic databases take a database containing probabilistic facts and score queries, we seek to predict unobserved logical relationships in a knowledge graph. Concretely, a distinguishing challenge in our setting is that while we are given a set of known edge relationships (i.e., facts), all missing edge relationships could possibly be true.

Neural theorem proving. Lastly, our work builds closely upon recent advancements in neural theorem proving , which have demonstrated how neural networks can prove first-order logic statements in toy knowledge bases . Our main contribution in this space is providing an efficient approach to embed a useful subset of first-order logic, demonstrating scalability to real-world network data with millions of edges.

Background and Preliminaries

Example 1: Drug interactions (Figure 2.a). A knowledge graph derived from a number from public biomedical databases (Appendix B). It consists of nodes corresponding to drugs, diseases, proteins, side effects, and biological processes. There are 42 different edge types, including multiple edge types between proteins (e.g., co-expression, binding interactions), edges denoting known drug-disease treatment pairs, and edges denoting experimentally documented side-effects of drugs. In total this dataset contains over $8$ million edges between $97{,}000$ nodes.

Example 2: Reddit dynamics (Figure 2.b). We also consider a graph-based representation of Reddit, one of the most popular websites in the world. Reddit allows users to form topical communities, within which users can create and comment on posts (e.g., images, or links to news stories). We analyze all activity in 105 videogame related communities from May 1-5th, 2017 (Appendix B). In total this dataset contains over 4 million edges denoting interactions between users, communities and posts, with over $700{,}000$ nodes in total (see Figure 2.b for the full schema). Edges exist to denote that a user created,“upvoted”, or “downvoted” a post, as well as edges that indicate whether a user subscribes to a community

In this work we seek to make predictions about conjunctive graph queries (Figure 1). Specifically, the queries $q\in\mathcal{Q}(\mathcal{G})$ that we consider can be written as:

In Equation (1), $V_{?}$ denotes the target variable of the query, i.e., the node that we want the query to return, while $V_{1},...,V_{m}$ are existentially quantified bound variable nodes. The edges $e_{i}$ in the query can involve these variable nodes as well as anchor nodes, i.e., non-variable/constant nodes that form the input to the query, denoted in lower-case as $v_{j}$ .

To give a concrete example using the biological interaction network (Figure 2.a), consider the query “return all drug nodes that are likely to target proteins that are associated with a given disease node $d$ .” We would write this query as:

and we say that the answer or denotation of this query $\llbracket q\rrbracket$ is the set of all drug nodes that are likely to be connected to node $d$ on a length-two path following edges that have types target and assoc, respectively. Note that $d$ is an anchor node of the query: it is the input that we provide. In contrast, the upper-case nodes $C_{?}$ and $P$ , are variables defined within the query, with the $P$ variable being existentially quantified. In terms of graph structure, Equation (2) corresponds to a path. Figure 1 contains a visual illustration of this idea.

Beyond paths, queries of the form in Equation (1) can also represent more complex relationships. For example, the query “return all drug nodes that are likely to target proteins that are associated with the given disease nodes $d_{1}$ and $d_{2}$ ” would be written as:

In this query we have two anchor nodes $d_{1}$ and $d_{2}$ , and the query corresponds to a polytree (Figure 1).

In general, we define the dependency graph of a query $q$ as the graph with edges $\mathcal{E}_{q}=\{e_{1},...,e_{n}\}$ formed between the anchor nodes $v_{1},...,v_{k}$ and the variable nodes $V_{?},V_{1},...,V_{m}$ (Figure 1). For a query to be valid, its dependency graph must be a directed acyclic graph (DAG), with the anchor nodes as the source nodes of the DAG and the query target as the unique sink node. The DAG structure ensures that there are no contradictions or redundancies.

Note that there is an important distinction between the query DAG, which contains variables, and a subgraph structure in the knowledge graph that satisfies this query, i.e., a concrete assignment of the query variables (see Figure 1). For instance, it is possible for a query DAG to be satisfied by a subgraph that contains cycles, e.g., by having two bound variables evaluate to the same node.

Observed vs. unobserved denotation sets. If we view edge relations as binary predicates, the graph queries defined by Equation (1) correspond to a standard conjunctive query language , with the restriction that we allow at most one free variable. However, unlike standard queries on relational databases, we seek to discover or predict unobserved relationship and not just answer queries that exactly satisfy a set of observed edges. Formally, we assume that every query $q\in\mathcal{Q}(\mathcal{G})$ has some unobserved denotation set $\llbracket q\rrbracket$ that we are trying to predict, and we assume that $\llbracket q\rrbracket$ is not fully observed in our training data. To avoid confusion on this point, we also introduce the notion of the observed denotation set of a query, denoted $\llbracket q\rrbracket_{\textrm{train}}$ , which corresponds to the set of nodes that exactly satisfy $q$ according to our observed, training edges. Thus, our goal is to train using example query-answer pairs that are known in the training data, i.e., $(q,v^{*}),v^{*}\in\llbracket q\rrbracket_{\textrm{train}}$ , so that we can generalize to parts of the graph that involve missing edges, i.e., so that we can make predictions for query-answer pairs that rely on edges which are unobserved in the training data $(q,v^{*}),v^{*}\in\llbracket q\rrbracket\setminus\llbracket q\rrbracket_{\textrm{train}}$ .

Proposed Approach

Thus, our goal is to generate an embedding $\mathbf{q}$ of a query that implicitly represents its denotation $\llbracket q\rrbracket$ ; i.e., we want to generate query embeddings so that $\texttt{score}(\mathbf{q},\mathbf{z}_{v})=1,\forall v\in\llbracket q\rrbracket$ and $\texttt{score}(\mathbf{q},\mathbf{z}_{v})=0,\forall v\notin\llbracket q\rrbracket$ . At inference time, we take a query $q$ , generate its corresponding embedding $\mathbf{q}$ , and then perform nearest neighbor search—e.g., via efficient locality sensitive hashing —in the embedding space to find nodes likely to satisfy this query (Figure 3).

To generate the embedding $\mathbf{q}$ for a query $q$ using Algorithm 1, we (i) represent the query using its DAG dependency graph, (ii) start with the embeddings $\mathbf{z}_{v_{1}},...,\mathbf{z}_{v_{n}}$ of its anchor nodes, and then (iii) we apply geometric operators, $\mathcal{P}$ and $\mathcal{I}$ (defined below) to these embeddings to obtain an embedding $\mathbf{q}$ of the query. In particular, we introduce two key geometric operators, both of which can be interpreted as manipulating the denotation set associated with a query in the embedding space.

Geometric projection operator, $\mathcal{P}$ :. Given a query embedding $\mathbf{q}$ and an edge type $\tau$ , the projection operator $\mathcal{P}$ outputs a new query embedding $\mathbf{q}^{\prime}=\mathcal{P}(\mathbf{q},\tau)$ whose corresponding denotation is $\llbracket q^{\prime}\rrbracket=\cup_{v\in\llbracket q\rrbracket}N(v,\tau)$ , where $N(v,\tau)$ denotes the set of nodes connected to $v$ by edges of type $\tau$ . Thus, $\mathcal{P}$ takes an embedding corresponding to a set of nodes $\llbracket q\rrbracket$ and produces a new embedding that corresponds to the union of all the neighbors of nodes in $\llbracket q\rrbracket$ , by edges of type $\tau$ . Following a long line of successful work on encoding edge and path relationships in knowledge graphs , we implement $\mathcal{P}$ as follows:

where $\mathbf{R}_{\tau}^{d\times d}$ is a trainable parameter matrix for edge type $\tau$ . In the base case, if $\mathcal{P}$ is given a node embedding $\mathbf{z}_{v}$ and edge type $\tau$ as input, then it returns an embedding of the neighbor set $N(v,\tau)$ .

Geometric intersection operator, $\mathcal{I}$ :. Suppose we are given a set of query embeddings $\mathbf{q}_{1},...,\mathbf{q}_{n}$ , all of which correspond to queries with the same output node type $\gamma$ . The geometric intersection operator $\mathcal{I}$ takes this set of query embeddings and produces a new embedding $\mathbf{q}^{\prime}$ whose denotation corresponds to $\llbracket q^{\prime}\rrbracket=\cap_{i=1,...,n}\llbracket q\rrbracket_{i}$ , i.e., it performs set intersection in the embedding space. While path projections of the form in Equation (4) have been considered in previous work on edge and path prediction, no previous work has considered such a geometric intersection operation. Motivated by recent advancements in deep learning on sets , we implement $\mathcal{I}$ as:

where $\textrm{NN}_{k}$ is a $k$ -layer feedforward neural network, $\Psi$ is a symmetric vector function (e.g., an elementwise mean or min of a set over vectors), and $\mathbf{W}_{\gamma}$ is a trainable transformation matrix for each node type $\gamma\in\Gamma$ . In principle, any sufficiently expressive neural network that operates on sets could be also employed as the intersection operator, as long as this network is permutation invariant on its inputs .

Query inference using $\mathcal{P}$ and $\mathcal{I}$ . Given the geometric projection operator $\mathcal{P}$ (Equation 4) and the geometric intersection operator $\mathcal{I}$ (Equation 5) we can use Algorithm 1 to efficiently generate an embedding $\mathbf{q}$ that corresponds to any DAG-structured conjunctive query $q$ on the network. To generate a query embedding, we start by projecting the anchor node embeddings according to their outgoing edges; then if a node has more than one incoming edge in the query DAG, we use the intersection operation to aggregate the incoming information, and we repeat this process as necessary until we reach the target variable of the query. In the end, Algorithm 1 generates an embedding $\mathbf{q}$ of a query in $O(d^{2}E)$ operations, where $d$ is the embedding dimension and $E$ is the number of edges in the query DAG. Using the generated embedding $\mathbf{q}$ we can predict nodes that are likely to satisfy this query by doing a nearest neighbor search in the embedding space. Moreover, since the set of nodes is known in advance, this nearest neighbor search can be made highly efficient (i.e., sublinear in | $\mathcal{V}|$ ) using locality sensitive hashing, at a small approximation cost .

Formally, we can show that in an ideal setting Algorithm 1 can exactly answer any conjunctive query on a network. This provides an equivalence between conjunctive queries on a network and sequences of geometric projection and intersection operations in an embedding space.

i.e., the observed denotation set of the query $\llbracket q\rrbracket_{\textrm{train}}$ can be exactly computed in the embeddings space by Algorithm 1 using $O(E)$ applications of the geometric operators $\mathcal{P}$ and $\mathcal{I}$ .

Theorem 1 is a consequence of the correspondence between tensor algebra and logic combined with the efficiency of DAG-structured conjunctive queries , and the full proof is in Appendix A.

2 Node embeddings

3 Other variants of our framework

Above we outlined one concrete implementation of our GQE framework. However, in principle, our framework can be implemented with alternative geometric projection $\mathcal{P}$ and intersection $\mathcal{I}$ operators. In particular, the projection operator can be implemented using any composable, embedding-based edge prediction model, as defined in Guu et al., 2015 . For instance, we also consider variants of the geometric projection operator based on DistMult and TransE . In the DistMult model the matrices in Equation (4) are restricted to be diagonal, whereas in the TransE variant we replace Equation (4) with a translation operation, $\mathcal{P}_{\textrm{TransE}}(\mathbf{q},\tau)=\mathbf{q}+\mathbf{r}_{\tau}$ . Note, however, that our proof of Theorem 1 relies on specific properties of projection operator described in Equation (4).

4 Model training

The geometric projection operator $\mathcal{P}$ , intersection operator $\mathcal{I}$ , and node embedding parameters can be trained using stochastic gradient descent on a max-margin loss. To compute this loss given a training query $q$ , we uniformly sample a positive example node $v^{*}\in\llbracket q\rrbracket_{\textrm{train}}$ and negative example node $v_{N}\notin\llbracket q\rrbracket_{\textrm{train}}$ from the training data and compute:

For queries involving intersection operations, we use two types of negative samples: “standard” negative samples are randomly sampled from the subset of nodes that have the correct type for a query; in contrast, “hard” negative samples correspond to nodes that satisfy the query if a logical conjunction is relaxed to a disjunction. For example, for the query “return all drugs that are likely to treat disease $d_{1}$ and $d_{2}$ ”, a hard negative example would be diseases that treat $d_{1}$ but not $d_{2}$ .

Experiments

We run experiments on the biological interaction (Bio) and Reddit datasets (Figure 2). Code and data is available at https://github.com/williamleif/graphqembed.

We consider variants of our framework using the projection operator in Equation 4 (termed Bilinear), as well as variants using TransE and DistMult as the projection operators (see Section 4.3). All variants use a single-layer neural network in Equation (5). As a baseline, we consider an enumeration approach that is trained end-to-end to perform edge prediction (using Bilinear, TransE, or DistMult) and scores possible subgraphs that could satisfy a query by taking the product (i.e., a soft-AND) of their individual edge likelihoods (using a sigmoid with a learned scaling factor to compute the edge likelihoods). However, this enumeration approach has exponential time complexity w.r.t. to the number of bound variables in a query and is intractable in many cases, so we only include it as a comparison point on the subset of queries with no bound variables. (A slightly less naive baseline variant where we simply use one-hot embeddings for nodes is similarly intractable due to having quadratic complexity w.r.t. to the number of nodes.) As additional ablations, we also consider simplified variants of our approach where we only train the projection operator $\mathcal{P}$ on edge prediction and where the intersection operator $\mathcal{I}$ is just an elementwise mean or min. This tests how well Algorithm 1 can answer conjunctive queries using standard node embeddings that are only trained to perform edge prediction. For all baselines and variants, we used PyTorch , the Adam optimizer, an embedding dimension $d=128$ , a batch size of $256$ , and tested learning rates $\{0.1,0.01,0.001\}$ .

2 Dataset of train and test queries

To test our approach, we sample sets of train/test queries from a knowledge graph, i.e., pairs ( $q$ , $v^{*}$ ), where $q$ is a query and $v^{*}$ is a node that satisfies this query. In our sampling scheme, we sample a fixed number of example queries for each possible query DAG structure (Figure 4, bottom). For each possible DAG structure, we sampled queries uniformly at random using a simple rejection sampling approach (described below).

To sample training queries, we first remove 10% of the edges uniformly at random from the graph and then perform sampling on this downsampled training graph. To sample test queries, we sample from the original graph (i.e., the complete graph without any removed edges), but we ensure that the test query examples are not directly answerable in the training graph. In other words, we ensure that every test query relies on at least one deleted edge (i.e., that for every test query example ( $q$ , $v^{*}$ ), $v^{*}\notin\llbracket q\rrbracket_{\textrm{train}}$ ). This train/test setup ensures that a trivial baseline—which simply tries to answer a query by template matching on the observed training edges—will have an accuracy that is no better random guessing on the test set, i.e., that every test query can only be answered by inferring unobserved relationships.

Sampling details. In our sampling scheme, we sample a fixed number of example queries for each possible query DAG structure. In particular, given a DAG structure with $E$ edges—specified by a vector $\mathbf{d}=[d_{1},d_{2},...,d_{E}]$ of node out degrees, which are sorted in topological order —we sample edges using the following procedure: First we sample the query target node (i.e., the root of the DAG); next, we sample $d_{1}$ out-edges from this node and we add each of these sampled nodes to a queue; we then iteratively pop nodes from the queue, sampling $d_{i+1}$ neighbors from the $i$ th node popped from the queue, and so on. If a node has $d_{i}=0$ , then this corresponds to an anchor node in the query. We use simple rejection sampling to cope with cases where the sampled nodes cannot satisfy a particular DAG structure, i.e., we repeatedly sample until we obtain $S$ example queries satisfying a particular query DAG structure.

Training, validation, and test set details. For training we sampled $10^{6}$ queries with two edges and $10^{6}$ queries with three edges, with equal numbers of samples for each different type of query DAG structure. For testing, we sampled 10,000 test queries for each DAG structure with two or three edges and ensured that these test queries involved missing edges (see above). We further sampled 1,000 test queries for each possible DAG structure to use for validation (e.g., for early stopping). We used all edges in the training graph as training examples for size-1 queries (i.e., edge prediction), and we used a $90/10$ split of the deleted edges to form the test and validation sets for size-1 queries.

3 Evaluation metrics

For a test query $q$ we evaluate how well the model ranks a node $v^{*}$ that does satisfy this query $v^{*}\in\llbracket q\rrbracket$ compared to negative example nodes that do not satisfy it, i.e., $v_{N}\notin\llbracket q\rrbracket$ . We quantify this performance using the ROC AUC score and average percentile rank (APR). For the APR computation, we rank the true node against $\min(1000,|\{v\notin\llbracket q\rrbracket|)$ negative examples (that have the correct type for the query) and compute the percentile rank of the true node within this set. For queries containing intersections, we run both these metrics using both standard and “hard” negative examples to compute the ranking/classification scores, where“hard” negative examples are nodes that satisfy the query if a logical conjunction is relaxed to a disjunction.

4 Results and discussion

Table 1 contains the performance results for three variants of GQEs based on bilinear transformations (i.e., Equation 4), DistMult, and TransE, as well as the ablated models that are only trained on edge prediction (denoted Edge Training).We selected the best $\Psi$ function and learning rate for each variant on the validation set. Overall, we can see that the full Bilinear model performs the best, with an AUC of 91.0 on the Bio data and an AUC of 76.4 on the Reddit data (macro-averaged across all query DAG structures of size 1-3). In Figure 4 we breakdown performance across different types of query dependency graph structures, and we can see that its performance on complex queries is very strong (relative to its performance on simple edge prediction), with long paths being the most difficult type of query.

Table 2 compares the best-performing GQE model to the best-performing enumeration-based baseline. The enumeration baseline is computationally intractable on queries with bound variables, so this comparison is restricted to the subset of queries with no bound variables. Even in this restricted setting, we see that GQE consistently outperforms the baseline. This demonstrates that performing logical operations in the embedding space is not only more efficient, it is also an effective alternative to enumerating the product of edge-likelihoods, even in cases where the latter is feasible.

The importance of training on complex queries. We found that explicitly training the model to predict complex queries was necessary to achieve strong performance (Table 1). Averaging across all model variants, we observed an average AUC improvement of $13.3\%$ on the Bio data and $13.9\%$ on the Reddit data (both $p<0.001$ , Wilcoxon signed-rank test) when using full GQE training compared to Edge Training. This shows that training on complex queries is a useful way to impose a meaningful logical structure on an embedding space and that optimizing for edge prediction alone does not necessarily lead to embeddings that are useful for more complex logical queries.

Conclusion

We proposed a framework to embed conjunctive graph queries, demonstrating how to map a practical subset of logic to efficient geometric operations in an embedding space. Our experiments showed that our approach can make accurate predictions on real-world data with millions of relations. Of course, there are limitations of our framework: for instance, it cannot handle logical negation or disjunction, and we also do not consider features on edges. Natural future directions include generalizing the space of logical queries—for example, by learning a geometric negation operator—and using graph neural networks to incorporate richer feature information on nodes and edges.

The authors thank Alex Ratner, Stephen Bach, and Michele Catasta for their helpful discussions and comments on early drafts. This research has been supported in part by NSF IIS-1149837, DARPA SIMPLEX, Stanford Data Science Initiative, Huawei, and Chan Zuckerberg Biohub. WLH was also supported by the SAP Stanford Graduate Fellowship and an NSERC PGS-D grant.

References

Appendix A: Proof of Theorem 1

The proof of this theorem follows directly from two lemmas:

Lemma 1 shows that any conjunctive query can be exactly represented in an embedding space of dimension $d=O(|\mathcal{V}|)$ .

Lemma 2 notes that Algorithm 1 terminates in $O(E)$ steps.

Now, by Lemma 3 the denotation set $\llbracket q\rrbracket$ of a DAG-structured conjunctive query $q$ can be computed in a sequence $S$ of two kinds of set operations, applied to the initial input sets $\{v_{1}\},...,\{v_{n}\}$ —where $v_{1},...,v_{n}$ are the anchor nodes of the query—and where the final output set is the query denotation:

Set projections, with one defined for each edge type, $\tau$ and which map a set of nodes $\mathcal{S}$ to the set $\cup_{v_{i}\in\mathcal{S}}N(\tau,v_{i})$ .

Set intersection (i.e., the basic set intersection operator) which takes a set of sets $\{\mathcal{S}_{1},...,\mathcal{S}_{n}\}$ and returns $\mathcal{S}_{1}\cap,...,\cap,\mathcal{S}_{n}$ .

And we can easily show that $\mathcal{P}$ and $\mathcal{I}$ perform exactly these operations, when using the parameter settings outlined above, and we can complete our proof by induction. In particular, our inductive assumption is that sets $\mathcal{S}_{i}$ at step $k$ of the sequence $S$ are all represented as binary vectors $\mathbf{z}_{\mathcal{S}}$ with non-zeroes in the entries corresponding to the nodes in this set. Under this assumption, we have two cases, corresponding to what our next operation is in the sequence $S$ :

If the next operation is a projection on a set $\mathcal{S}$ using edge relation $\tau$ , then we can compute it as $\mathbf{R}_{\tau}\mathbf{z}_{\mathcal{S}}$ , and by definition $\mathbf{R}_{\tau}\mathbf{z}_{\mathcal{S}}$ will have a non-zero entry at position $j$ iff there is at least one non-zero entry $i$ in $\mathbf{z}_{\mathcal{S}}$ . That is, we will have that:

If the next operation is an intersection of the set of sets $\{\mathcal{S}_{1},...,\mathcal{S}_{n}\}$ , then we compute it as $\mathbf{z}^{\prime}=\min\left(\{\mathbf{z}_{\mathcal{S}_{1}},...,\mathbf{z}_{\mathcal{S}_{n}}\}\right)$ , and by definition $\mathbf{z}^{\prime}$ will have non-zero entries only in positions where every one of the input vectors $\mathbf{z}_{\mathcal{S}_{1}},...,\mathbf{z}_{\mathcal{S}_{n}}$ has a non-zero. That is,

Finally, for the base case we have that the input anchor embeddings $\mathbf{z}_{v_{1}},...,\mathbf{z}_{v_{n}}$ represent the sets $\{v_{1}\},...,\{v_{n}\}$ by definition. ∎

Algorithm 1 terminates in $O(E)$ operations, where $E$ is the number of edges in the query DAG.

Algorithm 1 is identical to Kahn’s algorithm for topologically sorting a DAG , with the addition that we (i) apply $\mathcal{P}$ whenever we remove an edge from the DAG and (ii) run $\mathcal{I}$ whenever we pop a node from the queue. Thus, by direct analogy to Kahn’s algorithm we require exactly $E$ applications of $\mathcal{P}$ and $V$ applications of $\mathcal{I}$ , where $V$ is the number of nodes in the query DAG. Since $V$ is always less than $E$ , we have $O(E)$ overall. ∎

The denotation of any DAG-structured conjunctive query on a network can be obtained in a sequence of $O(E)$ applications of the following two operations:

Set projections, with one defined for each edge type, $\tau$ and which map a set of nodes $\mathcal{S}$ to the set $\cup_{v_{i}\in\mathcal{S}}N(\tau,v_{i})$ .

Set intersection (i.e., the basic set intersection operator) which takes a set of sets $\{\mathcal{S}_{1},...,\mathcal{S}_{n}\}$ and returns $\mathcal{S}_{1}\cap,...,\cap,\mathcal{S}_{n}$ .

For a query $C_{?}:\tau(v,C_{?})$ the denotation is $N(v,\tau)$ by definition. This is simply a set projection.

For a query $C_{?}:\tau(v_{1},C_{?})\land\tau(v_{2},C_{?})\land....\tau(v_{n},C_{?})$ the denotation is $\cap_{v_{i}\in\{v_{1},...,v_{n}\}}N(v_{i},\tau)$ by definition. This is a sequence of $n$ set projections followed by a set intersection.

Now, suppose we process the query variables in a topological order and we perform induction on this ordering. Our inductive assumption is that after processing $k$ nodes in this ordering, for every variable $V_{j}$ , $j\leq k$ in the query, we have a set $\mathcal{S}(V_{j})$ of possible nodes that could be assigned to this variable.

Now, when we process the node $V_{i}$ , we consider all of its incoming edges, and we have that:

by definition. Moreover, by the inductive assumption the set $\mathcal{S}(V_{j})$ for all nodes that have an outgoing edge to $V_{i}$ is known (because they must be earlier in the topological ordering). And Equation (7) requires only set projection and intersection operations, as defined above.

Finally, for the base case the set of possible assignments for the anchor nodes is given, and these nodes are guaranteed to be first in the DAG’s topological ordering, by definition. ∎

Appendix B: Further dataset details

The biological interaction network contains interactions between five types of biological entities (proteins, diseases, biological processes, side effects, and drugs). The network records 42 distinct types of biologically relevant molecular interactions between these entities, which we describe below.

Protein-protein interaction links describe relationships between proteins. We used the human protein-protein interaction (PPI) network compiled by and , integrated with additional PPI information from , and . The network contains physical interactions experimentally documented in humans, such as metabolic enzyme-coupled interactions and signaling interactions.

Drug-protein links describe the proteins targeted by a given drug. We obtained relationships between proteins and drugs from the STITCH database, which integrates various chemical and protein networks . Drug-drug interaction network contains 29 different types of edges (one for each type of polypharmacy side effects) and describes which drug pairs lead to which side effects . We also pulled from databases detailing side effects (e.g., nausea, vomiting, headache, diarrhoea, and dermatitis) of individual drugs. The SIDER database contains drug-side effect associations obtained by mining adverse events from drug label text. We integrated it with the OFFSIDES database, which details off-label associations between drugs and side effects .

Disease-protein links describe proteins that, when mutated or otherwise genomically altered, lead to the development of a given disease. We obtained relationships between proteins and diseases from the DisGeNET database , which integrates data from expert-curated repositories. Drug-disease links describe diseases that a given drug treats. We used the RepoDB database to obtain drug-disease links for all FDA-approved drugs in the U.S.

Finally, protein-process links describe biological processes (e.g., intracellular transport of molecules) that each protein is involved in. We obtained these links from the Gene Ontology database and we only considered experimentally verified links. Process-process links describe relationships between biological processes and were retrieved from the Gene Ontology graph.

We ignore in experiments any relation/edge-type with less than 1000 edges. Preprocessed versions of these datasets are publicly available at: http://snap.stanford.edu/biodata/.

Reddit data

Appendix C: Further details on empirical evaluation

As noted in the main text, the code for our model is available at: https://github.com/williamleif/graphqembed

As noted in the main text, we tested all models using different learning rates and symmetric vector aggregation functions $\Psi$ , selecting the best performing model on the validation set. The other important hyperparameter for the methods is the embedding dimension $d$ , which was set to $d=128$ in all experiments. We chose $d=128$ based upon early validation datasets on a subset of the Bio data. We tested embedding dimensions of 16, 64, 128, and 256; in these tests, we found performance increased until the dimension of 128 and then plateaued.

Further training details

During training of the full GQE framework, we first trained the model to convergence on edge prediction, and then trained on more complex queries, as we found that this led to better convergence. After training on edge prediction, in every batch of size $B$ we trained on $B$ queries of each type using standard negative samples and $B$ queries using hard negative samples. We weighted the contribution of path queries to the loss function with a factor of $0.01$ and intersection queries with a factor of $0.005$ , as we found this was necessary to prevent exploding/diverging gradient estimates. We performed validation every 5000 batches to test for convergence. All of these settings were determined in early validation studies on a subset of the Bio data. Note that in order to effectively batch on the GPU, in every batch we only select queries that have the same edges/relations and DAG structure. This means that for some query types batches can be smaller than $B$ on occasion.

Compute resources

We trained the models on a server with 16 x Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz processors, 512 GB RAM, and four NVIDIA Titan X Pascal GPUs with 12 GB of memory. This was a shared resource environment. Each model takes approximately 3 hours and three models could be concurrently run on a single GPU without significant slowdown. We expect all our experiments could be replicated in 48 hours or less on a single GPU, with sufficient RAM.

Inverse edges

Note that following , we explicitly parameterize every edge as both the original edge and the inverse edge. For instance, if there is an edge $\textsc{target}(u,v)$ in the network then we also add an edge $\textsc{target}^{-1}(v,u)$ and the two relations target and $\textsc{target}^{-1}$ have separate parameterizations in the projection operator $\mathcal{P}$ . This is necessary to obtain high performance on path queries because relations can be many-to-one and not necessarily perfect inverses. However, note also that whenever we remove an edge from the training set, we also remove the inverse edge, to prevent the existence of trivial test queries.