Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering

Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, Xiang Ren

Introduction

Many recently proposed question answering tasks require not only machine comprehension of the question and context, but also relational reasoning over entities (concepts) and their relationships by referencing external knowledge Talmor et al. (2019); Sap et al. (2019); Clark et al. (2018); Mihaylov et al. (2018). For example, the question in Fig. 1 requires a model to perform relational reasoning over mentioned entities, ${\sl i.e.}$ , to infer latent relations among the concepts: $\{$ Child, Sit, Desk, Schoolroom $\}$ . Background knowledge such as “a child is likely to appear in a schoolroom” may not be readily contained in the questions themselves, but are commonsensical to humans.

Despite the success of large-scale pre-trained language models (PTLMs) Devlin et al. (2019); Liu et al. (2019b), these models fall short of providing interpretable predictions, as the knowledge in their pre-training corpus is not explicitly stated, but rather is implicitly learned. It is thus difficult to recover the evidence used in the reasoning process.

This has led many to leverage knowledge graphs (KGs) (Mihaylov and Frank, 2018; Lin et al., 2019; Wang et al., 2019; Yang et al., 2019). KGs represent relational knowledge between entities with multi-relational edges for models to acquire. Incorporating KGs brings the potential of interpretable and trustworthy predictions, as the knowledge is now explicitly stated. For example, in Fig. 1, the relational path $(\textsc{Child}\rightarrow\texttt{AtLocation}\rightarrow\textsc{Classroom}\rightarrow\texttt{Synonym}\rightarrow\textsc{Schoolroom})$ naturally provides evidence for the answer Schoolroom.

A straightforward approach to leveraging a knowledge graph is to directly model these relational paths. KagNet (Lin et al., 2019) and MHPGM (Bauer et al., 2018) model multi-hop relations by extracting relational paths from KG and encoding them with sequence models. Application of attention mechanisms upon these relational paths can further offer good interpretability. However, these models are hardly scalable because the number of possible paths in a graph is (1) polynomial w.r.t. the number of nodes (2) exponential w.r.t. the path length (see Fig. 2). Therefore, some (Weissenborn et al., 2017; Mihaylov and Frank, 2018) resort to only using one-hop paths, namely, triples, to balance scalability and reasoning capacities.

Graph neural networks (GNNs), in contrast, enjoy better scalability via their message passing formulation, but usually lack transparency. The most commonly used GNNs’ variant, Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017), perform message passing by aggregating neighborhood information for each node, but ignore the relation types. RGCNs (Schlichtkrull et al., 2018) generalize GCNs by performing relation-specific aggregation, making it applicable to multi-relational graphs. However, these models do not distinguish the importance of different neighbors or relation types and thus cannot provide explicit relational paths for model behavior interpretation.

In this paper, we propose a novel graph encoding architecture, Multi-hop Graph Relation Network (MHGRN), which combines the strengths of path-based models and GNNs. Our model inherits scalability from GNNs by preserving the message passing formulation. It also enjoys interpretability of path-based models by incorporating structured relational attention mechanism. Our key motivation is to perform multi-hop message passing within a single layer to allow each node to directly attend to its multi-hop neighbours, towards multi-hop relational reasoning. We outline the favorable features of knowledge-aware QA models in Table 1 and compare MHGRN with them.

We summarize the main contributions of this work as follows: 1) We propose MHGRN, a novel model architecture tailored to multi-hop relational reasoning, which explicitly models multi-hop relational paths at scale. 2) We propose a structured relational attention mechanism for efficient and interpretable modeling of multi-hop reasoning paths, along with its training and inference algorithms. 3) We conduct extensive experiments on two question answering datasets and show that our models bring significant improvements compared to knowledge-agnostic PTLMs, and outperform other graph encoding methods by a large margin.

Problem Formulation and Overview

In this paper, we limit the scope to the task of multiple-choice question answering, although it can be easily generalized to other knowledge-guided tasks (e.g., natural language inference). The overall paradigm of knowledge-aware QA is illustrated in Fig. 3. Formally, given an external knowledge graph (KG) as the knowledge source and a question $q$ , our goal is to identify the correct answer from a set $\mathcal{C}$ of given options. We turn this problem into measuring the plausibility score between $q$ and each option $a\in\mathcal{C}$ then selecting the one with the highest plausibility score.

Denote the vector representations of question $q$ and option $a$ as $\bm{q}$ and $\bm{a}$ . To measure the score for $\bm{q}$ and $\bm{a}$ , we first concatenate them to form a statement vector $\bm{s}=\left[\bm{q};\bm{a}\right]$ . Then we extract from the external KG a subgraph $\mathcal{G}$ (i.e., schema graph in KagNet Lin et al. (2019)), with the guidance of $\bm{s}$ (detailed in §5.1). This contextualized subgraph is defined as a multi-relational graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\phi)$ . Here $\mathcal{V}$ is a subset of entity in the external KG, containing only those relevant to $s$ . $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{R}\times\mathcal{V}$ is the set of edges that connect nodes in $\mathcal{V}$ , where $\mathcal{R}=\{1,\cdots,m\}$ are ids of all pre-defined relation types. The mapping function $\phi(i):\mathcal{V}\rightarrow\mathcal{T}=\{{\bm{\mathsfit{E}}}_{q},{\bm{\mathsfit{E}}}_{a},{\bm{\mathsfit{E}}}_{o}\}$ takes node $i\in V$ as input and outputs ${\bm{\mathsfit{E}}}_{q}$ if $i$ is an entity mentioned in ${q}$ , ${\bm{\mathsfit{E}}}_{a}$ if it is mentioned in ${a}$ , or ${\bm{\mathsfit{E}}}_{o}$ otherwise. We finally encode the statement to $\bm{s}$ , $\mathcal{G}$ to $\bm{g}$ , concatenate $\bm{s}$ and $\bm{g}$ , for calculating the plausibility score.

Background: Multi-Relational Graph Encoding Methods

We leave encoding of $s$ to pre-trained language models and focus on the challenge of encoding graph $\mathcal{G}$ to capture latent relations between entities. Current methods for encoding multi-relational graphs mainly fall into two categories: GNNs and path-based models. GNNs encode structured information by passing messages between nodes, directly operating on the graph structure, while path-based methods first decompose the graph into paths and then pool features over them.

Graph Encoding with GNNs. For a graph with $n$ nodes, a graph neural network (GNN) takes a set of node features $\{\bm{h}_{1},\bm{h}_{2},\dots,\bm{h}_{n}\}$ as input, and computes their corresponding node embeddings $\{\bm{h}_{1}^{\prime},\bm{h}_{2}^{\prime},\dots,\bm{h}_{n}^{\prime}\}$ via message passing (Gilmer et al., 2017). A compact graph representation for $\mathcal{G}$ can thus be obtained by pooling over the node embeddings $\{\bm{h}_{i}^{\prime}\}$ :

As a notable variant of GNNs, graph convolutional networks (GCNs) (Kipf and Welling, 2017) additionally update node embeddings by aggregating messages from its direct neighbors. RGCNs (Schlichtkrull et al., 2018) extend GCNs to encode multi-relational graphs by defining relation-specific weight matrix $\bm{W}_{r}$ for each edge type:

where $\mathcal{N}_{i}^{r}$ denotes neighbors of node $i$ under relation $r$ .For simplicity, we assume a single graph convolutional layer. In practice, multiple layers are stacked to enable message passing from multi-hop neighbors.

While GNNs have proved to have good scalability, their reasoning is done at the node level, making them incompatible with modeling paths. This property also hinders the model’s decisions from being interpretable at the path level.

Graph Encoding with Path-Based Models. In addition to directly modeling the graph with GNNs, a graph can also be viewed as a set of relational paths connecting pairs of entities.

Relation Networks (RNs) (Santoro et al., 2017) can be adapted to multi-relational graph encoding under QA settings. RNs use MLPs to encode all triples (one-hop paths) in $\mathcal{G}$ whose head entity is in $\mathcal{Q}=\{j\mid\phi(j)={\bm{\mathsfit{E}}}_{q}\}$ and tail entity is in $\mathcal{A}=\{i\mid\phi(i)={\bm{\mathsfit{E}}}_{a}\}$ . It then pools the triple embeddings to generate a vector for $\mathcal{G}$ as follows.

Here $\bm{h}_{j}$ and $\bm{h}_{i}$ are features for nodes $j$ and $i$ , $\bm{e}_{r}$ is the embedding of relation $r\in\mathcal{R}$ , $\oplus$ denotes vector concatenation.

To further equip RN with the ability to model nondegenerate paths, KagNet (Lin et al., 2019) adopts LSTMs to encode all paths connecting question entities and answer entities with lengths no more than $K$ . It then aggregates all path embeddings via attention mechanism:

Proposed Method: Multi-Hop Graph Relation Network (MHGRN)

This section presents Multi-hop Graph Relation Network (MHGRN), a novel GNN architecture that unifies both GNNs and path-based models. MHGRN inherits path-level reasoning and interpretabilty from path-based models, while preserving good scalability of GNNs.

We follow the GNN framework introduced in §3, where node features can be initialized with pre-trained weights (details in Appendix C). Here we focus on the computation of node embeddings.

Type-Specific Transformation. To make our model aware of the node type $\phi$ , we first perform node type specific linear transformation on the input node features:

where the learnable parameters $\bm{U}$ and $\bm{b}$ are specific to the type of node $i$ .

Multi-Hop Message Passing. As mentioned before, our motivation is to endow GNNs with the capability of directly modeling paths. To this end, we propose to pass messages directly over all the relational paths of lengths up to $K$ . The set of valid $k$ -hop relational paths is defined as:

We perform $k$ -hop ( $1\leq k\leq K$ ) message passing over these paths, which is a generalization of the single-hop message passing in RGCNs (see Eq. 2):

where the $\bm{W}^{t}_{r}(1\leq t\leq K,0\leq r\leq m)$ matrices are learnable $\bm{W}_{0}^{t}(0\leq t\leq K)$ are introduced as padding matrices so that $K$ transformations are applied regardless of $k$ , thus ensuring comparable scale of $\bm{z}_{i}^{k}$ across different $k$ ., $\alpha(j,r_{1},\dots,r_{k},i)$ is an attention score elaborated in §4.2 and $d_{i}^{k}=\sum_{(j\cdots i)\in\Phi_{k}}\alpha(j\cdots i)$ is the normalization factor. The $\{\bm{W}_{r_{k}}^{k}\cdots\bm{W}_{r_{1}}^{1}\mid 1\leq r_{1},\dots,r_{k}\leq m\}$ matrices can be interpreted as the low rank approximation of a { $m\times\cdots\times m\}_{k}\times d\times d$ tensor that assigns a separate transformation for each $k$ -hop relation, where $d$ is the dim. of $\bm{x}_{i}$ .

Incoming messages from paths of different lengths are aggregated via attention mechanism (Vaswani et al., 2017):

Non-linear Activation. Finally, we apply shortcut connection and nonlinear activation to obtain the output node embeddings.

where $\bm{V}$ and $\bm{V}^{\prime}$ are learnable model parameters, and $\sigma$ is a non-linear activation function.

2 Structured Relational Attention

Here we work towards effectively parameterizing the attention score $\alpha(j,r_{1},\dots,r_{k},i)$ in Eq. 7 for all $k$ -hop paths without introducing $\mathcal{O}(m^{k})$ parameters. We first regard it as the probability of a relation sequence $\left(\phi(j),r_{1},\dots,r_{k},\phi(i)\right)$ conditioned on $\bm{s}$ :

which can naturally be modeled by a probabilistic graphical model, such as conditional random field (Lafferty et al., 2001):

where $f(\cdot)$ , $\delta(\cdot)$ and $g(\cdot)$ are parameterized by two-layer MLPs and $\tau(\cdot)$ by a transition matrix of shape $m\times m$ . Intuitively, $\beta(\cdot)$ models the importance of a $k$ -hop relation while $\gamma(\cdot)$ models the importance of messages from node type $\phi(j)$ to $\phi(i)$ (e.g., the model can learn to pass messages only from question entities to answer entities).

Our model scores a $k$ -hop relation by decomposing it into both context-aware single-hop relations (modeled by $\delta$ ) and two-hop relations (modeled by $\tau$ ). We argue that $\tau$ is indispensable, without which the model may assign high importance to illogical multi-hop relations (e.g., [AtLocation, CapableOf]) or noisy relations (e.g., [RelatedTo, RelatedTo]).

3 Computation Complexity Analysis

Although message passing process in Eq. 7 and attention module in Eq.11 handles potentially exponential number of paths, it can be computed in linear time with dynamic programming (see Appendix D). As summarized in Table 2, time complexity and space complexity of MHGRN on a sparse graph are both linear w.r.t. either the maximum path length $K$ or the number of nodes $n$ .

4 Expressive Power of MHGRN

In addition to efficiency and scalability, we now discuss the modeling capacity of MHGRN. With the message passing formulation and relation-specific transformations, it is by nature the generalization of RGCN. It is also capable of directly modeling paths, making it interpretable as are path-based models like RN and KagNet. To show this, we first generalize RN (Eq. 3) to the multi-hop setting and introduce $K$ -hop RN (formal definition in Appendix E), which models multi-hop relation as the composition of single-hop relations. We show that MHGRN is capable of representing $K$ -hop RN (proof in Appendix F).

5 Learning, Inference and Path Decoding

We now discuss the learning and inference process of MHGRN instantiated for QA tasks. Following the problem formulation in §2, we aim to determine the plausibility of an answer option $\bm{a}\in\mathcal{C}$ given the question $\bm{q}$ with the information from both text $s$ and graph $\mathcal{G}$ . We first obtain the graph representation $\bm{g}$ by performing attentive pooling over the output node embeddings of answer entities $\{\bm{h}_{i}^{\prime}\mid i\in\mathcal{A}\}$ . Next we concatenate it with the text representation $\bm{s}$ and compute the plausibility score by $\rho(\bm{q},\bm{a})=\texttt{MLP}(\bm{s}\oplus\bm{g})$ .

During training, we maximize the plausibility score of the correct answer $\hat{\bm{a}}$ by minimizing the cross-entropy loss:

The whole model is trained end-to-end jointly with the text encoder (e.g., RoBERTa).

During inference, we predict the most plausible answer by $\text{argmax}_{\bm{a}\in\mathcal{C}}~{}\rho(\bm{q},\bm{a})$ . Additionally, we can decode a reasoning path as evidence for model predictions, endowing our model with the interpretability enjoyed by path-based models. Specifically, we first determine the answer entity $i^{*}$ with the highest score in the pooling layer and the path length $k^{*}$ with the highest score in Eq. 8. Then the reasoning path is decoded by $\text{argmax}~{}\alpha(j,r_{1},\dots,r_{k^{*}},i^{*})$ , which can be computed in linear time using dynamic programming.

Experimental Setup

We introduce how we construct $\mathcal{G}$ (§5.1), the datasets (§5.2), as well as the baseline methods (§5.3). Appendix C shows more implementation and experimental details for reproducibility.

We use ConceptNet (Speer et al., 2017), a general-domain knowledge graph as our external KG to test models’ ability to harness structured knowledge source. Following KagNet Lin et al. (2019), we merge relation types to increase graph density and add reverse relations to construct a multi-relational graph with 34 relation types (details in Appendix A). To extract an informative contextualized graph $\mathcal{G}$ from KG, we recognize entity mentions in $s$ and link them to entities in ConceptNet, with which we initialize our node set $\mathcal{V}$ . We then add to $\mathcal{V}$ all the entities that appear in any two-hop paths between pairs of mentioned entities. Unlike KagNet, we do not perform any pruning but instead reserve all the edges between nodes in $\mathcal{V}$ , forming our $\mathcal{G}$ .

2 Datasets

We evaluate models on two multiple-choice question answering datasets, CommonsenseQA and OpenbookQA. Both require world knowledge beyond textual understanding to perform well.

CommonsenseQA (Talmor et al., 2019) necessitates various commonsense reasoning skills. The questions are created with entities from ConceptNet and they are designed to probe latent compositional relations between entities in ConceptNet.

OpenBookQA (Mihaylov et al., 2018) provides elementary science questions together with an open book of science facts. This dataset also probes general common sense beyond the provided facts.

3 Compared Methods

We implement both knowledge-agnostic fine-tuning of pre-trained LMs and models that incorporate KG as external sources as our baselines. Additionally, we directly compare our model with the results from corresponding leaderboard. These methods typically leverage textual knowledge or extra training data, as opposed to external KG. In all our implemented models, we use pre-trained LMs as text encoders for $s$ for fair comparison. We do compare our models with those (Ma et al., 2019; Lv et al., 2019; Khashabi et al., 2020) augmented by other text-form external knowledge (e.g., Wikipedia), although we stick to our focus of encoding structured KG.

Specifically, we fine-tune Bert-Base, Bert-Large (Devlin et al., 2019), and RoBERTa (Liu et al., 2019b) for multiple-choice questions. We take RGCN (Eq. 2 in §3), RNWe use mean pooling for 1-hop RN and attentive pooling for 2-hop RN (detailed in Appendix E). (Eq. 3 in §3), KagNet (Eq. 4 in §3) and GconAttn (Wang et al., 2019) as baselines. GconAttn generalizes match-LSTM (Wang and Jiang, 2016) and achieves success in language inference tasks.

Results and Discussions

In this section, we present the results of our models in comparison with baselines as well as methods on the leaderboards for both CommonsenseQA and OpenbookQA. We also provide analysis of models’ components and characteristics.

For CommonsenseQA (Table 3), we first use the in-house data split (IH) of Lin et al. (2019) (cf. Appendix B) to compare our models with implemented baselines. This is different from the official split used in the leaderboard methods. Almost all KG-augmented models achieve performance gain over vanilla pre-trained LMs, demonstrating the value of external knowledge on this dataset. Additionally, we evaluate our MHGRN (with text encoder being RoBERTa-Large) on official split, OF (Table 5) for fair comparison with other methods on leaderboard, in both single-model setting and ensemble-model setting. With backbone being T5 Raffel et al. (2019), UnifiedQA Khashabi et al. (2020) tops the leaderboard. Considering its training cost, we do not intend to compare our RoBERTa-based model with it. We achieve the best performances among other models.

For OpenbookQA (Table 5), we use official split and build models with RoBERTa-Large as text encoder. MHGRN surpasses all implemented baselines, with an absolute increase of $\sim$ 2% on Test. Also, as our approach is naturally compatible with the methods that utilize textual knowledge or extra data, because in our paradigm the encoding of textual statement and graph are structurally-decoupled (Fig. 3). To empirically show MHGRN can bring gain over textual-knowledge empowered systems, we replace our text encoder with AristoRoBERTaV7https://leaderboard.allenai.org/open_book_qa/submission/blcp1tu91i4gm0vf484g, and fine-tune our MHGRN upon OpenbookQA. Empirically, MHGRN continues to bring benefit to strong-performing textual-knowledge empowered systems. This indicates textual knowledge and structured knowledge can potentially be complementary.

2 Performance Analysis

Ablation Study on Model Components. As shown in Table 6, disabling type-specific transformation results in $\sim 1.3\%$ drop in performance, demonstrating the need for distinguishing node types for QA tasks. Our structured relational attention mechanism is also critical, with its two sub-components contributing almost equally.

Impact of the Amount of Training Data. We use different fractions of training data of CommonsenseQA and report results of fine-tuning text encoders alone and jointly training text encoder and graph encoder in Fig. 5. Regardless of training data fraction, our model shows consistently more performance improvement over knowledge-agnostic fine-tuning compared with the other graph encoding methods, indicating MHGRN’s complementary strengths to text encoders.

Impact of Number of Hops ( $K$ ). We investigate the impact of hyperparameter $K$ for MHGRN by its performance on CommonsenseQA (Fig. 6). The increase of $K$ continues to bring benefits until $K=4$ . However, performance begins to drop when $K>3$ . This might be attributed to exponential noise in longer relational paths in knowledge graph.

3 Model Scalability

Fig. 7 presents the computation cost of MHGRN and RGCN (measured by training time). Both grow linearly w.r.t. $K$ . Although the theoretical complexity of MultiRGN is $m$ times that of RGCN, the ratio of their empirical cost only approaches $2$ , demonstrating that our model can be better parallelized.

4 Model Interpretability

We can analyze our model’s reasoning process by decoding the reasoning path using the method described in §4.5. Fig. 8 shows two examples from CommonsenseQA, where our model correctly answers the questions and provides reasonable path evidences. In the example on the left, the model links question entities and answer entity in a chain to support reasoning, while the example on the right provides a case where our model leverage unmentioned entities to bridge the reasoning gap between question entity and answer entities, in a way that is coherent with the latent relation between Chapel and the desired answer in the question.

Related Work

Knowledge-Aware Methods for NLP Various work have investigated the potential to empower NLP models with external knowledge. Many attempt to extract structured knowledge, either in the form of nodes (Yang and Mitchell, 2017; Wang et al., 2019), triples (Weissenborn et al., 2017; Mihaylov and Frank, 2018), paths (Bauer et al., 2018; Kundu et al., 2019; Lin et al., 2019), or subgraphs (Li and Clark, 2015), and encode them to augment textual understanding.

Recent success of pre-trained LMs motivates many (Pan et al., 2019; Ye et al., 2019; Zhang et al., 2018; Li et al., 2019; Banerjee et al., 2019) to probe LMs’ potential as latent knowledge bases. This line of work turn to textual knowledge (e.g. Wikipedia) to directly impart knowledge to pre-trained LMs. They generally fall into two paradigms: 1) Fine-tuning LMs on large-scale general-domain datasets (e.g. RACE (Lai et al., 2017)) or on knowledge-rich text. 2) Providing LMs with evidence via information retrieval techniques. However, these models cannot provide explicit reasoning and evidence, thus hardly trustworthy. They are also subject to the availability of in-domain datasets and maximum input token of pre-trained LMs.

Neural Graph Encoding Graph Attention Networks (GAT) (Velickovic et al., 2018) incorporates attention mechanism in feature aggregation, RGCN (Schlichtkrull et al., 2018) proposes relational message passing which makes it applicable to multi-relational graphs. However they only perform single-hop message passing and cannot be interpreted at path level. Other work (Abu-El-Haija et al., 2019; Nikolentzos et al., 2019) aggregate for a node its $K$ -hop neighbors based on node-wise distances, but they are designed for non-relational graphs. MHGRN addresses these issues by reasoning on multi-relational graphs and being interpretable via maintaining paths as reasoning chains.

Conclusion

We present a principled, scalable method, MHGRN, that can leverage general knowledge via multi-hop reasoning over interpretable structures (e.g. ConceptNet). The proposed MHGRN generalizes and combines the advantages of GNNs and path-based reasoning models. It explicitly performs multi-hop relational reasoning and is empirically shown to outperform existing methods with superior scalablility and interpretability.

References

Appendix A Merging Types of Relations in ConceptNet

We merge relations that are close in semantics as well as in the general usage of triple instances in ConceptNet.

Appendix B Dataset Split Specifications

CommonsenseQA https://www.tau-nlp.org/commonsenseqa and OpenbookQA https://leaderboard.allenai.org/open_book_qa/submissions/public all have their leaderboards, with training and development set publicly available. As the ground truth labels for CommonsenseQA are not readily available, for model analysis, we take 1,241 examples from official training examples as our in-house test examples and regard the remaining 8,500 ones as our in-house training examples (CommonsenseQA (IH)).

Appendix C Implementation Details

Our models are implemented in PyTorch. We use cross-entropy loss and RAdam (Liu et al., 2019a) optimizer. We find it beneficial to use separate learning rates for the text encoder and the graph encoder. We tune learning rates for text encoders and graph encoders on two datasets. We first fine-tune RoBERTa-Large, Bert-Large, Bert-Base on CommonsenseQA and RoBERTa-Large on OpenbookQA respectively, and choose a dataset-specific learning rate from $\{1\times 10^{-5},2\times 10^{-5},3\times 10^{-5},6\times 10^{-5},1\times 10^{-4}\}$ for each text encoder, based on the best performance on development set, as listed in Table 9. We report the performance of these fine-tuned text encoders and also adopt their dataset-specific optimal learning rates in joint training with graph encoders. For models that involve KG, the learning rate of their graph encoders are chosen from $\{1\times 10^{-4},3\times 10^{-4},1\times 10^{-3},3\times 10^{-3}\}$ , based on their best development set performance with RoBERTa-Large as the text encoder. We report the optimal learning rates for graph encoders in Table 10. In training, we set the maximum input sequence length to text encoders to $64$ , batch size to $32$ , and perform early stopping. AristoRoBERTaV7+MHGRN is the only exception. In order to host fair comparison, we follow AristoRoBERTaV7 and set the batch size to $16$ , max input sequence length to $256$ , and choose a decoder learning rate from $\{1\times 10^{-3},2\times 10^{-5}\}$ .

For the input node features, we first use templates to turn knowledge triples in ConceptNet into sentences and feed them into pre-trained Bert-Large, obtaining a sequence of token embeddings from the last layer of Bert-Large for each triple. For each entity in ConceptNet, we perform mean pooling over the tokens of the entity’s occurrences across all the sentences to form a $1024d$ vector as its corresponding node feature. We use this set of features for all our implemented models.

We use 2-layer RGCN and single-layer MHGRN across our experiments.

The numbers of parameter for each graph encoder are listed in Table 11.

Appendix D Dynamic Programming Algorithm for Eq. 7

To show that multi-hop message passing can be computed in linear time, we observe that Eq. 7 can be re-written in matrix form:

where $\bm{G}=\text{diag}(\exp([g(\phi(v_{1}),\bm{s}),\dots,g(\phi(v_{n}),\\ \bm{s})])$ ( $\bm{F}$ is similarly defined), $\bm{A}_{r}$ is the adjacency matrix for relation $r$ and $\bm{D}^{k}$ is defined as follows:

Using this matrix formulation, we can compute Eq. 7 using dynamic programming:

Appendix E Formal Definition of K𝐾K-hop RN

( $K$ -hop Relation Network) A multi-hop relation network is a function that maps a multi-relational graph to a fixed size vector: