Recurrent Dirichlet Belief Networks for Interpretable Dynamic Relational Data Modelling

Yaqiong Li, Xuhui Fan, Ling Chen, Bin Li, Zheng Yu, Scott A. Sisson

Introduction

Dynamic data is a common feature in many real-world applications, including relational data analysis Mucha et al. (2010); Phan and Airoldi (2015); Yang and Koeppl (2018) for learning time-varying node interactions, and text modelling Guo et al. (2018); Schein et al. (2019) for exploring topic evolution. Modelling dynamic data has become a vibrant research topic, with popular techniques ranging from non-Bayesian methods, such as Collaborative Filtering with Temporal Dynamics (SVD++) Koren (2009), to Bayesian deep probabilistic frameworks such as Deep Poisson-Gamma Dynamical Systems (DPGDS) Guo et al. (2018). The main advantage of Bayesian deep probabilistic frameworks is the flexible model design and the strong modelling performance. However, most of these frameworks are static so that they cannot account for the evolution of relationships over time. It would be highly beneficial if the frameworks can be extended to the dynamic setting to enjoy the modelling advantages.

The Dirichlet Belief Network (DirBN) Zhao et al. (2018) has been proposed recently as a promising deep probabilistic framework for learning interpretable deep latent structures. To date, the DirBN has mainly been used in two applications: (1) topic structure learning Zhao et al. (2018), where latent representations are used to model the word distribution for topics; and (2) relational models Fan et al. (2019a), where latent representations model the nodes’ membership distribution over communities. By constructing a deep architecture for latent distributions, the DirBN can model high-order dependence between topic-word distributions (in topic models) and nodes’ membership distributions (in relational models).

In this work, we propose a Recurrent Dirichlet Belief Network (Recurrent-DBN) to explore the complex latent structures in dynamic relational data. In addition to constructing an interpretable deep architecture for the data within individual time steps, we also study the temporal dependence in the dynamic relational data through (layer-to-layer) connections crossing consecutive time steps. Consequently, our Recurrent-DBN can describe long-term temporal dependence (i.e., the dependence between the current variables and those in the previous several time steps), improving over the one-order Markov structures that usually describe the dependence between the current variables and those in the previous one time step only.

For model inference, we further develop an efficient Gibbs sampling algorithm. Besides upward propagating latent counts as done by DirBN, we also introduce a backward step to propagate the counts from the current time step to the previous time steps. Our experiments on real-world dynamic relational data show significant advantages of the Recurrent-DBN over the state-of-the-art models in tasks of interpretable latent structure discovery and link prediction. Similar to DirBN that can be considered as a self-contained module Zhao et al. (2018), our Recurrent-DBN could be flexibly adapted to account for dynamic data other than evolving relational data, such as time-varying counts and dynamic drifting text data.

We summarise this paper’s main merits as follows:

Recurrent structures are designed to model long term temporal dependence. Also, interpretable and organised latent structures are well explored;

An efficient Gibbs sampling method is devised that first upward-backward propagates latent counts and then downward-forward samples variable;

Significantly improved model performance in real-world dynamic relational models compared to the state-of-the-art, including better link prediction performance and enhanced interpretable latent structure visualisation.

Background information of DirBN

We first give a brief review of the DirBN model. In general, the DirBN constructs a multi-stochastic layered architecture to represent interpretable latent distributions for objects. We describe it within the relational data setting for illustrative purposes. Given a binary observed linkage matrix $\boldsymbol{R}\in\{0,1\}^{N\times N}$ for $N$ nodes, where $R_{ij}$ denotes whether node $i$ has a relation to node $j$ , the DirBN constructs an $L$ -layer and $K$ -length community membership distribution ${\boldsymbol{\pi}_{i}=}\{\boldsymbol{\pi}_{i}^{(l)}\}_{l=1}^{L}$ for each node $i$ . The generative process for the membership distributions $\{\boldsymbol{\pi}_{i}^{(l)}\}_{l=1}^{L}$ , as well as the observed matrix $\boldsymbol{R}$ , can be briefly described as:

$\beta_{i^{\prime}i}^{(l-1)}\sim\text{Gam}(c,\frac{1}{d}),\forall i,i^{\prime}=1,\ldots,N$

$\boldsymbol{\pi}_{i}^{(l)}\sim\text{Dirichlet}(\boldsymbol{\alpha}^{1\times K}\boldsymbol{1}(l=1)+\sum_{i^{\prime}}\beta_{i^{\prime}i}^{(l-1)}\boldsymbol{\pi}_{i^{\prime}}^{(l-1)})$

$\boldsymbol{X}_{i}\sim\text{Multinomial}(M;\boldsymbol{\pi}_{i}^{(L)}),\forall i=1,\ldots,N$ ;

$R_{ij}\sim\text{Bernoulli}\left(f(\boldsymbol{X}_{i},\boldsymbol{X}_{j})\right),\forall i,i^{\prime}=1,\ldots,N$ ;

where $\boldsymbol{\alpha}^{1\times K}$ is a concentration parameter generating the membership distribution in the $1$ st layer, $\beta_{i^{\prime}i}^{(l-1)}$ represents the information propagation coefficient from node $i^{\prime}$ to node $i$ in the $(l-1)$ th layer, $c$ and $d$ are the hyper-parameters generating these propagation coefficients, $\boldsymbol{X}_{i}$ is the latent count information for node $i$ and $M$ is the sum of these counts, and $f(\boldsymbol{X}_{i},\boldsymbol{X}_{j})$ represents the probabilistic function mapping a pair of membership distributions to a linkage probability. A larger value of $\beta_{i^{\prime}i}^{(l-1)}$ indicates higher influence of $\boldsymbol{\pi}_{i^{\prime}}^{(l-1)}$ on the generation of $\boldsymbol{\pi}_{i}^{(l)}$ . Therefore, $\beta_{i^{\prime}i}^{(l-1)}$ is set to if node $i^{\prime}$ is not connected to node $i$ in the observed data $\boldsymbol{R}$ .

It is difficult to directly implement efficient Gibbs sampling for the DirBN because the prior and posterior distributions of the membership distributions $\boldsymbol{\pi}_{i}^{(l)}$ are not conjugate. To address this issue, a strategy of first upward propagating latent counts and then downward sampling variables has been developed in Zhao et al. (2018). Given the count information $\boldsymbol{X}_{i}$ for node $i$ , the DirBN upward propagates $\boldsymbol{X}_{i}$ to all the nodes in the $(L-1)$ th layer through a Chinese Restaurant Table (CRT) distribution. Each node in the $(L-1)$ th layer collects these propagated counts and uses their sum as its latent count $\boldsymbol{X}_{i}^{(L-1)}$ in the $(L-1)$ th layer. This procedure is repeated until the counts have been assigned to all layers. Thus, conjugate constructions can be created for each variable and thereby used to construct efficient Gibbs samplers.

Recurrent-DBN for dynamic relational data modeling

To handle dynamic relational data, we attach an index $t$ to variables to denote the corresponding time step. Thus, the observed dynamic relational data can be described as $\boldsymbol{R}\in\{0,1\}^{N\times N\times T}$ for $N$ nodes at $T$ time steps, where $R_{ij,t}$ denotes whether node $i$ has relation to node $j$ at the $t$ th time step. Each matrix $\{\boldsymbol{R}_{-,t}\}_{t}\in\{0,1\}^{N\times N}$ can be either asymmetric (directional) or symmetric (non-directional) and we do not consider self-linkages $\{R_{ii,t}\}_{i,t}$ .

In the Recurrent-DBN, we assume the time-dependent membership distribution of a node $i$ in the $l$ -th layer at time step $t$ , $\boldsymbol{\pi}_{i,t}^{(l)}$ , follows a Dirichlet distribution. Its generative process can be described as below, with the propagation of $\boldsymbol{\pi}_{i,t}^{(l)}$ illustrated in Fig. 1 (Left). For notation convenience, any parameters with index are set to zero. It is noted that we have already used the observed data into the generative process.

$\beta_{i^{\prime}i,t}^{(l-1)}\left\{\begin{array}[]{ll}=0,&\text{ if }R_{i^{\prime}i,t}=0;\\ \sim\text{Gam}(c_{c}^{(l)},\frac{1}{d_{c}}),&\text{ if }i^{\prime}=i;\\ \sim\text{Gam}(c_{u}^{(l)},\frac{1}{d_{c}}),&\text{ if }i^{\prime}\neq i.\end{array}\right.$

$\gamma_{i^{\prime}i,t-1}^{(l)}\left\{\begin{array}[]{ll}=0,&\text{ if }R_{i^{\prime}i,t-1}=0;\\ \sim\text{Gam}(c_{c}^{(l)},\frac{1}{d_{c}}),&\text{ if }i^{\prime}=i;\\ \sim\text{Gam}(c_{u}^{(l)},\frac{1}{d_{c}}),&\text{ if }i^{\prime}\neq i.\end{array}\right.$

Calculate concentration parameter $\boldsymbol{\psi}_{i,t}^{(l)}$ :

$\boldsymbol{\pi}_{i,t}^{(l)}\sim\text{Dirichlet}(\boldsymbol{\alpha}^{1\times K}\boldsymbol{1}(t=1,l=1)+\boldsymbol{\psi}_{i,t}^{(l)})$ .

We restrict the two nodes to have information propagated only if they are observed with positive relationship (step (a).i and (a).ii). This can reduce the computational cost of calculating $\boldsymbol{\beta}_{-,t}^{(l)},\boldsymbol{\gamma}_{-,t}^{(l)}$ from $\mathcal{O}(N^{2})$ to the scale of the number of positive relationships. Also, it encourages connected nodes to have more similar membership distributions and larger dependencies between each other.

The concentration parameter $\boldsymbol{\psi}_{i,t}^{(l)}$ for generating $\boldsymbol{\pi}_{i,t}^{(l)}$ comprises two parts: the information propagated from all other nodes’ latent representations in the $(l-1)$ -th layer at time $t$ , $\sum_{i^{\prime}}\beta_{i^{\prime}i,t}^{(l-1)}\boldsymbol{\pi}_{i^{\prime},t}^{(l-1)}$ , and those in the $l$ -th layer at time $(t-1)$ , $\sum_{i^{\prime}}\gamma_{i^{\prime}i,t-1}^{(l)}\boldsymbol{\pi}_{i^{\prime},t-1}^{(l)}$ . In other words, $\boldsymbol{\psi}_{i,t}^{(l)}$ is a linear sum of all the previous-layers’ information at the same time step and all the previous-time steps’ information in the same layer. When the coefficients $\boldsymbol{\beta}$ dominate over $\boldsymbol{\gamma}$ , the hierarchical structure plays a more important role. Otherwise, the temporal dependence has higher influence.

2 Application to dynamic relational data

After generating the membership distributions $\{\boldsymbol{\pi}_{i,t}^{(l)}\}$ , we use the Bernoulli-Poisson link function Dunson and Herring (2005); Zhou (2015); Fan et al. (2019a) to generate the relational data at each time step:

$\Lambda_{k_{1}k_{2}}\sim\text{Gamma}(\lambda_{1},\lambda_{0}),\forall k_{1},k_{2}$

$M_{i,t}\sim\text{Poisson}(M),\forall i,t$

$\boldsymbol{X}_{i,t}\sim\text{Multinomial}(M_{i,t};\boldsymbol{\pi}_{i,t}^{(L)}),\forall i,t$ ;

$C_{ij,k_{1}k_{2},t}\sim\text{Poisson}(X_{i,k_{1},t}\Lambda_{k_{1}k_{2}}X_{j,k_{2},t}),\forall k_{1},k_{2}$

$R_{ij,t}={\boldsymbol{1}(\sum_{k_{1},k_{2}}C_{ij,k_{1}k_{2},t}>0)}$ ,

where $\Lambda_{k_{1}k_{2}}$ is a community compatibility parameter such that a larger value of $\Lambda_{k_{1}k_{2}}$ indicates a larger possibility of generating the links between communities $k_{1}$ and $k_{2}$ , $\lambda_{1},\lambda_{0},M$ are hyper-parameters, $M_{i,t}$ is a scaling parameter for generating the related counting information for node $i$ at time $t$ , and $C_{ij,k_{1}k_{2},t}$ is a community-to-community latent integer for linkage $R_{ij}$ at time $t$ .

Through the Multinomial distributions with $\boldsymbol{\pi}_{i,t}^{(L)}$ as event probabilities, $\boldsymbol{X}_{i,t}$ can be regarded as an estimator of $\boldsymbol{\pi}_{i,t}^{(L)}$ . Since the sum $M_{i}\sim{\text{Poisson}(M)}$ , according to the Poisson-Multinomial equivalence, each $X_{i,k,t}$ is equivalently distributed as $X_{i,k,t}\sim{\text{Poisson}(M{\pi}_{i,k,t}^{(L)})}$ . Therefore, both the prior distribution for generating $X_{i,k,t}$ and the likelihood based on $X_{i,k,t}$ are Poisson distributions. We may form feasible categorical distribution on its posterior inference. This trick is inspired by the recent advances in data augmentation and marginalisation techniques Fan et al. (2019a), which allows us to implement posterior sampling for $X_{i,k,t}$ efficiently.

The counts $\boldsymbol{X}_{i,t}$ lead to the generation of the $K\times K$ integer matrix $\boldsymbol{C}_{ij,t}$ . Based on the Bernoulli-Poisson link function Dunson and Herring (2005); Zhou (2015), the observed $R_{ij,t}$ is mapped to the latent Poisson count random variable matrix $\boldsymbol{C}_{ij,t}$ . It is shown in Fan et al. (2019a) that $\{C_{ij,k_{1}k_{2},t}\}_{k_{1},k_{2}}=0$ if $R_{ij,t}=0$ . That is, only the non-zero links are involved during the inference for $\boldsymbol{C}_{ij,k_{1}k_{2},t}$ , which largely reduces the computational complexity, especially for large and sparse dynamic relational data.

Recurrent structure. Before describing the inference of Recurrent-DBN, we discuss the characteristic of the recurrent structure of our model. Instead of using the one-order Markov property to describe the temporal dependence (assuming the state at time $t$ depends on the states at time $t-1$ only), which is adopted by most probabilistic dynamic models, the deep structure of the Recurrent-DBN allows the latent variables at time $t$ depend on those at time steps from $t-1$ to $t-L$ . For example, by using the law of total expectations, we can have the expectation of the latent count $\boldsymbol{X}_{-,t}$ in a $2$ -layered Recurrent-DBN as (We use the notation $-$ to denote a related parameter or variable hereafter):

3 Inference

The joint distribution of the latent variables is expressed as:

While the DirBN only has upward-propagation for the latent counts and downward-sampling for the latent variables, for the Recurrent-DBN we develop an upward-backward propagation and forward-downward Gibbs sampling algorithm for count propagation and latent variable sampling. Posterior simulation for the Recurrent-DBN involves two key steps in each sampling iteration: (1) propagating the counts $\boldsymbol{X}$ upward and backward to the upper layers and previous time steps via a latent count variable $\boldsymbol{m}$ ; (2) forward and downward sampling $\boldsymbol{\pi},\boldsymbol{\beta},\boldsymbol{\gamma}$ given the propagated latent counts $\boldsymbol{m}$ . Full updates for the other variables are similar to those in Fan et al. (2019a).

Figure 1 (right) illustrates the upward-backward propagation of counts $\boldsymbol{X}$ to the latent count variable $\boldsymbol{m}$ at each hidden layers. Generally speaking, for $i,i^{\prime}=1,\ldots,N,l=1,\ldots,L,t=1,\ldots,T,k=1,\ldots,K$ , the latent variable $\boldsymbol{\psi}$ is generated as Eq. (1). $m_{i,k,t}^{(l)}$ refers to the latent counts for the node $i$ in layer $l$ at time $t$ for the $k$ -th community. By integrating the $m_{i,k,t}^{(l)}$ , the likelihood term of $\boldsymbol{\psi}_{i,t}^{(l)}$ can be calculated as:

By introducing the auxiliary variables $q_{i,t}^{(l)}$ and $y_{i,k,t}^{(l)}$ , the likelihood term of $\boldsymbol{\psi}_{i,t}^{(l)}$ can be further augmented as:

where the $q_{i,t}^{(l)}$ and $y_{i,k,t}^{(l)}$ can be generated as:

Consequently, $y_{i,k,t}^{(l)}$ can be considered as the ‘derived latent counts’ for node $i$ derived from the latent counts $m_{i,k,t}^{(l)}$ . Each $y_{i,k,t}^{(l)}$ can then be upward and backward distributed based on the probabilities of $\psi_{i,k,t}^{(l)}$ as follows:

Here, the $y_{i,k,t}^{(l)}$ is divided into two parts: one is delivered to each $i^{\prime}$ at time $t$ of layer $l-1$ ( $(Z_{i1,k,t}^{(l-1)},\ldots,Z_{iN,k,t}^{(l-1)})$ ), and the other to each $i^{\prime}$ at time $t-1$ of layer $l$ ( $A_{i1,t-1,k}^{(l)},\ldots,A_{i1,t-1,k}^{(l)})$ ). We denote them as $\boldsymbol{Z}_{i-,k,t}^{(l-1)}$ and $\boldsymbol{A}_{i-,t-1,k}^{(l)}$ respectively. The latent counts of lower layers and previous time steps can thus be calculated respectively as:

Let $\boldsymbol{m}_{i,T}^{(L)}=\boldsymbol{X}_{i,T}$ , for $t=T-1,\ldots,2,i,i^{\prime}=1,\ldots,N$ , the specification in terms of layer $L$ is as follows,

To summarize, upward and backward propagation derives $\boldsymbol{y}_{i,t}^{(l)}$ from the latent counts $\boldsymbol{m}_{i,t}^{(l)}$ . Then, $\boldsymbol{y}_{i,t}^{(l)}$ is distributed to all $i^{\prime}$ at time $t$ of layer $l-1$ and time $t-1$ of layer $l$ respectively as $\boldsymbol{Z}_{i-,k,t}^{(l-1)}$ and $\boldsymbol{A}_{i-,t-1,k}^{(l)}$ . Lastly, $\boldsymbol{Z}_{-i,k,t}^{(l-1)}$ and $\boldsymbol{A}_{-i,t-1,k}^{(l)}$ contribute to the generation of $\boldsymbol{m}_{i,t}^{(l-1)}$ and $\boldsymbol{m}_{i,t-1}^{(l)}$ respectively. By repeating this process through layers and crossing time steps, we propagate the $\boldsymbol{X}$ to the $\boldsymbol{m}^{(l)}$ upward and backward sequentially.

3.2 Forward-Downward Sampling Latent Variables

The generated $\boldsymbol{\psi},\boldsymbol{q},\boldsymbol{m}^{(l)},(\boldsymbol{Z},\boldsymbol{A})$ can enable to form closed Gibbs sampling algorithm for the following variables:

After obtaining the latent counts $\boldsymbol{m}_{i,t}^{(l)}$ for each layer and each time step, the posterior inference of $\boldsymbol{\pi}_{i,t}^{(l)}$ can be proceeded as:

The likelihood term of ${\beta}_{i^{\prime}i,t}^{(l)}$ can be represented as:

The prior of $\beta_{i^{\prime}i,t}^{(l)},\beta_{i^{\prime}i,t}^{(l)}$ is $\text{Gam}(\gamma^{(l)}_{i},\frac{1}{c^{(l)}})$ . Their posterior distribution is

Related Work

Several Bayesian deep probabilistic frameworks have been proposed to capture the temporal dependence in dynamic data Gan et al. (2015); Gong (2017); Henao et al. (2015). The Deep Dynamic Sigmoid Belief Network Gan et al. (2015) sequentially stacks models of sigmoid belief networks and uses the binary-valued hidden variables to depict the log-range dynamic dependence. The Deep Dynamic Poisson Factor Analysis (DDPFA) Gong (2017) incorporates the Recurrent Neural Networks (RNN) into the Poisson Factor Analysis (PFA) to depict the long-range dynamic dependence. However, in DDPFA, the parameters in RNN and the latent variables in PFA are optimized separately. Poisson Gamma Dynamic Systems (PGDS) Schein et al. (2016) are developed to model the counting data through a “shallow” modelling strategy. Dynamic-PGDS (DPGDS) Guo et al. (2018) is probably the closest work to our approach. Compared with DPGDS, our Recurrent-DBN differs in three aspects: (1) our Recurrent-DBN generates normalized latent representations and thus provides more interpretable structures; (2) the count information is propagated in a different way; (3) our Recurrent-DBN is devised in the setting of relational modelling, while DPGDS is for the topic modelling setting.

For modelling dynamic network data, many of the existing works are “shallow” probabilistic modelling. The dynamic Tensorial Mixed Membership Stochastic Block model (T-MBM) Tarrés-Deulofeu et al. (2019) and the Fragmentation Coagulation Based MMSB (fcMMSB) Yu and Fan (2020) combine the notable mixed-membership stochastic block model with a dynamic setting. The Bayesian Poisson Tensor Factorization (BPTF) Schein et al. (2015) and the Dependent Relational Gamma Process model (DRGPM) Yang and Koeppl (2018) are the representative works that use Poisson matrix factorization techniques to address dynamic counting data. There are also some models using the collaborative filtering techniques such as SVD++. Some methods are not developed for dynamic network data originally, but they have later been applied to the dynamic scenario, such as structure-based models like Common Neighbor (CN) Newman (2001), and network embedding models, including Scalable Multiplex Network Embedding (MNE) Zhang et al. (2018) and DeepWalk Perozzi et al. (2014). It is noted that there is a recent trend in using the graphon theory Lloyd et al. (2012); Orbanz and Roy (2014) to model the network data Fan et al. (2016, 2018b, 2018a, 2019b, 2020).

Experiments

We evaluate the performance of our proposed Recurrent-DBN on five real-world data sets, by comparing with nine baseline methods: Mixed Membership Stochastic Block model (MMSB) Airoldi et al. (2008), T-MBM, fcMMSB, BPTF, DRGPM, SVD++, CN, MNE and DeepWalk. Except MMSB, all of the other eight baseline models are implemented with the released code. For MMSB, we use Gibbs sampling for the inference of all variables.

The real-world relational data sets used in this paper are: Coleman Coleman (1964), Mining Reality Eagle and Pentland (2006), Hypertext Isella et al. (2011), Infectious Isella et al. (2011) and Student Net Fan et al. (2014). The summarized statistics are detailed in Table 2. For the hyper-parameters, we specify $M\sim\text{Gamma}(N,1)$ for all data sets, $\{c_{c}^{(l)},c_{u}^{(l)}\}_{l},d,d_{c}$ and $\Lambda_{k1,k2}$ are all given $\text{Gamma}(1,1)$ priors and $L=3$ . For MMSB, we set the membership distribution according to $\text{Dirichlet}(\boldsymbol{1}^{1\times K})$ .

2 Link prediction

For link prediction, we randomly extract a proportion of $10\%$ of relational data entries (either links or non-links) at each time step as the test set. The remaining $90\%$ is used for training. The test relational data are not used to construct the information propagation matrix (i.e., we set $\{\beta_{i^{\prime}i,t}^{(l)}\}_{l,t}=0$ if $R_{i^{\prime}i}$ is the testing data). We estimate the posterior mean of $e^{-{\sum_{k_{1},k_{2}}X_{i,k_{1},t}\Lambda_{k_{1}k_{2}}X_{j,k_{2},t}}}$ as the linkage probability for each test data. These linkage probabilities are then used to calculate two evaluation metrics: the area under the curve of the receiver operating characteristic (AUC) and the precision-recall (precision). Higher values of AUC and precision indicate better model performance.

The detail results are shown in Table 1. We report the average evaluation results for each model over $16$ runs. Each run uses $3000$ MCMC iterations with the first $1500$ discarded as burn-in. Overall, Recurrent-DBN outperforms the baseline models for both metrics on almost all data sets. As might be expected, the value of AUC and precision increase with higher model complexity of Recurrent-DBN (i.e., larger values of $K$ ). For the other methods, fcMMSB is competitive with DRGPM and outperforms the other baselines. However, they all perform worse than Recurrent-DBN, especially for data sets with large numbers of $N$ or $T$ . We can see that the Recurrent-DBN has clear advantages in learning dynamic relational data, thanks to the deep hierarchical structure and recurrent long-term temporal dependence modelling.

3 Latent variable visualization

To gain further insights, we visualize the latent variables in Figure 2. It can be observed from the top part that: (1) for the same time step, the membership distributions change gradually with the increase of layers; (2) the membership distributions share some similarities for consecutive time steps and the similarities slowly shift along with the time. For example, the left bottom area of $\{\boldsymbol{\pi}_{i=1:30}^{(3)}\}$ seems to have $3$ different patterns: time steps $t=1$ , $t=2\sim 4$ , and $t=5\sim 10$ . The bottom part of Figure 2 visualizes propagation coefficients. It is reasonable to see the values of $\overline{\boldsymbol{\beta}}$ in the first layer and $\overline{\boldsymbol{\gamma}}$ in the first several time steps are small, since less information is propagated in these cases. The values become larger when more information is propagated. Also, the layer-wise propagation seems to have a larger influence than the cross-time propagation, with an average value of $\overline{\boldsymbol{\beta}}/\overline{\boldsymbol{\gamma}}=1.2\sim 1.4$ .

Conclusion

We have presented a probabilistic deep hierarchical structure named Recurrent Dirichlet Belief Networks (Recurrent-DBN) for learning dynamic relational data. Through Recurrent-DBN, the evolution of the latent structure is characterized by both the cross-layer and the cross-time dependencies. We also develop an upward-backward–forward-downward information propagation to enable efficient Gibbs sampling for all variables. The experimental results on a variety of real data sets demonstrate the excellent predictive performance of our model, and the inferred latent structure provides a rich interpretation for both hierarchical and dynamic information propagation. Our Recurrent-DBN can be applied to tasks like dynamic topic models Guo et al. (2018); Zhao et al. (2018)) and dynamic collaborative filtering. We keep these potential applications as the future work.

Acknowledgements

This work is partly supported by ARC Discovery Project DP180100966. Yaqiong Li is a recipient of UTS Research Excellence Scholarship. Xuhui Fan and Scott A. Sisson are supported by the Australian Research Council through the Australian Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS, CE140100049). Bin Li is supported by Shanghai Municipal Science & Technology Commission (16JC1420401) and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning.