Poisson-Randomized Gamma Dynamical Systems

Aaron Schein, Scott W. Linderman, Mingyuan Zhou, David M. Blei, Hanna Wallach

Introduction

Political scientists routinely analyze event counts of the number of times country ii took action aa toward country jj during time step tt . Such data can be represented as a sequence of count tensors Y(1),,Y(T)\boldsymbol{Y}^{{(1)}},\dots,\boldsymbol{Y}^{{(T)}} each of which contains the V ⁣× ⁣V ⁣× ⁣AV\!\times\!V\!\times\!A event counts for that time step for every combination of VV sender countries, VV receivers, and AA action types. International event data sets exhibit “complex dependence structures” like coalitions of countries and bursty temporal dynamics. These dependence structures violate the independence assumptions of the regression-based methods that political scientists have traditionally used to test theories of international relations . Political scientists have therefore advocated for using latent variable models to infer unobserved structures as a way of controlling for them . This approach motivates interpretable yet expressive models that are capable of capturing a variety of complex dependence structures. Recent work has applied tensor factorization methods to international event data sets to infer coalition structures among countries and topic structures among actions; however, these methods assume that the sequentially observed count tensors are exchangeable, thereby failing to capture the bursty temporal dynamics inherent to such data sets.

Sequentially observed count tensors present unique statistical challenges because they tend to be bursty , high-dimensional, and sparse . There are few models that are tailored to the challenging properties of both time series and count tensors. In recent years, Poisson factorization has emerged as a framework for modeling count matrices and tensors . Although factorization methods generally scale with the size of the matrix or tensor, many Poisson factorization models yield inference algorithms that scale linearly with the number of non-zero entries. This property allows researchers to efficiently infer latent structures from massive tensors, provided these tensors are sparse; however, this property is unique to a subset of Poisson factorization models that only posit non-negative prior distributions, which are difficult to chain in state-space models for time series. Hierarchical compositions of non-negative priors—notably, gamma and Dirichlet distributions—typically introduce non-conjugate dependencies that require innovative approaches to posterior inference.

This paper fills a gap in the literature between Poisson factorization models that are tractable—i.e., yielding closed-form complete conditionals that make inference algorithms easy to derive—and those that are expressive—i.e., capable of capturing a variety of complex dependence structures. To do so, we introduce an alternating chain of discrete Poisson and continuous gamma latent states, a new modeling motif that is analytically convenient and computationally tractable. We rely on this motif to construct the Poisson-randomized gamma dynamical system (PRGDS), a model for sequentially observed count tensors that is tractable, expressive, and efficient. The PRGDS is closely related to the Poisson–gamma dynamical system (PGDS) , a recently introduced model for dynamic count matrices, that is based on non-conjugate chains of gamma states. These chains are intractable; thus, posterior inference in the PGDS relies on sophisticated data augmentation schemes that are cumbersome to derive and impose unnatural restrictions on the priors over other variables. In contrast, the PRGDS introduces intermediate Poisson states that break the intractable dependencies between the gamma states (see Fig. 1). Although this motif is only semi-conjugate, it is tractable, yielding closed-form complete conditionals for the Poisson states by way of the little-known Bessel distribution and a novel discrete distribution that we derive and call the shifted confluent hypergeometric (SCH) distribution.

We study the inductive bias of the PRGDS by comparing its smoothing and forecasting performance to that of the PGDS and two other baselines on a range of real-world count data sets of text, international events, and neural spike data. For smoothing, we find that the PRGDS performs better than or similarly to the PGDS; for forecasting, we find the converse relationship. Both models outperform the other baselines. Using a specific hyperparameter setting, the PRGDS permits the continuous gamma latent states to take values of exactly zero, thereby encoding a unique inductive bias tailored to sparsity and burstiness. We find that this sparse variant always obtains better smoothing and forecasting performance than the non-sparse variant. We also find that this sparse variant yields a qualitatively broader range of latent structures—specifically, bursty latent structures that are highly localized in time.

Poisson-randomized gamma dynamical systems (PRGDS)

Generative process. The PRGDS is a form of canonical polyadic decomposition that assumes

where θk(t)\theta_{k}^{{(t)}} represents the activation of the kthk^{\textrm{th}} component at time step tt. Each component represents a dependence structure in the data set by way of a factor vector ϕk(m)\boldsymbol{\phi}^{{(m)}}_{k} for each mode mm. For international events, the first factor vector ϕk(1) ⁣= ⁣(ϕk1(1),,ϕkV(1))\boldsymbol{\phi}^{{(1)}}_{k}\!=\!(\phi^{{(1)}}_{k1},\dots,\phi^{{(1)}}_{kV}) represents the rate at which each of the VV countries acts as a sender in the kthk^{\textrm{th}} component while the second factor vector ϕk(2)\boldsymbol{\phi}^{{(2)}}_{k} represents the rate at which each country acts as a receiver. The weights λk\lambda_{k} and ρ(t)\rho^{{(t)}} represent the scales of component kk and time step tt. The PRGDS is stationary if ρ(t) ⁣= ⁣ρ\rho^{{(t)}}\!=\!\rho. We posit the following conjugate priors:

The PRGDS is characterized by an alternating chain of discrete and continuous latent states. The continuous states θk(1),,θk(T)\theta^{{(1)}}_{k},\ldots,\theta^{{(T)}}_{k} evolve via the intermediate discrete states hk(1),,hk(T)h^{{(1)}}_{k},\ldots,h^{{(T)}}_{k} as follows:

where we define θk(0) ⁣= ⁣λk\theta^{{(0)}}_{k}\!=\!\lambda_{k} to be the per-component weight from Eq. 1. In other words, the PRGDS assumes that θk(t)\theta_{k}^{{(t)}} is conditionally gamma distributed with rate τ\tau and shape equal to hk(t)h_{k}^{{(t)}} plus hyperparameter ϵ0(θ)0\epsilon_{0}^{{(\theta)}}\geq 0. We adopt the convention that a gamma random variable will be zero, almost surely, if its shape is zero. Therefore, setting ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0 defines a sparse variant of the PRGDS, where the gamma latent state θk(t)\theta_{k}^{{(t)}} takes the value of exactly zero provided hk(t) ⁣= ⁣0h_{k}^{{(t)}}\!=\!0—i.e., θk(t)=a.s.0\theta_{k}^{{(t)}}\stackrel{{\scriptstyle\textrm{a.s.}}}{{=}}0 if hk(t) ⁣= ⁣0h_{k}^{{(t)}}\!=\!0.

The transition weight πkk2\pi_{kk_{2}} in Eq. 3 represents how strongly component k2k_{2} excites component kk at the next time step. We view these weights collectively as a K ⁣× ⁣KK\!\times\!K transition matrix Π\Pi and impose Dirichlet priors over the columns of this matrix. We also place a gamma prior over concentration parameter τ\tau. This prior is conjugate to the gamma and Poisson distributions in which it appears:

For the per-component weights λ1,,λK\lambda_{1},\ldots,\lambda_{K}, we use a hierarchical prior with a similar flavor to Eq. 3:

where ϵ0(λ)\epsilon_{0}^{{(\lambda)}} is analogous to ϵ0(θ)\epsilon_{0}^{{(\theta)}}. Finally, we use the following gamma priors, which are both conjugate:

The PRGDS has five fixed hyperparameters: ϵ0(θ)\epsilon_{0}^{{(\theta)}}, ϵ0(λ)\epsilon_{0}^{{(\lambda)}}, α0\alpha_{0}, a0a_{0}, and b0b_{0}. For the empirical studies in § 5, we set a0 ⁣= ⁣b0 ⁣= ⁣0.01a_{0}\!=\!b_{0}\!=\!0.01 to define weakly informative gamma and Dirichlet priors and set α0 ⁣= ⁣10\alpha_{0}\!=\!10 to define a gamma prior that promotes values close to 1; we consider ϵ0(θ){0,1}\epsilon_{0}^{{(\theta)}}\in\{0,1\} and set ϵ0(λ) ⁣= ⁣1\epsilon_{0}^{{(\lambda)}}\!=\!1.

Properties. In Eq. 5, both ϵ0(λ)\epsilon_{0}^{{(\lambda)}} and γ\gamma are divided by the number of components KK. This means that as the number of components grows K ⁣ ⁣K\!\rightarrow\!\infty, the expected sum of the weights remains finite and fixed:

This prior encodes an inductive bias toward small values of λk\lambda_{k} and may be interpreted as the finite truncation of a novel Bayesian nonparametric process. A small value of λk\lambda_{k} shrinks the Poisson rates of both yi(t)y^{{(t)}}_{\textbf{{i}}} and the first discrete latent state hk(0)h^{{(0)}}_{k}. As a result, this prior encourages the PRGDS to only infer components that are both predictive of the data and useful for capturing the temporal dynamics.

The marginal expectation of θ(t) ⁣= ⁣(θ1(t),,θK(t))\boldsymbol{\theta}^{{(t)}}\!=\!(\theta^{{(t)}}_{1},\ldots,\theta^{{(t)}}_{K}) takes the form of a linear dynamical system:

Finally, we can analytically marginalize out all of the discrete Poisson latent states to obtain a purely continuous dynamical system. When ϵ0(θ)>0\epsilon_{0}^{{(\theta)}}>0, this dynamical system can be written as follows:

where RG1 denotes the randomized gamma distribution of the first type . When ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0, the dynamical system can be written in terms of a limiting form of the RG1. We describe the RG1 in Fig. 2.

Related work

The PRGDS is closely related to the Poisson–gamma dynamical system (PGDS) . In the PGDS,

The PGDS imposes non-conjugate dependencies directly between the gamma latent states. The complete conditional P(θk(t))P(\theta_{k}^{{(t)}}|-) is not available in closed form, and posterior inference relies on a sophisticated data augmentation scheme. The PRGDS instead introduces intermediate Poisson states that break the intractable dependencies between the gamma states; we visualize this in Fig. 1. Although the Poisson distribution is not a conjugate prior for the gamma rate, this motif is still tractable, yielding the complete conditional P(hk(t))P(h_{k}^{{(t)}}|-) in closed form, as we explain in § 4. The PGDS is limited by the data augmentation scheme that it relies on for posterior inference—specifically, this augmentation scheme does not allow λk\lambda_{k} to appear in the Poisson rate of yi(t)y^{{(t)}}_{\textbf{{i}}} in Eq. 1. To encourage parsimony, the PGDS instead draws λkGam(γK,β)\lambda_{k}\sim\textrm{Gam}(\tfrac{\gamma}{K},\beta) and then uses these per-component weights to shrink the transition matrix Π\Pi. This approach introduces additional intractable dependencies that require a different data augmentation scheme for posterior inference. Finally, the data augmentation schemes additionally require that each factor vector ϕk(m)\boldsymbol{\phi}_{k}^{{(m)}} and each column πk\boldsymbol{\pi}_{k} of the transition matrix are Dirichlet distributed. We note that although we also use Dirichlet distributions in this paper, this is a choice rather than a requirement imposed by the PRGDS.

The PGDS and its “deep” variants generalize gamma process dynamic Poisson factor analysis (GP-DPFA) , which assumes a simple random walk θk(t) ⁣ ⁣Gam(θk(t ⁣ ⁣1),c(t))\theta_{k}^{{(t)}}\!\sim\!\textrm{Gam}\left(\theta^{{(t\!-\!1)}}_{k},\,c^{{(t)}}\right); the model of Yang and Koeppl is also closely related . These models belong to a line of work exploring the “augment-and-conquer” data augmentation scheme for posterior inference in hierarchies of gamma variables chained via their shapes and linked to Poisson observations. Beyond models for time series, this motif can be used to build belief networks . An alternative approach is to chain gamma variables via their rates—e.g., θ(t)Gam(a,θ(t ⁣ ⁣1))\theta^{{(t)}}\sim\textrm{Gam}\left(a,\,\theta^{{(t\!-\!1)}}\right). This motif is conjugate and tractable, and has been applied to models for time series and deep belief networks . However, unlike the shape, the rate contributes to the variance of the gamma quadratically. Rate chains can therefore be highly volatile.

More broadly, gamma shape and rate chains are examples of non-negative chains. Such chains are especially well motivated in the context of Poisson factorization, which is particularly efficient when only non-negative prior distributions are used. In general, Poisson factorization assumes that each observed count yiy_{\textbf{{i}}} is drawn from a Poisson distribution with a latent rate μi\mu_{\textbf{{i}}} that is some function of the model parameters—i.e., yiPois(μi)y_{\textbf{{i}}}\sim\textrm{Pois}\left(\mu_{\textbf{{i}}}\right). When the rate is linear—i.e., μi ⁣= ⁣k=1Kμik\mu_{\textbf{{i}}}\!=\!\sum_{k=1}^{K}\mu_{\textbf{{i}}k}—Poisson factorization is allocative and admits a latent source representation , where yik=1Kyiky_{\textbf{{i}}}\triangleq\sum_{k=1}^{K}y_{\textbf{{i}}k} is defined to be the sum of KK latent sources yi1,,yiKy_{\textbf{{i}}1},\ldots,y_{\textbf{{i}}K} and yikPois(μik)y_{\textbf{{i}}k}\sim\textrm{Pois}\left(\mu_{\textbf{{i}}k}\right). Conditioning on the latent sources often induces conditional independencies that, in turn, facilitate closed-form, efficient, and parallelizable posterior inference. The first step in either MCMC or variational inference is therefore to update each latent source from its complete conditional, which is multinomial :

where the normalization of the non-negative rates μi1,,μiK\mu_{\textbf{{i}}1},\ldots,\mu_{\textbf{{i}}K} into a probability vector is left implicit. When the observed count is zero—i.e., yi ⁣= ⁣0y_{\textbf{{i}}}\!=\!0—the sources are also zero—i.e., yik=a.s.0y_{\textbf{{i}}k}\stackrel{{\scriptstyle\textrm{a.s.}}}{{=}}0—and no computation is required to update them. As a result, any Poisson factorization model that admits a latent source representation scales linearly with only the non-zero entries. This property is indispensable when modeling count tensors which typically contain exponentially more zeros than non-zeros . We emphasize that although the PRGDS and PGDS are substantively different models, they are both instances of allocative Poisson factorization, so the time complexity of posterior inference for both models is the same and equal to O(SK)\mathcal{O}\left(SK\right) where SS is the number of non-zero entries.

Because a latent source representation is only available when the rate μi\mu_{\textbf{{i}}} is a linear function of the model parameters and, by definition of the Poisson distribution, the rate must be non-negative, efficient Poisson factorization is only possible with non-negative priors. Modeling time series and other complex dependence structures via efficient Poisson factorization therefore requires developing novel motifs that exclude the Gaussian priors that researchers have traditionally relied on for analytic convenience and tractability. For example, the Poisson linear dynamical system links the widely used Gaussian linear dynamical system to Poisson observations via an exponential link function—i.e., μi=exp(k)\mu_{\textbf{{i}}}=\exp\left({\sum_{k}\cdots}\right). This approach, which is based on the generalized linear model , relies on a non-linear link function and therefore does not admit a latent source representation. Another approach is to use log-normal priors, as in dynamic Poisson factorization ; however, the log-normal is not conjugate to the Poisson distribution and does not yield closed-form conditionals.

There is also a long tradition of autoregressive models for time series of counts, including variational autoregressive models and models that are based on the Hawkes process . This approach avoids the challenge of constructing tractable state-space models from non-negative priors by modeling temporal correlations directly between the observed counts. However, for high-dimensional data, such as sequentially observed count tensors, an autoregressive approach is often impractical.

Posterior inference

Iteratively re-sampling each latent variable in the PRGDS from its complete conditional constitutes a Gibbs sampling algorithm. The complete conditionals for all variables are immediately available in closed form without data augmentation. We provide conditionals for the variables with non-standard priors below; the remaining conditionals are in the supplementary material. The PRGDS is based on a new motif in Bayesian latent variable modeling. We introduce the motif in its general form, derive its conditionals, and then use these to obtain the closed-form complete conditionals for the PRGDS.

Consider the following model of count mm involving variables θ\theta and hh and fixed c1,c2,c3,ϵ0(θ)>0c_{1},c_{2},c_{3},\epsilon_{0}^{{(\theta)}}>0:

This model is semi-conjugate. The gamma prior over θ\theta is conjugate to the Poisson and its posterior is

The Poisson prior over hh is not conjugate to the gamma; however, despite this, the posterior of hh is still available in closed form by way of the Bessel distribution , which we define in Fig. 3(a):

The Bessel distribution can be sampled efficiently ; our Cython implementation is available online.https://github.com/aschein/PRGDS Provided that ϵ0(θ) ⁣> ⁣0\epsilon_{0}^{{(\theta)}}\!>\!0, sampling θ\theta and hh iteratively from Eqs. 13 and 14 constitutes a valid Markov chain for posterior inference. When ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0, though, θ=a.s.0\theta\stackrel{{\scriptstyle\textrm{a.s.}}}{{=}}0 if h ⁣= ⁣0h\!=\!0, and vice versa. As a result, this Markov chain has an absorbing condition at h ⁣= ⁣0h\!=\!0 and violates detailed balance. In this case, we must therefore sample hh with θ\theta marginalized out. Toward that end, we prove Theorem 1.

Theorem 1: The incomplete conditional P(hϵ0(θ) ⁣= ⁣0,\θ)P(h,θϵ0(θ) ⁣= ⁣0,)dθP(h\,|\,{\epsilon_{0}^{{(\theta)}}\!=\!0,-\backslash}\theta)\triangleq\int P(h,\theta\,|\,\epsilon_{0}^{{(\theta)}}\!=\!0,-)\,\mathbf{d}\theta is

where SCH denotes the shifted confluent hypergeometric distribution. We describe the SCH in Fig. 3(b) and provide further information in the supplementary material, including the derivation of its PMF, PGF, and mode, along with details of how we sample from it and the proof for Theorem 1.

2 Closed-form complete conditionals for the PRGDS

The PRGDS admits a latent source representation, so the first step of posterior inference is therefore

We may similarly represent hk(t)h_{k}^{{(t)}} under its latent source representation—i.e., hk(t) ⁣ ⁣hk(t) ⁣= ⁣k2=1Khkk2(t)h_{k}^{{(t)}}\!\equiv\!h_{k\boldsymbol{\cdot}}^{{(t)}}\!=\!\sum_{k_{2}=1}^{K}h_{kk_{2}}^{{(t)}}, where hkk2(t) ⁣ ⁣Pois(τπkk2θk2(t ⁣ ⁣1))h_{kk_{2}}^{{(t)}}\!\sim\!\textrm{Pois}\left(\tau\,\pi_{kk_{2}}\theta_{k_{2}}^{{(t\!-\!1)}}\right). When notationally convenient, we use dot-notation (“\boldsymbol{\cdot}”) to denote summing over a mode. In this case, hk(t)h_{k\boldsymbol{\cdot}}^{{(t)}} denotes the sum of the kthk^{\textrm{th}} row of the K ⁣× ⁣KK\!\times\!K matrix of latent counts hkk2(t)h_{kk_{2}}^{{(t)}}. The complete conditional of the kthk^{\textrm{th}} row of counts, when conditioned on their sum hk(t)h_{k\boldsymbol{\cdot}}^{{(t)}}, is

To derive the conditional for θk(t)\theta_{k}^{{(t)}} we aggregate the Poisson variables that depend on it. By Poisson additivity, the column sum hk(t ⁣+ ⁣1) ⁣= ⁣k1=1Khk1k(t ⁣+ ⁣1)h_{\boldsymbol{\cdot}k}^{{(t\!+\!1)}}\!=\!\sum_{k_{1}=1}^{K}h_{k_{1}k}^{{(t\!+\!1)}} is distributed as hk(t ⁣+ ⁣1) ⁣ ⁣Pois(θk(t)τπk)h_{\boldsymbol{\cdot}k}^{{(t\!+\!1)}}\!\sim\!\textrm{Pois}\left(\theta_{k}^{{(t)}}\,\tau\,\pi_{\boldsymbol{\cdot}k}\right) and similarly yk(t)y_{\boldsymbol{\cdot}k}^{{(t)}} is distributed as y_{\boldsymbol{\cdot}k}^{{(t)}}\!\sim\!\textrm{Pois}\big{(}\theta_{k}^{{(t)}}\rho^{{(t)}}\lambda_{k}\prod_{m=1}^{M}\phi^{{(m)}}_{k\boldsymbol{\cdot}}\big{)}. The count mk(t)hk(t ⁣+ ⁣1) ⁣+ ⁣yk(t)m^{{(t)}}_{k}\triangleq h_{\boldsymbol{\cdot}k}^{{(t\!+\!1)}}\!+\!y^{{(t)}}_{k} isolates all dependence on θk(t)\theta_{k}^{{(t)}} and is also Poisson distributed. By gamma–Poisson conjugacy, the conditional of θk(t)\theta_{k}^{{(t)}} is

When ϵ0(θ)>0\epsilon_{0}^{{(\theta)}}>0, we apply the identity in Eq. 14 and sample hk(t)h_{k\boldsymbol{\cdot}}^{{(t)}} from its complete conditional:

When ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0, we instead apply Theorem 1 to sample hk(t)h_{k\boldsymbol{\cdot}}^{{(t)}}, where mk(t)m^{{(t)}}_{k} is analogous to mm in Eq. 15:

The complete conditionals for λk\lambda_{k} and gkg_{k} follow from applying the same Poisson–gamma–Poisson identities, while the complete conditionals for γ\gamma, β\beta, ϕk(m)\boldsymbol{\phi}^{{(m)}}_{k}, πk\boldsymbol{\pi}_{k}, and τ\tau all follow from conjugacy.

Empirical studies

As explained in the previous section, the Poisson–gamma–Poisson motif of the PRGDS (see § 4.1) yields a more tractable (see Fig. 1) and flexible (see § 3) model than previous models. This motif also encodes a unique inductive bias tailored to sparsity and burstiness that we test by comparing the PRGDS to the PGDS (described in § 3). As we can see by comparing Eqs. 9 and 10, comparing these models isolates the impact of the Poisson–gamma–Poisson motif. Because the PGDS was previously introduced to model a T ⁣× ⁣VT\!\times\!V matrix YY of sequentially observed VV-dimensional count vectors y(1),,y(T)\boldsymbol{y}^{{(1)}},\dots,\boldsymbol{y}^{{(T)}}, we generalize the PGDS to MM-mode tensors and provide derivations of its complete conditionals in the supplementary material. Our Cython implementation of this generalized PGDS (and the PRGDS) is available online. We also compare the variant of the PRGDS with ϵ0(θ) ⁣= ⁣1\epsilon_{0}^{{(\theta)}}\!=\!1 to the variant with ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0, which allows the continuous gamma latent states to take values of exactly zero.

Setup. Our empirical studies all have the following setup. For each data set Y(1),,Y(T)\boldsymbol{Y}^{{(1)}},\dots,\boldsymbol{Y}^{{(T)}}, the counts Y(t)\boldsymbol{Y}^{{(t)}} in randomly selected time steps are held out. Additionally, the counts in the last two time steps are always held out. Each model is fit to the data set using independent MCMC chains that impute the heldout counts and, ultimately, return a set of posterior samples of the latent variables. We distinguish the task of predicting the counts in intermediate time steps, known as smoothing, from the task of predicting the counts in the last two time steps, known as forecasting. To quantify the performance of each model, we use the SS posterior samples returned by the independent chains to approximate the information rate of the heldout counts—i.e., R(\Delta)=-\frac{1}{|\Delta|}\sum_{(t,\textbf{{i}})\in\Delta}\log\Big{[}\frac{1}{S}\sum_{s=1}^{S}\textrm{Pois}\big{(}y^{{(t)}}_{\textbf{{i}}};\mu_{\textbf{{i}},s}^{{(t)}}\big{)}\Big{]}, where Δ\Delta is the set of multi-indices of the heldout counts and μi,s(t)\mu_{\textbf{{i}},s}^{{(t)}} is the expectation of heldout count yi(t)y^{{(t)}}_{\textbf{{i}}} (defined in Eq. 1) computed from the sths^{\textrm{th}} posterior sample. The information rate quantifies the average number of nats needed to compress each heldout count; it is equivalent to log perplexity and to the negative of log pointwise predictive density (LPPD) . In each study, we also fit Bayesian Poisson tensor factorization (BPTF) , a non-dynamic baseline that assumes that the count tensors at different time steps are i.i.d.—i.e., yi(t) ⁣ ⁣Pois(μi)y_{\textbf{{i}}}^{{(t)}}\!\sim\!\textrm{Pois}\left(\mu_{\textbf{{i}}}\right). For each model, we then report the information gain over BPTF, where higher values are better, which we compute by subtracting the information rate of the model from that of BPTF.

Matrices. We first replicated the empirical studies of Schein et al. . These studies followed the setup described above and compared the PGDS to GP-DPFA , a simple dynamic baseline (described in § 3). The matrices in these studies were based on three text data sets—NeurIPS papers , DBLP abstracts , and State of the Union (SOTU) speeches —where yv(t)y^{{(t)}}_{v} is the number of times word vv occurs in time step tt, and two international event data sets—GDELT and ICEWS —where yv(t)y^{{(t)}}_{v} is the number of times sender–receiver pair vv interacted during time step tt. We used the matrices and heldout time steps, along with the posterior samples for both PGDS and GP-DPFA, originally obtained by Schein et al. . We then fit the PRGDS using the MCMC settings that they describe. In this matrix setting, BPTF reduces to yv(t) ⁣ ⁣Pois(μv)y^{{(t)}}_{v}\!\sim\!\textrm{Pois}(\mu_{v}), where vv indexes a single mode, and μv\mu_{v} cannot be meaningfully factorized. We therefore posited a conjugate gamma prior over μv\mu_{v} directly and drew exact posterior samples to compute the information rate. We depict the results in Fig. 4(a).

Tensors. We used two international event data sets—GDELT and ICEWS—where yiaj(t)y_{{i\xrightarrow{a}j}}^{{(t)}} is the number of times country ii took action aa toward country jj during time step tt. Each data set consists of a sequence of count tensors, each of which contains the V×V×AV\times V\times A event counts for that time step, where V ⁣= ⁣249V\!=\!249 countries and A ⁣= ⁣20A\!=\!20 action types. For both data sets, we used months as time steps. For GDELT, we considered the date range 2003–2008, yielding T ⁣= ⁣72T\!=\!72; for ICEWS, we considered the date range 1995–2013, yielding T ⁣= ⁣228T\!=\!228. We also used a data set of multi-neuronal spike train recordings of macaque monkey motor cortexes . In this data set, a count yij(t)y^{{(t)}}_{ij} is the number of times neuron ii spiked in trial jj during time step tt. These counts form a sequence of N ⁣× ⁣VN\!\times\!V matrices, where N ⁣= ⁣100N\!=\!100 is the number of neurons and V ⁣= ⁣1,716V\!=\!1,716 is the number of trials. We used 20-millisecond intervals as time steps, yielding T ⁣= ⁣162T\!=\!162. For each data set, we created three random masks, each corresponding to six heldout time steps in the range [2,T ⁣ ⁣2][2,T\!-\!2]. We fit each model to each data set and mask using two independent chains of 4,000 MCMC iterations, saving every 50th50^{\textrm{th}} posterior sample after the first 1,000 iterations to compute the information rate. We also fit BPTF using variational inference as described by Schein et al. , and then sampled from the fitted variational posterior to compute the information rate. Following Schein et al. , we set K ⁣= ⁣100K\!=\!100 for all models. We depict the results in Fig. 4(b), where the error bars reflect variability across the random masks.

Quantitative results. In all sixteen studies, the dynamic models outperform BPTF. In all but one study, the PGDS and a sparse variant of the PRGDS (i.e., ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0) outperform the other models. For smoothing, the PRGDS performs better than or similarly to the PGDS. In five of the eight smoothing studies, the sparse variant of the PRGDS obtains a higher information gain than the PGDS; in the remaining three smoothing studies, there is no discernible difference between the models. For forecasting, we find the converse relationship. In four of the eight forecasting studies, the PGDS obtains a higher information gain than the PGDS; in the remaining forecasting studies, there is no discernible difference. In all studies, the sparse variant of the PRGDS obtains better smoothing and forecasting performance than the non-sparse variant (i.e., ϵ0(θ) ⁣= ⁣1\epsilon_{0}^{{(\theta)}}\!=\!1). We conjecture that the better performance of the sparse variant can be explained by the form of the marginal expectation of θ(t)\boldsymbol{\theta}^{{(t)}} (see Eq. 8). When ϵ0(θ) ⁣> ⁣0\epsilon_{0}^{{(\theta)}}\!>\!0 this expectation includes an additive term that grows as more time steps are forecast. When ϵ0(θ) ⁣= ⁣0\epsilon_{0}^{{(\theta)}}\!=\!0, this term disappears and the expectation matches that of the PGDS (see Eq. 10).

Qualitative analysis. We also performed a qualitative comparison of the latent structures inferred by the different models and found that the sparse variant of the PRGDS inferred some components that the other models did not. Specifically, the sparse variant of the PRGDS is uniquely capable of inferring bursty latent structures that are highly localized in time; we visualize examples in Fig. 5. To compare the latent structures inferred by the PGDS and the PRGDS, we aligned the models’ inferred components using the Hungarian bipartite matching algorithm applied to the models’ continuous gamma latent states. The kthk^{\textrm{th}} component’s activation vector θk ⁣= ⁣(θk(1),,θk(T))\boldsymbol{\theta}_{k}\!=\!(\theta^{{(1)}}_{k},\dots,\theta^{{(T)}}_{k}) constitutes a signature of that component’s activity; these signatures are sufficiently unique to facilitate alignment. In the supplementary material, we provide four components that are well aligned across the models. In Fig. 5(a), we visualize two components inferred by the sparse variant of the PRGDS; one of these components (blue) was also inferred by the other models, while the other component (red) was not.

Conclusion

We presented the Poisson-randomized gamma dynamical system (PRGDS), a tractable, expressive, and efficient model for sequentially observed count tensors. The PRGDS is based on a new modeling motif, an alternating chain of discrete Poisson and continuous gamma latent states that yields closed-form complete conditionals for all variables. We found that a sparse variant of the PRGDS, which allows the continuous gamma latent states to take values of exactly zero, often obtains better predictive performance than other models and infers latent structures that are highly localized in time.

Acknowledgments We thank Saurabh Vyas, Alex Williams, and Krishna Shenoy for kindly providing us with the macaque monkey motor cortex data set and their corresponding preprocessing code. SWL was supported by the Simons Collaboration on the Global Brain (SCGB 418011). MZ was supported by NSF IIS-1812699. DMB was supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NIH 1U01MH115727-01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, IBM, 2Sigma, Amazon, NVIDIA, and the Simons Foundation.

References