Poisson-Randomized Gamma Dynamical Systems
Aaron Schein, Scott W. Linderman, Mingyuan Zhou, David M. Blei, Hanna Wallach
Introduction
Political scientists routinely analyze event counts of the number of times country took action toward country during time step . Such data can be represented as a sequence of count tensors each of which contains the event counts for that time step for every combination of sender countries, receivers, and action types. International event data sets exhibit “complex dependence structures” like coalitions of countries and bursty temporal dynamics. These dependence structures violate the independence assumptions of the regression-based methods that political scientists have traditionally used to test theories of international relations . Political scientists have therefore advocated for using latent variable models to infer unobserved structures as a way of controlling for them . This approach motivates interpretable yet expressive models that are capable of capturing a variety of complex dependence structures. Recent work has applied tensor factorization methods to international event data sets to infer coalition structures among countries and topic structures among actions; however, these methods assume that the sequentially observed count tensors are exchangeable, thereby failing to capture the bursty temporal dynamics inherent to such data sets.
Sequentially observed count tensors present unique statistical challenges because they tend to be bursty , high-dimensional, and sparse . There are few models that are tailored to the challenging properties of both time series and count tensors. In recent years, Poisson factorization has emerged as a framework for modeling count matrices and tensors . Although factorization methods generally scale with the size of the matrix or tensor, many Poisson factorization models yield inference algorithms that scale linearly with the number of non-zero entries. This property allows researchers to efficiently infer latent structures from massive tensors, provided these tensors are sparse; however, this property is unique to a subset of Poisson factorization models that only posit non-negative prior distributions, which are difficult to chain in state-space models for time series. Hierarchical compositions of non-negative priors—notably, gamma and Dirichlet distributions—typically introduce non-conjugate dependencies that require innovative approaches to posterior inference.
This paper fills a gap in the literature between Poisson factorization models that are tractable—i.e., yielding closed-form complete conditionals that make inference algorithms easy to derive—and those that are expressive—i.e., capable of capturing a variety of complex dependence structures. To do so, we introduce an alternating chain of discrete Poisson and continuous gamma latent states, a new modeling motif that is analytically convenient and computationally tractable. We rely on this motif to construct the Poisson-randomized gamma dynamical system (PRGDS), a model for sequentially observed count tensors that is tractable, expressive, and efficient. The PRGDS is closely related to the Poisson–gamma dynamical system (PGDS) , a recently introduced model for dynamic count matrices, that is based on non-conjugate chains of gamma states. These chains are intractable; thus, posterior inference in the PGDS relies on sophisticated data augmentation schemes that are cumbersome to derive and impose unnatural restrictions on the priors over other variables. In contrast, the PRGDS introduces intermediate Poisson states that break the intractable dependencies between the gamma states (see Fig. 1). Although this motif is only semi-conjugate, it is tractable, yielding closed-form complete conditionals for the Poisson states by way of the little-known Bessel distribution and a novel discrete distribution that we derive and call the shifted confluent hypergeometric (SCH) distribution.
We study the inductive bias of the PRGDS by comparing its smoothing and forecasting performance to that of the PGDS and two other baselines on a range of real-world count data sets of text, international events, and neural spike data. For smoothing, we find that the PRGDS performs better than or similarly to the PGDS; for forecasting, we find the converse relationship. Both models outperform the other baselines. Using a specific hyperparameter setting, the PRGDS permits the continuous gamma latent states to take values of exactly zero, thereby encoding a unique inductive bias tailored to sparsity and burstiness. We find that this sparse variant always obtains better smoothing and forecasting performance than the non-sparse variant. We also find that this sparse variant yields a qualitatively broader range of latent structures—specifically, bursty latent structures that are highly localized in time.
Poisson-randomized gamma dynamical systems (PRGDS)
Generative process. The PRGDS is a form of canonical polyadic decomposition that assumes
where represents the activation of the component at time step . Each component represents a dependence structure in the data set by way of a factor vector for each mode . For international events, the first factor vector represents the rate at which each of the countries acts as a sender in the component while the second factor vector represents the rate at which each country acts as a receiver. The weights and represent the scales of component and time step . The PRGDS is stationary if . We posit the following conjugate priors:
The PRGDS is characterized by an alternating chain of discrete and continuous latent states. The continuous states evolve via the intermediate discrete states as follows:
where we define to be the per-component weight from Eq. 1. In other words, the PRGDS assumes that is conditionally gamma distributed with rate and shape equal to plus hyperparameter . We adopt the convention that a gamma random variable will be zero, almost surely, if its shape is zero. Therefore, setting defines a sparse variant of the PRGDS, where the gamma latent state takes the value of exactly zero provided —i.e., if .
The transition weight in Eq. 3 represents how strongly component excites component at the next time step. We view these weights collectively as a transition matrix and impose Dirichlet priors over the columns of this matrix. We also place a gamma prior over concentration parameter . This prior is conjugate to the gamma and Poisson distributions in which it appears:
For the per-component weights , we use a hierarchical prior with a similar flavor to Eq. 3:
where is analogous to . Finally, we use the following gamma priors, which are both conjugate:
The PRGDS has five fixed hyperparameters: , , , , and . For the empirical studies in § 5, we set to define weakly informative gamma and Dirichlet priors and set to define a gamma prior that promotes values close to 1; we consider and set .
Properties. In Eq. 5, both and are divided by the number of components . This means that as the number of components grows , the expected sum of the weights remains finite and fixed:
This prior encodes an inductive bias toward small values of and may be interpreted as the finite truncation of a novel Bayesian nonparametric process. A small value of shrinks the Poisson rates of both and the first discrete latent state . As a result, this prior encourages the PRGDS to only infer components that are both predictive of the data and useful for capturing the temporal dynamics.
The marginal expectation of takes the form of a linear dynamical system:
Finally, we can analytically marginalize out all of the discrete Poisson latent states to obtain a purely continuous dynamical system. When , this dynamical system can be written as follows:
where RG1 denotes the randomized gamma distribution of the first type . When , the dynamical system can be written in terms of a limiting form of the RG1. We describe the RG1 in Fig. 2.
Related work
The PRGDS is closely related to the Poisson–gamma dynamical system (PGDS) . In the PGDS,
The PGDS imposes non-conjugate dependencies directly between the gamma latent states. The complete conditional is not available in closed form, and posterior inference relies on a sophisticated data augmentation scheme. The PRGDS instead introduces intermediate Poisson states that break the intractable dependencies between the gamma states; we visualize this in Fig. 1. Although the Poisson distribution is not a conjugate prior for the gamma rate, this motif is still tractable, yielding the complete conditional in closed form, as we explain in § 4. The PGDS is limited by the data augmentation scheme that it relies on for posterior inference—specifically, this augmentation scheme does not allow to appear in the Poisson rate of in Eq. 1. To encourage parsimony, the PGDS instead draws and then uses these per-component weights to shrink the transition matrix . This approach introduces additional intractable dependencies that require a different data augmentation scheme for posterior inference. Finally, the data augmentation schemes additionally require that each factor vector and each column of the transition matrix are Dirichlet distributed. We note that although we also use Dirichlet distributions in this paper, this is a choice rather than a requirement imposed by the PRGDS.
The PGDS and its “deep” variants generalize gamma process dynamic Poisson factor analysis (GP-DPFA) , which assumes a simple random walk ; the model of Yang and Koeppl is also closely related . These models belong to a line of work exploring the “augment-and-conquer” data augmentation scheme for posterior inference in hierarchies of gamma variables chained via their shapes and linked to Poisson observations. Beyond models for time series, this motif can be used to build belief networks . An alternative approach is to chain gamma variables via their rates—e.g., . This motif is conjugate and tractable, and has been applied to models for time series and deep belief networks . However, unlike the shape, the rate contributes to the variance of the gamma quadratically. Rate chains can therefore be highly volatile.
More broadly, gamma shape and rate chains are examples of non-negative chains. Such chains are especially well motivated in the context of Poisson factorization, which is particularly efficient when only non-negative prior distributions are used. In general, Poisson factorization assumes that each observed count is drawn from a Poisson distribution with a latent rate that is some function of the model parameters—i.e., . When the rate is linear—i.e., —Poisson factorization is allocative and admits a latent source representation , where is defined to be the sum of latent sources and . Conditioning on the latent sources often induces conditional independencies that, in turn, facilitate closed-form, efficient, and parallelizable posterior inference. The first step in either MCMC or variational inference is therefore to update each latent source from its complete conditional, which is multinomial :
where the normalization of the non-negative rates into a probability vector is left implicit. When the observed count is zero—i.e., —the sources are also zero—i.e., —and no computation is required to update them. As a result, any Poisson factorization model that admits a latent source representation scales linearly with only the non-zero entries. This property is indispensable when modeling count tensors which typically contain exponentially more zeros than non-zeros . We emphasize that although the PRGDS and PGDS are substantively different models, they are both instances of allocative Poisson factorization, so the time complexity of posterior inference for both models is the same and equal to where is the number of non-zero entries.
Because a latent source representation is only available when the rate is a linear function of the model parameters and, by definition of the Poisson distribution, the rate must be non-negative, efficient Poisson factorization is only possible with non-negative priors. Modeling time series and other complex dependence structures via efficient Poisson factorization therefore requires developing novel motifs that exclude the Gaussian priors that researchers have traditionally relied on for analytic convenience and tractability. For example, the Poisson linear dynamical system links the widely used Gaussian linear dynamical system to Poisson observations via an exponential link function—i.e., . This approach, which is based on the generalized linear model , relies on a non-linear link function and therefore does not admit a latent source representation. Another approach is to use log-normal priors, as in dynamic Poisson factorization ; however, the log-normal is not conjugate to the Poisson distribution and does not yield closed-form conditionals.
There is also a long tradition of autoregressive models for time series of counts, including variational autoregressive models and models that are based on the Hawkes process . This approach avoids the challenge of constructing tractable state-space models from non-negative priors by modeling temporal correlations directly between the observed counts. However, for high-dimensional data, such as sequentially observed count tensors, an autoregressive approach is often impractical.
Posterior inference
Iteratively re-sampling each latent variable in the PRGDS from its complete conditional constitutes a Gibbs sampling algorithm. The complete conditionals for all variables are immediately available in closed form without data augmentation. We provide conditionals for the variables with non-standard priors below; the remaining conditionals are in the supplementary material. The PRGDS is based on a new motif in Bayesian latent variable modeling. We introduce the motif in its general form, derive its conditionals, and then use these to obtain the closed-form complete conditionals for the PRGDS.
Consider the following model of count involving variables and and fixed :
This model is semi-conjugate. The gamma prior over is conjugate to the Poisson and its posterior is
The Poisson prior over is not conjugate to the gamma; however, despite this, the posterior of is still available in closed form by way of the Bessel distribution , which we define in Fig. 3(a):
The Bessel distribution can be sampled efficiently ; our Cython implementation is available online.https://github.com/aschein/PRGDS Provided that , sampling and iteratively from Eqs. 13 and 14 constitutes a valid Markov chain for posterior inference. When , though, if , and vice versa. As a result, this Markov chain has an absorbing condition at and violates detailed balance. In this case, we must therefore sample with marginalized out. Toward that end, we prove Theorem 1.
Theorem 1: The incomplete conditional is
where SCH denotes the shifted confluent hypergeometric distribution. We describe the SCH in Fig. 3(b) and provide further information in the supplementary material, including the derivation of its PMF, PGF, and mode, along with details of how we sample from it and the proof for Theorem 1.
2 Closed-form complete conditionals for the PRGDS
The PRGDS admits a latent source representation, so the first step of posterior inference is therefore
We may similarly represent under its latent source representation—i.e., , where . When notationally convenient, we use dot-notation (“”) to denote summing over a mode. In this case, denotes the sum of the row of the matrix of latent counts . The complete conditional of the row of counts, when conditioned on their sum , is
To derive the conditional for we aggregate the Poisson variables that depend on it. By Poisson additivity, the column sum is distributed as and similarly is distributed as y_{\boldsymbol{\cdot}k}^{{(t)}}\!\sim\!\textrm{Pois}\big{(}\theta_{k}^{{(t)}}\rho^{{(t)}}\lambda_{k}\prod_{m=1}^{M}\phi^{{(m)}}_{k\boldsymbol{\cdot}}\big{)}. The count isolates all dependence on and is also Poisson distributed. By gamma–Poisson conjugacy, the conditional of is
When , we apply the identity in Eq. 14 and sample from its complete conditional:
When , we instead apply Theorem 1 to sample , where is analogous to in Eq. 15:
The complete conditionals for and follow from applying the same Poisson–gamma–Poisson identities, while the complete conditionals for , , , , and all follow from conjugacy.
Empirical studies
As explained in the previous section, the Poisson–gamma–Poisson motif of the PRGDS (see § 4.1) yields a more tractable (see Fig. 1) and flexible (see § 3) model than previous models. This motif also encodes a unique inductive bias tailored to sparsity and burstiness that we test by comparing the PRGDS to the PGDS (described in § 3). As we can see by comparing Eqs. 9 and 10, comparing these models isolates the impact of the Poisson–gamma–Poisson motif. Because the PGDS was previously introduced to model a matrix of sequentially observed -dimensional count vectors , we generalize the PGDS to -mode tensors and provide derivations of its complete conditionals in the supplementary material. Our Cython implementation of this generalized PGDS (and the PRGDS) is available online. We also compare the variant of the PRGDS with to the variant with , which allows the continuous gamma latent states to take values of exactly zero.
Setup. Our empirical studies all have the following setup. For each data set , the counts in randomly selected time steps are held out. Additionally, the counts in the last two time steps are always held out. Each model is fit to the data set using independent MCMC chains that impute the heldout counts and, ultimately, return a set of posterior samples of the latent variables. We distinguish the task of predicting the counts in intermediate time steps, known as smoothing, from the task of predicting the counts in the last two time steps, known as forecasting. To quantify the performance of each model, we use the posterior samples returned by the independent chains to approximate the information rate of the heldout counts—i.e., R(\Delta)=-\frac{1}{|\Delta|}\sum_{(t,\textbf{{i}})\in\Delta}\log\Big{[}\frac{1}{S}\sum_{s=1}^{S}\textrm{Pois}\big{(}y^{{(t)}}_{\textbf{{i}}};\mu_{\textbf{{i}},s}^{{(t)}}\big{)}\Big{]}, where is the set of multi-indices of the heldout counts and is the expectation of heldout count (defined in Eq. 1) computed from the posterior sample. The information rate quantifies the average number of nats needed to compress each heldout count; it is equivalent to log perplexity and to the negative of log pointwise predictive density (LPPD) . In each study, we also fit Bayesian Poisson tensor factorization (BPTF) , a non-dynamic baseline that assumes that the count tensors at different time steps are i.i.d.—i.e., . For each model, we then report the information gain over BPTF, where higher values are better, which we compute by subtracting the information rate of the model from that of BPTF.
Matrices. We first replicated the empirical studies of Schein et al. . These studies followed the setup described above and compared the PGDS to GP-DPFA , a simple dynamic baseline (described in § 3). The matrices in these studies were based on three text data sets—NeurIPS papers , DBLP abstracts , and State of the Union (SOTU) speeches —where is the number of times word occurs in time step , and two international event data sets—GDELT and ICEWS —where is the number of times sender–receiver pair interacted during time step . We used the matrices and heldout time steps, along with the posterior samples for both PGDS and GP-DPFA, originally obtained by Schein et al. . We then fit the PRGDS using the MCMC settings that they describe. In this matrix setting, BPTF reduces to , where indexes a single mode, and cannot be meaningfully factorized. We therefore posited a conjugate gamma prior over directly and drew exact posterior samples to compute the information rate. We depict the results in Fig. 4(a).
Tensors. We used two international event data sets—GDELT and ICEWS—where is the number of times country took action toward country during time step . Each data set consists of a sequence of count tensors, each of which contains the event counts for that time step, where countries and action types. For both data sets, we used months as time steps. For GDELT, we considered the date range 2003–2008, yielding ; for ICEWS, we considered the date range 1995–2013, yielding . We also used a data set of multi-neuronal spike train recordings of macaque monkey motor cortexes . In this data set, a count is the number of times neuron spiked in trial during time step . These counts form a sequence of matrices, where is the number of neurons and is the number of trials. We used 20-millisecond intervals as time steps, yielding . For each data set, we created three random masks, each corresponding to six heldout time steps in the range . We fit each model to each data set and mask using two independent chains of 4,000 MCMC iterations, saving every posterior sample after the first 1,000 iterations to compute the information rate. We also fit BPTF using variational inference as described by Schein et al. , and then sampled from the fitted variational posterior to compute the information rate. Following Schein et al. , we set for all models. We depict the results in Fig. 4(b), where the error bars reflect variability across the random masks.
Quantitative results. In all sixteen studies, the dynamic models outperform BPTF. In all but one study, the PGDS and a sparse variant of the PRGDS (i.e., ) outperform the other models. For smoothing, the PRGDS performs better than or similarly to the PGDS. In five of the eight smoothing studies, the sparse variant of the PRGDS obtains a higher information gain than the PGDS; in the remaining three smoothing studies, there is no discernible difference between the models. For forecasting, we find the converse relationship. In four of the eight forecasting studies, the PGDS obtains a higher information gain than the PGDS; in the remaining forecasting studies, there is no discernible difference. In all studies, the sparse variant of the PRGDS obtains better smoothing and forecasting performance than the non-sparse variant (i.e., ). We conjecture that the better performance of the sparse variant can be explained by the form of the marginal expectation of (see Eq. 8). When this expectation includes an additive term that grows as more time steps are forecast. When , this term disappears and the expectation matches that of the PGDS (see Eq. 10).
Qualitative analysis. We also performed a qualitative comparison of the latent structures inferred by the different models and found that the sparse variant of the PRGDS inferred some components that the other models did not. Specifically, the sparse variant of the PRGDS is uniquely capable of inferring bursty latent structures that are highly localized in time; we visualize examples in Fig. 5. To compare the latent structures inferred by the PGDS and the PRGDS, we aligned the models’ inferred components using the Hungarian bipartite matching algorithm applied to the models’ continuous gamma latent states. The component’s activation vector constitutes a signature of that component’s activity; these signatures are sufficiently unique to facilitate alignment. In the supplementary material, we provide four components that are well aligned across the models. In Fig. 5(a), we visualize two components inferred by the sparse variant of the PRGDS; one of these components (blue) was also inferred by the other models, while the other component (red) was not.
Conclusion
We presented the Poisson-randomized gamma dynamical system (PRGDS), a tractable, expressive, and efficient model for sequentially observed count tensors. The PRGDS is based on a new modeling motif, an alternating chain of discrete Poisson and continuous gamma latent states that yields closed-form complete conditionals for all variables. We found that a sparse variant of the PRGDS, which allows the continuous gamma latent states to take values of exactly zero, often obtains better predictive performance than other models and infers latent structures that are highly localized in time.
Acknowledgments We thank Saurabh Vyas, Alex Williams, and Krishna Shenoy for kindly providing us with the macaque monkey motor cortex data set and their corresponding preprocessing code. SWL was supported by the Simons Collaboration on the Global Brain (SCGB 418011). MZ was supported by NSF IIS-1812699. DMB was supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NIH 1U01MH115727-01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, IBM, 2Sigma, Amazon, NVIDIA, and the Simons Foundation.