Discrete Event, Continuous Time RNNs

Michael C. Mozer, Denis Kazakov, Robert V. Lindsey

Introduction

Many classic data sources in machine learning can be characterized as sequences. For example, natural language text is a progression of words; videos consist of a series of still images; and spoken utterances are represented as sampled power spectra. In such sequences, observations are ordered but there is no timing information. In contrast to these ordinal sequences, event sequences consist of observations stamped with a continuous-valued, absolute or relative time of occurrence. Examples of event sequences include online product purchases, criminal activity in a police blotter, web forum postings, an individual’s restaurant reservations, file accesses, outgoing phone calls or text messages, sent emails, player log-ins to gaming sites, and music selections. In each of these examples, discrete events occur in continuous time and not necessarily at uniform intervals.

In this article, we focus on recurrent neural net (RNN) approaches to processing event sequences. We consider standard tasks for event sequences that include classification, prediction of the next event given the time lag from the previous event, and prediction of the time lag to the next event.

Existing sequence learning methods for recurrent nets, e.g., LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Chung et al., 2014) architectures, are not designed for event sequences but might be extended to handle them in various ways. First, time stamps might simply be ignored (e.g., Wu et al., 2016). Second, time might be discretized, allowing an event-based sequence to be transformed into a sequence sampled at fixed clock intervals (e.g., Hidasi et al., 2015; Song et al., 2016; Wang et al., 2016a; Wu et al., 2017). Third, time stamps might be used as additional input features (e.g., Choi et al., 2016; Du et al., 2016). We explore the hypothesis that time should be handled in a more specialized manner.

To explain what we mean by ‘specialized,’ consider the deep learning architecture universally used for vision—the convolutional network (Fukushima, 1980; Lecun et al., 1998; Mozer, 1987). Convolutional nets are successful because they incorporate three forms of inductive bias: (1) spatial locality—features at nearby locations in an image are more likely to have joint causes and consequences than more distant features; (2) spatial position homogeneity—features deemed significant in one region of an image are likely to be significant in other regions; and (3) spatial scale homogeneity—spatial locality and position homogeneity should apply across a range of spatial scales.

Architectures for event sequences should benefit from isomorphic forms of inductive bias, specifically: (1) temporal locality—events closer in time are more likely to have joint causes and consequences than more distant events; (2) temporal position homogeneity—event patterns deemed significant at one point in time are likely to be significant at other points; and (3) temporal scale homogeneity—temporal locality and position homogeneity should apply across a range of time scales.

We examine event-sequence architectures that incorporate these biases, plus one more: (4) temporal scale interactions—sequences have different structure at different scales and these scales interact. Scale interactions are also found in vision, and models have been designed to leverage these interactions by incorporating multi-resolution pyramids at every stage of a convolutional architecture (e.g., Buyssens et al., 2013; Zeng et al., 2017). To illustrate interactions across temporal scales, consider the scenario of online shopping. A customer may browse various TV models one week, home automation devices the next, and phones the next. Each of these activity patterns indicates at least a short-term interest in some topic, but the combination of the searches indicates a long-term interest in electronics. On the flip side, customers may frantically shop for parts when their furnace fails, but that does not imply a long-term interest in furnace paraphernalia—contrary to the annoying inference that shopping sites appear to make.

We believe that event-sequence learning can be improved because existing techniques fall short in incorporating the four biases we listed. For example, one standard technique is to include time stamps as additional inputs; these inputs gives deep learning models in principle all the necessary flexibility to handle time. However, the flexibility may simply be too great, in the same way that fully connected deep nets are too flexible to match convolutional net performance in vision tasks (Lecun et al., 1998). The architectural biases serve to constrain learning in a helpful manner.

Given the similarity between the spatial regularities incorporated into the convolutional net and the temporal regularities we described, it seems natural to use a convolutional architecture for sequences, essentially remapping time into space (Waibel et al., 1990; Lockett and Miikkulainen, 2009; Nguyen et al., 2016; Taylor et al., 2010; Kalchbrenner et al., 2014; Sainath et al., 2015; Zeng et al., 2016). In terms of the biases that we conjecture to be helpful, convolutional nets can check all the boxes, and some recent work has begun to investigate multiscale convolutional nets for time series to capture scale interactions (Cui et al., 2016). However, convolutional architectures poorly address the continuous nature of time and the potential wide range of time scales. Consider a domain such as network intrusion detection: event patterns of relevance can occur on a time scale of microseconds to weeks (Mukherjee et al., 1994; Palanivel and Duraiswamy, 2014). It is difficult to conceive how a convolutional architecture could accommodate this dynamic range.

RNN architectures have been proposed to address the multiscale nature of time series and to handle interactions of temporal scale, but these approaches have been focused on ordinal sequences and indexing is based on sequence position rather than chronological time. This work includes clockwork RNNs (Koutník et al., 2014), gated feedback RNNs (Chung et al., 2015), and hierarchical multiscale RNNs (Chung et al., 2016).

A wide range of probabilistic methods have been applied to event sequences, including hidden semi-Markov models and survival analysis (Kapoor et al., 2014, 2015; Zhang et al., 2016), temporal point processes (Dai et al., 2016; Du et al., 2015, 2016; Wang et al., 2016b), nonstationary bandits (Komiyama and Qin, 2014), and time-sensitive latent-factor models (Koren, 2010). All probabilistic methods properly treat chronological time as time, and therefore naturally incorporate temporal locality and position homogeneity biases. These methods also tend to permit a wide dynamic range of time scales. However, they are limited by strong generative assumptions. Our aim is to combine the strength of probabilistic methods—having an explicit theory of temporal dynamics—with the strength of deep learning—having the ability to discover representations.

Continuous-time recurrent networks

All dynamical event-sequence models must construct memories that encapsulate information from past that is relevant for future prediction, action, or classification. This information may have a limited lifetime of utility, and stale information which is no longer relevant should be forgotten. LSTM (Hochreiter and Schmidhuber, 1997) was originally designed to operate without forgetting, but adding a mechanism of forgetting improved the architecture (Gers et al., 2000). The intrinsic dynamics of the newer GRU (Chung et al., 2014) architecture incorporates forgetting: storage of new information is balanced against the forgetting of old.

In this section, we summarize the GRU architecture and we characterize its forgetting mechanism from a novel perspective that facilitates generalizing the architecture to handling sequences in continuous time. For exposition’s sake, we present our approach in terms of the GRU, but it could be cast in terms of LSTM just as well. There appears to be no functional difference between the two architectures with proper initialization (Jozefowicz et al., 2015).

The most basic architecture using gated-recurrent units (GRUs) involves an input layer, a recurrent hidden GRU layer, and an output layer. A schematic of the GRU units is shown in the left panel of Figure 1. The reset gate, ${\bm{r}}$ , shunts the activation of the previous hidden state, ${\bm{h}}$ . The shunted state, in conjunction with the external input, ${\bm{x}}$ , is used to detect the presence of task-relevant events ( ${\bm{q}}$ ). The update gate, ${\bm{s}}$ , then determines what proportion of the old hidden state should be retained and what proportion of the detected event should be stored. Formally, given an external input ${\bm{x}}_{k}$ at step $k$ and the previous hidden state ${\bm{h}}_{k-1}$ , the GRU layer updates as follows:

where ${\bm{W}}^{*}$ , ${\bm{U}}^{*}$ , and ${\bm{b}}^{*}$ are model parameters, $\circ$ denotes the Hadamard product, and ${\bm{h}}_{0}=\bm{0}$ .

Readers who are familiar with GRUs may notice that our depiction of GRUs in in Figure 1 looks a bit different than the depiction in the originating article (Chung et al., 2014). Our intention is to highlight the fact that the ‘update’ gate is actually making a decision about what to store in the memory (hence the notation ${\bm{s}}$ ), and the ‘reset’ gate is actually making a decision about what to retrieve from the memory (hence the notation ${\bm{r}}$ ). The schematic in Figure 1 makes obvious the store and retrieval operations via the gate placement on the input to and output from the hidden state, ${\bm{h}}$ , respectively.

To incorporate time into the GRU, we observe that the storage operation essentially splits each new event, ${\bm{q}}_{k}$ , into a portion ${\bm{s}}_{k}$ that is stored indefinitely and a portion $1-{\bm{s}}_{k}$ that is stored for only an infinitesimally short period of time. Similarly, the retrieval operation reassembles a memory by taking a proportion, ${\bm{r}}_{k}$ , of a long-lasting memory—via the product ${\bm{r}}_{k}\circ{\bm{h}}_{k-1}$ —and a complementary proportion, $1-{\bm{r}}_{k}$ of a very very short-term memory—a memory so brief that it has decayed to $\bm{0}$ . The retrieval operation is thus equivalent to computing the mixture ${\bm{r}}_{k}\circ{\bm{h}}_{k-1}+(1-{\bm{r}}_{k})\circ\bm{0}$ .

The essential idea of the model we will introduce, the CT-GRU, is to endow each hidden unit with multiple memory traces that span a range of time scales, in contrast to the GRU which can be conceived of as having just two time scales: one infinitely long and one infinitesimally short. We define time scale in the standard sense of a linear time-invariant system, operating according to the differential equation $dh/dt=-h/\tau$ , where $h$ is the memory, $t$ is continuous time, and $\tau$ is a (nonnegative) time constant or time scale. These dynamics yield exponential decay, i.e., $h(t)=e^{-t/\tau}h(0)$ and $\tau$ is the time for the state to decay to a proportion $e^{-1}\approx.37$ of its initial level. The short and long time scales of the GRU correspond to the limits $\tau\to 0$ and $\tau\to\infty$ , respectively.

2 Continuous-time gated recurrent unit (CT-GRU)

We argued that the storage (or update) gate of the GRU decides how to distribute the memory of a new event across time scales, and the retrieval (or reset) gate decides how to collect information previously stored across time scales. Binding memory operations to a time scale is sensible for any intelligent agent because different activities require different memory durations. To use human cognition as an example, when you are told a phone number, you need remember it only for a few seconds to enter it in your phone; when making a mental shopping list, you need remember the items only until you get to the store; but when a colleague goes on sabbatical and returns a year later, you should still remember her name. Although individuals typically do not wish to forget, forgetting can be viewed as adaptive (Anderson and Milson, 1989): when information becomes stale or is no longer relevant, it only interferes with ongoing processing and clutters memory. Indeed, cognitive scientists have shown that when an attribute must be updated frequently in memory, its current value decays more rapidly (Altmann and Gray, 2002). This phenomenon is related to the benefit of distributed practice on human knowledge retention: when study is spaced versus massed in time, memories are more durable (Mozer et al., 2009).

Returning to the CT-GRU, our goal is to develop a model that—consistent with the GRU—stores each new event at a time scale deemed appropriate for it, and similarly retrieves information from an appropriate time scale. Thus, we wish to replace the GRU storage and retrieval gates with storage and retrieval scales, computed from the external input and the current hidden state. The scale is expressed in terms of a time constant.

Just as the GRU determines the position of the storage (update) gate from its input, the CT-GRU determines the time scale of storage, $\tau_{k}^{\scriptscriptstyle S}$ . We use an exponential transform to ensure nonnegative $\tau_{k}^{\scriptscriptstyle S}$ :

Experiments

We compare the CT-GRU to a standard GRU that receives additional real-valued $\Delta t$ inputs. Although the CT-GRU is derived from the GRU, the CT-GRU is wired with a specific form of continuous time dynamics, whereas the GRU is free to use the $\Delta t$ input in an arbitrary manner. The conjecture that motivated our work is that the inductive bias built into the CT-GRU would enable it to better leverage temporal information and therefore outperform the overly flexible, poorly constrained GRU.

We have conducted experiments on a diverse variety of event-sequence data sets, synthetic and natural. The synthetic sets were designed to reveal the types of temporal structure that each architecture could discover. We have explored a range of classification and prediction tasks. The punch line of our work is this: Although the CT-GRU and GRU handle time in very different manners, the two architectures perform essentially identically. We found almost no empirical difference between the models. Where one makes errors, the other makes the same errors. Both models perform significantly above sensible baselines, and both models leverage time, albeit in a different manner. Nonetheless, we will argue that the CT-GRU has interesting dynamics and offers lessons for future research.

In all simulations, we present sequences of symbolic event labels. The input is a one-hot representation of the current event, $x_{k}$ . For the CT-GRU, $\Delta t_{k}$ —the lag between events $k$ and $k+1$ —is provided as a special input that modulates decay (see Figure 1b). For the GRU, $\Delta t_{k-1}$ and $\Delta t_{k}$ are included as standard real-valued inputs. The output layer representation and activation function depends on the task. For event-label prediction, the task is to predict the next event, $x_{k+1}$ ; the output layer is a one-hot representation with a softmax activation function. For event-polarity prediction, the task is to predict a binary property of the next event. For this task, the output consists of one logistic unit per event label; only the event that actually occurs is provided a target ( or $1$ ) value. For classification, the task is to map a complete sequence to one of two classes, or $1$ , via a logistic output unit.

We constructed independent Theano (Theano Development Team, 2016) and TensorFlow (Abadi et al., 2015) implementations as a means of verifying the code. For all data sets, 15% of the training set is used as validation data model selection, performed via early stopping and selection from a range of hidden layer sizes. We assess test-set performance via three measures: accuracy, log likelihood, and a discriminability measure, AUC (Green and Swets, 1966). We report accuracy because it closely mirrors log likelihood and AUC on all data sets, and accuracy is most intuitive. More details of the simulation methodology and a complete description of data sets can be found in the Supplementary Materials.

2 Discovery of temporal patterns in synthetic data

To illustrate the operation of the CT-GRU, we devised a Working memory task requiring limited-duration information storage. The input sequence consists of commands to store, for a duration of 1, 10, or 100 time units—specified by the commands s, m, or l—a specific symbol— a, b, or c. The input sequence also contains symbols a-c in isolation to probe memory for whether the symbol is currently stored. For example, with $\text{x}/t$ denoting event x at time $t$ , consider the sequence: $\{\text{m}/0,\text{b}/0,\text{b}/5\}$ . The first two events instruct the memory to store b for 10 time units. The third probes for b at time 5, which should produce a response of 1, whereas probes $\text{b}/25$ or $\text{a}/5$ should produce 0.

Both GRU and CT-GRU with 15 hidden units learn the task well, with 98.8% and 98.7% test-set accuracy, respectively. Figure 3a plots probe response to sequences of the form $\{\text{l}/0,\text{x}/0,\ldots,\text{x}/t\}$ for various durations $t$ . The scatterplot represents individual test sequences; the dashed line is a logistic fit. Both the GRU and CT-GRU show a drop off in response around $t=100$ , as desired, although the CT-GRU shows a more ideal, sharper cut off. Due to its explicit representation of time scale, the CT-GRU is amenable to dissection and interpretation. The bottom of Figure 3b shows weights ${\bm{W}}^{\scriptscriptstyle Q}$ for the fifteen CT-GRU hidden units, arranged such that the units which respond more strongly to symbols a–c—and thus will serve as memory for these symbols—are further to the right (blue negative, red positive). The top of the Figure shows the storage timescale, expressed as $\log_{10}{\bm{\tau}}_{k}^{\scriptscriptstyle S}$ , for a symbol a–c when preceded by commands s, m, or l. In accordance with task demands, the CT-GRU modulates the storage time scale based on the command context.

Moving on to more systematic investigations, we devised three synthetic data sets for which inter-event times are required to attain optimal performance. Data set Cluster classifies 100-element sequences according to whether three specific events occur in any order within a given time span. Figure 4a shows a sample sequence with critical elements that both satisfy and fail to satisfy the time-span requirement, indicated by the solid and outline rectangle, respectively. Remembering outputs a binary value for each event label indicating whether the lag from the last occurrence of the event is below or above a critical time threshold. Rhythm classifies 100-element sequences according to whether the inter-event timings follow a set of event-contingent rules, like a type of musical notation. The CT-GRU performs no better than the GRU with $\Delta t$ inputs, although both outperform the GRU without $\Delta t$ (Figures 5a-c), indicating that both architectures are able to use the temporal lags. For these and all other simulations reported, errors produced by the CT-GRU and the GRU are almost perfectly correlated. (Henceforth, we refer to the GRU with $\Delta t$ as the GRU.)

We ran ten replications of Cluster with different initializations and different example sequences, and found no reliable difference between CT-GRU and GRU by a two-sided Wilcoxon sign rank test ( $p=.43$ ). Because our data sets are almost all large—with between 10k to 100k training and test examples—and because our aim is not to argue that the CT-GRU outperforms the GRU, we report outcomes from a single simulation run in Figures 5a-i.

Having demonstrated that the GRU is able to leverage the $\Delta t$ inputs, we conducted two simulations to show that the CT-GRU requires decay dynamics to achieve its performance. We created a version of the CT-GRU in which the traces did not decay with the passage of time. In principle, such an architecture could be used as a flexible memory, where a unit decides which memory slot to use for information storage and retrieval. However, in practice removing decay dynamics for intrinsically temporal tasks harms the CT-GRU (Figures 5d,e). Data set Hawkes process consists of parallel event streams generated by independent Hawkes processes operating over a range of time scales; an example sequence is shown in Figure 4b. Data set Disperse classifies event streams according to whether two specific events occur in a precise but distant temporal relationship.

3 Naturalistic data sets

We experimented with five real-world event-sequence data sets, described in detail in the Supplementary Materials. Reddit is the timeseries of subreddit postings of 30k users, with sequences spanning up to several years and a thousand postings (Figure 4c). Last.fm has 300 time-tagged artist selections of 30k users, spanning a time range from hours to months. Msnbc, from the UCI repository (Lichman, 2013), has the sequence of categorized MSNBC web pages viewed in 120k sessions. Spanish and Japanese are data sets of students practicing foreign language vocabulary over a period of up to 4 months, with the lag between practice of a vocabulary item ranging from seconds to months. Reddit, Last.fm, and Msnbc are event-label prediction tasks; Spanish and Japanese require event-polarity prediction (whether students successfully translated a vocabulary item given their study history). Because students forget with the passage of time, we expected that CT-GRU would be particularly effective for modeling human memory strength.

Figures 5f-j reveal no meaningful performance difference between the GRU and CT-GRU architectures, and both architectures outperform a baseline measure (depicted as the solid black line in the Figures). For Reddit, Last.fm, and Msnbc, the baseline is obtained by predicting the next event label is the same as the current label; for Spanish and Japanese, the baseline is obtained by predicting the same success or failure for a vocabulary item as on the previous trial. Significantly beating baseline is quite difficult for each of these tasks because they involve modeling human behavior that is governed by many factors external to event history.

The most distressing result, which we do not show in the Figures, is that for each of these tasks, removing the $\Delta t$ inputs from the GRU has only a tiny impact on performance, at most a 5% drop toward baseline. Thus, neither GRU nor CT-GRU is able to leverage the timing information in the event stream. One possibility is that the stochasticity of human behavior overwhelms any signal in event timing. If so, time tags may provide more leverage for event sequences obtained from alternative sources (e.g., computer systems, physical processes). However, we are not hopeful given that our synthetic data sets also failed to show an advantage for the CT-GRU, and those data sets were crafted to benefit an architecture like CT-GRU with intrinsic temporal dynamics.

4 Summary of other investigations

We conducted a variety of additional investigations that we summarize here. First, we hoped that with smaller data sets, the value of the inductive bias in the CT-GRU would give it an advantage over the GRU, but it did not. Second, we tested other natural and synthetic data sets, but the pattern of results is as we report here. Third, we considered additional tasks that might reveal an advantage of the CT-GRU such as sequence extrapolation and event-timing prediction. And finally, we developed literally dozens of alternative neural net architectures that, like the CT-GRU, incorporate the forms of inductive bias described in the introduction that we expected to be helpful for event-sequence processing. All of these architectures share intrinsic time-based decay whose dynamics are modulated by information contained in the event sequence. These architectures include: variants of the CT-GRU in which the retrieved state is also used for output and computation of the storage and retrieval scales; the LSTM analog of the CT-GRU, with multiple temporal scales; and a variety of memory mechanisms whose internal dynamics are designed to mimic mean-field approximations to stochastic processes, including survival processes and self-excitatory and self-inhibitory point processes (e.g., Hawkes processes). Some of these models are easier to train than others, but, in the end, none beat the performance of generic LSTM or GRU architectures provided with additional $\Delta t$ inputs.

Discussion

Our work is premised on the hypothesis that event-sequence processing in RNN architectures could be improved by incorporating domain-appropriate inductive bias. Despite a concerted, year-long effort, we found no support for this hypothesis. Selling a null result is challenging. We have demonstrated that there is no trivial or pathological explanation for the null result, such as implementation issues with the CT-GRU or the possibility that both architectures simply ignore time. Our methodology is sound and careful, our simulations extensive and thorough. Nevertheless, negative results can be influential, e.g., the failure to learn long-term temporal dependencies (Hochreiter et al., 2001; Hochreiter, 1998; Bengio et al., 1994; Mozer, 1992) led to the discovery of novel RNN architectures. Further, this report may save others from a duplication of effort. We also note, somewhat cynically, that a large fraction of the novel architectures that are claimed to yield promising results one year seem to fall by the wayside a year later.

One possible explanation for our null result may come from the fact that the CT-GRU has no more free parameters than the GRU. In fact, the GRU has more parameters because the inter-event times are treated as additional inputs with associated weights in the GRU. The CT-GRU and GRU have different sorts of flexibility via their free parameters, but perhaps the space of solutions they can encode is roughly the same. Nonetheless, we are a bit mystified as to how they could admit the same solution space, given the very different manners in which they encode and utilize time.

Our work has two key insights that ought to have value for future research. First, we cast the popular LSTM and GRU architectures in terms of time-scale selection rather than in terms of gating information flow. Second, we show that a simple mechanism with a finite set of time scales is capable of storing and retrieving information from a continuous range of time scales.

To end on a more positive note, incorporating continuous-time dynamics into neural architectures has led us to some observations worthy of further pursuit. For example, consider the possibility of multiple events occurring simultaneously, e.g., a stream of outgoing emails might be coded in terms of the recipients, and a single message may be sent to multiple individuals. The state of an LSTM, GRU, or CT-GRU will depend on the order that the individuals are presented. However, we can incorporate into the CT-GRU absorption time dynamics for an input $x$ , via the closed-form solution to differential equations $dh=-h/\tau_{hid}+x/\tau_{in}$ and $dx=-x/\tau_{in}$ , yielding a model whose dynamics are invariant to order for simultaneous events, and relatively insensitive to order for events arriving closely in time. Such behavior could have significant benefits for event sequences with measurement noise or random factors influencing arrival times.

This research was supported by NSF grants DRL-1631428 and SES-1461535.

A

We constructed independent theano (Theano Development Team, 2016) and tensorflow (Abadi et al., 2015) implementations as a means of verifying the code. For all data sets, 15% of the training set is used as validation data model selection, performed via early stopping and selection from a range of hidden layer sizes. Optimization was performed via RMSPROP. Drop out was not used as it appeared to have little impact on results. We assessed performance on a test set via three measures: accuracy of prediction/classification, log likelihood of correct prediction/classification, and AUC (a discriminability measure) (Green and Swets, 1966). Because accuracy mirrored the other two measures for our data sets and because it is the most intuitive, we report accuracy. For event-label prediction tasks, a response is correct if the highest output probability label is the correct label. For classification tasks and event-polarity prediction tasks, a response is correct if the error magnitude is less than 0.5 for outputs in $$.

The GRU ${\bm{U}}^{*}$ and ${\bm{W}}^{*}$ weights are initialized with $L_{2}$ norm 1 and such that the fan-in weights across hidden units are mutually orthogonal. The GRU ${\bm{b}}^{*}$ are initialized to zero. Other weights, including the mapping from input to hidden and hidden to output, are initialized by draws from a $\mathcal{N}(0,.01)$ distribution.

A.1.2 CT-GRU initialization

A.2 Data sets

We explored a total of 11 data sets, 6 synthetic and 5 natural.

All synthetic data sets consisted of 10,000 training and 10,000 testing examples.

Working memory. We devised a simple task requiring a duration-limited or working memory. The input sequence consists of commands to store a symbol (a, b, or c) for a short (s), medium (m), or long (l) time interval—1, 10, or 100 time units, respectively. The input sequence also contains the symbols a-c in isolation to probe the memory for whether the symbol is currently stored. For example, consider the sequence: $\{0,\text{s}\},\{0,\text{b}\},\{5,\text{b}\}$ , where a $\{t,\text{x}\}$ denotes event x in the input sequence at time $t$ . The first 2 events instruct the memory to store b for 10 time units. The third event probes for b at time 5. This probe should produce a response of 1, whereas queries $\{25,\text{b}\}$ or $\{5,\text{a}\}$ should produce a response of 0. The specific form of sequences generated consisted of two commands to store distinct symbols, separated in time by $t_{1}$ units, followed by a probe of one of the symbols following $t_{2}$ units. The lags $t_{1}$ and $t_{2}$ were chosen in order to balance the training and test sets with half positive and half negative examples. Only fifteen hidden units were used for this task in order to interpret model behavior.

Cluster. We generated sequences of 100 events drawn uniformly from 12 labels, a–l, with inter-event times drawn from an exponential distribution with mean 1. The task is to classify the sequence depending on the occurrence of events a, b, and c in any order within a 6 time unit window. The data set was balanced with half positive and half negative examples. The positive examples had one or more occurrences of the target pattern at a random position within the sequence. We tested a 20 hidden unit architecture.

Remembering. We generated sequences of 100 events drawn uniformly from 12 labels with inter-event time lags drawn uniformly from $\{1,10,100\}$ . Each time a symbol is presented, the task is to remember that symbol for 310 time steps. If the next occurrence of the symbol is within this threshold, the target output for that symbol should be $1$ , otherwise . The threshold of 310 time steps was chosen in order that the target outputs are roughly balanced. The target output for the first presentation of a symbol is 0. We tested 20 and 40 hidden unit architectures.

Rhythm. This classification task involved sequences of 100 symbols drawn uniformly from a–d and terminated by e. The target output at the end of the sequence is 1 if the sequence follows a fixed rhythmic pattern, such that the lag following a-d are 1, 2, 4, and 8, respectively. The positive sequences follow the pattern exactly. The negative sequences double or halve between one and four of the lags. The training and test sets are balanced between positive and negative examples. Note that this task cannot be performed above chance without knowing the inter-event lags. We tested 20 and 40 hidden unit architectures.

Hawkes process. We generated interspersed event sequences for 12 labels from independent Hawkes processes. A Hawkes process is a self-excitatory point process whose intensity (event rate) at time $t$ depends on its history: $\lambda(t)=\mu+\alpha/\tau\sum_{t_{i}<t}e^{-(t-t_{i})/\tau}$ , where $\{t_{i}\}$ is the set of previously generated event times. Using the algorithm of (Dassios and Zhao, 2013), we synthesized sequences with $\alpha=.5$ , $\mu=.02$ , and $\tau\in\{1,2,4,8,...,4096\}$ . For each sequence, we assigned a random permutation of the possible $\tau$ scales to event labels. The intensity function ensures that the event rate is identical across scales, but labels with shorter time constants are more concentrated and bursty. The task here is to predict the next event label given the time to the next event, $\delta t_{k}$ , and the complete event history. Sequences ranged from 240 to 1020 events. Optimal performance for this data set was determined via maximum likelihood inference on the parameters of the model that generated the data. We tested 10, 20, 40, and 80 hidden unit architectures.

Disperse. We generated sequences of 100 events drawn from 12 labels, a-l, with inter-event times drawn from an exponential distribution with mean 1. The task is to classify a sequence according to whether a and b occur separated by 10 time units anywhere in the sequence. The target output is 1 if they occur at a lag ranging in , or 0 otherwise. The training and test sets are balanced with half positive and half negative examples. We tested 20, and 40 hidden unit architectures.

A.2.2 Naturalistic data sets

Reddit. We collected sequence of subreddit postings from 30,733 users, and divided the users into 15,000 for training and 15,733 for testing. The posting sequences ranged from 30 subreddits to 976, with a mean length of 61.0. (We excluded users who posted fewer than 30 times.) Each posting was considered an event and the task is to predict the next event label, i.e., the next subreddit to which the user will post. To focus on the temporal pattern of selections rather than the popularity of specific subreddits, we re-indexed each sequence such that each subreddit was mapped to the order in which it appeared in a sequence. Consequently, the first posting for any user will correspond to label 1; the second posting could either be a repetition of 1 or a new subreddit, 2. If the user posted to more then 50 subreddits, the 51st and beyond were assigned to label 50. Baseline performance is obtained by predicting that event $k+1$ will be the same as event $k$ .

Last.fm. We collected sequences of musical artist selections from 30,000 individuals, split evenly into training and testing sets. We picked a span of time wide enough to encompass exactly 300 selections. This span ranged from under an hour to more than six years, with a mean span of 76.3 days. To focus on the temporal pattern of selections rather than the popularity of specific artists, we re-indexed each sequence such that each artist was mapped to the order in which it appeared in a sequence. Any sequence with more than 50 distinct artists was rejected. Baseline performance is obtained by predicting that event $k+1$ will be the same as event $k$ .

Msnbc. This data set was obtained from the UCI repository and consists of the sequence of requests a user makes for web pages on the MSNBC site. The pages are classified into one of 17 categories, such as frontpage, news, tech, local. The sequences ranged from 9 selections to 99 selections with a mean length of 17.6. Unfortunately, time tags were not available for these data, and thus we treated the event sequences as ordinal sequences. We were interested in including one data set with ordinal sequences in order to examine whether such sequences might show an advantage or disadvantage for the CT-GRU. Baseline performance is obtained by predicting that event $k+1$ will be the same as event $k$ .

Spanish. This data set consists of retrieval practice trials from 180 native English speaking students studying 221 Spanish language vocabulary items over the time span of a semester Lindsey et al. (2014). On each trial, students were shown an English word or phrase to translate to Spanish, and correct or incorrect performance was recorded. The sequences consist of a student’s entire study history for a single item, and the task is to predict trial-to-trial accuracy. The data set consists of 37601 sequences split randomly into 18800 for training and 18801 for testing. Sequences had a mean length of 15.9 and a maximum length of 190. The input consisted of $221\times 2$ units each of which represents the current trial—the Cartesian product of item practiced and incorrect/correct performance. The output consisted of 221 logistic units with 0/1 values for the prediction of incorrect/correct performance on each of the 221 items. Training and test set error is based only on the item actually practiced. Baseline performance is obtained by predicting that the accuracy of a student’s response on trial $k+1$ is the same as on trial $k$ .

Japanese. This data set is from a controlled laboratory study of learning Japanese vocabulary with 32 participants studying 60 vocabulary items over an 84 day period, with times between practice trials ranging from seconds to 50 days. For this data set, we formed one sequence per subject; the sequences ranged from 654 to 659 trials. Because of the small number of subjects, we made an 8-fold split, each time training on 25 subjects, validating on 3, and testing on the remaining 4. Baseline performance is obtained by predicting that the accuracy of a student’s response on trial $k+1$ is the same as on trial $k$ .