Memory Augmented Neural Networks with Wormhole Connections

Caglar Gulcehre, Sarath Chandar, Yoshua Bengio

Introduction

Recurrent Neural Networks (RNNs) are neural network architectures that are designed to handle temporal dependencies in sequential prediction problems. However it is well known that RNNs suffer from the issue of vanishing gradients as the length of the sequence and the dependencies increases (Hochreiter, 1991; Bengio et al., 1994). Long Short Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997) were proposed as an alternative architecture which can handle long range dependencies better than a vanilla RNN. A simplified version of LSTM unit called Gated Recurrent Unit (GRU), proposed in (Cho et al., 2014), has proven to be successful in a number of applications (Bahdanau et al., 2015; Xu et al., 2015; Trischler et al., 2016; Kaiser and Sutskever, 2015; Serban et al., 2016). Even though LSTMs and GRUs attempt to solve the vanishing gradient problem, the memory in both architectures is stored in a single hidden vector as it is done in an RNN and hence accessing the information too far in the past can still be difficult. In other words, LSTM and GRU models have a limited ability to perform a search through its past memories when it needs to access a relevant information for making a prediction. Extending the capabilities of neural networks with a memory component has been explored in the literature on different applications with different architectures (Weston et al., 2015; Graves et al., 2014; Joulin and Mikolov, 2015; Grefenstette et al., 2015; Sukhbaatar et al., 2015; Bordes et al., 2015; Chandar et al., 2016; Gulcehre et al., 2016; Graves et al., 2016; Rae et al., 2016).

Memory augmented neural networks (MANN) such as neural Turing machines (NTM) (Graves et al., 2014; Rae et al., 2016), dynamic NTM (D-NTM) (Gulcehre et al., 2016), and Differentiable Neural Computers (DNC) (Graves et al., 2016) use an external memory (usually a matrix) to store information and the MANN’s controller can learn to both read from and write into the external memory. As we show here, it is in general possible to use particular MANNs to explicitly store the previous hidden states of an RNN in the memory and that will provide shortcut connections through time, called here wormhole connections, to look into the history of the states of the RNN controller. Learning to read and write into an external memory by using neural networks gives the model more freedom or flexibility to retrieve information from its past, forget or store new information into the memory. However, if the addressing mechanism for read and/or write operations are continuous (like in the NTM and continuous D-NTM), then the access may be too diffuse, especially early on during training. This can hurt especially the writing operation, since a diffused write operation will overwrite a large fraction of the memory at each step, yielding fast vanishing of the memories (and gradients). On the other hand, discrete addressing, as used in the discrete D-NTM, should be able to perform this search through the past, but prevents us from using straight backpropagation for learning how to choose the address.

We investigate the flow of the gradients and how the wormhole connections introduced by the controller effects it. Our results show that the wormhole connections created by the controller of the MANN can significantly reduce the effects of the vanishing gradients by shortening the paths that the signal needs to travel between the dependencies. We also discuss how the MANNs can generalize to the sequences longer than the ones seen during the training.

In a discrete D-NTM, the controller must learn to read from and write into the external memory by itself and additionally, it should also learn the reader/writer synchronization. This can make the learning to be more challenging. In spite of this difficulty, Gulcehre et al. (2016) reported that the discrete D-NTM can learn faster than the continuous D-NTM on some of the bAbI tasks. We provide a formal analysis of gradient flow in MANNs based on discrete addressing and justify this result. In this paper, we also propose a new MANN based on discrete addressing called TARDIS (Temporal Automatic Relation Discovery in Sequences). In TARDIS, memory access is based on tying the write and read heads of the model after memory is filled up. When the memory is not full, the write head store information in memory in the sequential order.

The main characteristics of TARDIS are as follows, TARDIS is a simple memory augmented neural network model which can represent long-term dependencies efficiently by using a external memory of small size. TARDIS represents the dependencies between the hidden states inside the memory. We show both theoretically and experimentally that TARDIS fixes to a large extent the problems related to long-term dependencies. Our model can also store sub-sequences or sequence chunks into the memory. As a consequence, the controller can learn to represent the high-level temporal abstractions as well. TARDIS performs well on several structured output prediction tasks as verified in our experiments.

The idea of using external memory with attention can be justified with the concept of mental-time travel which humans do occasionally to solve daily tasks. In particular, in the cognitive science literature, the concept of chronesthesia is known to be a form of consciousness which allows human to think about time subjectively and perform mental time-travel (Tulving, 2002). TARDIS is inspired by this ability of humans which allows one to look up past memories and plan for the future using the episodic memory.

TARDIS: A Memory Augmented Neural Network

Neural network architectures with an external memory represent the memory in a matrix form, such that at each time step $t$ the model can both read from and write to the external memory. The whole content of the external memory can be considered as a generalization of hidden state vector in a recurrent neural network. Instead of storing all the information into a single hidden state vector, our model can store them in a matrix which has a higher capacity and with more targeted ability to substantially change or use only a small subset of the memory at each time step. The neural Turing machine (NTM) (Graves et al., 2014) is such an example of a MANN, with both reading and writing into the memory.

At every time step, the hidden state $\mathbf{h}_{t}$ of the controller is also conditioned on the content $\mathbf{r}_{t}$ read from the memory. The wormhole connections are created by conditioning $\mathbf{h}_{t}$ on $\mathbf{r}_{t}$ :

As each cell in the memory is a linear projection of one of the previous hidden states, the conditioning of the controller’s hidden state with the content read from the memory can be interpreted as a way of creating short-cut connections across time (from the time $t^{\prime}$ that $\mathbf{h}_{t^{\prime}}$ was written to the time $t$ when it was read through $\mathbf{r}_{t}$ ) which can help to the flow of gradients across time. This is possible because of the discrete addressing used for read and write operations.

However, the main challenge for the model is to learn proper read and write mechanisms so that it can write the hidden states of the previous time steps that will be useful for future predictions and read them at the right time step. We call this the reader/writer synchronization problem. Instead of designing complicated addressing mechanisms to mitigate the difficulty of learning how to properly address the external memory, TARDIS side-steps the reader/writer synchronization problem by using the following heuristics. For the first $k$ time steps, our model writes the micro-states into the $k$ cells of the memory in a sequential order. When the memory becomes full, the most effective strategy in terms of preserving the information stored in the memory would be to replace the memory cell that has been read with the micro-state generated from the hidden state of the controller after it is conditioned on the memory cell that has been read. If the model needs to perfectly retain the memory cell that it has just overwritten, the controller can in principle learn to do that by copying its read input to its write output (into the same memory cell). The pseudocode and the details of the memory update algorithm for TARDIS is presented in Algorithm 1.

There are two missing pieces in Algorithm 1: How to generate the read weights? What is the structure of the controller function $\phi$ ? We will answer these two questions in detail in next two sub-sections.

2 Addressing mechanism

The continuous read weights $\overline{\mathbf{w}}^{r}_{t}$ are generated by an MLP which uses the information coming from $\mathbf{h}_{t}$ , $\mathbf{x}_{t}$ , $\mathbf{M}_{t}$ and the usage vector $\mathbf{u}_{t}$ (described below). The MLP is parametrized as follows:

where $\{\mathbf{a},\mathbf{W}^{\gamma}_{h},\mathbf{W}^{\gamma}_{x},\mathbf{W}^{\gamma}_{m},\mathbf{W}^{\gamma}_{u}\}$ are learnable parameters. $\mathbf{w}^{r}_{t}$ is a one-hot vector obtained by either sampling from $\overline{\mathbf{w}}^{r}_{t}$ or by using argmax over $\overline{\mathbf{w}}^{r}_{t}$ .

$\mathbf{u}_{t}$ is the usage vector which denotes the frequency of accesses to each cell in the memory. $\mathbf{u}_{t}$ is computed from the sum of discrete address vectors $\mathbf{w}^{r}_{t}$ and normalizing them.

$\text{norm}(\cdot)$ applied in Equation 6 is a simple feature-wise computation of centering and divisive variance normalization. This normalization step makes the training easier with the usage vectors. The introduction of the usage vector can help the attention mechanism to choose between the different memory cells based on their frequency of accesses to each cell of the memory. For example, if a memory cell is very rarely accessed by the controller, for the next time step, it can learn to assign more weights to those memory cells by looking into the usage vector. By this way, the controller can learn an LRU access mechanism (Santoro et al., 2016; Gulcehre et al., 2016).

Further, in order to prevent the model to learn deficient addressing mechanisms, for e.g. reading the same memory cell which will not increase the memory capacity of the model, we decrease the probability of the last read memory location by subtracting $100$ from the logit of $\overline{\mathbf{w}}^{r}_{t}$ for that particular memory location.

3 TARDIS Controller

We use an LSTM controller, and its gates are modified to take into account the content $\mathbf{r}_{t}$ of the cell read from the memory:

where ${\text{f}}_{t}$ , ${\text{i}}_{t}$ , and ${\text{o}}_{t}$ are forget gate, input gate, and output gate respectively. $\alpha_{t},\beta_{t}$ are the scalar RESET gates which control the magnitude of the information flowing from the memory and the previous hidden states to the cell of the LSTM $\mathbf{c}_{t}$ . By controlling the flow of information into the LSTM cell, those gates will allow the model to store the sub-sequences or chunks of sequences into the memory instead of the entire context.

We use Gumbel sigmoid (Maddison et al., 2016; Jang et al., 2016) for $\alpha_{t}$ and $\beta_{t}$ due to its behavior close to binary.

As in Equation 8 empirically, we find gumbel-sigmoid to be easier to train than the regular sigmoid. The temperature of the Gumbel-sigmoid is fixed to $0.3$ in all our experiments.

The cell of the LSTM controller, $\mathbf{c}_{t}$ is computed according to the Equation 2.3 with the $\alpha_{t}$ and $\beta_{t}$ RESET gates.

The hidden state of the LSTM controller is computed as follows:

In Figure 1, we illustrate the interaction between the controller and the memory with various heads and components of the controller.

4 Micro-states and Long-term Dependencies

A micro-state of the LSTM for a particular time step is the summary of the information that has been stored in the LSTM controller of the model. By attending over the cells of the memory which contains previous micro-states of the LSTM, the model can explicitly learn to restore information from its own past.

The controller can learn to represent high-level temporal abstractions by creating wormhole connections through the memory as illustrated in Figure 2. In this example, the model takes the token $x_{0}$ at the first timestep and stores its representation to the first memory cell with address $a_{0}$ . In the second timestep, the controller takes $x_{1}$ as input and writes into the second memory cell with the address $a_{1}$ . Furthermore, $\beta_{1}$ gater blocks the connection from $\mathbf{h}_{1}$ to $\mathbf{h}_{2}$ . At the third timestep, the controller starts reading. It receives $x_{2}$ as input and reads the first memory cell where micro-state of $\mathbf{h}_{0}$ was stored. After reading, it computes the hidden-state $\mathbf{h}_{2}$ and writes the micro-state of $\mathbf{h}_{2}$ into the first memory cell. The length of the path passing through the microstates of $\mathbf{h}_{0}$ and $\mathbf{h}_{2}$ would be $1$ . The wormhole connection from $\mathbf{h}_{2}$ to $\mathbf{h}_{0}$ would skip a timestep.

A regular single-layer RNN has a fixed graphical representation of a linear-chain when considering only the connections through its recurrent states or the temporal axis. However, TARDIS is more flexible in terms of that and it can learn directed graphs with more diverse structures using the wormhole connections and the RESET gates. The directed graph that TARDIS can learn through its recurrent states have at most the degree of 4 at each vertex (maximum 2 incoming and 2 outgoing edges) and it depends on the number of cells ( $k$ ) that can be stored in the memory.

In this work, we focus on a variation of TARDIS, where the controller maintains a fixed-size external memory. However as in (Cheng et al., 2016), it is possible to use a memory that grows with respect to the length of its input sequences, but that would not scale and can be more difficult to train with discrete addressing.

Training TARDIS

In this section, we explain how to train TARDIS as a language model. We use language modeling as an example application. However, we would like to highlight that TARDIS can also be applied to any complex sequence to sequence learning tasks.

Consider $N$ training examples where each example is a sequence of length $T$ . At every time-step $t$ , the model receives the input $\mathbf{x}_{t}\in\{0,1\}^{|V|}$ which is a one-hot vector of size equal to the size of the vocabulary $|V|$ and should produce the output $\mathbf{y}_{t}\in\{0,1\}^{|V|}$ which is also a one-hot vector of size equal to the size of the vocabulary $|V|$ .

The output of the model for $i$ -th example and $t$ -th time-step is computed as follows:

where $\mathbf{W}^{o}$ is the learnable parameters and $\text{g}(\mathbf{h}_{t},\mathbf{r}_{t})$ is a single layer MLP which combines both $\mathbf{h}_{t}$ and $\mathbf{r}_{t}$ as in deep fusion by (Pascanu et al., 2013a). The task loss would be the categorical cross-entropy between the targets and model-outputs. Super-script $i$ denotes that the variable is the output for the $i^{th}$ sample in the training set.

However, the discrete decisions taken for memory access during every time-step makes the model not differentiable and hence we need to rely on approximate methods of computing gradients with respect to the discrete address vectors. In this paper we explore two such approaches: REINFORCE (Williams, 1992) and straight-through estimator (Bengio et al., 2013).

REINFORCE is a likelihood-ratio method, which provides a convenient and simple way of estimating the gradients of the stochastic actions. In this paper, we focus on application of REINFORCE on sequential prediction tasks, such as language modelling. For example $i$ , let $R(\mathbf{w}_{j}^{r(i)})$ be the reward for the action $\mathbf{w}_{j}^{r(i)}$ at timestep $j$ . We are interested in maximizing the expected return for the whole episode as defined below:

Ideally we would like to compute the gradients for Equation 13, however computing the gradient of the expectation may not be feasible. We would have to use a Monte-Carlo approximation and compute the gradients by using the REINFORCE for the sequential prediction task which can be written as in Equation 14.

where $b_{j}$ is the reward baseline. However, we can further assume that the future actions do not depend on the past rewards in the episode/trajectory and further reduce the variance of REINFORCE as in Equation 15.

In our preliminary experiments, we find out that the training of the model is easier with the discounted returns, instead of using the centered undiscounted return:

Training models with REINFORCE can be difficult, due to the variance imposed into the gradients. In the recent years, researchers have developed several tricks in order to mitigate the effect of high-variance in the gradients. As proposed by (Mnih and Gregor, 2014), we also use variance normalization on the REINFORCE gradients.

For TARDIS, reward at timestep $j$ ( $R(\mathbf{w}_{j}^{r(i)})$ ) is the log-likelihood of the prediction at that timestep. Our initial experiments showed that REINFORCE with this reward structue often tends to under-utilize the memory and mainly rely on the internal memory of the LSTM controller. Especially, in the beginning of the training model, it can just decrease the loss by relying on the memory of the controller and this can cause the REINFORCE to increase the log-likelihood of the random actions.

In order to deal with this issue, instead of using the log-likelihood of the model as reward, we introduce an auxiliary cost to use as the reward $R^{\prime}$ which is computed based on predictions which are only based on the memory cell $\mathbf{r}_{t}$ which is read by the controller and not the hidden state of the controller:

2 Using Gumbel Softmax

Training with REINFORCE can be challenging due to the high variance of the gradients, gumbel-softmax provides a good alternative with straight-through estimator for REINFORCE to tackle the variance issue. Unlike (Maddison et al., 2016; Jang et al., 2016) instead of annealing the temperature or fixing it, our model learns the inverse-temperature with an MLP $\tau(\mathbf{h}_{t})$ which has a single scalar output conditioned on the hidden state of the controller.

We replace the softmax in Equation 5 with gumbel-softmax defined above. During forward computation, we sample from $\overline{\mathbf{w}}^{r}_{t}$ and use the generated one-hot vector ${\mathbf{w}}^{r}_{t}$ for memory access. However, during backprop, we use $\overline{\mathbf{w}}^{r}_{t}$ for gradient computation and hence the entire model becomes differentiable.

Learning the temperature of the Gumbel-Softmax reduces the burden of performing extensive hyper-parameter search for the temperature.

Related Work

Neural Turing Machine (NTM) (Graves et al., 2014) is the most related class of architecture to our model. NTMs have proven to be successful in terms of generalizing over longer sequences than the sequences that it has been trained on. Also NTM has been shown to be more effective in terms of solving algorithmic tasks than the gated models such as LSTMs. However NTM can have limitations due to some of its design choices. Due to the controller’s lack of precise knowledge on the contents of the information, the contents of the memory can overlap. These memory augmented models are also known to be complicated, which yields to the difficulties in terms of implementing the model and training it. The controller has no information about the sequence of operations and the information such as frequency of the read and write access to the memory. TARDIS tries to address these issues.

Gulcehre et al. (2016) proposed a variant of NTM called dynamic NTM (D-NTM) which had learnable location based addressing. D-NTM can be used with both continuous addressing and discrete addressing. Discrete D-NTM is related to TARDIS in the sense that both models use discrete addressing for all the memory operations. However, discrete D-NTM expects the controller to learn to read/write and also learn reader/writer synchronization. TARDIS do not have this synchronization problem since both reader and writer are tied. Rae et al. (2016) proposed sparse access memory (SAM) mechanism for NTMs which can be seen as a hybrid of continuous and discrete addressing. SAM uses continuous addressing over a selected set of top- $K$ relevant memory cells. Recently, Graves et al. (2016) proposed a differentiable neural computer (DNC) which is a successor of NTM.

Rocktäschel et al. (2015) and (Cheng et al., 2016) proposed models that generate weights to attend over the previous hidden states of the RNN. However, since those models attend over the whole context, the computation of the attention can be inefficient.

Grefenstette et al. (2015) has proposed a model that can store the information in a data structure, such as in a stack, dequeue or queue in a differentiable manner.

Grave et al. (2016) has proposed to use a cache based memory representation which stores the last $k$ states of the RNN in the memory and similar to the traditional cache-based models the model learns to choose a state of the memory for the prediction in the language modeling tasks (Kuhn and De Mori, 1990).

Gradient Flow through the External Memory

In this section, we analyze the flow of the gradients through the external memory and will also investigate its efficiency in terms of dealing with the vanishing gradients problem (Hochreiter, 1991; Bengio et al., 1994). First, we describe the vanishing gradient problem in an RNN and then describe how an external memory model can deal with it. For the sake of simplicity, we will focus on vanilla RNNs during the entire analysis, but the same analysis can be extended to LSTMs. In our analysis, we also assume that the weights for the read/write heads are discrete.

We will show that the rate of the gradients vanishing through time for a memory-augmented recurrent neural network is much smaller than of a regular vanilla recurrent neural network.

where $\mathbf{W}$ and $\mathbf{U}$ are the recurrent and the input weights of the RNN respectively and $\text{f}(\cdot)$ is a non-linear activation function. Let $\mathcal{L}=\sum_{t=1}^{T}\mathcal{L}_{t}$ be the loss function that the RNN is trying to minimize. Given an input sequence of length $T$ , we can write the derivative of the loss $\mathcal{L}$ with respect to parameters ${\boldsymbol{\theta}}$ as,

The multiplication of many Jacobians in the form of $\frac{\partial\mathbf{h}_{t}}{\partial\mathbf{h}_{t-1}}$ to obtain $\frac{\partial\mathbf{h}_{t_{1}}}{\partial\mathbf{h}_{t_{0}}}$ is the main reason of the vanishing and the exploding gradients (Pascanu et al., 2013b):

Let us assume that the singular values of a matrix $\mathbf{M}$ are ordered as, $\sigma_{1}({\mathbf{M}})\geq\sigma_{2}({\mathbf{M}})\geq\dots\geq\sigma_{n}({\mathbf{M}})$ . Let $\alpha$ be an upper bound on the singular values of $\mathbf{W}$ , s.t. $\alpha\geq\sigma_{1}(\mathbf{W})$ , then the norm of the Jacobian will satisfy (Zilly et al., 2016),

Pascanu et al. (2013b) showed that for $||\frac{\partial\mathbf{h}_{t}}{\partial\mathbf{h}_{t-1}}||\leq\sigma_{1}(\frac{\partial\mathbf{h}_{t}}{\partial\mathbf{h}_{t-1}})\leq\eta<1$ , the following inequality holds:

Since $\eta<1$ and the norm of the product of Jacobians grows exponentially on $t_{1}-t_{0}$ , the norm of the gradients will vanish exponentially fast.

Now consider the MANN where the contents of the memory are linear projections of the previous hidden states as described in Equation 2. Let us assume that both reading and writing operation use discrete addressing. Let the content read from the memory at time step $t$ correspond to some memory location $i$ :

where $\mathbf{h}_{i_{t}}$ corresponds to the hidden state of the controller at some previous timestep $i_{t}$ .

Now the hidden state of the controller in the external memory model can be written as,

If the controller reads $\mathbf{M}_{t}[i]$ at time step $t$ and its memory content is $\mathbf{A}\mathbf{h}_{i_{t}}$ as described above, then the Jacobians associated with Equation 5 can be computed as follows:

where $\mathbf{Q}_{t_{1}t_{0}}$ and $\mathbf{R}_{t_{1}t_{0}}$ are defined as below,

As shown in Equation 29, Jacobians of the MANN can be rewritten as a summation of two matrices, $\mathbf{Q}_{t_{1}t_{0}}$ and $\mathbf{R}_{t_{1}t_{0}}$ . The gradients flowing through $\mathbf{R}_{t_{1}t_{0}}$ do not necessarily vanish through time, because it is the sum of jacobians computed over the shorter paths.

The norm of the Jacobian can be lower bounded as follows by using Minkowski inequality:

Assuming that the length of the dependency is very long $||\mathbf{Q}_{t_{1}t_{0}}||$ would vanish to 0. Then we will have,

As one can see that the rate of the gradients vanishing through time depends on the length of the sequence passes through $\mathbf{R}_{t_{1}t_{0}}$ . This is typically lesser than the length of the sequence passing through $\mathbf{Q}_{t_{1}t_{0}}$ . Thus the gradients vanish at lesser rate than in an RNN. In particular the rate would strictly depend on the length of the shortest paths from $t_{1}$ to $t_{0}$ , because for the long enough dependencies, gradients through the longer paths would still vanish.

We can also derive an upper bound for norm of the Jacobian as follows:

Using the result from (Loyka, 2015), we can lower bound $\sigma_{1}(\mathbf{Q}_{t_{1}t_{0}}+\mathbf{R}_{t_{1}t_{0}})$ as follows:

For long sequences we know that $\sigma_{1}(\mathbf{Q}_{t_{1}t_{0}})$ will go to (see equation 25). Hence,

The rate at which $\sigma_{1}(\mathbf{R}_{t_{1}t_{0}})$ reaches zero is strictly smaller than the rate at which $\sigma_{1}(\mathbf{Q}_{t_{1}t_{0}})$ reaches zero and with ideal memory access, it will not reach zero. Hence unlike vanilla RNNs, Equation 38 states that the upper bound of the norm of the Jacobian will not reach to zero for a MANN with ideal memory access.

Consider a memory augmented neural network with $T$ memory cells for a sequence of length $T$ , and each hidden state of the controller is stored in different cells of the memory. If the prediction at time step $t_{1}$ has only a long-term dependency to $t_{0}$ and the prediction at $t_{1}$ is independent from the tokens appear before $t_{0}$ , and the memory reading mechanism is perfect, the model will not suffer from vanishing gradients when we back-propagate from $t_{1}$ to $t_{0}$ .Let us note that, unlike an Markovian $n$ -gram assumption, here we assume that at each time step the $n$ can be different.

If the input sequence has a longest-dependency to $t_{0}$ from $t_{1}$ , we would only be interested in gradients propagating from $t_{1}$ to $t_{0}$ and the Jacobians from $t_{1}$ to $t_{0}$ , i.e. $\frac{\partial\mathbf{h}_{t_{1}}}{\partial\mathbf{h}_{t_{0}}}$ . If the controller learns a perfect reading mechanism at time step $t_{1}$ it would read memory cell where the hidden state of the RNN at time step $t_{0}$ is stored at. Thus following the jacobians defined in the Equation 29, we can rewrite the jacobians as,

In Equation 39, the first two terms might vanish as $t_{1}-t_{0}$ grows. However, the singular values of the third term do not change as $t_{1}-t_{0}$ grows. As a result, the gradients propagated from $t_{1}$ to $t_{0}$ will not necessarily vanish through time. However, in order to obtain stable dynamics for the network, the initialization of the matrices, $\mathbf{V}$ and $\mathbf{A}$ is important. $\square$

This analysis highlights the fact that an external memory model with optimal read/write mechanism can handle long-range dependencies much better than an RNN. However, this is applicable only when we use discrete addressing for read/write operations. Both NTM and D-NTM still have to learn how to read and write from scratch which is a challenging optimization problem. For TARDIS tying the read/write operations make the learning to become much simpler for the model. In particular, the results of the Theorem 1 points the importance of coming up with better ways of designing attention mechanisms over the memory.

The controller of a MANN may not be able learn to use the memory efficiently. For example, some cells of the memory may remain empty or may never be read. The controller can overwrite the memory cells which have not been read. As a result the information stored in those overwritten memory cells can be lost completely. However TARDIS avoids most of these issues by the construction of the algorithm.

On the Length of the Paths Through the Wormhole Connections

As we have discussed in Section 5, the rate at which the gradients vanish for a MANN depends on the length of the paths passing along the wormhole connections. In this section we will analyse those lengths in depth for untrained models such that the model will assign uniform probability to read or write all memory cells. This will give us a better idea on how each untrained model uses the memory at the beginning of the training.

A wormhole connection can be created by reading a memory cell and writing into the same cell in TARDIS. For example, in Figure 2, while the actual path from $\mathbf{h}_{4}$ to $\mathbf{h}_{0}$ is of length 4, memory cell $\mathbf{a}_{0}$ creates a shorter path of length 2 $(\mathbf{h}_{0}\rightarrow\mathbf{h}_{2}\rightarrow\mathbf{h}_{4})$ . We call the length of the actual path as $T$ and length of the shorter path created by wormhole connection as $T_{mem}$ .

Consider a TARDIS model which has $k$ cells in its memory. If TARDIS access each memory cell uniformly random, then the probability of accessing a random cell $i$ , $p[i]~{}=~{}\frac{1}{k}$ . The expected length of the shorter path created by wormhole connections ( $T_{mem}$ ) would be proportional to the number of reads and writes into a memory cell. For TARDIS with reader choosing a memory cell uniformly random this would be $T_{mem}=\sum_{i=k}^{T}p[i]~{}=~{}\frac{T}{k}-1$ at the end of the sequence. We verify this result by simulating the read and write heads of TARDIS as in Figure 3 a).

Now consider a MANN with separate read and write heads each accessing the memory in discrete and uniformly random fashion. Let us call it as uMANN. We will compute the expected length of the shorter path created by wormhole connections ( $T_{mem}$ ) for uMANN. $\mathbf{w}^{r}_{t}$ and $\mathbf{w}^{w}_{t}$ are the read and write head weights, each sampled from a multinomial distribution with uniform probability for each memory cells respectively. Let $j_{t}$ be the index of the memory cell read at timestep $t$ . For any memory cell $i$ , $\text{len}(\cdot)$ , defined below, is a recursive function that computes the length of the path created by wormhole connections in that cell.

This analysis shows that while TARDIS with uniform read head maintains the same expected length of the shorter path created by wormhole connections as uMANN, it completely avoids the reader/writer synchronization problem.

If $k$ is large enough, $T_{mem}<<T$ should hold. In expectation, $\sigma_{1}(\mathbf{R}_{t_{1}t_{0}})$ will decay proportionally to $T_{mem}$ whereas $\sigma_{1}(\mathbf{Q}_{t_{1}t_{0}})$ will decay proportional Exponentially when the Equation 25 holds. to $T$ . With ideal memory access, the rate at which $\sigma_{1}(\mathbf{R}_{t_{1}t_{0}})$ reaches zero would be strictly smaller than the rate at which $\sigma_{1}(\mathbf{Q}_{t_{1}t_{0}})$ reaches zero. Hence, as per Equation 38, the upper bound of the norm of the Jacobian will vanish at a much smaller rate. However, this result assumes that the dependencies which the prediction relies are accessible through the memory cell which has been read by the controller.

In the more general case, consider a MANN with $k\geq T$ . The writer just fills in the memory cells in a sequential manner and the reader chooses a memory cell uniformly at random. Let us call this model as urMANN. Let us assume that there is a dependency between two timesteps $t_{0}$ and $t_{1}$ as shown in Figure 4. If $t_{0}$ was taken uniformly between and $t_{1}-1$ , then there is a probability $0.5$ that the read address invoked at time $t_{1}$ will be greater than or equal to $t_{0}$ (proof by symmetry). In that case, the expected shortest path length through that wormhole connection would be $(t_{1}-t_{0})/2$ , but this still would not scale well. If the reader is very well trained, it could pick exactly $t_{0}$ and the path length will be 1.

Let us consider all the paths of length less than or equal to $k+1$ of the form in Figure 4. Also, let $n\leq k/2$ and $m\leq k/2$ . Then, the shortest path from $t_{0}$ to $t_{1}$ now has length $n+m+1\leq k+1$ , using a wormhole connection that connects the state at $t_{0}+n$ with the state at $t_{1}-m$ . There are $O(k^{2})$ such paths that are realized, but we leave the distribution of the length of that shortest path as an open question. However, the probability of hitting a very short path (of length less than or equal to $k+1$ ) increases exponentially with $k$ . Let the probability of the read at $t_{1}-m$ to hit the interval $(t_{0},~{}t_{0}~{}+~{}k/2)$ be $p$ . Then the probability that the shorter paths over the last $k$ reads hits that interval is $1-(1-p)^{k/2}$ , where $p$ is on the order of $k/t_{1}$ . On the other hand, the probability of not hitting that interval approaches to 0 exponentially with $k$ .

Figure 4 illustrates how wormhole connections can creater shorter paths. In Figure 5 (b), we show that the expected length of the path travelled outside the wormhole connections obtained from the simulations decreases as the size of the memory decreases. In particular, for urMANN and TARDIS the trend is very close to exponential. As shown in Figure 5 (a), this also influences the total length of the paths travelled from timestep 50 to 5 as well. Writing into the memory by using weights sampled with uniform probability for all memory cells can not use the memory as efficiently as other approaches that we compare to. In particular fixing the writing mechanism seems to be useful.

Even if the reader does not manage to learn where to read, there are many "short paths" which can considerably reduce the effect of vanishing gradients.

On Generalization over the Longer Sequences

Graves et al. (2014) have shown that the LSTMs can not generalize well on the sequences longer than the ones seen during the training. Whereas a MANN such as an NTM or a D-NTM has been shown to generalize to sequences longer than the ones seen during the training set on a set of toy tasks.

We believe that the main reason of why LSTMs typically do not generalize to the sequences longer than the ones that are seen during the training is mainly because the hidden state of an LSTM network utilizes an unbounded history of the input sequence and as a result, its parameters are optimized using the maximum likelihood criterion to fit on the sequences with lengths of the training examples. However, an n-gram language model or an HMM does not suffer from this issue. In comparison, an n-gram LM would use an input context with a fixed window size and an HMM has the Markov property in its latent space. As argued below, we claim that while being trained a MANN can also learn the ability to generalize for sequences with a longer length than the ones that appear in the training set by modifying the contents of the memory and reading from it.

A regular RNN will minimize the negative log-likelihood objective function for the targets $\mathbf{y}_{t}$ by using the unbounded history represented with the hidden state of the RNN, and it will model the parametrized conditional distribution $p(\mathbf{y}_{t}|\mathbf{h}_{t};{\boldsymbol{\theta}})$ for the prediction at timestep $t$ and a MANN would learn $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})$ . If we assume that $\mathbf{r}_{t}$ represents all the dependencies that $\mathbf{y}_{t}$ depends on in the input sequence, we will have $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})\approx p(\mathbf{y}_{t}|\mathbf{r}_{t},\mathbf{x}_{t};{\boldsymbol{\theta}})$ where $\mathbf{r}_{t}$ represents the dependencies in a limited context window that only contains paths shorter than the sequences seen during the training set. Due to this property, we claim that MANNs such as NTM, D-NTM or TARDIS can generalize to the longer sequences more easily. In our experiments on PennTreebank, we show that a TARDIS language model trained to minimize the log-likelihood for $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})$ and on the test set both $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})$ and $p(\mathbf{y}_{t}|\mathbf{r}_{t},\mathbf{x}_{t};{\boldsymbol{\theta}})$ for the same model yields to very close results. On the other hand, the fact that the best results on bAbI dataset obtained in (Gulcehre et al., 2016) is with feedforward controller and similarly in (Graves et al., 2014) feedforward controller was used to solve some of the toy tasks also confirms our hypothesis. As a result, what has been written into the memory and what has been read becomes very important to be able to generalize to the longer sequences.

Experiments

As a preliminary study on the performance of our model we consider character-level language modelling. We have evaluated our models on Penn TreeBank (PTB) corpus (Marcus et al., 1993) based on the train, valid and test used in (Mikolov et al., 2012). On this task, we are using layer-normalization (Ba et al., 2016) and recurrent dropout (Semeniuta et al., 2016) as those are also used by the SOTA results on this task. Using layer-normalization and the recurrent dropout improves the performance significantly and reduces the effects of overfitting. We train our models with Adam (Kingma and Ba, 2014) over the sequences of length 150. We show our results in Table 1.

In addition to the regular char-LM experiments, in order to confirm our hypothesis regarding to the ability of MANNs generalizing to the sequences longer than the ones seen during the training. We have trained a language model which learns $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})$ by using a softmax layer as described in Equation 11. However to measure the performance of $p(\mathbf{y}_{t}|\mathbf{r}_{t},\mathbf{x}_{t};{\boldsymbol{\theta}})$ on test set, we have used the softmax layer that gets into the auxiliary cost defined for the REINFORCE as in Equation 17 for a model trained with REINFORCE and with the auxiliary cost. As in Table 1, the model’s performance by using $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})$ is 1.26, however by using $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})$ it becomes 1.28. This gap is small enough to confirm our assumption that $p(\mathbf{y}_{t}|\mathbf{h}_{t},\mathbf{r}_{t};{\boldsymbol{\theta}})\approx p(\mathbf{y}_{t}|\mathbf{r}_{t},\mathbf{x}_{t};{\boldsymbol{\theta}})$ .

2 Sequential Stroke Multi-digit MNIST task

In this subsection, we introduce a new pen-stroke based sequential multi-digit MNIST prediction task as a benchmark for long term dependency modelling. We also benchmark the performance of LSTM and TARDIS in this challenging task.

Recently (de Jong, 2016) introduced an MNIST pen stroke classification task and also provided dataset which consisted of pen stroke sequences representing the skeleton of the digits in the MNIST dataset. Each MNIST digit image $\mathcal{I}$ is represented as a sequence of quadruples $\{dx_{i},dy_{i},eos_{i},eod_{i}\}_{i=1}^{T}$ , where $T$ is the number of pen strokes to define the digit, $(dx_{i},dy_{i})$ denotes the pen offset from the previous to the current stroke (can be 1, -1 or 0), $eos_{i}$ is a binary valued feature to denote end of stroke and $eod_{i}$ is another binary valued feature to denote end of the digit. In the original dataset, first quadruple contains absolute value $(x,y)$ instead of offsets $(dx,dy)$ . Without loss of generality, we set the starting position $(x,y)$ to $(0,0)$ in our experiments. Each digit is represented by 40 strokes on an average and the task is to predict the digit at the end of the stroke sequence.

While this dataset was proposed for incremental sequence learning in (de Jong, 2016), we consider the multi-digit version of this dataset to benchmark models that can handle long term dependencies. Specifically, given a sequence of pen-stroke sequences, the task is to predict the sequence of digits corresponding to each pen-stroke sequences in the given order. This is a challenging task since it requires the model to learn to predict the digit based on the pen-stroke sequence, count the number of digits and remember them and generate them in the same order after seeing all the strokes. In our experiments we consider 3 versions of this task with 5,10, and 15 digit sequences respectively. We generated 200,000 training data points by randomly sampling digits from the training set of the MNIST dataset. Similarly we generated 20,000 validation and test data points by randomly sampling digits from the validation set and test set of the MNIST dataset respectively. Average length of the stroke sequences in each of these tasks are 199, 399, and 599 respectively.

2.2 Results

We benchmark the performance of LSTM and TARDIS in this new task. Both models receive the sequence of pen strokes and at the end of the sequence are expected to generate the sequence of digits followed by a particular token. The tasks is illustrated in Figure 6. We evaluate the models based on per-digit error rate. We also compare the performance of TARDIS with REINFORCE with that of TARDIS with gumbel softmax. All the models were trained for same number of updates with early stopping based on the per-digit error rate in the validation set. Results for all 3 versions of the task are reported in Table-2. From the table, we can see that TARDIS performs better than LSTM in all the three versions of the task. Also TARDIS with gumbel-softmax performs slightly better than TARDIS with REINFORCE, which is consistent with our other experiments.

We also compare the learning curves of all the three models in Figure-7. From the figure we can see that TARDIS learns to solve the task faster that LSTM by effectively utilizing the given memory slots. Also, TARDIS with gumbel softmax converges faster than TARDIS with REINFORCE.

3 NTM Tasks

Graves et al. (2014) proposed associative recall and the copy tasks to evaluate a model’s ability to learn simple algorithms and generalize to the sequences longer than the ones seen during the training. We trained a TARDIS model with 4 features for the address and 32 features for the memory content part of the model. We used a model with hidden state of size 120. Our model uses a memory of size 16. We train our model with Adam and used the learning rate of 3e-3. We show the results of our model in Table 3. TARDIS model was able to solve the both tasks, both with Gumbel-softmax and REINFORCE.

4 Stanford Natural Language Inference

Bowman et al. (2015) proposed a new task to test the machine learning algorithms’ ability to infer whether two given sentences entail, contradict or are neutral(semantic independence) from each other. However, this task can be considered as a long-term dependency task, if the premise and the hypothesis are presented to the model in sequential order as also explored by Rocktäschel et al. (2015). Because the model should learn the dependency relationship between the hypothesis and the premise. Our model first reads the premise, then the hypothesis and at the end of the hypothesis the model predicts whether the premise and the hypothesis contradicts or entails. The model proposed by Rocktäschel et al. (2015), applies attention over its previous hidden states over premise when it reads the hypothesis. In that sense their model can still be considered to have some task-specific architectural design choice. TARDIS and our baseline LSTM models do not include any task-specific architectural design choices. In Table 4, we compare the results of different models. Our model, performs significantly better than other models. However recently it has been shown that with architectural tweaks, it is possible to design a model specifically to solve this task and achieve 88.2% test accuracy (Chen et al., 2016).

Conclusion

In this paper, we propose a simple and efficient memory augmented neural network model which can perform well both on algorithmic tasks and more realistic tasks. Unlike the previous approaches, we show better performance on real-world NLP tasks, such as language modelling and SNLI. We have also proposed a new task to measure the performance of the models dealing with long-term dependencies.

We provide a detailed analysis on the effects of using external memory for the gradients and justify the reason why MANNs generalize better on the sequences longer than the ones seen in the training set. We have also shown that the gradients will vanish at a much slower rate (if they vanish) when an external memory is being used. Our theoretical results should encourage further studies in the direction of developing better attention mechanisms that can create wormhole connections efficiently.

We thank Chinnadhurai Sankar for suggesting the phrase "wormhole connections" and proof-reading the paper. We would like to thank Dzmitry Bahdanau for the comments and feedback for the earlier version of this paper. We would like to also thank the developers of Theano http://deeplearning.net/software/theano/, for developing such a powerful tool for scientific computing Theano Development Team (2016). We acknowledge the support of the following organizations for research funding and computing support: NSERC, Samsung, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. SC is supported by a FQRNT-PBEEE scholarship.