Simple Recurrent Units for Highly Parallelizable Recurrence

Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, Yoav Artzi

Introduction

Recurrent neural networks (RNN) are at the core of state-of-the-art approaches for a large number of natural language tasks, including machine translation (Cho et al., 2014; Bahdanau et al., 2015; Jean et al., 2015; Luong et al., 2015), language modeling (Zaremba et al., 2014; Gal and Ghahramani, 2016; Zoph and Le, 2016), opinion mining (Irsoy and Cardie, 2014), and situated language understanding (Mei et al., 2016; Misra et al., 2017; Suhr et al., 2018; Suhr and Artzi, 2018). Key to many of these advancements are architectures of increased capacity and computation. For instance, the top-performing models for semantic role labeling and translation use eight recurrent layers, requiring days to train (He et al., 2017; Wu et al., 2016b). The scalability of these models has become an important problem that impedes NLP research.

The difficulty of scaling recurrent networks arises from the time dependence of state computation. In common architectures, such as Long Short-term Memory (LSTM; Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU; Cho et al., 2014), the computation of each step is suspended until the complete execution of the previous step. This sequential dependency makes recurrent networks significantly slower than other operations, and limits their applicability. For example, recent translation models consist of non-recurrent components only, such as attention and convolution, to scale model training (Gehring et al., 2017; Vaswani et al., 2017).

In this work, we introduce the Simple Recurrent Unit (SRU), a unit with light recurrence that offers both high parallelization and sequence modeling capacity. The design of SRU is inspired by previous efforts, such as Quasi-RNN (QRNN; Bradbury et al., 2017) and Kernel NN (KNN; Lei et al., 2017), but enjoys additional benefits:

SRU exhibits the same level of parallelism as convolution and feed-forward nets. This is achieved by balancing sequential dependence and independence: while the state computation of SRU is time-dependent, each state dimension is independent. This simplification enables CUDA-level optimizations that parallelize the computation across hidden dimensions and time steps, effectively using the full capacity of modern GPUs. Figure 1 compares our architecture’s runtimes to common architectures.

SRU replaces the use of convolutions (i.e., n-gram filters), as in QRNN and KNN, with more recurrent connections. This retains modeling capacity, while using less computation (and hyper-parameters).

SRU improves the training of deep recurrent models by employing highway connections (Srivastava et al., 2015) and a parameter initialization scheme tailored for gradient propagation in deep architectures.

We evaluate SRU on a broad set of problems, including text classification, question answering, translation and character-level language modeling. Our experiments demonstrate that light recurrence is sufficient for various natural language tasks, offering a good trade-off between scalability and representational power. On classification and question answering datasets, SRU outperforms common recurrent and non-recurrent architectures, while achieving 5–9x speed-up compared to cuDNN LSTM. Stacking additional layers further improves performance, while incurring relatively small costs owing to the cheap computation of a single layer. We also obtain an average improvement of 0.7 BLEU score on the English to German translation task by incorporating SRU into Transformer (Vaswani et al., 2017).

Related Work

Improving on common architectures for sequence processing has recently received significant attention (Greff et al., 2017; Balduzzi and Ghifary, 2016; Miao et al., 2016; Zoph and Le, 2016; Lee et al., 2017). One area of research involves incorporating word-level convolutions (i.e. n-gram filters) into recurrent computation (Lei et al., 2015; Bradbury et al., 2017; Lei et al., 2017). For example, Quasi-RNN (Bradbury et al., 2017) proposes to alternate convolutions and a minimalist recurrent pooling function and achieves significant speed-up over LSTM. While Bradbury et al. (2017) focus on the speed advantages of the network, Lei et al. (2017) study the theoretical characteristics of such computation and possible extensions. Their results suggest that simplified recurrence retains strong modeling capacity through layer stacking. This finding motivates the design of SRU for both high parallelization and representational power. SRU also relates to IRNN (Le et al., 2015), which uses an identity diagonal matrix to initialize hidden-to-hidden connections. SRU uses point-wise multiplication for hidden connections, which is equivalent to using a diagonal weight matrix. This can be seen as a constrained version of diagonal initialization.

Various strategies have been proposed to scale network training (Goyal et al., 2017) and to speed up recurrent networks (Diamos et al., 2016; Shazeer et al., 2017; Kuchaiev and Ginsburg, 2017). For instance, Diamos et al. (2016) utilize hardware infrastructures by stashing RNN parameters on cache (or fast memory). Shazeer et al. (2017) and Kuchaiev and Ginsburg (2017) improve the computation via conditional computing and matrix factorization respectively. Our implementation for SRU is inspired by the cuDNN-optimized LSTM (Appleyard et al., 2016), but enables more parallelism – while cuDNN LSTM requires six optimization steps, SRU achieves more significant speed-up via two optimizations.

The design of recurrent networks, such as SRU and related architectures, raises questions about representational power and interpretability (Chen et al., 2018; Peng et al., 2018). Balduzzi and Ghifary (2016) applies type-preserving transformations to discuss the capacity of various simplified RNN architectures. Recent work (Anselmi et al., 2015; Daniely et al., 2016; Zhang et al., 2016; Lei et al., 2017) relates the capacity of neural networks to deep kernels. We empirically demonstrate SRU can achieve compelling results by stacking multiple layers.

Simple Recurrent Unit

We present and explain the design of Simple Recurrent Unit (SRU) in this section. A single layer of SRU involves the following computation:

subscript𝐖𝑓subscript𝐱𝑡direct-productsubscript𝐯𝑓subscript𝐜𝑡1subscript𝐛𝑓\displaystyle=\ \sigma\left(\mathbf{W}_{f}\mathbf{x}_{t}+\mathbf{v}_{f}\odot\mathbf{c}_{t-1}+\mathbf{b}_{f}\right) (1) \displaystyle\mathbf{c}_{t}\ = ftct1+(1ft)(Wxt)\displaystyle=\ \mathbf{f}_{t}\odot\mathbf{c}_{t-1}+(1-\mathbf{f}_{t})\odot(\mathbf{W}\mathbf{x}_{t}) (2) \displaystyle\mathbf{r}_{t}\ = σ(Wrxt+vrct1+br)\displaystyle=\ \sigma\left(\mathbf{W}_{r}\mathbf{x}_{t}+\mathbf{v}_{r}\odot\mathbf{c}_{t-1}+\mathbf{b}_{r}\right) (3) \displaystyle\mathbf{h}_{t}\ = rtct+(1rt)xt\displaystyle=\ \mathbf{r}_{t}\odot\mathbf{c}_{t}+(1-\mathbf{r}_{t})\odot\mathbf{x}_{t} (4) where W\mathbf{W}, Wf\mathbf{W}_{f} and Wr\mathbf{W}_{r} are parameter matrices and vf\mathbf{v}_{f}, vr\mathbf{v}_{r}, bf\mathbf{b}_{f} and bv\mathbf{b}_{v} are parameter vectors to be learnt during training. The complete architecture decomposes to two sub-components: a light recurrence (Equation 1 and 2) and a highway network (Equation 3 and 4).

The light recurrence component successively reads the input vectors xt\mathbf{x}_{t} and computes the sequence of states ct\mathbf{c}_{t} capturing sequential information. The computation resembles other recurrent networks such as LSTM, GRU and RAN (Lee et al., 2017). Specifically, a forget gate ft\mathbf{f}_{t} controls the information flow (Equation 1) and the state vector ct\mathbf{c}_{t} is determined by adaptively averaging the previous state ct1\mathbf{c}_{t-1} and the current observation Wxt\mathbf{W}\mathbf{x}_{t} according to ft\mathbf{f}_{t} (Equation 2).

One key design decision that differs from previous gated recurrent architectures is the way ct1\mathbf{c}_{t-1} is used in the sigmoid gate. Typically, ct1\mathbf{c}_{t-1} is multiplied with a parameter matrix to compute ft\mathbf{f}_{t}, e.g., ft=σ(Wfxt+Vfct1+bf)\mathbf{f}_{t}=\sigma(\mathbf{W}_{f}\mathbf{x}_{t}+\mathbf{V}_{f}\mathbf{c}_{t-1}+\mathbf{b}_{f}). However, the inclusion of Vfct1\mathbf{V}_{f}\mathbf{c}_{t-1} makes it difficult to parallelize the state computation: each dimension of ct\mathbf{c}_{t} and ft\mathbf{f}_{t} depends on all entries of ct1\mathbf{c}_{t-1}, and the computation has to wait until ct1\mathbf{c}_{t-1} is fully computed. To facilitate parallelization, our light recurrence component uses a point-wise multiplication vfct1\mathbf{v}_{f}\odot\mathbf{c}_{t-1} instead. With this simplification, each dimension of the state vectors becomes independent and hence parallelizable.

The highway network component Srivastava et al. (2015) facilitates gradient-based training of deep networks. It uses the reset gate rt\mathbf{r}_{t} (Equation 3) to adaptively combine the input xt\mathbf{x}_{t} and the state ct\mathbf{c}_{t} produced from the light recurrence (Equation 4), where (1rt)xt(1-\mathbf{r}_{t})\odot\mathbf{x}_{t} is a skip connection that allows the gradient to directly propagate to the previous layer. Such connections have been shown to improve scalability Wu et al. (2016a); Kim et al. (2016); He et al. (2016); Zilly et al. (2017).

The combination of the two components makes the overall architecture simple yet expressive, and easy to scale due to enhanced parallelization and gradient propagation.

Despite the parallelization friendly design of SRU, a naive implementation which computes equations (1)–(4) for each step tt sequentially would not achieve SRU’s full potential. We employ two optimizations to enhance parallelism. The optimizations are performed in the context of GPU / CUDA programming, but the general idea can be applied to other parallel programming models.

We re-organize the computation of equations (1)–(4) into two major steps. First, given the input sequence {x1xL}\{\mathbf{x}_{1}\cdots\mathbf{x}_{L}\}, we batch the matrix multiplications across all time steps. This significantly improves the computation intensity (e.g. GPU utilization). The batched multiplication is:

where LL is the sequence length, URL×3d\mathbf{U}\in\mathbb{R}^{L\times 3d} is the computed matrix and dd is the hidden state size. When the input is a mini-batch of BB sequences, U\mathbf{U} would be a tensor of size (L,B,3d)(L,B,3d).

𝐔𝑙𝑖𝑗𝑑subscript𝐯𝑓delimited-[]𝑗𝑐subscript𝐛𝑓delimited-[]𝑗f=\sigma\left(\,\mathbf{U}[l,i,j+d]+\mathbf{v}_{f}[j]\times c+\mathbf{b}_{f}[j]\,\right) c=f×c+(1f)×U[l,i,j]c=f\times c+(1-f)\times\mathbf{U}[l,i,j] r=σ(U[l,i,j+d×2]+vr[j]×c+br[j])r=\sigma\left(\,\mathbf{U}[l,i,j+d\times 2]+\mathbf{v}_{r}[j]\times c+\mathbf{b}_{r}[j]\,\right) h=r×c+(1r)×x[l,i,j]h=r\times c+(1-r)\times\mathbf{x}[l,i,j] c[l,i,j]=c\mathbf{c}[l,i,j]=c h[l,i,j]=h\mathbf{h}[l,i,j]=h return h[,,]\mathbf{h}[\cdot,\cdot,\cdot] and c[,,]\mathbf{c}[\cdot,\cdot,\cdot] The second step computes the remaining point-wise operations. Specifically, we compile all point-wise operations into a single fused CUDA kernel and parallelize the computation across each dimension of the hidden state. Algorithm 1 shows the pseudo code of the forward function. The complexity of this step is O(LBd)O(L\cdot B\cdot d) per layer, where LL is the sequence length and BB is the batch size. In contrast, the complexity of LSTM is O(LBd2)O(L\cdot B\cdot d^{2}) because of the hidden-to-hidden multiplications (e.g. Vht1\mathbf{V}\mathbf{h}_{t-1}), and each dimension can not be independently parallelized. The fused kernel also reduces overhead. Without it, operations such as sigmoid activation would each invoke a separate function call, adding kernel launching latency and more data moving costs.

The implementation of a bidirectional SRU is similar: the matrix multiplications of both directions are batched, and the fused kernel handles and parallelizes both directions at the same time.

2 Initialization

Proper parameter initialization can reduce gradient propagation difficulties and hence have a positive impact on the final performance. We now describe an initialization strategy tailored for SRU.

We start by adopting common initializations derived for feed-forward networks (Glorot and Bengio, 2010; He et al., 2015). The weights of parameter matrices are drawn with zero mean and 1/d1/d variance, for instance, via the uniform distribution [3/d,+3/d][-\sqrt{3/d},+\sqrt{3/d}]. This ensures the output variance remains approximately the same as the input variance after the matrix multiplication.

However, the light recurrence and highway computation would still reduce the variance of hidden representations by a factor of 1/31/3 to 1/21/2:

and the factor converges to 1/21/2 in deeper layers (see Appendix A). This implies the output ht\mathbf{h}_{t} and the gradient would vanish in deep models. To offset the problem, we introduce a scaling correction constant α\alpha in the highway connection

direct-productsubscript𝐫𝑡subscript𝐜𝑡⋅direct-product1subscript𝐫𝑡subscript𝐱𝑡𝛼\displaystyle=\ \mathbf{r}_{t}\odot\mathbf{c}_{t}\,+\,(1-\mathbf{r}_{t})\odot\mathbf{x}_{t}\cdot\alpha\;\;, where α\alpha is set to 3\sqrt{3} such that Var[ht]Var[xt]\text{Var}[\mathbf{h}_{t}]\approx\text{Var}[\mathbf{x}_{t}] at initialization. When the highway network is initialized with a non-zero bias br=b\mathbf{b}_{r}=b, the scaling constant α\alpha can be accordingly set as:

1𝑏2\displaystyle=\ \sqrt{1+\exp(b)\times 2}\;\;. Figure 2 compares the training progress with and without the scaling correction. See Appendix A for the derivation and more discussion.

Experiments

We evaluate SRU on several natural language processing tasks and perform additional analyses of the model. The set of tasks includes text classification, question answering, machine translation, and character-level language modeling. Training time on these benchmarks ranges from minutes (classification) to days (translation), providing a variety of computation challenges.

The main question we study is the performance-speed trade-off SRU provides in comparison to other architectures. We stack multiple layers of SRU to directly substitute other recurrent, convolutional or feed-forward modules. We minimize hyper-parameter tuning and architecture engineering for a fair comparison. Such efforts have a non-trivial impact on the results, which are beyond the scope of our experiments. Unless noted otherwise, the hyperparameters are set identical to prior work.

We use six sentence classification benchmarks: movie review sentiment (MR; Pang and Lee, 2005), sentence subjectivity (SUBJ; Pang and Lee, 2004), customer reviews polarity (CR; Hu and Liu, 2004), question type (TREC; Li and Roth, 2002), opinion polarity (MPQA; Wiebe et al., 2005), and the Stanford sentiment treebank (SST; Socher et al., 2013).222We use the binary version of SST dataset.

Following Kim (2014), we use word2vec embeddings trained on 100100 billion Google News tokens. For simplicity, all word vectors are normalized to unit vectors and are fixed during training.

Setup

We stack multiple SRU layers and use the last output state to predict the class label for a given sentence. We train for 100 epochs and use the validation (i.e., development) set to select the best training epoch. We perform 10-fold cross validation for datasets that do not have a standard train-evaluation split. The result on SST is averaged over five independent trials. We use Adam (Kingma and Ba, 2014) with the default learning rate 0.001, a weight decay 0 and a hidden dimension of 128.

We compare SRU with a wide range of methods on these datasets, including various convolutional models (Kalchbrenner et al., 2014; Kim, 2014; Zhang and Wallace, 2017) and a hierarchical sentence model (Zhao et al., 2015) reported as the state of the art on these datasets (Conneau et al., 2017). Their setups are not exactly the same as ours, and may involve more tuning on word embeddings and other regularizations. We use the setup of Kim (2014) but do not fine-tune word embeddings and the learning method for simplicity. In addition, we directly compare against three baselines trained using our code base: a re-implementation of the CNN model of Kim (2014), a two-layer LSTM model and Quasi-RNN (Bradbury et al., 2017). We use the official implementation of Quasi-RNN and also implement a version with highway connection for a fair comparison. These baselines are trained using the same hyper-parameter configuration as SRU.

Results

Table 1 compares the test results on the six benchmarks. We select the best number reported in previous methods when multiple model variants were explored in their experiments. Despite our simple setup, SRU outperforms most previous methods and achieves comparable results compared to the state-of-the-art but more sophisticated model of Zhao et al. (2015). Figure 3 shows validation performance relative to training time for SRU, cuDNN LSTM and the CNN model. Our SRU implementation runs 5–9 times faster than cuDNN LSTM, and 6–40% faster than the CNN model of Kim (2014). On the movie review (MR) dataset for instance, SRU completes 100 training epochs within 40 seconds, while LSTM takes over 320 seconds.

2 Question Answering

We use the Stanford Question Answering Dataset (SQuAD; Rajpurkar et al., 2016). SQuAD is a large machine comprehension dataset that includes over 100K question-answer pairs extracted from Wikipedia articles. We use the standard train and development sets.

Setup

We use the Document Reader model of Chen et al. (2017) as our base architecture for this task. The model is a combination of word-level bidirectional RNNs and attentions, providing a good testbed to compare our bidirectional SRU implementation with other RNN components.333The current state-of-the-art models (Seo et al., 2016; Wang et al., 2017) make use of additional components such as character-level embeddings, which are not directly comparable to the setup of Chen et al. (2017). However, these models can potentially benefit from SRU since RNNs are incorporated in the model architecture.

We use the open source implementation of Document Reader in our experiments.444https://github.com/hitvoice/DrQA We train models for up to 100 epochs, with a batch size of 32 and a hidden dimension of 128. Following the author suggestions, we use the Adamax optimizer (Kingma and Ba, 2014) and variational dropout (Gal and Ghahramani, 2016) during training. We compare with two alternative recurrent components: the bidirectional LSTM adopted in the original implementation of Chen et al. (2017) and Quasi-RNN with highway connections for improved performance.

Results

Table 2 summarizes the results on SQuAD. SRU achieves 71.4% exact match and 80.2% F1 score, outperforming the bidirectional LSTM model by 1.9% (EM) and 1.4% (F1) respectively. SRU also exhibits over 5x speed-up over LSTM and 53–63% reduction in total training time. In comparison with QRNN, SRU obtains 0.8% improvement on exact match and 0.6% on F1 score, and runs 60% faster. This speed improvement highlights the impact of the fused kernel (Algorithm 1). While the QRNN baseline involves a similar amount of computation, assembling all element-wise operations of both directions in SRU achieves better GPU utilization.

3 Machine Translation

We train translation models on the WMT English\rightarrowGerman dataset, a standard benchmark for translation systems (Peitz et al., 2014; Li et al., 2014; Jean et al., 2015). The dataset consists of 4.5 million sentence pairs. We obtain the pre-tokenized dataset from the OpenNMT project (Klein et al., 2017). The sentences were tokenized using the word-piece model (Wu et al., 2016b), which generates a shared vocabulary of about 32,000 tokens. Newstest-2014 and newstest-2017 are provided and used as the validation and test sets.555https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt

Setup

We use the state-of-the-art Transformer model of Vaswani et al. (2017) as our base architecture. In the base model, a single Transformer consists of a multi-head attention layer and a bottleneck feed-forward layer. We substitute the feed-forward network using our SRU implementation:

⋅𝐖ReLU_layer𝐱𝐛\displaystyle\quad\mathbf{W}\cdot\text{ReLU\_layer}(\mathbf{x})+\mathbf{b} ours: WSRU_layer(x)+b    .\displaystyle\quad\mathbf{W}\cdot\text{SRU\_layer}(\mathbf{x})+\mathbf{b}\;\;. The intuition is that SRU can better capture sequential information as a recurrent network, and potentially achieve better performance while requiring fewer layers.

We keep the model configuration the same as Vaswani et al. (2017): the model dimension is dmodel=512d_{\text{model}}=512, the feed-forward and SRU layer has inner dimensionality dff=dsru=2048d_{\text{ff}}=d_{\text{sru}}=2048, and positional encoding Gehring et al. (2017) is applied on the input word embeddings. The base model without SRU has 6 layers, while we set the number of layers to 4 and 5 when SRU is added. Following the original setup, we use a dropout probability 0.1 for all components, except the SRU in the 5-layer model, for which we use a dropout of 0.2 as we observe stronger over-fitting in training.

We use a single NVIDIA Tesla V100 GPU for each model. The published results were obtained using 8 GPUs in parallel, which provide a large effective batch size during training. To approximate the setup, we update the model parameters every 5×\times5120 tokens and use 16,000 warm-up steps following OpenNMT suggestions. We train each model for 40 epochs (250,000 steps), and perform 3 independent trials for each model configuration. A single run takes about 3.5 days with a Tesla V100 GPU.

Results

Table 3 shows the translation results. When SRU is incorporated into the architecture, both the 4-layer and 5-layer model outperform the Transformer base model. For instance, our 5-layer model obtains an average improvement of 0.7 test BLEU score and an improvement of 0.5 BLEU score by comparing the best results of each model achieved across three runs. SRU also exhibits more stable performance, with smaller variance over 3 runs. Figure 4 further compares the validation accuracy of different models. These results confirm that SRU is better at sequence modeling compared to the original feed-forward network (FFN), requiring fewer layers to achieve similar accuracy. Finally, adding SRU does not affect the parallelization or speed of Transformer – the 4-layer model exhibits 10% speed improvement, while the 5-layer model is only 5% slower compared to the base model. We present more results and discussion in Appendix B.3.

4 Character-level Language Modeling

We use Enwik8, a large dataset for character-level language modeling. Following standard practice, we use the first 90M characters for training and the remaining 10M split evenly for validation and test.

Setup

Similar to previous work, we use a batch size of 128 and an unroll size of 100 for truncated backpropagation during training. We also experiment with an unroll size of 256 and a batch size of 64 such that each training instance has longer context. We use a non-zero highway bias br=3\mathbf{b}_{r}={-3} that is shown useful for training language model (Zilly et al., 2017). Previous methods employ different optimizers and learning rate schedulers for training. For simplicity and consistency, we use the Adam optimizer and the same learning rate scheduling (i.e., Noam scheduling) as the translation experiments. We train a maximum of 100 epochs (about 700,000 steps).

We compare various recurrent models and use a parameter budget similar to previous methods. In addition, we experiment with the factorization trick (Kuchaiev and Ginsburg, 2017) to reduce the total number of parameters without decreasing the performance. See details in Appendix B.

Results

Table 4 presents the results of SRU and other recurrent models. The 8-layer SRU model achieves validation and test bits per character (BPC) of 1.21, outperforming previous best reported results of LSTM, QRNN and recurrent highway networks (RHN). Increasing the layer of SRU to 12 and using a longer context of 256 characters in training further improves the BPC to 1.19

5 Ablation Analysis

We perform ablation analyses on SRU by successively disabling different components:

Remove the point-wise multiplication term vct1\mathbf{v}\odot\mathbf{c}_{t-1} in the forget and reset gates. The resulting variant involves less recurrence and has less representational capacity.

Disable the scaling correction by setting the constant α=1\alpha=1.

We train model variants on the classification and question answering datasets. Table 5 and Figure 5 confirm the impact of our design decisions – removing these components result in worse classification accuracies and exact match scores.

Discussion

This work presents Simple Recurrent Unit (SRU), a scalable recurrent architecture that operates as fast as feed-forward and convolutional units. We confirm the effectiveness of SRU on multiple natural language tasks ranging from classification to translation. We open source our implementation to facilitate future NLP and deep learning research.

SRU achieves high parallelization by simplifying the hidden-to-hidden dependency. This simplification is likely to reduce the representational power of a single layer and hence should be balanced to avoid performance loss. However, unlike previous work that suggests additional computation (e.g., n-gram filters) within the layer (Balduzzi and Ghifary, 2016; Bradbury et al., 2017), we argue that increasing the depth of the model suffices to retain modeling capacity. Our empirical results on various tasks confirm this hypothesis.

Acknowledgement

We thank Alexander Rush and Yoon Kim for help with machine translation experiments, and Danqi Chen for help with SQuAD experiments. We thank Adam Yala, Howard Chen, Jeremy Wohlwend, Lili Yu, Kyle Swanson and Kevin Yang for providing useful feedback on the paper and the SRU implementation. A special thanks to Hugh Perkins for his support on the experimental environment setup and Runqi Yang for answering questions about his code.

References

Appendix A Parameter Initialization Derivation

Following the derivation of Glorot and Kaiming initialization Glorot and Bengio (2010); He et al. (2015), we assume the values of each input vector xt\mathbf{x}_{t} are i.i.d with zero mean and a small variance:

We initialize each weight matrix with zero mean and a variance of 1/d1/d. After a matrix multiplication y=Wxt\mathbf{y}=\mathbf{W}\mathbf{x}_{t}, each value yiy_{i} would have

which means the scale of the values after matrix multiplication remains the same.

Let ft,if_{t,i} be the ii-th entry of the forget gate ft\mathbf{f}_{t}:

superscriptsubscript𝐰𝑓𝑖topsubscript𝐱𝑡subscript𝑣𝑓𝑖subscript𝑐𝑡1𝑖subscript𝑏𝑓𝑖\displaystyle f_{t,i}=\sigma(\mathbf{w}_{f,i}^{\top}\mathbf{x}_{t}+v_{f,i}\,c_{t-1,i}+b_{f,i})\;\;. The pre-activation value will be sufficiently close to 0 because the parameters are initialized with zero mean and small variance and the bias value is initially 0. As a result,

The state value ct,ic_{t,i} is computed according to

⋅subscript𝑓𝑡𝑖subscript𝑐𝑡1𝑖⋅1subscript𝑓𝑡𝑖superscriptsubscript𝐰𝑖topsubscript𝐱𝑡\displaystyle c_{t,i}=f_{t,i}\cdot c_{t-1,i}+(1-f_{t,i})\cdot(\mathbf{w}_{i}^{\top}\mathbf{x}_{t})\;\;. Substituting the expectation of ft,if_{t,i} in, we get:666We are ignoring the correlation between ft,i\mathbf{f}_{t,i} and ft,i\mathbf{f}_{t,i^{\prime}} here because their variance is small.

subscript𝐱𝑡2subscript𝐱𝑡14subscript𝐱𝑡28⋯\displaystyle c_{t,i}=\mathbf{w}_{i}^{\top}\left(\frac{\mathbf{x}_{t}}{2}+\frac{\mathbf{x}_{t-1}}{4}+\frac{\mathbf{x}_{t-2}}{8}+\cdots\right)\;\;. Therefore, E[ct,i]=0\text{E}[c_{t,i}]=0 as E[wx]=0\text{E}[\mathbf{w}^{\top}\mathbf{x}]=0. The variance of ct,ic_{t,i} however depends on the correlation between input vectors. When the input vectors are independent:

1superscript221superscript421superscript82⋯\displaystyle=\text{Var}[\mathbf{w}_{i}^{\top}\mathbf{x}]\left(\frac{1}{2^{2}}+\frac{1}{4^{2}}+\frac{1}{8^{2}}+\cdots\right) Var[wix]13=Var[x]/3    .\displaystyle\approx\text{Var}[\mathbf{w}_{i}^{\top}\mathbf{x}]\cdot\frac{1}{3}=\text{Var}[x]/3\;\;. However, the two vectors in the input sequence, for instance xt\mathbf{x}_{t} and xt\mathbf{x}_{t}^{\prime}, are not necessarily independent, for example because two words in an input sentence are often correlated. When the input vectors are perfectly correlated xt=xt1==x\mathbf{x}_{t}=\mathbf{x}_{t-1}=\cdots=\mathbf{x}, on the other hand,

In practice, multiple SRU layers are stacked to construct a deep network. The internal state ct\mathbf{c}_{t} and ht\mathbf{h}_{t} would be a weighted combination of inputs {x1xt}\{\mathbf{x}_{1}\cdots\mathbf{x}_{t}\}, which will increase the correlation of the state vectors at different steps. These state vectors are again fed into the next layer, and keep increasing the correlation. As a result, we expect the actual ratio between the variance of ct\mathbf{c}_{t} and that of the input of the current layer xt\mathbf{x}_{t} lies between the two derived values,

and would finally converge to the upper bound value of 1. Figure 6 confirms our expectation by computing the empirical value of Var[c]/Var[x]\text{Var}[c]/\text{Var}[x] in deep SRU networks.

A.2 Computing Var[𝐡tsubscript𝐡𝑡\mathbf{h}_{t}]

Given the result in Equation (5), we proceed to compute Var[ht]\text{Var}[\mathbf{h}_{t}]. The ii-th entry of ht\mathbf{h}_{t} is similarly computed as

⋅subscript𝑟𝑡𝑖subscript𝑐𝑡𝑖⋅1subscript𝑟𝑡𝑖subscript𝑥𝑡𝑖\displaystyle h_{t,i}=r_{t,i}\cdot c_{t,i}+(1-r_{t,i})\cdot x_{t,i} where rt,i=σ(wr,ixt+vr,ict1,i+br,i)    .\displaystyle r_{t,i}=\sigma(\mathbf{w}_{r,i}^{\top}\mathbf{x}_{t}+v_{r,i}c_{t-1,i}+b_{r,i})\;\;. The highway reset gate is not necessarily initialized with a zero bias. Let the initial bias be bb and u=wr,ixt+vr,ict1,iu=\mathbf{w}_{r,i}^{\top}\mathbf{x}_{t}+v_{r,i}c_{t-1,i} denote the rest of terms in the sigmoid function. We have E[u]=0\text{E}[u]=0 and Var[u]1\text{Var}[u]\ll 1 because xt\mathbf{x}_{t} and ct1\mathbf{c}_{t-1} have small variance.

We approximate the value of rt,ir_{t,i} using its Taylor expansion at u=0u=0:

𝑢𝑏\displaystyle\ =\ \sigma(u+b)   ebeb+1+ebu(eb+1)2\displaystyle\ \approx\ \frac{e^{b}}{e^{b}+1}+\frac{e^{b}\cdot u}{(e^{b}+1)^{2}} E[rt,i2]\displaystyle\text{E}[r_{t,i}^{2}]   e2b(eb+1)2+e2bu2(eb+1)4    .\displaystyle\ \approx\ \frac{e^{2b}}{(e^{b}+1)^{2}}+\frac{e^{2b}\cdot u^{2}}{(e^{b}+1)^{4}}\;\;. We can ignore the term with u2u^{2} since Var[u]1\text{Var}[u]\ll 1, which gives us

superscript𝑒𝑏12\displaystyle\ \approx\ \frac{e^{2b}}{(e^{b}+1)^{2}}\;\;. Substituting this result in Var[ht,i]\text{Var}[h_{t,i}],

superscriptsubscript𝑟𝑡𝑖2superscriptsubscript𝐜𝑡𝑖2superscript1subscript𝑟𝑡𝑖2superscriptsubscript𝑥𝑡𝑖2\displaystyle\ =\ \text{E}\left[r_{t,i}^{2}\mathbf{c}_{t,i}^{2}+(1-r_{t,i})^{2}x_{t,i}^{2}\right]  = e2bVar[c](eb+1)2+Var[x](eb+1)2\displaystyle\ =\ \frac{e^{2b}\cdot\text{Var}[c]}{(e^{b}+1)^{2}}+\frac{\text{Var}[x]}{(e^{b}+1)^{2}} (10) Since from (5) we have Var[x]/3Var[c]Var[x]\text{Var}[x]/3\leq\text{Var}[c]\leq\text{Var}[x], we get the bound of Var[ht,i]\text{Var}[h_{t,i}]

superscript𝑒2𝑏33superscriptsuperscript𝑒𝑏12Vardelimited-[]ℎVardelimited-[]𝑥superscript𝑒2𝑏1superscriptsuperscript𝑒𝑏12\displaystyle\frac{e^{2b}+3}{3(e^{b}+1)^{2}}\ \leq\ \frac{\text{Var}[h]}{\text{Var}[x]}\ \leq\ \frac{e^{2b}+1}{(e^{b}+1)^{2}} which is equivalent to

A.3 Computing the Scaling Constant α𝛼\alpha

Finally, we compute the scaling constant α\alpha (Section 3.2). Using the result in Equation (6), when α\alpha is introduced we get:

⋅superscript𝑒2𝑏Vardelimited-[]𝑐superscriptsuperscript𝑒𝑏12⋅superscript𝛼2Vardelimited-[]𝑥superscriptsuperscript𝑒𝑏12\displaystyle\ =\ \frac{e^{2b}\cdot\text{Var}[c]}{(e^{b}+1)^{2}}+\frac{\alpha^{2}\cdot\text{Var}[x]}{(e^{b}+1)^{2}}   e2b+α2(eb+1)2Var[x]    ,\displaystyle\ \approx\ \frac{e^{2b}+\alpha^{2}}{(e^{b}+1)^{2}}\cdot\text{Var}[x]\;\;, as Var[c]Var[x]\text{Var}[c]\rightarrow\text{Var}[x] according to Equation (5) and the empirical evaluation (Figure 6). This implies e2b+α=(1+eb)2e^{2b}+\alpha=(1+e^{b})^{2} if we want Var[h]Var[x]\text{Var}[h]\approx\text{Var}[x]. By solving for α\alpha we have

1⋅2superscript𝑒𝑏\displaystyle\alpha\ =\ \sqrt{1+2\cdot e^{b}}\;\;, and α=3\alpha=\sqrt{3} when b=0b=0.

Appendix B Experimental Details

We include additional experimental setup and results in this section.

The data and pre-processing code are obtained from the code repository of Harvard NLP.777https://github.com/harvardnlp/sent-conv-torch

We use a batch size of 32 and a dropout probability of 0.5 for all models. In addition, we increment the dropout to 0.55 or 0.6 for the 8-layer SRU model. Following the implementation of (Kim, 2014), out-of-vocabulary words that are not in the pre-trained embeddings are initialized with random vectors with values from [0.25,0.25][-0.25,0.25].

B.2 Question Answering

We use a word embedding dropout of 0.5 and a recurrent dropout of 0.2. In the setup of Chen et al. (2017), the bi-LSTM models concatenates the output of each layer and feed it to subsequent layers. This helps the gradient propagation and improves the final performance. With highway connection, this is no longer necessary. In SRU and Q-RNN (with highway), only the output of the last layer is given to subsequent layers.

B.3 Machine Translation

We use the OpenNMT PyTorch implementation for the translation experiments.Table 6 shows the list of configuration options used for training. For evaluation, we use beam size 5 and length penalty 0.6.

Table 7 shows the averaged BLEU score of each model from 20th to 40th epoch. The improvement over the Transformer base model is consistent across different epochs.

Figure 7 plots the training and validation perplexity of three models. With a higher dropout (0.2) used for the SRU, the 5-layer model gets consistent lower validation perplexity over the base model and the 4-layer model. We also see that models with SRU exhibit much faster training progress with much lower training perplexity, suggesting the models could be tuned better with further training regularization.

B.4 Character-level Language Modeling

We train all models using a weight decay of 10710^{-7} and a gradient clipping of 0.30.3. We set the learning rate factor of Noam scheduling to 33 and the warmup steps to 32,00032,000. We tune the dropout probability from {0.2,0.3}\{0.2,0.3\}.

The projection (bottleneck) trick is implemented as follows. Recall that the batched multiplication of SRU is computed as

The stacked parameter matrices on the left is re-parameterized by a low-rank factorization,

where QRdin×d\mathbf{Q}\in\mathbb{R}^{d_{in}\times d^{\prime}} and PR3dout×d\mathbf{P}\in\mathbb{R}^{3d_{out}\times d^{\prime}} are two new parameter matrices to be learned, and dd^{\prime} is the projection dimension that is much smaller than the input and output dimension of the SRU.