On the Parameterization and Initialization of Diagonal State Space Models

Albert Gu, Ankit Gupta, Karan Goel, Christopher Ré

Introduction

A core class of models in modern deep learning are sequence models, which are parameterized mappings operating on arbitrary sequences of inputs. Recent approaches based on state space models (SSMs) have outperformed traditional deep sequence models such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers, in both computational efficiency and modeling ability. In particular, the S4 model displayed strong results on a range of sequence modeling tasks, especially on long sequences . Its ability to model long-range dependencies arises from being defined with a particular state matrix called the “HiPPO matrix” , which allows S4 to be viewed as a convolutional model that decomposes an input onto an orthogonal system of smooth basis functions.

However, beyond its theoretical interpretation, actually computing S4 as a deep learning model requires a sophisticated algorithm with many linear algebraic techniques that are difficult to understand and implement. These techniques were necessitated by parameterizing its state matrix as a diagonal plus low-rank (DPLR) matrix, which is necessary to capture HiPPO matrices. A natural question is whether simplifications of this parameterization and algorithm are possible. In particular, removing the low-rank term would result in a diagonal state space model (diagonal SSM) that is dramatically simpler to implement and understand.

Although it is known that almost all SSMs have an equivalent diagonal form—and therefore (complex) diagonal SSMs are fully expressive algebraically—they may not represent all SSMs numerically, and finding a good initialization is critical. Gu et al. showed that it is difficult to find a performant diagonal SSM, and that many alternative parameterizations of the state matrix – including by random diagonal matrices – are much less effective empirically, which motivated the necessity of the more complicated HiPPO matrix. However, recently Gupta made the empirical observation that a variant of S4 using a particular diagonal matrix is nearly as effective as the original S4 method. This matrix is based on the original HiPPO matrix and is defined by simply chopping off the low-rank term in the DPLR representation.

The discovery of performant diagonal state matrices opens up new possibilities for simplifying deep state space models, and consolidating models such as S4 and DSS to understand and improve them. First, the strongest version of DSS computes the SSM with a complex-valued softmax that complicates the algorithm, and is actually less efficient than S4. Additionally, DSS and S4 differ in several auxiliary aspects of parameterizing SSMs that can conflate performance effects, making it more difficult to isolate the core effects of diagonal versus DPLR state matrices. Most importantly, DSS relies on initializing the state matrix to a particular approximation of S4’s HiPPO matrix. While S4’s matrix has a mathematical interpretation for addressing long-range dependencies, the efficacy of the diagonal approximation to it remains theoretically unexplained.

In this work, we seek to systematically understand how to train diagonal SSMs. We introduce the S4D method, a diagonal SSM which combines the best of S4’s computation and parameterization and DSS’s initialization, resulting in a method that is extremely simple, theoretically princpled, and empirically effective.

First, we describe S4D, a simple method outlined by S4 for computing diagonal instead of DPLR matrices, which is based on Vandermonde matrix multiplication and is even simpler and more efficient than the DSS. Outside of the core state matrix, we categorize different representations of the other components of SSMs, introducing flexible design choices that capture both S4 and DSS and allow different SSM parameterizations to be systematically compared (Section 3).

We provide a new mathematical analysis of DSS’s initialization, showing that the diagonal approximation of the original HiPPO matrix surprisingly produces the same dynamics as S4 when the state size goes to infinity. We propose even simpler variants of diagonal SSMs using different initializations of the state matrix (Section 4).

We perform a controlled study of these various design choices across many domains, tasks, and sequence lengths, and additionally compare diagonal (S4D) versus DPLR (S4) variants. Our best S4D methods are competitive with S4 on almost all settings, with near state-of-the-art results on image, audio, and medical time series benchmarks, and achieving 85% on the Long Range Arena benchmark (Section 5).

Background

S4 investigated state space models (1) that are parameterized maps on signals u(t)y(t)u(t)\mapsto y(t). These SSMs are linear time-invariant systems that can be represented either as a linear ODE (equation (1)) or convolution (equation (2)).

An intuitive way to view the convolution kernel (2) is to interpret it as a linear combination (controlled by C\bm{C}) of basis kernels Kn(t)K_{n}(t) (controlled by A,B\bm{A},\bm{B})

We denote this basis as K(t)=KA,B(t)=etABK(t)=K_{\bm{A},\bm{B}}(t)=e^{t\bm{A}}\bm{B} if necessary to disambiguate; note that it is a vector of NN functions. In the case of diagonal SSMs, each function Kn(t)K_{n}(t) is just etAnBne^{t\bm{A}_{n}}\bm{B}_{n}.

As a deep learning model, SSMs have many elegant properties with concrete empirical and computational benefits . For example, the convolutional form (2) can be converted into a temporal recurrence that is substantially faster for autoregressive applications .

However, making SSMs effective required overcoming two key challenges: choosing appropriate values for the matrices, and computing the kernel (2) efficiently.

First, Gu et al. showed that naive instantiations of the SSM do not perform well, and instead relied on a particular (real-valued) matrix A\bm{A} called the HiPPO-LegS matrix (4).HiPPO also specifies formulas for B\bm{B}, but the state matrix A\bm{A} is more important. There are many other HiPPO instantiations besides LegS, but HiPPO-LegS is the main one that S4 uses and the term “HiPPO matrix” without the suffix refers to this one. These matrices were derived so that the basis kernels Kn(t)K_{n}(t) have closed-form formulas Ln(et)L_{n}(e^{-t}), where Ln(t)L_{n}(t) are normalized Legendre polynomials. Consequently, the SSM has a mathematical interpretation of decomposing the input signal u(t)u(t) onto a set of infinitely-long basis functions that are orthogonal respect to an exponentially-decaying measure, giving it long-range modeling abilities .

Second, S4 introduced a particular parameterization that decomposed this A\bm{A} matrix into the sum of a normal and rank-1 matrix (5), which can be unitarily conjugated into a (complex) diagonal plus rank-1 matrix. Leveraging this structured form, they then introduced a sophisticated algorithm for efficiently computing the convolution kernel (2) for state matrices that are diagonal plus low-rank (DPLR).

S4 was originally motivated by searching for a diagonal state matrix, which would be even more structured and result in very simple computation of the SSM. However, the HiPPO-LegS matrix cannot be stably transformed into diagonal form [9, Lemma 3.2], and they were unable to find any diagonal matrices that performed well, resulting in the DPLR formulation.

Gupta made the surprising empirical observation that simply removing the low-rank portion of the DPLR form of the HiPPO-LegS matrix results in a diagonal matrix that performs comparably to the original S4 method. More precisely, their initialization is the diagonal matrix A(D)\bm{A}^{(D)}, or the diagonalization of A(N)\bm{A}^{(N)} in (5). They termed A(N)\bm{A}^{(N)} the skew-HiPPO matrix, which we will also call the normal-HiPPO matrix. To be more specific and disambiguate these variants, we may also call A(N)\bm{A}^{(N)} the HiPPO-LegS-N or HiPPO-N matrix and A(D)\bm{A}^{(D)} the HiPPO-LegS-D or HiPPO-D matrix.

In addition to this initialization, they proposed a method for computing a diagonal SSM kernel. Beyond these two core differences, several other aspects of their parameterization differ from S4’s.

In Sections 3 and 4, we systematically study the components of DSS: we categorize different ways to parameterize and compute the diagonal state space, and explain the theoretical interpretion of this particular diagonal A\bm{A} matrix.

Because there are several different concrete matrices with different naming conventions, this table summarizes these special matrices and ways to refer to them.

Parameterizing Diagonal State Spaces

We describe various choices for the computation and parameterization of diagonal state spaces. Our categorization of these choices leads to simple variants of the core method. Both DSS and our proposed S4D can be described using a combination of these factors (Section 3.4).

The true continuous-time SSM can be represented as a continuous convolution y(t)=(Ku)(t)=0CesABu(ts) ⁣ds.y(t)=(K\ast u)(t)=\int_{0}^{\infty}\bm{C}e^{s\bm{A}}\bm{B}u(t-s)\mathop{}\!ds.

In discrete time, we view an input sequence u0,u1,u_{0},u_{1},\dots as uniformly-spaced samples from an underlying function u(t)u(t) and must approximate this integral. Standard methods for doing so that preserve the convolutional structure of the model exist. The first step is to discretize the parameters. Two simple choices that have been used in prior work include

With these methods, the discrete-time SSM output is just

These integration rules have both been used in prior works (e.g. LMU and DSS use ZOH while S4 and its predecessors use bilinear ).

In Section 5, we show that there is little empirical difference between them. However, we note that there is a curious phenomenon where the bilinear transform actually perfectly smooths out the kernel used in DSS to match the S4 kernel (Section 4 Fig. 3(d)). We additionally note that numerical integration is a rich and well-studied topic and more stable methods of approximating the convolutional integral may exist. For example, it is well-known that simple rules like the Trapezoid rule can dramatically reduce numerical integration error when the function has bounded second derivative.

2 Convolution Kernel

The main computational difficulty of the original S4 model is computing the convolution kernel K\bm{\overline{K}}. This is extremely slow for general state matrices A\bm{A}, and S4 introduced a complicated algorithm for DPLR state matrices. When A\bm{A} is diagonal, the computation is nearly trivial. By (6),

where \circ is Hadamard product, \cdot is matrix multiplication, and V\mathcal{V} is known as a Vandermonde matrix. Unpacking this a little more, we can write K\bm{\overline{K}} as the following Vandermonde matrix-vector multiplication.

The naive way to compute (7) is by materializing the Vandermonde matrix VL(A)\mathcal{V}_{L}(\bm{\overline{A}}) and performing a matrix multiplication, which requires O(NL)O(NL) time and space.

However, Vandermonde matrices are well-studied and theoretically the multiplication can be computed in O~(N+L)\widetilde{O}(N+L) operations and O(N+L)O(N+L) space. In fact, Vandermonde matrices are closely related to Cauchy matrices, which are the computational core of S4’s DPLR algorithm, and have identical complexity .

The time and space complexity of computing the kernel of diagonal SSMs is equal to that of computing DPLR SSMs.

We note that on modern parallelizable hardware such as GPUs, a simple fast algorithm is to compute (7) with naive summation (using O(NL)O(NL) operations), but without materializing the Vandermonde matrix (using O(N+L)O(N+L) space). Just as with S4, this may require implementing a custom kernel in some modern deep learning frameworks such as PyTorch to achieve the space savings.

3 Parameterization

The next question is how to represent the parameters A,B,C\bm{A},\bm{B},\bm{C}.

Note that the kernel K(t)=CetABK(t)=\bm{C}e^{t\bm{A}}\bm{B} blows up to \infty as tt\to\infty if A\bm{A} has any eigenvalues with positive real part. Goel et al. found that this is a serious constraint that affects the stability of the model, especially when using the SSM as an autoregressive generative model. They propose to force the real part of A\bm{A} to be negative, also known as the left-half plane condition in classical controls, by parameterizing the real part inside an exponential function A=exp(ARe)+iAIm.\bm{A}=-\exp(\bm{A}_{Re})+i\cdot\bm{A}_{Im}.

We note that instead of exp\exp, any activation function can be used as long as its range is bounded on one side, such as ReLU, softplus, etc. The original DSS does not constrain the real part of A\bm{A}, which is sufficient for simple tasks involving fixed-length sequences, but could become unstable in other settings.

Another choice in the parameterization is how to represent B\bm{B} and C\bm{C}. Note that the computation of the final discrete convolution kernel K\bm{\overline{K}} depends only on the elementwise product BC\bm{B}\circ\bm{C} (equation (7)). Therefore DSS chose to parameterize this product directly, which they call W\bm{W}, instead of B\bm{B} and C\bm{C} individually.

However, we observe that this is equivalent to keeping independent B\bm{B} and C\bm{C}, and simply freezing B=1\bm{B}=\bm{1} while training C\bm{C}. Therefore, just as S4 has separate parameters A\bm{A}, B\bm{B}, and C\bm{C} and uses a fixed initialization for A\bm{A} and B\bm{B}, S4D also proposes separate A,B\bm{A},\bm{B}, and C\bm{C} and uses fixed initializations for A\bm{A} (discussed in Section 4) and B\bm{B} (set to 1\bm{1}). Then the difference between S4D and DSS is simply that DSS does not train B\bm{B}. In our ablations, we show that training B\bm{B} gives a minor but consistent improvement in performance.

As described in , S4 initializes C\bm{C} randomly with standard deviation 11 (in contrast to standard deep learning initializations, which scale with the dimension e.g. N12N^{-\frac{1}{2}}), which is variance-preserving for S4’s (A,B)(\bm{A},\bm{B}) as a consequence of the HiPPO theory. Because it turns out that the diagonal approximation to HiPPO has similar theoretical properties, we retain this initialization in the diagonal case.

However, if using complex numbers, this effectively doubles the state dimension and the number of parameters in B,C\bm{B},\bm{C}. Furthermore, when using a complex SSM, the output of the SSM is not guaranteed to be real even if the input is real, and similarly the convolution kernel (7) will in general be complex.

To resolve this discrepency, note that when diagonalizing a real SSM into a complex SSM (see Proposition 2), the resulting parameters always occur in conjugate pairs. Therefore we can throw out half of the parameters.

In other words, to parameterize a real SSM of state size NN, we can instead parameterize a complex SSM of state size N2\frac{N}{2}, and implicitly add back the conjugate pairs of the parameters. This ensures that the total state size and parameter count is actually the equivalent of NN real numbers, and also guarantees that the output of the kernel is real. The implementation of this is very simple; the sum in (7) will implicitly include the conjugate pairs of A,B,C\bm{A},\bm{B},\bm{C} and therefore resolve to twice the real part of the original sum.

4 S4D: the Diagonal Version of S4

A key component of our exposition is disentangling the various choices possible in representing and computing state space models. With this categorization, different choices can be mixed and matched to define variants of the core method. Table 1 compares S4, DSS, and S4D, which have a core structure and kernel computation, but have various choices of other aspects of the parameterization.

We will define the base version of S4D to match the parameterization of S4 (i.e. bilinear discretization, (A)\Re(\bm{A}) parameterized with exp\exp, trainable B\bm{B}, and HiPPO-D initialization), but many other variants are possible. Note that unlike DSS, the output of S4D would be exactly the same as masking out the low-rank component of S4’s DPLR representation. Thus comparing S4D vs. S4 is a comparison of diagonal vs. DPLR representations of A\bm{A} while controlling all other factors. In our empirical study in Section 5, we systematically ablate the effects of each of these components.

We elaborate more on the comparisons between S4, DSS, and S4D below.

The original S4 work briefly considered the diagonal case as motivation [9, Section 3.1], and explicitly mentioned the connection to Vandermonde products and the computational complexity of diagonal SSMs. However, their focus was the more complex DPLR representation because it is difficult to find a performant diagonal state matrix. Compared to S4, we fleshed out details of the Vandermonde connection and its computational complexity, which matches that of S4.

On the other hand, DSS empirically found an effective diagonal state matrix, but introduced a more complicated method based on a complex softmax for computing it. Compared to S4D, this softmax essentially normalizes by the row-sums of the Vandermonde matrix, so we may sometimes refer to this distinction as “softmax normalization”. This makes the kernel more complicated than necessary, and has a few concrete drawbacks. First, the row-normalization effectively makes the model dependent on a particular sequence length LL, and special logic is required to handle different sequence lengths. Second, it does not expose the optimal computational complexity of the method, and the original version of DSS in fact uses O(N)O(N) more memory in the kernel construction than S4(D).An early version of DSS claimed that it did not require a custom kernel while S4 does, but this is because of its extra memory usage. The PyTorch implementation of S4 has an optional custom CUDA kernel primarily to save this factor of NN in space.

S4D disentangles the discretization method from the kernel computation (equation (7)), so that any discretization can be used, whereas previous methods required a specific discretization. For example, DSS requires the zero-order hold (ZOH) discretization because the exp\exp term in the ZOH formula lends itself to be computed with a softmax. On the other hand, when A\bm{A} is not diagonal, ZOH involves a matrix exponential which can be slower to compute, so S4 uses the bilinear discretization which can be computed efficiently for DPLR matrices.

All methods can enforce any constraint on the eigenvalues of A\bm{A}. While DSS found that letting them be unconstrained has slightly better performance, our experiments find that the difference is negligible and we recommend contraining negative real part of A\bm{A} as is standard practice in control systems. This ensures stability even in unbounded autoregressive settings.

The entire S4D method is very straightforward to implement, requiring just a few lines of code each for the parameterization and initialization, kernel computation, and full forward pass (Fig. 2). This minimal model maps an input sequence of length LL to an output of the same length; given multiple input channels, independent S4D layers are broadcast over them. Other details such as the initialization of Δ\Delta and other components of the overall neural network architecture are the same as in S4 and DSS.

Finally, note that different combinations of parameterization choices can lead to slightly different implementations of the kernel. Fig. 1 illustrates the S4D kernel with ZOH discretization which can be simplified even further to just 2 lines of code.

Initialization of Diagonal State Matrices

The critical question remains: which diagonal state matrices A\bm{A} are actually effective? We comment on the limitations of diagonal SSMs, and then provide three instantiations of S4D that perform well empirically.

We first present a simplified view on the expressivity of diagonal SSMs mentioned by . First, it is well-known that almost all matrices diagonalize over the complex plane. Therefore it is critical to use complex-valued matrices in order to use diagonal SSMs.

It is also well known that the state space (A,B,C)(\bm{A},\bm{B},\bm{C}) is exactly equivalent to (i.e. expresses the same map uyu\mapsto y) the state space (V1AV,V1B,CV)(\bm{V}^{-1}\bm{A}\bm{V},\bm{V}^{-1}\bm{B},\bm{C}\bm{V}), known in the SSM literature as a state space transformation. Therefore Proposition 2 says that (almost) all SSMs are equivalent to a diagonal SSM.

However, we emphasize that Proposition 2 is about expressivity which does not guarantee strong performance of a trained model after optimization. For example, Gu et al. and Gupta show that parameterizing A\bm{A} as a dense real matrix or diagonal complex matrix, which are both fully expressive classes, performs poorly if randomly initialized.

Second, Proposition 2 does not take into account numerical representations of data, which was the original reason S4 required a low-rank correction term instead of a pure diagonalization [9, Lemma 3.2]. In Section 5.2, we also show that two different initializations with the same spectrum (i.e., are equivalent to the same diagonal A\bm{A}) can have very different performance.

The HiPPO-LegS matrix has DPLR representation A(D)PP\bm{A}^{(D)}-\bm{P}\bm{P}^{\top}, and Gupta showed that simply approximating it with A(D)\bm{A}^{(D)} works quite well (5). Our first result is providing a clean mathematical interpretation of this method. Theorem 3 shows a surprising fact that does not hold in general for DPLR matrices (Section A.1), and arises out of the special structure of this particular matrix.

Let A=A(N)PP\bm{A}=\bm{A}^{(N)}-\bm{P}\bm{P}^{\top} and B\bm{B} be the HiPPO-LegS matrices, and KA,B(t)K_{\bm{A},\bm{B}}(t) be its basis. As the state size NN\to\infty, the SSM basis KA(N),B/2(t)K_{\bm{A}^{(N)},\bm{B}/2}(t) limits to KA,B(t)K_{\bm{A},\bm{B}}(t) (Fig. 3).

Note that A(N)\bm{A}^{(N)} is then unitarily equivalent to A(D)\bm{A}^{(D)}, which preserves the stability and timescale of the system.

We define S4D-LegS to be the S4D method for this choice of diagonal A=A(D)\bm{A}=\bm{A}^{(D)}. Theorem 3 explains the empirical results in whereby this system performed quite close to S4, but was usually slightly worse. This is because DSS is a variant of S4D-LegS, which by Theorem 3 is a noisy approximation to S4-LegS. Fig. 3 illustrates this result, and also shows a curious phenomenon involving different discretization rules that is open for future work.

To further simplify S4D-LegS, we analyze the structure of A(D)=diagA\bm{A}^{(D)}=\operatorname*{diag}\langle\bm{A}\rangle in more detail. The real part is easy to understand, which follows from the analysis in :

Let the imaginary part be sorted, i.e. (A)n\Im(\bm{A})_{n} is the nn-th largest (positive) imaginary component. We empirically deduced the following conjecture for the asymptotics of the imaginary part.

As NN\to\infty, (A)01πN2+c\Im(\bm{A})_{0}\to\frac{1}{\pi}N^{2}+c where c0.5236c\approx 0.5236 is a constant. For a fixed NN, the other eigenvalues satisfy an inverse scaling in nn: (A)n=Θ(n1)\Im(\bm{A})_{n}=\Theta(n^{-1}).

Fig. 4 empirically supports this conjecture. Based on 5, we propose the initialization S4D-Inv to use the following inverse-law diagonal matrix which closely approximates S4D-LegS.

While S4D-Inv can be seen as an approximation to the original S4-LegS, we propose an even simpler scaling law for the imaginary parts that can be seen as an approximation of S4-FouT (), where the imaginary parts are simply the Fourier series frequencies (i.e. matches the diagonal part of the DPLR form of S4-FouT). Fig. 1 (Right) illustrates the S4D-Lin basis etABe^{t\bm{A}}\bm{B}, which are simply damped Fourier basis functions.

The empirical study in Section 5 performs many ablations of different diagonal initializations, showing that many natural variants of the proposed methods do not perform as well. The overall guiding principles for the diagonal state matrix A\bm{A} are twofold, which can be seen from the closed form of the basis functions Kn(t)=etAnBnK_{n}(t)=e^{t\bm{A}_{n}}\bm{B}_{n} (Eq. 3).

First, the real part of An\bm{A}_{n} controls the decay rate of the function. An=12\bm{A}_{n}=-\frac{1}{2} is a good default that bounds the basis functions by the envelope et2e^{-\frac{t}{2}}, giving a constant timescale (Fig. 1 (Right)).

Second, the imaginary part of An\bm{A}_{n} controls the oscillating frequencies of the basis function. Critically, these should be spread out, which explains why random initializations of A\bm{A} do not perform well. S4D-Inv and S4D-Lin use simple asymptotics for these imaginary components that provide interpretable bases. We believe that alternative initializations that have different mathematical interpretations may exist, which is an interesting question for future work.

Experiments

Our experimental study shows that S4D has strong performance in a wide variety of domains and tasks, including the well-studied Long Range Arena (LRA) benchmark where the best S4D variant is competitive with S4 on all tasks and significantly outperforms all non-SSM baselines.

We begin with controlled ablations of the various representations of diagonal state space models.

In Section 5.1, we compare the different methods of parameterizing and computing a diagonal state space model (Section 3).

In Section 5.2, we compare our proposed initializations of the critical A\bm{A} matrix and perform several ablations showing that simple variants can substantially degrade performance, underscoring the importance of choosing A\bm{A} carefully (Section 4).

In Section 5.3, we compare our proposed S4D methods against the original S4 method (and the variants proposed in ).

In order to study the effects of different S4 and S4D variants in a controlled setting, we propose the following protocol. We focus on three datasets covering a varied range of data modalities (image pixels, biosignal time series, audio waveforms), sequence lengths (1K, 4K, 16K), and tasks (classification and regression with bidirectional and causal models).

Sequential CIFAR (sCIFAR). CIFAR-10 images are flattened into a sequence of length 10241024, and a bidirectional sequence model is used to perform 10-way classification.

BIDMC Vital Signs. EKG and PPG signals of length 40004000 are used to predict respiratory rate (RR), heart rate (HR), and blood oxygen saturation (SpO2). We focus on SpO2 in this study.

Speech Commands (SC).We note that a line of prior work including S4 all used a smaller 10-class subset of SC, so our results on the full dataset are not directly comparable. A 1-second raw audio waveform comprising 1600016000 samples is used for 35-way spoken word classification. We use an autoregressive (AR) model to vary the setting; this causal setting more closely imitates autoregressive speech generation, where SSMs have shown recent promise .

We fix a simple architecture and training protocol that works generically. The architecture has 44 layers and hidden dimension H=128H=128, resulting in 100K\sim 100K parameters. All results are averaged over multiple seeds (full protocol and results including std. reported in Appendix B).

1 Parameterization, Computation, Discretization

Given the same diagonal SSM matrices A,B\bm{A},\bm{B}, there are many variants of how to parameterize the matrices and compute the SSM kernel described in Section 3. We ablate the different choices described in Table 1. Results are in Table 2, and show that:

Computing the model with a softmax instead of Vandermonde product does not make much difference

Training B\bm{B} is consistently slightly better

Different discretizations (Section 3.1) do not make a noticeable difference

Unrestricting the real part of A\bm{A} (Section 3.3) may be slightly better

These ablations show that for a fixed initialization (A,B)(\bm{A},\bm{B}), different aspects of parameterizing SSMs make little difference overall. This justifies the parameterization and algorithm S4D uses (Section 3.4), which preserves the choices of the original S4 model and is simpler than DSS. For the remaining of the experiments in Section 5.2 and Section 5.3, we fix the S4D parameterization and algorithm described in Section 3. Note that this computes exactly the same kernel as the original S4 algorithm when the low-rank portion is set to , allowing controlled comparisons of the critical state matrix A\bm{A} for the remainder of this section.

2 S4D Initialization Ablations

The original S4 model proposed a specific formula for the A\bm{A} matrix, and the first diagonal version used a specific matrix based on it. Our new proposed variants S4D-Inv and S4D-Lin also define precise formulas for the initialization of the A\bm{A} matrix (8). This raises the question of whether the initialization of the A\bm{A} still needs to be so precise, despite the large simplifications from the original version. We perform several natural ablations on these initializations, showing that even simple variations of the precise formula can degrade performance.

The scaling rules for the imaginary parts of S4D-Inv and S4D-Lin are simple polynomial laws, but how is the constant factor chosen and how important is it? These constants are based on approximations to HiPPO methods (e.g. 5). Note that the range of imaginary components for S4D-Inv and S4D-Lin are quite different (Fig. 4); the largest imaginary part is N2π\frac{N^{2}}{\pi} for S4D-Inv and πN\pi N for S4D-Lin.

We consider scaling all imaginary parts by a constant factor of 0.010.01 or 100.0100.0 to investigate whether the constant matters. Note that this preserves the overall shape of the basis functions (Fig. 1, dashed lines) and simply changes the frequencies, and it is not obvious that this should degrade performance. However, both changes substantially reduce the performance of S4D in all settings.

Next, we consider choosing the imaginary parts randomly. For S4D-Inv, we keep the real parts equal to 12-\frac{1}{2} and set each imaginary component to

Note that when uu is equally spaced in $$ instead of uniformly random, this exactly recovers S4D-Inv (8), so this is a sensible random approximation to it.

Similarly, we consider a variant of S4D-Lin

that is equal to equation (9) when uu is equally spaced instead of random.

Table 3(a) (Random Imag) shows that this small change causes minor degradation in performance. We additionally note that the randomly initialized imaginary ablation can be interpreted as follows. Fig. 4 shows the asymptotics of the imaginary parts of SSM matrices, where the imaginary parts of the eigenvalues correspond to y-values corresponding to uniformly spaced nodes on the x-axis. This ablation then replaces the uniform spacing on the x-axis with uniformly random x values.

We considering initializing the real part of each eigenvalue as U-\mathcal{U} instead of fixing them to 12-\frac{1}{2}. Table 3(a)(Left, Random Real) shows that this also causes minor but consistent degradation in performance on the ablation datasets. Finally, we also consider randomizing both real and imaginary parts, which degrades performance even further.

Other simple variants of initializations show that it is not just the range of the eigenvalues but the actual distribution that is important (Fig. 4). Both S4D-Inv2 and S4D-Quad have real part 12-\frac{1}{2} and imaginary part satisfying the same maximum value as 5. The S4D-Inv2 initialization uses the same formula as S4D-Inv, but replaces a 2n+12n+1 in the denominator with n+1n+1. The S4D-Quad initialization uses a polynomial law with power 22 instead of 1-1 (S4D-Inv) or 11 (S4D-Lin). (S4D-Inv2)An=12+iNπ(Nn+11)(\textbf{S4D-Inv2})\quad\bm{A}_{n}=-\frac{1}{2}+i\frac{N}{\pi}\left(\frac{N}{n+1}-1\right) (12) (S4D-Quad)An=1π(1+2n)2(\textbf{S4D-Quad})\quad\bm{A}_{n}=\frac{1}{\pi}(1+2n)^{2} (13)

We include two additional methods here that are not based on the proposed S4D-Inv or S4D-Lin methods. First, S4D-Rand uses a randomly initialized diagonal A\bm{A}, and validates that it performs poorly, in line with earlier findings . Second, S4D-Real uses a particular real initialization with An=(n+1)\bm{A}_{n}=-(n+1). This is the exact same spectrum as the original S4(-LegS) method, which validates that it is not just the diagonalization that matters, highlighting the limitations of Proposition 2.

3 Full Comparisons of S4D and S4 Methods

Table 3(b) shows the performance of all S4D and S4 variants on the ablations datasets. We observe several interesting phenomena:

Freezing the matrices performs comparably to training them on sCIFAR and BIDMC, but is substantially worse on SC. We hypothesize that this results from Δ\Delta being poorly initialized for SC, so that at initialization models do not have context over the entire sequence, and training A\bm{A} and B\bm{B} helps adjust for this. As further evidence, the finite window methods S4-LegT and S4-FouT (defined in ) have the most limited context and suffer the most when A\bm{A} is frozen.

The full DPLR versions are often slightly better than the diagonal version throughout the entire training curve. We report the validation accuracy after 1 epoch of training on sCIFAR and SC to illustrate this phenomenon. Note that this is not a consequence of having more parameters (Appendix B).

Finally, we relax the strict requirements on model size and regularization for the ablation datasets, and show the performance of S4 and S4D variants on the test sets with a larger model (architecture and training details in Appendix B) when the model size and regularization is simply increased (Table 4). We note that results for each dataset are better than the original S4 model, which was already state-of-the-art on these datasets .

We use the same hyperparameter setting for the state-of-the-art S4 model in on the Long Range Arena benchmark for testing long dependencies in sequence models. S4D variants are highly competitive on all datasets except Path-X, and outperform the S4 variants on several of them. On Path-X using this hyperparameter setting with bidirectional models, only S4D-Inv, our simpler approximation to the original S4-LegS model, achieves above random chance, and has an average of 85% on the full LRA suite, more than 30 points better than the original Transformer .

Finally, we return to the parameterization choices presented in Section 3 and ablated in Section 5.1, and ablate them once more on the difficult Path-X dataset. We use small models of between 150K and 200K parameters (differing only depending on whether B\bm{B} is trained). We fix the S4D-LegS initialization (i.e., the diagonal HiPPO initialization (5)).

We start from the base S4D parameterization based on S4: bilinear discretization, exp\exp (A)\Re(\bm{A}), trainable B\bm{B}, and no softmax (Table 1). We ablate each of these choices one at a time for the discretization, constraint on (A)\Re(\bm{A}), trainability of B\bm{B}, and normalization. We also consider the combination that defines DSS: ZOH discretization, identity (A)\Re(\bm{A}), frozen B\bm{B}, softmax normalization.

Table 6 shows that the default S4 parameterization choices are a strong baseline. As in Section 5.1, we find that most of the other choices do not make much difference:

letting (A)\Re(\bm{A}) be unconstrained has little benefit, and can theoretically cause instabilities, so we do not recommend it,

the bilinear vs. ZOH discretizations make no difference,

training B\bm{B} helps slightly, for a minor increase in parameter count and no change in speed.

Finally, on this task – unlike the easier ablation datasets in Section 5.1 – the softmax normalization of DSS actually hurts performance, and we do not recommend it in general.

Conclusion

State space models based on S4 are a promising family of models for modeling many types of sequential data, with particular strengths for continuous signals and long-range interactions. These models are a large departure from conventional sequence models such as RNNs, CNNs, and Transformers, with many new ideas and moving parts. This work provides a more in-depth exposition for all aspects of working with S4-style models, from their core structures and kernel computation algorithms, to miscellaneous choices in their parameterizations, to new theory and methods for their initialization. We systematically analyzed and ablated each of these components, and provide recommendations for building a state space model that is as simple as possible, while as theoretically principled and empirically effective as S4. We believe that S4D can be a strong generic sequence model for a variety of domains, that opens new directions for state space models theoretically, and is much more practical to understand and implement for practitioners.

We gratefully acknowledge the support of DARPA under Nos. FA86501827865 (SDH) and FA86501827882 (ASED); NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); ONR under No. N000141712266 (Unifying Weak Supervision); the Moore Foundation, NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, the Okawa Foundation, American Family Insurance, Google Cloud, Swiss Re, Brown Institute for Media Innovation, Department of Defense (DoD) through the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program, Fannie and John Hertz Foundation, National Science Foundation Graduate Research Fellowship Program, Texas Instruments, and members of the Stanford DAWN project: Teradata, Facebook, Google, Ant Financial, NEC, VMWare, and Infosys. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA, NIH, ONR, or the U.S. Government.

References

Appendix A Method Details

We prove Theorem 3, and then show why this it is a surprising result that is not true in general to low-rank perturbations of SSMs.

We start with the interpretation of the S4-LegS matrix shown in , which corresponds to Fig. 1 (Left).

Let A,B,P\bm{A},\bm{B},\bm{P} be the matrices defined in equation (4). The SSM kernels Kn(t)=enetABK_{n}(t)=\bm{e}_{n}^{\top}e^{t\bm{A}}\bm{B} have the closed form formula

where LnL_{n} are the Legendre polynomials shifted and scaled to be orthonormal on the interval $$.

The functions Ln(et)L_{n}(e^{-t}) are a complete orthonormal basis with respect to the measure ω(t)=et\omega(t)=e^{-t}.

The polynomials are defined to be orthonormal on $$, i.e.

By the change of variables t=est=e^{-s} with dtds=es\frac{dt}{ds}=-e^{-s},

Completeness follows from the fact that polynomials are complete. ∎

We start with the standard interpretation of SSMs as convolutional systems. The SSM x(t)=Ax(t)+Bu(t)x^{\prime}(t)=\bm{A}x(t)+\bm{B}u(t) is equivalent to the convolution

Defining u(t)(s)=u(ts)u^{(t)}(s)=u(t-s), we can write this as

where ω(s)=es\omega(s)=e^{-s} and p(s),q(s)ω=0p(s)q(s)ω(s) ⁣ds\langle p(s),q(s)\rangle_{\omega}=\int_{0}^{\infty}p(s)q(s)\omega(s)\mathop{}\!ds is the inner product in the Hilbert space of L2 functions with respect to measure ω\omega.

By Theorem 6, the KnK_{n} are a complete orthonormal basis in this Hilbert space. There xn(t)x_{n}(t) represents a decomposition of the function u(t)u^{(t)} with respect to this basis, and can be recovered as a linear combination of these projections

Intuitively, due to the function reconstruction interpretation of HiPPO , we can approximate u(t)u(t) using knowledge in the current state x(t)x(t). There in the limit NN\to\infty, the original SSM is equivalent to

Finally, we remark that this phenomenon where removing the low-rank correction to a DPLR matrix approximates the original dynamics, is unique to this HiPPO-LegS matrix. We note that if instead of PP\bm{P}\bm{P}^{\top}, a random rank-1 correction is added to the HiPPO-LegS matrix in Theorem 3, the resulting SSM kernels look completely different and in fact diverge rapidly as the magnitude of P\bm{P} increases (Fig. 5).

Appendix B Experiment Details

The architecture has 44 layers and hidden dimension H=128H=128, resulting in around 100K trainable parameters. The A\bm{A} and B\bm{B} parameters were tied across the HH SSM copies; therefore the S4 models have only H×{num. layers}H\times\{\text{num. layers}\} more parameters than S4D models, arising from the P\bm{P} tensor in the DPLR representation A=ΛPP\bm{A}=\bm{\Lambda}-\bm{P}\bm{P}^{\top}. This choice was made because it generally does not affect performance much, while reducing parameter count and ensuring that S4 vs. S4D models have very similar numbers of parameters.

All results are averaged over 2 or 3 seeds.

All models use learning rate 0.0040.004, 0.010.01 weight decay, and no other regularization or data augmentation. For the classification tasks (sCIFAR and SC). we use a cosine scheduler with 1 epoch warmup and decaying to . For the regression task (BIDMC), we use a multistep scheduler following .

Reported results are all best validation accuracy, except for the large models in Table 4.

Table 7 and Table 8 contain the raw results for Table 2 including standard deviations.

Figs. 8, 8 and 9 show full results comparing our proposed methods against the best models from the literature; citations indicate numbers from prior work.

Note that earlier works on the Speech Commands dataset typically use pre-processing such as MFCC features, or a 10-class subset of the full 35-class dataset . As we are not aware of a collection of strong baselines for raw waveform classification using the full dataset, we trained several baselines from scratch for Table 9. The InceptionNet, ResNet-18, and XResNet-50 models are 1D adaptations from Nonaka and Seita of popular CNN architectures for vision. The ConvNet architecture is a generic convolutional neural network that we tuned for strong performance, comprising:

Four stages, each composed of three identical residual blocks.

The first stage has model dimension (i.e. channels, in CNN nomenclature) H=64H=64. Each stage doubles the dimension of the previous stage (with a position-wise linear layer) and ends in an average pooling layer of width 44. Thus, the first stage operates on inputs of length 1638416384, dimension 6464 (the input is zero-padded from 16000 to 16384) and the last on length 256256, dimension 512512.

Each residual block has a (pre-norm) BatchNorm layer followed by a convolution layer and GeLU activation.

Convolution layers have a kernel size of 2525.

Our Long Range Arena experiments follow the same setup as the original S4 paper with some differences in model architecture and hyperparameters. The main global differences are as follows:

The original S4 layer is unidirectional or causal, which is an unnecessary constraint for the classification tasks appearing in LRA. Goel et al. propose a bidirectional version of S4 that simply concatenates two S4 convolution kernels back-to-back. We use this for all tasks.

Instead of the plateau scheduler used in , we use a cosine annealing learning rate scheduler for all tasks.

Almost all tasks used no dropout and 0.050.05 weight decay.

Almost all tasks used an architecture with 66 layers, H=256H=256 hidden units, BatchNorm, pre-norm placement of the normalization layer.

Exceptions to the above rules are described below. Full hyperparameters are in Table 10.

This dataset is grayscale sequential CIFAR-10, and the settings for this task were taken from S4’s hyperparameters on the normal sequential CIFAR-10 task. In particular, this used LayerNorm instead of BatchNorm , a larger number of hidden features HH, post-norm instead of pre-norm, and minor dropout. We note that the choice of normalization and increased HH do not make a significant difference on final performance, still attaining classification accuracy in the high 80’s. Dropout does seem to make a difference.

We used a larger state size of N=256N=256, since we hypothesized that picking up higher frequency features on this dataset would help. We also used a step scheduler that decayed the LR by 0.50.5 every 100100 epochs, following prior work .

We hypothesized that this task benefits from deeper models, because of the explicit hierarchical nature of the task, so the architecture used here had 8 layers and H=128H=128 hidden features. However, results are very close with much smaller models. We also found that post-norm generalized better than pre-norm, but results are again close (less than 1%1\% difference).

As described in , the initialization range for PathX is decreased from (Δmin,Δmax)=(0.001,0.1)(\Delta_{min},\Delta_{max})=(0.001,0.1) to (Δmin,Δmax)=(0.0001,0.01)(\Delta_{min},\Delta_{max})=(0.0001,0.01).