Liquid Structural State-Space Models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, Daniela Rus

Introduction

Learning representations from sequences of data requires expressive temporal and structural credit assignment. In this space, the continuous-time neural network class of liquid time-constant networks (LTC) (Hasani et al., 2021b) has shown theoretical and empirical evidence for their expressivity and their ability to capture the cause and effect of a given task from high-dimensional sequential demonstrations (Lechner et al., 2020a; Vorbach et al., 2021). Liquid networks are nonlinear state-space models (SSMs) with an input-dependent state transition module that enables them to learn to adapt the dynamics of the model to incoming inputs, at inference, as they are dynamic causal models (Friston et al., 2003). Their complexity, however, is bottlenecked by their differential equation numerical solver that limits their scalability to longer-term sequences. How can we take advantage of LTC’s generalization and causality capabilities and scale them to competitively learn long-range sequences without gradient issues, compared to advanced recurrent neural networks (RNNs) (Rusch and Mishra, 2021; Erichson et al., 2021; Gu et al., 2020a), convolutional neural networks (CNNs) (Lea et al., 2016; Romero et al., 2021b; Cheng et al., 2022), and attention-based models (Vaswani et al., 2017)?

In this work, we set out to leverage the elegant formulation of structural state-space models (S4) (Gu et al., 2022a) to obtain linear liquid network instances that possess the approximation capabilities of both S4 and LTCs. This is because structural SSMs are shown to largely dominate advanced RNNs, CNNs, and Transformers across many data modalities such as text, sequence of pixels, audio, and time series (Gu et al., 2021, 2022a, 2022b; Gupta, 2022). Structural SSMs achieve such impressive performance by using three main mechanisms: 1) High-order polynomial projection operators (HiPPO) (Gu et al., 2020a) that are applied to state and input transition matrices to memorize signals’ history, 2) diagonal plus low-rank parametrization of the obtained HiPPO (Gu et al., 2022a), and 3) an efficient (convolution) kernel computation of an SSM’s transition matrices in the frequency domain, transformed back in time via an inverse Fourier transformation (Gu et al., 2022a).

To combine S4 and LTCs, instead of modeling sequences by linear state-space models of the form $\dot{x}=\textbf{A}~{}x+\textbf{B}~{}u$ , $y=\textbf{C}~{}x$ , (as done in structural and diagonal SSMs (Gu et al., 2022a, b)), we propose to use a linearized LTC state-space model (Hasani et al., 2021b), given by the following dynamics: $\dot{x}=(\textbf{A}+\textbf{B}~{}u)~{}x+\textbf{B}~{}u$ , $y=\textbf{C}~{}x$ . We show that this dynamical system can also be efficiently solved via the same parametrization of S4, giving rise to an additional convolutional Kernel that accounts for the similarities of lagged signals. We call the obtained model Liquid-S4. Through extensive empirical evaluation, we show that Liquid-S4 consistently leads to better generalization performance compared to all variants of S4, CNNs, RNNs, and Transformers across many time-series modeling tasks. In particular, we achieve SOTA performance on the Long Range Arena benchmark (Tay et al., 2020b) with an average of 87.32%. To sum up, we make the following contributions:

We introduce Liquid-S4, a new state-space model that encapsulates the generalization and causality capabilities of liquid networks as well as the memorization, efficiency and scalability of S4.

We achieve State-of-the-art performance on pixel-level sequence classification, text, speech recognition and all six tasks of the long-range arena benchmark with an average accuracy of 87.32%. On the full raw Speech Command recognition dataset Liquid-S4 achieves 96.78% accuracy with 30% reduction in parameter. Finally on the BIDMC vital signs dataset Liquid-S4 achieves SOTA in all modes.

Related Works

Learning Long-Range Dependencies with RNNs. Sequence modeling can be performed autoregressively with RNNs which possess persistent states (Little, 1974) originated from Ising (Brush, 1967) and Hopfield networks (Hopfield, 1982; Ramsauer et al., 2020). Discrete RNNs approximate continuous dynamics step-by-steps via dependencies on the history of their hidden states, and continuous-time (CT) RNNs use ordinary differential equation (ODE) solvers to unroll their dynamics with more elaborate temporal steps (Funahashi and Nakamura, 1993).

CT-RNNs can perform remarkable credit assignment in sequence modeling problems both on regularly sampled, irregularly-sampled data (Pearson et al., 2003; Li and Marlin, 2016; Belletti et al., 2016; Roy and Yan, 2020; Foster, 1996; Amigó et al., 2012; Kowal et al., 2019), by turning the spatiotemproal dependencies into vector fields (Chen et al., 2018), enabling better generalization and expressivity (Massaroli et al., 2020; Hasani et al., 2021b). Numerous works have studied their characteristics to understand their applicability and limitations in learning sequential data and flows (Lechner et al., 2019; Dupont et al., 2019; Durkan et al., 2019; Jia and Benson, 2019; Grunbacher et al., 2021; Hanshu et al., 2020; Holl et al., 2020; Quaglino et al., 2020; Kidger et al., 2020; Hasani et al., 2020; Liebenwein et al., 2021; Gruenbacher et al., 2022).

However, when these RNNs are trained by gradient descent (Rumelhart et al., 1986; Allen-Zhu and Li, 2019; Sherstinsky, 2020), they suffer from the vanishing/exploding gradients problem, which makes difficult the learning of long-term dependencies in sequences (Hochreiter, 1991; Bengio et al., 1994). This issue happens in both discrete RNNs such as GRU-D with its continuous delay mechanism (Che et al., 2018) and Phased-LSTMs (Neil et al., 2016), and continuous RNNs such as ODE-RNNs (Rubanova et al., 2019), GRU-ODE (De Brouwer et al., 2019), Log-ODE methods (Morrill et al., 2020) which compresses the input time-series by time-continuous path signatures (Friz and Victoir, 2010), and neural controlled differential equations (Kidger et al., 2020), and liquid time-constant networks (LTCs) (Hasani et al., 2021b).

Numerous solutions have been proposed to resolve these gradient issues to enable long-range dependency learning. Examples include discrete gating mechanisms in LSTMs (Hochreiter and Schmidhuber, 1997; Greff et al., 2016; Hasani et al., 2019), GRUs (Chung et al., 2014), continuous gating mechanisms such as CfCs (Hasani et al., 2021a), hawks LSTMs (Mei and Eisner, 2017), IndRNNs (Li et al., 2018), state regularization (Wang and Niepert, 2019), unitary RNNs (Jing et al., 2019), dilated RNNs (Chang et al., 2017), long memory stochastic processes (Greaves-Tunnell and Harchaoui, 2019), recurrent kernel networks (Chen et al., 2019), Lipschitz RNNs (Erichson et al., 2021), symmetric skew decomposition (Wisdom et al., 2016), infinitely many updates in iRNNs (Kag et al., 2019), coupled oscillatory RNNs (coRNNs) (Rusch and Mishra, 2021), mixed-memory RNNs (Lechner and Hasani, 2021), and Legendre Memory Units (Voelker et al., 2019).

Learning Long-range Dependencies with CNNs and Transformers. RNNs are not the only solution to learning long-range dependencies. Continuous convolutional kernels such as CKConv (Romero et al., 2021b) and (Romero et al., 2021a), and circular dilated CNNs (Cheng et al., 2022) have shown to be efficient in modeling long sequences faster than RNNs. There has also been a large series of works showing the effectiveness of attention-based methods for modeling spatiotemporal data. A large list of these models is listed in Table 1. These baselines have recently been largely outperformed by the structural state-space models (Gu et al., 2022a).

State-Space Models. SSMs are well-established frameworks to study deterministic and stochastic dynamical systems (KALMAN, 1960). Their state and input transition matrices can be directly learned by gradient descent to model sequences of observations (Lechner et al., 2020b; Hasani et al., 2021b; Gu et al., 2021). In a seminal work, Gu et al. (2022a) showed that with a couple of fundamental algorithmic methods on memorization and computation of input sequences, SSMs can turn into the most powerful sequence modeling framework to-date, outperforming advanced RNNs, temporal and continuous CNNs (Cheng et al., 2022; Romero et al., 2021b, a) and a wide variety of Transformers (Vaswani et al., 2017), available in Table 1 by a significant margin.

The key to their success is their diagonal plus-low rank parameterization of the transition matrix of SSMs via higher-order polynomial projection (HiPPO) matrix (Gu et al., 2020a) obtained by a scaled Legendre measure (LegS) inspired by the Legendre Memory Units (Voelker et al., 2019) to memorize input sequences, a learnable input transition matrix, and an efficient Cauchy Kernel algorithm, results in obtaining structural SSMs named S4. It was also shown recently that diagonal SSMs (S4D) (Gupta, 2022) could be as performant as S4 in learning long sequences when parametrized and initialized properly (Gu et al., 2022b, c). There was also a new variant of S4 introduced as simplified-S4 (S5) Smith et al. (2022) that tensorizes the 1-D operations of S4 to gain a more straightforward realization of SSMs. Here, we introduce Liquid-S4, which is obtained by a more expressive SSM, namely liquid time-constant (LTC) representation (Hasani et al., 2021b) which achieves SOTA performance across many benchmarks.

Setup and Methodology

In this section, we first revisit the necessary background to formulate our Liquid Structural State-Space Models. We then set up and sketch our technical contributions.

We aim to design an end-to-end sequence modeling framework built by SSMs. A continuous-time SSM representation of a linear dynamical system is given by the following set of equations:

Here, $x(t)$ is an $N$ -dimensional latent state, receiving a 1-dimensional input signal $u(t)$ , and computing a 1-dimensional output signal $y(t)$ . $\textbf{A}^{(N\times N)}$ , $\textbf{B}^{(N\times 1)}$ , $\textbf{C}^{(1\times N)}$ and $\textbf{D}^{(1\times 1)}$ are system’s parameters. For the sake of brevity, throughout our analysis, we set $\textbf{D}=0$ as it can be added eventually after construction of our main results in the form of a skip connection (Gu et al., 2022a).

Discretization of SSMs. In order to create a sequence-to-sequence model similar to a recurrent neural network (RNN), we discretize the continuous-time representation of SSMs by the trapezoidal rule (bilinear transform) $s\leftarrow\frac{2}{\delta t}\frac{1-z^{-1}}{1+z^{-1}}$ as follows (sampling step = $\delta t$ ) (Gu et al., 2022a):

This is obtained via the following modifications to the transition matrices:

Creating a Convolutional Representation of SSMs. The system described by (2) and (3), can be trained via gradient descent to learn to model sequences, in a sequential manner which is not scalable. To improve this, we can write the discretized SSM in (2) as a discrete convolutional kernel. To construct the convolutional kernel, let us unroll the system (2) in time as follows, assuming a zero initial hidden states $x_{-1}=0$ :

The mapping $u_{k}\rightarrow y_{k}$ can now can be formulated into a convolutional kernel explicitly:

Equation (5) is a non-circular convolutional kernel. Gu et al. (2022a) showed that under the condition that $\overline{\textbf{K}}$ is known, it could be solved very efficiently by a black-box Cauchy kernel computation pipeline.

2 Liquid Structural State-Space Models

In this work, we construct a convolutional kernel corresponding to a linearized version of LTCs (Hasani et al., 2021b); an expressive class of continuous-time neural networks that demonstrate attractive generalizability out-of-distribution and are dynamic causal models (Vorbach et al., 2021; Friston et al., 2003; Hasani et al., 2020). In their general form, the state of a liquid time-constant network at each time-step is given by the set of ODEs described below (Hasani et al., 2021b):

In this expression, $\textbf{x}^{(N\times 1)}(t)$ is the vector of hidden state of size $N$ , $\textbf{u}^{(m\times 1)}(t)$ is an input signal with $m$ features, $\bm{A}^{(N\times 1)}$ is a time-constant state-transition mechanism, $\bm{B}^{(N\times 1)}$ is a bias vector, and $\odot$ represents the Hadamard product. $f(.)$ is a bounded nonlinearity parametrized by $\theta$ .

Our objective is to show how the liquid time-constant (i.e., an input-dependent state transition mechanism in state-space models can enhance its generalization capabilities by accounting for the covariance of the input samples. To do this, we linearize the LTC formulation of Eq. 7 in the following to better connect the model to SSMs. Let’s dive in:

Linear Liquid Time-Constant State-Space Model. A Linear LTC SSM can be presented by the following coupled bilinear (first order bilinear Taylor approximation (Penny et al., 2005)) equation:

Similar to (1), $x(t)$ is an $N$ -dimensional latent state, receiving a 1-dimensional input signal $u(t)$ , and computing a 1-dimensional output signal $y(t)$ . $\textbf{A}^{(N\times N)}$ , $\textbf{B}^{(N\times 1)}$ , and $\textbf{C}^{(1\times N)}$ . Note that D is set to zero for simplicity. In (8), the first $\textbf{B}~{}u(t)$ is added element-wise to A. This dynamical system allows the coefficient (state transition compartment) of state vector $x(t)$ to be input dependent which, as a result, allows us to realize more complex dynamics.

Discretization of Liquid-SSMs. Similar to SSMs, Liquid-SSMs can also be discretized by a bilinear transform (trapezoidal rule) to construct a sequence-to-sequence model as follows:

The discretized parameters $\overline{\textbf{A}}$ , $\overline{\textbf{B}}$ , and $\overline{\textbf{C}}$ are identical to that of (3), which are function of the continuous-time coefficients A, B, and C, and the discretization step $\delta t$ .

Creating a Convolutional Representation of Liquid-SSMs. Similar to (4), we first unroll the Liquid-SSM in time to construct a convolutional kernel of it. By assuming $x_{-1}=0$ , we have:

The resulting expressions of the Liquid-SSM at each time step consist of two types of weight configurations: 1. Weights corresponding to the mapping of individual time instances of inputs independently, shown in black in (3.2), and 2. Weights associated with all orders of auto-correlation of the input signal, shown in violet in (3.2). The first set of weights corresponds to the convolutional kernel of the simple SSM, shown by Eq. 5 and Eq. 6, whereas the second set leads to the design of an additional input correlation kernel, which we call the liquid kernel. These kernels generate the following input-output mapping:

For instance, let us assume we have a 1-dimensional input signal $u(t)$ of length $L=100$ on which we run the liquid-SSM kernel. We set the hyperparameters $\mathcal{P}=4$ . This value represents the maximum order of the correlation terms we would want to take into account to output a decision. This means that the signal $u_{\text{correlations}}$ in (3.2) will contain all combinations of 2 order correlation signals $\binom{L+1}{2}$ , $u_{i}u_{j}$ , 3 order $\binom{L+1}{3}$ , $u_{i}u_{j}u_{k}$ and 4 order signals $\binom{L+1}{4}$ , $u_{i}u_{j}u_{k}u_{l}$ . The kernel weights corresponding to this auto-correlation signal would be:

How to compute Liquid-S4 kernel efficiently? Gu et al. (2022a) showed that the S4 convolution kernel could be computed efficiently using the following elegant parameterization tricks:

To obtain better representations in sequence modeling schemes by SSMs, instead of randomly initializing the transition matrix A, we can use the Normal Plus Low-Rank (NPLR) matrix below, called the Hippo Matrix (Gu et al., 2020a) which is obtained by the Scaled Legendre Measure (LegS) (Gu et al., 2021, 2022a):

The NPLR representation of this matrix is the following (Gu et al., 2022a):

Vectors $B_{n}$ and $P_{n}$ are initialized by $\bm{B_{n}}=(2n+1)^{\frac{1}{2}}$ and $\bm{P_{n}}=(n+1/2)^{\frac{1}{2}}$ (Gu et al., 2022b). Both vectors are trainable.

Furthermore, it was shown in Gu et al. (2022b) that with Decomposition 15, the eigenvalues of A might be on the right half of the complex plane, thus, result in numerical instability. To resolve this, Gu et al. (2022b) recently proposed to use the parametrization $\bm{\Lambda}-\bm{P}\bm{P}^{*}$ instead of $\bm{\Lambda}-\bm{P}\bm{P}^{*}$ .

Computing the powers of $\bm{A}$ in direct calculation of the S4 kernel $\bm{\overline{K}}$ is computationally expensive. S4 computes the spectrum of $\bm{\overline{K}}$ instead of direct computations, which reduces the problem of matrix powers to matrix inverse computation Gu et al. (2022a). S4 then computes this convolution kernel via a black-box Cauchy Kernel efficiently, and recovers $\bm{\overline{K}}$ by an inverse Fourier Transform (iFFT) (Gu et al., 2022a).

$\overline{\textbf{K}}_{\text{liquid}}$ possess similar structure to the S4 kernel. In particular, we have:

The proof is given in Appendix. Proposition 3.1 indicates that the liquid-s4 kernel can be obtained from the precomputed S4 kernel and a Hadamard product of that kernel with the transition vector $\bm{\overline{B}}$ powered by the chosen liquid order. This is illustrated in Algorithm 1, lines 6 to 10, corresponding to a mode we call KB, which stands for Kernel $\times$ B.

Additionally, we introduce a simplified Liquid-S4 kernel that is easier to compute while being as expressive as or even better performing than the KB kernel. To obtain this, we set the transition matrix $\bm{\overline{A}}$ in Liquid-S4 of Eq. 13, with an identity matrix, only for the input correlation terms. This way, the liquid-s4 Kernel for a given liquid order $p\in\mathcal{P}$ reduces to the following expression:

We call this kernel Liquid-S4 - PB, as it is obtained by powers of the vector $\bm{\overline{B}}$ . The computational steps to get this kernel is outlined in Algorithm 1 lines 11 to 15.

(62.75 + 89.02 + 91.20 + 89.50 + 94.8 + 96.66)b \opdiv*[maxdivstep=4]b6qr \opfloorq2a

Experiments with Liquid-S4

In this section, we present an extensive evaluation of Liquid-S4 on sequence modeling tasks with very long-term dependencies and compare its performance to a large series of baselines ranging from advanced Transformers and Convolutional networks to many variants of State-space models. In the following, we first outline the baseline models we compare against. We then list the datasets we evaluated these models on and finally present results and discussions.

Baselines. We consider a broad range of advanced models to compare liquid-S4 with. These baselines include transformer variants such as vanilla Transformer (Vaswani et al., 2017), Sparse Transformers (Child et al., 2019), a Transformer model with local attention (Tay et al., 2020b), Longformer (Beltagy et al., 2020), Linformer (Wang et al., 2020), Reformer (Kitaev et al., 2019), Sinkhorn Transformer (Tay et al., 2020a), BigBird (Zaheer et al., 2020), Linear Transformer (Katharopoulos et al., 2020), and Performer (Choromanski et al., 2020). We also include architectures such as FNets (Lee-Thorp et al., 2021), Nystrom̈former (Xiong et al., 2021), Luna-256 (Ma et al., 2021), H-Transformer-1D (Zhu and Soricut, 2021), and Circular Diluted Convolutional neural networks (CDIL) (Cheng et al., 2022). We then include a full series of state-space models and their variants such as diagonal SSMs (DSS) (Gupta, 2022), S4 (Gu et al., 2022a), S4-legS, S4-FouT, S4-LegS/FouT (Gu et al., 2022c), S4D-LegS (Gu et al., 2022b), S4D-Inv, S4D-Lin and the Simplified Structural State-space models (S5) (Smith et al., 2022).

Datasets. We first evaluate Liquid-S4’s performance on the well-studied Long Range Arena (LRA) benchmark (Tay et al., 2020b), where Liquid-S4 outperforms other S4 and S4D variants in every task pushing the state-of-the-art further with an average accuracy of 87.32%. LRA dataset includes six tasks with sequence lengths ranging from 1k to 16k.

We then report Liquid-S4’s performance compared to other S4, and S4D variants as well as other models, on the BIDMC Vital Signals dataset (Pimentel et al., 2016; Goldberger et al., 2000). BIDMC uses bio-marker signals of length 4000 to predict Heart rate (HR), respiratory rate (RR), and blood oxygen saturation (SpO2).

We also experiment with the sCIFAR dataset that consists of the classification of flattened images in the form of 1024-long sequences into 10 classes.

Finally, we perform RAW Speech Command (SC) recognition with FULL 35 LABELS as conducted very recently in the updated S4 article (Gu et al., 2022a).It is essential to denote that there is a modified speech command dataset that restricted the dataset to only 10 output classes and is used in a couple of works (see for example (Kidger et al., 2020; Gu et al., 2021; Romero et al., 2021b, a)). Aligned with the updated results reported in (Gu et al., 2022a) and (Gu et al., 2022b), we choose not to break down this dataset and use the full-sized benchmark. SC dataset contains sequences of length 16k to be classified into 35 commands. Gu et al. (2022a) introduced a new test case setting to assess the performance of models (trained on 16kHz sequences) on sequences of length 8kHz. S4 and S4D perform exceptionally well in this zero-shot test scenario.

Table 1 depicts a comprehensive list of baselines benchmarked against each other on six long-range sequence modeling tasks in LRA. We observe that Liquid-S4 instances (all use the PB kernel with a scaled Legendre (LegS) configuration) with a small liquid order, $p$ , ranging from 2 to 6, consistently outperform all baselines in all six tasks, establishing the new SOTA on LRA with an average performance of 87.32%. In particular, on ListOps, Liquid-S4 improves S4-LegS performance by more than 3%, on character-level IMDB by 2.2%, and on 1-D pixel-level classification (CIFAR) by 0.65%, while establishing the-state-of the-art on the hardest LRA task by gaining 96.54% accuracy. Liquid-S4 performs on par with improved S4 and S4D instances on both AAN and Pathfinder tasks.

The performance of SSM models is generally well-beyond what advanced Transformers, RNNs, and Convolutional networks achieve on LRA tasks, with the Liquid-S4 variants standing on top. It is worth noting that Liquid-S4 kernels perform better with smaller kernel sizes (See more details on this in Appendix); For instance, on ListOps and IMDB, their individual liquid-S4 kernel state-size could be as small as seven units. This significantly reduces the parameter count in Liquid-S4 in comparison to other variants.

The impact of increasing Liquid Order $p$ . Figure 1 illustrates how increasing the liquid order, $p$ , can consistently improve performance on ListOps and IMDB tasks from LRA.

2 Results on BIDMC Vital Signs

Table 2 demonstrates the performance of a variety of classical and advanced baseline models on the BIDMC dataset for all three heart rate (HR), respiratory rate (RR), and blood oxygen saturation (SpO2) level prediction tasks. We observe that Liquid-s4 with a PB kernel of order $p=3$ , $p=2$ , and $p=4$ , perform better than all S4 and S4D variants. It is worth denoting that Liquid-S4 is built by the same parametrization as S4-LegS (which is the official S4 model reported in the updated S4 report (Gu et al., 2022a)). In RR, Liquid-S4 outperforms S4-LegS by a significant margin of 36%. On SpO2, Liquid-S4 performs 26.67% better than S4-Legs. On HR, Liquid-S4 outperforms S4-Legs by 8.7% improvement in performance. The hyperparameters are given in Appendix.

3 Results on 1-D Pixel-level Image Classification

Similar to the previous tasks, a Liquid-S4 network with PB kernel of order $p=3$ outperforms all variants of S4 and S4D while being significantly better than Transformer and RNN baselines as summarized in Table 3. The hyperparameters are given in Appendix.

4 Results on Speech Commands

Table 4 demonstrates that Liquid-S4 with liquid order $p=2$ achieves the best performance amongst all benchmarks on the 16KHz testbed with full dataset. Liquid-S4 also performs competitively on the half-frequency zero-shot experiment, while it does not realize the best performance. Although the task is solved to a great degree, the reason could be that liquid kernel accounts for covariance terms. This might influence the learned representations in a way that hurts performance by a small margin in this zero-shot experiment. The hyperparameters are given in Appendix.

It is essential to denote that there is a modified speech command dataset that restricts the dataset to only ten output classes, namely SC10, and is used in a couple of works (see for example (Kidger et al., 2020; Gu et al., 2021; Romero et al., 2021b, a)). Aligned with the updated results reported in (Gu et al., 2022a) and (Gu et al., 2022b), we choose not to break down this dataset and report the full-sized benchmark in the main paper. Nevertheless, we conducted an experiment with SC10 and showed that even on the reduced dataset, with the same hyperparameters, we solved the task with a SOTA accuracy of 98.51%. The results are presented in Table 6.

Conclusions

We showed that structural state-space models could be considerably improved in performance if they are formulated by a linear liquid time-constant kernel, namely Liquid-S4. Liquid-S4 kernels are obtainable with minimal effort with their kernel computing the similarities between time-lags of the input signals in addition to the main S4 diagonal plus low-rank parametrization. Liquid-S4 kernels with Smaller parameter counts achieve SOTA performance on all six tasks of the Long-range arena dataset, on BIDMC heart rate, respiratory rate, and blood oxygen saturation, on sequential 1-D pixel-level image classification, and on Speech command recognition.

Acknowledgments

This work is supported by The Boeing Company and the Office of Naval Research (ONR) Grant N00014-18-1-2830.

References

Supplementary Materials

Proof of Proposition 3.1

This can be shown by unrolling the S4 convolution kernel and multiplying its components with $\bm{\overline{B}}^{p-1}$ , performing an anti-diagonal transformation to obtain the corresponding liquid S4 kernel:

For $p=2$ (correlations of order 2), S4 kernel should be multiplied by $\bm{\overline{B}}$ . The resulting kernel would be:

We obtain the liquid kernel by flipping the above kernel to be convolved with the 2-term correlation terms (p=2):

Similarly, we can obtain liquid kernels for higher liquid orders and obtain the statement of the proposition.

Hyperparameters

Learning Rate. Liquid-S4 generally requires a smaller learning rate compared to S4 and S4D blocks.

Setting $\Delta t_{max}$ and $\Delta t_{min}$ We set $\Delta t_{max}$ for all experiments to 0.2, while the $\Delta t_{min}$ was set based on the recommendations provided in (Gu et al., 2022c) to be proportional to $\propto\frac{1}{\text{seq length}}$ .

Causal Modeling vs. Bidirectional Modeling Liquid-S4 works better when it is used as a causal model, i.e., with no bidirectional configuration.

$d_{s}tate$ We observed that liquid-S4 PB kernel performs best with smaller individual state sizes $d_{s}tate$ . For instance, we achieve SOTA results in ListOps, IMDB, and Speech Commands by a state size set to 7, significantly reducing the number of required parameters to solve these tasks.

Choice of Liquid-S4 Kernel In all experiments, we choose our simplified PB kernel over the KB kernel due to the computational costs and performance. We recommend the use of PB kernel.

Choice of parameter $p$ in liquid kernel. In all experiments, start off by setting $p$ or the liquidity order to 2. This means that the liquid kernel is going to be computed only for correlation terms of order 2. In principle, we observe that higher $p$ values consistently enhance the representation learning capacity of liquid-S4 modules, as we showed in all experiments. We recommend $p=3$ as a norm to perform experiments with Liquid-S4.

The kernel computation pipeline uses the PyKeops package (Charlier et al., 2021) for large tensor computations without memory overflow.

All reported results are validation accuracy (similar to Gu et al. (2022a)) performed with 2 to 3 different random seeds, except for the BIDMC dataset, which reports accuracy on the test set.