Theoretical Foundations of Deep Selective State-Space Models

Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons

Introduction

Sequence-to-sequence blocks are fundamental components of modern deep learning models for language, images, video, audio, time series, and genomics. While attention (Vaswani et al., 2017; Dosovitskiy et al., 2020) has been for the last five years the dominant mechanism powering these architectures, competitive results have been recently achieved by replacing attention with state-space models (SSMs): GPU-efficient linear recurrent sequence-to-sequence blocks stemming from S4 (Gu et al., 2021). SSMs achieve state-of-the-art results on long-range-reasoning benchmarks (Tay et al., 2020) and show outstanding performance in various domain including vision (Nguyen et al., 2022), audio (Goel et al., 2022), biological signals (Gu et al., 2021), reinforcement learning (Lu et al., 2023) and online learning (Zucchet et al., 2023). SSMs recently have gained significant interest in the community since their computational complexity scales linearly in sequence length, while attention scales quadratically; moreover, unlike other recurrent mechanisms such as LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Cho et al., 2014), they can be efficiently parallelized on GPUs during training using parallel scans (Martin & Cundy, 2017; Smith et al., 2023).

While standard SSMs were shown to be particularly powerful on signal processing tasks, their computation power is limited: the core sequential mechanism of S4 is equivalent to a convolution (filtering) (Li et al., 2022a). This represents a drawback in challenging domains such as text and genetics, where the ability to select data efficiently in an input-dependent manner (i.e., perform content-based reasoning) is crucial (see results in (Wang et al., 2022; Fu et al., 2022)). Towards reaching this goal with recurrent models, various adaptations of S4 have been proposed in the last few months. Notably, Mamba (Gu & Dao, 2023) implements simple and efficient gating mechanisms on the S4 recurrence, unlocking input selectivity in the memory update. Mamba achieved state-of-the-art performance of various language modeling tasks at a fraction of the compute needed to train a transformer model with an equal number of parameters, with inference time still scaling (like S4) with $O(\text{sequence length})$ . Similar ideas can be found in recent S4-inspired recurrent linear attention models, such as RWKV (Peng et al., 2023), RetNet (Sun et al., 2023), Gateloop (Katsch, 2023) and Gated Linear Attention (GLA) (Yang et al., 2023). Last, very recently, (De et al., 2024) surpassed the performance of Mamba with a gated RNN architecture – Griffin – based on an improved version of the LRU (Orvieto et al., 2023b), perhaps the simplest SSM variant.

At the core of the models discussed above is a linear time-varying dynamical system, where content-based reasoning is performed along the input sequence through an efficient and parallelizable update on a hidden state. In this paper, we generalize the structure of such models, drawing a direct link to controlled differential equations (CDEs) (Kidger et al., 2020; Morrill et al., 2021; Arribas et al., 2020; Salvi et al., 2022; Hoglund et al., 2023) and using tools from rough path theory to study expressivity of deep selective SSMs.

We provide a framework for the analysis of (input-controlled) linear (in the hidden state) recurrences such as S4, Mamba and GLA. This framework allows the use of powerful tools and results in the Rough Path Theory literature to reason about the expressivity of modern SSM layers and to obtain generalized versions of the universality results of (Li et al., 2022b). This framework captures both S4 and Mamba, as well as recently proposed linear-attention-powered recurrences.

We prove that wide enough randomly initialized dense input-controlled linear recurrences are fully expressive: with no need for a multi-layer-perceptron (MLP) block transforming pointwise the recurrence output (as instead is done in the transformers architecture or at the S4 recurrence output), we show that the hidden state contains sufficient input statistics to approximate any continuous function from the input sequence to a target value. This is in clear contrast to S4, where the hidden state is just a convolution of the input sequence with a fixed kernel.

We show that diagonal input-controlled linear recurrences (such as Mamba) provably collect input statistics more efficiently than S4. We additionally show that chaining such blocks, by simply placing linear pointwise maps in between, allows computation of higher order global statistics – matching the dense linear input-controlled setting discussed in point (2) above, in the depth limit.

The main objective of our research is to outline the foundational mathematical tools for comprehensive theoretical investigations of modern recurrent blocks. In addition to our results and insights concerning recently proposed SSM models, our work provides a framework that can be useful for analyzing and comparing forthcoming architectural advances.

State-space Models

We start with a quick simplified recap of S4 (Gu et al., 2021) and then describe recently proposed variants such as Mamba (in particular, the S6 block) (Gu & Dao, 2023) and Gated Linear Attention (GLA) (Yang et al., 2023). We restrict our focus to the recurrent mechanism and invite the reader to refer to the original papers for a description of the token-wise operations following and preceding each block.

with $\bar{A}_{i},\bar{B}_{i}$ determined by the discretization technique and the channel-dependent stepsize $\Delta_{i}$ : under the commonly used Zero-Order Hold discretization This corresponds to: (i). considering the continuous underlying signal $X_{t}$ to be piecewise constant, (ii). solving exactly the ODE 1 and finally (iii). sampling $Z_{t}$ at the sample times of $X_{t}$ . ,

The Selective SSM (S6) powering the Mamba architecture (Gu & Dao, 2023) augments S4 with input-controlled transitions:

2 Known properties of (non-linear) recurrences

Mamba (alongside with gated linear attention variants e.g. Yang et al. (2023)) does not fall in the linear RNN or the nonlinear RNN setting: its recurrence is linear on the hidden state (hence it can be parallelized) but it is not linear time-invariant as S4. In this paper, we study this setting.

SSMs as Linear CDEs

As shown by Gu & Dao (2023), the crucial component that unlocks in-context learning and selectivity in modern SSMs is the input-dependent state-to-state transition matrix $A$ , gating the hidden state and thus allowing the system to filter out irrelevant information and remember relevant information indefinitely. As we will shortly see, one can study the structure and features of such systems within a unified convenient continuous-time framework (Linear Controlled Differential equations). We define such framework (Def. 6) in full generality, and draw the connection to S4 and Mamba.

where $t\mapsto\omega^{\text{X},i}_{t}$ is the $i$ -th channel of the path $\omega^{\text{X}}$ .

We show that both S4 and Mamba can be written in continuous-time as Linear CDEs, for difference choices of $\omega$ and $\xi$ . We describe this visually in Fig. 1. This motivates us to study the expressivity of the model in Eq. 6 and how it is affected by the choice of $\omega$ and $\xi$ .

The S4 model (Gu et al., 2021) is a Linear CDE of the form (6) with “gate” functions

2 Mamba is a Linear CDE

As shown Appendix F, Mamba in Eqn. (3) is equivalent to the following Linear CDE

where we defined $X_{t}$ piecewise constant, with values the $x_{l}$ , on a grid of stepsize $\delta$ and

As a quick validation, note that Eqn. (3) becomes

As $\delta\to 0$ , the exponential becomes linear and one has

Which is the standard forward Euler discretization of the Mamba ODE in Eqn. (8).

Expressivity of Linear CDEs

If the matrices $A_{1},...,A_{d_{\omega}}$ are constrained to be diagonal, as it is often the case in practice, the requirements $\omega^{\text{X},1}_{t}=t$ , $\omega^{\text{X},2}_{t}=t^{2}$ can be dropped and the closure reduces to

This result can be seen as a generalization of the Universal Approximation for Linear RNNs presented by Li et al. (2022b)[Thm. 7] for generic gates $\omega,\xi$ . In fact in their setting $Z_{0}=0$ with gates set as in (7) family (9) reduces to linear input filterings given by convolutions

The specific restriction of $\omega$ to subsets of $ $is a crucial part of (9). The family of approximable maps does not include all path-to-path causalA causal map is one which does not ”look in the future” cf. Appendix D. functions$ t\mapsto Y_{t}^{\text{X}} $but a subset of them, of type$ t\mapsto Y_{t}^{\text{X}}:=\Psi(\omega^{\text{X}}_{[0,t]})$, satisfying the specific time-homogeneity specified by the form of the restriction, akin to that in (Li et al., 2022b).

2 Signature expansion

To study expressivity of these generalized SSMs (Linear CDEs), it will be convenient to introduce the so-called signature transform (Lyons et al., 2007; Kidger et al., 2019; Fermanian et al., 2023), a classical path-transform from stochastic analysis. The main reason for doing so is that, as a simple consequence of the Stone-Weirestrass theorem, linear functionals on the signature provide the essential building blocks (analogous to monomials on Euclidean spaces) to approximate continuous functions on path space.

We refer the interested reader to Appendix A for additional details and references on the signature. A classical result from rough path theory states that a Linear CDE can be expanded explicitly as an (infinite) linear combination of terms in the signature of the driving path. Next, we specialize this result to the case of Linear CDEs given by Eqn. (6).

For any choice of matrices $A_{1},...,A_{d_{\omega}}$ and $B$ , the unique solution to

where we used the notation $A_{I}:=A_{i_{n}}...A_{i_{1}}$ .

Notice that the previous result does not rely on any assumptions on the nature of $Z_{0}$ , $A_{i}$ , and $B$ ; for any such choice the result is a time-independent linear map on a feature vector

3 Randomised Linear CDEs

It turns out that the same space of functionals in the closure (9) of a Linear CDE can be induced, with high probability, by sampling the matrices $A_{1},...,A_{d_{\omega}},B$ and $Z_{0}$ at random according to standard ML practices (LeCun et al., 2012), as stated in the following result:

where the matrices $A_{1},...,A_{d_{\omega}}$ and $B$ defining $Z^{X}$ are sampled at random as

In particular, Thm. 4.5 implies that only the final readout $v$ needs to be trained, while the randomly initialised matrices $A_{1},...,A_{d_{\omega}}$ and $B$ can be left untrained (cf. Sec. 6). This is a similar mechanism to the paradigm advocated in reservoir computing (Cuchiero et al., 2021b; Lukoveviius & Jaeger, 2009).

It can be proved, using the results by Dubach & Peled (2021), that the sampling measure does not have to be Gaussian if it satisfies certain moment requirements.

The core argument behind the proof of Thm. 4.5 consists of exhibiting almost-orthogonal vectors trough random linear maps, and then using these as a pseudo-basis of vectors on which to mimic the dynamics of the feature map $T(X)_{0,t}$ . It then suffices to bound the error made by this approximate approach, which is proved to disappear as the dimension $N$ increases.

4 S4 as a special case

This in turns implies the following expression for $Z^{X}$

and the corresponding feature vector $T$ reads

for polynomials $p^{1},\dots,p^{d}$ , which reconciles our framework with the results obtained by Li et al. (2022b).

It is evident from this discussion that the classical choice of input-independent $\omega$ (i.e. $\omega^{\text{X}}_{t}=t$ ) precludes the use of higher order statistics of $X$ .

Motivated by the efficiency requirements in S4 and Mamba, in the next section, we discuss the question of whether this choice of diagonal restricts the family of functionals that one can learn.

5 The Diagonal Case

If $A_{i}:=\operatorname{diag}(v_{i})$ , due to commutativity of diagonal matrices, one has that for any permutation $\sigma(I)$ of a word $I$ the equality $A_{I}=A_{\sigma(I)}$ holds. This means that a linear functional on $Z_{t}^{\text{X}}$ such as (11) can fundamentally only see the symmetric part of $\text{Sig}(\omega^{\text{X}})_{s,t}$ , which is just the tensor exponential

and as such does not capture higher order channel combinations (cf. See Sect. 6 for empirical validation). In fact the explicit solution a can be written as

On top of efficiency, an important feature of selective SSMs, crucial for long-range reasoning (see discussion in (Orvieto et al., 2023b)) is their stability in a dynamical system sense: eigenvalues of $\bar{A}$ for those discrete recurrences always have absolute value bounded by $1$ . Both S4 and Mamba achieve this by restricting the sign of the diagonal components of their recurrent matrix $A$ with positive nonlinearities such as ReLUs (see Sec. 2). Under ZOH numerical integration (see again Sec. 2), the resulting discrete-time $\bar{A}=\exp(\Delta A)$ is guaranteed to be stable. We can apply the same idea in the general linear diagonal CDE setting (cf. Appendix C.1). Instead, if one works with dense matrices, ensuring stability in numerical integration during training requires heavy adaptations of the optimizer on top of higher overall complexity.

Note that the main difference between (9) and (14) is the path-dependence: the latter take as input just a value or increment of $\omega^{\text{X}}$ while the maps in $(\ref{eqn:main_F})$ are functions on whole sections of it. Diagonal selective SSMs (i.e. MambaInstead, expressivity of S4 is unaffected by diagonality in the recurrence, see results above and in (Li et al., 2022b).) are thus fundamentally weaker than their non-diagonal counterparts, as they can at most capture the symmetric part of $\omega^{\text{X}}$ ’s signature and its integrals against $d\xi^{\text{X}}$ (see Appendix for details).

6 Chaining Diagonal CDEs

Fortunately it is possible to re-gain expressivity without sacrificing the computational advantages of diagonal schemes through chaining. This means driving a new linear CDE by the solution of a previous linear CDE, and repeating this procedure $K$ times. With this procedure, one can recover signature entries up to depth $K$ with complexity $\mathcal{O}(KLN)$ .

The intuition behind this result is the following: with a single diagonal layer it’s possible This is not straightforward since the time in the integral part of (14) is going ”backwards”! to recover the terms

which are just length $2$ terms from $\text{Sig}((\omega^{\text{X}},\xi^{\text{X}}))_{0,t}$ . If we pass these terms as $\omega$ in another layer then we can obtain a length $3$ term from $\text{Sig}((\omega^{\text{X}},\xi^{\text{X}}))_{0,t}$ , and so on.

The structure of the chaining in Prop. 4.7 seems quite restrictive, in particular it’s not clear to which extent gates which are non-linear in the input $X$ allow for such a construction. In the Appendix we address this issue in the simple but important case $d\omega^{\text{X}}_{t}=\text{ReLU}(WX_{t}+B)dt$ , where we show that no information is lost and numerical stability can be enforced.

Path-to-Path Learning

Orvieto et al. (2023a) discuss an important feature of S4, which we already presented in previous sections: its action on the input is a simple convolution with fixed (independent of the input) parameters. For this exact reason, the construction of (Orvieto et al., 2023a) defers all non-linear computation (reasoning) to the multi-layer perceptron (MLP) acting on the linear recurrence output, which is simply providing in their setting a loss-less compression of the inputs – with no added reasoning.

In our setting, we instead characterized the processing power of input-controlled (dense or diagonal) SSMs precisely, showing that it greatly surpasses linear filtering. For this reason, the computational burden for an MLP action on a general linear CDE, unlocking path-to-path learning, would be greatly diminished and its actual function reduced to a time-consistent interpolation of already good path-to-point approximations. This method is robust to different input discretization schemes and does not offload all the complexity to the network which follows the SSM (cf. Figure 2).

Actually, $F$ in the above result does not have to be a neural network, it suffices for it to be able to interpolate linear maps in a time-consistent manner cf. Appendix D.

Empirical Validation

A variation on the toy dataset introduced in (Walker et al., 2024) is used to empirically validate our theoretical results. We use two datasets of $100,000$ sample paths each, with dimensions $2$ and $3$ respectively. In both datasets, paths are defined on $100$ regularly spaced time-steps between $t=0$ and $t=1$ . The change in each channel at each time point is an independent sample from a standard Normal distribution rounded to the nearest integer. The datasets are then normalised to range between $-1$ and $1$ . The aim on both datasets is to predict specific terms in the anti-symmetric part of the signature. For the $2$ dimensional dataset the prediction is an area integral and for the $3$ dimensional a volume integral given, respectively, by

A S5 or Mamba recurrence with linear readout.

Two stacked S5 or Mamba recurrences with a linear mixing layer in-between, and the final state of the second layer fed into a linear readout.

A linear NCDE with gates $\omega^{\text{X}}_{t}=\xi^{\text{X}}_{t}=(t,X_{t})$ followed by a linear readout.

All state space models have trainable matrices in their recurrences, whereas the linear NCDE is using fixed matrices. All of the models use a hidden dimension of $256$ , with the state space models using a state dimension of $256$ . The state space models are trained using batch gradient descent with a batch size of $32$ and Adam with a learning rate of $3\times 10^{-5}$ . Results are plotted in Fig. 3.

The output from the linear NCDE’s recurrence is obtained using an adaptive ODE solver, Tsit5, with an absolute and relative tolerance of $1\times 10^{-2}$ . The linear NCDEs linear layer is obtained by ordinary least squares.

Conclusions

In this paper we have considered a family of models, defined by gating functions as solutions to linear CDEs, which extends both classical and modern SSM architectures.

Using analytical tools from Rough Paths theory we have characterized the uniform closure of these models generalizing the results of (Li et al., 2022b), shedding light on the expressive advantages of input-controlled transition dynamics, which allow to capture high order statistics of the input as opposed to just the linear ones extracted by convolutions.

We point out how the full expressive range is already captured by generic models, and how more efficient computational structures, such as imposing diagonality, weaken these expressive capabilities. A remedy to this loss is presented in the chaining of these models.

Finally we show how the substitution of the linear readout with an MLP allows to approximate general path-to-path functions without, in contrast to the S4 case, offloading all the complexity to the neural network.

Acknowledgements

Nicola Muca Cirone is supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1). Antonio Orvieto acknowledges the financial support of the Hector Foundation. Benjamin Walker was funded by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). Terry Lyons was funded in part by the EPSRC [grant number EP/S026347/1], in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership) and in part by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).

References

Appendix A Introduction to Signatures

This initial section of the Appendix is devoted to a brief introduction to the topic of Signature Transform. For more in-depth surveys we suggest the reader to refer to (Fermanian, 2020; Chevyrev & Kormilitzin, 2016; Lyons & McLeod, 2024).

In the simplest setting of smooth paths a CDE is a differential equation of form

The theory of Rough Paths has its origins in the study of such types of differential equations and provides a theoretical framework to define and work in rough settings i.e. when $X$ is not kust BV but even $\alpha$ -Hölder for $\alpha\in(0,1)$ cf. (Friz & Victoir, 2010).

where $V_{i}f(z):=df_{y}[V_{i}(z)]$ . Iterating this procedure on the $V_{i}f$ s, i.e. substituting in the previous equation the analogously obtained equality

Keeping with this procedure for $N$ steps we get

where $I=(i_{1},\dots,i_{k})$ runs trough the multi-indices, $V_{I}f:=V_{i_{1}}V_{i_{2}}\dots V_{i_{k}}f$ and

As one can imagine, under reasonable regularity assumptions, the remainder goes to as $N\to\infty$ and at the limit

This is a remarkable result: to know the solution $Z_{t}$ to the original CDE it suffices to know the quantities $V_{I}f$ for all multi-indices and $f$ in the coordinate maps, together with the iterated integrals

This observation is at the core of Rough Path Analysis, the theory can in a sense be considered an extreme development of it. The collection of iterated integrals, the Signature, will be the main for our analysis.

In Appendix E we expand and make rigorous the arguments of this section in the case of affine vector fields.

A.2 Basic Definitions

where $A=(a_{0},a_{1},\dots)$ and $B=(b_{0},b_{1},\dots)$

A.3 Notable Results

Here we present some notable results of which we will make use through the paper. We omit the proofs if they can be easily found in the suggested references.

The first result is about bounding the norm of Signature entries:

The most important fact about Signature is that it acts as the basis for a Taylor expansion in path space. In fact just as finite linear combinations of monomials are dense in the continuous functions with a compact input set, finite linear combinations of Signature entries are dense in continuous functions from compact path-spaces:

for some finite sequence $(\alpha_{I})_{|I|\leq N}$ of real numbers.

There is no magic in this result, it is just an application of Stone-Weiestrass enabled by the rich algebraic structure of iterated integrals, studied originally in (Chen, 1958).

This definition is such that the following important equation holds

With the right augmentation of the paths one can see that the Signature distinguishes between different sections of paths, this will be crucial for some of the original results presented in this work.

The if part is follows from $\text{Sig}(\gamma)_{s,t}=\text{Sig}(\gamma_{[s,t]})_{0,1}$ . For the only if part, If $s=s^{\prime}$ and $t=t^{\prime}$ the statement holds; this is because if the signatures over the time interval $[s,t]$ of two time-augmented paths are equal, then the two paths must be equal on $[s,t]$ . We now show that augmenting the path with $t^{2}$ and imposing equality of signatures, implies $s=s^{\prime}$ and $t=t^{\prime}$ , which will in turn allow us to conclude the proof by the previous remark. Assume $\text{Sig}(\omega)_{s,t}=\text{Sig}(\gamma)_{s^{\prime},t^{\prime}}$ , in particular we must have

Hence it must be true that $t=t^{\prime}$ and $s=s^{\prime}$ . ∎

Appendix B Expressivity

In the body of the paper we have presented the main results with the simplified assumption of $Z^{\text{X}}_{0}=0$ or at best $Z^{\text{X}}_{0}=Z_{0}$ i.e. with an initial value independent from the input. In this appendix we will carry on the proofs in a more general setting in which $Z^{\text{X}}_{0}$ is allowed to be input-dependent, as previously discussed the choice of initial value is, in contrast to the classical setting, meaningful inasmuch it allows to approximate linear maps on the signature of $\omega^{\text{X}}_{}$ . In order to do so we have to introduce a new gate, the initial value gate, in the form of a map

Despite the notation, there is no reason why $(X)_{0}$ should be the initial value of the path $X$ , one should think of this map as the one summarizing the data which still matters for the task but which does not have a time-series nature.

Then the main object of study, ”gated” Linear CDEs, are defined as:

B.2 Main Result - Statement and Strategy

Here we present the unified expressivity result in its most general form:

Moreover generic parameters suffice with high probability in the sense that under LeCun initialization

If the $A_{i}$ s are constrained to be diagonal, as often is the case in practice, the requirements $\omega^{\text{X},1}_{t}\equiv t$ , $\omega^{\text{X},2}_{t}\equiv t^{2}$ can be dropped and the existence result only holds with

Moreover in both the dense and diagonal cases the ”reverse” also holds in the sense that, given any choice of matrices $A_{i},B,C$ there is an $\epsilon$ -close map $F$ in the corresponding family.

As one can see the theorem is composed of different sub-results, which we believe are better understood separately from each other. The proof will thus be split in the following steps:

Using the theory developed in Appendix E we see how linear functions on the $Z_{t}$ s can be seen as linear functions on certain terms of the Signature Transform.

Such terms define a feature map $T(X)_{0,t}$ which generates a Reproducing Kernel Hilbert Space $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t}$ . This abstract space acts as an upper bound on expressivity: linear functions on $Z_{t}$ always belong to its closure (in uniform norm), independently of dimension and weights chosen, hence they cannot reach what functions in $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t}$ can’t approximate.

The full expressive range of $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t}$ is shown to be captured by generic $Z_{t}$ s.

Diagonal systems are shown to be restricted to a subset of the $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t}$ of which they capture the full expressive range.

B.3 Main Result - Proofs

is given, using the notation $A_{Ij}:=A_{j}A_{I}$ , by

Just apply Theorems (E.2) and (E.6) of Appendix E. ∎

A property highlighted by the previous result is the interpretability of these models. After training the NCDEs one can compute the matrix multiplications and observe which entries of the signature the model chooses to take into consideration, to attend.

B.3.2 The feature map T𝑇T and its RKHS

The expression in (34) is a linear map on a feature vector given, at time $t$ , by

This feature vector can be understood as a tensor in the following way:

In fact one readily sees that the only non-zero terms of $T(X)_{0,t}$ defined as above are

This is similar to the tensor-valued CDE defining the signature as a tensor i.e. (Salvi et al., 2021a)

with the addition of two terms to track $X_{0}$ and $\xi^{\text{X}}$ . One could also understand $T(X)_{s,t}$ as a sub-tensor of

Classical RKHS theory tells us that we can characterize its elements as:

taken over those $\gamma,\beta$ for which the above equality holds.

Signature kernels (Salvi et al., 2021a) are a class of universal kernels on sequential data which have received attention in recent years thanks to their efficiency in handling path-dependent problems (Lemercier et al., 2021; Salvi et al., 2021c; Cochrane et al., 2021; Salvi et al., 2021b; Cirone et al., 2023; Issa et al., 2023).

Just as signature kernels, the kernel associated to $T(X)_{0,t}$ can be explicitly written as the solution of a two-parameter CDE:

Let $\mathcal{K}^{\text{X},\text{Y}}(s,t):=\langle T(X)_{0,s},T(Y)_{0,t}\rangle_{l^{2}}$ then

The first expression follows immediately from Eqn. (36) the second one by summing the products of $T(X)_{0,s}$ ’s and $T(Y)_{0,t}$ ’s entries given above. ∎

The following proposition will show how linear maps on $Z^{\text{X}}_{t}$ cannot be more expressive than elements of the RKHS $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ since their closure is in the closure of $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ . In this precise sense these spaces act like upper bounds to expressiveness.

where $F\in\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ . In other words, linear maps on the $Z^{\text{X}}_{t}$ are in the uniform closure of $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ .

Using Eqn. (34) we see that $\langle v,Z^{\text{X}}_{t}\rangle$ is a linear map on $T(X)_{0,t}$ with coefficients

using Cauchy-Schwartz it’s moreover easy to see the existence of a constant $\lambda\geq 0$ such that for all $I,i,j$ one has

Since $|\text{Sig}(\omega)^{I}_{s,t}|\leq\frac{1}{|I|!}\left\lVert\omega\right\rVert_{1-var,[s,t]}^{|I|}$ we have that, given an integer $M\geq 0$ , the bound

B.3.4 Uniform closure of the RKHS

is a continuous map from a compact space, thus the image must be compact too. Moreover by Prop. A.8 the Signature separates the points in this image. Since any $G$ as above has form

the proof follows from the uniform density on compact sets of linear functionals on the (truncated) Signature (Thm. A.5), by also uniformly bounding thanks to compactness and continuity the norms of $X_{0}$ and $\xi^{\text{X}}_{1}$ . ∎

B.3.5 Generic Weights are fully expressive

We have seen how linear maps on $Z^{\text{X}}_{t}$ are in the uniform closure of $\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ , and we have explicitly characterized this closure. It is then natural to ask ”how much” of this closure the $Z^{\text{X}}_{t}$ are able to ”explore”. The present section not only shows that the $Z^{\text{X}}_{t}$ ”explore” all the closure, but also that a generic choice of weights is enough to eventually do this with high probability.

The fact that these maps are ”universal” in the above sense is not surprising, since it is well known that Linear CDEs are universal for path-to-point tasks cf. (Kidger, 2022), what is surprising is that this universality can be achieved probabilistically with one of the standard parametrizations used in ML practice (LeCun) It can be proved, using the results of (Dubach & Peled, 2021), that the sampling measure does not have to be Gaussian if it satisfies certain moment requirements..

Moreover generic weight choices suffice with high probability, in the sense that under LeCun initialization

We propose two proofs, the first one of a deterministic character concerns the first claim in the theorem, the second one is probabilistic and concerns the whole result. The deterministic proof follows the same arguments employed by (Kidger, 2022) and is included to highlight the main idea of the probabilistic result, which reduces to a ”spin” on the central argument of this proof.

Any $F\in\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ has form

where $\pi_{M}$ is the truncation at length $M$ .

In particular note how these matrices are such that

and for all $|I|>M$ one has necessarily $A_{I}=0$ .

Then expanding $Z^{\text{X}}_{t}$ as in (34) and using the equalities above one has

proving that for such $v$ and $Z^{\text{X}}$ it holds

(Probabilistic Proof) Any $F\in\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ has form

where $\pi_{M}$ is the truncation at length $M$ .

From (Cirone et al., 2023)[Appendix C] we know that

Then expanding $Z^{\text{X}}_{t}$ as in (34)

Which leads, thanks to the same bounds of (Cirone et al., 2023)[Appendix C], to

But then by Markov’s inequality it holds that

and thus there must be a choice of $N,\{A_{i}\},B,C$ such that the inequality holds, and we thus obtain using (54)

and we conclude by arbitrariness of $\epsilon$ .

B.4 The Diagonal Case

Here we study the particular, but empirically important, case where the matrices $A_{i}$ are taken to be diagonalIt is equivalent to ask for them to be commuting..

What we’ll discover is that the $Z^{\text{X}}_{t}$ cannot differentiate between $\text{Sig}(\omega^{\text{X}})_{s,t}^{I}$ and other $\text{Sig}(\omega^{\text{X}})_{s,t}^{\sigma(I)}$ for any permutation $\sigma$ of the letters in the word $I$ .

By Theorem E.1 and Theorem E.6 we know that the solution of

where $W_{s,t}$ is the unique solution to

In case the $A_{i}$ s are commuting matrices one can explicitly write the solution as

since for fixed $s$ one has, using commutativity, that

On the other hand we know, Theorem E.2, that

recalling then how $e_{I_{1}}\shuffle\cdots\shuffle e_{I_{k}}=\sum_{\sigma\in S_{k}}e_{\sigma(I)}$ we get to

In particular we see how in the commuting case

B.4.2 Diagonal Expressiveness

Moreover the ”reverse” also holds i.e. given any choice of matrices $A_{i},B,C$ there is an $\epsilon$ -close map $F$ in the family.

This is just a repetition of the arguments used for the dense case with little more care to get the uniformity in time.

The same argument of Proposition B.10 shows that the uniform closure of the space of linear maps on the $Z^{\text{X}}_{t}$ is contained in the uniform closure of $Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{})$ , and the same bounds show that this latter closure is the same as that of its subset composed of those $F\in Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{})$ having entries eventually equal to .

such maps can be expressed exactly in the form

for polynomial maps $P,Q$ fixed in time. The usual compactness and continuity argument, together with an application of Stone-Weiestrass, thus proves that the uniform closure of $Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{})$ has the form needed.

The final ingredient is the density of the space of linear maps on the $Z^{\text{X}}_{t}$ in $Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{})$ ; this is another consequence of Stone-Weiestrass as seen from Proposition B.17. ∎

It is not necessary to pass trough $Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{})$ to prove the previous result, since it directly follows from Proposition B.17. This choice of presentation has been motivated by the conviction of the usefulness of drawing parallels and comparisons.

This is an application of Stone-Weiestrass: $\mathcal{E}$ is a sub-algebra since

and $\alpha,\beta\in C\implies\alpha+\beta\in C$ by convexity of the cone; $\mathcal{E}$ contains the constant function $e^{\langle 0,x\rangle}=1$ and is clearly point separating since the cone, being $d$ -dimensional, it contains a basis of the whole space. ∎

The usefulness of stating the previous result in such a general setting is the following: with this formalism we can, for example, restrict to $\alpha\leq 0$ , in this way we would have a method to control the stability (cf. Appendix C.1) of the linear CDEs by choosing the gate with a.s. $\dot{\omega}^{\text{X}}\geq 0$ .

Appendix C Stability and Chaining of Diagonal Systems

Note that the present discussion holds even for non-diagonal but commuting matrices, since these can be simultaneously diagonalized (at the cost of considering the complex plane).

Here we explore the stability of the dynamical system $Z^{\text{X}}$ , thus we need to study the eigenvalues of the $W^{\text{X}}_{s,t}$ . Recall how in this setting

Note that because $\omega^{\text{X}}$ is continuous and of bounded variation, it can be reparameterised to be Lipschitz continuous, hence absolutely continuous. Thus we can assume that $\omega^{\text{X}}$ is almost everywhere differentiable and its derivative $\dot{\omega}\in L^{1}$ .

The stability of the dynamical system then depends on the alignment between $\omega^{\text{X}}_{t}-\omega^{\text{X}}_{s}$ and the singular vectors of $V$ . If $V\dot{\omega}^{\text{X}}_{t}\leq 0$ for all times, where the inequality is coordinate-wise, then $W^{\text{X}}_{s,t}$ has eigenvalues all in $$ thus the system is stable making training easier (Orvieto et al., 2023b).

Consider the singular value decomposition (SVD) of the matrix $V$

Then, a sufficient condition for stability is that for any $k=1,...,K$

In the case of Mamba (Gu & Dao, 2023) the matrices are diagonal and

moreover the proposed choices of $V$ are all of type

C.2 Chaining

The diagonal case differs from the general one not only in the fact that the class of approximable functions is much weaker but also in the necessity for the presence of $\xi^{\text{X}}$ in order to obtain any path-dependence. The term

becomes then a crucial component. At first sight one might think that such a term allows to recover at least level two components of the Signature of $(\omega^{\text{X}},\xi^{\text{X}})$ , unfortunately things are not as easy as they may seem. Notice how inside of the integral time is ”going backwards” from the perspective of $\omega^{\text{X}}$ , thus we can in general approximate terms of type

which are indeed terms of the Signature, but of the reverse paths $\overleftarrow{\omega}^{\text{X}}_{r}=\omega^{\text{X}}_{1-r}$ and $\overleftarrow{\xi}_{s}=\xi^{\text{X}}_{1-s}$ !

Under these hypotheses we know that linear functionals on $Z^{\text{X}}_{t}$ are uniformly dense, for continuous $\psi,\phi$ , in

Assume $\xi^{\text{X},j}_{s}=\langle\alpha_{j},\omega^{\text{X}}_{t}\rangle$ and consider the choices

Models like Mamba do not only use diagonal matrices but also consider controls of a specific kind:

The choice of $ReLU$ enforces $\dot{\omega}_{t}\geq 0$ for all times as seen above, but could, a priori, destroy information about $X$ which allows for the recovery, after chaining, of its Signature.

Does this choice keep some expressivity? Fortunately almost all of it: since

one can choose a linear map $W$ which allows to linearly recover

from $\omega^{\text{X}}_{t}$ . By correspondingly modifying the form of $\psi$ and $\phi$ in Eqn. (77) such that

Appendix D Path-to-Path

i.e. G is causal if it does not look in the future.

Fix $\epsilon>0$ . By B.7 the space ${\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ contains all functionals of form

Recalling that $\omega^{\text{X},1}_{t}=t$ we get that $X\mapsto\{t\mapsto t\}\in{\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ so that, given density of linear maps on $Z^{\text{X}}$ in the space, $\Psi(t,f_{0}(X,t),\dots,f_{M}(X,t))$ can be uniformly approximated. Triangular inequality gives finally

which, by arbitrariness of $\epsilon$ , gives the thesis. ∎

The non-linearity is crucial for the path-to-path result. A map of type $(\omega,t)\mapsto\langle t\alpha,\text{Sig}(\omega)_{0,t}\rangle$ cannot be approximated arbitrarily well by $(\omega,t)\mapsto\langle\beta,\text{Sig}(\omega)_{0,t}\rangle$ .

In any case, note that in the proof the role of the neural network is only that of interpolating the RKHS elements in the right order and at the right time. All the non-linear complexity of learning the particular $G$ is offloaded and taken care of by the RKHS elements.

In the proof we have only considered the part of $T(X)$ concerning $\omega^{\text{X}}$ , but $T(X)_{t}$ depends linearly on $X_{0}$ and $\xi^{\text{X}}$ suggesting that neural networks on ${\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}$ have stronger generalization properties. In fact one can prove that it is possible to approximate all continuous $G(X_{0},\omega^{\text{X}},\xi^{\text{X}},t)$ , this is done by reconstructing $X_{0}$ and $\xi^{\text{X}}_{}$ as in the classical SSM case cf. (Orvieto et al., 2023a).

Appendix E Wronskian Matrix Theory

In this section we obtain a unified theory studying the solutions to general linear CDEs. The results presented here are not new and can be found in different terms in the literature (Friz & Victoir, 2010), despite this we have decided to reproduce them from scratch for completeness, notational reasons and to present a self-contained theory.

Define the map $\Gamma:\Omega\to\Omega$ as

where $\mathcal{W}_{d}$ is the set of words in the alphabet $\{1,\dots,d\}$ and

By defining $M=\max\{\left\lVert A^{i}\right\rVert_{\infty}:i\in\{1,\dots,d\}\}$ then one clearly has

thus definitely (in $k$ ) the map $\Gamma^{k}$ is a contraction. By Banach fixed point there exist a unique fixed point $W\in\Omega$ . ∎

Under the assumptions of the previous theorem one can write $W_{s,t}$ explicitly as

moreover if for all $i$ the matrix-valued maps are constant on all $ $i.e.$ A^{i}_{t}\equiv A_{i}$ then

where $\text{Sig}(\omega)_{s,t}^{I}$ is the Signature of the path $\omega$ .

The second assertion follows from the first by definition of the Signature of a path.

Using the uniformity of this bound and the fact that for all $I\in\mathcal{W}_{d}$ one has

as a function of $(s,t)$ , which moreover is uniformly continuous

Under the previous conditions, the unique solution of the $N$ -dimensional CDE

The solutions are unique by standard results (Friz & Victoir, 2010)[Thm. 3.7], moreover

The Wronskian matrix has the following properties:

$\forall r,s,t\in.\hskip 10.0ptW_{r,t}=W_{s,t}W_{r,s}$

$\forall s,t\in.\hskip 10.0ptW_{s,t}^{-1}=W_{t,s}$

$\forall s,t\in.\hskip 10.0ptW_{s,t}=Id_{N}+\sum_{i=1}^{d}\int_{\sigma=s}^{t}W_{\sigma,t}A^{i}_{\sigma}d\omega^{i}_{\sigma}$

The second statement follows from the previous one setting first $r=t$ and subsequently exchanging $s$ and $t$ .

This just follows from the classical case since we can write

We can now state the main result of the section:

Given the unique solution $X_{t}$ one has

Appendix F ZOH and Exact Solutions

Consider a linear CDE as the one of Eqn.(28)

and recall how the solution can be explicitly written, for times $s<t$ , as

Assume moreover that in the interval $[s,t]$ both drivers have constant derivative i.e.

But the integral can be explicitly solved as

which, setting $\Delta=t-s$ , can be rewritten as