Theoretical Foundations of Deep Selective State-Space Models

Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons

Introduction

Sequence-to-sequence blocks are fundamental components of modern deep learning models for language, images, video, audio, time series, and genomics. While attention (Vaswani et al., 2017; Dosovitskiy et al., 2020) has been for the last five years the dominant mechanism powering these architectures, competitive results have been recently achieved by replacing attention with state-space models (SSMs): GPU-efficient linear recurrent sequence-to-sequence blocks stemming from S4 (Gu et al., 2021). SSMs achieve state-of-the-art results on long-range-reasoning benchmarks (Tay et al., 2020) and show outstanding performance in various domain including vision (Nguyen et al., 2022), audio (Goel et al., 2022), biological signals (Gu et al., 2021), reinforcement learning (Lu et al., 2023) and online learning (Zucchet et al., 2023). SSMs recently have gained significant interest in the community since their computational complexity scales linearly in sequence length, while attention scales quadratically; moreover, unlike other recurrent mechanisms such as LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Cho et al., 2014), they can be efficiently parallelized on GPUs during training using parallel scans (Martin & Cundy, 2017; Smith et al., 2023).

While standard SSMs were shown to be particularly powerful on signal processing tasks, their computation power is limited: the core sequential mechanism of S4 is equivalent to a convolution (filtering) (Li et al., 2022a). This represents a drawback in challenging domains such as text and genetics, where the ability to select data efficiently in an input-dependent manner (i.e., perform content-based reasoning) is crucial (see results in (Wang et al., 2022; Fu et al., 2022)). Towards reaching this goal with recurrent models, various adaptations of S4 have been proposed in the last few months. Notably, Mamba (Gu & Dao, 2023) implements simple and efficient gating mechanisms on the S4 recurrence, unlocking input selectivity in the memory update. Mamba achieved state-of-the-art performance of various language modeling tasks at a fraction of the compute needed to train a transformer model with an equal number of parameters, with inference time still scaling (like S4) with O(sequence length)O(\text{sequence length}). Similar ideas can be found in recent S4-inspired recurrent linear attention models, such as RWKV (Peng et al., 2023), RetNet (Sun et al., 2023), Gateloop (Katsch, 2023) and Gated Linear Attention (GLA) (Yang et al., 2023). Last, very recently, (De et al., 2024) surpassed the performance of Mamba with a gated RNN architecture – Griffin – based on an improved version of the LRU (Orvieto et al., 2023b), perhaps the simplest SSM variant.

At the core of the models discussed above is a linear time-varying dynamical system, where content-based reasoning is performed along the input sequence through an efficient and parallelizable update on a hidden state. In this paper, we generalize the structure of such models, drawing a direct link to controlled differential equations (CDEs) (Kidger et al., 2020; Morrill et al., 2021; Arribas et al., 2020; Salvi et al., 2022; Hoglund et al., 2023) and using tools from rough path theory to study expressivity of deep selective SSMs.

We provide a framework for the analysis of (input-controlled) linear (in the hidden state) recurrences such as S4, Mamba and GLA. This framework allows the use of powerful tools and results in the Rough Path Theory literature to reason about the expressivity of modern SSM layers and to obtain generalized versions of the universality results of (Li et al., 2022b). This framework captures both S4 and Mamba, as well as recently proposed linear-attention-powered recurrences.

We prove that wide enough randomly initialized dense input-controlled linear recurrences are fully expressive: with no need for a multi-layer-perceptron (MLP) block transforming pointwise the recurrence output (as instead is done in the transformers architecture or at the S4 recurrence output), we show that the hidden state contains sufficient input statistics to approximate any continuous function from the input sequence to a target value. This is in clear contrast to S4, where the hidden state is just a convolution of the input sequence with a fixed kernel.

We show that diagonal input-controlled linear recurrences (such as Mamba) provably collect input statistics more efficiently than S4. We additionally show that chaining such blocks, by simply placing linear pointwise maps in between, allows computation of higher order global statistics – matching the dense linear input-controlled setting discussed in point (2) above, in the depth limit.

The main objective of our research is to outline the foundational mathematical tools for comprehensive theoretical investigations of modern recurrent blocks. In addition to our results and insights concerning recently proposed SSM models, our work provides a framework that can be useful for analyzing and comparing forthcoming architectural advances.

State-space Models

We start with a quick simplified recap of S4 (Gu et al., 2021) and then describe recently proposed variants such as Mamba (in particular, the S6 block) (Gu & Dao, 2023) and Gated Linear Attention (GLA) (Yang et al., 2023). We restrict our focus to the recurrent mechanism and invite the reader to refer to the original papers for a description of the token-wise operations following and preceding each block.

with Aˉi,Bˉi\bar{A}_{i},\bar{B}_{i} determined by the discretization technique and the channel-dependent stepsize Δi\Delta_{i}: under the commonly used Zero-Order Hold discretization This corresponds to: (i). considering the continuous underlying signal XtX_{t} to be piecewise constant, (ii). solving exactly the ODE 1 and finally (iii). sampling ZtZ_{t} at the sample times of XtX_{t}. ,

The Selective SSM (S6) powering the Mamba architecture (Gu & Dao, 2023) augments S4 with input-controlled transitions:

2 Known properties of (non-linear) recurrences

Mamba (alongside with gated linear attention variants e.g. Yang et al. (2023)) does not fall in the linear RNN or the nonlinear RNN setting: its recurrence is linear on the hidden state (hence it can be parallelized) but it is not linear time-invariant as S4. In this paper, we study this setting.

SSMs as Linear CDEs

As shown by Gu & Dao (2023), the crucial component that unlocks in-context learning and selectivity in modern SSMs is the input-dependent state-to-state transition matrix AA, gating the hidden state and thus allowing the system to filter out irrelevant information and remember relevant information indefinitely. As we will shortly see, one can study the structure and features of such systems within a unified convenient continuous-time framework (Linear Controlled Differential equations). We define such framework (Def. 6) in full generality, and draw the connection to S4 and Mamba.

where tωtX,it\mapsto\omega^{\text{X},i}_{t} is the ii-th channel of the path ωX\omega^{\text{X}}.

We show that both S4 and Mamba can be written in continuous-time as Linear CDEs, for difference choices of ω\omega and ξ\xi. We describe this visually in Fig. 1. This motivates us to study the expressivity of the model in Eq. 6 and how it is affected by the choice of ω\omega and ξ\xi.

The S4 model (Gu et al., 2021) is a Linear CDE of the form (6) with “gate” functions

2 Mamba is a Linear CDE

As shown Appendix F, Mamba in Eqn. (3) is equivalent to the following Linear CDE

where we defined XtX_{t} piecewise constant, with values the xlx_{l}, on a grid of stepsize δ\delta and

As a quick validation, note that Eqn. (3) becomes

As δ0\delta\to 0, the exponential becomes linear and one has

Which is the standard forward Euler discretization of the Mamba ODE in Eqn. (8).

Expressivity of Linear CDEs

If the matrices A1,...,AdωA_{1},...,A_{d_{\omega}} are constrained to be diagonal, as it is often the case in practice, the requirements ωtX,1=t\omega^{\text{X},1}_{t}=t, ωtX,2=t2\omega^{\text{X},2}_{t}=t^{2} can be dropped and the closure reduces to

This result can be seen as a generalization of the Universal Approximation for Linear RNNs presented by Li et al. (2022b)[Thm. 7] for generic gates ω,ξ\omega,\xi. In fact in their setting Z0=0Z_{0}=0 with gates set as in (7) family (9) reduces to linear input filterings given by convolutions

The specific restriction of ω\omega to subsets of $isacrucialpartof(9).ThefamilyofapproximablemapsdoesnotincludeallpathtopathcausalAcausalmapisonewhichdoesnotlookinthefuturecf.AppendixD.functionsis a crucial part of (9). The family of approximable maps does not include all path-to-path causalA causal map is one which does not ”look in the future” cf. Appendix D. functionst\mapsto Y_{t}^{\text{X}}butasubsetofthem,oftypebut a subset of them, of typet\mapsto Y_{t}^{\text{X}}:=\Psi(\omega^{\text{X}}_{[0,t]})$, satisfying the specific time-homogeneity specified by the form of the restriction, akin to that in (Li et al., 2022b).

2 Signature expansion

To study expressivity of these generalized SSMs (Linear CDEs), it will be convenient to introduce the so-called signature transform (Lyons et al., 2007; Kidger et al., 2019; Fermanian et al., 2023), a classical path-transform from stochastic analysis. The main reason for doing so is that, as a simple consequence of the Stone-Weirestrass theorem, linear functionals on the signature provide the essential building blocks (analogous to monomials on Euclidean spaces) to approximate continuous functions on path space.

We refer the interested reader to Appendix A for additional details and references on the signature. A classical result from rough path theory states that a Linear CDE can be expanded explicitly as an (infinite) linear combination of terms in the signature of the driving path. Next, we specialize this result to the case of Linear CDEs given by Eqn. (6).

For any choice of matrices A1,...,AdωA_{1},...,A_{d_{\omega}} and BB, the unique solution to

where we used the notation AI:=Ain...Ai1A_{I}:=A_{i_{n}}...A_{i_{1}}.

Notice that the previous result does not rely on any assumptions on the nature of Z0Z_{0}, AiA_{i}, and BB; for any such choice the result is a time-independent linear map on a feature vector

3 Randomised Linear CDEs

It turns out that the same space of functionals in the closure (9) of a Linear CDE can be induced, with high probability, by sampling the matrices A1,...,Adω,BA_{1},...,A_{d_{\omega}},B and Z0Z_{0} at random according to standard ML practices (LeCun et al., 2012), as stated in the following result:

where the matrices A1,...,AdωA_{1},...,A_{d_{\omega}} and BB defining ZXZ^{X} are sampled at random as

In particular, Thm. 4.5 implies that only the final readout vv needs to be trained, while the randomly initialised matrices A1,...,AdωA_{1},...,A_{d_{\omega}} and BB can be left untrained (cf. Sec. 6). This is a similar mechanism to the paradigm advocated in reservoir computing (Cuchiero et al., 2021b; Lukoveviius & Jaeger, 2009).

It can be proved, using the results by Dubach & Peled (2021), that the sampling measure does not have to be Gaussian if it satisfies certain moment requirements.

The core argument behind the proof of Thm. 4.5 consists of exhibiting almost-orthogonal vectors trough random linear maps, and then using these as a pseudo-basis of vectors on which to mimic the dynamics of the feature map T(X)0,tT(X)_{0,t}. It then suffices to bound the error made by this approximate approach, which is proved to disappear as the dimension NN increases.

4 S4 as a special case

This in turns implies the following expression for ZXZ^{X}

and the corresponding feature vector TT reads

for polynomials p1,,pdp^{1},\dots,p^{d}, which reconciles our framework with the results obtained by Li et al. (2022b).

It is evident from this discussion that the classical choice of input-independent ω\omega (i.e. ωtX=t\omega^{\text{X}}_{t}=t) precludes the use of higher order statistics of XX.

Motivated by the efficiency requirements in S4 and Mamba, in the next section, we discuss the question of whether this choice of diagonal restricts the family of functionals that one can learn.

5 The Diagonal Case

If Ai:=diag(vi)A_{i}:=\operatorname{diag}(v_{i}), due to commutativity of diagonal matrices, one has that for any permutation σ(I)\sigma(I) of a word II the equality AI=Aσ(I)A_{I}=A_{\sigma(I)} holds. This means that a linear functional on ZtXZ_{t}^{\text{X}} such as (11) can fundamentally only see the symmetric part of Sig(ωX)s,t\text{Sig}(\omega^{\text{X}})_{s,t}, which is just the tensor exponential

and as such does not capture higher order channel combinations (cf. See Sect. 6 for empirical validation). In fact the explicit solution a can be written as

On top of efficiency, an important feature of selective SSMs, crucial for long-range reasoning (see discussion in (Orvieto et al., 2023b)) is their stability in a dynamical system sense: eigenvalues of Aˉ\bar{A} for those discrete recurrences always have absolute value bounded by 11. Both S4 and Mamba achieve this by restricting the sign of the diagonal components of their recurrent matrix AA with positive nonlinearities such as ReLUs (see Sec. 2). Under ZOH numerical integration (see again Sec. 2), the resulting discrete-time Aˉ=exp(ΔA)\bar{A}=\exp(\Delta A) is guaranteed to be stable. We can apply the same idea in the general linear diagonal CDE setting (cf. Appendix C.1). Instead, if one works with dense matrices, ensuring stability in numerical integration during training requires heavy adaptations of the optimizer on top of higher overall complexity.

Note that the main difference between (9) and (14) is the path-dependence: the latter take as input just a value or increment of ωX\omega^{\text{X}} while the maps in (\refeqn:mainF)(\ref{eqn:main_F}) are functions on whole sections of it. Diagonal selective SSMs (i.e. MambaInstead, expressivity of S4 is unaffected by diagonality in the recurrence, see results above and in (Li et al., 2022b).) are thus fundamentally weaker than their non-diagonal counterparts, as they can at most capture the symmetric part of ωX\omega^{\text{X}}’s signature and its integrals against dξXd\xi^{\text{X}} (see Appendix for details).

6 Chaining Diagonal CDEs

Fortunately it is possible to re-gain expressivity without sacrificing the computational advantages of diagonal schemes through chaining. This means driving a new linear CDE by the solution of a previous linear CDE, and repeating this procedure KK times. With this procedure, one can recover signature entries up to depth KK with complexity O(KLN)\mathcal{O}(KLN).

The intuition behind this result is the following: with a single diagonal layer it’s possible This is not straightforward since the time in the integral part of (14) is going ”backwards”! to recover the terms

which are just length 22 terms from Sig((ωX,ξX))0,t\text{Sig}((\omega^{\text{X}},\xi^{\text{X}}))_{0,t}. If we pass these terms as ω\omega in another layer then we can obtain a length 33 term from Sig((ωX,ξX))0,t\text{Sig}((\omega^{\text{X}},\xi^{\text{X}}))_{0,t}, and so on.

The structure of the chaining in Prop. 4.7 seems quite restrictive, in particular it’s not clear to which extent gates which are non-linear in the input XX allow for such a construction. In the Appendix we address this issue in the simple but important case dωtX=ReLU(WXt+B)dtd\omega^{\text{X}}_{t}=\text{ReLU}(WX_{t}+B)dt, where we show that no information is lost and numerical stability can be enforced.

Path-to-Path Learning

Orvieto et al. (2023a) discuss an important feature of S4, which we already presented in previous sections: its action on the input is a simple convolution with fixed (independent of the input) parameters. For this exact reason, the construction of (Orvieto et al., 2023a) defers all non-linear computation (reasoning) to the multi-layer perceptron (MLP) acting on the linear recurrence output, which is simply providing in their setting a loss-less compression of the inputs – with no added reasoning.

In our setting, we instead characterized the processing power of input-controlled (dense or diagonal) SSMs precisely, showing that it greatly surpasses linear filtering. For this reason, the computational burden for an MLP action on a general linear CDE, unlocking path-to-path learning, would be greatly diminished and its actual function reduced to a time-consistent interpolation of already good path-to-point approximations. This method is robust to different input discretization schemes and does not offload all the complexity to the network which follows the SSM (cf. Figure 2).

Actually, FF in the above result does not have to be a neural network, it suffices for it to be able to interpolate linear maps in a time-consistent manner cf. Appendix D.

Empirical Validation

A variation on the toy dataset introduced in (Walker et al., 2024) is used to empirically validate our theoretical results. We use two datasets of 100,000100,000 sample paths each, with dimensions 22 and 33 respectively. In both datasets, paths are defined on 100100 regularly spaced time-steps between t=0t=0 and t=1t=1. The change in each channel at each time point is an independent sample from a standard Normal distribution rounded to the nearest integer. The datasets are then normalised to range between 1-1 and 11. The aim on both datasets is to predict specific terms in the anti-symmetric part of the signature. For the 22 dimensional dataset the prediction is an area integral and for the 33 dimensional a volume integral given, respectively, by

A S5 or Mamba recurrence with linear readout.

Two stacked S5 or Mamba recurrences with a linear mixing layer in-between, and the final state of the second layer fed into a linear readout.

A linear NCDE with gates ωtX=ξtX=(t,Xt)\omega^{\text{X}}_{t}=\xi^{\text{X}}_{t}=(t,X_{t}) followed by a linear readout.

All state space models have trainable matrices in their recurrences, whereas the linear NCDE is using fixed matrices. All of the models use a hidden dimension of 256256, with the state space models using a state dimension of 256256. The state space models are trained using batch gradient descent with a batch size of 3232 and Adam with a learning rate of 3×1053\times 10^{-5}. Results are plotted in Fig. 3.

The output from the linear NCDE’s recurrence is obtained using an adaptive ODE solver, Tsit5, with an absolute and relative tolerance of 1×1021\times 10^{-2}. The linear NCDEs linear layer is obtained by ordinary least squares.

Conclusions

In this paper we have considered a family of models, defined by gating functions as solutions to linear CDEs, which extends both classical and modern SSM architectures.

Using analytical tools from Rough Paths theory we have characterized the uniform closure of these models generalizing the results of (Li et al., 2022b), shedding light on the expressive advantages of input-controlled transition dynamics, which allow to capture high order statistics of the input as opposed to just the linear ones extracted by convolutions.

We point out how the full expressive range is already captured by generic models, and how more efficient computational structures, such as imposing diagonality, weaken these expressive capabilities. A remedy to this loss is presented in the chaining of these models.

Finally we show how the substitution of the linear readout with an MLP allows to approximate general path-to-path functions without, in contrast to the S4 case, offloading all the complexity to the neural network.

Acknowledgements

Nicola Muca Cirone is supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1). Antonio Orvieto acknowledges the financial support of the Hector Foundation. Benjamin Walker was funded by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). Terry Lyons was funded in part by the EPSRC [grant number EP/S026347/1], in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership) and in part by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).

References

Appendix A Introduction to Signatures

This initial section of the Appendix is devoted to a brief introduction to the topic of Signature Transform. For more in-depth surveys we suggest the reader to refer to (Fermanian, 2020; Chevyrev & Kormilitzin, 2016; Lyons & McLeod, 2024).

In the simplest setting of smooth paths a CDE is a differential equation of form

The theory of Rough Paths has its origins in the study of such types of differential equations and provides a theoretical framework to define and work in rough settings i.e. when XX is not kust BV but even α\alpha-Hölder for α(0,1)\alpha\in(0,1) cf. (Friz & Victoir, 2010).

where Vif(z):=dfy[Vi(z)]V_{i}f(z):=df_{y}[V_{i}(z)]. Iterating this procedure on the VifV_{i}fs, i.e. substituting in the previous equation the analogously obtained equality

Keeping with this procedure for NN steps we get

where I=(i1,,ik)I=(i_{1},\dots,i_{k}) runs trough the multi-indices, VIf:=Vi1Vi2VikfV_{I}f:=V_{i_{1}}V_{i_{2}}\dots V_{i_{k}}f and

As one can imagine, under reasonable regularity assumptions, the remainder goes to as NN\to\infty and at the limit

This is a remarkable result: to know the solution ZtZ_{t} to the original CDE it suffices to know the quantities VIfV_{I}f for all multi-indices and ff in the coordinate maps, together with the iterated integrals

This observation is at the core of Rough Path Analysis, the theory can in a sense be considered an extreme development of it. The collection of iterated integrals, the Signature, will be the main for our analysis.

In Appendix E we expand and make rigorous the arguments of this section in the case of affine vector fields.

A.2 Basic Definitions

where A=(a0,a1,)A=(a_{0},a_{1},\dots) and B=(b0,b1,)B=(b_{0},b_{1},\dots)

A.3 Notable Results

Here we present some notable results of which we will make use through the paper. We omit the proofs if they can be easily found in the suggested references.

The first result is about bounding the norm of Signature entries:

The most important fact about Signature is that it acts as the basis for a Taylor expansion in path space. In fact just as finite linear combinations of monomials are dense in the continuous functions with a compact input set, finite linear combinations of Signature entries are dense in continuous functions from compact path-spaces:

for some finite sequence (αI)IN(\alpha_{I})_{|I|\leq N} of real numbers.

There is no magic in this result, it is just an application of Stone-Weiestrass enabled by the rich algebraic structure of iterated integrals, studied originally in (Chen, 1958).

This definition is such that the following important equation holds

With the right augmentation of the paths one can see that the Signature distinguishes between different sections of paths, this will be crucial for some of the original results presented in this work.

The if part is follows from Sig(γ)s,t=Sig(γ[s,t])0,1\text{Sig}(\gamma)_{s,t}=\text{Sig}(\gamma_{[s,t]})_{0,1}. For the only if part, If s=ss=s^{\prime} and t=tt=t^{\prime} the statement holds; this is because if the signatures over the time interval [s,t][s,t] of two time-augmented paths are equal, then the two paths must be equal on [s,t][s,t]. We now show that augmenting the path with t2t^{2} and imposing equality of signatures, implies s=ss=s^{\prime} and t=tt=t^{\prime}, which will in turn allow us to conclude the proof by the previous remark. Assume Sig(ω)s,t=Sig(γ)s,t\text{Sig}(\omega)_{s,t}=\text{Sig}(\gamma)_{s^{\prime},t^{\prime}}, in particular we must have

Hence it must be true that t=tt=t^{\prime} and s=ss=s^{\prime}. ∎

Appendix B Expressivity

In the body of the paper we have presented the main results with the simplified assumption of Z0X=0Z^{\text{X}}_{0}=0 or at best Z0X=Z0Z^{\text{X}}_{0}=Z_{0} i.e. with an initial value independent from the input. In this appendix we will carry on the proofs in a more general setting in which Z0XZ^{\text{X}}_{0} is allowed to be input-dependent, as previously discussed the choice of initial value is, in contrast to the classical setting, meaningful inasmuch it allows to approximate linear maps on the signature of ωX\omega^{\text{X}}_{}. In order to do so we have to introduce a new gate, the initial value gate, in the form of a map

Despite the notation, there is no reason why (X)0(X)_{0} should be the initial value of the path XX, one should think of this map as the one summarizing the data which still matters for the task but which does not have a time-series nature.

Then the main object of study, ”gated” Linear CDEs, are defined as:

B.2 Main Result - Statement and Strategy

Here we present the unified expressivity result in its most general form:

Moreover generic parameters suffice with high probability in the sense that under LeCun initialization

If the AiA_{i}s are constrained to be diagonal, as often is the case in practice, the requirements ωtX,1t\omega^{\text{X},1}_{t}\equiv t, ωtX,2t2\omega^{\text{X},2}_{t}\equiv t^{2} can be dropped and the existence result only holds with

Moreover in both the dense and diagonal cases the ”reverse” also holds in the sense that, given any choice of matrices Ai,B,CA_{i},B,C there is an ϵ\epsilon-close map FF in the corresponding family.

As one can see the theorem is composed of different sub-results, which we believe are better understood separately from each other. The proof will thus be split in the following steps:

Using the theory developed in Appendix E we see how linear functions on the ZtZ_{t}s can be seen as linear functions on certain terms of the Signature Transform.

Such terms define a feature map T(X)0,tT(X)_{0,t} which generates a Reproducing Kernel Hilbert Space Ht()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t}. This abstract space acts as an upper bound on expressivity: linear functions on ZtZ_{t} always belong to its closure (in uniform norm), independently of dimension and weights chosen, hence they cannot reach what functions in Ht()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t} can’t approximate.

The full expressive range of Ht()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t} is shown to be captured by generic ZtZ_{t}s.

Diagonal systems are shown to be restricted to a subset of the Ht()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{t} of which they capture the full expressive range.

B.3 Main Result - Proofs

is given, using the notation AIj:=AjAIA_{Ij}:=A_{j}A_{I}, by

Just apply Theorems (E.2) and (E.6) of Appendix E. ∎

A property highlighted by the previous result is the interpretability of these models. After training the NCDEs one can compute the matrix multiplications and observe which entries of the signature the model chooses to take into consideration, to attend.

B.3.2 The feature map T𝑇T and its RKHS

The expression in (34) is a linear map on a feature vector given, at time tt, by

This feature vector can be understood as a tensor in the following way:

In fact one readily sees that the only non-zero terms of T(X)0,tT(X)_{0,t} defined as above are

This is similar to the tensor-valued CDE defining the signature as a tensor i.e. (Salvi et al., 2021a)

with the addition of two terms to track X0X_{0} and ξX\xi^{\text{X}}. One could also understand T(X)s,tT(X)_{s,t} as a sub-tensor of

Classical RKHS theory tells us that we can characterize its elements as:

taken over those γ,β\gamma,\beta for which the above equality holds.

Signature kernels (Salvi et al., 2021a) are a class of universal kernels on sequential data which have received attention in recent years thanks to their efficiency in handling path-dependent problems (Lemercier et al., 2021; Salvi et al., 2021c; Cochrane et al., 2021; Salvi et al., 2021b; Cirone et al., 2023; Issa et al., 2023).

Just as signature kernels, the kernel associated to T(X)0,tT(X)_{0,t} can be explicitly written as the solution of a two-parameter CDE:

Let KX,Y(s,t):=T(X)0,s,T(Y)0,tl2\mathcal{K}^{\text{X},\text{Y}}(s,t):=\langle T(X)_{0,s},T(Y)_{0,t}\rangle_{l^{2}} then

The first expression follows immediately from Eqn. (36) the second one by summing the products of T(X)0,sT(X)_{0,s}’s and T(Y)0,tT(Y)_{0,t}’s entries given above. ∎

The following proposition will show how linear maps on ZtXZ^{\text{X}}_{t} cannot be more expressive than elements of the RKHS H()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{} since their closure is in the closure of H()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}. In this precise sense these spaces act like upper bounds to expressiveness.

where FH()0,ω,ηF\in\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}. In other words, linear maps on the ZtXZ^{\text{X}}_{t} are in the uniform closure of H()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}.

Using Eqn. (34) we see that v,ZtX\langle v,Z^{\text{X}}_{t}\rangle is a linear map on T(X)0,tT(X)_{0,t} with coefficients

using Cauchy-Schwartz it’s moreover easy to see the existence of a constant λ0\lambda\geq 0 such that for all I,i,jI,i,j one has

Since Sig(ω)s,tI1I!ω1var,[s,t]I|\text{Sig}(\omega)^{I}_{s,t}|\leq\frac{1}{|I|!}\left\lVert\omega\right\rVert_{1-var,[s,t]}^{|I|} we have that, given an integer M0M\geq 0, the bound

B.3.4 Uniform closure of the RKHS

is a continuous map from a compact space, thus the image must be compact too. Moreover by Prop. A.8 the Signature separates the points in this image. Since any GG as above has form

the proof follows from the uniform density on compact sets of linear functionals on the (truncated) Signature (Thm. A.5), by also uniformly bounding thanks to compactness and continuity the norms of X0X_{0} and ξ1X\xi^{\text{X}}_{1}. ∎

B.3.5 Generic Weights are fully expressive

We have seen how linear maps on ZtXZ^{\text{X}}_{t} are in the uniform closure of H()0,ω,η\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{}, and we have explicitly characterized this closure. It is then natural to ask ”how much” of this closure the ZtXZ^{\text{X}}_{t} are able to ”explore”. The present section not only shows that the ZtXZ^{\text{X}}_{t} ”explore” all the closure, but also that a generic choice of weights is enough to eventually do this with high probability.

The fact that these maps are ”universal” in the above sense is not surprising, since it is well known that Linear CDEs are universal for path-to-point tasks cf. (Kidger, 2022), what is surprising is that this universality can be achieved probabilistically with one of the standard parametrizations used in ML practice (LeCun) It can be proved, using the results of (Dubach & Peled, 2021), that the sampling measure does not have to be Gaussian if it satisfies certain moment requirements..

Moreover generic weight choices suffice with high probability, in the sense that under LeCun initialization

We propose two proofs, the first one of a deterministic character concerns the first claim in the theorem, the second one is probabilistic and concerns the whole result. The deterministic proof follows the same arguments employed by (Kidger, 2022) and is included to highlight the main idea of the probabilistic result, which reduces to a ”spin” on the central argument of this proof.

Any FH()0,ω,ηF\in\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{} has form

where πM\pi_{M} is the truncation at length MM.

In particular note how these matrices are such that

and for all I>M|I|>M one has necessarily AI=0A_{I}=0.

Then expanding ZtXZ^{\text{X}}_{t} as in (34) and using the equalities above one has

proving that for such vv and ZXZ^{\text{X}} it holds

(Probabilistic Proof) Any FH()0,ω,ηF\in\mathcal{H}^{\tiny(\cdot)_{0},\omega,\eta}_{} has form

where πM\pi_{M} is the truncation at length MM.

From (Cirone et al., 2023)[Appendix C] we know that

Then expanding ZtXZ^{\text{X}}_{t} as in (34)

Which leads, thanks to the same bounds of (Cirone et al., 2023)[Appendix C], to

But then by Markov’s inequality it holds that

and thus there must be a choice of N,{Ai},B,CN,\{A_{i}\},B,C such that the inequality holds, and we thus obtain using (54)

and we conclude by arbitrariness of ϵ\epsilon.

B.4 The Diagonal Case

Here we study the particular, but empirically important, case where the matrices AiA_{i} are taken to be diagonalIt is equivalent to ask for them to be commuting..

What we’ll discover is that the ZtXZ^{\text{X}}_{t} cannot differentiate between Sig(ωX)s,tI\text{Sig}(\omega^{\text{X}})_{s,t}^{I} and other Sig(ωX)s,tσ(I)\text{Sig}(\omega^{\text{X}})_{s,t}^{\sigma(I)} for any permutation σ\sigma of the letters in the word II.

By Theorem E.1 and Theorem E.6 we know that the solution of

where Ws,tW_{s,t} is the unique solution to

In case the AiA_{i}s are commuting matrices one can explicitly write the solution as

since for fixed ss one has, using commutativity, that

On the other hand we know, Theorem E.2, that

recalling then how eI1\shuffle\shuffleeIk=σSkeσ(I)e_{I_{1}}\shuffle\cdots\shuffle e_{I_{k}}=\sum_{\sigma\in S_{k}}e_{\sigma(I)} we get to

In particular we see how in the commuting case

B.4.2 Diagonal Expressiveness

Moreover the ”reverse” also holds i.e. given any choice of matrices Ai,B,CA_{i},B,C there is an ϵ\epsilon-close map FF in the family.

This is just a repetition of the arguments used for the dense case with little more care to get the uniformity in time.

The same argument of Proposition B.10 shows that the uniform closure of the space of linear maps on the ZtXZ^{\text{X}}_{t} is contained in the uniform closure of Sym(H()0,ω,η)Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}), and the same bounds show that this latter closure is the same as that of its subset composed of those FSym(H()0,ω,η)F\in Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}) having entries eventually equal to .

such maps can be expressed exactly in the form

for polynomial maps P,QP,Q fixed in time. The usual compactness and continuity argument, together with an application of Stone-Weiestrass, thus proves that the uniform closure of Sym(H()0,ω,η)Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}) has the form needed.

The final ingredient is the density of the space of linear maps on the ZtXZ^{\text{X}}_{t} in Sym(H()0,ω,η)Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}); this is another consequence of Stone-Weiestrass as seen from Proposition B.17. ∎

It is not necessary to pass trough Sym(H()0,ω,η)Sym({\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{}) to prove the previous result, since it directly follows from Proposition B.17. This choice of presentation has been motivated by the conviction of the usefulness of drawing parallels and comparisons.

This is an application of Stone-Weiestrass: E\mathcal{E} is a sub-algebra since

and α,βC    α+βC\alpha,\beta\in C\implies\alpha+\beta\in C by convexity of the cone; E\mathcal{E} contains the constant function e0,x=1e^{\langle 0,x\rangle}=1 and is clearly point separating since the cone, being dd-dimensional, it contains a basis of the whole space. ∎

The usefulness of stating the previous result in such a general setting is the following: with this formalism we can, for example, restrict to α0\alpha\leq 0, in this way we would have a method to control the stability (cf. Appendix C.1) of the linear CDEs by choosing the gate with a.s. ω˙X0\dot{\omega}^{\text{X}}\geq 0.

Appendix C Stability and Chaining of Diagonal Systems

Note that the present discussion holds even for non-diagonal but commuting matrices, since these can be simultaneously diagonalized (at the cost of considering the complex plane).

Here we explore the stability of the dynamical system ZXZ^{\text{X}}, thus we need to study the eigenvalues of the Ws,tXW^{\text{X}}_{s,t}. Recall how in this setting

Note that because ωX\omega^{\text{X}} is continuous and of bounded variation, it can be reparameterised to be Lipschitz continuous, hence absolutely continuous. Thus we can assume that ωX\omega^{\text{X}} is almost everywhere differentiable and its derivative ω˙L1\dot{\omega}\in L^{1}.

The stability of the dynamical system then depends on the alignment between ωtXωsX\omega^{\text{X}}_{t}-\omega^{\text{X}}_{s} and the singular vectors of VV. If Vω˙tX0V\dot{\omega}^{\text{X}}_{t}\leq 0 for all times, where the inequality is coordinate-wise, then Ws,tXW^{\text{X}}_{s,t} has eigenvalues all in $$ thus the system is stable making training easier (Orvieto et al., 2023b).

Consider the singular value decomposition (SVD) of the matrix VV

Then, a sufficient condition for stability is that for any k=1,...,Kk=1,...,K

In the case of Mamba (Gu & Dao, 2023) the matrices are diagonal and

moreover the proposed choices of VV are all of type

C.2 Chaining

The diagonal case differs from the general one not only in the fact that the class of approximable functions is much weaker but also in the necessity for the presence of ξX\xi^{\text{X}} in order to obtain any path-dependence. The term

becomes then a crucial component. At first sight one might think that such a term allows to recover at least level two components of the Signature of (ωX,ξX)(\omega^{\text{X}},\xi^{\text{X}}), unfortunately things are not as easy as they may seem. Notice how inside of the integral time is ”going backwards” from the perspective of ωX\omega^{\text{X}}, thus we can in general approximate terms of type

which are indeed terms of the Signature, but of the reverse paths ωrX=ω1rX\overleftarrow{\omega}^{\text{X}}_{r}=\omega^{\text{X}}_{1-r} and ξs=ξ1sX\overleftarrow{\xi}_{s}=\xi^{\text{X}}_{1-s}!

Under these hypotheses we know that linear functionals on ZtXZ^{\text{X}}_{t} are uniformly dense, for continuous ψ,ϕ\psi,\phi, in

Assume ξsX,j=αj,ωtX\xi^{\text{X},j}_{s}=\langle\alpha_{j},\omega^{\text{X}}_{t}\rangle and consider the choices

Models like Mamba do not only use diagonal matrices but also consider controls of a specific kind:

The choice of ReLUReLU enforces ω˙t0\dot{\omega}_{t}\geq 0 for all times as seen above, but could, a priori, destroy information about XX which allows for the recovery, after chaining, of its Signature.

Does this choice keep some expressivity? Fortunately almost all of it: since

one can choose a linear map WW which allows to linearly recover

from ωtX\omega^{\text{X}}_{t}. By correspondingly modifying the form of ψ\psi and ϕ\phi in Eqn. (77) such that

Appendix D Path-to-Path

i.e. G is causal if it does not look in the future.

Fix ϵ>0\epsilon>0. By B.7 the space H()0,ω,η{\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{} contains all functionals of form

Recalling that ωtX,1=t\omega^{\text{X},1}_{t}=t we get that X{tt}H()0,ω,ηX\mapsto\{t\mapsto t\}\in{\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{} so that, given density of linear maps on ZXZ^{\text{X}} in the space, Ψ(t,f0(X,t),,fM(X,t))\Psi(t,f_{0}(X,t),\dots,f_{M}(X,t)) can be uniformly approximated. Triangular inequality gives finally

which, by arbitrariness of ϵ\epsilon, gives the thesis. ∎

The non-linearity is crucial for the path-to-path result. A map of type (ω,t)tα,Sig(ω)0,t(\omega,t)\mapsto\langle t\alpha,\text{Sig}(\omega)_{0,t}\rangle cannot be approximated arbitrarily well by (ω,t)β,Sig(ω)0,t(\omega,t)\mapsto\langle\beta,\text{Sig}(\omega)_{0,t}\rangle.

In any case, note that in the proof the role of the neural network is only that of interpolating the RKHS elements in the right order and at the right time. All the non-linear complexity of learning the particular GG is offloaded and taken care of by the RKHS elements.

In the proof we have only considered the part of T(X)T(X) concerning ωX\omega^{\text{X}}, but T(X)tT(X)_{t} depends linearly on X0X_{0} and ξX\xi^{\text{X}} suggesting that neural networks on H()0,ω,η{\mathcal{H}}^{\tiny(\cdot)_{0},\omega,\eta}_{} have stronger generalization properties. In fact one can prove that it is possible to approximate all continuous G(X0,ωX,ξX,t)G(X_{0},\omega^{\text{X}},\xi^{\text{X}},t), this is done by reconstructing X0X_{0} and ξX\xi^{\text{X}}_{} as in the classical SSM case cf. (Orvieto et al., 2023a).

Appendix E Wronskian Matrix Theory

In this section we obtain a unified theory studying the solutions to general linear CDEs. The results presented here are not new and can be found in different terms in the literature (Friz & Victoir, 2010), despite this we have decided to reproduce them from scratch for completeness, notational reasons and to present a self-contained theory.

Define the map Γ:ΩΩ\Gamma:\Omega\to\Omega as

where Wd\mathcal{W}_{d} is the set of words in the alphabet {1,,d}\{1,\dots,d\} and

By defining M=max{Ai:i{1,,d}}M=\max\{\left\lVert A^{i}\right\rVert_{\infty}:i\in\{1,\dots,d\}\} then one clearly has

thus definitely (in kk) the map Γk\Gamma^{k} is a contraction. By Banach fixed point there exist a unique fixed point WΩW\in\Omega. ∎

Under the assumptions of the previous theorem one can write Ws,tW_{s,t} explicitly as

moreover if for all ii the matrix-valued maps are constant on all $i.e.i.e.A^{i}_{t}\equiv A_{i}$ then

where Sig(ω)s,tI\text{Sig}(\omega)_{s,t}^{I} is the Signature of the path ω\omega.

The second assertion follows from the first by definition of the Signature of a path.

Using the uniformity of this bound and the fact that for all IWdI\in\mathcal{W}_{d} one has

as a function of (s,t)(s,t), which moreover is uniformly continuous

Under the previous conditions, the unique solution of the NN-dimensional CDE

The solutions are unique by standard results (Friz & Victoir, 2010)[Thm. 3.7], moreover

The Wronskian matrix has the following properties:

r,s,t.Wr,t=Ws,tWr,s\forall r,s,t\in.\hskip 10.0ptW_{r,t}=W_{s,t}W_{r,s}

s,t.Ws,t1=Wt,s\forall s,t\in.\hskip 10.0ptW_{s,t}^{-1}=W_{t,s}

s,t.Ws,t=IdN+i=1dσ=stWσ,tAσidωσi\forall s,t\in.\hskip 10.0ptW_{s,t}=Id_{N}+\sum_{i=1}^{d}\int_{\sigma=s}^{t}W_{\sigma,t}A^{i}_{\sigma}d\omega^{i}_{\sigma}

The second statement follows from the previous one setting first r=tr=t and subsequently exchanging ss and tt.

This just follows from the classical case since we can write

We can now state the main result of the section:

Given the unique solution XtX_{t} one has

Appendix F ZOH and Exact Solutions

Consider a linear CDE as the one of Eqn.(28)

and recall how the solution can be explicitly written, for times s<ts<t, as

Assume moreover that in the interval [s,t][s,t] both drivers have constant derivative i.e.

But the integral can be explicitly solved as

which, setting Δ=ts\Delta=t-s, can be rewritten as