Theoretical Foundations of Deep Selective State-Space Models
Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, Terry Lyons
Introduction
Sequence-to-sequence blocks are fundamental components of modern deep learning models for language, images, video, audio, time series, and genomics. While attention (Vaswani et al., 2017; Dosovitskiy et al., 2020) has been for the last five years the dominant mechanism powering these architectures, competitive results have been recently achieved by replacing attention with state-space models (SSMs): GPU-efficient linear recurrent sequence-to-sequence blocks stemming from S4 (Gu et al., 2021). SSMs achieve state-of-the-art results on long-range-reasoning benchmarks (Tay et al., 2020) and show outstanding performance in various domain including vision (Nguyen et al., 2022), audio (Goel et al., 2022), biological signals (Gu et al., 2021), reinforcement learning (Lu et al., 2023) and online learning (Zucchet et al., 2023). SSMs recently have gained significant interest in the community since their computational complexity scales linearly in sequence length, while attention scales quadratically; moreover, unlike other recurrent mechanisms such as LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Cho et al., 2014), they can be efficiently parallelized on GPUs during training using parallel scans (Martin & Cundy, 2017; Smith et al., 2023).
While standard SSMs were shown to be particularly powerful on signal processing tasks, their computation power is limited: the core sequential mechanism of S4 is equivalent to a convolution (filtering) (Li et al., 2022a). This represents a drawback in challenging domains such as text and genetics, where the ability to select data efficiently in an input-dependent manner (i.e., perform content-based reasoning) is crucial (see results in (Wang et al., 2022; Fu et al., 2022)). Towards reaching this goal with recurrent models, various adaptations of S4 have been proposed in the last few months. Notably, Mamba (Gu & Dao, 2023) implements simple and efficient gating mechanisms on the S4 recurrence, unlocking input selectivity in the memory update. Mamba achieved state-of-the-art performance of various language modeling tasks at a fraction of the compute needed to train a transformer model with an equal number of parameters, with inference time still scaling (like S4) with . Similar ideas can be found in recent S4-inspired recurrent linear attention models, such as RWKV (Peng et al., 2023), RetNet (Sun et al., 2023), Gateloop (Katsch, 2023) and Gated Linear Attention (GLA) (Yang et al., 2023). Last, very recently, (De et al., 2024) surpassed the performance of Mamba with a gated RNN architecture – Griffin – based on an improved version of the LRU (Orvieto et al., 2023b), perhaps the simplest SSM variant.
At the core of the models discussed above is a linear time-varying dynamical system, where content-based reasoning is performed along the input sequence through an efficient and parallelizable update on a hidden state. In this paper, we generalize the structure of such models, drawing a direct link to controlled differential equations (CDEs) (Kidger et al., 2020; Morrill et al., 2021; Arribas et al., 2020; Salvi et al., 2022; Hoglund et al., 2023) and using tools from rough path theory to study expressivity of deep selective SSMs.
We provide a framework for the analysis of (input-controlled) linear (in the hidden state) recurrences such as S4, Mamba and GLA. This framework allows the use of powerful tools and results in the Rough Path Theory literature to reason about the expressivity of modern SSM layers and to obtain generalized versions of the universality results of (Li et al., 2022b). This framework captures both S4 and Mamba, as well as recently proposed linear-attention-powered recurrences.
We prove that wide enough randomly initialized dense input-controlled linear recurrences are fully expressive: with no need for a multi-layer-perceptron (MLP) block transforming pointwise the recurrence output (as instead is done in the transformers architecture or at the S4 recurrence output), we show that the hidden state contains sufficient input statistics to approximate any continuous function from the input sequence to a target value. This is in clear contrast to S4, where the hidden state is just a convolution of the input sequence with a fixed kernel.
We show that diagonal input-controlled linear recurrences (such as Mamba) provably collect input statistics more efficiently than S4. We additionally show that chaining such blocks, by simply placing linear pointwise maps in between, allows computation of higher order global statistics – matching the dense linear input-controlled setting discussed in point (2) above, in the depth limit.
The main objective of our research is to outline the foundational mathematical tools for comprehensive theoretical investigations of modern recurrent blocks. In addition to our results and insights concerning recently proposed SSM models, our work provides a framework that can be useful for analyzing and comparing forthcoming architectural advances.
State-space Models
We start with a quick simplified recap of S4 (Gu et al., 2021) and then describe recently proposed variants such as Mamba (in particular, the S6 block) (Gu & Dao, 2023) and Gated Linear Attention (GLA) (Yang et al., 2023). We restrict our focus to the recurrent mechanism and invite the reader to refer to the original papers for a description of the token-wise operations following and preceding each block.
with determined by the discretization technique and the channel-dependent stepsize : under the commonly used Zero-Order Hold discretization This corresponds to: (i). considering the continuous underlying signal to be piecewise constant, (ii). solving exactly the ODE 1 and finally (iii). sampling at the sample times of . ,
The Selective SSM (S6) powering the Mamba architecture (Gu & Dao, 2023) augments S4 with input-controlled transitions:
2 Known properties of (non-linear) recurrences
Mamba (alongside with gated linear attention variants e.g. Yang et al. (2023)) does not fall in the linear RNN or the nonlinear RNN setting: its recurrence is linear on the hidden state (hence it can be parallelized) but it is not linear time-invariant as S4. In this paper, we study this setting.
SSMs as Linear CDEs
As shown by Gu & Dao (2023), the crucial component that unlocks in-context learning and selectivity in modern SSMs is the input-dependent state-to-state transition matrix , gating the hidden state and thus allowing the system to filter out irrelevant information and remember relevant information indefinitely. As we will shortly see, one can study the structure and features of such systems within a unified convenient continuous-time framework (Linear Controlled Differential equations). We define such framework (Def. 6) in full generality, and draw the connection to S4 and Mamba.
where is the -th channel of the path .
We show that both S4 and Mamba can be written in continuous-time as Linear CDEs, for difference choices of and . We describe this visually in Fig. 1. This motivates us to study the expressivity of the model in Eq. 6 and how it is affected by the choice of and .
The S4 model (Gu et al., 2021) is a Linear CDE of the form (6) with “gate” functions
2 Mamba is a Linear CDE
As shown Appendix F, Mamba in Eqn. (3) is equivalent to the following Linear CDE
where we defined piecewise constant, with values the , on a grid of stepsize and
As a quick validation, note that Eqn. (3) becomes
As , the exponential becomes linear and one has
Which is the standard forward Euler discretization of the Mamba ODE in Eqn. (8).
Expressivity of Linear CDEs
If the matrices are constrained to be diagonal, as it is often the case in practice, the requirements , can be dropped and the closure reduces to
This result can be seen as a generalization of the Universal Approximation for Linear RNNs presented by Li et al. (2022b)[Thm. 7] for generic gates . In fact in their setting with gates set as in (7) family (9) reduces to linear input filterings given by convolutions
The specific restriction of to subsets of $t\mapsto Y_{t}^{\text{X}}t\mapsto Y_{t}^{\text{X}}:=\Psi(\omega^{\text{X}}_{[0,t]})$, satisfying the specific time-homogeneity specified by the form of the restriction, akin to that in (Li et al., 2022b).
2 Signature expansion
To study expressivity of these generalized SSMs (Linear CDEs), it will be convenient to introduce the so-called signature transform (Lyons et al., 2007; Kidger et al., 2019; Fermanian et al., 2023), a classical path-transform from stochastic analysis. The main reason for doing so is that, as a simple consequence of the Stone-Weirestrass theorem, linear functionals on the signature provide the essential building blocks (analogous to monomials on Euclidean spaces) to approximate continuous functions on path space.
We refer the interested reader to Appendix A for additional details and references on the signature. A classical result from rough path theory states that a Linear CDE can be expanded explicitly as an (infinite) linear combination of terms in the signature of the driving path. Next, we specialize this result to the case of Linear CDEs given by Eqn. (6).
For any choice of matrices and , the unique solution to
where we used the notation .
Notice that the previous result does not rely on any assumptions on the nature of , , and ; for any such choice the result is a time-independent linear map on a feature vector
3 Randomised Linear CDEs
It turns out that the same space of functionals in the closure (9) of a Linear CDE can be induced, with high probability, by sampling the matrices and at random according to standard ML practices (LeCun et al., 2012), as stated in the following result:
where the matrices and defining are sampled at random as
In particular, Thm. 4.5 implies that only the final readout needs to be trained, while the randomly initialised matrices and can be left untrained (cf. Sec. 6). This is a similar mechanism to the paradigm advocated in reservoir computing (Cuchiero et al., 2021b; Lukoveviius & Jaeger, 2009).
It can be proved, using the results by Dubach & Peled (2021), that the sampling measure does not have to be Gaussian if it satisfies certain moment requirements.
The core argument behind the proof of Thm. 4.5 consists of exhibiting almost-orthogonal vectors trough random linear maps, and then using these as a pseudo-basis of vectors on which to mimic the dynamics of the feature map . It then suffices to bound the error made by this approximate approach, which is proved to disappear as the dimension increases.
4 S4 as a special case
This in turns implies the following expression for
and the corresponding feature vector reads
for polynomials , which reconciles our framework with the results obtained by Li et al. (2022b).
It is evident from this discussion that the classical choice of input-independent (i.e. ) precludes the use of higher order statistics of .
Motivated by the efficiency requirements in S4 and Mamba, in the next section, we discuss the question of whether this choice of diagonal restricts the family of functionals that one can learn.
5 The Diagonal Case
If , due to commutativity of diagonal matrices, one has that for any permutation of a word the equality holds. This means that a linear functional on such as (11) can fundamentally only see the symmetric part of , which is just the tensor exponential
and as such does not capture higher order channel combinations (cf. See Sect. 6 for empirical validation). In fact the explicit solution a can be written as
On top of efficiency, an important feature of selective SSMs, crucial for long-range reasoning (see discussion in (Orvieto et al., 2023b)) is their stability in a dynamical system sense: eigenvalues of for those discrete recurrences always have absolute value bounded by . Both S4 and Mamba achieve this by restricting the sign of the diagonal components of their recurrent matrix with positive nonlinearities such as ReLUs (see Sec. 2). Under ZOH numerical integration (see again Sec. 2), the resulting discrete-time is guaranteed to be stable. We can apply the same idea in the general linear diagonal CDE setting (cf. Appendix C.1). Instead, if one works with dense matrices, ensuring stability in numerical integration during training requires heavy adaptations of the optimizer on top of higher overall complexity.
Note that the main difference between (9) and (14) is the path-dependence: the latter take as input just a value or increment of while the maps in are functions on whole sections of it. Diagonal selective SSMs (i.e. MambaInstead, expressivity of S4 is unaffected by diagonality in the recurrence, see results above and in (Li et al., 2022b).) are thus fundamentally weaker than their non-diagonal counterparts, as they can at most capture the symmetric part of ’s signature and its integrals against (see Appendix for details).
6 Chaining Diagonal CDEs
Fortunately it is possible to re-gain expressivity without sacrificing the computational advantages of diagonal schemes through chaining. This means driving a new linear CDE by the solution of a previous linear CDE, and repeating this procedure times. With this procedure, one can recover signature entries up to depth with complexity .
The intuition behind this result is the following: with a single diagonal layer it’s possible This is not straightforward since the time in the integral part of (14) is going ”backwards”! to recover the terms
which are just length terms from . If we pass these terms as in another layer then we can obtain a length term from , and so on.
The structure of the chaining in Prop. 4.7 seems quite restrictive, in particular it’s not clear to which extent gates which are non-linear in the input allow for such a construction. In the Appendix we address this issue in the simple but important case , where we show that no information is lost and numerical stability can be enforced.
Path-to-Path Learning
Orvieto et al. (2023a) discuss an important feature of S4, which we already presented in previous sections: its action on the input is a simple convolution with fixed (independent of the input) parameters. For this exact reason, the construction of (Orvieto et al., 2023a) defers all non-linear computation (reasoning) to the multi-layer perceptron (MLP) acting on the linear recurrence output, which is simply providing in their setting a loss-less compression of the inputs – with no added reasoning.
In our setting, we instead characterized the processing power of input-controlled (dense or diagonal) SSMs precisely, showing that it greatly surpasses linear filtering. For this reason, the computational burden for an MLP action on a general linear CDE, unlocking path-to-path learning, would be greatly diminished and its actual function reduced to a time-consistent interpolation of already good path-to-point approximations. This method is robust to different input discretization schemes and does not offload all the complexity to the network which follows the SSM (cf. Figure 2).
Actually, in the above result does not have to be a neural network, it suffices for it to be able to interpolate linear maps in a time-consistent manner cf. Appendix D.
Empirical Validation
A variation on the toy dataset introduced in (Walker et al., 2024) is used to empirically validate our theoretical results. We use two datasets of sample paths each, with dimensions and respectively. In both datasets, paths are defined on regularly spaced time-steps between and . The change in each channel at each time point is an independent sample from a standard Normal distribution rounded to the nearest integer. The datasets are then normalised to range between and . The aim on both datasets is to predict specific terms in the anti-symmetric part of the signature. For the dimensional dataset the prediction is an area integral and for the dimensional a volume integral given, respectively, by
A S5 or Mamba recurrence with linear readout.
Two stacked S5 or Mamba recurrences with a linear mixing layer in-between, and the final state of the second layer fed into a linear readout.
A linear NCDE with gates followed by a linear readout.
All state space models have trainable matrices in their recurrences, whereas the linear NCDE is using fixed matrices. All of the models use a hidden dimension of , with the state space models using a state dimension of . The state space models are trained using batch gradient descent with a batch size of and Adam with a learning rate of . Results are plotted in Fig. 3.
The output from the linear NCDE’s recurrence is obtained using an adaptive ODE solver, Tsit5, with an absolute and relative tolerance of . The linear NCDEs linear layer is obtained by ordinary least squares.
Conclusions
In this paper we have considered a family of models, defined by gating functions as solutions to linear CDEs, which extends both classical and modern SSM architectures.
Using analytical tools from Rough Paths theory we have characterized the uniform closure of these models generalizing the results of (Li et al., 2022b), shedding light on the expressive advantages of input-controlled transition dynamics, which allow to capture high order statistics of the input as opposed to just the linear ones extracted by convolutions.
We point out how the full expressive range is already captured by generic models, and how more efficient computational structures, such as imposing diagonality, weaken these expressive capabilities. A remedy to this loss is presented in the chaining of these models.
Finally we show how the substitution of the linear readout with an MLP allows to approximate general path-to-path functions without, in contrast to the S4 case, offloading all the complexity to the neural network.
Acknowledgements
Nicola Muca Cirone is supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1). Antonio Orvieto acknowledges the financial support of the Hector Foundation. Benjamin Walker was funded by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). Terry Lyons was funded in part by the EPSRC [grant number EP/S026347/1], in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1, the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership) and in part by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).
References
Appendix A Introduction to Signatures
This initial section of the Appendix is devoted to a brief introduction to the topic of Signature Transform. For more in-depth surveys we suggest the reader to refer to (Fermanian, 2020; Chevyrev & Kormilitzin, 2016; Lyons & McLeod, 2024).
In the simplest setting of smooth paths a CDE is a differential equation of form
The theory of Rough Paths has its origins in the study of such types of differential equations and provides a theoretical framework to define and work in rough settings i.e. when is not kust BV but even -Hölder for cf. (Friz & Victoir, 2010).
where . Iterating this procedure on the s, i.e. substituting in the previous equation the analogously obtained equality
Keeping with this procedure for steps we get
where runs trough the multi-indices, and
As one can imagine, under reasonable regularity assumptions, the remainder goes to as and at the limit
This is a remarkable result: to know the solution to the original CDE it suffices to know the quantities for all multi-indices and in the coordinate maps, together with the iterated integrals
This observation is at the core of Rough Path Analysis, the theory can in a sense be considered an extreme development of it. The collection of iterated integrals, the Signature, will be the main for our analysis.
In Appendix E we expand and make rigorous the arguments of this section in the case of affine vector fields.
A.2 Basic Definitions
where and
A.3 Notable Results
Here we present some notable results of which we will make use through the paper. We omit the proofs if they can be easily found in the suggested references.
The first result is about bounding the norm of Signature entries:
The most important fact about Signature is that it acts as the basis for a Taylor expansion in path space. In fact just as finite linear combinations of monomials are dense in the continuous functions with a compact input set, finite linear combinations of Signature entries are dense in continuous functions from compact path-spaces:
for some finite sequence of real numbers.
There is no magic in this result, it is just an application of Stone-Weiestrass enabled by the rich algebraic structure of iterated integrals, studied originally in (Chen, 1958).
This definition is such that the following important equation holds
With the right augmentation of the paths one can see that the Signature distinguishes between different sections of paths, this will be crucial for some of the original results presented in this work.
The if part is follows from . For the only if part, If and the statement holds; this is because if the signatures over the time interval of two time-augmented paths are equal, then the two paths must be equal on . We now show that augmenting the path with and imposing equality of signatures, implies and , which will in turn allow us to conclude the proof by the previous remark. Assume , in particular we must have
Hence it must be true that and . ∎
Appendix B Expressivity
In the body of the paper we have presented the main results with the simplified assumption of or at best i.e. with an initial value independent from the input. In this appendix we will carry on the proofs in a more general setting in which is allowed to be input-dependent, as previously discussed the choice of initial value is, in contrast to the classical setting, meaningful inasmuch it allows to approximate linear maps on the signature of . In order to do so we have to introduce a new gate, the initial value gate, in the form of a map
Despite the notation, there is no reason why should be the initial value of the path , one should think of this map as the one summarizing the data which still matters for the task but which does not have a time-series nature.
Then the main object of study, ”gated” Linear CDEs, are defined as:
B.2 Main Result - Statement and Strategy
Here we present the unified expressivity result in its most general form:
Moreover generic parameters suffice with high probability in the sense that under LeCun initialization
If the s are constrained to be diagonal, as often is the case in practice, the requirements , can be dropped and the existence result only holds with
Moreover in both the dense and diagonal cases the ”reverse” also holds in the sense that, given any choice of matrices there is an -close map in the corresponding family.
As one can see the theorem is composed of different sub-results, which we believe are better understood separately from each other. The proof will thus be split in the following steps:
Using the theory developed in Appendix E we see how linear functions on the s can be seen as linear functions on certain terms of the Signature Transform.
Such terms define a feature map which generates a Reproducing Kernel Hilbert Space . This abstract space acts as an upper bound on expressivity: linear functions on always belong to its closure (in uniform norm), independently of dimension and weights chosen, hence they cannot reach what functions in can’t approximate.
The full expressive range of is shown to be captured by generic s.
Diagonal systems are shown to be restricted to a subset of the of which they capture the full expressive range.
B.3 Main Result - Proofs
is given, using the notation , by
Just apply Theorems (E.2) and (E.6) of Appendix E. ∎
A property highlighted by the previous result is the interpretability of these models. After training the NCDEs one can compute the matrix multiplications and observe which entries of the signature the model chooses to take into consideration, to attend.
B.3.2 The feature map T𝑇T and its RKHS
The expression in (34) is a linear map on a feature vector given, at time , by
This feature vector can be understood as a tensor in the following way:
In fact one readily sees that the only non-zero terms of defined as above are
This is similar to the tensor-valued CDE defining the signature as a tensor i.e. (Salvi et al., 2021a)
with the addition of two terms to track and . One could also understand as a sub-tensor of
Classical RKHS theory tells us that we can characterize its elements as:
taken over those for which the above equality holds.
Signature kernels (Salvi et al., 2021a) are a class of universal kernels on sequential data which have received attention in recent years thanks to their efficiency in handling path-dependent problems (Lemercier et al., 2021; Salvi et al., 2021c; Cochrane et al., 2021; Salvi et al., 2021b; Cirone et al., 2023; Issa et al., 2023).
Just as signature kernels, the kernel associated to can be explicitly written as the solution of a two-parameter CDE:
Let then
The first expression follows immediately from Eqn. (36) the second one by summing the products of ’s and ’s entries given above. ∎
The following proposition will show how linear maps on cannot be more expressive than elements of the RKHS since their closure is in the closure of . In this precise sense these spaces act like upper bounds to expressiveness.
where . In other words, linear maps on the are in the uniform closure of .
Using Eqn. (34) we see that is a linear map on with coefficients
using Cauchy-Schwartz it’s moreover easy to see the existence of a constant such that for all one has
Since we have that, given an integer , the bound
B.3.4 Uniform closure of the RKHS
is a continuous map from a compact space, thus the image must be compact too. Moreover by Prop. A.8 the Signature separates the points in this image. Since any as above has form
the proof follows from the uniform density on compact sets of linear functionals on the (truncated) Signature (Thm. A.5), by also uniformly bounding thanks to compactness and continuity the norms of and . ∎
B.3.5 Generic Weights are fully expressive
We have seen how linear maps on are in the uniform closure of , and we have explicitly characterized this closure. It is then natural to ask ”how much” of this closure the are able to ”explore”. The present section not only shows that the ”explore” all the closure, but also that a generic choice of weights is enough to eventually do this with high probability.
The fact that these maps are ”universal” in the above sense is not surprising, since it is well known that Linear CDEs are universal for path-to-point tasks cf. (Kidger, 2022), what is surprising is that this universality can be achieved probabilistically with one of the standard parametrizations used in ML practice (LeCun) It can be proved, using the results of (Dubach & Peled, 2021), that the sampling measure does not have to be Gaussian if it satisfies certain moment requirements..
Moreover generic weight choices suffice with high probability, in the sense that under LeCun initialization
We propose two proofs, the first one of a deterministic character concerns the first claim in the theorem, the second one is probabilistic and concerns the whole result. The deterministic proof follows the same arguments employed by (Kidger, 2022) and is included to highlight the main idea of the probabilistic result, which reduces to a ”spin” on the central argument of this proof.
Any has form
where is the truncation at length .
In particular note how these matrices are such that
and for all one has necessarily .
Then expanding as in (34) and using the equalities above one has
proving that for such and it holds
(Probabilistic Proof) Any has form
where is the truncation at length .
From (Cirone et al., 2023)[Appendix C] we know that
Then expanding as in (34)
Which leads, thanks to the same bounds of (Cirone et al., 2023)[Appendix C], to
But then by Markov’s inequality it holds that
and thus there must be a choice of such that the inequality holds, and we thus obtain using (54)
and we conclude by arbitrariness of .
B.4 The Diagonal Case
Here we study the particular, but empirically important, case where the matrices are taken to be diagonalIt is equivalent to ask for them to be commuting..
What we’ll discover is that the cannot differentiate between and other for any permutation of the letters in the word .
By Theorem E.1 and Theorem E.6 we know that the solution of
where is the unique solution to
In case the s are commuting matrices one can explicitly write the solution as
since for fixed one has, using commutativity, that
On the other hand we know, Theorem E.2, that
recalling then how we get to
In particular we see how in the commuting case
B.4.2 Diagonal Expressiveness
Moreover the ”reverse” also holds i.e. given any choice of matrices there is an -close map in the family.
This is just a repetition of the arguments used for the dense case with little more care to get the uniformity in time.
The same argument of Proposition B.10 shows that the uniform closure of the space of linear maps on the is contained in the uniform closure of , and the same bounds show that this latter closure is the same as that of its subset composed of those having entries eventually equal to .
such maps can be expressed exactly in the form
for polynomial maps fixed in time. The usual compactness and continuity argument, together with an application of Stone-Weiestrass, thus proves that the uniform closure of has the form needed.
The final ingredient is the density of the space of linear maps on the in ; this is another consequence of Stone-Weiestrass as seen from Proposition B.17. ∎
It is not necessary to pass trough to prove the previous result, since it directly follows from Proposition B.17. This choice of presentation has been motivated by the conviction of the usefulness of drawing parallels and comparisons.
This is an application of Stone-Weiestrass: is a sub-algebra since
and by convexity of the cone; contains the constant function and is clearly point separating since the cone, being -dimensional, it contains a basis of the whole space. ∎
The usefulness of stating the previous result in such a general setting is the following: with this formalism we can, for example, restrict to , in this way we would have a method to control the stability (cf. Appendix C.1) of the linear CDEs by choosing the gate with a.s. .
Appendix C Stability and Chaining of Diagonal Systems
Note that the present discussion holds even for non-diagonal but commuting matrices, since these can be simultaneously diagonalized (at the cost of considering the complex plane).
Here we explore the stability of the dynamical system , thus we need to study the eigenvalues of the . Recall how in this setting
Note that because is continuous and of bounded variation, it can be reparameterised to be Lipschitz continuous, hence absolutely continuous. Thus we can assume that is almost everywhere differentiable and its derivative .
The stability of the dynamical system then depends on the alignment between and the singular vectors of . If for all times, where the inequality is coordinate-wise, then has eigenvalues all in $$ thus the system is stable making training easier (Orvieto et al., 2023b).
Consider the singular value decomposition (SVD) of the matrix
Then, a sufficient condition for stability is that for any
In the case of Mamba (Gu & Dao, 2023) the matrices are diagonal and
moreover the proposed choices of are all of type
C.2 Chaining
The diagonal case differs from the general one not only in the fact that the class of approximable functions is much weaker but also in the necessity for the presence of in order to obtain any path-dependence. The term
becomes then a crucial component. At first sight one might think that such a term allows to recover at least level two components of the Signature of , unfortunately things are not as easy as they may seem. Notice how inside of the integral time is ”going backwards” from the perspective of , thus we can in general approximate terms of type
which are indeed terms of the Signature, but of the reverse paths and !
Under these hypotheses we know that linear functionals on are uniformly dense, for continuous , in
Assume and consider the choices
Models like Mamba do not only use diagonal matrices but also consider controls of a specific kind:
The choice of enforces for all times as seen above, but could, a priori, destroy information about which allows for the recovery, after chaining, of its Signature.
Does this choice keep some expressivity? Fortunately almost all of it: since
one can choose a linear map which allows to linearly recover
from . By correspondingly modifying the form of and in Eqn. (77) such that
Appendix D Path-to-Path
i.e. G is causal if it does not look in the future.
Fix . By B.7 the space contains all functionals of form
Recalling that we get that so that, given density of linear maps on in the space, can be uniformly approximated. Triangular inequality gives finally
which, by arbitrariness of , gives the thesis. ∎
The non-linearity is crucial for the path-to-path result. A map of type cannot be approximated arbitrarily well by .
In any case, note that in the proof the role of the neural network is only that of interpolating the RKHS elements in the right order and at the right time. All the non-linear complexity of learning the particular is offloaded and taken care of by the RKHS elements.
In the proof we have only considered the part of concerning , but depends linearly on and suggesting that neural networks on have stronger generalization properties. In fact one can prove that it is possible to approximate all continuous , this is done by reconstructing and as in the classical SSM case cf. (Orvieto et al., 2023a).
Appendix E Wronskian Matrix Theory
In this section we obtain a unified theory studying the solutions to general linear CDEs. The results presented here are not new and can be found in different terms in the literature (Friz & Victoir, 2010), despite this we have decided to reproduce them from scratch for completeness, notational reasons and to present a self-contained theory.
Define the map as
where is the set of words in the alphabet and
By defining then one clearly has
thus definitely (in ) the map is a contraction. By Banach fixed point there exist a unique fixed point . ∎
Under the assumptions of the previous theorem one can write explicitly as
moreover if for all the matrix-valued maps are constant on all $A^{i}_{t}\equiv A_{i}$ then
where is the Signature of the path .
The second assertion follows from the first by definition of the Signature of a path.
Using the uniformity of this bound and the fact that for all one has
as a function of , which moreover is uniformly continuous
Under the previous conditions, the unique solution of the -dimensional CDE
The solutions are unique by standard results (Friz & Victoir, 2010)[Thm. 3.7], moreover
The Wronskian matrix has the following properties:
The second statement follows from the previous one setting first and subsequently exchanging and .
This just follows from the classical case since we can write
We can now state the main result of the section:
Given the unique solution one has
Appendix F ZOH and Exact Solutions
Consider a linear CDE as the one of Eqn.(28)
and recall how the solution can be explicitly written, for times , as
Assume moreover that in the interval both drivers have constant derivative i.e.
But the integral can be explicitly solved as
which, setting , can be rewritten as