Log Neural Controlled Differential Equations: The Lie Brackets Make a Difference

Benjamin Walker, Andrew D. McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, Terry Lyons

Introduction

Neural controlled differential equations (NCDEs) are a method for modelling multivariate time series, which offer a number of advantages for real-world applications. These include decoupling the number of forward passes through their neural network from the number of observations in the time series, as well as being robust to irregular sampling rates. However, there exists a gap in performance between NCDEs and current state-of-the-art approaches for time series modelling, such as S5 and the linear recurrent unit (LRU) .

This paper demonstrates that, on a range of multivariate time series classification benchmarks, the gap in performance between NCDEs and other state-of-the-art approaches can be closed by utilising the Log-ODE method during training. We refer to this new approach as Log-NCDEs.

2 Neural Controlled Differential Equations

for $t\in[t_{0},t_{n}]$ , where $f_{\theta}(h_{s})\text{d}X_{s}$ is matrix-vector multiplication . Details on the regularity required for existence and uniqueness of the solution to (1) can be found in appendix A.2. A sufficient condition is $X$ being of bounded variation and $f_{\theta}$ being Lipschitz continuous .

NCDEs are an attractive option for modelling multivariate time series. They are universal approximators of continuous real-valued functions on time series data [5, Theorem 3.9]. Additionally, since they interact with time series data through the continuous interpolation $X$ , NCDEs are agnostic to when the data was sampled. This makes them robust to irregular sampling rates. Furthermore, the number of forward passes through $f_{\theta}$ when evaluating (1) is controlled by the differential equation solver used. This is opposed to recurrent models, where it is controlled by the number of observations $n$ . By decoupling the number of forward passes through their neural network from the number of observations in the time series, NCDEs can mitigate exploding or vanishing gradients on highly-sampled time series.

To make use of the techniques developed for training neural ordinary differential equations (ODEs) , NCDEs are typically rewritten as an ODE,

3 Neural Rough Differential Equations

Compared to NCDEs, NRDEs can reduce the number of forward passes through the network while evaluating the model, as the vector field is autonomous on each interval $[r_{i},r_{i+1}]$ . This has been shown to lead to improved classification accuracy, alongside reduced time and memory-usage, on time series with up to 17,000 observations . Furthermore, as it is no longer necessary to apply a differentiable interpolation to the time series data, NRDEs are applicable to a wider range of input signals.

4 Contributions

Mathematical Background

Let $U$ , $V$ , and $W$ be vector spaces. The tensor product space $U\otimes V$ is the unique (up to isomorphism) space such that for all bilinear functions $\kappa:U\times V\rightarrow W$ there exists a unique linear map $\tau:U\otimes V\rightarrow W$ , such that $\kappa=\tau\circ\otimes$ .

where any bilinear function $\kappa(v,w)$ can be written as a linear function $\tau(v\otimes w)$ .

Details on the choice of norm used for $V^{\otimes j}$ are left to appendix A.3.

3 Lie Brackets

A Lie algebra is a vector space $V$ with a bilinear map $[\cdot,\cdot]:V\times V\rightarrow V$ satisfying $[w,w]=0$ and the Jacobi identity,

for all $w,x,y,z\in V$ . The map $[\cdot,\cdot]$ is called the Lie bracket .

Any associative algebra, $(V,\times)$ , has a Lie bracket structure with Lie bracket defined by

For a Lie algebra $V$ , let $[V,V]$ denote the span of elements $[v_{1},v_{2}]$ , where $v_{1},v_{2}\in V$ .

4 The Log-Signature

Let $X:[t_{0},t_{n}]\rightarrow V$ have bounded variation and define

The signature describes the path $X$ over the interval $[t_{0},t_{n}]$ . In fact, assuming $X$ contains time as a channel, linear maps on $S(X)_{[t_{0},t_{n}]}$ are universal approximators for continuous, real-valued functions of $X$ . This property of the signature relies on the shuffle-product identity, which states that certain terms in the signature can be written as polynomials of lower-order terms in the signature . A consequence of the shuffle-product identity is that the signature contains information redundancy, i.e. not every term in the signature provides new information about the path $X$ . The transformation which removes this information redundancy is the logarithm.

where $\mathbf{t}=(0,x^{1},x^{2},\ldots)$ .

(Free Lie Algebra ) Let $A$ be a non-empty set, $L_{0}$ and $L$ be Lie algebras, and $\phi:A\rightarrow L_{0}$ be a map. The Lie algebra $L_{0}$ is said to be the free Lie algebra generated by $X$ if for all maps $f:X\rightarrow L$ , there exists a unique Lie algebra homomorphism $g:L_{0}\rightarrow L$ such that $g\circ\phi=f$ .

The free Lie algebra generated by a Banach space $V$ is the space

which was originally shown by K.T. Chen .

5 The Log-ODE Method

Over an interval, the Log-ODE method approximates a CDE using an autonomous ODE constructed by applying the linear map $\bar{f}$ to the truncated log-signature of the control, as seen in (4). There exist theoretical results bounding the error in the Log-ODE method’s approximation, including when the control and solution paths live in infinite dimensional Banach spaces . However, for a given set of intervals, the series of vector fields $\{g_{X}(\cdot)\}_{N=1}^{\infty}$ is not guaranteed to converge. In practice, $N$ is typically chosen as the smallest $N$ such that a reasonably sized set of intervals $\{r_{i}\}_{i=0}^{m}$ gives an approximation error of the desired level. A recent development has been the introduction of an algorithm which adaptively updates $N$ and $\{r_{i}\}_{i=0}^{m}$ .

Method

Log-NCDEs use the same underlying model as NRDEs

These changes have two benefits. The first is that the output dimension of $f_{\theta}$ is $u\times v$ , whereas the output dimension of $\bar{f}_{\theta}$ is $u\times\beta(v,N)$ . Figure 2 compares these values for paths of dimension $v$ from $1$ to $10$ and truncation depths of $N=1,2,3$ . For $N>1$ , Log-NCDEs are exploring a significantly smaller output space during training than NRDEs, while maintaining the same expressivity, as NCDEs are universal approximators. The second benefit of these changes is that $\{r_{i}\}_{i=1}^{m}$ and $N$ are no longer hyperparameters which impact the model’s architecture. Instead, they control the error in the approximation arising from using the Log-ODE method. This allows them to be changed during training and inference. Both of these benefits come at the cost of needing to calculate the iterated Lie brackets when evaluating Log-NCDEs, which will be quantified in section 3.4.

Hence, in this case the only difference between Log-NCDEs and NRDEs is the regularisation of $f_{\theta}$ . Furthermore, (20) and (3) are equivalent when $X$ is a linear interpolation, so the approach of NCDEs, NRDEs, and Log-NCDEs all coincide when using a depth $-1$ Log-ODE approximation .

where $C$ is a constant depending on $n_{in},n_{h},$ and $m$ , $\{W^{i}\}_{i=1}^{m}$ and $\{\mathbf{b}^{i}\}_{i=1}^{m}$ are the weights and biases of $i^{\text{th}}$ layer of $f_{\theta}$ , and $P_{m!}$ is a polynomial of order $m!$ .

Assuming that each layer $\{L^{i}\}_{i=1}^{m}$ of $f_{\theta}$ satisfies $||L^{i}||_{\text{Lip}(2)}=1$ , an explicit evaluation of (21) gives

For a depth $5$ FCNN, this is greater than the maximum value of a double precision floating point number. Hence, it may be necessary to control $||f_{\theta}||_{\text{Lip}(2)}$ explicitly during training. This is achieved by modifying the neural network’s loss function $L$ to

where $\lambda$ is a hyperparameter controlling the weight of the penalty. This is an example of weight regularisation, which has long been understood to improve generalisation in NNs . Equation 23 is specifically a variation of spectral norm regularisation .

3 Constructing the Log-ODE Vector Field

where $\lambda_{k}$ is the term in the log-signature corresponding to the basis elements $\hat{e}_{k}$ . Since each $\hat{e}_{k}$ can be written as iterated Lie brackets of $\{e_{j}\}_{j=1}^{v}$ , it is now possible to replace $\bar{f}_{\theta}(\cdot)\hat{e}_{k}$ with the iterated Lie brackets of $f_{\theta}(\cdot)e_{i}$ using (17) and (18). The $f_{\theta}(\cdot)e_{i}$ are the vector fields defined by the columns of the neural network’s output. Hence, the vector field $g_{\theta,X}$ is evaluated at a point using Jacobian-vector products of the neural network, which can be efficiently calculated using forward-mode automatic differentiation.

4 Computational Cost

Constructing the iterated Lie brackets of $f_{\theta}$ incurs an additional computational cost for each evaluation of the vector field. In order to quantify this additional cost, assume that a NCDE, NRDE, and Log-NCDE are all using an identical FCNN as their vector field, except for the dimension of the final layer in the NRDE. Let $n_{h}$ denote the dimension of the hidden layers, and $\mathcal{O}(C)$ be the cost of evaluating the vector field of the NCDE. Then for a $v-$ dimensional input path, the cost of evaluating a NRDE’s vector field is $\mathcal{O}(C+v^{N}n_{h})$ and the cost of evaluating a Log-NCDE’s vector field is $\mathcal{O}(C+v^{N-1}C)$ . The idea behind NRDEs and Log-NCDEs is that their additional cost is compensated for by the Log-ODE method requiring less vector field evaluations to attain a solution to the CDE of similar accuracy as other methods . The idea behind Log-NCDEs, is that the additional cost of constructing the Lie brackets is compensated for by the smaller output dimension of the model’s neural network when compared to NRDEs.

5 Limitations

Despite these limitations, in the experiments conducted for this paper, Log-NCDEs significantly improve upon the test set accuracies of NCDEs and NRDEs. Furthermore, they achieve test set accuracies higher than current state-of-the-art deep learning approaches, including structured state-space models. We hope these results motivate the importance of further work on developing a complete implementation of the Log-ODE method for NCDEs.

Experiments

Log-NCDEs are compared against four models, which represent the state-of-the-art for a range of deep learning approaches to time series modelling. The first two are discrete methods; a recurrent neural network using LRU blocks and a structured state-space model, S5 . The other two baseline models are continuous; a NCDE using a Hermite cubic spline with backward differences and a NRDE .

2 Toy Dataset

We construct a toy dataset with dimension $6$ and $100$ regularly spaced samples. For every time step, the change in each channel is sampled independently from the discrete probability distribution with density

We consider four different binary classifications on the toy dataset. Each classification is a specific term in the signature of the path which depends on a different number of channels.

Was the change in the third channel, $\int_{0}^{1}\text{d}X^{3}_{s}$ , greater than zero?

Was the area integral of the third and sixth channels, $\int_{0}^{1}\int_{0}^{u}\text{d}X^{3}_{s}\text{d}X^{6}_{u}$ , greater than zero?

Was the volume integral of the third, sixth, and first channels, $\int_{0}^{1}\int_{0}^{v}\int_{0}^{u}\text{d}X^{3}_{s}\text{d}X^{6}_{u}\text{d}X^{1}_{v}$ , greater than zero?

Was the $4$ D volume integral of the third, sixth, first, and fourth channels, $\int_{0}^{1}\int_{0}^{w}\int_{0}^{v}\int_{0}^{u}\text{d}X^{3}_{s}\text{d}X^{6}_{u}\text{d}X^{1}_{v}\text{d}X^{4}_{w}$ , greater than zero?

On this dataset, all models used a hidden state of dimension $64$ and Adam with a learning rate of $0.0001$ . Both LRU and S5 used six blocks with GLU layers. S5 used a latent dimension of $64$ . NRDE and Log-NCDE used a stepsize of $4$ and a depth of $2$ .

3 UEA Multivariate Time Series Classification Archive

The models considered in this paper are evaluated on a subset of six datasets from the UEA multivariate time series classification archive (UEA-MTSCA). These six datasets were chosen via the following two criteria. First, only datasets with more than $200$ total cases were considered. Second, the six datasets with the most observations were chosen, as datasets with many observations have previously proved challenging for deep learning approaches to time series modelling. Following , the original train and test cases are combined and resplit into new random train, validation, and test cases using a $70:15:15$ split.

Hyperparameters for all models were found using a gird search over the validation accuracy on a fixed random split of the data. Full details on the hyperparameter grid search are in appendix C. Having fixed their hyperparameters, models are compared on their average test set accuracy over five different random splits of the data.

Results

Figure 4 compares the performance of LRU, S5, NCDE, NRDE, and Log-NCDE on the four different classifications considered for the toy dataset. As expected, all of the models perform well when the label depends on one channel. As the number of channels the classification depends on increases, LRU and S5 begin to take significantly more training steps to converge. When the classification label is the volume integral containing four channels, S5 and LRU struggle to learn anything within $100{,}000$ steps of training. Increasing the number of training steps to $1{,}000{,}000$ , S5 and LRU are able to achieve validation accuracies of $82.5\%$ and $75.8\%$ respectively. However, this is still significantly lower than the $90.9\%$ validation accuracy achieved by NCDEs in $10{,}000$ steps of training.

As can be seen in figure 4, NCDEs outperform both NRDEs and Log-NCDEs using a stepsize of $4$ and a depth of $2$ . This is expected, as the classification labels considered are exactly the solution of a CDE, and for $N>1$ , NRDEs and Log-NCDEs are approximations to CDEs. Additionally, Log-NCDEs outperform NRDEs on each classification, with the gap in performance growing as the classification label depends on more channels.

2 UEA-MTSCA

Table 1 reports the average and standard deviation of each model’s test set accuracy over five different splits of the data. On the datasets considered, NCDEs achieve the lowest average accuracy. NRDEs improve upon NCDEs in both average accuracy and rank, but are still outperformed by the current state-of-the-art models, LRU and S5. Log-NCDEs achieve the highest average accuracy and best average rank of the models considered. Furthermore, Log-NCDEs have the lowest standard deviation on four of the six datasets. These results highlight how the modifications introduced by Log-NCDEs can significantly increase performance.

Discussion

On the toy dataset, NCDEs, NRDEs, and Log-NCDEs are all able to model the $N-$ dimensional volume integrals with high validation accuracy. This is not surprising, as the classification label can be written as a CDE driven by the input path. However, LRU and S5 both struggle to capture the necessary information as $N$ increases. Log-NCDEs outperforming NRDEs when the classification is the solution of a CDE indicate that NRDEs do not always converge to solutions where $\bar{f}_{\theta}$ is the sum of Lie brackets of a NCDE vector field, $f_{\theta}$ .

Conclusion

There are many possible directions of future work. Log-NCDE’s could be extended to depth- $N$ Log-ODE methods for $N>2$ . This would require extending Theorem 3.1 to $\gamma>2$ . Furthermore, it would be necessary to address the computational cost of the iterated Lie brackets, which could be done by using a structured neural network with cheap Jacobian-vector products as the CDE vector field. Another avenue could be to incorporate the recently developed adaptive version of the Log-ODE method .

Acknowledgements

Benjamin Walker and Terry Lyons were funded by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). Andrew McLeod and Terry Lyons were funded by the EPSRC [grant number EP/S026347/1] and The Alan Turing Institute under the EPSRC grant EP/N510129/1. Terry Lyons was funded by the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership).

References

Appendix A Additional Mathematical Details

Let $V$ and $W$ be Banach spaces and $\mathbf{L}(V,W)$ denote the space of linear mappings from $V$ to $W$ .

In this paper, we will always take spaces of linear maps to be equipped with their operator norms.

A linear map $l\in\mathbf{L}(V^{\otimes j},W)$ is $j-$ symmetric if for all $v_{1}\otimes\cdots\otimes v_{j}\in V^{\otimes j}$ and all bijectiive functions $p:\{1,\ldots,j\}\rightarrow\{1,\ldots,j\}$ ,

The set of all $j-$ symmetric linear maps is denoted $\mathbf{L}_{s}(V^{\otimes j},W)$

$\text{Lip}(\gamma)$ Let $k$ be a non-negative integer, $\gamma\in(k,k+1]$ be a real number, $F$ a closed subset of $V$ , and $f^{0}:F\rightarrow W$ . For $j=1,\ldots,k$ , let $f^{j}:F\rightarrow\mathbf{L}_{s}(V^{\otimes j},W)$ . The collection $(f^{0},f^{1},\ldots,f^{k})$ is an element of $\text{Lip}(\gamma,F,W)$ if there exists $M\geq 0$ such that for $j=0,\ldots,k$ ,

and for $j=0,\ldots,k$ , all $x,y\in F$ , and each $v\in V^{\otimes j}$ ,

When there is no confusion over $F$ and $W$ , the shorthand notation $\text{Lip}(\gamma)$ will be used. If a collection $f=(f^{0},f^{1},\ldots,f^{k})$ is $\text{Lip}(\gamma)$ , then the $\text{Lip}(\gamma)-$ norm, denoted $||f||_{\text{Lip}(\gamma)}$ , is the smallest $M$ for which (28) and (29) hold.

In order to illustrate the definition of $\text{Lip}(\gamma)$ , two examples are given. When $0<\gamma\leq 1$ , then $k=0$ and $f^{0}\in\text{Lip}(\gamma)$ implies $f^{0}$ is bounded and $\gamma-$ Hölder continuous. When $\gamma=1$ , then $f^{0}$ is bounded and Lipschitz. When $1<\gamma\leq 2$ , then $k=1$ and $(f^{0},f^{1})\in\text{Lip}(\gamma)$ if $f^{0}$ is bounded and there exists $f^{1}$ that is bounded, $(\gamma-1)-$ Hölder continuous, and satisfies

for all $x,y\in F$ and some constant $M>0$ .

A.2 Existence and Uniqueness

Let $X:[0,T]\rightarrow V$ and $y:[0,T]\rightarrow W$ be continuous paths. The existence and uniqueness of the solution to a CDE,

depends on the smoothness of the control path $X$ and the vector field $f$ . We will measure the smoothness of a path by the smallest $p\geq 1$ for which the $p-$ variation is finite and the smoothness of a vector field by the largest $\gamma>0$ such that the function is $\text{Lip}(\gamma)$ (defined in section A.1).

(Partition) A partition of a real interval $[0,T]$ is a set of real numbers $\{r_{i}\}_{i=0}^{m}$ satisfying $0=r_{0}<\ldots<r_{m}=T$ .

( $p-$ variation ) Let $V$ be a Banach space, $\mathcal{D}=(r_{0},\cdots,r_{m})\subset[0,T]$ be a partition of $[0,T]$ , and $p\geq 1$ be a real number. The $p-$ variation of a path $X:[0,T]\rightarrow V$ is defined as

Let $1\leq p<2$ and $p-1<\gamma\leq 1$ . If $W$ is finite-dimensional, $X$ has finite $p-$ variation, and $f$ is $\text{Lip}(\gamma)$ , then (31) admits a solution for every $y_{0}\in W$ .

Let $1\leq p<2$ and $p<\gamma$ . If $X$ has finite $p-$ variation and $f$ is $\text{Lip}(\gamma)$ , then (31) admits a unique solution for every $y_{0}\in W$ .

These theorems extend the classic differential equation existence and uniqueness results to controls with unbounded variation but finite $p-$ variation for $p<2$ . A proof of these theorems is can be found in . These theorems are sufficient for the differential equations considered in this paper. However, there are many settings where the control has infinite $p-$ variation for all $p<2$ , such as Brownian motion. The theory of rough paths was developed in order to give meaning to (31) when the control’s $p-$ variation is finite only for $p\geq 2$ . An introduction to rough path theory can be found in .

A.3 Norm of Tensor Product Space

Let $V$ be a Banach space and $V^{\otimes n}$ denote the tensor powers of $V$ ,

There is choice in the norm of $V^{\otimes n}$ . In this paper, we follow the setting of and . It is assumed that each $V^{\otimes n}$ is endowed with a norm such that the following conditions hold for all $v\in V^{\otimes n}$ and $w\in V^{\otimes m}$ :

$||v||=||v_{1}\otimes\cdots\otimes v_{n}||=||v_{p(1)}\otimes\cdots\otimes v_{p(n)}||$ for all all bijectiive functions $p:\{1,\ldots,n\}\rightarrow\{1,\ldots,n\}$ ,

for any bounded linear functional $f$ on $V^{\otimes n}$ and $g$ on $V^{\otimes m}$ , there exists a unique bounded linear functional $f\otimes g$ on $V^{\otimes(m+n)}$ such that $(f\otimes g)(v\otimes w)=f(v)g(w).$

with product $\mathbf{z}=\mathbf{x}\otimes\mathbf{y}$ defined by

The tensor algebra’s product is associative and has unit $\mathbf{1}=(1,0,0,\ldots)$ . As $T((V))$ is an associative algebra, it has a Lie algebra structure, with Lie bracket

Appendix B Proof of Theorem 3.1

(Composed $\text{Lip}(\gamma)-$ norm ) Let $U$ , $V$ , and $W$ be Banach spaces and $\Sigma\subset U$ and $\Omega\subset V$ be closed. For $\gamma\geq 1$ , let $f=(f^{(0)},f^{(1)})\in\text{Lip}(\gamma,\Sigma,\Omega)$ and $g=(g^{(0)},g^{(1)})\in\text{Lip}(\gamma,\Omega,W)$ . Then the composition $g\circ f:\Omega\rightarrow W$ is $\text{Lip}(\gamma)$ with

where $k$ is the unique integer such that $\gamma\in(k,k+1]$ and $C_{\gamma}$ is a constant independent of $f$ and $g$ .

The original statement of lemma B.1 in gives (37) as

We believe this is a small erratum, as for $g:\rightarrow$ defined as $g(x)=x$ , (38) implies there exists $C_{1}>0$ such that

for all bounded and Lipschitz $f:\rightarrow$ . As a counterexample, for any $C_{1}>0$ , take $f(x)=x^{n}$ with $n>\max\{C_{1},1\}$ . The following proof of lemma B.1 is given in .

Let $(g\circ f)^{0},\ldots,(g\circ f)^{k}$ be defined by the generalisation of the chain rule to higher derivatives. Explicit calculation can be used to verify that if $f$ and $g$ are $\text{Lip}(\gamma)$ , definition A.3 implies $g\circ f$ is $\text{Lip}(\gamma)$ with $||g\circ f||_{\text{Lip}(\gamma)}$ obeying (37). ∎

Bounding the $\text{Lip}(\gamma)-$ norm of a neural network (NN) requires an explicit form for $C_{\gamma}$ in (37). This can be obtained via the explicit calculations mentioned in the proof of lemma B.1. Here, we present the case $\gamma\in(1,2]$ .

Let $U$ , $V$ , and $W$ be Banach spaces and $\Sigma\subset U$ and $\Omega\subset V$ be closed. For $\gamma\in(1,2]$ , let $f=(f^{(0)},f^{(1)})\in\text{Lip}(\gamma,\Sigma,\Omega)$ and $g=(g^{(0)},g^{(1)})\in\text{Lip}(\gamma,\Omega,W)$ . Consider $h^{(0)}:\Sigma\to W$ and $h^{(1)}:\Sigma\to\mathbf{L}(V,W)$ defined for $p\in\Sigma$ and $v\in V$ by

Then $h:=\left(h^{(0)},h^{(1)}\right)\in\text{Lip}(\gamma,\Sigma,W)$ and

From definition A.3, $f^{(0)}:\Sigma\to\Omega$ , $f^{(1)}:\Sigma\to\mathbf{L}(U,V)$ , $g^{(0)}:\Omega\to W$ and $g^{(1)}:\Omega\to\mathbf{L}(V,W)$ . Furthermore, for all $p\in\Sigma$

Similarly, for all $x\in\Omega$ we have that

Define $R^{f}_{0}:\Sigma\times\Sigma\to V$ and $R^{f}_{1}:\Sigma\times\Sigma\to\mathbf{L}(U,V)$ by

for any $p,q\in\Sigma$ and $u\in U$ . Then

Similarly, define $R^{g}_{0}:\Omega\times\Omega\to W$ and $R^{g}_{1}:\Omega\times\Omega\to\mathbf{L}(V,W)$ by

Define $h^{(0)}:\Sigma\to W$ and $h^{(1)}:\Sigma\to\mathbf{L}(V,W)$ as in (40),

for $p\in\Sigma$ and $u\in U$ . Finally, define remainder terms $R^{h}_{0}:\Sigma\times\Sigma\to W$ and $R^{h}_{1}:\Sigma\times\Sigma\to\mathbf{L}(U,W)$ by

for $p,q\in\Sigma$ and $u\in U$ . We now establish that $h=(h^{(0)},h^{(1)})\in\text{Lip}(\gamma,\Sigma,W)$ and that the norm estimate claimed in (41) is satisfied.

First we consider the bounds on $h^{(0)}$ and $h^{(1)}$ . For any $p\in\Sigma$ , (I) in (43) implies that

since $f^{(0)}(p)\in\Omega$ . Further, for any $p\in\Sigma$ and any $u\in U$ , (43) and (II) in (42) imply that

since $f^{(0)}(p)\in\Omega$ . Taking the supremum over $u\in U$ with unit $U$ -norm, it follows that

Now we consider the bounds on $R^{h}_{0}$ and $R^{h}_{1}$ . For this purpose we fix $p,q\in\Sigma$ and $u\in U$ . We first assume that $||q-p||_{U}>1$ . In this case we may use (50) and (51) to compute that

Since $\gamma>1$ means that $1<||q-p||_{U}<||q-p||_{U}^{\gamma}$ , we deduce that

Similarly, we may use (51) and that $||q-p||_{U}^{\gamma-1}>1$ to compute that

Taking the supremum over $u\in U$ with unit $U$ -norm in (53) yields the estimate that

Together, (52) and (54) establish the remainder term estimates required to conclude that $h=(h^{(0)},h^{(1)})\in\text{Lip}(\gamma,\Sigma,W)$ in the case that $||q-p||_{U}>1$ . We next establish similar remainder term estimates when $||q-p||_{U}<1$ . Thus we fix $p,q\in\Sigma$ and assume that $||q-p||_{U}<1$ . Note that $\gamma>1$ means that $||q-p||_{U}^{\gamma}<||q-p||_{U}<1$ . Additionally,

where (II) in (42) and (I) in (45) have been used. We now consider the term $R^{h}_{0}(p,q)$ . We start by observing that

Consequently, by using (II) in (43) to estimate the term $g^{(1)}\left(f^{(0)}(p)\right)$ , (I) in (45) to estimate the term $R^{f}_{0}(p,q)$ , and (I) in (47) to estimate the term $R^{g}_{0}\left(f^{(0)}(p),f^{(0)}(q)\right)$ , we may deduce that

The combination of (55) and (56) yields the estimate

Turning our attention to $R^{h}_{1}$ , we fix $u\in U$ and compute that

Consequently, by using (II) in (43) to estimate the term $g^{(1)}\left(f^{(0)}(p)\right)$ , (II) in (42) to estimate the term $f^{(1)}(q)$ , (II) in (45) to estimate the term $R^{f}_{1}(p,q)$ , and (II) in (47) to estimate the term $R^{g}_{1}\left(f^{(0)}(p),f^{(0)}(q)\right)$ , we may deduce that

The combination of (55) and (58) yields the estimate that

Taking the supremum over $u\in U$ with unit $U$ -norm in (59) yields the estimate that

Finally, we complete the proof by combining the various estimates we have established for $h$ to obtain the $\text{Lip}(\gamma,\Sigma,W)$ -norm bound claimed in (41).

We start this task by combining (52) and (57) to deduce that for every $p,q\in\Sigma$ we have

Moreover, the combination of (54) and (60) yields the estimate that

Therefore, by combining (50), (51), (63), and (64), we conclude both that $h=(h^{(0)},h^{(1)})\in\text{Lip}(\gamma,\Sigma,W)$ and that

Note that (41) is a stricter bound than (37), as for $\gamma\in(k,k+1]$ ,

There is equality when $||f||_{\text{Lip}(\gamma)}\leq 1$ or $\gamma=2$ .

B.2 Lip(2)−limit-fromLip2\text{Lip}(2)-norm of a Neural Network Layer

Lemma B.2 allows us to bound the $\text{Lip}(2)-$ norm of a neural network (NN) given a bound on the $\text{Lip}(2)-$ norm of each layer of a NN. We demonstrate this here for a simple NN.

Let $f_{\theta}$ be a fully connected NN with activation function SiLU. Assume the input is normalised such that $\mathbf{x}=[x_{1},\ldots,x_{n_{in}}]^{T}$ satisfies $|x_{j}|\leq 1$ for $j=1,\ldots,n_{in}$ . Then

where $l^{1}_{j}(\mathbf{x})=W^{1}_{j}\cdot\mathbf{x}+b^{1}_{j}.$ So,

Since $L^{1}_{j}$ is at least twice differentiable,

where $\mathbf{b}=\mathbf{x}+t(\mathbf{y}-\mathbf{x})$ for some $t\in(0,1)$ and $\beta(A)=\max\{|\lambda_{\max}(A)|,|\lambda_{\min}(A)|\}$ . Similarly,

where $\mathbf{z}=\mathbf{x}+\alpha(\mathbf{y}-\mathbf{x})$ for some $\alpha\in(0,1)$ . Now,

The calculations for subsequent layers are very similar, except that the input to each layer is no longer restricted to $A$ . For example,

B.3 Proof of Theorem 3.1

where $C$ is a constant depending on $n$ and $n_{in}$ and $P_{j}$ is a $j^{\text{th}}$ order polynomial. Applying lemma B.2 gives the bound in (86). ∎

Appendix C UEA-MTSCA Hyperparameter Optimisation

NCDEs, NRDEs, and Log-NCDEs use a single linear layer as $\xi_{\phi}$ . NCDEs and NRDEs use FCNNs as their vector fields configured in the same way as their original papers . NCDEs use ReLU activation functions for the hidden layers and a final activation function of $\tanh$ . NRDEs use the same, but move the $\tanh$ activaion function to be before the final linear layer in the FCNN. NRDEs and Log-NCDEs take their intervals $r_{i+1}-r_{i}$ to be a fixed number of observations, referred to as the Log-ODE step. NCDEs, NRDEs, and Log-NCDEs all use Heun as their differential equation solver with a fixed stepsize of $\max\{500,1+(\text{Time series length}/\text{Log-ODE step})\}$ , with $\text{Log-ODE step}=1$ for NCDEs. Table 2 gives an overview of the different hyperparameters optimised over for each model on the UEA-MTSCA. The optimisation was performed using a grid search of the validation accuracy.