Log Neural Controlled Differential Equations: The Lie Brackets Make a Difference
Benjamin Walker, Andrew D. McLeod, Tiexin Qin, Yichuan Cheng, Haoliang Li, Terry Lyons
Introduction
Neural controlled differential equations (NCDEs) are a method for modelling multivariate time series, which offer a number of advantages for real-world applications. These include decoupling the number of forward passes through their neural network from the number of observations in the time series, as well as being robust to irregular sampling rates. However, there exists a gap in performance between NCDEs and current state-of-the-art approaches for time series modelling, such as S5 and the linear recurrent unit (LRU) .
This paper demonstrates that, on a range of multivariate time series classification benchmarks, the gap in performance between NCDEs and other state-of-the-art approaches can be closed by utilising the Log-ODE method during training. We refer to this new approach as Log-NCDEs.
2 Neural Controlled Differential Equations
for , where is matrix-vector multiplication . Details on the regularity required for existence and uniqueness of the solution to (1) can be found in appendix A.2. A sufficient condition is being of bounded variation and being Lipschitz continuous .
NCDEs are an attractive option for modelling multivariate time series. They are universal approximators of continuous real-valued functions on time series data [5, Theorem 3.9]. Additionally, since they interact with time series data through the continuous interpolation , NCDEs are agnostic to when the data was sampled. This makes them robust to irregular sampling rates. Furthermore, the number of forward passes through when evaluating (1) is controlled by the differential equation solver used. This is opposed to recurrent models, where it is controlled by the number of observations . By decoupling the number of forward passes through their neural network from the number of observations in the time series, NCDEs can mitigate exploding or vanishing gradients on highly-sampled time series.
To make use of the techniques developed for training neural ordinary differential equations (ODEs) , NCDEs are typically rewritten as an ODE,
3 Neural Rough Differential Equations
Compared to NCDEs, NRDEs can reduce the number of forward passes through the network while evaluating the model, as the vector field is autonomous on each interval . This has been shown to lead to improved classification accuracy, alongside reduced time and memory-usage, on time series with up to 17,000 observations . Furthermore, as it is no longer necessary to apply a differentiable interpolation to the time series data, NRDEs are applicable to a wider range of input signals.
4 Contributions
Mathematical Background
Let , , and be vector spaces. The tensor product space is the unique (up to isomorphism) space such that for all bilinear functions there exists a unique linear map , such that .
where any bilinear function can be written as a linear function .
Details on the choice of norm used for are left to appendix A.3.
3 Lie Brackets
A Lie algebra is a vector space with a bilinear map satisfying and the Jacobi identity,
for all . The map is called the Lie bracket .
Any associative algebra, , has a Lie bracket structure with Lie bracket defined by
For a Lie algebra , let denote the span of elements , where .
4 The Log-Signature
Let have bounded variation and define
The signature describes the path over the interval . In fact, assuming contains time as a channel, linear maps on are universal approximators for continuous, real-valued functions of . This property of the signature relies on the shuffle-product identity, which states that certain terms in the signature can be written as polynomials of lower-order terms in the signature . A consequence of the shuffle-product identity is that the signature contains information redundancy, i.e. not every term in the signature provides new information about the path . The transformation which removes this information redundancy is the logarithm.
where .
(Free Lie Algebra ) Let be a non-empty set, and be Lie algebras, and be a map. The Lie algebra is said to be the free Lie algebra generated by if for all maps , there exists a unique Lie algebra homomorphism such that .
The free Lie algebra generated by a Banach space is the space
which was originally shown by K.T. Chen .
5 The Log-ODE Method
Over an interval, the Log-ODE method approximates a CDE using an autonomous ODE constructed by applying the linear map to the truncated log-signature of the control, as seen in (4). There exist theoretical results bounding the error in the Log-ODE method’s approximation, including when the control and solution paths live in infinite dimensional Banach spaces . However, for a given set of intervals, the series of vector fields is not guaranteed to converge. In practice, is typically chosen as the smallest such that a reasonably sized set of intervals gives an approximation error of the desired level. A recent development has been the introduction of an algorithm which adaptively updates and .
Method
Log-NCDEs use the same underlying model as NRDEs
These changes have two benefits. The first is that the output dimension of is , whereas the output dimension of is . Figure 2 compares these values for paths of dimension from to and truncation depths of . For , Log-NCDEs are exploring a significantly smaller output space during training than NRDEs, while maintaining the same expressivity, as NCDEs are universal approximators. The second benefit of these changes is that and are no longer hyperparameters which impact the model’s architecture. Instead, they control the error in the approximation arising from using the Log-ODE method. This allows them to be changed during training and inference. Both of these benefits come at the cost of needing to calculate the iterated Lie brackets when evaluating Log-NCDEs, which will be quantified in section 3.4.
Hence, in this case the only difference between Log-NCDEs and NRDEs is the regularisation of . Furthermore, (20) and (3) are equivalent when is a linear interpolation, so the approach of NCDEs, NRDEs, and Log-NCDEs all coincide when using a depth Log-ODE approximation .
where is a constant depending on and , and are the weights and biases of layer of , and is a polynomial of order .
Assuming that each layer of satisfies , an explicit evaluation of (21) gives
For a depth FCNN, this is greater than the maximum value of a double precision floating point number. Hence, it may be necessary to control explicitly during training. This is achieved by modifying the neural network’s loss function to
where is a hyperparameter controlling the weight of the penalty. This is an example of weight regularisation, which has long been understood to improve generalisation in NNs . Equation 23 is specifically a variation of spectral norm regularisation .
3 Constructing the Log-ODE Vector Field
where is the term in the log-signature corresponding to the basis elements . Since each can be written as iterated Lie brackets of , it is now possible to replace with the iterated Lie brackets of using (17) and (18). The are the vector fields defined by the columns of the neural network’s output. Hence, the vector field is evaluated at a point using Jacobian-vector products of the neural network, which can be efficiently calculated using forward-mode automatic differentiation.
4 Computational Cost
Constructing the iterated Lie brackets of incurs an additional computational cost for each evaluation of the vector field. In order to quantify this additional cost, assume that a NCDE, NRDE, and Log-NCDE are all using an identical FCNN as their vector field, except for the dimension of the final layer in the NRDE. Let denote the dimension of the hidden layers, and be the cost of evaluating the vector field of the NCDE. Then for a dimensional input path, the cost of evaluating a NRDE’s vector field is and the cost of evaluating a Log-NCDE’s vector field is . The idea behind NRDEs and Log-NCDEs is that their additional cost is compensated for by the Log-ODE method requiring less vector field evaluations to attain a solution to the CDE of similar accuracy as other methods . The idea behind Log-NCDEs, is that the additional cost of constructing the Lie brackets is compensated for by the smaller output dimension of the model’s neural network when compared to NRDEs.
5 Limitations
Despite these limitations, in the experiments conducted for this paper, Log-NCDEs significantly improve upon the test set accuracies of NCDEs and NRDEs. Furthermore, they achieve test set accuracies higher than current state-of-the-art deep learning approaches, including structured state-space models. We hope these results motivate the importance of further work on developing a complete implementation of the Log-ODE method for NCDEs.
Experiments
Log-NCDEs are compared against four models, which represent the state-of-the-art for a range of deep learning approaches to time series modelling. The first two are discrete methods; a recurrent neural network using LRU blocks and a structured state-space model, S5 . The other two baseline models are continuous; a NCDE using a Hermite cubic spline with backward differences and a NRDE .
2 Toy Dataset
We construct a toy dataset with dimension and regularly spaced samples. For every time step, the change in each channel is sampled independently from the discrete probability distribution with density
We consider four different binary classifications on the toy dataset. Each classification is a specific term in the signature of the path which depends on a different number of channels.
Was the change in the third channel, , greater than zero?
Was the area integral of the third and sixth channels, , greater than zero?
Was the volume integral of the third, sixth, and first channels, , greater than zero?
Was the D volume integral of the third, sixth, first, and fourth channels, , greater than zero?
On this dataset, all models used a hidden state of dimension and Adam with a learning rate of . Both LRU and S5 used six blocks with GLU layers. S5 used a latent dimension of . NRDE and Log-NCDE used a stepsize of and a depth of .
3 UEA Multivariate Time Series Classification Archive
The models considered in this paper are evaluated on a subset of six datasets from the UEA multivariate time series classification archive (UEA-MTSCA). These six datasets were chosen via the following two criteria. First, only datasets with more than total cases were considered. Second, the six datasets with the most observations were chosen, as datasets with many observations have previously proved challenging for deep learning approaches to time series modelling. Following , the original train and test cases are combined and resplit into new random train, validation, and test cases using a split.
Hyperparameters for all models were found using a gird search over the validation accuracy on a fixed random split of the data. Full details on the hyperparameter grid search are in appendix C. Having fixed their hyperparameters, models are compared on their average test set accuracy over five different random splits of the data.
Results
Figure 4 compares the performance of LRU, S5, NCDE, NRDE, and Log-NCDE on the four different classifications considered for the toy dataset. As expected, all of the models perform well when the label depends on one channel. As the number of channels the classification depends on increases, LRU and S5 begin to take significantly more training steps to converge. When the classification label is the volume integral containing four channels, S5 and LRU struggle to learn anything within steps of training. Increasing the number of training steps to , S5 and LRU are able to achieve validation accuracies of and respectively. However, this is still significantly lower than the validation accuracy achieved by NCDEs in steps of training.
As can be seen in figure 4, NCDEs outperform both NRDEs and Log-NCDEs using a stepsize of and a depth of . This is expected, as the classification labels considered are exactly the solution of a CDE, and for , NRDEs and Log-NCDEs are approximations to CDEs. Additionally, Log-NCDEs outperform NRDEs on each classification, with the gap in performance growing as the classification label depends on more channels.
2 UEA-MTSCA
Table 1 reports the average and standard deviation of each model’s test set accuracy over five different splits of the data. On the datasets considered, NCDEs achieve the lowest average accuracy. NRDEs improve upon NCDEs in both average accuracy and rank, but are still outperformed by the current state-of-the-art models, LRU and S5. Log-NCDEs achieve the highest average accuracy and best average rank of the models considered. Furthermore, Log-NCDEs have the lowest standard deviation on four of the six datasets. These results highlight how the modifications introduced by Log-NCDEs can significantly increase performance.
Discussion
On the toy dataset, NCDEs, NRDEs, and Log-NCDEs are all able to model the dimensional volume integrals with high validation accuracy. This is not surprising, as the classification label can be written as a CDE driven by the input path. However, LRU and S5 both struggle to capture the necessary information as increases. Log-NCDEs outperforming NRDEs when the classification is the solution of a CDE indicate that NRDEs do not always converge to solutions where is the sum of Lie brackets of a NCDE vector field, .
Conclusion
There are many possible directions of future work. Log-NCDE’s could be extended to depth- Log-ODE methods for . This would require extending Theorem 3.1 to . Furthermore, it would be necessary to address the computational cost of the iterated Lie brackets, which could be done by using a structured neural network with cheap Jacobian-vector products as the CDE vector field. Another avenue could be to incorporate the recently developed adaptive version of the Log-ODE method .
Acknowledgements
Benjamin Walker and Terry Lyons were funded by the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA). Andrew McLeod and Terry Lyons were funded by the EPSRC [grant number EP/S026347/1] and The Alan Turing Institute under the EPSRC grant EP/N510129/1. Terry Lyons was funded by the Data Centric Engineering Programme (under the Lloyd’s Register Foundation grant G0095), the Defence and Security Programme (funded by the UK Government) and the Office for National Statistics & The Alan Turing Institute (strategic partnership).
References
Appendix A Additional Mathematical Details
Let and be Banach spaces and denote the space of linear mappings from to .
In this paper, we will always take spaces of linear maps to be equipped with their operator norms.
A linear map is symmetric if for all and all bijectiive functions ,
The set of all symmetric linear maps is denoted
Let be a non-negative integer, be a real number, a closed subset of , and . For , let . The collection is an element of if there exists such that for ,
and for , all , and each ,
When there is no confusion over and , the shorthand notation will be used. If a collection is , then the norm, denoted , is the smallest for which (28) and (29) hold.
In order to illustrate the definition of , two examples are given. When , then and implies is bounded and Hölder continuous. When , then is bounded and Lipschitz. When , then and if is bounded and there exists that is bounded, Hölder continuous, and satisfies
for all and some constant .
A.2 Existence and Uniqueness
Let and be continuous paths. The existence and uniqueness of the solution to a CDE,
depends on the smoothness of the control path and the vector field . We will measure the smoothness of a path by the smallest for which the variation is finite and the smoothness of a vector field by the largest such that the function is (defined in section A.1).
(Partition) A partition of a real interval is a set of real numbers satisfying .
(variation ) Let be a Banach space, be a partition of , and be a real number. The variation of a path is defined as
Let and . If is finite-dimensional, has finite variation, and is , then (31) admits a solution for every .
Let and . If has finite variation and is , then (31) admits a unique solution for every .
These theorems extend the classic differential equation existence and uniqueness results to controls with unbounded variation but finite variation for . A proof of these theorems is can be found in . These theorems are sufficient for the differential equations considered in this paper. However, there are many settings where the control has infinite variation for all , such as Brownian motion. The theory of rough paths was developed in order to give meaning to (31) when the control’s variation is finite only for . An introduction to rough path theory can be found in .
A.3 Norm of Tensor Product Space
Let be a Banach space and denote the tensor powers of ,
There is choice in the norm of . In this paper, we follow the setting of and . It is assumed that each is endowed with a norm such that the following conditions hold for all and :
for all all bijectiive functions ,
for any bounded linear functional on and on , there exists a unique bounded linear functional on such that
with product defined by
The tensor algebra’s product is associative and has unit . As is an associative algebra, it has a Lie algebra structure, with Lie bracket
Appendix B Proof of Theorem 3.1
(Composed norm ) Let , , and be Banach spaces and and be closed. For , let and . Then the composition is with
where is the unique integer such that and is a constant independent of and .
The original statement of lemma B.1 in gives (37) as
We believe this is a small erratum, as for defined as , (38) implies there exists such that
for all bounded and Lipschitz . As a counterexample, for any , take with . The following proof of lemma B.1 is given in .
Let be defined by the generalisation of the chain rule to higher derivatives. Explicit calculation can be used to verify that if and are , definition A.3 implies is with obeying (37). ∎
Bounding the norm of a neural network (NN) requires an explicit form for in (37). This can be obtained via the explicit calculations mentioned in the proof of lemma B.1. Here, we present the case .
Let , , and be Banach spaces and and be closed. For , let and . Consider and defined for and by
Then and
From definition A.3, , , and . Furthermore, for all
Similarly, for all we have that
Define and by
for any and . Then
Similarly, define and by
Define and as in (40),
for and . Finally, define remainder terms and by
for and . We now establish that and that the norm estimate claimed in (41) is satisfied.
First we consider the bounds on and . For any , (I) in (43) implies that
since . Further, for any and any , (43) and (II) in (42) imply that
since . Taking the supremum over with unit -norm, it follows that
Now we consider the bounds on and . For this purpose we fix and . We first assume that . In this case we may use (50) and (51) to compute that
Since means that , we deduce that
Similarly, we may use (51) and that to compute that
Taking the supremum over with unit -norm in (53) yields the estimate that
Together, (52) and (54) establish the remainder term estimates required to conclude that in the case that . We next establish similar remainder term estimates when . Thus we fix and assume that . Note that means that . Additionally,
where (II) in (42) and (I) in (45) have been used. We now consider the term . We start by observing that
Consequently, by using (II) in (43) to estimate the term , (I) in (45) to estimate the term , and (I) in (47) to estimate the term , we may deduce that
The combination of (55) and (56) yields the estimate
Turning our attention to , we fix and compute that
Consequently, by using (II) in (43) to estimate the term , (II) in (42) to estimate the term , (II) in (45) to estimate the term , and (II) in (47) to estimate the term , we may deduce that
The combination of (55) and (58) yields the estimate that
Taking the supremum over with unit -norm in (59) yields the estimate that
Finally, we complete the proof by combining the various estimates we have established for to obtain the -norm bound claimed in (41).
We start this task by combining (52) and (57) to deduce that for every we have
Moreover, the combination of (54) and (60) yields the estimate that
Therefore, by combining (50), (51), (63), and (64), we conclude both that and that
Note that (41) is a stricter bound than (37), as for ,
There is equality when or .
B.2 Lip(2)−limit-fromLip2\text{Lip}(2)-norm of a Neural Network Layer
Lemma B.2 allows us to bound the norm of a neural network (NN) given a bound on the norm of each layer of a NN. We demonstrate this here for a simple NN.
Let be a fully connected NN with activation function SiLU. Assume the input is normalised such that satisfies for . Then
where So,
Since is at least twice differentiable,
where for some and . Similarly,
where for some . Now,
The calculations for subsequent layers are very similar, except that the input to each layer is no longer restricted to . For example,
B.3 Proof of Theorem 3.1
where is a constant depending on and , and are the weights and biases of layer of , and is a polynomial of order .
where is a constant depending on and and is a order polynomial. Applying lemma B.2 gives the bound in (86). ∎
Appendix C UEA-MTSCA Hyperparameter Optimisation
NCDEs, NRDEs, and Log-NCDEs use a single linear layer as . NCDEs and NRDEs use FCNNs as their vector fields configured in the same way as their original papers . NCDEs use ReLU activation functions for the hidden layers and a final activation function of . NRDEs use the same, but move the activaion function to be before the final linear layer in the FCNN. NRDEs and Log-NCDEs take their intervals to be a fixed number of observations, referred to as the Log-ODE step. NCDEs, NRDEs, and Log-NCDEs all use Heun as their differential equation solver with a fixed stepsize of , with for NCDEs. Table 2 gives an overview of the different hyperparameters optimised over for each model on the UEA-MTSCA. The optimisation was performed using a grid search of the validation accuracy.