Neural Ordinary Differential Equations

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud

Introduction

Models such as residual networks, recurrent neural network decoders, and normalizing flows build complicated transformations by composing a sequence of transformations to a hidden state:

What happens as we add more layers and take smaller steps? In the limit, we parameterize the continuous dynamics of hidden units using an ordinary differential equation (ODE) specified by a neural network:

Starting from the input layer h(0)\mathbf{h}(0), we can define the output layer h(T)\mathbf{h}(T) to be the solution to this ODE initial value problem at some time TT. This value can be computed by a black-box differential equation solver, which evaluates the hidden unit dynamics ff wherever necessary to determine the solution with the desired accuracy. Figure 1 contrasts these two approaches.

Defining and evaluating models using ODE solvers has several benefits:

In Section 2, we show how to compute gradients of a scalar-valued loss with respect to all inputs of any ODE solver, without backpropagating through the operations of the solver. Not storing any intermediate quantities of the forward pass allows us to train our models with constant memory cost as a function of depth, a major bottleneck of training deep models.

Adaptive computation

Euler’s method is perhaps the simplest method for solving ODEs. There have since been more than 120 years of development of efficient and accurate ODE solvers (Runge, 1895; Kutta, 1901; Hairer et al., 1987). Modern ODE solvers provide guarantees about the growth of approximation error, monitor the level of error, and adapt their evaluation strategy on the fly to achieve the requested level of accuracy. This allows the cost of evaluating a model to scale with problem complexity. After training, accuracy can be reduced for real-time or low-power applications.

Scalable and invertible normalizing flows

An unexpected side-benefit of continuous transformations is that the change of variables formula becomes easier to compute. In Section 4, we derive this result and use it to construct a new class of invertible density models that avoids the single-unit bottleneck of normalizing flows, and can be trained directly by maximum likelihood.

Continuous time-series models

Unlike recurrent neural networks, which require discretizing observation and emission intervals, continuously-defined dynamics can naturally incorporate data which arrives at arbitrary times. In Section 5, we construct and demonstrate such a model.

Reverse-mode automatic differentiation of ODE solutions

The main technical difficulty in training continuous-depth networks is performing reverse-mode differentiation (also known as backpropagation) through the ODE solver. Differentiating through the operations of the forward pass is straightforward, but incurs a high memory cost and introduces additional numerical error.

We treat the ODE solver as a black box, and compute gradients using the adjoint sensitivity method (Pontryagin et al., 1962). This approach computes gradients by solving a second, augmented ODE backwards in time, and is applicable to all ODE solvers. This approach scales linearly with problem size, has low memory cost, and explicitly controls numerical error.

Consider optimizing a scalar-valued loss function L()L(), whose input is the result of an ODE solver:

To optimize LL, we require gradients with respect to θ\theta. The first step is to determining how the gradient of the loss depends on the hidden state z(t)\mathbf{z}(t) at each instant. This quantity is called the adjoint a(t)=\nicefracLz(t)\mathbf{a}(t)=\nicefrac{{\partial{L}}}{{\partial\mathbf{z}(t)}}. Its dynamics are given by another ODE, which can be thought of as the instantaneous analog of the chain rule:

We can compute \nicefracLz(t0)\nicefrac{{\partial L}}{{\partial\mathbf{z}({t_{\textnormal{0}}})}} by another call to an ODE solver. This solver must run backwards, starting from the initial value of \nicefracLz(t1)\nicefrac{{\partial L}}{{\partial\mathbf{z}({t_{\textnormal{1}}})}}. One complication is that solving this ODE requires the knowing value of z(t)\mathbf{z}(t) along its entire trajectory. However, we can simply recompute z(t)\mathbf{z}(t) backwards in time together with the adjoint, starting from its final value z(t1)\mathbf{z}({t_{\textnormal{1}}}).

Computing the gradients with respect to the parameters θ\theta requires evaluating a third integral, which depends on both z(t)\mathbf{z}(t) and a(t)\mathbf{a}(t):

The vector-Jacobian products a(t)Tfz\mathbf{a}(t)^{T}\frac{\partial f}{\partial\mathbf{z}} and a(t)Tfθ\mathbf{a}(t)^{T}\frac{\partial f}{\partial\theta} in (4) and (5) can be efficiently evaluated by automatic differentiation, at a time cost similar to that of evaluating ff. All integrals for solving z\mathbf{z}, a\mathbf{a} and Lθ\frac{\partial L}{\partial\theta} can be computed in a single call to an ODE solver, which concatenates the original state, the adjoint, and the other partial derivatives into a single vector. Algorithm 1 shows how to construct the necessary dynamics, and call an ODE solver to compute all gradients at once.

Most ODE solvers have the option to output the state z(t)\mathbf{z}(t) at multiple times. When the loss depends on these intermediate states, the reverse-mode derivative must be broken into a sequence of separate solves, one between each consecutive pair of output times (Figure 2). At each observation, the adjoint must be adjusted in the direction of the corresponding partial derivative \nicefracLz(ti)\nicefrac{{\partial L}}{{\partial\mathbf{z}(t_{i})}}.

The results above extend those of Stapor et al. (2018, section 2.4.2). An extended version of Algorithm 1 including derivatives w.r.t. t0{t_{\textnormal{0}}} and t1{t_{\textnormal{1}}} can be found in Appendix C. Detailed derivations are provided in Appendix B. Appendix D provides Python code which computes all derivatives for scipy.integrate.odeint by extending the autograd automatic differentiation package. This code also supports all higher-order derivatives. We have since released a PyTorch (Paszke et al., 2017) implementation, including GPU-based implementations of several standard ODE solvers at github.com/rtqichen/torchdiffeq.

Replacing residual networks with ODEs for supervised learning

In this section, we experimentally investigate the training of neural ODEs for supervised learning.

To solve ODE initial value problems numerically, we use the implicit Adams method implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires solving a nonlinear optimization problem at every step. This setup makes direct backpropagation through the integrator difficult. We implement the adjoint sensitivity method in Python’s autograd framework (Maclaurin et al., 2015). For the experiments in this section, we evaluated the hidden state dynamics and their derivatives on the GPU using Tensorflow, which were then called from the Fortran ODE solvers, which were called from Python autograd code.

Model Architectures

Error Control in ODE-Nets

ODE solvers can approximately ensure that the output is within a given tolerance of the true solution. Changing this tolerance changes the behavior of the network. We first verify that error can indeed be controlled in Figure 3a. The time spent by the forward call is proportional to the number of function evaluations (Figure 3b), so tuning the tolerance gives us a trade-off between accuracy and computational cost. One could train with high accuracy, but switch to a lower accuracy at test time.

Figure 3c) shows a surprising result: the number of evaluations in the backward pass is roughly half of the forward pass. This suggests that the adjoint sensitivity method is not only more memory efficient, but also more computationally efficient than directly backpropagating through the integrator, because the latter approach will need to backprop through each function evaluation in the forward pass.

Network Depth

It’s not clear how to define the ‘depth‘ of an ODE solution. A related quantity is the number of evaluations of the hidden state dynamics required, a detail delegated to the ODE solver and dependent on the initial state or input. Figure 3d shows that he number of function evaluations increases throughout training, presumably adapting to increasing complexity of the model.

Continuous Normalizing Flows

The discretized equation (1) also appears in normalizing flows (Rezende and Mohamed, 2015) and the NICE framework (Dinh et al., 2014). These methods use the change of variables theorem to compute exact changes in probability if samples are transformed through a bijective function ff:

An example is the planar normalizing flow (Rezende and Mohamed, 2015):

Generally, the main bottleneck to using the change of variables formula is computing of the determinant of the Jacobian \nicefracfz\nicefrac{{\partial f}}{{\partial{\mathbf{z}}}}, which has a cubic cost in either the dimension of z\mathbf{z}, or the number of hidden units. Recent work explores the tradeoff between the expressiveness of normalizing flow layers and computational cost (Kingma et al., 2016; Tomczak and Welling, 2016; Berg et al., 2018).

Surprisingly, moving from a discrete set of layers to a continuous transformation simplifies the computation of the change in normalizing constant:

Let z(t){\mathbf{z}}(t) be a finite continuous random variable with probability p(z(t))p({\mathbf{z}}(t)) dependent on time. Let dzdt=f(z(t),t)\frac{d{\mathbf{z}}}{dt}=f({\mathbf{z}}(t),t) be a differential equation describing a continuous-in-time transformation of z(t){\mathbf{z}}(t). Assuming that ff is uniformly Lipschitz continuous in z{\mathbf{z}} and continuous in tt, then the change in log probability also follows a differential equation,

Proof in Appendix A. Instead of the log determinant in (6), we now only require a trace operation. Also unlike standard finite flows, the differential equation ff does not need to be bijective, since if uniqueness is satisfied, then the entire transformation is automatically bijective.

As an example application of the instantaneous change of variables, we can examine the continuous analog of the planar flow, and its change in normalization constant:

Given an initial distribution p(z(0))p({\mathbf{z}}(0)), we can sample from p(z(t))p({\mathbf{z}}(t)) and evaluate its density by solving this combined ODE.

While det\det is not a linear function, the trace function is, which implies tr(nJn)=ntr(Jn)\textnormal{tr}(\sum_{n}J_{n})=\sum_{n}\textnormal{tr}(J_{n}). Thus if our dynamics is given by a sum of functions then the differential equation for the log density is also a sum:

This means we can cheaply evaluate flow models having many hidden units, with a cost only linear in the number of hidden units MM. Evaluating such ‘wide’ flow layers using standard normalizing flows costs O(M3)\mathcal{O}(M^{3}), meaning that standard NF architectures use many layers of only a single hidden unit.

Time-dependent dynamics

We can specify the parameters of a flow as a function of tt, making the differential equation f(z(t),t)f({\mathbf{z}}(t),t) change with tt. This is parameterization is a kind of hypernetwork (Ha et al., 2016). We also introduce a gating mechanism for each hidden unit, dzdt=nσn(t)fn(z)\frac{d{\mathbf{z}}}{dt}=\sum_{n}\sigma_{n}(t)f_{n}({\mathbf{z}}) where σn(t)(0,1)\sigma_{n}(t)\in(0,1) is a neural network that learns when the dynamic fn(z)f_{n}({\mathbf{z}}) should be applied. We call these models continuous normalizing flows (CNF).

1 Experiments with Continuous Normalizing Flows

We first compare continuous and discrete planar flows at learning to sample from a known distribution. We show that a planar CNF with MM hidden units can be at least as expressive as a planar NF with K=MK=M layers, and sometimes much more expressive.

We configure the CNF as described above, and train for 10,000 iterations using Adam (Kingma and Ba, 2014). In contrast, the NF is trained for 500,000 iterations using RMSprop (Hinton et al., 2012), as suggested by Rezende and Mohamed (2015). For this task, we minimize KL(q(x)p(x))\operatorname{KL}\left(q(\mathbf{x})\middle\|p(\mathbf{x})\right) as the loss function where qq is the flow model and the target density p()p(\cdot) can be evaluated. Figure 4 shows that CNF generally achieves lower loss.

Maximum Likelihood Training

For this task, we use 64 hidden units for CNF, and 64 stacked one-hidden-unit layers for NF. Figure 6 shows the learned dynamics. Instead of showing the initial Gaussian distribution, we display the transformed distribution after a small amount of time which shows the locations of the initial planar flows. Interestingly, to fit the Two Circles distribution, the CNF rotates the planar flows so that the particles can be evenly spread into circles. While the CNF transformations are smooth and interpretable, we find that NF transformations are very unintuitive and this model has difficulty fitting the two moons dataset in Figure 6(b).

A generative latent function time-series model

Applying neural networks to irregularly-sampled data such as medical records, network traffic, or neural spiking data is difficult. Typically, observations are put into bins of fixed duration, and the latent dynamics are discretized in the same way. This leads to difficulties with missing data and ill-defined latent variables. Missing data can be addressed using generative time-series models (Álvarez and Lawrence, 2011; Futoma et al., 2017; Mei and Eisner, 2017; Soleimani et al., 2017a) or data imputation (Che et al., 2018). Another approach concatenates time-stamp information to the input of an RNN (Choi et al., 2016; Lipton et al., 2016; Du et al., 2016; Li, 2017).

We present a continuous-time, generative approach to modeling time series. Our model represents each time series by a latent trajectory. Each trajectory is determined from a local initial state, zt0\mathbf{z}_{t_{0}}, and a global set of latent dynamics shared across all time series. Given observation times t0,t1,,tNt_{0},t_{1},\dots,t_{N} and an initial state zt0\mathbf{z}_{t_{0}}, an ODE solver produces zt1,,ztN\mathbf{z}_{t_{1}},\dots,\mathbf{z}_{t_{N}}, which describe the latent state at each observation.We define this generative model formally through a sampling procedure:

Function ff is a time-invariant function that takes the value z\mathbf{z} at the current time step and outputs the gradient: \nicefracz(t)t=f(z(t),θf)\nicefrac{{\partial\mathbf{z}(t)}}{{\partial t}}=f(\mathbf{z}(t),\theta_{f}). We parametrize this function using a neural net. Because ff is time-invariant, given any latent state z(t)\mathbf{z}(t), the entire latent trajectory is uniquely defined. Extrapolating this latent trajectory lets us make predictions arbitrarily far forwards or backwards in time.

We can train this latent-variable model as a variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014), with sequence-valued observations. Our recognition net is an RNN, which consumes the data sequentially backwards in time, and outputs qϕ(z0x1,x2,,xN)q_{\phi}(\mathbf{z}_{0}|\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{N}). A detailed algorithm can be found in Appendix E. Using ODEs as a generative model allows us to make predictions for arbitrary time points t1t_{1}tMt_{M} on a continuous timeline.

Poisson Process likelihoods

The fact that an observation occurred often tells us something about the latent state. For example, a patient may be more likely to take a medical test if they are sick. The rate of events can be parameterized by a function of the latent state: p(\textnormal{event at timet}|\,\mathbf{z}(t))=\lambda(\mathbf{z}(t)). Given this rate function, the likelihood of a set of independent observation times in the interval [tstart,tend][t_{\textnormal{start}},t_{\textnormal{end}}] is given by an inhomogeneous Poisson process (Palm, 1943):

We can parameterize λ()\lambda(\cdot) using another neural network. Conveniently, we can evaluate both the latent trajectory and the Poisson process likelihood together in a single call to an ODE solver. Figure 8 shows the event rate learned by such a model on a toy dataset.

A Poisson process likelihood on observation times can be combined with a data likelihood to jointly model all observations and the times at which they were made.

1 Time-series Latent ODE Experiments

We investigate the ability of the latent ODE model to fit and extrapolate time series. The recognition network is an RNN with 25 hidden units. We use a 4-dimensional latent space. We parameterize the dynamics function ff with a one-hidden-layer network with 20 hidden units. The decoder computing p(xtizti)p(\mathbf{x}_{t_{i}}|\mathbf{z}_{t_{i}}) is another neural network with one hidden layer with 20 hidden units. Our baseline was a recurrent neural net with 25 hidden units trained to minimize negative Gaussian log-likelihood. We trained a second version of this RNN whose inputs were concatenated with the time difference to the next observation to aid RNN with irregular observations.

We generated a dataset of 1000 2-dimensional spirals, each starting at a different point, sampled at 100 equally-spaced timesteps. The dataset contains two types of spirals: half are clockwise while the other half counter-clockwise. To make the task more realistic, we add gaussian noise to the observations.

Time series with irregular time points

To generate irregular timestamps, we randomly sample points from each trajectory without replacement (n={30,50,100}n=\{30,50,100\}). We report predictive root-mean-squared error (RMSE) on 100 time points extending beyond those that were used for training. Table 2 shows that the latent ODE has substantially lower predictive RMSE.

Figure 9 shows examples of spiral reconstructions with 30 sub-sampled points. Reconstructions from the latent ODE were obtained by sampling from the posterior over latent trajectories and decoding it to data-space. Examples with varying number of time points are shown in Appendix F. We observed that reconstructions and extrapolations are consistent with the ground truth regardless of number of observed points and despite the noise.

Latent space interpolation

Figure 9(c) shows latent trajectories projected onto the first two dimensions of the latent space. The trajectories form two separate clusters of trajectories, one decoding to clockwise spirals, the other to counter-clockwise. Figure 10 shows that the latent trajectories change smoothly as a function of the initial point z(t0)\mathbf{z}(t_{0}), switching from a clockwise to a counter-clockwise spiral.

Scope and Limitations

The use of mini-batches is less straightforward than for standard neural networks. One can still batch together evaluations through the ODE solver by concatenating the states of each batch element together, creating a combined ODE with dimension D×KD\times K. In some cases, controlling error on all batch elements together might require evaluating the combined system KK times more often than if each system was solved individually. However, in practice the number of evaluations did not increase substantially when using minibatches.

Uniqueness

When do continuous dynamics have a unique solution? Picard’s existence theorem (Coddington and Levinson, 1955) states that the solution to an initial value problem exists and is unique if the differential equation is uniformly Lipschitz continuous in z\mathbf{z} and continuous in tt. This theorem holds for our model if the neural network has finite weights and uses Lipshitz nonlinearities, such as tanh or relu.

Setting tolerances

Our framework allows the user to trade off speed for precision, but requires the user to choose an error tolerance on both the forward and reverse passes during training. For sequence modeling, the default value of 1.5e-8 was used. In the classification and density estimation experiments, we were able to reduce the tolerance to 1e-3 and 1e-5, respectively, without degrading performance.

Reconstructing forward trajectories

Reconstructing the state trajectory by running the dynamics backwards can introduce extra numerical error if the reconstructed trajectory diverges from the original. This problem can be addressed by checkpointing: storing intermediate values of z\mathbf{z} on the forward pass, and reconstructing the exact forward trajectory by re-integrating from those points. We did not find this to be a practical problem, and we informally checked that reversing many layers of continuous normalizing flows with default tolerances recovered the initial states.

Related Work

The use of the adjoint method for training continuous-time neural networks was previously proposed (LeCun et al., 1988; Pearlmutter, 1995), though was not demonstrated practically. The interpretation of residual networks He et al. (2016a) as approximate ODE solvers spurred research into exploiting reversibility and approximate computation in ResNets (Chang et al., 2017; Lu et al., 2017). We demonstrate these same properties in more generality by directly using an ODE solver.

One can adapt computation time by training secondary neural networks to choose the number of evaluations of recurrent or residual networks (Graves, 2016; Jernite et al., 2016; Figurnov et al., 2017; Chang et al., 2018). However, this introduces overhead both at training and test time, and extra parameters that need to be fit. In contrast, ODE solvers offer well-studied, computationally cheap, and generalizable rules for adapting the amount of computation.

Constant memory backprop through reversibility

Recent work developed reversible versions of residual networks (Gomez et al., 2017; Haber and Ruthotto, 2017; Chang et al., 2017), which gives the same constant memory advantage as our approach. However, these methods require restricted architectures, which partition the hidden units. Our approach does not have these restrictions.

Learning differential equations

Much recent work has proposed learning differential equations from data. One can train feed-forward or recurrent neural networks to approximate a differential equation (Raissi and Karniadakis, 2018; Raissi et al., 2018a; Long et al., 2017), with applications such as fluid simulation (Wiewel et al., 2018). There is also significant work on connecting Gaussian Processes (GPs) and ODE solvers (Schober et al., 2014). GPs have been adapted to fit differential equations (Raissi et al., 2018b) and can naturally model continuous-time effects and interventions (Soleimani et al., 2017b; Schulam and Saria, 2017). Ryder et al. (2018) use stochastic variational inference to recover the solution of a given stochastic differential equation.

Differentiating through ODE solvers

The dolfin library (Farrell et al., 2013) implements adjoint computation for general ODE and PDE solutions, but only by backpropagating through the individual operations of the forward solver. The Stan library (Carpenter et al., 2015) implements gradient estimation through ODE solutions using forward sensitivity analysis. However, forward sensitivity analysis is quadratic-time in the number of variables, whereas the adjoint sensitivity analysis is linear (Carpenter et al., 2015; Zhang and Sandu, 2014). Melicher et al. (2017) used the adjoint method to train bespoke latent dynamic models.

In contrast, by providing a generic vector-Jacobian product, we allow an ODE solver to be trained end-to-end with any other differentiable model components. While use of vector-Jacobian products for solving the adjoint method has been explored in optimal control (Andersson, 2013; Andersson et al., In Press, 2018), we highlight the potential of a general integration of black-box ODE solvers into automatic differentiation (Baydin et al., 2018) for deep learning and generative modeling.

Conclusion

We investigated the use of black-box ODE solvers as a model component, developing new models for time-series modeling, supervised learning, and density estimation. These models are evaluated adaptively, and allow explicit control of the tradeoff between computation speed and accuracy. Finally, we derived an instantaneous version of the change of variables formula, and developed continuous-time normalizing flows, which can scale to large layer sizes.

Acknowledgements

We thank Wenyi Wang and Geoff Roeder for help with proofs, and Daniel Duckworth, Ethan Fetaya, Hossein Soleimani, Eldad Haber, Ken Caluwaerts, Daniel Flam-Shepherd, and Harry Braviner for feedback. We thank Chris Rackauckas, Dougal Maclaurin, and Matthew James Johnson for helpful discussions. We also thank Yuval Frommer for pointing out an unsupported claim about parameter efficiency.

References

Appendix A Proof of the Instantaneous Change of Variables Theorem

Let z(t)\mathbf{z}(t) be a finite continuous random variable with probability p(z(t))p(\mathbf{z}(t)) dependent on time. Let dzdt=f(z(t),t)\frac{d\mathbf{z}}{dt}=f(\mathbf{z}(t),t) be a differential equation describing a continuous-in-time transformation of z(t)\mathbf{z}(t). Assuming that ff is uniformly Lipschitz continuous in z\mathbf{z} and continuous in tt, then the change in log probability also follows a differential equation:

To prove this theorem, we take the infinitesimal limit of finite changes of logp(z(t))\log p({\mathbf{z}}(t)) through time. First we denote the transformation of z{\mathbf{z}} over an ε\varepsilon change in time as

We assume that ff is Lipschitz continuous in z(t){\mathbf{z}}(t) and continuous in tt, so every initial value problem has a unique solution by Picard’s existence theorem. We also assume z(t){\mathbf{z}}(t) is bounded. These conditions imply that ff, TεT_{\varepsilon}, and zTε\frac{\partial}{\partial{\mathbf{z}}}T_{\varepsilon} are all bounded. In the following, we use these conditions to exchange limits and products.

We can write the differential equation logp(z(t))t\frac{\partial\log p({\mathbf{z}}(t))}{\partial t} using the discrete change of variables formula, and the definition of the derivative:

The derivative of the determinant can be expressed using Jacobi’s formula, which gives

Substituting TεT_{\varepsilon} with its Taylor series expansion and taking the limit, we complete the proof.

Let f(z)=uh(wz+b)f({\mathbf{z}})=uh(w^{\mathbf{z}}+b), then fz=uhzT\frac{\partial f}{\partial{\mathbf{z}}}=u\frac{\partial h}{\partial{\mathbf{z}}}^{\mkern-1.5mu\mathsf{T}}{}. Since the trace of an outer product is the inner product, we have

This is the parameterization we use in all of our experiments.

Hamiltonian CNF.

The continuous analog of NICE (Dinh et al., 2014) is a Hamiltonian flow, which splits the data into two equal partitions and is a volume-preserving transformation, implying that logp(z)t=0\frac{\partial\log p({\mathbf{z}})}{\partial t}=0. We can verify this. Let

Then because the Jacobian is all zeros on its diagonal, the trace is zero. This is a volume-preserving flow.

A.2 Connection to Fokker-Planck and Liouville PDEs

The Fokker-Planck equation is a well-known partial differential equation (PDE) that describes the probability density function of a stochastic differential equation as it changes with time. We relate the instantaneous change of variables to the special case of Fokker-Planck with zero diffusion, the Liouville equation.

However, (30) cannot be easily used as it requires the partial derivatives of p(z,t)z\frac{p(\mathbf{z},t)}{\partial\mathbf{z}}, which is typically approximated using finite difference. This type of PDE has its own literature on efficient and accurate simulation (Stam, 1999).

Instead of evaluating p(,t)p(\cdot,t) at a fixed point, if we follow the trajectory of a particle z(t)\mathbf{z}(t), we obtain

We arrive at the instantaneous change of variables by taking the log,

While still a PDE, (32) can be combined with z(t)\mathbf{z}(t) to form an ODE of size D+1D+1,

Compared to the Fokker-Planck and Liouville equations, the instantaneous change of variables is of more practical impact as it can be numerically solved much more easily, requiring an extra state of DD for following the trajectory of z(t)\mathbf{z}(t). Whereas an approach based on finite difference approximation of the Liouville equation would require a grid size that is exponential in DD.

Appendix B A Modern Proof of the Adjoint Method

We present an alternative proof to the adjoint method (Pontryagin et al., 1962) that is short and easy to follow.

Let z(t)\mathbf{z}(t) follow the differential equation dz(t)dt=f(z(t),t,θ)\frac{d\mathbf{z}(t)}{dt}=f(\mathbf{z}(t),t,\theta), where θ\theta are the parameters. We will prove that if we define an adjoint state

then it follows the differential equation

For ease of notation, we denote vectors as row vectors, whereas the main text uses column vectors.

The adjoint state is the gradient with respect to the hidden state at a specified time tt. In standard neural networks, the gradient of a hidden layer ht\mathbf{h}_{t} depends on the gradient from the next layer ht+1\mathbf{h}_{t+1} by chain rule

With a continuous hidden state, we can write the transformation after an ε\varepsilon change in time as

The proof of (35) follows from the definition of derivative:

We pointed out the similarity between adjoint method and backpropagation (eq. 38). Similarly to backpropagation, ODE for the adjoint state needs to be solved backwards in time. We specify the constraint on the last time point, which is simply the gradient of the loss wrt the last time point, and can obtain the gradients with respect to the hidden state at any time, including the initial value.

Here we assumed that loss function LL depends only on the last time point tNt_{N}. If function LL depends also on intermediate time points t1,t2,,tN1t_{1},t_{2},\dots,t_{N-1}, etc., we can repeat the adjoint step for each of the intervals [tN1,tN][t_{N-1},t_{N}], [tN2,tN1][t_{N-2},t_{N-1}] in the backward order and sum up the obtained gradients.

B.2 Gradients wrt. θ𝜃\theta and t𝑡t

We can generalize (35) to obtain gradients with respect to θ\theta–a constant wrt. tt–and and the initial and end times, t0t_{0} and tNt_{N}. We view θ\theta and tt as states with constant differential equations and write

We can then combine these with zz to form an augmented stateNote that we’ve overloaded tt to be both a part of the state and the (dummy) independent variable. The distinction is clear given context, so we keep tt as the independent variable for consistency with the rest of the text. with corresponding differential equation and adjoint state,

Note this formulates the augmented ODE as an autonomous (time-invariant) ODE, but the derivations in the previous section still hold as this is a special case of a time-variant ODE. The Jacobian of ff has the form

where each 0\mathbf{0} is a matrix of zeros with the appropriate dimensions. We plug this into (35) to obtain

The first element is the adjoint differential equation (35), as expected. The second element can be used to obtain the total gradient with respect to the parameters, by integrating over the full interval and setting aθ(tN)=0\mathbf{a}_{\theta}(t_{N})=\mathbf{0}.

Finally, we also get gradients with respect to t0t_{0} and tNt_{N}, the start and end of the integration interval.

Between (35), (46), (51), and (52) we have gradients for all possible inputs to an initial value problem solver.

Appendix C Full Adjoint sensitivities algorithm

This more detailed version of Algorithm 1 includes gradients with respect to the start and end times of integration.

Appendix D Autograd Implementation

Appendix E Algorithm for training the latent ODE model

To obtain the latent representation zt0\mathbf{z}_{t_{0}}, we traverse the sequence using RNN and obtain parameters of distribution q(zt0{xti,ti}i,θenc)q(\mathbf{z}_{t_{0}}|\{\mathbf{x}_{t_{i}},t_{i}\}_{i},\theta_{enc}). The algorithm follows a standard VAE algorithm with an RNN variational posterior and an ODESolve model:

Run an RNN encoder through the time series and infer the parameters for a posterior over zt0\mathbf{z}_{t_{0}}:

where μz0,σz0\mu_{\mathbf{z}_{0}},\sigma_{\mathbf{z}_{0}} comes from hidden state of RNN({xti,ti}i,ϕ\{\mathbf{x}_{t_{i}},t_{i}\}_{i},\phi)

Sample zt0q(zt0{xti,ti}i)\mathbf{z}_{t_{0}}\sim q(\mathbf{z}_{t_{0}}|\{\mathbf{x}_{t_{i}},t_{i}\}_{i})

Obtain zt1,zt2,,ztM\mathbf{z}_{t_{1}},\mathbf{z}_{t_{2}},\dots,\mathbf{z}_{t_{M}} by solving ODE ODESolve(zt0,f,θf,t0,,tM)\textnormal{ODESolve}(\mathbf{z}_{t_{0}},f,\theta_{f},t_{0},\dots,t_{M}), where ff is the function defining the gradient dz/dtd\mathbf{z}/dt as a function of z\mathbf{z}

Maximize ELBO=i=1Mlogp(xtizti,θx)+logp(zt0)logq(zt0{xti,ti}i,ϕ)\textnormal{ELBO}=\sum_{i=1}^{M}\log p(\mathbf{x}_{t_{i}}|\mathbf{z}_{t_{i}},\theta_{\mathbf{x}})+\log p(\mathbf{z}_{t_{0}})-\log q(\mathbf{z}_{t_{0}}|\{\mathbf{x}_{t_{i}},t_{i}\}_{i},\phi), where p(zt0)=N(0,1)p(\mathbf{z}_{t_{0}})=\mathcal{N}(0,1)

Appendix F Extra Figures