On Learning Hamiltonian Systems from Data
Tom Bertalan, Felix Dietrich, Igor Mezić, Ioannis G. Kevrekidis
I Introduction
The Koopman operator has also been employed in combination with neural networks to extract conservation laws and special group structures kaiser-2018 ; lusch_deep_2018 . Symmetries in relation to conserved quantities are a well-studied problem in physics hamilton_general_1834 ; noether-1971 ; livio-2012 ; kondor-2018b . A recent thread of research consists of learning physics models from observation data mottaghi_newtonian_2016 , including modeling discrete-time data as observations of a continuous-time dynamical system raissi_inferring_2017 ; raissi_hidden_2018 .
It is informative to study physical systems through their conserved quantities, such as total energy, which can be encoded in a Hamiltonian function. hamilton_general_1834 ; almeida1992 The measure-preserving property of Hamiltonian systems has recently been exploited for transport of densities in Markov-chain Monte-Carlo methods brooks_mcmc_2011 , and variational autoencoders caterini_hamiltonian_2018 ; rezende_variational_2015 . For our purposes, a natural progression of the thread of computational modeling of physics from observations is to represent the Hamiltonian function directly.
Concurrently with this submission, two papers independently addressing similar issues appeared as preprintsgreydanus_hamiltonian_2019 ; toth_hamiltonian_2019 . In the first of thesegreydanus_hamiltonian_2019 , a loss function very similar to parts of our Eqn. (12) was used. The second paper focuses on the transformation of densities generated through the Hamiltonian. It is possible to draw some analogies between the second preprint and the old (non-Hamiltonian) work we mentioned aboveGonzalez-Garcia1998b ; Rico-Martinez1994 ; Rico-Martinez1994b ; Rico-Martinez1993 ; Rico-Martinez1995 ; Rico-Martinez1992a ; rico-martinez_noninvertibility_1993 ; Rico-Martinez2000d , as this newer work also used rollout with a time step templated on a classical numerical integration method (here, symplectic Euler and leapfrog). Both papers also use the pendulum as an example, emphasizing conditions where the system can be well approximated by a linear system: a thin annulus around the stable steady state at where the trajectories are nearly circular.
The remainder of the paper is structured as follows:
We derive data-driven approximations (through two approaches: Gaussian processes and neural networks) of a Hamiltonian function on a given phase space, from time series data. The Hamiltonian functions we consider do not need to be separable as a sum , and in our illustrative example we always work in the fully nonlinear regime of the pendulum.
We build data-driven reconstructions of a phase space from (a) linear and (b) nonlinear, non-symplectic transformations of the original Hamiltonian phase space. The reconstruction then leads to a symplectomorphic copy of the original Hamiltonian system.
We construct a completely data-driven pipeline combining (a) the construction of an appropriate phase space and (b) the approximation of a Hamiltonian function on this new phase space, from nonlinear, high-dimensional observations (e.g. from movies/sequences of movie snapshots).
II General description
Equations (1, 2) can be restated as a partial differential equation for at every :
In the next section, we discuss how to approximate the function from given data points . This involves solving the partial differential Eqn. (4) for . Since these equations determine only up to an additive constant, we assume that we also know the value of at a single point in phase space. This is not a major restriction for the approach, because as well as can be chosen arbitrarily.
III Example: the nonlinear pendulum
As an example, consider the case , and the Hamiltonian
This Hamiltonian forms the basis for the differential equations of the nonlinear pendulum, , or, in first-order form, and . In this section, we numerically solve PDE (4) by approximating the solution using two approaches: Gaussian Processes rasmussen-2005 (§III.1) and neural networks (§III.2).
We model the solution as a Gaussian Process with a Gaussian covariance kernel,
Together with an arbitrary pinning term at a point , the list of known derivatives leads to a linear system of equations, where we write
for the derivative of the given kernel function with respect to its first argument, evaluated at a given point in the first argument and all points in the dataset in the second argument. For each , we thus have a (column) vector of derivative evaluations, which is transposed and stacked into a large matrix to form the full linear system
III.2 Approximation using an artificial neural network
Another possibility for learning the form of using data is to represent the function with an artificial neural networkGoodfellow-et-al-2016 (ANN). We write
where the activation function is nonlinear (except where otherwise indicated, we used ) for (if ) and the identity for . The learnable parameters of this ANN are , and we gather all such learnable parameters from the multiple layers that may be used in one experiment into a parameter vector . If there are no hidden layers (), then we learn an affine transformation . This format provides a surrogate function , where the input is the row vector . (Treating inputs as row vectors and using right-multiplication for weight matrices is convenient, as a whole batch of inputs can be presented as an -by- array.) For all the experiments shown here, this network for computing has two hidden layers of width 16.
Similarly, in the case(s) we need to learn additional transformations \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}} and \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} (see §IV), they are also learned using such networks.
We collect training data by sampling a number of initial conditions in the rectangle , then simulate short trajectories from each to their final points. For each of these, we additionally evaluate . Shuffling over simulations once per epoch, and dividing this dataset into batches, we then perform batchwise stochastic gradient descent to learn the parameters using an Adam optimizer on the objective function defined below.
For this paper, all neural networks were constructed and trained using TensorFlow, and the gradients necessary for evaluating the Hamiltonian loss terms in Eqn. (12) were computed using TensorFlow’s core automatic differentiation capabilities.
This objective function comprises a scalar function evaluated on each data 4-tuple in the batch, and then averaged over the batch. This scalar function is written as
where the dependence on is through the learned Hamiltonian , the loss-term weights are chosen to emphasize certain properties of the problem thus posed, and the partial derivatives of {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial p}} and {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{H}}{\partial q}} are computed explicitly through automatic differentiation. Except for , all values are set to either or depending on whether the associated loss term is to be included or excluded. Because of the square term in Eqn. (5), we set arbitrarily to if nonzero, so the loss is not dominated by . An alternative might be to set to .
As an ablation study, we explored the effect (not shown here) of removing the first, second, and fourth terms. By construction, the true function is zero for all . Note that this is only ever achieved to any degree in the central box in the figure, where data was densely sampled. Removing made no visible difference in the quality of our approximation, which was expected due to the redundancy in the set of equations (1), (2), and (3). However, removing either or gives poor results across the figure, despite the apparent redundancy of these terms with . This might be due to not balancing the contributions of the and terms, for which we attempted to compensate by unequal weighting values and .
IV Estimating Hamiltonian structure from observations
An interesting and important feature common to the next three examples is that one cannot expect to systematically recover the original values from the given observation data . Only symplectic transformations can be recovered, which are enough to define a Hamiltonian. Once the coordinates are fixed, the Hamiltonian function in these coordinates is unique up to an additive constant.
Table I contains training and validation loss of the networks for all experiments. Additionally, we trained the network from §IV.4 with only 331 images, and did observe a significantly higher validation loss (not shown), consistent with overfitting.
The following loss function is used to train an autoencoder component together with a Hamiltonian function approximation network component:
where the dependence on is through the learned Hamiltonian and the learned transformations \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}} and \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}, and the time derivatives in the space are computed as \dot{\hat{q}}=\dot{x}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial x}}+\dot{y}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial y}} and \dot{\hat{p}}=\dot{x}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial x}}+\dot{y}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial y}}, using the Jacobian D\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}=\left[\begin{array}[]{cc}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial x}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{q}}{\partial y}}\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial x}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial\hat{p}}{\partial y}}\\ \end{array}\right] of the transformation \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} (computed pointwise with automatic differentiation). When we learn an (especially nonlinear) transformation \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1}, in addition to the Hamiltonian , including the term in the composite loss can have a detrimental effect on the learned transformation. There exists an easily-encountered naive local minimum in which \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} maps all of the sampled values from {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E)\ni(x,y) to a single point in , and the Hamiltonian learned is merely the constant function at the pinning value, . In this state (or an approximation of this state), all of {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial H}{\partial\hat{q}}}, {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\partial H}{\partial\hat{p}}}, and the elements of D\hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} are zero, so the loss (with terms , , , and optionally ) is zero (resp., small). A related failure is that in which the input in {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(E) is collapsed by \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} to a line or curve in .
To alleviate both of these problems we added a new loss component . That is, we require that the learned transformation not collapse the input. It is sufficient for the corresponding weighting factor to be a very small nonzero value (e.g. ). The addition of to our loss helps us to avoid falling early in training into the unrecoverable local minimum described above, and also helps keep the scale of the transformed variables and macroscopic.
IV.2 Example: linear transformation of the pendulum
We could learn \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} from a general class of nonlinear functions, as a small neural network, but here we simply learn and as linear transformations (that is, we have a linear “neural network”, where in Eqn. (11), and is fixed as ). As we include the reconstruction error of this autoencoder in our loss function, is constrained to be the inverse of to a precision no worse than the and terms in Eqn. (14), after all three are scaled by their corresponding values. In fact, for the linear case, initially the autoencoder’s contribution to the loss is significantly lower than the Hamiltonian components (see Fig. 3), but, as training proceeds and the and terms are improved (at the initial expense of raising the autoencoder loss), larger reductions in loss are possible by optimizing rather than \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}, so is decreased as quickly as (the larger of) or . That is, the autoencoder portion of the loss falls quickly to the level where it no longer contributes to the total loss given its weighting in the loss sum.
We find that the learned symplectomorphism {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}S}(q,p)=\hat{A}^{-1}\cdot A\cdot[q,p]^{T}, depicted in its portion in Fig. 3, preserves unmixed with in one or the other of its two discovered coordinates. This is because both (a) as well as (b) for any smooth function are symplectomorphisms. They are special because , i.e. they even preserve the Hamiltonian formulation. For the map (a), the transformation of the Hamiltonian can be seen from the following derivation.
Here, the first equality of (15, 16) follows from the map and the last equality of (15, 16) follows from the requirement that follow Hamiltonian dynamics with respect to the new Hamiltonian . Equations (IV.2,IV.2) then show that the new Hamiltonian is the same as the old one (modulo an additive constant) when considered as a map on the old coordinates.
IV.3 Example: nonlinear transformation of the pendulum
In addition to the linear {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}} of §IV.2, we show comparable results for a nonlinear transformation {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}} and learned \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}. Specifically, we transform the data through (x,y)={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}(q,p) where
the inverse of which is given by and . We use the analytical Jacobian of this {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}} to compute the necessary and for input to our network.
We proceed as before, except that we no longer restrict the form of the learned \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}} and \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} to linear transformations, but instead allow small multi-layer perceptrons of a form similar to that used for .
The resulting induced symplectomorphism is again one which appears to preserve an approximately monotonically increasing or decreasing in either or . This can be seen in Fig. 4.
IV.4 Example: constructing a Hamiltonian system from nonlinear, high-dimensional observations of q,p𝑞𝑝q,p
As a further demonstration of the method, we use a graphical rendering of the moving pendulum example from before as the transformation {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}} from the intrinsic state to an image as our high-dimensional observable. We use a symplectic semi-implicit Euler’s method
to generate trajectories for various initial conditions, and then a simple graphical renderer to display these as images (see Fig. 5). When rendering our video frames, we drag a tail of decaying images behind the moving pendulum head, so that information about both position and velocity is present in each rendered frame. This is done by iterating over each trajectory, and, for each , (1) multiplying the entire current image by a cooling factor of , (2) adding a constant heating amount to the image in a circle of fixed radius centered around the current point, and (3) clipping the image per-pixel to lie within $$. Samples of the resulting images are visible in Fig. 5.
Though we do not use directly in this procedure, is value is observed in the length of the resulting tail dragged behind the moving pendulum head. We create trajectories long enough that the effect of the initial formation of the tail (during which its length is not necessarily a good indicator of ) is not visible any more, and then use only the final two observations from these trajectories.
In order to make the approach agnostic to the data, we do not want to assume that the space is periodic, so instead, we use a four-dimensional phase space with elements {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{z}}}}=[\hat{q}_{1},\hat{q}_{2},\hat{p}_{1},\hat{p}_{2}]=[{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{q}}}},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{p}}}}] and consider the splitting into and during training. In the space of input images, the manifold does not fill up four-dimensional space, but a cylinder, which is mapped to the four-dimensional encoding layer by the autoencoder.
In addition, to simplify the learning problem, we learn \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} as the combination of a projection onto the first twenty principal components of the training dataset followed by a dense autoencoder, reserving learning \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} as an end-to-end convolutional autoencoder for future work. The encoding provides {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{z}}}} and, as before, we learn \hat{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{\theta}}}}^{-1} in tandem with , where now the conditions of Eqn. (14) are upgraded to vector equivalents to accommodate {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\hat{{\mathbf{z}}}}.
In §IV.1, we added a loss term proportional to the reciprocal of the determinant of the transformation’s Jacobian in order to avoid transformations that collapsed the phase space. Here, this was not such an issue–of course, some collapsing of the high-dimensional representation is obviously required. Instead, a common mode of failure turned out to be learning constant functions, which automatically satisfy the Hamiltonian requirements (a constant is naturally a conserved quantity). To avoid this, we considered several possible ways to promote a non-flat function, ultimately settling on (a) adding a term that encouraged the standard deviation of values to be nonzero, and (b) minimizing not just the mean squared error in our and terms, but also the max squared error, to avoid trivial or bi-level functions. The spherical Gaussian prior used in training variational autoencoders toth_hamiltonian_2019 also avoids, as a byproduct, learning constant functions.
V Conclusions
We described an approach to approximate Hamiltonian systems from observation data. It is a completely data-driven pipeline to (a) construct an appropriate phase space and (b) approximate a Hamiltonian function on the new phase space, from nonlinear, possibly high-dimensional observations (here, movies).
When only transformations of the original Hamiltonian phase space can be observed, it is only possible to recover a symplectic copy of the original phase space, with additional freedom in a constant scaling of the coordinates, and an additive constant to the Hamiltonian function. If no additional information about the original space is available, this is a fundamental limitation, and not particular to our approach. It is not necessary that there is an “original phase space” at all, and so the resulting symplectic phase space is by no means unique. The choice of a single such space has to rely on other factors, such as, possibly, interpretability by humans, or simplicity of the equations.
The approach may be extended to time-dependent Hamiltonian functions. This would allow us to cope with certain dissipative systemsmcdonald_hamiltonian_nodate . An even broader extension may allow transformations to arbitrary normal forms as the “target vector field”, and thus would not be constrained to Hamiltonian systems. In the general case, it will become important to explore whether the transformation we approximate remains bounded over our data, or whether it starts showing signs of approaching a singularity, suggesting that the problem may not be solvable.