Hamiltonian Generative Networks

Peter Toth, Danilo Jimenez Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, Irina Higgins

Introduction

Any system capable of a wide range of intelligent behaviours within a dynamic environment requires a good predictive model of the environment’s dynamics. This is true for intelligence in both biological (Friston, 2009; 2010; Clark, 2013) and artificial (Hafner et al., 2019; Battaglia et al., 2013; Watter et al., 2015; Watters et al., 2019) systems. Predicting environmental dynamics is also of fundamental importance in physics, where Hamiltonian dynamics and the structure-preserving transformations it provides have been used to unify, categorise and discover new physical entities (Noether, 1915; Livio, 2012).

Hamilton’s fundamental result was a system of two first-order differential equations that, in a stroke, unified the predictions made by prior Newtonian and Lagrangian mechanics (Hamilton, 1834). After well over a century of development, it has proven to be essential for parsimonious descriptions of nearly all of physics. Hamilton’s equations provide a way to predict a system’s future behavior from its current state in phase space (that is, its position and momentum for classical Newtonian systems, and its generalized position and momentum more broadly). Hamiltonian mechanics induce dynamics with several nice properties: they are smooth, they include paths along which certain physical quantities are conserved (symmetries) and their time evolution is fully reversible. These properties are also useful for machine learning systems. For example, capturing the time-reversible dynamics of the world state might be useful for agents attempting to account for how their actions led to effects in the world; recovering an abstract low-dimensional manifold with paths that conserve various properties is tightly connected to outstanding problems in representation learning (see e.g. Higgins et al. (2018) for more discussion); and the ability to conserve energy is related to expressive density modelling in generative approaches (Rezende & Mohamed, 2015). Hence, we propose a reformulation of the well-known image manifold hypothesis by extending it with a Hamiltonian assumption (illustrated in Fig. 1): natural images lie on a low-dimensional manifold embedded within a high-dimensional pixel space and natural sequences of images trace out paths on this manifold that follow the equations of an abstract Hamiltonian.

Given the rich set of established tools provided by Hamiltonian descriptions of system dynamics, can we adapt these to solve outstanding machine learning problems? When it comes to adapting the Hamiltonian formalism to contemporary machine learning, two questions need to be addressed: 1) how should a system’s Hamiltonian be learned from data; and 2) how should a system’s abstract phase space be inferred from the high-dimensional observations typically available to machine learning systems? Note that the inferred state may need to include information about properties that play no physical role in classical mechanics but which can still affect their behavior or function, like the colour or shape of an object. The first question was recently addressed by the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) approach, which was able to learn the Hamiltonian of three simple physical systems from noisy phase space observations. However, to address the second question, HNN makes assumptions that restrict it to Newtonian systems and appear to limit its ability to scale to more challenging video datasets.

In this paper we introduce the first model that answers both of these questions without relying on restrictive domain assumptions. Our model, the Hamiltonian Generative Network (HGN), is a generative model that infers the abstract state from pixels and then unrolls the learned Hamiltonian following the Hamiltonian equations (Goldstein, 1980). We demonstrate that HGN is able to reliably learn the Hamiltonian dynamics from noisy pixel observations on four simulated physical systems: a pendulum, a mass-spring and two- and three- body systems. Our approach outperforms HNN by a significant margin. After training, we demonstrate that HGN produces meaningful samples with reversible dynamics and that the speed of rollouts can be controlled by changing the time derivative of the integrator at test time. Finally, we show that a small modification of our architecture yields a flexible, normalising flow-based generative model that respects Hamiltonian dynamics. We show that this model, which we call Neural Hamiltonian Flow (NHF), inherits the beneficial properties of the Hamiltonian formalism (including volume preservation) and is capable of expressive density modelling, while offering computational benefits over standard flow-based models.

Related work

Most machine learning approaches to modeling dynamics use discrete time steps, which often results in an accumulation of the approximation errors when producing rollouts and, therefore, to a fast drop in accuracy. Our approach, on the other hand, does not discretise continuous dynamics and models them directly using the Hamiltonian differential equations, which leads to slower divergence for longer rollouts. The density model version of HGN (NHF) uses the Hamiltonian dynamics as normalising flows along with a numerical integrator, making our approach somewhat related to the recently published neural ODE work (Chen et al., 2018; Grathwohl et al., 2018). What makes our approach different is that Hamiltonian dynamics are both invertible and volume-preserving (as discussed in Sec. 3.3), which makes our approach computationally cheaper than the alternatives and more suitable as a model of physical systems and other processes that have these properties. Also related is recent work attempting to learn a model of physical system dynamics end-to-end from image sequences using an autoencoder (de Avila Belbute-Peres et al., 2018). Unlike our work, this model does not exploit Hamiltonian dynamics and is trained in a supervised or semi-supervised regime.

One of the most comparable approaches to ours is the Hamiltonian Neural Network (HNN) (Greydanus et al., 2019). This work, done concurrently to ours, proposes a way to learn Hamiltonian dynamics from data by training the gradients of a neural network (obtained by backpropagation) to match the time derivative of a target system in a supervised fashion. In particular, HNN learns a differentiable function $\mathcal{H}(q,p)$ that maps a system’s state (its position, $q$ , and momentum, $p$ ) to a scalar quantity interpreted as the system’s Hamiltonian. This model is trained so that $H(p,q)$ satisfies the Hamiltonian equation by minimizing

where the derivatives $\frac{\partial\mathcal{H}}{\partial q}$ and $\frac{\partial\mathcal{H}}{\partial p}$ are computed by backpropagation. Hence, this learning procedure is most directly applicable when the true state space (in canonical coordinates) and its time derivatives are known. Accordingly, in the majority of the experiments presented by the authors, the Hamiltonian was learned from the ground truth state space directly, rather than from pixel observations.

This assumption is appropriate in the case of the simple pendulum system presented in the paper, however it does not hold more generally. Note that our approach does not make any assumptions on the dimensionality of the learned phase space, or the form of the momenta coordinates, which makes our approach more general and allows it to perform well on a wider range of image domains as presented in Sec. 4.

Methods

2 Learning Hamiltonians with the Hamiltonian Generative Network

Our goal is to build a model that can learn a Hamiltonian from observations. We assume that the data $X=\{({\bm{x}}_{0}^{1},...,{\bm{x}}_{T}^{1}),...,({\bm{x}}_{0}^{K},...,{\bm{x}}_{T}^{K})\}$ comes in the form of high-dimensional noisy observations, where each ${\bm{x}}_{i}=\text{G}({\bm{s}}_{i})=\text{G}({\bm{q}}_{i})$ is a non-deterministic function of the generalised position in the phase space, and the full state is a non-deterministic function of a sequence of images ${\bm{s}}_{i}=({\bm{q}}_{i},{\bm{p}}_{i})=\text{D}({\bm{x}}_{0}^{i},...,{\bm{x}}_{t}^{i})$ , since the momentum (and hence the full state) cannot in general be recovered from a single observation. Our goal is to infer the abstract state and learn the Hamiltonian dynamics in phase space by observing $K$ motion sequences, discretised into $T+1$ time steps each. In the process, we also want to learn an approximation to the generative process $\text{G}({\bm{s}})$ in order to be able to move in both directions between the high dimensional observations and the low-dimensional abstract phase space.

Although the Hamiltonian formalism is general and does not depend on the form of the observations, we present our model in terms of visual observations, since many known physical Hamiltonian systems, like a mass-spring system, can be easily observed visually. In this section we introduce the Hamiltonian Generative Network (HGN), a generative model that is trained to behave according to the Hamiltonian dynamics in an abstract phase space learned from raw observations of image sequences. HGN consists of three parts (see Fig. 2): an inference network, a Hamiltonian network and a decoder network, which are discussed next.

The Hamiltonian network

The decoder network

is a standard deconvolutional network (we use the architecture from Karras et al. (2018)) that takes in a low-dimensional representation vector and produces a high-dimensional pixel reconstruction. Given that each instantaneous image does not depend on the momentum information, we restrict the decoder to take only the position coordinates of the abstract state as input: $p_{\theta}({\bm{x}}_{t})=d_{\theta}({\bm{q}}_{t})$ .

The objective function.

Given a sequence of $T+1$ images, HGN is trained to optimise the following objective:

which can be seen as a temporally extended variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) objective, consisting of a reconstruction term for each frame, and an additional term that encourages the inferred posterior to match a prior. The key difference with a standard VAE lies in how we generate rollouts – these are produced using the Hamiltonian equations of motion in learned Hamiltonian phase space.

3 Learning Hamiltonian Flows

In this section, we describe how the architecture described above can be modified to produce a model for flexible density estimation. Learning computationally feasible and accurate estimates of complex densities is an open problem in generative modelling. A common idea to address this problem is to start with a simple prior distribution $\pi({\bm{u}})$ and then transform it into a more expressive form $p({\bm{x}})$ through a series of composable invertible transformations $f_{i}({\bm{u}})$ called normalising flows (Rezende & Mohamed, 2015) (see Fig. 3A). Sampling can then be done according to ${\bm{x}}=f_{T}\circ\ldots\circ f_{1}({\bm{u}})$ , where ${\bm{u}}\sim\pi(\cdot)$ . Density evaluation, however, requires more expensive computations of both inverting the flows and calculating the determinants of their Jacobians. For a single flow step, this equates to the following: $p({\bm{x}})=\pi(f^{-1}({\bm{x}}))\left|\text{det}\left(\frac{\partial f^{-1}}{\partial{\bm{x}}}\right)\right|$ . While a lot of work has been done recently into proposing better alternatives for flow-based generative models in machine learning (Rezende & Mohamed, 2015; Kingma et al., 2016; Papamakarios et al., 2017; Dinh et al., 2017; Huang et al., 2018; Kingma & Dhariwal, 2018; Hoogeboom et al., 2019; Chen et al., 2019; Grathwohl et al., 2018), none of the approaches manage to produce both sampling and density evaluation steps that are computationally scalable.

The two requirements for normalising flows are that they are invertible and volume preserving, which are exactly the two properties that Hamiltonian dynamics possess. This can be seen by computing the determinant of the Jacobian of the infinitesimal transformation induced by the Hamiltonian $\mathcal{H}$ . By Jacobi’s formula, the derivative of the determinant at the identity is the trace and:

Our proposed NHF is more computationally efficient that many other flow-based approaches, because it does not require the expensive step of calculating the trace of the Jacobian. Hence, the NHF model constitutes a more structured form of a Neural ODE flow (Chen et al., 2018), but with a few notable differences: (i) The Hamiltonian ODE is volume-preserving, which makes the computation of log-likelihood cheaper than for a general ODE flow. (ii) General ODE flows are only invertible in the limit $dt\rightarrow 0$ , whereas for some Hamiltonians we can use more complex integrators (like the symplectic leapfrog integrator described in Sec. A.6) that are both invertible and volume-preserving for any $dt>0$ .

The structure ${\bm{s}}=({\bm{q}},{\bm{p}})$ on the state-space imposed by the Hamiltonian dynamics can be constraining from the point of view of density estimation. We choose to use the trick proposed in the Hamiltonian Monte Carlo (HMC) literature (Neal et al., 2011; Salimans et al., 2015; Levy et al., 2017), which treats the momentum ${\bm{p}}$ as a latent variable (see Fig. 4). This is an elegant solution which avoids having to artificially split the density into two disjoint sets. As a result, the data density that our Hamiltonian flows are modelling becomes exclusively parametrised by $p({\bm{q}}_{T})$ , which takes the following form: $p({\bm{q}}_{T})=\int p({\bm{q}}_{T},{\bm{p}}_{T})d{\bm{p}}_{T}=\int\pi(\mathcal{H}_{1}^{-dt}\circ\ldots\circ\mathcal{H}_{T}^{-dt}({\bm{q}}_{T},{\bm{p}}_{T}))d{\bm{p}}_{T}$ . This integral is intractable, but the model can still be trained via variational methods where we introduce a variational density $f_{\psi}({\bm{p}}_{T}|{\bm{q}}_{T})$ with parameters $\psi$ and instead optimise the following ELBO:

Note that, in contrast to VAEs (Kingma & Welling, 2014; Rezende et al., 2014), the ELBO in Eq. 10 is not explicitly in the form of a reconstruction error term plus a KL term.

Results

In order to directly compare the performance of HGN to that of its closest baseline, HNN, we generated four datasets analogous to the data used in Greydanus et al. (2019). The datasets contained observations of the time evolution of four physical systems: mass-spring, pendulum, two- and three-body (see Fig. 5). In order to generate each trajectory, we first randomly sampled an initial state, then produced a 30 step rollout following the ground truth Hamiltonian dynamics, before adding Gaussian noise with standard deviation $\sigma^{2}=0.1$ to each phase-space coordinate, and rendering a corresponding 64x64 pixel observation. We generated 50 000 train and 10 000 test trajectories for each dataset. When sampling initial states, we start by first sampling the total energy of the system denoted as a radius $r$ in the phase space, before sampling the initial state $(q,p)$ uniformly on the circle of radius $r$ . Note that our pendulum dataset is more challenging than the one described in Greydanus et al. (2019), where the pendulum had a fixed radius and was initialized at a maximum angle of $30^{\circ}$ from the central axis.

Pendulum.

Two- and three- body problems.

Learning the Hamiltonian

We tested whether HGN and the HNN baseline could learn the dynamics of the four systems described above. To ensure that our re-implementation of HNN was correct, we replicated all the results presented in the original paper (Greydanus et al., 2019) by verifying that it could learn the dynamics of the mass-spring, pendulum and two-body systems well from the ground truth state, and the dynamics of a restricted pendulum from pixels. We also compared different modifications of HGN: a version trained and tested with an Euler rather than a leapfrog integrator (HGN Euler), a version trained with no additional function between the posterior and the prior (HGN no $f_{\psi}$ ) and a deterministic version (HGN determ), which did not include the sampling step from the posterior $q_{\phi}({\bm{z}}|{\bm{x}}_{0}...{\bm{x}}_{T})$ .

Tbl. 1 and Fig. 6 demonstrate that HGN and its modifications learned well on all four datasets. However, when we attempted to train HNN on the four datasets described above, its Hamiltonian often collapsed to 0 and the model failed to reproduce any dynamics, defaulting to a static single image. We were unable to improve on this performance despite our best efforts, including a modification of the architecture to closely match ours (referred to as HNN Conv) (see Sec. A.3 of the appendix for details). Tbl. 1 shows that the average mean squared error (MSE) of the pixel reconstructions on both the train and test data is an order of magnitude better for HGN compared to both versions of HNN. The same holds when visualising the average per-frame MSE of a single train and test rollout for each dataset shown in Fig. 6.

Note that the different versions of HGN have different trade-offs. The deterministic version produces more accurate reconstructions but it does not allow sampling. This effect is equivalent to a similar distinction between autoencoders (Hinton & Salakhutdinov, 2006) and VAEs. Using the simpler Euler integrator rather than the more involved leapfrog one might be conceptually more appealing, however it does not provide the same energy conservation and reversibility properties as the leapfrog integrator, as evidenced by the increase by an order of magnitude of the variance of the learned Hamiltonian throughout a sequence rollout as shown in Tbl. 2. The full version of HGN, on the other hand, is capable of reproducing the dynamics well, is capable of producing diverse yet plausible rollout samples (Fig. 8) and its rollouts can be reversed in time, sped up or slowed down by either changing the value or the sign of $dt$ used in the integrator (Fig. 7).

Expressive density modelling using learned Hamiltonian flows

We evaluate whether NHF is capable of expressive density modelling by stacking learned Hamiltonians into a series of normalising flows. Fig. 9 demonstrates that NHF can transform a simple soft-uniform prior distribution $\pi({\bm{s}}_{0};\sigma,\beta)$ into significantly more complex densities with arbitrary numbers of modes. The soft-uniform density, $\pi({\bm{s}}_{0};\sigma,\beta)\propto f(\beta(s+\sigma\frac{1}{2}))f(-\beta(s-\sigma\frac{1}{2}))$ , where $f$ is the sigmoid function and $\beta$ is a constant, was chosen to make it easier to visualise the learned attractors. The model also performed well with other priors, including a Normal distribution. It is interesting to note that the trained model is very interpretable. When decomposed into the equivalents of the kinetic and potential energies, it can be seen that the learned potential energy $V({\bm{q}})$ learned to have several local minima, one for each mode of the data. As a consequence, the trajectory of the initial samples through the flow has attractors at the modes of the data. We have also compared the performance of NHF to that of the RNVP baseline (Dinh et al., 2017). Fig. 10 shows that the two approaches are comparable in their performance, but NHF is more computationally efficient as discussed at the end of Sec. 3.3.

Conclusions

We have presented HGN, the first deep learning approach capable of reliably learning Hamiltonian dynamics from pixel observations. We have evaluated our approach on four classical physical systems and demonstrated that it outperformed the only relevant baseline by a large margin. Hamiltonian dynamics have a number of useful properties that can be exploited more widely by the machine learning community. For example, the Hamiltonian induces a smooth manifold and a vector field in the abstract phase space along which certain physical quantities are conserved. The time evolution along these paths is also completely reversible. These properties can have wide implications in such areas of machine learning as reinforcement learning, representation learning and generative modelling. We have demonstrated the first step towards applying the learnt Hamiltonian dynamics as normalising flows for expressive yet computationally efficient density modelling. We hope that this work serves as the first step towards a wider adoption of the rich body of physics literature around the Hamiltonian principles in the machine learning community.

We thank Alvaro Sanchez for useful feedback.

References

Appendix A Supplementary Materials

The Hamiltonian Generative Network (HGN) consists of three major parts, an encoder, the Hamiltonian transition network and a decoder. During training the encoder starts with a sequence

of raw training images and encodes it into a probabilistic prior representation transformed with an additional network on top

consisting of a downsized spatial representation in latent space ( $4\times 4$ ), where each abstract pixel is the the concatenation of abstract position ( ${\bm{q}}$ ) and momentum ( ${\bm{p}}$ ) (each of dimension 16). The encoder network is a convolutional neural network with 8 layers, with 32 filters on the first layer, then 64 filters on each subsequent layer, while in the last layer we have 48 filters. The final encoder transformer network is a convolutional neural network with 3 layers and 64 filters on each layer. Starting from this initial embedded state, the Hamiltonian transition network generates subsequent states using a symplectic integrator approximating the Hamiltonian equations. The Hamiltonian transition network represents the Hamiltonian function

as a function from the abstract position and momentum space to the real numbers at any time step $t$ . The Hamiltonian transition network is a convolutional neural network of 6 layers, each consisting of 64 filters. The discrete timestep we use for the symplectic integrator update step is $dt=0.125$ .

The decoder network is a progressive network consisting of 3 residual blocks, where each residual block resizes the current input image by a factor of 2 using the nearest neighbor method (at the end we have to upscale our latent spatial dimension of 4 to the desired output image dimension of 32 in these steps), followed by 2 blocks of a one layer convolutional neural network with 64 filters and a leaky ReLU activation function, closing by a sigmoid activation in each block. After the 3 blocks a final one layer convolutional neural network outputs the output image with the right number of channels.

We use Adam optimisier (Kingma & Ba, 2014) with learning rate 1.5e-4. When optimising the loss, in practice we do not learn the variance of the decoder $p_{\theta}({\bm{x}}|{\bm{s}})$ and fix it to 1, which makes the reconstruction objective equivalent to a scaled L2 loss. Furthermore, we introduce a Lagrange multiplier in front of the KL term and optimise it using the same method as in Rezende & Viola (2018).

A.2 Neural Hamiltonian Flow

For all NHF experiments the Hamiltonian was of the form $\mathcal{H}(q,p)=K(p)+V(q)$ . The kinetic energy term $K$ and the potential energy term $V$ are soft-plus MLPs with layer-sizes $[d,128,128,1]$ where $d$ is the dimension of the data. Soft-plus non-linearities were chosen because the optimisation of Hamiltonian flows involves second-order derivatives of the MLPs used for parametrising the Hamiltonians. This makes ReLU non-linearities unsuitable. The encoder network was parametrized as $f_{\psi}({\bm{p}}|{\bm{q}})=\mathcal{N}({\bm{p}};\mu({\bm{q}}),\sigma({\bm{q}}))$ , where $\mu$ and $\sigma$ are ReLU MLPs with size $[d,128,128,d]$ . The Hamiltonian flow $\mathcal{H}^{dt}$ , was approximated using a leapfrog integrator (Neal et al., 2011) since it preserves volume and is invertible for any $dt$ (see also section A.6). We found that only two leapfrog steps where sufficient for all our examples. Parameters were optimised using Adam (Kingma & Ba, 2014) (learning rate 3e-4) and Lagrange multipliers were optimised using the same method as in Rezende & Viola (2018). All shown kernel density estimate (KDE) plots used 1000 samples and isotropic Gaussian kernel bandwidth of $0.3$ . For the RNVP baseline (Dinh et al., 2017) we used alternating masks with two or three layers. Each RNVP layer used an affine coupling parametrized by a two-layer relu-MLP that matched those used in the leapfrog.

A.3 Hamiltonian Neural Network

The Hamiltonian Neural Network (HNN) (Greydanus et al., 2019) learns a differentiable function $\mathcal{H}({\bm{q}},{\bm{p}})$ that maps a system’s state in phase space (its position ${\bm{q}}$ and momentum ${\bm{p}}$ ) to a scalar quantity interpreted as the system’s Hamiltonian. This model is trained so that $\mathcal{H}({\bm{q}},{\bm{p}})$ satisfies the Hamiltonian equation by minimizing

where the derivatives $\frac{\partial\mathcal{H}}{\partial{\bm{q}}}$ and $\frac{\partial\mathcal{H}}{\partial{\bm{p}}}$ are computed by backpropagation. In the original paper, these targets are either assumed to be known or are estimated by finite differences using the state at times $t$ and $t+1$ . Accordingly, in the majority of the experiments presented by the authors, the Hamiltonian was learned from the ground truth state space directly, rather than from pixel observations.

The original HNN model is trained in a supervised fashion on the ground truth state of a physical system and its time derivatives. As such, it is not directly comparable to our method, which learns a Hamiltonian directly from pixels. Instead, we compare to the PixelHNN variant of the HNN, which is introduced in the same paper, and which is able to learn a Hamiltonian from images and in the absence of true state or time derivative in some settings.

This loss is motivated by the observation that in simple Newtonian systems with unit mass, the system’s state is fully described by the position and its time derivative (the system’s velocity). An image embedding that corresponds to the position and velocity of the system will minimize this loss. This assumption is appropriate in the case of the simple pendulum system presented in the paper, however it does not hold more generally.

As mentioned earlier, PixelHNN takes as input a concatenated, flattened pair of images and maps them to an embedding space ${\bm{z}}_{t}=({\bm{q}}_{t},{\bm{p}}_{t})$ , which is treated as an estimate of the position and momentum of the system depicted in the images. Note that $X_{t}$ always consists of two images in order to make the momentum observable. This embedding space is used as the input to an HNN, which is trained to learn the Hamiltonian of the system as before, but using the embedding instead of the true system state.

To enable stable learning in this configuration, the PixelHNN uses a standard mean-squared error autoencoding loss:

where $\hat{X}_{t}$ is the autoencoder output and $X^{i}_{t}$ is the value of pixel $i$ of $N$ total pixels in $X_{t}$ . In our experiments, which use RGB images, this expectation is taken over color channels as well as pixels. This loss encourages the network embedding to reflect the content of the input images and to avoid the trivial solution to (16).

where $\mathcal{L}_{\text{HNN}}$ is computed using the finite difference estimate of the time derivative of the embedding. $\lambda_{\text{HNN}}$ is a Lagrange multiplier, which is set to 0.1, as in the original paper. $\mathcal{L}_{\text{WD}}$ is a standard L2 weight decay and its Lagrange multiplier $\lambda_{\text{WD}}$ is set to 1e-5, as in the original paper.

In the experiments presented here, we reimplemented the PixelHNN architecture as described in Greydanus et al. (2019) and trained it using the full loss (19). As in the original paper, we used a PixelHNN with HNN, encoder, and decoder subnetworks each parameterized by a multi-layer perceptron (MLP). The encoder and decoder MLPs use ReLU nonlinearities. Each consists of 4 layers, with 200 units in each hidden layer and an embedding of the same size as the true position and momentum of the system depicted (2 for mass-spring and pendulum, 8 for two-body, and 12 for 3-body). The HNN MLP uses tanh nonlinearities and consists of two hidden layers with 200 units and a one-dimensional output.

To ensure the difference in performance between the PixelHNN and HGN are not due primarily to archiectural choices, we also compare to a variant of the PixelHNN architecture using the same convolutional encoder and decoder as used in HGN. We used identical hyperparameters to those described in section A.1. We map between the convolutional latent space used by the encoder and decoder and the vector-valued latent required by the HNN using one additional linear layer for the encoder and decoder.

In the original paper, the PixelHNN model is trained using full-batch gradient descent. To make it more comparable to our approach, we train it here using stochastic gradient descent using minibatches of size 64 and around 15000 training steps. As in the original paper, we train the model using the Adam optimizer (Kingma & Ba, 2014) and a learning rate of 1e-3. As in the original paper, we produce rollouts of the model using a Runge-Kutta integrator (RK4). See Section A.6 for a description of RK4. Note that, as in the original paper, we use the more sophisticated algorithm implemented in scipy (scipy.integrate.solve_ivp) (Jones et al., 2001).

A.4 Datasets

The datasets for the experiments described in 4 were generated in a similar manner to Greydanus et al. (2019) for comparative purposes. All of the datasets simulate the exact Hamiltonian dynamics of the underlying differential equation using the default scipy initial value problem solver Jones et al. (2001). After creating a dataset of trajectories for each system, we render those into a sequence of images. The system depicted in each dataset can be visualized by rendering circular objects:

For the mass-spring the mass object is rendered as a circle and the spring and pivot are invisible.

For the pendulum only the weight (the bob) is rendered as a circle, while the massless rod and pivot are invisible.

For the two and three body problem we render each point mass as a circle in a different color.

Additionally, we smooth out the circles such that they do not have hard edges, as can be seen in Fig. 7.

A.5 Convergence of HGN and HNN

In order to obtain the results presented in Tbl. 1 we trained both HGN and HNN for 15000 iterations, with batch size of 16 for HGN, and 64 for the HNN. Given the dataset sizes, this means that HGN was trained for around 5 epochs and HNN was trained for around 19 epochs, which took around 16 hours. Figs. 11-12 plot the convergence rates for HGN (leapfrog) and HNN (conv) on the four datasets.

A.6 Integrators

Throughout this paper, we estimate the future state of systems from inferred values of the system position and momentum by numerically integrated the Hamiltonian. We explore three methods of numerical integration: (i) Euler integration, (ii) Runge-Kutta integration and (iii) leapfrog integration.

Euler integration estimates the value of a function at time $t+dt$ by incrementing the function’s value with the value accumulated by the function’s derivative, assuming it stays constant in the interval $[t,t+dt]$ . In the Hamiltonian framework, Euler integration takes the form:

Because Euler integration estimates a function’s future value by extrapolating along its first derivative, the method ignores the contribution of higher-order derivatives to the function’s change in time. Accordingly, while Euler integration can reasonably estimate a function’s value over short periods, its errors accumulate rapidly as it is integrated over longer periods or when it is applied multiple times. This limitation motivates the use of methods that are stable over more steps and longer integration times.

One such method is four-step Runge-Kutta integration (RK4), the most widely used member of the Runge-Kutta family of integrators. Whereas Euler integration estimates the value of a function at time $t+dt$ using only the function’s derivative at time $t$ , RK4 accumulates multiple estimates of the function’s value in the interval $[t,t+dt]$ . This integral more correctly reflects the behavior of the function in the interval, resulting in a more stable estimate of the function’s value.

RK4 estimates the state at time $t+dt$ as:

where, for compactness, we write the full system state as ${\bm{x}}_{t}=[{\bm{q}}_{t},{\bm{p}}_{t}]$ and the Hamiltonian as $\mathcal{H}({\bm{x}})$ . In the Hamiltonian framework, ${\bm{k}}_{1},{\bm{k}}_{2},{\bm{k}}_{3},{\bm{k}}_{4}$ are obtained by evaluating the derivative at four points in the interval $[t,t+dt]$ :

rCl k_1 & = dH(x)dt|_x=x_t k_2 = dH(x)dt|_x=x_t + dt2k_1 k_3 = dH(x)dt|_x=x_t + dt2k_2 k_4 = dH(x)dt|_x=x_t + dt ⋅k_3

While RK4 may produce reasonably stable estimates over short periods of time, it is not guaranteed to behave stably indefinitely. Neither RK4 nor Euler integration is guaranteed to preserve the energy of the system being integrated, and in practice both will produce estimates that drift away from the true system dynamics over timescales that are relevant for simulating real systems.

Fortunately, there are well-known methods for numerical integration that preserve energy and can be applied to Hamiltonian systems, like the one we propose here. One such method is leapfrog integration, which is a special method for integrating differential equations of the form:

If we assume that the Hamiltonian equations take this form, we can integrate them using the leapfrog integrator, which in essence updates the position and momentum variables at interleaved time points in a way that resembles the updates “leapfrogging” over each other (see Fig. 13B for an illustration).

In particular, the following updates can be applied to a Hamiltonian of the form $\mathcal{H}=V({\bm{q}})+T({\bm{p}})$ , where $V$ is the potential energy and $T$ is the kinetic energy of the system:

As discussed above, leapfrog integration is more stable and accurate over long rollouts than integrators like Euler or RK4. This is because the leapfrog integrator is an example of a symplectic integrator, which means it is guaranteed to preserve the special form of the Hamiltonian even after repeated application. An example visual comparison between a symplectic (leapfrog) and non-symplectic (Euler) integrator applied over the Hamiltonian for a harmonic oscilator is shown in Fig. 13A. For a more thorough discussion of the properties of leapfrog integration, see (Neal et al., 2011).

A.7 Proof of the Hamiltonian Flow ELBO

The Hamiltonian Flow consists of two components. Firstly, it defines a density model over the joint space $s_{T}=(q_{T},p_{T})$ using the Hamiltonian Flow as described in section 3.3. However, we assume that the observable variable represents only $q_{T}$ and treat $p_{T}$ as a latent variable which we have to marginalize over. Since the integral is intractable using the introduced variational distribution we can derive the lower bound on the marginal likelihood: