Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, Martin Riedmiller

Introduction

Control of non-linear dynamical systems with continuous state and action spaces is one of the key problems in robotics and, in a broader context, in reinforcement learning for autonomous agents. A prominent class of algorithms that aim to solve this problem are model-based locally optimal (stochastic) control algorithms such as iLQG control , which approximate the general non-linear control problem via local linearization. When combined with receding horizon control , and machine learning methods for learning approximate system models, such algorithms are powerful tools for solving complicated control problems ; however, they either rely on a known system model or require the design of relatively low-dimensional state representations. For real autonomous agents to succeed, we ultimately need algorithms that are capable of controlling complex dynamical systems from raw sensory input (e.g. images) only. In this paper we tackle this difficult problem.

If stochastic optimal control (SOC) methods were applied directly to control from raw image data, they would face two major obstacles. First, sensory data is usually high-dimensional – i.e. images with thousands of pixels – rendering a naive SOC solution computationally infeasible. Second, the image content is typically a highly non-linear function of the system dynamics underlying the observations; thus model identification and control of this dynamics are non-trivial.

While both problems could, in principle, be addressed by designing more advanced SOC algorithms we approach the “optimal control from raw images” problem differently: turning the problem of locally optimal control in high-dimensional non-linear systems into one of identifying a low-dimensional latent state space, in which locally optimal control can be performed robustly and easily. To learn such a latent space we propose a new deep generative model belonging to the class of variational autoencoders that is derived from an iLQG formulation in latent space. The resulting Embed to Control (E2C) system is a probabilistic generative model that holds a belief over viable trajectories in sensory space, allows for accurate long-term planning in latent space, and is trained fully unsupervised. We demonstrate the success of our approach on four challenging tasks for control from raw images and compare it to a range of methods for unsupervised representation learning. As an aside, we also validate that deep up-convolutional networks are powerful generative models for large images.

The Embed to Control (E2C) model

We briefly review the problem of SOC for dynamical systems, introduce approximate locally optimal control in latent space, and finish with the derivation of our model.

We consider the control of unknown dynamical systems of the form

where ${\boldsymbol{\omega}}$ accounts for system noise; or equivalently $\mathbf{z}_{t}\sim\mathcal{N}(m(\mathbf{x}_{t}),\mathbf{\Sigma}_{{\boldsymbol{\omega}}})$ . Assuming for the moment that such a function can be learned (or approximated), we will first define SOC in a latent space and introduce our model thereafter.

2 Stochastic locally optimal control in latent spaces

where $c(\mathbf{z}_{t},\mathbf{u}_{t})$ are instantaneous costs, $c_{T}(\mathbf{z}_{T},\mathbf{u}_{T})$ denotes terminal costs and $\mathbf{z}_{1:T}=\{\mathbf{z}_{1},\dots,\mathbf{z}_{T}\}$ and $\mathbf{u}_{1:T}=\{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\}$ are state and action sequences respectively. If $\mathbf{z}_{t}$ contains sufficient information about $\mathbf{s}_{t}$ , i.e., $\mathbf{s}_{t}$ can be inferred from $\mathbf{z}_{t}$ alone, and $f^{\text{lat}}$ is differentiable, the cost-minimizing controls can be computed from $J(\mathbf{z}_{1:T},\mathbf{u}_{1:T})$ via SOC algorithms . These optimal control algorithms approximate the global non-linear dynamics with locally linear dynamics at each time step $t$ . Locally optimal actions can then be found in closed form. Formally, given a reference trajectory $\bar{\mathbf{z}}_{1:T}$ – the current estimate for the optimal trajectory – together with corresponding controls $\bar{\mathbf{u}}_{1:T}$ the system is linearized as

where $\mathbf{A}(\bar{\mathbf{z}}_{t})=\frac{\delta f^{\text{lat}}(\bar{\mathbf{z}}_{t},\bar{\mathbf{u}}_{t})}{\delta\bar{\mathbf{z}}_{t}}$ , $\mathbf{B}(\bar{\mathbf{z}}_{t})=\frac{\delta f^{\text{lat}}(\bar{\mathbf{z}}_{t},\bar{\mathbf{u}}_{t})}{\delta\bar{\mathbf{u}}_{t}}$ are local Jacobians, and $\mathbf{o}(\bar{\mathbf{z}}_{t})$ is an offset. To enable efficient computation of the local controls we assume the costs to be a quadratic function of the latent representation

3 A locally linear latent state space model for dynamical systems

Starting from the SOC formulation, we now turn to the problem of learning an appropriate low-dimensional latent representation $\mathbf{z}_{t}\sim P(Z_{t}|m(\mathbf{x}_{t}),\mathbf{\Sigma}_{{\boldsymbol{\omega}}})$ of $\mathbf{x}_{t}$ . The representation $\mathbf{z}_{t}$ has to fulfill three properties: (i) it must capture sufficient information about $\mathbf{x}_{t}$ (enough to enable reconstruction); (ii) it must allow for accurate prediction of the next latent state $\mathbf{z}_{t+1}$ and thus, implicitly, of the next observation $\mathbf{x}_{t+1}$ ; (iii) the prediction $f^{\text{lat}}$ of the next latent state must be locally linearizable for all valid control magnitudes $\mathbf{u}_{t}$ . Given some representation $\mathbf{z}_{t}$ , properties (ii) and (iii) in particular require us to capture possibly highly non-linear changes of the latent representation due to transformations of the observed scene induced by control commands. Crucially, these are particularly hard to model and subsequently linearize. We circumvent this problem by taking a more direct approach: instead of learning a latent space $\mathbf{z}$ and transition model $f^{\text{lat}}$ which are then linearized and combined with SOC algorithms, we directly impose desired transformation properties on the representation $\mathbf{z}_{t}$ during learning. We will select these properties such that prediction in the latent space as well as locally linear inference of the next observation according to Equation (4) are easy.

The transformation properties that we desire from a latent representation can be formalized directly from the iLQG formulation given in Section 2.2 . Formally, following Equation (2), let the latent representation be Gaussian $P(Z|X)=\mathcal{N}(m(\mathbf{x}_{t}),\mathbf{\Sigma}_{{\boldsymbol{\omega}}})$ . To infer $\mathbf{z}_{t}$ from $\mathbf{x}_{t}$ we first require a method for sampling latent states. Ideally, we would generate samples directly from the unknown true posterior $P(Z|X)$ , which we, however, have no access to. Following the variational Bayes approach (see Jordan et al. for an overview) we resort to sampling $\mathbf{z}_{t}$ from an approximate posterior distribution $Q_{{\boldsymbol{\phi}}}(Z|X)$ with parameters ${\boldsymbol{\phi}}$ .

where $\hat{Q}_{\boldsymbol{\psi}}$ is the next latent state posterior distribution, which exactly follows the linear form required for stochastic optimal control. With ${\boldsymbol{\omega}}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{H}_{t})$ as an estimate of the system noise, $\mathbf{C}$ can be decomposed as $\mathbf{C}_{t}=\mathbf{A}_{t}\mathbf{\Sigma}_{t}\mathbf{A}^{T}_{t}+\mathbf{H}_{t}$ . Note that while the transition dynamics in our generative model operates on the inferred latent space, it takes untransformed controls into account. That is, we aim to learn a latent space such that the transition dynamics in $\mathbf{z}$ linearizes the non-linear observed dynamics in $\mathbf{x}$ and is locally linear in the applied controls $\mathbf{u}$ . Reconstruction of an image from $\mathbf{z}_{t}$ is performed by passing the sample through multiple hidden layers of a decoding neural network which computes the mean $\mathbf{p}_{t}$ of the generative Bernoulli distributionA Bernoulli distribution for $P_{\boldsymbol{\theta}}$ is a common choice when modeling black-and-white images. $P_{\boldsymbol{\theta}}(X|Z)$ as

A sketch of the complete architecture is shown in Figure 1. It also visualizes an additional constraint that is essential for learning a representation for long-term predictions: we require samples $\hat{\mathbf{z}}_{t+1}$ from the state transition distribution $\hat{Q}_{\boldsymbol{\psi}}$ to be similar to the encoding of $\mathbf{x}_{t+1}$ through $Q_{\boldsymbol{\phi}}$ . While it might seem that just learning a perfect reconstruction of $\mathbf{x}_{t+1}$ from $\hat{\mathbf{z}}_{t+1}$ is enough, we require multi-step predictions for planning in $Z$ which must correspond to valid trajectories in the observed space $X$ . Without enforcing similarity between samples from $\hat{Q}_{\boldsymbol{\psi}}$ and $Q_{\boldsymbol{\phi}}$ , following a transition in latent space from $\mathbf{z}_{t}$ with action $\mathbf{u}_{t}$ may lead to a point $\hat{\mathbf{z}}_{t+1}$ , from which reconstruction of $\mathbf{x}_{t+1}$ is possible, but that is not a valid encoding (i.e. the model will never encode any image as $\hat{\mathbf{z}}_{t+1}$ ). Executing another action in $\hat{\mathbf{z}}_{t+1}$ then does not result in a valid latent state – since the transition model is conditional on samples coming from the inference network – and thus long-term predictions fail. In a nutshell, such a divergence between encodings and the transition model results in a generative model that does not accurately model the Markov chain formed by the observations.

4 Learning via stochastic gradient variational Bayes

For training the model we use a data set $\mathcal{D}=\{(\mathbf{x}_{1},\mathbf{u}_{1},\mathbf{x}_{2}),\dots,(\mathbf{x}_{T-1},\mathbf{u}_{T-1},\mathbf{x}_{T})\}$ containing observation tuples with corresponding controls obtained from interactions with the dynamical system. Using this data set, we learn the parameters of the inference, transition and generative model by minimizing a variational bound on the true data negative log-likelihood $-\log P(\mathbf{x}_{t},\mathbf{u}_{t},\mathbf{x}_{t+1})$ plus an additional constraint on the latent representation. The complete loss functionNote that this is the loss for the latent state space model and distinct from the SOC costs. is given as

The first part of this loss is the per-example variational bound on the log-likelihood

where $Q_{{\boldsymbol{\phi}}}$ , $P_{\boldsymbol{\theta}}$ and $\hat{Q}_{\boldsymbol{\psi}}$ are the parametric inference, generative and transition distributions from Section 2.3 and $P(Z_{t})$ is a prior on the approximate posterior $Q_{\boldsymbol{\phi}}$ ; which we always chose to be an isotropic Gaussian distribution with mean zero and unit variance. The second KL divergence in Equation (11) is an additional contraction term with weight $\lambda$ , that enforces agreement between the transition and inference models. This term is essential for establishing a Markov chain in latent space that corresponds to the real system dynamics (see Section 2.3 above for an in depth discussion). This KL divergence can also be seen as a prior on the latent transition model. Note that all KL terms can be computed analytically for our model (see supplementary for details).

During training we approximate the expectation in $\mathcal{L}(\mathcal{D})$ via sampling. Specifically, we take one sample $\mathbf{z}_{t}$ for each input $\mathbf{x}_{t}$ and transform that sample using Equation (10) to give a valid sample $\hat{\mathbf{z}}_{t+1}$ from $\hat{Q}_{\boldsymbol{\psi}}$ . We then jointly learn all parameters of our model by minimizing $\mathcal{L}(\mathcal{D})$ using SGD.

Experimental Results

We evaluate our model on four visual tasks: an agent in a plane with obstacles, a visual version of the classic inverted pendulum swing-up task, balancing a cart-pole system, and control of a three-link arm with larger images. These are described in detail below.

Model training. We consider two different network types for our model: Standard fully connected neural networks with up to three layers, which work well for moderately sized images, are used for the planar and swing-up experiments; A deep convolutional network for the encoder in combination with an up-convolutional network as the decoder which, in accordance with recent findings from the literature , we found to be an adequate model for larger images. Training was performed using Adam throughout all experiments. The training data set $\mathcal{D}$ for all tasks was generated by randomly sampling $N$ state observations and actions with corresponding successor states. For the plane we used $N\!\!=\!\!3,000$ samples, for the inverted pendulum and cart-pole system we used $N\!\!=\!\!15,000$ and for the arm $N\!\!=\!\!30,000$ . A complete list of architecture parameters and hyperparameter choices as well as an in-depth explanation of the up-convolutional network are specified in the supplementary material. We will make our code and a video containing controlled trajectories for all systems available under http://ml.informatik.uni-freiburg.de/research/e2c .

Model variants. In addition to the Embed to Control (E2C) dynamics model derived above, we also consider two variants: By removing the latent dynamics network $h^{\text{trans}}_{\boldsymbol{\psi}}$ , i.e. setting its output to one in Equation (10) – we obtain a variant in which $\mathbf{A}_{t}$ , $\mathbf{B}_{t}$ and $\mathbf{o}_{t}$ are estimated as globally linear matrices (Global E2C). If we instead replace the transition model with a network estimating the dynamics as a non-linear function $\hat{f}^{\text{lat}}$ and only linearize during planning, estimating $\mathbf{A}_{t}$ , $\mathbf{B}_{t}$ , $\mathbf{o}_{t}$ as Jacobians to $\hat{f}^{\text{lat}}$ as described in Section 2.2, we obtain a variant with nonlinear latent dynamics.

Baseline models. For a thorough comparison and to exhibit the complicated nature of the tasks, we also test a set of baseline models on the plane and the inverted pendulum task (using the same architecture as the E2C model): a standard variational autoencoder (VAE) and a deep autoencoder (AE) are trained on the autoencoding subtask for visual problems. That is, given a data set $\mathcal{D}$ used for training our model, we remove all actions from the tuples in $\mathcal{D}$ and disregard temporal context between images. After autoencoder training we learn a dynamics model in latent space, approximating $f^{\text{lat}}$ from Section 2.2. We also consider a VAE variant with a slowness term on the latent representation – a full description of this variant is given in the supplementary material.

Optimal control algorithms. To perform optimal control in the latent space of different models, we employ two trajectory optimization algorithms: iterative linear quadratic regulation (iLQR) (for the plane and inverted pendulum) and approximate inference control (AICO) (all other experiments). For all VAEs both methods operate on the mean of distributions $Q_{\boldsymbol{\phi}}$ and $\hat{Q}_{\boldsymbol{\psi}}$ . AICO additionally makes use of the local Gaussian covariances $\mathbf{\Sigma}_{t}$ and $\mathbf{C}_{t}$ . Except for the experiments on the planar system, control was performed in a model predictive control fashion using the receding horizon scheme introduced in . To obtain closed loop control given an image $\mathbf{x}_{t}$ , it is first passed through the encoder to obtain the latent state $\mathbf{z}_{t}$ . A locally optimal trajectory is subsequently found by optimizing $(\mathbf{z}^{*}_{t:t+T},\mathbf{u}^{*}_{t:t+T})\approx\arg\min_{\begin{subarray}{c}\mathbf{z}_{t:t+T}\\ \mathbf{u}_{t:t+T}\end{subarray}}J(\mathbf{z}_{t:t+T},\mathbf{u}_{t:t+T})$ with fixed, small horizon $T$ (with $T=10$ unless noted otherwise). Controls $\mathbf{u}^{*}_{t}$ are applied to the system and a transition to $\mathbf{z}_{t+1}$ is observed (by encoding the next image $\mathbf{x}_{t+1}$ ). Then a new control sequence – with horizon $T$ – starting in $\mathbf{z}_{t+1}$ is found using the last estimated trajectory as a bootstrap. Note that planning is performed entirely in the latent state without access to any observations except for the depiction of the current state. To compute the cost function $c(\mathbf{z}_{t},\mathbf{u}_{t})$ required for trajectory optimization in $\mathbf{z}$ we assume knowledge of the observation $\mathbf{x}_{\text{goal}}$ of the goal state $\mathbf{s}_{\text{goal}}$ . This observation is then transformed into latent space and costs are computed according to Equation (5).

2 Control in a planar system

The agent in the planar system can move in a bounded two-dimensional plane by choosing a continuous offset in x- and y-direction. The high-dimensional representation of a state is a $40\times 40$ black-and-white image. Obstructed by six circular obstacles, the task is to move to the bottom right of the image, starting from a random x position at the top of the image. The encodings of obstacles are obtained prior to planning and an additional quadratic cost term is penalizing proximity to them.

A depiction of the observations on which control is performed – together with their corresponding state values and embeddings into latent space – is shown in Figure 2. The figure also clearly shows a fundamental advantage the E2C model has over its competitors: While the separately trained autoencoders make for aesthetically pleasing pictures, the models failed to discover the underlying structure of the state space, complicating dynamics estimation and largely invalidating costs based on distances in said space. Including the latent dynamics constraints in these end-to-end models on the other hand, yields latent spaces approaching the optimal planar embedding.

We test the long-term accuracy by accumulating latent and real trajectory costs to quantify whether the imagined trajectory reflects reality. The results for all models when starting from random positions at the top and executing $40$ pre-computed actions are summarized in Table 1 – using a seperate test set for evaluating reconstructions. While all methods achieve a low reconstruction loss, the difference in accumulated real costs per trajectory show the superiority of the E2C model. Using the globally or locally linear E2C model, trajectories planned in latent space are as good as trajectories planned on the real state. All models besides E2C fail to give long-term predictions that result in good performance.

3 Learning swing-up for an inverted pendulum

We next turn to the task of controlling the classical inverted pendulum system from images. We create depictions of the state by rendering a fixed length line starting from the center of the image at an angle corresponding to the pendulum position. The goal in this task is to swing-up and balance an underactuated pendulum from a resting position (pendulum hanging down). Exemplary observations and reconstructions for this system are given in Figure 3(d). In the visual inverted pendulum task our algorithm faces two additional difficulties: the observed space is non-Markov, as the angular velocity cannot be inferred from a single image, and second, discretization errors due to rendering pendulum angles as small 48x48 pixel images make exact control difficult. To restore the Markov property, we stack two images (as input channels), thus observing a one-step history.

Figure 3 shows the topology of the latent space for our model, as well as one sample trajectory in true state and latent space. The fact that the model can learn a meaningful embedding, separating velocities and positions, from this data is remarkable (no other model recovered this shape). Table 1 again compares the different models quantitatively. While the E2C model is not the best in terms of reconstruction performance, it is the only model resulting in stable swing-up and balance behavior. We explain the failure of the other models with the fact that the non-linear latent dynamics model cannot be guaranteed to be linearizable for all control magnitudes, resulting in undesired behavior around unstable fixpoints of the real system dynamics, and that for this task a globally linear dynamics model is inadequate.

4 Balancing a cart-pole and controlling a simulated robot arm

Finally, we consider control of two more complex dynamical systems from images using a six layer convolutional inference and six layer up-convolutional generative network, resulting in a 12-layer deep path from input to reconstruction. Specifically, we control a visual version of the classical cart-pole system from a history of two $80\times 80$ pixel images as well as a three-link planar robot arm based on a history of two $128\times 128$ pixel images. The latent space was set to be 8-dimensional in both experiments. The real state dimensionality for the cart-pole is four and is controlled using one action, while for the arm the real state can be described in 6 dimensions (joint angles and velocities) and controlled using a three-dimensional action vector corresponding to motor torques.

As in previous experiments the E2C model seems to have no problem finding a locally linear embedding of images into latent space in which control can be performed. Figure 4 depicts exemplary images – for both problems – from a trajectory executed by our system. The costs for these trajectories ( $11.13$ for the cart-pole, $85.12$ for the arm) are only slightly worse than trajectories obtained by AICO operating on the real system dynamics starting from the same start-state ( $7.28$ and $60.74$ respectively). The supplementary material contains additional experiments using these domains.

Comparison to recent work

In the context of representation learning for control (see Böhmer et al. for a review), deep autoencoders (ignoring state transitions) similar to our baseline models have been applied previously, e.g. by Lange and Riedmiller . A more direct route to control based on image streams is taken by recent work on (model free) deep end-to-end Q-learning for Atari games by Mnih et al. , as well as kernel based and deep policy learning for robot control .

Close to our approach is a recent paper by Wahlström et al. , where autoencoders are used to extract a latent representation for control from images, on which a non-linear model of the forward dynamics is learned. Their model is trained jointly and is thus similar to the non-linear E2C variant in our comparison. In contrast to our model, their formulation requires PCA pre-processing and does neither ensure that long-term predictions in latent space do not diverge, nor that they are linearizable.

As stated above, our system belongs to the family of VAEs and is generally similar to recent work such as Kingma and Welling , Rezende et al. , Gregor et al. , Bayer and Osendorfer . Two additional parallels between our work and recent advances for training deep neural networks can be observed. First, the idea of enforcing desired transformations in latent space during learning – such that the data becomes easy to model – has appeared several times already in the literature. This includes the development of transforming auto-encoders and recent probabilistic models for images . Second, learning relations between pairs of images – although without control – has received considerable attention from the community during the last years . In a broader context our model is related to work on state estimation in Markov decision processes (see Langford et al. for a discussion) through, e.g., hidden Markov models and Kalman filters .

Conclusion

We presented Embed to Control (E2C), a system for stochastic optimal control on high-dimensional image streams. Key to the approach is the extraction of a latent dynamics model which is constrained to be locally linear in its state transitions. An evaluation on four challenging benchmarks revealed that E2C can find embeddings on which control can be performed with ease, reaching performance close to that achievable by optimal control on the real system model.

We thank A. Radford, L. Metz, and T. DeWolf for sharing code, as well as A. Dosovitskiy for useful discussions. This work was partly funded by a DFG grant within the priority program “Autonomous learning” (SPP1597) and the BrainLinks-BrainTools Cluster of Excellence (grant number EXC 1086). M. Watter is funded through the State Graduate Funding Program of Baden-Württemberg.

References

Appendix A Supplementary to the E2C description

The KL divergence between two multivariate Gaussians is given by

For a simplified notation, such that $\text{KL}(\mathcal{N}_{0}||\mathcal{N}_{1})=\text{KL}(\hat{Q}||Q)$ , let us assume

The main point behind the derivation presented in the following, is to make partial derivatives of the above KL divergence efficiently computable. To this end, we cannot take the trace or the determinant via numerical algorithms, because we have to be able to take the gradients in symbolic form. Aside from that, we like to process a batch of samples, so the computation should have a convenient form and not require excessive amounts of tensor products in between. We start our simplification with the trace term which results in

The last equation is easy to implement and only requires summing over the non-batch dimension. The difference of means can be derived very quickly with the same summing scheme:

It remains the ratio of determinants, which we will simplify with the matrix determinant lemma giving

Putting the above to formulas together finally yields

Appendix B Supplementary to the experimental setup

We used convolutional inference networks for the cart-pole and three-link arm task. While these networks help us overcome the problem of large input dimensionalities (i.e. $2\times 128\times 128$ pixel images in the three-link arm task), we still have to generate full resolution images with the decoder network. For high-dimensional images generation fully connected neural networks are simply not an option. We thus decided to use up-convolutional networks, which were recently show to be powerful models for image generation .

To set-up these models we basically “mirror” the convolutional architecture used for the encoder. More specifically for each $5\times 5$ convolution followed by $2\times 2$ max-pooling step in the encoder network, we introduce a $2\times 2$ up-sampling and $5\times 5$ convolution step in the decoder network. The complete network architecture is given below. It is similar to the up-convolution networks used in Dosovitskiy et al. . The upsampling strategy we use is simple “perforated” upsampling as described in .

B.2 Variational Autoencoder with slowness

Enforcing temporal slowness during learning has previously been found to be a good proxy for learning representations in reinforcement learning and representation learning from videos . We also consider a VAE variant with a slowness term on the latent representation by enforcing similarity of the encodings of temporally close images. This can be achieved by augmenting the standard VAE objective $\mathcal{L}^{\text{bound}}$ with an additional KL divergence term on the latent posterior $Q_{\boldsymbol{\phi}}$ :

Indeed there seems to be a slightly better coherence of similar states in the latent spaces, as e.g. depicted in Figure 8 in the main paper. Yet, our experiments show that a slowness term alone does not suffice to structure the latent space, such that locally linear predictions and control become feasible.

B.3 Evaluation criteria

For comparing the performance of all variants of E2C and the baselines, the following criteria are of importance:

Autoencoding. Being able to reconstruct the given observations is the basic necessity for a model to work. The reconstruction cost drives a model to identify single states from its observations.

Decoding the next state. For any planning to be possible at all, the decoder must be able to generate the correct images from transitions the dynamics model performed. If this is not the case, we know that the latent states of the encoding and the transition model do not coincide, thus preventing any planning.

Optimizing latent trajectory costs. The action sequences for achieving a specified goal will be determined completely by locally linearized dynamics in the latent space. Therefore minimizing trajectory costs in latent space is, again, a necessity for successful control.

Optimizing real trajectory costs. While the action sequence has been determined for the latent dynamics, the deciding criterion is whether this reflects the true state trajectory costs. Therefore carrying out the ”dreamed” plans in reality is the optimality criterion for every model. To make the different models comparable, we use the same cost matrices for evaluation, which are not necessarily the same as for optimization.

We reflected these four criteria in the evaluation table in the paper. For the reconstruction of the current and next state we specified the mean log loss, which is in case of the Bernoulli distributions the cross entropy error function:

For the costs a model imagines and truly achieves, we sample from different starting states and accumulate the distances in latent and true state space according to the SOC method.

B.4 The three-link robot arm

The robot arm we used in the last experiment in the main paper was simulated using dynamics generated by the MapleSim http://www.maplesoft.com/products/maplesim/ simulator wrapped in Python and visualized for producing inputs to E2C using PyGame. We simulated a fairly standard robot arm with three links. The length of the links were set to $2$ , $1.2$ and $0.7$ (units were set to meters). The masses of the corresponding links were all set to $10kg$ .

B.5 Evaluating the true system model

To compare the efficacy of different models when combined with optimal control algorithms, we always reported the cost in latent space (as used by the optimal control algorithm) as well as the “real” trajectory cost. To compute this real cost, we evaluated the same cost function as in the latent space (quadratic costs on the deviation from a given goal state), but using the real system states during execution and different cost matrices for a fair comparison.

As an upper bound on the performance achievable for control by any of the models, we also computed the true system cost by applying iLQR/AICO to a model of the real system dynamics. We have this model available since all experiments were performed in simulation.

B.6 Neural Network training

All the datasets were created in advance as $\mathcal{D}=\{(\mathbf{x}_{1},\mathbf{u}_{1},\mathbf{x}_{2}),\dots,(\mathbf{x}_{T-1},\mathbf{u}_{T-1},\mathbf{x}_{T})\}$ for the training, validation and test split. While the E2C models were trained on $\mathcal{D}$ , the ones that do not incorporate any transition information (i.e. AE, VAE) were trained on images $\mathcal{D}_{\text{images}}=\{\mathbf{x}_{1},\dots,\mathbf{x}_{T}\}$ extracted from the original dataset $\mathcal{D}$ . The slowness VAE was trained on the pairs of images subset $\mathcal{D}_{\text{pairs}}=\{(\mathbf{x}_{1},\mathbf{x}_{2}),\dots,(\mathbf{x}_{T-1},\mathbf{x}_{T})\}$ and our E2C models on the full $\mathcal{D}$ .

In order to learn dynamics predictions for the image-only autoencoders, we extracted the latent representations and combined them with the actions from $\mathcal{D}$ into $\mathcal{D}_{\text{dynamics}}=\{(\mathbf{z}_{1},\mathbf{u}_{1},\mathbf{z}_{2}),\dots,(\mathbf{z}_{T-1},\mathbf{u}_{T-1},\mathbf{z}_{T})\}$ . On these low-dimensional representations we trained the dynamics MLPs, thus ensuring that all methods were trained on exactly the same data.

B.6.2 Implementation details

We used orthogonal weight initialization for every layer . As described in the main paper, Adam was used as the learning rule for all networks. We found both these techniques to be fundamentally important for stabilizing training and achieving good reconstructions for all methods. Both methods also clearly helped to cut the hyperparameter search needed for all methods to a minimum. In the process of training, we could make out three phases: the unfolding of the latent space, the overcoming of the trivial solution (the average image of the dataset) and the minimization of the latent KL term. The architectures used for our experiments were as follows (where ReLU stands for rectified linear units and conv. for convolutions):

Input: $40^{2}$ image dimensions, $2$ action dimensions

Encoder: 150 ReLU - 150 ReLU - 150 ReLU - 4 Linear (2 for AE)

Decoder: 200 ReLU - 200 ReLU - 1600 Linear (Sigmoid for AE)

Dynamics: 100 ReLU - 100 ReLU + Output layer (except Global E2C)

AE, VAE, VAE with slowness, Non-linear E2C: 2 Linear

E2C: 8 Linear ( $2\cdot 2$ for $\mathbf{A}_{t}$ , $2\cdot 1$ for $\mathbf{B}_{t}$ , 2 for $\mathbf{o}_{t}$ ), $\lambda=0.25$

Evaluation costs: $\mathbf{R}_{z}=0.1\cdot\mathbf{I}$ , $\mathbf{R}_{u}=\mathbf{I}$ , $\mathbf{R}_{o}=\mathbf{I}$

Input: $2\cdot 48^{2}$ image dimensions, $1$ action dimension

Encoder: 800 ReLU - 800 ReLU - 6 Linear (3 for AE)

Decoder: 800 ReLU - 800 ReLU - 4608 Linear (Sigmoid for AE)

Dynamics: 100 ReLU - 100 ReLU + Output layer (except Global E2C)

AE, VAE, VAE with slowness, Non-linear E2C: 3 Linear

E2C: 12 Linear ( $2\cdot 3$ for $\mathbf{A}_{t}=(\mathbf{I}+\mathbf{v}_{t}\mathbf{r}_{t}^{T})$ , $3\cdot 1$ for $\mathbf{B}_{t}$ , 3 for $\mathbf{b}_{t}$ ), $\lambda=0.25$

Adam: $\alpha=3\cdot 10^{-4},\beta_{2}=0.1$

Evaluation costs: $\mathbf{R}_{z}=\mathbf{I}$ , $\mathbf{R}_{u}=0.1\mathbf{I}$

Input: $2\cdot 80^{2}$ image dimensions, $1$ action dimension

Encoder: $32\times 5\times 5$ ReLU - $32\times 5\times 5$ ReLU - $32\times 5\times 5$ ReLU - 512 ReLU - 512 ReLU

Decoder: 512 ReLU - 512 ReLU - $2\times 2$ up-sampling - $32\times 5\times 5$ ReLU - $2\times 2$ up-sampling - $32\times 5\times 5$ ReLU - $2\times 2$ up-sampling - $32\times 5\times 5$ conv. ReLU

Dynamics: 200 ReLU - 200 ReLU + 32 Linear ( $2\cdot 8$ for $\mathbf{A}_{t}=(\mathbf{I}+\mathbf{v}_{t}\mathbf{r}_{t}^{T})$ , $8\cdot 1$ for $\mathbf{B}_{t}$ , $8$ for $\mathbf{b}_{t}$ ), $\lambda=1$

Evaluation costs: $\mathbf{R}_{z}=\mathbf{I}$ , $\mathbf{R}_{u}=\mathbf{I}$

Input: $2\cdot 128^{2}$ image dimensions, $3$ action dimensions

Encoder: $64\times 5\times 5$ conv. ReLU - $2\times 2$ max-pooling - $32\times 5\times 5$ conv. ReLU - $2\times 2$ max-pooling - $32\times 5\times 5$ conv. ReLU - $2\times 2$ max-pooling - 512 ReLU - 512 ReLU

Decoder: 512 ReLU - 512 ReLU - $2\times 2$ up-sampling - $32\times 5\times 5$ ReLU - $2\times 2$ up-sampling - $32\times 5\times 5$ ReLU - $2\times 2$ up-sampling - $64\times 5\times 5$ conv. ReLU

Dynamics: 200 ReLU - 200 ReLU + 48 Linear ( $2\cdot 8$ for $\mathbf{A}_{t}=(\mathbf{I}+\mathbf{v}_{t}\mathbf{r}_{t}^{T})$ , $8\cdot 3$ for $\mathbf{B}_{t}$ , $8$ for $\mathbf{b}_{t}$ ), $\lambda=1$

Evaluation costs: $\mathbf{R}_{z}=\mathbf{I}$ , $\mathbf{R}_{u}=0.001\mathbf{I}$

Appendix C Supplementary evaluations

To qualitatively measure the predictive accuracy, the starting state for a trajectory is encoded and the actions are applied on the latent representation. After each transition, the predicted latent position is decoded and visualized. In this manner, multi-step predictions can be generated for the planar system in Figure 5 and for the inverted pendulum in Figures 6 and 7.

C.2 Inverted pendulum latent space

Encoding the pendulum depictions into a 3-dimensional latent space allows for a visual comparison in Figure 8 .

C.3 Trajectories for cart-pole and three-link arm

Finally – similar to the images in Section C.1 – Figure 9 shows multi-step predictions for the cart-pole system. We depict important cases: (1) a long-term prediction with the cart-pole standing still (essentially the unstable fix-point of the underlying dynamics); (2) the cart-pole moving to the right, changing the direction of the poles angular velocity (middle column); (3) and the pole moving farthest to the right. The long-term predictions by the E2C model are all of high quality. Note that for the uncontrolled dynamics the predictions show a slight bias of the pole moving to the right (an effect that we consistently saw in trained models for the cart-pole). We attribute this problem to the fact that discretization errors in the image rendering process of the pole angle make it hard to predict small velocities accurately.

C.4 Exemplary trajectory taken for three-link arm task

Figure 10 shows a segment of a controlled trajectory for the three-link arm as executed by the E2C system. Note that, in contrast to other figures in this supplementary material, it does not show a long-term prediction but rather 10 steps of a trajectory (together with one-step-ahead predictions) that was taken by the E2C system when combined with model predictive control. For additional visualizations and controlled trajectories for all tasks we refer to the supplementary video.

C.5 Comparison of different models for cart-pole and robot arm

In Table 2 we compare our variety of models in terms of real trajectory cost and task success percentage on the cart-pole and the robot arm. All results are averaged over 30 different starting states with a fixed goal state.

The cart-pole always starts in the goal state (zero angle and zero velocity) with small additive Gaussian noise ( $\sigma=0.01$ ). Success is defined as preventing the pole from falling below an angle of $\pm 0.85$ rad. The three-link arm system begins in a random configuration and the goal is to to unroll all joints (e.g. make all angles zero) and stay $\epsilon$ -close to that position.

The results show that only E2C and its non-linear variant can perform this task successfully, although there is still a large performance gap between the two. We conclude, that the error of linearizing non-linear dynamics after training the corresponding model grows to the point of no longer allowing accurate control for the system.

C.6 Comparison of trajectory optimizers for cart-pole and robot arm

To compare how well AICO deals with the covariance matrices estimated in latent space we performed an additional experiment on the cart-pole and three-link robot arm task comparing it to iLQR. We performed model predictive control using the locally linear E2C model starting in 10 different start states each. The remaining settings are as given in Section C.5.

As reported in Table 3, both methods performed about the same for these tasks, indicating that the covariance matrices estimated by our model do not “hurt” planning, but considering them does not improve performance either.