Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

Tianfan Xue, Jiajun Wu, Katherine L. Bouman, William T. Freeman

Introduction

From just a single snapshot, humans are often able to easily imagine how a scene will visually change over time. For instance, due to the pose of the girl in Figure 1, most would predict that her arms are stationary but her leg is moving. However, the exact motion is often unpredictable due to intrinsic ambiguity. Is the girl’s leg moving up or down? In this work, we study the problem of visual dynamics: modeling the conditional distribution of future frames given an observed image. We propose to tackle this problem using a probabilistic, content-aware motion prediction model that learns this distribution without using annotations. Sampling from this model allows us to visualize the many possible ways that an input image is likely to change over time.

Modeling the conditional distribution of future frames given only a single image as input is a very challenging task for a number of reasons. First, natural images come from a very high dimensional distribution that is difficult to model. Modeling the conditional distribution of future frames further increases the dimensionality of the problem. Not only do the sampled, synthesized images need to look like real images, the motion between the input and synthesized images should also be realistic. Second, in order to properly predict motion distributions, the model must first learn about image parts and the correlation of their respective motions in a unsupervised fashion.

In this work, we propose a neural network structure, based on a variational autoencoder (Kingma and Welling, 2014) and our newly proposed cross convolutional layer, to tackle this problem. During training, the network observes a set of consecutive image pairs in videos, and automatically infers the relationship between images in each pair without any supervision. Then, during testing, the network predicts the conditional distribution, $P(J|I)$ , of future RGB images $J$ (Figure 1b) given an RGB input image $I$ that was not in the training set (Figure 1a). Using this distribution, the network is able to synthesize multiple different image samples corresponding to possible future frames of the input image (Figure 1c). Our network contains a number of key components that contribute to its success:

We use conditional variational autoencoder to model the complex conditional distribution of future frames (Kingma and Welling, 2014; Yan et al., 2015). This allows us to approximate a sample, $J$ , from the distribution of future images by using a trainable function $J=f(I,z)$ . The argument $z$ is a sample from a simple distribution, e.g. Gaussian, which introduces randomness into the sampling of $J$ . This formulation makes the problem of learning the distribution much more tractable than explicitly modeling the distribution.

Instead of finding an intrinsic representation of the image itself, as most previous work has done (Radford et al., 2016; Reed et al., 2015), our network finds an intrinsic representation of intensity changes between two images, also known as the difference image or Eulerian motion (Wu et al., 2012). This representation is typically sparser and easier to model than content in an original image.

We model motion using a set of image-dependent convolution kernels operating over an image pyramid. Unlike normal convolutional layers, these kernels vary between images, as different images may have different motions. Our proposed cross convolutional layer allows us to convolve image-dependent kernels with feature maps from an observed frame, to synthesize a probable future frame.

We test the proposed model on two synthetic datasets as well as a dataset generated from real videos. We show that, given an RGB input image, the algorithm can successfully model a distribution of possible future frames, and generate different samples that cover a variety of realistic motions. In addition, we demonstrate that our model can be easily applied to tasks such as visual analogy-making, and present some analysis of the learned network representations.

Related Work

Research studying the human visual system and motion priors provides evidence for low-level statistics of object motion. Pioneering work by Weiss and Adelson (1998) found that the human visual system prefers slow and smooth motion fields. More recent work by Lu and Yuille (2006) found that humans make similar motion predictions as a Bayesian ideal observer. Roth and Black (2005) analyzed the response of spatial filters applied to optical flow fields and concluded that the spatial distribution of motion resembles that of a heavy-tailed distribution. Fleet et al. (2000) found that a local motion field can be represented by a linear combination of a small number of bases.

These prior works focus on modeling the distribution of an image’s motion field using low-level statistics without any additional information. However, the distribution of a motion field is not independent of image content. For example, given an image of a car in front of a building, many would predict that the car is moving and the building is fixed. Thus, in our work, rather than modeling a motion prior as a context-free distribution, we propose to model the conditional motion distribution of future frames given an input image by incorporating a series of low- and high-level image cues.

Motion Prediction

Our problem is closely related to the motion prediction problem, whose goal is to predict a motion field (Pintea et al., 2014) or trajectory of objects (Walker et al., 2014) based on image content. Unlike our proposed algorithm, which models the conditional distribution of a future frame, work in motion prediction traditionally makes a deterministic prediction. For example, Liu et al. (2011) predicted a motion field of an image by transferring a similar motion field from a database, Pintea et al. (2014) learned a random-forest based mapping from image content to a motion field, and Vondrick et al. (2016) estimated the variation of future image features from observed frames.

However, as demonstrated in Figure 1, deterministic prediction is often impossible due to the intrinsic ambiguity of the problem. In order to model a distribution of possible motions, Walker et al. (2015) posed the motion prediction problem as a classification task and predicted the motion class label for each pixel in the image. This model was, however, not designed to capture pixel-wise correlations in the motion field, i.e., neighboring pixels belonging to the same object may move in opposite directions. Recently, and concurrently with our own work, Walker et al. (2016) introduced a variational autoencoder to model pixel-wise correlations in the motion field. This inspiring work nicely complements our own approach: we aim to predict Eulerian motions and to synthesize future RGB frames, while they focus on predicting the (Lagrangian) motion field. Different from their method, our model further learns feature maps and motion kernels jointly without supervision via our newly proposed cross convolutional network.

Image and Video Synthesis

Techniques that exploit the periodic structure of motion in videos have also been successful at generating novel frames from an input sequence. Early work in video textures proposed to shuffle frames from an existing video to generate a temporally consistent, looping image sequence (Schödl et al., 2000). These ideas were later extended to generate cinemagraphies (Joshi et al., 2012), seamlessly looping videos containing a variety of objects with different motion patterns (Agarwala et al., 2005; Liao et al., 2013), or video inpainting (Wexler et al., 2004). While high-resolution and realistic looking videos are generated using these techniques, they are often limited to periodic motion and require an input reference video. In contrast, we build an image generation model that does not require a reference video at test time.

Recently, several network structures have been proposed to synthesize a new frame from observed frames. Srivastava et al. (2015) designed a LSTM network that synthesized future frames in a sequence from set of observed frames. Mathieu et al. (2016) proposed to synthesize the next frame in a sequence from a few previous frames, using a multi-scale network, and Oh et al. (2015) and Finn et al. (2016) proposed to synthesize a future frame assuming a certain action is taken. Specifically, concurrent work from Finn et al. (2016) also discussed the idea of learning output convolutional kernels. While they applied the learned kernels on input images, in this paper we explored a more general and principled framework, namely the cross convolutional network, which jointly learns feature maps and kernels without direct supervision.

Early work in parametric texture synthesis developed a set of hand-crafted features that could be used to synthesize textures (Portilla and Simoncelli, 2000). More recently, works in image synthesis have begun to produce impressive results by training variants of neural network structures to produce novel images (Gregor et al., 2015; Xie et al., 2016). Generative adversarial networks (Goodfellow et al., 2014; Denton et al., 2015; Radford et al., 2016) and variational autoencoders (Kingma and Welling, 2014; Yan et al., 2015) have been used to model and sample from natural image distributions. Our proposed algorithm is based on the variational autoencoder, but unlike in this previous work, we also model temporal consistency.

Formulation

In this section, we describe how to sample future frames from a current observation image. Here we focus on next frame synthesis; given an RGB image $I$ observed at time $t$ , our goal is to model the conditional distribution of possible frames observed at time $t+1$ .

Formally, let $\{(I^{(1)},J^{(1)}),\dots,(I^{(n)},J^{(n)})\}$ be the set of image pairs in our training set, where $I^{(i)}$ and $J^{(i)}$ are images observed at two consecutive time steps. Using this data, our task is to model the distribution $p_{\theta}(J|I)$ of all possible next frames $J$ for a new, previously unseen test image $I$ , and to sample new images from this distribution.

In practice, we choose not to directly predict the next frame, but instead to predict the difference image $v=J-I$ , also known as the Eulerian motion, between the observed frame $I$ and the future frame $J$ ; these two problems are equivalent. The task is then to learn the conditional distribution $p_{\theta}(v|I)$ from a set of training pairs $\{(I^{(1)},v^{(1)}),\dots,(I^{(n)},v^{(n)})\}$ .

2 A Toy Example

We consider a simple toy example to illustrate how to learn a distribution of future frames at time $t+1$ given the current image at time $t$ . Consider a world of circles and squares with corresponding images that contain exactly one shape. All circles move vertically while all squares move horizontally, as shown in the Figure 2(a). Although in practice we choose $v$ to be the difference image between consecutive frames, for this toy example we show $v$ as a 2D motion field for a more intuitive visualization. Consider the three types of models shown in Figure 2.

In this structure, the model tries to find a deterministic relationship between the input image and object motion (Figure 2(b)). To do this, it attempts to find a function $f$ that minimizes the reconstruction error $\sum_{i}||v^{(i)}-f(I^{(i)})||$ on a training set. However, because the model does not formulate motions in a probabilistic manner, it cannot capture the multiple possible motions that a shape can have. It is possible that this model may disambiguate circles from squares, but it cannot generalize to predict motion on a new, previously unseen image. At most, the algorithm can only learn a mean motion for each object. In the case of zero-mean, symmetric motion distributions, the algorithm would produce an output frame without almost no motion.

(2) Motion prior

Variational autoencoders (Kingma and Welling, 2014) can be used to model the distribution of motion fields, as shown in Figure 2(c). This model contains a latent representation, $z$ , which encodes the intrinsic dimensionality of the motion fields. The network that learns this intrinsic representation consists of two parts: an encoder network that maps the motion field $v$ to an intrinsic representation $z$ (the gray network in Figure 2(c), which corresponds to $p(z|v)$ ), and a decoder network that maps the intrinsic representation $z$ to the motion field $v$ (the yellow network, which corresponds to $p(v|z)$ ). A shortcoming of this model is that it does not see the input image during inference. Therefore, it will only learn a joint distribution of motion fields for both circles and squares, without distinguishing the particular motion pattern for each class of objects.

(3) Probabilistic frame predictor

In order to model the conditional distribution of motion fields given an input image, we combine the deterministic motion prediction structure with that of a content-agnostic motion prior. Refer to Figure 2(d). The decoder (the yellow network in Figure 2(d), which corresponds to $p(v|I,z)$ ) now takes two inputs, the intrinsic representation $z$ and an image $I$ . Therefore, instead of modeling a joint distribution of motion $v$ , it will learn a conditional distribution of motion given the input image $I$ .

In this toy example, since squares and circles only move in one (although different) direction, we would only need a scalar $z\in\mathbb{R}$ for encoding the velocity of the object. The model is then able to infer the location and direction of the motion conditioned on the shape appearing in the input image.

3 Conditional Variational Autoencoder

In this section, we will formally derive the training objective of our model, following the similar derivations as those in Kingma and Welling (2014); Kingma et al. (2014); Gregor et al. (2015); Yan et al. (2015). Consider the following generative process that samples a future frame from a $\theta$ parametrized model, conditioned on an observed image $I$ (see the graphical model in Figure 2(d)). First the algorithm samples the hidden variable $z$ from a prior distribution $p_{z}(z)$ ; in this work we assume $p_{z}(z)$ is a multivariate Gaussian distribution where each dimension is i.i.d. with zero-mean and unit-variance. Then, given a value of $z$ , the algorithm samples the intensity difference image $v$ from the conditional distribution $p_{\theta}(v|I,z)$ . The final image, $J=I+v$ , is then returned as output.

In the training stage, the algorithm attempts to maximize the log-likehood of the conditional marginal distribution $\sum_{i}\log p(v^{(i)}|I^{(i)})$ . Assuming $I$ and $z$ are independent, the marginal distribution is expanded as $\sum_{i}\log\int_{z}p(v^{(i)}|I^{(i)},z)p_{z}(z)dz$ . Directly maximizing this marginal distribution is hard, thus we instead maximize its variational upper-bound, as proposed by Kingma and Welling (2014). Each term in the marginal distribution is upper-bounded by

where $D_{\text{KL}}$ is the KL-divergence, and $q_{\phi}(z|v^{(i)},I^{(i)})$ is the variational distribution that approximates the posterior $p(z|v^{(i)},I^{(i)})$ . For simplicity, we refer to the conditional data distribution, $p_{\theta}(v^{(i)}|z,I^{(i)})$ , as the generative model, and the variational distribution, $q_{\phi}(z|v^{(i)},I^{(i)})$ , as the recognition model.

The first KL-divergence term in Eq. 1 has an analytical form. To make the second term tractable, we approximate the variational distribution, $q_{\phi}(z|x^{(i)},I^{(i)})$ , by its empirical distribution,

where $z^{(i,l)}$ are samples from the variational distribution.

Distribution Reparametrization

Now we need to define distributions for the generative model, $p_{\theta}(v^{(i)}|z^{(i,l)},I^{(i)})$ , and for the recognition model, $q_{\phi}(z^{(i,l)}|v^{(i)},I^{(i)})$ . Using the reparameterization trick (Kingma and Welling, 2014), we approximate both distributions as Gaussian, where the mean and variance of the distributions are functions specified by a generative network and a recognition network, respectively. Specifically, let us define***Here the bold $\mathbf{I}$ denotes an identity matrix, whereas the normal-font $I$ denotes the observed image.:

where $\mathcal{N}(\;\cdot\;;a,b)$ is a conditional data distribution with mean $a$ and variance $b$ . $f_{\mbox{{mean}}}$ is a function that predicts the mean of the variational distribution, defined by the generative network (the yellow network in Figure 2(d)). $g_{\mbox{{mean}}}$ and $g_{\mbox{{var}}}$ are functions that predict the mean and variance of the variational distribution, respectively, defined by the recognition network (the gray network in Figure 2(d)). Here we assume that all dimensions of the conditional data distribution have the same variance $\sigma^{2}$ , where $\sigma$ is a hand-tuned hyper parameter. In the next section, we will describe the details of the network structure.

Method

In this section we present a trainable neural network structure, defining the generative function $f_{\mbox{{mean}}}$ and recognition functions $g_{\mbox{{mean}}}$ , and $g_{\mbox{{var}}}$ . Once trained, these functions can be used in conjunction with an input image to sample future frames. We first describe our newly proposed cross convolutional layer, which naturally characterizes a layered motion representation (Wang and Adelson, 1993). We then explain our network structure and demonstrate how we integrate the cross convolutional layer into the network for future frame synthesis.

Motion can often be decomposed in a layer-wise manner (Wang and Adelson, 1993). Intuitively, different semantic segments in an image should have different distributions over all possible motions; for example, a building is often static, but a river flows.

To model the layered motion, we propose a novel cross convolutional network (Figure 3). The network first decomposes an input image pyramid into multiple feature maps through an image encoder (Figure 3(c)). It then convolves these maps using different convolutional kernels (Figure 3(d)), and uses the outputs to synthesize a difference image (Figure 3(e)). This network structure naturally fits the layered motion representation, as each feature map characterizes an image layer (note this is different from a network layer) and the corresponding kernel characterizes the motion of that layer. In other words, we model motions as convolutional kernels, which are applied to image segments (feature maps) at multiple scales.

Unlike a traditional convolutional network, these kernels used in our network should not be identical for all inputs, as different images typically have different motions (kernels). We therefore propose a novel cross convolutional layer to tackle this problem. The cross convolutional layer does not learn the weights of the kernels itself. Instead, it takes both kernel weights and feature maps as input and computes convolution during a forward pass; for back propagation, it also computes the gradients of both convolutional kernels and feature maps.

2 Network Structure

As shown in Figure 3, our network consists of five components: (a) a motion encoder, which is a variational autoencoder learning the compact representation $z$ of possible motions; (b) a kernel decoder, which learns motion kernels from the compact motion representation $z$ ; (c) an image encoder, which consists of convolutional layers extracting feature maps from the input image $I$ ; (d) a cross convolutional layer, which takes the output of the image encoder and the kernel decoder, and convolves the feature maps with motion kernels; and (e) a motion decoder which regresses the difference image from the combined feature maps. The recognition functions $g_{\mbox{{mean}}}$ and $g_{\mbox{{var}}}$ are defined by the motion encoder, whereas the generative function $f_{\mbox{{mean}}}$ is defined by the image encoder, the kernel decoder, the cross convolutional layer, and the motion decoder. We now introduce each part in detail.

During training, our motion encoder (Figure 3(a)) takes two adjacent frames in time as input, both at resolution $128\times 128$ . The network then applies six $5\times 5$ convolutional and batch normalization layers (number of channels are $\{96,96,128,128,256,256\}$ ) to the concatenated images, with some pooling layers in between. The output has a size of $256\times 5\times 5$ . The kernel encoder then reshapes the output to a vector, and splits it into a $3,200$ -dimension mean vectors and a $3,200$ -dimension variance vector, from which the network samples the latent motion representation $z$ .

Next, the kernel decoder (Figure 3(b)) sends the $3200=128\times 5\times 5$ tensor into two additional convolutional layers, each with $128$ channels and a kernel size of $5$ . They are then split into four sets, each with $32$ kernels of size $5\times 5$ .

Our image encoder (Figure 3(c)) operates on four different scaled versions of the input image $I$ ( $256\times 256$ , $128\times 128$ , $64\times 64$ , and $32\times 32$ ). At each scale, there are four sets of $5\times 5$ convolutional and batch normalization layers (number of channels are $\{64,64,64,32\}$ ), two of which are followed by a $2\times 2$ max pooling layer. Therefore, the output size of the four channels are $32\times 64\times 64,32\times 32\times 32,32\times 16\times 16$ , and $32\times 8\times 8$ , respectively. This multi-scale convolutional network allows us to model both global and local structures in the image, which may have different motions.

The core of our network is a cross convolutional layer (Figure 3(d)) which, as discussed in Section 4.1, applies the kernels learned by the kernel decoder to the feature maps learned by the image encoder, respectively. The output size of the cross convolutional layer is identical to that of the image encoder.

Our motion decoder (Figure 3(e)) starts with an up-sampling layer at each scale, making the output of all scales of the cross convolutional layer have a resolution of $64\times 64$ . This is then followed by one $9\times 9$ and two $1\times 1$ convolutional and batch normalization layers, with $\{128,128,3\}$ channels. These final feature maps are then used to regress the output difference image (Eulerian motion map).

During training, the image encoder takes a single frame $I^{(i)}$ as input, and the motion encoder takes both $I^{(i)}$ and the difference image $v^{(i)}=J^{(i)}-I^{(i)}$ as input, where $J^{(i)}$ is the next frame. The network aims to regress the difference image using an L2 loss.

During testing, the image encoder still sees a single image $I$ ; however, instead of using a motion encoder, we directly sample motion vectors $z^{(j)}$ from the prior distribution $p_{z}(z)$ . In practice, we use an empirical distribution of $z$ over all training samples as an approximation to the prior, as we find it produces better synthesis results. The network then synthesizes possible difference images $v^{(j)}$ by taking sampled latent representations $z^{(j)}$ and an RGB image $I$ as input. We then generate a set of future frames $\{J^{(j)}\}$ from these difference images: $J^{(j)}=I+v^{(j)}$ .

Evaluations

We now present a series of experiments to evaluate our method. We start with a dataset of 2D shapes, which serves to benchmark our model on objects with simple, yet with nontrivial, motion distributions. Following Reed et al. (2015), we then test our method on a dataset of video game sprites†††Liberated pixel cup: http://lpc.opengameart.org with diverse motions. In addition to these synthetic datasets, we further evaluate our framework on a new real-world video dataset. Again, note that our model uses consecutive frames for training, requiring no supervision. Experimental results are also available in our project page‡‡‡Our project page: http://visualdynamics.csail.mit.edu and please refer to it for a better visualization.

The synthetic 2D shape dataset contains three types of objects: circles, squares, and triangles, where circles always move vertically, squares horizontally, and triangles diagonally. The motion of circles and squares are independent, however, the motion of circles and triangles are strongly correlated. The shapes can be heavily occluded, and their sizes, positions, and colors are chosen randomly. We synthesized $20,000$ image pairs for training, and $500$ for testing.

Results are shown in Figure 4. Figure 4(a) and (b) show a sample of consecutive frames in the dataset, and Figure 4(c) shows the reconstruction of the second frame after encoding and decoding with the ground truth image. Figure 4(d) and (e) show samples of the second frame; in these results the network only takes the first image as input, and the compact motion representation, $z$ , is randomly sampled from the prior distribution $p_{z}(z)$ . Note that the network is able to capture the distinctive motion pattern for each shape, including the strong correlation of triangle and circle motion. To quantitatively evaluate our algorithm, we compare the velocity distributions of circles, squares, and triangles in the sampled images with their ground truth distributions. We sampled $50,000$ images and used the optical flow package by Liu (2009) to calculate the speed of each object. We compare our algorithm with a simple baseline that copies the optical flow field from the training set (‘Flow’ in Figure 4); for each test image, we find its 10-nearest neighbors in the training set, and randomly transfer one of the corresponding optical flow fields. To illustrate the advantage of using a variational autoencoder over a standard autoencoder, we also modify our network by removing the KL-divergence loss and sampling layer (‘AE’ in Figure 4). Figure 4 shows our predicted distribution is very close to the ground-truth distribution, and our algorithm performs much better than the naive flow field transfer. It also shows that a variational autoencoder helps to capture the true distribution of future frames.

2 Movement of Video Game Sprites

We then evaluate our framework on a video game sprites dataset, also used by Reed et al. (2015), where characters have more complicated motion. The dataset consists of $672$ unique characters, and for each character there are $5$ animations (spellcast, thrust, walk, slash, shoot) from $4$ different viewpoints. Each animation ranges from $6$ to $13$ frames. We collect $102,364$ pairs of neighboring frames for training, and $3,140$ pairs for testing. When building the dataset, we ensure that the same character will not appear in both the training and the testing sets. Synthesized sample frames are shown in Figure 5. The result shows that our algorithm is able to capture various possible motions from a single input frame that are consistent with the motions in the training set.

For a quantitative evaluation, we conduct behavior experiments on Amazon Mechanical Turk. We randomly select $200$ images, sample possible next frames using our algorithm, and show them to multiple human subjects as an animation side by side with the ground truth animation. We then ask the subject to choose which animation is real (not synthesized). An ideal algorithm should achieve a success rate of $50\%$ . In our experiments, we present the animation in both the original resolution ( $64\times 64$ ) and a lower resolution ( $32\times 32$ ). We only evaluate on subjects that have a past approval rating of $>95\%$ and also pass our qualification tests. Figure 5 shows that our algorithm significantly out-performs a baseline algorithm that warps an input image by transferring a randomly selected flow field from the training set. Subjects are more easily fooled by the $32\times 32$ pixel images, as it is harder to hallucinate realistic details in high-resolution images.

3 Movement in Real Videos Captured in the Wild

+ Cls (Reed et al., 2015) 13.3 24.6 17.2 18.9 40.8 23.0 Our Model 9.5 11.5 11.1 28.2 19.0 15.9 Table 2: Mean squared pixel error on test analogies, by animation.

To demonstrate that our algorithm can also handle real videos, we collect $20$ workout videos from YouTube, each about $30$ to $60$ minutes long. We first apply motion stabilization to the training data as a pre-processing step to remove camera motion. We then extract $56,838$ pairs of frames for training and $6,243$ pairs for testing. The training and testing pairs come from different video sequences. Figure 6 shows that our framework works well in predicting the movement of the legs and torso. Additionally, Mechanical Turk behavior experiments show that the synthesized frames are visually realistic.

4 Zero-Shot Visual Analogy-Making

Inspired by some recent work on visual analogy-making (Reed et al., 2015; Sadeghi et al., 2015), we demonstrate that our framework can also be easily applied to the same task, even without supervision on analogies during training.

Specifically, Reed et al. (2015) studied the problem of inferring the relationship between a pair of images and synthesizing a new image by applying the inferred relationship to a new input image. Our motion encoder, which aims to extract motion information from two consecutive frames, can also be used to extract and synthesize relationships between pairs of images, as shown in Figure 7. In addition to qualitative experiments, we also evaluate our cross-convolutional network on zero-shot visual analogy-making quantitatively, and show the results in Table 2. Although our method requires no analogy supervision, it still performs better than those introduced in Reed et al. (2015), which required visual analogy labels during training.

5 Visualizing Feature Maps

We visualize the learned feature maps (see Figure 3(b)) in Figure 8. Even without supervision, our network learns to detect objects or contours in the image. For example, we see that the network automatically learns object (triangle and circle) detectors and edge detectors on the shape dataset. It also learns a hair detector and a body detector on the sprites and exercise datasets, respectively.

6 Dimension of Latent Representation z𝑧z

Although our latent motion representation $z$ has $3,200$ dimensions, its intrinsic dimensionality is much smaller. Table 1 shows the number of non-zero elements in predicted $z_{\mbox{{mean}}}$ for $1,000$ test samples. Note $z_{\mbox{{mean}}}$ is very sparse. We further run principle component analysis (PCA) on the $z_{\mbox{{mean}}}$ s and find that less than $30$ principle components are needed to cover $95\%$ of the variance. This indicates that our network has learned a sparse representation of motion in an unsupervised fashion, and encodes high-level knowledge using a small number of bits, rather than simply remembering the difference images. It automatically learns this sparse representation due to the use of the KL-divergence criterion in Eq. 2, which forces the latent representation $z$ to carry minimal information, as discussed by Hinton and Van Camp (1993) and concurrently by Higgins et al. (2016).

Conclusion

In this paper, we have proposed a novel framework that can sample future frames from a single input image. Our method incorporates a variational autoencoder for learning compact motion representations, and a novel cross convolutional layer for regressing Eulerian motion maps. We have demonstrated that our framework works well on both synthetic, and real-life videos.

More generally, results suggest that our probabilistic visual dynamics model may be useful for additional applications, such as inferring objects’ higher-order relationships by examining correlations in their motion distributions. Furthermore, this learned representation could be potentially used as a sophisticated motion prior in other computer vision and computational photography applications.

The authors thank Yining Wang for helpful discussions. This work is in part supported by NSF Robust Intelligence 1212849 Reconstructive Recognition, ONR MURI 6923196, Adobe, and Shell Research. The authors would also like to thank Nvidia for GPU donations.