Deep multi-scale video prediction beyond mean square error

Michael Mathieu, Camille Couprie, Yann LeCun

Introduction

Unsupervised feature learning of video representations is a promising direction of research because the resources are quasi-unlimited and the progress remaining to achieve in this area are quite important. In this paper, we address the problem of frame prediction. A significant difference with the more classical problem of image reconstruction (Vincent et al., 2008; Le, 2013) is that the ability of a model to predict future frames requires to build accurate, non trivial internal representations, even in the absence of other constraints (such as sparsity). Therefore, we postulate that the better the predictions of such system are, the better the feature representation should be. Indeed, the work of Srivastava et al. (2015) demonstrates that learning representations by predicting the next sequence of image features improves classification results on two action recognition datasets. In this work, however, we focus on predicting directly in pixel space and try to address the inherent problems related to this approach.

Top performing algorithms for action recognition exploit the temporal information in a supervised way, such as the 3D convolutional network of Tran et al. (2015), or the spatio-temporal convolutional model of Simonyan & Zisserman (2014), which can require months of training, and heavily labeled datasets. This could be reduced using unsupervised learning. The authors in (Wang & Gupta, 2015) compete with supervised learning performance on ImageNet, by using a siamese architecture (Bromley et al., 1993) to mine positive and negative examples from patch triplets of videos in an unsupervised fashion. Unsupervised learning from video is also exploited in the work of Vondrick et al. (2015), where a convolutional model is trained to predict sets of future possible actions, or in (Jayaraman & Grauman, 2015) which focuses on learning a feature space equivariant to ego-motion. Goroshin et al. (2015) trained a convolutional network to learn to linearize motion in the code space and tested it on the NORB dataset. Beside unsupervised learning, a video predictive system may find applications in robotics (Kosaka & Kak, 1992), video compression (Ascenso et al., 2005) and inpainting (Flynn et al., 2015) to name a few.

In this work, we address the problem of lack of sharpness in the predictions. We assess different loss functions, show that generative adversarial training (Goodfellow et al., 2014; Denton et al., 2015) may be successfully employed for next frame prediction, and finally introduce a new loss based on the image gradients, designed to preserve the sharpness of the frames. Combining these two losses produces the most visually satisfying results.

Our paper is organised as follows: the model section describes the different model architectures: simple, multi-scale, adversarial, and presents our gradient difference loss function. The experimental section compares the proposed architectures and losses on video sequences from the Sports1m dataset of Karpathy et al. (2014) and UCF101 (Soomro et al., 2012). We further compare our results with (Srivastava et al., 2015) and (Ranzato et al., 2014). We measure the quality of image generation by computing similarity and sharpness measures.

Models

Let $Y=\{Y^{1},...,Y^{n}\}$ be a sequence of frames to predict from input frames $X=\{X^{1},...,X^{m}\}$ in a video sequence. Our approach is based on a convolutional network (LeCun et al., 1998), alternating convolutions and Rectified Linear Units (ReLU) (Nair & Hinton, 2010).

Therefore, the network makes a series of predictions, starting from the lowest resolution, and uses the prediction of size $s_{k}$ as a starting point to make the prediction of size $s_{k+1}$ . At the lowest scale $s_{1}$ , the network takes only $X_{1}$ as an input. This architecture is illustrated on Figure 2, and the specific details are given in Section 3. The set of trainable parameters is denoted $W_{G}$ and the minimization is performed via Stochastic Gradient Descent (SGD).

Despite the multi-scale architecture, the search of $Y$ from $X$ without making any assumption on the space of possible configurations still leads to blurry predictions, because of Problem 2. In order to further reduce this effect, the next two sections introduce an adversarial strategy and the image gradient difference loss.

2 Adversarial training

Generative adversarial networks were introduced by Goodfellow et al. (2014), where images patches are generated from random noise using two networks trained simultaneously. In that work, the authors propose to use a discriminative network $D$ to estimate the probability that a sample comes from the dataset instead of being produced by a generative model $G$ . The two models are simultaneously trained so that $G$ learns to generate frames that are hard to classify by $D$ , while $D$ learns to discriminate the frames generated by $G$ . Ideally, when $G$ is trained, it should not be possible for $D$ to perform better than chance.

We adapted this approach for the purpose of frame prediction, which constitutes to our knowledge the first application of adversarial training to video prediction. The generative model $G$ is typically the one described in the previous section. The discriminative model $D$ takes a sequence of frames, and is trained to predict the probability that the last frames of the sequence are generated by $G$ . Note only the last frames are either real of generated by $G$ , the rest of the sequence is always from the dataset. This allows the discriminative model to make use of temporal information, so that $G$ learns to produce sequences that are temporally coherent with its input. Since $G$ is conditioned on the input frames $X$ , there is variability in the input of the generator even in the absence of noise, so noise is not a necessity anymore. We trained the network with and without adding noise and did not observe any difference. The results we present are obtained without random noise.

The discriminative model $D$ is a multi-scale convolutional network with a single scalar output. The training of the pair ( $G$ , $D$ ) consists of two alternated steps, described below. For the sake of clarity, we assume that we use pure SGD (minibatches of size 1), but there is no difficulty to generalize the algorithm to minibatches of size M by summing the losses over the samples.

Let $(X,Y)$ be a sample from the dataset. Note that $X$ (respectively $Y$ ) is a sequence of $m$ (respectively $n$ ) frames. We train $D$ to classify the input $(X,Y)$ into class $1$ and the input $(X,G(X))$ into class . More precisely, for each scale $k$ , we perform one SGD iteration of $D_{k}$ while keeping the weights of $G$ fixed. It is trained with in the target $1$ for the datapoint $(X_{k},Y_{k})$ , and the target for $(X_{k},G_{k}(X_{k}))$ . Therefore, the loss function we use to train $D$ is

where $L_{bce}$ is the binary cross-entropy loss, defined as

where $Y_{i}$ takes its values in $\{0,1\}$ and $\hat{Y_{i}}$ in $$.

Let $(X,Y)$ be a different data sample. While keeping the weights of $D$ fixed, we perform one SGD step on $G$ to minimize the adversarial loss:

3 Image Gradient Difference Loss (GDL)

where $\alpha$ is an integer greater or equal to 1, and $|.|$ denotes the absolute value function. To the best of our knowledge, the closest related work to this idea is the work of Mahendran & Vedaldi (2015), using a total variation regularization to generate images from learned features. Our GDL is fundamentally different: In (Mahendran & Vedaldi, 2015), the total variation takes only the reconstructed frame in input, whereas our loss penalises gradient differences between the prediction and the true output. Second, we chose the simplest possible image gradient by considering the neighbor pixel intensities differences, rather than adopting a more sophisticated norm on a larger neighborhood, for the sake of keeping the training time low.

4 Combining losses

In our experiments, we combine the losses previously defined with different weights. The final loss is:

Experiments

We now provide a quantitative evaluation of the quality of our video predictions on UCF101 (Soomro et al., 2012) and Sports1m (Karpathy et al., 2014) video clips. We train and compare two configurations: (1) We use $4$ input frames to predict one future frame. In order to generate further in the future, we apply the model recursively by using the newly generated frame as an input. (2) We use $8$ input frames to produce $8$ frames simultaneously. This second configuration represents a significantly harder problem and is presented in Appendix.

2 Network architecture

The generative model $G$ architecture is presented in Table 1. It contains padded convolutions interlaced with ReLU non linearities. A Hyperbolic tangent (Tanh) is added at the end of the model to ensure that the output values are between -1 and 1. The learning rate $\rho_{G}$ starts at $0.04$ and is reduced over time to $0.005$ . The minibatch size is set to $4$ , or $8$ in the case of the adversarial training, to take advantage of GPU hardware capabilities. We train the network on small patches, and since it is fully convolutional, we can seamlessly apply it on larger images at test time.

The discriminative model $D$ , also presented in Table 1, uses standard non padded convolutions followed by fully connected layers and ReLU non linearities. For the largest scale $s_{4}$ , a $2\times 2$ pooling is added after the convolutions. The network is trained by setting the learning rate $\rho_{D}$ to $0.02$ .

3 Quantitative evaluations

To evaluate the quality of the image predictions resulting from the different tested systems, we compute the Peak Signal to Noise Ratio (PSNR) between the true frame $Y$ and the prediction $\hat{Y}$ :

where $\max_{\hat{Y}}$ is the maximum possible value of the image intensities. We also provide the Structural Similarity Index Measure (SSIM) of Wang et al. (2004). It ranges between -1 and 1, a larger score meaning a greater similarity between the two images.

To measure the loss of sharpness between the true frame and the prediction, we define the following sharpness measure based on the difference of gradients between two images $Y$ and $\hat{Y}$ :

where $\nabla_{i}Y=|Y_{i,j}-Y_{i-1,j}|$ and $\nabla_{j}Y=|Y_{i,j}-Y_{i,j-1}|$ .

As for the other measures, a larger score is better. These quantitative measures on 378 test videos from UCF101We extracted from the test set list video files every 10 videos, starting at 1, 11, 21 etc. are given in Table 2. As it is trivial to predict pixel values in static areas, especially on the UCF101 dataset where most of the images are still, we performed our evaluation in the moving areas as displayed in Figure 3. To this end, we use the EpicFlow method of Revaud et al. (2015), and compute the different quality measures only in the areas where the optical flow is higher than a fixed threshold We use default parameters for the Epic Flow computation, and transformed the .flo file to png using the Matlab code http://vision.middlebury.edu/flow/code/flow-code-matlab.zip. If at least one color channel is lower than 0.2 (image color range between 0 and 1), we replace the corresponding pixel intensity of the output and ground truth to 0, and compute similarity measures in the resulting masked images.. Similarity and sharpness measures computed on the whole images are given in Appendix.

Figure 4 shows results on test sequences from the Sport1m dataset, as movements are more visible in this dataset.

4 Comparison to Ranzato et al. (2014)

In this section, we compare our results to (Ranzato et al., 2014). To obtain grayscale images, we make RGB predictions and extract the Y channel of our Adv+GDL model. Ranzato et al. (2014) images are generated by averaging 64 results obtained using different tiling to avoid a blockiness effect, however creating instead a blurriness effect. We compare the PSNR and SSIM values on the first predicted images of Figure 5.

We note that the results of Ranzato et al. appear slightly lighter than our results because of a normalization that does not take place in the original images, therefore the errors given here are not reflecting the full capacity of their approach. We tried to apply the blind deconvolution method of Krishnan et al. (2011) to improve Ranzato et al. and our different results. As expected, the obtained sharpness scores are higher, but the image similarity measures are deteriorated because often the contours of the predictions do not match exactly the targets. More importantly, Ranzato et al. results appear to be more static in moving areas. Visually, the optical flow result appears similar to the target, but a closer look at thin details reveals that lines, heads of people are bent or squeezed.

Conclusion

We provided a benchmark of several strategies for next frame prediction, by evaluating the quality of the prediction in terms of Peak Signal to Noise Ratio, Structural Similarity Index Measure and image sharpness. We display our results on small UCF video clips at http://cs.nyu.edu/~mathieu/iclr2016.html. The presented architectures and losses may be used as building blocks for more sophisticated prediction models, involving memory and recurrence. Unlike most optical flow algorithms, the model is fully differentiable, so it can be fine-tuned for another task if necessary. Future work will deal with the evaluation of the classification performances of the learned representations in a weakly supervised context, for instance on the UCF101 dataset. Another extension of this work could be the combination of the current system with optical flow predictions. Alternatively, we could replace optical flow predictions in applications that does not explicitly require optical flow but rather next frame predictions. A simple example is causal (where the next frame is unknown) segmentation of video streams.

We thank Florent Perronnin for fruitful discussions, and Nitish Srivastava, Marc’Aurelio Ranzato and Piotr Dollár for providing us their results on some video sequences.

References

Appendix

In this section, we trained our different multi-scale models – architecture described in Table 3– with $8$ input frames to predict $8$ frames simultaneously. Image similarity measures are given between the ground truth and the predictions in Table 4.

Compared to the recursive frame prediction as employed in the rest of the paper, predicting several input simultaneouly leads to better long term results but worst shorter term ones. The gap between the two performances could be reduced by the design of time multi-scale strategies.

2 Comparison to the LSTM approach of Srivastava et al. (2015)

3 Additional results on the UCF101 dataset

We trained the model described in Table 1 with our different losses to predict 1 frame from the 4 previous ones. We provide in Table 5 similarity (PSNR and SSIM) and sharpness measures between the different tested models predictions and frame to predict. The evaluation is performed on the full images but is not really meaningful because predicting the future location of static pixels is most accurately done by copying the last input frame.