Unsupervised Learning of Disentangled Representations from Video

Remi Denton, Vighnesh Birodkar

Introduction

Unsupervised learning from video is a long-standing problem in computer vision and machine learning. The goal is to learn, without explicit labels, a representation that generalizes effectively to a previously unseen range of tasks, such as semantic classification of the objects present, predicting future frames of the video or classifying the dynamic activity taking place. There are several prevailing paradigms: the first, known as self-supervision, uses domain knowledge to implicitly provide labels (e.g. predicting the relative position of patches on an object or using feature tracks ). This allows the problem to be posed as a classification task with self-generated labels. The second general approach relies on auxiliary action labels, available in real or simulated robotic environments. These can either be used to train action-conditional predictive models of future frames or inverse-kinematics models which attempt to predict actions from current and future frame pairs. The third and most general approaches are predictive auto-encoders (e.g.) which attempt to predict future frames from current ones. To learn effective representations, some kind of constraint on the latent representation is required.

In this paper, we introduce a form of predictive auto-encoder which uses a novel adversarial loss to factor the latent representation for each video frame into two components, one that is roughly time-independent (i.e. approximately constant throughout the clip) and another that captures the dynamic aspects of the sequence, thus varying over time. We refer to these as content and pose components, respectively. The adversarial loss relies on the intuition that while the content features should be distinctive of a given clip, individual pose features should not. Thus the loss encourages pose features to carry no information about clip identity. Empirically, we find that training with this loss to be crucial to inducing the desired factorization.

We explore the disentangled representation produced by our model, which we call Disentangled-Representation Net (DrNet ), on a variety of tasks. The first of these is predicting future video frames, something that is straightforward to do using our representation. We apply a standard LSTM model to the pose features, conditioning on the content features from the last observed frame. Despite the simplicity of our model relative to other video generation techniques, we are able to generate convincing long-range frame predictions, out to hundreds of time steps in some instances. This is significantly further than existing approaches that use real video data. We also show that DrNet can be used for classification. The content features capture the semantic content of the video thus can be used to predict object identity. Alternately, the pose features can be used for action prediction.

Related work

On account of its natural invariances, image data naturally lends itself to an explicit “what” and “where” representation. The capsule model of Hinton et al. performed this separation via an explicit auto-encoder structure. Zhao et al. proposed a multi-layered version, which has similarities to ladder networks . These methods all operate on static images, whereas our approach uses temporal structure to separate the components.

Other approaches explore general methods for learning disentangled representations from video. Kulkarni et al. show how explicit graphics code can be learned from datasets with systematic dimensions of variation. Whitney et al. use a gating principle to encourage each dimension of the latent representation to capture a distinct mode of variation.

A range of generative video models, based on deep nets, have recently been proposed. Ranzato et al. adopt a discrete vector quantization approach inspired by text models. Srivastava et al. use LSTMs to generate entire frames. Video Pixel Networks use these models is a conditional manner, generating one pixel at a time in raster-scan order (similar image models include ). Finn et al. use an LSTM framework to model motion via transformations of groups of pixels. Cricri et al. use a ladder of stacked-autoencoders. Other works predict optical flows fields that can be used to extrapolate motion beyond the current frame, e.g. . In contrast, a single pose vector is predicted in our model, rather than a spatial field.

Chiappa et al. and Oh et al. focus on prediction in video game environments, where known actions at each frame can be permit action-conditional generative models that can give accurate long-range predictions. In contrast to the above works, whose latent representations combine both content and motion, our approach relies on a factorization of the two, with a predictive model only being applied to the latter. Furthermore, we do not attempt to predict pixels directly, instead applying the forward model in the latent space. Chiappa et al. , like our approach, produces convincing long-range generations. However, the video game environment is somewhat more constrained than the real-world video we consider since actions are provided during generation.

Approach

In our model, two separate encoders produce distinct feature representations of content and pose for each frame. They are trained by requiring that the content representation of frame $x^{t}$ and the pose representation of future frame $x^{t+k}$ can be combined (via concatenation) and decoded to predict the pixels of future frame $x^{t+k}$ . However, this reconstruction constraint alone is insufficient to induce the desired factorization between the two encoders. We thus introduce a novel adversarial loss on the pose features that prevents them from being discriminable from one video to another, thus ensuring that they cannot contain content information. A further constraint, motivated by the notion that content information should vary slowly over time, encourages temporally close content vectors to be similar to one another.

The loss function used during training has several terms:

Note that many recent works on video prediction that rely on more complex losses that can capture uncertainty, such as GANs .

Similarity loss: To ensure the content encoder extracts mostly time-invariant representations, we penalize the squared error between the content features $h_{c}^{t},h_{c}^{t+k}$ of neighboring frames $k\in[0,K]$ :

Adversarial loss: We now introduce a novel adversarial loss that exploits the fact that the objects present do not typically change within a video, but they do between different videos. Our desired disenanglement would thus have the content features be (roughly) constant within a clip, but distinct between them. This implies that the pose features should not carry any information about the identity of objects within a clip.

We impose this via an adversarial framework between the scene discriminator network $C$ and pose encoder $E_{p}$ , shown in Fig. 1. The latter provides pairs of pose vectors, either computed from the same video $(h_{p,i}^{t},h_{p,i}^{t+k})$ or from different ones $(h_{p,i}^{t},h_{p,j}^{t+k})$ , for some other video $j$ . The discriminator then attempts to classify the pair as being from the same/different video using a cross-entropy loss:

The other half of the adversarial framework imposes a loss function on the pose encoder $E_{p}$ that tries to maximize the uncertainty (entropy) of the discriminator output on pairs of frames from the same clip:

Thus the pose encoder is encouraged to produce features that the discriminator is unable to classify if they come from the same clip or not. In so doing, the pose features cannot carry information about object content, yielding the desired factorization. Note that this does assume that the object’s pose is not distinctive to a particular clip. While adversarial training is also used by GANs, our setup purely considers classification; there is no generator network, for example.

Overall training objective: During training we minimize the sum of the above losses, with respect to $E_{c},E_{p},D$ and $C$ :

where $\alpha$ and $\beta$ are hyper-parameters. The first three terms can be jointly optimized, but the discriminator $C$ is updated while the other parts of the model ( $E_{c},E_{p},D$ ) are held constant. The overall model is shown in Fig. 1. Details of the training procedure and model architectures for $E_{c},E_{p},D$ and $C$ are given in Section 4.1.

Note that while pose estimates are generated in a recurrent fashion, the content features $h^{t}_{c}$ remain fixed from the last observed real frame. This relies on the nature of $\mathcal{L}_{reconstruction}$ which ensured that content features can be combined with future pose vectors to give valid reconstructions.

2 Classification

Another application of our disentangled representation is to use it for classification tasks. Content features, which are trained to be invariant to local temporal changes, can be used to classify the semantic content of an image. Conversely, a sequence of pose features can be used to classify actions in a video sequence. In either case, we train a two layer classifier network $S$ on top of either $h_{c}$ or $h_{p}$ , with its output predicting the class label $y$ .

Experiments

We evaluate our model on both synthetic (MNIST, NORB, SUNCG) and real (KTH Actions) video datasets. We explore several tasks with our model: (i) the ability to cleanly factorize into content and pose components; (ii) forward prediction of video frames using the approach from Section 3.1; (iii) using the pose/content features for classification tasks.

We explored a variety of convolutional architectures for the content encoder $E_{c}$ , pose encoder $E_{p}$ and decoder $D$ . For MNIST, $E_{c},E_{p}$ and $D$ all use a DCGAN architecture with $\left|{h_{p}}\right|=5$ and $\left|{h_{c}}\right|=128$ . The encoders consist of 5 convolutional layers with subsampling. Batch normalization and Leaky ReLU’s follow each convolutional layer except the final layer which normalizes the pose/content vectors to have unit norm. The decoder is a mirrored version of the encoder with 5 deconvolutional layers and a sigmoid output layer.

For both NORB and SUNCG, $D$ is a DCGAN architecture while $E_{c}$ and $E_{p}$ use a ResNet-18 architecture up until the final pooling layer with $\left|{h_{p}}\right|=10$ and $\left|{h_{c}}\right|=128$ .

For KTH, $E_{p}$ uses a ResNet-18 architecture with $\left|{h_{p}}\right|=5$ . $E_{c}$ uses the same architecture as VGG16 up until the final pooling layer with $\left|{h_{c}}\right|=128$ . The decoder is a mirrored version of the content encoder with pooling layers replaced with spatial up-sampling. In the style of U-Net , we add skip connections from the content encoder to the decoder, enabling the model to easily generate static background features.

In all experiments the scene discriminator $C$ is a fully connected neural network with 2 hidden layers of 100 units. We trained all our models with the ADAM optimizer and learning rate $\eta=0.002$ . We used $\beta=0.1$ for MNIST, NORB and SUNCG and $\beta=0.0001$ for KTH experiments. We used $\alpha=1$ for all datasets.

For future prediction experiments we train a two layer LSTM with 256 cells using the ADAM optimizer. On MNIST, we train the model by observing 5 frames and predicting 10 frames. On KTH, we train the model by observing 10 frames and predicting 10 frames.

2 Synthetic datasets

MNIST: We start with a toy dataset consisting of two MNIST digits bouncing around a 64x64 image. Each video sequence consists of a different pair of digits with independent trajectories. Fig. 3(left) shows how the content vector from one frame and the pose vector from another generate new examples that transfer the content and pose from the original frames. This demonstrates the clean disentanglement produced by our model. Interestingly, for this data we found it to be necessary to use a different color for the two digits. Our adversarial term is so aggressive that it prevents the pose vector from capturing any content information, thus without a color cue the model is unable to determine which pose information to associate with which digit. In Fig. 3(right) we perform forward modeling using our representation, demonstrating the ability to generate crisp digits 500 time steps into the future.

NORB: We apply our model to the NORB dataset , converted into videos by taking sequences of different azimuths, while holding object identity, lighting and elevation constant. Fig. 4(left) shows that our model is able to factor content and pose cleanly on held out data. In Fig. 4(center) we train a version of our model without the adversarial loss term, which results in a significant degradation in the model and the pose vectors are no longer isolated from content. For comparison, we also show the factorizations produced by Mathieu et al. , which are less clean, both in terms of disentanglement and generation quality than our approach. Table 1 shows classification results on NORB, following the training of a classifier on pose features and also content features. When the adversarial term is used ( $\beta=0.1$ ) the content features perform well. Without the term, content features become less effective for classification.

SUNCG: We use the rendering engine from the SUNCG dataset to generate sequences where the camera rotates around a range of 3D chair models. DrNet is able to generate high quality examples of this data, as shown in Fig. 5.

3 KTH Action Dataset

Finally, we apply DrNet to the KTH dataset . This is a simple dataset of real-world videos of people performing one of six actions (walking, jogging, running, boxing, handwaving, hand-clapping) against fairly uniform backgrounds. In Fig. 6 we show forward generations of different held out examples, comparing against two baselines: (i) the MCNet of Villegas et al. which, to the best of our knowledge, produces the current best quality generations of on real-world video and (ii) a baseline auto-encoder LSTM model (AE-LSTM). This is essentially the same as ours, but with a single encoder whose features thus combine content and pose (as opposed to factoring them in DrNet ). It is also similar to .

Fig. 7 shows more examples, with generations out to 100 time steps. For most actions this is sufficient time for the person to have left the frame, thus further generations would be of a fixed background. In Fig. 9 we attempt to quantify the fidelity of the generations by comparing the Inception score of our approach to MCNet . This metric is used for assessing generations from GANs and is more appropriate for our scenario that traditional metrics such as PSNR or SSIM (see appendix B for further discussion). The curves show the mean scores of our generations decaying more gracefully than MCNet . Further examples and generated movies may be viewed in appendix A and also at https://sites.google.com/view/drnet-paper//.

A natural concern with high capacity models is that they might be memorizing the training examples. We probe this in Fig. 10, where we show the nearest neighbors to our generated frames from the training set. Fig. 8 uses the pose representation produced by DrNet to train an action classifier from very few examples. We extract pose vectors from video sequences of length 24 and train a fully connected classifier on these vectors to predict the action class. We compare against an autoencoder baseline, which is the same as ours but with a single encoder whose features thus combine content and pose. We find the factorization significantly boosts performance.

Figure 8: Classification of KTH actions from pose vectors with few labeled examples, with autoencoder baseline. N.B. SOA (fully supervised) is 93.9% . Figure 9: Comparison of KTH video generation quality using Inception score. X-axis indicated how far from conditioned input the start of the generated sequence is.

Discussion

In this paper we introduced a model based on a pair of encoders that factor video into content and pose. This seperation is achieved during training through novel adversarial loss term. The resulting representation is versatile, in particular allowing for stable and coherent long-range prediction through nothing more than a standard LSTM. Our generations compare favorably with leading approaches, despite being a simple model, e.g. lacking the GAN losses or probabilistic formulations of other video generation approaches. Source code is available at https://github.com/edenton/drnet.

References

Appendix A Further KTH generations

Fig. 11 shows additional long-range KTH sequences generated from our model and MCNet . Generations in movie form are viewable at https://sites.google.com/view/drnet-paper/.

Appendix B Quantitative metrics for evaluating generations

Evaluating samples from generative models is generally problematic. Pixel-wise measures like PNSR and SSIM are appropriate when objects are well aligned, but for long-range generations this is unlikely to be the case. Fig. 12 shows sequences generated from our model and MCNet , as well as their difference with respect to the ground truth. While the person remains sharp, there is some error in the velocity prediction, which accumulates to a significant offset in position. Consequently, the resulting PSNR and SSIM scores are very misleading and we adopt the Inception score as an alternative in the main paper.

Appendix C KTH experimental settings

The KTH dataset consists of 25 different subjects performing six different actions (boxing, hand waving, hand clapping, jogging, running and walking) against a static background. Each person is observed performaing every action in four different scenarios with varied clothing a background conditions. Following we used person 1-16 for training and person 17-25 for testingNote, this is not the standard train/test split used for KTH action classification. We also resize frames to 128 $\times$ 128 pixels.

Appendix D Details of classification experiments

NORB object classification: We used a two layer fully connected network with 256 hidden units as the classifier. Leaky ReLU’s, batch normalization and dropout were used in every layer. We trained with ADAM as used early stopping on a validation set to prevent over fitting.

KTH action classification: We used a two layer fully connected network with 1200 hidden units as the classifier. Leaky ReLU’s, batch normalization and dropout were used in every layer. Both DrNet and the autoencoder baseline produced 24 dimensional latent vectors. The classifier was trained on sequences of length 24 so the input to the classifier was 24 $\times$ 24. We also tried an autoencoder baseline with a 128 dimensional latent space (i.e, same dimensionality as the content vectors of DrNet ) but found this model performed worse. We trained the action classifier on the pose representations from DrNet and the autoencoder for varying training set sizes. Specifically, we varied the number of subjects used in the training set from 1 to 12.