Video Representation Learning by Dense Predictive Coding

Tengda Han, Weidi Xie, Andrew Zisserman

Introduction

Videos are very appealing as a data source for self-supervision: there is almost an infinite supply available (from Youtube etc.); image level proxy losses can be used at the frame level; and, there are plenty of additional proxy losses that can be employed from the temporal information. One of the most natural, and consequently one of the first video proxy losses, is to predict future frames in the videos based on frames in the past. This has ample scope for exploration by varying the extent of the past knowledge (the temporal aggregation window used for the prediction) and also the temporal distance into the future for the predicted frames. However, future frame prediction does have a serious disadvantage – that the future is not deterministic – so methods may have to consider multiple hypotheses with multiple instance losses, or other distributions and losses over their predictions.

Previous approaches to future frame prediction in video can roughly be divided into two types: those that predict a reconstruction of the actual frames ; and those that only predict the latent representation (the embedding) of the frames . If our goal of self-supervision is only to learn a representation that allows generalization for downstream discriminative tasks, e.g. action recognition in video, then it may not be necessary to waste model capacity on resolving the stochasticity of frame appearance in detail, e.g. appearance changes due to shadows, illumination changes, camera motion, etc. Approaches that only predict the frame embedding, such as Vondrick et al. , avoid this potentially unnecessary task of detailed reconstruction, and use a mixture model to resolve the uncertainty in future prediction. Although not applied to videos (but rather to speech signals and images), the Contrastive Predictive Coding (CPC) model of Oord et al. also learns embeddings, in their case by using a multi-way classification over temporal audio frames (or image patches), rather than the regression loss of .

In this paper we propose a new idea for learning spatio-temporal video embeddings, that we term “Dense Predictive Coding” (DPC). The model is designed to predict the future representations based on the recent past . It is inspired by the CPC framework, and more generally by previous research on learning word embeddings . DPC is also trained by using a variant of noise contrastive estimation , therefore, in practice, the model has never been optimized to predict the exact future, it is only asked to solve a multiple choice question, i.e. pick the correct future states from lots of distractors. In order to succeed in this task, the model only needs to learn the shared semantics of the multiple possible future states, and this common/shared representation is the kind of invariance required in many of the vision tasks, e.g. action recognition in videos. In other words, the optimization objective will actually benefit from the fact that the future is not deterministic, and map the representation of all possible future states to a space that their embeddings are close. Concurrent work applies similar method on reinforcement learning.

The contributions of this paper are three-fold: First, we introduce Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos, we task the model to predict the future embedding of the spatio-temporal blocks recurrently (as used in N-gram prediction). The model is trained to pick the “correct” future states from a pool of distractors, therefore treated as a multi-way classification problem. Second, we propose a curriculum training scheme that enables the model to gradually predict further in the future (up to 2 seconds) with progressively less temporal context, leading more challenging training samples, and preventing the model from using shortcuts such as optical flow; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset using self-supervised learning, and then fine-tuning on action recognition benchmarks. Our DPC model achieves state-of-the-art self-supervised performance on both UCF101 ( $75.7\%$ top1 acc) and HMDB51 ( $35.7\%$ top1 acc), outperforming all previous single-stream (RGB only) self-supervised learning methods by a significant margin.

Related Work

Self-supervised learning from images. In recent years, methods for self-supervised learning on images have achieved an impressive performance in learning high-level image representations. Inspired by the variants of Word2vec that rely on predicting words from their context, Doersch et al. proposed the pretext task of predicting the relative location of image patches. This work spawned a line of work in context-based self-supervised visual representation learning methods, e.g. in . In contrast to the context-based idea, another set of pretext tasks include carefully designed image-level classification, such as rotation or pseudo-labels from clustering . Another class of pre-text tasks is for dense predictions, e.g. image inpainting , image colorization , and motion segmentation prediction . Other methods instead enforce structural constraints on the representation space . Self-supervised learning from videos. Other than the predictive tasks reviewed in the introduction, another class of proxy tasks is based on temporal sequence ordering of the frames . use the temporal coherence as a proxy loss. Other approaches use egomotion to enforce equivariance in feature space . In contrast, predicts the transformation applied to a spatio-temporal block. In , the authors propose to use a 3D puzzle as the proxy loss. Recently , leveraged the natural temporal coherency of color in videos, to train a network for tracking and correspondence related tasks. Action recognition with two-stream architectures. Recently, the two-stream architecture has been a foundation for many competitive methods. The authors show that optical flow is a powerful representation that improves action recognition dramatically. Other modalities like audio signal can also benefits visual representation learning . While in this paper, we deliberately avoid using any information from optical flow or audio, and aim to probe the upperbound of self-supervised learning with only RGB streams. We leave it as a future work to explore how much boost optical flow branch and audio branch can bring to our self-supervised learning architecture.

Dense Predictive Coding (DPC)

In this section, we describe the learning framework, details of the architecture, and the curriculum training that gradually learns to predict further into the future with progressively less temporal context.

The goal of DPC is to predict a slowly varying semantic representation based on the recent past, e.g. we construct a prediction task that observes about 2.5 seconds of the video and predict the embedding for the future 1.5 seconds, as illustrated in Figure 2. A video clip is partitioned into multiple non-overlapping blocks $x_{1},x_{2},\dots,x_{n}$ , with each block containing an equal number of frames. First, a non-linear encoder function $f(.)$ maps each input video block $x_{t}$ to its latent representation $z_{t}$ , then an aggregation function $g(.)$ temporally aggregates $t$ consecutive latent representations into a context representation $c_{t}$ :

The intuition behind the predictive task is that if one can infer future semantics from $c_{t}$ , then the context representation $c_{t}$ and the latent representations $z_{1},z_{2},...,z_{t}$ must have encoded strong semantics of the input video clip. Thus, we introduce a predictive function $\phi(.)$ to predict the future. In detail, $\phi(.)$ takes the context representation as the input and predicts the future clip representation:

where $c_{t}$ denotes the context representation from time step $1$ to $t$ , and $\hat{z}_{t+1}$ denotes the predicted latent representation of the time step $t+1$ . In the spirit of Seq2seq , representations are predicted in a sequential manner. We predict $q$ steps in the future, at each time step $t$ , the model consumes the previously generated embedding ( $\hat{z}_{t-1}$ ) as input when generating the next ( $\hat{z}_{t}$ ), further enforcing the prediction to be conditioned on all previous observations and predictions, and therefore encourages an N-gram like video representation.

2 Contrastive Loss

Noise Contrastive Estimation (NCE) constructs a binary classification task: a classifier is fed with real samples and noise samples, and the objective is to distinguish them. A variant of NCE classifies one real sample among many noise samples. Similar to , we use a loss based on NCE for the predictive task. NCE over feature embeddings encourages the predicted representation $\hat{z}$ to be close to the ground truth representation $z$ , but not so strictly that it has to resolve the low-level stochasticity.

In essense, this is simply a cross-entropy loss (negative log-likelihood) that distinguishes the positive Pred-GT pair out of all other negative pairs. For a predicted feature vector $\hat{z}_{i,k}$ , the only positive pair is $(\hat{z}_{i,k},{z}_{i,k})$ , i.e. the predicted and ground-truth features at the same time step and same spatial location. All the other pairs $(\hat{z}_{i,k},{z}_{j,m})$ where $(i,k)\neq(j,m)$ , are negative pairs. The loss encourages the positive pair to have a higher similarity than any negative pairs. If the network is trained in a mini-batch consisting of $B$ video clips and each of the $B$ clips is from distinct video, more negative pairs can be obtained.

To discriminate the different types of negative pairs, given a Pred-GT pair $(\hat{z}_{i,k},{z}_{j,m})$ , we define the terminology as follows:

is the Pred-GT pair that is formed from two distinct videos. These pairs are naturally easy because they usually have distinct color distributions and thus predicted feature and ground-truth feature have low similarity.

is the Pred-GT pair that is formed from the same video but at a different spatial position in the feature map, i.e. $k\neq m$ , while $i,j$ can be any index.

is the Pred-GT pair that comes from the same video and same spatial position, but from different time steps, i.e. $k=m,i\neq j$ . They are the hardest pair to classify because their score will be very close to the positive pairs.

Overall, we use a similar idea to the Multi-batch training . If the mini-batch has batch size $B$ , the feature map has spatial dimension $H^{\prime}\times W^{\prime}$ and the task is to classify one of $q$ time steps, the number of each classes follows:

A curriculum learning strategy is designed by progressively increasing the number of prediction steps of the model (Sec. 4.1.4). For instance, the training process can start by predicting only 2 steps (about 1 second), i.e. only computing $\hat{z}_{t+1}$ and $\hat{z}_{t+2}$ , and the Pred-GT pairs are constructed between $\{z_{t+1},z_{t+2}\}$ and $\{\hat{z}_{t+1},\hat{z}_{t+2}\}$ . After the network has learnt this simple task, it can be trained to predict 3 steps (about 1.5 seconds), e.g. computing $\hat{z}_{t+1}$ , $\hat{z}_{t+2}$ and $\hat{z}_{t+3}$ and construct Pred-GT pairs accordingly. Importantly, curriculum learning introduces more hard negatives throughout the training process, and forces the model to gradually learn to predict further in the future with progressively less temporal context. Meanwhile, the model is gradually trained to grasp the uncertain nature in its prediction.

3 Avoiding Shortcuts and Learning Semantics

Empirical experience in self-supervised learning indicates that if the proxy task is well-designed and requires semantic understanding, a more difficult learning task usually leads to a better-quality representation . However, ConvNets are notoriously known for learning shortcuts for tackling tasks . In our training, we employ a number of mechanisms to avoid potential shortcuts, as detailed next.

A trivial solution of our predictive task is that $f(.)$ , $g(.)$ and $\phi(.)$ together learn to capture low-level optical flow information and perform feature extrapolation as the prediction. To force the model to learn high-level semantics, a critical operation is frame-wise augmentation, i.e. random augmentation for each individual frame in the video blocks, such as frame-wise color jittering including random brightness, contrast, saturation, hue and random greyscale during training. Furthermore, the curriculum of predicting further into the future, i.e. predicting the semantics for the next a few seconds, also ensures that optical flow alone will not be able to solve this prediction task.

The temporal receptive field (RF) of $f(.)$ is limited by cutting the input video clip into non-overlapping blocks before feeding it into $f(.)$ . Thus, the effective temporal RF of each feature map $z_{i}$ is strictly restricted to be within each video block. This avoids the network being able to discriminate positive and hard-negative by recognizing relative temporal position.

Due to the depth of CNN, each feature vector $\hat{z}_{i,k}$ in the final predicted feature map $\hat{z}_{i}$ has a large spatial RF that (almost) covers the entire input spatial dimension. This creates a shortcut to discriminate positive and spatial negative by using padding patterns. One can limit the spatial RF by cutting input frames into patches . However this brings some drawbacks: First, the self-supervised pre-trained network will have limited receptive field (RF), so the representation may not generalize well for downstream tasks where a large RF is required. Second, limiting spatial RF in videos makes the context feature too weak. The context feature has a spatio-temporal RF that covers a thin cube in the video flow. Neglecting context is also not ideal for understanding video semantics and brings ambiguity to the predictive task. Considering this trade-off, our method does not restrict the spatial RF.

Common practice uses Batch Normalization (BN) in deep CNN architecture. The BN layer may provide shortcuts that the network acknowledges the statistical distribution of the mini-batch, which benefits the classification. In , the authors demonstrate BN results in network cheating, and the ResNet trained with BN does not generalize to the downstream image classification task. In our method, we find the effect of BN shortcut is very limited. The self-supervised training gives similar accuracy using either BN or Instance Normalization (IN). For downstream tasks like classification, a network with BN gives 5%-10% accuracy gain comparing with a network with IN. It is hard to train a deep CNN without normalization for either self-supervised training or supervised training. Overall, we use BN in our encoder function $f(.)$ .

4 Network Architecture

We choose to use a 3D-ResNet similar to as the encoder $f(.)$ . Following the convention of there are four residual blocks in ResNet architecture, namely $\text{res}_{2}$ , $\text{res}_{3}$ , $\text{res}_{4}$ and $\text{res}_{5}$ , and only expand the convolutional kernels in $\text{res}_{4}$ and $\text{res}_{5}$ to be 3D ones. For experiment analysis, we used 3D-ResNet18, denoted as R-18 below.

To train a strong encoder $f(.)$ , a weak aggregation function $g(.)$ is preferable. Specifically, a one-layer Convolutional Gated Recurrent Unit (ConvGRU) with kernel size $(1,1)$ is used, which shares the weights amongst all spatial positions in the feature map. This design allows the aggregation function to propagate features in the temporal axis. A dropout with $p=0.1$ is used when computing hidden state in each time step. A shallow two-layer perceptron is used as the predictive function $\phi(.)$ .

5 Self-Supervised Training

For data pre-processing, we use 30 fps videos with a uniform temporal downsampling by factor 3, i.e. take one frame from every 3 frames. These consecutive frames are grouped into 8 video blocks where each block consists of 5 frames. Frames are sampled in a consecutive way with consistent temporal stride to preserve the temporal regularity, because random temporal stride introduces uncertainties to the predictive task especially when the network needs to distinguish the difference among different time steps. Specifically, each video block spans over 0.5s and the entire 8 segments span over 4s in the raw video. The predictive task is initially designed to observe the first 5 blocks and predict the remaining 3 blocks (denoted as ‘5pred3’ afterwards), which is observing 2.5 seconds to predict the following 1.5 seconds. We also experiment with different predictive configuration like 4pred4 in Sec. 4.1.4.

For data augmentation, we apply random crop, random horizontal flip, random grey, and color jittering. Note that the random crop and random horizontal flip are applied for the entire clip in a consistent way. Random grey and color jittering are applied in a frame-wise manner to prevent the network from learning low-level flow information as mentioned above (in Sec. 3.3), e.g. each video block may contain both colored and grey-scale image with different contrast. All models are trained end-to-end using Adam optimizer with an initial learning rate $10^{-3}$ and weight decay $10^{-5}$ . Learning rate is decayed to $10^{-4}$ when validation loss plateaus. A batchsize of 64 samples per GPU is used, and our experiments use 4 GPUs.

Experiments and Analysis

In the following sections we present controlled experiments, and aim to investigate four aspects: First, an ablation study on the DPC model to show the function of different design choices, e.g. sequential prediction, dense prediction. Second, the benefits of training on a larger, and more diverse dataset. Third, the correlation between performance on self-supervised learning and performance on the downstream supervised learning task. Fourth, the variation in the learnt representations when predicting further into the future.

The DPC is a general self-supervised learning framework for any video types, but we focus here on human action videos e.g. UCF101 , HMDB51 and Kinetics-400 datasets. UCF101 contains 13K videos spanning over 101 human action classes. HMDB51 contains 7K videos from 51 human action classes. Kinetics-400 (K400) is a big video dataset containing 306K video clips for 400 human action classes.

The self-supervised model is trained either on UCF101 or K400. The representation is evaluated by its performance on a downstream task, i.e. action classification on UCF101 and HMDB51. For all the experiments below: we report top1 accuracy for self-supervised learning in the middle column of all tables; and report the top1 accuracy for supervised learning for action classification on UCF101 in the rightmost column. In self-supervised learning, the top1 accuracy refers to how often the multi-way classifier picks the right Pred-GT pair, i.e. this is not related with any action classes. While for supervised learning, the top1 accuracy indicates the action classification accuracy on UCF101. Note, we report the first training/testing splits of UCF101 and HMDB51 in all the experiments, apart from the comparison with the state of the art in Table 4 where we report the average accuracy over three splits.

1 Performance Analysis

In this section, we present an ablation study by gradually removing components from the DPC model (see Table 1). For efficiency, all the self-supervised learning experiments refer to the 5pred3 setting, i.e. 5 video blocks (2.5 second) are used as input to predict the future 3 steps (1.5 second).

Compared with the baseline model trained with random initialization and fully supervised learning, our DPC model pre-trained with self-supervised learning has a significant boost (top1 acc: 46.5% vs. 60.6%). When removing the sequential prediction, i.e. all 3 future steps are predicted in parallel with three different fully-connected layers, the accuracy for both self-supervised learning and supervised learning start to drop. Lastly, we further replace the dense feature map by the average-pooled feature vector, i.e. it becomes a CPC-like model, we are not able to train this model either on self-supervised learning task or supervised learning. This demonstrates that dense predictive coding is essential to our success, and sequential prediction also helps to boost the model performance.

1.2 Benefits of Large Datasets

In this section, we investigate the benefits of pre-training on a large-scale dataset (UCF101 vs. K400), we keep the 5pred3 setting and evaluate the effectiveness for downstream task on UCF101. Results are shown in Table 2.

Training the model on K400 increases the self-supervised accuracy to 61.1%, and supervised accuracy from $60.6\%$ to $65.9\%$ , suggesting the model has captured more regularities than a smaller dataset like UCF101. It is clear that DPC will benefit from large-scale video dataset (infinite supply available), which naturally provides more diverse negative Pred-GT pairs.

1.3 Self-Supervised vs. Classification Accuracy

In this section, we investigate the correlation between the accuracy of self-supervised learning and downstream supervised learning. While training DPC (5pred3 task on K400), we evaluate the representation at different training stages (number of epochs) on the downstream task (on UCF101). The results are shown in Figure 3.

It can be seen that a higher accuracy in self-supervised task always leads to a higher accuracy in downstream classification. The result indicates that DPC has actually learnt visual representations that are not only specific to self-supervised task, but are also generic enough to be beneficial for the downstream task.

1.4 Benefits of Predicting Further into the Future

Due to the increase of uncertainty, predicting further into the future in video sequences gets more difficult, therefore more abstract (semantic) understanding is required. We hypothesize that if we can train the model to predict further, the learnt representation should be even better. In this section, we employ curriculum learning to gradually train the model to predict further with progressively less temporal context, i.e. from 5pred3 to 4pred4 (4 video blocks as input and predict the future 4 steps).

The result shows that the 4pred4 setting gives a substantially lower accuracy on the self-supervised learning than 5pred3. This is actually not surprising, as 4pred4 naturally introduces 33% more hard negative pairs than predicting future 3 steps, making the self-supervised learning more difficult (explained in Section 3.2).

Interestingly, despite a lower accuracy on self-supervised learning task, when comparing with 5pred3, curriculum learning on 4pred4 provides $2.3\%$ performance boost on the downstream supervised task (top1 acc: 68.2% vs. 65.9%). The experiment also shows that curriculum learning is effective as it achieves higher performance than training 4pred4 task from scratch (top1 acc: 68.2% vs. 64.9%). Similar effect is also observed in .

1.5 Summary

Through the experiments above, we have demonstrated the keys to the success of DPC. First, it is critical to do dense predictive coding, i.e. predicting both temporal and spatial representation in the future blocks, and sequential prediction enables a further boost in the quality of the learnt representation. Second, a large-scale dataset helps to improve the self-supervised learning, as it naturally contains more world patterns and provides more diverse negative sample pairs. Third, the representation learnt from DPC is generic, as a higher accuracy in the self-supervised task also yield a higher accuracy in the downstream classification task. Fourth, predicting further into the future is also beneficial, as the model is forced to encode the high-level semantic representations, and ignore the low-level information.

Comparison with State-of-the-art Methods

The results are given in Table 4, four phenomena can be observed: First, when self-supervised training with only UCF101, our DPC (60.6%) outperforms all previous methods under similar settings. Note that OPN performs worse when input resolution increases, which indicates a simple self-supervised task like order prediction may not capture the rich semantics from videos. Second, when using Kinetics-400 for self-supervised pre-training, our DPC (68.2%) outperforms all the previous methods by a large margin. Note that, in the work , the authors use a full-scale 3D-ResNet18 architecture (33.6M parameters), i.e. all convolutions are 3D, however our modified 3D-ResNet18 has fewer parameters (only the last 2 blocks are 3D convolutions). The authors of obtain 65.8% accuracy by combing the rotation classification with their Space-Time Cubic Puzzles method, essentially multi-task learning. When only considering their Space-Time Cubic Puzzles method, they obtain 63.9% top1 accuracy. On HMDB51, our method also outperforms the previous state of the art result by 0.8% (34.5% vs. 33.7%). Third, when applying on larger input resolution ( $224\times 224$ ) and using model with more capacity (3D-ResNet34), our DPC clearly dominate all self-supervised learning methods (75.7% on UCF101 and 35.7% on HMDB51), further demonstrating that DPC is able to take advantage from networks with more capacity and today’s large-scale datasets. Fourth, ImageNet pretrained weights have been a golden baseline for action recognition , our self-supervised DPC is the first model that surpasses the performance of models (VGG-M) pre-trained with ImageNet ( $75.7\%$ vs. $73.0\%$ on UCF101).

We visualize the Nearest Neighbour (NN) of the video segments in the spatio-temporal feature space in Figure 4 and Figure 1. In detail, one video segment is randomly sampled from each video, then the spatio-temporal feature $z_{i}=f(x_{i})$ is extracted and pooled into a vector. Then the feature vector is used to compute the cosine similarity score. In all figures, Figure 4(a) includes the video clips retrieved using our DPC model from self-supervised learning, note that the network does not receive any class label information during training. In comparison, Figure 4(b) uses the inflated ImageNet pre-trained weights.

It can be seen, that the ImageNet model is able to encode the scene semantics, e.g. human faces, crowds, but does not capture any semantics about the human actions. In contrast, our DPC model has actually learnt the video semantics without using any manual annotation, for instance, despite the background change in running, DPC can still correctly retrieve the video block.

2 Discussion

Why should the DPC model succeed in learning a representation suitable for action recognition, given the problem of a non-deterministic future? There are three reasons: First, the use of the softmax function and multi-way classification loss enables multi-modal, skewed, peaked or long tailed distributions; the model can therefore handle the task of predicting the non-deterministic future. Second, by avoiding the shortcuts, the model has been prevented from learning simple smooth extrapolation of the embeddings; it is forced to learn semantic embeddings to succeed in its learning task. Third, in essense, DPC is trained by predicting future representations, and use them as a “query” to pick the correct “key” from lots of distractors. In order to succeed in this task, the model has to learn the shared semantics of the multiple possible future states, as this is the only way to always solve the multiple choice problem, no matter what future state appears along with the distractors. This common/shared representation is the invariance we are wishing for, i.e. higher level semantics. In other words, the representation of all these possible future states will be mapped to a space that their embeddings are close.

Conclusion

In this paper, we have introduced the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos, and outperformed the previous state-of-the-art by a large margin on the downstream tasks of action classification on UCF101 and HMDB51. As for future work, one straightforward extension of this idea is to employ different methods for aggregating the temporal information – instead of using a ConvGRU for temporal aggregation ( $g(.)$ in the paper), other methods like masked CNN and attention based methods are also promising. In addition, empirical evidence shows that optical flow is able to boost the performance for action recognition significantly; it will be interesting to explore how optical flow can be trained jointly with DPC with self-supervised learning to further enhance the representation quality.

Acknowledgements

Funding for this research is provided by the Oxford-Google DeepMind Graduate Scholarship, and by the EPSRC Programme Grant Seebibyte EP/M013774/1.

References

Appendix A Architectures in detail

We use tables to display CNN structures. The dimension of convolutional kernels are denoted by $\{\text{temporal}\times\text{spatial}^{2}\text{, channel size}\}$ . The strides are denoted by $\{\text{temporal stride, }\text{spatial stride}^{2}\}$ . The ‘output sizes’ column displays the dimension of feature map after the operation (except the dimension of input data in the first row), where $\{t\times d^{2}\times C\}$ denotes $\{\text{temporal size}\times\text{spatial size}^{2}\times\text{channel size}\}$ , and $T$ denotes the number of video blocks. In the following tables we take 3D-ResNet18 backbone with $128\times 128$ input resolution as an example.

Table 5 gives the details of the action classifier which is used to evaluate the learned representation. Figure 5 is a diagram of the action classifier structure. For an input video with 30 fps, first a temporal stride 3 is applied, i.e. every 3rd frame is taken, resulting in 10 fps. Then $T\times 5$ consecutive frames are sampled and truncated into $T$ video blocks, i.e. each video block has a size $5\times 128^{2}\times 3$ , and we take $T=5$ for the action classifier.

The action classifier is built with $f(.)$ and $g(.)$ . The encoder function $f(.)$ takes 5 video blocks, each block contains 5 video frames ( $5\times(5\times 128^{2}\times 3)$ ) as input, spatio-temporal features ( $z$ ) are extracted from the 5 video blocks with shared encoder ( $f(.)$ ). Then the aggregation function $g(.)$ (ConvGRU) aggregates the 5 spatio-temporal feature maps into one spatio-temporal feature map, which is referred to as the context $c$ in the paper. The context $c$ is then pooled into a feature vector followed by a fully-connected layer.

The DPC is built from $f(.)$ and $g(.)$ with an additional prediction mechanism, which is described in Table 6. Here we use 5pred3 setting for an example, where $f(.)$ takes 5 video blocks and extracts 5 spatio-temporal feature maps, then $g(.)$ aggregates feature maps into context $c$ . The prediction function $\phi(.)$ is a two-layer perceptron, which takes the context $c$ as input and produces a predicted feature $\hat{z}$ as output. The contrastive loss is computed using $z$ and $\hat{z}$ as described in the paper Sec. 3.2.

The detailed structure of the encoder function $f(.)$ is shown in Table 7. Note that $f(.)$ takes input video blocks independently, so the number of video block $T$ is omitted in the table.

The structure of the temporal aggregation function $g(.)$ is shown in Table 8. It aggregates the feature maps over the past $T$ time steps. Note that in the case of sequential prediction, $T$ increments by 1 after each prediction step. Table 8 shows the case where $g(.)$ aggregates the feature maps over the past 5 steps.

Appendix B t-SNE clustering of DPC context representation

This section shows the t-SNE clustering of the context representation on UCF101 extracted by $f(.)$ and $g(.)$ (Figure 6). In detail, 5 consecutive video blocks are sampled from each video in the validation set, then the feature maps $\{z_{1},...,z_{5}\}$ are extracted from each video block and aggregated into context representation $c_{5}$ and then pooled into vectors. We use t-SNE to visualize the context vectors in 2D. For clarity, only 10 action classes (out of 101 classes from UCF101) are displayed. The upper-left figure visualizes the context features extracted by randomly initialized $f(.)$ and $g(.)$ . The following 3 figures show the context features extracted by $f(.)$ and $g(.)$ after $\{13,48,109\}$ epochs of DPC training on K400, without any finetuning on UCF101.

It can be seen that as the DPC training proceeds the intra-class distance is reduced (compared to the random initialization) and also the inter-class distance is increased, i.e. the self-supervised DPC method is clustering the feature vectors into action classes.

Appendix C Cosine distance histogram of DPC context representation

This section shows the cosine distance of the context representation on UCF101 extracted by DPC pre-trained $f(.)$ and $g(.)$ (Figure 7). We use the same setting as Figure 6 and extract one context representation for each video and pool into vector. Then we compute the cosine distance of each pair of context vectors across the entire UCF101 validation set. The cosine distance is summarized by histogram, where ‘positive’ means two source videos are from the same action class and ‘negative’ means two source videos are from different action classes. For clarity, 17 out of 101 action classes are evenly sampled from UCF101 and visualized. Note that there is no finetunning in this stage, i.e. the network doesn’t see any action labels.

It can be seen that for all action classes, the context representations from the same action class have higher cosine similarity, i.e. DPC can cluster actions without knowing action labels.