Self-supervised learning of a facial attribute embedding from video

Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

Introduction

Babies and children are highly perceptive to the facial expressions of the people they interact with [Gerull and Rapee(2002), Ekman and Oster(1979)]. The ability to understand and respond to changes in people’s emotional state is similarly important for computer vision systems and affective systems when interacting with a human user. Thus being able to predict head pose and expression is of vital importance.

Recently, leveraging deep learning has led to state-of-the-art results on a variety of tasks such as emotion recognition and facial landmarks detection. Despite these advances, supervised methods require large amounts of labelled data which may be expensive or difficult to obtain in realistic, unconstrained settings, or necessitate assigning data to ill-defined categories. For example, categorising emotions with three human annotators leads to only 46% agreement [Barsoum et al.(2016)Barsoum, Zhang, Ferrer, and Zhang], and labelling pose in the wild is notoriously difficult. Moreover, performing each task independently does not leverage the fact that detecting landmarks requires understanding pose and facial features, which in turn correspond to expression.

Consequently, we consider the following question: is it possible to learn an embedding of facial attributes that encodes landmarks, pose, emotion, etc. in a self-supervised manner without any hand labelling? The learned embedding can then be used for another task (e.g. landmark, pose, and expression prediction) using a linear layer. To do this, we contribute FAb-Net, a self-supervised framework for learning a low-dimensional face embedding for facial attributes (Section 3). We take advantage of video data which contains a large collection of images of the same person from different viewpoints and with varied expressions. Given only the embeddings corresponding to a source and target frame, the network is tasked to map the source frame to the target frame by predicting a flow field between them. This proxy task forces the network to distil the information required to compute the flow field (e.g. the head pose and expression) into the source and target embeddings.

After explaining the setup for a single source frame in Section 3.1, we introduce our additional contributions: a method for leveraging multiple frames in order to improve the learned embedding in Section 3.2; and how a curriculum strategy for training FAb-Net in Section 4 can be used to improve performance.

The learned embedding is extracted and used for a variety of tasks such as landmark detection, pose regression, and expression classification in Section 5 by simply learning a linear layer. Our results on these tasks are comparable or superior to other self-supervised methods, approaching the performance of supervised methods. These experiments verify the hypothesis that our self-supervised framework learns to encode facial attributes which are useful for a variety of tasks. Finally, the method is tested qualitatively by using the learned embedding to retrieve images with similar facial attributes across different identities.

Related Work

Self-supervised learning. Self-supervised methods such as [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros, Zhang et al.(2016a)Zhang, Isola, and Efros, Doersch et al.(2015)Doersch, Gupta, and Efros, Noroozi and Favaro(2016)] require no manual labelling of the images; instead, they directly use image data to provide proxy supervision for learning good feature representations. The benefit of these approaches is that the features learned using large quantities of available data can be transferred to other tasks/domains which have less or even no annotated data.

To provide further supervision from image data, the images themselves can be transformed via a synthetic warp or rotation and the network trained to recognise the rotation [Gidaris et al.(2018)Gidaris, Singh, and Komodakis] or to learn equivariant pixel embeddings [Thewlis et al.(2017a)Thewlis, Bilen, and Vedaldi, Thewlis et al.(2017b)Thewlis, Bilen, and Vedaldi, Novotny et al.(2018)Novotny, Albanie, Larlus, and Vedaldi].

Of more direct relevance to our training framework are self-supervised frameworks that use video data ([Wang and Gupta(2015), Fernando et al.(2017)Fernando, Bilen, Gavves, and Gould, Misra et al.(2016)Misra, Zitnick, and Hebert, Lee et al.(2017)Lee, Huang, Singh, and Yang, Jakab et al.(2018)Jakab, Gupta, Bilen, and Vedaldi, Gan et al.(2018)Gan, Gong, Liu, Su, and Guibas, Wiles et al.(2018)Wiles, Koepke, and Zisserman, Xue et al.(2016)Xue, Wu, Bouman, and Freeman, Jia et al.(2016)Jia, De Brabandere, Tuytelaars, and Gool, Pătrăucean et al.(2016)Pătrăucean, Handa, and Cipolla, Chung et al.(2017)Chung, Jamaludin, and Zisserman, Denton and Birodkar(2017), Agrawal et al.(2015)Agrawal, Carreira, and Malik, Zamir et al.(2016)Zamir, Wekel, Agrawal, Wei, Malik, and Savarese]).

Our approach builds in particular on those that use frame synthesis [Xue et al.(2016)Xue, Wu, Bouman, and Freeman, Jia et al.(2016)Jia, De Brabandere, Tuytelaars, and Gool, Pătrăucean et al.(2016)Pătrăucean, Handa, and Cipolla, Denton and Birodkar(2017), Chung et al.(2017)Chung, Jamaludin, and Zisserman, Wiles et al.(2018)Wiles, Koepke, and Zisserman], though for us synthesis is a proxy task rather than the end goal. Note, unlike [Wang and Gupta(2015), Misra et al.(2016)Misra, Zitnick, and Hebert, Fernando et al.(2017)Fernando, Bilen, Gavves, and Gould, Lee et al.(2017)Lee, Huang, Singh, and Yang], we do not make use of the temporal ordering information inherent in a video; nor do we predict future frames conditioned on a number of past frames [Pătrăucean et al.(2016)Pătrăucean, Handa, and Cipolla], or explicitly predict the motion between frames as a convolutional kernel [Xue et al.(2016)Xue, Wu, Bouman, and Freeman, Jia et al.(2016)Jia, De Brabandere, Tuytelaars, and Gool], or condition the generation on another modality (e.g. voice [Chung et al.(2017)Chung, Jamaludin, and Zisserman]). Instead, we treat the frames as an unordered set, and propose a simple formulation: that by embedding the source and target frames in a common space and conditioning the transformation from source to target frame on these embeddings, the learned embeddings must learn to encode the relevant modes of variation necessary for the transformation.

Concurrent to our work, Zhang et al\bmvaOneDot [Zhang et al.(2018)Zhang, Guo, Jin, Luo, He, and Lee] and Jakab et al\bmvaOneDot [Jakab et al.(2018)Jakab, Gupta, Bilen, and Vedaldi] build on [Thewlis et al.(2017a)Thewlis, Bilen, and Vedaldi] by using the discovered landmarks to reconstruct the original image. However, unlike these works, we do not place any constraints on the learned representation – such as an explicit representation that encodes landmarks as heatmaps. Supervised learning of face embeddings. Given known (labelled) attribute information, e.g. for pose or expression, the embedding can be learned by training in a supervised manner to directly predict the attribute [Kumar et al.(2011)Kumar, Berg, Belhumeur, and Nayar, Rudd et al.(2016)Rudd, Günther, and Boult, Liu et al.(2015)Liu, Luo, Wang, and Tang], or to generate images (of faces, cars, or other classes) at a new, known pose, expression, etc. [Tran et al.(2017)Tran, Yin, and Liu, Yang et al.(2017)Yang, Ren, Chen, Wen, Li, and Hua, Dosovitskiy et al.(2015)Dosovitskiy, Springenberg, and Brox, Kulkarni et al.(2015)Kulkarni, Whitney, Kohli, and Tenenbaum, Zhou et al.(2016)Zhou, Tulsiani, Sun, Malik, and Efros].

Another way of supervising a face embedding is to explicitly learn the parameters of a 3D morphable model (3DMM) [Blanz et al.(2002)Blanz, Romdhani, and Vetter]. As fitting a 3DMM is relatively expensive, [Bas et al.(2017)Bas, Huber, Smith, Awais, and Kittler, Tewari et al.(2017)Tewari, Zollhöfer, Kim, Garrido, Bernard, Perez, and Theobalt] learn this end-to-end using either landmarks or a photometric error as supervision. However, unlike our method, these methods require either ground truth labels or a morphable model which fixes the modes of variation and the embedding.

An interesting half-way point is weak supervision, where the learned object or face embedding is conditioned for instance on object labels [Novotny et al.(2017)Novotny, Larlus, and Vedaldi] or weather/geo-location information [Li et al.(2015)Li, Wang, Liu, Jiang, Shan, and Chen] respectively. This requires additional meta-data, but results in embeddings that can represent attributes such as age and expression for faces or keypoints for objects.

Method

The aim is to train a network to learn an embedding that encodes facial attributes in a self-supervised manner, without any labels. To do this, the network is trained to generate a target frame from one or multiple source frames by learning how to transform the source into the target frame. The source and target frames are taken from the same face-track of a person speaking, i.e. the frames are of the same identity but with different expressions/poses. An overview of the architecture is given in Fig. 1 and further described for a single source frame in Section 3.1 and for multiple source frames in Section 3.2 (additional details are given in the supp. material).

The input to the network is a source frame $s$ and a target frame $t$ from the same face-track. These are passed through encoders with shared weights which learn a mapping $f$ from the input frames to a $256$ -dimensional vector embedding (as shown in Fig. 1). The embeddings corresponding to the target and source frames are $v_{t}=f(t)$ and $v_{s}=f(s)$ respectively. The source and target embeddings are concatenated to give a $512$ -dimensional vector which is upsampled via a decoder. The decoder learns a mapping $g$ from the concatenated embeddings to a bilinear grid sampler, which samples from the source frame to create a new, generated frame $s^{\prime}=g(v_{t},v_{s})(s)$ . Precisely, $g$ predicts offsets $(\delta x,\delta y)$ for each pixel location $(x,y)$ in the target frame; the generated frame $s^{\prime}$ at location $(x,y)$ is obtained by sampling from the source frame $s$ according to these offsets : $s^{\prime}{(x,y)}=s{(x+\delta x,y+\delta y)}$ . The network is trained to minimise the $L1$ loss between the generated and the target frame: $\mathcal{L}(s^{\prime},t)=||t-s^{\prime}||_{1}$ .

This setup enforces that the embeddings $v_{s}$ and $v_{t}$ represent facial attributes of the source and target frames respectively since the decoder maps from the source frame $s$ to generate the frame $s^{\prime}$ (i.e. it uses pixel RGB values from the source frame $s$ to create the generated $s^{\prime}$ – a similar formulation has been proposed concurrently to this work by [Vondrick et al.(2018)Vondrick, Shrivastava, Fathi, Guadarrama, and Murphy]).

As the decoder is a function of the target and source attribute embeddings, and the decoder is the only place in the network where information is shared, the target attribute embedding must encode information about expression and pose in order for the decoder to know where to sample from in the source frame and where to place this information in the generated frame.

2 Multi-source frames architecture

While using two frames for training enforces that the network learns a high-quality embedding, additional source frames can be leveraged to improve the learned embedding. This is achieved by also predicting a confidence heatmap – a $1$ channel image – for each source frame via an additional decoder. The heatmaps denote how confident the network is of the flow at each pixel location – e.g. if the source frame has a very different pose than the target frame, the confidence heatmap would have low certainty. Moreover, it can express this for sub-parts of the image; if the mouth is closed in the source but open in the target frame, the confidence heatmap can express uncertainty in this region. The confidence heatmaps $C_{i}$ are combined pixel-wise for each source frame $s_{i}$ using a soft-max operation. For $n$ source frames, the loss function to be minimised is given as $\mathcal{L}=||t-\frac{\sum_{i=1}^{n}e^{C_{i}}*(g(v_{t},v_{s_{i}})(s_{i}))}{\sum_{i=1}^{n}e^{C_{i}}}||_{1}$ .

Curriculum Strategy

The training of the network is divided into stages, so that knowledge can be built up over time as the examples given become progressively more difficult, as inspired by [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston, Kumar et al.(2010)Kumar, Packer, and Koller]. The loss computed by a forward pass is used to rank samples (i.e. source and target frame pairs) in the batch according to their difficulty in a manner similar to [Loshchilov and Hutter(2016), Simo-Serra et al.(2015)Simo-Serra, Trulls, Ferraz, Kokkinos, Fua, and Moreno-Noguer, Shrivastava et al.(2016)Shrivastava, Gupta, and Girshick, Nagrani et al.(2018)Nagrani, Albanie, and Zisserman]. However, these methods use only the most difficult samples, which was found to stop our network from learning. Similarly to [Nagrani et al.(2018)Nagrani, Albanie, and Zisserman], using progressively more difficult samples proved crucial for the strategy’s success.

Given a batch size of $N$ randomly chosen samples, i.e. source and target frame pairs, a forward pass is executed and the loss for each sample computed. The samples are ranked and sorted according to this loss. Initially the loss is back-propagated only on the samples in the batch which are in the 50th percentile (i.e. the $0.5N$ samples with the lowest loss computed by the forward pass). These are assumed to be easier samples. When the loss on the validation set plateaus, the subset to be back-propagated on is shifted by 10 (e.g. the samples in the 10th-60th percentile range). This is repeated 4 times until the samples being back-propagated on fall into the 40th-90th percentile range. At this point the curriculum strategy is terminated, as it is assumed that the samples in the 90-100th range are too challenging or may be problematic (e.g. there is a large shift in the background which is too challenging to learn).

Experiments

In this section, we evaluate the network and the learned embedding. In Section 5.1, the performance of using FAb-Net’s learned representation is compared to that of state-of-the-art self-supervised and supervised methods on a variety of tasks: facial landmark prediction, head pose regression and expression classification.

Section 5.2 discusses the benefit of using additional source frames, and Section 5.3 shows how the learned representation can be used for retrieving images with similar facial attributes.

Training. The model is trained on the VoxCeleb1 and VoxCeleb2 video datasets [Nagrani et al.(2017)Nagrani, Chung, and Zisserman, Chung et al.(2018)Chung, Nagrani, and Zisserman]; we refer to the combined datasets as VoxCeleb+. The VoxCeleb+ dataset consists of videos of interviews containing more than 1 million utterances of around 7,000 speakers. The frames are extracted at 1 fps. The frames are cropped, resized to $256\times 256$ , and the identities are randomly split into train/val/test (with a split of 75/15/10).

The models are trained in PyTorch [Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer] using SGD, an initial learning rate of $0.001$ , and momentum $0.9$ . When using the curriculum strategy described in Section 4, the batchsize is $N=32$ , else $N=8$ . The learning rate is divided by a factor of 10 when the loss on the validation set plateaus. (If the curriculum strategy is used, the learning rate is updated only when the 40-90th percentile is considered.) This is repeated until the loss converges. Further details about the training can be found in the supp. material.

First, we investigate the representation learned in our embedding and evaluate whether it indeed encodes facial attributes by challenging it to predict three different attributes: landmarks, pose, and expression.

Setup. Given a network trained on VoxCeleb+, a linear regressor or classifier is trained from the learned embedding to the output task. The linear regressor/classifier consists of two layers: batch-norm [Ioffe and Szegedy(2015)] followed by a linear fully connected layer with no bias. The regression tasks are trained using a MSE loss. The classification tasks are trained with a cross-entropy loss. The parameters of the encoder are fixed while the two additional layers are trained on the training set of the target dataset using Adam [Kingma and Ba(2015)], a learning rate of $0.001$ , $\beta_{1}=0.9$ and $\beta_{2}=0.999$ .

Self-supervised. There are prior publications on using self-supervision for landmark prediction on the datasets we evaluate on, but none for predicting emotion on standard datasets. Consequently, we implement an autoencoder and a set of state-of-the-art self-supervised methods [Gidaris et al.(2018)Gidaris, Singh, and Komodakis, Zhang et al.(2017)Zhang, Isola, and Efros] for object detection and segmentation. The baselines are trained using the same architecture as FAb-Net but with their associated loss functions and training objectives. For [Zhang et al.(2017)Zhang, Isola, and Efros], the regression loss for both the L and ab channels is used. These models are trained on VoxCeleb+ until convergence, with the same training parameters and data augmentation as FAb-Net. More details are given in the supp. material.

VGG-Face descriptor. We additionally compare to the VGG-Face descriptor which is obtained from the 4096-dimensional FC7 features from a VGG-16 network trained on the VGG-Face dataset [Parkhi et al.(2015)Parkhi, Vedaldi, and Zisserman]. Contrary to popular belief, it has been recently shown that a network trained for identity does retain information about other facial attributes [Cole et al.(2017)Cole, Belanger, Krishnan, Sarna, Mosseri, and Freeman, Ephrat et al.(2018)Ephrat, Mosseri, Lang, Dekel, Wilson, Hassidim, Freeman, and Rubinstein]. We use the VGG-Face descriptor to learn a linear regression/classification layer to the desired attribute task. This provides a strong baseline, and the results obtained confirm the finding that a network trained for identity does indeed encode expression and to some extent also pose information. However, note that unlike our method, this face descriptor requires a large dataset of labelled face images for training.

1.2 Results

Facial landmark locations are regressed from the learned embedding and compared to state-of-the-art methods on MAFL [Zhang et al.(2016b)Zhang, Luo, Loy, and Tang] and the more challenging $300$ -W [Sagonas et al.(2016)Sagonas, Antonakos, Tzimiropoulos, Zafeiriou, and Pantic] datasets. The evaluation is performed as outlined in [Zhang et al.(2016b)Zhang, Luo, Loy, and Tang, Thewlis et al.(2017a)Thewlis, Bilen, and Vedaldi], and the errors given in inter-ocular distance. For MAFL, 5 facial landmarks are regressed for 19k/1k train/test images. For $300$ -W, $68$ landmarks are regressed for $3148/689$ train/test images which are obtained (as described in [Thewlis et al.(2017a)Thewlis, Bilen, and Vedaldi]) from combining multiple datasets [Zhu and Ramanan(2012), Belhumeur et al.(2013)Belhumeur, Jacobs, Kriegman, and Kumar, Zhou et al.(2013)Zhou, Fan, Cao, Jiang, and Yin].

The results are reported in Table 2 and some qualitative results are visualised in Fig. 2 and Fig. 3. These results first demonstrate that fine-tuning with additional views and our curriculum strategy improve the embedding learned by FAb-Net. Second, these results show that our method performs competitively or better than state-of-the-art unsupervised landmark detection methods, better than the VGG-Face descriptor baseline and competitively with state-of-the-art supervised methods. This is achieved even though the other self-supervised methods [Zhang et al.(2018)Zhang, Guo, Jin, Luo, He, and Lee, Thewlis et al.(2017a)Thewlis, Bilen, and Vedaldi, Thewlis et al.(2017b)Thewlis, Bilen, and Vedaldi, Jakab et al.(2018)Jakab, Gupta, Bilen, and Vedaldi] are explicitly engineered to detect landmarks whereas our method is not. In addition to that, our method is able to bridge the domain gap between VoxCeleb+ and CelebA [Liu et al.(2015)Liu, Luo, Wang, and Tang] – the other self-supervised methods that we compare to are pre-trained on CelebA.

Pose. The learned embedding is used for pose prediction and compared to a supervised method [Kumar et al.(2017)Kumar, Alavi, and Chellappa] and to using the VGG-Face descriptor. To perform the evaluation, the linear regression is trained from the given embedding to head pose labels using the AFLW dataset [Koestinger et al.(2011)Koestinger, Wohlhart, Roth, and Bischof], but after leaving out the $1,000$ images of the AFLW test set from [Kumar et al.(2017)Kumar, Alavi, and Chellappa]. As can be seen in Table 2, FAb-Net performs better in predicting the roll angle, and the MAE is comparable to [Kumar et al.(2017)Kumar, Alavi, and Chellappa] which is supervised with head pose labels. Furthermore, our embedding outperforms the VGG-Face descriptor which is trained on identities; i.e. our learned embedding encodes more information about head pose.

We evaluate the performance of our learned embedding for expression estimation on two datasets: AffectNet [Mollahosseini et al.(2017)Mollahosseini, Hasani, and Mahoor] and EmotioNet [Benitez-Quiroz et al.(2016)Benitez-Quiroz, Srinivasan, and Martinez], which both contain over 900,000 images. These datasets are taken ‘in-the-wild’ as opposed to in a constrained environment. AffectNet contains 8 facial expressions (neutral, happy, sad, surprise, fear, disgust, anger, contempt) and EmotioNet contains 11 action units (AUs) (combinations of AUs correspond to facial expressions).

Both of these datasets were organised for challenges with a held out, unreleased test set. Therefore, the train set is subdivided into two subsets; one is used for training and the other for validation. The validation set of the original dataset is used to test the different models. The linear classifier for EmotioNet is trained with a binary cross-entropy loss for each AU, whereas for AffectNet, a cross-entropy loss is used. Both training datasets are highly imbalanced. As a result, the examples from the under-represented classes are re-weighted inversely proportionally to the class frequencies to penalise the loss more heavily for mis-classifying images of the under-represented classes.

The embedding learned by FAb-Net is compared to a number of self-supervised and supervised methods by measuring the Area Under the ROC curve (AUC). For each class (e.g. emotion or AU), the AUC is computed independently and the result is averaged over all classes. The results are reported in Table 3 and Table 4 for EmotioNet and AffectNet respectively showing that our network performs better than other self-supervised methods over both metrics when given the same training data. This is supposedly due to the fact that the network must learn to transform the source frame in order to generate the target frame. As parts of the face move together (e.g. an eyebrow raise or the lips when the mouth opens), the embedding must learn to encode information about facial features and thereby encode expression. Interestingly, the autoencoder performs well, presumably due to the restricted nature of this domain.

FAb-Net is also not far off supervised methods despite the domain shift; VoxCeleb+ consists only of people being interviewed (and consequently with mostly neutral/smiling faces), so it does not include the range/extremity of expressions found in AffectNet or EmotioNet. Finally, it can be observed that the VGG-Face descriptor trained to predict identities does surprisingly well at predicting emotion.

Discussion. FAb-Net has achieved impressive performance, as most self-supervised methods when transferred to another task have a large gap in comparison to supervised methods. There is no gap or a small gap for the smaller datasets (landmarks/pose) and the model approaches supervised performance for the larger datasets (expression).

2 What is the benefit of additional source frames?

The previous sections have shown that using additional source frames improves performance. This is at the expense of performing additional forward passes through the encoder (in this case two). Given enough GPU memory, these forward passes can be done in parallel, affecting only the memory requirements and not the computational speed.

Using multiple source frames is further investigated by visualising confidence heatmaps for a given set of source frames in Fig. 4. The confidence heatmaps allow images with more similar pose to be used for creating the generated frame. Furthermore, the network can focus on one frame for generating a part of the face (e.g. the mouth) and on another one for a different part.

3 Image retrieval

This section considers an application of the learned embedding: retrieving images with similar facial attributes (e.g. pose) but across different identities. To perform this task, a subset of 10,000 randomly sampled test images from VoxCeleb+ is obtained. For a given query image, all other images (the gallery) are ranked based on their similarity to the query image using the cosine similarity metric between the corresponding embeddings. For a given query image $Q$ , the embedding $x_{q}$ is extracted by performing a forward pass through the network. Similarly, the embedding $x_{i}$ is extracted for each image $I_{i}$ in the gallery. Each image $I_{i}$ is then ranked according to the cosine similarity between $x_{q}$ and $x_{i}$ . If the network does indeed encode salient information about facial attributes, the cosine similarity can be used to identify images with similar poses and facial attributes. For a set of query images, the results are visualised in Fig. 5. From these results it is again affirmed that our embedding encodes information about facial attributes, as the retrieved images have poses and expressions similar to those of the query images. Note, the embedding is largely unaffected by facial decorations (e.g. glasses) and identity, as these do not change within a face-track and so do not need to be learned in order to predict the transformation.

Conclusion

We have introduced FAb-Net: a self-supervised framework for learning facial attributes from videos. Our method learns about pose and expression by watching faces move and change over a large number of videos without any hand labels. The features of our trained network can then be used to predict pose, landmarks, and expression on other datasets (despite the domain shift) by just training a linear layer on top of the learned embedding. The features have been shown to be comparable or superior performance to self-supervised and supervised methods on a variety of tasks. This is impressive as generally the performance of self-supervised methods has been found to be worse than that of supervised methods, yet our method is indeed competitive/superior to supervised methods for pose regression and facial landmark detection, and approaches supervised performance on expression classification.

Acknowledgements

The authors would like to thank James Thewlis for helpfully sharing code and datasets. This work was funded by an EPSRC studentship and EPSRC Programme Grant Seebibyte EP/M013774/1.