From Facial Parts Responses to Face Detection: A Deep Learning Approach

Shuo Yang, Ping Luo, Chen Change Loy, Xiaoou Tang

Introduction

Neural network based methods were once widely applied for localizing faces , but they were soon replaced by various non-neural network-based face detectors, which are based on cascade structure and deformable part models (DPM) detectors. Deep convolutional networks (DCN) have recently achieved remarkable performance in many computer vision tasks, such as object detection, object classification, and face recognition. Given the recent advances of deep learning and graphical processing units (GPUs), it is worthwhile to revisit the face detection problem from the neural network perspective.

In this study, we wish to design a deep convolutional network for face detection, with the aim of not only exploiting the representation learning capacity of DCN, but also formulating a novel way for handling the severe occlusion issue, which has been a bottleneck in face detection. To this end, we design a new deep convolutional network with the following appealing properties: (1) It is robust to severe occlusion. As depicted in Fig. 1, our method can detect faces even more than half of the face region is occluded; (2) it is capable of detecting faces with large pose variation, e.g. profile view without training separate models under different viewpoints; (3) it accepts full image of arbitrary size and the faces of different scales can appear anywhere in the image.

All the aforementioned properties, which are challenging to achieve with conventional approaches, are made possible with the following considerations:

(1) Generating face parts responses from attribute-aware deep networks: We believe the reasoning of unique structure of local facial parts (e.g. eyes, nose, mouths) is the key to address face detection in unconstrained environment. To this end, we design a set of attribute-aware deep networks, which are pre-trained with generic objects and then fine-tuned with specific part-level binary attributes (e.g. mouth attributes including big lips, opened mouth, smiling, wearing lipstick). We show that these networks could generate response maps in deep layers that strongly indicate the locations of the parts. The examples depicted in Fig. 1(b) show the responses maps (known as ‘partness map’ in our paper) of five different face parts.

(2) Computing faceness score from responses configurations: Given the parts responses, we formulate an effective method to reason the degree of face likeliness through analysing their spatial arrangement. For instance, the hair should appear above the eyes, and the mouth should only appear below the nose. Any inconsistency would be penalized. Faceness scores will be derived and used to re-rank candidate windows of any generic object proposal generator to obtain a set of face proposals. Our experiment shows that our face proposal enjoys a high recall with just modest number of proposals (over 90% of face recall with around $150$ proposals, $\approx$ 0.5% of full sliding windows, and $\approx$ 10% of generic object proposals).

(3) Refining the face hypotheses – Both the aforementioned components offer us the chance to find a face even under severe occlusion and pose variations. The output of these components is a small set of high-quality face bounding box proposals that cover most faces in an image. Given the face proposals, we design a multitask deep convolutional network in the second stage to refine the hypotheses further, by simultaneously recognizing the true faces and estimating more precise face locations.

Our main contribution in this study is the novel use of DCN for discovering facial parts responses from arbitrary uncropped face images. Interestingly, in our method, part detectors emerge within CNN trained to classify attributes from uncropped face images, without any part supervision. This is new in the literature. We leverage this new capability to further propose a face detector that is robust to severe occlusion. Our network achieves the state-of-the-art performance on challenging face detection benchmarks including FDDB, PASCAL Faces, and AFW. We show that practical runtime speed can be achieved albeit the use of DCN.

Related Work

There is a long history of using neural network for the task of face detection . An early face detection survey provides an extensive coverage on relevant methods. Here we highlight a few notable studies. Rowley et al. exploit a set of neural network-based filters to detect presence of faces in multiple scales, and merge the detections from individual filters. Osadchy et al. demonstrate that a joint learning of face detection and pose estimation significantly improves the performance of face detection. The seminal work of Vaillant et al. adopt a two-stage coarse-to-fine detection. Specifically, the first stage approximately locates the face region, whilst the second stage provides a more precise localization. Our approach is inspired by these studies, but we introduce innovations on many aspects. In particular, we employ contemporary deep learning strategies, e.g. pre-training, to train deeper networks for more robust feature representation learning. Importantly, our first stage network is conceptually different from that of , and many recent deep learning detection frameworks – we train attribute-aware deep convolutional networks to achieve precise localization of facial parts, and exploit their spatial structure for inferring face likeliness. This concept is new and it allows our model to detect faces under severe occlusion and pose variations. While great efforts have been devoted for addressing face detection under occlusion , these methods are all confined to frontal faces. In contrast, our model can discover faces under variations of both pose and occlusion.

In the last decades, cascade based and deformable part models (DPM) detectors dominate the face detection approaches. Viola and Jones introduced fast Haar-like features computation via integral image and boosted cascade classifier. Various studies thereafter follow a similar pipeline. Amongst the variants, SURF cascade was one of the top performers. Later Chen et al. demonstrate state-of-the-art face detection performance by learning face detection and face alignment jointly in the same cascade framework. Deformable part models define face as a collection of parts. Latent Support Vector Machine is typically used to find the parts and their relationships. DPM is shown more robust to occlusion than the cascade based methods. A recent study demonstrates state-of-the-art performance with just a vanilla DPM, achieving better results than more sophisticated DPM variants .

A recent study shows that face detection can be further improved by using deep learning, leveraging the high capacity of deep convolutional networks. In this study, we push the performance limit further. Specifically, the network proposed by does not have explicit mechanism to handle occlusion, the face detector therefore fails to detect faces with heavy occlusions, as acknowledged by the authors. In contrast, our two-stage architecture has its first stage designated to handle partial occlusions. In addition, our network gains improved efficiency by adopting the more recent fully convolutional architecture, in contrast to the previous work that relies on the conventional sliding window approach to obtain the final face detector.

The first stage of our model is partially inspired by the generic object proposal approaches . Generic object proposal generators are now an indispensable component of standard object detection algorithms through providing high-quality and category-independent bounding boxes. These generic methods, however, are devoted to generic objects therefore not suitable to propose windows specific to face. In particular, applying a generic proposal generator directly would produce enormous number of candidate windows but only minority of them contain faces. In addition, a generic method does not consider the unique structure and parts on the face. Hence, there will be no principled mechanism to recall faces when the face is only partially visible. These shortcomings motivate us to formulate the new faceness measure to achieve high recall on faces, whilst reduce the number of candidate windows to half the original.

Faceness-Net

This section introduces the proposed attribute-aware face proposal and face detection approach, Faceness-Net. In the following, we first briefly overview the entire pipeline and then discuss the details.

Faceness-Net’s pipeline consists of three stages, i.e. generating partness maps, ranking candidate windows by faceness scores, and refining face proposals for face detection. In the first stage as shown in Fig. 2(a), a full image ${\bf x}$ is used as input to five CNNs. Note that all the five CNNs can share deep layers to save computational time. Each CNN outputs a partness map, which is obtained by weighted averaging over all the label maps at its top convolutional layer. Each of these partness maps indicates the location of a specific facial component presented in the image, e.g. hair, eyes, nose, mouth, and beard, denoted by ${\bf h}^{a}$ , ${\bf h}^{e}$ , ${\bf h}^{n}$ , ${\bf h}^{m}$ , and ${\bf h}^{b}$ , respectively. We combine all these partness maps into a face label map ${\bf h}^{f}$ , which clearly designates faces’ locations.

In the second stage, given a set of candidate windows that are generated by existing object proposal methods such as , we rank these windows according to their faceness scores, which are extracted from the partness maps with respect to different facial parts configurations, as illustrated at the bottom of Fig. 2(b). For example, as visualized in Fig. 2(b), a candidate window ‘A’ covers a local region of ${\bf h}^{a}$ (i.e. hair) and its faceness score is calculated by dividing the values at its upper part with respect to the values at its lower part, because hair is more likely to present at the top of a face region. A final faceness score of ‘A’ is obtained by averaging over the scores of these parts. In this case, large number of false positive windows can be pruned. Notably, the proposed approach is capable of coping with severe face occlusions, as shown in Fig. 2(c), where face windows ‘A’ and ‘E’ can be retrieved by objectness only if large amount of windows are proposed, whilst they rank top $50$ by using our method.

In the last stage, the proposed candidate windows are refined by training a multitask CNN, where face classification and bounding box regression are jointly optimized.

Network structure. Fig. 3 depicts the structure and hyper-parameters of the CNN in Fig. 2(a), which stacks seven convolutional layers (conv1 to conv7) and two max-pooling layers (max1 and max2). This convolutional structure is inspired by the AlexNet in image classification. Many recent studies showed that stacking many convolutions as AlexNet did can roughly capture object locations.

Learning partness maps. As shown in Fig. 4(b), a deep network trained on generic objects, e.g. AlexNet , is not capable of providing us with precise faces’ locations, let alone partness map. The partness maps can be learned in multiple ways. The most straight-forward manner is to use the image and its pixelwise segmentation label map as input and target, respectively. This setting is widely employed in image labeling . However, it requires label maps with pixelwise annotations, which are expensive to collect. Another setting is image-level classification (i.e. faces and non-faces), as shown in Fig. 4(c). It works well where the training images are well-aligned, such as face recognition . Nevertheless, it suffers from complex background clutter because the supervisory information is not sufficient to account for face variations. Its learned feature maps contain too much noises, which overwhelm the actual faces’ locations. Attribute learning in Fig. 4(d) extends the binary classification in (c) to the extreme by using a combination of attributes to capture face variations. For instance, an ‘Asian’ face can be distinguished from a ‘European’ face. However, our experiments demonstrate that the setting is not robust to occlusion. Hence, as shown in Fig. 4(e), this work extends (d) by partitioning attributes into groups based on facial components. For instance, ‘black hair’, ‘blond hair’, ‘bald’, and ‘bangs’ are grouped together, as all of them are related to hair. The grouped attributes are summarized in Table 1. In this case, different face parts can be modeled by different CNNs (with option to share some deep layers). If one part is occluded, the face region can still be localized by CNNs of the other parts.

However, the partness map generated by the Hair-CNN trained as above contains erroneous responses at the background, revealing that this training scheme is not sufficient to account for background clutter. To obtain a cleaner partness map, we employ the merit from object categorization, where CNN is pre-trained with massive general object categories in ImageNet as in . It can be viewed as a supervised pre-training for the Hair-CNN.

2 Ranking Windows by Faceness Measure

Our approach is loosely coupled with existing generic object proposal generators - it accepts candidate windows from the latter but generates its own faceness measure to return a ranked set of top-scoring face proposals. Fig. 5 takes hair and eyes to illustrate the procedure of deriving the faceness measure from a partness map. Let $\Delta_{w}$ be the faceness score of a window $w$ . For example, as shown in Fig. 5(a), given a partness map of hair, ${\bf h}^{a}$ , $\Delta_{w}$ is attained by dividing the sum of values in ABEF (red) by the sum of values in FECD. Similarly, Fig. 5(b) expresses that $\Delta_{w}$ is obtained by dividing the sum of values in EFGH (red) with respect to ABEF+HGCD of ${\bf h}^{e}$ .

For both of the above examples, larger value of $\Delta_{w}$ indicates $w$ has higher overlapping ratio with face. These spatial configurations, such as ABEF in (a) and EFGH in (b), can be learned from data. We take hair as an example. We need to learn the positions of points E and F, which can be represented by the $(x,y)$ -coordinates of ABCD, i.e. the proposed window. For instance, the position of E in (a) can be represented by $x_{e}=x_{b}$ and $y_{e}=\lambda y_{b}+(1-\lambda)y_{c}$ , implying that the value of its $y$ -axis is a linear combination of $y_{b}$ and $y_{c}$ . With this representation, $\Delta_{w}$ can be efficiently computed by using the integral image (denoted as ${\bf I}$ ) of the partness map. For instance, $\Delta_{w}$ in (a) is attained by

where ${\bf I}(x,y)$ signifies the value at the location $(x,y)$ .

Given a training set $\{w_{i},r_{i},{\bf h}_{i}\}_{i=1}^{M}$ , where $w_{i}$ and $r_{i}\in\{0,1\}$ denote the $i$ -th window and its label (i.e. face/non-face), respectively. ${\bf h}_{i}$ is the cropped partness map with respect to the $i$ -th window, e.g. region ABCD in ${\bf h}^{a}$ . This problem can be simply formulated as maximum a posteriori (MAP)

where $\lambda$ represents a set of parameters when learning the spatial configuration of hair (Fig. 5(a)). $p(r_{i}|\lambda,w_{i},{\bf h}_{i})$ and $p(\lambda,w_{i},{\bf h}_{i})$ stand for the likelihood and prior, respectively. The likelihood of faceness can be modeled by a sigmoid function, i.e. $p(r_{i}|\lambda,w_{i},{\bf h}_{i})=\frac{1}{1+\exp(\frac{-\alpha}{\Delta_{w_{i}}})}$ , where $\alpha$ is a coefficient. This likelihood measures the confidence of partitioning face and non-face, given a certain spatial configuration. The prior term can be factorized, $p(\lambda,w_{i},{\bf h}_{i})=p(\lambda)p(w_{i})p({\bf h}_{i})$ , where $p(\lambda)$ is a uniform distribution between zero and one, as it indicates the coefficients of linear combination, $p(w_{i})$ models the prior of the candidate window, which can be generated by object proposal methods, and $p({\bf h}_{i})$ is the partness map as in Sec. 3.1. Since $\lambda$ typically has low dimension in this work (e.g. one dimension of hair), it can be simply obtained by line search. Nevertheless, Eq.(2) can be easily extended to model more complex spatial configurations.

3 Face Detection

The proposed windows achieved by faceness measure have high recall rate. To improve it further, we refine these windows by joint training face classification and bounding box regression using a CNN similar to the AlexNet .

In particular, we fine-tune AlexNet using face images from AFLW and person-free images from PASCAL VOC 2007 . For face classification, a proposed window is assigned with a positive label if the IoU between it and the ground truth bounding box is larger than $0.5$ ; otherwise it is negative. For bounding box regression, each proposal is trained to predict the positions of its nearest ground truth bounding box. If the proposed window is a false positive, the CNN outputs a vector of $$. We adopt the Euclidean loss and cross-entropy loss for bounding box regression and face classification, respectively.

Experimental Settings

Training datasets. (i) We employ CelebFaces dataset to train our attribute-aware networks. The dataset contains 87,628 web-based images exclusive from the LFW , FDDB , AFW and PASCAL datasets. We label all images in the CelebFaces dataset with $25$ facial attributes and divide the labeled attributes into five categories based on their respective facial parts as shown in Table 1. We randomly select $75,000$ images from the CelebFaces dataset for training and the remaining is reserved as validation set. (ii) For face detection training, we choose $13,205$ images from the AFLW dataset to ensure a balanced out-of-plane pose distribution and $5,771$ random person-free images from the PASCAL VOC $2007$ dataset.

Part response testing dataset. In Sec. 5.1, we use LFW dataset for evaluating the quality of part response maps for part localization. We select $2,927$ LFW images following since it provides manually labeled hair+beard superpixel labels, on which the minimal and maximal coordinates can be used to generate the ground truth of face parts bounding boxes. Similarly, face parts boxes for eye, nose and mouth are manually labeled guided by the $68$ dense facial landmarks.

Face proposal and detection testing datasets. In Sec. 5.2 and Sec. 5.3, we use the following datasets. (i) FDDB dataset contains the annotations for $5,171$ faces in a set of $2,845$ images. For the face proposal evaluation, we follow the standard evaluation protocol in object proposal studies and transform the original FDDB ellipses ground truth into bounding boxes by minimal bounding rectangle. For the face detection evaluation, the original FDDB ellipse ground truth is used. (ii) AFW dataset is built using Flickr images. It has $205$ images with $473$ annotated faces with large variations in both face viewpoint and appearance. (iii) PASCAL faces is a widely used face detection benchmark dataset. It consists of $851$ images and $1,341$ annotated faces.

Evaluation settings. Following , we employ the Intersection over Union (IoU) as evaluation metric. We fix the IoU threshold to $0.5$ following the strict PASCAL criterion. In particular, an object is considered being covered/detected by a proposal if IoU is no less than $0.5$ . To evaluate the effectiveness of different object proposal algorithms, we use the detection rate (DR) given the number of proposals per image . For face detection, we use standard precision and recall (PR) to evaluate the effectiveness of face detection algorithms.

Results

Robustness to unconstrained training input. In the testing stage, the proposed approach does not assume well-cropped faces as input. In the training stage, our approach neither requires well-cropped faces for learning. This is an unique advantage over existing approaches.

To support this statement, we conduct an experiment by fine-tuning two different CNNs as in Fig. 2(a), each of which taking different inputs: (1) uncropped images, which may include large portion of background clutters apart the face; and (2) cropped images, which encompass roughly the face and shoulder regions. The performance is measured based on the part detection rateThe face part bounding box is generated by first conducting non-maximum suppression (NMS) on the partness maps, and finding bounding boxes centered on NMS points.. Note that we combine the evaluation on ‘Hair+Beard’ to suit the ground truth provided by (see Sec. 4). The detection results are summarized in Table 2. As can be observed, the proposed approach performs similarly given both the uncropped and cropped images as training inputs. The results suggest the robustness of the method in handling unconstrained images for training. In particular, thanks to the facial attribute-driven training, despite the use of uncropped images, the deep model is encouraged to discover and capture the facial part representation in the deep layers, it is therefore capable of generating response maps that precisely pinpoint the locations of parts. In the following experiments, all the proposed models are trained on uncropped images. Fig. 6(a) shows the qualitative results. Note that facial parts can be discovered despite challenging poses.

2 From Part Responses to Face Proposal

Comparison with generic object proposals. In this experiment, we show the effectiveness of adapting different generic object proposal generators to produce face-specific proposals. Since the notion of face proposal is new, no suitable methods are comparable therefore we use the original generic methods as baselines. We first apply any object proposal generator to generate the proposals and we use our method described in Sec. 3.2 to obtain the face proposals. We experiment with different parameters for the generic methods, and choose parameters that produce moderate number of proposals with very high recall. Evaluation is conducted following the standard protocol .

The results are shown in Fig. 7. It can be observed that our method consistently improves the state-of-the-art methods for proposing face candidate windows, under different IoU thresholds. Table 3 shows that our method achieves high recall with small number of proposals.

Evaluate the contribution of each face part. We factor the contributions of different face parts to face proposal. Specifically, we generate face proposals with partness maps from each face part individually using the same evaluation protocol in previous experiment. As can be observed from Fig. 8(a), the hair, eye, and nose parts perform much better than mouth and beard. The lower part of the face is often occluded, making the mouth and beard less effective in proposing face windows. In contrast, hair, eye, and nose are visible in most cases. Nonetheless, mouth and beard can provide complementary cues.

Face proposals with different training strategies. As discussed in Sec. 3.1, there are different fine-tuning strategies that can be considered for generating a response map. We compare face proposal performance between different training strategies. Quantitative results in Fig. 9 shows that our approach performs significantly better than approaches (c) and (d). This suggests that attributes-driven fine-tuning is more effective than ‘face and non-face’ supervision. As can be observed in Fig. 4 our method generates strong response even on the occluded face compared with approach (d), which leads to higher quality of face proposal.

3 From Face Proposal to Face Detection

In this experiment, we first show the influence of training a face detector using generic object proposals and our face proposals. Next we compare our face detector, Faceness-Net, with state-of-the-art face detection approaches.

Generic object proposal versus face proposal. We choose the best performer in Fig. 7, i.e. MCG, to conduct this comparison. The result is shown in Fig. 8(b). The best performance, a recall of $93\%$ , is achieved by using our faceness measure to re-rank the MCG top $200$ proposals (Faceness+MCG top- $200$ ). Using MCG top $200$ proposals alone yields the worst result. Even if we adjust the number of MCG proposal to $1,100$ with a high recall rate similar to that of our method, the result is still inferior due to the enormous number of false positives. The results suggest that the face proposal generated by our approach is more accurate in finding faces than generic object proposals for face detection.

Comparison with face detectors. We conduct face detection experiment on three datasets FDDB , AFW and PASCAL faces . Our face detector, Faceness-Net, is trained with top $200$ proposals by re-ranking MCG proposals following the process described in Sec. 3.3. We adopt the PASCAL VOC precision-recall protocol for evaluation.

We compare Faceness-Net against all published methods in the FDDB. For the PASCAL faces and AFW we compare with (1) deformable part based methods, e.g. structure model and Tree Parts Model (TSM) ; (2) cascade-based methods, e.g. Headhunter . Figures 10, 11, and 12 show that Faceness-Net outperforms all previous approaches by a considerable margin, especially on the FDDB dataset. Fig 6(b) shows some qualitative results on FDDB dataset together with the partness maps. More detection results are shown in Fig 13.

Discussion

There is a recent and concurrent study that proposed a Cascade-CNN for face detection. Our method differs significantly to this method in that we explicitly handle partial occlusion by inferring face likeliness through part responses. This difference leads to a significant margin of $2.65\%$ in recall rate (Cascade-CNN $85.67\%$ , our method $88.32\%$ ) when the number of false positives is fixed at 167 on the FDDB dataset. The complete recall rate of the proposed Faceness-Net is $90.99\%$ compared to $85.67\%$ of Cascade-CNN.

At the expense of recall rate, the fast version of Cascade-CNN achieves $14$ fps on CPU and $100$ fps on GPU for $640\times 480$ VGA images. The fast version of the proposed Faceness-Net can also achieve practical runtime efficiency, but still with a higher recall rate than the Cascade-CNN. The speed up of our method is achieved in two ways. First, we share the layers from conv1 to conv5 in the first stage of our model since the face part responses are only captured in layer conv7 (Fig. 2). The computations below conv7 in the ensemble are mostly redundant, since their filters capture global information e.g. edges and regions. Second, to achieve further efficiency, we replace MCG with Edgebox for faster generic object proposal, and reduce the number of proposal to $150$ per image. Under this aggressive setting, our method still achieves a $87\%$ recall rate on FDDB, higher than the $85.67\%$ achieved by the full Cascade-CNN. The new runtime of our two-stage model is $50$ ms on a single GPUWe use the same Nvidia Titan Black GPU as in Cascade-CNN . for VGA images. The runtime speed of our method is comparatively lower than because our implementation is currently based on unoptimized MATLAB code.

We note that further speed-up is possible without much trade-off on detection performance. Specifically, our method will benefit from Jaderberg et al. , who show that a CNN structure can enjoy a $2.5\times$ speedup with no loss in accuracy by approximating non-linear filtering with low-rank expansions. Our method will also benefit from the recent model compression technique .

Acknowledgement This workFor more technical details, please contact the corresponding author Ping Luo via pluo.lhi@gmail.com. is partially supported by the National Natural Science Foundation of China (91320101, 61472410, 61503366).