Context-aware CNNs for person head detection

Tuan-Hung Vu, Anton Osokin, Ivan Laptev

Introduction

Common images and videos primarily focus on people. Indeed, about $35\%$ of pixels in movies and YouTube videos as well as about $25\%$ of pixels in photographs belong to people . This strong bias together with the growing amount of daily videos and photographs urge reliable methods for person analysis in visual data.

Person detection is a key component for many tasks including person identification, action recognition, age and gender recognition, autonomous driving, cloth recognition and many others. While face detection has reached maturity , the more general task of finding people in images and video still remains to be very challenging. For example, state-of-the-art object detectors reach only $65\%$ Average Precision for the person class on the Pascal VOC benchmark. Common difficulties arise from variations in human pose, background clutter, motion blur, low image resolution, occlusions and poor lighting conditions.

Recent advances in Convolutional Neural Networks (CNN) have brought significant progress in image classification and other vision tasks. In particular, CNN-based object detectors such as R-CNN have shown large gains compared to previous models . Most of existing methods, however, treat objects independently and model appearance inside object bounding boxes only. Meanwhile, information available in the scene around objects as well as relations among objects are known to provide complementary contextual cues for recognition. Such cues are likely to be particularly helpful when object appearance lacks discriminative cues due to low image resolution, poor lighting and other factors.

In this work we build on the recent CNN model for object detection and extend it to contextual reasoning. We particularly focus on person detection and aim to locate human heads on images coming from video data. The choice of heads is motivated by frequent occlusions of other body parts. When visible, however, other body parts and the rest of the scene constrain locations of heads in the image. Moreover, interactions between people put constraints on the relative positions and appearance of heads. We aim to leverage such constraints for detection by introducing the following two models.

First, we propose a Global CNN model which we train to predict coarse locations and scales of objects given the full low-resolution image on the input. In contrast to our base Local model limited to object appearance only, the Global model uses all pixels of the image for prediction. Interestingly, we find this simple model to provide quite accurate localization of heads across positions and scales of the image. Second, we introduce a Pairwise CNN model that explicitly models relations among pairs of objects. Motivated by Desai et al. , we build a joint score function for multiple object hypotheses in the image. This score function considers the relative positions, scales and appearance of heads. All parameters of the score function depend on the image data and are learned by optimizing a structured-output loss function. Our final joint model combines Local, Global and Pairwise CNN models (see Figure 1).

To train and test our model, we introduce a new large dataset with $369,846$ human heads annotated in $224,740$ video frames from 21 movies. We show the importance of our large dataset for training and evaluate our method on the new and two existing datasets. The results demonstrate improvements of the proposed contextual CNN model compared to other recent baselines including R-CNN on all three datasets. We also demonstrate a speed-up of object detection provided by our Global model. Our new dataset and the code are publicly available from the project web-page .

The rest of the paper is organized as follows. We review related work in Section 2. Section 3 describes the parts of our contextual CNN model. Section 4 introduces datasets followed by the presentation of experimental results in Section 5. Section 6 concludes the paper.

Related works

The history of object detection with neural networks dates back to the 90s , but methods of these group have started to outperform others, e.g DPM , only after the seminal work of Krizhevsky et al. . Szegedy et al. and Sermanet et al. applied CNN as a sliding window detector at multiple scales. The R-CNN model is a combination of a CNN and a support vector machine (SVM) operating on object proposals generated by the selective search . The pipeline of our Local model is similar to the one of R-CNN (see Section 3.1 for details).

The use of image context was proposed to support object detection in . Contextual information can be modeled at a global scene level as well as at the level of object relations. For example, Murphy et al. propose a CRF model for jointly solving the task of object detection and scene classification. Modolo et al. uses context forest to predict object location and to speed-up object detection using global scene information. Erhan et al. uses CNN to predict coordinates of object bounding boxes. Our Global CNN model predicts likely locations and scales of objects by producing a multi-scale heat map for the whole image.

Desai et al. models spatial constellations of objects in the image and constructs an energy with unary and pairwise potentials. Unary potentials represent the confidence of object hypotheses based on the local image evidence, while pairwise potentials model spatial arrangement of objects in the image. Hoai and Zisserman substitute the pairwise dependencies with a latent variable that represents the preferable configuration of object hypotheses. In both works binary potentials do not depend on the actual image data, moreover, unary potentials are trained independently of the joint model. Our Pairwise model exploits object context, i.e. builds a graphical model (an energy function) reasoning about multiple image locations jointly. Our approach is richer compared to and as it allows pairwise dependencies to be conditioned on the image data and we can train the base detector jointly with the graphical model on top of it.

Our Pairwise CNN model incorporates the structured-output loss. The idea of combining the structured-prediction objective with neural networks has been explored in . Recently Domke and Chen et al. use the dual message passing formulation of the inference task to construct a joint objective of the CNN parameters and the message-passing variables. This approach was applied to the small scale denoising and binary segmentation tasks in and to the image tagging and word recognition tasks . Jaderberg et al. shows how to directly combine the structured SVM (SSVM) objective with the procedure of training a CNN for text recognition. CNNs with structured prediction have been recently explored for the task of human pose estimation. Chen and Yuille propose a model with data-dependent pairwise potentials but the different parts of the model were trained separately. Tompson et al. construct a specific NN that mimicked the behaviour of several rounds of a message-passing inference algorithm. Our Pairwise model is trained with an explicit structured-output surrogate loss with an external inference routine inside and enables to fine-tune all the parameters of the model jointly.

Context-aware CNN model

This section presents main components of our contextual CNN model. In Section 3.1, we describe our Local model building on R-CNN . In Section 3.2, we introduce the Global CNN model trained to score object proposals using the context of the full image. Section 3.3 describes our extension of CNNs with a structured-output loss function aimed to model pairwise relations between objects.

Our Local model follows R-CNN and uses selective search proposals to restrict the set of object hypotheses. We extend the bounding box of each proposal with a small margin to capture local image context around objects. The image patch corresponding to each proposal is then resized to fit the input layer of the CNN. As we are interested in head detection, we select bounding boxes with square-like aspect ratios $\mathcal{R}\in[2/3,3/2]$ and refer to them as candidates.

The R-CNN model is based on the AlexNet architecture pre-trained on the ImageNet dataset . We have considered several alternatives including VGG-S , VGG-verydeep-16 and Oquab et al. . In our experiments VGG-S slightly outperformed AlexNet but was significantly slower in both training and testing. VGG-verydeep-16 showed better performance but was much slower. The network of Oquab et al. had better accuracy and similar speed compared to AlexNet (see Section 5.3 for details). For experiments in this paper we use the pre-trained network of Oquab et al. extended by one fully-connected layer (with 2048 nodes) initialized randomly and followed by ReLu and DropOut.

To train the network, we optimize parameters by minimizing the sum of independent log-losses using stochastic gradient descent (SGD) with momentum. Differently from R-CNN which deploys the second pass of training using SVM, we use the outputs of CNN to score candidates. We found this training procedure to work better for our problem compared to the standard R-CNN training. More details on our training procedure can be found in Appendix A.1.

2 Global model

Our Global model uses image-level information to reason about locations of objects in the image. The Global model is a CNN that takes the whole image as input and outputs a score for each cell of a multi-scale heat map. The input image is isotropically rescaled and zero-padded to fit the standard CNN input of $224\times 224$ pixels. The output of the network is defined as a multi-scale grid of scores, corresponding to object hypotheses with coarsely discretized locations and scales in the image (see Figure 3). Object hypotheses form a grid of $\textsc{c}=284$ square cells of four sizes (28x28, 56x56, 112x112 and 224x224 pixels) and the stride corresponding to the $50\%$ of cell size. Except the output layer, the architecture of the Global CNN is identical to our Local model described in Section 3.1.

The Global CNN is trained with SGD, minimizing the sum of c log-loss functions, one per each grid cell $c\in\{1\cdots\textsc{c}\}$ ,

Due to the coarse resolution of grid cells, our Global model does not provide accurate localization. We therefore use the Global model to rescore the candidates of Local and Pairwise models. For this purpose, we match each candidate with the corresponding grid cell and compute affine combination of their scores. Each candidate is matched to a grid cell with the maximum IoU overlap-ratio. The parameters of affine score combination are optimized by cross validation on the validation set.

3 Pairwise model

In this section we describe our Pairwise model that aims to jointly reason about multiple object candidates. Following Desai et al. we formulate the model as a joint score function where variables correspond to object candidates. In the prior work unary potentials of the score function are defined by the response of the local object detector at corresponding locations, whereas higher-order potentials model spatial relations between candidates. Our Pairwise model enriches the model of by making all potentials of the score function (2) dependent on the image data and, in contrast to , allows to perform the joint training of all parameters. We describe details of our model in Section 3.3.1.

We train parameters of our model by minimizing the structured surrogate loss using stochastic gradient descent algorithm. The details of our training procedure are presented in Section 3.3.2.

Consider a set of $\mathcal{V}$ candidate bounding boxes (nodes) extracted from an image. Let each bounding box have a binary variable $y_{i}$ , $i\in\mathcal{V}$ assigned to it. We associate label $1$ with the object class and label with the background class. We assume that the ground-truth labels $\hat{y}_{i}$ are available for all candidates in training images.

For each pair of nodes we choose an order based on the coordinates of corresponding bounding boxes: the left box is defined to be the first, the right one – the second. Let $\mathcal{E}$ denote the set of oriented pairs of candidates (set of edges). We cluster all edges based on relative locations and scales of bounding boxesTo cluster edges we apply k-means algorithm with $K=20$ to a subset of oriented edges in training images. Edges in this subset connect object candidates with positive labels as well as any other candidates with high scores of the pre-trained Local model. For the clustering we use relative location features (horizontal and vertical displacements, ratio of sizes) converted to the log scale and normalized to have zero mean and unit standard deviation. Further details of the clustering are available in Appendix A.3. and denote the cluster index of edge $(i,j)\in\mathcal{E}$ by $k_{ij}\in\{1,\dots,K\}$ .

Inspired by Desai et al. , we construct a joint score function $S(\mathbold{y};\mathbold{w})$ that ties together the labels of candidates in the same image:

where $\mathbold{w}$ denotes trainable parameters, $\theta^{U}_{i}$ and $\theta^{P}_{ij}$ are unary and pairwise potentials depending on $\mathbold{w}$ , and $\mathbold{y}=(y_{i})_{i\in\mathcal{V}}$ is a vector of all binary variables.

Note, that different values of potentials in (2) can lead to exactly the same score function $S$ . We rewrite Eq. (2) in the more compact form (the set of all representable functions of binary variables stays the same):

where unary potentials $\theta^{U}_{i}$ and pairwise potentials $\theta^{P}_{ij,k_{ij}}$ are represented by real values.

Now we connect the image with potentials of the score function (3) using several feed-forward neural networks. First, from the Local model described in Section 3.1 we create a feature extractor (FE), i.e. a function $\varphi^{E}$ that constructs feature vector $\mathbold{f}_{i}$ for the image data $\mathbold{x}_{i}$ of candidate $i$ : $\mathbold{f}_{i}=\varphi^{E}(\mathbold{x}_{i},\mathbold{w}^{E})$ . Here $\mathbold{w}^{E}$ is a vector of trainable parameters of FE.

To connect features $\mathbold{f}_{i}$ with potentials in (3) we construct two additional feed-forward networks: the unary network (UN) and the pairwise network (PN). The unary network $\varphi^{U}$ maps the feature vector $\mathbold{f}_{i}$ of a candidate $i$ to the value of the corresponding unary potential, i.e. $\theta^{U}_{i}=\varphi^{U}(\mathbold{f}_{i},\mathbold{w}^{U})$ . The pairwise network $\varphi^{P}$ maps the concatenated feature vectors of its two candidates to a vector $\mathbold{\theta}^{P}_{ij}$ where the $k$ -th component $\theta^{P}_{ij,k}$ corresponds to the one of $K$ cluster indices, i.e. $\theta^{P}_{ij,k_{ij}}=\varphi^{P}_{k_{ij}}(\mathbold{f}_{i},\mathbold{f}_{j},\mathbold{w}^{P})$ . Vectors $\mathbold{w}^{U}$ and $\mathbold{w}^{P}$ are the trainable parameters of the UN and PN, correspondingly.

In our experiments we found the following architectures to work best. The FE was of the same structure as our Local model (based on the network of Oquab et al. ) leading to $2048$ features. In both UN and PN we use just one fully-connected layer. Addition of more hidden layers did not improve results.

Object detection methods are typically evaluated in terms of precision-recall (PR) and average precision (AP) values. To construct the precision-recall curve given the joint score (3), we follow the approach of Desai et al. . For each candidate bounding box $i$ , we compute an individual score $s_{i}(\mathbold{w})$ defined as the difference of the max-marginals of the joint score

The individual scores are used in the standard precision-recall evaluation pipeline .

When the number of candidates is small, i.e. $|\mathcal{V}|\leq 20$ , both maximization problems of (4) can be solved exactly using exhaustive search. When the number of candidates becomes larger, the exhaustive search becomes too slow. In this case one can use the cascade of QPBO and TRW-S methods to approximate $s_{i}$ . Specifically, QPBO allows to quickly determine the optimal label for some candidates. On our dataset QPBO works surprisingly well, i.e., in many cases it is able to label all nodes. If some nodes are unlabeled by QPBO, one can apply the exhaustive search when the number of unlabelled nodes is at most 20 and TRW-S otherwise.

We have tried using 16 and 32 candidates per image. The exact inference is tractable only in the first case. In this paper we use 16 candidates per image as the large number of candidates did not improve performance on our validation set.

3.2 Training the model

We train parameters of our model by minimizing a structured surrogate loss using the stochastic gradient descent algorithmAs common in the deep learning literature we ignore the non-differentiability issues and assume that in practice we can always compute the gradient.. The algorithm for parameter update consists of the following four steps:

Select the set of candidates by applying the non-maximum suppression on top of the scores produced by the Local model.

Perform the forward pass through the model to compute potentials of the joint score function.

Perform the inference to compute the structured loss and its gradient (see below).

Back-propagate the gradient through the model.

We explain details of the algorithm below.

A structured loss is a function that maps the current values of parameters, image data $\mathbold{x}=(\mathbold{x}_{i})_{i\in\mathcal{V}}$ and the ground-truth labeling $\hat{\mathbold{y}}=(\hat{y}_{i})_{i\in\mathcal{V}}$ to a real number. A popular choice for the surrogate loss for structured-prediction tasks is the structured SVM (SSVM) objective :

where $h(\mathbold{y},\hat{\mathbold{y}})\geq 0$ measures the agreement between the two labelings. Possible choices for $h$ include the Hamming loss, the Hamming loss with penalties normalized by the frequency of classes, or higher-order losses making use of assumption that each ground-truth object is assigned to exactly one object candidate . Notice, that in (5) the joint score $S$ depends on parameters $\mathbold{w}$ and image data $\mathbold{x}$ implicitly through potentials $\theta^{U}$ and $\theta^{P}$ .

However, in our experiments we have observed that the SSVM loss is less suited for the detection task, i.e. optimizing the objective (5) does not lead to good results in terms of precision-recall measure. To tackle this problem, we propose a new surrogate loss which directly imposes penalties on the wrong values of individual scores (4) extracted from the joint score $S$ . Specifically, this loss can be written as

where $v$ can be any non-increasing function bounded from below. We use $v(t)=\log(1+\exp(-t))$ which brings us closer to the training of conventional detector with a soft-max loss.

To optimize the structured loss w.r.t. the model parameters $\mathbold{w}$ , we need to compute the gradient of the objective w.r.t. model parameters. We can always achieve this goal using the back-propagation method under two assumptions: 1) the gradient can be back-propagated through the modules of the model, i.e. all the partial derivatives of $\varphi^{E}$ , $\varphi^{U}$ , $\varphi^{P}$ w.r.t. the input and the parameters can be computed; 2) the scores of the candidates (4) can be computed exactly.

To start the back-propagation procedure, we compute the gradient of structured loss w.r.t. potentials $\theta^{U}_{i}$ , $\theta^{P}_{ij,k}$ of the joint score function $S$ . Jaderberg et al. have in details explained how to do this for the SSVM loss (5). Here we explain how to differentiate the loss (6). First, the gradient of the loss (6) w.r.t. the scores can be expressed as

The gradient of the score (when existent) w.r.t. potentials can be computed exactly if we can compute all max-marginals exactly:

where $y_{q}^{i,t}$ is the $q$ -th component of $\mathbold{y}^{i,t}=\operatornamewithlimits{argmax}\limits_{\mathbold{y}:y_{i}=t}S(\mathbold{y};\mathbold{w})$ for $t\in\{0,1\}$ . Here, $[\cdot]$ is the Iverson bracket notation. Combining the two derivatives via the chain rule we get

The next step of the back-propagation procedure is to compute the derivatives of the loss w.r.t. parameters of the UN and PN

and w.r.t. the output of the feature extractor

Notice that all the derivatives of potentials w.r.t. parameters and features can be computed by propagating the gradient through networks $\varphi^{U}$ and $\varphi^{P}$ . Finally, propagation of the gradient (8) through $\varphi^{E}$ gives us the direction of the update for parameters $\mathbold{w}^{E}$ of the FE.

Datasets

In this section we present our new head detection dataset, HollywoodHeads (HH), and discuss two other datasets we use for evaluation: TVHI and Casablanca .

HollywoodHeads dataset contains $369,846$ human heads annotated in $224,740$ video frames from 21 Hollywood moviesList of movies used in HollywoodHeads dataset. Training set: American beauty, As Good As It Gets, Big Fish, Big Lebowski, Bringing out the dead, Capote, Clerks, Crash, Dead Poets Society, Double Indemnity, Erin Brockovich, Fantastic 4, Fargo, Fear And Loathing In Las Vegas, Fight Club. Validation set: Five Easy Pieces, Forrest Gump, Gang Related. Test set: Gandhi, Charade, I Am Sam.. The movies vary in genres and represent different time epochs. To create annotation, we have manually annotated tracks of human heads in action-rich movie clips. For each head track, head bounding boxes, i.e., the smallest axis-parallel rectangles including all visible pixels of the head, were manually annotated on several key frames. The bounding boxes on remaining frames were linearly interpolated and manually verified to be correct. In total, we have collected $2,380$ clips with $3,872$ human tracks, spanning over $3.5$ hours of video. The dataset is divided into the training, validation and test subsets which have no overlap in terms of movies3. Given the redundancy of consequent video frames, we have temporally subsampled videos in the validation and test subsets. In summary, the training set of HollywoodHeads contains $216,719$ frames from $15$ movies, the validation set contains $6,719$ frames from $3$ movies and the test set contains $1,302$ frames from another set of $3$ movies. Human heads with poor visibility (e.g., strong occlusions, low lighting conditions) were marked by the “difficult” flag and were excluded from the evaluation. The HollywoodHeads dataset is available from .

2 TVHI dataset

The extended TV Human Interaction (TVHI) dataset consists of $1,313$ frames of TV show episodes annotated with bounding boxes of human upper bodies. Frames are split into the two sets: $599$ for training and $714$ for testing. To evaluate head detection using upper-body annotation, we have applied bounding-box regression to the output of head detectors . The parameters of regression were tuned on the TVHI training subset for each tested method.

3 Casablanca dataset

The Casablanca dataset contains $1,466$ frames from the movie “Casablanca”. The frames are annotated with head bounding boxes, however, the annotation of frontal heads is typically reduced to face bounding boxes and, therefore differs in the scale and aspect ratio from the HollywoodHeads annotation. Given some mistakes in the original annotation of , we have added missing bounding boxes for heads of all people in the foreground. We have also applied bounding-box regression to compensate for differences in annotation policies.

Experiments

This section presents our experimental results. First, we demonstrate the effect of different combinations of proposed models (Section 5.1) and provide the comparison with the state-of-the-art (Section 5.2). Section 5.3 compares different architectures of the Local model. We then justify the need of our new large dataset for training (Section 5.4) and show improvements in computational complexity that can be achieved with the Global model (Section 5.5).

To evaluate the detection performance, we use the standard Average Precision (AP) measure based on the Precision-Recall (PR) curve . Detections having high overlap ratio with the ground truth (IoU > 0.5) are considered as true positives. Multiple detections assigned to the same ground truth are penalized and declared as false positives. Matches to “difficult” head annotations are ignored in the evaluation, i.e. such detections are considered neither as true positives nor as false positives.

We compare performance of the following four models: the Local model (Sec .3.1), the combination of the Local and Global models (Section 3.2), the combination of the Local and Pairwise models (Section 3.3) and the combination of all the three proposed models. The performance of head detection is evaluated on HollywoodHeads, Casablanca and TVHI datasets. Qualitative results of the Global and Pairwise models are illustrated in Figures 3 and 4 respectively. Table 1 presents quantitative results for all models. We observe that the Global and Pairwise models consistently improve the performance of the baseline Local model. The combination of all three models demonstrates the best performance on all three datasets.

2 Comparison with the state-of-the-art methods

We compare our approach against several baselines: the CNN-based object detector (R-CNN), DPM-based face detector (DPM Face) as well as other methods reporting results on TVHI (UBC+S) and Casablanca (VJ-CRF). We have trained R-CNNhttps://github.com/rbgirshick/rcnn object detector on human heads using the training subset of HollywoodHeads dataset. The CNN model was first fine-tuned on all region proposals used to train our Local model. Given memory limitations, the SVM phase of R-CNN training was done on a subset of training images. For the DPM-based face detector we have used the vanilla DPM model provided by . Results of other methods were taken from original papers.

Results of all compared methods are presented in Figure 5. Our joint model outperforms other methods on all three datasets. Consistently with other recent evaluations, we observe the advantage of CNN-based methods compared to other baselines. As expected, methods trained to detect faces achieve lower recall on the head detection task given the large variation of view points in natural images. Our method significantly outperforms R-CNN on two out of three datasets and performs slightly better than R-CNN on the TVHI dataset.

Note that our evaluation on the Casablanca dataset differs from due to the improved annotation and the use of VOC evaluation procedure. Our results using the original evaluation setup by Ren are reported in Appendix B. Additional results of our method are available from the project web-page and in Appendix C.

3 Architectures of the Local model.

In this section we compare performance and speed of different architectures of the Local model. We consider AlexNet architecture , VGG-S , VGG-verydeep-16 provided with the MatConvNet framework http://www.vlfeat.org/matconvnet/pretrained/ and the model of Oquab et al. . All models were pre-trained on the ImageNet dataset and fine-tuned on the training set of the HollywoodHeads dataset as the Local model. In Table 2 we report values of AP produced by different models together with the train/test speed. We measure the speed as the number of image patches processed per second. For each model we choose the size of the training batch such that the training speed is maximal. In all cases it happens to be the maximum batch size that fits into the GPU memory. Experiments of this section were run on NVIDIA TITAN X with 12G RAM.

4 Size of the training set

In this experiment we analyze the amount of training data required to train our models. Our full training set is constructed from $15$ movies. We also examine the use of smaller training sets corresponding to the first 8 movies and the first 4 movies of the full training set respectively. We use each training set to train parameters of our full model and evaluate it on three datasets. Corresponding results are reported in Table 3. We observe that the amount of the training data and, maybe more importantly, its diversity helps to improve the performance.

5 Complexity reduction with the Global model

Here we show that the Global model can suppress false candidates and reduce the computational complexity of R-CNN and our Local model at test time. We achieve this by transferring scores of the Global model detection proposals. We then filter out low-score candidates and thus reduce the number of candidates that have to pass through Local CNN. We evaluate the performance of detectors with different percentage of candidates left after the filtering. Table 4 presents results of this experiment. We observe that detection performance remains high despite aggressive filtering by the scores of the Global model.

Conclusion

In this work we have addressed the task of detecting people in still images. We proposed two context-aware CNN-based models. To train and evaluate our method, we have collected the new large-scale HollywoodHeads dataset consisting of movie frames and human head annotations. The combination of our context-aware models and the CNN-based local detector achieves state-of-the-art results on our dataset and the two existing human detection datasets, TVHI and Casablanca.

We believe that our context-aware models can be extended to tackle general object classes. In particular, the Microsoft COCO dataset contains many small object classes with implied spatial constraints. Another possible direction for future work is to take into account motion information to extend our methods to perform long-term tracking.

Acknowledgements This work was partly supported by the ERC grant Activia (no. 307574) and the MSR-INRIA Joint Center.

Appendix A Implementation details

To train the Local model, we assign each candidate region to the positive (head) or negative (background) class. For a given image, we make this assignment based on the intersection-over-union (IoU) overlap ratio $o$ of the candidate bounding box with the best matching ground-truth bounding box. Specifically, candidates with $o>0.6$ are labeled as positives and candidates with $o<0.5$ are labeled as negatives. The remaining candidates are considered ambiguous and are not used at the training. Following we exploit the context padding. Each candidate is resized to $188\times 188$ square patch which is extended with $18$ pixels on each side filled from the original image. The input images of our CNN are of size $224\times 224$ . For each image, we form a training batch by sampling 64 proposals such that the balance between classes is roughly maintained.

We initialize parameters of the network using the ImageNet pre-trained network of Oquab et al. . We optimize the parameters of the network by minimizing the sum of independent log-losses with a stochastic gradient descent (SGD) algorithm with momentum $0.9$ and weight decay $0.0005$ . We initialize the learning rate at $0.01$ , and decrease it several times by a factor of $10$ after the validation error reaches saturation.

A.2 Global model

The Global model takes the whole image (isotropically rescaled and zero-padded to size $224\times 224$ ) as the input and provides a vector of $284$ numbers as the output. Each element of the output vector is associated with a cell of our multi-scale grid. For each cell we construct a target objective: $1$ is the corresponding image patch has at least $0.3$ IoU ratio with at least one ground-truth object bounding box. To train the Global model we optimize the sum of independent log-losses with an SGD algorithm. We initialize the model with the ImageNet pre-trained network . The learning rate of SGD is set to $0.0001$ , momentum – to $0.9$ , weight decay – to $0.0005$ .

A.3 Pairwise model

The number of candidates from one image that our Pairwise model can process is quite limited due to the complexity of the inference procedure. To select the “good” candidates out of the thousands produced by the selective search we use the non-maximum suppression (with threshold 0.3) on top of the scores provided by the Local model. We find that 16 candidates per image produced this way provide good balance between quality and speed.

To construct clusters of candidate pair (edges) incorporating the layout information we use the three features representing the vertical and horizontal displacements and the ratio of the candidate sizes. To be precise, if the position of each candidate is defined by a tuple $(x_{i},y_{i},w_{i},h_{i})$ we define the size of the candidate as $s_{i}=(w_{i}+h_{i})/2$ , its horizontal position as $x^{c}_{i}=x_{i}+w_{i}/2$ and its vertical position as $y^{c}_{i}=y_{i}+h_{i}/2$ . For the two candidates sorted such that $x_{i}\leq x_{j}$ we compute the features as follows: $f_{ij}^{1}=\log(s_{i}/s2)$ , $f_{ij}^{2}=\varphi((x^{c}_{j}-x^{c}_{i})/s_{i})$ , $f_{ij}^{3}=\varphi((y^{c}_{j}-y^{c}_{i})/s_{i})$ , where $\varphi(x)=\operatornamewithlimits{sign}(x)\log(|x|+1)$ . All the three features are normalized to have zero mean and unit standard deviation on the training set. We find that increasing number of clusters beyond $20$ does not improve the performance.

To train the Pairwise model we assign each selected candidate a target binary label based on the maximum IoU ratio with the ground-truth bounding boxes (threshold 0.5). We form a training batch from 64 candidates coming from 4 different images. The FE part of the model is initialized from the Local model. The weights of the UN and PN were initialized randomly using zero-mean Gaussians with standard deviation 0.01. The structured surrogate objective is optimized with and SGD with momentum 0.9, weight decay $0.000005$ , and learning rate $0.00001$ . We decreased the learning rate by a factor of $10$ after 4 passes over the training data.

A.4 Combining models

We now describe the process of computing the scores of the joint model. First, we compute the scores of the Local model for all candidates and perform the non-maximum suppression using NMS threshold 0.3. The 16 top-scoring detections produced by NMS are then used as input for the Pairwise model. This number of candidates is sufficient on scenes with a few people, but can cause the drop of recall for crowded scenes (especially for some scenes of Casablanca dataset). To compensate for this drop, we combine scores produced by the Local and Pairwise models $s_{l}$ , $s_{p}$ respectively. For candidates with both scores existing, we use the affine combination $s_{lp}=\alpha s_{l}+(1-\alpha)s_{p}+\beta$ . For candidates with the score of the Pairwise model non-existent, we use the score of the Local model $s_{lp}=s_{l}$ . Parameters $\alpha\in$ and $\beta\in$ are selected by maximizing AP on the validation set using grid search.

To combine scores $s_{lp}$ with the Global model, we associate each candidate with the output cell of the Global model having maximum IoU overlap-ratio. The score of the joint model $s^{*}$ is computed as an affine combination of the detection score $s_{lp}$ and the grid cell score $s_{g}$ , i.e. $s^{*}=\gamma s_{lp}+(1-\gamma)s_{g}$ where $\gamma\in$ is obtained by maximizing AP on the validation set.

A.5 Implementation details

All our experiments were run on NVIDIA GPUs using MATLAB-based MatConvNet framework with the cuDNN backend . To avoid speed bottlenecks, we found it important to crop and resize image patches corresponding to object proposals using GPU which can be easily implemented using e.g. NVIDIA Performance Primitives (NPP) library provided in the CUDA packagehttps://github.com/aosokin/cropRectanglesMex.

We report the running times of different parts of our model measured on NVIDIA TITAN X. The forward and backward passes of the Local model on a batch of 64 proposals take $0.08$ s and $0.18$ s, respectively. The forward and backward passes of the Global model on a batch of 32 images, take $0.06$ s and $0.12$ s, respectively. The Pairwise model consists of several parts: feature extractor, unary network, pairwise network, structured loss. For a batch of 64 candidates taken from 4 images (16 candidates from each) the forward pass through a feature extractor network takes 0.07s, the unary network – 0.003s, the pairwise network – 0.003s. The backward pass through these networks takes 0.2s, 0.004s and 0.004s, correspondingly. The computation of the structured loss and its derivatives takes 0.01s per image. Overall, the forward and backward passes through a joint Pairwise model take 0.36s for a batch coming from 4 images.

Appendix B Evaluation on the original Casablanca dataset

To compare our results with the exact results reported in , we evaluate head detection on the Casablanca dataset using the original set of annotations and the evaluation procedure used in . Figure 6 demonstrates corresponding precision-recall curves. Our method significantly outperforms VJ-CRF as well as other baselines.

As mentioned in Section 4.3, the original Casablanca dataset contains many cases of missing and imprecisely localized head annotations. Figure 7 (left) depicts some examples with missing annotations. To provide more conclusive results in Section 5, we have improved original annotation by adding missing and correcting existing annotations on all test frames defined in , see Figure 7 (right). Despite our effort, some crowded scenes may still contain missing annotations of very small heads.

Appendix C Qualitative results

In this section we illustrate multi-scale grids of scores produced by the Global model (see Section 3.2). Each output consists of $1\times 1$ , $3\times 3$ , $7\times 7$ and $15\times 15$ score grids corresponding to grids of cells with $28\times 28$ , $56\times 56$ , $112\times 112$ and $224\times 224$ pixels. Figures 8 illustrates the output of the Global model for a few test examples. Note high responses at positions and scales corresponding to human heads in the image.

C.2 Pairwise model

In Figure 9 we provide a few qualitative results of our Pairwise model. The bounding boxes and the links in this figure have the same meaning as the ones in Figure 4 of the main paper. We use the same thresholds for the links and the candidates as in Figure 4 of the main paper.