Actions and Attributes from Wholes and Parts

Georgia Gkioxari, Ross Girshick, Jitendra Malik

Introduction

For the tasks of human attribute and action classification, it is difficult to infer from the recent literature if part-based modeling is essential or, to the contrary, obsolete. Consider action classification. Here, the method from Oquab et al. uses a holistic CNN classifier that outperforms part-based approaches . Turning to attribute classification, Zhang et al.’s CNN-based PANDA system shows that parts bring dramatic improvements over a holistic CNN model. How should we interpret these results? We aim to bring clarity by presenting a single approach for both tasks that shows consistent results.

We develop a part-based system, leveraging convolutional network features, and apply it to attribute and action classification. For both tasks, we find that a properly trained holistic model matches current approaches, while parts contribute further. Using deep CNNs we establish new top-performing results on the standard PASCAL human attribute and action classification benchmarks.

Figure 1 gives an outline of our approach. We compute CNN features on a set of bounding boxes associated with the instance to classify. One of these bounding boxes corresponds to the whole instance and is either provided by an oracle or comes from a person detector. The other bounding boxes (three in our implementation) come from poselet-like part detectors.

Our part detectors are a novel “deep” version of poselets. We define three human body parts (head, torso, and legs) and cluster the keypoints of each part into several distinct poselets. Traditional poselets would then operate as sliding-window detectors on top of low-level gradient orientation features, such as HOG . Instead, we train a sliding-window detector for each poselet on top of a deep feature pyramid, using the implementation of . Unlike HOG-based poselets, our parts are capable of firing on difficult to detect structures, such as sitting versus standing legs. Also, unlike recent deep parts based on bottom-up regions , our sliding-window parts can span useful, but inhomogeneous regions, that are unlikely to group together through a bottom-up process (e.g., bare arms and a t-shrit).

Another important aspect of our approach is task-specific CNN fine-tuning. We show that a fine-tuned holistic model (i.e., no parts) is capable of matching the attribute classification performance of the part-based PANDA system . Then, when we add parts our system outperforms PANDA. This result indicates that PANDA’s dramatic improvement from parts comes primarily from the weak holistic classifier baseline used in their work, rather than from the parts themselves. While we also observe an improvement from adding parts, our marginal gain over the holistic model is smaller, and the gain becomes even smaller as our network becomes deeper. This observation suggests a possible trend: as more powerful convolutional network architectures are engineered, the marginal gain from explicit parts may vanish.

As a final contribution, we show that our system can operate “without training wheels.” In the standard evaluation protocol for benchmarking attributes and actions , an oracle provides a perfect bounding box for each test instance. While this was a reasonable “cheat” a couple of years ago, it is worth revisiting. Due to recent substantial advances in detection performance, we believe it is time to drop the oracle bounding box at test time. We show, for the first time, experiments doing just this; we replace ground-truth bounding boxes with person detections from a state-of-the-art R-CNN person detector . Doing so only results in a modest drop in performance compared to the traditional oracle setting.

Related Work

Low-level image features. Part-based approaches using low-level features have been successful for a variety of computer vision tasks. DPMs capture different aspects of an object using mixture components and deforable parts, leading to good performance on object detection and attribute classification . Similarly, poselets are an ensemble of models which capture parts of an object under different viewpoints and have been used for object detection, action and attribute classification and pose estimation. Pictorial structures and its variants explicitly model parts of objects and their geometric relationship in order to accurately predict their location.

Convolutional network features. Turning away from hand-designed feature representations, convolutional networks (CNNs) have shown remarkable results on computer vision tasks, such as digit recognition and more recently image classification . Girshick et al. show that a holistic CNN-based approach performs significantly better than previous methods on object detection. They classify region proposals using a CNN fine-tuned on object boxes. Even though their design has no explicit part or component structure, it is able to detect objects under a wide variety of appearance and occlusion patterns.

Hybrid feature approaches. Even more recently, a number of methods incorporate HOG-based parts into deep models, showing significant improvements. Zhang et al. use HOG-poselet activations and train CNNs, one for each poselet type, for the task of attribute classification. They show a large improvement on the task compared to HOG-based approaches. However, their approach includes a number of suboptimal choices. They use pretrained HOG poselets to detect parts and they train a “shallow” CNN (by today’s standards) from scratch using a relatively small dataset of 25k images. We train poselet-like part detectors on a much richer feature representation than HOG, derived from the pool5 layer of . Indeed, show an impressive jump in object detection performance using pool5 instead of HOG. In addition, the task-specific CNN that we use for action or attribute classification shares the architecture of and is initialized by pre-training on the large ImageNet-1k dataset prior to task-specific fine-tuning.

In the same vein, Branson et al. tackle the problem of bird species categorization by first detecting bird parts with a HOG-DPM and then extracting CNN features from the aligned parts. They experimentally show the superiority of CNN-based features to hand-crafted representations. However, they work from relatively weak HOG-DPM part detections, using CNNs solely for classification purposes. Switching to the person category, HOG-DPM does not generate accurate part/keypoint predictions as shown by , and thus cannot be regarded as a source for well aligned body parts.

Deep parts. Zhang et al. introduce part-based R-CNNs for the task of bird species classification. They discover parts of birds from region proposals and combine them for classification. They gain from using parts and also from fine-tuning a CNN for the task starting from ImageNet weights. However, region proposals are not guaranteed to produce parts. Most techniques, such as , are designed to generate candidate regions that contain whole objects based on bottom-up cues. While this approach works for birds, it may fail in general as parts can be defined arbitrarily in an object and need not be of distinct color and texture with regard to the rest of the object. Our sliding-window parts provide a more general solution. Indeed, we find that the recall of selective search regions for our parts is 15.6% lower than our sliding-window parts across parts at 50% intersection-over-union.

Tompson et al. and Chen and Yuille train keypoint specific part detectors, in a CNN framework, for human body pose estimation and show significant improvement compared to . Their models assume that all parts are visible or self-occluded, which is reasonable for the datasets they show results on. The data for our task contain significantly more clutter, truncation, and occlusion and so our system is designed to handle missing parts.

Bourdev et al. introduce a form of deep poselets by training a network with a cross entropy loss. Their system uses a hybrid approach which first uses HOG poselets to bootstrap the collection of training data. They substitute deep poselets in the poselet detection pipeline to create person hypothesis. Their network is smaller than and they train it from scratch without hard negative mining. They show a marginal improvement over R-CNN for person detection, after feeding their hypothesis through R-CNN for rescoring and bounding box regression. Our parts look very much like poselets, since they capture parts of a pose. However, we cluster the space of poses instead of relying on random selection and train our models using a state-of-the-art network with hard negative mining.

Deep part detectors

Figure 2 schematically outlines the design of our deep part detectors, which can be viewed as a multi-scale fully convolutional network. The first stage produces a feature pyramid by convolving the levels of the gaussian pyramid of the input image with a 5-layer CNN, similar to Girshick et al. for training DeepPyramid DPMs. The second stage outputs a pyramid of part scores by convolving the feature pyramid with the part models.

Feature pyramids allow for object and part detections at multiple scales while the corresponding models are designed at a single scale. This is one of the oldest “tricks” in computer vision and has been implemented by sliding-window object detection approaches throughout the years .

Given an input image, the construction of the feature pyramid starts by creating the gaussian pyramid for the image for a fixed number of scales and subsequently extracting features from each scale. For feature extraction, we use a CNN and more precisely, we use a variant of the single-scale network proposed by Krizhevsky et al. . More details can be found in . Their software is publicly available, we draw on their implementation.

2 Part models

We design models to capture parts of the human body under a particular viewpoint and pose. Ideally, part models should be (a) pose-sensitive, i.e. produce strong activations on examples of similar pose and viewpoint, (b) inclusive, i.e. cover all the examples in the training set, and (c) discriminative, i.e. score higher on the object than on the background. To achieve all the above properties, we build part models by clustering the keypoint configurations of all the examples in the training set and train linear SVMs on pool5 features with hard negative mining.

We model the human body with three high-level parts: the head, the torso and the legs. Even though the pose of the parts is tied with the global pose of the person, each one has it own degrees of freedom. In addition, there is a large, yet not infinite due to the kinematic constraints of the human body, number of possible part combinations that cover the space of possible human poses.

We design parts defined by the three body areas, head ( $H$ ), torso ( $T$ ) and legs ( $L$ ). Assume $t\in\{H,T,L\}$ and $K_{t}^{(i)}$ the set of 2D keypoints of the $i$ -th training example corresponding to part $t$ . The keypoints correspond to predefined landmarks of the human body. Specifically, $K_{H}=\{\textit{Eyes, Nose, Shoulders}\}$ , $K_{T}=\{\textit{Shoulders, Hips}\}$ and for $K_{L}=\{\textit{Hips, Knees, Ankles}\}$ .

For each $t$ , we cluster the set of $K_{t}^{(i)},i=1,...,N$ , where $N$ is the size of the training set. The output is a set of clusters $C_{t}=\{c_{j}\}_{j=1}^{P_{t}}$ , where $P_{t}$ is the number of clusters for $t$ , and correspond to distinct part configurations

We use a greedy clustering algorithm, similar to . Examples are processed in a random order. An example is added to an existing cluster if its distance to the center is less than $\epsilon$ , otherwise it starts a new cluster. The distance of two examples is defined as the euclidean distance of their normalized keypoint distributions. For each cluster $c\in C_{t}$ , we collect the $M$ closest cluster members to its center. Those form the set of positive examples that represent the cluster. From now on, we describe a part by its body part type $t$ and its cluster index $j$ , with $c_{j}\in C_{t}$ while $S_{t,j}$ represents the set of positive examples for part $(t,j)$ .

Figure 3 (left) shows examples of clusters as produced by our clustering algorithm with $\epsilon=1$ and $M=100$ . We show 4 examples for each cluster example. We use the PASCAL VOC 2012 train set, along with keypoint annotations as provided by , to design and train the part detectors. In total we obtain 30 parts, 13 for head, 11 for torso and 6 for legs.

2.2 Learning part models

For each part $(t,j)$ , we define the part model to be the vector of weights ${\bf w}_{t,j}$ which when convolved with a feature pyramid gives stronger activations near the ground-truth location and scale (right most part of Figure 2).

Figure 3 (right) shows the top few detections of a subset of parts on PASCAL VOC val 2009 set. Each row shows activations of a different part, which is displayed at the left part of the same row.

We quantify the performance of our part detectors by computing the average precision (AP) - similar to object detection PASCAL VOC - on val 2009. For every image, we detect part activations at all scales and locations which we non-maximum suppress with a threshold of 0.3 across all parts of the same type. Since there are available keypoint annotations on the val set, we are able to construct ground-truth part boxes. A detection is marked as positive if the intersection-over-union with a ground-truth part box is more than $\sigma$ . In PASCAL VOC, $\sigma$ is set to 0.5. However, this threshold is rather strict for small objects, such as our parts. We report AP for various values of $\sigma$ for a fair assessment of the quality of our parts. Table 1 shows the results.

Since our part models operate independently, we need to group part activations and link them to an instance in question. Given a candidate region $box$ in an image $I$ , for each part $t$ we keep the highest scoring part within $box$

where $F_{(x,y)}(I)$ is the point in feature pyramid for $I$ corresponding to the image coordinates $(x,y)$ . This results in three parts being associated with each $box$ , as shown in Figure 1. A part is considered absent if the score of the part activation is below a threshold, here the threshold is set to -0.1.

In the case when an oracle gives ground-truth bounding boxes at test time, one can refine the search of parts even further. If $box$ is the oracle box in question, we retrieve the $k$ nearest neighbor instances $i=\{i_{1},...,i_{k}\}$ from the training set based on the $L_{2}$ -norm of their pool5 feature maps $F(\cdot)$ , i.e. $\frac{F(box)^{T}F(box_{i_{j}})}{||F(box)||\cdot||F(box_{i_{j}})||}$ . If $K_{i_{j}}$ are the keypoints for the nearest examples, we consider the average keypoint locations $K_{box}=\frac{1}{K}\sum_{j=1}^{k}K_{i_{j}}$ to be an estimate of the keypoints for the test instance $box$ . Based on $K_{box}$ we can reduce the regions of interest for each part within $box$ by only searching for them in the corresponding estimates of the body parts.

Part-based Classification

In this section we investigate the role of parts for fine-grained classification tasks. We focus on the tasks of action classification (e.g. running, reading, etc.) and attribute classification (e.g. male, wears hat, etc.). Figure 4 schematically outlines our approach at test time. We start with the part activations mapped to an instance and forward propagate the corresponding part and instance boxes through a CNN. The output is a fc7 feature vector for each part as well as the whole instance. We concatenate the feature vectors and classify the example with a linear SVM, which predicts the confidence for each class (action or attribute).

For each task, we consider four variants of our approach in order to understand which design factors are important.

This approach is our baseline and does not use part detectors. Instead, each instance is classified according to the fc7 feature vector computed from the instance bounding box. The CNN used for this system is fine-tuned from an ImageNet initialization, as in , on jittered instance bounding boxes.

This method uses our part detectors. Each instance is classified based on concatenated fc7 feature vectors from the instance and all three parts. The CNN used for this system is fine-tuned on instances, just as in the “no parts” system. We note that because some instances are occluded, and due to jittering, training samples may resemble parts, though typically only the head and torso (since occlusion tends to happen from the torso down).

This method also uses our part detectors and concatenated fc7 feature vectors. However, unlike the previous two methods we fine-tune the CNN jointly using instance and part boxes from each training sample. During fine-tuning the network can be seen as a four-stream CNN, with one stream for each bounding box. Importantly, we tie weights between the streams so that the number of CNN parameters is the same in all system variants. This design explicitly forces the CNN to see each part box during fine-tuning.

To test the importance of our part detectors, we employ a baseline that vertically splits the instance bounding box into three (top, middle, and bottom) in order to simulate crude part detectors. This variation uses a CNN fine-tuned on instances.

2 Action Classification

We focus on the problem of action classification as defined by the PASCAL VOC action challenge. The task involves predicting actions from a set of predefined action categories.

We train all networks with backpropagation using Caffe , starting from the ImageNet weights, similar to the fine-tuning procedure introduced in . A small learning rate of $10^{-5}$ and a dropout ratio of 50% were used. During training, and at test time, if a part is absent from an instance then we use a box filled with the ImageNet mean image values (i.e., all zeros after mean subtraction). Subsequently, we train linear SVMs, one for each action, on the concatenated fc7 feature vectors.

In order to make the most of the context in the image, we rescore our predictions by using the output of R-CNN for the 20 PASCAL VOC object categories and the presence of other people performing actions. We train a linear SVM on the action score of the test instance, the maximum scores of other instances (if any) and the object scores, to obtain a final prediction. Context rescoring is used for all system variations on the test set.

Table 2 shows the result of our approach on the PASCAL VOC 2012 test set. These results are in the standard setting, where an oracle gives ground-truth person bounds at test time. We conduct experiments using two different network architectures: a 8-layer CNN as defined in , and a 16-layer as defined in . Ours (no parts) is the baseline approach, with no parts. Ours is our full approach when we include the parts. For the 8-layer network, we use the CNN trained on instances, while for the 16-layer network we use the CNN trained jointly on instances and their parts based on results on the val set (Table 3). For our final system, we also present results when we add features extracted from the whole image, using a 16-layer network trained on ImageNet-1k (Ours (w/ image features)). We show results as reported by action poselets , a part-based approach, using action specific poselets with HOG features, Oquab et al. , Hoai and Simonyan and Zisserman , three CNN-based approaches on the task. The best performing method by uses a 16- and 19-layer network. Their 16-layer network is equivalent to Ours (no parts) with 16 layers, thus the additional boost in performance comes from the 19-layer network. This is not surprising, since deeper networks perform better, as is also evident from our experiments. From the comparison with the baseline, we conclude that parts improve the performance. For the 8-layer CNN, parts contribute 3% of mAP, with the biggest improvement coming from Phoning, Reading and Taking Photo. For the 16-layer CNN, the improvement from parts is smaller, 1.7 % of mAP, and the actions benefited the most are Reading, Taking Photo and Using Computer. The image features capture cues from the scene and give an additional boost to our final performance.

Table 3 shows results on the PASCAL VOC action val set for a variety of different implementations of our approach. Ours (no parts) is the baseline approach, with no parts, while Ours (3-way split) uses as parts the three horizontal splits comprising the instance box. Ours (joint fine-tuning) shows the results when using a CNN fine-tuned jointly on instances and parts, while Ours (instance fine-tuning) shows our approach when using a CNN fine-tuned on instances only. We note that all variations that use parts significantly outperform the no-parts system.

We also show results of our best system when ground-truth information is not available at test time Ours (R-CNN bbox). In place of oracle boxes we use R-CNN detections for person. For evaluation purposes, we associate a R-CNN detection to a ground-truth instance as following: we pick the highest scoring detection for person that overlaps more than 0.5 with the ground truth. Another option would be to define object categories as “person+action” and then just follow the standard detection AP protocol. However, this is not possible because not all people are marked in the dataset (this is true for the attribute dataset as well). We report numbers on the val action dataset. We observe a drop in performance, as expected due to the imperfect person detector, but our method still works reasonably well under those circumstances. Figure 5 shows the top few predictions on the test set. Each block corresponds to a different action.

3 Attribute Classification

We focus on the problem of attribute classification, as defined by . There are 9 different categories of attributes, such as Is Male, Has Long Hair, and the task involves predicting attributes, given the location of the people. Our approach is shown in Figure 4. We use the Berkeley Attributes of People Dataset as proposed by .

Similar to the task of action classification, we separately learn the parameters of the CNN and the linear SVM. Again, we fine-tune a CNN for the task in question with the difference that the softmax layer is replaced by a cross entropy layer (sum of logistic regressions).

Table 4 shows AP on the test set. We show results of our approach with and without parts, as well as results as reported by Zhang et al. , the state-of-the-art on the task, on the same test set. With an 8-layer network, parts improve the performance of all categories, indicating their impact on attribute classification. Also, a network jointly fine-tuned on instances and parts seems to work significantly better than a CNN trained solely on the instance boxes. In the case of a 16-layer network, joint fine-tuning and instance fine-tuning seem to work equally well. The gain in performance from adding parts is less significant in this case. This might be because of the already high performance achieved by the holistic network. Interestingly, our 8-layer holistic approach matches the current state-of-the-art on this task, PANDA showcasing the importance of deeper networks and good initialization.

Table 4 also shows the effectiveness of our best model, namely the jointly fine-tuned 16-layer CNN, when we use R-CNN detections instead of ground truth boxes on the Berkeley Attributes of People test set. Figure 7 shows the top few predictions on the test set. Each block corresponds to a different attribute. Figure 6 shows top errors for two of our lowest performing attribute classes.