Return of the Devil in the Details: Delving Deep into Convolutional Nets

Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

Introduction

Perhaps the single most important design choice in current state-of-the-art image classification and object recognition systems is the choice of visual features, or image representation. In fact, most of the quantitative improvements to image understanding obtained in the past dozen years can be ascribed to the introduction of improved representations, from the Bag-of-Visual-Words (BoVW) to the (Improved) Fisher Vector (IFV) . A common characteristic of these methods is that they are largely handcrafted. They are also relatively simple, comprising dense sampling of local image patches, describing them by means of visual descriptors such as SIFT, encoding them into a high-dimensional representation, and then pooling over the image. Recently, these handcrafted approaches have been substantially outperformed by the introduction of the latest generation of Convolutional Neural Networks (CNNs) to the computer vision field. These networks have a substantially more sophisticated structure than standard representations, comprising several layers of non-linear feature extractors, and are therefore said to be deep (in contrast, classical representation will be referred to as shallow). Furthermore, while their structure is handcrafted, they contain a very large number of parameters learnt from data. When applied to standard image classification and object detection benchmark datasets such as ImageNet ILSVRC and PASCAL VOC such networks have demonstrated excellent performance , significantly better than standard image encodings .

Despite these impressive results, it remains unclear how different deep architectures compare to each other and to shallow computer vision methods such as IFV. Most papers did not test these representations extensively on a common ground, so a systematic evaluation of the effect of different design and implementation choices remains largely missing. As noted in our previous work , which compared the performance of various shallow visual encodings, the performance of computer vision systems depends significantly on implementation details. For example, state-of-the-art methods such as not only involve the use of a CNN, but also include other improvements such as the use of very large scale datasets, GPU computation, and data augmentation (also known as data jittering or virtual sampling). These improvements could also transfer to shallow representations such as the IFV, potentially explaining a part of the performance gap .

Scenarios

This section introduces the three types of image representation $\phi(I)$ considered in this paper, describing them within the context of three different scenarios. Having outlined details specific to each, general methodologies which apply to all three scenarios are reviewed, such as data augmentation and feature normalisation, together with the linear classifier (trained with a standard hinge loss). We also specify here the benchmark datasets used in the evaluation.

2 Scenario 2: Deep representation (CNN) with pre-training

3 Scenario 3: Deep representation (CNN) with pre-training and fine-tuning

In Scenario 2 features are trained on one (large) dataset and applied to another (usually smaller). However, it was demonstrated that fine-tuning a pre-trained CNN on the target data can significantly improve the performance. We consider this scenario separately from that of Scenario 2, as the image features become dataset-specific after the fine-tuning.

4 Commonalities

We now turn to what is in common across the scenarios.

Data augmentation is a method applicable to shallow and deep representations, but that has been so far mostly applied to the latter . By augmentation we mean perturbing an image $I$ by transformations that leave the underlying class unchanged (e.g. cropping and flipping) in order to generate additional examples of the class. Augmentation can be applied at training time, at test time, or both. The augmented samples can either be taken as-is or combined to form a single feature, e.g. using sum/max-pooling or stacking.

4.2 Linear predictors

5 Benchmark data

As reference benchmark we use the PASCAL VOC data as already done in . The VOC-2007 edition contains about 10,000 images split into train, validation, and test sets, and labelled with twenty object classes. A one-vs-rest SVM classifier for each class is learnt and evaluated independently and the performance is measured as mean Average Precision (mAP) across all classes. The VOC-2012 edition contains roughly twice as many images and does not include test labels; instead, evaluation uses the official PASCAL Evaluation Server. To train deep representations we use the ILSVRC-2012 challenge dataset. This contains 1,000 object categories from ImageNet with roughly 1.2M training images, 50,000 validation images, and 100,000 test images. Performance is evaluated using the top-5 classification error. Finally, we also evaluate over the Caltech-101 and Caltech-256 image classification benchmarks . For Caltech-101, we followed the protocol of , and considered three random splits into training and testing data, each of which comprises 30 training and up to 30 testing images per class. For Caltech-256, two random splits were generated, each of which contains 60 training images per class, and the rest are used for testing. On both Caltech datasets, performance is measured using mean class accuracy.

Details

This section gives the implementation details of the methods introduced in Sect. 2.

Our IFV representation uses a slightly improved setting compared to the best result of .

The second modification is the use of spatially-extended local descriptors instead of a spatial pyramid. Here descriptors $\mathbf{x}_{i}$ are appended with their image location $(x_{i},y_{i})$ before quantization with the GMM. Formally, $\mathbf{x}_{i}$ is extended, after PCA projection, with its normalised spatial coordinates: $[\mathbf{x}_{i}^{\top},x_{i}/W-0.5,y_{i}/H-0.5]^{\top}$ , where $W\times H$ are the dimensions of the image. Since the GMM quantizes both appearance and location, this allows for spatial information to be captured directly by the soft-quantization process. This method is significantly more memory-efficient than using a spatial pyramid. Specifically, the PCA-reduced SIFT features are spatially augmented by appending $(x,y)$ yielding $D=82$ dimensional descriptors pooled in a $2KD=41,984$ dimensional IFV.

The third modification is the use of colour features in addition to SIFT descriptors. While colour information is used in CNNs and by the original FV paper , it was not explored in our previous comparison . We do so here by adopting the same Local Colour Statistics (LCS) features as used by . LCS is computed by dividing an input patch into a $4\times 4$ spatial grid (akin to SIFT), and computing the mean and variance of each of the Lab colour channels for each cell of the grid. The LCS dimensionality is thus $4\times 4\times 2\times 3=96$ . This is then encoded in a similar manner to SIFT.

2 Convolutional neural networks details

Our Fast (CNN-F) architecture is similar to the one used by Krizhevsky et al. . It comprises 8 learnable layers, 5 of which are convolutional, and the last 3 are fully-connected. The input image size is $224\times 224$ . Fast processing is ensured by the 4 pixel stride in the first convolutional layer. The main differences between our architecture and that of are the reduced number of convolutional layers and the dense connectivity between convolutional layers ( used sparse connections to enable training on two GPUs).

Our Medium (CNN-M) architecture is similar to the one used by Zeiler and Fergus . It is characterised by the decreased stride and smaller receptive field of the first convolutional layer, which was shown to be beneficial on the ILSVRC dataset. At the same time, conv2 uses larger stride (2 instead of 1) to keep the computation time reasonable. The main difference between our net and that of is we use less filters in the conv4 layer (512 vs. 1024).

Our Slow (CNN-S) architecture is related to the ‘accurate’ network from the OverFeat package . It also uses $7\times 7$ filters with stride $2$ in conv1. Unlike CNN-M and , the stride in conv2 is smaller (1 pixel), but the max-pooling window in conv1 and conv5 is larger ( $3\times 3$ ) to compensate for the increased spatial resolution. Compared to , we use 5 convolutional layers as in the previous architectures ( used 6), and less filters in conv5 (512 instead of 1024); we also incorporate an LRN layer after conv1 ( did not use contrast normalisation).

In general, our CNN training procedure follows that of , learning on ILSVRC-2012 using gradient descent with momentum. The hyper-parameters are the same as used by : momentum $0.9$ ; weight decay $5\cdot 10^{-4}$ ; initial learning rate $10^{-2}$ , which is decreased by a factor of $10$ , when the validation error stop decreasing. The layers are initialised from a Gaussian distribution with a zero mean and variance equal to $10^{-2}$ . We also employ similar data augmentation in the form of random crops, horizontal flips, and RGB colour jittering. Test time crop sampling is discussed in Sect. 3.3; at training time, $224\times 224$ crops are sampled randomly, rather than deterministically. Thus, the only notable difference to is that the crops are taken from the whole training image $P\times 256,P\geq 256$ , rather than its $256\times 256$ centre. Training was performed on a single NVIDIA GTX Titan GPU and the training time varied from 5 days for CNN-F to 3 weeks for CNN-S.

2.2 CNN fine-tuning on the target dataset

In our experiments, we fine-tuned CNN-S using VOC-2007, VOC-2012, or Caltech-101 as the target data. Fine-tuning was carried out using the same framework (and the same data augmentation), as we used for CNN training on ILSVRC. The last fully-connected layer (conv8) has output dimensionality equal to the number of classes, which differs between datasets, so we initialised it from a Gaussian distribution (as used for CNN training above). Now we turn to dataset-specific fine-tuning details.

VOC-2007 and VOC-2012. Considering that PASCAL VOC is a multi-label dataset (i.e. a single image might have multiple labels), we replaced the softmax regression loss with a more appropriate loss function, for which we considered two options: one-vs-rest classification hinge loss (the same loss as used in the SVM experiments) and ranking hinge loss. Both losses define constraints on the scores of positive ( $I_{pos}$ ) and negative ( $I_{neg}$ ) images for each class: $w_{c}\phi(I_{pos})>1-\xi,w_{c}\phi(I_{neg})<-1+\xi$ for the classification loss, $w_{c}\phi(I_{pos})>w_{c}\phi(I_{neg})+1-\xi$ for the ranking loss ( $w_{c}$ is the $c$ -th row of the last fully-connected layer, which can be seen as a linear classifier on deep features $\phi(I)$ ; $\xi$ is a slack variable). Our fine-tuned networks are denoted as “CNN S TUNE-CLS” (for the classification loss) and “CNN S TUNE-RNK” (for the ranking loss). In the case of both VOC datasets, the training and validation subsets were combined to form a single training set. Given the smaller size of the training data when compared to ILSVRC-2012, we controlled for over-fitting by using lower initial learning rates for the fine-tuned hidden layers. The learning rate schedule for the last layer / hidden layers was: $10^{-2}/10^{-4}\rightarrow 10^{-3}/10^{-4}\rightarrow 10^{-4}/10^{-4}\rightarrow 10^{-5}/10^{-5}$ .

Caltech-101 dataset contains a single class label per image, so fine-tuning was performed using the softmax regression loss. Other settings (including the learning rate schedule) were the same as used for the VOC fine-tuning experiments.

2.3 Low-dimensional CNN feature training

Our baseline networks (Table I) have the same dimensionality of the last hidden layer (full7): 4096. This design choice is in accordance with the state-of-the-art architectures , and leads to a 4096-D dimensional image representation, which is already rather compact compared to IFV. We further trained three modifications of the CNN-M network, with lower dimensional full7 layers of: 2048, 1024, and 128 dimensions respectively. The networks were learnt on ILSVRC-2012. To speed-up training, all layers aside from full7/full8 were set to those of the CNN-M net and a lower initial learning rate of $10^{-3}$ was used. The initial learning rate of full7/full8 was set to $10^{-2}$ .

3 Data augmentation details

We explore three data augmentation strategies. The first strategy is to use no augmentation. In contrast to IFV, however, CNNs require images to be transformed to a fixed size ( $224\times 224$ ) even when no augmentation is used. Hence the image is downsized so that the smallest dimension is equal to $224$ pixels and a $224\times 224$ crop is extracted from the centre.Extracting a $224\times 224$ centre crop from a $256\times 256$ image resulted in worse performance. The second strategy is to use flip augmentation, mirroring images about the $y$ -axis producing two samples from each image. The third strategy, termed C+F augmentation, combines cropping and flipping. For CNN-based representations, the image is downsized so that the smallest dimension is equal to $256$ pixels. Then $224\times 224$ crops are extracted from the four corners and the centre of the image. Note that the crops are sampled from the whole image, rather than its $256\times 256$ centre, as done by . These crops are then flipped about the $y$ -axis, producing $10$ perturbed samples per input image. In the case of the IFV encoding, the same crops are extracted, but at the original image resolution.

Analysis

This section describes the experimental results, comparing different features and data augmentation schemes. The results are given in Table II for VOC-2007 and analysed next, starting from generally applicable methods such as augmentation and then discussing the specifics of each scenario. We then move onto other datasets and the state of the art in Sect. 4.7.

We experiment with no data augmentation (denoted Image Aug=– in Tab. II), flip augmentation (Image Aug=F), and C+F augmentation (Image Aug=C). Augmented images are used as stand-alone samples (f), or by fusing the corresponding descriptors using sum (s) or max (m) pooling or stacking (t). So for example Image Aug=(C) f s in row II of Tab. II means that C+F augmentation is used to generate additional samples in training (f), and is combined with sum-pooling in testing (s).

Augmentation consistently improves performance by $\sim 3\%$ for both IFV (e.g. II vs. II) and CNN (e.g. II vs. II). Using additional samples for training and sum-pooling for testing works best (II) followed by sum-pooling II, max pooling II, and stacking II. In terms of the choice of transformations, flipping improves only marginally (II vs. II), but using the more expensive C+F sampling improves, as seen, by about $2\sim 3\%$ (II vs. II). We experimented with sampling more transformations, taking a higher density of crops from the centre of the image, but observed no benefit.

2 Colour

Colour information can be added and subtracted in CNN and IFV. In IFV replacing SIFT with the colour descriptors of (denoted COL in Method) yields significantly worse performance (II vs. II). However, when SIFT and colour descriptors are combined by stacking the corresponding IFVs (COL+) there is a small but significant improvement of around $\sim 1\%$ in the non-augmented case (e.g. II vs. II) but little impact in the augmented case (e.g. II vs. II). For CNNs, retraining the network after converting all the input images to grayscale (denoted GS in Methods) has a more significant impact, resulting in a performance drop of $\sim 3\%$ (II vs. II, II vs. II).

3 Scenario 1: Shallow representation (IFV)

The baseline IFV encoding using a spatial pyramid II performs slightly better than the results [I] taken from Chatfield et al. , primarily due to a larger number of spatial scales being used during SIFT feature extraction, and the resultant SIFT features being square-rooted. Intra-normalisation, denoted as IN in the Method column of the table, improves the performance by $\sim 1\%$ (e.g. II vs. II). More interestingly, switching from spatial pooling (denoted spm in the SPool column) to feature spatial augmentation (SPool=(x,y)) has either little effect on the performance or results in a marginal increase (II vs. II, II vs. II), whilst resulting in a representation which is over 10 $\times$ smaller. We also experimented with augmenting with scale in addition to position as in but observed no improvement. Finally, we investigate pushing the parameters of the representation setting $K=512$ (rows II-II). Increasing the number of GMM centres in the model from $K=256$ to $512$ results in a further performance increase (e.g. II vs. II), but at the expense of higher-dimensional codes (125K dimensional).

4 Scenario 2: Deep representation (CNN) with pre-training

5 Scenario 3: Deep representation (CNN) with pre-training and fine-tuning

We fine-tuned our CNN-S architecture on VOC-2007 using the ranking hinge loss, and achieved a significant improvement: $2.7\%$ (II vs. II). This demonstrates that in spite of the small amount of VOC training data (5,011 images), fine-tuning is able to adjust the learnt deep representation to better suit the dataset in question.

6 Combinations

For the CNN-M 2048 representation II, stacking deep and shallow representations to form a higher-dimensional descriptor makes little difference (II vs. II). For the weaker CNN-F it results in a small boost of $\sim 0.8\%$ (II vs. II).

7 Comparison with the state of the art

In Table III we report our results on ILSVRC-2012, VOC-2007, VOC-2012, Caltech-101, and Caltech-256 datasets, and compare them to the state of the art. First, we note that the ILSVRC error rates of our CNN-F, CNN-M, and CNN-S networks are better than those reported by , , and for the related configurations. This validates our implementation, and the difference is likely to be due to the sampling of image crops from the uncropped image plane (instead of the centre). When using our CNN features on other datasets, the relative performance generally follows the same pattern as on ILSVRC, where the nets are trained – the CNN-F architecture exhibits the worst performance, with CNN-M and CNN-S performing considerably better.

Further fine-tuning of CNN-S on the VOC datasets turns out to be beneficial; on VOC-2012, using the ranking loss is marginally better than the classification loss (III vs. III), which can be explained by the ranking-based VOC evaluation criterion. Fine-tuning on Caltech-101 also yields a small improvement, but no gain is observed over Caltech-256.

Our CNN-S net is competitive with recent CNN-based approaches and on a number of datasets (VOC-2007, VOC-2012, Caltech-101, Caltech-256) and sets the state of the art on VOC-2007 and VOC-2012 across methods pre-trained solely on ILSVRC-2012 dataset. While the CNN-based methods of achieve better performance on VOC (86.3% and 90.3% respectively), they were trained using extended ILSVRC datasets, enriched with additional categories semantically close to the ones in VOC. Additionally, used a significantly more complex classification pipeline, driven by bounding box proposals , pre-trained on ILSVRC-2013 detection dataset. Their best reported result on VOC-2012 (90.3%) was achieved by the late fusion with a complex hand-crafted method of ; without fusion, they get 84.2%. On Caltech-101, achieves the state of the art using spatial pyramid pooling of conv5 layer features, while we used full7 layer features consistently across all datasets (for full7 features, they report $87.08\%$ ).

In addition to achieving performance comparable to the state of the art with a very simple approach (but powerful CNN-based features), with the modifications outlined in the paper (primarily the use of data augmentation similar to the CNN-based methods) we are able to improve the performance of shallow IFV to 68.02% (Table II, II).

8 Performance Evolution on VOC-2007

A comparative plot of the evolution in the performance of the methods evaluated in this paper, along with a selection from our earlier review of shallow methods is presented in Fig. 1. Classification accuracy over PASCAL VOC was 54.48% mAP for the BoVW model in 2008, 61.7% for the IFV in 2010 , and 73.41% for DeCAF and similar CNN-based methods introduced in late 2013. Our best performing CNN-based method (CNN-S with fine-tuning) achieves 82.42%, comparable to the most recent state-of-the-art.

9 Timings and dimensionality

One of our best-performing CNN representations CNN-M-2048 II is $\sim 42\times$ more compact than the best performing IFV II (84K vs. 2K) and CNN-M features are also $\sim 50\times$ faster to compute ( $\sim 120s$ vs. $\sim 2.4s$ per image with augmentation enabled, over a single CPU core). Non-augmented CNN-M features II take around $0.3s$ per image, compared to $\sim 0.4s$ for CNN-S features and $\sim 0.13s$ for CNN-F features.

Conclusion

In this paper we presented a rigorous empirical evaluation of CNN-based methods for image classification, along with a comparison with more traditional shallow feature encoding methods. We have demonstrated that the performance of shallow representations can be significantly improved by adopting data augmentation, typically used in deep learning. In spite of this improvement, deep architectures still outperform the shallow methods by a large margin. We have shown that the performance of deep representations on the ILSVRC dataset is a good indicator of their performance on other datasets, and that fine-tuning can further improve on already very strong results achieved using the combination of deep representations and a linear SVM. Source code and CNN models to reproduce the experiments presented in the paper are available on the project website in the hope that it would provide common ground for future comparisons, and good baselines for image representation research.

Acknowledgements

This work was supported by the EPSRC and ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.