Learning to See by Moving

Pulkit Agrawal, Joao Carreira, Jitendra Malik

Introduction

Recent advances in computer vision have shown that visual features learnt by neural networks trained for the task of object recognition using more than a million labelled images are useful for many computer vision tasks like semantic segmentation, object detection and action classification . However, object recognition is one among many tasks for which vision is used. For example, humans use visual perception for recognizing objects, understanding spatial layouts of scenes and performing actions such as moving around in the world. Is there something special about the task of object recognition or is it the case that useful visual representations can be learnt through other modes of supervision? Clearly, biological agents perform complex visual tasks and it is unlikely that they require external supervision in form of millions of labelled examples. Unlabelled visual data is freely available and in theory this data can be used to learn useful visual representations. However, until now unsupervised learning approaches have not yet delivered on their promise and are nowhere to be seen in current applications on complex real world imagery.

Biological agents use perceptual systems for obtaining sensory information about their environment that enables them to act and accomplish their goals . Both biological and robotic agents employ their motor system for executing actions in their environment. Is it also possible that these agents can use their own motor system as a source of supervision for learning useful perceptual representations? Motor theories of perception have a long history , but there has been little work in formulating computational models of perception that make use of motor information. In this work we focus on visual perception and present a model based on egomotion (i.e. self motion) for learning useful visual representations. When we say useful visual representations , we mean representations that possess the following two characteristics - (1) ability to perform multiple visual tasks and (2) ability of performing new visual tasks by learning from only a few labeled examples provided by an extrinsic teacher.

Mobile agents are naturally aware of their egomotion (i.e. self-motion) through their own motor system. In other words, knowledge of egomotion is “freely” available. For example, the vestibular system provides the sense of orientation in many mammals. In humans and other animals, the brain has access to information about eye movements and the actions performed by the animal . A mobile robotic agent can estimate its egomotion either from the motor commands it issues to move or from odometry sensors such as gyroscopes and accelerometers mounted on the agent itself.

We propose that useful visual representations can be learnt by performing the simple task of correlating visual stimuli with egomotion. A mobile agent can be treated like a camera moving in the world and thus the knowledge of egomotion is the same as the knowledge of camera motion. Using this insight, we pose the problem of correlating visual stimuli with egomotion as the problem of predicting the camera transformation from the consequent pairs of images that the agent receives while it moves. Intuitively, the task of predicting camera transformation between two images should force the agent to learn features that are adept at identifying visual elements that are present in both the images (i.e. visual correspondence). In the past, features such as SIFT, that were hand engineered for finding correspondences were also found to be very useful for tasks such as object recognition . This suggests that egomotion based learning can also result in features that are useful for such tasks.

In order to test our hypothesis of feature learning using egomotion, we trained multilayer neural networks to predict the camera transformation between pairs of images. As a proof of concept, we first demonstrate the usefulness of our approach on the MNIST dataset . We show that features learnt using our method outperform previous approaches of unsupervised feature learning when class-label supervision is available only for a limited number of examples (section 3.4) Next, we evaluated the efficacy of our approach on real world imagery. For this purpose, we used image and odometry data recorded from a car moving through urban scenes, made available as part of the KITTI and the San Francisco (SF) city datasets. This data mimics the scenario of a robotic agent moving around in the world. The quality of features learnt from this data were evaluated on four tasks (1) Scene recognition on SUN (section 5.1), (2) Visual odometery (section 5.4), (3) Keypoint matching (section 5.3) and (4) Object recognition on Imagenet (section 5.2). Our results show that for the same amount of training data, features learnt using egomotion as supervision compare favorably to features learnt using class-label as supervision. We also show that egomotion based pretraining outperforms a previous approach based on slow feature analysis for unsupervised learning from videos . To the best of our knowledge, this work provides the first effective demonstration of learning visual representations from non-visual access to egomotion information in real world setting.

The rest of this paper is organized as following: In section 2 we discuss the related work, in section 3, 4, 5 we present the method, dataset details and we conclude with the discussion in section 6.

Related Work

Past work in unsupervised learning has been dominated by approaches that pose feature learning as the problem of discovering compact and rich representations of images that are also sufficient to reconstruct the images . Another line of work has focused on learning features that are invariant to transformations either from video or from images . perform feature learning by modeling spatial transformations using boltzmann machines, but donot evaluate the quality of learnt features.

Despite a lot of work in unsupervised learning (see for a review), a method that works on complex real world imagery is yet to be developed. An alternative to unsupervised learning is to learn features using intrinsic reward signals that are freely available to a system (i.e self-supervised learning). For instance, used intrinsic reward signals available to a robot for learning features that predict path traversability, while trained neural networks for driving vehicles directly from visual input.

In this work we propose to use non-visual access to egomotion information as a form of self-supervision for visual feature learning. Unlike any other previous work, we show that our method works on real world imagery. Closest to our method is the the work of transforming auto-encoders that used egomotion to reconstruct the transformed image from an input source image. This work was purely conceptual in nature and the quality of learned features was not evaluated. In contrast, our method uses egomotion as supervision by predicting the transformation between two images using a siamese-like network model .

Our method can also be seen as an instance of feature learning from videos. perform feature learning from videos by imposing the constraint that temporally close frames should have similar feature representations (i.e. slow feature analysis) without accounting for either the camera motion or the motion of objects in the scene. In many settings the camera motion dominates the motion content of the video. Our key observation is that knowledge of camera motion (i.e. egomotion) is freely available to mobile agents and can be used as a powerful source of self-supervision.

A Simple Model of Motion-based Learning

We model the visual system of the agent with a Convolutional Neural Network (CNN, ). The agent optimizes its visual representations (i.e. updating the weights of the CNN) by minimizing the error between the egomotion information (i.e. camera transformation) obtained from its motor system and egomotion predicted using its visual inputs only. Performing this task is equivalent to training a CNN with two streams (i.e. Siamese Style CNN or SCNN) that takes two images as inputs and predicts the egomotion that the agent underwent as it moved between the two spatial locations from which the two images were obtained. In order to learn useful visual representations, the agent continuously performs this task as it moves around in its environment.

In this work we use the pretraining-finetuning paradigm for evaluating the utility of learnt features. Pretraining is the process of optimizing the weights of a randomly initialized CNN for an auxiliary task that is not the same as the target task. Finetuning is the process of modifying the weights of a pretrained CNN for the given target task. Our experiments compare the utility of features learnt using egomotion based pretraining against class-label based and slow-feature based pretraining on multiple target tasks.

Each stream of the CNN independently computes features for one image. Both streams share the same architecture and the same set of weights and consequently perform the same set of operations for computing features. The individual streams have been called as Base-CNN (BCNN). Features from two BCNNs are concatenated and passed downstream into another CNN called as the Top-CNN (TCNN) (see figure 2). TCNN is responsible for using the BCNN features to predict the camera transformation between the input pair of images. After pretraining, the TCNN is removed and a single BCNN is used as a standard CNN for feature computation for the target task.

2 Shorthand for CNN architectures

The abbreviations Ck, Fk, P, D, Op represent a convolutional(C) layer with k filters, a fully-connected(F) layer with k filters, pooling(P), dropout(D) and the output(Op) layers respectively. We used ReLU non-linearity after every convolutional/fully-connected layer, except for the output layer. The dropout layer was always used with dropout of 0.5. The output layer was a fully connected layer with number of units equal to the number of desired outputs. As an example of our notation, C96-P-F500-D refers to a network with 96 filters in the convolution layer followed by ReLU non-linearity, a pooling layer, a fully-connected layer with 500 unit, ReLU non-linearity and a dropout layer. We used for training all our models.

3 Slow Feature Analysis (SFA) Baseline

Slow Feature Analysis (SFA) is a method for feature learning based on the principle that useful features change slowly in time. We used the following contrastive loss formulation of SFA , $L(x_{t_{1}},x_{t_{2}},W)=$

where, $L$ is the loss, $x_{t_{1}},x_{t_{2}}$ refer to feature representations of frames observed at times $t_{1},t_{2}$ respectively, $W$ are the parameters that specify the feature extraction process, $D$ is a measure of distance with parameter, $m$ is a predefined margin and $T$ is a predefined time threshold for determining whether the two frames are temporally close or not. In this work, $x_{t}$ are features computed using a CNN with weights $W$ and $D$ was chosen to be the L2 distance. SFA pretraining was performed using two stream architectures that took pairs of images as inputs and produced outputs $x_{t_{1}},x_{t_{2}}$ as outputs from the two streams respectively.

4 Proof of Concept using MNIST

On MNIST, egomotion was emulated by generating synthetic data consisting of random transformation (translations and rotations) of digit images. From the training set of 60K images, digits were randomly sampled and then transformed using two different sets of random transformations to generate image pairs. CNNs were trained for predicting the transformations between these image pairs.

For egomotion based pretraining, relative translation between the digits was constrained to be an integer value in the range and relative rotation $\theta$ was constrained to lie within the range [-30°, 30°]. The prediction of transformation was posed as a classification task with three separate soft-max losses (one each for translation along X, Y axes and the rotation about Z-axis). SCNN was trained to minimize the sum of these three losses. Translations along X, Y were separately binned into seven uniformly spaced bins each. The rotations were binned into bins of size 3°each resulting into a total of 20 bins (or classes). For SFA based pretraining, image pairs with relative translation in the range and relative rotation within [-3°, 3°] were considered to be temporally close to each other (see equation 1). A total of 5 million image pairs were used for both pretraining procedures.

4.2 Network Architectures

We experimented with multiple BCNN architectures and chose the optimal architecture for each pretraining method separately. For egmotion based pretraining, the two BCNN streams were concatenated using the TCNN: F1000-D-Op. Pretraining was performed for 40K iterations (i.e. 5M examples) using an initial learning rate of 0.01 which was reduced by a factor of 2 after every 10K iterations.

The following architecture was used for finetuning: BCNN-F500-D-Op. In order to evaluate the quality of BCNN features, the learning rate of all layers in the BCNN were set to 0 during finetuning for digit classification. Finetuning was performed for 4K iterations (which is equivalent to training for 50 epochs for the 10K labelled training examples) with a constant learning rate of 0.01.

4.3 Results

The BCNN features were evaluated by computing the error rates on the task of digit classification using 100, 300, 1K and 10K class-labelled examples for training. These sets were constructed by randomly sampling digits from the standard training set of 60K digits. For this part of the experiment, the original digit images were used (i.e. without any transformations or data augmentation). The standard test set of 10K digits was used for evaluation and error rates averaged across 3 runs are reported in table 1.

The BCNN architecture: C96-P-C256-P, was found to be optimal for egomotion and SFA based pretraining and also for training from scratch (i.e. random weight initialization). Results for other architectures are provided in the supplementary material. For SFA based pretraining, we experimented with multiple values of the margin $m$ and found that $m=10,100$ led to the best performance. Our method outperforms convolutional deep belief networks , a previous approach based on learning features invariant to transformations and SFA based pretraining.

Learning Visual Features From Egomotion in Natural Environments

We used two main sources of real world data for feature learning: the KITTI and SF datasets, which were collected using cameras and odometry sensors mounted on a car driving through urban scenes. Details about the data, the experimental procedure, the network architectures and the results are provided in sections 4.1, 4.2, 4.3 and 5 respectively.

The KITTI dataset provided odometry and image data recorded during 11 short trips of variable length made by a car moving through urban landscapes. The total number of frames in the entire dataset was 23,201. Out of 11, 9 sequences were used for training and 2 for validation. The total number of images in the training set was 20,501.

The odometry data was used to compute the camera transformation between pairs of images recorded from the car. The direction in which the camera pointed was assumed to be the Z axis and the image plane was taken to be the XY plane. X-axis and Y-axis refer to horizontal and vertical directions in the image plane. As significant camera transformations in the KITTI data were either due to translations along the Z/X axis or rotation about the Y axis, only these three dimensions were used to express the camera transformation. The rotation was represented as the euler angle about the Y-axis. The task of predicting the transformation between pair of images was posed as a classification problem. The three dimensions of camera transformation were individually binned into 20 uniformly spaced bins each. The training image pairs were selected from frames that were at most $\pm 7$ frames apart to ensure that images in any given pair would have a reasonable overlap. For SFA based pretraining, pairs of frames that were separated by atmost $\pm 7$ frames were considered to be temporally close to each other.

The SCNN was trained to predict camera transformation from pairs of $227\times 227$ pixel sized image regions extracted from images of overall size $370\times 1226$ pixels. For each image pair, the coordinates for cropping image regions were randomly chosen. Figure 1 illustrates typical image crops.

2 SF Dataset

SF dataset provides camera transformation between $\approx$ 136K pairs of images (constructed from a set of 17,357 unique images). This dataset was constructed using Google StreetView . $\approx 130K$ image pairs were used for training and $\approx 6K$ pairs for validation.

Just like KITTI, the task of predicting camera transformation was posed as a classification problem. Unlike KITTI, significant camera transformation was found along all six dimensions of transformation (i.e. the 3 euler angles and the 3 translations). Since, it is unreasonable to expect that visual features can be used to infer big camera transformations, rotations between [-30°, 30°] were binned into 10 uniformly spaced bins and two extra bins were used for rotations larger and smaller than 30°and -30°respectively. The three translations were individually binned into 10 uniformly spaced bins each. Images were resized to a size of $360\times 480$ and image regions of size $227\times 227$ were used for training the SCNN.

3 Network Architecture

BCNN closely followed the architecture of first five AlexNet layers : C96-P-C256-P-C384-C384-C256-P. TCNN architecture was: C256-C128-F500-D-Op. The convolutional filters in the TCNN were of spatial size $3\times 3$ . The networks were trained for 60K iterations with a batch size of 128. The initial learning rate was set to 0.001 and was reduced by a factor of two after every 20K iterations.

We term the networks pretrained using egomotion on KITTI and SF datasets as KITTI-Net and SF-Net respectively. The net pretrained on KITTI with SFA is called KITTI-SFA-Net. Figure 3 shows the layer-1 filters of KITTI-Net and SF-Net. A large majority of layer-1 filters are color detectors, while some of them are edge detectors. As color is a useful cue for determining correspondences between closeby frames of a video sequence, learning of color detectors as layer-1 filters is not surprising. The fraction of filters that detect edges is higher for the SF-Net. This is not surprising either, because higher fraction of images in the SF dataset contain structured objects like buildings and cars.

Evaluating Motion-based Learning

For evaluating the merits of the proposed approach, features learned using egomotion based supervision were compared against features learned using class-label and SFA based supervision on the challenging tasks of scene recognition, intra-class keypoint matching and visual odometry and object recognition. The ultimate goal of feature learning is to find features that can generalize from only a few supervised examples on a new task. Therefore it makes sense to evaluate the quality of features when only a few labelled examples for the target task are provided. Consequently, the scene and object recognition experiments were performed in the setting when only 1-20 labelled examples per class were available for finetuning.

The KITTI-Net and SF-Net (examples of models trained using egomotion based supervision) were trained using only only $\approx$ 20K unique images. To make a fair comparison with class-label based supervision, a model with AlexNet architecture was trained using only 20K images taken from the training set of ILSVRC12 challenge (i.e. 20 examples per class). This model has been referred to as AlexNet-20K. In addition, some experiments presented in this work also make comparison with AlexNet models trained with 100K and 1M images that have been named as AlexNet-100K and AlexNet-1M respectively.

SUN dataset consisting of 397 indoor/outdoor scene categories was used for evaluating scene recognition performance. This dataset provides 10 standard splits of 5 and 20 training images per class and a standard test set of 50 images per class. Due to time limitation of running 10 runs of the experiment, we evaluated the performance using only 3 train/test splits.

For evaluating the utility of CNN features produced by different layers, separate linear (SoftMax) classifiers were trained on features produced by individual CNN layers (i.e. BCNN layers of KITTI-Net, KITTI-SFA-Net and SF-Net). Table 2 reports recognition accuracy (averaged over 3 train/test splits) for various networks considered in this study. KITTI-Net outperforms SF-Net and is comparable to AlexNet-20K. This indicates that given a fixed budget of pretraining images, egomotion based supervision learns features that are almost as good as the features using class-based supervision on the task of scene recognition. The performance of features computed by layers 1-3 (abbreviated as L1, L2, L3 in table 2) of the KITTI-SFA-Net and KITTI-Net is comparable, whereas layer 4, 5 features of KITTI-Net significantly outperform layer 4, 5 features of KITTI-SFA-Net. This indicates that egomotion based pretraining results into learning of higher-level features, while SFA based pretraining results into learning of lower-level features only.

The KITTI-Net outperforms GIST, which was specifically developed for scene classification, but is outperformed by Dense SIFT with spatial pyramid matching (SPM) kernel . The KITTI-Net was trained using limited visual data ( $\approx 20Kframes$ ) containing visual imagery of limited diversity. The KITTI data mainly contains images of roads, buildings, cars, few pedestrians, trees and some vegetation. It is in fact surprising that a network trained on data with such little diversity is competitive on classifying indoor and outdoor scenes with the AlexNet-20K that was trained on a much more diverse set of images. We believe that with more diverse training data for egomotion based learning, the performance of learnt features will be better than currently reported numbers.

The KITTI-Net outperformed the SF-Net except for the performance of layer 1 (L1). As it was possible to extract a larger number of image region pairs from the KITTI dataset as compared to the SF dataset (see section 4.1, 4.2), the result that KITTI-Net outperforms SF-Net is not surprising. Because KITTI-Net was found to be superior to the SF-Net in this experiment, the KITTI-Net was used for all other experiments described in this paper.

2 Object Recognition

If egomotion based pretraining learns useful features for object recognition, then a net initialized with KITTI-Net weights should outperform a net initialized with random weights on the task of object recognition. For testing this, we trained CNNs using 1, 5, 10 and 20 images per class from the ILSVRC-2012 challenge. As this dataset contains 1000 classes, the total number of training examples available for training for these networks were 1K, 5K, 10K and 20K respectively. All layers of KITTI-Net, KITTI-SFA-Net and AlexNet-Scratch (i.e. CNN with random weight initialization) were finetuned for image classification.

The results of the experiment presented in table 3 show that egomotion based supervision (KITTI-Net) clearly outperforms SFA based supervision(KITTI-SFA-Net) and AlexNet-Scratch. As expected, the improvement offered by motion-based pretraining is larger when the number of examples provided for the target task are fewer. These result show that egomotion based pretraining learns features useful for object recognition.

3 Intra-Class Keypoint Matching

Identifying the same keypoint of an object across different instances of the same object class is an important visual task. Visual features learned using egomotion, SFA and class-label based supervision were evaluated for this task using keypoint annotations on the PASCAL dataset .

Keypoint matching was computed in the following way: First, ground-truth object bounding boxes (GT-BBOX) from PASCAL-VOC2012 dataset were extracted and re-sized (while preserving the aspect ratio) to ensure that the smaller side of the boxes was of length 227 pixels. Next, feature maps from layers 2-5 of various CNNs were computed for every GT-BBOX. The keypoint matching score was computed between all pairs of GT-BBOX belonging to the same object class. For given pair of GT-BBOX, the features associated with keypoints in the first image were used to predict the location of the same keypoints in the second image. The normalized pixel distance between the actual and predicted keypoint locations was taken as the error in keypoint matching. More details about this procedure have been provided in the supp. materials.

It is natural to expect that accuracy of keypoint matching would depend on the camera transformation between the two viewpoints of the object(i.e. viewpoint distance). In order to make a holistic evaluation of the utility of features learnt by different pretraining methods on this task, matching error was computed as a function of viewpoint distance . Figure 4 reports the matching error averaged across all keypoints, all pairs of GT-BBOX and all classes using features extracted from layers conv-3 and conv-4.

KITTI-Net trained only with 20K unique frames was superior to AlexNet-20K and AlexNet-100K and inferior only to AlexNet-1M. A net with AlexNet architecture initialized with random weights (AlexNet-Rand), surprisingly performed better than AlexNet-20K. One possible explanation for this observation is that with only 20K examples, features learnt by AlexNet-20K only capture coarse global appearance of objects and are therefore poor at keypoint matching. SIFT has been hand engineered for finding correspondences across images and performs as well as the best AlexNet-1M features for this task (i.e. conv-4 features). KITTI-Net also significantly outperforms KITTI-SFA-Net. These results indicate that features learnt by egomotion based pretraining are superior to SFA and class-label based pretraining for the task of keypoint matching.

4 Visual Odometry

Visual odometry is the task of estimating the camera transformation between image pairs. All layers of KITTI-Net and AlexNet-1M were finetuned for 25K iterations using the training set of SF dataset on the task of visual odometry (see section 4.2 for task description). The performance of various CNNs was evaluated on the validation set of SF dataset and the results are reported in table 4.

Performance of KITTI-Net was either superior or comparable to AlexNet-1M on this task. As the evaluation was made on the SF dataset itself, it was not surprising that on some metrics SF-Net outperformed KITTI-Net. The results of this experiment indicate that egomotion based feature learning is superior to class-label based feature learning on the task of visual odometry.

Discussion

In this work, we have shown that egomotion is a useful source of intrinsic supervision for visual feature learning in mobile agents. In contrast to class labels, knowledge of egomotion is ”freely” available. On MNIST, egomotion-based feature learning outperforms many previous unsupervised methods of feature learning. Given the same budget of pretraining images, on task of scene recognition, egomotion-based learning performs almost as well as class-label-based learning. Further, egomotion based features outperform features learnt by a CNN trained using class-label supervision on two orders of magnitude more data (AlexNet-1M) on the task of visual odometry and one order of magnitude more data on the task of intra-class keypoint matching. In addition to demonstrating the utility of egomotion based supervision, these results also suggest that features learnt by class-label based supervision are not optimal for all visual tasks. This means that future work should look at what kinds of pretraining are useful for what tasks.

One potential criticism of our work is that we have trained and evaluated high capacity deep models on relatively little data (e.g. only 20K unique images available on the KITTI dataset). In theory, we could have learnt better features by downsizing the networks. For example, in our experiments with MNIST we found that pretraining a 2-layer network instead of 3-layer results in better performance (table 1). In this work, we have made a conscious choice of using standard deep models because the main goal of this work was not to explore novel feature extraction architectures but to investigate the value of egmotion for learning visual representations on architectures known to perform well on practical applications. Future research focused on exploring architectures that are better suited for egomotion based learning can only make a stronger case for this line of work. While egomotion is freely available to mobile agents, there are currently no publicly available datasets as large as Imagenet. Consequently, we were unable to evaluate the utility of motion-based supervision across the full spectrum of training set sizes.

In this work, we chose to first pretrain our models using a base task (i.e. egomotion) and then finetune these models for target tasks. An equally interesting setting is that of online learning where the agent has continuous access to intrinsic supervision (such as egomotion) and occasional explicit access to extrinsic teacher signal (such as the class labels). We believe that such a training procedure is likely to result in learning of better features. Our intuition behind this is that seeing different views of the same instance of an object (say) car, may not be sufficient to learn that different instances of the car class should be grouped together. The occasional extrinsic signal about object labels may prove useful for the agent to learn such concepts. Also, current work makes use of passively collected egomotion data and it would be interesting to investigate if it is possible to learn better visual representations if the agent can actively decide on how to explores its environment (i.e. active learning ).

Acknowledgements

This work was supported in part by ONR MURI-N00014-14-1-0671. Pulkit Agrawal was partially supported by Fulbright Science and Technology Fellowship. João Carreira was supported by the Portuguese Science Founda- tion, FCT, under grant SFRH/BPD/84194/2012. We grate- fully acknowledge NVIDIA corporation for the donation of Tesla GPUs for this research.

Appendix

Appendix A Keypoint Matching Score

Consider images of two instances of the same object class (for example airplane images as shown in first row of figure 5) for which keypoint matching score needs to be computed.

The images are pre-processed in the following way:

Crop the groundtruth bounding box from the image.

Pad the images by 30 pixels along each dimension.

Resize each image so that the smallest side is 227 pixels. The aspect ratio of the image is preserved.

Assume that the $l^{th}$ layer of the CNN is used for feature computation. The feature map produced by the $l^{th}$ layer is of dimensionality $I\times J\times M$ , where $(I,J)$ are the spatial dimensions and M is the number of filters in the $l^{th}$ layer. Thus, the $l^{th}$ layer produces a $M$ dimensional feature vector for each of the $I\times J$ grid position in the feature map.

The coordinates of the keypoints are provided in the image coordinate system . For the keypoints in the first image, we first determine their grid position in the $I\times J$ feature map. Each grid position has an associated receptive field in the image. The keypoints are assigned to the grid positions for which the center of receptive field is closest to the keypoints. This means that each keypoint is assigned one location in the feature map.

Let the $M$ dimensional feature vector associated with the $k^{th}$ keypoint in the first image be $F_{1}^{k}$ . Let the $M$ dimensional feature vector at grid location $C_{ij}$ for the second image be $F_{2}(C_{ij})$ . The location of matching keypoint in the second image is determined by solving:

$C_{*}$ is transformed into the image coordinate system by computing the center of receptive field (in the image) associated with this grid position. Let this transformed coordinates be $C_{*}^{im}$ and the coordinates of the corresponding keypoint (in the second image) be $C_{gt}^{im}$ . The matching error for the $k^{th}$ keypoint ( $E_{k}$ ) is defined as:

where, $L^{2}_{D}$ is the length of diagonal (in pixels) of the second image. As different images have different sizes, dividing by $L^{2}_{D}$ normalizes for the difference in sizes. The matching error for a pair of images of instances belonging to the same class is calculated as:

The average matching error across all pairs of the instance of the same class is given by $E_{class}$ :

where, $\#pairs$ is the number of pairs of object instances belonging to the same class. In Figure 4 of the main paper we report the matching error averaged across all the 20 classes.

A.2 Keypoint Matching using SIFT

SIFT features are extracted using a square window of size 72 pixels and a stride of 8 pixels using the open source code from . The stride of 8 pixels was chosen to have a fair comparison with the CNN features. The CNN features were computed with a stride of 8 for layer conv-2 and stride of 16 for layers conv-3, conv-4 and conv-5 respectively. The matching error using SIFT was calculated in the same way as for the CNNs.

A.3 Effect of Viewpoint on Keypoint Matching

Intuitively, matching instances of the same object that are related by a large transformation (i.e. viewpoint distance) should be harder than matching instances with a small viewpoint distance. Therefore, in order to obtain a holistic understanding of the accuracy of features in performing keypoint matching it is instructive to study the accuracy of matching as a function of viewpoint distance.

aligned instances of the same class (from PASCAL-VOC-2012) in a global coordinate system and provide a rotation matrix ( $R$ ) for each instance in the class. To measure the viewpoint distance, we computed the riemannian metric on the manifold of rotation matrices $||log(R_{i}R_{j}^{T})||_{F}$ , where $log$ is the matrix logarithm, $||.||_{F}$ is the Frobenius norm of the matrix and $R_{i},R_{j}$ are the rotation matrices for the $i^{th},j^{th}$ instances respectively. We binned the distances into 10 uniform bins (of 18°each). In Figure 4 of the main paper we show the mean error in keypoint matching in each of these viewpoints bin. The matching error in the $k^{th}$ bin is calculated by considering all the instances with a viewpoint distance $\leq k\times 18\degree$ , for $k\in$ . As expected we find that keypoint matching is worse for larger viewpoint distances.