Weakly Supervised Object Localization Using Things and Stuff Transfer

Miaojing Shi, Holger Caesar, Vittorio Ferrari

Introduction

The goal of object class detection is to place a tight bounding box on every instance of an object class. Given an input image, recent object detectors first extract object proposals and then score them with a classifier to determine their probabilities of containing an instance of the class. Manually annotated bounding boxes are typically required for training (full supervision).

Annotating bounding boxes is tedious and time-consuming. In order to reduce the annotation cost, many previous works learn the detector in a weakly supervised setting , i.e. given a set of images known to contain instances of a certain object class, but without their locations. This weakly supervised object localization (WSOL) bypasses the need for bounding box annotation and substantially reduces annotation time.

Despite the low annotation cost, the performance of WSOL is considerably lower than that of full supervision. To improve WSOL, various advanced cues can be added, e.g. objectness , which gives an estimation of how likely a proposal contains an object; co-occurrence among multiple classes in the same training images ; object size estimates based on an auxiliary dataset with size annotations ; and appearance models transferred from object classes with bounding box annotations to new object classes .

There are two types of classes that can be transferred from a source set with manually annotated locations: things (objects) and stuff (materials and backgrounds). Things have a specific spatial extent and shape (e.g. helicopter, cow, car), while stuff does not (e.g. sky, grass, road). Current transfer works mostly focus on transferring appearance models among similar thing classes (things-to-things). In contrast, using stuff to find things is largely unexplored, particularly in the WSOL setting (stuff-to-things).

In this paper, we transfer a fully supervised segmentation model from the source set to help WSOL on the target set. We introduce several schemes to conduct the transfer of both things and stuff knowledge, guided by the similarity between classes. Particularly, we transfer the co-occurrence knowledge between thing and stuff classes in the source via a second order scheme to thing classes in the target. We propagate the transferred knowledge from the pixel level to the object proposal level and inject it as a new cue into a multiple instance learning framework (MIL).

In extensive experiments, we show that our method: (1) improves over a standard MIL baseline on three datasets: ILSVRC , COCO , PASCAL VOC 2007 ; (2) outperforms the things-to-things transfer method and the state-of-the-art WSOL methods on VOC 2007; (3) outperforms another things-to-things transfer method (LSDA ) on ILSVRC.

Related Work

Weakly supervised object localization. In WSOL the training images are known to contain instances of a certain object class but their locations are unknown. The task is both to localize the objects in the training images and to learn a detector for the class.

Due to the use of strong CNN features , recent works on WSOL have shown remarkable progress. Moreover, researchers also tried to incorporate various advanced cues into the WSOL process, e.g. objectness , object size , co-occurrence among classes, and transferring appearance models of the source thing classes to help localize similar target thing classes . This paper introduces a new cue called things and stuff transfer (TST), which learns a semantic segmentation model from the source on both things and stuff annotations and transfers its knowledge to help localize the target thing class.

Transfer learning. The goal of transfer learning is to improve the learning of a target task by leveraging knowledge from a source task . It is intensively studied in image classification, segmentation and object detection . Many methods use the parameters of the source classifiers as priors for the target model . Other works transfer knowledge through an intermediate attribute layer, which captures visual qualities shared by many object classes (e.g. “striped”, “yellow”). A third family of works transfer object parts between classes , e.g. wheels between cars and bicycles.

In this work we are interested in the task where we have the location annotations in the source and transfer them to help learn the classes in the target . We categorize the transfer into two types: 1) Things-to-things. Guillaumin et al. transferred spatial location, appearance, and context information from the source thing classes to localize the things in the target; Shi et al. and Rochan et al. follow a similar spirit to ; while Kuettel et al. instead transferred segmentation masks. 2) Stuff-to-things. Heitz et al. proposed a context model to utilize stuff regions to find things, in a fully supervised setting for the target objects; Lee et al. also made use of stuff annotations in the source to discover things in the target, in an unsupervised setting.

Our work offers several new elements over these: (1) we encode the transfer as a combination of both things-to-things and stuff-to-things; (2) we propose a model to propagate the transferred knowledge from the pixel level to the proposal level; (3) we introduce a second order transfer, i.e. stuff-to-things-to-things.

Overview of our method

In this section we define the notations and introduce our method on a high level, providing some details for each part.

Notations. We have a source set $\mathcal{A}$ and a target set $\mathcal{B}$ . We have every image pixelwise annotated for both stuff and things in $\mathcal{A}$ ; whereas we have only image level labels for images in $\mathcal{B}$ . We denote by $\mathcal{A}^{T}$ the set of thing classes in $\mathcal{A}$ , and $a^{t}$ an individual thing class; analogue we have $\mathcal{A}^{S}$ and $a^{s}$ for stuff classes in $\mathcal{A}$ and $\mathcal{B}^{T}$ and $b^{t}$ for thing classes in $\mathcal{B}$ . Note that there are no stuff classes in $\mathcal{B}$ , as datasets labeled only by thing classes are more common in practice (e.g. PASCAL VOC , ImageNet , COCO ).

Method overview. Our goal is to conduct WSOL on $\mathcal{B}$ , where the training images are known to contain instances of a certain object class but their locations are unknown. A standard WSOL approach, e.g. MIL, treats images as bags of object proposals (instances). The task is both to localize the objects (select the best proposal) in the training images and to learn a detector for the target class. To improve MIL, we transfer knowledge from $\mathcal{A}$ to $\mathcal{B}$ , incorporating new cues into it.

Fig. 1 illustrates our transfer. We first acquire three types of knowledge in the source $\mathcal{A}$ (Sec. 4): 1) a semantic segmentation model (Sec. 4.1), 2) the thing class similarities between $\mathcal{A}$ and $\mathcal{B}$ (Sec. 4.2) and 3) the co-occurrence frequencies between thing and stuff classes in $\mathcal{A}$ (Sec. 4.3). Afterwards, we transfer the knowledge to $\mathcal{B}$ (Sec. 5). Given an image in $\mathcal{B}$ , we first use the segmentation model to generate the thing ( $T$ ) and stuff ( $S$ ) maps of it (Sec. 5.1). $T$ contains one score map ( $R$ ) and one label ( $L$ ) map, so does $S$ . The segmentation model transfers knowledge generically to every image in $\mathcal{B}$ . Building upon its result, we propose three proposal scoring schemes: label weighting (LW, Sec. 5.2), contrast weighting (CW, Sec. 5.3), and area weighting (AW, Sec. 5.4). These link the pixel level segmentation to the proposal level score. In each scheme, two scoring functions are proposed separately on thing and stuff maps. We combine the three schemes to provide an even better proposal score to help MIL (Sec 5.5).

Scoring schemes. LW transfers the similarity and co-occurrence relations as weighting functions to the thing and stuff label maps, respectively. Since we do not have stuff annotations on $\mathcal{B}$ , we conduct the co-occurrence knowledge transfer as a second-order transfer by finding the target class’ most similar thing class in $\mathcal{A}$ . We believe that the target class should appear against a similar background with its most similar class. For example, in Fig. 1 target class bear’s most similar class in $\mathcal{A}$ is cat, LW up-weights the cat score on $T$ and its frequently co-occurring tree score on $S$ .

LW favours small proposals with high weighted scores. To counter this effect, we introduce the CW score. It measures the dissimilarity of a proposal to its surroundings, measured on the thing/stuff score maps (Fig. 3). CW up-weights proposals that are more likely to contain an entire object in $T$ or an entire stuff region in $S$ .

Finally, the AW score encourages proposals to incorporate as much as possible of the connected components of pixels on a target’s $K$ most similar classes in $\mathcal{A}$ (e.g. Fig. 1: the cat area in the $T$ map). While CW favors objects in general, AW focuses on objects of the target class in particular.

Acquiring knowledge from the source 𝒜𝒜\mathcal{A}

We employ the popular fully convolutional network (FCN-16s) to train an end-to-end semantic segmentation model on both thing and stuff classes of $\mathcal{A}$ . Given a new image, the FCN model is able to predict a likelihood distribution over all classes at each pixel. Notice that the FCN model is first pretrained for image classification on ILSVRC 2012 , then fine-tuned for semantic segmentation on $\mathcal{A}$ . While it is possible that some of the target classes are seen during pretraining, only image-level labels are used. Therefore the weakly supervised setting still holds for the target classes.

2 Similarity relations

We compute the thing class similarities $V(a^{t},b^{t})$ between any thing class pair $(a^{t},b^{t})$ . We propose two similarity measures to compute $V$ as follows:

3 Co-occurrence relation

We denote by $U(a^{s},a^{t})$ the co-occurrence frequency of any stuff and thing class pair $(a^{s},a^{t})$ in $\mathcal{A}$ . This frequency is computed and normalized over all the images in $\mathcal{A}$ .

Transferring knowledge to the target ℬℬ\mathcal{B}

This section transfers the source knowledge to the target set $\mathcal{B}$ . In this set, we have access only to image level labels, but no location annotations. We call the classes that are listed on the image level label list target classes. Given a new image of class $b^{t}$ , we first use the FCN model trained on $\mathcal{A}$ to generate the thing ( $T$ ) and stuff ( $S$ ) segmentations separately (Sec. 5.1). Then we introduce three proposal scoring schemes to propagate the information from pixel level to proposal level (Sec. 5.2 - 5.4). Finally we combine the three scoring schemes into a single window score (Sec. 5.5). The scoring scheme parameters are learned in Sec. 5.6.

We apply the trained FCN model (Sec. 4.1) to a target image in $\mathcal{B}$ . Usually, the output semantic segmentation is obtained by maximizing over all the class scores at each pixel . In this paper, we instead generate two output segmentations, one for things $T$ and one for stuff $S$ . We denote $i$ as the $i$ -th pixel in the image. We use $R^{T}=\{r^{T}_{i}\}$ and $L^{T}=\{l^{T}_{i}\}$ to denote the score ( $R$ ) and label ( $L$ ) maps for $T$ . They are generated by keeping the maximum score and the corresponding label over all the thing classes $\mathcal{A}^{T}$ at each pixel $i$ . Similar to $R^{T}$ and $L^{T}$ , $R^{S}=\{r^{S}_{i}\}$ and $L^{S}=\{l^{S}_{i}\}$ are generated by keeping the maximum score over all the stuff classes $\mathcal{A}^{S}$ at each pixel.

Fig. 1 shows an example of a bear image (target). The thing and stuff maps are produced by the semantic segmentation model. The $R$ heatmaps indicate the probability of assigning a certain thing or stuff label to each pixel. Building upon these heatmaps, we propose three proposal scoring schemes to link the pixel level result to the proposal level score (Sec. 5.2 - 5.4). These try to give high scores to proposals containing the target class.

2 Label weighting (LW)

Because bear is more similar to cat than to table, we want to up-weight the proposal area in the thing map if it is predicted as cat. Meanwhile, because bear frequently appears against tree, we also want to up-weight the proposal area in the stuff map if it is predicted as tree. To do this, we transfer the knowledge of similarity and co-occurrence relations acquired in the source to the target class (bear), and use both relations to modulate the segmentation scores in $T$ and $S$ . Both relations and segmentation scores play a role in the label weighting proposal scoring scheme.

Thing label weighting. We can generate a thing label weighting map depending on how close the predicted class $l^{T}_{i}$ at pixel $i$ in $L^{T}$ is to the target class $b^{t}$ . The thing label ( $l^{T}_{i}$ ) weight is given by the class similarity score $V(l^{T}_{i},b^{t})$ (Sec. 4.2). In Fig. 1 the target class bear is more similar to cat than to table. If a pixel is predicted as cat, then we assign a high label weight, otherwise we assign a low one.

In Fig. 1, cat frequently co-occurs with trees, and so does bear. So, if a certain pixel is predicted as tree, it gets assigned a high stuff label weight.

Notice that the thing label weighting can be viewed as a first-order transfer where the information goes directly from the source thing classes to the target thing classes. Instead, the stuff label weighting can be viewed as second-order transfer where the information first goes from the source stuff classes to the source thing classes, and then to the target thing classes. To the best of our knowledge, such second-order transfer has not been proposed before.

3 Contrast weighting (CW)

The LW scheme favours small proposals with high label weights, which typically cover only part of an object (top right image in Fig. 1). To counter this effect, contrast weighting (CW) measures the dissimilarity of a proposal to its immediate surrounding area on the thing/stuff score maps. It up-weights proposals that are more likely to contain an entire object or an entire stuff region.

The surrounding $Surr(w,\theta)$ of a proposal $w$ is a rectangular ring obtained by enlarging it by a factor $\theta$ in all directions (Fig. 3, the yellow ring). The CW between a window and its surrounding ring is computed as the Chi-square distance between their score map ( $R$ ) histograms $h(\cdot)$

4 Area weighting (AW)

If none of the $K$ -NN classes occurs in $w$ , we simply set $Ratio^{t}$ to zero. Throughout this paper, $K$ is set to 3.

For thing and stuff area weighting we apply a cumulative distribution function (CDF) of the normal distribution

where $\mu^{t}$ and $\sigma^{t}$ are the mean and standard deviation. We choose $\mu^{t}=\mu^{s}=0$ and $\sigma^{t}$ , $\sigma^{s}$ are free parameters (Sec. 5.6).

5 Combining the scoring schemes

For each proposal in an image, the above scoring schemes can be independently computed, each on the thing and stuff map. The scoring schemes tackle different problems, and are complementary to each other. This sections combines them to give our final TST (things and stuff transfer) window score $W$ .

6 Parameter learning

In the WSOL setting, we do not have the ground truth bounding box annotations in the target set $\mathcal{B}$ . Thus we learn the score parameters $\alpha^{t}$ , $\alpha^{s}$ , $\theta^{t}$ , $\theta^{s}$ , $\sigma^{t}$ and $\sigma^{s}$ on the source set $\mathcal{A}$ , where we have ground truth. We train the semantic segmentation model on the train set of $\mathcal{A}$ , and then apply it to the val set of $\mathcal{A}$ . For each image in the val set, we rank all its proposals using (5). We jointly learn the score parameters by maximizing the performance over the entire validation set.

Overall system

In WSOL, given the target training set in $\mathcal{B}$ with image level labels, the goal is to localize the object instances in it and to train good object detectors for the target test set. We explain here how we build a complete WSOL system by building on a MIL framework and incorporating our transfer cues into it.

Basic MIL. We build a Basic MIL pipeline as follows. We represent each image in the target set $\mathcal{B}$ as a bag of object proposals extracted using Edge Boxes . They return about 5,000 proposals per image, likely to cover all objects. Following , we describe the proposals by the output of the fc7 layer of the AlexNet CNN architecture . The CNN model is pre-trained for whole-image classification on ILSVRC , using the Caffe implementation . This produces a 4,096-dimensional feature vector for each proposal. Based on this feature representation for each target class, we iteratively build an SVM appearance model (object detector) in two alternating steps: (1) Re-localization: in each positive image, we select the highest scoring proposal by the SVM. This produces the positive set which contains the current selection of one instance from each positive image. (2) Re-training: we train the SVM using the current selection of positive samples, and all proposals from the negative images as negative samples. As in , we also linearly combine the SVM score with a general measure of objectness . This leads to a higher MIL baseline.

Incorporating things and stuff transfer (TST). We incorporate our things and stuff transfer (TST) into Basic MIL by linearly combining the SVM score with our proposal scoring function (5). Note how the behavior of (5) depends on the class similarity measure used within it (either appearance or semantic similarity, Sec. 4.2).

Deep MIL. Basic MIL uses an SVM on top of fixed deep features as the appearance model. Now we change the model to fine-tune all layers of the deep network during the re-training step of MIL. We take the output of Basic MIL as an initialization for two additional MIL iterations. During these iterations, we use Fast R-CNN .

Experiments

We use one source set $\mathcal{A}$ (PASCAL Context) and several different target sets $\mathcal{B}$ in turn (ILSVRC-20, COCO-07 and PASCAL VOC 2007). Each target set contains a training set and a test set. We perform WSOL on the target training set to localize objects within it. Then we train a Fast R-CNN detector from it and apply it on the target test set.

Evaluation protocol. We quantify localization performance in the target training set with the CorLoc measure . We quantify object detection performance on the target test set using mean average precision (mAP). As in most previous WSOL methods , our scheme returns exactly one bounding-box per class per training image. At test time the object detector is capable of localizing multiple objects of the same class in the same image (and this is captured in the mAP measure).

Source set: PASCAL Context. PASCAL Context augments PASCAL VOC 2010 with class labels at every pixel. As in , we select the 59 most frequent classes. We categorize them into things and stuff. There are 40 thing classes, including the original 20 PASCAL classes and new classes such as book, cup and window. There are 19 stuff classes, such as sky, water and grass. We train the semantic segmentation model (Sec. 4.1) on the train set of $\mathcal{A}$ and set the score parameters (Sec. 5.6) on the val set, using the 20 PASCAL classes from $\mathcal{A}$ as targets.

Target set: ILSVRC-20. The ILSVRC dataset originates from the ImageNet dataset , but is much harder . As the target training set we use the train60k subset of ILSVRC 2014. As the target test set we use the 20k images of the validation set. To conduct WSOL on train60k, we carefully select 20 target classes: ant, baby-bed, basketball, bear, burrito, butterfly, cello, coffee-maker, electric-fan, elephant, goldfish, golfcart, monkey, pizza, rabbit, strainer, tape-player, turtle, waffle-iron and whale. ILSVRC-20 contains 3,843 target training set images and 877 target test set images. This selection is good because: (1) they are visually considerably different from any source class; (2) they appear against similar background classes as the source classes, so we can show the benefits of stuff transfer; (3) they are diverse, covering a broad range of object types.

Target set: COCO-07. The COCO 2014 dataset has fewer object classes (80) than ILSVRC (200), but more instances. COCO is generally more difficult than ILSVRC for detection, as objects are smaller . There are also more instances per image: 7.7 in COCO compared to 3.0 in ILSVRC . We select 7 target classes to carry out WSOL: apple, giraffe, kite, microwave, snowboard, tennis racket and toilet. COCO-07 contains 11,489 target training set images and 5,443 target test set images.

Target set: PASCAL VOC 2007. The PASCAL VOC 2007 dataset is one of the most important object detection datasets. It includes 5,011 training (trainval) images and 4,952 test images, which we directly use as our target training set and target test set, respectively. For our experiments we use all 20 thing classes in VOC 2007. Since the thing classes in our source set (PASCAL Context) overlap with those of VOC 2007, when doing our TST transfer to a target class we remove it from the sources. For example, when we transfer to “dog” in VOC 2007, we remove “dog” from the FCN model trained on PASCAL Context.

2 ILSVRC-20

Table 1 presents results for our method (TST) and several alternative methods on ILSVRC-20.

Our transfer (TST). Our results (TST) vary depending on the underlying class similarity measure used, either appearance (APP) or semantic (SEM) (Sec. 4.2). TST (APP) leads to slightly better results than TST (SEM). We achieve a $+7\%$ improvement in CorLoc (46.7) compared to Basic MIL without objectness, and $+5\%$ improvement (52.7) over Basic MIL with objectness. Hence, our transfer method is effective, and is complementary to objectness. Fig. 5 shows example localizations by Basic MIL with objectness and TST (APP).

Comparison to direct transfer (DT). We compare here to a simpler way to transfer knowledge. We train a fully supervised object detector for each source thing class. Then, for every target class we find the most similar source class from the 40 PASCAL Context thing classes, and use it to directly detect the target objects. For the appearance similarity measure (APP) all NN classes of ILSVRC-20 are part of PASCAL VOC and PASCAL Context. Therefore we have bounding box annotations for these classes. However, for the semantic similarity measure (SEM) not all NN classes of ILSVRC-20 are part of PASCAL VOC. Therefore we do not have bounding box annotations for these classes and cannot apply DT. DT is similar to the ‘transfer only’ method in (see Sec. 4.2 and Table 2 in ).

As Table 1 shows, the results are quite poor as the source and target classes are visually quite different, e.g. the most similar class to ant according to APP is bird; while for waffle-iron, it is table; for golfcart, it is person. This shows that the transfer task we address (from PASCAL Context to ILSVRC-20) is challenging and cannot be solved by simply using object detectors pre-trained on the source classes.

Comparison to direct transfer with MIL (DT+MIL). We improve the direct transfer method by using the DT detector to score all proposals in a target image, and then combining this score with the standard SVM score for the target class during the MIL re-localization step. This is very similar to the full method of and is also close to . The main difference from is that we train the target class’ SVM model in an MIL framework (Sec. 6), whereas simply trains it by using proposals with high objectness as positive samples.

As Table 1 shows, DT+MIL performs substantially better than DT alone, but it only slightly exceeds MIL without transfer, again due to the source and target classes being visually different ( $+1.5\%$ over Basic MIL with objectness). Importantly, our method (TST) achieves higher results, demonstrating that it is a better way to transfer knowledge ( $+5\%$ over Basic MIL with objectness).

Deep MIL. As Table 1 shows, Deep MIL improves slightly over Basic MIL (from 47.6 to 48.4, both with objectness). When built on Deep MIL, our TST transfer raises CorLoc to 54.0 (APP) and 53.8 (SEM), a $+5\%$ improvement over Deep MIL (confirming what we observed when building on Basic MIL). Table 2 shows the mAP of Deep MIL and our method (TST) on the test set. The observed improvements in CorLoc on the training set nicely translate to better mAP on the test set ( $+3.3\%$ over Deep MIL).

Comparison to LSDA . We compare to LSDA , which trains fully supervised detectors for 100 classes of the ILSVRC 2013 dataset (sources) and transfers to the other 100 classes (targets). We report in Table 2 the mAP on the 8 classes common to both their target set and ours. On these 8 classes, we improve on by $+1.7\%$ mAP while using a substantially smaller source set (5K images in PASCAL Context, compared to 105K images in their 100 source classes from ILSVRC 2013).

Furthermore, we can also incorporate detectors for their 100 source classes in our method, in a similar manner as for the DT+MIL method. For each target class we use the detector of the $3$ most similar source classes as a proposal scoring function during MIL’s re-localization step. We choose the SEM measure to guide the transfer as it is fast to compute. This new scoring function is referred to as ILSVRC-dets in Table 1 and 2. When using the ILSVRC-dets score, our mAP improves further, to a final value $+2.5\%$ better than LSDA .

3 COCO-07

Table 3 presents results on COCO-07, which is a harder dataset. Compared to Deep MIL with objectness, our transfer method improves CorLoc by $+3.0\%$ and mAP by $+2.2\%$ (APP).

4 PASCAL VOC 2007

Table 4 presents results on PASCAL VOC 2007. As our baseline system, we use both objectness and multifolding in Deep MIL. This performs at $50.7$ CorLoc and $28.1$ mAP. Our transfer method TST strongly improves CorLoc to $59.9$ ( $+9.2\%$ ) and mAP to $33.8$ ( $+5.7\%$ ).

Comparison to . They present results on this dataset in a transfer setting, by using detectors trained in a fully supervised setting for all 200 classes of ILSVRC (excluding the target class). Adopting their protocol, we also use those detectors in our method (analog to the LSDA comparison above). This leads to our highest CorLoc of $60.8$ , which outperforms , as well as state-of-the-art WSOL works (which do not use such transfer). For completeness, we also report the corresponding mAPs. Our mAP 34.5 matches the result of based on their ’S’ neural network, which corresponds to the AlexNet we use. They propose an advanced WSOL technique that integrates both recognition and detection tasks to jointly train a weakly supervised deep network, whilst we build on a weaker MIL system. We believe our contributions are complementary: we could incorporate our TST transfer cues into their WSOL technique and get even better results.

Finally, we note that our experimental protocol guarantees no overlap in either images nor classes between source and target sets (Sec. 7.1). However, in general VOC 2007 and PASCAL Context (VOC 2010) share similar attributes, which makes this transfer task easier in our setting.

Conclusion

We present weakly supervised object localization using things and stuff transfer. We transfer knowledge by training a semantic segmentation model on the source set and using it to generate thing and stuff maps on a target image. Class similarity and co-occurrence relations are also transferred and used as weighting functions. We devise three proposal scoring schemes on both thing and stuff maps and combine them to produce our final TST score. We plug the score into an MIL pipeline and show significant improvements on the ILSVRC-20, VOC 2007 and COCO-07 datasets. We compare favourably to two previous transfer works . Acknowledgements. Work supported by the ERC Starting Grant VisCul.

Weakly Supervised Object Localization Using Things and Stuff Transfer — Supplemental Material—

Miaojing Shi1,2 Holger Caesar1 1University of Edinburgh 2Tencent Youtu Lab name.surname@ed.ac.uk Vittorio Ferrari1

Appendix A Proxy measures

We propose two proxy measures to jointly learn the score parameters by maximizing the performance over the entire validation set in $\mathcal{A}$ (Sec. 5.6):

Rank: the highest rank of any proposal whose intersection-over-union (IoU) with ground truth bounding box is $>$ 0.5.

CorLoc@1: the percentage of images in which the highest scoring proposal localizes an object of the target class correctly (IoU $>$ 0.5).

These two measures characterize well whether a proposal scoring function gives a higher score to the target objects than to other proposals. Hence they are good proxy measures for their usefulness within MIL. The behavior of the proposal scoring functions (Eqn. 5) depends on the class similarity measure used within them. Referring to Sec. 4.2, the guided similarities can be either appearance or semantic similarity (APP/SEM).

Results. We notice that roughly the same parameters are obtained from both criteria. Now we test how well they work on two of our target sets: ILSVRC-20 and COCO-07. We gradually add each proposal scoring scheme from Sec. 5.2 - 5.4 and denote them by +LW, +CW, and +AW in Fig. 6. Both Rank and CorLoc@1 are gradually improved: using APP we achieve the highest CorLoc@1: 22.9 on ILSVRC-20 and 6.2 on COCO-07; and the highest Rank: 0.94 on ILSVRC-20 and 0.89 on COCO-07. SEM is lower than APP: 19.2 and 4.3 in terms of CorLoc@1, and 0.94 and 0.83 in terms of Rank, on ILSVRC-20 and COCO-07, respectively. Comparing our proposal scoring schemes with a modern version of objectness ,we see that both perform similarly well. In Sec. 7.2 and 7.3 we integrate our scheme with objectness and achieve a big improvement (+5%), which shows that both are complementary.

Appendix B Ablation study

We report here an ablation study to offer the justification of each component in our proposed system. We incorporate the LW, AW and CW scores (Sec. 5.2 -5.4) separately into the Basic MIL framework (Sec. 7.1). We report experiments on ILSVRC-20 in Table 5 in the same protocol as for the first table in the paper (guided by appearance similarity APP column). The three scores bring +0.9%, +3.3%, and +3.2% CorLoc on top of Basic MIL’s 39.7%. This demonstrates that each individual score brings an improvement. Moreover, we also tried combining multiple scores: LW+CW reaches 44.0%, AW+CW reaches 45.8%, and using all three scores AW+LW+CW gives us the highest CorLoc 47.6%. Here we can see that LW brings an additional improvement of +1.8% when added to AW+CW. This shows that by carefully designing and integrating each component into our system, we are able to boost the overall performance over each individual component or any two of them.