Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, Dacheng Tao

Introduction

Recent advances in deep neural networks have helped to solve many challenges in computer vision and natural language processing. These advances have also stimulated high-level research into the connections that exist between vision and language such as visual captioning Donahue et al. (2015); Xu et al. (2015), visual question answering Fukui et al. (2016); Yu et al. (2017b, 2018) and visual grounding Rohrbach et al. (2016); Chen et al. (2017).

Visual grounding (a.k.a., referring expressions) aims to localize an object in an image referred to by a textual query phrase. It is a challenge task that requires a fine-grained understanding of the semantics of the image and the query phrase and the ability to predict the location of the object in the image. The visual grounding task is a natural extension of object detection Ren et al. (2015). While object detection aims to localize all possible pre-defined objects, visual grounding introduces a textual query phrase to localize only one best matching object from an open vocabulary.

Most existing visual grounding approaches Rohrbach et al. (2016); Fukui et al. (2016); Hu et al. (2016) can be modularized into the general framework shown in Figure 1. Given an input image, a fixed number of object proposals are first generated using a proposal generation model and visual features for each proposal are then extracted. Meanwhile, the input query is encoded by a language model (e.g., LSTM Hochreiter and Schmidhuber (1997)) that outputs a textual feature vector. The textual and visual features are then integrated by a multi-modal fusion model and fed into a proposal ranking model to output the location of the proposal with the highest ranking score.

Of the three modules shown in Figure 1, most existing visual grounding approaches focus on designing robust multi-modal feature fusion models. SCRC Hu et al. (2016) and phrase-based CCA Plummer et al. (2015) learn the multi-modal common space via canonical correlation analysis (CCA) and a recurrent neural network (RNN), respectively. MCB learns a compact bilinear pooling model to fuse the multi-modal features, to obtain a more discriminative fused feature Fukui et al. (2016). Further, some approaches have explored designing different loss functions to help improve the accuracy of the final object localization. Rohrbach et al. use an auxiliary reconstruction loss function to regularize the model training Rohrbach et al. (2016). Wu et al. propose a model that gradually refine the predicted bounding box via reinforcement learning Wu et al. (2017).

Compared with the aforementioned two modules, proposal generation has been less thoroughly investigated. Most visual grounding approaches usually use class-agnostic models (e.g., selective search Uijlings et al. (2013) or data-driven region proposal networks (RPNs) Ren et al. (2015) trained on specific datasets) to generate object proposals and extract visual features for each proposal Fukui et al. (2016); Hu et al. (2016); Rohrbach et al. (2016); Chen et al. (2017); Li et al. (2017). Although many visual grounding approaches have been proposed, what makes a good proposal generator for visual grounding remains uncertain.

Choosing the optimal number of generated proposals for visual grounding is difficult. If the number is small, recall of the true objects is limited, which seriously influencing visual grounding performance. If the number is large, recall is satisfactory but it does not necessarily improve visual grounding performance, as it increases the difficulty for accurate prediction of the proposal ranking model. Here, we rethink this problem and suggest that the ideal generated proposal set should contain as many different objects as possible but be of a relatively small size. To achieve this goal, we introduce diversity and discrimination simultaneously when generating proposals, and in doing so propose Diversified and Discriminative Proposal Networks (DDPN).

Based on the proposals generated by DDPN, we propose a high performance baseline model to verify the effectiveness of the DDPN for visual grounding. Our model uses simple feature concatenation as the multi-modal fusion model, and trained with two novel loss functions: Kullback-Leibler Divergence (KLD) loss with soft labels to penalize the proposals while capturing contextual information of the generated proposals, and smoothed $L_{1}$ regression loss to refine the proposal bounding boxes.

The main contributions of this study are as follows: 1) we analyze the limitations of existing proposal generators for visual grounding and propose DDPN to generate high-quality proposals; 2) based on DDPN, we propose a high performance baseline model trained with two novel losses; and 3) by way of extensive ablation studies, we evaluate our models on four benchmark datasets. Our experimental results demonstrate that our model delivers a significant improvement on all the tested datasets.

The Diversified and Discriminative Proposal Networks (DDPN)

For visual grounding, we propose that the ideal generated proposals should be diversified and discriminative simultaneously: 1) the proposals of all images should be diversified to detect objects from open-vocabulary classes (see Figure 2(a)). As the objects to be localized during visual grounding may vary significantly, proposals only covering objects from a small set of classes will be a bottleneck for the model that follows and limit final visual grounding performance; 2) the proposals of an individual image should be discriminative to guarantee that the proposals and visual features accurately represent the true semantic (see Figure 2(b)).

In practice, learning a proposal generator that fully meets these two properties is challenging. The diversified property requires a class-agnostic object detector, while the discriminative property requires a class-aware detector. Since class-aware object detectors are usually trained on datasets with a small set of object classes (e.g., 20 classes for PASCAL VOC Everingham et al. (2010) or 80 classes for COCO Lin et al. (2014)), directly using the output bounding boxes of the detection model as the proposals for visual grounding may compromise diversity. As a trade-off, most existing visual grounding approaches only consider diversity and ignore the discrimination during proposal generation. Specifically, they use the class-agnostic object detector (e.g., Selective Search Uijlings et al. (2013)) to generate proposals and then extract the visual features for the proposals using a pre-trained model Rohrbach et al. (2016); Hu et al. (2016); Lin et al. (2014); Fukui et al. (2016) or a fine-tuned model Chen et al. (2017). However, when there is only a limited number of proposals, they may have the following limitations: 1) they may not be accurate enough to localize the corresponding objects; 2) they may fail to localize small objects; and 3) they may contain some noisy information (i.e., meaningless proposals).

To overcome the drawbacks inherent in the existing approaches, we relax the diversity constraint to a relatively large set of object classes (e.g., more than 1k classes) and propose a simple solution by training a class-aware object detector that can detect a large number of object classes as an approximation. We choose the commonly used class-aware model Faster R-CNN Ren et al. (2015) as the object detector, and train the model on the Visual Genome dataset which contains a large number of object classes and extra attribute labels Krishna et al. (2017). We call our scheme Diversified and Discriminative Proposal Networks (DDPN).

Following Anderson et al. (2017), we clean and filter of the training data in Visual Genome by truncating the low frequency classes and attributes. Our final training set contains 1,600 object classes and 400 attribute classes. To train this model, we initialize Faster R-CNN with the CNN model (e.g., VGG-16 Simonyan and Zisserman (2014) or ResNet-101 He et al. (2016)) pre-trained on ImageNet. We then train Faster R-CNN on the processed Visual Genome datasets. Slightly different from the standard Faster R-CNN model with four losses, we add an additional loss to the last fully-connected layer to predict attribute classes in addition to object classes. Specifically, we concatenate the fully-connected feature with a learned embedding of the ground-truth object class and feed this into an additional output layer defining a softmax distribution over the attribute classes.

Note that although DDPN is similar to Anderson et al. (2017), we have a different motivation. In their approach, Faster R-CNN is used to build bottom-up attention to provide high-level image understanding. In this paper, we use Faster R-CNN to learn diversified and discriminative proposals for visual grounding.

Baseline Model for Visual Grounding

We first introduce a high performance baseline model for visual grounding based on DDPN. Given an image-query pair, our model is trained to predict the bounding boxes of the referred object in the image. The flowchart of our model is shown in Figure 3.

To better demonstrate the capacity of the visual features, we do not introduce a complex feature fusion model (e.g., bilinear pooling) and only use simple feature concatenation followed by a fully-connected layer for fusion. For each pair $\{q,v\}$ , the output feature $f$ is obtained as follows.

The problem now becomes how to define the loss function to make the predicted ranking scores $S$ consistent with their ground-truth ranking scores $S^{*}$ . In most existing visual grounding approaches Fukui et al. (2016); Hu et al. (2016); Chen et al. (2017), the ground-truth ranking scores $S^{*}=[s_{1}^{*},...,s_{N}^{*}]\in\{0,1\}^{N}$ are defined as a one-hot vector in which the only element is set to 1 when the corresponding proposal most overlaps with the ground-truth bounding box (i.e., the largest IoU score) and 0 otherwise. Based on the one-hot single label, softmax cross-entropy loss is used to learning the ranking model.

The benefits of using soft labels are three-fold: 1) except for the max-overlapping proposal, other proposals may also contain useful information, that provides contextual knowledge of the ground-truth; 2) the soft label can be seen as a model regularization strategy, as introduced in Szegedy et al. (2016) as label smoothing; and 3) during testing, the predicted bounding box is considered to be correct when its IoU score with the ground-truth is larger than a threshold $\eta$ . Therefore, optimizing the model with soft labels applies consistency to training and testing and may improve visual grounding performance.

Although using the KLD loss in Eq.(4) can capture the proposals’ contextual information, the accuracies of the region proposals themselves could be a performance bottleneck for visual grounding. In the case that all the generated region proposals do not overlap with the ground-truth bounding box, the grounding accuracy will be zero. Inspired by the strategy used in Faster R-CNN Ren et al. (2015), we append an additional fully-connected layer on top of fused feature $f$ and add a regression layer to refine the proposal coordinates.

The overall loss for our model is defined as:

where $\gamma$ is a hyper-parameter to balance the two terms.

Experiments

We evaluate our approach on four benchmark datasets: Flickr30K Entities Plummer et al. (2015), ReferItGame Kazemzadeh et al. (2014), RefCOCO and RefCOCO+ Yu et al. (2016). These are all commonly used benchmark datasets for visual grounding.

We use the same hyper-parameters as in Anderson et al. (2017) to train the DDPN model. We train DDPN with different CNN backbones (VGG-16 and ResNet-101). Both models are trained for up to 30 epochs, which takes two GPUs 2 $\sim$ 3 weeks to finish.

Based on the optimized DDPN model, we train our visual grounding model with the hyper-parameters listed as follows. The loss weight $\gamma$ is set to 1 for all experiments. The dimensionality of the visual feature $d_{v}$ is 2048 (for ResNet-101) or 4096 (for VGG-16), the dimensionality of the word embedding feature $d_{e}$ is 300, the dimensionality of the output feature of the LSTM network $d_{q}$ is 1024, the dimensionality of the fused feature $d_{o}$ is 512, the threshold of IoU scores $\eta$ is 0.5, and the number of proposals $N$ is 100. During training, all the weight parameters, including the word embedding layer, the LSTM network and the multi-modal fusion model are initialized using the xavier method. We use the Adam solver to train the model with $\beta_{1}=0.9$ , $\beta_{2}=0.99$ . The base learning rate is set to 0.001 with an exponential decay rate of 0.1. The mini-batch size is set to 64. All the models are trained up to 10k iterations. During testing, we feed forward the network and output the scores for all the proposals before picking the proposal with the highest score and using its refined bounding box as the final output.

2 Datasets

Flickr30K Entities contains 32k images, and 275k bounding boxes and 360k query phrases. Some phrases in the dataset may correspond to multiple boxes. In such cases, we merge the boxes and use their union as the ground-truth Rohrbach et al. (2016). We use the standard split in our setting, i.e., 1k images for validation, 1k for testing, and 30k for training.

2.2 ReferItGame

ReferItGame contains over 99k bounding boxes and 130k query phrases from 20k images. Each bounding box is associated with 1-3 query phrases. We use the same data split as in Rohrbach et al. (2016), namely 10k images for testing, 9k for training and 1k for validation.

2.3 RefCOCO and RefCOCO+

Both the RefCOCO and RefCOCO+ datasets utilize images from the COCO dataset Lin et al. (2014). Both of them consist of 142k query phrases and 50k bounding boxes from 20k images. The difference between the two datasets is that the query phrases in RefCOCO+ does not contain any location word, which is more difficult to understand the query intent. The datasets are split into four sets: train, validation, testA and testB. The images in testA contain multiple people, while images in testB contain objects in other categories.

3 Experimental Setup

For all datasets, we adopt accuracy as the evaluation metric, which is defined as the percentage in which the predicted bounding box overlaps with the ground-truth of IoU $>0.5$ .

3.2 Compared Methods

We compare our approach with state-of-the-art visual grounding methods: SCRC Hu et al. (2016), GroundeR Rohrbach et al. (2016), MCB Fukui et al. (2016), QRC Chen et al. (2017), and the approaches of Li et al. Li et al. (2017), Wu et al. Wu et al. (2017) and Yu et al. Yu et al. (2017a).

4 Ablation Studies

We also perform the following ablation experiments on Flickr30k Entities to verify the efficacy of DDPN, and the loss functions to train our baseline visual grounding model.

4.2 Effects of Different Losses on Visual Grounding

In Table 2, we report the performance of our model variants trained with different losses. It can be seen that: 1) training with the KLD loss leads to a 1.2 $\sim$ 1.7-point improvement over the models with classical single-label softmax loss. This verifies our hypothesis that using soft labels captures contextual information and enhances the model’s capacity. Compared to the context modeling strategy proposed by Chen et al. (2017), which exploits the ground-truth bounding boxes within an image from multiple queries, our strategy is more general and easier to implement; 2) training with an additional regression loss to refine the proposals leads to a 4.7 $\sim$ 5.2-point improvement. We believe this will become a commonly-used strategy in future visual grounding studies.

5 Comparisons with State-of-the-Art

We next compare our models to current state-of-the-art approaches. We report the results of our approach with DDPN of two backbones, namely VGG-16 Simonyan and Zisserman (2014) and ResNet-101 He et al. (2016).

Tables 3, 4 and 5 show the comparative results on RefCOCO, RefCOCO+, Flickr30k Entities and ReferItGame, respectively. We note the following: 1) with the same CNN backbone (i.e., VGG-16), our model achieves an absolute improvement of 3.2 points on RefCOCO (testA), 6.3 points on RefCOCO+ (testA), 4.9 points on Flickr30k Entities and 16.1 points on ReferItGame, respectively. The improvement is primarily due to the high-quality proposals generated by DDPN, and the loss functions we use for our visual grounding model; 2) by replacing VGG-16 with ResNet-101 as the backbone for DDPN, all results improve by 3 $\sim$ 4 points, illustrating that the representation capacity of DDPN significantly influences the visual grounding performance.

6 Qualitative Results

We visualize some visual grounding results of our model with ResNet-101 backbone model on Flickr30k Entities (the first row) and ReferItGame (the second row) in Figure 5. It can be seen that our approach achieves good visual grounding performance, and is able to handle the small and fine-grained objects. Moreover, the bounding box regression helps refine the results, rectifying some inaccurate proposals. Finally, our approach still has some limitations, especially when faced with complex queries or confused visual objects. These observations are useful to guide further improvements for visual grounding in the future.

Conclusions and Future Work

In this paper, we interrogate the proposal generation for visual grounding and in doing so propose Diversified and Discriminative Proposal Networks (DDPN) to produce high-quality proposals. Based on the proposals and visual features extracted from the DDPN model, we design a high performance baseline for visual grounding trained with two novel losses: KLD loss to capture the contextual information of the proposals and regression loss to refine the proposals. We conduct extensive experiments on four benchmark datasets and achieve significantly better results on all datasets.

Since the models studied here represent the baseline, there remains significant room for improvement, for example by introducing a more advanced backbone model for DDPN or introducing a more powerful multi-modal feature fusion model such as bilinear pooling.

Acknowledgments

This work was supported in part by National Natural Science Foundation of China under Grant 61702143, Grant 61622205, Grant 61472110 and Grant 61602405, and in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LR15F020002, in part by the Australian Research Council Projects under Grant FL-170100117 and Grant DP-180103424.