Grounding of Textual Phrases in Images by Reconstruction

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

Introduction

Language grounding in visual data is an interesting problem studied both in computer vision and natural language processing communities. Such grounding can be done on different levels of granularity: from coarse, e.g. associating a paragraph of text to a scene in a movie , to fine, e.g. localizing a word or phrase in a given image . In this work we focus on the latter scenario. Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground . On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but which is also able to incorporate phrases with bounding box supervision when available (see Fig. 1).

The main idea of our approach is shown in Fig. 1(b,c). Let us first consider the scenario where no localization supervision is available. Given images paired with natural language phrases we want to localize these phrases with a bounding box in the image (Fig. 1c). To do this we propose a model (Fig. 1b) which learns to attend to a bounding box proposal and, based on the selected bounding box, reconstructs the phrase. As the second part of the model (Fig. 1b, bottom) is able to predict the correct phrase only if the first part of the model attended correctly (Fig. 1b, top), this can be learned without additional bounding box supervision. Our method is based on Grounding with a Reconstruction loss and hence named GroundeR. Additional supervision is integrated in our model by adding a loss function which directly penalizes incorrect attention before the reconstruction step. At test time we evaluate whether the model attends to the correct bounding box.

We propose a novel approach to grounding of textual phrases in images which can operate in all supervision modes: with no, a few, or all grounding annotations available. We evaluate our GroundeR approach on the Flickr 30k Entities and ReferItGame datasets and show that our unsupervised variant is better than prior work and our supervised approach significantly outperforms state-of-the-art on both datasets. Interestingly, our semi-supervised approach can effectively exploit small amounts of labeled data and surpasses the supervised variant by exploiting multiple losses.

Related work

Grounding natural language in images and video. For grounding language in images, the approach of is based on a Markov Random Field which aligns 3D cuboids to words. However it is limited to nouns of 21 object classes relevant to indoor scenes. uses a Conditional Random Field to ground the specifically designed scene graph query in the image. grounds dependency-tree relations to image regions using Multiple Instance Learning and a ranking objective. simplifies this objective to just the maximum score and replaces the dependency tree with a learned recurrent network. Both works have not been evaluated for grounding, but we discuss a quantitative comparison in Section 4. Recently, presented a new dataset, Flickr 30k Entities, which augments the Flickr30k dataset with bounding boxes for all noun phrases present in textual descriptions. report the localization performance of their proposed CCA embedding approach. proposes Deep Structure-Preserving Embedding for image-sentence retrieval and also applies it to phrase localization, formulated as ranking problem. The Spatial Context Recurrent ConvNet (SCRC) and the approach of use a caption generation framework to score the phrase on the set of proposal boxes, to select the box with highest probability. One advantage of our approach over is its applicability to un- and semi-supervised training regimes. We believe that our approach of encoding the phrase optimizes the better objective for grounding than scoring the phrase with a text generation pipeline as in . As for the fully-supervised regime we empirically show our advantage over . attempts to localize relation phases of type Subject-Verb-Object at a large scale in order to verify their correctness, while relying on detectors from .

In the video domain some of the representative works on spatial-temporal language grounding are and . These are limited to small set of nouns.

Object co-localization focuses on discovering and detecting an object in images or videos without any bounding box annotation, but only from image/video level labels . These works are similar to ours with respect to the amount of supervision, but they focus on a few discrete classes, while our approach can handle arbitrary phrases and allows for localization of novel phrases. There are also works that propose to train detectors for a wide range of concepts using image-level annotated data from web image search, e.g. and . These approaches are complementary to ours in the sense of obtaining large scale concept detectors with little supervision, however they do not tackle complex phrases e.g. “a blond boy on the left” which is the focus of our work.

Attention in vision tasks. Recently, different attention mechanisms have been applied to a range of computer vision tasks. The general idea is that given a visual input, e.g. set of features, at any given moment we might want to focus only on part of it, e.g. attend to a specific subset of features . integrates spatial attention into their image captioning pipeline. They consider two variants: “soft” and “hard” attention, meaning that in the latter case the model is only allowed to pick a single location, while in the first one the attention “weights” can be distributed over multiple locations. adapts the soft-attention mechanism and attends to bounding box proposals, one word at a time, while generating an image captioning. relies on a similar mechanism to perform temporal attention for selecting frames in video description task. uses attention mechanism to densely label actions in a video sequence. Our approach relies on soft-attention mechanism, similar to the one of . We apply it to the language grounding task where attention helps us to select a bounding box proposal for a given phrase.

Bi-directional mapping. In our model, a phrase is first mapped to a image region through attention, and then the image region is mapped back to phrase during reconstruction. There is conceptual similarity between previous work and ours on the idea of bi-directional mapping from one domain to another. In autoencoders , input data is first mapped to a compressed vector during encoding, and then reconstructed during decoding. uses a bi-directional mapping from visual features to words and from words to visual features in a recurrent neural network model. The idea is to generate descriptions from visual features and then to reconstruct visual features given a description. Similar to , our model can also learn to associate input text with visual features, but through attending to an image region rather than reconstructing directly from words. In the linguistic community, proposed a CRF Autoencoder, which generates latent structures for the given language input and then reconstructs the input from these latent structures, with the application to e.g. part-of-speech tagging.

GroundeR: Grounding by Reconstruction

The goal of our approach is to ground natural language phrases in images. More specifically, to ground a phrase $p$ in an image $I$ means to find a region $r_{j}$ in the image which corresponds to this phrase. $r_{j}$ can be any subset of $I$ , e.g. a segment or a bounding box. The core insight of our method is that there is a bi-directional correspondence between an image region and the phrase describing it. As a correct grounding of a textual phrase should result in an image region which a human would describe using this phrase, i.e. it is possible to reconstruct the phrase based on the grounded image region. Thus, the key idea of our approach is to learn to ground a phrase by reconstructing this phrase from an automatically localized region. Fig. 1 gives an overview of our approach.

In this work, we utilize a set of automatically generated bounding box proposals $\{r_{i}\}_{i\in N}$ for the image $I$ . Given a phrase $p$ , during training our model works in two parts: the first part aims to attend to the most relevant region $r_{j}$ (or potentially also multiple regions) based on the phrase $p$ , and then the second part tries to reconstruct the same phrase $p$ from region(s) $r_{j}$ it attended to in the first phase. Therefore, by training to reconstruct the text phrase, the model learns to first ground the phrase in the image, and then generate the phrase from that region. Fig. 2a visualizes the network structure. At test time, we remove the phrase reconstruction part, and use the first part for phrase grounding. The described pipeline can be extended to accommodate partial supervision, i.e. ground-truth phrase localization. For that we integrate an additional loss into the model, which directly optimizes for correct attention prediction, see Fig. 2b. Finally, we can adapt our model to the fully supervised scenario by removing the reconstruction phase, see Fig. 2c.

In the following we present the details of the two parts in our approach: learning to attend to the correct region for a given phrase and learning to reconstruct the phrase from the attended region. For simplicity, but without loss of generality, we will refer to $r_{j}$ as a single bounding box.

We frame the problem of grounding a phrase $p$ in image $I$ as selecting a bounding box $r_{j}$ from a set of image region proposals $\{r_{i}\}_{i=1,\cdots,N}$ . To select the correct bounding box, we define an attention function $f_{ATT}$ and select the box $j$ which receives the maximum attention:

In the following we describe the details of how we model the attention in $f_{ATT}$ . The attention mechanism used in our model is inspired by and similar to the soft attention formulations of . However, our inputs to the attention predictor are not single words but rather multi-word phrases, and consequently we also do not have a “doubly stochastic attention” which is used in to normalize the attention across words.

The phrases that we are dealing with might be very complex thus we require a good language model to represent them. We choose a Long Short-Term Memory network (LSTM) as our phrase encoder, as it has been shown effective in various language modeling tasks, e.g. translation . We encode our query phrase word by word with an LSTM and obtain a representation of the phrase using the hidden state $h$ at the final time step as:

Each word $w_{t}$ in the phrase $p$ is first encoded with a one-hot-vector. Then it is embedded in the lower dimensional space and given to LSTM.

Next, each bounding box $r_{i}$ is encoded using a convolutional neural network (CNN) to compute the visual feature vector $v_{i}$ :

Based on the encoded phrase and feature representation of each proposal, we use a two layer perceptron to compute the attention on the proposal $r_{i}$ :

subscript𝑊2italic-ϕsubscript𝑊ℎℎsubscript𝑊𝑣subscript𝑣𝑖subscript𝑏1subscript𝑏2\bar{\alpha}_{i}=f_{ATT}(p,r_{i})=W_{2}\phi(W_{h}h+W_{v}v_{i}+b_{1})+b_{2} (4) where $\phi$ is the rectified linear unit (ReLU): $\phi(x)=max(0,x)$ . We found that this architecture performs better than e.g. a single layer perceptron with a hyperbolic tangent nonlinearity used in .

We get normalized attention weights $\alpha_{i}$ by using softmax, which can be interpreted as probability of region $r_{i}$ being the correct region $r_{\hat{j}}$ :

If at training time we have ground truth information, i.e. that $r_{\hat{j}}$ is the correct proposal box, then we can compute the loss $L_{att}$ based on our prediction as:

where $B$ is the number of phrases per batch. This loss activates only if the training sample has the ground-truth attention value, otherwise, it is zero. If we do not have ground truth annotations then we have to define a loss function to learn the parameters of $f_{ATT}$ in a weakly supervised manner. In the next section we describe how we define this loss by aiming to reconstruct the phrase based on the boxes that are attended to. At test time, we calculate the IOU (intersection over union) value between the selected box $r_{j}$ and the ground truth box $r_{\hat{j}}$ .

2 Learning to reconstruct

The key idea of our phrase reconstruction model is to learn to reconstruct the phrase only from the attended boxes. Given an attention distribution over the boxes, we compute a weighted sum over the visual features and the attention weights $\alpha_{i}$ :

which aggregates the visual features from the attended boxes. Then, the visual features $v_{att}$ are further encoded into $v^{\prime}_{att}$ using a non-linear encoding layer:

subscript𝑊𝑎subscript𝑣𝑎𝑡𝑡subscript𝑏𝑎v^{\prime}_{att}=f_{REC}(v_{att})=\phi(W_{a}v_{att}+b_{a}) (8) We reconstruct the input phrase based on this encoded visual feature $v^{\prime}_{att}$ over attended regions. During reconstruction, we use an image description LSTM that takes $v^{\prime}_{att}$ as input to generate a distribution over phrases $p$ :

where $P(p|v^{\prime}_{att})$ is a distribution over the phrases conditioned on the input visual feature. Our approach for phrase generation is inspired by who have effectively used LSTM for generating image descriptions based on visual features. Given a visual feature, it learns to predict a word sequence $\{w_{t}\}$ . At each time step $t$ , the model predicts a distribution over the next word $w_{t+1}$ conditioned on the input visual feature $v^{\prime}_{att}$ and all the previous words. We use a single LSTM layer and we feed the visual input only at the first time step. We use LSTM as our phrase encoder as well as decoder. Although one could potentially use other approaches to map phrases into a lower dimensional semantic space, it is not clear how one would do the reconstruction without the recurrent network, given that we have to train encoding and decoding end-to-end.

Importantly, the entire grounding+reconstruction model is trained as a single deep network through back-propagation by maximizing the likelihood of the ground truth phrase $\hat{p}$ generated during reconstruction, where we define the training loss for batch size $B$ :

Finally, in the semi-supervised model we have both losses $L_{att}$ and $L_{rec}$ , which are combined as follows:

𝜆subscript𝐿𝑎𝑡𝑡subscript𝐿𝑟𝑒𝑐L=\lambda L_{att}+L_{rec} (11) where parameter $\lambda$ regulates the importance of the attention loss.

Experiments

We first discuss the experimental setup and design choices of our implementation and then present quantitative results on the test sets of Flickr 30k Entities (Tables 1,2) and ReferItGame (Table 3) datasets. We find our best results to outperform state-of-the-art on both datasets by a significant margin. Figures 3 and 4 show qualitatively how well we can ground phrases in images.

We evaluate GroundeR on the datasets Flickr 30k Entities and ReferItGame . Flickr 30k Entities contains over 275K bounding boxes from 31K images associated with natural language phrases. Some phrases in the dataset correspond to multiple boxes, e.g. “two men”. For consistency with , in such cases we consider the union of the boxes as ground truth. We use 1,000 images for validation, 1,000 for testing and 29,783 for training. The ReferItGame dataset contains over 99K regions from 20K images. Regions are associated with natural language expressions, constructed to disambiguate the described objects. We use the bounding boxes provided by and the same test split, namely 10K images for testing; the rest we split in 9K training and 1K validation images.

We obtain 100 bounding box proposals for each image using Selective Search for Flickr 30k Entities and Edge Boxes for ReferItGame dataset. For our semi-supervised and fully supervised models we obtain the ground-truth attention by selecting the proposal box which overlaps most with the ground-truth box, while the overlap IOU (intersection over union) is above 0.5. Thus, our fully supervised model is not trained with all available training phrase-box pairs, but only with those where such proposal boxes exist.

On the Flickr 30k Entities for the visual representation we rely on the VGG16 network trained on ImageNet . For each box we extract a 4,096 dimensional feature from the fully connected fc7 layer. We also consider a VGG16 network fine-tuned for object detection on PASCAL , trained using Fast R-CNN . In the following we refer to both features as VGG-CLS and VGG-DET, respectively. We do not fine-tune the VGG representation for our task to reduce computational and memory load, however, our model trivially allows back-propagation into the image representation which likely would lead to further improvements. For the ReferItGame dataset we use the VGG-CLS features and additional spatial features provided by . We concatenate both and refer to the obtained feature as VGG+SPAT. For the language encoding and decoding we rely on the LSTM variant implemented in Caffe which we initialize randomly and jointly train with the grounding task.

At test time we compute the accuracy as the ratio of phrases for which the attended box overlaps with the ground-truth box by more than 0.5 IOU.

2 Design choices and findings

In all experiments we use the Adam solver , which adaptively changes the learning rate during training. We train our models for about 20/50 epochs for the Flickr 30k Entities/ReferItGame dataset, respectively, and pick the best iteration on the validation set.

Next, we report our results for optimizing hyperparmeters on the validation set of Flickr 30k Entities while using the VGG-CLS features.

Regularization. Applying L2 regularization to parameters (weight decay) is important for the best performance of our unsupervised model. By introducing the weight decay of $0.0005$ we improve the accuracy from $20.33\%$ to $22.96\%$ . In contrast, when supervision is available, we introduce batch normalization for the phrase encoding LSTM and visual feature, which leads to a performance improvement, in particular from 37.42% to 40.93% in the supervised scenario.

Layer initialization. We experiment with different ways to initialize the layer parameters. The configuration which works best for us is using uniform initialization for LSTM, MSRA for convolutional layers, and Xavier for all other layers. Switching from Xavier to MSRA initialization for the convolutional layers improves the accuracy of the unsupervised model from $21.04\%$ to $22.96\%$ .

3 Experiments on Flickr 30k Entities dataset

We report the performance of our approach with multiple levels of supervision in Table 1. In the last line of the table we report the proposal upper-bound accuracy, namely the presence of the correct box among the proposals (which overlaps with the ground-truth box with $IOU>0.5$ ).

Unsupervised training. We start with the unsupervised scenario, i.e. no phrase localization ground-truth is used at training time. Our approach, which relies on VGG-CLS features, is able to achieve 24.66% accuracy. Note that the VGG network trained on ImageNet has not seen any bounding box annotations at training time. VGG-DET, which was fine-tuned for detection, performs better and achieves 28.94% accuracy. We can further improve this by taking a sentence constraint into account. Namely, it is unlikely that two different phrases from one sentence are grounded to the same box. Thus we post-process the attended boxes: we jointly process the phrases from one sentence and greedily select the highest scoring box for each phrase, while the same box cannot be selected twice. This allows us to reach the accuracy of 25.01% for VGG-CLS and 29.02% for VGG-DET. While we currently only use a sentence constraint as a simple post processing step at test time, it would be interesting to include a sentence level constraint during training as part of future work. We compare to the unsupervised Deep Fragments approach of . Note, that does not report the grounding performance and does not allow for direct comparison with our work. With our best case evaluation111We train the Deep Fragments model on the the Flickr 30k dataset and evaluate with the Flickr 30k Entities ground truth phrases and boxes. Our trained Deep Fragments model achieves 11.2%/16.5% recall@1 for image annotation/search compared to 10.3%/16.4% reported in . As there is a large number of dependency tree fragments per sentence (on average 9.5) which are matched to proposal boxes, rather than on average 3.0 noun phrases per sentence in Flickr 30k Entities, we make a best case study in favor of . For each ground-truth phrase we take the maximum overlapping dependency tree fragments (w.r.t. word overlap), compute the IOU between their matched boxes and the ground truth, and take the highest IOU. of Deep Fragments , which also relies on detection boxes and features, we achieve an accuracy of 21.78%. Overall, the ranking objective in can be seen complimentary to our reconstruction objective. It might be possible, as part of future work, to combine both objectives to learn even better models without grounding supervision.

Supervised training. Next we look at the fully supervised scenario. The accuracy achieved by is 27.42%222The number was provided by the authors of , while in they report 25.30% for phrases automatically extracted with a parser. and by SCRC is 27.80%. Recent approach of achieves 43.89% with VGG-DET features. Our approach, when using VGG-CLS features achieves an accuracy of 41.56%, significantly improving over prior works that use VGG-CLS. We further improve our result to impressive 47.81% when using VGG-DET features.

Semi-supervised training. Finally, we move to the semi-supervised scenario. The notation “ $x$ % annot.” means that $x$ % of the annotated data (where ground-truth attention is available) is used. As described in Section 3.2 we have a parameter $\lambda$ which controls the weight of the attention loss $L_{att}$ vs. the reconstruction loss $L_{rec}$ . We estimate the value of $\lambda$ on validation set and fix it for all iterations. We found that we need higher weight on $L_{att}$ when little supervision is available. E.g. for 3.12% of supervision $\lambda=200$ and for 12.5% supervision $\lambda=50$ . This is due to the fact that in these cases only 3.12% / 12.5% of labeled instances contribute to $L_{att}$ , while all instances contribute to $L_{rec}$ .

When integrating 3.12% of the available annotated data into the model we significantly improve the accuracy from 24.66% to 33.02% (VGG-CLS) and from 28.94% to 42.32% (VGG-DET). The accuracy further increases when providing more annotations, reaching 42.43% for VGG-CLS and 48.38% for VGG-DET when using all annotations. As ablation of our semi-supervised model we evaluated the supervised model while only using the respective $x$ % of annotated data. We observed consistent improvement of our semi-supervised model over the supervised model. Intrestingly, when using all available supervision, $L_{rec}$ still helps to improve performance over the supervised model (42.43% vs. 41.56%, 48.38% vs. 47.81%). Our intuition for this is that $L_{att}$ only has a single correct bounding box (which overlaps most with the ground truth), while $L_{rec}$ can also learn from overlapping boxes with high but not best overlap.

Results per phrase type. Flickr 30k Entities dataset provides a “type of phrase” annotation for each phrase, which we analyze in Table 2. Our unsupervised approach does well on phrases like “people”, “animals”, “vehicles” and worse on “clothing” and “body parts”. This could be due to confusion between people and their clothing or body parts. To address this, one could jointly model the phrases and add spatial relations between them in the model. Body parts are also the most challenging type to detect, with the proposal upper-bound of only $41.3\%$ . The supervised model with VGG-CLS features outperforms in all types except “body parts” and “instruments”, while with VGG-DET it is better or similar in all types. Semi-supervised model brings further significant performance improvements, in particular for “body parts”. In the last column we report the accuracy for novel phrases, i.e. the ones which did not appear in the training data. On these phrases our approach maintains high performance, although it is lower than the overall accuracy. This shows that learned language representation is effective and allows transfer to unseen phrases.

Summary Flickr 30k Entities. Our unsupervised approach performs similar (VGG-CLS) or better (VGG-DET) than the fully supervised methods of and (Table 1). Incorporating a small amount of supervision (e.g. 3.12% of annotated data) allows us to outperform and also when VGG-CLS features are used. Our best supervised model achieves 47.81%, surpassing all the previously reported results, including . Our semi-supervised model efficiently exploits the reconstruction loss $L_{rec}$ which allows it to outperform the supervised model.

4 Experiments on ReferItGame dataset

Table 3 summarizes results on the ReferItGame dataset. We compare our approach to the previously introduced fully supervised method SCRC , as well as provide reference numbers for two other baselines: LRCN and CAFFE-7K reported in . The LRCN baseline of is using the image captioning model LRCN trained on MSCOCO to score how likely the query phrase is to be generated for the proposal box. CAFFE-7K is a large scale object classifier trained on ImageNet to distinguish 7K classes. predicts a class for each proposal box and constructs a word bag with all the synonyms of the class-name based on WordNet . The obtained word bag is then compared to the query phrase after both are projected to a joint vector space. Both approaches are unsupervised w.r.t. the phrase bounding box annotations. Table 3 reports the results of our approach with VGG, as well as VGG+SPAT features of .

Unsupervised training. In the unsupervised scenario our GroundeR performs competitive with the LRCN and CAFFE-7K baselines, achieving 10.7% accuracy. We note that in this case VGG and VGG+SPAT perform similarly.

Supervised training. In the supervised scenario we compare to the best prior work on this dataset, SCRC , which reaches 17.93% accuracy. Our supervised approach, which uses identical visual features, significantly improves this performance to 26.93%.

Semi-supervised training. Moving to the semi-supervised scenario again demonstrates performance improvements, similar to the ones observed on Flickr 30k Entities datset. Even the small amount of supervision (3.12%) significantly improves performance to 15.03% (VGG+SPAT), while with 100% of annotations we achieve 28.51%, outperforming the supervised model.

Summary ReferItGame dataset. While the unsupervised model only slightly improves over prior work, the semi-supervised version can effectively learn from few labeled training instances, and with all supervision it achieves 28.51%, improving over by a large margin of 10.6%. Overall the performance on ReferItGame dataset is significantly lower than on Flickr 30k Entities. We attribute this to two facts. First, the training set of ReferItGame is rather small compared to Flickr 30k (9k vs. 29k images). Second, the proposal upperbound on ReferItGame is significantly lower than on Flickr 30k Entities (59.38% vs 77.90%) due to the complex nature of the described objects and “stuff” image regions.

5 Qualitative results

We provide qualitative results on Flickr 30K Entities dataset in Figure 3. We compare our unsupervised and supervised approaches, both with VGG-DET features. The supervised approach visibly improves the localization quality over the unsupervised approach, which nevertheless is able to localize many phrases correctly. Figure 4 presents qualitative results on ReferItGame dataset. We show the predictions of our supervised approach, as well as the ground-truth boxes. One can see the difficulty of the task from the presented examples, including two failures in the bottom row. One requires good language understanding in order to correctly ground such complex phrases. In order to ground expressions like “hut to the nearest left of the person on the right” we would need to additionally model relations between objects, an interesting direction for future work.

Conclusion

In this work we address the challenging task of grounding unconstrained natural phrases in images. We consider different scenarios of available bounding box supervision at training time, namely none, little, and full supervision. We propose a novel approach, GroundeR, which learns to localize phrases in images by attending to the correct box proposal and reconstructing the phrase and is able to operate in all of these supervision scenarios. In the unsupervised scenario we are competitive or better than related work. Our semi-supervised approach works well with a small portion of available annotated data and takes advantage of the unsupervised data to outperform purely supervised training using the same amount of labeled data. It outperforms state-of-the-art, both on Flickr 30k Entities and ReferItGame dataset, by 4.5% and 10.6%, respectively.

Our approach is rather general and it could be applied to other regions such as segmentation proposals instead of bounding box proposals. Possible extensions are to include constraints within sentences at training time, jointly reason about multiple phrases, and to take into account spatial relations between them.

Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD). This work was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Artificial Intelligence Research (BAIR) Lab.