Learning Deep Structure-Preserving Image-Text Embeddings

Liwei Wang, Yin Li, Svetlana Lazebnik

Introduction

Computer vision is moving from predicting discrete, categorical labels to generating rich descriptions of visual data, for example, in the form of natural language. There is a surge of interest in image-text tasks such as image captioning and visual question answering . A core problem for these applications is how to measure the semantic similarity between visual data (e.g., an input image or region) and text data (a sentence or phrase). A common solution is to learn a joint embedding for images and text into a shared latent space where vectors from the two different modalities can be compared directly. This space is usually of low dimension and is very convenient for cross-view tasks such as image-to-text and text-to-image retrieval.

Several recent embedding methods are based on Canonical Correlation Analysis (CCA) , which finds linear projections that maximize the correlation between projected vectors from the two views. Kernel CCA is an extension of CCA in which maximally correlated nonlinear projections, restricted to reproducing kernel Hilbert spaces with corresponding kernels, are found. Extensions of CCA to a deep learning framework have also been proposed . However, as pointed out in , CCA is hard to scale to large amounts of data. In particular, stochastic gradient descent (SGD) techniques cannot guarantee a good solution to the original generalized eigenvalue problem, since covariance estimated in each small batch (due to the GPU memory limit) is extremely unstable.

An alternative to CCA is to learn a joint embedding space using SGD with a ranking loss. WSABIE and DeVISE learn linear transformations of visual and textual features to the shared space using a single-directional ranking loss that applies a margin-based penalty to incorrect annotations that get ranked higher than correct ones for each training image. Compared to CCA-based methods, this ranking loss easily scales to large amounts of data with stochastic optimization in training. As a more powerful objective function, a few other works have proposed a bi-directional ranking loss that, in addition to ensuring that correct sentences for each training image get ranked above incorrect ones, also ensures that for each sentence, the image described by that sentence gets ranked above images described by other sentences . However, to date, it has proven frustratingly difficult to beat CCA with an SGD-trained embedding: Klein et al. have shown that properly normalized CCA on top of state-of-the-art image and text features can outperform considerably more complex models.

Another strand of research on multi-modal embeddings is based on deep learning , utilizing such techniques as deep Boltzmann machines , autoencoders , LSTMs , and recurrent neural networks . By making it possible learn nonlinear mappings, deep methods can in principle provide greater representational power than methods based on linear projections .

In this work, we propose to learn an image-text embedding using a two-view neural network with two layers of nonlinearities on top of any representations of the image and text views (Figure 1). These representations can be given by the outputs of two pre-trained networks, off-the-shelf feature extractors, or trained jointly end-to-end with the embedding. To train this network, we use a bi-directional loss function similar to , combined with constraints that preserve neighborhood structure within each individual view. Specifically, in the learned latent space, we want images (resp. sentences) with similar meaning to be close to each other. Such within-view structure preservation constraints have been extensively explored in the metric learning literature . In particular, the Large Margin Nearest Neighbor (LMNN) approach tries to ensure that for each image its target neighbors from the same class are closer than samples from other classes. As our work will show, these constraints can also provide a useful regularization term for the cross-view matching task.

From the viewpoint of architecture, our method is similar to the two-branch Deep CCA models , though it avoids Deep CCA’s training-time difficulties associated with covariance matrix estimation. Our network also gains in accuracy by performing feature normalization (L2 and batch normalization) before the embedding loss layer. Finally, our work is related to deep similarity learning , though we are solving a cross-view, not a within-view, matching problem. Siamese networks for similarity learning (e.g., ) can be considered as special cases of our framework where the two views come from the same modality and the two branches share weights.

Our proposed approach substantially improves the state of the art for image-to-sentence and sentence-to-image retrieval on the Flickr30K and MSCOCO datasets. We are also able to obtain convincing improvements over CCA on phrase localization for the Flickr30K Entities dataset .

Deep Structure-Preserving Embedding

Let XX and YY denote the collections of training images and sentences, each encoded according to their own feature vector representation. We want to map the image and sentence vectors (which may have different dimensions initially) to a joint space of common dimension. We use the inner product over the embedding space to measure similarity, which is equivalent to the Euclidean distance since the outputs of the two embeddings are L2-normalized. In the following, d(x,y)d(x,y) will denote the Euclidean distance between image and sentence vectors in the embedded space.

We propose to learn a nonlinear embedding in a deep neural network framework. As shown in Figure 1, our deep model has two branches, each composed of fully connected layers with weight matrices WlW_{l} and VlV_{l}. Successive layers are separated by Rectified Linear Unit (ReLU) nonlinearities. We apply batch normalization right after the last linear layer. And at the end of each branch, we add L2 normalization.

In general, each branch can have a different number of layers, and if the inputs of the two branches XX and YY are produced by their own networks, the parameters of those networks can be trained (or fine-tuned) together with the parameters of the embedding layers. However, in this paper, we have obtained very satisfactory results by using two embedding layers per branch on top of pre-computed image and text features (see Section 3.1 for details).

2 Training Objective

Our training objective is a stochastic margin-based loss that includes bidirectional cross-view ranking constraints, together with within-view structure-preserving constraints.

Bi-directional ranking constraints. Given a training image xix_{i}, let Yi+Y^{+}_{i} and YiY^{-}_{i} denote its sets of matching (positive) and non-matching (negative) sentences, respectively. We want the distance between xix_{i} and each positive sentence yjy_{j} to be smaller than the distance between xix_{i} and each negative sentence yky_{k} by some enforced margin mm:

𝑑subscript𝑥𝑖subscript𝑦𝑗𝑚𝑑subscript𝑥𝑖subscript𝑦𝑘formulae-sequencefor-allsubscript𝑦𝑗subscriptsuperscript𝑌𝑖for-allsubscript𝑦𝑘subscriptsuperscript𝑌𝑖d(x_{i},y_{j})+myiy_{i^{\prime}}, we have

𝑑subscript𝑥superscript𝑗′subscript𝑦superscript𝑖′𝑚𝑑subscript𝑥superscript𝑘′subscript𝑦superscript𝑖′formulae-sequencefor-allsubscript𝑥superscript𝑗′subscriptsuperscript𝑋superscript𝑖′for-allsubscript𝑥superscript𝑘′subscriptsuperscript𝑋superscript𝑖′d(x_{j^{\prime}},y_{i^{\prime}})+mXi+X^{+}_{i^{\prime}} and XiX^{-}_{i^{\prime}} denote the sets of matching (positive) and non-matching (negative) images for yiy_{i^{\prime}}.

Structure-preserving constraints. Let N(xi)N(x_{i}) denote the neighborhood of xix_{i} containing images that share the same meaning. In our case, this is the set of images described by the same sentence as xix_{i}. Then we want to enforce a margin of mm between N(xi)N(x_{i}) and any point outside of the neighborhood:

𝑑subscript𝑥𝑖subscript𝑥𝑗𝑚𝑑subscript𝑥𝑖subscript𝑥𝑘formulae-sequencefor-allsubscript𝑥𝑗𝑁subscript𝑥𝑖for-allsubscript𝑥𝑘𝑁subscript𝑥𝑖d(x_{i},x_{j})+m

𝑑subscript𝑦superscript𝑖′subscript𝑦superscript𝑗′𝑚𝑑subscript𝑦superscript𝑖′subscript𝑦superscript𝑘′formulae-sequencefor-allsubscript𝑦superscript𝑗′𝑁subscript𝑦superscript𝑖′for-allsubscript𝑦superscript𝑘′𝑁subscript𝑦superscript𝑖′d(y_{i^{\prime}},y_{j^{\prime}})+mN(yi)N(y_{i^{\prime}}) contains sentences describing the same image.

Figure 2 gives an intuitive illustration of how within-view structure preservation can help with cross-view matching. The embedding space on the left satisfies the cross-view matching property. That is, each square (representing an image) is closer to all circles of the same color (representing its corresponding sentences) than to any circles of the other color. Similarly, for any circle (sentence), the closest square (image) has the same color. However, for the new image query (white square), the embedding space gives an ambiguous matching result since both red and blue circles are very close to it. This problem is mitigated in the embedding on the right, where within-view structure constraints are added, pushing semantically similar sentences (same color circles) closer to each other.

Note that our two image-sentence datasets, Flickr30K and MSCOCO, consist of images paired with five sentences each. The neighborhood of each image, N(xi)N(x_{i}), generally only contains xix_{i} itself, since it is rare for two different images to be described by an identical sentence. Thus, the image-view constraints (eq. 3) are trivial, while the neighborhood of each sentence N(yi)N(y_{i^{\prime}}) has five members. However, for the region-phrase dataset of Section 3.3, many phrases have multiple region exemplars, so we get a non-trivial set of constraints for the image view.

Embedding Loss Function. We convert the constraints to our training objective in the standard way using hinge loss. The resulting loss function is given by

subscript𝑖𝑗𝑘0𝑚𝑑subscript𝑥𝑖subscript𝑦𝑗𝑑subscript𝑥𝑖subscript𝑦𝑘subscript𝜆1subscriptsuperscript𝑖′superscript𝑗′superscript𝑘′0𝑚𝑑subscript𝑥superscript𝑗′subscript𝑦superscript𝑖′𝑑subscript𝑥superscript𝑘′subscript𝑦superscript𝑖′subscript𝜆2subscript𝑖𝑗𝑘0𝑚𝑑subscript𝑥𝑖subscript𝑥𝑗𝑑subscript𝑥𝑖subscript𝑥𝑘subscript𝜆3subscriptsuperscript𝑖′superscript𝑗′superscript𝑘′0𝑚𝑑subscript𝑦superscript𝑖′subscript𝑦superscript𝑗′𝑑subscript𝑦superscript𝑖′subscript𝑦superscript𝑘′\begin{split}L(X,Y)=&\sum_{i,j,k}\max[0,m+d(x_{i},y_{j})-d(x_{i},y_{k})]\\ +&\lambda_{1}\sum_{i^{\prime},j^{\prime},k^{\prime}}\max[0,m+d(x_{j^{\prime}},y_{i^{\prime}})-d(x_{k^{\prime}},y_{i^{\prime}})]\\ +&\lambda_{2}\sum_{i,j,k}\max[0,m+d(x_{i},x_{j})-d(x_{i},x_{k})]\\ +&\lambda_{3}\sum_{i^{\prime},j^{\prime},k^{\prime}}\max[0,m+d(y_{i^{\prime}},y_{j^{\prime}})-d(y_{i^{\prime}},y_{k^{\prime}})]\,,\\ \end{split} (5) where the sums are over all triplets defined as in the constraints (1-4). The margin mm could be different for different types of distance or even different instances. But to make it easy to optimize, we fix mm for all terms across all training samples (m=0.1m=0.1 in the experiments). The weight λ1\lambda_{1} balances the strengths of both ranking terms. In other work with a bi-directional ranking loss , this is always set to 1, but in our case, we found λ1=2\lambda_{1}=2 produces the best results. The weights λ2,λ3\lambda_{2},\lambda_{3} control the importance of the structure-preserving terms, which act as regularizers for the bi-directional retrieval tasks. We usually set both to small values like 0.1 or 0.2 (see Section 3 for details).

Triplet sampling. Our loss involves all triplets consisting of a target instance, a positive match, and a negative match. Optimizing over all such triplets is computationally infeasible. Therefore, we sample triplets within each mini-batch and optimize our loss function using SGD. Inspired by , instead of choosing the most violating negative match in all instance space, we select top KK most violated matches in each mini-batch. This is done by computing pairwise similarities between all (xi,yj)(x_{i},y_{j}), (xi,xj)(x_{i},x_{j}) and (yi,yj)(y_{i},y_{j}) within the mini-batch. For each positive pair (i.e., a ground truth image-sentence pair, two neighboring images, or two neighboring sentences), we then find at most top KK violations of each relevant constraint (we use K=50K=50 in the implementation, although most pairs have many fewer violations). Theoretical guarantees of such a sampling strategy have been discussed in , though not in the context of deep learning. In our experiments, we observe convergence within 30 epochs on average.

In Section 3, we will demonstrate the performance of our method both with and without structure-preserving constraints. For training the network without these constraints, we randomly sample 1500 pairs (xi,yix_{i},y_{i}) to form our mini-batches. For the experiments with the structure-preserving constraints, in order to get a non-empty set of constraint triplets, we need a moderate number of positive pairs (i.e., at least two sentences that are matched to the same image) in each mini-batch. However, random sampling of pairs cannot guarantee this. Therefore, for each xix_{i} in a given mini-batch, we add one more positive sentence distinct from the ones that may already be included among the sampled pairs, resulting in mini-batches of variable size.

Experiments

In this section, we analyze the contributions of different components of our method and evaluate it on image-to-sentence and sentence-to-image retrieval on popular Flickr30K and MSCOCO datasets, and on phrase localization on the new Flickr30K Entities dataset .

In image-sentence retrieval experiments, to represent images, we follow the implementation details in . Given an image, we extract the 4096-dimensional activations from the 19-layer VGG model . Following standard procedure, the original 256×256256\times 256 image is cropped in ten different ways into 224×224224\times 224 images: the four corners, the center, and their x-axis mirror image. The mean intensity is then subtracted from each color channel, the resulting images are encoded by the network, and the network outputs are averaged.

To represent sentences and phrases, we primarily use the Fisher vector (FV) representation as suggested by Klein et al. . Starting with 300-dimensional word2vec vectors of the sentence words, we apply ICA as in and construct a codebook with 30 centers using both first- and second-order information, resulting in sentence features of dimension 300302=18000300*30*2=18000. We only use the Hybrid Gaussian-Laplacian mixture model (HGLMM) from for our experiments rather than the combined HGLMM+GMM model which obtained the best performance in . To save memory and training time, we perform PCA on these 18000-dimensional vectors to reduce them to 6000 dimensions. PCA also makes the original features less sparse, which is good for the numerical stability of our training procedure.

Since FV is already a powerful hand-crafted nonlinear transformation of the original sentences, we are also interested in exploring the effectiveness of our approach on top of simpler text representations. To this end, we include results on 300-dimensional means of word2vec vectors of words in each sentence/phrase, and on tf-idf-weighted bag-of-words vectors. For tf-idf, we pre-process all the sentences with WordNet’s lemmatizer and remove stop words. For the Flickr30K dataset, our dictionary size (and descriptor dimensionality) is 3000, and for MSCOCO, it is 5600.

For our experiments using tf-idf or FV text features, we set the embedding dimension to be 512. On the image (XX) side, when using 4096-dimensional visual features, W1W_{1} is a 4096×20484096\times 2048 matrix, and W2W_{2} is a 2048×5122048\times 512 matrix. That is, the output dimensions of the two layers are . On the text (YY) side, the output dimensions of the V1V_{1} and V2V_{2} layers are . For the experiments using 300-D word2vec features, we use a lower dimension (256) for the embedding space and the intermediate layers output are accordingly changed to .

We train our networks using SGD with momentum 0.90.9 and weight decay 0.00050.0005. We use a small learning rate starting with 0.10.1 and decay the learning rate by 0.10.1 after every 1010 epochs. To accelerate the training and also make gradient updates more stable, we apply batch normalization right after the last linear layer of both network branches. We also use a Dropout layer after ReLU with probability = 0.5. We set the mini-batch size to 1500 ground truth image-sentence pairs and augment these pairs as necessary as described in the previous section. Compared with CCA-based methods, our method has much smaller memory requirements and is scalable to larger amounts of data.

2 Image-sentence retrieval

In this section, we report results on image-to-sentence and sentence-to-image retrieval on the standard Flickr30K and MSCOCO datasets. Flickr30K consists of 31783 images accompanied by five descriptive sentences each. The larger MSCOCO dataset consists of 123000 images, also with five sentences each.

For evaluation, we follow the same protocols as other recent work . For Flickr30K, given a test set of 1000 images and 5000 corresponding sentences, we use the images to retrieve sentences and vice versa, and report performance as Recall@KK (K=1,5,10K=1,5,10), or the percentage of queries for which at least one correct ground truth match was ranked among the top KK matches. For MSCOCO, consistent with , we also report results on 1000 test images and their corresponding sentences.

For Flickr30K, bidirectional retrieval results are listed in Table 1. Part (a) of the table summarizes the performance reported by a number of competing recent methods. In Part (b) we demonstrate the impact of different components of our model by reporting results for the following variants.

Linear + one-directional: In this setting, we keep only the first layers in each branch with parameters W1,V1W_{1},V_{1}, immediately followed by L2 normalization. The output dimensions of W1W_{1} and V1V_{1} are changed to be the embedding space dimension. In the objective function (eq. 5), we set λ1=0,λ2=0,λ3=0\lambda_{1}=0,\lambda_{2}=0,\lambda_{3}=0, only retaining the image-to-sentence ranking constraints. This results in a model similar to WSABIE .

Linear + bi-directional: The model structure is as above, and in eq. (5), we set λ1=2,λ2=0,λ3=0\lambda_{1}=2,\lambda_{2}=0,\lambda_{3}=0. This form of embedding is similar to (though the details of the representations used by those works are quite different).

Linear + bi-directional + structure: same linear model, eq. (5) with λ1=2,λ2=0,λ3=0.2\lambda_{1}=2,\lambda_{2}=0,\lambda_{3}=0.2.

Nonlinear + one-directional: Network as in Figure 1, eq. (5) with λ1=0,λ2=0,λ3=0\lambda_{1}=0,\lambda_{2}=0,\lambda_{3}=0.

Nonlinear + bi-directional: Network as in Figure 1, eq. (5) with λ1=2,λ2=0,λ3=0\lambda_{1}=2,\lambda_{2}=0,\lambda_{3}=0.

Nonlinear + bi-directional + structure: Network as in Figure 1, eq. (5) with λ1=2,λ2=0,λ3=0.2\lambda_{1}=2,\lambda_{2}=0,\lambda_{3}=0.2.

Note that in all the above configurations we have λ2=0\lambda_{2}=0, that is, the structure-preserving constraint associated with the image space is inactive, since in the Flickr30K and MSCOCO datasets we do not have direct supervisory information about multiple images that can be described by the same sentence. However, our results for the region-phrase dataset of Section 3.3 will incorporate structure-preserving constraints on both spaces.

From Table 1 (b), we can see that changing the embedding function from linear to nonlinear improves the accuracy by about 4% across the board. Going from one-directional to bi-directional constraints improves the accuracy by 1-2% for image-to-sentence retrieval and by a bigger amount for sentence-to-image retrieval. Finally, adding the structure-preserving constraints provides an additional improvement of 1-2% in both linear and nonlinear cases. The methods from Table 1 (a) most comparable to ours are CCA (HGLMM) , since they use the same underlying feature representation with linear CCA. Our linear model with all the constraints of eq. (5) does not outperform linear CCA, but our nonlinear one does.

Finally, to check how much our method relies on the power of the input features, parts (c) and (d) of Table 1 report results for our nonlinear models with and without structure-preserving constraints applied on top of weaker text representations, namely mean of word2vec vectors of the sentence and tf-idf vectors, as described in Section 3.1. Once again, we can see that structure-preserving constraints give us an additional improvement. Our results with mean vector are considerably better than the CCA results of on the same feature, and are in fact comparable with the results of on top of the more powerful FV representation. For tf-idf, we achieve results that are just below our best FV results, showing that we do not require a highly nonlinear feature as an input in order to learn a good embedding. Another possible reason why tf-idf performs so strongly may be that word2vec features are pre-trained on an unrelated text corpus, so they may not be as well adapted to our specific data.

For MSCOCO, results on 1000 test images are listed in Table 2. The trends are the same as in Table 1: adding structure-preserving constraints on the sentence space consistently improves performance, and our results with the FV text feature considerably exceed the state of the art. We have also tried fine-tuning the VGG network by back-propagating our loss function through all the VGG layers, and obtained about 0.5% additional improvement.

3 Phrase Localization on Flickr30K Entities

The recently published Flickr30K Entities dataset allows us to learn correspondences between phrases and image regions. Specifically, the annotations in this dataset provide links from 244K mentions of distinct entities in sentences to 276K ground truth bounding boxes (some entities consist of multiple instances, such as “group of people”). We are interested in this dataset because unlike the global image-sentence datasets, it provides many-to-many correspondences, i.e., each region may be described by multiple phrases and each phrase may have multiple region exemplars across multiple images. This allows us to take advantage of structure-preserving constraints on both the visual and textual spaces.

As formulated in , the goal of phrase localization is to predict a bounding box in an image for each entity mention (noun phrase) from a caption that goes with that image. For a particular phrase, we perform the search by extracting 100 EdgeBox region proposals and scoring them using our embedding. To get good performance, the best-scoring box should have high overlap with the ground truth region. This can be considered as a ranking problem, and both CCA and our methods can be trained to match phrases and regions. On the other hand, we should realize that this problem is more like detection, where the algorithm should be able to distinguish foreground objects from boxes that contain only background or poorly localized objects. CCA and Deep CCA are not well suited to this scenario, since there is no way to add negative boxes into their learning stage. However, our margin-based loss function makes it possible.

Plummer et al. reported baseline results for a region-phrase embedding using CCA on top of ImageNet-trained VGG features. Following Rohrbach et al. , who obtained big improvements on phrase localization using detection-based VGG features, we also use Fast R-CNN features fine-tuned on a union of the PASCAL 2007 and 2012 train-val sets . Consistent with , we do not average multiple crops for region features. For text, in this section we use only the FV feature. Thus, the input dimension of XX is 4096 and the input dimension of YY is 6000 as before (reduced by PCA from the original 18000-D FV). We use the two-layer network structure with $astheintermediatelayerdimensionsonboththeas the intermediate layer dimensions on both theXandandYsides(notethatonthesides (note that on theX$ side, the intermediate layer actually doubles the feature dimension).

For our first experiment, we train our embedding without negative mining, using the same positive region-phrase pairs as CCA. For this, we use the same training set as , which is resampled with at most ten regions per phrase, for a total of 137133 region-phrase pairs, 70759 of which are unique. As in the previous section, we use initial mini-batch size of 1500. But now, for the full version of our objective (eq. 5), we augment the mini-batches by sampling not only additional positive phrases for regions, but also additional positive regions for phrases, to make sure that we have as many triplets as possible for structure-preserving constraints on the region side (eq. 3) and the phrase side (eq. 4).

The results of training our model without negative mining for 28 epochs are shown in the top part of Table 3. We use the evaluation protocol proposed by . First, we treat phrase localization as the problem of retrieving instances of a query phrase from a set of region proposals extracted from test images, and report Recall@KK, or the percentage of queries for which a correct match has rank of at most KK (a region proposal is considered to be a correct match if it has IOU of at least 0.5 with the ground-truth bounding box for that phrase). Second, we report average precision (AP) of ranking bounding boxes for each phrase in the test images that contain that phrase, following nonmaximum suppression. The last column of Table 3 shows mAP over all unique phrases in the test set, with each unique phrase being treated as its own class label.

Table 3 (a-d) shows the performance of our bi-directional ranking objective with different combinations of structure terms. We can see that including the structure terms generally gives better results than excluding them, though the effects of turning on each term separately do not differ too much. In large part, this is because of the limited number of structure-preserving constraint triples for each view. In the Flickr30K Entities training set, for all 130K pairs, there are around 70K unique phrases and 80K regions described by a single phrase. This means, that, for most phrases/regions, there are no more than two corresponding regions/phrases. The top line of Table 3 gives baseline CCA results. For the pre-trained model without using negative mining, our deep embedding has comparable results with CCA on Recall@5 and Recall@10, but lower results on Recall@1. As mentioned earlier, in our past experience we have found CCA to be surprisingly hard to beat with more complex methods .

In order to further improve the accuracy of our embedding, we need to refine it using negative data from background and poorly localized regions. To do this, we take the embedding trained without negative mining, and for each unique phrase in the training set, calculate the distance between this phrase and the ground truth boxes as well as all our proposal boxes. Then we record those “hard negative” boxes that are closer to the phrase than the ground truth boxes. For efficiency, we only sample at most 50 hard negative regions for each unique phrase. Next, we continue training our region-phrase model on a training set augmented with these hard negative boxes, using only the bi-directional ranking constraints (eqs. 1 and 2). We exclude the structure-preserving constraints because they would now be even more severely outnumbered by the bi-directional ranking constraints.

The last four lines of Table 3 show the results of fine-tuning the models from Table 3 (a-d) with hard negative samples. Compared to the best model trained with only positive regions, our Recall@1 and mAP have improved by almost 6%, and are now considerably better than CCA. Note that in absolute terms, Rohrbach et al. get higher results, with a R@1 of over 47%, but they use a much more complex method that includes LSTMs with a phrase reconstruction objective.

Finally, Figure 3 shows examples of phrase localization in four images where our model improves upon the CCA baseline.

Conclusion

This paper has proposed an image-text embedding method in which a two-branch network with multiple layers is trained using a margin-based objective function consisting of bi-directional ranking terms and structure-preserving terms inspired by metric learning. Our architecture is simple and flexible, and can be applied to various kinds of visual and textual features. Extensive experiments demonstrate that the components of our system are well chosen and all the terms in our objective function are justified. To the best of our knowledge, our retrieval results on Flickr30K and MSCOCO datasets considerably exceed the state of the art, and we also demonstrate convincing improvements over CCA on the new problem of phrase localization on the Flickr30K Entities dataset.

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant CIF-1302438, Xerox UAC, and the Sloan Foundation. We would like to thank Bryan Plummer for help with phrase localization evaluation.

References