Predicting Visual Features from Text for Image and Video Caption Retrieval

Jianfeng Dong, Xirong Li, Cees G. M. Snoek

I Introduction

This paper attacks the problem of image and video caption retrieval, i.e., finding amidst a set of possible sentences the one best describing the content of a given image or video. Before the advent of deep learning based approaches to feature extraction, an image or video is typically represented by a bag of quantized local descriptors (known as visual words) while a sentence is represented by a bag of words. These hand-crafted features do not well represent the visual and lingual modalities, and are not directly comparable. Hence, feature transformations are performed on both sides to learn a common latent subspace where the two modalities are better represented and a cross-modal similarity can be computed . This tradition continues, as the prevailing image and video caption retrieval methods prefer to represent the visual and lingual modalities in a common latent subspace. Like others before us , we consider caption retrieval an important enough problem by itself, and we question the dependence on latent subspace solutions. For image retrieval by caption, recent evidence shows that a one-way mapping from the visual to the textual modality outperforms the state-of-the-art subspace based solutions. Our work shares a similar spirit but targets at the opposite direction, i.e., image and video caption retrieval. Our key novelty is that we find the most likely caption for a given image or video by looking for their similarity in the visual feature space exclusively, as illustrated in Fig. 1.

From the visual side we are inspired by the recent progress in predicting images from text . We also depart from the text, but instead of predicting pixels, our model predicts visual features. We consider features from deep convolutional neural networks (ConvNet) . These neural networks learn a textual class prediction for an image by successive layers of convolutions, non-linearities, pooling, and full connections, with the aid of big amounts of labeled images, e.g., ImageNet . Apart from classification, visual features derived from the layers of these networks are superior representations for various challenges in vision and multimedia . We also rely on a layered neural network architecture, but rather than predicting a class label for an image, we strive to predict a deep visual feature from a natural language description for the purpose of caption retrieval.

From the lingual side we are inspired by the encouraging progress in sentence encoding by neural language modeling for cross-modal matching . In particular, word2vec pre-trained on large-scale text corpora provides distributed word embeddings, an important prerequisite for vectorizing sentences towards a representation shared with image or video . In , a sentence is fed as a word sequence into a recurrent neural network (RNN). The RNN output at the last time step is taken as the sentence feature, which is further projected into a latent subspace. We employ word2vec and RNN as part of our sentence encoding strategy as well. What is different is that we continue to transform the encoding into a higher-dimensional visual feature space via a multi-layer perceptron. As we predict visual features from text, we call our approach Word2VisualVec. While both visual and textual modalities are used during training, Word2VisualVec performs a mapping from the textual to the visual modality. Hence, at run time, Word2VisualVec allows the caption retrieval to be performed in the visual space.

We make the following three contributions in this paper: $\bullet$ First, to the best of our knowledge we are the first to solve the caption retrieval problem in the visual space. We consider this counter-tradition approach promising thanks to the effectiveness of deep learning based visual features which are continuously improving. For cross-modal matching, we consider it beneficial to rely on the visual space, instead of a joint space, as it allows us to learn a one-way mapping from natural language text to the visual feature space, rather than a more complicated joint space. $\bullet$ Second, we propose Word2VisualVec to effectively realize the above proposal. Word2VisualVec is a deep neural network based on multi-scale sentence vectorization and a multi-layer perceptron. While its components are known, we consider their combined usage in our overall system novel and effective to transform a natural language sentence into a visual feature vector. We consider prediction of several recent visual features based on text, but the approach is general and can, in principle, predict any deep visual feature it is trained on. $\bullet$ Third, we show how Word2VisualVec can be easily generalized to the video domain, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation including Mel Frequency Cepstral Coefficients . Experiments on Flickr8k , Flickr30k , the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec’s properties, its benefit over the word2vec textual embedding, the potential for multimodal query composition and its state-of-the-art results.

Before detailing our approach, we first highlight in more detail related work.

II Related Work

Prior to deep visual features, methods for image caption retrieval often resort to relatively complicated models to learn a shared representation to compensate for the deficiency of traditional low-level visual features. Hodosh et al. leverage Kernel Canonical Correlation Analysis (CCA), finding a joint embedding by maximizing the correlation between the projected image and text kernel matrices. With deep visual features, we observe an increased use of relatively light embeddings on the image side. Using the fc6 layer of a pre-trained AlexNet as the image feature, Gong et al. show that linear CCA compares favorably to its kernel counterpart . Linear CCA is also adopted by Klein et al. for visual embedding. More recent models utilize affine transformations to reduce the image feature to a much shorter $h$ -dimensional vector, with the transformation optimized in an end-to-end fashion within a deep learning framework .

Similar to the image domain, the state-of-the-art methods for video caption retrieval also operate in a shared subspace . Xu et al. propose to vectorize each subject-verb-object triplet extracted from a given sentence by a pre-trained word2vec, and subsequently aggregate the vectors into a sentence-level vector by a recursive neural network. A joint embedding model projects both the sentence vector and the video feature vector, obtained by temporal pooling over frame-level features, into a latent subspace. Otani et al. improve upon by exploiting web image search results of an input sentence, which are deemed helpful for word disambiguation, e.g., telling if the word “keyboard” refers to a musical instrument or an input device for computers. To learn a common multimodal representation for videos and text, Yu et al. use two distinct Long Short Term Memory (LSTM) modules to encode the video and text modalities respectively. They then employ a compact bilinear pooling layer to capture implicit interactions between the two modalities.

Different from the existing works, we propose to perform image and video caption retrieval directly in the visual space. This change is important as it allows us to completely remove the learning part from the visual side and focus our energy on learning an effective mapping from natural language text to the visual feature space.

II-B Sentence Vectorization

To convert variably-sized sentences to fixed-sized feature vectors for subsequent learning, bag-of-words (BoW) is arguably the most popular choice . A BoW vocabulary has to be prespecified based on the availability of words describing the training images. As collecting image-sentence pairs at a large-scale is both labor intensive and time consuming, the amount of words covered by BoW is bounded. To overcome this limit, a distributional text embedding provided by word2vec is gaining increased attention. The word embedding matrix used in is instantiated by a word2vec model pre-trained on large-scale text corpora. In Frome et al. , for instance, the input text is vectorized by averaging the word2vec vectors of its words. Such a mean pooling strategy results in a dense representation that could be less discriminative than the initial BoW feature. As an alternative, Klein et al. and their follow-up perform fisher vector pooling over word vectors.

Beside BoW and word2vec, we observe an increased use of RNN-based sentence vectorization. Socher et al. design a Dependency-Tree RNN that learns vector representations for sentences based on their dependency trees . Lev et al. propose RNN fisher vectors on the basis of , replacing the Gaussian model by a RNN model that takes into account the order of elements in the sequence. Kiros et al. employ an LSTM to encode a sentence, using the LSTM’s hidden state at the last time step as the sentence feature. In a follow-up work, Vendrov et al. replace LSTM by a Gated Recurrent Unit (GRU) which has less parameters to tune . While RNN and its LSTM or GRU variants have demonstrated promising results for generating visual descriptions , they tend to be over-sensitive to word orders by design. Indeed Socher et al. suggest that for caption retrieval, models invariant to surface changes, such as word order, perform better.

In order to jointly exploit the merits of the BoW, word2vec and RNN based representations, we consider in this paper multi-scale sentence vectorization. Ma et al. have made a first attempt in this direction. In their approach three multimodal ConvNets are trained on feature maps, formed by merging the image embedding vector with word, phrase and sentence embedding vectors. The relevance between an image and a sentence is estimated by late fusion of the individual matching scores. By contrast, we perform multi-scale sentence vectorization in an early stage, by merging BoW, word2vec and GRU sentence features and letting the model figure out the optimal way for combining them. Moreover, at run time the multi-modal network by requires a query image to be paired with each of the test sentences as the network input. By contrast, our Word2VisualVec model predicts visual features from text alone, meaning the vectorization can be precomputed. An advantageous property for caption retrieval on large-scale image and video datasets.

III Word2VisualVec

Multi-scale sentence vectorization. To handle sentences of varying length, we choose to first vectorize each sentence. We propose multi-scale sentence vectorization that utilizes BoW, word2vec and RNN based text encodings.

BoW is a classical text encoding method. Each dimension in a BoW vector corresponds to the occurrence of a specific word in the input sentence, i.e.,

where $c(w,q)$ returns the occurrence of word $w$ in $q$ , and $m$ is the size of a prespecified vocabulary. A drawback of Bow is that its vocabulary is bounded by the words used in the multi-modal training data, which is at a relatively small scale compared to a text corpus containing millions of words. Given faucet as a novel word, for example, “A little girl plays with a faucet” will not have the main object encoded in its BoW vector. Notice that setting a large vocabulary for BoW is unhelpful, as words without training images will always have zero value and thus will not be effectively modeled. To compensate for such a loss, we further leverage word2vec. By learning from a large-scale text corpus, the vocabulary of word2vec is much larger than its BoW counterpart. We obtain the embedding vector of the sentence by mean pooling over its words, i.e.,

where $v(w)$ denotes individual word embedding vectors, $|q|$ is the sentence length. Previous works employ word2vec trained on web documents as their word embedding matrix . However, recent studies suggest that word2vec trained on Flickr tags better captures visual relationships than its counterpart learned from web documents . We therefore train a 500-dimensional word2vec model on English tags of 30 million Flickr images, using the skip-gram algorithm . This results in a vocabulary of 1.7 million words.

Despite their effectiveness, the BoW and word2vec representations ignore word orders in the input sentence. As such, they cannot discriminate between “a dog follows a person” and “a person follows a dog”. To tackle this downside, we employ an RNN, which is known to be effective for modeling long-term word dependency in natural language text. In particular, we adopt a GRU , which has less parameters than LSTM and presumably requires less amounts of training data. At a specific time step $t$ , let $v_{t}$ be the embedding vector of the $t$ -th word, obtained by performing a lookup on a word embedding matrix $W_{e}$ . GRU receives inputs from $v_{t}$ and the previous hidden state $h_{t-1}$ , and accordingly the new hidden state $h_{t}$ is updated as follows,

Multi-scale sentence vectorization is obtained by concatenating the three representations, that is

Text transformation via a multilayer perceptron. The sentence vector $s(q)$ goes through subsequent hidden layers until it reaches the output layer $r(q)$ , which resides in the visual feature space. More concretely, by applying an affine transformation on $s(q)$ , followed by an element-wise ReLU activation $\sigma(z)=\max(0,z)$ , we obtain the first hidden layer $h_{1}(q)$ of an $l$ -layer Word2VisualVec as:

The following hidden layers are expressed by:

where $W_{i}$ parameterizes the affine transformation of the $i$ -th hidden layer and $b_{i}$ is a bias terms. In a similar manner, we compute the output layer $r(q)$ as:

Putting it all together, the learnable parameters are represented by $\theta=[W_{e},W_{z.},W_{r.},W_{h.},b_{z},b_{r},b_{h},W_{1},b_{1},\ldots,W_{l},b_{l}]$ .

In principle, the learning capacity of our model grows as more layers are used. This also means more solutions exist which minimize the training loss, yet are suboptimal for unseen test data. We analyze in the experiments how deep Word2VisualVec can go without losing its generalization ability.

III-B Learning algorithm

Objective function. For a given image, different persons might describe the same visual content with different words. For example, “A dog leaps over a log” versus “A dog is leaping over a fallen tree”. The verb leap in different tenses essentially describe the same action, while a log and a fallen tree can have similar visual appearance. Projecting the two sentences into the same visual feature space has the effect of implicitly finding such correlations. In order to reconstruct the visual feature $\phi(x)$ directly from $q$ , we use Mean Squared Error (MSE) as our objective function. We have also experimented with the marginal ranking loss, as commonly used in previous works , but found MSE yields better performance.

The MSE loss $l_{mse}$ for a given training pair is defined as:

We train Word2VisualVec to minimize the overall MSE loss on a given training set $\mathcal{D}=\{(x,q)\}$ , containing a number of relevant image-sentence pairs:

Optimization. We solve Eq. (9) using stochastic gradient descent with RMSprop . This optimization algorithm divides the learning rate by an exponentially decaying average of squared gradients, to prevent the learning rate from effectively shrinking over time. We empirically set the initial learning rate $\eta=0.0001$ , decay weights $\gamma=0.9$ and small constant $\epsilon=10^{-6}$ for RMSprop. We apply dropout to all hidden layers in Word2VisualVec to mitigate model overfitting. Lastly, we take an empirical learning schedule as follows. Once the validation performance does not increase in three consecutive epochs, we divide the learning rate by 2. Early stop occurs if the validation performance does not improve in ten consecutive epochs. The maximal number of epochs is 100.

III-C Image Caption Retrieval

For a given image, we select from a given sentence pool the sentence deemed most relevant with respect to the image. Note that image-sentence pairs are required only for training Word2VisualVec. For a test sentence, its $r(q$ ) is obtained by forward computation through the Word2VisualVec network, without the need of any test image. Hence, the sentence pool can be vectorized in advance. Image caption retrieval in our case boils down to finding the sentence nearest to the given image in the visual feature space. We use the cosine similarity between $r(q$ ) and the image feature $\phi(x)$ , as this similarity normalizes feature vectors and is found to be better than the dot product or mean square error according to our preliminary experiments.

III-D Video Caption Retrieval

Word2VisualVec is also applicable for video as long as we have an effective vectorized representation of video. Again, different from previous methods for video caption retrieval that execute in a joint subspace , we project sentences into the video feature space.

Following the good practice of using pre-trained ConvNets for video content analysis , we extract features by applying image ConvNets on individual frames and 3-D ConvNets on consecutive-frame sequences. For short video clips, as used in our experiments, mean pooling over video frames is considered reasonable . Hence, the visual feature vector of each video is obtained by averaging the feature vectors of its frames. Note that longer videos open up possibilities for further improvement of Word2VisualVec by exploiting temporal order of video frames, e.g., . The audio channel of a video sometime provides complementary information to the visual channel. For instance, to help decide whether a person is talking or singing. To exploit this channel, we extract a bag of quantized Mel-frequency Cepstral Coefficients (MFCC) and concatenate it with the previous visual feature. Word2VisualVec is trained to predict such a visual-audio feature, as a whole, from input text.

Word2VisualVec is used in a principled manner, transforming an input sentence to a video feature vector, let it be visual or visual-audio. For the sake of clarity we term the video variant Word2VideoVec.

IV Experiments

We first investigate the impact of major design choices, e.g., how to vectorize an input sentence?. Before detailing the investigation, we first introduce data and evaluation protocol.

Data. For image caption retrieval, we use two popular benchmark sets, Flickr8k and Flickr30k . Each image is associated with five crowd-sourced English sentences, which briefly describe the main objects and scenes present in the image. For video caption retrieval we rely on the Microsoft Video Description dataset (MSVD) . Each video is labeled with 40 English sentences on average. The videos are short, usually less than 10 seconds long. For the ease of cross-paper comparison, we follow the identical data partitions as used in for images and for videos. That is, training / validation / test is 6k / 1k / 1k for Flickr8k, 29K / 1,014 / 1k for Flickr30k, and 1,200 / 100 / 670 for MSVD.

Visual features. A deep visual feature is determined by a specific ConvNet and its layers. We experiment with four pretrained 2-D ConvNets, i.e., CaffeNet , GoogLeNet , GoogLeNet-shuffle and ResNet-152 . The first three 2-D ConvNets were trained using images containing 1K different visual objects as defined in the Large Scale Visual Recognition Challenge . GoogLeNet-shuffle follows GoogLeNet’s architecture, but is re-trained using a bottom-up reorganization of the complete 22K ImageNet hierarchy, excluding over-specific classes and classes with few images and thus making the final classes more balanced. For the video dataset, we further experiment with a 3-D ConvNet , trained on one million sports videos containing 487 sport-related concepts . As the videos were muted, we cannot evaluate Word2VideoVec with audio features. We tried multiple layers of each ConvNet model and report the best performing layer. Finally we use the fc7 layer for CaffeNet (4,096-dim), the pool5 layer for GoogleNet (1,024-dim), GoogleNet-shuffle (1,024-dim) and ResNet-152 (2,048-dim), and the fc6 layer for C3D (4,096-dim).

Details of the model. The size of the word2vec and GRU layers is 500 and 1,024, respectively. The size of the BoW layer depends on training data, which is 2,535, 7,379 and 3,030 for Flickr8k, Flickr30k and MSVD, respectively (with words appearing less than five times in the corresponding training set removed). Accordingly, the size of the composite vectorization layer is 4,059, 8,903 and 4,554, respectively. The size of the hidden layers is 2,048. The number of layers is three unless otherwise stated. Code is available at https://github.com/danieljf24/w2vv.

Evaluation protocol. The training, validation and test set are used for model training, model selection and performance evaluation, respectively, and exclusively. For performance evaluation, each test caption is first vectorized by a trained Word2VisualVec. Given a test image/video query, we then rank all the test captions in terms of their similarities with the image/video query in the visual feature space. The performance is evaluated based on the caption ranking. Following the common convention , we report rank-based performance metrics $R@K$ ( $K=1,5,10$ ). $R@K$ computes the percentage of test images for which at least one correct result is found among the top- $K$ retrieved sentences. Hence, higher $R@K$ means better performance.

How to vectorize an input sentence? As shown in Table I, II and III, multi-scale sentence vectorization outperforms its single-scale counterparts. Table IV shows examples for which a particular vectorization method is particularly suited. In the first two rows, word2vec performs better than BoW and GRU, because the main words rottweiler and quad are not in the vocabularies of BoW and GRU. However, the use of word2vec sometimes has the side effect of overweighting high-level semantic similarity between words. E.g., beagle in the third row is found to be closer to dog than to hound, and woman in the fourth row is found to be more close to man than to lady in the word2vec space. In this case, the resultant Word2VisualVec vector is less discriminative than its BoW counterpart. Since GRU is good at modeling long-term word dependency, it performs the best in the last two rows, where the captions are more narrative.

Which visual feature? Table I and II show performance of image caption retrieval on Flickr8k and Flickr30k, respectively. As the ConvNets go deeper, predicting the corresponding visual features by Word2VisualVec improves. This result is encouraging as better performance can be expected from the continuous progress in deep learning features. Table III shows performance of video caption retrieval on MSVD, where the more compact GoogLeNet-shuffle feature tops the performance when combined with multi-scale sentence vectorization. Although MSVD has more visual / sentence pairs than Flickr8k, it has a much less number of 1,200 visual examples for training. Substituting ResNet-152 for GoogLeNet-shuffle reduces the amount of trainable parameters by 18%, making Word2VisualVec more effective to learn from relatively limited examples. Ideally, the learning process shall allow the model to automatically discover which elements in the composite sentence vectorization layer are the most important for the problem in consideration. This advantage cannot be properly leveraged when training examples are in short supply. In such a case, using word2vec instead of the composite vectorization is preferred, resulting in a Word2VisualVec with 73% less parameters when using ResNet-152 (60% less parameters when using CaffeNet or C3D) and thus easier to train. A similar phenomenon is observed on the image data, when given only 3k image-sentence pairs for training (see Fig. 3). Word2VisualVec with word2vec is more suited for small-scale training data regimes.

Given a fixed amount of training pairs, having more visual examples might be better for Word2VisualVec. To verify this conjecture, we take from the Flickr30k training set a random subset of 3k images with one sentence per image. We then incrementally increase the amount of image / sentence pairs for training, using the following two strategies. One is to increase the number of sentences per image from 1 to 2, 3, 4, and 5 with the number of images fixed, while the other is to let the amount of images increase to 6k, 9k, 12k and 15k with the number of sentences per image fixed to one. As the performance curves in Fig. 3 show, given the same amount of training pairs, adding more images results in better models. The result is also instructive for more effective acquisition of training data for image and video caption retrieval.

How deep? In this experiment, we use word2vec as sentence vectorization for its efficient execution. We vary the number of MLP layers, and observe a performance peak when using three-layers, i.e., 500-2048-2048, on Flickr8k and four-layers, i.e., 500-2048-2048-2048, on Flickr30k. Recall that the model is chosen in terms of its performance on the validation set. While its learning capacity increases as the model goes deeper, the chance of overfitting also increases. To improve generalization we also tried $l_{2}$ regularization on the network weights. This tactic brings a marginal improvement, yet introduces extra hyper parameters. So we did not go further in that direction. Overall the three-layer Word2VisualVec strikes the best balance between model capacity and generalization ability, so we use this network configuration in what follows.

How fast? We implement Word2VisualVec using Keras with theano backend. The three-layer model with multi-scale sentence vectorization takes about 1.3 hours to learn from the 30k image-sentence pairs in Flickr8k on a GeoForce GTX 1070 GPU. Predicting visual features for a given sentence is swift, at an averaged speed of 20 milliseconds. Retrieving captions from a pool of 5k sentences takes 8 milliseconds per test image. Based on the above evaluations we recommend Word2VisualVec that uses multi-scale sentence vectorization, and predicts the 2,048-dim ResNet-152 feature when adequate training data is available (over 2k training images with five sentences per image) or the 1,024-dim GoogLeNet-shuffle feature when training data is more scarce.

IV-B Word2VisualVec versus word2vec

Although our model is meant for caption retrieval, it essentially generates a new representation of text. How meaningful is this new representation as compared to word2vec? To answer this question, we take all the 5K test sentences from Flickr30k, vectorizing them by word2vec and Word2VisualVec, respectively. The word2vec model was trained on Flickr tags as described in Section III-A. For a fair comparison, we let Word2VisualVec use the same word2vec as its first layer. Fig. 4 presents t-SNE visualizations of sentence distributions in the word2vec and Word2VisualVec spaces, showing that sentences describing the same image stay more close while sentences from distinct images are more distant in the latter space. Recall that sentences associated with the same image are meant for describing the same visual content. Moreover, since they were independently written by distinct users, the wording may vary across the users, requiring a text representation to capture shared semantics among distinct words. Word2VisualVec better handles such variance in captions as illustrated in the first two examples in Fig. 4(e).

The last example in Fig. 4(e) shows failures of both models, where the two sentences (#5 and #6) are supposed to be close. Large difference between their subject (teenagers versus people) and object (shirt versus paper) makes it difficult for Word2VisualVec to predict similar visual features from the two sentences. Actually, we find in the Word2VisualVec space that the sentence nearest to #5 is “A woman is completing a picture of a young woman” (which resembles subjects, i.e., teenager versus young woman and action, i.e., holding paper or easel) and the one to #6 is “Kids scale a wall as two other people watch” (which depicts similar subjects, i.e., two people and objects, i.e., concrete versus wall). This example shows the existence of large divergence between manually written descriptions of the same visual content, and thus the challenging nature of the caption retrieval problem.

Note that the above comparison is not completely fair as word2vec is not intended for fitting the relevance between image and text. By contrast, Word2VisualVec is designed to exploit the link between the two modalities, producing a new representation of text that is well suited for image and video caption retrieval.

IV-C Word2VisualVec for multi-modal querying

Fig. 5 presents an example of Word2VisualVec’s learned representation and its ability for multi-modal query composition. Given the query image, its composed queries are obtained by subtracting and/or adding the visual features of the query words, as predicted by Word2VisualVec. A deep dream visualization is performed on an average (gray) image guided by each composed query. Consider the query in the second row for instance, where we instruct the search to replace bicycle with motorbike via a textual specification. The predicted visual feature of word bicycle is subtracted (effect visible in first row) and the predicted visual feature of word motorbike is added. Imagery of motorbikes are indeed present in the dream. Hence, the nearest retrieved images emphasize on motorbikes in street scenes.

IV-D Comparison to the State-of-the-Art

Image caption retrieval. We compare a number of recently developed models for image caption retrieval . All the methods, including ours, require image-sentence pairs to train. They all perform caption retrieval on a provided set of test sentences. Note that the compared methods have no reported performance on the ResNet-152 feature. We have tried the VGGNet feature as used in and found Word2VisualVec less effective. This is not surprising as the choice of the visual feature is an essential ingredient of our model. While it would be ideal to replicate all methods using the same ResNet feature, only have released their source code. So we re-train these two models with the same ResNet features we use. Table V presents the performance of the above models on both Flickr8k and Flickr30k. Word2VisualVec compares favorably against the state-of-the-art. Given the same visual feature, our model outperforms , especially for $R@1$ . Notice that Plummer et al. employ extra bounding-box level annotations. Still our results are better, indicating that we can expect further gains by including locality in the Word2VisualVec representation. As all the competitor models use joint subspaces, the results justify the viability of directly using the deep visual feature space for image caption retrieval.

Compared with the two top-performing methods , the run-time complexity of the multi-scale Word2VisualVec is $O(m\times s+s\times g+(m+s+g)\times 2048+2048\times d)$ , where $s$ indicates the dimensionality of word embedding and $g$ denotes the size of GRU. This complexity is larger than which has a complexity of $O(m\times s+s\times g+g\times d)$ , but lower than which vectorizes a sentence by a time-consuming Fisher vector encoding.

Video caption retrieval. We also participated in the NIST TrecVid 2016 video caption retrieval task . The test set consists of 1,915 videos collected from Twitter Vine. Each video is about 6 sec long. The videos were given to 8 annotators to generate a total of 3,830 sentences, with each video associated with two sentences written by two different annotators. The sentences have been split into two equal-sized subsets, set $A$ and set $B$ , with the rule that sentences describing the same video are not in the same subset. Per test video, participants are asked to rank all sentences in the two subsets. Notice that we have no access to the ground-truth, as the test set is used for blind testing by the organizers only. NIST also provides a training set of 200 videos, which we consider insufficient for training Word2VideoVec. Instead, we learn the network parameters using video-text pairs from MSR-VTT , with hyper-parameters tuned on the provided TrecVid training set. By the time of TrecVid submission, we used GoogLeNet-shuffle as the visual feature, a 1,024-dim bag of MFCC as the audio feature, and word2vec for sentence vectorization. The performance metric is Mean Inverted Rank (MIR) at which the annotated item is found. Higher MIR means better performance.

As shown in Fig. 6, with MIR ranging from 0.097 to 0.110, Word2VideoVec leads the evaluation on both set A and set B in the context of 21 submissions from seven teams worldwide. Moreover, the results can be further improved by predicting the visual-audio feature. Besides us two other teams submitted their technical reports, scoring their best MIR of 0.076 and 0.006 , respectively. Given a video-sentence pair, the model from iteratively combines the video and sentence features into one vector, followed by a fully connected layer to predict the similarity score. The model from learns an embedding space by minimizing a cross-media distance.

Some qualitative image and video caption retrieval results are shown in Fig. 7. Consider the last image in the top row. Its ground-truth caption is “A man playing an accordion in front of buildings”, while the top-retrieved caption is “People walk through an arch in an old-looking city”. Though the ResNet feature well describes the overall scene, it fails to capture the accordion which is small but has successfully drawn the attention of the annotator who wrote the ground-truth caption. The last video in the bottom row of Fig. 7 shows “A man throws his phone into a river”. This action is not well described by the averagely pooled video feature. Hence, the main sources of errors come from the cases where the visual features do not well represent the visual content.

IV-E Limits of caption retrieval and possible extensions

The caption retrieval task works with the assumption that for a query image or video, there is at least one sentence relevant w.r.t the query. In a general scenario where the query is unconstrained with arbitrary content, this assumption is unlikely to be valid. A naive remedy would be to enlarge the sentence pool. A more advanced solution is to combine with methods that construct novel captions. In for instance, a caption is formed using a set of visually relevant phrases extracted from a large-scale image collection. From the top-n sentences retrieved by Word2VisualVec, one can also generate a new caption, using the methods of . As this paper is to retrieve rather than to construct a caption, we leave this for future exploration.

V Conclusions

This paper shows the viability of resolving image and video caption retrieval in a visual feature space exclusively. We contribute Word2VisualVec, which is capable of transforming a natural language sentence to a meaningful visual feature representation. Compared to the word2vec space, sentences describing the same image tend to stay closer, while sentences from different images are more distant in the Word2VisualVec space. As the sentences are meant for describing visual content, the new textual encoding captures both semantic and visual similarities. Word2VisualVec also supports multi-modal query composition, by subtracting and/or adding the predicted visual features of specific words to a given query image. What is more the Word2VisualVec is easily generalized to predict a visual-audio representation from text for video caption retrieval. For state-of-the-art results, we suggest Word2VisualVec with multi-scale sentence vectorization, predicting the ResNet feature when adequate training data is available or the GoogLeNet-shuffle feature when training data is in short supply.