Order-Embeddings of Images and Language

Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun

Introduction

Computer vision and natural language processing are becoming increasingly intertwined. Recent work in vision has moved beyond discriminating between a fixed set of object classes, to automatically generating open-ended lingual descriptions of images (Vinyals et al., 2015). Recent methods for natural language processing such as Young et al. (2014) learn the semantics of language by grounding it in the visual world. Looking to the future, autonomous artificial agents will need to jointly model vision and language in order to parse the visual world and communicate with people.

But what, precisely, is the relationship between images and the words or captions we use to describe them? It is akin to the hypernym relation between words, and textual entailment among phrases: captions are simply abstractions of images. In fact, all three relations can be seen as special cases of a partial order over images and language, illustrated in Figure 1, which we refer to as the visual-semantic hierarchy. As a partial order, this relation is transitive: “woman walking her dog”, “woman walking”, “person walking”, “person”, and “entity” are all valid abstractions of the rightmost image. Our goal in this work is to learn representations that respect this partial order structure.

Most recent approaches to modeling the hypernym, entailment, and image-caption relations involve learning distributed representations or embeddings. This is a very powerful and general approach which maps the objects of interest—words, phrases, images— to points in a high-dimensional vector space. One line of work, exemplified by Chopra et al. (2005) and first applied to the caption-image relationship by Socher et al. (2014), requires the mapping to be distance-preserving: semantically similar objects are mapped to points that are nearby in the embedding space. A symmetric distance measure such as Euclidean or cosine distance is typically used. Since the visual-semantic hierarchy is an antisymmetric relation, we expect this approach to introduce systematic model error.

Other approaches do not have such explicit constraints, learning a more-or-less general binary relation between the objects of interest, e.g. Bordes et al. (2011); Socher et al. (2013); Ma et al. (2015). Notably, no existing approach directly imposes the transitivity and antisymmetry of the partial order, leaving the model to induce these properties from data.

In contrast, we propose to exploit the partial order structure of the visual-semantic hierarchy by learning a mapping which is not distance-preserving but order-preserving between the visual-semantic hierarchy and a partial order over the embedding space. We call embeddings learned in this way order-embeddings. This idea can be integrated into existing relational learning methods simply by replacing their comparison operation with ours. By modifying existing methods in this way, we find that order-embeddings provide a marked improvement over the state-of-art for hypernymy prediction and caption-image retrieval, and near state-of-the-art performance for natural language inference.

This paper is structured as follows. We begin, in Section 2, by giving a unified mathematical treatment of our tasks, and describing the general approach of learning order-embeddings. In the next three sections we describe in detail the tasks we tackle, how we apply the order-embeddings idea to each of them, and the results we obtain. The tasks are hypernym prediction (Section 3), caption-image retrieval (Section 4), and textual entailment (Section 5).

In the supplementary material, we visualize novel vector regularities that emerge in our learned embeddings of images and language.

Learning Order-Embeddings

To unify our treatment of various tasks, we introduce the problem of partial order completion. In partial order completion, we are given a set of positive examples P={(u,v)}P=\{(u,v)\} of ordered pairs drawn from a partially ordered set (X,X)(X,\preceq_{X}), and a set of negative examples NN which we know to be unordered. Our goal is to predict whether an unseen pair (u,v)(u^{\prime},v^{\prime}) is ordered. Note that hypernym prediction, caption-image retrieval, and textual entailment are all special cases of this task, since they all involve classifying pairs of concepts in the (partially ordered) visual-semantic hierarchy.

We tackle this problem by learning a mapping from XX into a partially ordered embedding space (Y,Y)(Y,\preceq_{Y}). The idea is to predict the ordering of an unseen pair in XX based on its ordering in the embedding space. This is possible only if the mapping satisfies the following crucial property:

A function f:(X,X)(Y,Y)f:(X,\preceq_{X})\to(Y,\preceq_{Y}) is an order-embedding if for all u,vXu,v\in X,

This definition implies that each combination of embedding space YY, order Y\preceq_{Y}, and order-embedding ff determines a unique completion of our data as a partial order X\preceq_{X}. In the following, we first consider the choice of YY and Y\preceq_{Y}, and then discuss how to find an appropriate ff.

𝑁\mathbb{R}_{+}^{N} The choice of YY and Y\preceq_{Y} is somewhat application-dependent. For the purpose of modeling the semantic hierarchy, our choices are narrowed by the following considerations.

Much of the expressive power of human language comes from abstraction and composition. For any two concepts, say “dog” and “cat”, we can name a concept that is an abstraction of the two, such as “mammal”, as well as a concept that composes the two, such as “dog chasing cat”. So, in order to represent the visual-semantic hierarchy, we need to choose an order Y\preceq_{Y} that is rich enough to embed these two relations.

We also restrict ourselves to orders Y\preceq_{Y} with a top element, which is above every other element in the order. In the visual-semantic hierarchy, this element represents the most general possible concept; practically, it provides an anchor for the embedding.

Finally, we choose the embedding space YY to be continuous in order to allow optimization with gradient-based methods.

A natural choice that satisfies all three properties is the reversed product order on R+N\mathbb{R}_{+}^{N}, defined by the conjunction of total orders on each coordinate:

for all vectors x,yx,y with nonnegative coordinates. Note the reversal of direction: smaller coordinates imply higher position in the partial order. The origin is then the top element of the order, representing the most general concept.

Instead of viewing our embeddings as single points xR+Nx\in\mathbb{R}_{+}^{N}, we can also view them as sets {y:xy}\{y:x\preceq y\}. The meaning of a word is then the union of all concepts of which it is a hypernym, and the meaning of a sentence is the union of all sentences that entail it. The visual-semantic hierarchy can then be seen as a special case of the subset relation, a connection also used by Young et al. (2014).

2 Penalizing Order Violations

Having fixed the embedding space and order, we now consider the problem of finding an order-embedding into this space. In practice, the order embedding condition (Definition 1) is too restrictive to impose as a hard constraint. Instead, we aim to find an approximate order-embedding: a mapping which violates the order-embedding condition, imposed as a soft constraint, as little as possible.

More precisely, we define a penalty that measures the degree to which a pair of points violates the product order. In particular, we define the penalty for an ordered pair (x,y)(x,y) of points in R+N\mathbb{R}_{+}^{N} as

Crucially, E(x,y)=0    xyE(x,y)=0\iff x\preceq y according to the reversed product order; if the order is not satisfied, E(x,y)E(x,y) is positive. This effectively imposes a strong prior on the space of relations, encouraging our learned relation to satisfy the partial order properties of transitivity and antisymmetry. This penalty is key to our method. Throughout the remainder of the paper, we will use it where previous work has used symmetric distances or learned comparison operators.

Recall that PP and NN are our positive and negative examples, respectively. Then, to learn an approximate order-embedding ff, we could use a max-margin loss which encourages positive examples to have zero penalty, and negative examples to have penalty greater than a margin:

subscript𝑢𝑣𝑃𝐸𝑓𝑢𝑓𝑣subscriptsuperscript𝑢′superscript𝑣′𝑁0𝛼𝐸𝑓superscript𝑢′𝑓superscript𝑣′\sum_{(u,v)\in P}E(f(u),f(v))+\sum_{(u^{\prime},v^{\prime})\in N}\max\{0,\alpha-E(f(u^{\prime}),f(v^{\prime}))\} (3) In practice we are often not given negative examples, in which case this loss admits the trivial solution of mapping all objects to the same point. The best way of dealing with this problem depends on the application, so we will describe task-specific variations of the loss in the next several sections.

Hypernym Prediction

To test the ability of our model to learn partial orders from incomplete data, our first task is to predict withheld hypernym pairs in WordNet (Miller, 1995). A hypernym pair is a pair of concepts where the first concept is a specialization or an instance of the second, e.g., (woman, person) or (New York, city). Our setup differs significantly from previous work in that we use only the WordNet hierarchy as training data. The most similar evaluation has been that of Baroni et al. (2012), who use external linguistic data in the form of distributional semantic vectors. Bordes et al. (2011) and Socher et al. (2013) also evaluate on the WordNet hierarchy, but they use other relations in WordNet as training data (and external linguistic data, in Socher’s case).

Additionally, the latter two consider only direct hypernyms, rather than the full, transitive hypernymy relation. But predicting the transitive hypernym relation is a better-defined problem because individual hypernym edges in WordNet vary dramatically in the degree of abstraction they require. For instance, (person, organism) is a direct hypernym pair, but it takes eight hypernym edges to get from cat to organism.

To apply order-embeddings to hypernymy, we follow the setup of Socher et al. (2013) in learning an N-dimensional vector for each concept in WordNet, but we replace their neural tensor network with our order-violation penalty defined in Eq. (2). Just like them, we corrupt each hypernym pair by replacing one of the two concepts with a randomly chosen concept, and use these corrupted pairs as negative examples for both training and evaluation. We use their max-margin loss, which encourages the order-violation penalty to be zero for positive examples, and greater than a margin α\alpha for negative examples:

subscript𝑢𝑣𝑊𝑜𝑟𝑑𝑁𝑒𝑡𝐸𝑓𝑢𝑓𝑣0𝛼𝐸𝑓superscript𝑢′𝑓superscript𝑣′\sum_{(u,v)\in WordNet}E(f(u),f(v))+\max\{0,\alpha-E(f(u^{\prime}),f(v^{\prime}))\} (4) where EE is our order-violation penalty, and (u,v)(u^{\prime},v^{\prime}) is a corrupted version of (u,v)(u,v). Since we learn an independent embedding for each concept, the mapping ff is simply a lookup table.

2 Dataset

The transitive closure of the WordNet hierarchy gives us 838073838073 edges between 8219282192 concepts in WordNet. Like Bordes et al. (2011), we randomly select 40004000 edges for the test split, and another 40004000 for the development set. Note that the majority of test set edges can be inferred simply by applying transitivity, giving us a strong baseline.

3 Details of Training

We learn a 50-dimensional nonnegative vector for each concept in WordNet using the max-margin objective (4) with margin α=1\alpha=1, sampling 500 true and 500 false hypernym pairs in each batch. We train for 30-50 epochs using the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.010.01 and early stopping on the validation set. During evaluation, we find the optimal classification threshold on the validation set, then apply it to the test set.

4 Results

Since our setup is novel, there are no published numbers to compare to. We therefore compare three variants of our model to two baselines, with results shown in Table 1.

The transitive closure baseline involves no learning; it simply classifies hypernyms pairs as positive if they are in the transitive closure of the union of edges in the training and validation sets. The word2gauss baseline evaluates the approach of Vilnis & McCallum (2015) to represent words as Gaussian densities rather than points in the embedding space. This allows a natural representation of hierarchies using the KL divergence. We used 50-dimensional diagonal Gaussian embeddings, trained for 200 epochs on a max-margin objective with margin 77, chosen by grid search111We used the code of http://github.com/seomoz/word2gauss.

order-embeddings (symmetric) is our full model, but using symmetric cosine distance instead of our asymmetric penalty. order-embeddings (bilinear) replaces our penalty with the bilinear model used by Socher et al. (2013). order-embeddings is our full model.

Only our full model can do better than the transitive baseline, showing the value of exploiting partial order structure in contrast to using symmetric similarity or learning a general binary relation as most previous work and our bilinear baseline do.

The resulting 50-dimensional embeddings are difficult to visualize. To give some intuition for the structure being learned, Figure 2 shows the results of a toy 2D experiment.

Caption-Image Retrieval

The caption-image retrieval task has become a standard evaluation of joint models of vision and language (Hodosh et al., 2013; Lin et al., 2014a). The task involves ranking a large dataset of images by relevance for a query caption (Image Retrieval), and ranking captions by relevance for a query image (Caption Retrieval). Given a set of aligned image-caption pairs as training data, the goal is then to learn a caption-image compatibility score S(c,i)S(c,i) to be used at test time.

Many modern approaches model the caption-image relationship symmetrically, either by embedding into a common “visual-semantic” space with inner-product similarity (Socher et al., 2014; Kiros et al., 2014), or by using Canonical Correlations Analysis between distributed representations of images and captions (Klein et al., 2015). While Karpathy & Li (2015) and Plummer et al. (2015) model a finer-grained alignment between regions in the image and segments of the caption, the similarity they use is still symmetric. An alternative is to learn an unconstrained binary relation, either with a neural language model conditioned on the image (Vinyals et al., 2015; Mao et al., 2015) or using a multimodal CNN (Ma et al., 2015).

In contrast to these lines of work, we propose to treat the caption-image pairs as a two-level partial order with captions above the images they describe, and let

with EE our order-violation penalty defined in Eq (2), and fc,fif_{c},f_{i} are embedding functions from captions and images into R+NR_{+}^{N}.

To facilitate comparison, we use the same pairwise ranking loss that Socher et al. (2014), Kiros et al. (2014) and Karpathy & Li (2015) have used on this task—simply replacing their symmetric similarity measure with our asymmetric order-violation penalty. This loss function encourages S(c,i)S(c,i) for ground truth caption-image pairs to be greater than that for all other pairs, by a margin:

subscriptsuperscript𝑐′0𝛼𝑆𝑐𝑖𝑆superscript𝑐′𝑖subscriptsuperscript𝑖′0𝛼𝑆𝑐𝑖𝑆𝑐superscript𝑖′\sum_{(c,i)}\left(\sum_{c^{\prime}}\max\{0,\alpha-S(c,i)+S(c^{\prime},i)\}+\sum_{i^{\prime}}\max\{0,\alpha-S(c,i)+S(c,i^{\prime})\}\right) (5) where (c,i)(c,i) is a ground truth caption-image pair, cc^{\prime} goes over captions that no describe ii, and ii^{\prime} goes over image not described by cc.

2 Image and Caption Embeddings

To learn fcf_{c} and fif_{i}, we use the approach of Kiros et al. (2014) except, since we are embedding into R+N\mathbb{R}_{+}^{N}, we constrain the embedding vectors to have nonnegative entries by taking their absolute value. Thus, to embed images, we use

where WiW_{i} is a learned N×4096N\times 4096 matrix, NN being the dimensionality of the embedding space. CNN(i)CNN(i) is the same image feature used by Klein et al. (2015): we rescale images to have smallest side 256256 pixels, we take 224×224224\times 224 crops from the corners, center, and their horizontal reflections, run the 10 crops through the 19-layer VGG network of Simonyan & Zisserman (2015) (weights pre-trained on ImageNet and fixed during training), and average their fc7 features.

To embed the captions, we use a recurrent neural net encoder with GRU activations (Cho et al., 2014), so fc(c)=GRU(c)f_{c}(c)=|GRU(c)|, the absolute value of hidden state after processing the last word.

3 Dataset

We evaluate on the Microsoft COCO dataset (Lin et al., 2014b), which has over 120,000 images, each with at least five human-annotated captions per image. This is by far the largest dataset commonly used for caption-image retrieval. We use the data splits of Karpathy & Li (2015) for training (113,287 images), validation (5000 images), and test (5000 images).

4 Details of Training

To train the model, we use the standard pairwise ranking objective from Eq. (5). We sample minibatches of 128 random image-caption pairs, and draw all contrastive terms from the minibatch, giving us 127 contrastive images for each caption and captions for each image. We train for 15-30 epochs using the Adam optimizer with learning rate 0.0010.001, and early stopping on the validation set.

We set the dimension of the embedding space and the GRU hidden state NN to 10241024, the dimension of the learned word embeddings to 300300, and the margin α\alpha to 0.050.05. All these hyperparameters, as well as the learning rate and batchsize, were selected using the validation set. For consistency with Kiros et al. (2014) and to mitigate overfitting, we constrain the caption and image embeddings to have unit L2 norm. This constraint implies that no two points can be exactly ordered with zero order-violation penalty, but since we use a ranking loss, only the relative size of the penalties matters.

5 Results

Given a query caption or image, we sort all the images or captions of the test set in order of increasing penalty. We use standard ranking metrics for evaluation. We measure Recall@KK, the percent of queries for which the GT term is one of the first KK retrieved; and median and mean rank, which are statistics over the position of the GT term in the retrieval order.

Table 2 shows a comparison between all state-of-the-art and some older methods222Note that the numbers for MNLM come not from the published paper but from the recently released code at http://github.com/ryankiros/visual-semantic-embedding. along with our own; see Ma et al. (2015) for a more complete listing.

The best results overall are in bold, and the best results using 1-crop VGG image features are underlined. Note that the comparison is additionally complicated by the following:

mm-CNNENS is an ensemble of four different models, whereas the other entries are all single models.

STV and FV use external text corpora to learn their language features, whereas the other methods learn them from scratch.

To facilitate the comparison and to evaluate the contributions of various components of our model, we evaluate four variations of order-embeddings:

order-embeddings is our full model as described above.

order-embeddings (reversed) reverses the order of captions and image embeddings in our order-violation penalty—placing images above captions in the partial order learned by our model. This seemingly slight variation performs atrociously, confirming our prior that captions are much more abstract than images, and should be placed higher in the semantic hierarchy.

order-embeddings (1-crop) computes the image feature using just the center crop, instead of averaging over 10 crops.

order-embeddings (symm.) replaces our asymmetric penalty with the symmetric cosine distance, and allows embedding coordinates to be negative—essentially replicating MNLM, but with better image features. Here we find that a different margin (α=0.2\alpha=0.2) works best.

Between these four models, the only previous work whose results are incommensurable with ours is DVSA, since it uses the less discriminative CNN of Krizhevsky et al. (2012) but 20 region features instead of a single whole-image feature.

Aside from this limitation, and if only single models are considered, order-embeddings significantly outperform the state-of-art approaches for image retrieval even when we control for image features.

6 Exploration

Why would order-embeddings do well on such a shallow partial order? Why are they much more helpful for image retrieval than for caption retrieval?

Intuitively, symmetric similarity should fail when an image has captions with very different levels of detail, because the captions are so dissimilar that it is impossible to map both their embeddings close to the same image embedding. Order-embeddings don’t have this problem: the less detailed caption can be embedded very far away from the image while remaining above it in the partial order.

To evaluate this intuition, we use caption length as a proxy for level of detail and select, among pairs of co-referring captions in our validation set, the 100 pairs with the biggest length difference. For image retrieval with 1000 target images, the mean rank over captions in this set is 6.46.4 for order-embeddings and 9.79.7 for cosine similarity, a much bigger difference than over the entire dataset. Some particularly dramatic examples of this are shown in Figure 3. Moreover, if we use the shorter caption as a query, and retrieve captions in order of increasing error, the mean rank of the longer caption is 34.034.0 for order-embeddings and 47.647.6 for cosine similarity, showing that order-embeddings are able to capture the relatedness of co-referring captions with very different lengths.

This also explains why order-embeddings provide a much smaller improvement for caption retrieval than for image retrieval: all the caption retrieval metrics are based on the position of the first ground truth caption in the retrieval order, so the embeddings need only learn to retrieve one of each image’s five captions well, which symmetric similarity is well suited for.

Textual Entailment / Natural Language Inference

Natural language inference can be seen as a generalization of hypernymy from words to sentences. For example, from “woman walking her dog in a park” we can infer both “woman walking her dog” and “dog in a park”, but not ”old woman” or ”black dog”. Given a pair of sentences, our task is to predict whether we can infer the second sentence (the hypothesis) from the first (the premise).

To apply order-embeddings to this task, we again view it as partial order completion—we can infer a hypothesis from a premise exactly when the hypothesis is above the premise in the visual-semantic hierarchy.

Unlike our other tasks, for which we had to generate contrastive negatives, datasets for natural language inference include labeled negative examples. So, we can simply use a max-margin loss:

subscript𝑝ℎ𝐸𝑓𝑝𝑓ℎsubscriptsuperscript𝑝′superscriptℎ′0𝛼𝐸𝑓superscript𝑝′𝑓superscriptℎ′\sum_{(p,h)}E(f(p),f(h))+\sum_{(p^{\prime},h^{\prime})}\max\{0,\alpha-E(f(p^{\prime}),f(h^{\prime}))\} (7) where (p,h)(p,h) are positive and (p,h)(p^{\prime},h^{\prime}) negative pairs of premise and hypothesis. To embed sentences, we use the same GRU encoder as in the caption-image retrieval task.

2 Dataset

To evaluate order-embeddings on the natural language inference task, we use the recently proposed SNLI corpus (Bowman et al., 2015), which contains 570,000 pairs of sentences, each labeled with “entailment” if the inference is valid, “contradiction” if the two sentences contradict, or “neutral” if the inference is invalid but there is no contradiction. Our method only allows us to discriminate between entailment and non-entailment, so we merge the “contradiction” and “neutral” classes together to serve as our negative examples.

3 Implementation Details

Just as for caption-image ranking, we set the dimensions of the embedding space and GRU hidden state to be 10241024, the dimension of the word embeddings to be 300300, and constrain the embeddings to have unit L2 norm. We train for 10 epochs with batches of 128128 sentence pairs. We use the Adam optimizer with learning rate 0.0010.001 and early stopping on the validation set. During evaluation, we find the optimal classification threshold on validation, then use the threshold to classify the test set.

4 Results

The state-of-the-art method for 3-class classification on SNLI is that of Rocktäschel et al. (2015). Unfortunately, they do not compute 2-class accuracy, so we cannot compare to them directly.

As a bridge to facilitate comparison, we use a challenging baseline which can be evaluated on both the 2-class and 3-class problems. The baseline, referred to as skip-thoughts, involves a feedforward neural network on top of skip-thought vectors (Kiros et al., 2015), a state-of-the-art semantic representation of sentences. Given pairs of sentence vectors uu and vv, the input to the network is the concatenation of uu, vv and the absolute difference uv|u-v|. We tuned the number of layers, layer dimensionality and dropout rates to optimize performance on the development set, using the Adam optimizer. Batch normalization (Ioffe & Szegedy, 2015) and PReLU units (He et al., 2015) were used. Our best network used 2 hidden layers of 1000 units each, with dropout rate of 0.5 across both the input and hidden layers. We did not backpropagate through the skip-thought encoder.

We also evaluate against EOP classifier, a 2-class baseline introduced by (Bowman et al., 2015), and against a version of our model where our order-violation penalty is replaced with the symmetric cosine distance, order-embeddings (symmetric).

The results for all models are shown in Table 3. We see that order-embeddings outperform the skip-thought baseline despite not using external text corpora. While our method is almost certainly worse than the state-of-the-art method of Rocktäschel et al. (2015), which uses a word-by-word attention mechanism, it is also much simpler.

Conclusion and Future Work

We introduced a simple method to encode order into learned distributed representations, which allows us to explicitly model the partial order structure of the visual-semantic hierarchy. Our method can be easily integrated into existing relational learning methods, as we demonstrated on three challenging tasks involving computer vision and natural language processing. On two of these tasks, hypernym prediction and caption-image retrieval, our methods outperform all previous work.

A promising direction of future work is to learn better classifiers on ImageNet (Deng et al., 2009), which has over 21k image classes arranged by the WordNet hierarchy. Previous approaches, including Frome et al. (2013) and Norouzi et al. (2014) have embedded words and images into a shared semantic space with symmetric similarity—which our experiments suggest to be a poor fit with the partial order structure of WordNet. We expect significant progress on ImageNet classification, and the related problems of one-shot and zero-shot learning, to be possible using order-embeddings.

Going further, order-embeddings may enable learning the entire semantic hierarchy in a single model which jointly reasons about hypernymy, entailment, and the relationship between perception and language, unifying what have been until now almost independent lines of work.

We thank Kaustav Kundu for many fruitful discussions throughout the development of this paper. The work was supported in part by an NSERC Graduate Scholarship.

References

Supplementary Material

Mikolov et al. (2013) showed that word representations learned using word2vec exhibit semantic regularities, such as kingman+womanqueenking-man+woman\sim queen. Kiros et al. (2014) showed that similar regularities hold for joint image-language models. We find that order-embeddings exhibit a novel form of regularity, shown in Figure 4. The elementwise max\max and min\min operations in the embedding space roughly correspond to composition and abstraction, respectively.