Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick

Introduction

The automatic generation of captions for images is a long standing and challenging problem in artificial intelligence . Research in this area spans numerous domains, such as computer vision, natural language processing, and machine learning. Recently there has been a surprising resurgence of interest in this area , due to the renewed interest in neural network learning techniques and increasingly large datasets .

In this paper, we describe our process of collecting captions for the Microsoft COCO Caption dataset, and the evaluation server we have set up to evaluate performance of different algorithms. The MS COCO caption dataset contains human generated captions for images contained in the Microsoft Common Objects in COntext (COCO) dataset . Similar to previous datasets , we collect our captions using Amazon’s Mechanical Turk (AMT). Upon completion of the dataset it will contain over a million captions.

When evaluating image caption generation algorithms, it is essential that a consistent evaluation protocol is used. Comparing results from different approaches can be difficult since numerous evaluation metrics exist . To further complicate matters the implementations of these metrics often differ. To help alleviate these issues, we have built an evaluation server to enable consistency in evaluation of different caption generation approaches. Using the testing data, our evaluation server evaluates captions output by different approaches using numerous automatic metrics: BLEU , METEOR , ROUGE and CIDEr . We hope to augment these results with human evaluations on an annual basis.

This paper is organized as follows: First we describe the data collection process. Next, we describe the caption evaluation server and the various metrics used. Human performance using these metrics are provided. Finally the annotation format and instructions for using the evaluation server are described for those who wish to submit results. We conclude by discussing future directions and known issues.

Data Collection

In this section we describe how the data is gathered for the MS COCO captions dataset. For images, we use the dataset collected by Microsoft COCO . These images are split into training, validation and testing sets. The images were gathered by searching for pairs of 80 object categories and various scene types on Flickr. The goal of the MS COCO image collection process was to gather images containing multiple objects in their natural context. Given the visual complexity of most images in the dataset, they pose an interesting and difficult challenge for image captioning.

For generating a dataset of image captions, the same training, validation and testing sets were used as in the original MS COCO dataset. Two datasets were collected. The first dataset MS COCO c5 contains five reference captions for every image in the MS COCO training, validation and testing datasets. The second dataset MS COCO c40 contains 40 reference sentences for a randomly chosen 5,000 images from the MS COCO testing dataset. MS COCO c40 was created since many automatic evaluation metrics achieve higher correlation with human judgement when given more reference sentences . MS COCO c40 may be expanded to include the MS COCO validation dataset in the future.

Our process for gathering captions received significant inspiration from the work of Young etal. and Hodosh etal. that collected captions on Flickr images using Amazon’s Mechanical Turk (AMT). Each of our captions are also generated using human subjects on AMT. Each subject was shown the user interface in Figure 2. The subjects were instructed to:

Describe all the important parts of the scene.

Do not start the sentences with “There is”.

Do not describe things that might have happened in the future or past.

The sentences should contain at least 8 words.

The number of captions gathered is 413,915 captions for 82,783 images in training, 202,520 captions for 40,504 images in validation and 379,249 captions for 40,775 images in testing including 179,189 for MS COCO c5 and 200,060 for MS COCO c40. For each testing image, we collected one additional caption to compute the scores of human performance for comparing scores of machine generated captions. The total number of collected captions is 1,026,459. We plan to collect captions for the MS COCO 2015 dataset when it is released, which should approximately double the size of the caption dataset. The AMT interface may be obtained from the MS COCO website.

Caption evaluation

In this section we describe the MS COCO caption evaluation server. Instructions for using the evaluation server are provided in Section 5. As input the evaluation server receives candidate captions for both the validation and testing datasets in the format specified in Section 5. The validation and test images are provided to the submitter. However, the human generated reference sentences are only provided for the validation set. The reference sentences for the testing set are kept private to reduce the risk of overfitting.

Numerous evaluation metrics are computed on both MS COCO c5 and MS COCO c40. These include BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR and CIDEr-D. The details of the these metrics are described next.

Both the candidate captions and the reference captions are pre-processed by the evaluation server. To tokenize the captions, we use Stanford PTBTokenizer in Stanford CoreNLP tools (version 3.4.1) which mimics Penn Treebank 3 tokenization. In addition, punctuationsThe full list of punctuations: {“, ”, ‘, ’, -LRB-, -RRB-, -LCB-, -RCB-, ., ?, !, ,, :, -, –, …, ;}. are removed from the tokenized captions.

2 Evaluation metrics

Our goal is to automatically evaluate for an image $I_{i}$ the quality of a candidate caption $c_{i}$ given a set of reference captions $S_{i}=\{s_{i1},\ldots,s_{im}\}\in S$ . The caption sentences are represented using sets of $n$ -grams, where an $n$ -gram $\omega_{k}\in\Omega$ is a set of one or more ordered words. In this paper we explore $n$ -grams with one to four words. No stemming is performed on the words. The number of times an $n$ -gram $\omega_{k}$ occurs in a sentence $s_{ij}$ is denoted $h_{k}(s_{ij})$ or $h_{k}(c_{i})$ for the candidate sentence $c_{i}\in C$ .

3 BLEU

BLEU is a popular machine translation metric that analyzes the co-occurrences of $n$ -grams between the candidate and reference sentences. It computes a corpus-level clipped $n$ -gram precision between sentences as follows:

where $k$ indexes the set of possible $n$ -grams of length $n$ . The clipped precision metric limits the number of times an $n$ -gram may be counted to the maximum number of times it is observed in a single reference sentence. Note that $CP_{n}$ is a precision score and it favors short sentences. So a brevity penalty is also used:

where $l_{C}$ is the total length of candidate sentences $c_{i}$ ’s and $l_{S}$ is the length of the corpus-level effective reference length. When there are multiple references for a candidate sentence, we choose to use the closest reference length for the brevity penalty.

The overall BLEU score is computed using a weighted geometric mean of the individual $n$ -gram precision:

where $N=1,2,3,4$ and $w_{n}$ is typically held constant for all $n$ .

BLEU has shown good performance for corpus-level comparisons over which a high number of $n$ -gram matches exist. However, at a sentence-level the $n$ -gram matches for higher $n$ rarely occur. As a result, BLEU performs poorly when comparing individual sentences.

4 ROUGE

ROUGE is a set of evaluation metrics designed to evaluate text summarization algorithms.

ROUGEN: The first ROUGE metric computes a simple $n$ -gram recall over all reference summaries given a candidate sentence:

ROUGEL: ROUGEL uses a measure based on the Longest Common Subsequence (LCS). An LCS is a set words shared by two sentences which occur in the same order. However, unlike $n$ -grams there may be words in between the words that create the LCS. Given the length $l(c_{i},s_{ij})$ of the LCS between a pair of sentences, ROUGEL is found by computing an F-measure:

$R_{l}$ and $P_{l}$ are recall and precision of LCS. $\beta$ is usually set to favor recall ( $\beta=1.2$ ). Since $n$ -grams are implicit in this measure due to the use of the LCS, they need not be specified.

ROUGES: The final ROUGE metric uses skip bi-grams instead of the LCS or $n$ -grams. Skip bi-grams are pairs of ordered words in a sentence. However, similar to the LCS, words may be skipped between pairs of words. Thus, a sentence with 4 words would have $C^{4}_{2}=6$ skip bi-grams. Precision and recall are again incorporated to compute an F-measure score. If $f_{k}(s_{ij})$ is the skip bi-gram count for sentence $s_{ij}$ , ROUGES is computed as:

Skip bi-grams are capable of capturing long range sentence structure. In practice, skip bi-grams are computed so that the component words occur at a distance of at most 4 from each other.

5 METEOR

METEOR is calculated by generating an alignment between the words in the candidate and reference sentences, with an aim of 1:1 correspondence. This alignment is computed while minimizing the number of chunks, $ch$ , of contiguous and identically ordered tokens in the sentence pair. The alignment is based on exact token matching, followed by WordNet synonyms , stemmed tokens and then paraphrases. Given a set of alignments, $m$ , the METEOR score is the harmonic mean of precision $P_{m}$ and recall $R_{m}$ between the best scoring reference and candidate:

Thus, the final METEOR score includes a penalty $Pen$ based on chunkiness of resolved matches and a harmonic mean term that gives the quality of the resolved matches. The default parameters $\alpha$ , $\gamma$ and $\theta$ are used for this evaluation. Note that similar to BLEU, statistics of precision and recall are first aggregated over the entire corpus, which are then combined to give the corpus-level METEOR score.

6 CIDEr

The CIDEr metric measures consensus in image captions by performing a Term Frequency Inverse Document Frequency (TF-IDF) weighting for each $n$ -gram. The number of times an $n$ -gram $\omega_{k}$ occurs in a reference sentence $s_{ij}$ is denoted by $h_{k}(s_{ij})$ or $h_{k}(c_{i})$ for the candidate sentence $c_{i}$ . CIDEr computes the TF-IDF weighting $g_{k}(s_{ij})$ for each $n$ -gram $\omega_{k}$ using:

where $\Omega$ is the vocabulary of all $n$ -grams and $I$ is the set of all images in the dataset. The first term measures the TF of each $n$ -gram $\omega_{k}$ , and the second term measures the rarity of $\omega_{k}$ using its IDF. Intuitively, TF places higher weight on $n$ -grams that frequently occur in the reference sentences describing an image, while IDF reduces the weight of $n$ -grams that commonly occur across all descriptions. That is, the IDF provides a measure of word saliency by discounting popular words that are likely to be less visually informative. The IDF is computed using the logarithm of the number of images in the dataset $|I|$ divided by the number of images for which $\omega_{k}$ occurs in any of its reference sentences.

The CIDErn score for $n$ -grams of length $n$ is computed using the average cosine similarity between the candidate sentence and the reference sentences, which accounts for both precision and recall:

where $\bm{g^{n}}(c_{i})$ is a vector formed by $g_{k}(c_{i})$ corresponding to all $n$ -grams of length $n$ and $\|\bm{g^{n}}(c_{i})\|$ is the magnitude of the vector $\bm{g^{n}}(c_{i})$ . Similarly for $\bm{g^{n}}(s_{ij})$ .

Higher order (longer) $n$ -grams to are used to capture grammatical properties as well as richer semantics. Scores from $n$ -grams of varying lengths are combined as follows:

Uniform weights are used $w_{n}=1/N$ . $N$ = 4 is used.

CIDEr-D is a modification to CIDEr to make it more robust to gaming. Gaming refers to the phenomenon where a sentence that is poorly judged by humans tends to score highly with an automated metric. To defend the CIDEr metric against gaming effects, add clipping and a length based gaussian penalty to the CIDEr metric described above. This results in the following equations for CIDEr-D:

Where $l(c_{i})$ and $l(s_{ij})$ denote the lengths of candidate and reference sentences respectively. $\sigma=6$ is used. A factor of 10 is used in the numerator to make the CIDEr-D scores numerically similar to the other metrics.

The final CIDEr-D metric is computed in a similar manner to CIDEr (analogous to eqn. 18):

Note that just like the BLEU and ROUGE metrics, CIDEr-D does not use stemming. We adopt the CIDEr-D metric for the evaluation server.

Human performance

In this section, we study the human agreement among humans at this task. We start with analyzing the inter-human agreement for image captioning (Section. 4.1) and then analyze human agreement for the word prediction sub-task and provide a simple model which explains human agreement for this sub-task (Section. 4.2).

When examining human agreement on captions, it becomes clear that there are many equivalent ways to say essentially the same thing. We quantify this by conducting the following experiment: We collect one additional human caption for each image in the test set and treat this caption as the prediction. Using the MS COCO caption evaluation server we compute the various metrics. The results are tabulated in Table I.

2 Human Agreement for Word Prediction

We can do a similar analysis for human agreement at the sub-task of word prediction. Consider the task of tagging the image with words that occur in the captions. For this task, we can compute the human precision and recall for a given word $w$ by benchmarking words used in the $k+1$ human caption with respect to words used in the first $k$ reference captions. Note that we use weighted versions of precision and recall, where each negative image has a weight of 1 and each positive image has a weight equal to the number of captions containing the word $w$ . Human precision ( $H_{p}$ ) and human recall ( $H_{r}$ ) can be computed from the counts of how many subjects out of $k$ use the word $w$ to describe a given image over the whole dataset.

We plot $H_{p}$ versus $H_{r}$ for a set of nouns, verbs and adjectives, and all 1000 words considered in Figure 3. Nouns referring to animals like ‘elephant’ have a high recall, which means that if an ‘elephant’ exists in the image, a subject is likely to talk about it (which makes intuitive sense, given ‘elephant’ images are somewhat rare, and there are no alternative words that could be used instead of ‘elephant’). On the other hand, an adjective like ‘bright’ is used inconsistently and hence has low recall. Interestingly, words with high recall also have high precision. Indeed, all the points of human agreement appear to lie on a one-dimensional curve in the two-dimension precision-recall space.

This observation motivates us to propose a simple model for when subjects use a particular word $w$ for describing an image. Let $o$ denote an object or visual concept associated with word $w$ , $n$ be the total number of images, and $k$ be the number of reference captions. Next, let $q=P(o=1)$ be the probability that object $o$ exists in an image. For clarity these definitions are summarized in Table II. We make two simplifications. First, we ignore image level saliency and instead focus on word level saliency. Specifically, we only model $p=P(w=1|o=1)$ , the probability a subject uses $w$ given that $o$ is in the image, without conditioning on the image itself. Second, we assume that $P(w=1|o=0)=0$ , i.e. that a subject does not use $w$ unless $o$ is in the image. As we will show, even with these simplifications our model suffices to explain the empirical observations in Figure 3 to a reasonable degree of accuracy.

Given these assumptions, we can model human precision $\widetilde{H_{p}}$ and recall $\widetilde{H_{r}}$ for a word $w$ given only $p$ and $k$ . First, given $k$ captions per image, we need to compute the expected number of (1) captions containing $w$ ( $cw$ ), (2) true positives ( $tp$ ), and (3) false positives ( $fp$ ). Note that in our definition there can be up to $k$ true positives per image (if $cw=k$ , i.e. each of the $k$ captions contains word $w$ ) but at most 1 false positive (if none of the $k$ captions contains $w$ ). The expectations, in terms of $k$ , $p$ , and $q$ are:

In the above $w^{i}=1$ denotes that $w$ appeared in the $i^{th}$ caption. Note that we are also assuming independence between subjects conditioned on $o$ . We can now define model precision and recall as:

Note that these expressions are independent of $q$ and only depend on $p$ . Interestingly, because of the use of weighted precision and recall, the recall for a category comes out to be exactly equal to $p$ , the probability a subject uses $w$ given that $o$ is in the image.

We set $k=4$ and vary $p$ to plot $\widetilde{H_{p}}$ versus $\widetilde{H_{r}}$ , getting the curve as shown in blue in Figure 3 (bottom left). The curve explains the observed data quite well, closely matching the precision-recall tradeoffs of the empirical data (although not perfectly). We can also reduce the number of captions from four, and look at how the empirical and predicted precision and recall change. Figure 3 (bottom right), shows this variation as we reduce the number of reference captions per image from four to one annotations. We see that the points of human agreement remain at the same recall value, but decrease in their precision, which is consistent with what the model predicts. Also, the human precision at infinite subjects will approach one, which is again reasonable given that a subject will only use the word $w$ if the corresponding object is in the image (and in the presence of infinite subjects someone else will also use the word $w$ ).

In fact, the fixed recall value can help us recover $p$ , the probability that a subject will use the word $w$ in describing the image given the object is present. Nouns like ‘elephant’ and ‘tennis’ have large $p$ , which is reasonable. Verbs and adjectives, on the other hand, have smaller $p$ values, which can be justified from the fact that a) subjects are less likely to describe attributes of objects and b) subjects might use a different word (synonym) to describe the same attribute.

This analysis of human agreement also motivates using a different metric for measuring performance. We propose Precision at Human Recall (PHR) as a metric for measuring performance of a vision system performing this task. Given that human recall for a particular word is fixed and precision varies with the number of annotations, we can look at system precision at human recall and compare it with human precision to report the performance of the vision system.

Evaluation Server Instructions

Directions on how to use the MS COCO caption evaluation server can be found on the MS COCO website. The evaluation server is hosted by CodaLab. To participate, a user account on CodaLab must be created. The participants need to generate results on both the validation and testing datasets. When training for the generation of results on the test dataset, the training and validation dataset may be used as the participant sees fit. That is, the validation dataset may be used for training if desired. However, when generating results on the validation set, we ask participants to only train on the training dataset, and only use the validation dataset for tuning meta-parameters. Two JSON files should be created corresponding to results on each dataset in the following format:

The results may then be placed into a zip file and uploaded to the server for evaluation. Code is also provided on GitHub to evaluate results on the validation dataset without having to upload to the server. The number of submissions per user is limited to a fixed amount.

Discussion

Many challenges exist when creating an image caption dataset. As stated in the captions generated by human subjects can vary significantly. However even though two captions may be very different, they may be judged equally “good” by human subjects. Designing effective automatic evaluation metrics that are highly correlated with human judgment remains a difficult challenge . We hope that by releasing results on the validation data, we can help enable future research in this area.

Since automatic evaluation metrics do not always correspond to human judgment, we hope to conduct experiments using human subjects to judge the quality of automatically generated captions, which are most similar to human captions, and whether they are grammatically correct . This is essential to determining whether future algorithms are indeed improving, or whether they are merely over fitting to a specific metric. These human experiments will also allow us to evaluate the automatic evaluation metrics themselves, and see which ones are correlated to human judgment.