Neural Baby Talk

Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

Introduction

Image captioning is a challenging problem that lies at the intersection of computer vision and natural language processing. It involves generating a natural language sentence that accurately summarizes the contents of an image. Image captioning is also an important first step towards real-world applications with significant practical impact, ranging from aiding visually impaired users to personal assistants to human-robot interaction .

State-of-art image captioning models today tend to be monolithic neural models, essentially of the “encoder-decoder” paradigm. Images are encoded into a vector with a convolutional neural network (CNN), and captions are decoded from this vector using a Recurrent Neural Network (RNN), with the entire system trained end-to-end. While there are many recent extensions of this basic idea to include attention , it is well-understood that models still lack visual grounding (i.e., do not associate named concepts to pixels in the image). They often tend to ‘look’ at different regions than humans would and tend to copy captions from training data .

For instance, in Fig. 1 a neural image captioning approach describes the image as “A dog is sitting on a couch with a toy.” This is not quite accurate. But if one were to really squint at the image, it (arguably) does perhaps look like a scene where a dog could be sitting on a couch with a toy. It certainly is common to find dogs sitting on couches with toys. A-priori, the description is reasonable. That’s exactly what today’s neural captioning models tend to do – produce generic plausible captions based on the language modelfrequently, directly reproduced from a caption in the training data. that match a first-glance gist of the scene. While this may suffice for common scenes, images that differ from canonical scenes – given the diversity in our visual world, there are plenty of such images – tend to be underserved by these models.

If we take a step back – do we really need the language model to do the heavy lifting in image captioning? Given the unprecedented progress we are seeing in object recognitione.g., 11% absolute increase in average precision in object detection in the COCO challenge in the last year. (e.g., object detection, semantic segmentation, instance segmentation, pose estimation), it seems like the vision pipeline can certainly do better than rely on just a first-glance gist of the scene. In fact, today’s state-of-the-art object detectors can successfully detect the table and cake in the image in Fig. 1(c)! The caption ought to be able to talk about the table and cake actually detected as opposed to letting the language model hallucinate a couch and a toy simply because that sounds plausible.

Interestingly, some of the first attempts at image captioning – before the deep learning “revolution” – relied heavily on outputs of object detectors and attribute classifiers to describe images. For instance, consider the output of Baby Talk in Fig. 1, that used a slot filling approach to talk about all the objects and attributes found in the scene via a templated caption. The language is unnatural but the caption is very much grounded in what the model sees in the image. Today’s approaches fall at the other extreme on the spectrum – the language generated by modern neural image captioning approaches is much more natural but tends to be much less grounded in the image.

In this paper, we introduce Neural Baby Talk that reconciles these methodologies. It produces natural language explicitly grounded in entities found by object detectors. It is a neural approach that generates a sentence “template” with slot locations explicitly tied to image regions. These slots are then filled by object recognizers with concepts found in the regions. The entire approach is trained end-to-end. This results in natural sounding and grounded captions.

Our main technical contribution is a novel neural decoder for grounded image captioning. Specifically, at each time step, the model decides whether to generate a word from the textual vocabulary or generate a “visual” word. The visual word is essentially a token that will hold the slot for a word that is to describe a specific region in the image. For instance, for the image in Fig. 1, the generated sequence may be “A $<$ region $-$ 17 $>$ is sitting at a $<$ region $-$ 123 $>$ with a $<$ region $-3$ $>$ .” The visual words ( $<$ region $-$ [.] $>$ ’s) are then filled in during a second stage that classifies each of the indicated regions (e.g., $<$ region $-$ 17 $>$ $\rightarrow$ puppy, $<$ region $-$ 123 $>$ $\rightarrow$ table), resulting in a final description of “A puppy is sitting at a table with a cake.” – a free-form natural language description that is grounded in the image. One nice feature of our model is that it allows for different object detectors to be plugged in easily. As a result, a variety of captions can be produced for the same image using different detection backends. See Fig. 2 for an illustration.

Contributions: Our contributions are as follows:

We present Neural Baby Talk – a novel framework for visually grounded image captioning that explicitly localizes objects in the image while generating free-form natural language descriptions.

Ours is a two-stage approach that first generates a hybrid template that contains a mix of (text) words and slots explicitly associated with image regions, and then fills in the slots with (text) words by recognizing the content in the corresponding image regions.

We propose a robust image captioning task to benchmark compositionality of image captioning algorithms where at test time the model encounters images containing known objects but in novel combinations (e.g., the model has seen dogs on couches and people at tables during training, but at test time encounters a dog at a table). Generalizing to such novel compositions is one way to demonstrate image grounding as opposed to simply leveraging correlations from training data.

Our proposed method achieves state-of-the-art performance on COCO and Flickr30k datasets on the standard image captioning task, and significantly outperforms existing approaches on the robust image captioning and novel object captioning tasks.

Related Work

Some of the earlier approaches generated templated image captions via slot-filling. For instance, Kulkarni et al. detect objects, attributes, and prepositions, jointly reason about these through a CRF, and finally fill appropriate slots in a template. Farhadi et al. compute a triplet for a scene, and use this templated “meaning” representation to retrieve a caption from a database. use more powerful language templates such as a syntactically well-formed tree. These approaches tend to either produce captions that are relevant to the image but not natural sounding, or captions that are natural (e.g. retrieved from a database of captions) but may not be sufficiently grounded in the image.

Neural models for image captioning have been receiving increased attention in the last few years . State-of-the-art neural approaches include attention mechanisms that identify regions in the image to “ground” emitted words. In practice, these attention regions tend to be quite blurry, and rarely correspond to semantically meaningful individual entities (e.g., objects instances) in the image. Our approach grounds words in object detections, which by design identify concrete semantic entities (object instances) in the image.

There has been some recent interest in grounding natural language in images. Dense Captioning generates descriptions for specific image regions. In contrast, our model produces captions for the entire image, with words grounded in concrete entities in the image. Another related line of work is on resolving referring expressions (or description-based object retrieval – given a description of an object in the image, identify which object is being referred to) or referring expression generation (given an object in the image, generate a discriminative description of the object). While the interest in grounded language is in common, our task is different.

One natural strength of our model is its ability to incorporate different object detectors, including the ability to generate captions with novel objects (never seen before in training captions). In that context, our work is related to prior works on novel object captioning . As we describe in Sec. 4.3, our method outperforms these approaches by 14.6% on the averaged F1 score.

Method

Given an image $\bm{I}$ , the goal of our method is to generate visually grounded descriptions $\bm{y}=\{y_{1},\ldots,y_{T}\}$ . Let $\bm{r}_{\bm{I}}=\{r_{1},...,r_{N}\}$ be the set of $N$ images regions extracted from $\bm{I}$ . When generating an entity word in the caption, we want to ground it in a specific image region $r\in\bm{r}_{\bm{I}}$ . Following the standard supervised learning paradigm, we learn parameters $\bm{\theta}$ of our model by maximizing the likelihood of the correct caption:

Using chain rule, the joint probability distribution can be decomposed over a sequence of tokens:

where we drop the dependency on model parameters to avoid notational clutter. We introduce a latent variable $r_{t}$ to denote a specific image region so that $y_{t}$ can explicitly ground in it. Thus the probability of $y_{t}$ is decomposed to:

With this, Eq. 1 can be decomposed into two cascaded objectives. First, maximizing the probability of generating the sentence “template”. A sequence of grounding regions associated with the visual words interspersed with the textual words can be viewed as a sentence “template”, where the grounding regions are slots to be filled in with visual words.Our approach is not limited to any pre-specified bank of templates. Rather, our approach automatically generates a template (with placeholders – slots – for visually grounded words), which may be any one of the exponentially many possible templates. An example template (Fig. 3) is “A $<$ region $-$ 2 $>$ is laying on the $<$ region $-$ 4 $>$ near a $<$ region $-$ 7 $>$ . Second, maximizing the probability of visual words $y_{t}^{vis}$ conditioned on the grounding regions and object detection information, e.g., categories recognized by detector. In the template example above, the model will fill the slots with ‘cat’, ‘laptop’ and ‘chair’ respectively.

In the following, we first describe how we generate the slotted caption template (Sec. 3.1), and then how the slots are filled in to obtain the final image description (Sec. 3.2). The overall objective function is described in Sec. 3.3 and the implementation details in Sec. 3.4.

where we drop the dependency on $\bm{I}$ to avoid clutter.

We first describe how the visual sentinel is computed, and then how the textual words are determined based on the visual sentinel. Following , when the decoder RNN is an LSTM , the representation for visual sentinel $\bm{s}_{t}$ can be obtained by:

We feed the hidden state $\bm{h}_{t}$ into a softmax layer to obtain the probability over textual words conditioned on the image, all previous words, and the visual sentinel:

2 Caption Refinement: Filling in The Slots

To fill the slots in the generated template with visual words grounded in image regions, we leverage the outputs of an object detection network. Given a grounding region, the category can be obtained through any detection framework . But outputs of detection networks are typically singular coarse labels e.g. “dog”. Captions often refer to these entities in a fine-grained fashion e.g. “puppy” or in the plural form “dogs”. In order to accommodate for these linguistic variations, the visual word $y^{vis}$ in our model is a refinement of the category name by considering the following two factors: First, determine the plurality – whether it should be singular or plural. Second, determine the fine-grained class (if any). Using two single layer MLPs with ReLU activation $f(\cdot)$ , we compute them with:

3 Objective

Most standard image captioning datasets (e.g. COCO ) do not contain phrase grounding annotations, while some datasets do (e.g. Flickr30k ). Our training objective (presented next) can incorporate different kinds of supervision – be it strong annotations indicating which words in the caption are grounded in which boxes in the image, or weak supervision where objects are annotated in the image but are not aligned to words in the caption. Given the target ground truth caption $\bm{y}_{1:T}^{*}$ and a image captioning model with parameters $\bm{\theta}$ , we minimize the cross entropy loss:

Visual word extraction. During training, visual words in a caption are dynamically identified by matching the base form of each word (using the Stanford lemmatization toolbox ) against a vocabulary of visual words (details of how to get visual word can be found in dataset Sec. 4). The grounding regions $\{r_{t}^{i}\}_{i=1}^{m}$ for a visual word $y_{t}$ is identified by computing the IoU of all boxes detected by the object detection network with the ground truth bounding box associated with the category corresponding to $y_{t}$ . If the score exceeds a threshold of $0.5$ and the grounding region label matches the visual word, the bounding boxes are selected as the grounding regions. E.g., given a target visual word “cat”, if there are no proposals that match the target bounding box, the model predicts the textual word “cat” instead.

4 Implementation Details

Detection model. We use Faster R-CNN with ResNet-101 to obtain region proposals for the image. We use an IoU threshold of 0.7 for region proposal suppression and 0.3 for class suppressions. A class detection confidence threshold of 0.5 is used to select regions.

Region feature. We use a pre-trained ResNet-101 in our model. The image is first resized to $576\times 576$ and we random crop $512\times 512$ as the input to the CNN network. Given proposals from the pre-trained detection model, the feature $\bm{v}_{i}$ for region $i$ is a concatenation of 3 different features $\bm{v}_{i}=[\bm{v}_{i}^{p};\bm{v}_{i}^{l};\bm{v}_{i}^{g}]$ where $\bm{v}_{i}^{p}$ is the pooling feature of RoI align layer given the proposal coordinates, $\bm{v}_{i}^{l}$ is the location feature and $\bm{v}_{i}^{g}$ is the glove vector embedding of the class label for region $i$ . Let $x_{\text{min}},y_{\text{min}},x_{\text{max}},y_{\text{max}}$ be the bounding box coordinates of the region $b$ ; $W_{I}$ and $H_{I}$ be the width and height of the image $I$ . Then the location feature $\bm{v}_{i}^{l}$ can be obtained by projecting the normalized location $[\dfrac{x_{\text{min}}}{W_{I}},\dfrac{y_{\text{min}}}{H_{I}},\dfrac{x_{\text{max}}}{W_{I}},\dfrac{y_{\text{max}}}{H_{I}}]$ into another embedding space.

Language model. We use an attention model with two LSTM layers as our base attention model. Given $N$ region features from detection proposals $\bm{V}=\{\bm{v}_{1},\ldots,\bm{v}_{N}\}$ and CNN features from the last convolution layer at $K$ grids $\bm{\hat{V}}=\{\bm{\hat{v}}_{1},\ldots,\bm{\hat{v}}_{K}\}$ , the language model has two separate attention layers shown in Fig 4. The attention distribution over the image features for detection proposals is:

Training details. In our experiments, we use a two layer LSTM with hidden size $1024$ . The number of hidden units in the attention layer and the size of the input word embedding are $512$ . We use the Adam optimizer with an initial learning rate of $5\times 10^{-4}$ and anneal the learning rate by a factor of $0.8$ every three epochs. We train the model up to 50 epochs with early stopping. Note that we do not finetune the CNN network during training. We set the batch size to be 100 for COCO and 50 for Flickr30k .

Experimental Results

Datasets. We experiment with two datasets. Flickr30k Entities contains 275,755 bounding boxes from 31,783 images associated with natural language phrases. Each image is annotated with 5 crowdsourced captions. For each annotated phrase in the caption, we identify visual words by selecting the inner most NP (noun phrase) tag from the Stanford part-of-speech tagger . We use Stanford Lemmatization Toolbox to get the base form of the entity words resulting in 2,567 unique words.

COCO contains 82,783, 40,504 and 40,775 images for training, validation and testing respectively. Each image has around 5 crowdsourced captions. Unlike Flickr30k Entities, COCO does not have bounding box annotations associated with specific phrases or entities in the caption. To identify visual words, we manually constructed an object category to word mapping that maps object categories like $<$ person $>$ to a list of potential fine-grained labels like [“child”, “baker”, …]. This results in 80 categories with a total of 413 fine-grained classes. See supp. for details.

Detector pre-training. We use open an source implementation of Faster-RCNN to train the detector. For Flickr30K Entities, we use visual words that occur at least 100 times as detection labels, resulting in a total of $460$ detection labels. Since detection labels and visual words have a one-to-one mapping, we do not have fine-grained classes for the Flickr30K Entities dataset – the caption refinement process only determines the plurality of detection labels. For COCO, ground truth detection annotations are used to train the object detector.

Caption pre-processing. We truncate captions longer than 16 words for both COCO and Flickr30k Entities dataset. We then build a vocabulary of words that occur at least 5 times in the training set, resulting in 9,587 and 6,864 words for COCO and Flickr30k Entities, respectively.

For standard image captioning, we use splits from Karpathy et al. on COCO/Flickr30k. We report results using the COCO captioning evaluation toolkit , which reports the widely used automatic evaluation metrics, BLEU , METEOR , CIDEr and SPICE .

We present our methods trained on different object detectors: Flickr and COCO. We compare our approach (referred to as NBT) to recently proposed Hard-Attention , ATT-FCN and Adaptive on Flickr30k, and Att2in , Up-Down on COCO. Since object detectors have not yet achieved near-perfect accuracies on these datasets, we also report the performance of our model under an oracle setting, where the ground truth object region and category is also provided during test time. (referred to as NBToracle) This can be viewed as the upper bound of our method when we have perfect object detectors.

Table 1 shows results on the Flickr30k dataset. We see that our method achieves state of the art on all automatic evaluation metrics, outperforming the previous state-of-art model Adaptive by 2.0 and 4.4 on BLEU4 and CIDEr. When using ground truth proposals, NBToracle significantly outperforms previous methods, improving 5.1 on SPICE, which implies that our method could further benefit from improved object detectors.

Table 2 shows results on the COCO dataset. Our method outperforms 4 out of 5 automatic evaluation metrics compared to the state of the art without using better visual features or directly optimizing the CIDEr metric. Interestingly, the NBToracle has little improvement over NBT. We suspect the reason is that explicit ground truth annotation is absent for visual words. Our model can be further improved with explicit co-reference supervision where the ground truth location annotation of the visual word is provided. Fig. 5 shows qualitative results on both datasets. We see that our model learns to correctly identify the visual word, and ground it in image regions even under weak supervision (COCO). Our model is also robust to erroneous detections and produces correct captions (3rd column).

2 Robust Image Captioning

To quantitatively evaluate image captioning models for novel scene compositions, we present a new split of the COCO dataset, called the robust-COCO split. This new split is created by re-organizing the train and val splits of the COCO dataset such that the distribution of co-occurring objects in train is different from test. We also present a new metric to evaluate grounding.

Robust split. To create the new split, we first identify entity words that belong to the 80 COCO object categories by following the same pre-processing procedure. For each image, we get a list of object categories that are mentioned in the caption. We then calculate the co-occurrence statistics for these 80 object categories. Starting from the least co-occurring category pairs, we greedily add them to the test set and ensure that for each category, at least half the instances of each category are in the train set. As a result, there are sufficient examples from each category in train, but at test time we see novel compositions (pairs) of categories. Remaining images are assigned to the training set. The final split has 110,234/3,915/9,138 images in train/val/test respectively.

Evaluation metric. To evaluate visual grounding on the robust-COCO split, we want a metric that indicates whether or not a generated caption includes the new object combination. Common automatic evaluation metrics such as BLEU and CIDEr measure the overall sentence fluency. We also measure whether the generated caption contains the novel co-occurring categories that exist in the ground truth caption. A generated caption is deemed 100% accurate if it contains at least one mention of the compositionally novel category-pairs in any ground truth annotation that describe the image.

Results and analysis. We compare our method with state of the art Att2in and Up-Down . These are implemented using the open source implementation from that can replicate results on Karpathy’s split. We follow the experimental setting from and train the model using the robust-COCO train set. Table 3 shows the results on the robust-COCO split. As we can see, all models perform worse on the robust-COCO split than the Karpathy’s split by 2 $\sim$ 3 points in general. Our method outperforms the previous state of the art methods on all metrics, outperforming Up-Down by 2.7 on the proposed metric. The oracle setting (NBToracle) has consistent improvements on all metrics, improving 3.3 on the proposed metric.

Fig. 6 shows qualitative results on the robust image captioning task. Our model successfully produces a caption with novel compositions, such as “cat-remote”, “man-bird” and “dog-skateboard” to describe the image. The last column shows failure cases where our model didn’t select “orange” in the caption. We can force our model to produce a caption containing “orange” and “bird” using constrained beam search , further illustrated in Sec. 4.3.

3 Novel Object Captioning

Since our model directly fills the “slotted” caption template with the concept, it can seamlessly generate descriptions for out-of-domain images. We replicated an existing experimental design on COCO which excludes all the image-sentence pairs that contain at least one of eight objects in COCO. The excluded objects are ‘bottle’, “bus”, “couch”, “microwave”, “pizza”, “racket”, “suitcase” and “zebra”. We follow the same splits for training, validation, and testing as in prior work . We use Faster R-CNN in conjunction with ResNet-101 which is pre-trained on COCO train split as the detection model. Note that we do not pre-train the language model using COCO captions as in , and simply replace the novel object’s word embedding with an existing one which belongs to the same super-category in COCO (e.g., bus $\leftarrow$ car).

Following , the test set is split into in-domain and out-of-domain subsets. We report F1 as in , which checks if the specific excluded object is mentioned in the generated caption. To evaluate the quality of the generated caption, we use SPICE, METEOR and CIDEr metrics and the scores on out-of-domain test data are macro-averaged across eight excluded categories. For consistency with previous work , the inverse document frequency statistics used by CIDEr are determined across the entire test set.

As illustrated in Table 4, simply using greedy decoding, our model (NBT∗+G) can successfully caption novel concepts with minimum changes to the model. When using ResNet-101 and constrained beam search , our model significantly outperforms prior works under F1 scores, SPICE, METEOR, and CIDEr, across both out-of-domain and in-domain test data. Specifically, NBT†+T2 outperforms the previous state-of-art model C-LSTM by 14.6% on average F1 scores. From the category F1 scores, we can see that our model is less likely to select small objects, e.g. “bottle”, “racket” when only using the greedy decoding. Since the visual words are grounded at the object-level, by using , our model was able to significantly boost the captioning performance on out-of-domain images. Fig. 7 shows qualitative novel object captioning results. Also see rightmost example in Fig. 2.

Conclusion

In this paper, we introduce Neural Baby Talk, a novel image captioning framework that produces natural language explicitly grounded in entities object detectors find in images. Our approach is a two-stage approach that first generates a hybrid template that contains a mix of words from a text vocabulary as well as slots corresponding to image regions. It then fills the slots based on categories recognized by object detectors in the image regions. We also introduce a robust image captioning split by re-organizing the train and val splits of the COCO dataset. Experimental results on standard, robust, and novel object image captioning tasks validate the effectiveness of our proposed approach.

Acknowledgements This work was funded in part by: NSF CAREER awards to DB, DP; ONR YIP awards to DP, DB; ONR Grants N00014-14-1-{0679,2713}; PGA Family Foundation award to DP; Google FRAs to DP, DB; and Amazon ARAs to DP, DB; DARPA XAI grant to DB, DP.

Appendix: COCO Fine-grained Categories

The COCO dataset does not have bounding box annotations associated with specific phrases or entities in the caption. We use category level detection annotations and create a category mapping list that maps the object categories like $<$ Person $>$ to a list of potential fine-grained labels like [“child”, “man”, “baker”,…]. We first use the Stanford lemmatization toolbox to get the base form of the entity words in the caption. For each category class, we retrieve the top 200 similar words in the WordVec space. We then manually verify each word in the list, resulting in 413 fine-grained classes. A complete list of the fine-grained class for each object category can be found in Table 5 and Table 6.