Attention on Attention for Image Captioning

Lun Huang, Wenmin Wang, Jie Chen, Xiao-Yong Wei

Introduction

Image captioning is one of the primary goals of computer vision which aims to automatically generate natural descriptions for images. It requires not only to recognize salient objects in an image, understand their interactions, but also to verbalize them using natural language, which makes itself very challenging .

Inspired by the development of neural machine translation, attention mechanisms have been widely used in current encoder/decoder frameworks for visual captioning and achieved impressive results. In such a framework for image captioning, an image is first encoded to a set of feature vectors via a CNN based network and then decoded to words via an RNN based network, where the attention mechanism guides the decoding process by generating a weighted average over the extracted feature vectors for each time step.

The attention mechanism plays a crucial role in such a system that must capture global dependencies, e.g. a model for the sequence to sequence learning task like image/video captioning, since the output is directly conditioned on the attention result. However, the decoder has little idea of whether or how well the attention result is related to the query. There are some cases when the attention result is not what the decoder expects and the decoder can be misled to give fallacious results, which could happen when the attention module doesn’t do well on its part or there’s no worthful information from the candidate vectors at all. The former case can’t be avoided since mistakes always happen. As for the latter, when there’s nothing that meets the requirement of a specific query, the attention module still returns a vector which is a weighted average on the candidate vectors and thus is totally irrelevant to the query.

To address this issue, we propose Attention on Attention (AoA), which extends the conventional attention mechanisms by adding another attention. Firstly, AoA generates an “information vector” and an “attention gate” with two linear transformations, which is similar to GLU . The information vector is derived from the current context (i.e. the query) and the attention result via a linear transformation, and stores the newly obtained information from the attention result together with the information from the current context. The attention gate is also derived from the query and the attention result via another linear transformation with sigmoid activation followed, and the value of each channel indicates the relevance/importance of the information on the corresponding channel in the information vector. Subsequently, AoA adds another attention by applying the attention gate to the information vector using element-wise multiplication and finally obtains the “attended information”, the expected useful knowledge.

AoA can be applied to various attention mechanisms. For the traditional single-head attention, AoA helps to determine the relevance between the attention result and query. Specially, for the recently proposed multi-head attention , AoA helps to build relationships among different attention heads, filters all the attention results and keeps only the useful ones.

We apply AoA to both the image encoder and the caption decoder of our image captioning model, AoANet. For the encoder, it extracts feature vectors of objects in the image, applies self-attention to the vectors to model relationships among the objects, and then applies AoA to determine how they are related to each other. For the decoder, it applies AoA to filter out the irrelevant/misleading attention results and keep only the useful ones.

We evaluate the impact of applying AoA to the encoder and decoder respectively. Both quantitative and qualitative results show that AoA module is effective. The proposed AoANet outperforms all previously published image captioning models: a single model of AoANet achieves 129.8 CIDEr-D score on MS COCO dataset offline test split; and an ensemble of 4 models achieves 129.6 CIDEr-D (C40) score on the online testing server. Main contributions of this paper include:

We propose the Attention on Attention (AoA) module, an extension to the conventional attention mechanism, to determine the relevance of attention results.

We apply AoA to both the encoder and decoder to constitute AoANet: in the encoder, AoA helps to better model relationships among different objects in the image; in the decoder, AoA filters out irrelative attention results and keeps only the useful ones.

Our method achieves a new state-of-the-art performance on MS COCO dataset.

Related Work

Earlier approaches to image captioning are rule/template-based which generate slotted caption templates and use the outputs of object detection , attribute prediction and scene recognition to fill in the slots. Recent approaches are neural-based and specifically, utilize a deep encoder decoder framework, which is inspired by the development of neural machine translation . For instance, an end-to-end framework is proposed with a CNN encoding the image to feature vector and an LSTM decoding it to caption . In , the spatial attention mechanism on CNN feature map is used to incorporate visual context. In , a spatial and channel-wise attention model is proposed. In , an adaptive attention mechanism is introduced to decide when to activate the visual attention. More recently, more complex information such as objects, attributes and relationships are integrated to generate better descriptions .

2 Attention Mechanisms

The attention mechanism , which is derived from human intuition, has been widely applied and yielded significant improvements for various sequence learning tasks. It first calculates an importance score for each candidate vector, then normalizes the scores to weights using the soft-max function, finally applies these weights to the candidates to generate the attention result, a weighted average vector . There are other attention mechanisms such as: spatial and channel-wise attention , adaptive attention , stacked attention , multi-level attention , multi-head attention and self-attention .

Recently, Vaswani et al. showed that solely using self-attention can achieve state-of-the-art results for machine translation. Several works extend the idea of employing self-attention to some tasks in computer vision, which inspires us to apply self-attention to image captioning to model relationships among objects in an image.

3 Other Work

AoA generates an attention gate and an information vector via two linear transformations and applies the gate to the vector to add a second attention, where the techniques are similar to some other work: GLU , which replaces RNN and CNN to capture long-range dependencies for language modeling; multi-modal fusion , which models interactions between different modalities (e.g. text and image) and combines information from them; LSTM/GRU, which uses gates and memories to model its inputs in a sequential manner.

4 Summarization

We summarize the differences between our method and the work discussed above, as follows: We apply Attention on Attention (AoA) to image captioning in this paper; AoA is a general extension to attention mechanisms and can be applied to any of them; AoA determines the relevance between the attention result and query, while multi-modal fusion combines information from different modalities; AoA requires only one “attention gate” but no hidden states. In contrast, LSTM/GRU requires hidden states and more gates, and is applicable only to sequence modeling.

Method

We first introduce the Attention on Attention (AoA) module and then show how we derive AoANet for image captioning by applying AoA to the image encoder and the caption decoder.

An attention module $f_{att}(\boldsymbol{Q,K,V})$ operates on some queries, keys and values and generates some weighted average vectors (denoted by ${\boldsymbol{Q}}$ , ${\boldsymbol{K}}$ , ${\boldsymbol{V}}$ and $\hat{{\boldsymbol{V}}}$ respectively), in Figure 2(a). It first measures the similarities between ${\boldsymbol{Q}}$ and ${\boldsymbol{K}}$ and then uses the similarity scores to compute weighted average vectors over ${\boldsymbol{V}}$ , which can be formulated as:

where $\boldsymbol{q}_{i}\in{\boldsymbol{Q}}$ is the $i^{th}$ query, $\boldsymbol{k}_{j}\in\boldsymbol{K}$ and $\boldsymbol{v}_{j}\in\boldsymbol{V}$ are the $j^{th}$ key/value pair; $f_{sim}$ is a function that computes the similarity score of each $\boldsymbol{k}_{j}$ and $\boldsymbol{q}_{i}$ ; and $\boldsymbol{\hat{v}}_{i}$ is the attended vector for the query $\boldsymbol{q}_{i}$ .

The attention module outputs a weighted average for each query, no matter whether or how ${\boldsymbol{Q}}$ and ${\boldsymbol{K}}$ / ${\boldsymbol{V}}$ are related. Even when there is no relevant vectors, the attention module still generates a weighted average vector, which can be irrelevant or even misleading information.

Thus we propose the AoA module (as shown in Figure 2(b)) to measure the relevance between the attention result and the query. The AoA module generates an “information vector” $\boldsymbol{i}$ and an “attention gate” $\boldsymbol{g}$ via two separate linear transformations, which are both conditioned on the attention result and the current context (i.e. the query) $\boldsymbol{q}$ :

Then AoA adds another attention by applying the attention gate to the information vector using element-wise multiplication and obtains the attended information $\hat{\boldsymbol{i}}$ :

where $\odot$ denotes element-wise multiplication. The throughout pipeline of AoA is formulated as:

2 AoANet for Image Captioning

We build the model, AoANet, for image captioning based on the encoder/decoder framework (Figure 3), where both the encoder and the decoder are incorporated with an AoA module.

Instead of directly feeding these vectors to the decoder, we build a refining network which contains an AoA module to refine their representations (Figure 4). The AoA module in the encoder, notated as $\textrm{AoA}^{E}$ , adopts the multi-head attention function where ${\boldsymbol{Q}},{\boldsymbol{K}}$ , and ${\boldsymbol{V}}$ are three individual linear projections of the feature vectors ${\boldsymbol{A}}$ . The AoA module is followed by a residual connection and layer normalization :

In this refining module, the self-attentive multi-head attention module seeks the interactions among objects in the image, and AoA is applied to measure how well they are related. After refining, we update the feature vectors ${\boldsymbol{A}}\leftarrow{\boldsymbol{A}}^{\prime}$ . The refining module doesn’t change the dimension of ${\boldsymbol{A}}$ , and thus can be stacked for $N$ times ( $N=6$ in this paper).

Note that the refining module adopts a different structure from that of the original transformer encoder as the feed-forward layer is dropped, which is optional and the change is made for the following two reasons: 1) the feed-forward layer is added to provide non-linear representations, which is also realized by applying AoA; 2) dropping the feed-forward layer does not change the performances perceptually of AoANet but gives simplicity.

2.2 Decoder with AoA

The decoder (Figure 5) generates a sequence of caption $\boldsymbol{y}$ with the (refined) feature vectors ${\boldsymbol{A}}$ .

We model a context vector $\boldsymbol{c}_{t}$ to compute the conditional probabilities on the vocabulary:

The context vector $\boldsymbol{c}_{t}$ saves the decoding state and the newly acquired information, which is generated with the attended feature vector $\hat{\boldsymbol{a}}_{t}$ and the output $\boldsymbol{h}_{t}$ of an LSTM, where $\hat{\boldsymbol{a}}_{t}$ is the attended result from an attention module which could have a single head or multiple heads.

The LSTM in the decoder models the caption decoding process. Its input consists of the embedding of the input word at current time step, and a visual vector $(\bar{\boldsymbol{a}}+\boldsymbol{c}_{t-1})$ , where $\bar{\boldsymbol{a}}=\frac{1}{k}\sum_{i}\boldsymbol{a}_{i}$ denotes the mean pooling of ${\boldsymbol{A}}$ and $\boldsymbol{c}_{t-1}$ denotes the context vector at previous time step ( $\boldsymbol{c}_{-1}$ is initialized to zeros at the beginning step):

As shown in Figure 5, for the AoA decoder, $\boldsymbol{c}_{t}$ is obtained from an AoA module, notated as $\textrm{AoA}^{D}$ :

3 Training and Objectives

Training with Cross Entropy Loss. We first train AoANet by optimizing the cross entropy (XE) loss $L_{XE}$ :

where ${\boldsymbol{y}}_{1:T}^{*}$ denotes the target ground truth sequence.

CIDEr-D Score Optimization. Then we directly optimize the non-differentiable metrics with Self-Critical Sequence Training (SCST):

where the reward $r(\cdot)$ uses the score of some metric (e.g. CIDEr-D ). The gradients can be approximated:

${\boldsymbol{y}}^{s}$ means it’s a result sampled from probability distribution, while $\hat{{\boldsymbol{y}}}$ indicates a result of greedy decoding.

4 Implementation Details

We employ a pre-trained Faster-RCNN model on ImageNet and Visual Genome to extract bottom-up feature vectors of images . The dimension of the original vectors is 2048 and we project them to a new space with the dimension of $D=1024$ , which is also the hidden size of the LSTM in the decoder. As for the training process, we train AoANet under XE loss for 30 epochs with a mini batch size of 10, and ADAM optimizer is used with a learning rate initialized by 2e-4 and annealed by 0.8 every 3 epochs. We increase the scheduled sampling probability by 0.05 every 5 epochs . We optimize the CIDEr-D score with SCST for another 15 epochs with an initial learning rate of 2e-5 and annealed by 0.5 when the score on the validation split does not improve for some training steps.

Experiments

We evaluate our proposed method on the popular MS COCO dataset . MS COCO dataset contains 123,287 images labeled with 5 captions for each, including 82,783 training images and 40,504 validation images. MS COCO provides 40,775 images as test set for online evaluation as well. The offline “Karpathy” data split is used for the offline performance comparisons, where 5,000 images are used for validation, 5,000 images for testing and the rest for training. We convert all sentences to lower case, and drop the words that occur less than 5 times and end up with a vocabulary of 10,369 words. We use different metrics, including BLEU , METEOR , ROUGE-L , CIDEr-D and SPICE , to evaluate the proposed method and compare with other methods. All the metrics are computed with the publicly released codehttps://github.com/tylin/coco-caption.

2 Quantitative Analysis

Offline Evaluation. We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM , which encodes the image using CNN and decodes it using LSTM; SCST , which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down , which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet , which fuses encoded features from multiple CNN networks; GCN-LSTM , which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE , which introduces auto-encoding scene graphs into its model.

For fair comparison, all the models are first trained under XE loss and then optimized for CIDEr-D score. For the XE loss training stage in Table 1, it can be seen that our single model achieves the highest scores among all compared methods in terms of all metrics even comparing with the ensemble of their models. As for the CIDEr-D score optimization stage, an ensemble of 4 models with different parameter initialization of AoANet outperforms all other models and sets a new state-of-the-art performance of 132.0 CIDEr-D score.

Online Evaluation. We also evaluate our model on the online COCO test serverhttps://competitions.codalab.org/competitions/3221#results in Table 2. The results of AoANet are evaluated by an ensemble of 4 models trained on the “Karpathy” training split. AoANet achieves the highest scores for most metrics except a slightly lower one for BLEU-1 (C40).

3 Qualitative Analysis

Table 3 shows a few examples with images and captions generated by our AoANet and a strong baseline as well as the human-annotated ground truths. We derive the baseline model by re-implementing the Up-Down model with the settings of AoANet. From these examples, we find that the baseline model generates captions which are in line with the logic of language but inaccurate for the image content, while AoANet generates accurate captions in high quality. More specifically, our AoANet is superior in the following two aspects: 1) AoANet counts objects of the same kind more accurately. There are two birds/cats in the image of the first/second example. However, the baseline model finds only one while our AoANet counts correctly; 2) AoANet figures out the interactions of objects in an image. For example, AoANet knows that the birds are on top of a giraffe but not the tree, in the first example; the boy is hitting the tennis ball with a racket but not holding, in the fourth example. AoANet has these advantages because it can figure out the connections among objects and also knows how they are connected: in the encoder, the refining module uses self-attention to seek interactions among objects and uses AoA to measure how well they are related; in the decoder, AoA helps to filter out irrelative objects which don’t have the required interactions and only keeps the related ones. While the baseline model generates captions which are logically right but might not match the image contents.

4 Ablative Analysis

To quantify the impact of the proposed AoA module, we compare AoANet against a set of other ablated models with various settings. We first design the “base” model which doesn’t have a refining module in its encoder and adopts a “base” decoder in Figure 7(a), using a linear transformation to generate the context vector ${\boldsymbol{c}}_{t}$ .

Effect of AoA on the encoder. To evaluate the effect of applying AoA to the encoder, we design a refining module without AoA, which contains a self-attention module and a following feed-forward transition, in Figure 6(a). From Table 4 we observe that refining the feature representations brings positive effects, and adding a refining module without AoA improves the CIDEr-D score of “base” by 3.0. We then apply AoA to the attention mechanism in the refining module and we drop the feed-forward layer. The results show that our AoA further improves the CIDEr-D score by 2.0.

Effect of AoA on the decoder. We compare the performance of using different schemes to model the context vector ${\boldsymbol{c}}_{t}$ : “base” (Figure 7(a)), via a linear transformation; “LSTM” (Figure 7(b)), via an LSTM; AoA (Figure 7(c)), by applying AoA. We conduct experiments with both single attention and multi-head attention (MH-Att). From Table 4, we observe that replacing single attention with multi-head attention brings slightly better performances. Using LSTM improves the performance of the base model, and AoA further outperforms LSTM. Compared to LSTM, which uses some memories (hidden states) and gates to model attention results in a sequential manner, AoA is more light-weighted as it involves only two linear transformation and requires little computation. Even so, AoA still outperforms LSTM. We also find that the training process of “LSTM + AoA” (building AoA upon LSTM) is unstable and could reach a sub-optimal point, which indicates that stacking more gates doesn’t provide further performance improvements.

To qualitatively show the effect of AoA, we visualize the caption generation process in Figure 8 with attended image regions for each decoding time step. Two models are compared: the “base” model, which doesn’t incorporate the AoA module, and “decoder with AoA”, which employs an AoA module in its caption decoder. Observing the attended image regions in Figure 8, we find that the attention module isn’t always reliable for the caption decoder to generate a word, and directly using the attention result might result in wrong captions. In the example, the book is attended by the base model when generating the caption fragment “A teddy bear sitting on a …”. As a result, the base model outputs “book” for the next word, which is not consistent to what the image shows since the teddy bear is actually sitting on a chair but not on a book. In contrast, “decoder with AoA” is less likely to be misled by irrelevant attention results, because the AoA module in it adds another attention on the attention result, which suppress the irrelevant/misleading information and keeps only the useful.

5 Human Evaluation

We follow the practice in and invited 30 evaluators to evaluate 100 randomly selected images. For each image, we show the evaluators two captions generated by “decoder with AoA” and the “base” model in random order, and ask them which one is more descriptive. The percentages of “decoder with AoA”, “base”, and comparative are 49.15%, 21.2%, and 29.65% respectively, which shows the effectiveness of AoA as confirmed by the evaluators.

6 Generalization

To show the general applicability of AoA, we perform experiments on a video captioning dataset, MSR-VTT : we use ResNet-101 to extract feature vectors from sampled 20 frames of each video and then pass them to a bi-LSTM and a decoder, “base” or “decoder with AoA”. We find that “decoder with AoA” improves “base” from BLEU-4: 33.53,CIDEr-D: 38.83, ROUGE-L 56.90 to 37.22, 42.44, 58.32, respectively, which shows that AoA is also promising for other tasks which involve attention mechanisms.

Conclusion

In this paper, we propose the Attention on Attention (AoA) module, an extension to conventional attention mechanisms, to address the irrelevant attention issue. Furthermore, we propose AoANet for image captioning by applying AoA to both the encoder and decoder. More remarkably, we achieve a new state-of-the-art performance with AoANet. Extensive experiments conducted on the MS COCO dataset demonstrate the superiority and general applicability of our proposed AoA module and AoANet.

Acknowledgment

This project was supported by Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467), National Natural Science Foundation of China (NSFC, No.U1613209, 61872256, 61972217), and National Engineering Laboratory for Video Technology - Shenzhen Division. We would also like to thank Qian Wu, Yaxian Xia and Qixiang Ye, as well as the anonymous reviewers for their insightful comments.