Recurrent Fusion Network for Image Captioning

Wenhao Jiang, Lin Ma, Yu-Gang Jiang, Wei Liu, Tong Zhang

Introduction

Captioning , a task to describe images/videos with natural sentences automatically, has been an active research topic in computer vision and machine learning. Generating natural descriptions of images is very useful in practice. For example, it can improve the quality of image retrieval by discovering salient contents and help visually impaired people understand image contents.

Even though a great success has been achieved in object recognition , describing images with natural sentences is still a very challenging task. Image captioning models need to have a thorough understanding of an input image and capture the complicated relationships among objects. Moreover, they also need to capture the interactions between images and languages and thereby translate image representations into natural sentences.

The encoder-decoder framework, with its advance in machine translation , has demonstrated promising performance in image captioning . This framework consists of two parts, namely an encoder and a decoder. The encoder is usually a convolutional neural network (CNN), while the decoder is a recurrent neural network (RNN). The encoder is used to extract image representations, based on which the decoder is used to generate the corresponding captions. Usually, a pre-trained CNN targeting for image classification is leveraged to extract image representations.

All existing models employ only one encoder, so the performance heavily depends on the expressive ability of the deployed CNN. Fortunately, there are quite a few well-established CNNs, e.g., ResNet , Inception-X , etc. It is natural to improve captioning performance by extracting diverse representations with multiple encoders, which play a complementary role in fully depicting and characterizing semantic information of images. However, to the best of our knowledge, there is no model that considers to exploit the complementary behaviors of multiple encoders for image captioning.

In this paper, to exploit complementary information from multiple encoders, we propose a Recurrent Fusion Network (RFNet) for image captioning. Our framework, as illustrated in Fig. 1, introduces a fusion procedure between the encoders and decoder. Multiple CNNs, served as the encoders, can provide diverse and more comprehensive descriptions of the input image. The fusion procedure performs a given number of RNN steps and outputs the hidden states as thought vectors. Our fusion procedure consists of two stages. The first stage contains multiple components, and each component processes the information from one encoder. The interactions among the components are captured to generate thought vectors. Hence, each component can communicate with the other components and extract complementary information from them. The second stage compresses the outputs of the first stage into one set of thought vectors. During this procedure, the interactions among the sets of thought vectors are further exploited. Thus, useful information is absorbed into the final thought vectors, which will be used as input of the attention model in the decoder. The intuition behind our proposed RFNet is to fuse all the information encoded by multiple encoders and produce thought vectors that are more comprehensive and representative than the original ones.

Related Works

Recently, inspired by advance in machine translation, the encoder-decoder framework has also been introduced to image captioning . In this framework, a CNN pre-trained on an image classification task is used as the encoder, while an RNN is used as the decoder to translate the information from the encoder into natural sentences. This framework is simple and elegant, and several extensions have been proposed. In , a spatial attention mechanism was introduced. The model using this attention mechanism could determine which subregions should be focused on automatically at each time step. In , channel- wise attention was proposed to modulates the sentence generation context in multi-layer feature maps. In , ReviewNet was proposed. The review steps can learn the annotation vectors and initial states for the decoder, which are more representative than those generated by the encoder directly. In , a guiding network which models attribute properties of input images was introduced for the decoder. In , the authors observed that the decoder does not need visual attendance when predicting non-visual words, and hence proposed an adaptive attention model that attends to the image or to the visual sentinel automatically at each time step.

Besides, several approaches that introduce useful information into this framework have been proposed to improve the image captioning performance. In , the word occurrence prediction was treated as a multi-label classification problem. And a region-based multi-label classification framework was proposed to extract visually semantic information. This prediction is then used to initialize a memory cell of a long short-term memory (LSTM) model. Yao $et~{}al.$ improved this procedure and discussed different approaches to incorporate word occurrence predictions into the decoder .

Recently, in order to optimize the non-differentiable evaluation metrics directly, policy gradient methods for reinforcement learning are employed to train the encoder-decoder framework. In , the cross-entropy loss was replaced by negative CIDEr score . This system was then trained with the REINFORCE algorithm , which significantly improved the performance. Such a training strategy can be leveraged to improve the performance of all existing models under the encoder-decoder framework.

2 Encoder-Decoder Framework with Multiple Encoders or Decoders

In , multi-task learning (MTL) was combined with sequence-to-sequence learning with multiple encoders or decoders. In the multi-task sequence-to-sequence learning, the encoders or decoders are shared among different tasks. The goal of is to transfer knowledge among tasks to improve the performance. For example, the tasks of translation and image captioning can be formulated together as a model with only one decoder. The decoder is shared between both tasks and responsible for translating from both image and source language. Both tasks can benefit from each other. A similar structure was also exploited in to perform multi-lingual translation. In this paper, we propose a model to combine representations from multiple encoders for the decoder. In , the inputs of the encoders are different. But in our model, they are the same. Our goal is to leverage complementary information from different encoders to form better representations for the decoder.

3 Ensemble and Fusion

Our proposed RFNet also relates to information fusion, multi-view learning , and ensemble learning . Each representation extracted from an individual image CNN can be regarded as an individual view depicting the input image. Combining different representations with diversity is a well-known technique to improve the performance. The combining process can occur at the input, intermediate, and output stage of the target model. For the input fusion, the simplest way is to concatenate all the representations and use the concatenation as the input of the target model. This method usually leads to limited improvements. For the output fusion, the results of base learners for individual views are combined to form the final results. The common ensemble technique in image captioning is regarded as an output fusion technique, combining the output of the decoder at each time step . For the intermediate fusion, the representations from different views are preprocessed by exploiting the relationships among them to form the input for the target model. Our proposed RFNet can be regarded as a kind of intermediate fusion methods.

Background

To provide a clear description of our method, we present a short review of the encoder-decoder framework for image captioning in this section.

Under the encoder-decoder framework for image captioning, a CNN pre-trained for an image classification task is usually employed as the encoder to extract the global representation and subregion representations of an input image. The global representation is usually the output of a fully-connected layer, while subregion representations are usually the outputs of a convolutional layer. The extracted global representation and subregion representations are denoted as $\mathbf{a}_{0}$ and $A=\{\mathbf{a}_{1},\dots,\mathbf{a}_{k}\}$ , respectively, where $k$ denotes the number of subregions.

2 Decoder

Given the image representations $\mathbf{a}_{0}$ and $A$ , a decoder, which is usually a gated recurrent unit (GRU) or long short-term memory (LSTM) , is employed to translate an input image into a natural sentence. In this paper, we use LSTM equipped with an attention mechanism as the basic unit of the decoder.

Recall that an LSTM with an attention mechanism is a function that outputs the results based on the current hidden state, current input, and context vector. The context vector is the weighted sum of elements from $A$ , with the weights determined by an attention model. We adopt the same LSTM used in and express the LSTM unit with the attention strategy as follows:

where $\mathbf{i}_{t}$ , $\mathbf{f}_{t}$ , $\mathbf{c}_{t}$ , $\mathbf{o}_{t}$ , and $\mathbf{h}_{t}$ are the input gate, forget gate, memory cell, output gate, and hidden state of the LSTM, respectively. Here,

is the concatenation of input $\mathbf{x}_{t}$ at time step $t$ and hidden state $\mathbf{h}_{t-1}$ , $\mathbf{T}$ is a linear transformation operator. $\mathbf{z}_{t}$ is the context vector, which is the output of the attention model $f_{\textrm{att}}({A},\mathbf{h}_{t-1})$ . Specifically,

where $\text{sim}(\mathbf{a}_{i},\mathbf{h}_{t})$ is a function to measure the similarity between $\mathbf{a}_{i}$ and $\mathbf{h}_{t}$ , which is usually realized by a multilayer perceptron (MLP). In this paper, we use the shorthand notation

The purpose of image captioning is to generate a caption $\mathcal{C}=(y_{1},y_{2},\cdots,y_{N})$ for one given image $\mathcal{I}$ . The objective adopted is usually to minimize a negative log-likelihood:

where $p(y_{t+1}|y_{t})=\textrm{Softmax}(\mathbf{W}\mathbf{h}_{t})$ and $\mathbf{h}_{t}$ is computed by setting $\mathbf{x}_{t}=\mathbf{E}\mathbf{y}_{t}$ . Here, $\mathbf{W}$ is a matrix for linear transformation and $y_{0}$ is the sign for the start of sentences. $\mathbf{E}\mathbf{y}_{t}$ denotes the distributed representation of the word $\mathbf{y}_{t}$ , in which $\mathbf{y}_{t}$ is the one-hot representation for word $y_{t}$ and $\mathbf{E}$ is the word embedding matrix.

Our Method

In this section, we propose our Recurrent Fusion Network (RFNet) for image captioning. The fusion process in RFNet consists of two stages. The first stage combines the representations from multiple encoders to form multiple sets of thought vectors, which will be compressed into one set of thought vectors in the second stage. The goal of our model is to generate more representative thought vectors for the decoder. Two special designs are adopted: 1) employing interactions among components in the first stage; 2) reviewing the thought vectors from the previous stage in the second stage. We will describe the details of our RFNet in this section and analyze our design in the experiments section.

In our model, $M$ CNNs serve as the encoders. The global representation and subregion representations extracted from the $m$ -th CNN are denoted as $\mathbf{a}_{0}^{(m)}$ and $A^{(m)}=\{\mathbf{a}_{1}^{(m)},\dots,\mathbf{a}_{k_{m}}^{(m)}\}$ , respectively.

The framework of our proposed RFNet is illustrated in Fig. 2. The fusion procedure of RFNet consists of two stages, specifically fusion stages I and II. Both stages perform a number of RNN steps with attention mechanisms and output hidden states as the thought vectors. The numbers of steps in stages I and II are denoted as $T_{1}$ and $T_{2}$ , respectively. The hidden states from stages I and II are regarded as the thought vectors. The thought vectors of stage I will be used as the input of the attention model of stage II. The hidden states and memory cells after the last step of fusion stage I are aggregated to form the initial hidden state and the memory cell for fusion stage II. The thought vectors generated by stage II will be used in the decoder. RFNet is designed to capture the interactions among the components in stage I, and compress the $M$ sets of thought vectors to generate more compact and informative ones in stage II. The details will be described in the following subsections.

2 Fusion Stage I

Fusion stage I takes $M$ sets of annotation vectors as inputs and generates $M$ sets of thought vectors, which will be aggregated into one set of thought vectors in fusion stage II. This stage contains $M$ review components. In order to capture the interactions among the review components, each review component needs to know what has been generated by all the components at the previous time step.

is the concatenation of hidden states of all review components at the previous time step, $f_{\textrm{att-fusion-I}}^{(m,t)}(\cdot,\cdot)$ is the attention model for the $m$ -th review component, and $\text{LSTM}_{t}^{(m)}(\cdot,\cdot)$ is the LSTM unit used by the $m$ -th review component at time step $t$ . Stage I can be regarded as a grid LSTM with independent attention mechanisms. In our model, the LSTM unit $\text{LSTM}_{t}^{(m)}(\cdot,\cdot)$ can be different for different $t$ and $m$ . Hence, $M\times T_{1}$ LSTMs are used in fusion stage I. The set of thought vectors generated from the $m$ -th component is denoted as:

In fusion stage I, the interactions among review components are realized via Eq. (23). The vector $\mathbf{H}_{t}$ contains the hidden states of all the components after time step $t-1$ and is shared as the input among them. Hence, each component is aware of the states of the other components and thus can absorb complementary information from $\mathbf{H}_{t}$ . This hidden state sharing mechanism provides a way for the review component to communicate with each other, which facilitates the generation of thought vectors.

3 Fusion Stage II

The hidden state and memory cell of fusion stage II are initialized with $\mathbf{h}_{T_{1}}^{(1)},\cdots,\mathbf{h}_{T_{1}}^{(M)}$ and $\mathbf{c}_{T_{1}}^{(1)},\cdots,\mathbf{c}_{T_{1}}^{(M)}$ , respectively. We use averaging in our model:

Fusion stage II combines $M$ sets of thought vectors to form a new one using the multi-attention mechanism. At each time step, the concatenation of context vectors is calculated as:

where $\text{LSTM}_{t}$ is the LSTM unit at time step $t$ . Please note that all the LSTM units in this stage are also different.

Fusion stage II can be regarded as review steps with $M$ independent attention models, which performs the attention mechanism on the thought vectors yielded in the first stage. It combines and compresses the outputs from stage I and generates only one set of thought vectors. Hence, the generated thought vectors can provide more information for the decoder.

The hidden states of fusion stage II are collected to form the thought vector set:

which will be used as the input of the attention model in the decoder.

4 Decoder

The decoder translates the information generated by the fusion procedure into natural sentences. The initial hidden state and memory cell are inherited from the last step of fusion stage II directly. The decoder step in our model is the same as other encoder-decoder models, which is expressed as:

where $\text{LSTM}_{dec}(\cdot,\cdot)$ is the LSTM unit for all the decoder steps, $f_{\textrm{att-dec}}(\cdot,\cdot)$ is the corresponding attention model, $\mathbf{x}_{t}$ is the word embedding for the input word at the current time step, and $\mathbf{H}_{t}=\left[\begin{array}[]{c}\mathbf{x}_{t}\\ \mathbf{h}_{t-1}\\ \end{array}\right]$ .

5 Discriminative Supervision

We adopt the discriminative supervision in our model to further boost image captioning performance, which is similar to . Given a set of thought vectors $V$ , a matrix $\mathbf{V}$ is formed by selecting elements from $V$ as column vectors. A score vector $\mathbf{s}$ of words is then calculated as

where $\mathbf{W}$ is a trainable linear transformation matrix and $\text{Row-Max-Pool}(\cdot)$ is a max-pooling operator along the rows of the input matrix. The $i$ -th element of $\mathbf{s}$ is denoted as $s_{i}$ , which represents the score for the $i$ -th word. Adopting a multi-label margin loss, we obtain the loss function fulfilling discriminative supervision as:

where $\mathcal{W}$ is the set of all frequent words in the current caption. In this paper, we only consider the 1,000 most frequent words in the captions of the training set.

By considering both the discriminative supervision loss in Eq. (35) and the captioning loss in Eq. (18), the complete loss function of our model is expressed as:

where $\lambda$ is a trade-off parameter, and $B^{(m)}$ and $C$ are sets of thought vectors from fusion stages I and II, respectively.

Experiments

The MSCOCO datasethttp://mscoco.org/ is the largest benchmark dataset for the image captioning task, which contains 82,783, 40,504, and 40,775 images for training, validation, and test, respectively. This dataset is challenging, because most images contain multiple objects in the context of complex scenes. Each image in this dataset is associated with five captions annotated by human. For offline evaluation, we follow the conventional evaluation procedure , and employ the same data split as in , which contains 5,000 images for validation, 5,000 images for test, and 113,287 images for training. For online evaluation on the MSCOCO evaluation server, we add the testing set into the training set to form a larger training set.

For the captions, we discard all the non-alphabetic characters, transform all letters into lowercase, and tokenize the captions using white space. Moreover, all the words with the occurrences less than 5 times are replaced by the unknown token . Thus a vocabulary consisting of 9,487 words is finally constructed.

2 Configurations and Settings

For the experiments, we use ResNet , DenseNet , Inception-V3 , Inception-V4, and Inception-ResNet-V2 as encoders to extract 5 groups of representations. Each group of representations contains a global feature vector and a set of subregion feature vectors. The outputs of the last convolution layer (before pooling) are extracted as subregion features. For Inception-V3, the output of the last fully connected layer is used as the global feature vector. For the other CNNs, the means of subregion representations are regarded as the global feature vectors. The parameters for encoders are fixed during the training procedure. Since reinforcement learning (RL) has become a common method to boost image captioning performance , we first train our model with cross-entropy loss and fine-tune the trained model with CIDEr optimization using reinforcement learning . The performance of models trained with both cross-entropy loss and CIDEr optimization is reported and compared.

When training with cross-entropy loss, the scheduled sampling , label-smoothing regularization (LSR) , dropout, and early stopping are adopted. For scheduled sampling, the probability of sampling a token from model is $\min(0.25,\frac{epoch}{100})$ , where $epoch$ is the number of passes sweeping over training data. For LSR, the prior distribution over labels is uniform distribution and the smoothing parameter is set to 0.1. Dropout is only applied on the hidden states and the probability is set to 0.3 for all LSTM units. We terminate the training procedure, if the evaluation measurement on validation set, specifically the CIDEr, reaches the maximum value. When training with RL , only dropout and early stopping are used.

The hidden state size is set as 512 for all LSTM units in our model. The parameters of LSTM are initialized with uniform distribution in $[-0.1,0.1]$ . The Adam is applied to optimize the network with the learning rate setting as $5\times 10^{-4}$ and decaying every 3 epochs by a factor 0.8 when training with cross-entropy loss. Each mini-batch contains 10 images. For RL training, the learning rate is fixed as $5\times 10^{-5}$ . For training with cross-entropy loss, the weight of discriminative supervision $\lambda$ is set to 10. And discriminative supervision is not used for training with RL. To improve the performance, data augmentation is adopted. Both flipping and cropping strategies are used. We crop $90\%$ of width and height at the four corners. Hence, $10\times$ images are used for training.

For sentence generation in testing stage, there are two common strategies. The first one is greedy search, which chooses the word with maximum probability at each time step and sets it as LSTM input for next time step until the end-of-sentence sign is emitted or the maximum length of sentence is reached. The second one is the beam search strategy which selects the top- $k$ best sentences at each time step and considers them as the candidates to generate new top- $k$ best sentences at the next time step. Usually beam search provides better performance for models trained with cross-entropy loss. For model trained with RL, beam search and greedy search generate similar results. But greedy search is faster than beam search. Hence, for models trained with RL, we use greedy search to generate captions.

3 Performance and Analysis

We compare our proposed RFNet with the state-of-the-art approaches on image captioning, including Neural Image Caption (NIC) , Attribute LSTM , LSTM-A3 , Recurrent Image Captioner (RIC) , Recurrent Highway Network (RHN) , Soft Attention model , Attribute Attention model , Sentence Attention model , ReviewNet , Text Attention model , Att2in model , Adaptive model , and Up-Down model . Please note that the encoder of Up-Down model is not a CNN pre-trained on ImageNet dataset . It used Faster R-CNN trained on Visual Genome to encode the input image.

Following the standard evaluation process, five types of metrics are used for performance comparisons, specifically the BLEU , METEOR , ROUGE-L , CIDEr , and SPICE . These metrics measure the similarity between generated sentences and the ground truth sentences from specific viewpoints, e.g, n-gram occurrences, semantic contents. We use the official MSCOCO caption evaluation scriptshttps://github.com/tylin/coco-caption and the source code of SPICEhttps://github.com/peteanderson80/coco-caption for the performance evaluation.

Performance comparisons.

The performance of models trained with cross-entropy loss is shown in Table 1, including the performance of single model and ensemble of models. First, it can be observed that our single model RFNet significantly outperformed the existing image captioning models, except the Up-Down model . Our RFNet performed inferiorly to Up-Down in BLEU-1, BLEU-4, and CIDEr, while superiorly to Up-down in METEOR, ROUGGE-L, and SPICE. However, the encoder of Up-Down model was pre-trained on ImageNet dataset and fine-tuned on Visual Genome dataset. The Visual Genome dataset is heavily annotated with objects, attributes and region descriptions and 51K images are extracted from MSCOCO dataset. Hence, the encoder of Up-Down model is trained with far more information than the CNNs trained on ImageNet. With the recurrent fusion strategy, RFNet can extract useful information from different encoders to remedy the lacking of information about objects and attributes in the representations.

Moreover, we can observe that our single RFNet model performed significantly better than other ensemble models, such as NIC, Att2in, and behaved comparably with ReviewNetΣ which is an ensemble of 40 ReviewNets (8 models for each CNNs). But RFNetΣ, an ensemble of 4 RFNets, significantly outperformed all the ensemble models.

The performance comparisons with RL training are presented in Table 2. We compared RFNet with Att2all , Up-Down model , and ensemble of ReviewNets. For our method, the performance of single model and ensemble of 4 models are provided. We can see that our RFNet outperformed other methods. For online evaluation, we used the ensemble of 7 models and the comparisons are provided in Table 3. We can see that our RFNet still achieved the best performance. The C5 and C40 CIDEr scores were improved by 5.0 and 4.6, compared to the state-of-the-art Up-Down model .

Ablation study of fusion stages I and II.

To study the effects of the two fusion stages, we present the performance of the following models:

$\textrm{RFNet}_{-\textrm{I}}$ denotes RFNet without fusion stage I, with only the fusion stage II preserved. The global representations are concatenated to form one global representation and multi-attention mechanisms are performed on the subregion representation from the multiple encoders. The rest is the same with RFNet.

$\textrm{RFNet}_{-\textrm{II}}$ denotes RFNet without fusion stage II. Multiple attention models are employed in the decoder and the rest is the same as RFNet.

$\textrm{RFNet}_{-\textrm{inter}}$ denotes RFNet without the interactions in fusion stage I. Each component in the fusion stage I is independent. Specifically, at each time step, the input of the $m$ -th component is just $\mathbf{h}_{t-1}^{(m)}$ , and it is unaware of the hidden states of the other components.

The CIDEr scores on the test set of the Karpathy’s split are presented in Table 4. We can see that both the two-stage structure and the interactions in the first stage are important for our model. With the interactions, the quality of thought vectors in the first stage can be improved. With the two-stage structure, the thought vectors in the first stage can be refined and compressed into more compact and informative set of thought vectors in the second stage. Therefore, with the specifically designed recurrent fusion strategy, our proposed RFNet provides the best performance.

Effects of discriminative supervision.

We examined the effects of the discriminative supervision with different values of $\lambda$ . First, it can be observed that introducing discriminative supervision properly can help improve the captioning performance, with 107.2 ( $\lambda$ =1) and 107.3 ( $\lambda$ =10) vs. 105.2 ( $\lambda$ =0). However, if $\lambda$ is too large, e.g., 100, the discriminative supervision will degrade the corresponding performance. Therefore, in this paper, $\lambda$ is set as 10, which provided the best performance.

Conclusions

In this paper, we proposed a novel Recurrent Fusion Network (RFNet), to exploit complementary information of multiple image representations for image captioning. In the RFNet, a recurrent fusion procedure is inserted between the encoders and the decoder.This recurrent fusion procedure consists of two stages, and each stage can be regarded as a special RNN. In the first stage, each image representation is compressed into a set of thought vectors by absorbing complementary information from the other representations. The generated sets of thought vectors are then compressed into another set of thought vectors in the second stage, which will be used as the input to the attention module of the decoder. The proposed RFNet achieved leading performance on the MSCOCO evaluation server, corroborating the effectiveness of our proposed network architecture.

References

Appendix 0.A Appendices

To gain a qualitative understanding of our model, some examples of captions generated by our model and ReviewNet are shown in Fig. 3. We only provide the results of ReviewNets with ResNet, DenseNet and Inception-V3 as encoders. We can observe that the improved performance of our model is mostly attributed to the improvement in the recognition of salient objects. For example, our model recognized “two children”, “a kite” and “beach” in the first image. But ReviewNet-ResNet failed to recognize “beach”, ReviewNet-DenseNet wrongly recognized “a man” and ReviewNet-Inception-V3 failed to recognize “kite”. Similar results can also be observed in the other cases. Like the “zebra” and “giraffe” in the second image, the “police officer” in the third image and the “two beds” and “a lamp” in the last image. The Review Net with only one kind of CNN does not recognize the salient objects as our model does. Hence, by the recurrent fusion process, our model is good at combining multiple representations and finding the salient objects neglected in individual representations. More examples can be found in Fig 4.

A.2 Ablation Study on Different CNNs.

We perform the ablation studies to show the effects of different CNNs used in our experiments, with the results shown in Table 6. We can see that generally speaking, ResNet contributes the most while DenseNet contributes the least to the final image captioning performance.