Auto-Encoding Scene Graphs for Image Captioning

Xu Yang, Kaihua Tang, Hanwang Zhang, Jianfei Cai

Introduction

Modern image captioning models employ an end-to-end encoder-decoder framework , i.e., the encoder encodes an image into vector representations and then the decoder decodes them into a language sequence. Since its invention inspired from neural machine translation , this framework has experienced several significant upgrades such as the top-bottom and bottom-up visual attentions for dynamic encoding, and the reinforced mechanism for sequence decoding . However, a ubiquitous problem has never been substantially resolved: when we feed an unseen image scene into the framework, we usually get a simple and trivial caption about the salient objects such as “there is a dog on the floor”, which is no better than just a list of object detection . This situation is particularly embarrassing in front of the booming “mid-level” vision techniques nowadays: we can already detect and segment almost everything in an image .

We humans are good at telling sentences about a visual scene. Not surprisingly, cognitive evidences show that the visually grounded language generation is not end-to-end and largely attributed to the “high-level” symbolic reasoning, that is, once we abstract the scene into symbols, the generation will be almost disentangled from the visual perception. For example, as shown in Figure 1, from the scene abstraction “helmet-on-human” and “road dirty”, we can say “a man with a helmet in contryside” by using the common sense knowledge like “country road is dirty”. In fact, such collocations and contextual inference in human language can be considered as the inductive bias that is apprehended by us from everyday practice, which makes us performing better than machines in high-level reasoning . However, the direct exploitation of the inductive bias, e.g., early template/rule-based caption models , is well-known ineffective compared to the encoder-decoder ones, due to the large gap between visual perception and language composition.

In this paper, we propose to incorporate the inductive bias of language generation into the encoder-decoder framework for image captioning, benefiting from the complementary strengths of both symbolic reasoning and end-to-end multi-modal feature mapping. In particular, we use scene graphs to bridge the gap between the two worlds. A scene graph ( $\mathcal{G}$ ) is a unified representation that connects 1) the objects (or entities), 2) their attributes, and 3) their relationships in an image ( $\mathcal{I}$ ) or a sentence ( $\mathcal{S}$ ) by directed edges. Thanks to the recent advances in spatial Graph Convolutional Networks (GCNs) , we can embed the graph structure into vector representations, which can be seamlessly integrated into the encoder-decoder. Our key insight is that the vector representations are expected to transfer the inductive bias from the pure language domain to the vision-language domain.

Specifically, to encode the language prior, we propose the Scene Graph Auto-Encoder (SGAE) that is a sentence self-reconstruction network in the $\mathcal{S}\rightarrow\mathcal{G}\rightarrow\mathcal{D}\rightarrow\mathcal{S}$ pipeline, where $\mathcal{D}$ is a trainable dictionary for the re-encoding purpose of the node features, the $\mathcal{S}\rightarrow\mathcal{G}$ module is a fixed off-the-shelf scene graph language parser , and the $\mathcal{D}\rightarrow\mathcal{S}$ is a trainable RNN-based language decoder . Note that $\mathcal{D}$ is the “juice” — the language inductive bias — we extract from training SGAE. By sharing $\mathcal{D}$ in the encoder-decoder training pipeline: $\mathcal{I}\rightarrow\mathcal{G}\rightarrow\mathcal{D}\rightarrow\mathcal{S}$ , we can incorporate the language prior to guide the end-to-end image captioning. In particular, the $\mathcal{I}\rightarrow\mathcal{G}$ module is a visual scene graph detector and we introduce a multi-modal GCN for the $\mathcal{G}\rightarrow\mathcal{D}$ module in the captioning pipeline, to complement necessary visual cues that are missing due to the imperfect visual detection. Interestingly, $\mathcal{D}$ can be considered as a working memory that helps to re-key the encoded nodes from $\mathcal{I}$ or $\mathcal{S}$ to a more generic representation with smaller domain gaps. More motivations and the incarnation of $\mathcal{D}$ will be discussed in Section 4.3.

We implement the proposed SGAE-based captioning model by using the recently released visual encoder and language decoder with RL-based training strategy . Extensive experiments on MS-COCO validates the superiority of using SGAE in image captioning. Particularly, in terms of the popular CIDEr-D metric , we achieve an absolute 7.2 points improvement over a strong baseline: an upgraded version of Up-Down . Then, we advance to a new state-of-the-art single-model achieving 127.8 on the Karpathy split and a competitive 125.5 on the official test server even compared to many ensemble models.

In summary, we would like to make the following technical contributions:

A novel Scene graph Auto-Encoder (SGAE) for learning the feature representation of the language inductive bias.

A multi-modal graph convolutional network for modulating scene graphs into visual representations.

A SGAE-based encoder-decoder image captioner with a shared dictionary guiding the language decoding.

Related Work

Image Captioning. There is a long history for researchers to develop automatic image captioning methods. Compared with early works which are rules/templates based , the modern captioning models have achieved striking advances by three techniques inspired from the NLP field, i.e., encoder-decoder based pipeline , attention technique , and RL-based training objective . Afterwards, researchers tried to discover more semantic information from images and incorporated them into captioning models for better descriptive abilities. For example, some methods exploit object , attribute , and relationship knowledge into their captioning models. Compared with these approaches, we use the scene graph as the bridge to integrate object, attribute, and relationship knowledge together to discover more meaningful semantic contexts for better caption generations.

Scene Graphs. The scene graph contains the structured semantic information of an image, which includes the knowledge of present objects, their attributes, and pairwise relationships. Thus, the scene graph can provide a beneficial prior for other vision tasks like image retrieval , VQA , and image generation . By observing the potential of exploiting scene graphs in vision tasks, a variety of approaches are proposed to improve the scene graph generation from images . On the another hand, some researchers also tried to extract scene graphs only from textual data . In this research, we use to parse scene graphs from images and to parse scene graphs from captions.

Memory Networks. Recently, many researchers try to augment a working memory into network for preserving a dynamic knowledge base for facilitating subsequent inference . Among these methods, differentiable attention mechanisms are usually applied to extract useful knowledge from memory for the tasks on hand. Inspired by these methods, we also implement a memory architecture to preserve humans’ inductive bias, guiding our image captioning model to generate more descriptive captions.

Encoder-Decoder Revisited

As illustrated in Figure 2, given an image $\mathcal{I}$ , the target of image captioning is to generate a natural language sentence $\mathcal{S}=\{w_{1},w_{2},...,w_{T}\}$ describing the image. A state-of-the-art encoder-decoder image captioner can be formulated as:

Usually, an encoder is a Convolutional Neural Network (CNN) that extracts the image feature $\mathcal{V}$ ; the map is the the widely used attention mechanism that re-encodes the visual features into more informative $\hat{\mathcal{V}}$ that is dynamic to language generation; an decoder is an RNN-based language decoder for the sequence prediction of $\mathcal{S}$ . Given a ground truth caption $\mathcal{S}^{*}$ for $\mathcal{I}$ , we can train this encoder-decoder model by minimizing the cross-entropy loss:

or by maximizing a reinforcement learning (RL) based reward as:

where $r$ is a sentence-level metric for the sampled sentence $\mathcal{S}^{s}$ and the ground-truth $\mathcal{S}^{*}$ , e.g., the CIDEr-D metric.

This encoder-decoder framework is the core pillar underpinning almost all state-of-the-art image captioners since . However, it is widely shown brittle to dataset bias . We propose to exploit the language inductive bias, which is beneficial, to confront the dataset bias, which is pernicious, for more human-like image captioning. As shown in Figure 2, the proposed framework is formulated as:

As can be clearly seen that we focus on modifying the Map module by introducing the scene graph $\mathcal{G}$ into a re-encoder $R$ parameterized by a shared dictionary $\mathcal{D}$ . As we will detail in the rest of the paper, we first propose a Scene Graph Auto-Encoder (SGAE) to learn the dictionary $\mathcal{D}$ which embeds the language inductive bias from sentence to sentence self-reconstruction (cf. Section 4) with the help of scene graphs. Then, we equip the encoder-decoder with the proposed SGAE to be our overall image captioner (cf. Section 5). Specifically, we use a novel Multi-modal Graph Convolutional Network (MGCN) (cf. Section 5.1) to re-encode the image features by using $\mathcal{D}$ , narrowing the gap between vision and language.

Auto-Encoding Scene Graphs

In this section, we will introduce how to learn $\mathcal{D}$ through self-reconstructing sentence $\mathcal{S}$ . As shown in Figure 2, the process of reconstructing $\mathcal{S}$ is also an encoder-decoder pipeline. Thus, by slightly abusing the notations, we can formulate SGAE as:

Next, we will detail every component mentioned in Eq. (5).

We introduce how to implement the step $\mathcal{G}\leftarrow\mathcal{S}$ , i.e., from sentence to scene graph. Formally, a scene graph is a tuple $\mathcal{G}=(\mathcal{N},\mathcal{E})$ , where $\mathcal{N}$ and $\mathcal{E}$ are the sets of nodes and edges, respectively. There are three kinds of nodes in $\mathcal{N}$ : object node $o$ , attribute node $a$ , and relationship node $r$ . We denote $o_{i}$ as the $i$ -th object, $r_{ij}$ as the relationship between object $o_{i}$ and $o_{j}$ , and $a_{i,l}$ as the $l$ -th attribute of object $o_{i}$ . For each node in $\mathcal{N}$ , it is represented by a $d$ -dimensional vector, i.e., $\bm{e}_{o}$ , $\bm{e}_{a}$ , and $\bm{e}_{r}$ . In our implementation, $d$ is set to $1,000$ . In particular, the node features are trainable label embeddings. The edges in $\mathcal{E}$ are formulated as follows:

if an object $o_{i}$ owns an attribute $a_{i,l}$ , assigning a directed edge from $o_{i}$ to $a_{i,l}$ ;

if there is one relationship triplet $<o_{i}-r_{ij}-o_{j}>$ appears, assigning two directed edges from $o_{i}$ to $r_{ij}$ and from $r_{ij}$ to $o_{j}$ , respectively.

Figure 3 shows one example of $\mathcal{G}$ , which contains $7$ nodes in $\mathcal{N}$ and $6$ directed edges in $\mathcal{E}$ .

We use the scene graph parser provided by for scene graphs $\mathcal{G}$ from sentences, where a syntactic dependency tree is built by and then a rule-based method is applied for transforming the tree to a scene graph.

2 Graph Convolution Network

We present the implementation for the step $\mathcal{X}\leftarrow\mathcal{G}$ in Eq. (5), i.e., how to transform the original node embeddings $\bm{e}_{o}$ , $\bm{e}_{a}$ , and $\bm{e}_{r}$ into a new set of context-aware embeddings $\mathcal{X}$ . Formally, $\mathcal{X}$ contains three kinds of $d$ -dimensional embeddings: relationship embedding $\bm{x}_{r_{ij}}$ for relationship node $r_{ij}$ , object embedding $\bm{x}_{o_{i}}$ for object node $o_{i}$ , and attribute embedding $\bm{x}_{a_{i}}$ for object node $o_{i}$ . In our implementation, $d$ is set to $1,000$ . We use four spatial graph convolutions: $g_{r}$ , $g_{a}$ , $g_{s}$ , and $g_{o}$ for generating the above mentioned three kinds of embeddings. In our implementation, all these four functions have the same structure with independent parameters: a vector concatenation input to a fully-connected layer, followed by an ReLU.

Relationship Embedding $\mathbf{x}_{r_{ij}}$ : Given one relationship triplet $<o_{i}-r_{ij}-o_{j}>$ in $\mathcal{G}$ , we have:

where the context of a relationship triplet is incorporated together. Figure 3 (a) shows such an example.

Attribute Embedding $\bm{x}_{a_{i}}$ : Given one object node $o_{i}$ with all its attributes $a_{i,1:Na_{i}}$ in $\mathcal{G}$ , where $Na_{i}$ is the number of attributes that the object $o_{i}$ has, then $\bm{x}_{{a}_{i}}$ for $o_{i}$ is:

where the context of this object and all its attributes are incorporated. Figure 3 (b) shows such an example.

Object Embedding $\bm{x}_{o_{i}}$ : In $\mathcal{G}$ , $o_{i}$ can act as “subject” or “object” in relationships, which means $o_{i}$ will play different roles due to different edge directions. Then, different functions should be used to incorporate such knowledge. For avoiding ambiguous meaning of the same “predicate” in different context, knowledge of the whole relationship triplets where $o_{i}$ appears should be incorporated into $\bm{x}_{o_{i}}$ . One simple example for ambiguity is that, in $<$ hand-with-cup $>$ , the predicate “with” may mean “hold”, while in $<$ head-with-hat $>$ , “with” may mean “wear”. Therefore, $\bm{x}_{o_{i}}$ can be calculated as:

For each node $o_{j}\in sbj(o_{i})$ , it acts as “object” while $o_{i}$ acts as “subject”, e.g., $sbj(o_{1})=\{o_{2}\}$ in Figure 3 (c). $Nr_{i}=|sbj(i)|+|obj(i)|$ is the number of relationship triplets where $o_{i}$ is present. Figure 3 (c) shows this example.

3 Dictionary

Now we introduce how to learn the dictionary $\mathcal{D}$ and then use it to re-encode $\hat{\mathcal{X}}\leftarrow R(\mathcal{X};\mathcal{D})$ in Eq. (5). Our key idea is inspired by using the working memory to preserve a dynamic knowledge base for run-time inference, which is widely used in textual QA , VQA , and one-shot classification . Our $\mathcal{D}$ aims to embed language inductive bias in language composition. Therefore, we propose to place the dictionary learning into the sentence self-reconstruction framework. Formally, we denote $\mathcal{D}$ as a $d\times K$ matrix $\bm{D}=\{\bm{d}_{1},\bm{d}_{2},...,\bm{d}_{K}\}$ . The $K$ is set as $10,000$ in implementation. Given an embedding vector $\bm{x}\in\mathcal{X}$ , the re-encoder function $R_{\mathcal{D}}$ can be formulated as:

where $\bm{\alpha}=\text{softmax}(\bm{D}^{T}\bm{x})$ can be viewed as the “key” operation in memory network . As shown in Figure 4, this re-encoding offers some interesting “imagination” in human common sense reasoning. For example, from “yellow and dotted banana”, after re-encoding, the feature will be more likely to generate “ripe banana”.

We deploy the attention structure in for reconstructing $\mathcal{S}$ . Given a reconstructed $\mathcal{S}$ , we can use the training objective in Eq. (2) or (3) to train SGAE parameterized by $\mathcal{D}$ in an end-to-end fashion. Note that training SGAE is unsupervised, that is, SGAE offers a potential never-ending learning from large-scale unsupervised inductive bias learning for $\mathcal{D}$ . Some preliminary studies are reported in Section 6.2.2.

Overall Model: SGAE-based Encoder-Decoder

In this section, we will introduce the overall model: SGAE-based Encoder-Decoder as sketched in Figure 2 and Eq. (4).

The original image features extracted by CNN are not ready for use for the dictionary re-encoding as in Eq. (9), due to the large gap between vision and language. To this end, we propose a Multi-modal Graph Convolution Network (MGCN) to first map the visual features $\mathcal{V}$ into a set of scene graph-modulated features ${\mathcal{V}}^{\prime}$ .

Here, the scene graph $\mathcal{G}$ is extracted by an image scene graph parser that contains an object proposal detector, an attribute classifier, and a relationship classifier. In our implementation, we use Faster-RCNN as the object detector , MOTIFS relationship detector as the relationship classifier, and we use our own attribute classifier: an small fc-ReLU-fc-Softmax network head. The key representation difference between the sentence-parsed $\mathcal{G}$ and the image-parsed $\mathcal{G}$ is that the node $o_{i}$ is not only the label embedding. In particular, we use the RoI features pre-trained from Faster RCNN and then fuse the detected label embedding $\bm{e}_{o_{i}}$ with the visual feature $\bm{v}_{o_{i}}$ , into a new node feature $\bm{u}_{o_{i}}$ :

where $\bm{W}_{1}$ and $\bm{W}_{2}$ are the fusion parameters following . Compared to the popular bi-linear fusion , Eq (10) is empirically shown a faster convergence of training the label embeddings in our experiments. The rest node embeddings: $\bm{u}_{r_{ij}}$ and $\bm{u}_{a_{i}}$ are obtained in a similar way. The differences between two scene graphs generated from $\mathcal{I}$ and $\mathcal{S}$ are visualized in Figure 1, where the image $\mathcal{G}$ is usually more simpler and nosier than the sentence $\mathcal{G}$ .

Similar to the GCN used in Section 4.2, MGCN also has an ensemble of four functions $f_{r}$ , $f_{a}$ , $f_{s}$ and $f_{o}$ , each of which is a two-layer structure: fc-ReLU with independent parameters. And the computation of relationship, attribute and object embeddings are similar to Eq. (6), Eq. (7), and Eq. (8), respectively. After computing $\mathcal{V}^{\prime}$ by using MGCN, we can adopt Eq. (9) to re-encode $\mathcal{V}^{\prime}$ as $\hat{\mathcal{V}}$ and feed $\hat{\mathcal{V}}$ to the decoder for generating language $\mathcal{S}$ . In particular, we deploy the attention structure in for the generation.

2 Training and Inference

Following the common practice in deep-learning feature transfer , we use the SGAE pre-trained $\mathcal{D}$ as the initialization for the $\mathcal{D}$ in our overall encoder-decoder for image captioning. In particular, we intentionally use a very small learning rate (e.g., $10^{-5}$ ) for fine-tuning $\mathcal{D}$ to impose the sharing purpose. The overall training loss is hybrid: we use the cross-entropy loss in Eq. (2) for $20$ epochs and then use the RL-based reward in Eq. (3) for another $40$ epochs.

For inference in language generation, we adopt the beam search strategy with a beam size of 5.

Experiments

MS-COCO . There are two standard splits of MS-COCO: the official online test split and the 3rd-party Karpathy split for offline test. The first split has $82,783/40,504/40,775$ train/val/test images, each of which has 5 human labeled captions. The second split has $113,287/5,000/5,000$ train/val/test images, each of which has 5 captions.

Visual Genome (VG). This dataset has abundant scene graph annotations, e.g., objects’ categories, objects’ attributes, and pairwise relationships, which can be exploited to train the object proposal detector, attribute classifier, and relationship classifier as our image scene graph parser.

Settings. For captions, we used the following steps to pre-process the captions: we first tokenized the texts on white space; then we changed all the words to lowercase; we also deleted the words which appear less than $5$ times; at last, we trimmed each caption to a maximum of $16$ words. This results in a vocabulary of $10,369$ words. This pre-processing was also applied in VG. It is noteworthy that except for ablative studies, these additional text descriptions from VG were not used for training the captioner. Since the object, attribute, and relationship annotations are very noisy in VG dataset, we filter them by keeping the objects, attributes, and relationships which appear more than $2,000$ times in the training set. After filtering, the remained $305$ objects, $103$ attributes, and $64$ relationships are used to train our object detector, attribute classifier and relationship classifier.

We chose the language decoder proposed in . The number of hidden units of both LSTMs used in this decoder is set to $1000$ . For training SGAE in Eq. (5), the decoder is firstly set as $\mathcal{S}\leftarrow\mathcal{X}$ and $\mathcal{D}$ is not trained to learn a rudiment encoder and decoder. We used the corss-entropy loss in Eq. (2) to train them for $20$ epochs. Then the decoder was set as $\mathcal{S}\leftarrow\hat{\mathcal{X}}$ to train $\mathcal{D}$ by cross-entropy loss for another $20$ epochs. The learning rate was initialized to $5e^{-4}$ for all parameters and we decayed them by $0.8$ for every $5$ epochs. For training our SGAE-based encoder-decoder, we followed Eq. (4) to generate $\mathcal{S}$ with shared $\mathcal{D}$ pre-trained from SGAE. The decoder was set as $\mathcal{S}\leftarrow\{\hat{\mathcal{V}},\mathcal{V}^{\prime}\}$ , where $\mathcal{V}^{\prime}$ and $\hat{\mathcal{V}}$ can provide visual clues and high-level semantic contexts respectively. In this process, cross-entropy loss was first used to train the network for $20$ epochs and then the RL-based reward was used to train for another $40$ epochs. The learning rate for $\mathcal{D}$ was initialized to $5e^{-5}$ and for other parameters it was $5e^{-4}$ , and all these learning rates were decayed by $0.8$ for every $5$ epochs. Adam optimizer was used for batch size $100$ .

Metrics. We used four standard automatic evaluations metrics: CIDEr-D , BLEU , METEOR and ROUGE .

2 Ablative Studies

We conducted extensive ablations for architecture (Section 6.2.1), language corpus (Section 6.2.2), and sentence reconstruction quality (Section 6.2.3). For simplicity, we use SGAE to denote our SGAE-based encoder-decoder captioning model.

Comparing Methods. For quantifying the importance of the proposed GCN, MGCN, and dictionary $\mathcal{D}$ , we ablated our SGAE with the following baselines: Base: We followed the pipeline given in Eq (1) without using GCN, MGCN, and $\mathcal{D}$ . This baseline is the benchmark for other ablative baselines. Base+MGCN: We added MGCN to compute the multi-modal embedding set $\hat{\mathcal{V}}$ . This baseline is designed for validating the importance of MGCN. Base+ $\bm{D}\textbf{ w/o GCN}$ : We learned $\mathcal{D}$ by using Eq. (5), while GCN is not used and only word embeddings of $\mathcal{S}$ were input to the decoder. Also, MGCN in Eq. (4) is not used. This baseline is designed for validating the importance of GCN. Base+ $\bm{D}$ : Compared to Base, we learned $\mathcal{D}$ by using GCN. And MGCN in Eq. (4) was not used. This baseline is designed for validating the importance of the shared $\mathcal{D}$ .

Results. The middle section of Table 1 shows the performances of the ablative baselines on MS-COCO Karpathy split. Compared with Base, our SGAE can boost the CIDEr-D by absolute $7.2$ . By comparing Base+MGCN, Base+ $\bm{D}$ w/o GCN, and Base+ $\bm{D}$ with Base, we can find that all the performances are improved, which demonstrate that the proposed MGCN, GCN, and $\mathcal{D}$ are all indispensable for advancing the performances. We can also observe that the performances of Base+ $\bm{D}$ or Base+ $\bm{D}$ w/o GCN are better than Base+MGCN, which suggests that the language inductive bias plays an important role in generating better captions.

Qualitative Examples. Figure 5 shows $6$ examples of the generated captions using different baselines. We can see that compared with captions generated by Base, Base+MGCN’s descriptions usually contain more descriptions about objects’ attributes and pairwise relationships. For captions generated by SGAE, they are more complex and descriptive. For example, in Figure 5 (a), the word “busy” will be used to describe the heavy traffic; in (b) the scene “forest” can be deduced from “trees”; and in (d), the weather “rain” will be inferred from “umbrella’.

2.2 Language Corpus

Comparing Methods. To test the potential of using large-scale corpus for learning a better $\mathcal{D}$ , we used the texts provided by VG instead of MS-COCO to learn $\mathcal{D}$ , and then share the learned $\mathcal{D}$ in the encoder-decoder pipeline. The results are demonstrated in Table 2, where Web means results obtained by using sentences from VG.

Results. We can observe that by using the web description texts, the performances of generated captions are boosted compared with Base, which validates the potential of our proposed model in exploiting additional Web texts. We can also see that by using texts provided by MS-COCO itself (SGAE), the generated captions have better scores than using Web texts. This is intuitively reasonable since $\mathcal{D}$ can preserve more useful clues when a matched language corpus is given. Both of these two comparisons validate the effectiveness of $\mathcal{D}$ in two aspects: $\mathcal{D}$ can memorize common inductive bias from the additional unmatched Web texts or specific inductive bias from a matched language corpus.

Qualitative Examples. Figure 6 shows $6$ examples of generated captions by using different language corpora. Generally, compared with captions generated by Base, the captions of Web and SGAE are more descriptive. Specifically, the captions generated by using the matched language corpus can usually describe a scene by some specific expressions in the dataset, while more general expressions will appear in captions generated by using Web texts. For example, in Figure 6 (b), SGAE uses “lush green field” as GT captions while Web uses “grass” ; or in (e), SGAE prefers “dirt” while Web prefers “sand”.

Human Evaluation. For better evaluating the qualities of the generated captions by using different language corpora, we conducted human evaluation with $30$ workers. We showed them two captions generated by different methods and asked them which one is more descriptive. For each pairwise comparison, $100$ images are randomly extracted from the Karpathy split for them to compare. The results of the comparisons are shown in Figure 7. From these pie charts, we can observe that when a $\mathcal{D}$ is used, the generated captions are evaluated to be more descriptive.

2.3 Sentence Reconstruction

Comparing Methods. We investigated how well the sentences are reconstructed in training SGAE in Eq. (5), with or without using the re-encoding by $\mathcal{D}$ , that is, we denote $\widehat{\mathcal{X}}$ as the pipeline using $\mathcal{D}$ and $\mathcal{X}$ as the pipeline directly reconstructing sentences from their scene graph node features. Such results are given in Table 3.

Analysis. As we can see, the performances of using direct scene graph features $\widehat{\mathcal{X}}$ are much better than those ( $\mathcal{X}$ ) imposed with $\mathcal{D}$ for re-encoding. This is reasonable since $\mathcal{D}$ will regularize the reconstruction and thus encourages the learning of language inductive bias. Interestingly, the gap between $\hat{\mathcal{X}}$ and SGAE suggest that we should develop a more powerful image scene graph parser for improving the quality of $\mathcal{G}$ in Eq. (4), and a stronger re-encoder should be designed for extracting more preserved inductive bias when only low-quality visual scene graphs are available.

3 Comparisons with State-of-The-Arts

Comparing Methods. Though there are various captioning models developed in recent years, for fair comparison, we only compared SGAE with some encoder-decoder methods trained by the RL-based reward (Eq. (3)), due to their superior performances. Specifically, we compared our methods with SCST , StackCap , Up-Down , LSTM-A , GCN-LSTM , and CAVP . Among these methods, SCST and Up-Down are two baselines where the more advanced self-critic reward and visual features are used. Compared with SCST, StackCap proposes a more complex RL-based reward for learning captions with more details. All of LSTM-A, GCN-LSTM, and CAVP try to exploit information of visual scene graphs, e.g., LSTM-A and GCN-LSTM exploit attributes and relationships information respectively, while CAVP tries to learn pairwise relationships in the decoder. Noteworthy, in GCN-LSTM, they set the batch size as $1,024$ and the training epoch as $250$ , which is quite large compared with some other methods like Up-Down or CAVP, and is beyond our computation resources. For fair comparison, we also re-implemented a version of their work (since they do not publish the code), and set the batch size and training epoch both as $100$ , such result is denoted as GCN-LSTM† in Table 1. In addition, the best result reported by GCN-LSTM is obtained by fusing two probabilities computed from two different kinds of relationships, which is denoted as GCN-LSTMfuse, and our counterpart is denoted as SGAEfuse.

Analysis. From Table 1, we can see that our single model achieves a new state-of-the-art score among all the compared methods in terms of CIDEr-D, which is $127.8$ . And compared with GCN-LSTMfuse, our fusion model SGAEfuse also achieves better performances. By exploiting the inductive bias in $\mathcal{D}$ , even when our decoder or RL-reward is not as sophisticated as CVAP or StackCap, our method still has better performances. Moreover, our small batch size and fewer training epochs still lead to higher performances than GCN-LSTM, whose batch size and training epochs are much larger. Table 4 reports the performances of different methods test on the official server. Compared with the published captioning methods (by the date of 16/11/2018), our single model has competitive performances and can achieve the highest CIDEr-D score.

Conclusions

We proposed to incorporate the language inductive bias — a prior for more human-like language generation — into the prevailing encoder-decoder framework for image captioning. In particular, we presented a novel unsupervised learning method: Scene Graph Auto-Encoder (SGAE), for embedding the inductive bias into a dictionary, which can be shared as a re-encoder for language generation and significantly improve the performance of the encoder-decoder. We validate the SGAE-based framework by extensive ablations and comparisons with state-of-the-art performances on MS-COCO. As we believe that SGAE is a general solution for capturing the language inductive bias, we are going to apply it in other vision-language tasks.

References

Network Architecture

Here, we introduce the detailed network architectures of all the components in our model, which includes Graph Convolutional Network (GCN), Multi-modal Graph Convolutional Network (MGCN), Dictionary, and Decoders.

2 Multi-modal Graph Convolutional Network

Relationship Embedding $\bm{v}_{r_{ij}}^{{}^{\prime}}$ (Table 6 (12)):

Attribute Embedding $\bm{v}_{a_{i}}^{{}^{\prime}}$ (Table 6 (13)):

Object Embedding $\bm{v}_{o_{i}}^{{}^{\prime}}$ (Table 6 (14)):

3 Dictionary

The re-encoder function in Section 4.3 is used to re-encode a new representation $\hat{\bm{x}}$ from an index vector $\bm{x}$ and a dictionary $\mathcal{D}$ , such operation is given in Table 7. As shown in Table 7 (2) and (3) respectively, by given an index vector $\bm{x}$ , we first do inner produce between each element in $\bm{D}$ with $\bm{x}$ and then use softmax to normalize the computed results. At last, the re-encoded $\hat{\bm{x}}$ is the weighted sum of each atom in $\bm{D}$ as $\sum_{k=1}^{K}\alpha_{k}\bm{d}_{k}$ , $K$ is set as 10,000.

4 Decoders

We followed the language decoder proposed by to set our two decoders of Eq. (4) and Eq. (5) in the main paper. Both decoders have the same architecture, as shown in Table 8, except for the different embedding sets used as their inputs. For convenience, we introduce the decoders’ common architecture without differentiating them between Eq. (4) and Eq. (5), and then detail the difference between them at the end of this section.

The implemented decoder contains two LSTM layers and one attention module. The input of the first LSTM contains the concatenation of three terms: word embedding vector $\bm{W}_{\Sigma}\bm{w}_{t-1}$ , mean pooling of embedding set $\bar{\bm{z}}$ , and the output of the second LSTM $\bm{h}_{t-1}^{2}$ . We use them as input since they can provide abundant accumulated context information. Then, an index vector $\bm{h}_{t-1}^{1}$ is created by LSTM1 in Table 8 (7), which will be used to instruct the decoder to put attention on suitable embedding of $\mathcal{Z}$ by an attention module. Given $\mathcal{Z}$ and $\bm{h}_{t-1}^{1}$ , the formulations in Table 8 (8) and (9) can be applied for computing a $M$ -dimension attention distribution $\bm{\beta}$ , and then we can create the attended embedding $\hat{\bm{z}}$ by weighted sum as in (10). By inputting $\hat{\bm{z}}$ and $\bm{h}_{t-1}^{1}$ into LSTM2 and implementing (11) to (13), the word distribution $P_{t}$ can be got for sampling a word at time $t$ .

For two decoders in Eq. (4) and Eq. (5), they only differ in using different embedding sets $\mathcal{Z}$ as inputs. In SGAE (Eq. (5)), $\mathcal{Z}$ is set as $\hat{\mathcal{X}}$ . While in SGAE-based encoder-decoder (Eq. (4)), we have a small modification that the vector $\bm{z}\in\mathcal{Z}$ is set as follows: $\bm{z}=[\bm{v}^{\prime},\hat{\bm{v}}]$ , where $\bm{v}^{\prime}\in\mathcal{V}^{\prime}$ ( $\mathcal{V}^{\prime}$ is the scene graph-modulated feature set in Section 5.1), and $\hat{\bm{v}}\in\hat{\mathcal{V}}$ ( $\hat{\mathcal{V}}$ is the re-encoded feature set in Section 5.1).

Details of Scene Graph

For each sentence, we directly implemented the software provided by to parse its scene graph. And we filtered them by removing objects, relationships, and attributes which appear less than 10 among all the parsed scene graphs. After filtering, there are 5,364 objects, 1,308 relationships, and 3,430 attributes remaining. We grouped them together and used word embedding matrix $\bm{W}_{\Sigma_{S}}$ in Table 5 to transform nodes’ labels to continuous vector representations.

2 Image Scene Graph

Compared with sentence scene graphs, the parsing of image scene graphs is more complicated that we used Faster-RCNN as the object detector to detect and classify objects, MOTIFS relationship detector to classify relationships between objects, and one simple attribute classifier to predict attributes. The details of them are given as follows.

Object Detector: For detecting objects and extracting their RoI features, we followed to train Faster-RCNN. After training, we used 0.7 as the IoU threshold for proposal NMS, and 0.3 as threshold for object NMS. Also, we selected at least 10 objects and at most 100 objects for each image. RoI pooling was used to extract these objects’ features, which will be used as the input to the relationship classifier, attribute classifier, and MGCN.

Relationship Classifier: We used the LSTM structure proposed in as our relationship classifier. After training, we predicted a relationship for each two objects whose IoU is larger than 0.2.

Attribute Classifier: The detail structure of our attribute classifier is given in Table 9. After training, we predicted top-3 attributes for each object.

For each image, by using predicted objects, relationships and attributes, an image scene graph can be built. As detailed in Section 6.1 of the main paper, the total number of used objects, relationships, and attributes here is 472, thus we used a 472 $\times$ 1,000 word embedding matrix to transform the nodes’ labels into the continuous vectors as in Table 6 (6) to (8).

The codes and all these parsed scene graphs will be published for further research upon paper acceptance.

More Qualitative Examples

Figure 8 and 9 show more examples of generated captions of our methods and some baselines. We can find that the captions generated by SGAE prefer to use some more accurate words to describe the appeared objects, attributes, relationships or scenes. For instance, in Figure 8 (a), the object ‘weather vane’ is used while this object is not accurately recognized by the object detector; in Figure 8 (c), SGAE prefers the attribute ‘old rusty’; in Figure 9 SGAE describes the relationship between boat with water as ‘floating’ instead of ‘swimming’; and in Figure 9, the scene ‘mountains’ is inferred by using SGAE.