Scene Graph Generation with External Knowledge and Image Reconstruction

Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, Mingyang Ling

Introduction

With recent breakthroughs in deep learning and image recognition, higher-level visual understanding tasks, such as visual relationship detection, has been a popular research topic . Scene graph, as an abstraction of objects and their complex relationships, provides rich semantic information of an image. It involves the detection of all $\langle$ subject-predicate-object $\rangle$ triplets in an image and the localization of all objects. Scene graph provides a structured representation of an image that can support a wide range of high-level visual tasks, including image captioning , visual question answering , image retrieval , and image generation . However, it is not easy to extract scene graphs from images, since it involves not only detecting and localizing pairs of interacting objects but also recognizing their pairwise relationships. Currently, there are two categories of approaches for scene graph generation. Both categories group object proposals into pairs and use the phrase features (features of their union area) for predicate inference. The difference of the two categories lies in the different procedures. The first category detects the objects first and then recognizes the relationships between those objects . The second category jointly identifies the objects and their relationships based on the object and relationship proposals .

Despite the promising progress introduced by these approaches, most of them suffer from the limitations of existing scene graph datasets. First, to comprehensively depict an image using the scene graph, it requires a wide variety of relation triplets $\langle$ subject-predicate-object $\rangle$ . Unfortunately, current datasets only capture a small portion of the knowledge , e.g., Visual Relationship Detection (VRD) dataset. Training on such a dataset with long-tail distributions will cause the prediction model bias towards those most-frequent relationships. Second, predicate labels are highly determined by the identification of object pairs . However, due to the difficulty of exhaustively labeling bounding boxes of all instances of each object, the current large-scale crowd-sourced datasets like Visual Genome (VG) are contaminated by noises (e.g., missing annotations and meaningless proposals). Such a noisy dataset will inevitably result in a poor performance of the trained object detector , which further hinders the performance of predicate detection.

For human beings, we are capable of reasoning over visual elements of an image based on our commonsense knowledge. For example, in Figure 1, humans have the background knowledge: the subject (woman) appears / stands on something; the object (snow) enhances the evidence of the predicate (skiing). Commonsense knowledge can also help correct object detection. For example, the specific external knowledge for skiing benefits inference of the object (snow) as well. This motivates us to leverage commonsense knowledge to help scene graph generation.

Meanwhile, despite the crucial role of object labels for relationship prediction, existing datasets are very noisy due to the significant amount of missing object annotations. However, our goal is to obtain scene graphs with more complete scene representation. Motivated by this goal, we regularize our scene graph generation network by reconstructing the image from detected objects. Considering the case in Figure 1, a method might recognize snow as grass by mistake. If we generate an image based on the falsely predicted scene graph, this minor error would be heavily penalized, even though most of the snow’s relationships might be correctly identified.

The contributions of this paper are threefold. 1) We propose a knowledge-based feature refinement module to incorporate commonsense knowledge from an external knowledge base. Specifically, the module extracts useful information from ConceptNet to refine object and phrase features before scene graph generation. We exploit Dynamic Memory Network (DMN) to implement multi-hop reasoning over the retrieved facts and infer the most probable relations accordingly. 2) We introduce image-level supervision module by reconstructing the image to regularize our scene graph generation model. We view this auxiliary branch as a regularizer, which is only present during training. 3) We conduct extensive experiments on two benchmark datasets: VRD and VG datasets. Our empirical results demonstrate that our approach can significantly improve the state-of-the-art on scene graph generation.

Related Works

Incorporating Knowledge in Neural Networks. There has been growing interest in improving data-driven models with external Knowledge Bases (KBs) in natural language processing and computer vision communities . Large-scale structured KBs are constructed either by manual effort (e.g., Wikipedia, DBpedia ), or by automatic extraction from unstructured or semi-structured data (e.g., ConceptNet). One direction to improve the data-driven model is to distill external knowledge into Deep Neural Networks . Wu et al. encode the mined knowledge from DBpedia into a vector and combine it with visual features to predict answers. Instead of aggregating the textual vectors with average-pooling operation , Li et al. distill the retrieved context-relevant external knowledge triplet through a DMN for open-domain visual question answering. Unlike , Yu et al. extract linguistic knowledge from training annotations and Wikipedia, and distill knowledge to regularize training and provide extra cues for inference. A teacher-student framework is adopted to minimize the KL-divergence of the prediction distributions of teacher and student.

Visual Relationship Detection. Visual relationship detection has been investigated by many works in the last decade . Lu et al. introduce generic visual relationship detection as a visual task, where they detect objects first, and then recognize predicates between object pairs. Recently, some works have explored the message passing for context propagation and feature refinement . Xu et al. construct the scene graph by refining the object and relationship features jointly with message passing. Dai et al. exploit the statistical dependencies between objects and their relationships and refine the posterior probabilities iteratively with a Conditional Random Field (CRF) network. More recently, Zeller et al. achieve a strong baseline by predicting relationships with frequency priors. To deal with the large number of potential relations between objects, Yang et al. propose a relation proposal network that prunes out uncorrelated object pairs, and captures the contextual information with an attentional graph convolutional network. In , they propose a clustering method which factorizes the full graph into subgraphs, where each subgraph is composed of several objects and a subset of their relationships.

Most related to our work are the approaches proposed by Li et al. and Yu et al. . Unlike , which focuses on the efficient scene graph generation, our approach addresses the long tail distribution of relationships by commonsense cues along with visual cues. Unlike , which leverages linguistic knowledge to regularize the network, our knowledge-based module improves the feature refining procedure by reasoning over a basket of commonsense knowledge retrieved from ConceptNet.

Methodology

Figure 2 gives an overview of our proposed scene graph generation framework. The entire framework can be divided into the following steps: (1) generate object and subgraph proposals for a given image; (2) refine object and subgraph features with external knowledge; (3) generate the scene graph by recognizing object categories with object features and recognizing object relations by fusing subgraph features and object feature pairs; (4) reconstruct the input image via an additional generative path. During training, we use two types of supervisions: scene graph level supervision and image-level supervision. For scene graph level supervision, we optimize our model by guiding the generated scene graph with the ground truth object and predicate categories. The image-level supervision is introduced to overcome the aforementioned missing annotations by reconstructing the image from objects and enforcing the reconstructed image close to the original image.

Object Proposal Generation. Given an image $\mathbf{I}$ , we first use the Region Proposal Network (RPN) to extract a set of object proposals:

where $f_{\text{RPN}}(\cdot)$ stands for the RPN module, and $o_{i}$ is the $i$ -th object proposal represented by a bounding box $r_{i}=[x_{i},y_{i},w_{i},h_{i}]$ with $(x_{i},y_{i})$ being the coordinates of the top left corner and $w_{i}$ and $h_{i}$ being the width and the height of the bounding box, respectively. For any two different objects $\langle o_{i},o_{j}\rangle$ , there are two possible relationships in opposite directions. Thus, for $N$ object proposals, there are totally $N(N-1)$ potential relations. Although more object proposals lead to a bigger scene graph, the number of potential relations will increase dramatically, which significantly increases the computational cost and deteriorates the inference speed. To address this issue, subgraph is introduced in to reduce the number of potential relations by clustering.

2 Feature Refinement with External Knowledge

Object and Subgraph Inter-refinement. Considering that each object $\mathbf{o}_{i}$ is connected to a set of subgraphs $\mathbf{S}^{i}$ and each subgraph $\mathbf{s}_{k}$ is associated with a set of objects $\mathbf{O}^{k}$ , we refine the object vector (resp. the subgraph) by attending the associated subgraph feature maps (resp. the associated object vectors):

where $\alpha_{k}^{s\rightarrow o}$ (resp. $\alpha_{i}^{o\rightarrow s}$ ) is the output of a softmax layer indicating the weight for passing $\mathbf{s}_{k}^{i}$ (resp. $\mathbf{o}_{i}^{k}$ ) to $\mathbf{o}_{i}$ (resp. $\mathbf{s}_{k}$ ), and $f_{s\rightarrow o}$ and $f_{o\rightarrow s}$ are non-linear mapping functions. This part is similar to . Note that due to different dimensions of $\mathbf{o}_{i}$ and $\mathbf{s}_{k}$ , pooling or spatial location based attention needs to be respectively applied for $s\rightarrow o$ or $o\rightarrow s$ refinement. Interested readers are referred to for details.

Knowledge Retrieval and Embedding. To address the relationship distribution bias of the current visual relationship datasets, we propose a novel feature refinement network to further improve the feature representation by taking advantage of the commonsense relationships in external knowledge base (KB). In particular, we predict the object label $a_{i}$ from the refined object vector $\bar{\mathbf{o}}_{i}$ , and match $a_{i}$ with the corresponding semantic entities in KB. Afterwards, we retrieve the corresponding commonsense relationships from KB using the object label $a_{i}$ :

where $a^{r}_{i,j}$ , $a^{o}_{j}$ and $w_{i,j}$ are the top- $K$ corresponding relationships, the object entity and the weight, respectively. Note that the weight $w_{i,j}$ is provided by KB (i.e., ConceptNet ), indicating how common a triplet $\langle{a}_{i},{a}_{i,j}^{r},{a}_{j}^{o}\rangle$ is. Based on the weight $w_{i,j}$ , we can identify the top- $K$ most common relationships for $a_{i}$ . Figure 3 illustrates the process of our proposed knowledge-based feature refinement module.

To encode the retrieved commonsense relationships, we first transform each symbolic triplet $\langle{a}_{i},{a}_{i,j}^{r},{a}_{j}^{o}\rangle$ into a sequence of words: $[X^{0},\cdots,X^{T_{a}-1}]$ , and then map each word in the sentence into a continuous vector space with word embedding $\mathbf{x}^{t}=\mathbf{W}_{e}X^{t}$ . The embedded vectors are then fed into an RNN-based encoder as

where $\mathbf{x}_{k}^{t}$ is the $t$ -th word embedding of the $k$ -th sentence, and $\mathbf{h}_{k}^{t}$ is the hidden state of the encoder. We use a bi-directional Gated Recurrent Unit (GRU) for $\textrm{RNN}_{\text{fact}}$ and the final hidden state $\mathbf{h}_{k}^{T_{a}-1}$ is treated as the vector representation for the $k$ -th retrieved sentence or fact, denoted as $\mathbf{f}_{k}^{i}$ for object $\mathbf{o}_{i}$ .

Attention-based Knowledge Fusion. The knowledge units are stored in memory slots for reasoning and updating. Our target is to incorporate the external knowledge into the procedure of feature refining. However, for $N$ objects, we have $N\times K$ relevant fact vectors in memory slots. This makes it difficult to distill the useful information from the candidate knowledge when $N\times K$ is large. DMN provides a mechanism to pick out the most relevant facts by using an episodic memory module. Inspired by this, we adopt the improved DMN to reason over the retrieved facts $\mathbf{F}$ , where $\mathbf{F}$ denotes the set of fact embedding $\{\mathbf{f}_{k}\}$ . It consists of an attention component which generates a contextual vector using the episode memory $\mathbf{m}^{t-1}$ . Specifically, we feed the object vector $\mathbf{\bar{o}}$ to a non-linear fully-connected layer and attend the facts as follows:

where $\mathbf{z}^{t}$ is the interactions between the facts $\mathbf{F}$ , the episode memory $\mathbf{m}^{t-1}$ and the mapped object vector $\mathbf{q}$ , $\mathbf{g}^{t}$ is the output of a softmax layer, $\circ$ is the element-wise product, $|\cdot|$ is the element-wise absolute value, and $[\ ;\ ]$ is the concatenation operation. Note that $\mathbf{q}$ and $\mathbf{m}$ need to be expanded via duplication in order to have the same dimension as $\mathbf{F}$ for the interactions. In (9), $\text{AGRU}(\cdot)$ refers to the Attention based GRU which replaces the update gate in GRU with the output attention weight $\mathbf{g}_{k}^{t}$ for fact $k$ :

where $\mathbf{e}_{K}^{t}$ is the final state of the episode which is the state of the GRU after all the $K$ sentences have been seen.

After one pass of the attention mechanism, the memory is updated using the current episode state and the previous memory state:

where $\mathbf{m}^{t}$ is the new episode memory state. By the final pass $T_{m}$ , the episodic memory $\mathbf{m}^{T_{m}-1}$ can memorizes useful knowledge information for relationship prediction.

The final episodic memory $\mathbf{m}^{T_{m}-1}$ is passed to refine the object feature $\mathbf{\bar{o}}$ as

3 Scene Graph Generation

Relation Prediction. After the feature refinement, we can predict object labels as well as predicate labels with the refined object and subgraph features. For object label, we can predict it directly with the object features. For relationship label, as the subgraph feature is related to several object pairs, we predict the label based on subject and object feature vectors along with their corresponding subgraph feature map. We formulate the inference process as

where $f_{\text{rel}}(\cdot)$ and $f_{\text{node}}(\cdot)$ denote the mapping layers for predicate and object recognition, respectively, and $\otimes$ denotes the convolution operation . Then, we can construct the scene graph as: $\mathcal{G}=\langle V_{i},P_{i,j},V_{j}\rangle,i\neq j$ .

Scene Graph Level Supervision. Like other approaches , during training we want the generated scene graph close to the ground-truth scene graph by optimizing the scene graph generation process with object detection loss and relationship classification loss

where $\mathcal{L}_{\text{pred}}$ , $\mathcal{L}_{\text{obj}}$ and $\mathcal{L}_{\text{reg}}$ are the predicate classification loss, the object classification loss and the bounding box regression loss, respectively, $\lambda_{\text{obj}}$ , $\lambda_{\text{pred}}$ and $\lambda_{\text{reg}}$ are hyper-parameters, and $\mathbf{1}$ is the indicator function with $u$ being the object label, $u\geq 1$ for object categories and $u=0$ for background.

For the predicate detection, the output is the probability over all the candidate predicates. $\mathcal{L}_{\text{pred}}$ is defined as the softmax loss. Like the predicate classification, the output of the object detection is the probability over all the object categories. $\mathcal{L}_{\text{cls}}$ is also defined as the softmax loss. For the bounding box regression loss $\mathcal{L}_{\text{reg}}$ , we use smooth $L_{1}$ loss .

4 Image Generation

Given the scene layout, we synthesize an image that respects the object positions with an image generator $G$ . Here, we adopt a cascaded refinement network which consists of a series of convolutional refinement modules to generate the image. The spatial resolution doubles between the convolutional refinement modules. This allows the generation to proceed in a coarse-to-fine manner. For each module, it takes two inputs. One is the output from the previous module (the first module takes Gaussian noise), and the other one is the scene layout $S^{\text{layout}}$ , which is downsampled to the input resolution of the module. These inputs are concatenated channel-wisely and passed to a pair of $3\times 3$ convolution layers. The outputs are then upsampled using nearest-neighbor interpolation before being passed to the next module. The output from the last module is passed to two final convolution layers to produce the output image.

Image-level Supervision. In addition to the common pixel reconstruction loss $\mathcal{L}_{\text{pixel}}$ , we also adopt a conditional GAN loss , considering the image is generated based on the objects. In particular, we train the discriminator $D_{i}$ and the generator $G_{i}$ by alternatively maximizing $\mathcal{L}_{D_{i}}$ in Eq. (16) and $\mathcal{L}_{G_{i}}$ in Eq. (17):

As shown in Figure 2, we view the object-to-image generation branch as a regularizer. It can be seen as a corrective model for scene graph generation by improving the performance of object detection. During training, backpropagation from losses (15), (16), and (17) influences the model parameter updates. This image-level supervision can be seen as a corrective model for scene graph generation by improving the performance of object detection. The gradients back-propagated from the object-to-image branch update the parameters of our object detector and the feature refinement module which is followed by the relation prediction.

Alg. 1 summarizes the entire training procedure.

Experiments

We evaluate our approach on two datasets: VRD and VG . VRD is the most widely used benchmark dataset for visual relationship detection. Compared with VRD, the raw VG contains a large number of noisy labels. In our experiment, we use a cleansed-version VG-MSDN in . Detailed statistics of both datasets are shown in Table 1.

For the external KB, we employ the English subgraph of ConceptNet as our knowledge graph. ConceptNet is a large-scale graph of general knowledge which aims to align its knowledge resources on its core set of 40 relations. A large portion of these relation types can be considered as visual relations, such as, spatial co-occurrence (e.g., AtLocation, LocatedNear), visual properties of objects (e.g., HasProperty, PartOf), and actions (e.g., CapableOf, UsedFor).

2 Implementation Details

As shown in Alg. 1, we train our model in two phrases. The initial phase looks only at the object annotations of the training set, ignoring the relationship triplets. For each dataset, we filter the objects according to the category and relation vocabularies in Table 1. We then learn an image-level regularizer that reconstructs the image based on the object labels and bounding boxes. The output size of the image generator is $64\times 64\times 3$ , and the real image is resized before inputting to the discriminator. We train the regularizer with learning rate $10^{-4}$ and batch size 32. For each mini-batch we first update $G_{i}$ , and then update $D_{i}$ .

The second phase jointly trains the scene graph generation model and the auxiliary reconstruction branch. We adopt the Faster R-CNN associated with VGG-16 as the backbone. During training, the number of object proposals is 256. For each proposal, we use ROI align pooling to generate object and subgraph features. The subgraph regions are pooled to $5\times 5$ feature maps. The dimension $D$ of the pooled object vector and the subgraph feature map is set to 512. For the knowledge-based refinement module, we set the dimension of word embedding to 300 and initialize it with the GloVe 6B pre-trained word vectors . We keep the top-8 commonsense relationships. The number of hidden units of the fact encoder is set to 300, and the dimension of episodic memory is set to 512. The iteration number $T_{m}$ of DMN update is set to 2. For the relation inference module, we adopt the same bottleneck layer as . All the newly introduced layers are randomly initialized except the auxiliary regularizer. We set $\lambda_{\text{pred}}=2.0$ , $\lambda_{\text{cls}}=1.0$ , and $\lambda_{\text{reg}}=0.5$ in Eq (15). The hyperparameter $\lambda_{p}$ in Eq (17) is set to 1.0. The iteration number $T_{r}$ of the feature refinement is set to 2. We first train RPNs and then jointly train the entire network. The initial learning rate is 0.01, decay rate is 0.1, and stochastic gradient descent (SGD) is used as the optimizer. We deploy weight decay and dropout to prevent over-fitting.

During testing, the image reconstruction branch will be discarded. We respectively set the RPN non-maximum suppression (NMS) threshold to 0.6 and subgraph clustering threshold to 0.5. We output all the predicates and use the top-1 category as the prediction for objects and relations. Models are evaluated on two tasks: Visual Phrase Detection (PhrDet) and Scene Graph Generation (SGGen). PhrDet is to detect the $\langle$ subject-predicate-object $\rangle$ phrases. SGGen is to detect the objects within the image and recognize their pairwise relationships. Following , the Top- $K$ Recall (denoted as Rec@ $K$ ) is used as the performance metric; it calculates how many labeled relationships are hit in the top K predictions. In our experiments, Rec@50 and Rec@100 are reported. Note that, Li et al. and Yang et al. reported the results on two more metrics: Predicate Recognition and Phrase Recognition. These two evaluation metrics are based on ground-truth object locations, which is not the case we consider. In our setting, we use detected objects for image reconstruction and scene graph generation. To be consistent with the training, we choose PhrDet and SGGen as the evaluation metrics, which is also more practical.

3 Baseline Approaches for Comparisons

Baseline. This baseline model is the re-implementation of Factorizable Net . We re-train it based on our backbone. Specifically, we use the same RPN model, and jointly train the scene graph generator until convergence.

KB. This model is a KB-enhanced version of the baseline model. External knowledge triples are incorporated in DMN. The explicit knowledge-based reasoning is incorporated in the feature refining procedure.

GAN. This model improves the baseline model by attaching an auxiliary branch that generates the image from objects with GAN. We train this model in two phases. The first phase trains the image reconstruction branch only with the object annotations. Then we refine the model jointly with the scene graph generation model.

KB-GAN. This is our full model containing both KB and GAN. It is initialized with the trained parameters from KB and GAN, and fine-tuned with Alg. 1.

4 Quantitative Results

In this section, we present our quantitative results and analysis. To verify the effectiveness of our approach and analyze the contribution of each component, we first compare different baselines in Table 2, and investigate the improvement in recognizing objects in Table 3. Then, we conduct a simulation experiment on VRD to investigate the effectiveness of our auxiliary regularizer in Table 4. The comparison of our approach with the state-of-the-art methods is reported in Table 5.

Component Analysis. In our framework, we proposed two novel modules – KB-based feature refinement (KB) and auxiliary image generation (GAN). To get a clear sense of how these components affect the final performance, we perform ablation studies in Table 2. The left-most columns in Table 2 indicate whether or not we use KB and GAN in our approach. To further investigate the improvement of our approach on recognizing objects, we also report object detection performance mAP in Table 3.

In Table 2, we observe that KB boosts PhrDet and SGGen significantly. This indicates our knowledge-based feature refinement can effectively learn the commonsense knowledge of objects to achieve high recall for the correct relationships. By adding the image-level supervision to the baseline model, the performance is further improved. This improvement demonstrates that the proposed image-level supervision is capable of capturing meaningful context across the objects. These results align with our intuitions discussed in the introduction. With KB and GAN, our model can generate scene graphs with high recall.

Table 3 demonstrates the improvement in recognizing objects. We can see that our full model (KB-GAN) outperforms Faster R-CNN , ViP-CNN measured by mAP. It is worth noticing that the huge gain of KB illustrates that the introduction of commonsense knowledge substantially contributes to the object detection task.

Investigation on Image-level Supervision. As aforementioned, our image-level supervision can exploit the instances of rare categories. To demonstrate that our introduced image-level supervision can help on this issue, we exaggerate the problem by randomly removing 20% object instances as well as their corresponding relationships from the dataset. In Table 4, we can see that training on such a sub-sampled dataset (with only 80% object instances), Rec@50 of the baseline model drops from 25.57 (resp. 18.16) to 15.44 (resp. 10.94) for PhrDet and SGGen. However, with the help of GAN, Rec@50 of our final model decreases only slightly from 27.39 (resp. 20.31) to 26.62 (resp. 19.78) for PhrDet and SGGen, respectively.

We give our explanation on this significant performance improvement as below. Too many low-frequency categories deteriorate the training gain when only utilizing the class label as training targets. With the explicit image-level supervision, the proposed image reconstruction path can utilize the large quantities of instances of rare classes. This image-level supervision idea is generic, which can apply to many potential applications such as object detection.

Comparison with Existing Methods. Table 5 shows the comparison of our approach with the existing methods. We can see that our proposed method outperforms all the existing methods in the recall on both datasets. Compared with these methods, our model recognizes the objects and their relationships not only in the graph domain but also in the image domain.

5 Qualitative Results

Figure 5 visualizes some examples of our full-model. We show the generated scene graph as well as the reconstructed image for each sample. It is clear that our method can generate high-quality relationship predictions in the generated scene graph. Also notable is that our auxiliary output images are reasonable. This demonstrates our model’s capability to generate rich scene graph by learning with both external KB and auxiliary image-level regularizer.

Conclusion

In this work, we have introduced a new model for scene graph generation which includes a novel knowledge-base feature refinement network that effectively propagates contextual information across the graph, and an image-level supervision that regularizes the scene graph generation from image domain. Our framework outperforms state-of-the-art methods for scene graph generation on VRD and VG datasets. Our experiments show that it is fruitful to incorporate the commonsense knowledge as well as the image-level supervision into the scene graph generation. Our work shows a promising way to improve high-level image understanding via scene graph.

Acknowledgments

This work was supported in part by Adobe Research, NTU-IGS, NTU-Alibaba Lab, and NTU ROSE Lab.