Image Generation from Scene Graphs

Justin Johnson, Agrim Gupta, Li Fei-Fei

Introduction

What I cannot create, I do not understand

The act of creation requires a deep understanding of the thing being created: chefs, novelists, and filmmakers must understand food, writing, and film at a much deeper level than diners, readers, or moviegoers. If our computer vision systems are to truly understand the visual world, they must be able not only recognize images but also to generate them.

Aside from imparting deep visual understanding, methods for generating realistic images can also be practically useful. In the near term, automatic image generation can aid the work of artists or graphic designers. One day, we might replace image and video search engines with algorithms that generate customized images and videos in response to the individual tastes of each user.

As a step toward these goals, there has been exciting recent progress on text to image synthesis by combining recurrent neural networks and Generative Adversarial Networks to generate images from natural language descriptions.

These methods can give stunning results on limited domains, such as fine-grained descriptions of birds or flowers. However as shown in Figure 1, leading methods for generating images from sentences struggle with complex sentences containing many objects.

A sentence is a linear structure, with one word following another; however as shown in Figure 1, the information conveyed by a complex sentence can often be more explicitly represented as a scene graph of objects and their relationships. Scene graphs are a powerful structured representation for both images and language; they have been used for semantic image retrieval and for evaluating and improving image captioning; methods have also been developed for converting sentences to scene graphs and for predicting scene graphs from images .

In this paper we aim to generate complex images with many objects and relationships by conditioning our generation on scene graphs, allowing our model to reason explicitly about objects and their relationships.

With this new task comes new challenges. We must develop a method for processing scene graph inputs; for this we use a graph convolution network which passes information along graph edges. After processing the graph, we must bridge the gap between the symbolic graph-structured input and the two-dimensional image output; to this end we construct a scene layout by predicting bounding boxes and segmentation masks for all objects in the graph. Having predicted a layout, we must generate an image which respects it; for this we use a cascaded refinement network (CRN) which processes the layout at increasing spatial scales. Finally, we must ensure that our generated images are realistic and contain recognizable objects; we therefore train adversarially against a pair of discriminator networks operating on image patches and generated objects. All components of the model are learned jointly in an end-to-end manner.

We experiment on two datasets: Visual Genome , which provides human annotated scene graphs, and COCO-Stuff where we construct synthetic scene graphs from ground-truth object positions. On both datasets we show qualitative results demonstrating our method’s ability to generate complex images which respect the objects and relationships of the input scene graph, and perform comprehensive ablations to validate each component of our model.

Automated evaluation of generative images models is a challenging problem unto itself , so we also evaluate our results with two user studies on Amazon Mechanical Turk. Compared to StackGAN , a leading system for text to image synthesis, users find that our results better match COCO captions in 68% of trials, and contain 59% more recognizable objects.

Related Work

Generative Image Models fall into three recent categories: Generative Adversarial Networks (GANs) jointly learn a generator for synthesizing images and a discriminator classifying images as real or fake; Variational Autoencoders use variational inference to jointly learn an encoder and decoder mapping between images and latent codes; autoregressive approaches model likelihoods by conditioning each pixel on all previous pixels.

Conditional Image Synthesis conditions generation on additional input. GANs can be conditioned on category labels by providing labels as an additional input to both generator and discriminator or by forcing the discriminator to predict the label ; we take the latter approach.

Reed et al. generate images from text using a GAN; Zhang et al. extend this approach to higher resolutions using multistage generation. Related to our approach, Reed et al. generate images conditioned on sentences and keypoints using both GANs and multiscale autoregressive models ; in addition to generating images they also predict locations of unobserved keypoints using a separate generator and discriminator operating on keypoint locations.

Chen and Koltun generate high-resolution images of street scenes from ground-truth semantic segmentation using a cascaded refinement network (CRN) trained with a perceptual feature reconstruction loss ; we use their CRN architecture to generate images from scene layouts.

Related to our layout prediction, Chang et al. have investigated text to 3D scene generation ; other approaches to image synthesis include stochastic grammars , probabalistic programming , inverse graphics , neural de-rendering , and generative ConvNets .

Scene Graphs represent scenes as directed graphs, where nodes are objects and edges give relationships between objects. Scene graphs have been used for image retrieval and to evaluate image captioning ; some work converts sentences to scene graphs or predicts grounded scene graphs for images . Most work on scene graphs uses the Visual Genome dataset , which provides human-annotated scene graphs.

Deep Learning on Graphs. Some methods learn embeddings for graph nodes given a single large graph similar to word2vec which learns embeddings for words given a text corpus. These differ from our approach, since we must process a new graph on each forward pass.

More closely related to our work are Graph Neural Networks (GNNs) which generalize recursive neural networks to operate on arbitrary graphs. GNNs and related models have been applied to molecular property prediction , program verification , modeling human motion , and premise selection for theorem proving . Some methods operate on graphs in the spectral domain though we do not take this approach.

Method

Our goal is to develop a model which takes as input a scene graph describing objects and their relationships, and which generates a realistic image corresponding to the graph. The primary challenges are threefold: first, we must develop a method for processing the graph-structured input; second, we must ensure that the generated images respect the objects and relationships specified by the graph; third, we must ensure that the synthesized images are realistic.

We convert scene graphs to images with an image generation network $f$ , shown in Figure 2, which inputs a scene graph $G$ and noise $z$ and outputs an image $\hat{I}=f(G,z)$ .

The scene graph $G$ is processed by a graph convolution network which gives embedding vectors for each object; as shown in Figures 2 and 3, each layer of graph convolution mixes information along edges of the graph.

We respect the objects and relationships from $G$ by using the object embedding vectors from the graph convolution network to predict bounding boxes and segmentation masks for each object; these are combined to form a scene layout, shown in the center of Figure 2, which acts as an intermediate between the graph and the image domains.

The output image $\hat{I}$ is generated from the layout using a cascaded refinement network (CRN) , shown in the right half of Figure 2; each of its modules processes the layout at increasing spatial scales, eventually generating the image $\hat{I}$ .

We generate realistic images by training $f$ adversarially against a pair of discriminator networks $D_{img}$ and $D_{obj}$ which encourage the image $\hat{I}$ to both appear realistic and to contain realistic, recognizable objects.

Each of these components is described in more detail below; the supplementary material describes the exact architecures used in our experiments.

Scene Graphs. The input to our model is a scene graph describing objects and relationships between objects. Given a set of object categories $\mathcal{C}$ and a set of relationship categories $\mathcal{R}$ , a scene graph is a tuple $(O,E)$ where $O=\{o_{1},\ldots,o_{n}\}$ is a set of objects with each $o_{i}\in\mathcal{C}$ , and $E\subseteq O\times\mathcal{R}\times O$ is a set of directed edges of the form $(o_{i},r,o_{j})$ where $o_{i},o_{j}\in O$ and $r\in\mathcal{R}$ .

As a first stage of processing, we use a learned embedding layer to convert each node and edge of the graph from a categorical label to a dense vector, analogous to the embedding layer typically used in neural language models.

Graph Convolution Network. In order to process scene graphs in an end-to-end manner, we need a neural network module which can operate natively on graphs. To this end we use a graph convolution network composed of several graph convolution layers.

A traditional 2D convolution layer takes as input a spatial grid of feature vectors and produces as output a new spatial grid of vectors, where each output vector is a function of a local neighborhood of its corresponding input vector; in this way a convolution aggregates information across local neighborhoods of the input. A single convolution layer can operate on inputs of arbitrary shape through the use of weight sharing across all neighborhoods in the input.

Our graph convolution layer performs a similar function: given an input graph with vectors of dimension $D_{in}$ at each node and edge, it computes new vectors of dimension $D_{out}$ for each node and edge. Output vectors are a function of a neighborhood of their corresponding inputs, so that each graph convolution layer propagates information along edges of the graph. A graph convolution layer applies the same function to all edges of the graph, allowing a single layer to operate on graphs of arbitrary shape.

To compute the output vectors $v_{r}^{\prime}$ for edges we simply set $v_{r}^{\prime}=g_{p}(v_{i},v_{r},v_{j})$ . Updating object vectors is more complex, since an object may participate in many relationships; as such the output vector $v_{i}^{\prime}$ for an object $o_{i}$ should depend on all vectors $v_{j}$ for objects to which $o_{i}$ is connected via graph edges, as well as the vectors $v_{r}$ for those edges. To this end, for each edge starting at $o_{i}$ we use $g_{s}$ to compute a candidate vector, collecting all such candidates in the set $V_{i}^{s}$ ; we similarly use $g_{o}$ to compute a set of candidate vectors $V_{i}^{o}$ for all edges terminating at $o_{i}$ . Concretely,

The output vector for $v_{i}^{\prime}$ for object $o_{i}$ is then computed as $v_{i}^{\prime}=h(V_{i}^{s}\cup V_{i}^{o})$ where $h$ is a symmetric function which pools an input set of vectors to a single output vector. An example computational graph for a single graph convolution layer is shown in Figure 3.

In our implementation, the functions $g_{s}$ , $g_{p}$ , and $g_{o}$ are implemented using a single network which concatenates its three input vectors, feeds them to a multilayer perceptron (MLP), and computes three output vectors using fully-connected output heads. The pooling function $h$ averages its input vectors and feeds the result to a MLP.

Scene Layout. Processing the input scene graph with a series of graph convolution layers gives an embedding vector for each object which aggregates information across all objects and relationships in the graph.

In order to generate an image, we must move from the graph domain to the image domain. To this end, we use the object embedding vectors to compute a scene layout which gives the coarse 2D structure of the image to generate; we compute the scene layout by predicting a segmentation mask and bounding box for each object using an object layout network, shown in Figure 4.

The object layout network receives an embedding vector $v_{i}$ of shape $D$ for object $o_{i}$ and passes it to a mask regression network to predict a soft binary mask $\hat{m}_{i}$ of shape $M\times M$ and a box regression network to predict a bounding box $\hat{b}_{i}=(x_{0},y_{0},x_{1},y_{1})$ . The mask regression network consists of several transpose convolutions terminating in a sigmoid nonlinearity so that elements of the mask lies in the range $(0,1)$ ; the box regression network is a MLP.

We multiply the embedding vector $v_{i}$ elementwise with the mask $\hat{m}_{i}$ to give a masked embedding of shape $D\times M\times M$ which is then warped to the position of the bounding box using bilinear interpolation to give an object layout. The scene layout is then the sum of all object layouts.

During training we use ground-truth bounding boxes $b_{i}$ to compute the scene layout; at test-time we instead use predicted bounding boxes $\hat{b}_{i}$ .

Cascaded Refinement Network. Given the scene layout, we must synthesize an image that respects the object positions given in the layout. For this task we use a Cascaded Refinement Network (CRN). A CRN consists of a series of convolutional refinement modules, with spatial resolution doubling between modules; this allows generation to proceed in a coarse-to-fine manner.

Each module receives as input both the scene layout (downsampled to the input resolution of the module) and the output from the previous module. These inputs are concatenated channelwise and passed to a pair of $3\times 3$ convolution layers; the output is then upsampled using nearest-neighbor interpolation before being passed to the next module.

The first module takes Gaussian noise $z\sim p_{z}$ as input, and the output from the last module is passed to two final convolution layers to produce the output image.

Discriminators. We generate realistic output images by training the image generation network $f$ adversarially against a pair of discriminator networks $D_{img}$ and $D_{obj}$ .

A discriminator $D$ attempts to classify its input $x$ as real or fake by maximizing the objective

The patch-based image discriminator $D_{img}$ ensures that the overall appearance of generated images is realistic; it classifies a regularly spaced, overlapping set of image patches as real or fake, and is implemented as a fully convolutional network, similar to the discriminator used in .

The object discriminator $D_{obj}$ ensures that each object in the image appears realistic; its input are the pixels of an object, cropped and rescaled to a fixed size using bilinear interpolation . In addition to classifying each object as real or fake, $D_{obj}$ also ensures that each object is recognizable using an auxiliary classifier which predicts the object’s category; both $D_{obj}$ and $f$ attempt to maximize the probability that $D_{obj}$ correctly classifies objects.

Training. We jointly train the generation network $f$ and the discriminators $D_{obj}$ and $D_{img}$ . The generation network is trained to minimize the weighted sum of six losses:

Box loss $\mathcal{L}_{box}=\sum_{i=1}^{n}\|b_{i}-\hat{b}_{i}\|_{1}$ penalizing the $L_{1}$ difference between ground-truth and predicted boxes

Mask loss $\mathcal{L}_{mask}$ penalizing differences between ground-truth and predicted masks with pixelwise cross-entropy; not used for models trained on Visual Genome

Pixel loss $\mathcal{L}_{pix}=\|I-\hat{I}\|_{1}$ penalizing the $L_{1}$ difference between ground-truth generated images

Image adversarial loss $\mathcal{L}_{GAN}^{img}$ from $D_{img}$ encouraging generated image patches to appear realistic

Object adversarial loss $\mathcal{L}_{GAN}^{obj}$ from the $D_{obj}$ encouraging each generated object to look realistic

Auxiliarly classifier loss $\mathcal{L}_{AC}^{obj}$ from $D_{obj}$ , ensuring that each generated object can be classified by $D_{obj}$

Implementation Details. We augment all scene graphs with a special image object, and add special in image relationships connecting each true object with the image object; this ensures that all scene graphs are connected.

We train all models using Adam with learning rate $10^{-4}$ and batch size 32 for 1 million iterations; training takes about 3 days on a single Tesla P100. For each minibatch we first update $f$ , then update $D_{img}$ and $D_{obj}$ .

We use ReLU for graph convolution; the CRN and discriminators use discriminators use LeakyReLU and batch normalization . Full details about our architecture can be found in the supplementary material, and code will be made publicly available.

Experiments

We train our model to generate $64\times 64$ images on the Visual Genome and COCO-Stuff datasets. In our experiments we aim to show that our method generates images of complex scenes which respect the objects and relationships of the input scene graph.

COCO. We perform experiments on the 2017 COCO-Stuff dataset , which augments a subset of the COCO dataset with additional stuff categories. The dataset annotates 40K train and 5K val images with bounding boxes and segmentation masks for 80 thing categories (people, cars, etc.) and 91 stuff categories (sky, grass, etc.).

We use these annotations to construct synthetic scene graphs based on the 2D image coordinates of the objects, using six mutually exclusive geometric relationships: left of, right of, above, below, inside, and surrounding.

We ignore objects covering less than 2% of the image, and use images with 3 to 8 objects; we divide the COCO-Stuff 2017 val set into our own val and test sets, leaving us with 24,972 train, 1024 val, and 2048 test images.

Visual Genome. We experiment on Visual Genome version 1.4 (VG) which comprises 108,077 images annotated with scene graphs. We divide the data into 80% train, 10% val, and 10% test; we use object and relationship categories occurring at least 2000 and 500 times respectively in the train set, leaving 178 object and 45 relationship types.

We ignore small objects, and use images with between 3 and 30 objects and at least one relationship; this leaves us with 62,565 train, 5,506 val, and 5,088 test images with an average of ten objects and five relationships per image.

Visual Genome does not provide segmentation masks, so we omit the mask prediction loss for models trained on VG.

2 Qualitative Results

Figure 5 shows example scene graphs from the Visual Genome and COCO test sets and generated images using our method, as well as predicted object bounding boxes and segmentation masks.

From these examples it is clear that our method can generate scenes with multiple objects, and even multiple instances of the same object type: for example Figure 5 (a) shows two sheep, (d) shows two busses, (g) contains three people, and (i) shows two cars.

These examples also show that our method generates images which respect the relationships of the input graph; for example in (i) we see one broccoli left of a second broccoli, with a carrot below the second broccoli; in (j) the man is riding the horse, and both the man and the horse have legs which have been properly positioned.

Figure 5 also shows examples of images generated by our method using ground-truth rather than predicted object layouts. In some cases we see that our predicted layouts can vary significantly from the ground-truth objects layout. For example in (k) the graph does not specify the position of the bird and our method renders it standing on the ground, but in the ground-truth layout the bird is flying in the sky. Our model is sometimes bottlenecked by layout prediction, such as (n) where using the ground-truth rather than predicted layout significantly improves the image quality.

In Figure 6 we demonstrate our model’s ability to generate complex images by starting with simple graphs on the left and progressively building up to more complex graphs. From this example we can see that object positions are influenced by the relationships in the graph: in the top sequence adding the relationship car below kite causes the car to shift to the right and the kite to shift to the left so that the relationship is respected. In the bottom sequence, adding the relationship boat on grass causes the boat’s position to shift.

3 Ablation Study

We demonstrate the necessity of all components of our model by comparing the image quality of several ablated versions of our model, shown in Table 1; see supplementary material for example images from ablated models.

No gconv omits graph convolution, so boxes and masks are predicted from initial object embedding vectors. It cannot reason jointly about the presence of different objects, and can only predict one box and mask per category.

No relationships uses graph convolution layers but ignores all relationships from the input scene graph except for trivial in image relationships; graph convolution allows this model to jointly about objects. Its poor performance demonstrates the utility of the scene graph relationships.

No discriminators omits both $D_{img}$ and $D_{obj}$ , relying on the pixel regression loss $\mathcal{L}_{pix}$ to guide the generation network. It tends to produce overly smoothed images.

No $\mathbf{D_{obj}}$ and No $\mathbf{D_{img}}$ omit one of the discriminators. On both datasets, using any discriminator leads to significant improvements over models trained with $\mathcal{L}_{pix}$ alone. On COCO the two discriminators are complimentary, and combining them in our full model leads to large improvements. On VG, omitting $D_{img}$ does not degrade performance.

In addition to ablations, we also compare with two GT Layout versions of our model which omit the $\mathcal{L}_{box}$ and $\mathcal{L}_{mask}$ losses, and use ground-truth bounding boxes during both training and testing; on COCO they also use ground-truth segmentation masks, similar to Chen and Koltun . These methods give an upper bound to our model’s performance in the case of perfect layout prediction.

Omitting graph convolution degrades performance even when using ground-truth layouts, suggesting that scene graph relationships and graph convolution have benefits beyond simply predicting object positions.

4 Object Localization

In addition to looking at images, we can also inspect the bounding boxes predicted by our model. One measure of box quality is high agreement between predicted and ground-truth boxes; in Table 2 we show the object recall of our model at two intersection-over-union thresholds.

Another measure for boxes is variety: predicted boxes for objects should vary in response to the other objects and relationships in the graph. Table 2 shows the mean per-category standard deviations of box position and area.

Without graph convolution, our model can only learn to predict a single bounding box per object category. This model achieves nontrivial object recall, but has no variety in its predicted boxes, as $\sigma_{x}=\sigma_{area}=0$ .

Using graph convolution without relationships, our model can jointly reason about objects when predicting bounding boxes; this leads to improved variety in its predictions. Without relationships, this model’s predicted boxes have less agreement with ground-truth box positions.

Our full model with graph convolution and relationships achieves both variety and high agreement with ground-truth boxes, indicating that it can use the relationships of the graph to help localize objects with greater fidelity.

5 User Studies

Automatic metrics such as Inception scores and box statistics give a coarse measure of image quality; the true measure of success is human judgement of the generated images. For this reason we performed two user studies on Mechanical Turk to evaluate our results.

We are unaware of any previous end-to-end methods for generating images from scene graphs, so we compare our method with StackGAN , a state-of-the art method for generating images from sentence descriptions.

Despite the different input modalities between our method and StackGAN, we can compare the two on COCO, which in addition to object annotations also provides captions for each image. We use our method to generate images from synthetic scene graphs built from COCO object annotations, and StackGANWe use the pretrained COCO model provided by the authors at https://github.com/hanzhanggit/StackGAN-Pytorch to generate images from COCO captions for the same images. Though the methods receive different inputs, they should generate similar images due to the correspondence between COCO captions and objects.

For user studies we downsample StackGAN images to $64\times 64$ to compensate for differing resolutions; we repeat all trials with three workers and randomize order in all trials.

Caption Matching. We measure semantic interpretability by showing users a COCO caption, an image generated by StackGAN from that caption, and an image generated by our method from a scene graph built from the COCO objects corresponding to the caption. We ask users to select the image that better matches the caption. An example image pair and results are shown in Figure 7.

This experiment is biased toward StackGAN, since the caption may contain information not captured by the scene graph. Even so, a majority of workers preferred the result from our method in 67.6% of image pairs, demonstrating that compared to StackGAN our method more frequently generates complex, semantically meaningful images.

Object Recall. This experiment measures the number of recognizable objects in each method’s images. In each trial we show an image from one method and a list of COCO objects and ask users to identify which objects appear in the image. An example and results are snown in Figure 8.

We compute the fraction of objects that a majority of users believed were present, dividing the results into things and stuff. Both methods achieve higher recall for stuff than things, and our method achieves significantly higher object recall, with 65% and 61% relative improvements for thing and stuff recall respectively.

This experiment is biased toward our method since the scene graph may contain objects not mentioned in the caption, but it demonstrates that compared to StackGAN, our method produces images with more recognizable objects.

Conclusion

In this paper we have developed an end-to-end method for generating images from scene graphs. Compared to leading methods which generate images from text descriptions, generating images from structured scene graphs rather than unstructured text allows our method to reason explicitly about objects and relationships, and generate complex images with many recognizable objects.

Acknowledgments We thank Shyamal Buch, Christopher Choy, De-An Huang, and Ranjay Krishna for helpful comments and suggestions.

References

Appendix A Network Architecture

Here we describe the exact network architectures for all components of our model.

As described in Section 3 of the main paper, we process the input scene graph with a graph convolution network composed of several graph convolution layers.

A graph convolution layer accepts as input a vector of dimension $D_{in}$ for each node and edge in the graph, and computes new vectors of dimension $D_{out}$ for each node and edge. A single graph convolution layer can be applied to graphs of any size of shape due to weight sharing. A single graph convolution layer proceeds in two stages.

As a second stage of processing, for each object in the scene graph we collect all of its candidate vectors and process them with a symmeitric pooling function $h$ which converts the set of candidate vectors into a a single vector of dimension $D_{out}$ . Concretely, for object $o_{i}$ in the scene graph $G$ , let $V_{i}^{s}=\{g_{s}(v_{i},v_{r},v_{j}):(o_{i},r,o_{j})\in G\}$ be the set of candidate vectors for $o_{i}$ from relationships where $o_{i}$ appears as the subject, and let $V_{i}^{o}=\{g_{o}(v_{j},v_{r},v_{i}):(o_{j},r,o_{i})\in G\}$ be the set of candidate vectors for $o_{i}$ from relationships where $o_{i}$ appears as the object of the relationship. The pooling function $h$ takes as input the two sets of vectors $V_{i}^{s}$ and $V_{i}^{o}$ , averages them, and feeds the result to an MLP to compute the output vector $v_{i}^{\prime}$ for object $o_{i}$ from the graph convolution layer. The exact architecture of the network we use for $h$ is shown in Table 4.

Overall a graph convolution layer has three hyperparameters defining its size: the input dimension $D_{in}$ , the hidden dimension $H$ , and the output dimension $D_{out}$ . We can therefore specify a graph convolution layer with the notation gconv( $D_{in}\to H\to D_{out}$ ).

A.2 Graph Convolution Network

The input scene graph is processed by a graph convolution network, the exact architecture of which is shown in Table 5. Our network first embeds the objects and relationships of the graph with embedding layers to produce vectors of dimension $D_{in}=128$ ; we then use five layers of graph convolution with $D_{in}=D_{out}=128$ and $H=512$ .

A.3 Box Regression Network

We predict bounding boxes for images using a box regression network. The input to the box regression network are the final embedding vectors for objects produced by the graph convolution network. The output from the box regression network is a predicted bounding box for the object, parameterized as $(x_{0},y_{0},x_{1},y_{1})$ where $x_{0},x_{1}$ are the left and right coordinates of the box and $y_{0},y_{1}$ are the top and bottom coordinates of the box; all box coordinates are normalized to be in the range $$. The architecture of the box regression network is shown in Table 6.

A.4 Mask Regression Network

We predict segmentation masks for images using a mask regression network. The input to the mask regression network are the final embedding vectors for objects from the graph convolution network, and the output from the mask regresion network is a $M\times M$ segmentation mask with all elements in the range $(0,1)$ . The mask regression network is composed of a sequence of upsampling and convolution layers, terminating in a sigmoid nonlinearity; its exact architecture is shown in Table 7.

The main text of the paper states that the mask regression network uses transpose convolution, but in fact it uses upsampling and stride-1 convolutions as shown in Table 7. This error will be corrected in the camera-ready version of the paper.

A.5 Scene Layout

The final embedding vectors for objects from the graph convolution network are combined with the predicted bounding boxes and segmentation masks for objects to give a scene layout. The conversion from vectors, masks, and boxes to scene layouts does not have any learnable parameters.

The scene layout has shape $D\times H\times W$ where $D=128$ is the dimension of embededing vectors for objects from the graph convolution network and $H\times W=64\times 64$ is the output resolution at which images will be generated.

A.6 Cascaded Refinement Network

The scene layout is converted to an image using a Cascaded Refinement Network (CRN) consisting of a number of Cascaded Refinement Modules (CRMs).

Each CRM recieves as input the scene layout of shape $D\times H\times W=128\times 64\times 64$ and the previous feature map, and outputs a new feature map twice the spatial size of the input feature map. Internally each CRM upsamples the input feature map by a factor of 2, and downsamples the layout using average pooling the match the size of the upsampled feature map; the two are concatenated and processed with two convolution layers. A CRM taking input of shape $C_{in}\times H_{in}\times W_{out}$ and producing an output of shape $C_{out}\times H_{out}\times W_{out}$ (with $H_{out}=2H_{in}$ and $W_{out}=2W_{in}$ is denoted as CRM( $H_{in}\times W_{in},\;C_{in}\to C_{out}$ ). The exact architecture of our CRMs is shown in Table 8.

Our Cascaded Refinement Network consists of five Cascaded Refinement Modules. The input to the first module is Gaussian noise of shape $32\times 2\times 2$ and the output from the final module is processed with two final convolution layers to produce the output image. The architecture of the CRN is shown in Table 9.

A.7 Batch Normalization in the Generator

Most implementations of batch normalization operate in two modes. In train mode, minibatches are normalized using the empirical mean and variance of features; in eval mode a running mean of feature means and variances are used to normalize minibatches instead. We found that training models in train mode and running them in eval mode at test-time led to significant image artifacts. To overcome this limitation while still benefitting from the optimization benefits that batch normalization provides, we train our models for 100K iterations using batch normalization in train mode, then continue training for an additional 900K iterations with batch normalization in eval mode.

Since discriminators are not used at test-time, batch normalization in the discriminators is always used in train mode.

A.8 Object Discriminator

Our object discriminator $D_{obj}$ inputs image pixels corresponding to objects in real or generated images; objects are cropped using their bounding boxes to a spatial size of $32\times 32$ using differentiable bilinear interpolation. The object discriminator serves two roles: it classifies objects as real or fake, and also uses an auxiliary classifier which attempts to classify each object. The exact architecture of our object discriminator is shown in Table 10.

A.9 Image Discriminator

Our image discriminator $D_{img}$ inputs a real or fake image, and classifies an overlapping grid of $8\times 8$ image patches from its input image as real or fake. The exact architecture of our image discriminator is shown in Table 11.

A.10 Higher Image Resolutions

We performed preliminary experiments with a version of our model that produces $128\times 128$ images rather than $64\times 64$ images. For these models we compute the scene layout at $128\times 128$ rather than at $64\times 64$ ; we also add an extra Cascaded Refinement Module to our Cascaded Refinement Network; we add one additional convolutional layer to both $D_{obj}$ and $D_{img}$ , and for these models $D_{obj}$ receives a $64\times 64$ crop of objects rather than a $32\times 32$ crop. During trainging we reduce the batch size from 32 to 24.

The images in Figure 6 from the main paper were generated from a version of our model trained to produce $128\times 128$ images from Visual Genome.

Appendix B Image Loss Functions

In Figure 9 we show additional qualitative results from our model trained on COCO, comparing the results from different ablated versions of our model.

Omitting the discriminators from the model (L1 only) tends to produce images that are overly smoothed. Without the object discriminator (No $D_{obj}$ ) objects tend to be less recognizable, and without the image discriminator (No $D_{img}$ ) the generated images tend to appear less realistic overall, with low-level artifacts. Our model trained to use ground-truth layouts rather than predicting its own layouts (GT Layout) tends to produce higher-quality images, but requires both bounding-box and segmentation mask annotations at test-time.

The bottom row of Figure 9 also shows a typical failure case, where all models struggle to synthesize a realistic image from a complex scene graph for an indoor scene.

Appendix C User Study

As discussed in Section 4.5 of the main paper, we perform two user studies on Amazon Mechanical Turk to compare the perceptual quality of images generated from our method with those generated using StackGAN.

In the first user study, we show users an image generated from a COCO caption using StackGAN, and an image generated using our method from a scene graph built from the COCO object annotations corresponding to the caption. We ask users to select the image that better matches the caption. In each trial of this user study the order of our image and the image from StackGAN are randomized.

In the second user study, we again show users images generated using both methods, and we ask users to select the COCO objects that are visible in the image. In this experiment, if a single image contains multiple instances of the same object category then we only ask about its presence once. In each Mechanical Turk HIT users see an equal number of results from StackGAN and our method, and the order in which they are presented is randomized.

For both studies we use 1024 images from each method generated from COCO val annotations. All images are seen by three workers, and we report all results using majority opinions.

StackGAN produces $256\times 256$ images, but our method produces $64\times 64$ images. To prevent the differing image resolution from affecting worker opinion, we downsample StackGAN results to $64\times 64$ using bicubic interpolation before presenting them to users.