Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis
Seunghoon Hong, Dingdong Yang, Jongwook Choi, Honglak Lee
Introduction
Generating images from text description has been an active research topic in computer vision. By allowing users to describe visual concepts in natural language, it provides a natural and flexible interface for conditioning image generation. Recently, approaches based on conditional Generative Adversarial Network (GAN) have shown promising results on text-to-image synthesis task . Conditioning both generator and discriminator on text, these approaches are able to generate realistic images that are both diverse and relevant to input text. Based on conditional GAN framework, recent approaches further improve the prediction quality by generating high-resolution images or augmenting text information .
However, the success of existing approaches has been mainly limited to simple datasets such as birds and flowers , while generation of complicated, real-world images such as MS-COCO remains an open challenge. As illustrated in Figure 1, generating image from a general sentence “people riding on elephants that are walking through a river” requires multiple reasonings on various visual concepts, such as object category (people and elephants), spatial configurations of objects (riding), scene context (walking through a river), etc., which is much more complicated than generating a single, large object as in simpler datasets . Existing approaches have not been successful in generating reasonable images for such complex text descriptions, because of the complexity of learning a direct text-to-pixel mapping from general images.
Instead of learning a direct mapping from text to image, we propose an alternative approach that constructs semantic layout as an intermediate representation between text and image. Semantic layout defines a structure of scene based on object instances and provides fine-grained information of the scene, such as the number of objects, object category, location, size, shape, etc. (Figure 1). By introducing a mechanism that explicitly aligns the semantic structure of an image to text, the proposed method can generate complicated images that match complex text descriptions. In addition, conditioning the image generation on semantic structure allows our model to generate semantically more meaningful images that are easy to recognize and interpret.
Our model for hierarchical text-to-image synthesis consists of two parts: the layout generator that constructs a semantic label map from a text description, and the image generator that converts the estimated layout to an image using the text. Since learning a direct mapping from text to fine-grained semantic layout is still challenging, we further decompose the task into two manageable subtasks: we first estimate the bounding box layout of an image using the box generator, and then refine the shape of each object inside the box by the shape generator. The generated layout is then used to guide the image generator for pixel-level synthesis. The box generator, shape generator and image generator are implemented by independent neural networks, and trained in parallel with corresponding supervisions.
Generating semantic layout not only improves quality of text-to-image synthesis, but also provides a number of potential benefits. First, the semantic layout provides instance-wise annotations on generated images, which can be directly exploited for automated scene parsing and object retrieval. Second, it offers an interactive interface for controlling image generation process; users can modify the semantic layout to generate a desired image by removing/adding objects, changing size and location of objects, etc.
The contributions of this paper are as follows:
We propose a novel approach for synthesizing images from complicated text descriptions. Our model explicitly constructs semantic layout from the text description, and guides image generation using the inferred semantic layout.
By conditioning image generation on explicit layout prediction, our method is able to generate images that are semantically meaningful and well-aligned with input descriptions.
We conduct extensive quantitative and qualitative evaluations on challenging MS-COCO dataset, and demonstrate substantial improvement on generation quality over existing works.
The rest of the paper is organized as follows. We briefly review related work in Section 2, and provide an overview of the proposed approach in Section 3. Our model for layout and image generation is introduced in Section 4 and 5, respectively. We discuss the experimental results on the MS-COCO dataset in Section 6.
Related Work
Generating images from text descriptions has recently drawn a lot of attention from the research community. Formulating the task as a conditional image generation problem, various approaches have been proposed based on Variational Auto-Encoders (VAE) , auto-regressive models , optimization techniques , etc. Recently, approaches based on conditional Generative Adversarial Network (GAN) have shown promising results in text-to-image synthesis . Reed et al. proposed to learn both generator and discriminator conditioned on text embedding. Zhang et al. improved the image quality by increasing image resolution with a two-stage GAN. Other approaches include improving conditional generation by augmenting text data with synthesized captions , or adding conditions on class labels . Although these approaches have demonstrated impressive generation results on datasets of specific categories (e.g., birds and flowers ), the perceptual quality of generation tends to substantially degrade on datasets with complicated images (e.g., MS-COCO ). We investigate a way to improve text-to-image synthesis on general images, by conditioning generation on inferred semantic layout.
The problem of generating images from pixel-wise semantic labels has been explored recently . In these approaches, the task of image generation is formulated as translating semantic labels to pixels. Isola et al. proposed a pixel-to-pixel translation network that converts dense pixel-wise labels to image, and Chen et al. proposed a cascaded refinement network that generates high-resolution output from dense semantic labels. Karacan et al. employed both dense layout and attribute vectors for image generation using conditional GAN. Notably, Reed et al. utilized sparse label maps like our method. Unlike previous approaches that require ground-truth layouts for generation, our method infers the semantic layout, and thus is more generally applicable to various generation tasks. Note that our main contribution is complementary to these approaches, and we can integrate existing segmentation-to-pixel generation methods to generate an image conditioned on a layout inferred by our method.
The idea of inferring scene structure for image generation is not new, as it has been explored by some recent works in several domains. For example, Wang et al. proposed to infer a surface normal map as an intermediate structure to generate indoor scene images, and Villegas et al. predicted human joints for future frame prediction. The most relevant work to our method is Reed et al. , which predicted local key-points of bird or human for text-to-image synthesis. Contrary to the previous approaches that predict such specific types of structure for image generation, our proposed method aims to predict semantic label maps, which is a general representation of natural images.
Overview
The overall pipeline of the proposed framework is illustrated in Figure 2. Given a text description, our model progressively constructs a scene by refining semantic structure of an image using the following sequence of generators:
Box generator takes a text embedding as input, and generates a coarse layout by composing object instances in an image. The output of the box generator is a set of bounding boxes , where each bounding box defines the location, size and category label of the -th object (Section 4.1).
Shape generator takes a set of bounding boxes generated from box generator, and predicts shapes of the object inside the boxes. The output of the shape generator is a set of binary masks , where each mask defines the foreground shape of the -th object (Section 4.2).
Image generator takes the semantic label map obtained by aggregating instance-wise masks, and the text embedding as inputs, and generates an image by translating a semantic layout to pixels matching the text description (Section 5).
By conditioning the image generation process on the semantic layouts that are explicitly inferred, our method is able to generate images that preserve detailed object shapes and therefore are easier to recognize semantic contents. In our experiments, we show that the images generated by our method are semantically more meaningful and well-aligned with the input text, compared to ones generated by previous approaches (Section 6).
Inferring Semantic Layout from Text
The box generator defines a stochastic mapping from the input text to a set of object bounding boxes :
We employ an auto-regressive decoder for the box generator, by decomposing the conditional joint bounding box probability as , where the conditionals are approximated by LSTM . In the generative process, we first sample a class label for the -th object and then generate the box coordinates conditioned on , i.e., . The two conditionals are modeled by a Gaussian Mixture Model (GMM) and a categorical distribution , respectively:
Training.
We train the box generator by minimizing the negative log-likelihood of ground-truth bounding boxes:
where is the number of objects in an image, and are balancing hyper-parameters, which are set to 4 and 1 in our experiment, respectively. and are ground-truth bounding box coordinates and label of the -th object, respectively, which are ordered based on their bounding box locations from left to right. Note that we drop the conditioning in Eq. (4) for notational brevity.
At test time, we generate bounding boxes via ancestral sampling of box coordinates and class label by Eq. (2) and (3), respectively. We terminate the sampling when the sampled class label corresponds to the termination indicator , thus the number of objects are determined adaptively based on the text.
2 Shape Generation
where is a random noise vector.
Generating an accurate object shape should meet two requirements: (i) First, each instance-wise mask should match the location and class information of , and be recognizable as an individual instance (instance-wise constraints). (ii) Second, each object shape must be aligned with its surrounding context (global constraints). To satisfy both, we design the shape generator as a recurrent neural network, which is trained with two conditional adversarial losses as described below.
We build the shape generator using a convolutional recurrent neural network , as illustrated in Figure 2. At each step , the model takes through encoder CNN, and encodes information of all object instances by bi-directional convolutional LSTM (Bi-convLSTM). On top of convLSTM output at -th step, we add noise by spatial tiling and concatenation, and generate a mask by forwarding it through a decoder CNN.
Training.
Training of the shape generator is based on the GAN framework , in which generator and discriminator are alternately trained. To enforce both the global and the instance-wise constraints discussed earlier, we employ two conditional adversarial losses with the instance-wise discriminator and the global discriminator .
First, we encourage each object mask to be compatible with class and location information encoded by object bounding box. We train an instance-wise discriminator by optimizing the following instance-wise adversarial loss:
where indicates the -th output from mask generator. The instance-wise loss is applied for each of instance-wise masks, and aggregated over all instances as .
On the other hand, the global loss encourages all the instance-wise masks form a globally coherent context. To consider relation between different objects, we aggregate them into a global mask is computed by addition to model overlap between objects. and compute an global adversarial loss analogous to Eq. (6) as
Finally, we additionally impose a reconstruction loss that encourages the predicted instance masks to be similar to the ground-truths. We implement this idea using perceptual loss , which measures the distance of real and fake images in the feature space of a pre-trained CNN by
where is the feature extracted from the -th layer of a CNN. We use the VGG-19 network pre-trained on ImageNet in our experiments. Since our input to the pre-trained network is a binary mask, we replicate masks to channel dimension and use the converted mask to compute Eq. (8). We found that using the perceptual loss significantly improves stability of GAN training and the quality of object shapes, as discussed in .
Combining Eq.(6), (7) and (8), the overall training objective for the shape generator becomes
where and are hyper-parameters that balance different losses, which are set to 1, 1 and 10 in the experiment, respectively. We provide more details of training and network architecture in the appendix (Section A.2).
Synthesizing Images from Text and Layout
The outputs from the layout generator define location, size, shape and class information of objects, which provide semantic structure of a scene relevant to text. Given the semantic structure and text, the objective of the image generator is to generate an image that conforms to both conditions. To this end, we first aggregate binary object masks to a semantic label map , such that if and only if there exists an object of class whose mask covers the pixel . Then, given the semantic layout and the text , the image generator is defined by
where is a random noise. In the following, we describe the network architecture and training procedures of the image generator.
For the discriminator network , we first concatenate the generated image and the semantic layout . It is fed through a series of down-sampling blocks, resulting in a feature map of size . We concatenate it with a spatially tiled text embedding, from which we compute a decision score of the discriminator.
Training.
Conditioned on both the semantic layout and the text embedding , the image generator is jointly trained with the discriminator . We define the objective function by where
where is a ground-truth image associated with semantic layout . As in the mask generator, we apply the same perceptual loss , which is found to be effective. We set the hyper-parameters , in our experiment. More details on network architecture and training procedure is provided in appendix (Section A.3).
Experiments
We use the MS-COCO dataset to evaluate our model. It contains 164,000 training images over 80 semantic classes, where each image is associated with instance-wise annotations (i.e., object bounding boxes and segmentation masks) and 5 text descriptions. The dataset has complex scenes with many objects in a diverse context, which makes generation very challenging. We use the official train and validation splits from MS-COCO 2014 for training and evaluating our model, respectively.
Evaluation metrics.
We evaluate text-conditional image generation performance using various metrics: Inception score, caption generation, and human evaluation.
Inception score — We compute the Inception score by applying pre-trained classifier on synthesized images and investigating statistics of their score distributions. It measures recognizability and diversity of generated images, and has been known to be correlated with human perceptions on visual quality . We use the Inception-v3 network pre-trained on ImageNet for evaluation, and measure the score for all validation images.
Caption generation — In addition to the Inception score, assessing performance of text-conditional image generation necessitates measuring the relevance of generated image to the input text. To this end, we generate sentences from the synthesized image and measure the similarity between input text and predicted sentence. The underlying intuition is that if the generated image is relevant to input text and its contents are recognizable, one should be able to guess the original text from the synthesized image. We employ an image caption generator trained on MS-COCO to generate sentences, where one sentence is generated per image by greedy decoding. We report three standard language similarity metrics: BLEU , METEOR and CIDEr .
Human evaluation — Evaluation based on caption generation is beneficial for large-scale evaluation but may introduce unintended bias by the caption generator. To verify the effectiveness of caption-based evaluation, we conduct human evaluation using Amazon Mechanical Turk. For each text randomly selected from MS-COCO validation set, we presented 5 images generated by different methods, and asked users to rank the methods based on the relevance of generated images to text. We collected results for 1000 sentences, each of which is annotated by 5 users. We report results based on the ratio of each method ranked as the best, and one-to-one comparison between ours and the baselines.
2 Quantitative Analysis
We compare our method with two state-of-the-art approaches based on conditional GANs. Table 1 and Table 2 summarizes the quantitative evaluation results.
We first present systemic evaluation results based on Inception score and caption generation performance. The results are summarized in Table 1. The proposed method substantially outperforms existing approaches based on both evaluation metrics. In terms of Inception score, our method outperforms the existing approaches with a substantial margin, presumably because our method generates more recognizable objects. Caption generation performance shows that captions generated from our synthesized images are more strongly correlated with the input text than the baselines. This shows that images generated by our method are better aligned with descriptions and are easier to recognize semantic contents.
Table 2 summarizes comparison results based on human evaluation. When users are asked to rank images based on their relevance to input text, they choose images generated by our method as the best in about of all presented sentences, which is substantially higher than baselines (about ). This is consistent with the caption generation results in Table 1, in which our method substantially outperforms the baselines while their performances are comparable.
Figure 4 illustrates qualitative comparisons. Due to adversarial training, images generated by the other methods, especially StackGAN , tend to be clear and exhibits high frequency details. However, it is difficult to recognize contents from the images, since they often fail to predict important semantic structure of object and scene. As a result, the reconstructed captions from the generated images are usually not relevant to the input text. Compared to them, our method generates much more recognizable and semantically meaningful images by conditioning the generation with inferred semantic layout, and is able to reconstruct descriptions that better align with the input sentences.
Ablative Analysis.
To understand quality and the impact of the predicted semantic layout, we conduct an ablation study by gradually replacing the bounding box and mask layout predicted by layout generator with the ground-truths. Table 1 summarizes quantitative evaluation results. As it shows, replacing the predicted layouts to ground-truths leads with gradual performance improvements, which shows predictions errors in both bounding box and mask layout.
3 Qualitative Analysis
Figure 5 shows qualitative results of our method. For each text, we present the generated images alongside the predicted semantic layouts. As in the previous section, we also present our results conditioned on ground-truth layouts. As it shows, our method generates reasonable semantic layout and image matching the input text; it generates bounding boxes corresponding to fine-grained scene structure implied in texts (i.e. object categories, the number of objects), and object masks capturing class-specific visual attributes as well as relation to other objects. Given the inferred layouts, our image generator produces correct object appearances and background compatible with text. Replacing the predicted layouts with ground-truths makes the generated images to have a similar context to original images.
To assess the diversity in generation, we sample multiple images while fixing the input text. Figure 6 illustrates the example images generated by our method. Our method generates diverse semantic structures given the same text description, while preserving semantic details such as the number of objects and object categories.
Text-conditional generation.
To see how our model incorporates text description in generation process, we generate images while modifying parts of the descriptions. Figure 7 illustrates the example results. When we change the context of descriptions such as object class, number of objects, spatial composition of objects and background patterns, our method correctly adapts semantic structure and images based on the modified part of the text.
Controllable image generation.
We demonstrate controllable image generation by modifying bounding box layout. Figure 8 illustrates the example results. Our method updates object shapes and context based on the modified semantic layout (e.g. adding new objects, changing spatial configuration of objects) and generates reasonable images. See Figure 13 and 14 for more examples on various types of layout modifications.
Conclusion
We proposed an approach for text-to-image synthesis which explicitly infers and exploits a semantic layout as an intermediate representation from text to image. Our model hierarchically constructs a semantic layout in a coarse-to-fine manner by a series of generators. By conditioning image generation on explicit layout prediction, our method generates complicated images that preserve semantic details and highly relevant to the text description. We also showed that the predicted layout can be used to control generation process. We believe that end-to-end training of layout and image generation would be an interesting future work.
This work was supported in part by ONR N00014-13-1-0762, NSF CAREER IIS-1453651, DARPA Explainable AI (XAI) program #313498, and Sloan Research Fellowship.
References
Appendix
Appendix A Implementation Details
This section describes the details of the box generator. Denoting bounding box of -th object as , the joint probability of sampling from the box generator is given by
We drop the conditioning variables for notational brevity. As described in the main paper, we implement by categorical distribution and by a mixture of quadravariate Gaussians. However, modeling full convariance matrix of quadravariate Gaussian is expensive as it involves many parameters. Therefore, we decompose the box coordinate probability as , and approximate it with two bivariate Gaussian mixtures by
Then the parameters for Eq. (13) are obtained from LSTM outputs at each step by
where are the parameters for GMM concatenated to a vector.
For training, we employ an Adam optimizer with learning rate 0.001, and exponentially decrease the learning rate with rate 0.5 at every epoch after the initial 10 epochs.
A.2 Shape Generator
We provide a detailed architecture of the shape generator and the two discriminators and in Figure 9. At each step , we encode a box tensor by a series of downsampling layers, where each downsampling layer is implemented by a stride-2 convolution followed by instance-wise normalization and ReLU. The encoded feature is fed into the bidirectional convolutional LSTM (bi-convLSTM), and combined with features from all object instances. On top of the bi-convLSTM output at each step , we add a noise by spatial replication and depth concatenation, and apply masking operation so that regions outside the object bounding box are all set to 0. The masked feature is fed into several residual blocks, and mapped to a binary mask by a series of upsampling layers. Similar to downsampling layers, we implement an upsampling layer by stride-2 deconvolution followed by instance-wise normalization and ReLU except the last one, which is convolution followed by the sigmoid nonlinearity.
The instance-wise discriminator and global discriminator share the same architecture but have separate parameters. The input to the instance-wise discriminator is constructed by concatenating the box tensor and the corresponding binary mask through channel dimension, while the one for global discriminator is constructed by concatenating the aggregated box tensor and the aggregated masks . Both discriminators encode the input by a series of downsampling layers, which are implemented by stride-2 convolutions followed by instance-wise normalization and Leaky-ReLU .
For training, we employ an Adam optimizer with learning rate 0.0002, and linearly decrease the learning rate after the first 50-epochs training.
A.3 Image Generator
A detailed architecture of the image generator is illustrated in Figure 10. The architecture of the downsampling and the residual blocks are same as the ones used in the shape generator. To encourage the model to generate images that match the input layout, we implement upsampling layers based on cascaded refinement network . At each upsampling layer, it takes an output from the previous layer and the semantic layout resized to the same spatial size as inputs, and combines them by depth concatenation followed by convolution. The combined feature map is then spatially upscaled by bilinear upsampling followed by instance-wise normalization and ReLU, and subsequently fed into the next upsampling layer.
To encourage the model to generate images that match input text descriptions, we employ a matching-aware loss proposed in . Denoting a ground-truth training example as , where , and denote semantic layout, text embedding and image, respectively, we construct an additional mismatching triple by sampling random text embedding non-relevant to the image. We consider it as additional fake examples in adversarial training, and extend the conditional adversarial loss for image generator (Eq. (11) in the main paper) as
We found that employing matching-aware loss substantially improves text-conditional generation and stabilizes overall GAN training.
For training, we employ an Adam optimizer with learning rate 0.0002, and linearly decrease the learning rate after the first 30-epoch training.
Appendix B Additional Experiment Results
To understand the impact of each component in the proposed framework, we conduct an ablation study by varying configurations of the proposed model. Table 3 summarizes the results based on caption generation performance.
We first investigate the impact of shape generator. To this end, we remove the shape generator from our generation pipeline, and modify the image generator to generate images directly from box generator outputs. Specifically, we feed the aggregated bounding box tensor as an input to the image generator, which is constructed by taking pixel-wise maximum over all box tensors as Note that the the aggregated box tensor can be considered as a semantic layout that the shape of each object is a rectangular box.. The result is presented in the second row in Table 3. Removing the shape generator leads to substantial performance degradation, since predicting accurate object shapes and textures directly from bounding box is a complicated task; the image generator tends to miss detailed object shapes such as body parts, which are critical to recognize the image content for human. By explicitly inferring object shapes, it improves the overall image quality and interpretability of content.
Impact of perceptual loss in shape generator
Impact of perceptual loss in image generator
Impact of attention in image generator
Our image generator combines features from the text embedding and semantic layout by attention mechanism. To see its impact on text-conditional image generation, we remove the attention mechanism from the image generator (computation of and in Figure 10) and concatenate the layout feature directly to text embedding. As shown in the last row of Table 3, employing attention mechanism improves the text-conditional image generation performance, since it forces the model to exploit text information in generation process. We found that the attention mechanism helps the model to generate textures and background relevant to the input text.
B.2 More qualitative examples
We present the end-to-end image generation results of our method in Figure 11, including object bounding boxes and masks obtained by the layout generator. As illustrated in the figure, our model generates object bounding boxes that match content of the input text, and shapes capturing class-specific visual attributes and relation with other objects (e.g. person riding a motorcycle, person swinging a bat, etc). Given the layout, the image generator correctly predicts object textures and background match the description.
Diversity of samples.
Figure 12 presents a set of samples generated by our method, which corresponds to Figure 6 in the main paper. Our method generates diverse samples by generating semantic layouts that are both diverse and highly related to the input text description.
Controllable image generation.
Semantic layout provides a natural and interactive interface for image editing. By modifying the bounding box layout of the scene, our model can generate the object shapes and images compatible with the modified layout. Figure 13 illustrates the generated images obtained by adding new objects to the existing semantic layout. By placing new object bounding boxes to a scene, our model not only creates the corresponding object instance but also modifies surrounding context adaptive to the change. For instance, adding cars and pedestrians in front of a tower makes the model to generate a street on a background (the 4th row in Figure 13). Similarly, one can modify the semantic layout by changing size and spatial location of existing objects. Figure 14 illustrates the results. Modifying the spatial configuration of objects sometimes changes the relationship between objects and leads to images in different context. For instance, changing the locations of a soccer ball and players leads to various images such as dribbling, shooting and competing to occupy the ball (the first row in Figure 14).