Synthesizing Programs for Images using Reinforced Adversarial Learning

Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, Oriol Vinyals

Introduction

Recovering structured representations from raw sensations is an ability that humans readily possess and frequently use. Given a picture of a hand-written character, decomposing it into strokes can make it easier to classify or re-imagine that character, and similarly, knowing the underlying layout of a room can aid with planning, navigation and interaction in that room. Furthermore, this structure can be exploited for generalization, rapid learning, and even communication with other agents. It is commonly believed that humans exploit simulations to learn this skill (Lake et al., 2017). By experimenting with a pen and a piece of paper we learn how our hand movements lead to written characters, and via imagination we learn how architectural layouts manifest themselves in reality.

In the visual domain, inversion of a renderer for the purposes of scene understanding is typically referred to as inverse graphics (Mansinghka et al., 2013; Kulkarni et al., 2015a). Training vision systems using the inverse graphics approach has remained a challenge. Renderers typically expect as input programs that have sequential semantics, are composed of discrete symbols (e.g., keystrokes in a CAD program), and are long (tens or hundreds of symbols). Additionally, matching rendered images to real data poses an optimization problem as black-box graphics simulators are not differentiable in general.

To address these problems, we present a new approach for interpreting and generating images using Deep Reinforced Adversarial Learning. In this approach, an adversarially trained agent generates visual programs which are in turn executed by a graphics engine to generate images, either conditioned on data or unconditionally. The agent is rewarded by fooling a discriminator network, and is trained with distributed reinforcement learning without any extra supervision. The discriminator network itself is trained to distinguish between rendered and real images.

An adversarially trained reinforcement learning agent that interprets and generates images in the space of visual programs. Crucially, the architecture of our agent is agnostic both to the semantics of the visual program and to the domain.

Scaling inverse graphics to real world and procedural datasets without the need for labels. In particular, our model discovers pen strokes that give rise to MNIST and Omniglot characters, brush strokes that give rise to celebrity faces, and scene descriptions that, once rendered, reconstruct an image of a 3D scene (Figure 1).

Evidence that utilizing a discriminator’s output as the reward signal for reinforcement learning is significantly better at optimizing the pixel error between renderings and data, compared to directly optimizing pixel error.

A showcase of state-of-the-art deep reinforcement learning techniques, which can provide a scaling path for inverse graphics, and could lead to broader implications for program synthesis in future work.

Related Work

The idea of inverting simulators to interpret images has been explored extensively in recent years (Nair et al., 2008; Paysan et al., 2009; Mansinghka et al., 2013; Loper & Black, 2014; Kulkarni et al., 2015a; Jampani et al., 2015). Structured object-attribute based ‘de-rendering’ models have been proposed for interpretation of images (Wu et al., 2017b) and videos (Wu et al., 2017a). Concurrent work has explored the use of Constructive Solid Geometry primitives for explaining binary images (Sharma et al., 2017). Loper & Black (2014) proposed the idea of differentiable inverse graphics, which is efficient for optimizing continuous variables but cannot handle discrete variables. Earlier work has also explored using reinforcement learning for automatic generation of single brush strokes (Xie et al., 2013). However, scaling these approaches to larger real-world datasets, particularly at test-time, has remained a challenge.

Inferring motor programs for the reconstruction of MNIST digits was first studied in (Nair & Hinton, 2006). The generative model is parametrized by two pairs of opposing springs whose stiffness is controlled by a motor program. The training procedure involved starting with a prototype program and its corresponding observation. Random noise was then added to this prototype in order to produce new training examples until the generated distribution stretched to cover the manifold of the training digits. In contrast, our model automatically learns the training curriculum via the discriminator and the same agent is suitable for a range of scene understanding problems, including those in 3D.

Visual program induction has recently been studied in the context of hand-written characters on the Omniglot dataset (Lake et al., 2015). This model achieves impressive performance but requires parses from a hand-crafted algorithm to initialize training and was not demonstrated to generalize beyond hand-written characters. Ellis et al. (2017) proposed a visual program induction model to infer LaTeXprograms for diagram understanding. More recently, the sketch-rnn model (Ha & Eck, 2017) used sequence-to-sequence learning (Sutskever et al., 2014) to produce impressive sketches both unconditionally and conditioned on data. However, similar to the aforementioned works and unlike SPIRAL, the model requires supervision in the form of sketches and corresponding sequences of strokes.

In the neural network community, there have been analogous attempts at inferring and learning feed-forward or recurrent procedures for image generation (LeCun et al., 2015; Hinton & Salakhutdinov, 2006; Goodfellow et al., 2014; Ackley et al., 1987; Kingma & Welling, 2013; Oord et al., 2016; Kulkarni et al., 2015b; Eslami et al., 2016; Reed et al., 2017; Gregor et al., 2015). These models demonstrate impressive image generation capabilities but generally lack the ability to infer structured representations of images.

Our approach employs adversarial training techniques, first used for generative modeling (Goodfellow et al., 2014) and domain adaptation (Ganin & Lempitsky, 2015). Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) were orignally used for image generation but have now been successfully applied to model audio, text and motor behaviors (Ho & Ermon, 2016; Merel et al., 2017). Perhaps the most interesting extension in our context is their use in domain transfer, where images from one domain (e.g., segmentations) were mapped to another (e.g., pixels). Models such as pix2pix (Isola et al., 2017), CycleGAN (Zhu et al., 2017) and AIGN (Tung et al., 2017) fall in this category.

The SPIRAL agent builds upon this literature, has minimal hand-crafting in its design, requires no supervision in the form of pairs of programs and corresponding images and, as we demonstrate in the following sections, is applicable across a wide range of domains.

The SPIRAL Agent

Our goal is to construct a generative model $G$ capable of sampling from some target data distribution $p_{d}$ . To that end, we propose using an external black-box rendering simulator $\mathcal{R}$ that accepts a sequence of commands $a=(a_{1},a_{2},\ldots,a_{N})$ and transforms them into the domain of interest, e.g., a bitmap. For example, $\mathcal{R}$ could be a CAD program rendering descriptions of primitives into 3D scenes. Thus, our task is equivalent to recovering a distribution $p_{a}$ such that $p_{d}\approx\mathcal{R}(p_{a})$ . We model $p_{a}$ with a recurrent neural network $\pi$ which we call the policy network (or, somewhat sloppily, the agent). The generation process $G$ , which consists of a policy $\pi$ and a renderer $\mathcal{R}$ , is illustrated in Figure 2(a).

In order to optimize $\pi$ , we employ the adversarial framework (Goodfellow et al., 2014). In this framework, the generator (denoted as $G$ ) aims to maximally confuse a discriminator network $D$ which is trained to distinguish between the samples drawn from $p_{d}$ and the samples generated by the model. As a result, the distribution defined by $G$ (denoted as $p_{g}$ ) gradually becomes closer to $p_{d}$ . Crucially, and unlike previous work, our training procedure does not require any strong supervision in the form of aligned examples from $p_{d}$ and $p_{a}$ .

We give the concrete training objectives for $G$ and $D$ below.

2 Objectives

In our experiments, we found that the original minimax objective from Goodfellow et al. (2014) was hard to optimize in the context of our model, so we opted for using a variant that employs Wasserstein distance as a measure of divergence between distributions (Gulrajani et al., 2017), as it appears to handle vastly dissimilar $p_{g}$ and $p_{d}$ more gracefully. In this case, the discriminator is considered “confused” if it assigns the same scores (in expectation) to the inputs coming both from $p_{d}$ and $p_{g}$ . We note that our approach can be used in conjunction with other forms of GAN objectives as well.

Discriminator. Following (Gulrajani et al., 2017), we define the objective for $D$ as:

where $R$ is a regularization term softly constraining $D$ to stay in the set of Lipschitz continuous functions (for some fixed Lipschitz constant). It is worth noting that the solution to (1) is defined up to an additive constant. Due to the nature of how we use discriminator outputs during training of $\pi$ , we found it beneficial to resolve this ambiguity by encouraging $D({\mathbf{x}})$ to be close to on average for ${\mathbf{x}}\sim\frac{1}{2}p_{g}+\frac{1}{2}p_{d}$ .

Generator. We formally define $\pi$ as a network that at every time step $t$ predicts a distribution over all possible commands $\pi_{t}=\pi(a_{t}|s_{t};\theta)$ , where $s_{t}$ is the recurrent state of the network, and $\theta$ is a set of learnable parameters. Given a sequence of samples $\left(a_{t}\,|\,a_{t}\sim\pi_{t},1\leq t\leq N\right)$ , a sample from $p_{g}$ is then computed as $\mathcal{R}(a_{1},a_{2},\ldots,a_{N})$ . Since the generator is an arbitrary non-differentiable function, we cannot optimize

with naive gradient descent. Therefore we pose this problem as maximization of the expected return which can be solved using standard techniques from reinforcement learning. Specifically, we employ a variant of the REINFORCE (Williams, 1992) algorithm, advantage actor-critic (A2C):

where $V^{\pi}$ is an approximation to the value function which is considered to be independent of $\theta$ , and $R_{t}=\sum_{t}^{N}r_{t}$ is a 1-sample Monte-Carlo estimate of the return. Optimizing (3) recovers the solution to (2) if the rewards are set to:

One interesting aspect of this new formulation is that we can also bias the search by introducing intermediate rewards which may depend not only on the output of $\mathcal{R}$ but also on commands used to generate that output. We present several examples of such rewards in Section 4.

3 Conditional Generation

So far, we have described the case of unconditional generation, but in many situations it is useful to condition the model on auxiliary input (Mirza & Osindero, 2014). For instance, one might be interested in finding a specific program that generates a given image $\mathbf{x}_{\mbox{\scriptsize target}}$ . That could be achieved by supplying $\mathbf{x}_{\mbox{\scriptsize target}}$ both to the policy and to the discriminator networks. In other words,

while $p_{d}$ becomes a Dirac $\delta$ -function centered at $\mathbf{x}_{\mbox{\scriptsize target}}$ . The first two terms in (1) thus reduce to

4 Distributed Learning

Our training pipeline is outlined in Figure 2(b). It is an extension of the recently proposed IMPALA architecture (Espeholt et al., 2018). For training, we define three kinds of workers:

Actors are responsible for generating the training trajectories through interaction between the policy network and the rendering simulator. Each trajectory contains a sequence $((\pi_{t},a_{t})\,|\,1\leq t\leq N)$ as well as all intermediate renderings produced by $\mathcal{R}$ .

A policy learner receives trajectories from the actors, combines them into a batch and updates $\pi$ by performing an SGD step on $\mathcal{L}_{G}$ (2). Following common practice (Mnih et al., 2016), we augment $\mathcal{L}_{G}$ with an entropy penalty encouraging exploration.

In contrast to the base IMPALA setup, we define an additional discriminator learner. This worker consumes random examples from $p_{d}$ , as well as generated data (final renders) coming from the actor workers, and optimizes $\mathcal{L}_{D}$ (1).

In the original paper introducing WGAN with gradient penalty (Gulrajani et al., 2017), the authors note that in order to obtain better performance, the discriminator has to be updated more frequently than the generator. In our setting, generation of each model sample is expensive since it involves multiple invocations of an external simulator. We therefore do not omit any trajectories in the policy learner. Instead, we decouple the $D$ updates from the $\pi$ updates by introducing a replay buffer that serves as a communication layer between the actors and the discriminator learner. That allows the latter to optimize $D$ at a higher rate than the training of the policy network due to the difference in network sizes ( $\pi$ is a multi-step RNN, while $D$ is a plain CNN). We note that even though sampling from a replay buffer inevitably leads to smoothing of $p_{g}$ , we found this setup to work well in practice.

Experiments

We validate our approach on three real-world and one synthetic image dataset. The first, MNIST (LeCun et al., 1998), is regarded as a standard sanity check for newly proposed generative models. It contains $70000$ examples of handwritten digits, of which $10000$ constitute a test set. Each example is a $28\times 28$ grayscale image. Although the dataset is often considered “solved” by neural decoder-based approaches (including GANs and VAEs), these approaches do not focus on recovering interpretable structure from the data. We, therefore, choose not to discard MNIST from the empirical evaluation since it is likely that additional constraints increase the difficulty of the modeling task.

The second dataset, Omniglot (Lake et al., 2015), comprises $1623$ handwritten characters from 50 alphabets. Compared to MNIST, this dataset introduces three additional challenges: higher data variability, higher complexity of symbols (e.g., disjoint subcurves) and fewer (only 20) data points per symbol class.

Since both MNIST and Omniglot represent a restricted line drawing domain, we diversify our set of experiments by testing the proposed method on CelebA (Liu et al., 2015). The dataset contains over $200000$ color headshots of celebrities with large variation in poses, backgrounds and lighting conditions.

Lastly, we are interested in evaluating our approach on the task of unsupervised 3D scene understanding which is a crucial precursor for manipulating and reasoning about objects in the real world. To that end, we created a procedural dataset called MuJoCo Scenes consisting of renders of simple 3D primitives (up to 5 objects) scattered around a square platform (see Figure 10). The training set is comprised of $50000$ RGB images generated by means of the MuJoCo environment discussed in the next section.

In each case, we rescale the images to $64\times 64$ , which allows us to reuse the same network architectures in all the experiments, demonstrating the generality of our method.

2 Environments

We introduce two new rendering environments. For MNIST, Omniglot and CelebA generation we use an open-source painting library libmypaint (libmypaint contributors, 2018). The agent controls a brush and produces a sequence of (possibly disjoint) strokes on a canvas $C$ . The state of the environment is comprised of the contents of $C$ as well as the current brush location $l_{t}$ . Each action $a_{t}$ is a tuple of 8 discrete decisions $(a_{t}^{1},a_{t}^{2},\ldots,a_{t}^{8})$ (see Figure 3). The first two components are the control point $p_{c}$ and the end point $l_{t+1}$ of the stroke, which is specified as a quadratic Bézier curve:

where $\tau\in$ . In our experiments, we define the valid range of locations as a $32\times 32$ grid imposed on $C$ . We set $l_{0}$ to the upper left corner of the canvas. The next 5 components represent the appearance of the stroke: the pressure that the agent applies to the brush (10 levels), the brush size, and the stroke color characterized by mixture of red, green and blue (20 bins for each color component). The last element of $a_{t}$ is a binary flag specifying the type of action: the agent can choose either to produce a stroke or to jump right to $l_{t+1}$ . For grayscale datasets (MNIST and Omniglot), we omit the color components.

In the MuJoCo Scenes experiment, we render images using a MuJoCo-based environment (Todorov et al., 2012). At each time step, the agent has to decide on the object type (4 options), its location on a $16\times 16$ grid, its size (3 options) and the color (3 color components with 4 bins each). The resulting tuple is sent to the environment, which adds an object to the scene according to the specification. Additionally, the agent can decide to skip a move or change the most recently emitted object. All three types of actions are illustrated in Figure 2(a).

3 MNIST

For the MNIST dataset, we conduct two sets of experiments. In the first set, we train an unconditional agent to model the data distribution. Along with the reward provided by the discriminator we also use auxiliary penalties expressing our inductive biases for the particular type of data. To encourage the agent to draw a digit in a single continuous motion of the brush, we provide a small negative reward for starting each continuous sequence of strokes. We also found it beneficial to penalize our model for not producing any visible strokes at all. The resulting agent manages to generate samples clearly recognizable as hand-written digits. Examples of such generations are shown in Figure 4(a).

Following (Sharma et al., 2017), we also train a “blind” version of the agent, i.e., we do not feed intermediate canvas states as an input to $\pi$ . That means that the model cannot rely on reactive behaviour since it does not “see” the immediate consequences of its decisions. The training curve for this experiment is shown in Figure 8(a) (dotted blue line). Although the agent does not reach the level of performance of the full model, it can still produce sensible reconstructions which suggests that our approach could be used in the more general setting of program synthesis, where access to intermediate states of the execution pipeline is not assumed.

4 Omniglot

In the previous section, we showed that our approach works reasonably well for handwritten digits. In this series of experiments, we test our agent in a similar but more challenging setting of handwritten characters. The difficulty of the dataset manifests itself in lower quality of unconditional generations (Figure 5(a)). Note that this task appears to be hard for other neural network based approaches as well: models that do produce good samples, such as (Rezende et al., 2016), do not do so in a manner that mimics actual strokes.

Since Omniglot contains a highly diverse set of symbols, over the course of training our model could learn a general notion of image reproduction rather than simply memorizing dataset-specific strokes. In order to test this, we feed a trained agent with previously unseen line drawings. The resulting reconstructions are shown in Figure 6. The agent handles out-of-domain images well, although it is slightly better at reconstructing the Omniglot test set.

5 CelebA

Although blurry, the model’s reconstruction closely matches the high-level structure of each image. For instance the background color, the position of the face and the color of the person’s hair. In some cases, shadows around eyes and the nose are visible. However, we observe that our model tends to generate strokes in the first half of the episode that are fully occluded by strokes in the second half. We hypothesize that this phenomenon is a consequence of credit assignment being quite challenging in this task. One possible remedy is to provide the agent with a mid-episode reward for reproducing a blurred version of the target image. We leave this prospect for future work.

6 MuJoCo Scenes

We note that our agent has to deal with a high-cardinality action space intractable for a brute-force search. Indeed, the total number of possible execution traces is $M^{N}$ , where $M=4\cdot 16^{2}\cdot 3\cdot 4^{3}\cdot 3$ is the total number of attribute settings for a single object (see Section 4.2 for details) and $N=20$ is the length of an episode.The actual number of scene configurations is smaller but still intractable. In order to demonstrate the computational hardness of the task, we ran a general-purpose Metropolis-Hastings inference algorithm on a set of 100 images. The algorithm samples an execution trace defining attributes for a maximum of 20 primitives. These attributes are treated as latent variables. During each time step of inference, a block of attributes (including the presence/absence flag) corresponding to a single object is flipped uniformly within appropriate ranges. The resulting trace is rendered by the environement into an output sample which is then accepted or rejected using the Metropolis-Hastings update rule, with a Gaussian likelihood centered around the test image and a fixed diagonal covariance of $0.25$ . As shown in Figure 9, the MCMC search baseline was unable to solve the task even after a large number of evaluations.

Discussion

Scaling visual program synthesis to real world and combinatorial datasets has been a challenge. We have shown that it is possible to train an adversarial generative agent employing black-box rendering simulators. Our results indicate that using the Wasserstein discriminator’s output as a reward function with asynchronous reinforcement learning can provide a scaling path for visual program synthesis. The current exploration strategy used in the agent is entropy-based, but future work should address this limitation by employing sophisticated search algorithms for policy improvement. For instance, Monte Carlo Tree Search can be used, analogous to AlphaGo Zero (Silver et al., 2017). General-purpose inference algorithms could also be used for this purpose.

Future work should explore different parameterizations of action spaces. For instance, the use of two arbitrary control points is perhaps not the best way to represent strokes, as it is hard to deal with straight lines. Actions could also directly parametrize 3D surfaces, planes and learned texture models to invert richer visual scenes. On the reward side, using a joint image-action discriminator similar to BiGAN/ALI (Donahue et al., 2016; Dumoulin et al., 2016) (in this case, the policy can viewed as an encoder, while the renderer becomes a decoder) could result in a more meaningful learning signal, since $D$ will be forced to focus on the semantics of the image.

We hope that this paper provides an avenue to further explore inverse simulation and program synthesis on applications ranging from vision, graphics, speech, music and scientific simulators.

Acknowledgements

We would like to thank Mike Chrzanowski, Michael Figurnov, David Warde-Farley, Pushmeet Kohli, Sasha Vezhnevets and Suman Ravuri for helping with the manuscript preparation as well as Sergey Bartunov, Ian Goodfellow, Jacob Menick, Lasse Espeholt, Ivo Danihelka, Junyoung Chung, Çağlar Gülçehre, Dzmitry Bahdanau and Koray Kavukcuoglu for insightful discussions and support.

References

Appendix A Optimal D𝐷D for Conditional Generation

It turns out that in the case of conditional generation (i.e., $p_{d}$ is a Dirac $\delta$ -function), we can derive an explicit form of the optimal (non-parametric) discriminator $D$ . Indeed, (1) corresponds to the dual representation of the Wasserstein-1 metric (Villani, 2008). The primal form of that metric is defined as

where $\Gamma(p_{g},p_{d})$ is a set of all couplings between $p_{g}$ and $p_{d}$ . Taking into account that the data distribution is a point mass, we can simplify (8):

The expression above gives the optimal value for $\mathcal{L}_{D}$ in (1). Therefore $D({\mathbf{x}})=\left\lVert{\mathbf{x}}-\mathbf{x}_{\mbox{\scriptsize target}}\right\rVert_{2}$ is a solution of (1):

where the last term ( $R$ ) is zero since the Euclidean distance belongs to the set of 1-Lipschitz functions.

and $\operatorname{Lip}(D)\leq 1$ . One example of such function would be a hyperplane $H$ containing the segment (11). Let us now consider a set of points

One other reason why discriminator training is different from using a fixed image distance is that in practice, we do not optimize the exact dual formulation of the Wasserstein distance and, on top of that, use stochastic gradient descent methods which we do not run until convergence. A toy example illustrating that difference is presented in Figure 11.

Appendix B Network Architectures

The policy network (shown in Figure 12) takes the observation (i.e., the current state of the canvas $C_{t}$ ) and conditions it on a tuple corresponding to the last performed action $a_{t}$ . The resulting features are then downsampled to a lower-dimensional spatial resolution by means of strided convolutions and passed through a stack of ResNet blocks (He et al., 2016) followed by a fully-connected layer. This yields an embedding which we feed into an LSTM (Hochreiter & Schmidhuber, 1997). The LSTM produces a hidden vector $z_{0}$ serving as a seed for the action sampling procedure described below.

In order to obtain $a_{t+1}$ , we employ an autoregressive decoder depicted in Figure 13. Each component $a_{t+1}^{i}$ is sampled from a categorical distribution whose parameters are computed as a function of $z_{i}$ . We use two kinds of functions depending on whether $a_{t+1}^{i}$ corresponds to a scalar (e.g., brush size) or to a spatial location (e.g., a control point of a Bézier curve). In the scalar case, $z_{i}$ is transformed by a fully-connected layer, otherwise we process it using several ResNet blocks followed by a series of transpose convolutions and a final convolution. After $a_{t+1}^{i}$ is sampled, we obtain an updated hidden vector $z_{i+1}$ by embedding $a_{t+1}^{i}$ into a 16-dimensional code and combining it with $z_{i}$ . The procedure is repeated until the entire action tuple has been generated.

For the discriminator network, we use a conventional architecture similar to DCGAN (Radford et al., 2015).

Appendix C Training Details

Following standard practice in the GAN literature, we optimize the discriminator objective using Adam (Kingma & Ba, 2014) with a learning rate of $10^{-4}$ and $\beta_{1}$ set to $0.5$ . For generator training, we employ population-based exploration of hyperparameters (PBT) (Jaderberg et al., 2017) to find values for the entropy loss coefficient and learning rate of the policy learner. A population contains 12 training instances with each instance running 64 CPU actor jobs and 2 GPU jobs (1 for the policy learner and 1 for the discriminator learner). We assume that discriminator scores are compatible across different instances and use them as a measure of fitness in the exploitation phase of PBT.

The batch size is set to 64 on both the policy learner and discriminator learner. The generated data is sampled uniformly from a replay buffer with a capacity of 20 batches.