StyleRig: Rigging StyleGAN for 3D Control over Portrait Images

Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, Christian Theobalt

cs.CV cs.GR

Introduction

Photorealistic synthesis of portrait face images finds many applications in several fields including special effects, extended reality, virtual worlds, and next-generation communication. During the content creation process for such applications, artist control over the face rig’s semantic parameters, such as geometric identity, expressions, reflectance, or scene illumination is desired. The computer vision and graphics communities have a rich history of modeling face rigs . These models provide artist-friendly control (often called a face rig), while navigating the various parameters of a morphable face model (3DMM) . Such methods are often limited by the lack of training data, and more importantly, lack of photorealism in the final rendering.

Through 3D face scanning techniques high-quality face geometry datasets can be obtained . However, models derived from these datasets are bound by the diversity of faces scanned and may limit the generalization over the rich set of human faces’ semantic parameterization. Further, deep learning-based models trained on in-the-wild data also often rely on data-driven priors and other forms of regularization obtained from scan-based datasets. With respect to photorealism, perceptual losses recently showed an improvement of face modeling quality over existing methods. However, they still do not engender photorealistic face renders. Mouth interiors, hair, or eyes, let alone image background are often not modeled by such approaches. Generative Adversarial Networks (GANs) have lately achieved photorealism , especially for faces. Karras et al. show that through a progressive growth of GAN’s generator and discriminator, one can better stabilize and speed up training. When trained on the CelebA-HQ dataset this yields a remarkable level of photorealism for faces. Their approach also shows how photorealistic face images of non-existent people can be sampled from the learned GAN distribution. Building on Karras et al. , StyleGAN uses ideas from the style transfer literature and proposes an architecture capable of disentangling various face attributes. Promising results of control over various attributes, including coarse (hair, geometry), medium (expressions, facial hair) and fine (color distribution, freckles) attributes were shown. However, these controllable attributes are not semantically well defined, and contain several similar yet entangled semantic attributes. For example, both coarse and medium level attributes contain face identity information. In addition, the coarse levels contain several entangled attributes such as face identity and head pose.

We present a novel solution to rig StyleGAN using a semantic parameter space for faces. Our approach brings the best of both worlds: the controllable parametric nature of existing morphable face models , and the high photorealism of generative face models . We employ a fixed and pretrained StyleGAN and do not require more data for training. Our focus is to provide computer graphics style rig-like control over the various semantic parameters. Our novel training procedure is based on a self-supervised two-way cycle consistency loss that is empowered by the combination of a face reconstruction network with a differentiable renderer. This allows us to measure the photometric rerendering error in the image domain and leads to high quality results. We show compelling results of our method, including interactive control of StyleGAN generated imagery as well as image synthesis conditioned on well-defined semantic parameters.

Related Work

In the following, we discuss deep generative models for the synthesis of imagery with a focus on faces, as well as 3D parametric face models. For an in-depth overview of parametric face models and their possible applications we refer to the recent survey papers .

Generative adversarial networks (GANs) contain two main blocks: a generator and a discriminator . The generator takes a noise vector as an input and produces an output, and tries to fool the discriminator, whose purpose is to classify whether the output is real or fake. When the input to the network is a noise vector, the output is a sample from the learned distribution. Karras et al. show that such a noise vector can generate high-resolution photorealistic images of human faces. To achieve this they employ a progressive strategy of slowly increasing the size of the generator and the discriminator, by adding more layers during training. This enables more stable training phase, and in turn helps learn high-resolution images of faces. StyleGAN can synthesize highly photorealistic images while allowing for more control over the output, compared to Karras et al. . However, StyleGAN still suffers from a clear entanglement of semantically different attributes. Therefore, it does not provide a semantic and interpretable control over the image synthesis process. Exploring the latent space of GANs for image editing has been recently explored in Jahanian et al. . They can only achieve simple transformations, such as zoom and 2D translations as they need ground truth images for each transformation during training. For faces, concurrent efforts have been made in controlling images synthesized by GANs , but they lack explicit rig-like 3D control of the generative model. Isola et al. use conditional GANs to produce image-to-image translations. Here, the input is not a noise vector, but a conditional image from a source domain, which is translated to the target domain by the generator. Their approach, however, requires paired training data. CycleGAN and UNIT learn to perform image-to-image translation only using unpaired data using cycle-consistency losses. GAUGAN shows interactive semantic image synthesis based on spatially adaptive normalization. The remarkable quality achieved by GANs has inspired the development of several neural rendering applications for faces and others objects .

D Morphable Models

3D Morphable Models (3DMMs) are commonly used to represent faces . Here, faces are parameterized by the identity geometry, expressions, skin reflectance and scene illumination. Expressions are commonly modeled using blendshapes, and illumination is generally modeled via spherical harmonics parameters . The models are learned from 3D scans of people , or more recently from in-the-wild internet footage . The parametric nature of 3DMMs allows navigating and exploring the space of plausible faces, e.g., in terms of geometry, expressions and so on. Thus, synthetic images can be rendered based on different parameter configurations. The rendered images, however, often look synthetic and lack photorealism. More recently, neural rendering has been used to bridge the gap between synthetic computer graphics renderings and corresponding real versions . Several methods have been proposed for fitting face models to images . Our work, however, focuses on learning-based approaches, that can be categorized into reconstruction only techniques , and reconstruction plus model learning . MoFA projects a face into the 3DMM space using a CNN, followed by a differentiable renderer to synthesize the reconstructed face. The network is trained in a self-supervised manner based on a large collection of face images. Tran et al. use a perceptual loss to enhance the renderings of the reconstruction. RingNet and FML impose multi-image consistency losses to enforce identity similarity. RingNet also enforces identity dissimilarity between pictures of different people. Several approaches learn to reconstruct the parameters of a 3DMM by training it on large scale synthetic data . For a more comprehensive overview of all techniques please refer to .

Overview

Semantic Rig Parameters

Training Corpus

Network Architecture

The first term is a dense photometric alignment loss:

Here, $\mathbf{M}$ is a binary mask with all pixels where the face mesh is rendered set to $1$ and $\odot$ is element-wise multiplication. We also use a sparse landmark loss

RigNet Encoder The encoder takes the latent vector $\mathbf{w}$ as input and linearly transforms it into a lower dimensional vector $\mathbf{l}$ of size $18\times 32$ . Each sub-vector $\mathbf{w}_{i}$ of $\mathbf{w}$ of size $512$ is independently transformed into a sub-vector $\mathbf{l}_{i}$ of size $32$ , for all $i\in\{0,\ldots,17\}$ .

RigNet Decoder The decoder tranforms $\mathbf{l}$ and the input control parameters $\mathbf{p}$ into the output $\hat{\mathbf{w}}$ . Similar to the encoder, we use independent linear decoders for each $\mathbf{l}_{i}$ . Each layer first concatenates $\mathbf{l}_{i}$ and $\mathbf{p}$ , and transforms it into ${\mathbf{d}}_{i}$ , for all $i\in\{0,\ldots,17\}$ . The final output is computed as $\hat{\mathbf{w}}=\mathbf{d}+\mathbf{w}$ .

Self-supervised Training

Our goal is to train RigNet such that we can inject a subset of parameters into a given latent code $\mathbf{w}$ . For example, we might want to inject a new head pose, while maintaining the facial identity, expression, and illumination in the original image synthesized from $\mathbf{w}$ . We employ the following loss function for training:

It consists of a reconstruction loss $\mathcal{L}_{\text{rec}}$ , an editing loss $\mathcal{L}_{\text{edit}}$ , and a consistency loss $\mathcal{L}_{\text{consist}}$ . Since we do not have ground truth for the desired modifications (our training corpus only contains one image per person), we employ self-supervision based on cycle-consistent editing and consistency losses. We optimize $\mathcal{L}_{\text{total}}$ based on AdaDelta with a learning rate of $0.01$ . In the following, we provide details.

This constraint anchors the learned mapping at the right location in the latent space. Without this constraint, learning the mapping is underconstrained, which leads to a degradation in the image quality (see Sec. 8). Since $\mathcal{F}$ is pretrained and not updated, the semantics of the control space are enforced.

Cycle-Consistent Per-Pixel Editing Loss

Given two latent codes, $\mathbf{w}$ and $\mathbf{v}$ with corresponding images $\mathbf{I}_{\mathbf{w}}$ and $\mathbf{I}_{\mathbf{v}}$ , we transfer the semantic parameters of $\mathbf{v}$ to $\mathbf{w}$ during training. We first extract the target parameter vector $\mathbf{p}_{\mathbf{v}}=\mathcal{F}(\mathbf{v})$ using the differentiable face reconstruction network. Next, we inject a subset of the parameters of $\mathbf{p}_{\mathbf{v}}$ (the ones we want to modify) into the latent code $\mathbf{w}$ to yield a new latent code $\mathbf{\hat{w}}=\mathit{RigNet}(\mathbf{w},\mathbf{p}_{\mathbf{v}})$ , so that $\mathbf{I}_{\mathbf{\hat{w}}}=\textit{StyleGAN}(\mathbf{\hat{w}})$ (ideally) corresponds to the image $\mathbf{I}_{\mathbf{w}}$ , modified according to the subset of the parameters of $\mathbf{p}_{\mathbf{v}}$ . For example, $\mathbf{\hat{w}}$ might retain the facial identity, expression and scene illumination of $\mathbf{w}$ , but should perform the head rotation specified in $\mathbf{p}_{\mathbf{v}}$ .

Since we do not have ground truth for such a modification, i.e., the image $\mathbf{I}_{\mathbf{\hat{w}}}$ is unknown, we employ supervision based on a cycle-consistent editing loss. The editing loss enforces that the latent code $\mathbf{\hat{w}}$ contains the modified parameters. We enforce this by mapping from the latent to the parameter space $\mathbf{\hat{p}}=\mathcal{F}(\mathbf{\hat{w}})$ . The regressed parameters $\mathbf{\hat{p}}$ should have the same rotation as $\mathbf{p}_{\mathbf{v}}$ . We could measure this directly in the parameter space but this has been shown to not be very effective . We also observed in our experiments that minimizing a loss in the parameter space does not lead to desired results, since the perceptual effect of different parameters in the image space can be very different.

Instead, we employ a rerendering loss similar to the one used for differentiable face reconstruction. We take the original target parameter vector $\mathbf{p}_{\mathbf{v}}$ and replace its rotation parameters with the regressed rotation from $\mathbf{\hat{p}}$ , resulting in $\mathbf{p}_{\text{edit}}$ . We can now compare this to $\mathbf{I}_{\mathbf{v}}$ using the rerendering loss (see Eq. 1):

We do not use any regularization terms here. Such a loss function ensures that the rotation component of $\mathbf{p}_{\text{edit}}$ aligns with $\mathbf{I}_{\mathbf{v}}$ , which is the desired output. The component of $\mathbf{p}_{\mathbf{v}}$ which is replaced from $\mathbf{\hat{p}}$ depends on the property we want to change. It could either be the pose, expressions, or illumination parameters.

Cycle-consistent Per-pixel Consistency Loss

In addition to the editing loss, we enforce consistency of the parameters that should not be changed by the performed edit operation. The regressed parameters $\mathbf{\hat{p}}$ should have the same unmodified parameters as $\mathbf{p}_{\mathbf{w}}$ . Similarly as above, we impose this in terms of a rerendering loss. We take the original parameter vector $\mathbf{p}_{\mathbf{w}}$ and replace all parameters that should not be modified by the regressed ones from $\mathbf{\hat{p}}$ , resulting in $\mathbf{p}_{\text{consist}}$ . In the case of modifying rotation values, the parameters that should not change are expression, illumination as well as identity parameters (shape and skin reflectance). This leads to the loss function:

Siamese Training Since we have already sampled two latent codes $\mathbf{w}$ and $\mathbf{v}$ during training, we perform the same operations in a reverse order, i.e., in addition to injecting $\mathbf{p}_{\mathbf{v}}$ into $\mathbf{w}$ , we also inject $\mathbf{p}_{\mathbf{w}}$ into $\mathbf{v}$ . To this end, we use a Siamese network with two towers that have shared weights. This results in a two-way cycle consistency loss.

Results

At test time, StyleRig allows control over the pose, expression, and illumination parameters of StyleGAN generated images. We demonstrate the efficacy of our approach with three applications: Style Mixing (8.1), Interactive Rig Control (8.2) and Conditional Image Generation (8.3).

2 Interactive Rig Control

Since the parameters of the 3DMM can also be controlled independently, StyleRig allows for explicit semantic control of StyleGAN generated images. We develop a user interface where a user can interact with a face mesh by interactively changing its pose, expression, and scene illumination parameters. These updated parameters are then fed into RigNet to generate new images at interactive frame rates ( $\sim 5$ fps). Fig. 1 shows the results for various controls over StyleGAN images: pose, expression, and illumination edits. The control rig carries out the edits in a smooth interactive manner. Please refer to the supplemental video for more results.

The interactive editor allows us to easily inspect the trained networks. We observe that while the network does a good job at most controls, some expressivity of the 3D parametric face model is lost. That is, RigNet cannot transfer all modes of parametric control to similar changes in the StyleGAN generated images. For example, we notice that in-plane rotation of the face mesh is ignored. Similarly, many expressions of the face mesh do not translate well into the resultant generated images. We attribute these problems to the bias in the images StyleGAN has been trained on. To analyze these modes, we look at the distribution of face model parameters in our training data, generated from StyleGAN, see Fig. 6. We notice that in-plane rotations (rotation around the Z-axis) are hardly present in the data. In fact, most variation is only around the Y-axis. This could be because StyleGAN is trained on the Flickr-HQ dataset . Most static images of faces in such a dataset would not include in-plane rotations. The same reasoning can be applied to expressions, where most generated images consist of either neutral or smiling/laughing faces. These expressions can be captured using up to three blendshapes. Even though the face rig contains $64$ vectors, we cannot control them well because of the biases in the distribution of the training data. Similarly, the lighting conditions are also limited in the dataset. We note that there are larger variations in the global color and azimuth dimensions, as compared to the other dimensions. Our approach provides an intuitive and interactive user interface which allows us to inspect not only StyleRig, but also the biases present in StyleGAN.

3 Conditional Image Generation

Explicit and implicit control of a pretrained generative model allows us to turn it into a conditional one. We can simply fix the pose, expression, or illumination inputs to RigNet in order to generate images which correspond to the specified parameters, see Fig. 7. This is a straight forward way to convert an unconditional generative model into a conditional model, and can produce high-resolution photorealistic results. It is also very efficient, as it takes us less than $24$ hours to train StyleRig, while training a conditional generative model from scratch should take at least as much time as StyleGAN, which takes more than $41$ days to train (both numbers are for an Nvidia Volta GPU).

4 Comparisons to Baseline Approaches

In the following, we compare our approach with several baseline approaches.

“Steering” the latent vector Inspired by Jahanian et al. , we design a network architecture which tries to steer the StyleGAN latent vector based on the change in parameters. This network architecture does not use the latent vector $\mathbf{w}$ as an input, and thus does not require an encoder. The inputs to the network are the delta in the face model parameters, with the output being the delta in the latent vector. In our settings, such an architecture does not lead to desirable results with the network not being able to deform the geometry of the faces, see Fig. 8. Thus, the semantic deltas in latent space should also be conditional on the the latent vectors, in addition to the target parameters.

Different Loss Functions As explained in Eq. 2, our loss function consists of three terms. For the first baseline, we switch off the reconstruction loss. This can lead to the output latent vectors drifting from the space of StyleGAN latent codes, thus resulting in non-face images. Next, we switch off the consistency loss. This loss term enforces the consistency of all face model parameters, other than the one being changed. Without this term, changing one dimension, for example the illumination, also changes others such as the head pose. Our final model ensures the desired edits with consistent identity and scene information. Note that switching off the editing loss is not a good baseline, as it would not add any control over the generator.

5 Simultaneous Parameter Control

In addition to controlling different parameters independently, we can also control them simultaneously. To this end, we train RigNet, such that, it receives target pose, expression, and illumination parameters as input. For every $(\mathbf{w},\mathbf{v})$ training code vector pair, we sample three training samples. Here, one out of the three parameters (pose, expression or illumination) is changed in each sample. We then use the loss function defined in Eq. 2 for each such sample. Thus, RigNet learns to edit each dimension of the control space independently, while also being able to combine the edits using the same network. Fig. 9 shows mixing results where pose, expression and illumination parameters are transferred from the source to target images.

Limitations

While we have demonstrated high quality semantic control of StyleGAN-generated facial imagery, our approach is still subject to a few limitations that can be addressed in follow-up work. In the analysis sections, we have already discussed that StyleRig is not able to exploit the full expressivity of the parametric face model. This provides a nice insight into the inner workings of StlyeGAN and allows us to introspect the biases it learned. In the future, this might lead the ways to designing better generative models. Our approach is also limited by the quality of the employed differentiable face reconstruction network. Currently, this model does not allow us to reconstruct fine-scale detail, thus we can not explicitly control them. Finally, there is no explicit constraint that tries to preserve parts of the scene that are not explained by the parameteric face model, e.g., the background or hair style. Therefore, these parts can not be controlled and might change when editing the parameters.

Conclusion

We have proposed StyleRig, a novel approach that provides face rig-like control over a pretrained and fixed StyleGAN network. Our network is trained in a self-supervised manner and does not require any additional images or manual annotations. At test time, our method generates images of faces with the photorealism of StyleGAN, while providing explicit control over a set of semantic control parameters. We believe that the combination of computer graphics control with deep generative models enables many exciting editing applications, provides insights into the inner workings of the generative model, and will inspire follow-up work. Acknowledgements: We thank True-VisionSolutions Pty Ltd for providing the 2D face tracker. This work was supported by the ERC Consolidator Grant 4DReply (770784), the Max Planck Center for Visual Computing and Communications (MPC-VCC), and by Technicolor.