Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation

Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, Daniel Cohen-Or

Introduction

In recent years, Generative Adversarial Networks (GANs) have significantly advanced image synthesis, particularly on face images. State-of-the-art image generation methods have achieved high visual quality and fidelity, and can now generate images with phenomenal realism. Most notably, StyleGAN proposes a novel style-based generator architecture and attains state-of-the-art visual quality on high-resolution images. Moreover, it has been demonstrated that it has a disentangled latent space, W\mathcal{W} , which offers control and editing capabilities.

Recently, numerous methods have shown competence in controlling StyleGAN’s latent space and performing meaningful manipulations in W\mathcal{W} . These methods follow an “invert first, edit later” approach, where one first inverts an image into StyleGAN’s latent space and then edits the latent code in a semantically meaningful manner to obtain a new code that is then used by StyleGAN to generate the output image. However, it has been shown that inverting a real image into a 512512-dimensional vector wW\textbf{w}\in\mathcal{W} does not lead to an accurate reconstruction. Motivated by this, it has become common practice to encode real images into an extended latent space, W+\mathcal{W+}, defined by the concatenation of 1818 different 512512-dimensional w vectors, one for each input layer of StyleGAN. These works usually resort to using per-image optimization over W+\mathcal{W+}, requiring several minutes for a single image. To accelerate this optimization process, some methods have trained an encoder to infer an approximate vector in W+\mathcal{W+} which serves as a good initial point from which additional optimization is required. However, a fast and accurate inversion of real images into W+\mathcal{W+} remains a challenge.

In this paper, we first introduce a novel encoder architecture tasked with encoding an arbitrary image directly into W+\mathcal{W+}. The encoder is based on a Feature Pyramid Network , where style vectors are extracted from different pyramid scales and inserted directly into a fixed, pretrained StyleGAN generator in correspondence to their spatial scales. We show that our encoder can directly reconstruct real input images, allowing one to perform latent space manipulations without requiring time-consuming optimization. While these manipulations allow for extensive editing of real images, they are inherently limited. That is because the input image must be invertible, i.e., there must exist a latent code that reconstructs the image. This requirement is a severe limitation for tasks, such as conditional image generation, where the input image does not reside in the same StyleGAN domain. To overcome this limitation we propose using our encoder together with the pretrained StyleGAN generator as a complete image-to-image translation framework. In this formulation, input images are directly encoded into the desired output latents which are then fed into StyleGAN to generate the desired output images. This allows one to utilize StyleGAN for image-to-image translation even when the input and output images are not from the same domain.

While many previous approaches to solving image-to-image translation tasks involve dedicated architectures specific for solving a single problem, we follow the spirit of pix2pix and define a generic framework able to solve a wide range of tasks, all using the same architecture. Besides the simplification of the training process, as no adversary discriminator needs to be trained, using a pretrained StyleGAN generator offers several intriguing advantages over previous works. For example, many image-to-image architectures explicitly feed the generator with residual feature maps from the encoder , creating a strong locality bias . In contrast, our generator is governed only by the styles with no direct spatial input. Another notable advantage of the intermediate style representation is the inherent support for multi-modal synthesis for ambiguous tasks such as image generation from sketches, segmentation maps, or low-resolution images. In such tasks, the generated styles can be resampled to create variations of the output image with no change to the architecture or training process. In a sense, our method performs pixel2style2pixel translation, as every image is first encoded into style vectors and then into an image, and is therefore dubbed pSp.

The main contributions of this paper are: (i) A novel StyleGAN encoder able to directly encode real images into the W+\mathcal{W+} latent domain; and (ii) A new methodology for utilizing a pretrained StyleGAN generator to solve image-to-image translation tasks.

Related Work

With the rapid evolution of GANs, many works have tried to understand and control their latent space. A specific task that has received substantial attention is GAN Inversion, which was first introduced by Zhu et al. . In this task, the latent vector from which a pretrained GAN most accurately reconstructs a given, known image, is sought. Motivated by its state-of-the-art image quality and latent space semantic richness, many recent works have used StyleGAN for this task. Generally, inversion methods either directly optimize the latent vector to minimize the error for the given image , train an encoder to map the given image to the latent space , or use a hybrid approach combining both . Typically, methods performing optimization are superior in reconstruction quality to a learned encoder mapping, but require a substantially longer time. Unlike the above methods, our encoder can accurately and efficiently embed a given face image into the extended latent space W+\mathcal{W}+ with no further optimization.

Recently, numerous papers have presented diverse methods for learning semantic edits of the latent code. One popular approach is to find linear directions that correspond to changes in a given binary labeled attribute, such as young \leftrightarrow old, or no-smile \leftrightarrow smile . Tewari et al. utilize a pretrained 3DMM to learn semantic face edits in the latent space. Jahanian et al. find latent space paths that correspond to a specific transformation, such as zoom or rotation, in a self-supervised manner. Härkönen et al. find useful paths in an unsupervised manner by using the principal component axes of an intermediate activation space. Collins et al. perform local semantic editing by manipulating corresponding components of the latent code. These methods generally follow an “invert first, edit later” procedure, where an image is first embedded into the latent space, and then its latent is edited in a semantically meaningful manner. This differs from our approach which directly encodes input images into the corresponding output latents, allowing one to also handle inputs that do not reside in the StyleGAN domain.

Image-to-Image translation techniques aim at learning a conditional image generation function that maps an input image of a source domain to a corresponding image of a target domain. Isola et al. first introduced the use of conditional GANs to solve various image-to-image translation tasks. Since then, their work has been extended for many scenarios: high-resolution synthesis , unsupervised learning , multi-modal image synthesis , and conditional image synthesis . The aforementioned works have constructed dedicated architectures, which require training the generator network and generally do not generalize to other translation tasks. This is in contrast to our method that uses the same architecture for solving a variety of tasks.

The pSp Framework

Our pSp framework builds upon the representative power of a pretrained StyleGAN generator and the W+\mathcal{W+} latent space. To utilize this representation one needs a strong encoder that is able to match each input image to an accurate encoding in the latent domain. A simple technique to embed into this domain is directly encoding a given input image into W+\mathcal{W+} using a single 512512-dimensional vector obtained from the last layer of the encoder network, thereby learning all 1818 style vectors together. However, such an architecture presents a strong bottleneck making it difficult to fully represent the finer details of the original image and is therefore limited in reconstruction quality.

In StyleGAN, the authors have shown that the different style inputs correspond to different levels of detail, which are roughly divided into three groups — coarse, medium, and fine. Following this observation, in pSp we extend an encoder backbone with a feature pyramid , generating three levels of feature maps from which styles are extracted using a simple intermediate network — map2style — shown in Figure 2. The styles, aligned with the hierarchical representation, are then fed into the generator in correspondence to their scale to generate the output image, thus completing the translation from input pixels to output pixels, through the intermediate style representation. The complete architecture is illustrated in Figure 2.

As in StyleGAN, we further define w\overline{\textbf{w}} to be the average style vector of the pretrained generator. Given an input image, x, the output of our model is then defined as

where E()E(\cdot) and G()G(\cdot) denote the encoder and StyleGAN generator, respectively. In this formulation, our encoder aims to learn the latent code with respect to the average style vector. We find that this results in better initialization.

While the style-based translation is the core part of our framework, the choice of losses is also crucial. Our encoder is trained using a weighted combination of several objectives. First, we utilize the pixel-wise L2\mathcal{L}_{2} loss,

In addition, to learn perceptual similarities, we utilize the LPIPS loss, which has been shown to better preserve image quality compared to the more standard perceptual loss :

where F()F(\cdot) denotes the perceptual feature extractor.

To encourage the encoder to output latent style vectors closer to the average latent vector, we additionally define the following regularization loss:

Similar to the truncation trick introduced in StyleGAN, we find that adding this regularization in the training of our encoder improves image quality without harming the fidelity of our outputs, especially in some of the more ambiguous tasks explored below.

Finally, a common challenge when handling the specific task of encoding facial images is the preservation of the input identity. To tackle this, we incorporate a dedicated recognition loss measuring the cosine similarity between the output image and its source,

where RR is the pretrained ArcFace network.

In summary, the total loss function is defined as

where λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, λ4\lambda_{4} are constants defining the loss weights. This curated set of loss functions allows for more accurate encoding into StyleGAN compared to previous works and can be easily tuned for different encoding tasks according to their nature. Constants and other implementation details can be found in Appendix A.

2 The Benefits of The StyleGAN Domain

Applications and Experiments

To explore the effectiveness of our approach we evaluate pSp on numerous image-to-image translation tasks.

We start by evaluating the usage of the pSp framework for StyleGAN Inversion, that is, finding the latent code of real images in the latent domain. We compare our method to the optimization technique from Karras et al. , the ALAE encoder and to the encoder from IDInvert . The ALAE method proposes a StyleGAN-based autoencoder, where the encoder is trained alongside the generator to generate latent codes. In IDInvert, images are embedded into the latent domain of a pretrained StyleGAN by first encoding the image into W+\mathcal{W+} and then directly optimizing over the generated image to tune the latent. For a fair comparison, we compare with IDInvert where no further optimization is performed after encoding.

Figure 4 shows a qualitative comparison between the methods. One can see that the ALAE method, operating in the W\mathcal{W} domain, cannot accurately reconstruct the input images. While IDInvert better preserves the image attributes, it still fails to accurately preserve identity and the finer details of the input image. In contrast, our method is able to preserve identity while also reconstructing fine details such as lighting, hairstyle, and glasses.

Next, we conduct an ablation study to analyze the effectiveness of the pSp architecture. We compare our architecture to two simpler variations. First, we define an encoder generating a 512512-dimensional style vector in the W\mathcal{W} latent domain, extracted from the last layer of the encoder network. We then expand this and define an encoder with an additional layer to transform the 512512-dimensional feature vector to a full 18×51218\times 512 W+\mathcal{W+} vector. Figure 5 shows that while this simple extension into W+\mathcal{W+} significantly improves the results, it still cannot preserve the finer details generated by our architecture. In Figure 6 we show the importance of the identity loss in the reconstruction task.

Finally, Table 1 presents a quantitative evaluation measuring the different inversion methods. Compared to other encoders, pSp is able to better preserve the original images in terms of both perceptual similarity and identity. To make sure the similarity score is independent of our loss function, we utilize the CurricularFace method for evaluation.

2 Face Frontalization

Face frontalization is a challenging task for image-to-image translation frameworks due to the required non-local transformations and the lack of paired training data. RotateAndRender (R&R) overcome this challenge by incorporating a geometric 3D alignment process before the translation process. Alternatively, we show that our style-based translation mechanism is able to overcome these challenges, even when trained with no labeled data.

For this task, training is the same as the encoder formulation with two important changes. First, we randomly flip the target image during training, effectively forcing the model to output an image that is close to both the original image and the mirrored one. The underlying idea behind this augmentation is that it guides the model to converge to a fixed frontal pose. Next, we increase LID\mathcal{L}_{\text{ID}} and decrease the L2\mathcal{L}_{2} and LLPIPS\mathcal{L}_{\text{LPIPS}} losses for the outer part of the image. This change is based on the fact that for frontalization we are less interested in preserving the background region compared to the face region and the facial identity.

Results are illustrated in Figure 7. When trained with the same data and methodology, pix2pixHD is unable to converge to satisfying results as it is much more dependent on the correspondence between the input and output pairs. Conversely, our method is able to handle the task successfully, generating realistic frontal faces, which are comparable to the more involved R&R approach. This shows the benefit of using a pretrained StyleGAN for image translation, as it allows us to achieve visually-pleasing results even with weak supervision. Table 2 provides a quantitative evaluation on the FEI Database . While R&R outperforms pSp, our simple approach provides a fast and elegant alternative, without requiring specialized modules, such as R&R’s 3DMM fitting and inpainting steps.

3 Conditional Image Synthesis

Conditional image synthesis aims at generating photo-realistic images conditioned on certain input types. In this section, our pSp architecture is tested on two conditional image generation tasks: generating high-quality face images from sketches and semantic segmentation maps. We demonstrate that, with only minimal changes, our encoder successfully utilizes the expressiveness of StyleGAN to generate high-quality and diverse outputs.

The training of the two conditional generation tasks is similar to that of the encoder, where the input is the conditioned image and the target is the corresponding real image. To generate multiple images at inference time we perform style-mixing on the fine-level features, taking layers (1-7) from the latent code of the input image and layers (8-18) from a randomly drawn w vector.

3.1 Face From Sketch

Common approaches for sketch-to-image synthesis incorporate hard constraints that require pixel-wise correspondence between the input sketch and generated image, making them ill-suited when given incomplete, sparse sketches. DeepFaceDrawing address this using a set of dedicated mapping networks. We show that pSp provides a simple alternative to past approaches. As there are currently no publicly available datasets representative of hand-drawn face sketches, we elect to construct our own dataset, which we describe in Appendix B.

Figure 9 compares the results of our method to those of pix2pixHD and DeepFaceDrawing. As no code release is available for DeepFaceDrawing, we compare directly with sketches and results published in their paper. While DeepFaceDrawing obtain more visually pleasing results compared to pix2pixHD, they are still limited in their diversity. Conversely, although our model is trained on a different dataset, we are still able to generalize well to their sketches. Notably, we observe our ability to obtain more diverse outputs that better retain finer details (e.g. facial hair). Additional results, including those on non-frontal sketches are provided in the Appendix.

3.2 Face from Segmentation Map

Here, we evaluate using pSp for synthesizing face images from segmentation maps. In addition to pix2pixHD, we compare our approach to two additional state-of-the-art label-to-image methods: SPADE , and CC_FPSE , both of which are based on pix2pixHD.

In Figure 9 we provide a visual comparison of the competing approaches on the CelebAMask-HQ dataset containing 19 semantic categories. As the competing methods are based on pix2pixHD, the results of all three suffer from similar artifacts. Conversely, our approach is able to generate high-quality outputs across a wide range of inputs of various poses and expressions. Additionally, using our multi-modal technique, pSp can easily generate various possible outputs with the same pose and attributes but varying fine styles for a single input semantic map or sketch image. We provide examples in Figure 1 with additional results in the Appendix.

We additionally perform a human evaluation to compare the visual quality of each method presented above. Each worker is given two images synthesized by different methods on the same input and is given an unlimited time to select which output looks more realistic. Each of our three workers reviews approximately 2,8002,800 pairs for each task, resulting in over 8,4008,400 human judgements for each method. Table 3 shows that pSp significantly outperforms the other respective methods in both synthesis tasks.

4 Extending to Other Applications

Besides the applications presented above, we have found pSp to be applicable to a wide variety of additional tasks with minimal changes to the training process. Specifically, we present samples of super-resolution and inpainting results using pSp in Figure 1 with further details and results presented in Appendix C. For both tasks, paired data is generated and training is performed in a supervised fashion. Additionally, we show multi-modal support for super-resolution via style-mixing on medium-level features and evaluate pSp on several image editing tasks, including image interpolation and local patch editing.

5 Going Beyond the Facial Domain

In this section we show that our pSp framework can be trained to solve the various tasks explored above without relying on the advantages provided by the identity loss in the facial domain. While our method does require a pretrained StyleGAN generator, recent works have shown that such a generator can be easily trained with significantly fewer examples than required in the past.

Figure 20 shows the results on the AFHQ Cat and AFHQ Dog datasets for the StyleGAN inversion and sketch-to-image tasks. For these tasks, we use a pretrained StyleGAN-ADA model for each of the two domains and train our pSp encoder using only the L2\mathcal{L_{\text{2}}}, LLPIPS\mathcal{L_{\text{LPIPS}}}, and Lreg\mathcal{L_{\text{reg}}} losses with the same λ\lambda values as those used for the facial domain. As shown, we are able to generalize well to the examined domains, obtaining high-quality, accurate reconstruction results while also supporting multi-modal synthesis via our style-mixing approach. The accompanying Appendix provides additional results for super-resolution and inpainting on these domains.

Discussion

Although our suggested framework for image-to-image translation achieves compelling results in various applications, it has some inherent assumptions that should be considered. First, the high-quality images that are generated by utilizing the pretrained StyleGAN come with a cost — the method is limited to images that can be generated by StyleGAN. Thus, generating faces which are not close to frontal, or have certain expressions may be challenging if such examples were not available when training the StyleGAN model. Also, the global approach of pSp, although advantageous for many tasks, does introduce a challenge in preserving finer details of the input image, such as earrings or background details. This is especially significant in tasks such as inpainting or super-resolution where standard image-to-image architectures can simply propagate local information. Figure 11 presents some examples of such reconstruction failures.

Conclusion

In this work, we propose a novel encoder architecture that can be used to directly map a real image into the W+\mathcal{W+} latent space with no optimization required. There, styles are extracted in a hierarchical fashion and fed into the corresponding inputs of a fixed StyleGAN generator. Combining our encoder with a StyleGAN decoder, we present a generic framework for solving various image-to-image translation tasks, all using the same architecture. Notably, in contrast to the “invert first, edit later” approach of previous StyleGAN encoders, we show pSp can be used to directly encode these translation tasks into StyleGAN, thereby supporting input images that do not reside in the StyleGAN domain. Additionally, differing from previous works that typically rely on dedicated architectures for solving a single translation task, we show pSp to be capable of solving a wide variety of problems, requiring only minimal changes to the training losses and methodology. We hope that the ease-of-use of our approach will encourage further research into utilizing StyleGAN for real image-to-image translation tasks.

References

Appendix A Implementation Details

For our backbone network we use the ResNet-IR architecture from pretrained on face recognition, which accelerated convergence. We use a fixed StyleGAN2 generator trained on the FFHQ dataset. That is, only the pSp encoder network is trained on the given translation task. For all applications, the input image resolution is 256×256256\times 256, where the generated 1024×10241024\times 1024 output is resized before being fed into the loss functions. Specifically for LID\mathcal{L}_{\text{ID}}, the images are cropped around the face region and resized to 112×122112\times 122 before being fed into the recognition network. For training, we use the Ranger optimizer, a combination of Rectified Adam with the Lookahead technique , with a constant learning rate of 0.0010.001. Only horizontal flips are used as augmentations. All experiments are performed using a single NVIDIA Tesla P40 GPU.

For the StyleGAN inversion task, the λ\lambda values are set as λ1=1\lambda_{1}=1, λ2=0.8\lambda_{2}=0.8, and λ3=0.1\lambda_{3}=0.1. For face frontalization, we increase the weight of the LID\mathcal{L}_{\text{ID}}, setting λ3=1\lambda_{3}=1 and decrease the L2\mathcal{L}_{\text{2}} and LLPIPS\mathcal{L}_{\text{LPIPS}} loss functions, setting λ1=0.01\lambda_{1}=0.01, λ2=0.8\lambda_{2}=0.8 over the inner part of the face and λ1=0.001\lambda_{1}=0.001, λ2=0.08\lambda_{2}=0.08 elsewhere. Additionally, the constants used in the conditional image synthesis tasks are identical to those used in the inversion task except for the omission of the identity loss (i.e. λ3=0\lambda_{3}=0). Finally, λ4\lambda_{4} is set to 0.0050.005 in all applications except for the StyleGAN inversion task, which does not utilize the regularization loss.

Appendix B Dataset Details

We conduct our experiments on the CelebA-HQ dataset , which contains 30,000 high-quality images. We use a standard train-test split of the dataset, resulting in approximately 24,000 training images. The FFHQ dataset from , which contains 70,000 face images, is used for the StyleGAN inversion and face frontalization tasks.

For the generation of real images from sketches, we construct a dataset representative of hand-drawn sketches using the CelebA-HQ dataset. Given an input image, we first apply a “pencil sketch” filter which retains most facial details of the original image while removing the remaining noise. We then apply the sketch-simplification method by , resulting in images resembling hand-drawn sketches. The same approach is also used for generating the sketch images on the AFHQ Cat and AFHQ Dog datasets .

Appendix C Application Details

In super resolution, the pSp framework is used to construct high-resolution (HR) images from corresponding low-resolution (LR) input images. PULSE approaches this task in an unsupervised manner by traversing the HR image manifold in search of an image that downsamples to the input LR image.

We train both our model and pix2pixHD in a supervised fashion, where for each input we perform random bi-cubic down-sampling of ×1\times 1 (i.e. no down-sampling), ×2,×4,×8\times 2,\times 4,\times 8, ×16\times 16, or ×32\times 32 and set the original, full resolution image as the target.

Figures 12-14 demonstrates the visual quality of the resulting images from our method along with those of the previous approaches. Although PULSE is able to achieve very high-quality results due to their usage of StyleGAN to generate images, they are unable to accurately reconstruct the original image even when performing down-sampling of ×8\times 8 to a resolution of 32×3232\times 32. By learning a pixel-wise correspondence between the LR and HR images, pix2pixHD is able to obtain satisfying results even when down-sampled to a resolution of 16×1616\times 16 (i.e. ×16\times 16 down-sampling). However, visually, their results appear less photo-realistic. Contrary to these previous works, we are able to obtain high-quality results even when down-sampling to resolutions of 16×1616\times 16 and 8×88\times 8. Finally, in Figure 15 we generate multiple outputs for a given LR image using our multi-modal technique by performing style-mixing with a randomly sampled w vector on layers (4-7) with an α\alpha value of 0.50.5. Doing so alters medium-level styles that mainly control facial features.

C.2 Inpainting

In the task of inpainting we wish to reconstruct missing or occluded regions in a given image. Due to their local nature, pix2pix and other local-based translation methods, have shown success in tackling this problem as they can simply propagate non-occluded regions.

We train both pSp and pix2pixHD in a supervised fashion, where each input image is occluded with a symmetric triangular mask.

Figure 16 presents results for both our method and pix2pixHD. As shown, due to the lack of information in the occluded regions, pix2pixHD is unable to accurately reconstruct the original image and incurs many artifacts. In contrast, since pSp is trained to encode images into realistic face latents, it is able to accurately reconstruct the occluded region, resulting in high-quality outputs with no artifacts.

C.3 Local Editing

Our framework allows for a simple approach to local image editing using a trained pSp encoder where altering specific attributes of an input sketch (e.g. eyes, smile) or segmentation map (e.g. hair) results in local edits of the generated images. We can further extend this and perform local patch editing on real face images. As shown in Figure 18, pSp is able to seamlessly merge the desired patch into the original image.

C.4 Face Interpolation

Given two real images one can obtain their respective latent codes w1,w2W+w_{1},w_{2}\in\mathcal{W+} by feeding the images through our encoder. We can then naturally interpolate between the two images by computing their intermediate latent code w=αw1+(1α)w2w^{\prime}=\alpha w_{1}+(1-\alpha)w_{2} for 0α10\leq\alpha\leq 1 and generate the corresponding image using the new code ww^{\prime}.