GAN Inversion: A Survey

Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, Ming-Hsuan Yang

Introduction

The Generative Adversarial Network (GAN) is a deep generative model that learns to generate new data through adversarial training . It consists of two neural networks: a generator, $G$ , and a discriminator, $D$ , which are trained jointly through an adversarial process. The objective of $G$ is to synthesize fake data that resemble real data, while the objective of $D$ is to distinguish between real and fake data. Through an adversarial training process, the generator $G$ tries to generate fake data that match the real data distribution to fool the discriminator. In recent years, GANs have been applied to computer vision tasks ranging from image translation , image manipulation , to image restoration .

Many GAN models, e.g., PGGAN , BigGAN and StyleGAN , have been developed to synthesize images with high quality and diversity from random latent code. Recent studies have shown that GANs effectively encode rich semantic information in intermediate features and latent spaces from the supervision of image generation. These methods can synthesize images with a diverse range of attributes, such as faces with different ages and expressions, and scenes with different lighting conditions. By varying the latent code, we can manipulate certain attributes while retaining the other attributes for the generated image. However, such manipulation in the latent space is only applicable to the images generated by the GAN generator rather than any given real images due to the lack of inference capability in GANs.

GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model. The image can then be faithfully reconstructed from the inverted code by the generator. Since GAN inversion plays an essential role in bridging real and fake image domains, significant advances have been made . GAN inversion makes the controllable directions found in latent spaces of the existing trained GANs applicable to editing real images, without requiring any ad-hoc supervision or expensive optimization. As shown in Fig. 1, after the real image is inverted into the latent space, we can vary its code along one specific direction to edit the corresponding attribute of the image. As a rapidly growing direction that combines GANs and interpretable machine learning techniques, GAN inversion is not only a flexible image editing framework but also helps reveal the inner workings of deep generative models.

In this paper, we present a comprehensive survey of GAN inversion methods with an emphasis on algorithms and applications. To the best of our knowledge, this work is the first survey on the rapidly growing GAN inversion with the following contributions. We provide a comprehensive review of GAN inversion methods and compare their different properties and performances. We further discuss the challenges, open issues, and trends for future research.

The rest of this survey paper is organized as follows. We first give a problem formulation of GAN inversion in Section 2. The obtained latent code for a given image should have two properties: 1) reconstructing the input image faithfully and photorealistically and 2) facilitating downstream tasks. Achieving these two properties is also the goal of GAN inversion. Section 3.1 introduces many different pretrained GAN models $G(\mathbf{z})$ . Subsequent sections introduce the efforts taken by different GAN inversion methods to reach the goal. To evaluate the performance of GAN inversion methods, we consider the two important aspects, how photorealistic (perceptual quality) and faithful (inversion accuracy) the reconstructed image is, in Section 3.2. The first aspect depends on how the formulation is solved. It is usually a nonconvex optimization problem due to the nonconvexity of $G(\mathbf{z})$ , for which finding accurate solutions is difficult. The second aspect is primarily decided by which latent space to use. Section 4.1 introduces, analyses, and compares the characteristics of different latent spaces. In Sections 4.2, 4.3, and 4.4, we introduce how existing methods have attempted to provide solutions and discuss some important characteristics of these GAN inversion methods. Applications and future directions of GAN inversion are introduced in Sections 5 and 6.

Problem Definition and Overview

It is well known that GANs can generate high-resolution and photorealistic fake images. However, it remains challenging to apply these unconditional GANs to the editing of real images due to the lack of inference capability. Given an image, GAN inversion aims to recover the latent code in a latent space of a pretrained unconditional GAN model, and thus enables numerous image editing applications by manipulating the latent code. In this case, the pretrained unconditional GAN model can be used without modifying the architecture. Ideally the found latent code of the given image should achieve two goals: 1) reconstructing the input image faithfully and photorealistically and 2) facilitating downstream tasks.

The second goal as facilitating downstream tasks is primarily decided by which latent space to use (see Section 4.1). The first goal depends on how to solve Equation (1) accurately, which is usually a nonconvex optimization problem due to the nonconvexity of $G(\mathbf{z})$ . Thus it is not easily amenable to find accurate solutions. Many methods have been developed to solve Equation (1) with formulation based on learning, optimization, or both. A learning-based inversion method aims to learn an encoder network to map an image into the latent space such that the reconstructed image based on the latent code looks as similar to the original one as possible. An optimization-based inversion approach directly solves the objective function through back-propagation to find a latent code that minimizes pixel-wise reconstruction loss. A hybrid approach first uses an encoder to generate initial latent code and then refines it with an optimization algorithm. Generally, learning-based GAN inversion methods cannot faithfully reconstruct the image content. For example, learning-based inversion methods have been known to sometimes fail in preserving identities as well as some other details when reconstructing face images . While optimization-based techniques have achieved superior image reconstruction quality, their inevitable drawback is the significantly higher computational cost . Thus, recent improvements of learning-based GAN inversion methods mainly focus on how to faithfully reconstruct images, e.g., integrating an additional facial identity loss during training or proposing an iterative feedback mechanism . Recent improvements of optimization-based methods emphasize on how to find the desired latent code more quickly thus propose several initialization strategies and optimizers . Reconstruction quality and inference time cannot be simultaneously achieved for existing inversion approaches, resulting in a “quality-time tradeoff”. Although some hybrid approaches are additionally proposed to balance this tradeoff, it remains a challenge to quickly find an accurate latent code.

Similar to GAN inversion, some tasks also aim to learn the inverse mapping of GAN models. Some methods use additional encoder networks to learn the inverse mapping of GANs, but their goals are to jointly train the encoder with both the generator and the discriminator, instead of using a trained GAN model. Some other methods, e.g., PULSE , ILO , or PICGM , also rely on a pretrained generator to solve inverse problems such as inpainting, super-resolution, or denoising. They design different optimization mechanisms to search for latent codes that satisfy the given degraded observations. Since they aim to search for accurate and reliable estimation (e.g., denoised image) from a degraded observation (e.g., noisy image) instead of faithful reconstruction of the given image, we do not categorize them as GAN inversion methods in this survey paper. But it would be beneficial to pay attention to those works as they share the same idea of finding desired latent code in the latent space of pretrained GAN models.

Preliminaries

Deep generative models such as GANs have been used to model natural image distributions and synthesize photorealistic images. Recent advances in GANs, such as DCGAN , WGAN , PGGAN , BigGAN , StyleGAN , StyleGAN2 , StyleGAN2-Ada , and StyleGAN3 have developed better architectures, losses, and training schemes. These models are trained on diverse datasets, including faces (CelebA-HQ , FFHQ , AnimeFaces and AnimalFace ), scenes (LSUN ), and objects (LSUN and ImageNet ). Specifically, BigGAN pretrained on ImageNet, PGGAN on CelebA-HQ, and Style-based GANs on FFHQ or LSUN are widely used in GAN inversion methods. In contrast to the above-mentioned 2D GANs, the recently developed 3D-aware GANs bridge the gap between 2D images and 3D physical world. The inversion methods based on these 3D-aware GANs are currently less studied but have great potential for image, video, and 3D applications.

DCGAN uses convolutions in the discriminator and fractional-strided convolutions in the generator.

WGAN minimizes the Wasserstein distance between the generated and real data distributions, which offers more model stability and makes the training process easier.

BigGAN generates high-resolution and high-quality images, with modifications for scaling up, architectural changes and orthogonal regularization to improve the scalability, robustness and stability of large-scale GANs. BigGAN can be trained on ImageNet at 256 $\times$ 256 and 512 $\times$ 512.

PGGAN , also denoted as ProGAN or progressive GAN, uses a growing strategy for the training process. The key idea is to start with a low resolution for both the generator and the discriminator and then add new layers that model increasingly fine-grained details as the training progresses. This approach improves both the training speed and the stabilization, thereby facilitating image synthesis at higher resolution, e.g., CelebA images at 1024 $\times$ 1024 pixels.

1.2 Datasets

ImageNet is a large-scale hand-annotated dataset for visual object recognition research and contains more than 14 million images with more than 20,000 categories.

CelebA is a large-scale face attribute dataset consisting of 200K celebrity images with 40 attribute annotations each. CelebA, together with its succeeding CelebA-HQ , and CelebAMask-HQ , are widely used in face image generation and manipulation.

Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces crawled from Flickr, which consists of 70,000 high-quality human face images of $1024\times 1024$ pixels and contains considerable variation in terms of age, ethnicity, and image background.

LSUN contains approximately one million labeled images for each of 10 scene categories (e.g., bedroom, church, or tower) and 20 object classes (e.g., bird, cat, or bus). The church and bedroom scene images and car and bird object images are commonly used in the GAN inversion methods.

Some GAN inversion studies also use other datasets in their experiments, such as DeepFashion , AnimeFaces , and StreetScapes .

2 Evaluation Metrics

There are different dimensions to evaluate GAN inversion methods, such as photorealism, faithfulness of the reconstructed image, and editability of the inverted latent code.

The IS, FID, and LPIPS metrics are widely used to assess the photorealistic quality of GAN-generated images. Other metrics such as Fr $\acute{e}$ chet segmentation distance (FSD) and sliced Wasserstein discrepancy (SWD) have also been used for image perceptual quality evaluation. In , Xu et al. present an empirical study on the evaluation metrics of GAN models.

Inception score (IS) is a widely used metric to measure the quality and diversity of images generated from GAN models. It calculates the statistics of a synthesized image using the the Inception-v3 Network pretrained on the ImageNet . A higher score is better.

Fr $\acute{e}$ chet inception distance (FID) is defined by the Fr $\acute{e}$ chet distance between feature vectors from the real and generated images based on the Inception-v3 pool3 layer. Lower FID indicates better perceptual quality.

Learned perceptual image patch similarity (LPIPS) measures image perceptual quality using a VGG model pretrained on the ImageNet. A lower value means higher similarity between image patches.

2.2 Faithfulness

Faithfulness measures the similarity between the real image and the generated one. It can be approximated by the image similarity. The most widely used metrics are PSNR and SSIM. Some methods use the pixel-wise reconstruction distance, e.g., mean absolute error (MAE), mean squared error (MSE), or root mean squared error (RMSE).

Peak signal-to-noise ratio (PSNR) is one of the most widely used criteria to measure the quality of reconstruction. The PSNR between the ground truth image and the reconstruction is defined by the maximum possible pixel value of the image and the mean squared error between images.

Structural similarity (SSIM) measures the structural similarity between images based on independent comparisons in terms of luminance, contrast, and structures. The details of these terms can be found in .

2.3 Editability

Editability measures the editable flexibility of the inverted latent code with respect to certain attributes of the output image from the generator. Directly evaluating editability of the latent code is difficult. Existing methods use either cosine or Euclidean distance or classification accuracy to evaluate certain attributes between input $x$ and output $x^{\prime}$ (i.e., modifying the target attribute while keeping others unchanged). Existing methods focus on evaluation of editability on face data and facial attributes. For example, Nitzan et al. use the cosine similarity to compare the accuracy of facial expression preservation, which is calculated by the Euclidean distance between 2D landmarks of $x$ and $x^{\prime}$ . In contrast, the pose preservation is calculated as the Euclidean distance between Euler angles of $x$ and $x^{\prime}$ . Abdal et al. develop the edit consistency score (regressed by an attribute classifier) to measure the consistency across edited face images based on the assumption that different permutations of edits should have the same attribute score when classified with an attribute classifier. These methods measure preservation of face identity to evaluate the quality of the edited images. We note the above-discussed methods may not be applicable to all image domains other than faces.

2.4 Subjective Metric

Aside from the above-mentioned metrics, some studies include human raters or user studies for performance evaluation. For example, for subjective image quality assessment, human raters are asked to assign perceptual quality scores to images, e.g., from $1$ (bad) to $5$ (good). The final score, usually called the mean opinion score (MOS) or difference mean opinion score (DMOS), is calculated as the arithmetic mean over all ratings. A typical user study asks participants to choose one that best meets the question from a given triple of images (source, results of a baseline and the proposed method). The question can be “choose one from the given two edited images that better preserves the identity of the person in the source image” or “which edited image is more realistic?” The final percentage of responses indicates the preference rate of the proposed method against a baseline. Drawbacks with these metrics include the nonlinear scale of human judgement, potential bias and variance, and high human cost.

GAN Inversion Methods

This section introduces different latent spaces of GAN models, representative GAN inversion methods, and their properties. As the StyleGAN models achieve state-of-the-art image synthesis, numerous GAN inversion methods have been developed using various latent spaces based on the StyleGANs. In addition to the $\mathcal{Z}$ space for generic GANs, several latent spaces are designed specifically for StyleGANs, including $\mathcal{W}$ , $\mathcal{W}^{+}$ , $\mathcal{S}$ , and $\mathcal{P}$ spaces.

Regardless of the GAN inversion methods, one important design choice is to which latent space to embed the image. A good latent space should be disentangled and easy to embed. The latent code in such a latent space has the following two properties: it reconstructs the input image faithfully and photorealistically, and it facilitates downstream image editing tasks. This section introduces the efforts of latent space analysis and regularization on the latent spaces from the original $\mathcal{Z}$ space to the most recent $\mathcal{P}$ space. The $\mathcal{Z}$ space is applicable to all GANs and some latent spaces are designed specifically for StyleGANs . The choice of latent space depends on the pretrained models and tasks. For instance, image editing with StyleGANs is mostly performed in the $\mathcal{W}^{+}$ space.

$\mathcal{Z}$ Space. The generative model in the GAN architecture learns to map the values sampled from a simple distribution, e.g., normal or uniform distribution, to the generated images. These values, sampled directly from the distribution, are often called latent codes or latent representations (denoted by $\mathbf{z}\in\mathcal{Z}$ ), as shown in Fig. 2. The structure they form is typically called latent $\mathcal{Z}$ space. The $\mathcal{Z}$ space is applicable to all the unconditional GAN models such as DCGAN , PGGAN , BigGAN , and StyleGANs . However, the constraint of the $\mathcal{Z}$ space subject to a normal distribution limits its representation capacity and disentanglement for the semantic attributes.

$\mathcal{W}$ and $\mathcal{W}^{+}$ Space. Recent GAN inversion methods mostly adopt the latent spaces used in StyleGANs. These latent spaces have higher degrees of freedom and thus are significantly more expressive than the $\mathcal{Z}$ space. Fig. 2 illustrates the latent spaces from which the inversion methods are constructed. Various latent spaces are derived from the original $\mathcal{Z}$ space. StyleGAN converts native $\mathbf{z}$ to the mapped style vectors $\mathbf{w}$ by a nonlinear mapping network $f$ implemented with an $8$ -layer multilayer perceptron (MLP). This intermediate latent space is named as $\mathcal{W}$ space. Due to the mapping network and affine transformations, the $\mathcal{W}$ space of StyleGAN contains more disentangled features than does the $\mathcal{Z}$ space. Some studies analyze the separability and semantics of both $\mathcal{W}$ and $\mathcal{Z}$ spaces. The expressiveness of $\mathcal{W}$ space is, however, still limited, restricting the range of images that can be faithfully reconstructed. Therefore, some works make use of another layer-wise latent space, $\mathcal{W}^{+}$ , where a different intermediate latent vector, $\mathbf{w}$ , is fed into each of the generator’s layers via AdaIN . However, inverting images into the $\mathcal{W}^{+}$ space alleviates distortion at the expense of compromised editability. Recent methods aim to balance the reconstruction-editability tradeoff by predicting latent codes in $\mathcal{W}^{+}$ that reside close to $\mathcal{W}$ . For a StyleGAN with 18 layers, $\mathbf{w}\in\mathcal{W}$ has 512 dimensions, and $\mathbf{w}\in\mathcal{W}^{+}$ has 18 $\times$ 512 dimensions.

$\mathcal{S}$ Space. The style space $\mathcal{S}$ is spanned by channel-wise style parameters s, where s is transformed from $\mathbf{w}\in\mathcal{W}$ by using a different learned affine transformation for each layer of the generator. In a 1024 $\times$ 1024 StyleGAN2 with 18 layers, $\mathcal{W}$ , $\mathcal{W}^{+}$ , and $\mathcal{S}$ have 512, 9216, and 9088 dimensions, respectively. This $\mathcal{S}$ space is proposed to achieve better spatial disentanglement in the spatial dimension beyond the semantic level. The spatial entanglement is primarily caused by the intrinsic complexity of style-based generators and the spatial invariance of AdaIN normalization . Xu et al. replace original style codes with disentangled multilevel visual features learned by an encoder. They refer to the space spanned by these style parameters as $\mathcal{Y}$ space, but it actually can be seen as a type of $\mathcal{S}$ space. By directly intervening the style code $s\in\mathcal{S}$ , methods based on $\mathcal{S}$ space achieve fine-grained controls on local translations.

2 GAN Inversion Methods

Fig. 3 shows three main techniques of GAN inversion, i.e., projecting images into the latent space based on learning, optimization, or hybrid formulations. The inverted codes have other properties, i.e., having supported resolution, being semantic-aware, being layerwise, and having out-of-distribution generalizability. Table I lists some important properties of the existing GAN inversion methods.

Learning-based GAN inversion typically involves training an encoding neural network $E(x;\theta_{E})$ to map an image, $x$ , into the latent code $\mathbf{z}$ by

where $x_{n}$ denotes the $n$ -th image in the dataset. The objective in (2) is reminiscent of an autoencoder pipeline, with an encoder $E$ and a decoder $G$ . The decoder $G$ is fixed throughout the training. Aside from accurate reconstruction, a good encoder for GAN inversion should have the following feats: 1) lightweight; 2) data-efficiency; 3) supporting high-resolution images (see Section 4.3.1); and 4) generalizability to arbitrary images (see Section 4.3.4).

One earlier learning-based GAN inversion method is proposed by Perarnau et al. . Given a conditional GAN (cGAN) model, a real image $x$ is encoded by a latent code $\mathbf{z}$ and an attribute vector $y$ , a modified image $x^{\prime}$ is synthesized by changing $y$ . This approach consists of training an encoder $E$ with a trained conditional GAN (cGAN). Different from Zhu et al. , this encoder $E$ is composed of two modules: $E_{z}$ , which encodes an image to $\mathbf{z}$ , and $E_{y}$ , which encodes an image to $y$ . To train $E_{z}$ , this method uses the generator to create a dataset of generated images $x^{\prime}$ and latent vectors $\mathbf{z}$ , minimizes a squared reconstruction loss $\mathcal{L}_{ez}$ between $\mathbf{z}$ and $E_{z}(G(\mathbf{z},y^{\prime}))$ and improves $E_{y}$ by directly training with $\|y-E_{y}(x)\|_{2}^{2}$ . $E_{y}$ is initially trained by using generated images $x^{\prime}$ and their conditional information $y^{\prime}$ .

Due to the prevalence of StyleGANs , most recent learning-based methods design an encoder for StyleGANs. Richardson et al. propose the map2style modules to learn styles from the corresponding feature map, where 18 single-layer latent codes are predicted separately. Instead of using 18 modules to learn styles for StyleGANs, Wei et al. propose a simple and efficient head, which just consists of an average pooling layer and a fully connected layer. Given three different semantic levels of features obtained by the feature pyramid network (FPN) , these three heads produce $\mathbf{w}_{15},\cdots,\mathbf{w}_{18}$ , $\mathbf{w}_{10},\cdots,\mathbf{w}_{14}$ , and $\mathbf{w}_{1},\cdots,\mathbf{w}_{9}$ from the shallow, medium, and deep features, respectively. In , Tov et al.analyze the trade-offs between distortion, perceptual quality, and editability within the StyleGAN latent space. An encoder is used to control the trade-offs and facilitate downstream image editing. To improve inversion accuracy, Alaluf et al. introduce an iterative refinement mechanism for the encoder. Instead of directly predicting the latent code of a given real image in a forward pass, at step $t$ , the encoder operates on an extended input obtained by concatenating the given image $\mathbf{x}$ with the predicted image: $\Delta_{t}=E(\mathbf{x},y_{t})$ , where $y_{t}=G(\mathbf{w}_{t})$ . The latent code at step $t+1$ is then updated as $\mathbf{w}_{t+1}=\Delta_{t}+\mathbf{w}_{t}$ . The initialized values of $\mathbf{w}_{0}$ and $y_{0}$ are set as the average latent code and its corresponding image, respectively.

Although some methods use additive encoder networks to learn the inverse mapping of GANs, we do not categorize them as GAN inversion since their goals are to jointly train the encoder with both the generator and the discriminator, instead of determining the latent space of a trained GAN model.

2.2 Optimization-based GAN Inversion

Existing optimization-based GAN inversion methods typically reconstruct a target image by optimizing the latent vector

where $x$ is the target image and $G$ is a GAN generator parameterized by $\theta$ .

It is critical to choose the optimizer since a good optimizer helps alleviate the local minima problem. There are two types of optimizers: gradient-based (ADAM , L-BFGS , Hamiltonian Monte Carlo (HMC) ), and gradient-free (covariance matrix adaptation (CMA) ) methods. Optimization-based GAN inversion methods use different optimizers. For example, ADAM is used in the Image2StyleGAN , and L-BFGS is used by Zhu et al. . Huh et al. systematically experiment with different choices of both gradient-based and gradient-free optimizers and find that CMA and its variant BasinCMA perform the best for optimizing the latent vector when inverting images in challenging datasets (e.g. LSUN Cars ) to the latent space of StyleGAN2 .

Another important issue for optimization-based GAN inversion is the initialization of latent code. Since Equation (1) is highly nonconvex, the reconstruction quality strongly relies on a good initialization of $\mathbf{z}$ (sometimes $\mathbf{w}$ for StyleGAN ). Experiments show that different initial values lead to a significant perceptual difference in generated images . An intuitive solution is to start with several random initial values and obtain the best result with minimal cost. Image2StyleGAN studies two initialization choices, one based on random selection and the other based on mean latent code $\overline{\mathbf{w}}$ . However, a prohibitively large number of random initial values may be tested before obtaining a stable reconstruction , which makes real-time processing impossible. Thus, some instead train a deep neural network to minimize (1) directly, as introduced in Section 4.2.1. Some propose using an encoder to provide better initialization for optimization, which is discussed in Section 4.2.3.

We note that the optimization-based methods typically require an expensive iterative process in terms of both memory and runtime, as they have to be applied to each latent code independently.

2.3 Hybrid GAN Inversion

The hybrid methods exploit the advantages of both approaches discussed above. As one of the pioneering works in this field, Zhu et al. propose a framework that first predicts $\mathbf{z}$ of a given real photo $x$ by training a separate encoder $E(x;\theta_{E})$ , which then uses the obtained $\mathbf{z}$ as the initialization for optimization. The learned predictive model serves as a fast bottom-up initialization for the nonconvex optimization problem (1).

Subsequent studies follow this framework and have proposed several variants. For example, to invert $G$ , Bau et al. begin by training a network $E$ to obtain a suitable initialization of the latent code $\mathbf{z}_{0}=E(x)$ and its intermediate representation $\mathbf{r}_{0}=g_{n}(\cdots(g_{1}(\mathbf{z}_{0})))$ , where $g_{n}(\cdots(g_{1}(\cdot)))$ in a layerwise representation of $G(\cdot)$ . This method then uses $\mathbf{r}_{0}$ to initialize a search for $\mathbf{r}^{*}$ to obtain a reconstruction $x^{\prime}=G(\mathbf{r}^{*})$ close to the target $x$ (see Section 4.3.3 for more details). Zhu et al. show that in most existing methods, generator $G$ does not provide its domain knowledge to guide the training of encoder $E$ since the gradients from $G(\cdot)$ are not taken into account at all. To fix it, a domain-specific GAN inversion approach is developed, which both reconstructs the input image and ensures that the inverted code is meaningful for semantic editing (see Section 4.3.2 for more details of this method). In contrast to previous methods, Roich et al. develop a generator-tuning technique. Using an initial latent code as the pivot, they lightly tune the pretrained generator so that the input image can be faithfully reconstructed. This process is referred to as pivotal tuning, which helps map an out-of-domain image to an in-domain latent code faithfully. Alaluf et al. further introduce a hypernetwork that learns to refine the generator weights with respect to a given input image. The hypernetwork is composed of a lightweight feature extractor and a set of refinement blocks.

3 Properties of GAN Inversion Methods

In this section, we discuss the important properties of GAN inversion methods, i.e., having supported resolution, being semantic-aware, being layerwise, and having out-of-distribution generalizability.

The image resolution that a GAN inversion method can support is mainly determined by the capacity of generators and inversion mechanisms. Zhu et al. use GCGANs trained on several datasets with images of $64\times 64$ pixels, and Bau et al. adopt PGGANs trained with images of size $256\times 256$ pixels from Lsun . However, some methods cannot fully leverage the pretrained GAN model. Zhu et al. propose an encoder to map the given images to the latent space of StyleGAN. This method (Fig. 4 (a)) performs well for images of $256\times 256$ pixels but does not scale up well to images of $1024\times 1024$ pixels due to the high computational cost (where 1/n in the figure means semantic feature maps of 1/n original input resolution). Conversely, the pSp method proposed by Richardson et al. (Fig. 4 (b)) can synthesize images of $1024\times 1024$ pixels, regardless of input image size, since the 18 map2style modules they proposed are used to predict 18 single-layer latent codes separately. Wei et al. propose a similar model but with a lightweight encoder. Similar to , features from three semantic levels are used to predict different parts of the latent codes. Nevertheless, this model predicts 9, 5, and 4 layers of latent codes from each semantic level, as shown in Fig. 4 (c). Recent applications such as face swapping on megapixels and infinite-resolution image synthesis are developed as image inversion methods that can support high-resolution image editing.

3.2 Semantic Awareness

GAN inversion methods with semantic-aware properties can perform image reconstruction at the pixel level and align the inverted code with the knowledge that emerge in the latent space. Semantic-aware latent codes can better support image editing by reusing the rich knowledge encoded in the GAN models. The existing approaches typically randomly sample a collection of latent codes $\mathbf{z}$ and feed them into $G(\cdot)$ to obtain the corresponding synthesis $x^{\prime}$ . The encoder $E(\cdot)$ is then trained by

where $\|\cdot\|_{2}$ denotes the $l_{2}$ distance, and $\Theta_{E}$ represents the parameters of the encoder $E(\cdot)$ . Collins et al. use a latent object representation to synthesize images with different styles and reduce artifacts. However, the supervision by only reconstructing $\mathbf{z}$ (or equivalently, the synthesized data) is not sufficient to train an accurate encoder.

To alleviate this issue, Zhu et al. propose a domain-specific GAN inversion approach to recover the input real image at both the pixel and semantic levels. This method first trains a domain-guided encoder $E$ to map the image space to the latent space such that all codes produced by the encoder are in-domain latent codes. The encoder $E$ is trained to recover the real images, instead of being trained with synthesized data to recover the latent code. Then, they perform the instance-level domain-regularized optimization by involving this well-trained $E$ as a regularization term to fine-tune the latent code in the semantic domain during $\mathbf{z}$ optimization. Such optimization helps better reconstruct the pixel values without affecting the semantic property of the inverted code. The training process is formulated as

where $x$ is the target image to invert, and $\lambda_{1}^{\prime}$ and $\lambda_{2}^{\prime}$ are the loss weights corresponding to the perceptual loss and the encoder regularizer, respectively.

3.3 Layerwise

When the number of layers is large, it is not feasible to determine the generator for the full inversion problem defined by Equation (1). Some recent approaches are developed to solve a tractable subproblem by decomposing the generator $G$ into layers:

where $g_{1},\ldots,g_{n}$ are the early layers of $G$ , and $G_{f}$ constructs all the later layers of $G$ .

The simplest layerwise GAN inversion is based on one layer. Lei et al. consider a one-layer model in the form of $G=g(\mathbf{z})=\text{ReLU}(\mathbf{W}\mathbf{z}+\mathbf{b})$ . When the problem is realizable, to find a feasible $\mathbf{z}$ such that $x=G(\mathbf{z})$ , one could invert the function by solving a linear programming problem:

To invert complex state-of-the-art GANs, Bau et al. propose solving the easier problem of inverting the final layers $G_{f}$ :

where $||\cdot||_{1}$ denotes an $\mathcal{L}_{1}$ loss, and $\lambda_{\text{R}}$ is set as 0.01 to emphasize the reconstruction of $\mathbf{r}_{i-1}$ . To focus on training near the manifold of representations produced by the generator, this method uses sample $\mathbf{z}$ and layers $g_{i}$ to compute samples of $\mathbf{r}_{i-1}$ and $\mathbf{r}_{i}$ such that $\mathbf{r}_{i-1}=g_{i-1}(\cdots g_{1}(\mathbf{z}))$ . Once all the layers are inverted, an inversion network for all of $G$ can be composed as follows:

The results can be further improved by fine-tuning the composed network $E^{*}$ to invert $G$ jointly as a whole and obtain the final result $E$ .

For StyleGANs , the intermediate latent vector $\mathbf{w}\in\mathcal{W}^{+}$ or $\mathbf{s}\in\mathcal{S}$ is different across layers and is fed into the corresponding layer of the generator via AdaIN or affine transformations . Therefore, inverting images into $\mathcal{W}^{+}$ or $\mathcal{S}$ space can be seen as being layerwise.

3.4 Out-of-Distribution Generalizability

GAN inversion methods can support inverting the images, especially any given real images that are not generated by the same process of the training data. We refer to this ability as out-of-distribution generalizability . Specifically, given a StyleGAN pretrained on the FFHQ dataset, this property is closely related to the following two aspects: 1) to generate face images with all combinations of facial attributes, even if some combinations do not exist in the training dataset; 2) to handle the images different to the samples of the training set, such as corrupted images, caricatures, or black and white photos. This property is a prerequisite for GAN inversion methods to edit a wider range of images. Out-of-distribution generalizability has been demonstrated in many GAN inversion methods. Zhu et al. propose a domain-specific GAN inversion approach to recover the input image at both the pixel and semantic levels. Although trained only with the FFHQ dataset, their model can generalize to not only real face images from multiple face datasets but also paintings, caricatures, and black and white photos collected from the Internet. Kang et al. propose a method to invert out-of-range images. Taking facial images as an example, out-of-range images could be the images with extreme poses or the corrupted images, which previous methods often fail to handle. Being able to invert out-of-range images allows GAN inversion methods to be applied to wider domains rather than limited settings. Some methods explore the potential of inverting an image into a desired latent code just given a degraded or partial observation. In addition to images, recent methods also show out-of-distribution generalization ability for other modalities, i.e., sketch and text .

The out-of-distribution generalizability of GAN inversion facilitates open-world image manipulation when combined with the latent code-based editing methods (see Section 4.4) . One notable drawback is that inverting images that contain unseen attributes can easily lead to unexpected results as they lie outside the domain of the pretrained image generators. This limits extending GAN inversion to broader applications such as image synthesis guided by uncommon textual descriptions . Some recent approaches aim to alleviate this issue by transferring the GANs pretrained on one image domain to a new one, guided by certain references or semantics from one or few target images (few-shot and one-shot), pretrained language-image models (zero-shot), or both .

4 Latent Space Navigation

GAN inversion is not the end goal. The reason that we invert a real image into the latent space of a trained GAN model is that it allows us to manipulate the image by varying the inverted code in the latent space for a certain attribute. This technique is usually known as latent space navigation or traversals , GAN steerability , or latent code manipulation . Although often regarded as an independent research field, it becomes an indispensable application of the GAN inversion . Many inversion methods also explore the efficient discovery of a desired latent code. Section 4.1 has introduced different latent spaces. This section introduces discovering interpretable and disentangled directions in the latent spaces of GANs.

Some methods support discovering interpretable directions in the latent space, i.e., controlling the generation process by varying the latent codes $\mathbf{z}$ in the desired directions $\mathbf{n}$ with step $\alpha$ , which is considered as the vector arithmetic $\mathbf{z}^{\prime}=\mathbf{z}+\alpha\mathbf{n}$ . Such directions can be identified through supervised, unsupervised, or self-supervised manners. Recent methods have also been proposed to directly compute the interpretable directions in closed form from the pretrained models without any kind of training or optimization.

Supervised Setting. Existing supervised learning-based approaches typically randomly sample a large amount of latent codes, synthesize a collection of corresponding images, and annotate them with some predefined labels by introducing a pretrained classifier (e.g., predicting face attributes or light directions) or extracting statistical image information (e.g., color variations) . For example, to interpret the face representation learned by GANs, Shen et al. employ some off-the-shelf classifiers to learn a hyperplane in the latent space serving as the separation boundary and predict semantic scores for synthesized images. Abdal et al. learn a semantic mapping between the $\mathcal{Z}$ space and the $\mathcal{W}$ space by using continuous normalizing flows (CNF). Both methods rely on the availability of attributes (typically obtained by a face classifier network), which might be difficult to obtain for new datasets and could require manual labeling effort.

Unsupervised Setting. The supervised setting would introduce bias into the experiment since the sampled codes and synthesized images used as supervision are different in each sampling and may lead to different discoveries of interpretable directions . It also severely restricts a range of directions that existing approaches can discover, especially when the labels are missing. Furthermore, the individual controls discovered by these methods are typically entangled, affecting multiple attributes, and are often nonlocal. Thus, some methods aim to discover interpretable directions in the latent space in an unsupervised manner, i.e., without the requirement of paired data. For example, Härkönen et al. create interpretable controls for image synthesis by identifying important latent directions based on PCA applied in the latent or feature space. The obtained principal components correspond to certain attributes, and the selective application of the principal components allows for the control of many image attributes. This method is considered as “unsupervised” since the directions can be discovered by PCA without using any labels. Manual intervention and supervision are required to annotate these directions to the target operations and to which layers they should be applied to. In contrast, Jahanian et al. optimize trajectories (both linear and nonlinear) in a self-supervised manner. Taking the linear walk $\boldsymbol{w}$ as an example, given an inverted source image $G(\mathbf{z})$ , they learn $\boldsymbol{w}$ as

where $\mathcal{L}$ measures the distance between the generated image $G(\mathbf{z}+\alpha\boldsymbol{w})$ after taking an $\alpha$ -step in the latent direction and the target image edit( $G(\mathbf{z}),\alpha$ ). This method is considered as “self-supervised” because the target image ( $G(\mathbf{z}),\alpha$ ) could be derived from the source image $G(\mathbf{z})$ .

Closed-form Solution. A few methods recent show that interpretable directions for image synthesis can be directly obtained in closed forms without training or optimization. Shen et al. propose a semantic factorization method based on the singular value decomposition of the weights of the first layer of a pretrained GAN. They observe that the semantic transformation of an image, usually denoted by moving the latent code toward a certain direction $\mathbf{n}^{\prime}=\mathbf{z}+\alpha\mathbf{n}$ , is actually determined by the latent direction $\mathbf{n}$ , which is independent of the sampled code $\mathbf{z}$ . A Semantics Factorization (SeFa) method is developed to discover the directions $\mathbf{n}$ that can cause a significant change in the output image $\Delta\mathbf{y}$ , i.e., $\Delta\mathbf{y}=\mathbf{y}^{\prime}-\mathbf{y}=(\mathbf{A}(\mathbf{z}+\alpha\mathbf{n})+\mathbf{b})-(\mathbf{A}\mathbf{z}+\mathbf{b})=\alpha\mathbf{A}\mathbf{n}$ , where $\mathbf{A}$ and $\mathbf{b}$ are the weight and bias of certain layers in $G$ , respectively. The obtained formula, $\Delta\mathbf{y}=\alpha\mathbf{A}\mathbf{n}$ , suggests that the desired editing with direction $\mathbf{n}$ can be achieved by adding the term $\alpha\mathbf{A}\mathbf{n}$ onto the projected code and indicates that the weight parameter $\mathbf{A}$ should contain the essential knowledge of image variations. The problem of exploring the latent semantics can thus be factorized by solving the following optimization problem:

The desired directions $\mathbf{n}^{*}$ , i.e., a closed-form factorization of latent semantics in GANs, should be the eigenvectors of the matrix $\mathbf{A}^{T}\mathbf{A}$ . In contrast to SeFa , a method based on orthogonal Jacobian regularization is applied to multiple layers of the generator to determine interpretable directions for image synthesis .

4.2 Discovering Disentangled Directions

When several attributes are involved, editing one may affect another since some semantics are not separated. Some methods aim to tackle multi-attribute image manipulation without interference. This characteristic is also named multidimensional or conditional editing in the literature. The goal is to discover disentangled directions for the desired attributes. For example, to edit multiple attributes, Shen et al. formulate the inversion-based image manipulation as $x^{\prime}=G(\mathbf{z}^{*}+\alpha\mathbf{n})$ , where $\mathbf{n}$ is a unit normal vector indicating a hyperplane defined by two latent codes $\mathbf{z}_{1}$ and $\mathbf{z}_{1}$ . In this method, $k$ attributes $\{\mathbf{z}_{1},\cdots,\mathbf{z}_{k}\}$ can form $m$ (where $m\leq k(k-1)/2$ ) hyperplanes $\{\mathbf{n}_{1},\cdots,\mathbf{n}_{m}\}$ . To edit multiple attributes without interfering with each other, these disentangled directions $\{\mathbf{n}_{1},\cdots,\mathbf{n}_{m}\}$ should be orthogonal. If this condition does not hold, then some semantics will correlate with each other, and $\mathbf{n}_{i}^{\top}\mathbf{n}_{j}$ can be used to measure the entanglement between the $i$ -th and $j$ -th semantics. In particular, this method uses projection to orthogonalize different vectors. As shown in Fig. 5, given two hyperplanes with normal vectors $\mathbf{n}_{1}$ and $\mathbf{n}_{2}$ , the goal is to find a projected direction $\mathbf{n}_{1}-(\mathbf{n}_{1}^{\top}\mathbf{n}_{2})\mathbf{n}_{2}$ such that moving samples along this new direction can change “attribute one” without affecting “attribute two”. For the case where multiple attributes are involved, they subtract the projection from the primal direction onto the plane that is constructed by all conditioned directions. Other GAN inversion methods based on pretrained StyleGAN or StyleGAN2 models can also manipulate multiple attributes due to the stronger separability of $\mathcal{W}$ space than of $\mathcal{Z}$ space. However, as observed by recent methods , some attributes remain entangled in the $\mathcal{W}$ space, leading to some unwanted changes when we manipulate a given image. Instead of manipulating in the semantic $\mathcal{W}$ space, Wu et al. propose the $\mathcal{S}$ space (style space). The style code is formed by concatenating the output of all affine layers of the StyleGAN2 generator. Experiments show that the $\mathcal{S}$ space can alleviate spatially entangled changes and exert precise local modifications. By intervening the style code $s\in\mathcal{S}$ directly, their method can manipulate different facial attributes along with various semantic directions without affecting others and can achieve fine-grained controls on local translations.

Applications

Finding an accurate solution to the inversion problem allows us to match the target image without compromising the editing capabilities in the downstream tasks. GAN inversion does not require task-specific dense-labeled datasets and can be applied to many tasks such as image manipulation, image interpolation, image restoration, style transfer, novel-view synthesis, and even adversarial defense. In addition to the common image editing applications, in the last few months, GAN inversion techniques have been widely introduced to many other tasks, such as 3D reconstruction , image understanding , multimodal learning , and medical imaging , which shows its versatility for different tasks and strength to benefit a larger research community.

Given an image $x$ , we want to edit certain regions by varying its latent codes $\mathbf{z}$ and obtain ${\mathbf{z^{\prime}}}$ of the target image ${x^{\prime}}$ by linearly transforming the latent representation from a trained GAN model $G$ . This can be formulated in the framework of GAN inversion as the operation of adding a scaled difference vector:

where $\mathbf{n}$ is the normal direction corresponding to a particular semantic in the latent space, and $\alpha$ is the step for manipulation. In other words, if a latent code is moved in a certain direction, then the semantics contained in the output image should vary accordingly. For example, Voynov et al. gradually determine the direction corresponding to the background removal or background blur without changing the foreground. Shen et al. achieve single and multiple facial attribute manipulation by projecting and orthogonalizing different vectors. Recently, Zhu et al. perform semantic manipulation by either decreasing or increasing the semantic degree. Both methods use a projection strategy to search for the semantic direction $\mathbf{n}$ .

Some methods can perform region-of-interest editing, which allows for the editing of some desired regions in a given image with user manipulation. Such operations often involve additional tools to select the desired region. For example, Abdal et al. analyze the defective image embedding of StyleGAN trained on FFHQ , i.e., the embedding of images with masked regions. The experiments show that the StyleGAN embedding is quite robust to the defects in images, and the embeddings of different facial features are independent of each other . Based on their observation, they develop a mask-based local manipulation method. They find a plausible embedding for regions outside the mask and fill in reasonable semantic content in the masked pixels. Zhu et al. use their in-domain inversion method for semantic diffusion. This task is to insert the target face into the context and makes them compatible. Their method can keep the salient features of the target image (e.g., face identity) and adapt to the context information at the same time.

Some methods also can manipulate the image other than the semantics, e.g., geometry, texture, and color. For example, change pose rotation for face manipulation, while can manipulate geometry (e.g., zoom/shift/rotation), texture (e.g., background blur/add grass/sharpness), and color (e.g., lighting/saturation).

2 Image Generation

Several GAN inversion-based methods are proposed for image generation tasks, such as hairstyle transfer , few-shot semantic image synthesis , and infinite-resolution image synthesis . Saha et al. develop a photorealistic hairstyle transfer method by optimizing the extended latent space and the noise space of StyleGAN2 . Endo et al. assume pixels sharing the same semantics have similar StyleGAN features to generate images and corresponding pseudosemantic masks from random noise in the latent space, and use a nearest-neighbor search for synthesis. This method integrates an encoder with the fixed StyleGAN generator and trains the encoder with the pseudolabeled data in a supervised fashion to control the generator. Cheng et al. propose a GAN inversion-based method for image inpainting and outpainting. A coordinate-conditioned generator is designed to synthesize patches to be concatenated for a full image. The latent codes, depending on the joint latent codes and their coordinates, synthesize the images overlapping with the input image. The optimal latent code for the available input patches is determined in the latent space of the trained patch-based generator during the outpainting stage. GAN inversion methods can be applied to interactive generation, i.e., starting with strokes drawn by a user and generating natural images that best satisfy the user constraints. Zhu et al. show that users can employ the brush tools to generate an image from scratch and then continually add more scribbles to refine the result. Abdal et al. invert the StyleGAN to perform semantic local edits based on user scribbles. With this method, simple scribbles can be converted into photorealistic edits by embedding them into certain layers of StyleGAN. This application is helpful for existing interactive image processing tasks such as sketch-to-image generation and sketch-based image retrieval , which usually require densely labeled datasets.

3 Image Restoration

Suppose that $\hat{x}$ is obtained via $\hat{x}=\phi(x)$ during acquisition, where $x$ is the distortion-free image, and $\phi$ is a degradation transform. Many image restoration tasks can be regarded as recovering $x$ given $\hat{x}$ . A common practice is to learn a mapping from $\hat{x}$ to $x$ , which often requires task-specific training for different $\phi$ . Alternatively, GAN inversion can employ statistics of $x$ stored in some prior and search in the space of $x$ for an optimal $x$ that best matches $\hat{x}$ by viewing $\hat{x}$ as partial observations of $x$ . For example, Abdal et al. observe that StyleGAN embedding is quite robust to the defects in images, e.g., masked regions. Based on that observation, they propose an inversion-based image inpainting method by embedding the source defective image into the early layers of the $\mathcal{W}^{+}$ space to predict the missing content and into the later layers to maintain color consistency. Pan et al. claim that a fixed GAN generator is inevitably limited by the distribution of training data and its inversion cannot faithfully reconstruct unseen and complex images. Thus, they present a relaxed and more practical reconstruction formulation for capturing the statistics of natural images in a trained GAN model as do the prior methods, i.e., the deep generative prior (DGP). Specifically, they reformulate (3) such that it allows the generator parameters to be fine-tuned on the target image on the fly:

Their method performs comparable to state-of-the-art methods in terms of colorization , inpainting , and super-resolution . While artifacts sometimes occur in synthesized face images by GAN models , Shen et al. show that the quality information encoded in the latent space can be used for restoration. The artifacts generated by PGGAN can be corrected by moving the latent code toward the positive quality direction that is defined by a separating hyperplane using a linear SVM .

4 Image Interpolation

With GAN inversion, new results can be interpolated by morphing between corresponding latent vectors of given images. Given a well-trained GAN generator $G$ and two target images $x_{A}$ and $x_{B}$ , morphing between them could naturally be achieved by interpolating between their latent vectors $\mathbf{z}_{A}$ and $\mathbf{z}_{B}$ . Typically, morphing between $x_{A}$ and $x_{B}$ can be obtained by applying linear interpolation :

Such an operation can be found in . Moreover, in DGP , reconstructing two target images $x_{A}$ and $x_{B}$ would result in two generators $G_{\theta_{A}}$ and $G_{\theta_{B}}$ , respectively, and the corresponding latent vectors $\mathbf{z}_{A}$ and $\mathbf{z}_{B}$ since they also fine-tune $G$ . In this case, morphing between $x_{A}$ and $x_{B}$ can be achieved by linear interpolation of both the latent vectors and the generator parameters:

and images can be generated with the new $\mathbf{z}$ and $\theta$ .

5 3D Reconstruction

For 3D data, Pan et al. and Zhang et al. propose 3D shape reconstruction from single images and point cloud completion based on GAN inversion. Given an image generated by GAN, starting with an initial ellipsoid 3D object shape, Pan et al. first render a number of unnatural images with various randomly sampled viewpoints and lighting conditions (called pseudosamples). By reconstructing them with the GAN, these pseudosamples could guide the original image toward the sampled viewpoints and lighting conditions in the GAN manifold, producing a number of natural-looking images (called projected samples). These projected samples could be adopted as the ground truth of the differentiable rendering process to refine the prior 3D shape. Instead of using existing 2D GANs trained on images, Zhang et al. first train a generator $G$ on 3D shapes in the form of point clouds. Latent codes are used by the pretrained generator to produce complete shapes. Given a partial shape, they look for a target latent vector $\mathbf{z}$ and fine-tune the parameters $\theta$ of $G$ that best reconstruct the complete shape via gradient descent.

6 Image Understanding

7 Multimodal Learning

For multimodal learning, several recent studies have focused on language-driven image generation and manipulation using StyleGAN. Xia et al. propose a novel unified framework for both text-to-image generation and text-guided image manipulation tasks by training an encoder to map texts into the latent space of StyleGAN and perform style-mixing to produce diverse results. In , Wang et al. propose a similar idea but introduce the cycle-consistency training during inversion to learn more robust and consistent inverted latent codes. On the other hand, a few methods first obtain the latent code of a given image and find the target latent code of desired attributes with the guidance of some powerful pretrained language models, e.g., CLIP or ALIGN . Logacheva et al. present a generative model for landscape animation videos based on StyleGAN inversion. Lee et al. propose a sound-guided image editing framework. They train an audio encoder to encode sounds into a multimodal latent space, where audio representations are aligned with text-image representations to guide image manipulation.

8 Medical Imaging

GAN inversion techniques have been recently introduced to medical applications . These methods are used for data augmentation, where publicly available medical datasets are often outdated, limited, or inadequately annotated. Typically, these methods train the GAN models on domain-specific medical image datasets, e.g., Computed Tomography (CT) or Magnetic Resonance (MR), and use existing GAN inversion methods for inversion and manipulation. Fetty et al. present a method based on the StyleGAN model in which CT or MR images with desired attributes can be synthesized by traversing points in the latent space (see Section 4.4) or style mixing . To synthesize medical images with desired attributes, Ren et al. use the domain-specific GAN inversion technique to generate mammograms with desired shape and texture for psychophysical experiments. Overall, these methods based on GAN inversion achieve better interpretability and controllability in medical image synthesis.

Challenges and Future Directions

Theoretical Understanding. While significant effort has been made on applying GAN inversion to image editing applications, much less attention is paid to a better theoretic understanding of the latent space. Nonlinear structure in data can be represented compactly, and the induced geometry necessitates the use of nonlinear statistical tools , Riemannian manifolds, and locally linear methods. Well-established theories in related areas can facilitate the theoretical understanding from different perspectives. Some recent methods treat the latent space as the manifold structure, which involves different concepts and metrics.

Inversion Type. In addition to GAN inversion, some methods have been developed to invert generative models based on the encoder-decoder architecture. The IIN method learns invertible disentangled interpretations of variational autoencoders (VAEs) . Zhu et al. develop the latently invertible autoencoder to learn a disentangled representation of face images from which contents can be edited based on attributes. The LaDDer approach uses a meta-embedding based on a generative prior (including an additive VAE and a mixture of hyperpriors) to project the latent space of a well-trained VAE to a lower-dimensional latent space, where multiple VAE models are used to form a hierarchical representation. It is beneficial to explore combining GAN inversion and encoder-decoder inversion so that we can exploit the best of both worlds.

Domain Generalization. As discussed in Section 5, GAN inversion proves to be effective in cross-domain applications such as style transfer and image restoration, which indicates that pretrained models have learned domain-agnostic features. The images from different domains can be inverted into the same latent space from which effective metrics can be derived. Multitask methods have been developed to collaboratively exploit visual cues, such as image restoration and image segmentation or semantic segmentation and depth estimation , within the GAN framework. It is challenging but worthwhile to develop effective and consistent methods to invert the intermediate shared representations so that we can tackle different vision tasks under a unified framework.

Implicit Representation. Some methods based on pretrained GANs can manipulate geometry (e.g., zoom, shift, and rotate), texture (e.g., background blur and sharpness) and color (e.g., lighting and saturation). This ability indicates the GAN models pretrained on large-scale datasets have learned some physical information from real-world scenes. Implicit neural representation learning , a recent trend in the 3D computer vision, learns implicit functions for 3D shapes or scenes and enables control of scene properties such as illumination, camera parameters, pose, geometry, appearance, and semantic structure. It has been used for volumetric performance capture , novel-view synthesis , face shape generation , object modeling , and human reconstruction . The recent StyleRig method is trained to align the parameters of the 3D morphable model (3DMM) with the input of StyleGAN . It opens an interesting research direction to invert such implicit representations of a pretrained GAN for 3D reconstruction, e.g., using StyleGAN for human face modeling or time-lapse video generation.

Precise Control. GAN inversion can be used to find directions for image manipulation while preserving the identity and other attributes . However, some tuning is needed to achieve the desired granularity of precise control at a fine-grained level, e.g., gaze redirection , relighting , and continuous view control . These tasks require precise control, i.e., $1^{\circ}$ of camera view or gaze direction. Current GAN inversion methods are incapable of handling the tasks. Thus more efforts are needed, such as creating more disentangled latent spaces and discovering more interpretable directions.

Multimodal Inversion. The existing GAN inversion methods primarily focus on images. However, recent advances in generative models are beyond the image domain, such as the GPT-3 language model and WaveNet for audio synthesis. Trained on diverse large-scale datasets, these sophisticated deep neural networks prove to be capable of representing an extensive range of different contents, styles, sentiments, and topics. Applying GAN inversion techniques on these different modalities could provide a novel perspective for tasks such as language style transfer. Furthermore, many GAN models are developed for multimodality generation or translation . It is a promising direction to invert such GAN models as multimodal representations to create novel kinds of content, behavior, and interaction.

Evaluation Metrics. New perceptual quality metrics, which can better evaluate photorealistic and diverse images or identity consistent with the original image, remain to be explored. Current evaluations mostly concentrate on measuring photorealism or if the distribution of generated images is consistent with the real images with regard to classification or segmentation accuracy using models trained for real images. However, there is still a lack of effective assessment tools to evaluate the difference between the predicted results and the expected outcome or to measure the inverted latent codes more directly.

Conclusion

Deep generative models such as GANs learn the underlying variation factors of the training data through the weak supervision of image generation. Discovering and steering the interpretable latent representations in image generation facilitate a wide range of image editing applications. This paper presents a comprehensive survey of GAN inversion methods with an emphasis on algorithms and applications. We summarize the important properties of GAN latent spaces and models and then introduce four kinds of GAN inversion methods and their key properties. We then go through several fascinating applications of GAN inversion, including image manipulation, image generation, image restoration, and recent applications beyond image processing. We finally discuss the challenges and the future directions of GAN inversion.