Editing in Style: Uncovering the Local Semantics of GANs

Edo Collins, Raja Bala, Bob Price, Sabine Süsstrunk

Introduction

In the short span of five years, generative adversarial neural networks (GANs) have come to dominate the field of data-driven image synthesis. Like most other neural network models, however, the exact model they learn for the data is not straightforwardly interpretable.

There have been significant steps towards alleviating this issue. For instance, state-of-the-art image GANs such as PG-GAN and StyleGAN , by virtue of their progressive training, encourage each layer to model the variation exhibited at given image resolutions (e.g., 8×\times8 images capture coarse structure, 32×3232\times 32 add finer details, etc.).

The notion of a disentangled representation has been used to describe such phenomena. While definitions of disentanglement are many and varied , the common idea is that an attribute of interest, which we often consider semantic, can be manipulated independently of other attributes.

In this paper we show that deep generative models like PG-GAN, StyleGAN and the recent StyleGAN2 learn a representation of objects and object-parts that is disentangled in the sense that various semantic parts (e.g., the mouth of a person or the pillows in a bedroom) have a significant ability to vary independently of the rest of the scene.

Based on this observation we propose an algorithm that performs spatially-localized semantic editing on the outputs of GANs - primarily StyleGAN. Editing is performed by transferring semantically localized style from a reference image to a target image, both outputs of a GAN. Our method is simple and effective, requiring no more than an off-the-shelf pre-trained GAN. Our method is unique in that it enacts a localized change through a global operation, akin to style transfer. As a result, unlike other GAN editing methods that make use of additional datasets and trained networks, or traditional image morphing methods requiring complex spatial operations, our method relies upon and benefits solely from the rich semantic representation learned by the GAN itself. Applications include forensic art where a human face is composited from various sources; and interior design where various combinations of design elements such as furniture, upholstery, etc., can be visualized. Extension to semantic editing of real images can be envisioned by combining our approach with the recent work that embeds natural images into the latent space of StyleGAN .

We provide insight into the structure of hidden activations of the StyleGAN generator, showing that the learned representations are largely disentangled with respect to semantic objects in the synthesized image.

We exploit this structure to develop a novel image editor that performs semantic part transfer from a reference to a target synthesized image. The underlying formulation is simple and elegant and achieves naturalistic part transfer without the need for complex spatial processing, or supervision from additional training data and models.

The paper is structured as follows. In Section 2 we review work related to GAN editing and interpretability. In Section 3 we detail our observations regarding spatial disentanglement in GAN latent space and introduce our local editing method. In Section 4 we show experimental results that validate our claims, and in Section 5 we conclude with a discussion of the results and future work.

Related Work

The literature on the use of GANs for image synthesis has exploded since the seminal work by Goodfellow et al. , with today’s state of art methods such as StyleGAN , StyleGAN2, and BigGAN producing extremely realistic outputs. For a thorough review of the GAN literature we refer the reader to recent surveys in . Our goal here is not to propose another GAN, but to offer a local editing method for its output, by changing the style of specific objects or object parts to the style given in a reference image. We next review past work germane to semantic image editing, paying particular attention to recent GAN-based methods.

Several works have explored the use of deep generative models for semantic image editing. We distinguish between two flavors: latent code-based methods for global attribute editing and activation-based methods for local editing.

Latent code-based techniques learn a manifold for natural images in the latent code space facilitated by a GAN and perform semantic edits by traversing paths along this manifold . A variant of this framework employs auto-encoders to disentangle the image into semantic subspaces and reconstruct the image, thus facilitating semantic edits along the individual subspaces . Examples of edits accomplished by these techniques include global changes in color, lighting, pose, facial expression, gender, age, hair appearance, eyewear and headwear . AttGAN uses supervised learning with external attribute classifiers to accomplish attribute editing.

Activation-based techniques for local editing directly manipulate specific spatial positions on the activation tensor at certain convolutional layers of the generator. In this way, GAN Dissection controls the presence or absence of objects at given positions, guided by supervision from an independent semantic segmentation model. Similarly, feature blending transfers objects between a target GAN output and a reference by “copy-pasting” activation values from the reference onto the target. We compare that technique, together with traditional Poisson blending , to our approach in Fig. 5.

Distinct from all these works, our approach is a latent code-based approach for local editing. Crucially, it neither relies on external supervision by image segmentation models nor involves complex spatial blending operations. Instead, we uncover and exploit the disentangled structure in the embedding space of the generator that naturally permits spatially localized part editing.

2 Face Swapping

Our technique for object-specific editing, when applied to face images, is akin to the problems of face swapping and transfer. Previous efforts describe methods for exchanging global properties between a pair of facial images. Our method stands out from these approaches by offering editing that is localized to semantic object parts. Furthermore, a primary motivation for face swapping is de-identification for privacy preservation, which is not relevant for our goal of editing synthetic images. Yang et al. present a method for transferring expression from one face to another. Certain specific cases of expression transfer (e.g., smile) involve localized part (e.g., mouth) transfer, and are thus similar to our setting. However, even in these common scenarios, our editing framework is unique in that it requires no explicit spatial processing such as warping and compositing.

Local Semantics in Generative Models

Deep feature factorization (DFF) is a recent method that explains a convolutional neural network’s (CNN) learned representation through a set of saliency maps, extracted by factorizing a matrix of hidden layer activations. With such a factorization, it has been shown that CNNs trained for ImageNet classification learn features that act as semantic object and object-part detectors.

The main result of this analysis is that at certain layers of the generator, clusters correspond well to semantic objects and parts. Fig. 2 shows the clusters produced for a 32×3232\times 32 layer of StyleGAN generator networks trained on Flickr-Faces-HQ (FFHQ) and LSUN-Bedrooms . Each pixel in the heatmap is color-coded to indicate its cluster. As can be seen, clusters spatially span coherent semantic objects and object-parts, such as eyes, nose and mouth for faces, and bed, pillows and windows for bedrooms.

The cluster membership encoded in U{\bm{\mathsfit{U}}} allows us to compute the contribution Mk,c{\bm{M}}_{k,c} of channel cc towards each semantic cluster kk as follows:

Assuming that the feature maps of Al{\bm{\mathsfit{A}}}_{l} have zero mean and unit variance, the contribution of each channel is bound between 0 and 1, i.e., MK×C{\bm{M}}\in^{K\times C}.

Furthermore, by bilinearly up- or down-sampling the spatial dimensions of the tensor U{\bm{\mathsfit{U}}} to an appropriate size, we are able to find a matrix M{\bm{M}} for all layers in the generator, with respect to the same semantic clusters.

Using this approach we produced a semantic catalog for each GAN. We chose at which layer and with which KK to apply spherical k-means guided by a qualitative evaluation of the cluster membership maps. This process requires only minutes of human supervision.

2 Local editing

This style-based control mechanism is motivated by style transfer , , where it has been shown that manipulating per-channel mean and variance is sufficient to control the style of an image . By fixing the input to the StyleGAN convolutional generator to be a constant image, the authors of StyleGAN showed that this mechanism is sufficient to determine all aspects of the generated image: the style at one layer determines the content at the next layer.

2.2 Conditioned interpolation

Given a target image S{\bm{S}} and a reference image R{\bm{R}}, both GAN outputs, we would like to transfer the appearance of a specified local object or part from R{\bm{R}} to S{\bm{S}}, creating the edited image G{\bm{G}}. Let σS{\bm{\sigma}}^{{\bm{S}}} and σR{\bm{\sigma}}^{{\bm{R}}} be two style scaling coefficients of the same layer corresponding to the two images.

For global transfer, due to the properties of linearity and separability exhibited by StyleGAN’s latent space, a mixed style σG{\bm{\sigma}}^{{\bm{G}}} produced by linear interpolation between σS{\bm{\sigma}}^{{\bm{S}}} and σR{\bm{\sigma}}^{{\bm{R}}}Karras et al. (2019) interpolate in the latent space of w{\bm{w}}, but the effect is similar. produces plausible fluid morphings between the two images:

for 0λ10\leq\lambda\leq 1. Doing so results in transferring all the properties of σR{\bm{\sigma}}^{{\bm{R}}} onto σG{\bm{\sigma}}^{{\bm{G}}}, eventually leaving no trace of σS{\bm{\sigma}}^{{\bm{S}}}.

To enable selective local editing, we control the style interpolation with a matrix transformation:

where the matrix Q{\bm{Q}} is positive semi-definite and is chosen such that σG{\bm{\sigma}}^{{\bm{G}}} effects a local style transfer from σR{\bm{\sigma}}^{{\bm{R}}} to σS{\bm{\sigma}}^{{\bm{S}}}. In practice we choose Q{\bm{Q}} to be a diagonal matrix whose elements form q[0,1]C{\bm{q}}\in\left[0,1\right]^{C}, which we refer to as the query vector.

2.3 Choosing the query

For local editing, an appropriate choice for the query q{\bm{q}} is one that favors channels that affect the region of interest (ROI), while ignoring channels that have an effect outside the ROI. When specifying the ROI using one of the semantic clusters computed in section 3.1, say kk^{\prime}, the vector Mk,c{\bm{M}}_{k^{\prime},c} encodes exactly this information.

A simple approach is to use Mk,c{\bm{M}}_{k^{\prime},c}, computed offline from Eq. (1) for a given genre and dataset of images, to control the slope of the interpolation, clipping at 1:

where qc{\bm{q}}_{c} is the cc-th channel element of q{\bm{q}}, and λ\lambda, as in Eq. (2), is the global strength of the interpolation. We refer to this approach as simultaneous as it updates all channels at the same time. Intuitively, when λ\lambda is small or intermediate, channels with large Mk,c{\bm{M}}_{k^{\prime},c} will have a higher weight, thus having an effect of localizing the interpolation.

We propose an approach which achieves superior localization compared to Eq. (4), referred to as sequential. We first set the most relevant channel to the maximum slope of 1, before raising the slope of the second-most relevant, third-most, etc. This definition of the query corresponds to solving for the following objective:

We solve this objective by sorting channels based on Mk{\bm{M}}_{k^{\prime}}, and greedily assigning qc=1{\bm{q}}_{c}=1 to the most relevant channels as long as the total effect outside the ROI is no more than some budget ϵ\epsilon. Additionally, a non-zero weight is only assigned to channels where Mk,c>ρ1+ρ{\bm{M}}_{k^{\prime},c}>\frac{\rho}{1+\rho}, which improves the robustness of local editing by ignoring irrelevant channels even when the budget ϵ\epsilon allows more change.

Experiments

In Figs. 3 and 4 we demonstrate our editing methodOur code is available online at: https://github.com/IVRL/GANLocalEditing with StyleGAN generators trained on two datasets: FFHQ comprising 70K facial images and LSUN-Bedrooms comprising about 3M color images depicting bedrooms.

In both datasets, we found the first 32×3232\times 32 resolution layer of the generator to be “most semantic”. We therefore chose this layer to apply spherical k-means clustering. We set ρ\rho such that ρ1+ρ=0.1\frac{\rho}{1+\rho}=0.1 and tune 20ϵ10020\leq\epsilon\leq 100 for best performance. We found that the tuning of ϵ\epsilon depends mostly on the target image and object of interest, and not the style reference. Note that by nature of the local edit, changes to the target image may be subtle, and best viewed on screen.

Fig. 5 compares our method with feature-level blending and pixel-level (Poisson) blending methods. Feature blending is applied once to all layers of resolution 32×3232\times 32 or lower, and once to those of 64×6464\times 64 or lower. While these approaches are strictly localized (see section 2.1), their outputs lack photorealism. For instance, the target and reference faces are facing slightly different directions, which causes a misalignment problem most visible in the nose. In contrast, our editing method primarily affects the ROI, and yet maintains the photorealism of the baseline GAN by admitting some necessary global changes. However, our method does not always copy the appearance of an object ’faithfully’, as seen in the window row of Fig. 4.

Fig. 6 demonstrates the applicability of our method to the recent StyleGAN2 model trained on LSUN-Cats and LSUN-Cars . Unlike traditional blending methods, our technique is able to transfer parts between unaligned images as seen here and in Fig. 4.

2 Quantitative analysis

We quantitatively evaluate the results of editing on two aspects of interest: locality and photorealism.

3 Locality

To evaluate the locality of editing, we examine the squared-error in pixel space between target images and their edited outputs. Fig. 7 (a) shows the difference between unedited and edited images averaged over 50K FFHQ-StyleGAN samples, where at every pixel location we compute the squared distance in CIELAB color space. This figure indicates that the transfers are both perceptible and localized, and that not all object parts are equally disentangled. Compared to eyes and mouth, where edits are very localized, editing the nose seems to force a subtle degree of correlation with the other face parts. Such correlations trade-off control on the appearance of individual parts versus plausibility and realism of the overall output.

We further examine the localization ability of our method and variants described in Section 3.2. First, we obtain for each image the binary mask indicating the ROI, using the pre-computed spherical k-means clusters of Section 3.1. Then, we perform interpolation with various values of λ\lambda (Eqs. 2 and 4) and ϵ\epsilon (Eq. 5). For each such setting we measure the (normalized) In- and Out-MSE of each target-output pair, i.e., the MSE inside the ROI and MSE outside the ROI, respectively. In Fig. 7 (b) and (c), we show that for both FFHQ and LSUN-Bedrooms, respectively, our method (sequential) has better localization, i.e., less change outside the ROI for the same amount of change inside the ROI.

4 Photorealism

Measuring photorealism is challenging, as there is not yet a direct computational way of assessing the perceived photorealism of an image. The Fréchet Inception Distance (FID), however, has been shown to correlate well with human judgement and has become a standard metric for GAN evaluation.

An aggregate statistic, FID compares the distributions of two image sets in the feature space of a deep CNN layer. In Table 1 we report the FID of 50K edited images against the original FFHQ and LSUN-Bedrooms datasets. The FID scores indicate that our edited images are not significantly different from the vanilla output of the baseline GAN.

However, the same result was achieved when we computed the FID of 50K FFHQ images edited with feature blending , although Fig. 5 shows qualitatively that these produced outputs lack photorealism. This reemphasizes the difficulty of correctly measuring photorealism in an automated way. We did not run a similar analysis with Poisson blending since the many failure cases we observed with this approach did not justify the heavy computational cost required to process a large collection of 1024×10241024\times 1024 images. For both feature blending and Poisson editing, we could not test the Bedrooms dataset since these methods are not suitable for unaligned image pairs.

Conclusion

We have demonstrated that StyleGAN’s latent representations spatially disentangle semantic objects and parts. We leverage this finding to introduce a simple method for local semantic part editing in StyleGAN images. The core idea is to let the latent object representation guide the style interpolation to produce realistic part transfers without introducing any artifacts not already inherent to StyleGAN. The locality of the result depends on the extent to which an object’s representation is disentangled from other object representations, which in the case of StyleGAN is significant. Importantly, our technique does not involve external supervision by semantic segmentation models, or complex spatial operations to define the edit region and ensure a seamless transition from edited to unedited regions.

For future investigation, our observations open the door to explicitly incorporate editing capabilities into the adversarial training itself, which we believe will improve the extent of disentanglement between semantic objects, and yield even better localization.

Finally, the method can, in principle, be extended to semantic editing of real images by leveraging the frameworks of , to first map natural images into the latent space of StyleGAN. This opens up interesting applications in photo enhancement, augmented reality, visualization for plastic surgery, and privacy preservation.

References

Appendix A Spherical k-means for semantic clustering

In this section we elaborate on the layer-wise analysis described in Section 3.

The matrix U{\bm{U}} can be reshaped to a tensor U{0,1}N×K×H×W{\bm{\mathsfit{U}}}\in\{0,1\}^{N\times K\times H\times W} which represents KK sets of NN masks (one per image), where each mask spatially shows the cluster memberships.

In Figs. 9, 10 we show examples produced with StyleGAN , where the tensor U{\bm{\mathsfit{U}}} is up-sampled and overlaid on RGB images for ease of interpretation. The color-coding in these figures indicates to which cluster a spatial position belongs. In Fig. 11 we similarly show results for ProgGAN on CelebA-HQ .

The main observation emerging from this analysis is that at certain layers (e.g., the 32×3232\times 32 layer 6 of StyleGAN), activations capture abstract semantic concepts (e.g., eyes for faces, pillow for bedrooms).

By manually examining the cluster membership masks of a few (five to ten) samples, an annotator can easily label a cluster as representing a certain object. Thus, we randomly generated N=200N=200 samples and recorded all their activations. We tested several layers and rank KK combinations and selected the one that qualitatively yielded the most semantic decomposition into objects, as shown in Figures 9 and 10. We then manually labeled the resulting clusters. In the case that multiple clusters matched a part of interest, we merged their masks into a single mask. Note that this process is a one-time, offline process (per dataset/GAN) that then drives a fully automated semantic editing operation.

Appendix B Squared-error maps

Squared-error “diff” maps between edited outputs and the target image help detect changes between the two images and evaluate the locality of the edit operation. We compute the error in CIELAB color-space.

In Figs. 13 and 14 we show the diff maps corresponding to Figs. 3 and 4 respectively.

Appendix C Additional qualitative results with StyleGAN2

In this section we show additional results with StyleGAN2. Figs. 15 and 17 are extended versions of Fig. 6. Figs. 16 and 18 show their diff maps. Figs. 19 and 20 show results for StyleGAN2 trained of FFHQ. Additional examples can be found on the paper’s GitHub page, linked above.