Diagonal Attention and Style-based GAN for Content-Style Disentanglement in Image Generation and Translation

Gihyun Kwon, Jong Chul Ye

Introduction

Recent development of Generative Adversarial Networks (GAN) has enabled the generation of high-quality images that are indistinguishable to the human eye. Despite its high performance, disentangling the attributes of the generated images is still an open problem.

For example, the content and style disentanglement is an important issue in many image generation tasks such as faces. Here, contents refer to the spatial information such as face direction, expression, whereas styles are related with other features such as color, makeup, gender. StyleGAN , which shows the state-of-the-art performance in image generation, tries to disentangle the style and content using the AdaIN codes and the content feature vectors from random per-pixel noises, respectively. The AdaIN layer then combines the style and the content features to generate more realistic features at each resolution (see Fig. 2(a)). However, the content control by per-pixel noises is mostly for minor spatial variations so that the disentanglement of global contents and styles is by no means complete.

Recently, generative models that simultaneously use AdaIN and independent content latent codes have shown good performance in separating global style and content information. For example, in recent structured noise injection (SNI) approach , the latent code for content is generated by an additional neural network, which is used as an input tensor of the image generator composed of subsequent layers for style control using AdaIN (see Fig. 2(b)). Although SNI showed good performance in disentanglement, one of the major drawbacks is that the size of the input tensor is limited to relatively small resolution (e.g. 4 $\times$ 4). Therefore, the intended content control often fails to work properly due to the limited capacity.

To address these issues, here we introduce a novel Diagonal spatial ATtention (DAT) module to manipulate the content feature in a hierarchical manner. Specifically, the content code is applied to multiple layer features as diagonal attention maps at various resolutions as shown in Fig. 2(c). Despite the simplicity of diagonal attention, one of the important advantages of DAT is that the image content and style can be modulated independently in a symmetric manner; and similar to AdaIN for the styles, DAT enables the hierarchical control of the spatial content. These lead to an effective disentanglement of the content and style components in generated images

In addition, our method can be easily integrated into the state-of-the-art GAN inversion , allowing much more flexible post-hoc control of the content and style in the translated images from the multi-domain image translation.

Related works

Spatial attention helps the visual learning tasks by highlighting the specific regions that contain important information. Several methods have used spatial attention to improve performance on some visual tasks: object detection , semantic segmentation , image captioning , and so on. Spatial attention has been further extended to non-traditional image generation tasks. For example, self-attention GAN enhances the generation of geometrical and structural patterns. In the image-to-image translation tasks, recent methods achieved realistic generation performance with attention maps that focus on spatial areas of targeted objects or face components .

2 Disentangled representation

For disentangled image generation, several approaches have been proposed. Direct approaches rely on increasing the connection between the latent and image spaces , a specialized training to constrain the latent space , manipulating the prior distribution of latent , or using external attribute information . Other approaches for disentanglement rely on the hierarchical structure of networks using layer-dependent latent variables in VAE to encode the disentangled attributes , using a tree-like latent variable structure , or synthesizing image components in several stages . Despite the theoretical motivations, the above methods often suffer from poor generation quality due to limited network capacity or from disadvantages due to the need for additional attribute labels.

Recently, several authors propose to use an additional latent vector which controls independent attributes from the original one. For example, SC-GAN separates style and content information using AdaIN along with input content codes. Additionally, there are methods which employ style-content disentanglement to improve the style transfer , and image translation . Recently, a state-of-the-art style-content disentanglement was proposed in , which allows to control various spatial attributes by injecting structured noise as an input tensor of StyleGAN. Specifically, as shown in Fig. 2(b), a content latent $z_{c}$ is processed by a specific neural network and directly used as an input tensor of the generator network. However, one of the drawbacks of this approach is its asymmetric architecture: although the style can be manipulated in a multi-resolution manner using hierarchical AdaIN layers, the content is controlled using a single input.

3 Our contributions

The architecture of our method, which we call the diagonal GAN, is shown in Fig. 2(c) and has several advantages over existing disentanglement methods.

In contrast to the original styleGAN in Fig. 2(a), the content and style code generation is symmetric by using similar code generators. Similar to the AdaIN layer, diagonal attention layer (DAT) enables the spatial control of the content in a hierarchical way that is difficult by SNI in Fig. 2(b).

Although existing attention approach is implemented by multiplying a fully populated attention matrix, our approach is unique in that it uses a diagonal attention matrix to manipulate content information. While using a simple network architecture, this is a more efficient method as it enables much more powerful control of global content compared to the baseline StyleGAN model (see Fig. 2(a)).

Theory

To understand the motivation of the proposed DAT layer, here we provide some mathematical analysis of existing normalization and attention modules in neural networks. Our analysis shows that the normalization and spatial attention have similar structure that can be exploited for style and content disentanglement.

where the channel-directional transform ${\bm{T}}$ and the bias ${\bm{R}}$ are learned from the statistics of the feature maps. Specifically, ${\bm{T}}$ is a diagonal matrix that is calculated as the ratio of the standard deviations of the channel-wise input and target features, and ${\bm{R}}$ is the bias term that converts the input mean to the target mean.

Similarly, the spatial attention can be represented by

In styleGAN, the content code ${\bm{C}}$ , which is generated from per-pixel noises via the scaling network ${\bm{B}}$ (see Fig. 2(a)), is added to the feature ${\bm{X}}$ before the AdaIN layer. This leads to the following feature transform:

Accordingly, the last term, ${\bm{C}}{\bm{T}}$ , works as an additional bias term, which is different from the spatial attention that is multiplicative to the feature ${\bm{X}}$ (see (2)). Although one could potentially generate ${\bm{C}}$ so that the net effect is similar to ${\bm{A}}{\bm{X}}$ , this would require a complicated content code generation network. This explains the fundamental limitation of the content control in the original styleGAN.

2 Diagonal Attention (DAT)

If normalization and attention in (1) and (2) are applied together, the output feature can be represented by

One of the most important observations in this paper is that the combined equation (4) is the key for systematic style-content disentanglement. Specifically, ${\bm{T}}$ in (4) from AdaIN layer is a diagonal matrix obtained from a style code generator. Mathematically, the diagonal matrix ${\bm{T}}$ control the row space of the feature ${\bm{X}}$ , which turns out to be the style control. Accordingly, we conjecture that the spatial content can be controlled by manipulating the remaining factor: the column space of the feature ${\bm{X}}$ . Mathematically, this can be easily implemented by (4) using a diagonal attention matrix ${\bm{A}}$ that is obtained from another content code generator. The diagonal attention and diagonal normalization are then complimentary to each other, which are applied to different axes of the feature tensor to simultaneously control the two independent factors of the feature tensor ${\bm{X}}$ . Furthermore, due to the symmetric role of AdaIN and DAT, they can be applied at each layer in a hierarchical manner as shown in Fig. 2(c).

Specifically, Fig. 2(c) describes the overall architecture of our proposed model. We adopt a method of using two different latent codes. In addition to the style latent code $z_{s}$ , we use an independent latent code $z_{c}$ to control the content information. More specifically, our style code $z_{s}$ is mapped into a linearly distributed space ${\bm{W}}_{s}$ by several MLPs. Then the mapped code $s$ is transformed into the parameters that can be applied to multiple layers as AdaIN. Similar to the style code mapping, the content code $z_{c}$ is also mapped to a linear space ${\bm{W}}_{I}$ through a mapping function consisting of a series of MLPs. The mapped intermediate content code $c$ can change the spatial information of the convolution output through the proposed attention mapping.

Figure 3 is a detailed diagram of our attention mapping network. Here, rather than directly estimating the diagonal component of ${\bm{A}}$ , we are interested in estimating the perturbation from the identity attention. Specifically, the mapped content code $c$ is converted into a vector with $HW\times 1$ dimension. Then the vector is reshaped into a differential attention map ${\bm{d}}\in{\bm{W}}_{c}$ which has the same spatial dimension $H\times W$ to that of the output from convolution layer. In order to avoid the undesired artifacts from excessive diversity in the attention map, we limit the value range of the differential attention with the help of sigmoid activation. Thanks to the diagonal attention map, the network output is then element-wise multiplied with feature map at each channel, which is added to the original feature map. In this stage, we use an additional parameter $\beta$ , allowing the attention map of the network to learn the layer-wise contribution of content control. Since the contribution of attention can be calibrated by $\beta$ depending on whether the layer is responsible for minor or major changes, an artifact from overemphasizing minor changes can be prevented.

Accordingly, the resulting feature output can be represented by

where $\operatorname{diag}({\bm{d}})$ denotes the diagonal matrix whose diagonal elements is ${\bm{d}}$ . This suggests that the resulting diagonal attention matrix is

The DAT layer can also easily incorporate the per-pixel noises used in StyleGAN. However, care should be taken as per-pixel noise is only additive so that it can change minor spatial variations, whereas our diagonal spatial attention is multiplicative so that we can control global spatial variations.

Method

Our implementation is inspired by the original StyleGAN paper and the source codehttps://github.com/rosinality/style-based-gan-pytorch implemented on PyTorch. Similar to StyleGAN, we choose the non-saturating loss with R1 regularization for adversarial loss . We also use Diversity-Sensitive (DS) loss to encourage the attention network to yield diverse maps. More specifically, if we sample two content codes $z_{c}^{1}$ , $z_{c}^{2}$ and a style code $z_{s}$ , our DS loss is defined as:

where $G(z_{s},z_{c})$ is our generator with respect to the style code $z_{s}$ and the content coder $z_{c}$ , respectively. The objective of our DS loss is to maximize the $L_{1}$ distance between the generated images from different content codes with same style. However, directly optimizing the negative $L_{1}$ loss will lead to the explosion of the loss value. Therefore, we penalize DS loss with threshold $\lambda$ so that the distance will not exceed $\lambda$ . Accordingly, our total loss function is described as:

2 Experiment Settings

For qualitative evaluation, we report the results from the model trained on 1024 $\times$ 1024 CelebA-HQ and 512 $\times$ 512 AFHQ . In Supplementary Material, we also provide experimental results using flowers , birds , cars data sets. Considering the number of parameters for attention mapping at high resolutions, we include the DAT layers up to the resolution of 256 $\times$ 256.

For quantitative evaluation, we compared our method with the baseline models that use input noises for content control. Among several methods using this approach, we use the state-of-the-art SNI as a representative method. For fair comparison, we also included SNI trained with content DS loss as a baseline. We also use original StyleGAN results with per-pixel noises as another comparative model. For comparative studies with various parameter settings, we trained the models at the reduced resolution of 256 $\times$ 256 using 500K iterations (total of $\sim$ 4.7M samples). As baseline SNI presented results on models with and without adding per-pixel noises, we showed our results on both conditions. When training our models, we set the parameter $\lambda$ in the DS loss as 0.3, as it showed the best performance. For more experimental settings, see Supplementary Materials.

For quantitative metrics, we use FID for measuring the image quality and Perceptual Path Length (PPL) for measuring the disentanglement. PPL was first proposed in StyleGAN to measure the perceptual distance between output images obtained with slightly changing the interpolated codes. A low PPL value means better disentanglement, since there is little interference of irrelevant features between two latent points. This can be also interpreted to mean that the latent space follows the linear trend. To measure the performance of the mapping networks that map both style and content code into their respective linear spaces, we compare the disentanglement performance by the PPL in the $W$ (i.e. $W_{s}$ and $W_{c}$ ) space.

Experimental Results

Content and Style Disentanglement: Fig. 1 illustrates full-resolution images synthesized with different DAT and AdaIN codes. The left panel shows generated 1024 $\times$ 1024 images by our method trained using CelebA-HQ, whereas the right panel shows generated 512 $\times$ 512 images by our method trained using AFHQ. For a given source image (a), which is generated from arbitrary DAT content and AdaIN style codes, the images in (b) show the generated samples with varying style codes and the content code, whereas (c) illustrates samples with varying content codes and the fixed style codes. We can clearly see the effect of content code: the content of faces, such as the direction and components, vary. This is different from the effect of style codes in (b), which changes the hair color, gender, etc., while the face direction and components are fixed. By using specific style and content codes in (b)(c), the images in (c) shows that the face direction and components follow the content in (b), whereas the hair color, gender, etc are controlled by the style in (b). This experiments clearly indicates the powerful content and style disentanglement by our method.

Hierarchical Content Disentanglement: We also show the hierarchical disentanglement ability by controlling diagonal attention map at each layer. The generated samples are presented in Figure 4. The source images on top row are sampled from arbitrary DAT content and AdaIN style codes. The images on the second row are generated with changing entire content code under the fixed style codes. We can observe the variations of entire spatial attributes including shape, rotation and facial expressions with consistent styles. The images in the following rows are sampled with changing the content codes at specific layers while fixing those of other layers. The first DAT at 4 $\times$ 4 layer mainly focuses on geometrical change, and the second 4 $\times$ 4 DAT changes hairstyles and eyes accessories. The 8 $\times$ 8 DAT layer mainly changes the lower part of facial expressions, and the DAT layers at higher resolutions give relatively minor variations such as hair curls and eyes.

Quantitatively, our CelebA-HQ model showed satisfying performance of 7.32 in FID, compared with 5.17 of original StyleGAN.

Hierarchical Latent Interpolation: Figure 5 show the generated examples by interpolating DAT content codes $c\in W_{I}$ between two randomly sampled points with fixed style. The first row shows results from interpolating content codes of all layers, whereas the rest of the rows illustrate the results by interpolating specific layer content codes. Although similar latent interpolation in the first row (Figure 5(a)) could be done by StyleGAN, the fine spatial detail interpolation in Figure 5(b)-(d), such as mouth expressions, is not possible in StyleGAN. On the other hand, our method allows hierarchical content interpolation by interpolating the specific layer content codes. This can be also seen in our AFHQ results in Figure 6. In addition to changes of global content in Figure 6(a), we can smooth change the specific attribute of mouth by a specific layer content code interpolation as shown in Figure 6(b). For additional interpolation results using other data sets, see Supplementary Materials.

Direct Manipulation of Diagonal Attention: In order to verify the meaning of our diagonal attention maps, Figure 7 shows the generated samples by directly manipulating the diagonal attention maps at specific layers. With 4 $\times$ 4 maps, we can generate the faces with arbitrary direction by changing the activated regions. Also for 8 $\times$ 8 maps, we can control the mouth expression with high values on larger mouth areas. In 16 $\times$ 16 maps, we can control the size of eyes by manipulating activated pixel areas of eyes. These show that our diagonal attention maps have a clear and intuitive relationship to different spatial regions. More examples can be found in our Supplementary Materials.

2 Quantitative Comparison Results

In Table 1, our model shows better performance in terms of disentanglement metric for almost all of the settings. Specifically, when we compare the models trained with both conditions of with and without per-pixel noises, we can see that our models show improved disentanglement metrics compared to SNI. The results clearly indicate that our diagonal attention map can obtain better disentanglement with rich control of the content than SNI. Even with the baseline SNI trained with DS loss, the model still could not overcome the limitation of insufficient capacity as indicated by the higher PPL scores. For further comparison, we also measured the disentanglement of not only the entire $W$ space, but also the style space $W_{s}$ and the content space $W_{c}$ each. In all cases, our model achieved improved disentanglement performance with lower PPL scores. In addition, our models show comparable FID scores in almost all experimental settings. Although there is a slight degradation in some cases, they are from the expected trade-off between the image quality and the disentanglement as stated in . To support the quantitative results, qualitative comparison with other methods are provided in Supplementary Materials, in addition to extensive ablation studies.

3 Inverting Disentangled Model

To further highlight the advantage of our method, we additionally implement a GAN inversion framework in which the real images are encoded into latent spaces, from which various output images are generated by simply manipulating content and style codes. For realistic image reconstruction, we use the modified version of state-of-the-art inversion method IDinvert to include both DAT and AdaIN.

Specifically, we first pretrained our Diagonal GAN with multi-domain styles. Then, as shown in Fig. 9, we train the style encoder $SE$ which has a double-head structure so that sampled style codes from each head represent the specific domain style (e.g. male, female). Additionally, the content encoder $CE$ is trained so that it can generate the content code. The generated style and contents codes are fed into the pre-trained Diagonal GAN through AdaIN and DAT. Then, we train the network to reconstruct realistic input images. For encoder and diagonal GAN network training, we use 28,000 CelebA-HQ images with 256 $\times$ 256 resolution, which are split in two domains of males and females. For testing, we use 2,000 (1000 male, 1000 female) images. Detailed training process is elaborated in our Supplementary Materials.

Fig. 8 shows the synthesis results from our inversion model. First, auto-encoding reconstruction results confirm that the network can successfully generate similar outputs as the input images. Then, Figs. 8(b) show the results by changing the style codes. We can change the global styles from the inputs. In Figs. 8(c), we show the results by varying the content codes at each resolution layers. Thanks to the DAT layers, compared to the existing image translation models, our model has much more flexibility by allowing hierarchically control of both content and styles in the generated images.

For further evaluation, in Table 2, we compared the performance with the state-of-the-art image translation model StarGANv2 . Since the existing StarGANv2 can only change the style similar to Figure 8(b), we measured the quantitative performance of style synthesis for fair comparison. Surprisingly, we achieved better image quality with comparable diversity even in style synthesis for both of latent-based sampling and reference-based transfer. The results show that our method has remarkable advantages, as it has a better image generation quality and more flexible content control than the existing state-of-the-art model. Detailed experiment settings and qualitative comparisons are provided in Supplementary Materials.

Conclusions

In this paper, we proposed a novel diagonal spatial attention (DAT) module as a complement to the AdaIN in order to disentangle the style and content information. The symmetric structure of DAT and AdaIN enabled the independent control of the style and content of features in a hierarchical manner. Our extensive experiments showed that the style and content attribute of images can be independently manipulated in a hierarchical manner, confirming the style and content disentanglement in high quality image generation. Moreover, the proposed method has also been successfully integrated into GAN inversion to achieve high quality image translation with better disentanglement of content and style.

Supplementary Material

Diagonal GAN Experiments

For qualitative evaluation in the main text, our models were trained using the images from 1024 $\times$ 1024 CelebA-HQ and 512 $\times$ 512 AFHQ .

Specifically, using full-resolution CelebA-HQ images, we trained our model by accessing 20 million training samples. For efficient training with limited GPU capacity, we started with batch size of 512 in the first 8 $\times$ 8 resolution, and reduced the batch size by half each time when we proceeded to a larger image size. With the aforementioned training strategy, the overall training took about one month using a single Tesla V100 GPU. For learning rate, we used 0.001 until accessing 12 million samples, then decreased the learning rate to 0.0001. To test the effect of our diagonal attention (DAT) module in the qualitative evaluation, we removed per-pixel noises at each layer. We increased the $\lambda$ value of DS loss from 0.3 to 0.5 after accessing 12 million samples for the training. We used the left and right flips for data augmentation in all training procedures. For the case of full resolution AFHQ images, we used the same training settings as we did before for the CelebA-HQ dataset case, except that we used the fixed value of $\lambda$ as 0.3 throughout the training.

For further qualitative evaluations, we carried out experiments with additional image data sets: Oxford Flowers 102 , Caltech-UCSD Birds (CUB2011) , and Stanford Cars . For the flower data set, we first extracted the flower regions with center cropping. Then we resized the cropped images to 512 $\times$ 512. For the bird dataset, we extracted bird image areas using bounding box information. Then we changed the size of the extracted images to 256 $\times$ 256. For the car dataset, we also extracted the car image areas using bounding box information, then resized the cropped images to 384 $\times$ 512. We also used the left-right flips for data augmentation.

When training the models with flower and car data sets, we continued the training models with up to 12 million samples. For the bird dataset, training was continued until we accessed 10 million samples. The training settings used for the AFHQ model training was also used for the flower, car, and bird datasets.

To improve the perceptual quality, the images are generated by applying truncation trick similar to . More specifically, we found that best perceptual image quality was obtained by truncating mapped style code in $\bm{W}_{s}$ up to 0.7, whereas no truncation was used for $\bm{W}_{c}$ .

AFHQ results: Fig. 11 shows the results of direct attention map manipulation of our full-resolution AFHQ model. Similar to our CelebA-HQ results, we could obtain the faces with the desired direction by manipulating the 4 $\times$ 4 map, and control the mouth opening by changing the values around mouth in 8 $\times$ 8 map. We also show the results of the interpolation of the content codes of our AFHQ model in Figs. 12 and 13. When the content codes at all levels are changed, the global spatial attributes are changed, and when the 8 $\times$ 8 maps are changed, the lower parts of the areas change. Quantitatively, our model trained with AFHQ in full resolution achieved 10.79 in FID.

CelebA-HQ results: We also show more results of the content code interpolation with the model trained with full-resolution CelebA-HQ in Figure 14 and 15. When the content codes for all layers are changed, the global spatial attributes are changed, and when the first 4 $\times$ 4 codes are changed, face direction changes. When the 8 $\times$ 8 codes are changed, lower parts of faces change.

Flower results: To test the versatility of our proposed model, we trained our model using additional datasets. Fig. 16 illustrates the generated images from the model trained on flower dataset. When we change the content code with fixed style codes, we can change the spatial information such as flower shape, number, and location of flowers. On the other hand, if we vary style codes with fixed content codes, we can observe the changes of the global style attributes including species, flower color and background. Fig. 17 also shows the results by changing the content smoothly with interpolating the content codes. Our flower model scored 46.43 in FID.

Birds results: Fig. 18 shows the generated images from the model trained on birds dataset. Fig. 18 (b) shows samples with varying style codes and the fixed content code. We can observe the changes of the global style attributes including species, feather colors and patterns. Fig. 18 (c) illustrate samples generated with varying content codes and the fixed style. We can observe that the content attributes including location, rotation and global shapes change with different content codes. Fig. 19 also shows the content interpolation results, which shows that birds smoothly change the head orientation. Our model on birds dataset scored 14.27 in FID.

Cars results: Figure 20(a) shows the sample images from random content and style code. Then, Figure 20(b) shows the samples with varying style codes and the fixed content code. We can observe the changes of the global style attributes including car type, colors and background. Samples generated with varying content codes and the fixed style are provided in Figure 20(c). We can observe that the content attributes including rotation and global shapes change with different content codes. In Fig. 21, we can see the effect of content interpolation in terms of rotation angle. Our model on car dataset scored 8.96 in FID.

2 Quantitative Experiments

For quantitative evaluation, we compared our method with the baseline SNI model, SNI with DS loss, and the original StyleGAN. In order to carry out extensive comparative studies with various models, we trained the models with the reduced resolution of $256\times 256$ using 500,000 iterations (total of $\sim$ 4.7 million samples). We also used batch-size scheduling for efficient training. It took four days for training each model with a single NVIDIA RTX2080Ti GPU. For a fair comparison, we used the same non-saturating loss with R1 regularization in all the experiments. The same settings are also used in all models for our ablation studies and additional disentanglement studies.

For training the baseline SNI model and SNI with DS loss, we implemented the models on PyTorch based on official source codehttps://github.com/yalharbi/StructuredNoiseInjection. To follow the best settings in the original paper , we used the input tensor of 8 $\times$ 8 resolution for training the models. For the DS loss, $\lambda$ is set to 0.3, which shows the best results.

In order to quantitatively evaluate the image quality of the generated samples, we calculated the FID values . For CelebA-HQ with 30,000 training images, we computed the FID values with 50,000 generated samples. For the AFHQ and other data sets with relatively fewer training images, we calculated the FID values with 20,000 generated samples.

To calculate the total PPL of the $W$ space, we follow the same calculation proposed in StyleGAN. If we sample the two style codes $s_{1},s_{2}\in W_{s}$ and two contents codes $c_{1},c_{2}\in W_{c}$ , the PPL score is calculated as

where $t$ is a uniformly sampled between $ $, and$ G(s,c) $is the generator output with respect to the style code$ s $and content code$ c $, respectively, and$ d(X,Y) $denotes the perceptual distance between two images$ X $and$ Y $. We use$ \epsilon=10^{-4}$ for all the calculations and report the average values that are computed using 10,000 generated samples.

For calculation of PPL for $W_{s}$ , we use fixed content code $c_{fix}\in W_{c}$ and paired style codes $s_{1},s_{2}\in W_{s}$ :

To calculate $\text{PPL}_{W_{c}}$ , we use fixed style code $s_{fix}\in W_{s}$ and sampled content codes $c_{1},c_{2}\in W_{c}$ :

For the computation of PPL for $W_{s}$ (resp. $W_{c}$ ), for each fixed code, the content codes (resp. style codes) are sampled fifty times, and the average value was calculated by repeating this 200 times. Therefore, the final PPL value is calculated using 10,000 samples.

3 Ablation studies

In our ablation study, we compared the quantitative performance of models trained with different settings. In all of the experiments, we used models trained with per-pixel noises.

In order to validate the choice of the value of $\lambda$ for the diversity-sensitive loss, we first show the results of various models that were trained with different $\lambda$ . In Table 3, the models trained with $\lambda=0.3$ show the best performance in the disentanglement capability, exhibiting the lowest PPL scores; furthermore, they showed the best image quality with the lowest FID. The models trained with lower or higher $\lambda$ values show degraded disentanglement performance. The results show that we can achieve the most balanced content-style control with both codes when we set $\lambda=0.3$ .

We also investigated the effect of different network architectures for the attention mapping, and show the results in Table 4. To ensure stability in training, we use attention mapping with a single layer MLP followed by a sigmoid applied to layers with a resolution of up to $256\times 256$ . To verify our choice of mapping network architecture, we implemented two additional networks for ablation study: 2 $\times$ MLP-256 and CNN-256. Here, 2 $\times$ MLP-256 represents a model which has attention mapping of 2-layer MLP instead of a single MLP. The model CNN-256 uses CNN layer-wise upsampling network to generate the diagonal attention. In all the experiments, we fixed $\lambda=0.3$ and used mapping network up to 256 $\times$ 256 layers.

Table 4 shows that when using 2-layer MLP, we can obtain a well-disentangled model with relatively low PPL scores, but it still cannot achieve the best performance. In case of using CNN as an attention network, the disentanglement scores are severely degraded, which may be due to the imbalance between the simple AdaIN network and CNN-based attention mapping networks.

Then, we carried out comparative study by changing the maximum resolution of the diagonal attention (DAT) layer. In Table 4, the use of DAT up to 32 $\times$ 32 (single MLP-32) and 64 $\times$ 64 (single MLP-64) have relatively high PPL values, suggesting that DAT layers at the lower levels only result in a limited expressiveness for the various content information. The disentanglement quality is particularly impaired in the models trained with AFHQ data set. We suspect that limited capacity in content control makes it more difficult to cover the variations of images in AFHQ that are more diverse than those in CelebA-HQ. Therefore, we use DAT layers up to 256 $\times$ 256 resolution (single MLP-256), which is our default model.

In evaluating image quality in terms of FID scores, our default model showed better performance than most of baseline settings except for single MLP-64. However, in the case of single MLP-64, the disentanglement performance is relatively poor. Therefore, we can obtain best result when using single MLP-256.

4 Disentanglement Experiments

To further quantify the disentanglement performance, we additionally measure the content and style diversity in the image generation. As a measurement of image diversity, we use Learned Perceptual Image Patch Similarity (LPIPS) . Since our model and the baseline SNI have two independent content and style codes, we can compare their diversity of style and content. To measure the diversity score of both codes, we compute the average value of LPIPS of 40,000 images sampled with arbitrary content and style codes. On the other hand, to measure the style and content diversity separately, we calculate the LPIPS of 40 images sampled by varying one code with another code fixed, which is repeated for 10,000 times to calculate the averaged LPIPS. This makes the the total number of images for LPIPS calculation equal to 40,000 for both cases.

Table 5 is the result of LPIPS scores. When we compare the LPIPS by varying both content and style codes, both SNI and our model show similar diversity in the generated images. However, when looking at the diversity of style and content separately, the baseline SNI shows that the diversity of content is much lower than that of the style. On the other hand, our model has more balanced diversity in style and content. With SNI trained with DS loss on content code, the model shows slightly better diversity in content code than that of the baseline SNI. However, the model still shows lower content diversity than our model as it is not able to overcome the capacity limit of content control with input tensor.

To support the above quantitative comparison in terms of LPIPS scores, we also qualitatively compared the effect of various content controlling methods: per-pixel noises of StyleGAN, input tensor of SNI and our DAT mapping. Since the code spaces of the baseline models are slightly different, we compare the images generated from the mean style code by varying the content codes. In Figure 22, we can see that StyleGAN with different per-pixel noises only results in the minor spatial variations such as curls of hair and fur. As for the baseline SNI, it can only change the simple geometrical information such as rotation. For SNI trained with DS loss, we can see that the generated samples have more diversity than basic SNI, but still the variation is limited to geometry similar to basic SNI. In contrast, our model allows more diverse changes on spatial information including geometry, hairstyle (fur pattern), facial expression, etc., by controlling specific DAT layers.

Diagonal GAN Inversion

As discussed in the main text, our diagonal GAN can be easily incorporated with GAN inversion. Specifically, our inversion model consists of two steps as shown in Figure 10. The details of each step are as follows.

Step 1: The first step is to train our proposed Diagonal GAN. For domain-aware (i.e females, males) image generation, we train a multi-domain Diagonal GAN in which the style mapping network $SM$ can sample multiple style codes with a multi-head structure. Specifically, we use two types of style codes that represent males and females domains (see Fig. 10(a)). On the other hand, the content mapping generates a unified content code $c=CM(z_{c})$ that can be used for both style domains. We used mapped style codes $s\in W_{s}$ which have the dimension of 512, and mapped content codes $c\in W_{I}$ with the dimension of 512. Our discriminator $D$ also has a multi-head structure to simultaneously enable realistic generation and domain classification.

Step 2: After pre-training our generator network (Fig. 10(a)), we invert the real images into latent spaces with our inversion network in Step 2 (Figs. 10(b)(c)). In this step, we use the modified version of state-of-the-art GAN inversion method IDinvert . To encode the real images into style and content code spaces, we introduced the style encoder $SE$ and the content encoder $CE$ networks. Similar to the style mapping in Step 1, our style encoder has a multi-head structure after the last convolution layer.

The main idea of IDInvert is that when encoding a real image into a latent space, realistic reconstruction is possible only when the encoded latent code is constrained within the learned latent space. To achieve this goal, as shown in Fig. 10(b), we first sampled the random style code $s$ with random domain label $y$ , and the random content code $c$ using the pre-trained mapping networks $SM$ and $CM$ , and generated a fake image $X_{fake}$ . Then by putting the generated $X_{fake}$ into the encoders, we can obtain the encoded style and content codes $s_{f}$ and $c_{f}$ , respectively. Then, our loss for latent codes is given by

which reduces the mean-squared error (MSE) between the encoded codes and the learned codes so that our encoder networks can generate the codes within the learned latent spaces.

Additionally, in Fig. 10(c), we put the real image $X_{real}$ with the corresponding domain label $\hat{y}$ into the style and content encoders to get the content code $c_{e}$ and style code $s_{e}$ , respectively. Then, the codes $c_{e}$ and $s_{e}$ are used in DAT and AdaIN layers, respectively, of the pre-trained generator to obtain the reconstructed image $X_{rec}$ . The goal of this step is to make $X_{rec}$ as close as possible to $X_{real}$ . For realistic reconstruction, we reduce the distance between $X_{real}$ and $X_{rec}$ by using a MSE loss, a LPIPS loss that reduces the perceptual distance, and an adversarial loss using a new discriminator $D$ which also has a multi-head structure. For adversarial loss, we used the same loss function as StyleGAN , which is composed of the non-saturating Softplus, $f(t)=\text{softplus}(t)=\text{log}(1+\text{exp}(t))$ , with $R_{1}$ regularization.

Accordingly, our total loss function for the content and style encoder is given by

where $\lambda_{lat}$ and $\lambda_{adv}$ are weight parameters. On the other hand, the loss for the discriminator is

where $D_{\hat{y}}(\cdot)$ denotes the output of the discriminator $D$ corresponding to the domain ${\hat{y}}$ , $\gamma$ is a weight parameter for gradient $R_{1}$ regularization (the last term in $L_{D}$ ).

Inference time Latent-Regularized Optimization: Additionally, we try to find better latent codes at the test time by additionally optimizing the latent code for better reconstruction. In this process, we use the latent optimization method proposed by IDinvert . Specifically, as an initialization for the style and content codes $s$ and $c$ , respectively, we use codes $s_{e}$ and $c_{e}$ from pre-trained encoders in Step 2. In addition to reducing the distance between the input and reconstruction, we include latent regularization loss to make the latent vectors $s$ , $c$ lie within the learned space of the encoders and the generator. The resulting loss function for optimization is:

where $G(s,c)$ denotes the generator output with style and content codes $s$ and $c$ , respectively, $SE_{\hat{y}}$ refers to the style encoder on the domain $\hat{y}$ , and $CE$ is the content encoder.

2 Method Details

In GAN inversion experiments, we used 256 $\times$ 256 resolution CelebA-HQ dataset. Total of 30,000 images, 28,000 are used as a training set, and 2,000 are used as a test set. The test set consists of 1000 male and 1000 female face images.

When training our Diagonal GAN network in Step 1, we trained the model until we access a total of 10 million training samples, which took about a week with a single RTX-2080Ti GPU. Except for the maximum resolution, other training settings are the same as our full-resolution CelebA-HQ experiments.

Our style encoder model is a CNN with multi-head fully-connected layers, which has the same structure as the discriminator. The content encoder has the same architecture, except that it has a single-head structure. In Step 2 training, we trained the model with the batch size of 2 for 200,000 iterations. We used Adam optimizer, initially using a learning rate of 0.001, and then decreased the learning rate to 0.0001 after 100,000 iterations. For weight parameters, we set $\lambda_{lat}=1$ , $\lambda_{adv}=0.1$ . This took 2 days with a single NVIDIA RTX2080Ti GPU.

For our inference time latent-regularized optimization using (9), both style and content latent codes were optimized using 100 iterations per a single input image. We used Adam optimizer with learning rate of 0.01, and set the loss weights as $\lambda_{reg}=2$ . Optimization process took 4 seconds per each input image.

3 Experiment Details

In order to show the superior disentanglement performance of our model, we compared our method with state-of-the-art diverse image translation model, StarGANv2 . Note that our inverted model can control both content and style spaces using DAT and AdaIN layers, whereas StarGANv2 can only convert styles of input images due to the exclusive use of AdaIN layers. For a fair comparison, we used the pre-trained StarGANv2 that can be downloaded from the official GitHub repository https://github.com/clovaai/stargan-v2.

For quantitative evaluation, we measured the quality in terms of FID and diversity through LPIPS. Since StarGANv2 can only convert the style of the images, we only consider the style conversion by two methods for this quantitative comparison. We consider both image translation scenario: 1) latent-based image translation, which converts the style of input image to a random style by sampling the style codes, and 2) reference-based image translation, in which we convert the style of inputs to that of the reference images. At this time, we measured the performance by converting a single image of one domain into 10 different target domain images. In our GAN inversion, the experiment was conducted by varying the style codes while using the same content code of the input image. As mentioned before, this is to compare the image quality during style translation, as StarGANv2 is only for the style translation. Since the test set contains 1,000 images for each domain (female, male) and 10 target styles are used for each image, 10,000 synthesized images can be obtained. Furthermore, we consider the domain conversion scenario (i.e. females $\leftrightarrows$ males), which doubles the number of synthesized images. Therefore, for each latent and reference based experiment, we measured metrics on 20,000 generated images. For more details, please refer to the original StarGANv2 paper , as we use the same evaluation process.

In all the experiments, we used style and content codes obtained using inference time latent-regularized optimization process, except for the qualitative experiment of reference-based style synthesis, where style codes without latent-regularized optimization still provide better perceptual quality.

4 Inversion Experimental Results

Auto-encoder Reconstruction Results: Our inversion model showed satisfactory reconstruction performance by extending the state-of-the-art inversion model. To evaluate the reconstruction performance, we measured the distance between input and the reconstructed image with MSE and LPIPS. When we reconstruct the images without inference time latent-regularized optimization, we could obtain MSE of 0.095 and LPIPS of 0.246. Furthermore, with additional latent-regularized optimization process, the model showed improved performance with MSE of 0.042 and LPIPS of 0.155. The results confirmed that our model shows good reconstruction performance, and more accurate reconstruction is possible when latent-regularized optimization is additionally used.

Qualitative comparison: In the main script, we have already compared the performance of our model and baseline StarGANv2 to show that our model outperforms the generation quality. Here, we provide more extensive qualitative comparison results to highlight the advantages of our model.

Figure 23 shows the generated samples synthesized from input image to follow the styles of reference images. Although StarGANv2 shows good performance in style synthesis from typical images (see Fig. 23(a)), it is still a conventional image translation model that uses spatial information of the image as it is. Therefore, as shown in Fig. 23(b), when the content information of the input is complicated or rare, we can observe that the generation performance is often severely degraded. In contrast, our model finds the content code that can best express the input content in the pre-trained space, so that it can generate realistic images even with complex or rare input contents (see Fig. 23(b)). Figs. 24(a)(b) show the result of converting the input image to follow random styles. Again, our model could generate more realistic images even if the input content is complex (see (see Fig. 24(b)).

To clearly show the strength of our model, in Figure 25, we show the results from our image translation results with content manipulation through the content code interpolation. As explained before, StarGANv2 can change the contents only by using different input images similar to Figs. 25(a)(c). In contrast, as our model can control the content code space, Figure 25(b) shows that the content information of the translated images can change smoothly with interpolating the content codes. Furthermore, Figure 26 shows the image translation experiment by changing the hierarchical content code in addition to style synthesis. Unlike StarGANv2, which can only change the style of the input, we can further change the the specific content attributes such as: face geometry by changing the 4 $\times$ 4 content codes, the hair shape by changing the 8 $\times$ 8 codes, and the mouth expression by changing the 16 $\times$ 16 codes.

The results clearly show that our model is capable of flexible content control in addition to the style control, which is not possible with the existing image translation models such as StarGANv2.