MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

Cheng-Han Lee, Ziwei Liu, Lingyun Wu, Ping Luo

Introduction

Facial image manipulation is an important task in computer vision and computer graphic, enabling lots of applications such as automatic facial expressions and styles (e.g. hairstyle, skin color) transfer. This task can be roughly categorized into two types: semantic-level manipulation and geometry-level manipulation . However, these methods either operate on a pre-defined set of attributes or leave users little freedom to interactively manipulate the face images.

To overcome the aforementioned drawbacks, we propose a novel framework termed MaskGAN, which aims to enable diverse and interactive face manipulation. Our key insight is that semantic masks serve as a suitable intermediate representation for flexible face manipulation with fidelity preservation. Instead of directly transforming images in the pixel space, MaskGAN learns the face manipulation process as traversing on the mask manifold , thus producing more diverse results with respect to facial components, shapes, and poses. An additional advantage of MaskGAN is that it provides users an intuitive way to specify the shape, location, and facial component categories for interactive editing.

MaskGAN has two main components including 1) Dense Mapping Network and 2) Editing Behavior Simulated Training. The former learns the mapping between the semantic mask and the rendered image, while the latter learns to model the user editing behavior when manipulating masks. Specifically, Dense Mapping Network consists of an Image Generation Backbone and a Spatial-Aware Style Encoder. The Spatial-Aware Style Encoder takes both the target image and its corresponding semantic label mask as inputs; it produces spatial-aware style features to the Image Generation Backbone. After receiving a source mask with user modification, the Image Generation Backbone learns to synthesize faces according to the spatial-aware style features. In this way, our Dense Mapping Network is capable of learning the fine-grained style mapping between a user modified mask and a target image.

Editing behavior simulated training is a training strategy to model the user editing behavior on the source mask, which introduces the dual-editing consistency as the auxiliary supervision signal Its training pipeline comprises an obtained Dense Mapping Network, a pre-trained MaskVAE, and an alpha blender sub-network. The core idea is that the generation results of two locally-perturbed input masks (by traversing on the mask manifold learned by MaskVAE) blending together should retain the subject’s appearance and identity information. Specifically, the MaskVAE with encoder-decoder architecture is responsible for modeling the manifold of geometrical structure priors. The alpha blender sub-network learns to perform alpha blending as image composition, which helps maintain the manipulation consistency. After training with editing behavior simulation, Dense Mapping Network is more robust to the various changes of the user-input mask during inference.

MaskGAN is comprehensively evaluated on two challenging tasks, including attribute transfer and style copy, showing superior performance compared to other state-of-the-art methods. To facilitate large-scale studies, we construct a large-scale high-resolution face dataset with fine-grained mask labels named CelebAMask-HQ. Specifically, CelebAMask-HQ consists of over 30,000 face images of 512 $\times$ 512 resolution, where each image is annotated with a semantic mask of 19 facial component categories, e.g. eye region, nose region, mouth region.

To summarize, our contributions are three-fold: 1) We present MaskGAN for diverse and interactive face manipulation. Within the MaskGAN framework, Dense Mapping Network is further proposed to provide users an interactive way for manipulating face using its semantic label mask. 2) We introduce a novel training strategy termed Editing Behavior Simulated Training, which enhances the robustness of Dense Mapping Network to the shape variations of the user-input mask during inference. 3) We contribute CelebAMask-HQ, a large-scale high-resolution face dataset with mask annotations. We believe this geometry-oriented dataset would open new research directions for the face editing and manipulation community.

Related Work

Generative Adversarial Network. GAN generally consists of a generator and a discriminator that compete with each other. Because GAN can generate realistic images, it enjoys pervasive applications on tasks such as image-to-image translation , image inpainting , and virtual try-on .

Semantic-level Face Manipulation. Deep semantic-level face editing has been studied for a few years. Many works including achieved impressive results. IcGAN introduced an encoder to learn the inverse mappings of conditional GAN. DIAT utilized adversarial loss to transfer attributes and learn to blend predicted face and original face. Fader Network leveraged adversarial training to disentangle attribute related features from the latent space. StarGAN was proposed to perform multi-domain image translation using a single network conditioned on the target domain label. However, these methods cannot generate images by exemplars.

Geometry-level Face Manipulation. Some recent studies start to discuss the possibility of transferring facial attributes at instance level from exemplars. For example, ELEGANT was proposed to exchange attribute between two faces by exchanging the latent codes of two faces. However, ELEGANT cannot transfer the attributes (e.g. ‘smiling’) from exemplars accurately. For 3D-based face manipulation, though 3D-based methods achieve promising results on normal poses, they are often computationally expensive and their performance may degrade with large and extreme poses.

Our Approach

Training Pipeline. As shown in Fig. 11, MaskGAN composes of three key elements: Dense Mapping Network (DMN), MaskVAE, and Alpha Blender which are trained by Editing Behavior Simulated Training (EBST). DMN (See Sec. 3.1) provides users an interface for manipulating face toward semantic label mask which can learn a style mapping between $I^{t}$ and $M^{src}$ . MaskVAE is responsible for modeling the manifold of structure priors (See Sec. 3.2). Alpha Blender is responsible for maintaining manipulation consistency (See Sec. 3.2). To make DMN more robust to the changing of the user-defined mask $M^{src}$ in the inference time, we propose a novel training strategy called EBST (See Sec. 3.2) which can model the user editing behavior on the $M^{src}$ . This training method needs a well trained DMN, a MaskVAE trained until low reconstruction error, and an Alpha Blender trained from scratch. The training pipeline can be divided into two stages. In training stage, we replace $M^{src}$ with $M^{t}$ as input. In Stage-I, we update DMN with $M^{t}$ and $I^{t}$ firstly. In Stage-II, we used MaskVAE to generate two new mask $M^{inter}$ and $M^{outer}$ with small different from $M^{t}$ and generate two faces $I^{inter}$ and $I^{outer}$ . Then, Alpha Blender blends these two faces to $I^{blend}$ for maintaining manipulation consistency. After EBST, DMN would be more robust to the change of $M^{src}$ in the inference stage. The details of the objective functions are shown in Sec. 3.3.

Inference Pipeline. We only need DMN in testing. In Fig. 12, different from training stage, we simply replace the input of Image Generation Backbone with $M^{src}$ where $M^{src}$ can be defined by the user.

Dense Mapping Network adopts the architecture of Pix2PixHD as a backbone and we extend it with an external encoder $Enc_{style}$ which will receive $I^{t}$ and $M^{t}$ as inputs. The detailed architecture is shown in Fig. 12.

Spatial-Aware Style Encoder. We propose a Spatial-Aware Style Encoder network $Enc_{style}$ which receives style information $I^{t}$ and its corresponding spatial information $M^{t}$ at the same time. To fuse these two domains, we utilize Spatial Feature Transform (SFT) in SFT-GAN . The SFT layer learns a mapping function $\mathcal{M}:\Psi\mapsto(\gamma,\beta)$ where affine transformation parameters $(\gamma,\beta)$ is obtained by prior condition $\Psi$ as $(\gamma,\beta)=\mathcal{M}(\Psi)$ . After obtaining $\gamma$ and $\beta$ , the SFT layer both perform feature-wise and spatial-wise modulation on feature map F as $SFT(F|\gamma,\beta)=\gamma\odot F+\beta$ where the dimension of F is the same as $\gamma$ and $\beta$ , and $\odot$ is referred to element-wise product. Here we obtain the prior condition $\Psi$ from the features of $M^{t}$ and feature map F from $I^{t}$ . Therefore, we can condition spatial information $M^{t}$ on style information $I^{t}$ and generate $x_{i},y_{i}$ as following:

where $x_{i},y_{i}$ are affine parameters which contain spatial-aware style information. To transfer the spatial-aware style information to target mask input, we leverage adaptive instance normalization (AdaIN) on residual blocks $z_{i}$ in the DMN. The AdaIN operation which is a state-of-the-art method in style transfer is defined as:

which is similar to Instance Normalization , but replaces the affine parameters from IN with conditional style information.

DMN is a generator defined as $G_{A}$ where $I^{out}=G_{A}(Enc_{style}(I^{t},M^{t}),M^{t}))$ . With the Spatial-Aware Style Encoder, DMN learns the style mapping between $I^{t}$ and $M^{src}$ according to the spatial information provided by $M^{t}$ . Therefore, styles (e.g. hairstyle and skin style) in $I^{t}$ are transitioned to the corresponding position on $M^{src}$ so that DMN can synthesis final manipulated face $I^{out}$ .

2 Editing Behavior Simulated Training

Structural Priors by MaskVAE. Similar to Variational Autoencoder , the objective function for learning a MaskVAE consists of two parts: (i) ${L}_{reconstruct}$ , which controls the pixel-wise semantic label difference, (ii) ${L}_{KL}$ , which controls the smoothness in the latent space. The overall objective is to minimize the following loss function:

where denotes the $j-th$ element of vector $\sigma$ . Then, we can sample latent vector by $z=\mu+r\odot exp(\sigma)$ in the training phase, where $r\sim N(0,I)$ is a random vector and $\odot$ denotes element-wise multiplication.

Fig. 13 shows samples of linear interpolation between two masks. MaskVAE can perform smooth transition on masks and EBST relies on a smooth latent space to operate.

Manipulation Consistency by Alpha Blender. To maintain the consistency of manipulation between $I^{blend}$ and $I^{t}$ , we realize alpha blending used in image composition by a deep neural network based Alpha Blender $B$ which learn the alpha blending weight $\alpha$ with two input images : $I^{inter}$ and $I^{outer}$ as $\alpha=B(I^{inter},I^{outer})$ . After learning appropriated $\alpha$ , Alpha Blender blend $I^{inter}$ and $I^{outer}$ according $I^{blend}=\alpha\times I^{inter}+(1-\alpha)\times I^{outer}$ . As shown in the $Stage-II$ of Fig. 11, Alpha Blender is jointly optimized with two share weighted Dense Mapping Networks. The group of models is defined as $G_{B}$ .

3 Multi-Objective Learning

The objective function for learning both $G_{A}$ and $G_{B}$ consists of three parts: (i) ${L}_{adv}$ , which is the conditional adversarial loss that makes generated images more realistic and corrects the generation structure according to the conditional mask $M^{t}$ , (ii) ${L}_{feat}$ , which encourages generator to produce natural statistic at multiple scales, (iii) ${L}_{percept}$ , which improves content generation from low-frequency to high-frequency details in perceptually toward deep features in VGG-19 trained by ImageNet . To improve the synthesis quality of a high-resolution image, we leverage multi-scale discriminator to increase the receptive field and decrease repeated patterns appearing in the generated image. We used two discriminators which refer to $D_{1,2}$ with identical network structure to operate at two different scales. The overall objective is to minimize the following loss function.

where $\lambda_{feat}$ and $\lambda_{percept}$ are set to $10$ which are obtained through cross validation.

$\mathcal{L}_{adv}$ is the conditional adversarial loss defined by

$\mathcal{L}_{feat}$ is the feature matching loss which computes the $L1$ distance between the real and generated image using the intermediate features from discriminator by

$\mathcal{L}_{percept}$ is the perceptual loss which computes the $L1$ distance between the real and generated image using the intermediate features from a fixed VGG-19 model by

CelebAMask-HQ Dataset

We built a large-scale face semantic label dataset named CelebAMask-HQ, which was labeled according to CelebA-HQ that contains 30,000 high-resolution face images from CelebA . It has several appealing properties:

Comprehensive Annotations. CelebAMask-HQ was precisely hand-annotated with the size of 512 $\times$ 512 and 19 classes including all facial components and accessories such as ‘skin’, ‘nose’, ‘eyes’, ‘eyebrows’, ‘ears’, ‘mouth’, ‘lip’, ‘hair’, ‘hat’, ‘eyeglass’, ‘earring’, ‘necklace’, ‘neck’, and ‘cloth’.

Label Size Selection. The size of images in CelebA-HQ were 1024 $\times$ 1024. However, we chose the size of 512 $\times$ 512 because the cost of the labeling would be quite high for labeling the face at 1024 $\times$ 1024. Besides, we could easily extend the labels from 512 $\times$ 512 to 1024 $\times$ 1024 by nearest-neighbor interpolation without introducing noticeable artifacts.

Quality Control. After manual labeling, we had a quality control check on every single segmentation mask. Furthermore, we asked annotaters to refine all masks with several rounds of iterations.

Amodal Handling. For occlusion handling, if the facial component was partly occluded, we asked annotators to label the occluded parts of the components by human inferring. On the other hand, we skipped the annotations for those components that are totally occluded.

Table 5 compares the dataset statistics of CelebAMask-HQ with Helen dataset .

Experiments

We comprehensively evaluated our approach by showing quantitative and visual quality on different benchmarks.

CelebA-HQ. is a high quality facial image dataset that consists of 30000 images picked from CelebA dataset . These images are processed with quality improvement to the size of 1024 $\times$ 1024. We resize all images to the size of 512 $\times$ 512 for our experiments.

CelebAMask-HQ. Based on CelebA-HQ, we propose a new dataset named CelebAMask-HQ which has 30000 semantic segmentation labels with a size of 512 $\times$ 512. Each label in the dataset has 19 classes.

2 Implementation Details

Network Architectures. Image Generation Backbone in Dense Mapping Network follows the design of Pix2PixHD with 4 residual blocks. Alpha Blender also follows the design of Pix2PixHD but only downsampling 3 times and using 3 residual blocks. The architecture of MaskVAE is similar to UNet without skip-connection. Spatial-Aware Style Encoder in DMN does not use any Instance Normalization layers which will remove style information. All the other convolutional layers in DMN, Alpha Blender, and Discriminator are followed by IN layers. MaskVAE utilizes Batch Normalization in all layers.

Comparison Methods. We choose state-of-the-art StarGAN , ELEGANT , Pix2PixHD , SPADE as our baselines. StarGAN performs semantic-level facial attribute manipulation. ELEGANT performs geometry-level facial attribute manipulation. Pix2PixHD performs photo-realistic image synthesis from the semantic mask. We simply remove the branch for receiving $M^{t}$ in Spatial-Aware Style Encoder of Dense Mapping Network as a baseline called Pix2PixHD-m. SPADE performs structure-conditional image manipulation on natural images.

3 Evaluation Metrics

Semantic-level Evaluation. To evaluate a method of manipulating a target attribute, we examined the classification accuracy of synthesized images. We trained binary facial attribute classifiers for specific attributes on the CelebA dataset by using ResNet-18 architecture.

Geometry-level Evaluation. To measure the quality of mask-conditional image generation, we applied a pre-trained a face parsing model with U-Net architecture to the generated images and measure the consistency between the input layout and the predicted parsing results in terms of pixel-wise accuracy.

Distribution-level Evaluation. To measure the quality of generated images from different models, we used the Fréchet Inception Distance (FID) to measure the quality and diversity of generated images.

Human Perception Evaluation. We performed a user survey to evaluate perceptual generation quality. Given a target image (and a source image in the experiment of style copy), the user was required to choose the best-generated image based on two criteria: 1) quality of transfer in attributes and style 2) perceptual realism. The options were randomly shuffled images generated from different methods.

Identity Preserving Evaluation. To further evaluate the identity preservation ability, we conducted an additional face verification experiment by ArcFace (99.52% on LFW). In the experimental setting, we selected 400 pairs of faces from testing set in CelebA-HQ, and each pair contained a modified face (Smiling) and an unmodified face. Besides, in the testing stage, each face was resized to 112 $\times$ 112.

4 Comparisons with Prior Works

The comparison is performed w.r.t. three aspects, including semantic-level evaluation, geometry-level evaluation, and distributed-level evaluation. We denote our approach as MaskGAN and MaskGAN† for reference, where † indicates the model is equipped with Editing Behavior Simulated Training. For Pix2PixHD with modification, we name it as Pix2PixHD-m for reference.

Evaluation on Attribute Transfer. We choose Smiling to compare which is the most challenging attribute type to transfer in previous works. To be more specific, smiling would influence the whole expressing of a face and smiling has large geometry variety. To generate the user-modified mask as input, we conducted head pose estimation on the testing set by using the HopeNet . With the angle information of roll, pitch, and yaw, we selected 400 source and target pairs with a similar pose from the testing set. Then, we directly replaced the mask of mouth, upper lip and lower lip from target mask to source mask. Fig. 14, Fig. 15 and Table 2 show the visual results and quantitative results on MaskGAN and state-of-the-art. For a fair comparison, StarGAN* and ELEGANT* mean model trained by images with a size of 256 $\times$ 256. StarGAN has the best classification accuracy and FID scores but fails in the region of smiling for the reason that the performance of StarGAN may be influenced by the size of the training data and network design. ELEGANT has plausible results but sometimes cannot transfer smiling from the source image accurately because it exchanges attributes from source image in latent space. SPADE gets the best segmentation accuracy but has an inferior reconstruction ability than others. As long as the target image does not have spatial information to learn a better mapping with the user-defined mask. MaskGAN has plausible visual quality and relative high classification accuracy and segmentation accuracy.

Evaluation on Style Copy. To illustrate the robustness of our model, we test MaskGAN on a more difficult task: geometry-level style copy. Style copy can also be seen as manipulating a face structure to another face. We selected 1000 target images from the testing set and the source images were selected from the target images with a different order. For this setting, about half of the pairs are a different gender. Fig. 16, Fig. 17 and Table 3 show the visual results and quantitative results on MaskGAN and state-of-the-art. From the visual results and attribute classification accuracy (from left to right: Male, Heavy Makeup, and No Beard), SPADE obtains the best accuracy on segmentation by using Spatially-Adaptive Normalization, but it fails on keeping attributes (e.g. gender and beard). MaskGAN shows better ability to transfer style like makeup and gender than SPADE and Pix2PixHD-m since it introduces spatial information to the style features and simulates the user editing behavior via dual-editing consistency during training.

Evaluation on identity preserving. As the experimental results shown in Table 4, our MaskGAN is superior to other state-of-the-art mask-to-image methods for identity preserving. Actually, we have explored adding face identification loss. However, the performance gain is limited. Therefore, we removed the loss in our final framework.

5 Ablation Study

In the ablation study, we consider two variants of our model: (i) MaskGAN and (ii) MaskGAN†.

Dense Mapping Network. In Fig. 15, we observe that Pix2PixHD-m is influenced by the prior information contained in the user-modified mask. For example, if the user modifies the mask to be a female while the target image looks like a male, the predicted image tends to a female with makeup and no beard. Besides, Pix2PixHD-m cannot transition the style from the target image to the user-modified mask accurately. With Spatial-Aware Style Encoder, MaskGAN not only prevents generated results influenced by prior knowledge in the user-modified mask, but also accurately transfers the style of the target image.

Editing Behavior Simulated Training. Table 2 and Table 3 show that simulating editing behavior in training can prevent content generation in the inference stage from being influenced by structure changing on the user-modified mask. It improves the robustness of attribute keeping ability so that MaskGAN demonstrates better evaluation scores.

6 Interactive Face Editing

Our MaskGAN allows users to interactively edit the shape, location, and category of facial components at geometry-level through a semantic mask interface. The interactive face editing results are illustrated in Fig. 16. The first row shows examples of adding accessories like eyeglasses, earrings, and hats. The second row shows examples of editing face shape and nose shape. The third row shows examples of adding hair. More results are in the supplementary materials.

Conclusions

In this work, we have proposed a novel geometry-oriented face manipulation framework, MaskGAN, with two carefully designed components: 1) Dense Mapping Network and 2) Editing Behavior Simulated Training. Our key insight is that semantic masks serve as a suitable intermediate representation for flexible face manipulation with fidelity preservation. MaskGAN is comprehensively evaluated on two challenging tasks: attribute transfer and style copy, showing superior performance over other state-of-the-art methods. We further contribute a large-scale high-resolution face dataset with fine-grained mask annotations, named CelebAMask-HQ. Future work includes combining MaskGAN with image completion techniques to further preserve details on the regions without editing. Acknowledgement. This work was partially supported by HKU Seed Fund for Basic Research, Start-up Fund and Research Donation from SenseTime.

References

A Additional Implementation Details

Our MaskGAN is composed of four key components: MaskVAE, Dense Mapping Network, Alpha Blender, and Discriminator. Specifically, Dense Mapping Network contains two elements: Image Generation Backbone, Spatial-Aware Style Encoder. More details about the architecture design of these components and training details are shown below.

Image Generation Backbone. We choose the architecture of Pix2PixHD as Image Generation Backbone. The detailed architecture is as follow: $c7s1-64,d128,d256,d512,d1024,R1024,R1024,R1024,\\ R1024,u512,u256,u128,u64-c7s1$ . We utilize AdaIN for all residual blocks, other layers use IN. We do not further utilize a local enhancer because we conduct all experiments on images with a size of 512 $\times$ 512.

Spatial-Aware Style Encoder. As shown in Fig. 11, Spatial-Aware Style Encoder consists of two branches for receiving both style and spatial information. To fuse two different domains, we leverage SFT Layers in SFT-GAN . The detailed architecture of SFT Layer is shown in Fig. 12 which does not use any normalization for all layers.

Alpha Blender. Alpha Blender also follows the desing of Pix2PixHD but only downsampling three times and using three residual blocks. The detailed architecture is as follow: $c7s1-32,d64,d128,d256,R256,R256,R256,u128,u64,\\ u32-c7s1$ which uses IN for all layers.

Discriminator. Our design of discriminator also follows Pix2PixHD which utilize PatchGAN . We concatenate the masks and images as inputs to realize conditional GAN . The detailed architecture is as follow: $c64,c128,c256,c512$ which uses IN for all layers.

Training Details. Our Dense Mapping Network and MaskVAE are both updated with the Adam optimizer ( $\beta_{1}=0.5$ , $\beta_{2}=0.999$ , learning rate of $2e^{-4}$ ). For Editing Behavior Simulated Training, we reduce the learning rate to $5e^{-5}$ . MaskVAE is trained with batch size of 16 and MaskGAN is trained with the batch size of 8.

B Additional Ablation Study

A simple quantitative comparison is shown in Table. 5. SFT layers utilize more parameters to fuse to different domains together. As a result, it is reasonable that SFT layers have better effect than concatenation.

In Fig. 13, we show a visual comparison of style copy. The results with EBST have better color saturation and attribute keeping quality (heavy makeup).

C Additional Visual Results

In Fig. 14, Fig. 15, Fig. 16, and Fig. 17, we show additional visual results of attribute transfer for a specific attribute: Smiling. We compare our MaskGAN with state-of-the art methods including Pix2PixHD with modification, ELEGANT , and StarGAN .

In Fig. 18, Fig. 19, Fig. 20 and Fig. 21, we show additional visual results of style. We compare our MaskGAN with state-of-the art methods including Pix2PixHD with modification.

In the accompanying video, we demonstrate our interactive facial image manipulation interface. Users can edit the shape of facial components or add some accessories toward manipulating the semantic segmentation mask.