Free-Form Image Inpainting with Gated Convolution

Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, Thomas Huang

Introduction

Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.

In computer vision, two broad approaches to image inpainting exist: patch matching using low-level image features and feed-forward generative models with deep convolutional networks. The former approach can synthesize plausible stationary textures, but usually makes critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach can exploit semantics learned from large scale datasets to synthesize contents in non-stationary images in an end-to-end fashion.

However, deep generative models based on vanilla convolutions are naturally ill-fitted for image hole-filling because the spatially shared convolutional filters treat all input pixels or features as same valid ones. For hole-filling, the input to each layer are composed of valid pixels/features outside holes and invalid ones in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (for example, the ones on hole boundary) pixels/features, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks .

To address this limitation, recently partial convolution is proposed where the convolution is masked and normalized to be conditioned only on valid pixels. It is then followed by a rule-based mask-update step to update valid locations for next layer. Partial convolution categorizes all input locations to be either invalid or valid, and multiplies a zero-or-one mask to inputs throughout all layers. The mask can also be viewed as a single un-learnable feature gating channelApplying mask before convolution or after is equivalent when convolutions are stacked layer-by-layer in neural networks. Because the output of current layer is the input to next layer and the masked region of input image is already filled with zeros.. However this assumption has several limitations. First, considering the input spatial locations across different layers of a network, they may include (1) valid pixels in input image, (2) masked pixels in input image, (3) neurons with receptive field covering no valid pixel of input image, (4) neurons with receptive field covering different number of valid pixels of input image (these valid image pixels may also have different relative locations), and (5) synthesized pixels in deep layers. Heuristically categorizing all locations to be either invalid or valid ignores these important information. Second, if we extend to user-guided image inpainting where users provide sparse sketch inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Third, for partial convolution the “invalid” pixels will progressively disappear layer by layer and the rule-based mask will be all ones in deep layers. However, to synthesize pixels in hole these deep layers may also need the information of whether current locations are inside or outside the hole. The partial convolution with all-ones mask cannot provide such information. We will show that if we allow the network to learn the mask automatically, the mask may have different values based on whether current locations are masked or not in input image, even in deep layers.

We propose gated convolution for free-form image inpainting. It learns a dynamic feature gating mechanism for each channel and each spatial location (for example, inside or outside masks, RGB channels or user-guidance channels). Specifically we consider the formulation where the input feature is firstly used to compute gating values g=σ(wgx)g=\sigma(w_{g}x) (σ\sigma is sigmoid function, wgw_{g} is learnable parameter). The final output is a multiplication of learned feature and gating values y=ϕ(wx)gy=\phi(wx)\odot g where ϕ\phi can be any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also have conditional inputs like sparse sketch. For network architectures, we stack gated convolution to form an encoder-decoder network following . Our inpainting network also integrates contextual attention module within same refinement network to better capture long-range dependencies.

For practical image inpainting tools, enabling user interactivity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user sketch as guided input. Comparison to other methods is summarized in Table 1. Our main contributions are as follows: (1) We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs. (2) We present a more practical patch-based GAN discriminator, SN-PatchGAN, for free-form image inpainting. It is simple, fast and produces high-quality inpainting results. (3) We extend our inpainting model to an interactive one, enabling user sketch as guidance to obtain more user-desired inpainting results. (4) Our proposed inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps user quickly remove distracting objects, modify image layouts, clear watermarks and edit faces in images.

Related Work

A variety of approaches have been proposed for image inpainting. Traditionally, patch-based algorithms progressively extend pixels close to the hole boundaries based on low-level features (for example, features of mean square difference on RGB space), to search and paste the most similar image patch. These algorithms work well on stationary textural regions but often fail on non-stationary images. Further, Simakov et al. propose bidirectional similarity synthesis approach to better capture and summarize non-stationary visual data. To reduce the high cost of memory and computation during search, tree-based acceleration structures of memory and randomized algorithms are proposed. Moreover, inpainting results are improved by matching local features like image gradients and offset statistics of similar patches . Recently, image inpainting systems based on deep learning are proposed to directly predict pixel values inside masks. A significant advantage of these models is the ability to learn adaptive image features for different semantics. Thus they can synthesize more visually plausible contents especially for images like faces , objects and natural scenes . Among all these methods, Iizuka et al. propose a fully convolutional image inpainting network with both global and local consistency to handle high-resolution images on a variety of datasets . This approach, however, still heavily relies on Poisson image blending with traditional patch-based inpainting results . Yu et al. propose an end-to-end image inpainting model by adopting stacked generative networks to further ensure the color and texture consistence of generated regions with surroundings. Moreover, for capturing long-range spatial dependencies, contextual attention module is proposed and integrated into networks to explicitly borrow information from distant spatial locations. However, this approach is mainly trained on large rectangular masks and does not generalize well on free-form masks. To better handle irregular masks, partial convolution is proposed where the convolution is masked and re-normalized to utilize valid pixels only. It is then followed by a rule-based mask-update step to re-compute new masks layer by layer.

2 Guided Image Inpainting and Synthesis

To improve image inpainting, user guidance is explored including dots or lines , structures , transformation or distortion information and image exemplars . Notably, Hays and Efros first utilize millions of photographs as a database to search for an example image which is most similar to the input, and then complete the image by cutting and pasting the corresponding regions from the matched image.

Recent advances in conditional generative networks empower user-guided image processing, synthesis and manipulation learned from large-scale datasets. Here we selectively review several related work. Zhang et al. propose colorization networks which can take user guidance as additional inputs. Wang et al. propose to synthesize high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks. The Scribbler explore a deep generative network conditioned on sketched boundaries and sparse color strokes to synthesize cars, bedrooms, or faces.

3 Feature-wise Gating

Feature-wise gating has been explored widely in vision , language , speech and many other tasks. For examples, Highway Networks utilize feature gating to ease gradient-based training of very deep networks. Squeeze-and-Excitation Networks re-calibrate feature responses by explicitly multiplying each channel with learned sigmoidal gating values. WaveNets achieve better results by employing a special feature gating y=tanh(w1x)sigmoid(w2x)y=\text{tanh}(w_{1}∗x)\cdot\text{sigmoid}(w_{2}∗x) for modeling audio signals.

Approach

In this section, we describe our approach from bottom to top. We first introduce the details of the Gated Convolution, SN-PatchGAN, and then present the overview of inpainting network in Figure 3 and our extension to allow optional user guidance.

We first explain why vanilla convolutions used in are ill-fitted for the task of free-form image inpainting. We consider a convolutional layer in which a bank of filters are applied to the input feature map as output. Assume input is CchannelC\mathit{-channel}, each pixel located at (y,x)(y,x) in CchannelC^{\prime}\mathit{-channel} output map is computed as

The equation shows that for all spatial locations (y,x)(y,x), the same filters are applied to produce the output in vanilla convolutional layers. This makes sense for tasks such as image classification and object detection, where all pixels of input image are valid, to extract local features in a sliding-window fashion. However, for image inpainting, the input are composed of both regions with valid pixels/features outside holes and invalid pixels/features (in shallow layers) or synthesized pixels/features (in deep layers) in masked regions. This causes ambiguity during training and leads to visual artifacts such as color discrepancy, blurriness and obvious edge responses during testing, as reported in .

Recently partial convolution is proposed which adapts a masking and re-normalization step to make the convolution dependent only on valid pixels as

in which MM is the corresponding binary mask, 11 represents pixel in the location (y,x)(y,x) is valid, represents the pixel is invalid, \odot denotes element-wise multiplication. After each partial convolution operation, the mask-update step is required to propagate new MM with the following rule: m^{\prime}_{y,x}=1,\text{iff sum(M)>0}.

Partial convolution improves the quality of inpainting on irregular mask, but it still has remaining issues: (1) It heuristically classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (for example, 1 valid pixel and 9 valid pixels are treated as same to update current mask). (2) It is incompatible with additional user inputs. We aim at a user-guided image inpainting system where users can optionally provide sparse sketch inside the mask as conditional channels. In this situation, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? (3) For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones. However, our study shows that if we allow the network to learn optimal mask automatically, the network assigns soft mask values to every spatial locations even in deep layers. (4) All channels in each layer share the same mask, which limits the flexibility. Essentially, partial convolution can be viewed as un-learnable single-channel feature hard-gating.

We propose gated convolution for image inpainting network, as shown in Figure 2. Instead of hard-gating mask updated with rules, gated convolutions learn soft mask automatically from data. It is formulated as:

where σ\sigma is sigmoid function thus the output gating values are between zeros and ones. ϕ\phi can be any activation functions (for examples, ReLU, ELU and LeakyReLU). WgW_{g} and WfW_{f} are two different convolutional filters.

The proposed gated convolution learns a dynamic feature selection mechanism for each channel and each spatial location. Interestingly, visualization of intermediate gating values show that it learns to select the feature not only according to background, mask, sketch, but also considering semantic segmentation in some channels. Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.

2 Spectral-Normalized Markovian Discriminator (SN-PatchGAN)

For previous inpainting networks which try to fill a single rectangular hole, an additional local GAN is used on the masked rectangular region to improve results . However, we consider the task of free-form image inpainting where there may be multiple holes with any shape at any location. Motivated by global and local GANs , MarkovianGANs , perceptual loss and recent work on spectral-normalized GANs , we present a simple and effective GAN loss, SN-PatchGAN, for training free-form image inpainting networks.

3 Inpainting Network Architecture

We customize a generative inpainting network with the proposed gated convolution and SN-PatchGAN loss. Specifically, we adapt the full model architecture in with both coarse and refinement networks. The full framework is summarized in Figure 3.

For coarse and refinement networks, we use a simple encoder-decoder network instead of U-Net used in PartialConv . We found that skip connections in a U-Net have no significant effect for non-narrow mask. This is mainly because for center of a masked region, the inputs of these skip connections are almost zeros thus cannot propagate detailed color or texture information to the decoder of that region. For hole boundaries, our encoder-decoder architecture equipped with gated convolution is sufficient to generate seamless results.

We replace all vanilla convolutions with gated convolutions . One potential problem is that gated convolutions introduce additional parameters. To maintain the same efficiency with our baseline model , we slim the model width by 25%25\% and have not found obvious performance drop both quantitatively and qualitatively. The inpainting network is trained end-to-end and can be tested on free-form holes at arbitrary locations. Our network is fully convolutional and supports different input resolutions in inference.

4 Free-Form Mask Generation

The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar to masks drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.

We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines. More details are included in the supplementary materials due to space limit.

5 Extension to User-Guided Image Inpainting

We use sketch as an example user guidance to extend our image inpainting network as a user-guided system. Sketch (or edge) is simple and intuitive for users to draw. We show both cases with faces and natural scenes. For faces, we extract landmarks and connect related landmarks. For natural scene images, we directly extract edge maps using the HED edge detector and set all values above a certain threshold (i.e. 0.6) to ones. Sketch examples are shown in the supplementary materials due to space limit.

For training the user-guided image inpainting system, intuitively we will need additional constraint loss to enforce the network generating results conditioned on the user guidance. However with the same combination of pixel-wise reconstruction loss and GAN loss (with conditional channels as input to the discriminator), we are able to learn conditional generative network in which the generated results respect user guidance faithfully. We also tried to use additional pixel-wise loss on HED output features with the raw image or the generated result as input to enforce constraints, but the inpainting quality is similar. The user-guided inpainting model is separately trained with a 5-channel input (R,G,B color channels, mask channel and sketch channel).

Results

We evaluate the proposed free-form image inpainting system on Places2 and CelebA-HQ faces . Our model has totally 4.1M parameters, and is trained with TensorFlow v1.8, CUDNN v7.0, CUDA v9.0. For testing, it runs at 0.21 seconds per image on single NVIDIA(R) Tesla(R) V100 GPU and 1.9 seconds on Intel(R) Xeon(R) CPU @ 2.00GHz for images of resolution 512×512512\times 512 on average, regardless of hole size.

2 Qualitative Comparisons

Next, we compare our model with previous state-of-the-art methods . Figure 4 and Figure 5 shows automatic and user-guided inpainting results with several representative images. For automatic image inpainting, the result of PartialConv is obtained from its online demohttps://www.nvidia.com/research/inpainting. For user-guided image inpainting, we train PartialConv* with the exact same setting of GatedConv, expect the convolution types (sketch regions are treated as valid pixels for rule-based mask updating). For all learning-based methods, no post-processing step is performed to ensure fairness.

As reported in , simple uniform region (last row of Figure 4 and Figure 5) are hard cases for learning-based image inpainting networks. Previous methods with vanilla convolution have obvious visual artifacts and edge responses in/surrounding holes. PartialConv produces better results but still exhibits observable color discrepancy. Our method based on gated convolution obtains more visually pleasing results without noticeable color inconsistency. In Figure 5, given sparse sketch, our method produces realistic results with seamless boundary transitions.

3 Object Removal and Creative Editing

Moreover, we study two important real use cases of image inpainting: object removal and creative editing.

Object Removal. In the first example, we try to remove the distracting person in Figure 6. We compare our method with commercial product Photoshop (based on PatchMatch ) and the previous state-of-the-art generative inpainting network (official released model trained on Places2) . The results show that Content-Aware Fill function from Photoshop incorrectly copies half of face from left. This example reflects the fact that traditional methods without learning from large-scale data ignore the semantics of an image, which leads to critical failures in non-stationary/complicated scenes. For learning-based methods with vanilla convolution , artifacts exist near hole boundaries.

Creative Editing. Next we study the case where user interacts with the inpainting system to produce more desired results. The examples on both faces and natural scenes are shown in Figure 7. Our inpainting results nicely follow the user sketch, which is useful for creatively editing image layouts, faces and many others.

4 User Study

We performed a user study by first collecting 30 test images (with holes but no sketches) from Places2 validation dataset without knowing their inpainting results on each model. We then computed results of the following four methods for comparison: (1) ground truth, (2) our model, (3) re-implemented PartialConv within same framework, and (4) official PartialConv . We did two types of user study. (A) We evaluate each method individually to rate the naturalness/inpainting quality of results (from 1 to 10, the higher the better), and (B) we compare our model and the official PartialConv model to evaluate which method produces better results. 104 users finished the user study with the results shown as follows.

(A) Naturalness: (1) 9.89, (2) 7.72, (3) 7.07, (4) 6.54

(B) Pairwise comparison of (2) our model vs. (4) official PartialConv model: 79.4% vs. 20.6% (the higher the better).

5 Ablation Study of SN-PatchGAN

Conclusions

References

Appendix A Free-Form Mask Generation

The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar in shape to holes drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.

We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines.

We use maxVertex, maxLength, maxWidth and maxAngle as four hyper-parameters to provide large varieties of sampled masks. Moreover, our algorithm generates masks on-the-fly with little computational overhead and no storage is required. In practice, the computation of free-form masks on CPU can be easily hid behind training networks on GPU in modern deep learning frameworks. The overall mask generation algorithm is illustrated in Algorithm 1. Additionally we can sample multiple strokes in single image to mask multiple regions, and add regular masks (e.g. rectangular) on top of sampled free-form masks. Example masks compared with previous method is shown in Figure 9.

Appendix B Sketch Generation

We use sketch as an example user guidance to extend our image inpainting network as a user guided system. We show both cases on faces and natural scenes. For faces, we extract landmarks and connect related landmarks. For natural scene images, we directly extract edge maps using the HED edge detector and set all values above a certain threshold (i.e. 0.6) to ones. Sketch examples are shown in Figure 10. Alternative methods to generative better sketch or other user guidance should also work well with our user-guided image inpainting system.

Appendix C The Effects of Sketch Input

As shown in Section 4.3, our inpainting network can nicely follow the user sketch, which is useful for creative editing of images. We show in Figure 11 an additional comparison case where the input image uses the same mask but different sketches.

Appendix D Visualization and Interpretation

In Figure 12, we provide the visualization and interpretation of learned gating values in our inpainting network, and compare them with that of PartialConv .

Appendix E Ablation Study of SN-PatchGAN

In this section, we present ablation study to demonstrate the effectiveness of SN-PatchGAN. It is noteworthy that SN-PatchGAN is proposed because free-form masks may appear anywhere in images with any shape. Global and local GANs designed for a single rectangular mask are not applicable. Previous work have already shown that (1) one vanilla global discriminator has much worse performance than two local and global discriminators , and (2) GAN with spectral normalization has better stability and performance. We also provide experiments of SN-PatchGAN in the context of image inpainting in Figure 13. Our image inpainting network trained on a global GAN without spectral normalization has significantly worse performance on all examples.

Appendix F More Comparison Results

In this section, we show more comparison results of learning-based image inpainting systems including Global&Local , ContextAttention , PartialConv (both our implementation within same framework and official model via online demo) and our proposed method based on gated convolution. Note that the models of scenes and faces are trained in separate following all other methods . All testing images are not in the training set. Results are shown in Figure 14 and Figure 15. Compared with our baseline PartialConv, our inpainting system generates higher-quality inpainting results. Although PartialConv significantly improves over previous baselines like Global&Local and ContextAttention , it still produces observable color inconsistency or shadows in both official online demo and our reproduced version (best-viewed with zoom-in on PDF to see color shadows and artifacts). Moreover, PartialConv fails especially on cases (1) when holes are large and involving transitions of two segments (e.g., a mask covering both sky and ground), and (2) when the image has strong structure/contour/edge prior. The reasons are discussed in the introduction of main paper that un-learnable rule-based hard-gating heuristically categorizes all input locations to be either invalid or valid, ignoring many other important information. Gated convolution is able to leverage these information by learning a soft-gating end-to-end.

Appendix G More Inpainting Results of Our System

In this section, we present more examples towards real use cases based on our proposed image inpainting system. We show inpainting results on both natural scenes and faces in Figure 16, Figure 17 and Figure 18. We show our inpainting system helps user quickly remove distracting objects, modify image layouts, edit faces and interactively create novel objects in images.