Resolution-robust Large Mask Inpainting with Fourier Convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, Victor Lempitsky

cs.CV eess.IV

Introduction

The solution to the image inpainting problem—realistic filling of missing parts—requires both to “understand“ large-scale structure of natural images and to perform image synthesis. The subject has been studied in pre-deep learning era , and the progress accelerated in recent years through the use of deep and wide neural networks and adversarial learning .

The usual practice is to train inpainting systems on a large automatically generated dataset, created by randomly masking real images. It’s common to use complicated two-stage models with intermediate predictions, such as smoothed images , edges , and segmentation maps . In this work, we achieve state-of-the-art results with a simple single-stage network.

A large effective receptive field is essential for understanding the global structure of an image and hence solving the inpainting problem. Moreover, in the case of a large mask, an even large yet limited receptive field may not be enough to access information necessary for generating a quality inpainting. We notice that popular convolutional architectures might lack a sufficiently large effective receptive field. We carefully intervene into each component of the system to alleviate the problem and to unlock the potential of the single-stage solution. Specifically:

i) We propose an inpainting network based on recently developed fast Fourier convolutions (FFCs) . FFCs allow for a receptive field that covers an entire image even in the early layers of the network. We show that this property of FFCs improves both perceptual quality and parameter efficiency of the network. Interestingly, the inductive bias of FFC allows the network to generalize to high resolutions that are never seen during training (Figure 5, Figure 6). This finding brings significant practical benefits, as less training data and computations are needed.

ii) We propose the use of the perceptual loss based on a semantic segmentation network with a high receptive field. This leans upon the observation that the insufficient receptive field impairs not only the inpainting network, but also the perceptual loss. Our loss promotes the consistency of global structures and shapes.

iii) We introduce an aggressive strategy for training mask generation, to unlock the potential of a high receptive field of the first two components. The procedure produces wide and large masks, which force the network to fully exploit the high receptive field of the model and the loss function.

This leads us to large mask inpainting (LaMa)—a novel single-stage image inpainting system. The main components of LaMa are the high receptive field architecture $(i)$ , with the high receptive field loss function $(ii)$ , and the aggressive algorithm of training masks generation $(iii)$ . We meticulously compare LaMa with state-of-the-art baselines and analyze the influence of each proposed component. Through evaluation, we find that LaMa can generalize to high-resolution images after training only on low-resolution data. LaMa can capture and generate complex periodic structures, and is robust to large masks. Furthermore, this is achieved with significantly less trainable parameters and inference time costs compared to competitive baselines.

Method

Our goal is to inpaint a color image $x$ masked by a binary mask of unknown pixels $m$ , the masked image is denoted as $x\odot m$ . The mask $m$ is stacked with the masked image $x\odot m$ , resulting in a four-channel input tensor $x^{\prime}=\texttt{stack}(x\odot m,m)$ . We use a feed-forward inpainting network $f_{\theta}(\cdot)$ , that we also refer to as generator. Taking $x^{\prime}$ , the inpainting network processes the input in a fully-convolutional manner, and produces an inpainted three-channel color image $\hat{x}=f_{\theta}(x^{\prime})$ . The training is performed on a dataset of (image, mask) pairs obtained from real images and synthetically generated masks.

In challenging cases, e.g. filling of large masks, the generation of proper inpainting requires to consider global context. Thus, we argue that a good architecture should have units with as wide-as-possible receptive field as early as possible in the pipeline. The conventional fully convolutional models, e.g. ResNet , suffer from slow growth of effective receptive field . Receptive field might be insufficient, especially in the early layers of the network, due to the typically small (e.g. $3\times 3$ ) convolutional kernels. Thus, many layers in the network will be lacking global context and will waste computations and parameters to create one. For wide masks, the whole receptive field of a generator at the specific position may be inside the mask, thus observing only missing pixels. The issue becomes especially pronounced for high-resolution images.

Fast Fourier convolution (FFC) is the recently proposed operator that allows to use global context in early layers. FFC is based on a channel-wise fast Fourier transform (FFT) and has a receptive field that covers the entire image. FFC splits channels into two parallel branches: i) local branch uses conventional convolutions, and ii) global branch uses real FFT to account for global context. Real FFT can be applied only to real valued signals, and inverse real FFT ensures that the output is real valued. Real FFT uses only half of the spectrum compared to the FFT. Specifically, FFC makes following steps:

and concatenates real and imaginary parts

applies a convolution block in the frequency domain

applies inverse transform to recover a spatial structure

Finally, the outputs of the local (i) and global (ii) branches are fused together. The illustration of FFC is available in Figure 2.

The power of FFCs FFCs are fully differentiable and easy-to-use drop-in replacement for conventional convolutions. Due to the image-wide receptive field, FFCs allow the generator to account for the global context starting from the early layers, which is crucial for high-resolution image inpainting. This also leads to better efficiency: trainable parameters can be used for reasoning and generation instead of “waiting” for a propagation of information.

We show that FFCs are well suited to capture periodic structures, which are common in human-made environments, e.g. bricks, ladders, windows, etc (Figure 4). Interestingly, sharing the same convolutions across all frequencies shifts the model towards scale equivariance (Figures 5, 6).

2 Loss functions

The inpainting problem is inherently ambiguous. There could be many plausible fillings for the same missing areas, especially when the “holes” become wider. We will discuss the components of the proposed loss, that together allow to handle the complex nature of the problem.

Naive supervised losses require the generator to reconstruct the ground truth precisely. However, the visible parts of the image often do not contain enough information for the exact reconstruction of the masked part. Therefore, using naive supervision leads to blurry results due to the averaging of multiple plausible modes of the inpainted content.

In contrast, perceptual loss evaluates a distance between features extracted from the predicted and the target images by a base pre-trained network $\phi(\cdot)$ . It does not require an exact reconstruction, allowing for variations in the reconstructed image. The focus of large-mask inpainting is shifted towards understanding of global structure. Therefore, we argue that it is important to use the base network with a fast growth of a receptive field. We introduce the high receptive field perceptual loss (HRF PL), that uses a high receptive field base model $\phi_{\text{\it HRF}}(\cdot)$ :

where $[\cdot-\cdot]^{2}$ is an element-wise operation, and $\mathcal{M}$ is the sequential two-stage mean operation (interlayer mean of intra-layer means). The $\phi_{\text{\it HRF}}(x)$ can be implemented using Fourier or Dilated convolutions. The HRF perceptual loss appears to be crucial for our large-mask inpainting system, as demonstrated in the ablation study (Table 3).

Pretext problem A pretext problem on which the base network for a perceptual loss was trained is important. For example, using a segmentation model as a backbone for perceptual loss may help to focus on high-level information, e.g. objects and their parts. On the contrary, classification models are known to focus more on textures , which can introduce biases harmful for high-level information.

2.2 Adversarial loss

We use adversarial loss to ensure that inpainting model $f_{\theta}(x^{\prime})$ generates naturally looking local details. We define a discriminator $D_{\xi}(\cdot)$ that works on a local patch-level , discriminating between “real” and “fake” patches. Only patches that intersect with the masked area get the “fake” label. Due to the supervised HRF perceptual loss, the generator quickly learns to copy the known parts of the input image, thus we label the known parts of generated images as “real”. Finally, we use the non-saturating adversarial loss:

where $x$ is a sample from a dataset, $m$ is a synthetically generated mask, $\hat{x}=f_{\theta}(x^{\prime})$ is the inpainting result for $x^{\prime}=\texttt{stack}(x\odot m,m)$ , $\texttt{sg}_{\textit{var}}$ stops gradients w.r.t var, and $L_{\textit{Adv}}$ is the joint loss to optimise.

2.3 The final loss function

In the final loss we also use $R_{1}\!=\!E_{x}||\nabla D_{\xi}(x)||^{2}$ gradient penalty , and a discriminator-based perceptual loss or so-called feature matching loss—a perceptual loss on the features of discriminator network $\mathcal{L}_{\text{\it DiscPL}}$ . $\mathcal{L}_{\text{\it DiscPL}}$ is known to stabilize training, and in some cases slightly improves the performance.

The final loss function for our inpainting system

is the weighted sum of the discussed losses, where $L_{\textit{Adv}}$ and $\mathcal{L}_{\text{\it DiscPL}}$ are responsible for generation of naturally looking local details, while $\mathcal{L}_{\text{\it HRFPL}}$ is responsible for the supervised signal and consistency of the global structure.

3 Generation of masks during training

The last component of our system is a mask generation policy. Each training example $x^{\prime}$ is a real photograph from a training dataset superimposed by a synthetically generated mask. Similar to discriminative models where data-augmentation has a high influence on the final performance, we find that the policy of mask generation noticeably influences the performance of the inpainting system.

We thus opted for an aggressive large mask generation strategy. This strategy uniformly uses samples from polygonal chains dilated by a high random width (wide masks) and rectangles of arbitrary aspect ratios (box masks). The examples of our masks are demonstrated in Figure 3.

We tested large mask training against narrow mask training for several methods, and found that training with large mask strategy generally improves performance on both narrow and wide masks (Table 4). That suggests that increasing diversity of the masks might be beneficial for various inpainting systems. The sampling algorithm is provided in supplementary material.

Experiments

In this section we demonstrate that the proposed technique outperforms a range of strong baselines on standard low resolutions, and the difference is even more pronounced when inpainting wider holes. Then we conduct the ablation study, showing the importance of FFC, the high receptive field perceptual loss, and large masks. The model, surprisingly, can generalise to high, never seen resolutions, while having significantly less parameters compared to most competitive baselines.

Implementation details For LaMa inpainting network we use a ResNet-like architecture with 3 downsampling blocks, 6-18 residual blocks, and 3 upsampling blocks. In our model, the residual blocks use FFC. The further details on the discriminator architecture are provided in the supplementary material. We use Adam optimizer, with the fixed learning rates $0.001$ and $0.0001$ for inpainting and discriminator networks, respectively. All models are trained for 1M iterations with a batch size of 30 unless otherwise stated. In all experiments, we select hyperparameters using the coordinate-wise beam-search strategy. That scheme led to the weight values $\kappa=10$ , $\alpha=30$ , $\beta=100$ , $\gamma=0.001$ . We use these hyperparameters for the training of all models, except those described in the loss ablation study (shown in Sec. 3.2). In all cases, the hyperparameter search is performed on a separate validation subset. More information about dataset splits is provided in supplementary material.

Data and metrics We use Places and CelebA-HQ datasets. We follow the established practice in recent image2image literature and use Learned Perceptual Image Patch Similarity (LPIPS) and Fréchet inception distance (FID) metrics. Compared to pixel-level L1 and L2 distances, LPIPS and FID are more suitable for measuring performance of large masks inpainting when multiple natural completions are plausible. The experimentation pipeline is implemented using PyTorch , PyTorch-Lightning , and Hydra . The code and the models are publicly available at github.com/saic-mdal/lama.

We compare the proposed approach with a number of strong baselines that are presented in Table 1. Only publicly available pretrained models are used to calculate these metrics. For each dataset, we validate the performance across narrow, wide, and segmentation-based masks. LaMa-Fourier consistently outperforms most of the baselines, while having fewer parameters than the strongest competitors. The only two competitive baselines CoModGAN and MADF use $\approx 4\times$ and $\approx 3\times$ more parameters. The difference is especially noticeable for wide masks.

User study To alleviate a possible bias of the selected metrics, we have conducted a crowdsourced user study. The results of the user study correlate well with the quantitative evaluation and demonstrate that the inpainting produced by our method is more preferable and less detectable compared to other methods. The protocol and the results of the user study are provided in the supplementary material.

2 Ablation Study

The goal of the study is to carefully examine the influence of different components of the method. In this section, we present results on Places dataset; the additional results for CelebA dataset are available in supplementary material.

Receptive field of $f_{\theta}(\cdot)$ FFCs increase the effective receptive field of our system. Adding FFCs substantially improves FID scores of inpainting in wide masks (Table 2).

The importance of the receptive field is most noticeable when a model is applied to a higher resolution than it was trained on. As demonstrated in Figure 5, the model with regular convolutions produces visible artifacts as the resolution increases beyond those used at train time. The same effect is validated quantitatively (Figure 6). FFCs also improve generation of repetitive structures such as windows a lot (Figure 4). Interestingly, the LaMa-Fourier is only $20\%$ slower, while $40\%$ smaller than LaMa-Regular.

Dilated convolutions are an alternative option that allows the fast growth of a receptive field. Similar to FFCs, dilated convolutions boost the performance of our inpainting system. This further supports our hypothesis on the importance of the fast growth of the effective receptive field for image inpainting. However, dilated convolutions have more restrictive receptive field and heavily rely on scale, leading to inferior generalization to higher resolutions (Figure 6). Dilated convolutions are widely implemented in most frameworks and may serve as a practical replacement for Fourier ones when the resources are limited, e.g. on mobile devices. We provide more details on the LaMa-Dilated architecture in the supplementary material.

Loss We verify that the high receptive field of the perceptual loss—implemented with Dilated convolutions—indeed improves the quality of inpainting (Table 3). The pretext problem and the design choice beyond using dilation layers also prove to be important. For each loss variant, we performed a weight coefficient search to ensure a fair evaluation.

Masks generation Wider training masks improve inpainting of both wide and narrow holes for LaMa (ours) and RegionWise (Table 4). However, wider masks may make results worse, which is the case for DeepFill v2 and EdgeConnect on narrow masks. We hypothesize that this difference is caused by specific design choices (e.g. high receptive field of a generator or loss functions) that make a method more or less suitable for inpainting of both narrow and wide masks at the same time.

3 Generalization to higher resolution

Training directly at high-resolution is slow and computationally expensive. Still, most real-world image editing scenarios require inpainting to work in high-resolution. So, we evaluate our models, which were trained using $256\times 256$ crops from $512\times 512$ images, on much larger images. We apply models in a fully-convolutional fashion, i.e. an image is processed in a single pass, not patch-wise.

FFC-based models transfer to higher resolutions significantly better (Figure 6). We hypothesize that FFCs are more robust across different scales due to $i)$ image-wide receptive field, $ii)$ preserving the low-frequencies of the spectrum after scale change, $iii)$ the inherent scale equivariance of $1\times{}1$ convolutions in the frequency domain. While all models generalize reasonably well to the $512\!\times\!512$ resolution, the FFC-enabled models preserve much more quality and consistency at the $1536\!\times\!1536$ resolution, compared to all other models (Figure 5). It is worth noting, that they achieve this quality at a significantly lower parameter cost than the competitive baselines.

4 Teaser model: Big LaMa

To verify the scalability and applicability of our approach to real high-resolution images, we trained a large inpainting Big LaMa model with more resources.

Big LaMa-Fourier differs from LaMa-Fourier in three aspects: the depth of the generator; the training dataset; and the size of the batch. It has 18 residual blocks, all based on FFC, resulting in 51M parameters. The model was trained on a subset of 4.5M images from Places-Challenge dataset . Just as our standard base model, the Big LaMa was trained only on low-resolution $256\times 256$ crops of approximately $512\times 512$ images. Big LaMa uses a larger batch size of 120 (instead of 30 for our other models). Although we consider this model relatively large, it is still smaller than some of the baselines. It was trained on eight NVidia V100 GPUs for approximately 240 hours. The inpainting examples of Big LaMa model are presented in Figures 1 and 5.

Related Work

Early data-driven approaches to image inpainting relied on patch-based and nearest neighbor-based generation. One of the first inpainting works in deep learning era used a convnet with an encoder-decoder architecture trained in an adversarial way . This approach remains commonly used for deep inpainting to date. Another popular group of choices for the completion network is architectures based on U-Net , such as .

One common concern is the ability of the network to grasp the local and global context. Towards this end, proposed to incorporate dilated convolutions to expand receptive field; besides, two discriminators were supposed to encourage global and local consistency separately. In , the use of branches in the completion network with varying receptive fields was suggested. To borrow information from spatially distant patches, proposed the contextual attention layer. Alternative attention mechanisms were suggested in . Our study confirms the importance of the efficient propagation of information between distant locations. One variant of our approach relies heavily on dilated convolutional blocks, inspired by . As an even better alternative, we propose a mechanism based on transformations in the frequency domain (FFC) . This also aligns with a recent trend on using Transformers in computer vision and treating Fourier transform as a lightweight replacement to the self-attention .

At a more global level, introduced a coarse-to-fine framework that involves two networks. In their approach, the first network completes coarse global structure in the holes, while the second network then uses it as a guidance to refine local details. Such two-stage approaches that follow a relatively old idea of structure-texture decomposition became prevalent in the subsequent works. Some studies modify the framework so that coarse and fine result components are obtained simultaneously rather than sequentially. Several works suggest two-stage methods that use completion of other structure types as an intermediate step: salient edges in , semantic segmentation maps in , foreground object contours in , gradient maps in , and edge-preserved smooth images in . Another trend is progressive approaches . In contrast to all these works, we demonstrate that a meticulously designed single-stage approach can achieve very strong results.

To deal with irregular masks, several works modified convolutional layers, introducing partial , gated , light-weight gated and region-wise convolutions. Various shapes of training masks were explored, including random , free-form and object-shaped masks . We found that as long as contours of training masks are diverse enough, the exact way of mask generation is not as important as the width of the masks.

Discussion

In this study, we have investigated the use of a simple, single-stage approach for large-mask inpainting. We have shown that such an approach is very competitive and can push the state of the art in image inpainting, given the appropriate choices of the architecture, the loss function, and the mask generation strategy. The proposed method is arguably good in generating repetitive visual structures (Figure 1, 4), which appears to be an issue for many inpainting methods. However, LaMa usually struggles when a strong perspective distortion gets involved (see supplementary material). We would like to note that this is usually the case for complex images from the Internet, that do not belong to a dataset. It remains a question whether FFCs can account for these deformations of periodic signals. Interestingly, FFCs allow the method to generalize to never seen high resolutions, and be more parameter-efficient compared to state-of-the-art baselines. The Fourier or Dilated convolutions are not the only options to receive a high receptive field. For instance, a high receptive field can be obtained with vision transformer that is also an exciting topic for future research. We believe that models with a large receptive field will open new opportunities for the development of efficient high-resolution computer vision models.

Acknowledgements We want to thank Nikita Dvornik, Gleb Sterkin, Aibek Alanov, Anna Vorontsova, Alexander Grishin, and Julia Churkina for their valuable feedback.

Supplementary material For more details and visual samples, please refer to the project page https://saic-mdal.github.io/lama-project/ or supplementary material https://bit.ly/3zhv2rD.