RePaint: Inpainting using Denoising Diffusion Probabilistic Models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool

Introduction

Image Inpainting, also known as Image Completion, aims at filling missing regions within an image. Such inpainted regions need to harmonize with the rest of the image and be semantically reasonable. Inpainting approaches thus require strong generative capabilities. To this end, current State-of-the-Art approaches rely on GANs or Autoregressive Modeling . Moreover, inpainting methods need to handle various forms of masks such as thin or thick brushes, squares, or even extreme masks where the vast majority of the image is missing. This is highly challenging since existing approaches train with a certain mask distribution, which can lead to poor generalization to novel mask types. In this work, we investigate an alternative generative approach for inpainting, aiming to design an approach that requires no mask-specific training.

Denoising Diffusion Probabilistic Models (DDPM) is an emerging alternative paradigm for generative modelling . Recently, Dhariwal and Nichol demonstrated that DDPM can even outperform the state-of-the-art GAN-based method for image synthesis. In essence, the DDPM is trained to iteratively denoise the image by reversing a diffusion process. Starting from randomly sampled noise, the DDPM is then iteratively applied for a certain number of steps, which yields the final image sample. While founded in principled probabilistic modeling, DDPMs have been shown to generate diverse and high-quality images .

We propose RePaint: an inpainting method that solely leverages an off-the-shelf unconditionally trained DDPM. Specifically, instead of learning a mask-conditional generative model, we condition the generation process by sampling from the given pixels during the reverse diffusion iterations. Remarkably, our model is therefore not trained for the inpainting task itself. This has two important advantages. First, it allows our network to generalize to any mask during inference. Second, it enables our network to learn more semantic generation capabilities since it has a powerful DDPM image synthesis prior (Figure LABEL:fig:intro).

Although the standard DDPM sampling strategy produces matching textures, the inpainting is often semantically incorrect. Therefore, we introduce an improved denoising strategy that resamples (RePaint) iterations to better condition the image. Notably, instead of slowing down the diffusion process , our approach goes forward and backward in diffusion time, producing remarkable semantically meaningful images. Our approach allows the network to effectively harmonize the generated image information during the entire inference process, leading to a more effective conditioning on the given image information.

We perform experiments on CelebA-HQ and ImageNet , and compare with other State-of-the-Art inpainting approaches. Our approach generalizes better and has overall more semantically meaningful inpainted regions.

Related Work

Early attempts on Image Inpainting or Image Completion exploited low-level cues within the input image , or within the neighbor of a large image dataset to fill the missing region.

Deterministic Image Inpainting: Since the introduction of GANs , most of the existing methods follow a standard configuration, first proposed by Pathak et al. , that is, using an encoder-decoder architecture as the main inpainting generator, adversarial training, and tailored losses that aim at photo-realism. Follow-up works have produced impressive results in recent years .

As image inpainting requires a high-level semantic context, and to explicitly include it in the generation pipeline, there exist hand-crafted architectural designs such as Dilated Convolutions to increase the receptive field, Partial Convolutions and Gated Convolutions to guide the convolution kernel according to the inpainted mask, Contextual Attention to leverage on global information, Edges maps or Semantic Segmentation maps to further guide the generation, and Fourier Convolutions to include both global and local information efficiently. Although recent works produce photo-realistic results, GANs are well known for textural synthesis, so these methods shine on background completion or removing objects, which require repetitive structural synthesis, and struggle with semantic synthesis (Figure 4).

Diverse Image Inpainting: Most GAN-based Image Inpainting methods are prone to deterministic transformations due to the lack of control during the image synthesis. To overcome this issue, Zheng et al. and Zhao et al. propose a VAE-based network that trade-offs between diversity and reconstruction. Zhao et al. , inspired by the StyleGAN2 modulated convolutions, introduces a co-modulation layer for the inpainting task in order to improve both diversity and reconstruction. A new family of auto-regressive methods , which can handle irregular masks, has recently emerged as a powerful alternative for free-form image inpainting.

Usage of Image Prior: In a different direction closer to ours Richardson et al. exploits the StyleGAN prior to successfully inpaint missing regions. However, similar to super-resolution methods that leverage the StyleGAN latent space, it is to limited specific scenarios like faces. Noteworthy, a Ulyanov et al. showed that the structure of a non-trained generator network contains an inherent prior that can be used for inpaining and other applications. In contrast to these methods, we are leveraging on the high expressiveness of a pretrained Denoising Diffusion Probabilistic Model (DDPM) and therefore use it as a prior for generic image inpainting. Our method generates very detailed, high-quality images for both semantically meaningful generation and texture synthesis. Moreover, our method is not trained for the image inpainting task, and instead, we take full advantage of the prior DDPM, so each image is optimized independently.

Image Conditional Diffusion Models: The work by Sohl-Dickstein et al. applied early diffusion models to inpainting. More recently, Song et al. develop a score-based formulation using stochastic differential equations for unconditional image generation, with an additional application to inpainting. However, both these works only show qualitative results, and do not compare with other inpainting approaches. In contrast, we aim to advance the state-of-the-art in image inpainting, and provide comprehensive comparisons with the top competing methods in literature.

A different line of research is guided image synthesis with DDPM-based approaches . In the case of ILVR , a trained diffusion model is guided using the low-frequency information from a conditional image. However, this conditioning strategy cannot be adopted for inpainting, since both high and low-frequency information is absent in the masked-out regions. Another approach for image-conditional synthesis is developed by . Guided generation is performed by initializing the reverse diffusion process from the guiding image at some intermediate diffusion time. An iterative strategy, repeating the reverse process several times, is further adopted to improve harmonization. Since a guiding image is required to start the reverse process at an intermediate time step, this approach is not applicable to inpainting, where new image content needs to be generated solely conditioned on the non-masked pixels. Furthermore, the resampling strategy proposed in this work differs from the concurrent . We proceed through the full reverse diffusion process, starting at the end time, at each step jumping back and forth a fixed number of time steps to progressively improve generation quality.

While we propose a method that conditions an unconditionally trained model, the concurrent work is based on classifier-free guidance for training an image-conditional diffusion model. Another direction for image manipulation is image-to-image translation using diffusion models as explored in the concurrent work . It trains an image-conditional DDPM, and shows an application to inpainting. Unlike both these concurrent works, we leverage an unconditional DDPM and only condition through the reverse diffusion process itself. It allows our approach to effortlessly generalize to any mask shape for free-form inpainting. Moreover, we propose a sampling schedule for the reverse process, which greatly improves image quality.

Preliminaries: Denoising Diffusion Probabilistic Models

In this paper, we use diffusion models as a generative method. As other generative models, the DDPM learns a distribution of images given a training set. The inference process works by sampling a random noise vector ${x_{T}}$ and gradually denoising it until it reaches a high-quality output image ${x_{0}}$ . During training, DDPM methods define a diffusion process that transforms an image $x_{0}$ to white Gaussian noise $x_{T}\sim\mathcal{N}(0,1)$ in $T$ time steps. Each step in the forward direction is given by,

The sample $x_{t}$ is obtained by adding i.i.d. Gaussian noise with variance $\beta_{t}$ at timestep $t$ and scaling the previous sample $x_{t-1}$ with $\sqrt{1-\beta_{t}}$ according to a variance schedule.

The DDPM is trained to reverse the process in (1). The reverse process is modeled by a neural network that predicts the parameters $\mu_{\theta}(x_{t},t)$ and $\Sigma_{\theta}(x_{t},t)$ of a Gaussian distribution,

The learning objective for the model (2) is derived by considering the variational lower bound,

As extended by Ho et al. , this loss can be further decomposed as,

Importantly the term $L_{t-1}$ trains the network (2) to perform one reverse diffusion step. Furthermore, it allows for a closed from expression of the objective since $q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})$ is also Gaussian .

As reported by Ho et al. , the best way to parametrize the model is to predict the cumulative noise $\epsilon_{0}$ that is added to the current intermediate image $x_{t}$ . Thus, we obtain the following parametrization of the predicted mean $\mu_{\theta}(x_{t},t)$ ,

From $L_{t-1}$ in (4), the following simplified training objective is derived by Ho et al. ,

As introduced by Nichol and Dhariwal , learning the variance $\Sigma_{\theta}(x_{t},t)$ in (2) of the reverse process helps to reduce the number of sampling steps by an order of magnitude. They, therefore, add the variational lower bound loss. Specifically, we base our training and inference approach on the recent work , which further reduced the inference time by factor four.

To train the DDPM, we need a sample $x_{t}$ and corresponding noise that is used to transform $x_{0}$ to $x_{t}$ . By using the independence property of the noise added at each step (1), we can calculate the total noise variance as $\bar{\alpha_{t}}=\prod_{s=1}^{t}(1-\beta_{s})$ . We can thus rewrite (1), as a single step,

It allows us to efficiently sample pairs of training data to train a reverse transition step.

Method

In this section, we first present our approach for conditioning the reverse diffusion process of an unconditional DDPM for image inpainting in Section 4.1. Then, we introduce an approach to improve the reverse process itself for inpainting in Section 4.2.

The goal of inpainting is to predict missing pixels of an image using a mask region as a condition. In the remaining of the paper, we consider a trained unconditional denoising diffusion probabilistic model (2). We denote the ground truth image as $x$ , the unknown pixels as $m\odot x$ and the known pixels as $(1-m)\odot x$ .

Since every reverse step (2) from $x_{t}$ to $x_{t-1}$ depends solely on $x_{t}$ , we can alter the known regions $(1-m)\odot x_{t}$ as long as we keep the correct properties of the corresponding distribution. Since the forward process is defined by a Markov Chain (1) of added Gaussian noise, we can sample the intermediate image $x_{t}$ at any point in time using (7). This allows us to sample the know regions $m\odot x_{t}$ at any time step $t$ . Thus, using (2) for the unknown region and (7) for the known regions, we achieve the following expression for one reverse step in our approach,

Thus, $x_{t-1}^{\text{known}}$ is sampled using the known pixels in the given image $m\odot x_{0}$ , while $x_{t-1}^{\text{unknown}}$ is sampled from the model, given the previous iteration $x_{t}$ . These are then combined to the new sample $x_{t-1}$ using the mask. Our approach is illustrated in Figure 1.

2 Resampling

When directly applying the method described in Section 4.1, we observe that only the content type matches with the known regions. For example, in Figure 2 $n=1$ , the inpainted area is a furry texture matching the hair of the dog. Although the inpainted region matches the texture of the neighboring region, it is semantically incorrect. Therefore, the DDPM is leveraging on the context of the known region, yet it is not harmonizing it well with the rest of the image. Next, we discuss possible reasons for this behavior.

From Figure 1, we analyze how the method is conditioning the known regions. As shown in (8), the model predicts $x_{t-1}$ using $x_{t}$ , which comprises the output of the DDPM (2) and the sample from the known region. However, the sampling of the known pixels using (7) is performed without considering the generated parts of the image, which introduces disharmony. Although the model tries to harmonize the image again in every step, it can never fully converge because the same issue occurs in the next step. Moreover, in each reverse step, the maximum change to an image declines due to the variance schedule of $\beta_{t}$ . Thus, the method cannot correct mistakes that lead to disharmonious boundaries in the subsequent steps due to restricted flexibility. As a consequence, the model needs more time to harmonize the conditional information $x_{t-1}^{\text{known}}$ with the generated information $x_{t-1}^{\text{unknown}}$ in one step before advancing to the next denoising step.

Since the DDPM is trained to generate an image that lies within a data distribution, it naturally aims at producing consistent structures. In our resampling approach, we use this DDPM property to harmonize the input of the model. Consequently, we diffuse the output $x_{t-1}$ back to $x_{t}$ by sampling from (1) as $x_{t}\sim\mathcal{N}(\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{I})$ . Although this operation scales back the output and adds noise, some information incorporated in the generated region $x_{t-1}^{\text{unknown}}$ is still preserved in $x_{t}^{\text{unknown}}$ . It leads to a new $x_{t}^{\text{unknown}}$ which is both more harmonized with $x_{t}^{\mathtt{known}}$ and contains conditional information from it.

Since this operation can only harmonize one step, it might not be able to incorporate the semantic information over the entire denoising process. To overcome this problem, we denote the time horizon of this operation as jump length, which is $j=1$ for the previous case. Similar to the standard change in diffusion speed (a.k.a. slowing down), the resampling also increases the runtime of the reverse diffusion. Slowing down applies smaller but more resampling steps by reducing the added variance in each denoising step. However, that is a fundamentally different approach because slowing down the diffusion still has the problem of not harmonizing the image, as described in our resampling strategy. We empirically demonstrate this advantage of our approach in Sec. 5.6.

Experiments

We perform extensive experiments for face and generic inpainting, compare to the state-of-the-art solutions, and conduct an ablative analysis. In Section 5.3 and 5.4, we report a detailed discussion of mask robustness and diversity, respectively. We also report with additional results, analysis, and visuals in the appendix.

We validate our solution over the CelebA-HQ , and Imagenet datasets. As our method relies on a pretrained guided diffusion model , we use the provided ImageNet model. For CelebA-HQ, we follow the same training hyper-parameters as for ImageNet. We use $256\times 256$ crops in three batches on 4 $\times$ V100 GPUs each. In contrast to the pretrained ImageNet model, the CelebA-HQ one is only trained for 250,000 iterations during roughly five days. Note that all our qualitative and quantitative results in the main paper are for 256 image size.

For our final approach, we use $T=250$ timesteps, and applied $r=10$ times resampling with jumpy size $j=10$ .

2 Metrics

We compare our RePaint with the baseline methods in a user study described as follows. The user is shown the input image with the blanked missing regions. Next to this image, we display two different inpainting solutions. The user is asked to select “Which image looks more realistic?”. The user thus evaluates the realism of our RePaint to the result of a baseline. To avoid biasing the user towards an approach, the methods were anonymized shown in a different random order for each image. Moreover, each user was asked every question twice and could only submit their answer if they were consistent with themselves in at least 75% of their answer. A self-consistency in 100% of the cases is often not possible since, for example, the LaMa method can have a very similar quality to RePaint on the mask settings they provide. Our user study evaluates all 100 test images of the test datasets CelebA-HQ and ImageNet for the following masks: Wide, Narrow, Every Second Line, Half Image, Expand, and Super-Resolve. We use the answers of five different humans for every image query, resulting in 1000 votes per method-to-method comparison in each dataset and mask setting, and show the 95% confidence interval next to the mean votes. In addition to the user study, we report the commonly reported perceptual metric LPIPS , which is a learned distance metric based on the deep feature space of AlexNet. We compute the LPIPS over the same 100 test images used in the user study. The results are shown in Table 1. Furthermore, please refer to the appendix for additional quantitative results.

3 Comparison with State-of-the-Art

In this section, we first compare our approach against state-of-the-art on standard mask distributions, commonly employed for benchmarking. We then analyze the generalization capabilities of our method against other approaches. To this end, we evaluate their robustness under four challenging mask settings. Firstly, two different masks that probe if the methods can incorporate information from thin structures. Secondly, two masks that require to inpaint a large connected area of the image. All quantitative results are reported in Table 1 and visual results in Figure 3 and 4.

Methods: We compare our approach against several state-of-the-art autoregressive-based or GAN-based approaches. The autoregressive methods are DSI and ICT , and the GAN methods are DeepFillv2 , AOT , and LaMa . We use their publicly available pretrained models. We used the existing FFHQ pretrained model of ICT for our CelebA-HQ testing. As LaMa does not provide ImageNet models, we trained their system for 300,000 iterations of batch size five using the original implementation.

Settings: We use 100 images of size 256 $\times$ 256 from CelebA-HQ and ImageNet test sets. The resulting LPIPS and the average votes of the user study are shown in Table 1. Additionally, refer to the appendix for qualitative and quantitative results over the Places2 dataset.

Wide and Narrow masks: To validate our method on the standard image inpainting scenario, we use the LaMa settings for Wide and Narrow masks. RePaint outperforms all other methods with a significance margin of 95% in both CelebA-HQ and ImageNet, for both Wide and Narrow settings. See qualitative results in Figure 3 and 4 and quantitative in Table 1. The best autoregressive method ICT seems to have less global consistency as observed in Figure 3 in the second row, where the eyes do not to match well. In general, the best GAN approach LaMa has better global consistency, yet it produces notorious checkerboard artifacts. Those observations might have influenced the users to vote for RePaint for the majority of images, in which our method generates more realistic images.

Thin Masks: Similar to a Nearest-Neighbor Super Resolution problem, the “Super-Resolution $2\times$ ” mask only leaves pixels with a stride of 2 in height and width dimension, and the “Alternating Lines” mask removes the pixels every second row of an image. As seen in Figure 3 and 4, AOT fails completely, while the others either produce blurry images or generate visible artifacts, or both. These observations are also confirmed by the user study, where RePaint achieves between 73.1% and 99.3% of the user votes.

Thick Masks: The “Expand” mask only leaves a center crop of $64\times 64$ from a $256\times 256$ image, and “Half” mask, which provides the left half of the image as input. As there is less contextual information, most of the methods struggle (see Figure 3 and 4). Qualitatively, LaMa comes closer to ours, yet our generated images are sharper and have overall more semantic hallucination. Noteworthy, LaMa outperforms RePaint in therms of LPIPS on “Expand” and “Half” for both CelebA and ImageNet (Tab. 1). We argue that this behavior is due to our method being more flexible and diverse in the generation. By generating a semantically different image than that in the Ground-Truth, it makes the LPIPS an unsuitable metric for this particular solution.

The artifacts produced by the baselines can be explained by strong overfitting to the training masks. In contrast, as our method does not involve mask training, our RePaint can handle any type of mask. In the case of large-area inpainting, RePaint produces a semantically meaningful filling, while others generate artifacts or copy texture. Finally, RePaint is preferred by the users with confidence 95% except for the inconclusive result of ICT with “Half” masks as shown in Table 1.

4 Analysis of Diversity

As shown in (2), every reverse diffusion step is inherently stochastic since it incorporates new noise from a Gaussian Distribution. Moreover, as we do not directly guide the inpainted area with any loss, the model is, therefore, free to inpaint anything that semantically aligns with the training set. Figure LABEL:fig:intro illustrates the diversity and flexibility of our model.

5 Class conditional Experiment

The pretrained ImageNet DDPM is capable of class-conditional generation sampling. In Figure 5 we show examples for the “Expand” mask for the “Granny Smith” class, as well as other classes.

6 Ablation Study

Comparison to slowing down: To analyze if the increased computational budget causes the improved performance of resampling, we compare it with the commonly used technique of slowing down the diffusion process as described in Section 4.2. Therefore, in Figure 6 and Table 2, we show a comparison resampling and the slow down in diffusion using the same computational budget for each setting. We observe that the resampling uses the extra computational budget for harmonizing the image, whereas there is no visible improvement at slowing down the diffusion process.

Jumps Length: Moreover, to ablate the jump lengths $j$ and the number of resampling $r$ , we study nine different settings in Table 3. We obtain better performance at applying the larger jump $j=10$ length than smaller step length steps. We observe that for jump length $j=1$ , the DDPM is more likely to output a blurry image. Furthermore, this observation is stable across different numbers of resampling. Furthermore, the number of resamplings increases the performance.

Comparison to alternative sampling strategy: To compare our resampling approach to SDEdit , we first perform reverse diffusion from $t=T$ to $t=T/2$ to obtain the required initial inpainting at $t=T/2$ . We then apply the resampling method from SDEdit, which repeats the reverse process from $t=T/2$ to $t=0$ several times. The results are shown in Table 4. Our approach achieves significantly better performance across all mask types except for one “Expand” case, where LPIPS $>0.6$ is outside a meaningful range for comparisons. In case of ‘super-resolution masks’, our approach reduces the LPIPS by over 53% on all datasets, clearly demonstrating the advantage of our resampling strategy.

Limitations

Our method produces sharp, highly detailed, and semantically meaningful images. We believe that our work opens interesting research directions for addressing the current limitations of the method. Two directions are of particular interest. First, naturally, the per-image DDPM optimization process is significantly slower than the GAN-based and Autoregressive-based counterparts. That makes it currently difficult to apply it for real-time applications. Nonetheless, DDPM is gaining in popularity, and recent publications are working on improving the efficiency . Secondly, for the extreme mask cases, RePaint can produce realistic images completions that are very different from the Ground Truth image. That makes the quantitative evaluation challenging for those conditions. An alternative solution is to employ the FID score over a test set. However, a reliable FID for inpainting is usually computed with more than 1,000 images. For current DDPM, this would result in a runtime that is not feasible for most research institutes.

Potential Negative Societal Impact

On the one hand, RePaint is an inpainting method that relies on an unconditional pretrained DDPM. Therefore, the algorithm might be biased towards the dataset on which it was trained. Since the model aims to generate images of the same distribution as the training set, it might reflect the same biases, such as gender, age, and ethnicity. On the other hand, RePaint could be used for the anonymization of faces. For example, one could remove the information about the identity of people shown at public events and hallucinate artificial faces for data protection.

Conclusions

We presented a novel denoising diffusion probabilistic model solution for the image inpainting task. In detail, we developed a mask-agnostic approach that widely increases the degree of freedom of masks for the free-form inpainting. Since the novel conditioning approach of RePaint complies with the model assumptions of a DDPM, it produces a photo-realistic image regardless of the type of the mask.

Acknowledgements: This work was supported by the ETH Zürich Fund (OK), a Huawei Technologies Oy (Finland) project, and an Nvidia GPU grant.

References

Appendix

In this appendix, we provide additional details and analysis of our approach. We give more explanation on our user study in Section A. Further, we present additional details on how we implemented the diffusion time schedule for jumps in Section B. Visual results for our ablation for jump size and the number of resamplings are provided in Section C. The evaluation on the second part of the LaMa Benchmark on Places2 is presented in Section D. Furthermore, to compare the diversity of the inpaintings for RePaint compared with state-of-the-art, we provide a quantitative analysis in Section E. Details on failure cases and data bias on the ImageNet dataset are provided in Section F. For gaining a better intuitive understanding of the evolution of the latent space, we provide a video of the inference in Section G. And finally, we show additional visual examples in Section I.

A User Study

As described in Section 5.2 in the main paper, we conduct a user study to determine which method is best perceived to the human eye. In Figure 7, we depict the user interface, where the user selects the most realistic solution from an input reference. To reduce bias, we show the two candidate images in random order. Additionally, to improve the consistency of the user decision and prevent answers with low effort, we show every example twice. The users that agree in less than 75% of their own votes are discarded.

B Algorithm for jump size larger than one

In addition to the resampling introduced in Algorithm 1 in the main paper, we use jumps in diffusion time as described in Section 4.2 in the main paper. Figure 8 shows a pseudo-code to further clarify the generation of state transitions. Note that each transition increases or decreases the diffusion time $t$ by one. For example, for a chosen jump length of $j=10$ shown in Figure 10, we apply ten forward transitions before applying ten reverse transitions. The diffusion time $t$ for the latent vector $x_{t}$ is plotted in Figure 9.

C Ablation

In addition to the quantitative analysis in Table 3 in the main paper, this section shows visual examples for different jump lengths $j$ and number or resamplings $r$ . As discussed in Section 5.5 in the main paper, smaller jump lengths $j$ tend to produce blurrier images as shown in Figure 11, and an increased number or resamplings $r$ improves the overall image consistency.

D Evaluation on Places2

For a more comprehensive experimental framework, in this section, we provide the second part of the benchmark proposed in LaMa , which is over the Places2 dataset. The experiments on Places2 were conducted using an unconditional model that we trained for 300k iterations with batch size four on four V100, taking about six days in total. All other training settings were kept as originally used for ImageNet. The model checkpoint will be published. We will clarify these aspects and add further details in the paper. We use the same mask generation procedure and settings described in the main paper. The results shown in Table 5 are in line with those on CelebA and ImageNet in Table 1 of the main paper. RePaint outperforms all other methods for all masks with significance 95% except for one inconclusive case. This case is when comparing RePaint to LaMa on Wide Masks, where the users vote in 52.4% for RePaint, but the significance interval overlaps with the 50% border. The visual comparison on the and Wide and Narrow mask is shown in Figure 21. Moreover, the visual results further confirm the robustness against sparse masks as shown in Figure 22. The mask pattern is clearly visible in all competing methods, while RePaint shows better harmonization. Regarding large masks, RePaint is able to inpaint semantically meaningful content such as the companion in the Bar in the same age, and overall lightning conditions as shown in the second row of Figure 23.

E Diversity

For our quantitative evaluation in the main paper, we sample a single image per input. However, since our method is stochastic, we can sample from it. To compare the diversity among the stochastic methods, we use the Diversity Score as described in (higher is better). In contrast to the standard diversity metric that only computes the mean LPIPS across pair of outputs, this score is designed to describe meaningful diversity yet also weighting the overall performance in LPIPS. It aims at measuring the diversity of the generations inside the manifold of plausible predictions. In detail, too extreme predictions or failures are therefore penalized. As shown in Table 6, for “Wide” and “Half”, there is no method with both best LPIPS and Diversity Score and for “Expand” ICT beats RePaint by $0.81\%$ in Diversity Score and $1.1\%$ in LPIPS. RePaint is superior by a large margin in both LPIPS and Diversity Score for the thin structured masks “Narrow”, “Super-Resolution $2\times$ ”, and “Alternating Lines” to both ICT and DSI .

F Failure Cases

As depicted in Figure 12, RePaint sometimes confuses the semantic context and mixes non-matching objects. Our model on ImageNet seems to be biased towards inpainting dogs more frequently than expected. Since ImageNet has many different breeds of dogs for classification tasks, dogs are over-represented in the training set, hence our model bias.

G Attached Video

To inspect the latent space of the diffusion space, we provide a video in the attachment as shown in the screenshot in Figure 13. There we show the Ground Truth and the latent space $x_{t}$ after every transition in the diffusion process. Note that the diffusion time $t$ , shown on top, jumps up and down according to the following schedule: The jump length is $j=5$ , and the number of resamplings is $r=9$ . To focus more on the visually interesting part of the diffusion process we set the number of diffusion steps to $T=100$ and start resampling below $t=50$ .

H Experiment on larger resolution

As shown in Figure 14, our inpainting method also works on pretrained model from for $512\times 512$ . However, we were not able to conduct our full analysis on that resolution due to limited computational resources.

I Additional Visual Results

We also provide additional visual examples for CelebA-HQ and ImageNet, comparing our approach to the same state-of-the-art methods as in the main paper. We show the results for Wide and Narrow masks in Figures 15 and 18, respectively, for the sparse masks “Super-Resolve $2\times$ ” and “Alternating Lines” in Figures 16 and 19 and for “Half” and “Expand” in Figures 17 and 20.