SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, Kun Zhang

Introduction

Traditional image inpainting aims to fill the missing area in images conditioned on surrounding pixels, lacking control over the inpainted content. To alleviate this, multi-modal image inpainting offers more control through additional information, e.g. class labels, text descriptions, segmentation maps, etc. In this paper, we consider the task of multi-modal object inpainting conditioned on both a text description and the shape of the object to be inpainted (see Fig. 1). In particular, we explore diffusion models for this task inspired by their superior performance in modeling complex image distributions and generating high-quality images.

Diffusion models (DMs) , e.g., Stable Diffusion , DALL-E , and Imagen have shown promising results in text-to-image generation. They can also be adapted to the inpainting task by replacing the random noise in the background region with a noisy version of the original image during the diffusion reverse process . However, this leads to undesirable samples since the model cannot see the global context during sampling . To address this, GLIDE and Stable Inpainting (inpainting specialist v1.5 from Stable Diffusion) randomly erase part of the image and fine-tune the model to recover the missing area conditioned on the corresponding image caption. However, semantic misalignment between the missing area (local content) and global text description may cause the model to fill in the masked region with background instead of precisely following the text prompt as shown in Fig. 1 (“Glide” and “Stable Inpainting”). We refer to this phenomenon as text misalignment.

An alternative way to perform multi-modal image inpainting is to utilize powerful language-vision models, e.g., CLIP . Blended diffusion uses CLIP to compute the difference between the image embedding and the input text embedding and then injects the difference into the sampling process of a pretrained unconditional diffusion model. However, CLIP models tend to capture the global and high-level image features, thus there is no incentive to generate objects aligning with the given mask (see “Blended Diffusion” in Fig. 1). We denote this phenomenon as mask misalignment. Another issue for existing inpainting methods is background preservation in which case they often produce distorted background surrounding the inpainted object as shown in Fig. 1 (bottom row).

To address above challenges, we introduce a precision factor into the input masks, i.e., our model not only takes a mask as input but also information about how closely the inpainted object should follow the mask’s shape. To achieve this we generate different types of masks from fine to coarse by applying Gaussian blur to accurate instance masks and use the masks and their precision type to train the guided diffusion model. With this setup, we allow users to either use coarse masks which will contain the desired object somewhere within the mask or to provide detailed masks that outline the shape of the object exactly. Thus, we can supply very accurate masks and the model will fill the entire mask with the object described by the text prompt (see the first row in Fig. 1), while, on the other hand, we can also provide very coarse masks (e.g., a bounding box) and the model is free to insert the desired object within the mask area such that the object is roughly bounded by the mask.

One important characteristic, especially for coarse masks such as bounding boxes, is that we want to keep the background within the inpainted area consistent with the original image. To achieve this, we not only encourage the model to inpaint the masked region but also use a regularization loss to encourage the model to predict an instance mask of the object it is generating.

At test time we replace the coarse mask with the predicted mask during sampling to preserve background as much as possible which leads to more consistent results (second row in Fig. 1).

We evaluate our model on several challenging object inpainting tasks and show that it achieves state-of-the-art results on object inpainting across several datasets and examples. Our model offers more flexibility due to the mask precision control, which offers users to specify how closely they want the model to follow a given mask. Due to our foreground mask prediction during sampling, our model is much better at preserving background within the inpainted areas than other baselines, leading to more realistic results, especially for coarser masks such as bounding boxes. Our user study shows that users prefer the outputs of our model as compared to DALLE-2 and Stable Inpainting across several axes of evaluation such as shape, text alignment, and realism. To summarize our contributions:

We introduce a text and shape guided object inpainting diffusion model, which is conditioned on object masks of different precision, achieving a new level of control for object inpainting.

To preserve the image background with coarse input masks, the model is trained to predict a foreground object mask during inpainting for preserving original background surrounding the synthesized object.

Instead of training with random masks and text captions that describe the entire images, we use instance segmentation masks and train our model with local text descriptions of the inpainted area.

We propose a multi-task training strategy by jointly training object inpainting with text-to-image generation to leverage more training data.

Related Work

Diffusion Models Diffusion models (DMs) learn the data distribution by inverting a Markov noising process, and they have gained wide attention recently due to their stability and superior performance in image synthesis as compared to GANs. Given a clean image $x_{0}$ , the diffusion process adds noise to the image at each step $t$ , obtaining a set of noisy latent $x_{t}$ . Then, the model is trained to recover the clean image $x_{0}$ from $x_{t}$ in the backward process. DMs have shown appealing results in different tasks, e.g., unconditional image generation , text-to-image generation , video generation , image inpainting , image translation , and image editing .

Text-Guided Image Inpainting Taking advantage of the recent success of diffusion-based text-to-image generation models, an intuitive adaptation from a text-to-image generation to text-guided inpainting is to replace the pure random noise with the noisy background outside the mask region. However, this leads to strong artifacts, e.g., generating partial objects or inconsistent content in the background. To address this problem, GLIDE further finetune a pre-trained text-to-image model toward the inpainting task. It first generates a random mask and then provides the masked image and mask as additions to the diffusion model, which learns to utilize the information outside of the mask region. Blended diffusion adapts from a pre-trained unconditional diffusion model and encourages the output to align with the text prompt using the CLIP score. Repaint builds on a pre-trained unconditional diffusion model and proposes to resample in each reverse step, but it doesn’t support text input. Some recent works also endeavored to tackle image editing tasks, e.g., Prompt2Prompt allows partial modification on the original prompt such that the newly generated image will be partially edited correspondingly, while it is difficult to control object shape and target regions, especially if the image content becomes complicated. DiffEdit follows the spirit of Prompt2Prompt but derives masks from the difference before and after modifying the prompt. PaintbyWord pairs the large-scale GAN with a full-text image retrieval network to enable multi-modal image editing. However, due to the structure of GAN, it cannot specifically modify the region given by the mask. TDANet proposes a dual attention mechanism to exploit the text features about the masked region by comparing text with the corrupted image and its counterpart.

Preliminary: Diffusion Model

Given an input image $x_{0}$ , we apply a forward diffusion Markov process to add noise to the image over a number of time steps $t$ with scheduled variance $\beta_{t}$ :

where $T$ is the total number of steps. If $T\rightarrow\infty$ , the output $x_{T}$ will be isotropic Gaussian. The defined Markov process allows us to get $x_{t}$ in a closed form

where $\alpha_{t}=1-\beta_{t}$ , $\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}$ , $\epsilon_{t}\sim\mathcal{N}(0,\text{I})$ .

To generate images from random noise, we need to invert above diffusion process, i.e., learning $q(x_{t-1}|x_{t})$ that is also a Gaussian when $\beta_{t}$ is small enough. However, $q(x_{t-1}|x_{t})$ is unknown since it is inaccessible to the true distribution of $x_{0}$ . Thus, we train a neural network $p_{\theta}$ to approximate the conditional distribution.

where $\mu_{\theta}$ is trained to predict $x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}\epsilon_{t}}\right)$ , which is derived from Sec. 3. Since we already have $x_{t}$ during training, we can train a network $\epsilon_{\theta}$ to predict $\epsilon_{t}$ instead of training $\mu_{\theta}$ . We obtain the objective for training the diffusion model.

At test time, we start from a random noise $x_{T}\sim\mathcal{N}(0,\text{I})$ and then iteratively apply the model $\epsilon_{\theta}$ to obtain $x_{t-1}$ from $x_{t}$ until $t=0$ . We may employ more efficient sampling techniques like DDIM and PNDM to speed up the sampling, and adopt classifier free guidance to improve the sample quality.

As for conditional diffusion models, e.g., text-to-image and inpainting models, conditional information can be fed into the network $\epsilon_{\theta}$ without changing the loss function. The model will learn to utilize the conditions to generate high quality conditional images.

Our Approach

Existing inpainting models randomly erase part of the images and are trained to inpaint the erased region. As a result, the randomly erased region may contain only parts of an object or contain areas of background around a given object. Therefore, we propose to utilize the text and shape information from existing instance or panoptic segmentation datasets. These datasets contain annotated masks $\{m_{i}\}_{i=1}^{N}$ where $N$ is the number of annotations and each masked region $x\odot m_{i}$ contains only one object. For each mask we also have a corresponding class label $c_{i}$ , e.g., hat or cat.

In the forward process, we randomly draw a segmentation mask $m$ and its corresponding class text label $c$ for image $x$ . We define $x_{0}=x$ and only add noise in the masked region instead of all pixels:

where $\epsilon\sim\mathcal{N}(\textbf{0},\textbf{I})$ and $t$ is the timestep in the forward process. We use $x_{t}$ , $m$ , and $c$ as input to the model so it can learn to utilize the clean background information and learn to recover the masked region $x_{0}\odot m$ . This ensures that generated objects in the foreground $m$ are consistent with the background. Following we train a network $\epsilon_{\theta}$ to predict the noise $\epsilon$ from the noisy $x_{t}$ :

In the inference phase, we generate random Gaussian noise in the masked region $x_{T}=\epsilon\odot m+x_{0}\odot(1-m)$ , where $T$ is the number of sampling steps. Then we reverse the diffusion process and obtain the inpainted result $x_{0}$ .

2 Shape Precision Control

Our training masks come from the segmentaion annotations and thus are accurate instance masks. Training the model with these masks will encourage the model to exactly follow the shape of the input mask at test time. To allow users to provide masks that are either accurate (e.g., in the shape of a cat) or coarse (e.g., a bounding box) we propose to generate masks with different precision. To achieve this, we randomly augment the masks during training to degrade the shape of the original mask. Specifically, given an accurate instance mask $m$ , we use a mask precision indicator $s\sim[0,S]$ and define a set of parameters for each indicator:

where $k_{s}$ denotes Gaussian kernel size, and $\sigma_{s}$ is standard deviation of the kernel. If $s=0$ , the mask stays unchanged and corresponds to the accurate instance mask from the dataset annotation. When $s=S$ , the mask $m_{s}$ is a bounding box of the instance mask $m$ , and it loses all detailed shape information. During training, for each training sample (object), we employ a set of masks $\{m_{s},s\}$ from fine to coarse and condition the diffusion model on the precision indicator $s$ :

Through this, we can control whether the generated object should align with the input mask by specifying different mask precision indicators $s$ . We present a sample of masks in Fig. 5.

3 Background Preservation

During inference, the diffusion model will denoise the masked region and generate objects according to the given text prompt. As a result, the background in the masked region will be changed if the input masks are coarse. For example, the model may generate a cat in the given square box mask region but the other pixels in the square box region will also be changed. Ideally we would like to preserve the background, however, this is challenging since we do not know where in the coarse mask the model will generate the desired object.

We address this challenge by utilizing the information of mask precision. Specifically, we train our diffusion network to also predict an accurate instance mask $m$ from the coarse input version $m_{s}$ :

where $H$ can be any suitable criterion for segmentation. We choose to use the DICE loss, i.e., $H(X,Y)=1-\frac{2|X\cap Y|}{|X|+|Y|}$ . For this, we simply add an extra output channel to our diffusion model which contains the instance mask prediction.

During inference, we are able to predict where the object is generated inside the coarse mask $m_{s}$ using the diffusion model’s prediction. We first feed a coarse mask $m_{s}$ into the diffusion model and switch to using the predicted mask to perform denoising. With the predicted mask, we know where the object is generated within the masked region which helps to preserve background information around the generated object.

4 Training Strategy

Combining Eqs. 8 and 9, our final training objective can be expressed as follows.

where $\lambda$ is a hyper-parameter which balances the two losses. In our experiment, $\lambda=0.01$ .

Our model can be built based on pre-trained text-to-image generation models, e.g., Stable Diffusion and Imagen, to speed up the training process. In the experiments, we finetune based on the Stable Diffusion text-to-image model v1.2 with our conditions (Fig. 2) and loss function $\mathcal{L}_{\text{total}}$ (Eq. 10). To align text descriptions with the local mask content, avoiding text misalignment as aforementioned, we train with the training split of OpenImages v6 , which has segmentation and corresponding labels that can serve as local descriptions. From our empirical study, such categorical text would degrade the generation quality from long sentences. Therefore, we employ the BLIP model to collect richer and longer captions for those local segments. During the training, we randomly pair the segmentation label or BLIP caption to the corresponding mask. Therefore, the model can handle both single word text and short phrase well during the inference.

Multi-task Training In addition, to leverage more training data and handle more diverse text descriptions and image contents, beyond the domain of the segmentation dataset, we propose a multi-task training strategy by jointly training our main task and the foundational text-to-image generation task, using image/text paired data from LAION-Aesthetics v2 5+ subset following Stable Diffusion . For text-to-image, we set the input mask to cover the entire image, and treat it as a special inpainting case. As demonstrated in Sec. 5, our final model trained with all these components significantly outperforms state-of-the-art methods in terms of visual quality of generated objects, as well as their consistency to text description and mask shape.

Experimental Evaluation

We set $\lambda=0.01$ in the total loss function Eq. 10 and batch size to be 1024. Following the training strategy discussed in Sec. 4.4, we train the inpainting task and text-to-image generation task with the probability of 80% and 20%, respectively. Our model was trained around 20K steps on 8 A100 GPUs. As a reference, Stable Inpainting takes 256 A100 GPUs around 440K steps.

Baselines We choose the state-of-the-art image inpainting methods as our baselines, i.e., Blended Diffusion , GLIDE , Stable Diffusion , and Stable Inpainting . We also compare with DALLE-2 on limited images since its model is not open source yet. Stable Diffusion, Stable Inpainting, and our SmartBrush support image generation on the size of 512 $\times$ 512. Since Blended Diffusion and GLIDE only support images size of 256 $\times$ 256, we resize all results to 256 $\times$ 256 for fair comparison.

Testing Datasets We evaluate our model on two popular segmentation datasets, i.e., OpenImages and MSCOCO . We sample 2 masks for each image in the testing dataset of MSCOCO, so the number of testing images is 9311. As for OpenImages, we sample images with resolution higher than 512 and use one mask for each image. Then, the number of testing images is 13400. The input prompts are directly from segmentation class labels.

Evaluation Metrics We first measure the image quality by Frechet Inception Distance (FID) . Since our main task is object generation in the masked region, the global FID cannot well reflect the generation quality since the masked region may occupy a small part of the image. Therefore, we crop the images according to the bounding box of the mask and measure FID on the local regions, which is referred to as “Local FID”. To measure the alignment between text and generated content, we adopt the CLIP score .

2 Text and Shape Guided Inpainting

The proposed SmartBrush can inpaint not only objects but also generic scene like sunset sky by following the text and shape guidance. For object inpainting, we consider two common use cases: 1) accurate object masks and 2) bounding box masks. The former expects the generated object to follow the given mask shape, while the latter does not constrain the shape of generated objects as long as they are inside of the box. Corresponding quantitative results are listed in Tabs. 1 and 2.

As a strong baseline, Stable Inpainting presents lower CLIP scores than ours, which suggests that random masking is not an optimal training strategy for text-guided inpainting. The Blended Diffusion achieves a relatively high CLIP score but lags far behind in FID since the CLIP model focus on the global content instead of local objects. By contrast, our SmartBrush achieves the best performance in both tasks on all metrics, which demonstrates the effectiveness of our proposed training strategy with text and shape guidance.

Fig. 3 visualizes inpainting examples from the baselines and our SmartBrush. In general, we can generate high-quality objects/scenes well following both the mask shape and text, no matter short words or long sentences. By contrast, all baselines failed following the mask shape. Even, Blended Diffusion and GLIDE cannot generate decent objects given these local text descriptions. Stable Diffusion, Stable Inpainting, and DALLE-2 could be better but with high chance of misunderstanding the text caused by text misalignment.

Besides object inpainting, our SmartBrush also supports scene inpainting as illustrated by the last two rows in Fig. 3. More examples can be found in the supplementary. Still, as compared to our SmartBrush, it is difficult for existing inpainting models to follow the mask shape.

We also conduct user studies through Amazon Mechanical Turk. Over 300 workers were asked 1) which result follows the object mask best, 2) which result follows the input text description best, and 3) which result looks most natural/realistic. The survey result is shown in Fig. 4, where more then 50% users vote our results as the best on each question.

3 Mask Precision Control

In the real world, users will not always provide the precise mask of the object they want to inpaint. We may encounter a coarse mask, so SmartBrush accepts the control of how closely the inpainted object is to the given mask. Fig. 5 shows the results with different types of masks, which follow the blurring rule during training, i.e., applying Gaussian blur iteratively to obtain masks from fine to coarse. The Stable Diffusion results are not affected by mask types since it is not trained that way. The results of Stable Inpainting only change the object size with the mask size but do not follow the mask shape. By contrast, ours strictly follow the mask shape when providing a finer mask, while roughly following the mask if given a coarser mask. For extremely, given a box-like mask (the last column), we allow the generation to happen anywhere inside the box.

4 Background Preservation

To inpaint an object, especially when giving a box-like mask, it is important to preserve the background since the inpainted object will only partially occupy the mask area. Fig. 6 compares different methods in background preservation when giving box-like masks. Without any background preservation regularization, DALLE-2 generates objects inside the mask and changes the non-object pixels inside the mask. Our SmartBrush, with object mask prediction (shown in Fig. 7), could much better preserve the background by utilizing the predicted mask during sampling.

Conclusion, Limitation, and Future Work

Existing text and shape guided image inpainting models face three typical challenges: mask misalignment, text misalignment, and background preservation. In this paper, we propose a novel training method that utilizes the text and shape guidance from the segmentation dataset to address the text misalignment problem. Then we further propose to create different levels of masks (from fine to coarse) to allow precision control of the generation. Finally, we propose an additional training loss function to encourage the model to make object predictions from the input box mask. Then we can utilize the predicted mask to avoid unnecessary changes inside the mask. The quantitative and qualitative results demonstrate the superiority of our method.

The main limitation of our method is the large shadow case, where the shadow of the object exceeds the object mask, e.g., the shadow of a person can be very long in the morning while the bounding box usually fails to cover the whole shadow. Our method may not be able to generate such long shadow since the coarsest mask is the object bounding box. We will explore it in the near future.