ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, Sungroh Yoon

Introduction

Generative models, such as generative adversarial networks (GAN) , normalizing flows , and variational autoencoders , have shown remarkable quality in image generation, and have been applied to numerous purposes such as image-to-image translation and image editing .

There are mainly two approaches to control generative models to generate images as desired: one is by designing the conditional generative models for the desired purpose, and the other is by leveraging well-performed unconditional generative models.

The first approach learns to control by providing the desired condition in training procedure and has shown remarkable performance on various tasks, such as segmentation mask conditioned generation , style transfer , and inpainting . The second approach utilizes high-quality generative models, such as StyleGAN or BigGAN . Shen et al. and Härkönen et al. manipulate semantic attributes of images by analyzing latent space of pre-trained generative models, while Huh et al. and Zhu et al. perform image editing by projecting image into the latent space.

Denoising diffusion probabilistic models (DDPM) , an iterative generative model, has shown comparable performance to the state-of-the-art models in unconditional image generation. DDPM learns to model the Markov transition from simple distribution to data distribution and generates diverse samples through sequential stochastic transitions. Samples obtained from the DDPM depend on the initial state of the simple distribution and each transition. However, it is challenging to control DDPM to generate images with desired semantics, since the stochasticity of transitions generates images with inconsistent high-level semantics, even from the same initial state.

In this work, we propose a learning-free method, iterative latent variable refinement (ILVR), to condition the generation process in well-performing unconditional DDPM. Each transition in the generation process is refined utilizing a given reference image. By matching each latent variable, ILVR ensures the given condition in each transition thus enables sampling from a conditional distribution. Thus, ILVR generates high-quality images sharing desired semantics.

We describe user controllability of our method, which enables control on semantic similarity of generated images to the refenence. Fig. 1(a) and Fig. 4 show samples sharing semantics ranging from coarse to fine information. Besides, reference images can be selected from unseen data domains. From these properties, we were motivated to leverage unconditional DDPM learned on single data domain to multi-domain image translation; a challenging task where existing works had to learn on multiple data domains. Furthermore, we extend our method to paint-to-image and editing with scribbles (Fig. 1(c) and (d)). We demonstrate that our ILVR enables leveraging a single unconditional DDPM model on these various tasks without any additional learning or models. Measuring Fréchet Inception Distance (FID) and Learned Perceptual Image Patch Similarity (LPIPS), we confirm that our generation method from various downsampling factors provides control over diversity while maintaining visual quality.

Our paper makes the following contributions:

We propose ILVR, a method of refining each transition in the generative process by matching each latent variable with given reference image.

We investigate several properties that allows user controllability on semantic similarity to the reference.

We demonstrate that our ILVR enables leveraging unconditional DDPM in various image generation tasks including multi-domain image translation, paint-to-image, and editing with scribbles.

Background

Denoising diffusion probabilistic models (DDPM) is a class of generative models that show superior performance in unconditional image generation. It learns a Markov Chain which gradually converts a simple distribution such as isotropic Gaussian, into a data distribution. Generative process learns the reverse of the DDPM’s forward (diffusion) process, a fixed Markov Chain that gradually adds noise to data when sequentially sampling latent variables x1,...,xTx_{1},...,x_{T} of the same dimensionality. Here, each step in the forward process is a Gaussian translation.

where β1,...,βT\beta_{1},...,\beta_{T} is a fixed variance schedule rather than learned parameters . Eq. 1 is a process finding xtx_{t} by adding a small Gaussian noise to the latent variable. Given clean data x0x_{0}, sampling of xtx_{t} is expressed in a closed form:

where αt:=1βt\alpha_{t}:=1-\beta_{t} and αt:=s=1tαs\overline{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}. Therefore, xtx_{t} can be expressed as a linear combination of x0x_{0} and ϵ\epsilon:

where ϵN(0,I)\epsilon\sim N(0,\mathbf{I}) has the same dimensionality as data x0x_{0} and latent variables x1,...,xTx_{1},...,x_{T}.

Since the reverse of the forward process q(xt1xt)q(x_{t-1}|x_{t}) is intractable, DDPM learns parameterized Gaussian transitions pθ(xt1xt)p_{\theta}(x_{t-1}|x_{t}). The generative (or reverse) process has the same functional form as the forward process, and it is expressed as a Gaussian transition with learned mean and fixed variance :

Further, by decomposing μθ\mu_{\theta} into a linear combination of xtx_{t} and the noise approximator ϵθ\epsilon_{\theta}, the generative process is expressed as:

where zN(0,I)\mathbf{z}\sim N(0,\mathbf{I}), which suggests that each generation step is stochastic. Multiple stochastic process steps result in a difficulty in controlling the DDPM generative process. ϵθ\epsilon_{\theta} represents a neural network with the same input and output dimensions and the noise predicted by the neural network ϵθ\epsilon_{\theta} in each step is used for the denoising process in Eq. 5.

Method

Leveraging the capabilities of DDPM, we propose a method of controlling unconditional DDPM. We introduce our method, Iterative Latent Variable Refinement (ILVR), in Section. 3.1. Section. 3.2 investigates several properties of ILVR, which motivate control of two factors: downsampling factors and conditioning range.

In this section, we introduce Iterative Latent Variable Refinement (ILVR), a method of conditioning the generative process of the unconditional DDPM model to generate images that share high-level semantics from given reference images. For this purpose, we sample images from the conditional distribution p(x0c)p(x_{0}|c) with the condition cc:

Each transition pθ(xt1xt,c)p_{\theta}(x_{t-1}|x_{t},c) of the generative process depends on the condition cc. However, the unconditionally trained DDPM represents unconditional transition pθ(xt1xt)p_{\theta}(x_{t-1}|x_{t}) of Eq. 4. Our ILVR provides condition cc to unconditional transition pθ(xt1xt)p_{\theta}(x_{t-1}|x_{t}) without additional learning or models. Specifically, we refine each unconditional transition with a downsampled reference image.

Let ϕN()\phi_{N}(\cdot) denote a linear low-pass filtering operation, a sequence of downsampling and upsampling by a factor of NN, therefore maintaining dimensionality of the image. Given a reference image yy, the condition cc is to ensure the downsampled image ϕN(x0)\phi_{N}(x_{0}) of the generated image x0x_{0} to be equal to ϕN(y)\phi_{N}(y).

Utilizing the forward process q(xtx0)q(x_{t}|x_{0}) of Eq. 3 and the linear property of ϕN\phi_{N}, each Markov transition under the condition cc is approximated as follows:

where yty_{t} can be sampled following Eq. 3. The condition cc in each transition from xtx_{t} to xt1x_{t-1} can be replaced with a local condition, wherein latent variable xt1x_{t-1} and corrupted reference yt1y_{t-1} share low-frequency contents. To ensure the local condition in each transition, we first use DDPM to compute the unconditional proposal distribution of xt1x_{t-1}^{{}^{\prime}} from xtx_{t}. Then, since operation ϕ\phi maintains dimensionality, we refine the proposal distribution by matching ϕ(xt1)\phi(x_{t-1}^{{}^{\prime}}) of the proposal xt1x_{t-1}^{{}^{\prime}} with that of yt1y_{t-1} as follows:

By matching latent variables following Eq. 8, ILVR ensures local condition in Eq. 7, thus enables conditional generation with unconditional DDPM. Fig. 2 and Algorithm 1 illustrate our ILVR. Although we approximate the conditional transition with a simple modification of the unconditional proposal distribution, Fig. 1(a) and Fig. 4 show diverse, high-quality samples sharing semantics of the references.

2 Reference selection and user controllability

Let μ\mu be the set of images that an unconditional DDPM can generate. Our method enables sampling from a conditional distribution with a given reference image yy. In other words, we sample images from a subset of μ\mu, which is directed by the reference image.

To extend our method to various applications, we investigate 1) minimum requirement on reference image selection and 2) user controllability on reference directed subset, which defines semantic similarity to the reference. To provide an intuition for reference selection and control, we investigate several properties. Fig. 3 visualizes ILVR in each generation step to guide toward the subset directed by the reference.

We consider a range of conditioning steps by extending the above notation:

where RN, (a, b)(y)R_{N,~{}(a,~{}b)}(y) represents the distribution of images matching latent variables (line 9 of Alg. 1) in steps bb to aa. We will now discuss several properties on the reference selection and subset control.

Property 1. Reference image can be any image selected from the set:

the reference image only needs to match the low-resolution space of learned data distribution. Even reference images from unseen data domains are possible. Thus, we can select a reference from unseen data domains and perform multi-domain image translation, as demonstrated in Section. 4.2.

Property 2. Considering downsampling factors NN and MM where NMN\leq M,

which suggests that higher factors correspond to broader image subsets.

As higher factor NN enables sampling from broader set of images, sampled images are more diverse and exhibit lower semantic similarity to the reference. In Fig. 4, perceptual similarity to the reference image is controlled by the downsampling factors. Samples obtained from higher factor NN share coarse features of the reference, while samples from lower NN share also finer features. Note that since RNR_{N} is a subset of μ\mu, our sampling method maintains the sample quality of unconditional DDPM.

Property 3. Limiting the range of conditioning steps enables sampling from a broader subset, while sampling from learned image distribution is still guaranteed.

Fig. 5 shows the tendency of generated images when gradually limiting the range of conditioned steps. Compared to changing downsampling factors, changing conditioning range has a fine-grained influence on sample diversity.

Experiments and Applications

As discussed previously, ILVR generates high-quality images and allows control on semantic similarity to the reference. We first show qualitative results of controlled generation in Section. 4.1. Then we demonstrate ILVR on various image generation tasks in Sections 4.2, 4.3, and 4.4. Quantitative evaluations on the visual quality and diversity of ILVR are presented in Section. 4.5.

We trained the DDPM model on FFHQ , MetFaces , AFHQ , LSUN-Church , and Places365 datasets , to exemplify its applicability in various tasks. We used correctly implemented resizing library for the operation ϕN\phi_{N}. Reference face images are from the web, those unseen during training. See supplementary materials for details on implementation and evaluations.

Semantic similarity to the reference vary based on the downsampling factor NN and the conditioning step range [b, a][b,~{}a]. In Fig. 4, images are generated from the reference image downsampled by various factors. As the factor N increase, samples are more diverse and perceptually less similar to the reference, as stated in Eq. 12. For example, samples obtained from N=8 differ with references in fine details (e.g., hair curls, eye color, earring) while samples from N=64 share only coarse features (e.g., color scheme) with the reference. This user controllability on similarity to the reference supports learning-free adaptation of a single pre-trained model to various tasks, as described subsequently.

In addition to models we reproduced, we also utilize publicly available guided-diffusion , recent state-of-the-art DDPM. Fig. 9 shows samples generated with unconditional models trained on LSUN datasets. Samples share either coarse (N=64) or fine (N=16) features from the references. Such results suggest that our method can be applied to any unconditional DDPMs without retraining.

Fig. 5 shows samples generated from a varying the range of conditioning steps. Here, a narrower range allows image sampling from a broader subset following Eq. 13, resulting in diverse images. Conditioning in less than 500 steps, facial features differ from the references. The downsampling factor and conditioning range provide user controllability, where the later has a finer control on sample diversity.

2 Multi-Domain Image Translation

Image-to-Image translation aims to learn the mapping between two visual domains. More specifically, generated images need to take the texture of the target domain while preserving the structure of the input images . ILVR performs this task by matching the coarse information in reference images. We chose N=32 to preserve the coarse structure of the reference.

The first two rows in Fig. 6 show samples generated with DDPM model trained on the FFHQ dataset, which contains high-quality photos of human faces. Samples from portrait references show successful translation into photo-realistic faces. We also generated portraits from photos, with DDPM trained on METFACES , the dataset of face portraits. Here, diverse samples are generated, however, some existing image translation models fail to produce stochastic samples.

Generally, image translation models , including multi-domain translation models , learn translation between different domains. Thus they can only translate from domains learned in the training phase. However, ILVR requires only a single model trained on the target domain. Therefore ILVR enables image translation from unseen source domains, with reference images from the low-resolution space of learned dataset as suggested in Eq. 11. Quantitative comparison to existing translation models is presented in the supplementary materials.

With a DDPM model trained on AFHQ-dog , we translated images of dogs, cats, and wildlife animals from the validation set. The fourth to sixth row of Fig. 6 show the results. DDPM model trained only on dog images translates unseen cat and wildlife images well into dog images.

3 Paint-to-Image

Paint-to-image is the task of transferring unnatural paintings into photo-realistic images. We validate our extension on this task using a model trained on the waterfall category from Places365 .

As shown in Fig 7, clip art, oil painting, and watercolor are well translated into photo-realistic images. Paintings and photo-realistic images differ in detailed texture. We chose a factor of N=64 to preserve only the coarse aspect (e.g., color scheme) of the reference. From Eq. 11, we can infer that the given paintings share coarse features of the learned dataset.

4 Editing with Scribbles

We extend our method to application of performing editions with user scribbles, which was also presented in Image2StyleGAN++ . We generated samples with DDPM trained on LSUN-Church and FFHQ . On reference images from the validation set, we added scribbles. Then, scribbled images are provided as references in factor N=8 on time steps from 1000 to 200, in order to both maintain details of original images and harmonize the scribbles. Interesting samples are shown in Fig. 8. In the second row, DDPM generated the ”Shutterstock” watermark in the middle and the article number at the bottom. Since these pair of watermark and article number is common in the dataset, DDPM generated such features from a white scribble at the bottom. See supplementary for more samples.

5 Quantitative Evaluation

We evaluated the quality and diversity of our generated images with widely used FID and LPIPS . The FID score evaluates the visual quality and distance between real and generated image distributions. LPIPS measures the perceptual similarity between two images.

Table 1 reports FID scores measured from each downsampling factor NN and unconditional generation with models trained on FFHQ and METFACES datasets. Scores (lower is better) are mostly comparable to the unconditional models, suggesting that our conditioning method does not harm the generation quality of unconditional model. In addition, FID scores of lower downsampling factors are better, as generated images from lower factors align almost perfectly with reference images.

To evaluate the diversity among samples generated from the same reference, we generated 10 images for each reference image and calculated average pairwise (45 pairs) LPIPS distance, following StarGAN2 . Table 2 shows that the higher the factor NN, the higher the LPIPS, thus more diverse samples are generated as suggested in Eq. 12. In contrary, samples from lower NN share more amount of contents from the references, therefore less diverse.

Related Work

Successful iterative generative models gradually add noise to the data and learn to reverse this process. Score-based models estimate a score (gradient of log-likelihood), and sample images with Langevin dynamics. A denoising score matching is utilized to learn the scores in a scalable manner. DDPM learns to reverse the diffusion process that corrupts data, and utilizes the same functional form of the diffusion and reverse process. Ho et al. show superior performance in image generation, by achieving exceptionally low FID. Diffusion models also show superior performance in other domains such as speech synthesis and point cloud generation . Our conditioning method allows this powerful DDPM to be utilized for a variety of purposes.

2 Conditional generative models

Depending on the input type, such as class-label , segmentation mask , feature from classifier , and image , various conditional generative models are available. The studies employing images as a condition began with Isola et al. , and extended to unsupervised , few-shot , and multi-domain image translations . Concurrent to our work, SR3 trained conditional DDPM for super-resolution. These models show remarkable performances, however, only in the desired setting. In contrast, we demonstrate adaptation of a single unconditional model to various applications.

3 Leveraging unconditional models

Researches on leveraging pre-trained unconditional generators for various purposes, such as image editing , style transfer , and super-resolution are being conducted. Specifically, by projecting given images into the latent vectors and manipulating them , images are easily edited. Leveraging capability of the unconditional models, these works exhibit high-quality images. GAN forms a cornerstone of these works. However, we utilized the iterative generative model, DDPM, which has not been explored in this context.

4 High-level semantics

Image semantics contained in CNN features , segmentation masks , and low-resolution images are actively used as conditions in generative models. From our derivation in Eq. 6, various kind of semantic conditions, such as features or segmentations, can provide high-level semantics. However, they require additional models (classifier or segmentation models). Since we are interested in controlling DDPM without any additional models, we provided semantics with a low-resolution image by utilizing iterative nature of DDPM.

Conclusion

We proposed a learning-free method of conditioning the generation process of unconditional DDPM. By refining each transition with given reference, we enable sampling from the space of plausible images. Further, downsampling factors and the conditioning range provide user controllability over this method. We demonstrated that a single unconditional DDPM can be leveraged to various tasks without any additional learning and models.

Acknowledgements: This work was supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFC-IT1901-12, the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT)[2018R1A2B3001628], AIRS Company in Hyundai Motor and Kia through HMC/KIA-SNU AI Consortium Fund, and the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University in 2021.

A Derivation of approximation

In the main paper, we proposed iterative latent variable refinement (ILVR), where each transition of the generative process is matched with a given reference image. Condition in each transition was replaced with a local condition based on our approximation, as suggested in Eq.7 of the main text.

Before detailed derivations of the approximation (Eq.7), we review notations used in the main text. With pre-defined hyperparameter αt\overline{\alpha}_{t}, latent variable xtx_{t} can be sampled in closed-form: xtq(xtx0)x_{t}\sim q(x_{t}|x_{0}) (Eq.2). Trained model ϵθ(xt,t)\epsilon_{\theta}(x_{t},t) predicts noise added in xtx_{t}, conditioned with time step tt.

From the property of the forward process that latent variable xtx_{t} can be sampled from x0x_{0} in closed-form, denoised data x0x_{0} can be approximated with model prediction ϵθ(xt,t)\epsilon_{\theta}(x_{t},t):

Below is a derivation of Eq.7, where we approximated each conditioned Markov transition. We denote ϕN\phi_{N} as ϕ\phi and fθ(xt, t)f_{\theta}(x_{t},~{}t) as f(xt)f(x_{t}) for brevity. From Eq. A, each conditional Markov transition with given reference image yy can be approximated as follows:

With linear property of operation ϕ\phi and Eq. A, we have

As shown in Eq.8 and Algorithm 1 of the main text, we first compute unconditional proposal xt1x_{t-1}^{\prime}, then refine it by ensuring ϕ(xt1)=ϕ(yt1)\phi(x_{t-1})=\phi(y_{t-1}). Therefore,

B Additional evaluations

We provide additional qualitative and quantitative evaluations on the generation quality of ILVR. We evaluate images generated from low-resolution (LR) images downsampled by a factor of 16 and 64. Here, we compare ILVR with bicubic interpolation and PULSE , a super-resolution study that leverages pre-trained StyleGAN . PULSE finds a latent vector that generates an image that downscales to the given LR image. We used publicly available StyleGAN2 modelhttps://github.com/rosinality/stylegan2-pytorch trained at 256×256256\times 256. Combining loss function from PULSE and StyleGAN2, we search for latent vectors with a loss as follows:

where each term refers to mean square error (MSE), geodesic cross loss , and noise regularization , respectively. MSE ensures generated image G(z)G(z) and reference image yy to match at low-resolution space. The geodesic cross loss ensures the latent vectors v1,...,v14v_{1},...,v_{14} remain in the learned latent space. Noise regularization LnoiseL_{noise} discourages signal sneaking into the noise maps of StyleGAN2. We chose α=5e3\alpha=5e^{3}. Refer to StyleGAN2 literature for details on the noise regularization. We inherited initialization and learning rate schedule from StyleGAN2.

Fig. A presents additional qualitative results. ILVR and PULSE both show high-quality images generated from extremely downscaled images. Table. A shows NIQE score, which is a no-reference metric that measures the perceptual quality of an image. ILVR shows higher perceptual quality, even better than the original 2562256^{2} reference images (HR). We measured NIQE with reference images in Fig. B.

B.2 Image translation

We compare Frechét inception distance (FID) with image translation models on cat-to-dog (AFHQ dataset) translation. Table. B shows the results. FID scores are calculated with the test set from AFHQ . ILVR presents comparable performance to CUT , which is a state-of-the-art on cat-to-dog translation. Note that ILVR requires a model trained only on dog images, unlike the other models trained on both cat and dog images. We expect our result to broaden the applicability of DDPM to such image translation tasks.

B.3 Additional samples

Fig. 9 shows samples generated with publicly available guided-diffusion trained on LSUN datasets. We present additional editing with scribbles in Fig. C.

C Implementation details

We trained unconditional DDPM with publicly available PyTorch implementation.https://github.com/rosinality/denoising-diffusion-pytorch

We used bicubic downsampling and upsampling with correctly implemented function . In Fig. D, we compare generated samples where the same noises were added through the generative process, only differing resizing kernels. Among kernels, images are almost identical, suggesting that our method is robust to kernel choice.

C.2 Datasets and training

Here we describe datasets and training details. For all datasets, we trained at 2562256^{2} resolution with a batch size 8.

FFHQ consists of 70,000 high-resolution face images. We trained a model for 1.2M steps.

METFACES consists of 1,000 high-resolution portrait images. To avoid overfitting, we fine-tuned a model pre-trained on FFHQ , for 20k steps.

AFHQ consists of 15,000 high-resolution animal face images, which are equally split into three categories: dog, cat, and wild. We trained on the train set of dog category, then used test sets of three categories as reference images to demonstrate multi-domain image translation.

Places365 consists of 10M images of over 400 scene categories. We trained a model on a waterfall category, which consists of 5,000 images. We used this model to paint-to-image task.

LSUN Church consists of 126,227 images of churches. We trained a model for 1M steps.

Paintings used for paint-to-image task are collected from the web.

C.3 Architecture

We trained the same neural network architecture as Ho et al. , which is U-Net based on Wide ResNet . Details include group normalization , self-attention blocks at 16×1616\times 16 resolution, sinusoidal positional embedding , and a fixed linear variance schedule β1,...,βT\beta_{1},...,\beta_{T}.

C.4 Evaluation

In Table 1 of the main text, we calculated FID with 50k real and 50k generated images using codehttps://github.com/mseitzer/pytorch-fid of PyTorch framework.

References