Improved Techniques for Training Score-Based Generative Models

Yang Song, Stefano Ermon

Introduction

Score-based generative models represent probability distributions through score—a vector field pointing in the direction where the likelihood of data increases most rapidly. Remarkably, these score functions can be learned from data without requiring adversarial optimization, and can produce realistic image samples that rival GANs on simple datasets such as CIFAR-10 .

Despite this success, existing score-based generative models only work on low resolution images (32×3232\times 32) due to several limiting factors. First, the score function is learned via denoising score matching . Intuitively, this means a neural network (named the score network) is trained to denoise images blurred with Gaussian noise. A key insight from is to perturb the data using multiple noise scales so that the score network captures both coarse and fine-grained image features. However, it is an open question how these noise scales should be chosen. The recommended settings in work well for 32×3232\times 32 images, but perform poorly when the resolution gets higher. Second, samples are generated by running Langevin dynamics . This method starts from white noise and progressively denoises it into an image using the score network. This procedure, however, might fail or take an extremely long time to converge when used in high-dimensions and with a necessarily imperfect (learned) score network.

We propose a set of techniques to scale score-based generative models to high resolution images. Based on a new theoretical analysis on a simplified mixture model, we provide a method to analytically compute an effective set of Gaussian noise scales from training data. Additionally, we propose an efficient architecture to amortize the score estimation task across a large (possibly infinite) number of noise scales with a single neural network. Based on a simplified analysis of the convergence properties of the underlying Langevin dynamics sampling procedure, we also derive a technique to approximately optimize its performance as a function of the noise scales. Combining these techniques with an exponential moving average (EMA) of model parameters, we are able to significantly improve the sample quality, and successfully scale to images of resolutions ranging from 64×6464\times 64 to 256×256256\times 256, which was previously impossible for score-based generative models. As illustrated in Fig. 1, the samples are sharp and diverse.

Background

For any continuously differentiable probability density p(x)p(\mathbf{x}), we call xlogp(x)\nabla_{\mathbf{x}}\log p(\mathbf{x}) its score function. In many situations the score function is easier to model and estimate than the original probability density function . For example, for an unnormalized density it does not depend on the partition function. Once the score function is known, we can employ Langevin dynamics to sample from the corresponding distribution. Given a step size α>0\alpha>0, a total number of iterations TT, and an initial sample x0\mathbf{x}_{0} from any prior distribution π(x)\pi(\mathbf{x}), Langevin dynamics iteratively evaluate the following

where ztN(0,I)\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). When α\alpha is sufficiently small and TT is sufficiently large, the distribution of xT\mathbf{x}_{T} will be close to p(x)p(\mathbf{x}) under some regularity conditions . Suppose we have a neural network sθ(x)\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}) (called the score network) parameterized by θ{\boldsymbol{\theta}}, and it has been trained such that sθ(x)xlogp(x)\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})\approx\nabla_{\mathbf{x}}\log p(\mathbf{x}). We can approximately generate samples from p(x)p(\mathbf{x}) using Langevin dynamics by replacing xlogp(xt1)\nabla_{\mathbf{x}}\log p(\mathbf{x}_{t-1}) with sθ(xt1)\mathbf{s}_{{\boldsymbol{\theta}}}(\mathbf{x}_{t-1}) in Eq. 1. Note that Eq. 1 can be interpreted as noisy gradient ascent on the log-density logp(x)\log p(\mathbf{x}).

2 Score-based generative modeling

We can estimate the score function from data and generate new samples with Langevin dynamics. This idea was named score-based generative modeling by ref. . Because the estimated score function is inaccurate in regions without training data, Langevin dynamics may not converge correctly when a sampling trajectory encounters those regions (see more detailed analysis in ref. ). As a remedy, ref. proposes to perturb the data with Gaussian noise of different intensities and jointly estimate the score functions of all noise-perturbed data distributions. During inference, they combine the information from all noise scales by sampling from each noise-perturbed distribution sequentially with Langevin dynamics.

where all expectations can be efficiently estimated using empirical averages. When trained to the optimum (denoted as sθ(x,σ)s_{{\boldsymbol{\theta}}^{*}}(\mathbf{x},\sigma)), the noise conditional score network (NCSN) satisfies i:sθ(x,σi)=xlogpσi(x)\forall i:s_{{\boldsymbol{\theta}}^{*}}(\mathbf{x},\sigma_{i})=\nabla_{\mathbf{x}}\log p_{\sigma_{i}}(\mathbf{x}) almost everywhere , assuming enough data and model capacity.

After training an NCSN, ref. generates samples by annealed Langevin dynamics, a method that combines information from all noise scales. We provide its pseudo-code in Algorithm 1. The approach amounts to sampling from pσ1(x),pσ2(x),,pσL(x)p_{\sigma_{1}}(\mathbf{x}),p_{\sigma_{2}}(\mathbf{x}),\cdots,p_{\sigma_{L}}(\mathbf{x}) sequentially with Langevin dynamics with a special step size schedule αi=ϵ σi2/σL2\alpha_{i}=\epsilon~{}\sigma_{i}^{2}/\sigma_{L}^{2} for the ii-th noise scale. Samples from each noise scale are used to initialize Langevin dynamics for the next noise scale until reaching the smallest one, where it provides final samples for the NCSN.

Following the first public release of this work, ref. noticed that adding an extra denoising step after the original annealed Langevin dynamics in , similar to , often significantly improves FID scores without affecting the visual appearance of samples. Instead of directly returning xT\mathbf{x}_{T}, this denoising step returns xT+σT2sθ(xT,σT)\mathbf{x}_{T}+\sigma_{T}^{2}\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}_{T},\sigma_{T}) (see Algorithm 1), which essentially removes the unwanted noise N(0,σT2I)\mathcal{N}(\mathbf{0},\sigma_{T}^{2}\mathbf{I}) from xT\mathbf{x}_{T} using Tweedie’s formula . Therefore, we have updated results in the main paper by incorporating this denoising trick, but kept some original results without this denoising step in the appendix for reference.

There are many design choices that are critical to the successful training and inference of NCSNs, including (i) the set of noise scales {σi}i=1L\{\sigma_{i}\}_{i=1}^{L}, (ii) the way that sθ(x,σ)\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x},\sigma) incorporates information of σ\sigma, (iii) the step size parameter ϵ\epsilon and (iv) the number of sampling steps per noise scale TT in Algorithm 1. Below we provide theoretically motivated ways to configure them without manual tuning, which significantly improve the performance of NCSNs on high resolution images.

Choosing noise scales

Noise scales are critical for the success of NCSNs. As shown in , score networks trained with a single noise can never produce convincing samples for large images. Intuitively, high noise facilitates the estimation of score functions, but also leads to corrupted samples; while lower noise gives clean samples but makes score functions harder to estimate. One should therefore leverage different noise scales together to get the best of both worlds.

When the range of pixel values is $,theoriginalworkonNCSNrecommendschoosing, the original work on NCSN recommends choosing\{\sigma_{i}\}_{i=1}^{L}asageometricsequencewhereas a geometric sequence whereL=10,,\sigma_{1}=1,and, and\sigma_{L}=0.01.Itisreasonablethatthesmallestnoisescale. It is reasonable that the smallest noise scale\sigma_{L}=0.01\ll 1,becausewesamplefromperturbeddistributionswithdescendingnoisescalesandwewanttoaddlownoiseattheend.However,someimportantquestionsremainunanswered,whichturnouttobecriticaltothesuccessofNCSNsonhighresolutionimages:(i)Is, because we sample from perturbed distributions with descending noise scales and we want to add low noise at the end. However, some important questions remain unanswered, which turn out to be critical to the success of NCSNs on high resolution images: (i) Is\sigma_{1}=1appropriate?Ifnot,howshouldweadjustappropriate? If not, how should we adjust\sigma_{1}fordifferentdatasets?(ii)Isgeometricprogressionagoodchoice?(iii)Isfor different datasets? (ii) Is geometric progression a good choice? (iii) IsL=10$ good across different datasets? If not, how many noise scales are ideal?

Below we provide answers to the above questions, motivated by theoretical analyses on simple mathematical models. Our insights are effective for configuring score-based generative modeling in practice, as corroborated by experimental results in Section 6.

The algorithm of annealed Langevin dynamics (Algorithm 1) is an iterative refining procedure that starts from generating coarse samples with rich variation under large noise, before converging to fine samples with less variation under small noise. The initial noise scale σ1\sigma_{1} largely controls the diversity of the final samples. In order to promote sample diversity, we might want to choose σ1\sigma_{1} to be as large as possible. However, an excessively large σ1\sigma_{1} will require more noise scales (to be discussed in Section 3.2) and make annealed Langevin dynamics more expensive. Below we present an analysis to guide the choice of σ1\sigma_{1} and provide a technique to strike the right balance.

Real-world data distributions are complex and hard to analyze, so we approximate them with empirical distributions. Suppose we have a dataset {x(1),x(2),,x(N)}\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\cdots,\mathbf{x}^{(N)}\} which is i.i.d. sampled from pdata(x)p_{\text{data}}(\mathbf{x}). Assuming NN is sufficiently large, we have pdata(x)p^data(x)1Ni=1Nδ(x=x(i))p_{\text{data}}(\mathbf{x})\approx\hat{p}_{\text{data}}(\mathbf{x})\triangleq\frac{1}{N}\sum_{i=1}^{N}\delta(\mathbf{x}=\mathbf{x}^{(i)}), where δ()\delta(\cdot) denotes a point mass distribution. When perturbed with N(0,σ12I)\mathcal{N}(\mathbf{0},\sigma_{1}^{2}\mathbf{I}), the empirical distribution becomes p^σ1(x)1Ni=1Np(i)(x)\hat{p}_{\sigma_{1}}(\mathbf{x})\triangleq\frac{1}{N}\sum_{i=1}^{N}p^{(i)}(\mathbf{x}), where p(i)(x)N(xx(i),σ12I)p^{(i)}(\mathbf{x})\triangleq\mathcal{N}(\mathbf{x}\mid\mathbf{x}^{(i)},\sigma_{1}^{2}\mathbf{I}). For generating diverse samples regardless of initialization, we naturally expect that Langevin dynamics can explore any component p(i)(x)p^{(i)}(\mathbf{x}) when initialized from any other component p(j)(x)p^{(j)}(\mathbf{x}), where iji\neq j. The performance of Langevin dynamics is governed by the score function xlogp^σ1(x)\nabla_{\mathbf{x}}\log\hat{p}_{\sigma_{1}}(\mathbf{x}) (see Eq. 1).

Let p^σ1(x)1Ni=1Np(i)(x)\hat{p}_{\sigma_{1}}(\mathbf{x})\triangleq\frac{1}{N}\sum_{i=1}^{N}p^{(i)}(\mathbf{x}), where p(i)(x)N(xx(i),σ12I)p^{(i)}(\mathbf{x})\triangleq\mathcal{N}(\mathbf{x}\mid\mathbf{x}^{(i)},\sigma_{1}^{2}\mathbf{I}). With r(i)(x)p(i)(x)k=1Np(k)(x)r^{(i)}(\mathbf{x})\triangleq\frac{p^{(i)}(\mathbf{x})}{\sum_{k=1}^{N}p^{(k)}(\mathbf{x})}, the score function is xlogp^σ1(x)=i=1Nr(i)(x)xlogp(i)(x)\nabla_{\mathbf{x}}\log\hat{p}_{\sigma_{1}}(\mathbf{x})=\sum_{i=1}^{N}r^{(i)}(\mathbf{x})\nabla_{\mathbf{x}}\log p^{(i)}(\mathbf{x}). Moreover,

Choose σ1\sigma_{1} to be as large as the maximum Euclidean distance between all pairs of training data points.

2 Other noise scales

After setting σL\sigma_{L} and σ1\sigma_{1}, we need to choose the number of noise scales LL and specify the other elements of {σi}i=1L\{\sigma_{i}\}_{i=1}^{L}. As analyzed in , it is crucial for the success of score-based generative models to ensure that pσi(x)p_{\sigma_{i}}(\mathbf{x}) generates a sufficient number of training data in high density regions of pσi1(x)p_{\sigma_{i-1}}(\mathbf{x}) for all 1<iL1<i\leq L. The intuition is we need reliable gradient signals for pσi(x)p_{\sigma_{i}}(\mathbf{x}) when initializing Langevin dynamics with samples from pσi1(x)p_{\sigma_{i-1}}(\mathbf{x}).

However, an extensive grid search on {σi}i=1L\{\sigma_{i}\}_{i=1}^{L} can be very expensive. To give some theoretical guidance on finding good noise scales, we consider a simple case where the dataset contains only one data point, or equivalently, 1iL:pσi(x)=N(x0,σi2I)\forall 1\leq i\leq L:p_{\sigma_{i}}(\mathbf{x})=\mathcal{N}(\mathbf{x}\mid\mathbf{0},\sigma_{i}^{2}\mathbf{I}). Our first step is to understand the distributions of pσi(x)p_{\sigma_{i}}(\mathbf{x}) better, especially when x\mathbf{x} has high dimensionality. We can decompose pσi(x)p_{\sigma_{i}}(\mathbf{x}) in hyperspherical coordinates to p(ϕ)pσi(r)p({\boldsymbol{\phi}})p_{\sigma_{i}}(r), where rr and ϕ{\boldsymbol{\phi}} denote the radial and angular coordinates of x\mathbf{x} respectively. Because pσi(x)p_{\sigma_{i}}(\mathbf{x}) is an isotropic Gaussian, the angular component p(ϕ)p({\boldsymbol{\phi}}) is uniform and shared across all noise scales. As for pσi(r)p_{\sigma_{i}}(r), we have the following

In practice, dimensions of image data can range from several thousand to millions, and are typically large enough to warrant p(r)N(rDσ,σ2/2)p(r)\approx\mathcal{N}(r|\sqrt{D}\sigma,\sigma^{2}/2) with negligible error. We therefore take pσi(r)=N(rmi,si2)p_{\sigma_{i}}(r)=\mathcal{N}(r|m_{i},s_{i}^{2}) to simplify our analysis, where miDσm_{i}\triangleq\sqrt{D}\sigma, and si2σ2/2s_{i}^{2}\triangleq\sigma^{2}/2.

Recall that our goal is to make sure samples from pσi(x)p_{\sigma_{i}}(\mathbf{x}) will cover high density regions of pσi1(x)p_{\sigma_{i-1}}(\mathbf{x}). Because p(ϕ)p({\boldsymbol{\phi}}) is shared across all noise scales, pσi(x)p_{\sigma_{i}}(\mathbf{x}) already covers the angular component of pσi1(x)p_{\sigma_{i-1}}(\mathbf{x}). Therefore, we need the radial components of pσi(x)p_{\sigma_{i}}(\mathbf{x}) and pσi1(x)p_{\sigma_{i-1}}(\mathbf{x}) to have large overlap. Since pσi1(r)p_{\sigma_{i-1}}(r) has high density in Ii1[mi13si1,mi1+3si1]\mathcal{I}_{i-1}\triangleq[m_{i-1}-3s_{i-1},m_{i-1}+3s_{i-1}] (employing the “three-sigma rule of thumb” ), a natural choice is to fix pσi(rIi1)=Φ(2D(γi1)+3γi)Φ(2D(γi1)3γi)=Cp_{\sigma_{i}}(r\in\mathcal{I}_{i-1})=\Phi(\sqrt{2D}(\gamma_{i}-1)+3\gamma_{i})-\Phi(\sqrt{2D}(\gamma_{i}-1)-3\gamma_{i})=C with some moderately large constant C>0C>0 for all 1<iL1<i\leq L, where γiσi1/σi\gamma_{i}\triangleq\sigma_{i-1}/\sigma_{i} and Φ()\Phi(\cdot) is the CDF of standard Gaussian. This choice immediately implies that γ2=γ3=γL\gamma_{2}=\gamma_{3}=\cdots\gamma_{L} and thus {σi}i=1L\{\sigma_{i}\}_{i=1}^{L} is a geometric progression.

Ideally, we should choose as many noise scales as possible to make C1C\approx 1. However, having too many noise scales will make sampling very costly, as we need to run Langevin dynamics for each noise scale in sequence. On the other hand, L=10L=10 (for 32×3232\times 32 images) as in the original setting of is arguably too small, for which C=0C=0 up to numerical precision. To strike a balance, we recommend C0.5C\approx 0.5 which performs well in our experiments. In summary,

Choose {σi}i=1L\{\sigma_{i}\}_{i=1}^{L} as a geometric progression with common ratio γ\gamma, such that Φ(2D(γ1)+3γ)Φ(2D(γ1)3γ)0.5\Phi(\sqrt{2D}(\gamma-1)+3\gamma)-\Phi(\sqrt{2D}(\gamma-1)-3\gamma)\approx 0.5.

3 Incorporating the noise information

For high resolution images, we need a large σ1\sigma_{1} and a huge number of noise scales as per 1 and 2. Recall that the NCSN is a single amortized network that takes a noise scale and gives the corresponding score. In , authors use a separate set of scale and bias parameters in normalization layers to incorporate the information from each noise scale. However, its memory consumption grows linearly w.r.t. LL, and it is not applicable when the NCSN has no normalization layers.

Parameterize the NCSN with sθ(x,σ)=sθ(x)/σ\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x},\sigma)=\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})/\sigma, where sθ(x)\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}) is an unconditional score network.

It is typically hard for deep networks to automatically learn this rescaling, because σ1\sigma_{1} and σL\sigma_{L} can differ by several orders of magnitude. This simple choice is easier to implement, and can easily handle a large number of noise scales (even continuous ones). As shown in Fig. 3 (detailed settings in Appendix B), it achieves similar training losses compared to the original noise conditioning approach in , and generate samples of better quality (see Section C.4).

Configuring annealed Langevin dynamics

In order to sample from an NCSN with annealed Langevin dynamics, we need to specify the number of sampling steps per noise scale TT and the step size parameter ϵ\epsilon in Algorithm 1. Authors of recommends ϵ=2×105\epsilon=2\times 10^{-5} and T=100T=100. It remains unclear how we should change ϵ\epsilon and TT for different sets of noise scales.

To gain some theoretical insight, we revisit the setting in Section 3.2 where the dataset has one point (i.e., pσi(x)=N(x0,σi2I)p_{\sigma_{i}}(\mathbf{x})=\mathcal{N}(\mathbf{x}\mid\mathbf{0},\sigma_{i}^{2}\mathbf{I})). Annealed Langevin dynamics connect two adjacent noise scales σi1>σi\sigma_{i-1}>\sigma_{i} by initializing the Langevin dynamics for pσi(x)p_{\sigma_{i}}(\mathbf{x}) with samples obtained from pσi1(x)p_{\sigma_{i-1}}(\mathbf{x}). When applying Langevin dynamics to pσi(x)p_{\sigma_{i}}(\mathbf{x}), we have xt+1xt+αxlogpσi(xt)+2αzt\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}+\alpha\nabla_{\mathbf{x}}\log p_{\sigma_{i}}(\mathbf{x}_{t})+\sqrt{2\alpha}\mathbf{z}_{t}, where x0pσi1(x)\mathbf{x}_{0}\sim p_{\sigma_{i-1}}(\mathbf{x}) and ztN(0,I)\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). The distribution of xT\mathbf{x}_{T} can be computed in closed form:

Let γ=σi1σi\gamma=\frac{\sigma_{i-1}}{\sigma_{i}}. For α=ϵσi2σL2\alpha=\epsilon\cdot\frac{\sigma_{i}^{2}}{\sigma_{L}^{2}} (as in Algorithm 1), we have xTN(0,sT2I)\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},s^{2}_{T}\mathbf{I}), where

When {σi}i=1L\{\sigma_{i}\}_{i=1}^{L} is a geometric progression as advocated by 2, we immediately see that \nicefracsT2σi2\nicefrac{{s^{2}_{T}}}{{\sigma_{i}^{2}}} is identical across all 1<iT1<i\leq T because of the shared γ\gamma. Furthermore, the value of \nicefracsT2σi2\nicefrac{{s^{2}_{T}}}{{\sigma_{i}^{2}}} has no explicit dependency on the dimensionality DD.

For better mixing of annealed Langevin dynamics, we hope \nicefracsT2σi2\nicefrac{{s^{2}_{T}}}{{\sigma_{i}^{2}}} approaches 1 across all noise scales, which can be achieved by finding ϵ\epsilon and TT that minimize the difference between Eq. 4 and 1. Unfortunately, this often results in an unnecessarily large TT that makes sampling very expensive for large LL. As an alternative, we propose to first choose TT based on a reasonable computing budget (typically T×LT\times L is several thousand), and subsequently find ϵ\epsilon by making Eq. 4 as close to 1 as possible. In summary:

Choose TT as large as allowed by a computing budget and then select an ϵ\epsilon that makes Eq. 4 maximally close to 1.

We follow this guidance to generate all samples in this paper, except for those from the original NCSN where we adopt the same settings as in . When finding ϵ\epsilon with 4 and Eq. 4, we recommend performing grid search over ϵ\epsilon, rather than using gradient-based optimization methods.

Improving stability with moving average

Unlike GANs, score-based generative models have one unified objective (Eq. 2) and require no adversarial training. However, even though the loss function of NCSNs typically decreases steadily over the course of training, we observe that the generated image samples sometimes exhibit unstable visual quality, especially for images of larger resolutions. We empirically demonstrate this fact by training NCSNs on CIFAR-10 32×3232\times 32 and CelebA 64×6464\times 64 following the settings of , which exemplifies typical behavior on other image datasets. We report FID scores computed on 1000 samples every 5000 iterations. Results in Fig. 4 are computed with the denoising step, but results without the denoising step are similar (see Fig. 8 in Section C.1). As shown in Figs. 4 and 8, the FID scores for the vanilla NCSN often fluctuate significantly during training. Additionally, samples from the vanilla NCSN sometimes exhibit characteristic artifacts: image samples from the same checkpoint have strong tendency to have a common color shift. Moreover, samples are shifted towards different colors throughout training. We provide more samples in Section C.3 to manifest this artifact.

This issue can be easily fixed by exponential moving average (EMA). Specifically, let θi{\boldsymbol{\theta}}_{i} denote the parameters of an NCSN after the ii-th training iteration, and θ{\boldsymbol{\theta}}^{\prime} be an independent copy of the parameters. We update θ{\boldsymbol{\theta}}^{\prime} with θmθ+(1m)θi{\boldsymbol{\theta}}^{\prime}\leftarrow m{\boldsymbol{\theta}}^{\prime}+(1-m){\boldsymbol{\theta}}_{i} after each optimization step, where mm is the momentum parameter and typically m=0.999m=0.999. When producing samples, we use sθ(x,σ)\mathbf{s}_{{\boldsymbol{\theta}}^{\prime}}(\mathbf{x},\sigma) instead of sθi(x,σ)\mathbf{s}_{{\boldsymbol{\theta}}_{i}}(\mathbf{x},\sigma). As shown in Fig. 4, EMA can effectively stabilize FIDs, remove artifacts (more samples in Section C.3) and give better FID scores in most cases. Empirically, we observe the effectiveness of EMA is universal across a large number of different image datasets. As a result, we recommend the following rule of thumb:

Apply exponential moving average to parameters when sampling.

Combining all techniques together

Employing 1–5, we build NCSNs that can readily work across a large number of different datasets, including high resolution images that were previously out of reach with score-based generative modeling. Our modified model is named NCSNv2. For a complete description on experimental details and more results, please refer to Appendix B and C.

Model Inception \uparrow FID \downarrow CIFAR-10 Unconditional PixelCNN 4.604.60 65.9365.93 IGEBM 6.026.02 40.5840.58 WGAN-GP 7.86±.077.86\pm.07 36.436.4 SNGAN 8.22±.058.22\pm.05 21.721.7 NCSN 8.87±.12\mathbf{8.87\pm.12} 25.3225.32 NCSN (w/ denoising) 7.32±.127.32\pm.12 29.829.8 NCSNv2 (w/o denoising) 8.73±.138.73\pm.13 31.7531.75 NCSNv2 (w/ denoising) 8.40±.078.40\pm.07 10.87\mathbf{10.87} CelebA 64×64\mathbf{64\times 64} NCSN (w/o denoising) - 26.8926.89 NCSN (w/ denoising) - 25.3025.30 NCSNv2 (w/o denoising) - 28.8628.86 NCSNv2 (w/ denoising) - 10.23\mathbf{10.23}

Quantitative results: We consider CIFAR-10 32×3232\times 32 and CelebA 64×6464\times 64 where NCSN and NCSNv2 both produce reasonable samples. We report FIDs (lower is better) every 5000 iterations of training on 1000 samples and give results in Fig. 5 (with denoising) and Fig. 10 (without denoising, deferred to Section C.1). As shown in Figs. 5 and 10, we observe that the FID scores of NCSNv2 (with all techniques applied) are on average better than those of NCSN, and have much smaller variance over the course of training. Following , we select checkpoints with the smallest FIDs (on 1000 samples) encountered during training, and compute full FID and Inception scores on more samples from them. As shown by results in Table 1, NCSNv2 (w/ denoising) is able to significantly improve the FID scores of NCSN on both CIFAR-10 and CelebA, while bearing a slight loss of Inception scores on CIFAR-10. However, we note that Inception and FID scores have known issues and they should be interpreted with caution as they may not correlate with visual quality in the expected way. In particular, they can be sensitive to slight noise perturbations , as shown by the difference of scores with and without denoising in Table 1. To verify that NCSNv2 indeed generates better images than NCSN, we provide additional uncurated samples in Section C.4 for visual comparison.

Ablation studies: We conduct ablation studies to isolate the contributions of different techniques. We partition all techniques into three groups: (i) 5, (ii) 1,2,4, and (iii) 3, where different groups can be applied simultaneously. 1,2 and 4 are grouped together because 1 and 2 collectively determine the set of noise scales, and to sample from NCSNs trained with these noise scales we need 4 to configure annealed Langevin dynamics properly. We test the performance of successively removing groups (iii), (ii), (i) from NCSNv2, and report results in Fig. 5 for sampling with denoising and in Fig. 10 (Section C.1) for sampling without denoising. All groups of techniques improve over the vanilla NCSN. Although the FID scores are not strictly increasing when removing (iii), (ii), and (i) progressively, we note that FIDs may not always correlate with sample quality well. In fact, we do observe decreasing sample quality by visual inspection (see Section C.4), and combining all techniques gives the best samples.

Towards higher resolution: The original NCSN only succeeds at generating images of low resolution. In fact, only tested it on MNIST 28×2828\times 28 and CelebA/CIFAR-10 32×3232\times 32. For slightly larger images such as CelebA 64×6464\times 64, NCSN can generate images of consistent global structure, yet with strong color artifacts that are easily noticeable (see Fig. 4 and compare Fig. 9(c) with Fig. 9(d)). For images with resolutions beyond 96×9696\times 96, NCSN will completely fail to produce samples with correct structure or color (see Fig. 7). All samples shown here are generated without the denoising step, but since σL\sigma_{L} is very small, they are visually indistinguishable from ones with the denoising step.

By combining 1–5, NCSNv2 can work on images of much higher resolution. Note that we directly calculated the noise scales for training NCSNs, and computed the step size for annealed Langevin dynamics sampling without manual hyper-parameter tuning. The network architectures are the same across datasets, except that for ones with higher resolution we use more layers and more filters to ensure the receptive field and model capacity are large enough (see details in Section B.1). In Fig. 6 and 1, we show NCSNv2 is capable of generating high-fidelity image samples with resolutions ranging from 96×9696\times 96 to 256×256256\times 256. To show that this high sample quality is not a result of dataset memorization, we provide the loss curves for training/test, as well as nearest neighbors for samples in Section C.5. In addition, NCSNv2 can produce smooth interpolations between two given samples as in Fig. 6 (details in Section B.2), indicating the ability to learn generalizable image representations.

Conclusion

Motivated by both theoretical analyses and empirical observations, we propose a set of techniques to improve score-based generative models. Our techniques significantly improve the training and sampling processes, lead to better sample quality, and enable high-fidelity image generation at high resolutions. Although our techniques work well without manual tuning, we believe that the performance can be improved even more by fine-tuning various hyper-parameters. Future directions include theoretical understandings on the sample quality of score-based generative models, as well as alternative noise distributions to Gaussian perturbations.

Broader Impact

Our work represents another step towards more powerful generative models. While we focused on images, it is quite likely that similar techniques could be applicable to other data modalities such as speech or behavioral data (in the context of imitation learning). Like other generative models that have been previously proposed, such as GANs and WaveNets, score models have a multitude of applications. Among many other applications, they could be used to synthesize new data automatically, detect anomalies and adversarial examples, and also improve results in key tasks such as semi-supervised learning and reinforcement learning. In turn, these techniques can have both positive and negative impacts on society, depending on the application. In particular, the models we trained on image datasets can be used to synthesize new images that are hard to distinguish from real ones by humans. Synthetic images from generative models have already been used to deceive humans in malicious ways. There are also positive uses of these technologies, for example in the arts and as a tool to aid design in engineering. We also note that our models have been trained on datasets that have biases (e.g., CelebA is not gender-balanced), and the learned distribution is likely to have inherited them, in addition to others that are caused by the so-called inductive bias of models.

Acknowledgments and Disclosure of Funding

The authors would like to thank Aditya Grover, Rui Shu and Shengjia Zhao for reviewing an early draft of this paper, as well as Gabby Wright and Sharon Zhou for resolving technical issues in computing HYPE∞ scores. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and Amazon AWS.

References

Appendix A Proofs

Let p^σ1(x)1Ni=1Np(i)(x)\hat{p}_{\sigma_{1}}(\mathbf{x})\triangleq\frac{1}{N}\sum_{i=1}^{N}p^{(i)}(\mathbf{x}), where p(i)(x)N(xx(i),σ12I)p^{(i)}(\mathbf{x})\triangleq\mathcal{N}(\mathbf{x}\mid\mathbf{x}^{(i)},\sigma_{1}^{2}I). With r(i)(x)p(i)(x)k=1Np(k)(x)r^{(i)}(\mathbf{x})\triangleq\frac{p^{(i)}(\mathbf{x})}{\sum_{k=1}^{N}p^{(k)}(\mathbf{x})}, the score function is xlogp^σ1(x)=i=1Nr(i)(x)xlogp(i)(x)\nabla_{\mathbf{x}}\log\hat{p}_{\sigma_{1}}(\mathbf{x})=\sum_{i=1}^{N}r^{(i)}(\mathbf{x})\nabla_{\mathbf{x}}\log p^{(i)}(\mathbf{x}). Moreover,

According to the definition of pσ1(x)p_{\sigma_{1}}(\mathbf{x}) and r(x)r(\mathbf{x}), we have

where (1)(1) is due to the geometric mean–harmonic mean inequality. ∎

Since xN(0,σ2I)\mathbf{x}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I), we have sx22/σ2χD2s\triangleq\left\lVert\mathbf{x}\right\rVert_{2}^{2}/\sigma^{2}\sim\chi^{2}_{D}, i.e.,

Because r=x2=σsr=\left\lVert\mathbf{x}\right\rVert_{2}=\sigma\sqrt{s}, we can use the change of variables formula to get

and therefore rDσdN(0,σ2/2)r-\sqrt{D}\sigma\stackrel{{\scriptstyle d}}{{\to}}\mathcal{N}(0,\sigma^{2}/2). ∎

Let γ=σi1σi\gamma=\frac{\sigma_{i-1}}{\sigma_{i}}. For α=ϵσi2σL2\alpha=\epsilon\cdot\frac{\sigma_{i}^{2}}{\sigma_{L}^{2}} (as in Algorithm 1), we have xTN(0,sT2I)\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},s^{2}_{T}I), where

where ztN(0,I)\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},I). Therefore, the variance of xt\mathbf{x}_{t} satisfies

Now let v2α1(1ασi2)2I\mathbf{v}\triangleq\frac{2\alpha}{1-\left(1-\frac{\alpha}{\sigma_{i}^{2}}\right)^{2}}I, we have

Substituting ϵ σi2/σL2\epsilon~{}\sigma_{i}^{2}/\sigma_{L}^{2} for α\alpha in Eq. 7, we immediately obtain Eq. 6. ∎

Appendix B Experimental details

The original NCSN in uses a network structure based on RefineNet —a classical architecture for semantic segmentation. There are three major modifications to the original RefineNet in NCSN: (i) adding an enhanced version of conditional instance normalization (designed in and named CondInstanceNorm++) for every convolutional layer; (ii) replacing max pooling with average pooling in RefineNet blocks; and (iii) using dilated convolutions in the ResNet backend of RefineNet. We use exactly the same architecture for NCSN experiments, but for NCSNv2 or any other architecture implementing 3, we apply the following modifications: (i) setting the number of classes in CondInstanceNorm++ to 1 (which we name as InstanceNorm++); (ii) changing average pooling back to max pooling; and (iii) removing all normalization layers in RefineNet blocks. Here (ii) and (iii) do not affect the results much, but they are included because we hope to minimize the number of unnecessary changes to the standard RefineNet architecture (the original RefineNet blocks in use max pooling and have no normalization layers). We name a ResNet block (with InstanceNorm++ instead of BatchNorm) “ResBlock”, and a RefineNet block “RefineBlock”. When CondInstanceNorm++ is added, we name them “CondResBlock” and “CondRefineBlock” respectively. We use the ELU activation function throughout all architectures.

To ensure sufficient capacity and receptive fields, the network structures for images of different resolutions have different numbers of layers and filters. We summarize the architectures in Table 2 and Table 3.

We use the Adam optimizer for all models. When 3 is not in effect, we choose the learning rate 0.0010.001; otherwise we use a learning rate 0.00010.0001 to avoid loss explosion. We set the ϵ\epsilon parameter of Adam to 10310^{-3} for FFHQ and 10810^{-8} otherwise. We provide other hyperparameters in Table 4, where σ1\sigma_{1}, LL, TT, and ϵ\epsilon of NCSNv2 are all chosen in accordance with our proposed techniques. When the number of training data is larger than 60000, we randomly sample 10000 of them and compute the maximum pairwise distance, which is set as σ1\sigma_{1} for NCSNv2.

B.2 Additional settings

Datasets: We use the following datasets in our experiments: CIFAR-10 , CelebA , LSUN , and FFHQ . CIFAR-10 contains 50000 training images and 10000 test images, all of resolution 32×3232\times 32. CelebA contains 162770 training images and 19962 test images with various resolutions. For preprocessing, we first center crop them to size 140×140140\times 140, and then resize them to 64×6464\times 64. We choose the church_outdoor, bedroom and tower categories in the LSUN dataset. They contain 126227, 3033042, and 708264 training images respectively, and all have 300 validation images. For preprocessing, we first resize them so that the smallest dimension of images is 9696 (for church_outdoor) or 128128 (for bedroom and tower), and then center crop them to equalize their lengths and heights. Finally, the FFHQ dataset consists of 70000 high-quality facial images at resolution 1024×10241024\times 1024. We resize them to 256×256256\times 256 in our experiments. Because FFHQ does not have an official test dataset, we randomly select 63000 images for training and the remaining 7000 as the test dataset. In addition, we apply random horizontal flip as data augmentation in all cases.

Metrics: We use FID and HYPE∞ scores for quantitative comparison of results. When computing FIDs on CIFAR-10 32×3232\times 32, we measure the distance between the statistics of samples and training data. When computing FIDs on CelebA 64×6464\times 64, we follow the settings in where the distance is measured between 10000 samples and the test dataset. We use the official website https://hype.stanford.edu for computing HYPE∞ scores. Regarding model selection, we follow the settings in , where we compute FID scores on 1000 samples every 5000 training iterations and choose the checkpoint with the smallest FID for computing both full FID scores (with more samples from it) and the HYPE∞ scores.

Training: We use the Adam optimizer with default hyperparameters. The learning rates and batch sizes are provided in Section B.1 and Table 4. We observe that for images at resolution 128×128128\times 128 or 256×256256\times 256, training can be unstable when the loss is near convergence. We note, however, this is a well-known problem of the Adam optimizer, and can be mitigated by techniques such as AMSGrad . We trained all models on Nvidia Tesla V100 GPUs.

Settings for Section 3.3: The loss curves in Fig. 3 are results of two settings: (i) 1, 2, 4 and 5 are in effect, but the model architecture is the same as the original NCSN (i.e., Table 2(a)); and (ii) all techniques are in effect, i.e., the model is the same as NCSNv2 depicted in Table 3(a). We apply EMA with momentum 0.9 to smooth the curves in Fig. 3. We observe that despite being simpler to implement, the new noise conditioning method proposed in 3 performs as well as the original and arguably more complex one in in terms of the training loss. See the ablation studies in Section 6 and Section C.4 for additional results.

Interpolation: We can interpolate between two different samples from NCSN/NCSNv2 via interpolating the Gaussian random noise injected by annealed Langevin dynamics. Specifically, suppose we have a total of LL noise levels, and for each noise level we run TT steps of Langevin dynamics. Let {zij}1iL,1jT{z11,z12,,z1T,z21,z22,,z2T,,zL1,zL2,,zLT}\{\mathbf{z}_{ij}\}_{1\leq i\leq L,1\leq j\leq T}\triangleq\{\mathbf{z}_{11},\mathbf{z}_{12},\cdots,\mathbf{z}_{1T},\mathbf{z}_{21},\mathbf{z}_{22},\cdots,\mathbf{z}_{2T},\cdots,\mathbf{z}_{L1},\mathbf{z}_{L2},\cdots,\mathbf{z}_{LT}\} denote the set of all Gaussian noise used in this procedure, where zij\mathbf{z}_{ij} is the noise injected at the jj-th iteration of Langevin dynamics corresponding to the ii-th noise level. Next, suppose we have two samples x(1)\mathbf{x}^{(1)} and x(2)\mathbf{x}^{(2)} with the same initialization x0\mathbf{x}_{0}, and denote the corresponding set of Gaussian noise as {zij(1)}1iL,1jT\{\mathbf{z}^{(1)}_{ij}\}_{1\leq i\leq L,1\leq j\leq T} and {zij(2)}1iL,1jT\{\mathbf{z}^{(2)}_{ij}\}_{1\leq i\leq L,1\leq j\leq T} respectively. We can generate NN interpolated samples between x(1)\mathbf{x}^{(1)} and x(2)\mathbf{x}^{(2)}, where for the kk-th interpolated sample we use Gaussian noise \{\cos\big{(}\frac{k\pi}{2(N+1)}\big{)}\mathbf{z}_{ij}^{(1)}+\sin\big{(}\frac{k\pi}{2(N+1)}\big{)}\mathbf{z}_{ij}^{(2)}\}_{1\leq i\leq L,1\leq j\leq T} and initialization x0\mathbf{x}_{0}.

Appendix C Additional experimental results

We further demonstrate the stabilizing effect of EMA in Fig. 8, where FIDs are computed without the denoising step. As indicated by Figs. 8 and 4, EMA can stabilize training and remove sample artifacts regardless of whether denoising is used or not.

FID scores should be interpreted with caution because they may not align well with human judgement. For example, the samples from NCSNv2 as demonstrated in Fig. 9(d) have an FID score of 28.9 (without denoising), worse than NCSN (Fig. 9(c)) whose FID is 26.9 (without denoising), but arguably produce much more visually appealing samples. To investigate whether FID scores align well with human ratings, we use the HYPE∞ score (higher is better), a metric of sample quality based on human evaluation, to compare the two models that generated samples in Figs. 9(c) and 9(d). We provide full results in Table 5, where all numbers except those for NCSN and NCSNv2 are directly taken from . As Table 5 shows, our NCSNv2 achieves 37.3 on CelebA 64×6464\times 64 which is comparable to ProgressiveGAN , whereas NCSN achieves 19.8. This is completely different from the ranking indicated by FIDs.

Finally, we provide ablation results without the denoising step in Fig. 10. It is qualitatively similar to Fig. 5 where results are computed with denoising.

C.2 Training and sampling speed

In Table 6, we provide the time cost for training and sampling from NCSNv2 models on various datasets considered in our experiments.

C.3 Color shifts

C.4 Additional results on ablation studies

As discussed in Section 6, we partition all techniques into three groups: (i) 5, (ii) 1,2,4, and (iii) 3, and investigate the performance of models after successively removing (iii), (ii), and (i) from NCSNv2. Aside from the FID curves in Figs. 5 and 10, we also provide samples from different models for visual inspection in Figs. 13 and 14. To generate these samples, we compute the FID scores on 1000 samples every 5000 training iterations for each considered model, and sample from the checkpoint of the smallest FID (the same setting as in ). From samples in Figs. 13 and 14, we easily observe that removing any group of techniques leads to worse samples.

C.5 Generalization

First, we demonstrate that our NCSNv2 does not overfit to the training dataset by showing the curves of training/test loss in Fig. 15. Since the loss on the test dataset is always close to the loss on the training dataset during the course of training, this indicates that our model does not simply memorize training data.

C.5.2 Nearest neighbors

C.5.3 Additional interpolation results

We generate samples from NCSNv2 and interpolate between them using the method described in Section B.2.

C.6 Additional uncurated samples