Diffusion Posterior Sampling for General Noisy Inverse Problems

Hyungjin Chung, Jeongsol Kim, Michael T. Mccann, Marc L. Klasky, Jong Chul Ye

Introduction

Diffusion models learn the implicit prior of the underlying data distribution by matching the gradient of the log density (i.e. Stein score; $\nabla_{\bm{x}}\log p({\bm{x}})$ ) (Song et al., 2021b). The prior can be leveraged when solving inverse problems, which aim to recover ${\bm{x}}$ from the measurement ${\bm{y}}$ , related through the forward measurement operator ${\mathcal{A}}$ and the detector noise ${\bm{n}}$ . When we know such forward models, one can incorporate the gradient of the log likelihood (i.e. $\nabla_{\bm{x}}\log p({\bm{y}}|{\bm{x}})$ ) in order to sample from the posterior distribution $p({\bm{x}}|{\bm{y}})$ . While this looks straightforward, the likelihood term is in fact analytically intractable in terms of diffusion models, due to their dependence on time $t$ . Due to its intractability, one often resorts to projections onto the measurement subspace (Song et al., 2021b; Chung et al., 2022b; Chung & Ye, 2022; Choi et al., 2021). However, the projection-type approach fails dramatically when 1) there is noise in the measurement, since the noise is typically amplified during the generative process due to the ill-posedness of the inverse problems; and 2) the measurement process is nonlinear.

One line of works that aim to solve noisy inverse problems run the diffusion in the spectral domain (Kawar et al., 2021; 2022) so that they can tie the noise in the measurement domain into the spectral domain via singular value decomposition (SVD). Nonetheless, the computation of SVD is costly and even prohibitive when the forward model gets more complex. For example, kawar2022denoising only considered seperable Gaussian kernels for deblurring, since they were restricted to the family of inverse problems where they could effectively perform the SVD. Hence, the applicability of such methods is restricted, and it would be useful to devise a method to solve noisy inverse problems without the computation of SVD. Furthermore, while diffusion models were applied to various inverse problems including inpainting (song2020score; chung2022come; kawar2022denoising; chung2022improving), super-resolution (choi2021ilvr; chung2022come; kawar2022denoising), colorization (song2020score; kawar2022denoising; chung2022improving), compressed-sensing MRI (CS-MRI) (song2022solving; chung2022score; chung2022come), computed tomography (CT) (song2022solving; chung2022improving), etc., to our best knowledge, all works so far considered linear inverse problems only, and have not explored nonlinear inverse problems.

In this work, we devise a method to circumvent the intractability of posterior sampling by diffusion models via a novel approximation, which can be generally applied to noisy inverse problems. Specifically, we show that our method can efficiently handle both the Gaussian and the Poisson measurement noise. Also, our framework easily extends to any nonlinear inverse problems, when the gradients can be obtained through automatic differentiation. We further reveal that a recently proposed method of manifold constrained gradients (MCG) (chung2022improving) is a special case of the proposed method when the measurement is noiseless. With a geometric interpretation, we further show that the proposed method is more likely to yield desirable sample paths in noisy setting than the previous approach (chung2022improving). In addition, the proposed method fully runs on the image domain rather than the spectral domain, thereby avoiding the computation of SVD for efficient implementation. With extensive experiments including various inverse problems—inpainting, super-resolution, (Gaussian/motion/non-uniform) deblurring, Fourier phase retrieval—we show that our method serves as a general framework for solving general noisy inverse problems with superior quality (Representative results shown in Fig. 1).

Background

Our aim is to recover the data generating distribution starting from the tractable distribution, which can be achieved by writing down the corresponding reverse SDE of (1) (anderson1982reverse):

where $dt$ corresponds to time running backward and $d\bar{\bm{w}}$ to the standard Wiener process running backward. The drift function now depends on the time-dependent score function $\nabla_{{\bm{x}}_{t}}\log p_{t}({\bm{x}}_{t})$ , which is approximated by a neural network ${\bm{s}}_{\theta}$ trained with denoising score matching (vincent2011connection):

where $\varepsilon\simeq 0$ is a small positive constant. Once $\theta^{*}$ is acquired through (3), one can use the approximation $\nabla_{{\bm{x}}_{t}}\log p_{t}({\bm{x}}_{t})\simeq{\bm{s}}_{\theta^{*}}({\bm{x}}_{t},t)$ as a plug-in estimateThe approximation error comes from optimization/parameterization error of the neural network. to replace the score function in (2). Discretization of (2) and solving using, e.g. Euler-Maruyama discretization, amounts to sampling from the data distribution $p({\bm{x}})$ , the goal of generative modeling.

2 Inverse problem solving with diffusion models

For various scientific problems, we have a partial measurement ${\bm{y}}$ that is derived from ${\bm{x}}$ . When the mapping ${\bm{x}}\mapsto{\bm{y}}$ is many-to-one, we arrive at an ill-posed inverse problem, where we cannot exactly retrieve ${\bm{x}}$ . In the Bayesian framework, one utilizes $p({\bm{x}})$ as the prior, and samples from the posterior $p({\bm{x}}|{\bm{y}})$ , where the relationship is formally established with the Bayes’ rule: $p({\bm{x}}|{\bm{y}})=p({\bm{y}}|{\bm{x}})p({\bm{x}})/p({\bm{y}})$ . Leveraging the diffusion model as the prior, it is straightforward to modify (2) to arrive at the reverse diffusion sampler for sampling from the posterior distribution:

In (4), we have two terms that should be computed: the score function $\nabla_{{\bm{x}}_{t}}\log p_{t}({\bm{x}}_{t})$ , and the likelihood $\nabla_{{\bm{x}}_{t}}\log p_{t}({\bm{y}}|{\bm{x}}_{t})$ . To compute the former term involving $p_{t}({\bm{x}})$ , we can simply use the pre-trained score function ${\bm{s}}_{\theta^{*}}$ . However, the latter term is hard to acquire in closed-form due to the dependence on the time $t$ , as there only exists explicit dependence between ${\bm{y}}$ and ${\bm{x}}_{0}$ .

Formally, the general form of the forward modelTo be precise, when we have signal-dependent noise model (e.g. Poisson), we cannot write the forward model with additive noise. We shall still write the forward model with additive noise for simplicity, and discuss which treatments are required when dealing with signal-dependent noise later in the paper. can be stated as

In order to circumvent using the likelihood term directly, alternating projections onto the measurement subspace is a widely used strategy (song2020score; chung2022score; chung2022come). Namely, one can disregard the likelihood term in (4), and first take an unconditional update with (2), and then take a projection step such that measurement consistency can be met, when assuming $\bm{{\bm{n}}}\simeq 0$ . Another line of work (jalal2021robust) solves linear inverse problems where ${\mathcal{A}}({\bm{x}})\triangleq{\bm{A}}{\bm{x}}$ and utilizes an approximation $\nabla_{{\bm{x}}_{t}}\log p_{t}({\bm{y}}|{\bm{x}})\simeq\frac{{\bm{A}}^{H}({\bm{y}}-{\bm{A}}{\bm{x}})}{\sigma^{2}}$ , which is obtained when $\bm{{\bm{n}}}$ is assumed to be Gaussian noise with variance $\sigma^{2}$ . Nonetheless, the equation is only correct when $t=0$ , while being wrong at all other noise levels that are actually used in the generative process. The incorrectness is counteracted by a heuristic of assuming higher levels of noise as $t\rightarrow T$ , such that $\nabla_{{\bm{x}}_{t}}\log p_{t}({\bm{y}}|{\bm{x}})\simeq\frac{{\bm{A}}^{H}({\bm{y}}-{\bm{A}}{\bm{x}})}{\sigma^{2}+\gamma_{t}^{2}}$ , where $\{\gamma_{t}\}_{t=1}^{T}$ are hyperparameters. While both lines of works aim to perform posterior sampling given the measurements and empirically work well on noiseless inverse problems, it should be noted that 1) they do not provide means to handle measurement noise, and 2) using such methods to solve nonlinear inverse problems either fails to work or is not straightforward to implement. The aim of this paper is to take a step toward a more general inverse problem solver, which can address noisy measurements and also scales effectively to nonlinear inverse problems.

Diffusion Posterior Sampling (DPS)

Recall that no analytical formulation for $p({\bm{y}}|{\bm{x}}_{t})$ exists. In order to exploit the measurement model $p({\bm{y}}|{\bm{x}}_{0})$ , we factorize $p({\bm{y}}|{\bm{x}}_{t})$ as follows:

where the second equality comes from that ${\bm{y}}$ and ${\bm{x}}_{t}$ are conditionally independent on ${\bm{x}}_{0}$ , as shown in Fig. 2. Here, $p({\bm{x}}_{0}|{\bm{x}}_{t})$ , as was shown with blue dotted lines in Fig. 2, is intractable in general. Note however, that for the case of diffusion models such as VP-SDE or DDPM, the forward diffusion can be simply represented by

so that we can obtain the specialized representation of the posterior mean as shown in Proposition 1 through the Tweedie’s approach (efron2011tweedie; kim2021noisescore). Detailed derivations can be found in Appendix A.

For the case of VP-SDE or DDPM sampling, $p({\bm{x}}_{0}|{\bm{x}}_{t})$ has the unique posterior mean at

By replacing $\nabla_{{\bm{x}}_{t}}\log p({\bm{x}}_{t})$ in (9) with the score estimate ${\bm{s}}_{\theta^{*}}({\bm{x}}_{t})$ , we can approximate the posterior mean from $p({\bm{x}}_{0}|{\bm{x}}_{t})$ as:

In fact, the result is closely related to the well established field of denoising. Concretely, consider the problem of retrieving the estimate of clean ${\bm{x}}_{0}$ from the given Gaussian noisy ${\bm{x}}_{t}$ . A classic result of Tweedie’s formula (robbins1992empirical; stein1981estimation; efron2011tweedie; kim2021noisescore) states that one can retrieve the empirical Bayes optimal posterior mean $\hat{\bm{x}}_{0}$ using the formula in (10).

implying that the outer expectation of $p({\bm{y}}|{\bm{x}}_{0})$ over the posterior distribution is replaced with inner expectation of ${\bm{x}}_{0}$ . In fact, this type of the approximation is closely related to the Jensen’s inequality, so we need the following definition to quantify the approximation error:

Let ${\bm{x}}$ be a random variable with distribution $p({\bm{x}})$ . For some function $f$ that may or may not be convex, the Jensen gap is defined as

where the expectation is taken over $p({\bm{x}})$ .

The following theorem derives the closed-form upper bound of the Jensen gap for the inverse problem from (6) when $\bm{{\bm{n}}}\sim{\mathcal{N}}(0,\sigma^{2}{\bm{I}})$ :

For the given measurement model (6) with $\bm{{\bm{n}}}\sim{\mathcal{N}}(0,\sigma^{2}{\bm{I}})$ , we have

where the approximation error can be quantified with the Jensen gap, which is upper bounded by

where $\|\nabla_{\bm{x}}{\mathcal{A}}({\bm{x}})\|:=\max_{{\bm{x}}}\|\nabla_{\bm{x}}{\mathcal{A}}({\bm{x}})\|$ and $m_{1}:=\int\|{\bm{x}}_{0}-\hat{\bm{x}}_{0}\|p({\bm{x}}_{0}|{\bm{x}}_{t})\,d{\bm{x}}_{0}$ .

Note that $\|\nabla_{\bm{x}}{\mathcal{A}}({\bm{x}})\|$ is finite in most of the inverse problems. This should not be confused with the ill-posedness of the inverse problems, which refers to the unboundedness of the inverse operator ${\mathcal{A}}^{-1}$ . Accordingly, if $m_{1}$ is also finite (which is the case for most of the distribution in practice), the Jensen gap in Theorem 1 can approach to 0 as $\sigma\rightarrow\infty$ , suggesting that the approximation error reduces with higher measurement noise. This may explain why our DPS works well for noisy inverse problems. In addition, although we have specified the measurement distribution to be Gaussian, we can also determine the Jensen gap for other measurement distributions (e.g. Poisson) in an analogous fashion.

By leveraging the result of Theorem 1, we can use the approximate gradient of the log likelihood

where the latter is now analytically tractable, as the measurement distribution is given.

2 Model dependent likelihood of the measurement

Note that we may have different measurement models $p({\bm{y}}|{\bm{x}}_{0})$ for each application. Two of the most common cases in inverse problems are the Gaussian noise and the Poisson noise. Here, we explore how our diffusion posterior sampling described above can be adapted to each case.

Gaussian noise. The likelihood function takes the form

where $n$ denotes the dimension of the measurement ${\bm{y}}$ . By differentiating $p({\bm{y}}|{\bm{x}}_{t})$ with respect to ${\bm{x}}_{t}$ , using Theorem 1 and (15), we get

where we have explicitly denoted $\hat{\bm{x}}_{0}:=\hat{\bm{x}}_{0}({\bm{x}}_{t})$ to emphasize that $\hat{\bm{x}}_{0}$ is a function of ${\bm{x}}_{t}$ . Consequently, taking the gradient $\nabla_{{\bm{x}}_{t}}$ amounts to taking the backpropagation through the network. Plugging in the result from Theorem 1 to (5) with the trained score function, we finally conclude that

where $\rho\triangleq 1/\sigma^{2}$ is set as the step size.

Poisson noise. The likelihood function for the Poisson measurements under the i.i.d. assumption is given as

where $j$ indexes the measurement bin. In most cases where the measured values are not too small, the model can be approximated by a Gaussian distribution with very high accuracyFor ${\bm{y}}_{j}>20$ , the approximation holds within 1% of the error (hubbard1970approximation).. Namely,

where we have used the standard approximation for the shot noise model $[{\mathcal{A}}({\bm{x}}_{0})]_{j}\simeq{\bm{y}}_{j}$ to arrive at the last equation (kingston2013detection). Then, similar to the Gaussian case, by differentiation and the use of Theorem 1, we have that

where $\|\bm{a}\|_{\bm{\Lambda}}^{2}\triangleq\bm{a}^{T}\bm{\Lambda}\bm{a}$ , and we have included ${\rho}$ to define the step size as in the Gaussian case. We can summarize our strategy for each noise model as follows:

Incorporation of (16) or (21) into the usual ancestral sampling (ho2020denoising) steps leads to Algorithm 1,2In the discrete implementation, we instead use $\zeta_{i}$ to express the step size. From the experiments, we observe that taking $\zeta_{i}=\zeta^{\prime}/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$ , with $\zeta^{\prime}$ set to constant, yields highly stable results. See Appendix D for details in the choice of step size.. Here, we name our algorithm Diffusion Posterior Sampling (DPS), as we construct our method in order to perform sampling from the posterior distribution. Notice that unlike prior methods that limit their applications to linear inverse problems ${\mathcal{A}}({\bm{x}})\triangleq{\bm{A}}{\bm{x}}$ , our method is fully general in that we can also use nonlinear operators ${\mathcal{A}}(\cdot)$ . To show that this is indeed the case, in experimental section we take the two notoriously hard nonlinear inverse problems: Fourier phase retrieval and non-uniform deblurring, and show that our method has very strong performance even in such challenging problem settings.

Geometry of DPS and connection to manifold constrained gradient (MCG). Interestingly, our method in the Gaussian measurement case corresponds to the manifold constrained gradient (MCG) step that was proposed in chung2022improving, when setting ${\bm{W}}={\bm{I}}$ from chung2022improving. However, chung2022improving additionally performs projection onto the measurement subspace after the update step via (16), which can be thought of as corrections that are made for deviations from perfect data consistency. Borrowing the interpretation of diffusion models from chung2022improving, we compare the generative procedure geometrically. It was shown that in the context of diffusion models, a single denoising step via ${\bm{s}}_{\theta^{*}}$ corresponds to the orthogonal projection to the data manifold, and the gradient step $\nabla_{{\bm{x}}_{i}}\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0})\|_{2}^{2}$ takes a step tangent to the current manifold. For noisy inverse problems, when taking projections on the measurement subspace after every gradient step as in chung2022improving, the sample may fall off the manifold, accumulate error, and arrive at the wrong solution, as can be seen in Fig. 3(a), due to the overly imposing the data consistency that works only for noiseless measurement. On the other hand, our method without the projections on the measurement subspace is free from such drawbacks for noisy measurement (see Fig. 3(b)). Accordingly, while projections on the measurement subspace are useful for noiseless inverse problems that chung2022improving tries to solve, they fail dramatically for noisy inverse problems that we try to solve. Finally, when used together with the projection steps on the measurement subspace, it was shown that choosing different ${\bm{W}}$ for different applications was necessary for MCG, whereas our method is free from such heuristics.

Experiments

Experimental setup. We test our experiment on two datasets that have diverging characteristic - FFHQ 256 $\times$ 256 (karras2019style), and Imagenet 256 $\times$ 256 (deng2009imagenet), on 1k validation images each. The pre-trained diffusion model for ImageNet was taken from dhariwal2021diffusion and was used directly without finetuning for specific tasks. The diffusion model for FFHQ was trained from scratch using 49k training data (to exclude 1k validation set) for 1M steps. All images are normalized to the range $ $. Forward measurement operators are specified as follows: (i) For box-type inpainting, we mask out 128$ \times $128 box region following chung2022improving, and for random-type we mask out 92% of the total pixels (all RGB channels). (ii) For super-resolution, bicubic downsampling is performed. (iii) Gaussian blur kernel has size$ 61\times 61 $with standard deviation of$ 3.0 $, and motion blur is randomly generated with the codehttps://github.com/LeviBorodenko/motionblur, with size$ 61\times 61 $and intensity value$ 0.5 $. The kernels are convolved with the ground truth image to produce the measurement. (iv) For phase retrieval, Fourier transform is performed to the image, and only the Fourier magnitude is taken as the measurement. (v) For nonlinear deblurring, we leverage the neural network approximated forward model as in tran2021explore. All Gaussian noise is added to the measurement domain with$ \sigma=0.05 $. Poisson noise level is set to$ \lambda=1.0$. More details including the hyper-parameters can be found in Appendix B,D.

We perform comparison with the following methods: Denoising diffusion restoration models (DDRM) (kawar2022denoising), manifold constrained gradients (MCG) (chung2022improving), Plug-and-play alternating direction method of multipliers (PnP-ADMM) (chan2016plug) using DnCNN zhang2017beyond in place of proximal mappings, total-variation (TV) sparsity regularized optimization method (ADMM-TV), and Score-SDE (song2020score). Note that song2020score only proposes a method for inpainting, and not for general inverse problems. However, the methodology of iteratively applying projections onto convex sets (POCS) was applied in the same way for super-resolution in iterative latent variable refinement (ILVR) (choi2021ilvr), and more generally to linear inverse problems in chung2022come; thus we simply refer to these methods as score-SDE henceforth.For a fair comparison, we used the same score function for all the different methods that are based on diffusion (i.e. DPS, DDRM, MCG, score-SDE).

For phase retrieval, we compare with three strong baselines that are considered standards: oversampling smoothness (OSS) (rodriguez2013oversampling), Hybrid input-output (HIO) (fienup1987phase), and error reduction (ER) algorithm (fienup1982phase). For nonlinear deblurring, we compare against the prior arts: blur kernel space (BKS) - styleGAN2 (tran2021explore), based on GAN priors, blur kernel space (BKS) - generic (tran2021explore), based on Hyper-Laplacian priors, and MCG.

Further experimental details are provided in Appendix D. For quantitative comparison, we focus on the following two widely-used perceptual metrics - Fréchet Inception Distance (FID), and Learned Perceptual Image Patch Similarity (LPIPS) distance, with further evaluation with standard metrics: peak signal-to-noise-ratio (PSNR), and structural similarity index (SSIM) provided in Appendix E.

Noisy linear inverse problems. We first test our method on diverse linear inverse problems with Gaussian measurement noises. The quantitative results shown in Tables 1,2 illustrate that the proposed method outperforms all the other comparison methods by large margins. Particularly, MCG and Score-SDE (or ILVR) are methods that rely on projections on the measurement subspace, where the generative process is controlled such that the measurement consistency is perfectly met. While this is useful for noiseless (or negligible noise) problems, in the case where we cannot ignore noise, the solutions overfit to the corrupted measurement (for further discussion, see Appendix C.1). In Fig. 4, we specifically compare our methods with DDRM and PnP-ADMM, which are two methods that are known to be robust to measurement noise. Our method is able to provide high-quality reconstructions that are crisp and realistic on all tasks. On the other hand, we see that DDRM performs poorly on image inpainting tasks where the dimensionality of the measurements are very low, and tend to produce blurrier results on both SR, and deblurring tasks. We further note that DDRM relies on SVD, and hence is only able to solve problems where the forward measurement matrix can be efficiently implemented (e.g. separable kernel in the case of deblurring). Hence, while one can solve Gaussian deblurring, one cannot solve problems such as motion deblur, where the point spread function (PSF) is much more complex. Contrarily, our method is not restricted by such conditions, and can be always used regardless of the complexity. The results of the Poisson noisy linear inverse problems are presented in Fig. 5. Consistent with the Gaussian case, DPS is capable of producing high quality reconstructions that closely mimic the ground truth. From the experiments, we further observe that the weighted least squares method adopted in Algorithm 2 works best compared to other choices that can be made for Poisson inverse problems (for further analysis, see Appendix C.4).

Nonlinear inverse problems. We show the quantitative results of phase retrieval in Table 3, and the results of nonlinear deblurring in Table 4. Representative results are illustrated in Fig. 6.

We first observe that the proposed method is capable of highly accurate reconstruction for the given phase retrieval problem, capturing most of the high frequency details. However, we also observe that we do not always get high quality reconstructions. In fact, due to the non-uniqueness of the phase-retrieval under some conditions, widely used methods such as HIO are also dependent on the initializations (fienup1978reconstruction), and hence it is considered standard practice to first generate multiple reconstructions, and take the best sample. Following this, when reporting our quantitative metrics, we generate 4 different samples for all the methods, and report the metric based on the best samples. We see that DPS outperforms other methods by a large margin. For the case of nonlinear deblurring, we again see that our method performs the best, producing highly realistic samples. BKS-styleGAN2 (tran2021explore) leverages GAN prior and hence generates feasible human faces, but heavily distorts the identity. BKS-generic utilizes the Hyper-Laplacian prior (krishnan2009fast), but is unable to remove artifacts and noise properly. MCG amplifies noise in a similar way that was discussed in Fig. 7.

Conclusion

In this paper, we proposed diffusion posterior sampling (DPS) strategy for solving general noisy (both signal dependent/independent) inverse problems in imaging. Our method is versatile in that it can also be used for highly noisy and nonlinear inverse problems. Extensive experiments show that the proposed method outperforms existing state-of-the-art by large margins, and also covers the widest range of problems.

This work was supported by the National Research Foundation of Korea under Grant NRF-2020R1A2B5B03001980, by the Korea Medical Device Development Fund grant funded by the Korea government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711137899, KMDF_PR_20200901_0015), and by the KAIST Key Research Institute (Interdisciplinary Research Group) Project.

References

Appendix A Proofs

Let $p({\bm{y}}|{\bm{\eta}})$ belong to the exponential family distribution

Marginal distribution $p({\bm{y}})$ could be expressed as

Then, the derivative of the marginal distribution $p({\bm{y}})$ with respect to ${\bm{y}}$ becomes

For the case of VP-SDE and DDPM forward sampling in (8), we have

which is a Gaussian distribution. The corresponding canonical decomposition is then given by

where $L=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp{(-\frac{1}{2\sigma^{2}})}$ .

As $\phi^{\prime}$ is continuous and bounded, we use the mean value theorem to get

Since $L$ is the minimal value for (34), we have that $L\leq\|\phi^{\prime}\|_{\infty}$ . Taking the limit $y\rightarrow x$ gives $|\phi^{\prime}(x)|\leq L$ , and thus $\|\phi^{\prime}\|_{\infty}\leq L$ . Hence

Since the derivative of $\phi^{\prime}$ is given as

and the maximum is attained when $x=1\pm\sigma^{2}\mu$ , we have

where $L=\frac{d}{\sqrt{2\pi\sigma^{2}}}e^{-1/2\sigma^{2}}$ .

where the second inequality comes from that each element of $\nabla_{\bm{z}}{\bm{\phi}}({\bm{z}})$ is bounded by $\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp{\left(-\frac{1}{2\sigma^{2}}\right)}$ . ∎

Here, $f(\cdot):=h({\mathcal{A}}(\cdot))$ where ${\mathcal{A}}$ is the forward operator and $h({\bm{x}})$ is the multivariate normal distribution with mean ${\bm{y}}$ and the covariance $\sigma^{2}{\bm{I}}$ . Therefore, we have

where $dP({\bm{x}}_{0}|{\bm{x}}_{t})=p({\bm{x}}_{0}|{\bm{x}}_{t})\,d{\bm{x}}_{0}$ , (b) is the result of Lemma 3, (c) is from the intermediate value theorem, and (d) is from Proposition 2. ∎

Appendix B Inverse problem setup

Super-resolution. The forward model for super-resolution is defined as

Inpainting. For both box-type and random-type inpainting, the forward model reads

where ${\bm{P}}\in\{0,1\}^{n\times d}$ is the masking matrix that consists of elementary unit vectors.

Linear Deblurring. For both Gaussian and motion deblurring, the measurement model is given as

Nonlinear deblurring. We leverage the nonlinear blurring process that was proposed in the GOPRO dataset (nah2017deep), where the blurring process is not defined as a convolution, but rather as an integration of sharp images through the time frame. Specifically, in the discrete sense the measurement model reads

where $b({\bm{x}})={\bm{x}}^{1/2.2}$ is the nonlinear camera response function, and $T$ denotes the total time frames. While we could directly use (56) as our forward model, note that this is only possible when we have multiple sharp time frames at hand (e.g. when leveraging GOPRO dataset directly). Recently, there was an effort to distill the forward model through a neural network (tran2021explore). Particularly, when we have a set of blurry-sharp image pairs $\{({\bm{x}}_{i},{\bm{y}}_{i})\}_{i=1}^{N}$ , one can train a neural network to estimate the forward model as

where ${\mathcal{G}}_{\phi}({\bm{x}}_{i},{\bm{y}}_{i})$ extracts the implicit kernel information from the pair, and ${\mathcal{F}}_{\phi}$ takes in ${\bm{x}}_{i},{\mathcal{G}}_{\phi}({\bm{x}}_{i},{\bm{y}}_{i})$ to generate the blurry image. When using ${\mathcal{F}}_{\phi}$ at deployment to generate new synthetic blurry images, one can simply replace ${\mathcal{G}}_{\phi}({\bm{x}}_{i},{\bm{y}}_{i})$ with a Gaussian random vector ${\bm{k}}$ . Consequently, our forward model reads

where $k$ is the dimensionality of the latent vector ${\bm{k}}$ , and $\sigma_{k}^{2}$ is the variance of the vector.

Phase Retrieval. The forward measurement model is usually given as

where ${\bm{F}}$ denotes the 2D Discrete Fourier Trasform (DFT) matrix. In another words, the phase of the Fourier measurements are nulled, and our aim is to impute the missing phase information. As the problem is highly ill-posed, one typically incorporates the oversampling in order to induce the uniqueness condition (hayes1982reconstruction; bruck1979ambiguity), usually specified as

where ${\bm{P}}$ denotes the oversampling matrix with ratio $k/n$ .

Poisson noise simulation. To simulate the Poisson noise, we assume that each measurement pixel is a source of photon, where the number of photons is proportional to the discrete pixel value between 0 and 255. Thus, we sample noisy measurement values from the Poisson distribution with the mean value of the clean measurement values. Here, the clean measurement is ${\mathcal{A}}({\bm{x}}_{0})$ , which is an image after applying the forward operation. Then, we clip the values by and normalize to .

Appendix C Ablation studies and Discussion

As discussed in the experiments, methods that rely on projections fail dramatically when solving inverse problems with excessive amount of noise in the measurement. Even worse, for many problems such as SR or deblurring, noise gets amplified during the projection step due to the operator transpose ${\bm{A}}^{T}$ being applied. This downside is clearly depicted in Fig. 7, where we show the failure cases of MCG (chung2022improving) on noisy super-resolution. In contrast, our method does not rely on such projections, and thus is much more robust to the corrupted measurements. Notably, we find that MCG also fails dramatically in SR even when there is no noise existent, while it performs well on some of the other tasks (e.g. inpainting). We can conclude that the proposed method works generally well across a broader range of inverse problems, whether or not there is noise in the measurement.

There is one hyper-parameter in our DPS solver, and that is the step size. As this value is essentially the weight that is given to the likelihood (i.e. data consistency) of the inverse problem, we can expect that the values being too high or too low will cause problems. In Fig. 8, we show the trend of the reconstruction results when varying the step size $\zeta_{i}$ . Note that we instead use the notation $\zeta^{\prime}\triangleq\zeta_{i}\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$ for brevity. Here, we see that with low values of $\zeta^{\prime}<0.1$ , we achieve results that are not consistent with the given measurement. On the other hand, when we crank up the values too high ( $\zeta^{\prime}>5$ ), we observe saturation arfiacts that tend to amplify the noise. From our experiments, we conclude that it is best practice to set the $\zeta^{\prime}$ values in the range $[0.1,1.0]$ for best results. Specific values for all the experiments are presented in Appendix D.

C.3 Other step size schedules

While the proposed step size schedule in C.2 yields good results, there can be many other choices that one can take. In this section, we conduct an ablation study to compare against other choices. Specifically, we test 100 images for Gaussian deblurring (Gaussian noise, $\sigma=0.05$ ) on FFHQ, and compute the average perceptual distance (LPIPS) against the ground truth. We compare against the following three choices: 1) Linearly decaying steps $\zeta^{\prime}_{i}=\zeta^{\prime}_{rminit}\times\left(1-\frac{i}{N}\right)$ , 2) exponentially decaying steps $\zeta^{\prime}_{i}=\zeta^{\prime}_{\rm init}\times\gamma^{i}$ , with $\gamma=0.99$ , 3) directly using step size proportional to $1/\sigma^{2}$ as in eq. 16.

We present qualitative analysis in Fig. 9. From the figure, it is clear that the proposed schedule produces the best result that most closely matches the ground truth in terms of perception. For decaying step sizes, we often yield results that are coarsely similar to the ground truth, but varies in the fine details, as the information about the measurement is less incorporated in the later steps of the diffusion. From Fig. 9, we see that taking step sizes proportional to $1/\sigma^{2}$ , motivated by direct derivation from the gaussian forward model, yields poor results. We see similar results with the quantitative metrics presented in Table. 5.

C.4 Poisson inverse problems

For inverse problems corrupted with Poisson noise, more care needs to be taken compared to the Gaussian noise counterparts, as the noise is signal-dependent and therefore harder to account for. In this section, we discuss the different choices of likelihood functions that can be made, and clarify the choice (20) used in all our experiments. One straightforward option is to directly use the Poisson likelihood model without the Gaussian approximation. From (17), we have that

which we refer to as Poisson-direct. Moreover, one can use the Gaussian approximated version of the Poisson measurement model given in (18)

which we refer to as Poisson-Gaussian. Next, we can use our choice in (19) to arrive at (20), which is the proposed method. Finally, while irrelevant with the noise model, we can also still use the same least sqaures (LS) method used for Gaussian noise (we refer to this method as Poisson-LS), as due to the central limit theorem, Poisson noise is nearly Gaussian in the high SNR level regime. In Fig. 10, we show representative results achieved by using each choice. From the experiments, we observe that Poisson-direct is unstable due to the log term in the likelihood, hence often diverging. We also observe that the residual ${\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0})$ fails to converge, hinting that the information from the measurement is not effectively integrated into the generative process. For Poisson-Gaussian, we see that the weighting term of the MSE is problematic, and this term prevents the process from proper convergence. Both the proposed method and Poisson-LS are stable, but Poisson-LS tends to blur out the relevant details from the reconstruction, while Poisson-shot preserves the high-frequency details better, and does not alter the identity of the ground truth person.

C.5 Sampling speed

As widely known in the literature, diffusion model-based methods are heavily dependent on the number of neural function evaluations (NFE). We investigate the performance in terms of LPIPS with respect to the change in NFEs in Fig. 11. For the experiment, we take the case of noisy SR $\times 4$ , which is a problem where DDRM tends to perform well, in contrast to other problems, e.g. inpainting. In the high NFE regime ( $\geq$ 250), DPS outperforms all the other methods, whereas in the low NFE regime ( $\leq$ 100), DDRM takes over. This can be attributed to DDIM (song2020denoising) sampling strategy that DDRM adopts, known for better performance in the low NFE regimes. Orthogonal to the direction presented in this work, devising a method to improve the performance of DPS in such regime with advanced samplers (e.g. lu2022dpm; liu2022pseudo) would benefit the method.

C.6 Limitations

Inheriting the characteristics of the diffusion model-based methods, the proposed method is relatively slow, as can be seen in the runtime analysis of Fig. 6(a). However, we note that our method is still faster than the GAN-based optimization methods, as we do not have to finetune the network itself. Moreover, the slow sampling speed could be mitigated with the incorporation of advanced samplers. Our method tends to preserve the high frequency details (e.g. beard, hair, texture) of the image, while methods such as DDRM tends to produce rather blurry images. In the qualitative view, and in the perception oriented metris (i.e. FID, LPIPS), our method clearly outperforms DDRM. In contrast, in standard distortion metrics such as PSNR, our method underperforms DDRM. This can be explained by the perception-distortion tradeoff phenomena (blau2018perception), where preserving high frequency details may actually penalize the reconstructions from having better distortion metrics. Finally, we note that the reconstruction quality of phase retrieval is not as robust as compared to other problems - linear inverse problems and nonlinear deblurring. Due to the stochasticity, we often encounter failures among the posterior samples, which can be potentially counteracted by simply taking multiple samples, as was done in other methods. Devising methods to stabilize the samplers, especially for nonlinear phase retrieval problems, would be a promising direction of research.

Appendix D Experimental details

Step size. Here, we list the step sizes used in our DPS algorithm for each problem setting.

Super-resolution: ${\zeta_{i}}=1/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Inpainting: ${\zeta_{i}}=1/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Deblurring (Gauss): ${\zeta_{i}}=1/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Deblurring (motion): ${\zeta_{i}}=1/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Super-resolution: ${\zeta_{i}}=1/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Inpainting: ${\zeta_{i}}=1/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Deblurring (Gauss): ${\zeta_{i}}=0.4/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Deblurring (motion): ${\zeta_{i}}=0.6/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Super-resolution: ${\zeta_{i}}=0.3/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Deblurring (Gauss): ${\zeta_{i}}=0.3/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Deblurring (motion): ${\zeta_{i}}=0.3/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Phase retrieval: ${\zeta_{i}}=0.4/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

non-uniform deblurring: ${\zeta_{i}}=1.0/\|{\bm{y}}-{\mathcal{A}}(\hat{\bm{x}}_{0}({\bm{x}}_{i}))\|$

Score functions used. Pre-trained score function for the FFHQ dataset was taken from choi2021ilvrhttps://github.com/jychoi118/ilvr_adm, and the score function for the ImageNet dataset was taken from dhariwal2021diffusionhttps://github.com/openai/guided-diffusion. Unconditional version.

Compute time. All experiments were performed on a single RTX 2080Ti GPU. FFHQ experiments take about 95 seconds per image (1000 NFE), while ImageNet experiments take about 600 seconds per image (1000 NFE) for reconstruction due to the much larger network size.

Code availability. Code is available at https://github.com/DPS2022/diffusion-posterior-sampling.

D.2 Comparison methods

For DDRM, MCG, Score-SDE, and our method we use the same checkpoint for the score functions.

DDRM. All experiments were performed with the default setting of $\eta_{B}=1.0,\eta=0.85$ , and leveraging DDIM (song2020denoising) sampling for 20 NFEs. For the Gaussian deblurring experiment, the forward model was implemented by separable 1D convolutions for efficient SVD.

MCG. We set the same values of $\alpha$ that are used in our methods (DPS). At each step, the additional data consistency steps are applied as Euclidean projections onto the measurement set ${\mathcal{C}}:=\{{\bm{x}}_{i}|{\mathcal{A}}({\bm{x}}_{i})={\bm{y}}_{i},\,{\bm{y}}_{i}\sim p({\bm{y}}_{i}|{\bm{y}}_{0})\}$ .

Score-SDE. Score-SDE solves inverse problems by iteratively applying denoising followed by data consistency projections. As in MCG, we apply Euclidean projections onto the measurment set ${\mathcal{C}}$ .

PnP-ADMM. We take the implementation from the scico library (scico-2022). The parameters are set as follows: $\rho=0.2$ (ADMM penalty parameter), maxiter $=12$ . For proximal mappings, we utilize the pretrained DnCNN zhang2017beyond denoiser.

ADMM-TV. We minimize the following objective

where ${\bm{D}}=[{\bm{D}}_{x},{\bm{D}}_{y}]$ computes the finite difference with respect to both axes, $\lambda$ is the regularization weight, and $\|\cdot\|_{2,1}$ implements the isotropic TV regularization. Note that the optimization is solved with ADMM, and hence we have an additional parameter $\rho$ . We take the implementation from the scico library (scico-2022). The parameters $\lambda,\rho$ were found with grid search for each optimization problems. We use the following settings: $(\lambda,\rho)=(2.7e-2,1.4e-1)$ for deblurring, $(\lambda,\rho)=(2.7e-2,1.0e-2)$ for SR and inpainting.

ER, HIO, OSS. For all algorithms, we initialize a real signal by sampling from the normal distribution as the problem statement of (fienup1982phase). For the object domain constraint, we apply both the non-negative constraint and the finite support constraint. We set the number of iterations to 10,000 for sufficient convergence. To mitigate the instability of reconstruction depending on initialization, we repeat each algorithm four times per data and report the best one with the smallest mean squared error between the measurement and amplitude of the estimation in the Fourier domain. In the case of HIO and OSS, we set $\beta$ to 0.9, which yields best results.

Appendix E Further experimental results

We first provide quantitative evaluations based on the standard PSNR and SSIM metrics in Table 6 and Table 7.

Further experimental results that show the ability of our method to sample multiple reconstructions are presented in Figs. 13,14, 15, 16, 17, 18 (Gaussian measurement with $\sigma=0.05$ ), and Fig. 19,20 (Poisson measurement with $\lambda=1.0$ ).