Improving Diffusion Models for Inverse Problems using Manifold Constraints

Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, Jong Chul Ye

Introduction

Diffusion models have shown impressive performance both as generative models themselves [song2020score, dhariwal2021diffusion], and also as unsupervised inverse problem solvers [song2020score, choi2021ilvr, chung2022come, kawar2022denoising] that do not require problem-specific training. Specifically, given a pre-trained unconditional score function (i.e. denoiser), solving the reverse stochastic differential equation (SDE) numerically would amount to sampling from the data generating distribution [song2020score]. For many different inverse problems (e.g. super-resolution [choi2021ilvr, chung2022come], inpainting [song2020score, chung2022come], compressed-sensing MRI (CS-MRI) [song2022solving, chung2022come], sparse view CT (SV-CT) [song2022solving], etc.), it was shown that simple incorporation of the measurement process produces satisfactory conditional samples, even when the model was not trained for the specific problem.

Nevertheless, for certain problems (e.g. inpainting), currently used algorithms often produce unsatisfactory results when implemented naively (e.g. boundary artifacts, as shown in Fig. 1 (b)). The authors in [lugmayr2022repaint] showed that in order to produce high quality reconstructions, one needs to iterate back and forth between the noising and the denoising step at least $>10$ times per iteration. These iterations are computationally demanding and should be avoided, considering that diffusion models are slow to sample from even without such iterations. On the other hand, a classic result of Tweedie’s formula [robbins1992empirical, stein1981estimation] shows that one can perform Bayes optimal denoising in one step, once we know the gradient of the log density. Extending such result, it was recently shown that one can indeed perform a single-step denoising with learned score functions for denoising problems from the general exponential family [kim2021noisescore].

In this work, we leverage the denoising result through Tweedie’s formula and show that such denoised samples can be the key to significantly improving the performance of reconstruction using diffusion models across arbitrary linear inverse problems, despite the simplicity in the implementation. Moreover, we theoretically prove that if the score function estimation is globally optimal, the correction term from the manifold constraint enforces the sample path to stay on the plane tangent to the data manifoldWe coin our method Manifold Constrained Gradient (MCG)., so by combining with the reverse diffusion step, the solution becomes more stable and accurate.

Related Works

with $dt$ denoting the infinitesimal negative time step, and $\bar{{\boldsymbol{w}}}$ defining the standard Wiener process running backward in time. Note that the reverse SDE defines the generative process through the score function $\nabla_{\boldsymbol{x}}\log~{}p_{t}({\boldsymbol{x}})$ , which in practice, is typically replaced with $\nabla_{\boldsymbol{x}}\log p_{0t}({\boldsymbol{x}}(t)|{\boldsymbol{x}}(0))$ to minimize the following denoising score-matching objective

Once the parameter $\theta^{*}$ for the score function is estimated, one can replace the score function in (2) with $s_{\theta^{*}}({\boldsymbol{x}}(t),t)$ to solve the reverse SDE [song2020score].

Due to the linearity of $\bar{\boldsymbol{f}}$ and $\bar{g}$ , the forward diffusion step can be implemented with a simple reparameterization trick [kingma2013auto]. Namely, the general form of the forward diffusion is

where we have replaced the ground truth score function with the trained one. We detail the choice of $a_{i},b_{i},{\boldsymbol{f}},g$ in Appendix. B.

2 Conditional Generative models for Inverse problems

where ${\boldsymbol{A}},{\boldsymbol{b}}_{i}$ are functions of ${\boldsymbol{H}},{\boldsymbol{y}},$ and ${\boldsymbol{x}}_{0}$ . Note that (7) is identical to the unconditional reverse diffusion step in (5), whereas (8) effectively imposes the condition. It was shown in [chung2022come] that any general contraction mapping (e.g. projection onto convex sets, gradient step) may be utilized as (8) to impose the constraint.

Another recent work [kawar2022denoising] advancing [kawar2021snips] establishes the state-of-the-art (SOTA) in solving noisy inverse problems with unconditional diffusion models, by running the conditional reverse diffusion process in the spectral domain achieved by performing singular value decomposition (SVD), and leveraging approximate gradient of the log likelihood term in the spectral space. The authors show that feasible solutions can be obtained with as small as 20 diffusion steps.

Prior to the development of diffusion models, Plug-and-Play (PnP) models [venkatakrishnan2013plug, zhang2017learning, tirer2018image] were used in a similar fashion by utilizing a general-purpose unconditional denoiser in the place of proximal mappings in model-based iterative reconstruction methods [boyd2011distributed, beck2009fast]. Similarly, outside the context of diffusion models, iterative denoising followed by projection-based data consistency was proposed in [tirer2018image]. In such view, diffusion models can be understood as generative variant of PnPs trained with multiple scales of noise.

GAN-based solvers are also widly explored [bora2017compressed, daras2021intermediate, hussein2020image], where the pre-trained generators are tuned at the test time by optimizing over the latent, the parameters, or jointly.

3 Tweedie’s formula for denoising

In the case of Gaussian noise, a classic result of Tweedie’s formula [robbins1992empirical] tells us that one can achieve the denoised result by computing the posterior expectation:

Tweedie’s formula is in fact not only relevant to Gaussian denoising in the Bayesian framework, but have also been extended to be in close relation with kernel regression [ong2019local]. Moreover, it was shown that it can be applied to arbitrary exponential noise distributions beyond Gaussian [efron2011tweedie, kim2021noisescore]. In the following, we use this key property to develop our algorithm.

Conditional Diffusion using Manifold Constraints

Although our original motivation of using the measurement constraint step in (8) was to utilize the unconditionally trained score function in the reverse diffusion step in (7), there is room for imposing additional constraints while still using the unconditionally trained score function.

Specifically, the Bayes rule $p({\boldsymbol{x}}|{\boldsymbol{y}})=p({\boldsymbol{y}}|{\boldsymbol{x}})p({\boldsymbol{x}})/p({\boldsymbol{y}})$ leads to

Hence, the score function in the reverse SDE in (7) can be replaced by (11), leading to

where $\alpha$ and ${\boldsymbol{W}}$ depend on the noise covariance, if the noise $\bm{\epsilon}$ in (6) is Gaussian.

Now, one of the important contributions of this paper is to reveal that the Bayes optimal denoising step in (10) from the Tweedie’s formula leads to a preferred condition both empirically and theoretically. Specifically, we define the set constraint for ${\boldsymbol{x}}_{i}$ , called the manifold constrained gradient (MCG), so that the gradient of the measurement term stays on the manifold (see Theorem 1):

To deal with the potential deviation from the measurement consistency, we again impose the data consistency step (8). Putting them together, the discrete reverse diffusion under the additional manifold constraint and the data consistency can be represented by

We illustrate our scheme visually in Fig. 1 (a), specifically for the task of image inpainting. The additional step leads to a dramatic performance boost, as can be seen in Fig. 1 (b). Note that while the mapping (10) does not rely on the measurement, our gradient term in (14) incorporates the information of ${\boldsymbol{y}}$ so that the gradient of the measurement terms stays on the manifold. In the following, we study the theoretical properties of the method. Further algorithmic details and adaptations to each problem that we tackle are presented in Section C.

We note that the authors of [ho2022video] proposed a similar gradient method for the application of temporal imputation and super-resolution. When combining (14) with (15), one can arrive at a similar gradient method proposed in [ho2022video], and hence our method can be seen as a generalization to arbitrary linear inverse problems. Furthermore, there are vast literature in the context of PnP models that utilize pre-trained denoisers together with gradient of the log-likelihood to solve inverse problems [laumont2022bayesian, vidal2020maximum, de2020maximum]. Among them, [laumont2022bayesian] is especially relevant to this work since their method relies on modified Langevin diffusion, together with Tweedie’s denoising and projections to the measurement subspace.

Geometry of Diffusion Models and Manifold Constrained Gradient

In this section, we theoretically support the effectiveness of the proposed algorithm by showing the problematic behavior of the earlier algorithm and how the proposed algorithm resolves the problem. We defer all proofs in the supplementary section. To begin with, we borrow a geometrical viewpoint of the data manifold.

Notation For a scalar $a$ , points ${\boldsymbol{x}},{\boldsymbol{y}}$ and a set $A$ , we use the following notations. $aA:=\{a{\boldsymbol{x}}:{\boldsymbol{x}}\in A\}$ ; $d({\boldsymbol{x}},A):=\inf_{{\boldsymbol{y}}\in A}||{\boldsymbol{x}}-{\boldsymbol{y}}||_{2}$ ; $B_{r}(A):=\{{\boldsymbol{x}}:d({\boldsymbol{x}},A)<r\}$ ; $T_{\boldsymbol{x}}{\mathcal{M}}$ : the tangent space to a manifold ${\mathcal{M}}$ at ${\boldsymbol{x}}$ ; ${\boldsymbol{J}}_{f}$ : the Jacobian matrix of a vector valued function $f$ . We define $p_{0}=p_{data}$ .

To develop the theory, we need an assumption on the data distribution.

Moreover, the data distribution $p_{0}$ is the uniform distribution on the data manifold ${\mathcal{M}}$ .

We need to recall that the conventional manifold assumption is about the intrinsic geometry of data points having a low dimensional nature. However, we assume more in this work: the manifold is locally linear. Although this stronger assumption might narrow the practice of the theory, the geometric approach may provide new insights on diffusion models. Under this assumption, the following proposition shows how the data perturbed by noise lies in the ambient space, illustrated pictorially in Fig. 2(a).

Considering Proposition 1, the manifolds of noisy data can be interpreted as interpolating manifolds between the two: the hypersphere, where pure noise ${\mathcal{N}}(a_{\infty}{\boldsymbol{x}}_{0},b_{\infty}^{2})$ is concentrated, and the clean data manifold. In this regard, the diffusion steps are mere transitions from one manifold to another and the diffusion process is a transport from the data manifold to the hypersphere through interpolating manifolds. See Fig. 2(a).

We can infer from the proposition that the score functions are trained only with the data points concentrated on the noisy data manifolds. Therefore, inaccurate inference might be caused by application of a score function on points away from the noisy data manifold.

Suppose $s_{\theta}$ is the minimizer of the denoising score matching loss in (3). Let $Q_{i}$ be the function that maps ${\boldsymbol{x}}_{i}$ to $\hat{{\boldsymbol{x}}}_{0}$ for each $i$ ,

According to the proposition, the score function only concerns the normal direction of the data manifold. In other words, the score function cannot discriminate two data points whose difference is tangent to the manifold. In solving inverse problems, however, we desire to discriminate data points to reconstruct the original signal, and the discrimination is achievable by measurement fidelity. In order to achieve the original signal, the measurement plays a role in correcting the tangent component near the data manifold. Furthermore, with regard to remark 2, diffusion model-based inverse problem solvers should follow the tangent component. The following theorem shows how existing algorithms and the proposed method are different in this regard.

A correction by the manifold constrained gradient does not leave the data manifold. Formally,

the gradient is the projection of the data fidelity term onto $T_{\hat{\boldsymbol{x}}_{0}}{\mathcal{M}}$ ,

This theorem suggests that in diffusion models, the naive measurement fidelity step (without considering the data manifold) pushes the inference path out of the manifolds and might lead to inaccurate reconstruction. (To see this pictorially, see section. D, and Fig. 7.) On the other hand, our correction term from the manifold constraint guides the diffusion to lie on the data manifold, leading to better reconstruction. Such geometric views are illustrated in Fig. 2(b).

One may concern that the suboptimality of the denoising score matching loss optimization may lead to inaccurate inference of the MCG steps. In practice, however, most of the error in denoising score matching is concentrated on $t\sim 1$ [chung2022come], and in such region, the Tweedie’s inference cannot make meaningful images. That is, the score function cannot detect the data manifold. Nonetheless, in this regime, the magnitudes of the MCGs are small when the denoising score is inaccurate, and hence the matters arising from suboptimality is minimal. As $t\rightarrow 0$ , the estimation becomes exact, and subsequently leads to accurate implementation of the MCG.

Experiments

For all tasks, we aim to verify the superiority of our method against other diffusion model-based approaches, and also against strong supervised learning-based baselines. Further details can be found in Section. F.

For inpainting, we use FFHQ 256 $\times$ 256 [karras2019style], and ImageNet 256 $\times$ 256 [deng2009imagenet] to validate our method. We utilize pre-trained models from the open sourced repository based on the implementation of ADM (VP-SDE) [dhariwal2021diffusion]. We validate the performance on 1000 held-out validation set images for both FFHQ and ImageNet dataset. For the colorization task, we use FFHQ 256 $\times$ 256, and LSUN-bedroom 256 $\times$ 256 [yu2015lsun]. We use pre-trained score functions from score-SDE [song2020score] based on VE-SDE. We use 300 validation images for testing the performance with respect to the LSUN-bedroom dataset. For experiments with CT, we train our model based on ncsnpp as a VE-SDE from score-SDE [song2020score], on the 2016 American Association of Physicists in Medicine (AAPM) grand challenge dataset, and we process the data as in [kang2017deep]. Specifically, the dataset contains 3839 training images resized to 256 $\times$ 256 resolution. We simulate the CT measurement process with parallel beam geometry with evenly-spaced 180 degrees. Evaluation is performed on 421 held-out validation images from the AAPM challenge.

Score-SDE [song2020score], REPAINT [lugmayr2022repaint], DDRM [kawar2022denoising] were chosen as baseline diffusion models to compare against the proposed method. For a fair comparison, we use the same score function for all methods including MCG, and only differentiate the inference method that is used. Another class of generative models: GAN-based inverse problem solver, IAGAN [hussein2020image] is considered as a comparison method for FFHQ specifically. We also include comparisons against supervised learning based baselines: LaMa [suvorov2022resolution], AOT-GAN [zeng2022aggregated], ICT [wan2021high], and DSI [peng2021generating]. We use various forms of inpainting masks: box (128 $\times$ 128 sized square region is missingThe location of the box is sampled uniformly within 16 pixel margin of each side.), extreme (only the box region is existent), random (90-95% of pixels are missing), and LaMa-wide. Quantitative evaluation is performed with two metrics - Frechet Inception Distance (FID)-1k [NIPS2017_8a1d6947], and Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018unreasonable].

Our method outperforms the diffusion model baselines [song2020score, lugmayr2022repaint, kawar2022denoising] by a large margin. Moreover, our method is also competitive with, or even better than the best-in-class fully supervised methods, as can be seen in Table 1. In Fig. 3, we depict representative results that show the superiority of the method, where we see in both the box-type and random dropping that MCG performs very well on all experiments.

We choose score-SDE [song2020score], and DDRM [kawar2022denoising] as diffusion-model based comparison methods, and also compare against cINN [ardizzone2019guided], and pix2pix [isola2017image]. Two metrics were used for evaluation: structural similarity index (SSIM), and LPIPS. Consistent with the findings from inpainting, we achieve much improved performance than score-SDE, and also is favorable against state-of-the-art (SOTA) superivsed learning based methods. In Table 2, we see that the proposed method outperforms all other methods in terms of both PSNR/LPIPS in LSUN-bedroom, and also achieves strong performance in the colorization of FFHQ dataset.

To the best of our knowledge, [song2022solving] is the only method that tackles CT reconstruction directly with diffusion models. We compare our method against [song2022solving], which we refer to as score-CT henceforth. We also compare with the best-in-class supervised learning methods, cGAN [ghani2018deep] and SIN-4c-PRN [wei20202]. As a compressed sensing baseline, FISTA-TV [beck2009fast] was included, along with the analytical reconstruction method, FBP. We use two standard metrics - peak-signal-to-noise-ratio (PSNR), and SSIM for quantitative evaluation. From Table 3, we see that the newly proposed MCG method outperforms the previous score-CT [song2022solving] by a large margin. We can observe the superiority of MCG over other methods more clearly in Fig. 4, where MCG reconstructs the measurement with high fidelity and detail. All other methods including the fully supervised baselines fall behind the proposed method.

We perform three ablation studies: 1) As both the MCG term and the projection term contain information about the measurement ${\boldsymbol{y}}$ , we observe the contribution of each term to the fixed solution. To further clarify the efficacy of the gradient step combined with Tweedie’s denoising, we also consider the case where the gradient of the log likelihood is computed not in the noiseless regime, but in the noise level matching the current iteration. Specifically, we define ${\boldsymbol{x}}^{\prime}_{i-1}:={\boldsymbol{f}}({\boldsymbol{x}}_{i},{\boldsymbol{s}}_{\theta})+g({\boldsymbol{x}}_{i}){\boldsymbol{z}},\,{\boldsymbol{z}}\sim{\mathcal{N}}(0,{\boldsymbol{I}})$ , ${\boldsymbol{y}}_{i-1}\sim p({\boldsymbol{y}}_{i-1}|{\boldsymbol{y}}_{0})$ , and implement the gradient step as $\nabla_{{\boldsymbol{x}}_{i}}\|{\boldsymbol{y}}_{i-1}-{\boldsymbol{H}}{\boldsymbol{x}}^{\prime}_{i-1}\|_{2}^{2}$ . 2) As the performance of diffusion models depend heavily on the number of NFEs, we observe the trade-off of each diffusion model when varying the NFE from 20 to 1000. Moreover, for completeness, we measure the runtime of each algorithms including the non-diffusion based methods in wall-clock time computed with a commodity GPU in Table. 4. 3) Setting $\alpha=0.0$ reduces our method to [chung2022come]. We show the difference in the performance by varying the values of $\alpha$ .

First, we see in Table. 5 that using only the MCG step leads to improved performance in terms of LPIPS, but introduces error in the measurement consistency (measured with MSE). Combining both the projection and MCG leads to perfect data consistency along with further improved reconstruction. When considering gradient steps without Tweedie’s denoising (i.e. keeping the noise level at the $i^{\text{th}}$ step), the performance heavily degrades, especially when implemented without the projection steps. Here, we see that the proposed denoising step to utilize $\hat{\boldsymbol{x}}_{0}$ is indeed the key to the superior performance.

Second, looking at Fig. 5(a), we immediately see that the graph of MCG stays in the lowest (best) LPIPS regime across all NFEs by a large margin, except for when the NFE drops below 100. Here, DDRM [kawar2022denoising] takes over the 1st place - allegedly due to the DDIM sampling strategy they take. The performance of RePAINT deteriorates rapidly as we decrease NFE. Furthermore, we observe that the LPIPS of score-SDE [song2020score] actually increases (i.e. worsen), as we increase the number of NFEs from a few hundred to one thousand. This suggests that the inference process that score-SDE takes (i.e. projection only) is inherently flawed, and cannot be corrected by taking small enough steps. In Table. 4, we list the runtime of all the methods that were used for comparison in the task of inpainting. Note that the proposed method takes longer for compute than score-SDE albeit having the same NFE. The gap is due to the backpropagation steps that are required for the MCG step, where the gap can be potentially ameliorated by switching to JAX [jax2018github] implementation from the current PyTorch implementation.

Lastly, we observe the difference in the performance as we vary the values of $\alpha$ . Implementation-wise, we find that we yield superior results when normalizing the squared norm with the norm of itself (e.g. $\alpha=\alpha^{\prime}/\|{\boldsymbol{W}}({\boldsymbol{y}}-{\boldsymbol{H}}\hat{\boldsymbol{x}}_{0})\|$ , where $\alpha^{\prime}$ is some constant). In order to avoid cluttered notation, we instead experiment with changing the values of $\alpha^{\prime}$ in Fig. 5(b). Inspecting Fig. 5(b), we see that $\alpha$ values within the range $[0.1,1.0]$ produce satisfactory results. $\alpha$ values that are too low do not fully enjoy the advantages of MCG and collapses to the projection-only method, while using too high values of $\alpha$ results in exploding gradients, and the reconstruction saturates.

Our proposed method is fully unsupervised and is not trained on solving a specific inverse problem. For example, our box masks and random masks have very different forms of erasing the pixel values. Nevertheless, our method generalizes perfectly well to such different measurement conditions, while other methods have a large performance gap between the different mask shapes. We further note two appealing properties of our method as an inverse problem solver: 1) the ability to generate multiple solutions given a condition, and 2) the ability to maintain perfect measurement consistency. The former ability often lacks in supervised learning-based methods [suvorov2022resolution, wei20202], and the latter is often not satisfied for some unsupervised GAN-based solutions [daras2021intermediate, bora2017compressed].

Conclusion

In this work, we proposed a general framework that can greatly enhance the performance of the diffusion model-based solvers for solving inverse problems. We showed several promising applications - inpainting, colorization, sparse view CT reconstruction, and showed that our method can outperform the current state-of-the-art methods. We analyzed our method theoretically and show that MCG prevents the data generation process from falling off the manifold, thereby reducing the errors that might accumulate at every step. Further, we showed that MCG controls the direction tangent to the data manifold, whereas the score function controls the direction that is normal, such that the two components complement each other.

The proposed method is inherently stochastic since the diffusion model is the main workhorse of the algorithm. When the dimension $m$ is pushed to low values, at times, our method fails to produce high quality reconstructions, albeit being better than the other methods overall. For extreme cases of inpainting (e.g. Half masks) with the ImageNet model, we often observe artifacts in our reconstruction (e.g. generating perfectly symmetric images), which we discuss in further detail in Sec. E. We note that our method is slow to sample from, inheriting the existing limitations of diffusion models. This would likely benefit from leveraging recent solvers aimed at accelerating the inference speed of diffusion models. In line with the arguments of other generative model-based inverse problem solvers, our method is a solver that relies heavily on the underlying diffusion model, and can thus potentially create malicious content such as deepfakes. Further, the reconstructions could intensify the social bias that is already existent in the training dataset.

Acknowledgments and Disclosure of Funding

This research was supported by Field-oriented Technology Development Project for Customs Administration through National Research Foundation of Korea(NRF) funded by the Ministry of Science & ICT and Korea Customs Service (NRF-2021M3I1A1097938), by the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), which is funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HU21C0222), and by the KAIST Key Research Institute (Interdisciplinary Research Group) Project.

References

Appendix A Proofs

First, we remind our notation and the assumption.

Notation For a scalar $a$ , points ${\boldsymbol{x}},{\boldsymbol{y}}$ and a set $A$ , we use the following notations: $aA:=\{a{\boldsymbol{x}}:{\boldsymbol{x}}\in A\}$ ; $d({\boldsymbol{x}},A):=\inf_{{\boldsymbol{y}}\in A}||{\boldsymbol{x}}-{\boldsymbol{y}}||_{2}$ ; $B_{r}(A):=\{{\boldsymbol{x}}:d({\boldsymbol{x}},A)<r\}$ ; $T_{\boldsymbol{x}}{\mathcal{M}}$ : the tangent space to a manifold ${\mathcal{M}}$ at ${\boldsymbol{x}}$ ; ${\boldsymbol{J}}_{f}$ : the jacobian matrix of a vector valued function $f$ .

As $\frac{x_{l+1}^{2}}{b_{i}^{2}}+\dots+\frac{x_{n}^{2}}{b_{i}^{2}}$ is a chi-square distribution with $n-l$ degrees of freedom, by substituting $t=(n-l)\varepsilon^{\prime}$ in the above bound,

Note that the above inequality does not depend on $x_{1},\dots x_{l}$ , thus the choice of ${\boldsymbol{x}}^{\prime}\in{\mathcal{M}}$ . As a result, by setting $\varepsilon=\min\{1-\sqrt{1-2\sqrt{\varepsilon^{\prime}}},\sqrt{1+2\sqrt{\varepsilon^{\prime}}+2\varepsilon^{\prime}}-1\}$ and $\delta=2e^{-(n-l)\varepsilon^{\prime}}$ ,

By differentiating the objective with respect to $s_{\theta}({\boldsymbol{x}}_{t},t)$ , we have

where we used $p({\boldsymbol{x}}_{t}|{\boldsymbol{x}})p({\boldsymbol{x}})=p({\boldsymbol{x}},{\boldsymbol{x}}_{t})=p({\boldsymbol{x}}_{t})p({\boldsymbol{x}}|{\boldsymbol{x}}_{t})$ , $p({\boldsymbol{x}}_{t})>0$ , and $\int p({\boldsymbol{x}}|{\boldsymbol{x}}_{t})d{\boldsymbol{x}}=1$ in each line. Here, $Q_{i}({\boldsymbol{x}}_{i})=\int{\boldsymbol{x}}p({\boldsymbol{x}}|{\boldsymbol{x}}_{i})d{\boldsymbol{x}}$ is the weighted average vector of points on the data manifold as $p({\boldsymbol{x}}|{\boldsymbol{x}}_{i})$ is supported on the data manifold. Combining it with the assumption that the manifold is linear, $Q_{i}({\boldsymbol{x}}_{i})\in{\mathcal{M}}$ .

where we applied ${\boldsymbol{v}}_{n}^{T}{\boldsymbol{u}}_{t}=0={\boldsymbol{u}}_{n}^{T}{\boldsymbol{v}}_{t}$ . Therefore, ${\boldsymbol{J}}_{Q_{i}}$ is symmetric, i.e. ${\boldsymbol{J}}_{Q_{i}}^{T}={\boldsymbol{J}}_{Q_{i}}$ , which concludes this proof. ∎

where $d=-2{\boldsymbol{H}}^{T}{\boldsymbol{W}}^{T}{\boldsymbol{W}}({\boldsymbol{y}}-{\boldsymbol{H}}\hat{\boldsymbol{x}}_{0})$ . The first and second equality is given by the chain rule and the last line is by Proposition 2. ∎

In Fig 6, we illustrate how the proposed algorithm benefits from mixing the MCG step with the conventional POCS step. Pushing the points to the tangent directions, we expect less deviation from the manifold which is attributed to POCS.

Appendix B Discrete forms of SDE

Here, we review the different types of SDEs and sampling algorithms that we use throughout the paper for completeness. We assume that the time horizon $ $is linearly split up into$ N $discretization segments, such that all intervals have the length$ 1/N$, if not specified otherwise.

Due to the linearity of the drift and diffusion functions, we can analytically sample from $p({\boldsymbol{x}}_{i}|{\boldsymbol{x}}_{0})$ via reparameterization trick:

In VP-SDE [ho2020denoising], one defines a linearly increasing noise schedule $\beta_{1},\beta_{2},\dots,\beta_{N}\in(0,1)$ . Further, we define $\alpha_{i}=1-\beta_{i}$ , and $\bar{\alpha}_{i}=\prod_{j=1}^{i}\alpha_{j}$ . Then, the forward diffusion process can be implemented as

In VE-SDE [song2020score], one defines a geometrically increasing noise schedule $\sigma_{i}=\sigma_{0}\left(\frac{\sigma_{N}}{\sigma_{0}}\right)^{\frac{i-1}{N-1}}$ . Since the drift function is zero, the forward diffusion simply becomes Brownian motion. Concretely,

B.2 Reverse diffusion

First, for the case of VP-SDE, the reverse diffusion step is implemented by

Next, for the VE-SDE, the reverse diffusion step using the Euler-Maruyama solver [sarkka2019applied] is given as

Appendix C Algorithms

The forward model for inpainting is given as

where ${\boldsymbol{P}}\in\{0,1\}^{m\times n}$ is the matrix consisting of the columns with standard coordinate vectors indicating the indices of measurement. For the steps in (14), (15), we choose the following

Specifically, ${\boldsymbol{A}}$ takes the orthogonal complement of ${\boldsymbol{x}}^{\prime}_{i-1}$ , meaning that the measurement subspace is corrected by ${\boldsymbol{y}}_{i}$ , while the orthogonal components are updated from ${\boldsymbol{x}}^{\prime}_{i-1}$ . Note that we use ${\boldsymbol{y}}_{i}$ sampled from ${\boldsymbol{y}}$ to match the noise level of the current estimate.

We provide the algorithm used for inpainting in Algorithm. 1. The sampler is based on basic ancestral sampling (AS) of [ho2020denoising], and the default configuration requires $N=1000$ , $\alpha=1.0/\|{\boldsymbol{y}}-{\boldsymbol{P}}\hat{\boldsymbol{x}}_{0}\|$ for sampling.

The forward model for colorization is specified as

where ${\boldsymbol{P}}$ is the matrix that was used in inpainting, and ${\boldsymbol{M}}$ is an orthogonal matrix that couples the RGB colormapsThe matrix ${\boldsymbol{M}}$ is adopted from the colorization matrix of [song2020score].. ${\boldsymbol{M}}^{T}$ is a matrix that de-couples the channels back to the original space. In other words, one can view colorization as performing imputation in some spectral space. Subsequently, for our colorization method we choose

Again, our forward measurement matrix is orthogonal, and we choose ${\boldsymbol{A}}$ such that we only affect the orthogonal complement of the measurement subspace.

The sampler for colorization is based on the predictor-corrector (PC) sampler of [song2020score] (VE-SDE), and we choose to apply MCG after every iteration of both predictor, and corrector steps. $N=2000,\alpha=0.1/\|{\boldsymbol{C}}^{T}({\boldsymbol{y}}-{\boldsymbol{C}}\hat{\boldsymbol{x}}_{0})\|$ are chosen as hyper-parameters.

For the case of CT reconstruction, the forward model reads

where ${\boldsymbol{R}}$ is the discretized Radon transfrom [buzug2011computed] that measures the projection images from different angles. Note that for CT applications, ${\boldsymbol{R}}^{T}$ corresponds to performing backprojection (BP), and ${\boldsymbol{R}}^{\dagger}$ corresponds to performing filtered backprojection (FBP). We choose

where the choice of ${\boldsymbol{A}}$ reflects that the Radon transform is not orthogonal, and we need the term $({\boldsymbol{R}}{\boldsymbol{R}}^{T})^{\dagger}$ as a term analogous to the filtering step. Indeed, this form of update is known as the algebraic reconstruction technique (ART), a classic technique in the context of CT [gordon1970algebraic]. We note that this choice is different from what was proposed in [song2022solving], where the authors repeatedly apply projection/FBP by explicitly replacing the sinogram in the measured locations. From our experiments, we find that repeated application of FBP is highly numerically unstable, often leading to overflow. This is especially the case when we have limited resources for training data (we use 4k, whereas [song2022solving] uses 50k), as we further show in Section 5.

Algorithm for SV-CT reconstruction uses PC sampler (VE-SDE), where we use MCG step after one sweep of corrector-predictor update. We note that this is a design choice, and one may as well use the MCG update step after both the predictor and corrector steps, as was proposed in [song2020score]. We set $N=2000,\alpha=0.1/\|{\boldsymbol{R}}^{\dagger}({\boldsymbol{y}}-{\boldsymbol{R}}\hat{\boldsymbol{x}}_{0})\|$ .

Appendix D Generative process of the proposed method

In Fig. 7, we depict the comparison of the generative process between the two methods: score-SDE [song2020score], which relies on alternating projections; and our method, which utilizes MCG as correcting steps. In Fig. 7 (a), we can clearly see the unnatural boundary between the masked and the unmasked region forming, and evolving as $t\rightarrow 0$ , without getting corrected (Visible more clearly in $\hat{\boldsymbol{x}}_{0}$ ). On the other hand, thanks to the additional gradient step that corrects the errors in the boundary, we see a much more natural evolution of the signal as $t\rightarrow 0$ in Fig. 7 (b).

Appendix E Limitations

There exists a limitation specifically for the ImageNet dataset when using the proposed algorithm for inpainting. Specifically, as shown in Fig. 8, for the case of half-mask (i.e. the left or right half of the image is zeroed-out), we often see the reconstructions are generated showing symmetries that are unrealistic. Note that this kind of effect is not observed in our FFHQ experiments. Hence, we conjecture that this phenomenon arises from the imperfectness of the learned score function $s_{\theta}$ . Namely, due to the ImageNet dataset being much more diverse and therefore widely known to be a much harder dataset to learn, the subobtimality of the score function may be greater than the FFHQ score function. This could possibly lead to such deficiencies.

Appendix F Experimental Details

For inpainting experiments, we take the pre-trained score functions that are available online (FFHQhttps://github.com/jychoi118/ilvr_adm, imagenethttps://github.com/openai/guided-diffusion). For CT reconstruction experiment, we train a ncsnpp model with default configurations as guided in [song2020score] with the VE-SDE framework. The model was trained for 200 epochs with the full training dataset, with a single RTX 3090 GPU. Training took about one week wall-clock time.

All our sampling steps detailed in Algorithm C was performed with a single RTX 3090 GPU. The inpainting algorithm based on ADM [dhariwal2021diffusion] takes about 90 seconds (1000 NFE) to reconstruct a single image of size 256 $\times$ 256. Our colorization and CT reconstruction algorithm based on score-SDE [song2020score] takes about 600 seconds (4000 NFE) to infer a single 256 $\times$ 256 image.

We will open-source our code used in our experiments upon publication to boost reproducibility.

F.2 Comparison methods

Score-SDE [song2020score] demonstrated that unconditional diffusion models can be adopted to various inverse problems, such as inpainting and colorization. Our method without the MCG step is identical to score-SDE, and hence we use the same score function, parameters, and sampler as used in the proposed method for reconstruction.

RePAINT [lugmayr2022repaint] proposes to iterate between denoising-noising steps multiple times in order to better incorporate inter-dependency between the known and the unknown regions in the case of image inpainting. We use the same score function and sampler for RePAINT as in the proposed method. Following the default configurations in [lugmayr2022repaint], we take $N=200$ (corresponding to $T$ in [lugmayr2022repaint]), and $U=10$ , where $U$ denotes the count of iterated denoising-noising steps used within a single update index $i$ .

DDRM [kawar2022denoising] demonstrates that linear inverse problems can be solved via diffusion models by decomposing the generative process with singular value decomposition (SVD), and performing reverse diffusion sampling in the spectral space. The same score function adopted for the proposed method is used. Using the notations from [kawar2022denoising], we choose $\sigma_{{\boldsymbol{y}}}=0$ , as we are aiming to solve noiseless inverse problem, and $\eta=0.85,\eta_{b}=1$ . The number of NFE is set to 20 with the DDRM sampling steps.

LaMA contains fast Fourier convolution in generator architecture for reconstructing images. We trained the model from scratch using adversarial loss with r1 regularization term with its coefficient 10 and gradient penalty coefficient 0.001. Adam optimizer is used with the fixed learning rate of 0.001 and 0.0001 for discriminator network. For FFHQ and Imagenet dataset, 500k iterations of trainings were done with batch size of 8.

AOT-GAN consists of a deep image generator with a AOT block which consists of multiple length of residual blocks in parallel. The discriminator is the same architecture with PatchGAN from [zhu2017unpaired]. We trained the model from the scratch with 0.0001 learning rate using Adam optimizer $\beta_{1}=0$ and $\beta_{2}=0.9$ for both FFHQ and Imagent dataset. 500k iterations of trainings were done with batch size of 8. Also, for style loss and the perceptual loss, VGG19 [simonyan2014very] pretrained on ImageNet [deng2009imagenet] was used.

Image completion transformer (ICT) consists of two modules - a transformer model that follows the tokenization procedure to process information in the lower dimensional space, and another guided upsampling module to retrieve the data dimensionality. The encoded features are sampled from a probability distribution via Gibbs sampling, such that one can capture multimodal reconstructions from the same measurement. For both the FFHQ and Imagenet dataset, we used pretrained models provided by the authors.

Image adaptive GAN (IAGAN) uses a pre-trained generator and adapts it at test time for the given forward model. Specifically, following compressed sensing using generative model (CSGM) [bora2017compressed], one initializes the latent vector ${\boldsymbol{z}}$ such that ${\boldsymbol{z}}^{*}=\operatorname*{arg\,min}_{\boldsymbol{z}}\|y-AG_{\theta}({\boldsymbol{z}})\|$ . Then, the latent code and the neural network parameters are jointly optimized through some iterations of ${\boldsymbol{z}}^{**},\theta^{*}=\operatorname*{arg\,min}_{{\boldsymbol{z}},\theta}\|y-AG_{\theta}({\boldsymbol{z}})\|$ . The final result is achieved by the forward pass through the generator, after which follows the projection into the measurement subspace. For tuning the generator, we follow the default configurations from the official codebase. Since the codebase uses a GAN that generates 1024 $\times$ 1024 images, we downscale the result into 256 $\times$ 256 image as a final post-processing step.

DSI is structured with the combination of VQ-VAE [van2017neural], structure generator and texture generator. The architectures were trained separately, with Adam optimizer. When inference, only structure and texture generator was used. We trained the model from scratch. During optimization, the structure generator used linear warm-up schedular and square-root decay schedule used in [razavi2019generating]. We used Adam optimizer on training all models with learning rate of 0.0001 and $\beta_{1}=0.5$ using exponetial moving average (EMA). Training was done for 500k iteration for both FFHQ and Imagenet dataset.

cINN is an invertible neural network which can take in additional conditions as input, and in our case grayscale images. We train the model using default configurations as advised in https://github.com/VLL-HD/conditional_INNs without modifications. FFHQ model was trained with the learning rate of 0.0001 for 100 epohcs using the Adam optimizer. LSUN bedroom model was trained with the learning rate of 0.0001 for 30 epochs.

Pix2pix is a variant of conditional GAN (cGAN) that takes in as input, the corrupted image. The model is trained in a supervised fashion, with the loss consisting of the reconstruction loss, and the adversarial loss. As the discriminator architecture, we adopt patchGAN [isola2017image], and utilize the LSGAN [mao2017least] loss, weighting the adversarial loss by the value of 0.1. Similar to cINN, FFHQ model was trained with the learning rate of 0.0001 for 100 epochs using Adam optimizer. LSUN bedroom model was trained with the same configuration for 30 epochs.

F.3 CT reconstruction

We use the hyper-parameters as advised in [song2022solving] and set $\eta=0.246,\lambda=0.841$ . The measurement consistency step is imposed after every corrector-predictor sweep as in the proposed method.

Directly using the official implementationhttps://github.com/anonyr7/Sinogram-Inpainting [wei20202], we train the sinogram inpainting network (SIN) with the AAPM dataset for 200 epochs with the batch size of 8, and learning rate of 0.0001. We train two models separately for different number of views - 18, and 30.

We adopt the implementation of cGAN [ghani2018deep] from SIN-4c-PRN repository \@footnotemark. We train the two separate networks for 18 view, and 30 view projection, with the same configuration - 200 epochs, learning rate of 0.0001, and batch size of 8.

We perform FISTA-TV [beck2009fast] reconstruction using TomoBAR [tooldkazanc], together with the CCPi regularization toolkit [kazantsev2019ccpi]. Leveraging the default setting, we use the least-squares (LS) data model, and run the FISTA iteration for 300 iterations per image, with the total variation regularization strength set to 0.001.

Appendix G Further Experimental Results

We provide extensive set of comparison study for each task in Fig. 9, 11, and 12. Furthermore, in order to illustrate the ability of our method to generate multimodal reconstructions given a measurement, we present further experimental results of inpainting and colorization in the following figures: Fig. 13, 14, 15, and 16