Deblurring via Stochastic Refinement

Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G. Dimakis, Peyman Milanfar

Introduction

Image deblurring is a long-standing problem in computer vision. Various conditions such as moving objects, camera shakes, or an out-of-focus lens may contribute to blurring artifacts. Single image deblurring is a highly ill-posed inverse problem where multiple plausible sharp images could lead to the very same blurry observation. Nonetheless, most existing methods produce a single deterministic estimate of the clean image.

Traditional methods formulate deblurring as a variational optimization problem and find a solution that satisfies closeness to certain image and/or blur kernel prior . With the emergence of deep learning, convolutional neural networks (CNNs) have become the de-facto standard for deblurring models . Typically, these CNNs are trained with simulated sharp-blurry image pairs through supervised learning. Minimizing L1L_{1} or L2L_{2} pixel loss is perhaps the most widely adopted approach for training such models. These losses provide a straightforward learning objective and optimize for the popular PSNR (peak signal-to-noise-ratio) metric. Unfortunately, PSNR and other distortion metrics are well-known to only partially correspond to human perception and can actually lead to algorithms with visibly lower quality in the reconstructed images. To alleviate this problem, recent works introduced additional loss terms that seek to improve the quality of generated images under metrics that represent human perception more reliably. Training networks to go from corrupted images to a known ground truth in a supervised way belongs in the family of end-to-end methods . These methods perform very well in-distribution, but can be quite fragile to distributional shifts or changes in the corruption process .

A second body of work has focused on using deep generative models to solve inverse problems . For deblurring, Generative Adversarial Networks (GANs) have been successfully applied with competitive performance . GAN-based restoration methods train the deblurring network with an adversarial loss to make the restored images more perceptually plausible. However the proposed methods so far have been deterministic, and adversarial losses often introduce artifacts not present in the original clean image, leading to large distortion (e.g. for super-resolution).

In this work, we adopt a different perspective and view deblurring as a conditional generative modeling task, where we seek to generate diverse samples from the posterior distribution. Specifically, we introduce a “predict-and-refine” conditional diffusion model, where a deterministic data-adaptive predictor is jointly trained with a stochastic sampler that refines the output of the said predictor (see Fig. 2).

Our predict-and-refine approach enables more efficient sampling compared to the standard diffusion model. This formulation also naturally leads to a stochastic model capable of producing realistic images without sacrificing pixel-level distortion. To the best of our knowledge, this is the first blind deblurring technique that leverages a deep generative model and is capable of producing diverse samples.

Overall, our method produces a variety of plausible and photo-realistic results, while achieving state-of-the-art performance under many quantitative metrics in terms of both distortion and perceptual quality across multiple standard datasets. In addition, by aggregating a different number of generated deblurred samples, our framework allows us to conveniently traverse the Perception-Distortion curve as shown in Fig. 1, without any expensive retraining or finetuning. These results show clear benefits of stochastic diffusion-based methods for deblurring and challenge the currently dominant strategy of producing deterministic reconstructions.

Related Work

Deblurring through point estimates. Traditional deblurring methods formulate the problem as one of blind deconvolution . In this setup, the blur is generally modeled as a noisy linear operator acting on the clean image. While the exact values of the blur operator are not assumed to be known, one can enforce some prior distribution on the blur and the sharp image and try to find the most likely solution.

Alternatively, many recent methods adopt an end-to-end approach where a deep neural network is trained to directly produce a point estimate . These methods generally rely on pairs of blurry-sharp images as training data and cast the deblurring problem as a supervised regression task. Much of the efforts have gone into developing specialized network architectures and loss functions to achieve better pixel-level reconstruction metrics such as PSNR or SSIM . For example, MIMO-UNet proposed an architecture that facilitates information flow across different image resolutions in a multi-scale U-Net . Another work HINet introduced Half Instance Normalization , which can be used as a building block for image restoration networks. MPRNet presented an improved multi-stage architecture designed to incorporate both high-level global features as well as local details.

Issue of regression to the mean. While the aforementioned approaches lead to state-of-the-art PSNR, they share the limitation that they can only produce a deterministic output. This is at odds with the nature of blind image deblurring, which is an inherently ill-posed inverse problem with multiple valid solutions for a single input. In fact, the current trend of developing point-estimators that directly minimize a distortion loss suffers from the problem of “regression to the mean”. If there are multiple possible clean images that correspond to the blurry input, the optimal reconstruction according to the given loss function will be an average of them. Consequently, the resultant deterministic reconstruction often lacks details as it learns to produce the average of all possible solutions at best.

Diverse image restoration. One way to circumvent the regression to the mean phenomenon is to avoid point estimations and directly learn to generate samples from the posterior distribution . While techniques based on adversarial training have been explored for blind deblurring , in general they are not trained to produce multiple samples. Additionally, non-reference based adversarial losses can introduce significant hallucinations and distortions .

Likelihood-based deep generative models such as Variational Autoencoders , Normalizing Flows , and Diffusion Probabilistic Models (DPMs) have also been successfully applied to other image enhancement tasks such as super-resolution, where a diverse set of candidates can be generated from the learned posterior . Compared to point estimates, solving imaging inverse problems by sampling from the posterior has additional benefits such as uncertainty quantification , near-optimal sample complexity and better fairness guarantees .

Diffusion Probabilistic Models

where αt(0,1)\alpha_{t}\in(0,1) for all t=1,,Tt=1,\ldots,T. The noise schedule α1:T(α1,,αT)\bm{\alpha}_{1:T}\triangleq(\alpha_{1},\ldots,\alpha_{T}) is a hyperparameter that controls the variance of noise added at each step. The latent variables x1:T\bm{x}_{1:T} have the same dimensionality as the original data sample x0\bm{x}_{0}.

While this particular choice of diffusion process may seem arbitrary, it results in closed-form expressions for the following distributions: the marginalFor notational brevity, we use the term “marginal” to include distributions conditioned on x0\bm{x}_{0}. distribution q(xtx0)q(\bm{x}_{t}\>|\>\bm{x}_{0}) and the reverse diffusion step q(xt1xt,x0)q(\bm{x}_{t-1}\>|\>\bm{x}_{t},\bm{x}_{0}). Writing αˉtj=1tαj\bar{\alpha}_{t}\triangleq\prod_{j=1}^{t}\alpha_{j}, we get

where μt(xt,x0)\bm{\mu}_{t}(\bm{x}_{t},\bm{x}_{0}) and βt\beta_{t} are quantities that depend on xt,x0\bm{x}_{t},\bm{x}_{0} and α1:T\bm{\alpha}_{1:T}. Their full expressions and derivations are included in Appendix D.

The marginal distribution in Eq. 2 allows us to sample a partially noisy image xt\bm{x}_{t} at an arbitrary time step, and the reverse diffusion step in Eq. 3 is a stochastic denoising procedure that tells us how to reverse a single diffusion step by sampling a slightly less noisy image xt1\bm{x}_{t-1} from xt\bm{x}_{t}. The ability to sample from arbitrary marginals is important to make training of a DPM practical, as the training objective relies on it (see Eq. 5).

We note that the diffusion process defined here has no learnable parameter. It is a fixed process that gradually destroys the original signal x0\bm{x}_{0} and produces xT\bm{x}_{T} that looks indistinguishable from pure Gaussian noise given a sufficiently large TT. Thus, if we could apply the reverse diffusion step TT times starting from pure Gaussian noise, we would obtain a clean sample x0\bm{x}_{0}. However this is not possible because the reverse diffusion step itself requires access to x0\bm{x}_{0}, which is exactly what we are trying to generate.

Reverse process and denoiser network. A key component of DPM is the denoiser network fθf_{\theta} that tries to estimate x0\bm{x}_{0} from the partially noisy image xt\bm{x}_{t}. With it, we can apply the reverse diffusion step without knowing x0\bm{x}_{0} by using the estimate fθ(xt,t)f_{\theta}(\bm{x}_{t},t) in place of x0\bm{x}_{0}:

This defines a Markov chain that runs backwards in time from xT\bm{x}_{T} to x0\bm{x}_{0}, which we call the reverse process. The goal of DPM is to train fθf_{\theta} to make pθ(xt1xt)p_{\theta}(\bm{x}_{t-1}\>|\>\bm{x}_{t}) as close to the true reverse diffusion step q(xt1xt,x0)q(\bm{x}_{t-1}\>|\>\bm{x}_{t},\bm{x}_{0}) as possible. This is done by optimizing fθf_{\theta} to maximize the variational lower bound of the marginal likelihood logpθ(x)\log p_{\theta}(\bm{x}).

In practice, we use an alternative parametrization of fθf_{\theta} proposed by that instead predicts the Gaussian noise ϵ\bm{\epsilon} that deterministically relates xt\bm{x}_{t} and x0\bm{x}_{0} via Equation 2. Specifically, we write xt=αˉtx0+(1αˉt)ϵ\bm{x}_{t}=\sqrt{\bar{\alpha}_{t}}\bm{x}_{0}+(1-\bar{\alpha}_{t})\bm{\epsilon} for ϵN(0,Id)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\bm{I}_{d}) and train fθf_{\theta} to predict ϵ\bm{\epsilon}.

Continuous noise level. Chen et al. proposes a modified formulation based on a continuous noise level αˉ\bar{\alpha}, which we also adopt. An important property of this formulation is that it allows us to sample from the model using a noise schedule α1:T\bm{\alpha}_{1:T} different from the one used during training. This flexibility enables us to control the trade-off between the distortion and the perceptual quality of generated samples without having to retrain the model, as we show later.

Conditional DPM. So far we have defined a DPM that is trained to model the unconditional data distribution. For conditional models that must estimate p(xy)p(\bm{x}\>|\>\bm{y}), we make fθf_{\theta} accept y\bm{y} as the conditioning input, as was done in . This way, the iterative denoising procedure becomes dependent on y\bm{y}. The final training objective is:

where the expectation is over y,x0,αˉ\bm{y},\bm{x}_{0},\bar{\alpha}, and ϵ\bm{\epsilon}.

Sampling from a DPM. As mentioned earlier, sampling an image from a DPM is done by running the reverse process. Given some inference-time noise schedule αˉ1:T\bar{\alpha}_{1:T}, we start from a pure Gaussian noise xTN(0,Id)\bm{x}_{T}\sim\mathcal{N}(\mathbf{0},\bm{I}_{d}) and repeatedly apply the reverse process transition pθ(xt1xt)p_{\theta}(\bm{x}_{t-1}\>|\>\bm{x}_{t}) defined in Eq. 4. Notice that this procedure requires a total of TT calls to the denoiser network. At the end of this sampling procedure, we are left with a single sample x0\bm{x}_{0}.

Predict-and-Refine Diffusion Model

One of the main drawbacks of DPM is the computational cost of generating samples, which may require up to thousands of forward passes of the denoiser network due to the iterative denoising procedure. As such, many recent works have explored alternative sampling strategies that reduce the number of sampling steps .

We introduce a simple technique that reduces this cost by exploiting the fact that it is often possible to get a cheap initial guess for conditional generative models. Specifically, we augment our conditional diffusion model with a deterministic initial predictor (Fig. 2), which provides a data-adaptive candidate for the clean image. Then the denoiser network only needs to model the residual.

Letting gθg_{\theta} denote the initial predictor, the new objective becomes: LOurs(θ)=L_{\text{Ours}}(\theta)=

We include a pseudocode for the modified sampling procedure in Algorithm 1. Notice that the initial predictor gθg_{\theta} does not require an extra loss or pretraining because the gradient from the loss flows through fθf_{\theta} into gθg_{\theta}.

Since the initial predictor runs only once, it is beneficial to keep the denoiser network small by offloading most of the computation to the initial predictor. This leads to much more efficient sampling because any reduction in the computational cost of the denoiser network gets amplified by the number of sampling steps used. We further explore this effect in Sec. 6.

As explained in Sec. 3, conditioning the diffusion model on continuous noise level makes it possible to use a different noise schedule during inference. We observe that using many steps with small noise level generally leads to better perceptual quality, and using fewer steps with large noise level leads to lower distortion.

For our experiments, we run a small grid search over the noise schedule hyperparameters and use the model with the best LPIPS score (labeled “Ours”). We emphasize that this inference-time hyperparameter tuning is cheap as it does not involve retraining or finetuning the model itself.

Traversing the Perception-Distortion curve. By appropriately setting the inference-time hyperparameters mentioned above (sampling steps TT, noise schedule αˉ1:T\bar{\alpha}_{1:T}, and sample averaging), we can smoothly traverse the P-D curve as shown in Fig. 1.

For example, the LPIPS-optimized model (“Ours”) uses a relatively large step count of T=500T=500 without sample averaging to achieve high perceptual quality at a slight cost of PSNR. The distortion-optimized model (“Ours-SA”) does the opposite by using T=10T=10 with sample averaging to sacrifice perceptual quality for higher PSNR. Each point on the P-D curve in Fig. 1 thus corresponds to a specific choice of these hyperparameters.

2 Resolution-agnostic Architecture

Unlike the image benchmarks commonly used to evaluate DPMs, blind deblurring benchmarks contain images with various sizes. To support arbitrary input shapes, we use a fully-convolutional architecture for both initial predictor and denoiser network.

Our architecture is based on SR3 , which uses a variant of U-Net architecture from with residual blocks replaced with that of BigGAN . To make our model agnostic to image resolution, we removed self-attention, positional encoding, and group normalization. The exact specification of our architecture can be found in Appendix E.

We note that, to the best of our knowledge, this is the first time a conditional diffusion model is made to support arbitrary image size. Our preliminary experiments show that the fully-convolutioanl architecture had little to no degradation in sample quality for deblurring at non-native resolutions. Because the denoiser network is a relatively simple U-Net, DPMs provide a particularly convenient choice for conditional image generation that must work on any input size.

Experiments

We train and evaluate our models on two widely-used image deblurring datasets. For a fair comparison, we follow the same setup used by and train our model only using the provided training data.

GoPro. GoPro dataset contains 3214 pairs of clean and blurry 1280×7201280\times 720 images, of which 1111 are reserved for evaluation. These images are generated by recording video clips with high shutter speed, then averaging consecutive frames to simulate blurs caused by slow shutter speed.

HIDE. We additionally evaluate our GoPro-trained model on the HIDE dataset, which contains 2025 images also of size 1280×7201280\times 720. By training and evaluating our model on different datasets, we can test its ability to generalize under a distributional shift.

2 Model Training

We jointly train the initial predictor and denoiser network by minimizing the loss in Eq. 6. Since our model is fully convolutional, we use random 128×128128\times 128 crops during training, but apply the model on full-size images for evaluation. We also perform training-time data augmentation with random horizontal/vertical flips and // rotations.

A note on training data. Most currently leading methods only report distortion-based metrics (PSNR and SSIM) and provide pre-trained models for GoPro. Since our work focuses on perceptual quality, we need to compute perceptual metrics ourselves using outputs from other methods. Thus to ensure a fair comparison, we are limited to using models trained on the GoPro dataset, as it is the only dataset with widely available pre-trained models. Nonetheless, we provide additional results and the details of how we obtained the outputs of other methods in Appendices H and F.

3 Evaluation

Evaluation Metrics. We evaluate our method on four different perceptual metrics: LPIPS , NIQE , FID (Fréchet Inception Distance) , and KID (Kernel Inception Distance) . Because our datasets do not have enough examples to reliably compute FID and KID, we extract 15 non-overlapping patches of size 256×240256\times 240 from each 1280×7201280\times 720 image and compute the Inception-based metrics at the patch level, similar to . For completeness, we also include two distortion-based metrics: PSNR and SSIM .

We note the importance of including full-reference metrics for conditional image generation. A method can achieve near-perfect score on a no-reference metric such as NIQE by producing highly realistic images that are completely unrelated to the input. This is particularly relevant for GAN-based methods, since the discriminator may not penalize the generator for producing natural-looking images that do not match the input. This is why we included LPIPS (and to some extent, PSNR and SSIM), even though it is technically not a perceptual metric. For a qualitative comparison, we also conduct a human study and provide sample restorations.

4 Quantitative Results

Table 1 shows quantitative results on the GoPro dataset. We compared our model with the current state-of-the-art (SOTA) methods HINet , MPRNet , and DeblurGAN-v2 .

Our model achieves SOTA performance across all perceptual metrics while maintaining competitive PSNR and SSIM to existing methods. Notably, we obtain the FID of 4.04, nearly a 70% reduction compared to DeblurGAN-v2 , the current SOTA method in terms of perceptual quality. Moreover, the sample-averaging variant of our method achieves a new SOTA PSNR of 33.23 while still outperforming all other methods with respect to LPIPS. All in all, these results highlight our framework’s flexibility to control the trade-off between perception and distortion using a single model. As shown in Figure 1, our result sets a new Pareto frontier on the Perception-Distortion plot.

4.2 HIDE Results

We also evaluate our GoPro-trained model on the HIDE dataset to test its ability to generalize to out-of-distribution input. As the results in Table 2 clearly show, the gains in perceptual quality do translate over to the HIDE dataset. In particular, both of our models significantly outperform the baseline methods across all perceptual metrics while maintaining competitive distortion values.

Fig. 4 includes several sample reconstructions from both GoPro and HIDE datasets. Despite sometimes containing a little more noise (some of which was presumably learned from the training data itself), we see that our model shows a clear improvement in perceptual quality. Additional full-size comparisons are provided in Appendix G.

5 Human Study for Qualitative Evaluation

We ran a perceptual study with human subjects to further quantify the performance of the proposed deblurring framework. Our results are presented in Table 3. We used Amazon Mechanical Turk to obtain pairwise ratings comparing different deblurring methods applied on the GoPro dataset. In this study, the human subjects had a minimum of 70% approval rating, and were asked to select the image with the better quality from side-by-side crops of size 512×512512\times 512.

Results in Table 3 show the average rater’s preference computed from 480 comparisons. As the highlighted cells show, these results indicate that both variations of our deblurring model outperform the competing methods.

We also observed that raters showed a modest preference for the sample-averaged variant in crops with relatively flat content. On the other hand, raters preferred individual samples for highly-textured crops. Fig. 5 shows that the level of detail produced by our model is adaptive to the blur present in the input. As expected, blurrier images generally lead to higher variance in the resulting samples.

Discussion and Analysis

For the analysis of various aspects of our model, we used a custom dataset created by applying synthetic camera shake blur and noise (described in Appendix C) on the images of the DIV2K dataset . This was done to make qualitative evaluation in a more controlled environment, since the low-quality ground truth images in existing paired datasets make qualitative assessment difficult.

More efficient sampling. The main benefit of residual modeling is the reduction in the computational cost of sampling. Due to the iterative nature of diffusion sampling, the denoiser network must run many times for each generated sample – sometimes up to hundreds to thousands of steps. Thus, any reduction in the cost of running the denoiser is particularly valuable, and our initial predictor provides a simple way to offload some of this computation.

A key question is then whether the initial predictor can compensate for the decrease in the sample quality from using a smaller denoiser network. We empirically explore this by comparing sampling latency against sample quality with and without the initial predictor. In Fig. 6, the non-residual model refers to a regular conditional diffusion model with a large denoiser network. The residual model follows our architecture and has a large initial predictor and a small denoiser. Overall, the residual model has more parameters (33M vs. 28M).

We see that the residual model requires much less time to sample an image despite it being larger than the non-residual model. Importantly, this reduction in sampling cost does not negatively affect the sample quality – in fact, the residual model is up to 7×7\times faster for a comparable sample quality.

Output of the initial predictor. One unexpected discovery from our experiments is that the output of the initial predictor is often a fairly reasonable reconstruction of the reference image. We can see this in Fig. 3. While lacking in detail, the initial prediction is certainly less blurry than the input.

It is perhaps surprising that this happens even though there is no explicit loss on the initial predictor’s output gθ(y)g_{\theta}(\bm{y}) to match the reference. We also note that our method is not the only possible parameterization of a diffusion model with an explicit decoupling of the iterative portion (denoiser network) from the single-pass portion (initial predictor). For instance, we could have simply fed gθ(y)g_{\theta}(\bm{y}) as an auxiliary input to the denoiser fθf_{\theta} without computing the residual. We leave these investigations around the initial predictor as future work.

Residual images are simpler to model. One may wonder why adding a deterministic initial predictor would help with the model’s performance. We posit that the benefits of residual modeling may be due to the distribution of residual images being “simpler” than that of reference images.

While it is impractical to approximate the true entropy of the two distributions, we can look at related quantities that may serve as a proxy. Specifically, we compute the entropy of pixel values aggregated across all pixel locations for residual and reference images. As expected from natural images, the reference pixel distribution is reasonably spread out and has the entropy of 7.427.42 bits-per-dimension (bpd). On the other hand, the residual pixel values follow a much more sharply concentrated distribution, leading to a substantially lower entropy of 3.913.91 bpd. This suggests that the residual images may indeed be simpler to model.

2 Network Architecture Ablation

To better understand where the performance gains of our method are originating from, we trained a regression-based baseline that only uses the initial predictor. Surprisingly, we observed that the initial predictor alone was able to achieve state-of-the-art PSNR of 33.07 when trained with a simple L2L_{2} loss. Through a detailed ablation study, we identified three key hyperparameters: exponential moving average (EMA) of weights, large batch size, and network size.

In Table 4, we start from a simple U-Net architecture and gradually enable each of the aforementioned hyperparameters. All models were trained for 1M steps to ensure the differences are not due to insufficient training. As the results show, all three hyperparameters were critical to the model’s performance.

Conclusion and Future Directions

We presented a new framework for stochastic blind image deblurring with a focus on perceptual quality using a conditional diffusion model. We introduced a novel technique for reducing the computational burden of diffusion sampling. We empirically showed that our method achieves significantly improved perceptual quality and competitive distortion metrics as compared to the current state-of-the-art methods. We believe that our work opens a new direction for blind deblurring with a focus on perceptual quality and establishes a strong benchmark for future works to improve upon.

There are a number of avenues to explore to further address the limitations of our work. Due to slow sampling and large network size, diffusion models are computationally too expensive to be incorporated into consumer-level devices. One way to combat this is to use more efficient sampling schemes such as DDIM or distillation . Another promising direction is to replace our initial predictor and denoiser network with U-Net architectures that are optimized for both distortion and run time .

References

Appendix A Additional Perception-Distortion Plots

The Perception-Distortion plot provided in Section 1 of the main text shows the trade-off between PSNR and Kernel Inception Distance (KID). We observe that other combinations of perceptual (NIQE, LPIPS, FID) and distortion metrics (PSNR, SSIM) follow a similar trend, as shown in Figure 7. We note that formally LPIPS is also a distortion metric, as it is a full-reference based distance computed in a deep feature space. We nonetheless observed that LPIPS corresponds to human perception much better than PSNR or SSIM.

Appendix B Diversity Analysis

Figure 8 shows the relation between the blurriness (or sharpness) on the input image, and the diversity of the generated deblurred samples. The blurrier the input image is, the more diversity we get in the samples (see figure caption for more details).

Appendix C Synthetic DIV2K Deblurring Dataset

To better analyze various aspects of our diffusion deblurring model, we created a custom dataset by applying synthetic camera shake blur (following and noise to the DIV2K dataset . This allows us to make qualitative evaluations in a more controlled environment, since the low-quality ground truth images in existing paired datasets make qualitative assessment difficult and lessens the benefits from using a powerful generative model.

The synthetically generated random kernels are of varying size (31×3131\times 31 maximal support). Figure 9 shows example kernels. The kernels can be of any size from a perfect Delta (sharp) to about 30 pixels. In addition to the blur, a white Gaussian noise with random standard deviation σU\sigma\sim\mathcal{U} is added.

Appendix D Omitted Details for DPM Formulation

Equation 2: Marginal at time step tt. We proceed by induction. For t=1t=1, we have αˉ1=α1\bar{\alpha}_{1}=\alpha_{1}, so Eq. 2 reduces to the diffusion transition kernel:

Now suppose we have q(xtx0)=N(xt;αˉtx0,(1αˉt)Id)q(\bm{x}_{t}\>|\>\bm{x}_{0})=\mathcal{N}(\bm{x}_{t};\sqrt{\bar{\alpha}_{t}}\bm{x}_{0},(1-\bar{\alpha}_{t})\bm{I}_{d}) for some t>1t>1, which we reparameterize as

Then by applying a single diffusion step q(xt+1xt)q(\bm{x}_{t+1}\>|\>\bm{x}_{t}) to the above, we get

where the first step uses a reparameterization ϵN(0,Id)\bm{\epsilon}^{\prime}\sim\mathcal{N}(\mathbf{0},\bm{I}_{d}), the second step is from the inductive hypothesis, and the last step follows from summing two independent Gaussian random variables. Thus

Reverse diffusion step expressions. Applying Bayes’ Rule to Eq. 3 leads to the following expressions for the mean and variance for the reverse diffusion step:

We refer the reader to Ho et al. for a more thorough treatment of the DPM formulation.

Specifying the noise schedule. Following , given a fixed budget of TT steps, we sample the continuous noise level αˉ\sqrt{\bar{\alpha}} from a piecewise uniform distribution. Specifically, we define TT intervals (li1,li)(l_{i-1},l_{i}), where l01l_{0}\triangleq 1 and liαˉil_{i}\triangleq\sqrt{\bar{\alpha}_{i}} for i>0i>0. Then to sample a continuous noise level αˉ\bar{\alpha}, we first randomly pick an interval (lk1,lk)(l_{k-1},l_{k}), and sample αˉU[lk1,lk]\bar{\alpha}\sim\mathcal{U}[l_{k-1},l_{k}].

Now all that remains is to specify the schedule α1,,αT\alpha_{1},\ldots,\alpha_{T}. While there are many options (e.g. as explored by Chen et al. ), we used a simple linear schedule on the variance of the forward process by fixing the two endpoints and linearly interpolating the intermediate values.

Appendix E Model Details

Network architecture. We use a U-Net architecture similar to the one used by SR3 . A crucial difference is that our network was made fully-convolutional by removing self-attention, group normalization, and positional encoding. At the input, the noisy sample xt\bm{x}_{t} is concatenated with the conditioning input y\bm{y} channel-wise.

As shown in Fig. 10, our U-Net has four resolution depths with channel multipliers {1,2,3,4}\left\{1,2,3,4\right\}. Both the denoiser network and initial predictor use this architecture. Their main difference is size, where the starting channel count is 64 for the initial predictor and 32 for the denoiser. This results in the initial predictor having \sim26M parameters, and the denoiser having \sim7M parameters. Note that the input and output would change slightly when this architecture is used for the initial predictor, which tries to estimate x\bm{x} from y\bm{y} (no xt\bm{x}_{t} and αˉ\bar{\alpha} in the input, and the output is not ϵ\bm{\epsilon}).

Training details. We train all of our models for 1M steps using 32 TPUv3 cores. For our main model with the initial predictor and the denoiser network, it takes about 27 hours to train the model. We used the AdamW optimizer with a fixed learning rate of 0.0001, weight decay rate of 0.0001, and EMA decay rate of 0.9999. During training, we used fine-grained diffusion process with T=2000T=2000 steps. As described above, we used a linear noise schedule with the two endpoints set as: 1α0=1×1061-\alpha_{0}=1\times 10^{-6} and 1αT=0.011-\alpha_{T}=0.01.

Appendix F Evaluation Details

For all our experiments (on all datasets: GoPro, HIDE, DIV2K), we performed a grid search over the following hyperparameter combinations during inference:

Inference steps (TT): 10, 20, 30, 50, 100, 200, 300, 500.

Noise schedule (α1:T\bm{\alpha}_{1:T}): We fixed the initial forward process variance (1α01-\alpha_{0}) to 1×1061\times 10^{-6}. For the final variance (1αT1-\alpha_{T}), we sweep over {0.01,0.02,0.05,0.1,0.2,0.5}\left\{0.01,0.02,0.05,0.1,0.2,0.5\right\}. The intermediate values are linearly interpolated.

How baseline samples are obtained. As mentioned in Sec. 5 of the main text, we computed various perceptual metrics ourselves as the existing literature often only reports PSNR and SSIM. To ensure fairness in our comparisons, we tried to use author-produced restoration results whenever possible. Otherwise, we used the official implementations and pre-trained models released by the authors of each paper and produced restorations ourselves.

Specifically, for HINet , MPRNet , and SAPHNet , we used restorations produced by the authors for both GoPro and HIDE results. For MIMO-UNet+ and DeblurGANv2 , we used the authors’ implementation and model checkpoints from their respective Github repositories. For SimpleNet , we could not obtain either the restorations nor the code, so we only reported the metrics from the paper (PSNR, SSIM, LPIPS).

Appendix G Large GoPro and HIDE Results

In Figures 11–13, we include larger versions of the GoPro and HIDE restorations shown in the main text. Figures 11 and 12 are from GoPro , and Figure 13 is from HIDE dataset .

Appendix H Additional Results

GoPro dataset. In Figures 14–18 we present additional results on the GoPro dataset where we compare our diffusion deblurring method to SAPHNet , DeblurGAN-v2 , MIMO-Unet+ , MPRNet , and HINet . Consistent with the main text, “Ours-SA” refers to the sample averaging variant of our method.

DIV2K Deblurring dataset. In Figures 19–22 we present additional results on the synthetically generated DIV2K deblurring dataset. For comparison purposes, we train a regression-based model (to minimize L2 loss, thus maximizing PSNR) that has the same architecture as the one we used for the initial predictor. Compared to the over-smoothed restorations from the regression-based baseline trained to minimize distortion, our method produces more realistic textural details.