Solving Linear Inverse Problems Using the Prior Implicit in a Denoiser
Zahra Kadkhodaie, Eero P. Simoncelli
Introduction
Many problems in image processing and computer vision rely, explicitly or implicitly, on prior probability models. Describing the full density of natural images is a daunting problem, given the high dimensionality of the signal space. Traditionally, models have been developed by combining assumed symmetry properties (e.g., translation-invariance, dilation-invariance), with simple parametric forms (e.g., Gaussian, exponential, Gaussian mixtures), often within pre-specified transformed coordinate systems (e.g., Fourier transform, multi-scale wavelets). While these models have led to steady advances in problems such as denoising (e.g., ), they are too simplistic to generate complex features that occur in our visual world, or to solve more demanding statistical inference problems.
In recent years, nearly all problems in image processing and computer vision have been revolutionalized by the use of deep Convolutional Neural Networks (CNNs). These networks are generally trained to perform tasks in a supervised fashion, without explicit use of an image prior. Their phenomenal performance suggests that they embed far more sophisticated priors than traditional methods. But these implicit priors arise from a combination of the distribution of the training data, the architecture of the network , regularization terms included in the optimization objective, and the optimization algorithm. Moreover, they are intertwined with the task for which they are optimized.
Nearly a decade ago, a strategy known as “plug and play” was proposed for using the prior embedded in a denoiser to solve other inverse problems , and a number of recent extensions have used this concept to develop MAP solutions for linear inverse problems. Generally, the objective is decoupled into a data fidelity term and a regularization, introducing a slack variable for use in a proximal optimization algorithm (e.g., ADMM). In place of a proximal operator, a denoiser is substituted for the gradient of the regularization term in the objective function. A parallel thread of research has focused on the use of generative models based on score matching . The connection between score matching and denoising autoencoders was first shown in , by proving that the training criterion of a denoising autoencoder is equivalent to matching the score of the model and a Parzan density estimate of the data. This has been used as the basis for drawing samples from the prior implicit in the denoiser . In Section 4 we elaborate on how these methods are related to our results.
Here, we derive a general algorithm for solving linear inverse problems using the prior implicit in a denoiser. We start with a result from classical statistics that states that a denoiser that aims to minimize squared error of images corrupted by additive Gaussian noise may be interpreted as computing the gradient of the log of the density of noisy images. This result is related to score matching, but provides more direct relationship between least-squares optimal denoising and the embedded prior . Starting from a random initialization, we develop a stochastic ascent algorithm that uses this denoiser-estimated gradient to draw high-probability samples from the embedded prior. The gradient step sizes and the amplitude of injected noise are jointly and adaptively controlled by the denoiser, producing robust and efficient convergence. More generally, we combine this procedure with constraints arising from any linear measurement of an image to draw samples from the prior conditioned on this measurement, thus providing a stochastic solution for any linear inverse problem. We demonstrate that our method, using the prior implicit in a pre-trained state-of-the-art CNN image denoiser, produces high-quality results on image synthesis, inpainting, super-resolution, deblurring and recovery of missing pixels. We also apply our method to recovering images from projections onto a random low-dimensional basis, demonstrating results that greatly improve on those obtained using sparse union-of-subspace priors typically assumed in the compressive sensing literature. A software impliementation of the sampling and linear inverse algorithms is available at https://github.com/LabForComputationalVision/universal_inverse_problem
Suppose we make a noisy observation of an image, , where is the original image drawn from , and is a sample of Gaussian white noise. The observation density (also known as the prior predictive density) is related to the prior via marginalization:
Equation (1) is in the form of a convolution, and thus is a Gaussian-blurred version of the signal prior, . Moreover, the family of observation densities over different noise variances, , forms a Gaussian scale-space representation of the prior , analogous to the temporal evolution of a diffusion process.
2 Least squares denoising and CNNs
Given a noisy observation, , the least squares estimate (also called "minimum mean squared error", MMSE) of the true signal is well known to be the conditional mean of the posterior:
Traditionally, one obtains such estimators by choosing a prior probability model, (often with parameters fit to sets of images), combining it with a likelihood function describing the noise, , and solving. For example, the Wiener filter is derived by assuming a Gaussian prior in which variance falls inversely with spatial frequency . Modern denoising solutions, on the other hand, are often based on discriminative training. One expresses the estimation function (as opposed to the prior) in parametric form, and sets the parameters by minimizing the denoising MSE over a large training set of example signals and their noise-corrupted counterparts . The size of the training set is virtually unlimited, since it can be constructed automatically from a set of photographic images, and does not rely on human labelling.
Current state-of-the-art denoising results using CNNs are far superior, both numerically and visually, to results of previous methods . Recent work demonstrates that these architectures can be simplified by removing all additive bias terms, with no loss of performance. The resulting bias-free networks offer two important advantages. First, they automatically generalize to all noise levels: a network trained on images with barely noticeable levels of noise can produce high quality results when applied to images corrupted by noise of any amplitude. Second, they may be analyzed as adaptive linear systems, which reveals that they perform an approximate projection onto a low-dimensional subspace. In our context, we interpret this subspace as a tangent hyperplane of the image manifold at a specific location. Moreover, the dimensionality of these subspaces falls inversely with , and for a given noise sample, the subspaces associated with different noise amplitude are nested, with high-noise subspaces lying within their lower-noise counterparts. In the limit as the noise variance goes to zero, the subspace dimensionality grows to match that of the manifold at that particular point.
3 Exposing the implicit prior through Empirical Bayes estimation
The trained CNN denoisers mentioned above embed detailed prior knowledge of image structure. Given such a denoiser, how can we obtain access to this implicit prior? Recent results have derived relationships between Score matching density estimates and denoising , and have used these relationships to make use of implicit prior information. Here, we exploit a much more direct but little-known result from the literature on Empirical Bayesian estimation. The idea was introduced in , extended to the case of Gaussian additive noise in , and generalized to many other measurement models in . For the Gaussian noise case, the least-squares estimate of Eq. (2) may be rewritten as:
The proof of this result is relatively straightforward. The gradient of the observation density expressed in Eq. (1) is:
Multiplying both sides by and separating the right side into two terms gives:
Rearranging terms and using the chain rule to compute the gradient of the log gives Miyasawa’s result, as expressed in Eq. (3).
Intuitively, the Empirical Bayesian form in Eq. (3) suggests that denoisers use a form of gradient ascent, removing noise from an observed signal by moving up a probability gradient. But note that: 1) the relevant density is not the prior, , but the noisy observation density, ; 2) the gradient is computed on the log density (the associated “energy function”); and 3) the adjustment is not iterative - the optimal solution is achieved in a single step, and holds for any noise level, .
Drawing high-probability samples from the implicit prior
Suppose we wish to draw a sample from the prior implicit in a denoiser. Equation (4) allows us to generate an image proportional to the gradient of by computing the denoiser residual, . Previous work developed a related computation in a Markov chain Monte Carlo (MCMC) scheme, combining gradient steps derived from Score Matching and injected noise in a Langevin sampling algorithm to draw samples from a sequence of densities , while reducing in discrete steps, each associated with an appropriately trained denoiser. In contrast, starting from a random initialization, , we aim to find a high-probability image (i.e., an image from the manifold) using a simpler and more efficient stochastic gradient ascent procedure.
We compute gradients using the residual of a bias-free universal CNN denoiser, which automatically adapts to each noise level. On each iteration, we take a small step in the direction specified by the denoiser, which moves closer to the image manifold, thereby reducing the amplitude of the effective noise. The reduction of noise is achieved by decreasing the amplitude in the directions orthogonal to the observable manifold while retaining the amplitude of image in the directions parallel to manifold which gives rise to synthesis of image content. As the effective noise decreases, the observable dimensionality of the image manifold increases , allowing the synthesis of detailed structures. Since the family of observation densities, forms a scale-space representation of , the algorithm may be viewed as an adaptive form of coarse-to-fine optimization . Assuming the step sizes are adequately-controlled, the procedure converges to a point on the manifold. Figure 10 illustrates this process in two dimensions.
Each iteration operates by taking a deterministic step in the direction of the gradient (as obtained from the denoising function) and injecting some additional noise:
where is the residual of the denoising function, which is proportional to the gradient of , from Eq. (4). The parameter controls the fraction of the denoising correction that is taken, and controls the amplitude of a sample of white Gaussian noise, , whose purpose is to avoid getting stuck in local maxima. The effective noise variance of image is:
where the first term is the variance of the noise remaining after the denoiser correction , and the second term is the variance arising from the injected noise. We assume the denoiser step is highly effective in estimating the direction and magnitude of noise, reducing the noise standard deviation by a factor of - this is justified in the case of recent DNN denoisers, as shown empirically in .
To ensure convergence, we require the effective noise variance on each time step to be reduced, despite the injection of additional noise. For this purpose, we introduce a parameter to control the proportion of injected noise ( indicates no noise), and enforce the convergence by requiring that:
Combining this with Eq. (6) yields an expression for in terms of :
where the second line assumes that the magnitude of the denoising residual provides a good estimate of the effective noise standard deviation, as was found in . This allows the denoiser to adaptively control the gradient ascent step sizes, reducing them as the approaches the manifold (see Figure 10 for a visualization in 2D). This automatic adjustment results in efficient and reliable convergence, as demonstrated empirically in Figure 11.
We found that initial implementations with a small constant fractional step size produced high quality results, but required many iterations. Inuitively, step sizes that are a fixed proportion of the distance to the manifold lead to exponential decay - a form of Zeno’s paradox. To accelerate convergence, we introduced a schedule for increasing the step size proportion according to , starting from . Note that the injected noise amplitude, , is adjusted using Eq. (8), maintaining convergence, albeit at a slower rate. The sampling algorithm is summarized below (Algorithm 1), and is laid out in a block diagram in Figure 9 in the appendix. Example convergence behavior is shown in Figure 11.
We examined the properties of images synthesized using this stochastic coarse-to-fine ascent algorithm. For a denoiser, we used BF-CNN , a bias-free variant of DnCNN . We trained the same network on three different datasets: patches cropped from Berkeley segmentation training set , in color and grayscale, and MNIST dataset (see Appendix A for further details). We obtained similar results (not shown) using other bias-free CNN denoisers (see ). For the sampling algorithm, we chose parameters , and for all experiments.
Figure 1 illustrates the iterative generation of four images, starting from different random initializations, , with no additional noise injected (i.e, ), demonstrating the way that the algorithm amplifies and "hallucinates" structure found in the initial (noise) images. Convergence is typically achieved in less than 40 iterations with stochasticity disabled (). The left panel in Figure 2 shows samples drawn with different initializations, , using a moderate level of injected noise (). Images contain natural-looking features, with sharp contours, junctions, shading, and in some cases, detailed texture regions. The right panel in Figure 2 shows a set of samples drawn with more substantial injected noise (). The additional noise helps to avoid local maxima, and arrives at images that are smoother and higher probability, but still containing sharp boundaries. As expected, the additional noise also lengthens convergence time (see Figure 11). Figure 12 shows a set of samples drawn from the implicit prior of a denoiser trained on the MNIST dataset of handwritten digits.
Solving linear inverse problems using the implicit prior
Many applications in signal processing can be expressed as deterministic linear inverse problems - deblurring, super-resolution, estimating missing pixels (e.g., inpainting), and compressive sensing are all examples. Given a set of linear measurements of an image, , where is a low-rank measurement matrix, one attempts to recover the original image. In Section 2, we developed a stochastic gradient-ascent algorithm for obtaining a high-probability sample from . Here, we modify this algorithm to solve for a high-probability sample from the conditional density .
Consider the distribution of a noisy image, , conditioned on the linear measurements, . Without loss of generality, we assume the columns of the matrix M are orthogonal unit vectors (if not, we can re-parameterize to an equivalent constraint using the SVD). In this case, is the pseudo-inverse of , and that matrix can be used to project an image onto the measurement subspace. Using Bayes’ rule, we write the conditional density of the noisy image conditioned on the linear measureement as
where , and (the projection of onto the orthogonal complement of ). As with the algorithm of Section 2, we wish to obtain a local maximum of this function using stochastic coarse-to-fine gradient ascent. Applying the operator yields
The second term is the gradient of the observation noise distribution, projected into the measurement space. If this is Gaussian with variance , it reduces to . The first term is the gradient of a function defined only within the subspace orthogonal to the measurements, and thus can be computed by projecting the measurement subspace out of the full gradient. Combining these gives:
Thus, we see that the gradient of the conditional density is partitioned into two orthogonal components, capturing the gradient of the (log) noisy density, and the deviation from the constraints, respectively. To draw a high-probability sample from , we use the same algorithm described in Section 2, substituting Eq. (9) for the deterministic update vector, (see Algorithm 2, and Figure 9).
2 Linear inverse examples
We demonstrate the results of applying our method to several linear inverse problems. The same algorithm and parameters are used on all problems - only the measurement matrix and measured values are altered. In particular, as in section 2.1, we used BF-CNN , and chose parameters . For each example, we show a row of original images (), a row of direct least-squares reconstructions (), and a row of restored images generated by our algorithm. For these applications, comparisons to ground truth are not particularly meaningful, at least when the measurement matrix is of very low rank. In these cases, the algorithm relies heavily on the prior to "hallucinate" the missing information, and the goal is not so much to reproduce the original image, but to create an image that looks natural while being consistent with the measurements. Thus, the best measure of performance is a judgement of perceptual quality by a human observer.
Inpainting. A simple example of a linear inverse problem involves restoring a block of missing pixels, conditioned on the surrounding content. Here, the columns of the measurement matrix are a subset of the identity matrix, corresponding to the measured (outer) pixel locations. We choose a missing block of size pixels, which is less than the size of the receptive field of the BF-CNN network (), the largest extent over which this denoiser can be expected to directly capture joint statistical relationships. There is no single correct solution for this problem: Figure 3 shows multiple samples, resulting from different initalizations. Each appears plausible and consistent with the surrounding content. Figure 4 shows additional examples.
Random missing pixels. Consider a measurement process that discards a random subset of pixels. is a low rank matrix whose columns consist of a subset of the identity matrix corresponding to the randomly chosen set of preserved pixels. Figure 5 shows examples with of pixels retained. Despite the significant number of missing pixels, the recovered images are remarkably similar to the originals.
Spatial super-resolution. In this problem, the goal is to construct a high resolution image from a low resolution (i.e. downsampled) image. Downsampling is typically performed after lowpass filtering, and the combination of downsampling factor and filter kernel determines the measurement model, . Here, we use a block-averaging filter, and downsampling (i.e., measurements are averages over non-overlapping blocks). For comparison, we also reconstructed high resolution images using Deep image Prior (DIP) and DeepRED . DIP chooses a random input vector, and adjusts the weights of a CNN to minimize the mean square error between the output and the corrupted image. The authors interpret the denoising capability of this algorithm as arising from inductive biases imposed by the CNN architecture that favor clean natural images over those corrupted by artifacts or noise. By stopping the optimization early, the authors obtain surprisingly good solutions to the linear inverse problem. Regularization by denoising (RED) develops a MAP solution, by using a least squares denoiser as a regularizer . DeepRED combines DIP and RED, obtaining better performance than either method alone. Results for three example images are shown in Figure 6. In all three cases, our method produces an image that is sharper with less noticeable artifacts than the others. Despite this, the PSNR and SSIM values are slightly worse (see Tables 1 and 2). These can be improved by averaging over realizations (i.e., running the algorithm with different random initializations), at the expense of some blurring (see last column of Figure 6). We can interpret this in the context of the prior embedded in our denoiser: if each super-resolution reconstruction corresponds to a point on the (curved) manifold of natural images, then the average (a convex combination of those points) will lie off the manifold. This illustrates the point made earlier that comparison to ground truth (e.g. PSNR, SSIM) is not particularly meaningful when the measurement matrix is very low-rank. Finally, our method is more than two orders of magnitude faster than either DIP or DeepRED, as can be seen from average execution times provided in Table 7.
Deblurring (spectral super-resolution). The applications described above are based on partial measurements in the pixel domain. Here, we consider a blurring operator that operates by retaining a set of low-frequency coefficient in the Fourier domain, discarding the rest. This is equivalent to blurring the image with a sinc kenerl. In this case, consists of the preserved low-frequency columns of the discrete Fourier transform, and is a blurred version of . Note that, unlike Gaussian deblurring, this problem does not have a trivial solution. Examples are shown in Figure 7.
Compressive sensing. Compressive sensing provides a set of theoretical results regarding recovery of sparse signals from a small number of linear measurements. Specifically, if one assumes that signals can be represented with at most non-zero coefficients in a known basis, they can be recovered from a measurements obtained by projecting onto a surprisingly small number of axes (approaching ), far fewer than expected from traditional Shannon-Whitaker sampling theory. The theory relies on the sparsity property, which corresponds to a “union of subspaces” prior , and on the measurement axes being incoherent (essentially, weakly correlated) with the sparse basis. Typically, one chooses a sensing matrix containing a set of random orthogonal axes. Recovery is achieved by solving the sparse inverse problem, using any of a number of methods.
Photographic images are not truly sparse in any fixed linear basis, but they can be reasonably approximated by low-dimensional subsets of Fourier or wavelet basis functions, and compressive sensing results are typically demonstrated using one of these. The manifold prior embedded within our CNN denoiser corresponds to a nonlinear form of sparsity, and analogous to sparse inverse algorithms used in compressed sensing, our stochastic coarse-to-fine ascent algorithm can be used to recover an image from a set of linear projections onto a set of random basis functions. Table 5 shows average numerical performance on Set68 , in terms of both PSNR and SSIM. TVAL3 is an optimization-based algorithm with total variation regularization, ISTA-Net is a CNN supervised method trained to reconstruct images from measurements obtained from one particular random matrix. DIP is an unsupervised general linear inverse solver, BNN is an unsupervised Bayesian method for solving compressive sensing problems. Our method outperforms almost all of the other methods. All the numbers are taken from except for ISTA-Net which were obtained by running their open-source code. Figure 8 shows three examples of images recovered from random projections using our denoiser-induced manifold prior, compared with. In all cases, the denoiser-recovered images exhibit fewer artifacts.
Related work
Our method is conceptually similar to “plug and play”, in using a denoiser to solve linear inverse problems. But it differs in a number of important ways: (1) We derive our method from Miyasawa’s explicit relationship between the mapping and the prior, which is exact and direct, and makes the algorithm interpretable. The rationale for using a denoiser as a regularizer in P&P framework arises from interpreting the proximal operator of the regularizer as a MAP solution of a denoising problem. RED, which is based on P&P, offers some intuitions as to why the smoothness regularization is a proper choice, but the connection to the prior is is still not explicit; (2) We sample a high probability image from the implicit prior that is consistent with a linear constraint. This stochastic solution does not minimize MSE, but has high perceptual quality. RED and other P&P methods are derived as MAP solutions, and although this is not equivalent to minimizing MSE (maximizing PSNR), the results generally have better PSNR than our sampling results, but are visually more blurred (see Figure 6 and 7). As shown in Fig. 7 and Table 4, averaging over samples from our method results in improved PSNR (but more blurring); (3) RED, which uses ADMM for optimization, relies on hyper-parameter selection that affects convergence (as discussed extensively in ). On the contrary, our algorithm adjusts step-size automatically using the denoiser (which must be “blind”, operating on images of unknown noise level), with only two primary hyper-parameters ( and ), and performance is robust to choices of these (Appendix D); (4) Our method requires that the denoiser be trained on Gaussian-noise contaminated images to minimize squared error and must operate "blind" (without knowledge of noise level), whereas RED relies on a set of three additional properties.
Our method is also similar to recent work that uses Score Matching to draw samples from an implicit prior. This line of work is rooted in the connection between denoising autoencoders and score matching which was described in . The idea that denoising autoencoders learn the gradient of log density, and its connection to an underlying data manifold has been explored in . More recently, directly train a neural network to estimate energy (a scalar value related to density) and equated its gradient to the score of the data. Finally, in a breakthrough paper, Song and Ermon trained a sequence of denoisers with decreasing levels of noise and used Langevin dynamics to sample from the underlying probability distributions in succession, using the sample from each stage to initialize the process at the next stage. Our work differs in several important ways: (1) Our derivation of the relationship between the denoiser mapping and the density is direct and significantly simpler (i.e. a few lines of proof in Section 1.3), exploiting a result from the classical statistical literature on Empirical Bayes estimation . The theoretical connection between score matching and denoinsing autoencoders is approximate and significantly more complex. More importantly, our method makes explicit the noise level, whereas the score matching and denoising score matching derivations hold only in the limit of small noise. (2) Our method assumes and uses a single (universal) blind denoiser, rather than a family of denoisers trained for a pre-specified set of noise levels. The noise level schedule, hence the step size, is automatically adjusted by the denoiser, based on distance to the manifold. (3) Our algorithm is efficient - we use stochastic gradient ascent to maximize probability, rather than MCMC methods (e.g., Langevin dynamics) to draw proper samples from each of a discrete sequence of densities. (4) Finally, we’ve demonstrate the generality of our algorithm by using it as a general solver for five different linear inverse problems, whereas the Score Matching methods are generally focused on unconditional image synthesis.
Discussion
We’ve described a framework for using the prior embedded in a denoiser to solve inverse problems. Specifically, we developed a stochastic coarse-to-fine gradient ascent algorithm that uses the denoiser to draw high-probability samples from its implicit prior, and a constrained variant that can be used to solve any linear inverse problem. The derivation relies on the denoiser being optimized for mean squared error in removing additive Gaussian noise of unspecified amplitude. Denoisers can be trained using discriminative learning (nonlinear regression) on virtually unlimited amounts of unlabeled data, and thus, our method extends the power of supervised learning to a much broader set of problems, without further training.
As mentioned previously, the performance of our method is not well-captured by comparisons to the original image (using, for example, PSNR or SSIM). Performance should ultimately be tested using experiments with human observers, which might be approximated using using a no-reference perceptual quality metric (e.g., ). Handling of nonlinear inverse problems with convex measurements (e.g. recovery from quantized representation, such as JPEG) is a natural extension of the method, in which the algorithm would need to be modified to incorporate projection onto convex sets. Finally, our method for image generation offers a means of visualizing the implicit prior of a denoiser, which arises from the combination of architecture, optimization, regularization, and training set. As such, it might offer a means of experimentally isolating and elucidating the effects of these components.
References
Appendix A Description of BF-CNN denoiser
Examples in this paper were computed using the publicly-available implementation of the BF-CNN denoiser , which is trained to minimize MSE for images corrupted with additive Gaussian white noise.
The network contains bias-free convolutional layers, each consisting of filters and channels, batch normalization, and a ReLU nonlinearity. Note that to construct a bias-free network, we remove all sources of additive bias, including the mean parameter of the batch-normalization, in every layer.
Training Scheme.
We follow the training procedure described in . The network is trained to denoise images corrupted by i.i.d. Gaussian noise with standard deviations drawn from the range (relative to image intensity range $40\times 40180\times 180$. Training is carried out on batches of size 128, for 70 epochs.
Appendix B Block diagram of Universal Linear Inverse Sampler
Appendix C Visualization of Universal Inverse Sampler on a 2D manifold prior
Appendix D Convergence
Figure 11 illustrates the convergence of our iterative sampling algorithm, expressed in terms of the effective noise standard deviation averaged over synthesis of three images, for three different levels of the stochasticity parameter . Convergence is well-behaved and efficient in all cases. As expected, with smaller (larger amounts of injected noise), effective standard deviation falls more slowly, and convergence takes longer. For (no injected noise), the convergence is approximately what is expected from the formulation of the alorithm (Eq. 7). For larger amounts of injected noise, the algorithm converges faster than expected, we believe because a portion of the additive noise is parallel to the manifold, so does not contribute to calculated variance.
In addition to the total effective noise, we can compare the evolution of the removed noise versus injected noise. Figure LABEL:fig:convergence_parts shows the reduction in effective standard deviation, in each iteration, along with the standard deviation of the added noise, . The amount of noise added relative to the amount removed is such that effective noise drops as . When , the addedtive noise is zero, , and the convergence of is the fastest. When , a lot of noise is added in each iteration, and the convergence is the slowest.