Image Restoration using Total Variation Regularized Deep Image Prior

Jiaming Liu, Yu Sun, Xiaojian Xu, Ulugbek S. Kamilov

Introduction

Image reconstruction is one of the most widely studied problems in computational imaging. Since the problem is often ill-posed, the process is traditionally regularized by constraining the solutions to be consistent with our prior knowledge about the image. Some traditional imaging priors include nonnegativity, transform-domain sparsity, and self-similarity . Recently, however, the attention in the field has been shifting towards new imaging formulations based on deep learning .

The most common deep-learning approach is based on an end-to-end training of a convolutional neural network (CNN) for reproducing the desired image from its noisy measurements . A popular alternative considers training a CNN as an image denoiser and using it within an iterative reconstruction algorithms . However, recently, it was also shown that a CNN can by itself regularize image reconstruction without data-driven training . This deep image prior (DIP) framework naturally regularizes reconstruction by optimizing the weights of a CNN for it to synthesize the measurements from a given random input vector. The intuition behind DIP is that natural images can be well represented by CNNs, which is not the case for the random noise and certain other image degradations. DIP was shown to achieve remarkable performance on a number of image reconstruction tasks .

In this paper, we propose to further improve DIP by combining an implicit CNN regularization with an explicit TV penalty. The idea of our DIP-TV approach is simple: by including an additional TV term into the objective function, we restrict the solutions synthesized by CNN to those that are piecewise smooth. We experimentally show that our DIP-TV method outperforms the traditional formulations of DIP and TV, and performs on a par with other state-of-the-art image restoration methods such as BM3D and IRCNN .

Background

Consider the restoration as a linear inverse problem

As practical inverse problems are often ill-posed, it is common to regularize the task by constraining the solution according some prior knowledge. In practice, the reconstruction often relies on the regularized least-squares formulation

where the data-fidelity term ensures the consistency with measurements, and regularizer $\rho$ constrains the solution to the desired image class. The parameter $\lambda>0$ controls the strength of regularization.

where $\mathbf{D_{1}}$ and $\mathbf{D_{2}}$ denote the finite difference operation along the first and second dimension of a two-dimensional (2D) image with appropriate boundary conditions.

Currently, deep learning achieves the state-of-the-art performance for different image restoration problems . The core idea is to train a CNN via the following optimization

where ${\mathbf{x}}^{\ast}$ is the restored image, and $f_{\mathbf{\Theta}}(\cdot)$ represents the CNN parametrized by $\mathbf{\Theta}$ . $\mathcal{L}$ denotes the loss function. In practice, (LABEL:Eq:CNNloss) can be effectively optimized using the family of stochastic gradient descend (SGD) methods, such as adaptive moment estimation (ADAM) .

Recently, Ulyanov et al. proposed to use CNN-based methods in an alternative way. They discovered that the architecture of deep CNN models is well-suited for representing natural images, but not random noise. With a random input vector, CNN can reproduce the clear image without supervised training on a large dataset. In the context of image restoration, the associated optimization for DIP can be formulated as

Proposed Method

Optimization in (6) is similar to training of a CNN and one can rely on any standard optimization algorithms.

Figure 3 illustrates the CNN architecture we used in this paper, which was adapted from . In particular, the popular U-net architecture is modified such that the skip connections contain a convolutional layer. The decoder uses a down-sampling and up-sampling based scaling-expanding structure, which makes the effective receptive field of the network increase as the input goes deeper into the network . Besides, the skip connection enables the later layers to reconstruct the feature maps with both local details and global texture. Here, the input $\mathbf{z}$ can be initialized with uniform noise and be further optimized. The proposed framework can deal with both grayscale and color images, where for color images anisotropic TV jointly regularizes all three channels.

Experiments

We now present the experimental results on image denoising and deblurring. We consider 14 gray scale images and 8 standard color images ( $256\times 256$ and $512\times 512$ ) from set12, set14, and BSD68 as our testing images. The gray scale images are shown in Figure 1, while color images are: Monarch, Parrots, House, Lena, Peppers, Baby, and Jet.

In this subsection, we analyze the performance of DIP-TV method for image denoising problems. The CNN architecture in Figure 3 is used for both color and grayscale images, with $n_{s}[i]=4$ for each skip layers. All algorithmic hyperparameters were optimized in each experiment for the best signal-to-noise ratio (SNR) performance with respect to the ground truth test image. Both DIP-TV and DIP were set to run 5000 optimization step. We use the average SNR to denote the SNR values averaged over the associated set of test images.

We first present the results of the experiments on grayscale images, where we compared DIP-TV with EPLL , BM3D , TV and DIP . In order to directly evaluate the range of noise levels that DIP-TV performs better, the input SNR to output SNR relationships are presented in Table 1. The grayscale images were corrupted by AWGN corresponding to input SNR of 5 dB, 10 dB, 15 dB, 20 dB, 25 dB, respectively. In particular, DIP-TV outperforms original DIP by around 0.5 dB for a wide range of noise levels from 5 dB to 20 dB. Note that the proposed method also bridge the gap between DIP and the state-of-the-art methods in high noise levels. Figure 4 illustrates the visual comparisons for grayscale images Tower and Jet under two different noise levels, respectively. The DIP-TV significantly promotes the denoising performance of DIP itself in terms of both visual qualities and SNR. The noise is effectively filtered out and the details of the image are preserved because of the TV regularization. For instance, DIP-TV improves the SNR with respect to Tower by over 1.06 dB against DIP, and outperforms BM3D by 0.35 dB. Visually, the door highlighted in Tower is clearly restored, while other methods bring serious distortion to it.

In color image denoising, we compared our method with CBM3D and NLM as well as DIP itself. We considered AWGN corresponding to variance $\sigma$ from 25 to 75. Figure 2 compares the SNR performance of CBM3D, DIP, and DIP-TV on the image Monarch. Table 2 summaries the average SNR among different methods. Overall, DIP-TV exceeds DIP by at least 0.2 dB on the testing images. Moreover, DIP-TV outperforms CBM3D with the increase of noise level (e.g. $\sigma\geq 35$ ). Considering that the whole procedure of DIP-TV and DIP are image-agnostic and no prior information is learned from other images, it is notable that DIP-TV achieves comparable performance to the state-of-the-art for high noise levels.

2 Image Deblurring

In image deblurring, one is given an blurry image which is synthesized by firstly applying blur kernel $\mathbf{H}$ and then adding AWGN with noise level $\sigma$ ; The goal is to restore the image from the degraded ones. We tested DIP and DIP-TV based on the network architecture illustrated in , with $n_{s}[i]=128$ .

Both DIP and DIP-TV were set to run 5500 optimization step. Taking advantage of recent progress in CNN and the merit of GPU computation, here we utilized convolution to implement the blur. As a baseline, we compared our method with IRCNN and DIP itself based on the same set of images in denoising. Two blur kernels were applied, including a general Gaussian kernel with standard deviation 1.6 as well as a realistic kernel defined in . Different AWGN of $\sigma$ is added in each experiment.

Figure 5 shows the visual results for Peppers obtained by different methods. All methods can effectively remove the blurry and noise from the image. Particularly, our method further enhance the piecewise-smoothness and mitigate the noise of the image, and thus increases the peak-signal-to-noise ratio (PSNR) by over 0.45 dB against DIP. Also note that the aid of TV regularization makes DIP even outperform IRCNN by 0.15 dB on Peppers. Table 3 reports the average PSNR compassion with IRCNN and DIP on color and gray scale images, repectively.

In general, the improvement by TV regularization outperforms DIP by at least 0.54 dB in terms of PSNR and makes the DIP framework more comparable with IRCNN. For example, DIP-TV is only 0.01 dB lower than IRCNN in terms of the average PSNR on color images, with standard Gaussian blur kernel and $\sigma=2$ .