Learning a Single Convolutional Super-Resolution Network for Multiple Degradations
Kai Zhang, Wangmeng Zuo, Lei Zhang
Introduction
Single image super-resolution (SISR) aims to recover a high-resolution (HR) version of a low-resolution (LR) input. As a classical problem, SISR is still an active yet challenging research topic in the field of computer vision due to its ill-poseness nature and high practical values . In the typical SISR framework, an LR image y is modeled as the output of the following degradation process:
where represents the convolution between a blur kernel k and a latent HR image x, is a subsequent downsampling operation with scale factor , and usually is additive white Gaussian noise (AWGN) with standard deviation (noise level) .
SISR methods can be broadly classified into three categories, i.e., interpolation-based methods, model-based optimization methods and discriminative learning methods. Interpolation-based methods such as nearest-neighbor, bilinear and bicubic interpolators are simple and efficient but have very limited performance. By exploiting powerful image priors (e.g., the non-local self-similarity prior , sparsity prior and denoiser prior ), model-based optimization methods are flexible to reconstruct relative high-quality HR images, but they usually involve a time-consuming optimization procedure. Although the integration of convolutional neural network (CNN) denoiser prior and model-based optimization can improve the efficiency to some extent, it still suffers from the typical drawbacks of model-based optimization methods, e.g., it is not in an end-to-end learning manner and involves hand-designed parameters . As an alternative, discriminative learning methods have attracted considerable attentions due to their favorable SISR performance in terms of effectiveness and efficiency. Notably, recent years have witnessed a dramatic upsurge of using CNN for SISR.
In this paper, we focus on discriminative CNN methods for SISR so as to exploit the merits of CNN, such as the fast speed by parallel computing, high accuracy by end-to-end training, and tremendous advances in training and designing networks . While several SISR models based on discriminative CNN have reported impressive results, they suffer from a common drawback: their models are specialized for a single simplified degradation (e.g., bicubic degradation) and lack scalability to handle multiple degradations by using a single model. Because the practical degradation of SISR is much more complex , the performance of learned CNN models may deteriorate seriously when the assumed degradation deviates from the true one, making them less effective in practical scenarios. It has been pointed out that the blur kernel plays a vital role for the success of SISR methods and the mismatch of blur kernels will largely deteriorate the final SISR results . However, little work has been done on how to design a CNN to address this crucial issue.
Given the facts above, it is natural to raise the following questions, which are the focus of our paper: (i) Can we learn a single model to effectively handle multiple and even spatially variant degradations? (ii) Is it possible to use synthetic data to train a model with high practicability? This work aims to make one of the first attempts towards answering these two questions.
To answer the first question, we revisit and analyze the general model-based SISR methods under the maximum a posteriori (MAP) framework. Then we argue that one may tackle this issue by taking LR input, blur kernel and noise level as input to CNN but their dimensionality mismatch makes it difficult to design a single convolutional super-resolution network. In view of this, we introduce a dimensionality stretching strategy which facilitates the network to handle multiple and even spatially variant degradations with respect to blur kernel and noise. To the best of our knowledge, there is no attempt to consider both the blur kernel and noise for SISR via training a single CNN model.
For the second question, we will show that it is possible to learn a practical super-resolver using synthetic data. To this end, a large variety of degradations with different combinations of blur kernels and noise levels are sampled to cover the degradation space. In a practical scenario, even the degradation is more complex (e.g., the noise is non-AWGN), we can select the best fitted degradation model rather than the bicubic degradation to produce a better result. It turns out that, by choosing a proper degradation, the learned SISR model can yield perceptually convincing results on real LR images. It should be noted that we make no effort to use specialized network architectures but use the plain CNN as in .
The main contributions of this paper are summarized in the following:
We propose a simple yet effective and scalable deep CNN framework for SISR. The proposed model goes beyond the widely-used bicubic degradation assumption and works for multiple and even spatially variant degradations, thus making a substantial step towards developing a practical CNN-based super-resolver for real applications.
We propose a novel dimensionality stretching strategy to address the dimensionality mismatch between LR input image, blur kernel and noise level. Although this strategy is proposed for SISR, it is general and can be extended to other tasks such as deblurring.
We show that the proposed convolutional super-resolution network learned from synthetic training data can not only produce competitive results against state-of-the-art SISR methods on synthetic LR images but also give rise to visually plausible results on real LR images.
Related Work
The first work of using CNN to solve SISR can be traced back to where a three-layer super-resolution network (SRCNN) was proposed. In the extended work , the authors investigated the impact of depth on super-resolution and empirically showed that the difficulty of training deeper model hinders the performance improvement of CNN super-resolvers. To overcome the training difficulty, Kim et al. proposed a very deep super-resolution (VDSR) method with residual learning strategy. Interestingly, they showed that VDSR can handle multiple scales super-resolution. By analyzing the relation between CNN and MAP inference, Zhang et al. pointed out that CNN mainly model the prior information and they empirically demonstrated that a single model can handle multiple scales super-resolution, image deblocking and image denoising. While achieving good performance, the above methods take the bicubicly interpolated LR image as input, which not only suffers from high computational cost but also hinders the effective expansion of receptive field.
To improve the efficiency, some researchers resort to directly manipulating the LR input and adopting an upscaling operation at the end of the network. Shi et al. introduced an efficient sub-pixel convolution layer to upscale the LR feature maps into HR images. Dong et al. adopted a deconvolution layer at the end of the network to perform upsampling. Lai et al. proposed a Laplacian pyramid super-resolution network (LapSRN) that takes an LR image as input and progressively predicts the sub-band residuals with transposed convolutions in a coarse-to-fine manner. To improve the perceptual quality at a large scale factor, Ledig et al. proposed a generative adversarial network based super-resolution (SRGAN) method. In the generator network of SRGAN, two sub-pixel convolution layers are used to efficiently upscale the LR input by a factor of 4.
Although various techniques have been proposed for SISR, the above CNN-based methods are tailored to the widely-used settings of bicubic degradation, neglecting their limited applicability for practical scenarios. An interesting line of CNN-based methods which can go beyond bicubic degradation adopt a CNN denoiser to solve SISR via model-based optimization framework . For example, the method proposed in can handle the widely-used Gaussian degradation as in . However, manually selecting the hyper-parameters for different degradations is not a trivial task . As a result, it is desirable to learn a single SISR model which can handle multiple degradations with high practicability. This paper attempts to give a positive answer.
Due to the limited space, we can only discuss some of the related works here. Other CNN-based SISR methods can be found in .
Method
Before solving the problem of SISR, it is important to have a clear understanding of the degradation model which is not limited to Eqn. (1). Another practical degradation model can be given by
When is the bicubic downsampler, Eqn. (2) corresponds to a deblurring problem followed by a SISR problem with bicubic degradation. Thus, it can benefit from existing deblurring methods and bicubic degradation based SISR methods. Due to limited space, we only consider the more widely assumed degradation model given in Eqn. (1). Nevertheless, our method is general and can be easily extended to handle Eqn. (2). In the following, we make a short discussion on blur kernel k, noise n and downsampler .
Different from image deblurring, the blur kernel setting of SISR is usually simple. The most popular choice is isotropic Gaussian blur kernel parameterized by standard deviation or kernel width . In , anisotropic Gaussian blur kernels are also used. In practice, more complex blur kernel models used in deblurring task, such as motion blur , can be further considered. Empirical and theoretical analyses have revealed that the influence of an accurate blur kernel is much larger than that of sophisticated image priors . Specifically, when the assumed kernel is smoother than the true kernel, the recovered image is over-smoothed. Most of SISR methods actually favor for such case. On the other hand, when the assumed kernel is sharper than the true kernel, high frequency ringing artifacts will appear.
Noise.
While being of low-resolution, the LR images are usually also noisy. Directly super-resolving the noisy input without noise removal would amplify the unwanted noise, resulting in visually unpleasant results. To address this problem, the straightforward way is to perform denoising first and then enhance the resolution. However, the denoising pre-processing step tends to lose detail information and would deteriorate the subsequent super-resolution performance . Thus, it would be highly desirable to jointly perform denoising and super-resolution.
Downsampler.
Existing literatures have considered two types of downsamplers, including direct downsampler and bicubic downsampler . In this paper, we consider the bicubic downsampler since when k is delta kernel and the noise level is zero, Eqn. (1) turns into the widely-used bicubic degradation model. It should be pointed out that, different from blur kernel and noise which vary in a general degradation model, downsampler is assumed to be fixed.
Though blur kernel and noise have been recognized as key factors for the success of SISR and several methods have been proposed to consider those two factors, there has been little effort towards simultaneously considering blur kernel and noise in a single CNN framework. It is a challenging task since the degradation space with respect to blur kernel and noise is rather large (see Figure 1 as an example). One relevant work is done by Zhang et al. ; nonetheless, their method is essentially a model-based optimization method and thus suffers from several drawbacks as mentioned previously. In another related work, Riegler et al. exploited the blur kernel information into the SISR model. Our method differs from on two major aspects. First, our method considers a more general degradation model. Second, our method exploits a more effective way to parameterize the degradation model.
2 A Perspective from MAP Framework
Though existing CNN-based SISR methods are not necessarily derived under the traditional MAP framework, they have the same goal. We revisit and analyze the general MAP framework of SISR, aiming to find the intrinsic connections between the MAP principle and the working mechanism of CNN. Consequently, more insights on CNN architecture design can be obtained.
Due to the ill-posed nature of SISR, regularization needs to be imposed to constrain the solution. Mathematically, the HR counterpart of an LR image y can be estimated by solving the following MAP problem
where is the data fidelity term, is the regularization term (or prior term) and is the trade-off parameter. Simply speaking, Eqn. (3) conveys two points: (i) the estimated solution should not only accord with the degradation process but also have the desired property of clean HR images; (ii) is a function of LR image y, blur kernel k, noise level , and trade-off parameter . Therefore, the MAP solution of (non-blind) SISR can be formulated as
where denotes the parameters of the MAP inference.
By treating CNN as a discriminative learning solution to Eqn. (4), we can have the following insights.
Because the data fidelity term corresponds to the degradation process, accurate modeling of the degradation plays a key role for the success of SISR. However, existing CNN-based SISR methods with bicubic degradation actually aim to solve the following problem
Inevitably, their practicability is very limited.
To design a more practical SISR model, it is preferable to learn a mapping function like Eqn. (4), which covers more extensive degradations. It should be stressed that, since can be absorbed into , Eqn. (4) can be reformulated as
Considering that the MAP framework (Eqn. (3)) can perform generic image super-resolution with the same image prior, it is intuitive to jointly perform denoising and SISR in a unified CNN framework. Moreover, the work indicates that the parameters of the MAP inference mainly model the prior; therefore, CNN has the capacity to deal with multiple degradations via a single model.
From the viewpoint of MAP framework, one can see that the goal of SISR is to learn a mapping function rather than . However, it is not an easy task to directly model via CNN. The reason lies in the fact that the three inputs y, k and have different dimensions. In the next subsection, we will propose a simple dimensionality stretching strategy to resolve this problem.
3 Dimensionality Stretching
The proposed dimensionality stretching strategy is schematically illustrated in Figure 2. Suppose the inputs consist of a blur kernel of size , a noise level and an LR image of size , where denotes the number of channels. The blur kernel is first vectorized into a vector of size and then projected onto -dimensional linear space by the PCA (Principal Component Analysis) technique. After that, the concatenated low dimensional vector and the noise level, denoted by v, are stretched into degradation maps of size , where all the elements of -th map are . By doing so, the degradation maps then can be concatenated with the LR image, making CNN possible to handle the three inputs. Such a simple strategy can be easily exploited to deal with spatially variant degradations by considering the fact that the degradation maps can be non-uniform.
4 Proposed Network
The proposed super-resolution network for multiple degradations, denoted by SRMD, is illustrated in Figure 3. As one can see, the distinctive feature of SRMD is that it takes the concatenated LR image and degradation maps as input. To show the effectiveness of the dimensionality stretching strategy, we resort to plain CNN without complex architectural engineering. Typically, to super-resolve an LR image with a scale factor of , SRMD first takes the concatenated LR image and degradation maps of size as input. Then, similar to , a cascade of convolutional layers are applied to perform the non-linear mapping. Each layer is composed of three types of operations, including Convolution (Conv), Rectified Linear Units (ReLU) , and Batch Normalization (BN) . Specifically, “Conv + BN + ReLU” is adopted for each convolutional layer except the last convolutional layer which consists of a single “Conv” operation. Finally, a sub-pixel convolution layer is followed by the last convolutional layer to convert multiple HR subimages of size to a single HR image of size .
For all scale factors 2, 3 and 4, the number of convolutional layers is set to , and the number of feature maps in each layer is set to . We separately learn models for each scale factor. In particular, we also learn the models for noise-free degradation, namely SRMDNF, by removing the connection of the noise level map in the first convolutional filter and fine-tuning with new training data.
It is worth pointing out that neither residual learning nor bicubicly interpolated LR image is used for the network design due to the following reasons. First, with a moderate network depth and advanced CNN training and design such as ReLU , BN and Adam , it is easy to train the network without the residual learning strategy. Second, since the degradation involves noise, bicubicly interpolated LR image would aggravate the complexity of noise which in turn will increase the difficulty of training.
5 Why not Learn a Blind Model?
To enhance the practicability of CNN for SISR, it seems the most straightforward way is to learn a blind model with synthesized training data by different degradations. However, such blind model does not perform as well as expected. First, the performance deteriorates seriously when the blur kernel model is complex, e.g., motion blur. This phenomenon can be explained by the following example. Given an HR image, a blur kernel and corresponding LR image, shifting the HR image to left by one pixel and shifting the blur kernel to right by one pixel would result in the same LR image. Thus, an LR image may correspond to different HR images with pixel shift. This in turn would aggravate the pixel-wise average problem , typically leading to over-smoothed results. Second, the blind model without specially designed architecture design has inferior generalization ability and performs poorly in real applications.
In contrast, non-blind model for multiple degradations suffers little from the pixel-wise average problem and has better generalization ability. First, the degradation maps contain the warping information and thus can enable the network to have spatial transformation capability. For clarity, one can treat the degradation maps induced by blur kernel and noise level as the output of a spatial transformer as in . Second, by anchoring the model with degradation maps, the non-blind model generalizes easily to unseen degradations and has the ability to control the tradeoff between data fidelity term and regularization term.
Experiments
Before synthesizing LR images according to Eqn. (1), it is necessary to define the blur kernels and noise level range, as well as providing a large-scale clean HR image set.
For the blur kernels, we follow the kernel model of isotropic Gaussian with a fixed kernel width which has been proved practically feasible in SISR applications. Specifically, the kernel width ranges are set to , and for scale factors 2, 3 and 4, respectively. We sample the kernel width by a stride of . The kernel size is fixed to 1515. To further expand the degradation space, we also consider a more general kernel assumption, i.e., anisotropic Gaussian, which is characterized by a Gaussian probability density function with zero mean and varying covariance matrix . The space of such Gaussian kernel is determined by rotation angle of the eigenvectors of and scaling of corresponding eigenvalues. We set the rotation angle range to . For the scaling of eigenvalues, it is set from to , and for scale factors 2, 3 and 4, respectively.
Although we adopt the bicubic downsampler throughout the paper, it is straightforward to train a model with direct downsampler. Alternatively, we can also include the degradations with direct downsampler by approximating it. Specifically, given a blur kernel under direct downsampler , we can find the corresponding blur kernel under bicubic downsampler by solving the following problem with a data-driven method
In this paper, we also include such degradations for scale factor 3.
Once the blur kernels are well-defined or learned, we then uniformly sample substantial kernels and aggregate them to learn the PCA projection matrix. By preserving about of the energy, the kernels are projected onto a space of dimension (i.e., ). The visualization of some typical blur kernels for scale factor 3 and some PCA eigenvectors is shown in Figure 4.
For the noise level range, we set it as $4008004,744$ images from WED dataset .
Then, given an HR image, we synthesize LR image by blurring it with a blur kernel k and bicubicly downsampling it with a scale factor , followed by an addition of AWGN with noise level . The LR patch size is set to which means the corresponding HR patch sizes for scale factors 2, 3, and 4 are , and , respectively.
In the training phase, we randomly select a blur kernel and a noise level to synthesize an LR image and crop LR/HR patch pairs (along with the degradation maps) for each epoch. We optimize the following loss function using Adam
The mini-batch size is set to . The learning rate starts from and reduces to when the training error stops decreasing. When the training error keeps unchanged in five sequential epochs, we merge the parameters of each batch normalization into the adjacent convolution filters. Then, a small learning rate of is used for additional epochs to fine-tune the model. Since SRMDNF is obtained by fine-tuning SRMD, its learning rate is fixed to for epochs.
We train the models in Matlab (R2015b) environment with MatConvNet package and an Nvidia Titan X Pascal GPU. The training of a single SRMD model can be done in about two days. The source code can be downloaded at https://github.com/cszn/SRMD.
2 Experiments on Bicubic Degradation
As mentioned above, instead of handling the bicubic degradation only, our aim is to learn a single network to handle multiple degradations. However, in order to show the advantage of the dimensionality stretching strategy, the proposed method is also compared with other CNN-based methods specifically designed for bicubic degradation.
Table 1 shows the PSNR and SSIM results of state-of-the-art CNN-based SISR methods on four widely-used datasets. As one can see, SRMD achieves comparable results with VDSR at small scale factor and outperforms VDSR at large scale factor. In particular, SRMDNF achieves the best overall quantitative results. Using ImageNet dataset to train the specific model with bicubic degradation, SRResNet performs slightly better than SRMDNF on scale factor 4. To further compare with other methods such as VDSR, we also have trained a SRMDNF model (for scale factor 3) which operates on Y channel with 291 training images. The learned model achieves 33.97dB, 29.96dB, 28.95dB and 27.42dB on Set5, Set14, BSD100 and Urban100, respectively. As a result, it can still outperform other competing methods. The possible reason lies in that the SRMDNF with multiple degradations shares the same prior in the MAP framework which facilitates the implicit prior learning and thus benefits to PSNR improvement. This also can explain why VDSR with multiple scales improves the performance.
For the GPU run time, SRMD spends 0.084, 0.042 and 0.027 seconds to reconstruct an HR image of size for scale factors 2, 3 and 4, respectively. As a comparison, the run time of VDSR is 0.174 second for all scale factors. Figure 5 shows the visual results of different methods. One can see that our proposed method yields very competitive performance against other methods.
3 Experiments on General Degradations
In this subsection, we evaluate the performance of the proposed method on general degradations. The degradation settings are given in Table 2. We only consider the isotropic Gaussian blur kernel for an easy comparison. To further show the scalability of the proposed method, another widely-used degradation which involves 77 Gaussian kernel with width 1.6 and direct downsampler with scale factor 3 is also included. We compare the proposed method with VDSR, two model-based methods (i.e., NCSR and IRCNN ), and a cascaded denoising-SISR method (i.e., DnCNN +SRMDNF).
The quantitative results of different methods with different degradations on Set5 are provided in Table 2, from which we have observations and analyses as follows. First, the performance of VDSR deteriorates seriously when the assumed bicubic degradation deviates from the true one. Second, SRMD produces much better results than NCSR and IRCNN, and outperforms DnCNN+SRMDNF. In particular, the PSNR gain of SRMD over DnCNN+SRMDNF increases with the kernel width which verifies the advantage of joint denoising and super-resolution. Third, by setting proper blur kernel, the proposed method delivers good performance in handling the degradation with direct downsampler. The visual comparison is given in Figure 6. One can see that NCSR and IRCNN produce more visually pleasant results than VDSR since their assumed degradation matches the true one. However, they cannot recover edges as sharper as SRMD and SRMDNF.
4 Experiments on Spatially Variant Degradation
To demonstrate the effectiveness of SRMD for spatially variant degradation, we synthesize an LR images with spatially variant blur kernels and noise levels. Figure 7 shows the visual result of the proposed SRMD for the spatially variant degradations. One can see that the proposed SRMD is effective in recovering the latent HR image. Note that the blur kernel is assumed to be isotropic Gaussian.
5 Experiments on Real Images
Besides the above experiments on LR images synthetically downsampled from HR images with known blur kernels and corrupted by AWGN with known noise levels, we also do experiments on real LR images to demonstrate the effectiveness of the proposed SRMD. Since there are no ground-truth HR images, we only provide the visual comparison.
As aforementioned, while we also use anisotropic Gaussian kernels in training, it is generally feasible to use isotropic Gaussian for most of the real LR images in testing. To find the degradation parameters with good visual quality, we use a grid search strategy rather than adopting any blur kernel or noise level estimation methods. Specifically, the kernel width is uniformly sampled from 0.1 to 2.4 with a stride of 0.1, and the noise level is from 0 to 75 with stride 5.
Figures 8 and 9 illustrate the SISR results on two real LR images “Cat” and “Chip”, respectively. The VDSR is used as one of the representative CNN-based methods for comparison. For image “Cat” which is corrupted by compression artifacts, Waifu2x is also used for comparison. For image “Chip” which contains repetitive structures, a self-similarity based method SelfEx is also included for comparison.
It can be observed from the visual results that SRMD can produce much more visually plausible HR images than the competing methods. Specifically, one can see from Figure 8 that the performance of VDSR is severely affected by the compression artifacts. While Waifu2x can successfully remove the compression artifacts, it fails to recover sharp edges. In comparison, SRMD can not only remove the unsatisfying artifacts but also produce sharp edges. From Figure 9, we can see that VDSR and SelfEx both tend to produce over-smoothed results, whereas SRMD can recover sharp image with better intensity and gradient statistics of clean images .
Conclusion
In this paper, we proposed an effective super-resolution network with high scalability of handling multiple degradations via a single model. Different from existing CNN-based SISR methods, the proposed super-resolver takes both LR image and its degradation maps as input. Specifically, degradation maps are obtained by a simple dimensionality stretching of the degradation parameters (i.e., blur kernel and noise level). The results on synthetic LR images demonstrated that the proposed super-resolver can not only produce state-of-the-art results on bicubic degradation but also perform favorably on other degradations and even spatially variant degradations. Moreover, the results on real LR images showed that the proposed method can reconstruct visually plausible HR images. In summary, the proposed super-resolver offers a feasible solution toward practical CNN-based SISR applications.
Acknowledgements
This work is supported by National Natural Science Foundation of China (grant no. 61671182, 61471146), HK RGC General Research Fund (PolyU 152240/15E) and PolyU-Alibaba Collaborative Research Project “Quality Enhancement of Surveillance Images and Videos”. We gratefully acknowledge the support from NVIDIA Corporation for providing us the Titan Xp GPU used in this research.