Detail-revealing Deep Video Super-resolution

Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, Jiaya Jia

Introduction

As one of the fundamental problems in image processing and computer vision, video or multi-frame super-resolution (SR) aims at recovering high-resolution (HR) images from a sequence of low-resolution (LR) ones. In contrast to single-image SR where details have to be generated based on only external examples, an ideal video SR system should be able to correctly extract and fuse image details in multiple frames. To achieve this goal, two important sub-problems are to be answered: (1) how to align multiple frames to construct accurate correspondence; and (2) how to effectively fuse image details for high-quality outputs.

While large motion between consecutive frames increases the difficulty to locate corresponding image regions, subtle sub-pixel motion contrarily benefits restoration of details. Most previous methods compensate inter-frame motion by estimating optical flow or applying block-matching . After motion is estimated, traditional methods reconstruct the HR output based on various imaging models and image priors, typically under an iterative estimation framework. Most of these methods involve rather intensive case-by-case parameter-tuning and costly computation.

Recent deep-learning-based video SR methods compensate inter-frame motion by aligning all other frames to the reference one, using backward warping. We show that such a seemingly reasonable technical choice is actually not optimal for video SR, and improving motion compensation can directly lead to higher quality SR results. In this paper, we achieve this by proposing a sub-pixel motion compensation (SPMC) strategy, which is validated by both theoretical analysis and extensive experiments.

Detail Fusion

Besides motion compensation, proper image detail fusion from multiple frames is the key to the success of video SR. We propose a new CNN framework that incorporates the SPMC layer, and effectively fuses image information from aligned frames. Although previous CNN-based video SR systems can produce sharp-edge images, it is not entirely clear whether the image details are those inherent in input frames, or learned from external data. In many practical applications such as face or text recognition, only true HR details are useful. In this paper we provide insightful ablation study to verify this point.

Scalability

A traditionally-overlooked but practically-meaningful property of SR systems is the scalability. In many previous learning-based SR systems, the network structure is closely coupled with SR parameters, making them less flexible when new SR parameters need to be applied. For example, ESPCN output channel number is determined by the scale factor. VSRnet and VESPCN can only take a fixed number of temporal frames as input, once trained.

In contrast, our system is fully scalable. First, it can take arbitrary-size input images. Second, the new SPMC layer does not contain trainable parameters and can be applied for arbitrary scaling factors during testing. Finally, the ConvLSTM-based network structure makes it possible to accept an arbitrary number of frames for SR in testing phase.

1 Related Work

With the seminal work of SRCNN , a majority of recent SR methods employ deep neural networks . Most of them resize input frames before sending them to the network , and use very deep , recursive or other networks to predict HR results. Shi et al. proposed a subpixel network, which directly takes low-resolution images as input, and produces a high-res one with subpixel location. Ledig et al. used a trainable deconvolution layer instead.

For deep video SR, Liao et al. adopted a separate step to construct high-resolution SR-drafts, which are obtained under different flow parameters. Kappeler et al. estimated optical flow and selected corresponding patches across frames to train a CNN. In both methods, motion estimation is separated from training. Recently, Caballero et al. proposed the first end-to-end video SR framework, which incorporates motion compensation as a submodule.

Motion Estimation

Deep neural networks were also used to solve motion estimation problems. Zbontar and LeCun and Luo et al. used CNNs to learn a patch distance measure for stereo matching. Fischer et al. and Mayer et al. proposed end-to-end networks to predict optical flow and stereo disparity.

Progress was made in spatial transformer networks where a differentiable layer warps images according to predicted affine transformation parameters. Based on it, WarpNet used a similar scheme to extract sparse correspondence. Yu et al. warped output based on predicted optical flow as a photometric loss for unsupervised optical flow learning. Different from these strategies, we introduce a Sub-pixel Motion Compensation (SPMC) layer, which is suitable for the video SR task.

Sub-pixel Motion Compensation (SPMC)

We first introduce our notations for video SR. It takes a sequence of $N_{F}=(2T+1)$ LR images as input ( $T$ is the size of temporal span in terms of number of frames), where $\Omega_{L}=\{I^{L}_{-T},\cdots,I^{L}_{0},\cdots,I^{L}_{T}\}$ . The output HR image $I^{H}_{0}$ corresponds to center reference frame $I^{L}_{0}$ .

The classical imaging model for LR images is expressed as

where $W_{0\rightarrow i}$ is the warping operator to warp from the th to $i$ th frame. $K$ and $S$ are downsampling blur and decimation operators, respectively. $n_{i}$ is the additive noise to frame $i$ . For simplicity’s sake, we neglect operator $K$ in the following analysis, since it can be absorbed by $S$ .

Flow Direction and Transposed Operators

Operator $W_{0\rightarrow i}$ indicates the warping process. To compute it, one needs to first calculate the motion field $F_{i\rightarrow 0}$ (from the $i$ th to th frame), and then perform backward warping to produce the warped image. However, current deep video SR methods usually align other frames back to $I^{L}_{0}$ , which actually makes use of flow $F_{0\rightarrow i}$ .

More specifically, directly minimizing the $L_{2}$ -norm reconstruction error $\sum_{i}\|SW_{0\rightarrow i}I^{H}_{0}-I^{L}_{i}\|^{2}$ results in

With certain assumptions , $W_{0\rightarrow i}^{T}S^{T}SW_{0\rightarrow i}$ becomes a diagonal matrix. The solution to Eq. (2) reduces to a feed-forward generation process of

where $\mathbf{1}$ is an all-one vector with the same size as $I_{i}^{L}$ . The operators that are actually applied to $I_{i}^{L}$ are $S^{T}$ and $W_{0\rightarrow i}^{T}$ . $S^{T}$ is the transposed decimation corresponding to zero-upsampling. $W_{0\rightarrow i}^{T}$ is the transposed forward warping using flow $F_{i\rightarrow 0}$ . A 1D signal example for these operators is shown in Fig. 1. We will further analyze the difference of forward and backward warping after explaining our system.

Our Method

Our method takes a sequence of $N_{F}$ LR images as input and produces one HR image $I^{H}_{0}$ . It is an end-to-end fully trainable framework that comprises of three modules: motion estimation, motion compensation and detail fusion. They are respectively responsible for motion field estimation between frames; aligning frames by compensating motion; and finally increasing image scale and adding image details. We elaborate on each module in the following.

The motion estimation module takes two LR frames as input and produces a LR motion field as

where $F_{i\rightarrow j}=(u_{i\rightarrow j},v_{i\rightarrow j})$ is the motion field from frame $I^{L}_{i}$ to $I^{L}_{j}$ . $\theta_{ME}$ is the set of module parameters.

Using neural networks for motion estimation is not a new idea, and existing work already achieves good results. We have tested FlowNet-S and the motion compensation transformer (MCT) module from VESPCN for our task. We choose MCT because it has less parameters and accordingly less computation cost. It can process 500+ single-channel image pairs ( $100\times 100$ in pixels) per second. The result quality is also acceptable in our system.

2 SPMC Layer

According to the analysis in Sec. 2, we propose a novel layer to utilize sub-pixel information from motion and simultaneously achieve sub-pixel motion compensation (SPMC) and resolution enhancement. It is defined as

where $J^{L}$ and $J^{H}$ are input LR and output HR images, $F$ is optical flow used for transposed warping and $\alpha$ is the scaling factor. The layer contains two submodules.

In this step, transformed coordinates are first calculated according to estimated flow $F=(u,v)$ as

where $p$ indexes pixels in LR image space. $x_{p}$ and $y_{p}$ are the two coordinates of $p$ . $u_{p}$ and $v_{p}$ are the flow vectors estimated from previous stage. We denote transform of coordinates as operator $W_{F;\alpha}$ , which depends on flow field $F$ and scale factor $\alpha$ . $x_{p}^{s}$ and $y_{p}^{s}$ are the transformed coordinates in an enlarged image space, as shown in Fig. 3.

Differentiable Image Sampler

Output image is constructed in the enlarged image space according to $x_{p}^{s}$ and $y_{p}^{s}$ . The resulting image $J_{q}^{H}$ is

where $q$ indexes HR image pixels. $x_{q}$ and $y_{q}$ are the two coordinates for pixel $q$ in the HR grid. $M(\cdot)$ is the sampling kernel, which defines the image interpolation methods (e.g. bicubic, bilinear, and nearest-neighbor).

We further investigate differentiability of this layer. As indicated in Eq. (5), the SPMC layer takes one LR image $J^{L}$ and one flow field $F=(u,v)$ as input, without other trainable parameters. For each output pixel, partial derivative with respect to each input pixel is

It is similar to calculating partial derivatives with respect to flow field $(u_{p},v_{p})$ using the chain rule as

where $M^{\prime}(\cdot)$ is the gradient of sampling kernel $M(\cdot)$ . Similar derivatives can be derived for $\frac{\partial J_{q}}{\partial v_{p}}$ . We choose $M(x)=\max(0,1-|x|)$ , which corresponds to the bilinear interpolation kernel, because of its simplicity and convenience to calculate gradients. Our final layer is fully differentiable, allowing back-propagating loss to flow fields smoothly. The advantages of having this type of layers is threefold.

This layer can simultaneously achieve motion compensation and resolution enhancement. Note in most previous work, they are separate steps (e.g. backward warping + bicubic interpolation).

This layer is parameter free and fully differentiable, which can be effectively incorporated into neural networks with almost no additional cost.

The rationale behind this layer roots from accurate LR imaging model, which ensures good performance in theory. It also demonstrates good results in practice, as we will present later.

3 Detail Fusion Net

The SPMC layer produces a series of motion compensated frames $\{J^{H}_{i}\}$ expressed as

Design of the following network is non-trivial due to the following considerations. First, $\{J^{H}_{i}\}$ are already HR-size images that produce large feature maps, thus computational cost becomes an important factor.

Second, due to the property of forward warping and zero-upsampling, $\{J^{H}_{i}\}$ is sparse and majority of the pixels are zero-valued (e.g. about 15/16 are zeros for scale factor $4\times$ ). This requires the network to have large receptive fields to capture image patterns in $J^{H}_{i}$ . Using simple interpolation to fill these holes is not a good solution because interpolated values would dominate during training.

Finally, special attention needs to be paid to the use of the reference frame. On the one hand, we rely on the reference frame as the guidance for SR so that the output HR image is consistent with the reference frame in terms of image structures. On the other hand, over-emphasizing the reference frame could impose an adverse effect of neglecting information in other frames. The extreme case is that the system behaves like a single-image SR one.

We design an encoder-decoder style structure with skip-connections (see Fig. 2) to tackle above issues. This type of structure has been proven to be effective in many image regression tasks . The encoder sub-network reduces the size of input HR image to $1/4$ of it in our case, leading to reduced computation cost. It also makes the feature maps less sparse so that information can be effectively aggregated without the need of employing very deep networks. Skip-connections are used for all stages to accelerate training.

A ConvLSTM module is inserted in the middle stage as a natural choice for sequential input. The network structure includes

where $\mathbf{Net}_{E}$ and $\mathbf{Net}_{D}$ are encoder and decoder CNNs with parameters $\theta_{E}$ and $\theta_{D}$ . $\mathbf{f}_{i}$ is the output of encoder net. $\mathbf{g}_{i}$ is the input of decoder net. $\mathbf{s}_{i}$ is the hidden state for LSTM at the $i$ th step. $\textbf{S}^{E}_{i}$ for all $i$ are intermediate feature maps of $\mathbf{Net}_{E}$ , used for skip-connection. $I^{L\uparrow}_{0}$ is the bicubic upsampled $I^{L}_{0}$ . $I^{(i)}_{0}$ is the $i$ th time step output.

The first layer of $\mathbf{Net}_{E}$ and the last layer of $\mathbf{Net}_{D}$ have kernel size $5\times 5$ . All other convolution layers use kernel size $3\times 3$ , including those inside ConvLSTM. Deconvolution layers are with kernel size $4\times 4$ and stride 2. Rectified Linear Units (ReLU) are used for every conv/deconv layer as the activation function. For skip-connection, we use SUM operator between connected layers. Other parameters are labeled in Fig. 2.

4 Training Strategy

Our framework consists of three major components, each has a unique functionality. Training the whole system in an end-to-end fashion with random initialization would result in zero flow in motion estimation, making the final results similar to those of single-image SR. We therefore separate training into three phases.

We only consider $\mathbf{Net}_{ME}$ in the beginning of training. Since we do not have ground truth flow, unsupervised warping loss is used as

Phase 2

We then fix the learned weights $\theta_{ME}$ and only train $\mathbf{Net}_{DF}$ . This time we use Euclidean loss between our estimated HR reference frame and the ground truth as

where $I^{(i)}_{0}$ is our network output in the $i$ th time step, corresponding to reference frame $I^{L}_{0}$ . $\{\kappa_{i}\}$ are the weights for each time step. We empirically set $\kappa_{-T}=0.5$ and $\kappa_{T}=1.0$ , and linearly interpolate intermediate values.

Phase 3

In the last stage, we jointly tune the whole system using the total loss as

where $\lambda_{2}$ is the weight balancing two losses.

Experiments

We conduct our experiments on a PC with an Intel Xeon E5 CPU and an NVIDIA Titan X GPU. We implement our framework on the TensorFlow platform , which enables us to easily develop our special layers and experiment with different network configurations.

For the super-resolution task, training data needs to be of high-quality without noise while containing rich fine details. To our knowledge, there is no such publicly available video dataset that is large enough to train our deep networks. We thus collect 975 sequences from high-quality 1080p HD video clips. Most of them are commercial videos shot with high-end cameras and contain both natural-world and urban scenes that have rich details. Each sequence contains 31 frames following the configuration of . We downsample the original frames to $540\times 960$ pixels as HR ground truth using bicubic interpolation. LR input is obtained by further downsampling HR frames to $270\times 480$ , $180\times 320$ and $135\times 240$ sizes. We randomly choose 945 of them as training data, and the rest 30 sequences are for validation and testing.

Model Training

For model training, we use Adam solver with learning rate of $0.0001$ , $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . We apply gradient clip only to weights of ConvLSTM module (clipped by global norm 3) to stabilize the training process. At each iteration, we randomly sample $N_{F}$ consecutive frames (e.g. $N_{F}=3,5,7$ ) from one sequence, and randomly crop a $100\times 100$ image region as training input. The corresponding ground truth is accordingly cropped from the reference frame with size $100\alpha\times 100\alpha$ where $\alpha$ is the scaling factor. Above parameters are fixed for all experiments. Batch size varies according to different settings, which is determined as the maximal value allowed by GPU memory.

We first train the motion estimation module using only loss $\mathcal{L}_{ME}$ in Eq. (12) with $\lambda_{1}=0.01$ . After about 70,000 iterations, we fix the parameters $\theta_{ME}$ and train the system using only loss $\mathcal{L}_{SR}$ in Eq. (13) for 20,000 iterations. Finally, all parameters are trained using total loss $\mathcal{L}$ in Eq. (14), $\lambda_{2}$ is empirically chosen as 0.01. All trainable variables are initialized using Xavier methods .

In the following analysis and experiments, we train several models under different settings. For simplicity, we use $\times(\cdot)$ to denote scaling factors (e.g. $\times 2$ , $\times 3$ , and $\times 4$ ). And F $(\cdot)$ is used as the number of input frames (e.g. F3, F5, and F7). Moreover, our ConvLSTM based DF net produces multiple outputs (one for each time step), we use $\{\#1,\#2,\cdots\}$ to index output.

1 Effectiveness of SPMC Layer

We first evaluate the effectiveness of the proposed SPMC layer. For comparison, a baseline model BW (F3- $\times 4$ ) is used. It is achieved by fixing our system in Fig. 2, except replacing the SPMC layer with backward warping, followed by bicubic interpolation, which is a standard alignment procedure. An example is shown in Fig. 4. In Fig. 4(a), bicubic $\times 4$ for reference frame contains severe aliasing for the tile patterns. Baseline model BW produces 3 outputs corresponding to three time steps in Fig. 4(b)-(d). Although results are sharper when more frames are used, tile patterns are obviously wrong compared to ground truth in Fig. 4(e). This is due to loss of sub-pixel information as analyzed in Section 2. The results are similar to the output of single image SR, where the reference frame dominates.

As shown in Fig. 4(f), if we only use one input image in our method, the recovered pattern is also similar to Fig. 4(a)-(d). However, with more input frames fed into the system, the restored images dramatically improve, as shown in Fig. 4(g)-(h), which are both sharper and closer to the ground truth. Quantitative values on our validation set are listed in Table 1.

2 Detail Fusion vs. Synthesis

We further investigate if our recovered details truly exist in original frames. One example is already shown in Fig. 4. Here we conduct a more illustrative experiment by replacing all input frames with the same reference frame. Specifically, Fig. 5(f)-(h) are outputs using 3 consecutive frames (F3- $\times 3$ ). The numbers and logo are recovered nicely. However, if we only use 3 copies of the same reference frame as input and test them on the same pre-trained model, the results are almost the same as using only one frame. This manifests that our final result shown in Fig. 5(h) is truly recovered from the 3 different input frames based on their internal detail information, rather than synthesized from external examples because if the latter holds, the synthesized details should also appear even if we use only one reference frame.

3 DF-Net with Various Inputs

Our proposed detail fusion (DF) net takes only $J^{H}_{i}$ as input. To further evaluate if the reference frame is needed, we design two baseline models. Model DF-bic and DF-0up respectively add bicubic and zero-upsampled $I^{L}_{0}$ as another channel of input to DF net. Visual comparison in Fig. 6 shows that although all models can recover reasonable details, the emphasis on the reference frame may mislead detail recovery and slightly degrade results quantitatively on the evaluation set (see Table 1).

4 Comparisons with Video SR Methods

We compare our method against previous video SR methods on the evaluation dataset. BayesSR is viewed as the best-performing traditional method that iteratively estimates motion flow, blur kernel, noise and the HR image. DESR ensembles “draft” based on estimated flow, which makes it an intermediate solution between traditional and CNN-based methods. We also include a recent deep-learning-based method VSRnet in comparison. We use author-provided implementation for all these methods. VESPCN did not provide code or pre-trained model, so we only list their reported PSNR/SSIM on the 4-video dataset VID4 . The quantitative results are listed in Table 2. Visual comparisons are shown in Fig. 7.

5 Comparisons with Single Image SR

Since our framework is flexible, we can set $N_{F}=1$ to turn it into a single image SR solution. We compare this approach with three recent image SR methods: SRCNN , FSRCNN and VDSR , on dataset Set5 and Set14 . To further compare the performance of using multiple frames against single, we compare all single image methods with our method under F3 setting on our evaluation dataset SPMCS. The quantitative results are listed in Table 3.

For the F1 setting on Set5 and Set14, our method produces comparable or slightly lower PSNR or SSIM results. Under the F3 setting, our method outperforms image SR methods by a large margin, indicating that our multi-frame setting can effectively fuse information in multiple frames. An example is shown in Fig. 9, where single image SR cannot recover the tiled structure of the building. In contrast, our F3 model can faithfully restore it.

6 Real-World Examples

The LR images in the above evaluation are produced though downsampling (bicubic interpolation). Although this is a standard approach for evaluation , the generated LR images may not fully resemble the real-world cases. To verify the effectiveness of our method on real-world data, we captured four examples as shown in Fig. 8. For each object, we capture a short video using a hand-held cellphone camera, and extract 31 consecutive frames from it. We then crop a $135\times 240$ region from the center frame, and use TLD tracking to track and crop the same region from all other frames as the input data to our system. Fig. 8 shows the SR result of the center frame for each sequence. Our method faithfully recovers the textbook characters and fine image details using the F7- $\times 4$ model. More examples are included in our supplementary material.

7 Model Complexity and Running Time

Using our un-optimized TensorFlow code, the F7- $\times 4$ model takes about $0.26s$ to process 7 input images with size $180\times 120$ for one HR output. In comparison, reported timings for other methods (F31) are 2 hours for Liu et al. , 10 min. for Ma et al. , and 8 min. for DESR . VSRnet requires $\approx$ 40s for F5 configuration. Our method can be further accelerated to 0.19s for F5 and 0.14s for F3.

Concluding Remarks

We have proposed a new deep-learning-based approach for video SR. Our method includes a sub-pixel motion compensation layer that can better handle inter-frame motion for this task. Our detail fusion (DF) network that can effectively fuse image details from multiple images after SPMC alignment. We have conducted extensive experiments to validate the effectiveness of each module. Results show that our method can accomplish high-quality results both qualitatively and quantitatively, at the same time being flexible on scaling factors and numbers of input frames.