Online Video Deblurring via Dynamic Temporal Blending Network

Tae Hyun Kim, Kyoung Mu Lee, Bernhard Schölkopf, Michael Hirsch

Introduction

Moving objects in dynamic scenes as well as camera shake can cause undesirable motion blur in video recordings, often implying a severe degradation of video quality. This is especially true for low-light situations where the exposure time of each frame is increased, and for videos recorded with action (hand-held) cameras that have enjoyed widespread popularity in recent years. Therefore, not only to improve video quality but also to facilitate other vision tasks such as tracking , SLAM , and dense 3D reconstruction , video deblurring techniques and their applications have seen an ever increasing interest recently. However, removing motion blur and restoring sharp frames in a blind manner (i.e., without knowing the blur of each frame) is a highly ill-posed problem and an active research topic in the field of computational photography.

In this paper we propose a novel discriminative video deblurring method. Our method leverages recent insights within the field of deep learning and proposes a novel neural network architecture that enables run-times which are orders of magnitude faster than previous methods without significantly sacrificing restoration quality. Furthermore, our approach is the first online (sequential) video deblurring technique that is able to remove general motion blur stemming from both egomotion and object motion in real-time (for VGA video resolution).

Our novel network architecture employs deep convolutional residual networks with a layout that is recurrent both in time and space. For temporal sequence modeling we propose a network layer that implements a novel mechanism that we dub dynamic temporal blending, which compares the feature representation at consecutive time steps and allows for dynamic (i.e. input-dependent) pixel-specific information propagation. Recurrence in the spatial domain is implemented through a novel network layout that is able to extend the spatial receptive field over time without increasing the size of the network. In doing so, we can handle large blurs better than typical networks for video frames, without run-time overhead.

Due to the lack of publicly available training data for video deblurring, we have collected a large number of blurry and sharp videos similar to the work of Kim et al. and the recent work of Su et al. . Specifically, we recorded sharp frames using a high-speed camera and generated realistic blurry frames by averaging over several consecutive sharp frames. Using this new dataset, we successfully trained our novel video deblurring network in an end-to-end manner.

Using the proposed network and new dataset, we perform deblurring in a sequential manner, in contrast to many previous methods that require access to all frames, while at the same time being hundreds to thousands times faster than existing state-of-the-art video deblurring methods. In the experimental section, we validate the performance of our proposed model on a number of challenging real-world videos capturing dynamic scenes such as the one shown in Fig. 1, and illustrate the superiority of our method in a comprehensive comparison with the state of the art, both qualitatively and quantitatively. In particular, we make the following contributions:

we present, to the best of our knowledge, the first discriminative learning approach to video deblurring which is capable of removing spatially varying motion blurs in a sequential manner with real-time performance

we introduce a novel spatio-temporal recurrent residual architecture with small computational footprint and increased receptive field along with a dynamic temporal blending mechanism that enables adaptive information propagation during test time

we release a large-scale high-speed video dataset that enables discriminative learning

we show promising results on a wide range of challenging real-world video sequences

Related Work

Multi-frame Deblurring. Early attempts to handle motion blur caused by camera shake considered multiple blurry images , and adapted techniques for removing uniform blur in single blurry images . Other works include Cai et al , and Zhang et al which obtained sharp frames by exploiting the sparsity of the blur kernels and gradient distribution of the latent frames. More recently, Delbracio and Sapiro proposed Fourier Burst Accumulation (FBA) for burst deblurring, an efficient method to combine multiple blurry images without explicit kernel estimation by averaging complex pixel coefficients of multiple observations in the Fourier domain. Wieschollek et al. extended the work with a recent neural network approach for single image blind deconvolution , and achieved promising results by training the network in an end-to-end manner.

Most of the afore-mentioned methods assume stationarity, i.e., shift invariant blur, and cannot handle the more challenging case of spatially varying blur. To deal with spatially varying blur, often caused by rotational camera motion (roll) around the optical axis , additional non-trivial alignment of multiple images is required. Several methods have been proposed to simultaneously solve the alignment and restoration problem . In particular, Li et al. proposed a method to jointly perform camera motion (global motion) estimation and multi-frame deblurring, in contrast to previous methods that estimate a single latent image from multiple frames.

Video Deblurring. Despite some of these methods being able to handle non-uniform blur caused by camera shake, none of them is able to remove spatially-varying blur stemming from object motion in a video recording of a dynamic scene. More generally, blur in a typical video might originate from various sources including moving objects, camera shake, depth variation, and thus it is required to estimate pixel-wise different blur kernels which is a highly intricate problem.

Some early approaches make use of sharp ‘‘lucky’’ frames which sometimes exist in long videos. Matsushita et al. detect sharp frames using image statistics, perform global image registration and transfer pixel intensities from neighboring sharp frames to blurry ones in order to remove blur. Cho et al. improved deblurring quality significantly by employing additional local search and a blur model for aligning differently blurred image regions. However, their exemplar-based method still has some limitations in treating distinct blurs by fast moving objects due to the difficulty of accurately finding corresponding points between severely blurred objects.

Other deblurring attempts segment differently blurred regions. Both Levin and Bar et al. automatically segment a motion blurred object in the foreground from a (constant) background, and assume a uniform motion blur model in the foreground region. Wulff and Black consider differently blurred bi-layered scenes and estimate segment-wise accurate blur kernels by constraining those through a temporally consistent affine motion model. While they achieve impressive results especially at the motion boundaries, extending and generalizing their model to handle multi-layered scenes in real situations are difficult as we do not know the number and depth ordering of the layers in advance.

In contrast, there has been some recent work that estimates pixel-wise varying kernels directly without segmentation. Kim and Lee proposed a method to parametrize pixel-wise varying kernels with motion flows in a single image, and they naturally extended it to deal with blurs in videos . To handle spatially and temporally varying blurs, they parametrize the blur kernels with bi-directional optical flow between latent sharp frames that are estimated concurrently. Delbracio and Sapiro also use bi-directional optical flow for pixel-wise registration of consecutive frames, however manage to keep processing time low by using their fast FBA method for local blur removal. Recently, Sellent et al. tackled independent object motions with local homographies, and their adaptive boundary handling rendered promising results with stereo video datasets. Although these methods are applicable to remove general motion blurs, they are rather time consuming due to optical flow estimation and/or pixel-wise varying kernel estimation. Probably closest related to our approach is the concurrent work of Su et al. , which trains a CNN with skip connections to remove blur stemming from both ego and object motion. In a comprehensive comparison we show the merits of our novel network architecture both in terms of computation time as well as restoration quality.

Training Datasets

A key factor for the recent success of deep learning in computer vision is the availability of large amounts of training data. However, the situation is more tricky for the task of blind deblurring. Previous learning-based single-image blind deconvolution and burst deblurring approaches have considered only ego motion and assumed a uniform blur model. However, adapting these techniques to the case of spatially and temporally varying motion blurs caused by both ego motion and object motion is not straightforward. Therefore, we pursue a different strategy and employ a recently proposed technique that generates pairs of sharp and blurry videos using a high-speed camera.

Given a high-speed video, we ‘‘simulate’’ long shutter times by averaging several consecutive short-exposure images, thereby synthesizing a video with fewer longer-exposed frames. The rendered (averaged) frames are likely to feature motion blur which might arise from camera shake and/or object motion. At the same time we use the center short-exposure image as a reference sharp frame. We thus have,

where $n$ denotes the time step, and $\textbf{X}_{nT}$ , $\textbf{B}_{n}$ , and $\textbf{S}_{n}$ are the short-exposure frame, synthesized blurry frame, and the reference sharp frame respectively. A parameter $\tau$ corresponds to the effective shutter speed which determines the number of frames to be averaged. A time interval, T, which satisfies $\textit{T}\geq\tau$ controls the frame rate of the synthesized video. For example, the frame rate of the generated video is $\frac{\textit{f}}{\textit{T}}$ for a high-speed video captured at a frame rate f. Note that with these datasets, we can handle motion blurs only, but not other blurs (e.g., defocus blur). We can control the strength of the blurs by adjusting $\tau$ (a larger $\tau$ generates more blurry videos), and can also change the duty cycle of the generated video by controlling the time interval T. The whole process is visualized in Fig. 2.

For our experiments, we collected high-speed sharp frames using a GoProHERO4 BLACK camera which supports recording HD (1280x720) video at a speed of $f=240$ frames per second, and then downsampled frames to the resolution of 960x540 size to reduce noise and jpeg artifacts. To generate more realistic blurry frames, we carefully captured videos to have small motions (ideally less than 1 pixel) among high-speed sharp frames as suggested in . Moreover, we randomly selected parameters as $\tau\in\{7,9,11,13,15\}$ and $\tau\leq\textit{T}<2\tau$ to generate various datasets with different blur sizes and duty cycles.

Method Overview

In this paper, using our large dataset of blurry and sharp video pairs, we propose a video deblurring network estimating the latent sharp frames from blurry ones. As suggested in the work of Su et al. , a straightforward and naive technique to deal with a video rather than a single image is employing a neural network repeatedly as shown in Fig. 3 (a). Here, input to the network are consecutive blurry frames $\langle\textbf{B}_{n}\rangle_{m}=\{\textbf{B}_{n-m},\ldots,\textbf{B}_{n+m}\}$ where $\textbf{B}_{n}$ is the mid-frame and $m$ some small integerFor simplicity we dropped index $m$ from $\langle\textbf{B}_{n}\rangle_{m}$ in the figures.. The network predicts a single sharp frame $\textbf{L}_{n}$ for time step $n$ . In contrast, we present networks specialized for treating videos by exploiting temporal information, and improve the deblurring performance drastically without increasing the number of parameters and overall size of the networks.

In the present section, we introduce network architectures which we have found to improve the performance significantly. First, in Fig. 3 (b), we propose a spatio-temporal recurrent network which effectively extends the receptive field without increasing the number of parameters of the network, facilitating the removal of large blurs caused by severe motion. Next, in Fig. 3 (c), we additionally introduce a network architecture that implements our dynamic temporal blending mechanism which enforces temporal coherence between consecutive frames and further improves our spatio-temporal recurrent model. In the following we describe our proposed network architectures in more detail.

A large receptive field is essential for a neural network being capable of handling large blurs. For example, it requires about 50 convolutional layers to handle blur kernels of a size of 101x101 pixels with conventional deep residual networks using 3x3 small filters . Although using a deeper network or larger filters are a straightforward and an easy way to ensure large receptive field, the overall run-time does increase with the number of additional layers and increasing filter size. Therefore, we propose an effective network which retains large receptive field without increasing its depth and filter size, i.e. number of layers and therewith its number of parameters.

The architecture of the proposed spatio-temporal network in Fig. 3 (b) is based on conventional recurrent networks , but has a point of distinction and profound difference. To be specific, we put $\textbf{F}_{n-1}$ which is the feature map of multiple blurry input frames $\langle\textbf{B}_{n-1}\rangle_{m}$ coupled with the previous feature map $\textbf{F}_{n-2}$ computed at time step $(n-1)$ , as an additional input to our network together with blurry input frames $\langle\textbf{B}_{n}\rangle_{m}$ at time step $n$ . By doing so, at time step $n$ , the features of blurry frame $\textbf{B}_{n}$ passes through the same network ( $m+1$ ) times, and ideally, we could increase the receptive field by the same factor without having to change the number of layers and parameters of our network. Notice that, in practice, the increase of receptive field is limited by the network capacity.

In other words, in a high dimensional feature space, each blurry input frame is recurrently processed multiple times by our recurrent network over time, thereby effectively experiencing a deeper spatial feature extraction with an increased receptive field. Moreover, further (temporal) information obtained from previous time steps is also transferred to enhance the current frame, thus we call such a network spatio-temporal recurrent or simply STRCNN.

2 Dynamic temporal blending network

When handling video rather than single frames, it is important to enforce temporal consistency. Although we recurrently transfer previous feature maps over time and implicitly share information between consecutive frames, we developed a novel mechanism for temporal information propagation that significantly improves the deblurring performance.

Motivated by the recent deep learning approaches of which dynamically adapt network parameters to input data at test time, we also generate weight parameters for temporal feature blending that encourages temporal consistency, as depicted in Fig. 3 (c). Specifically, based on our spatio-temporal recurrent network, we additionally propose a dynamic temporal blending network, which generates weight parameter $\textbf{w}_{n}$ at time step $n$ which is used for linear blending between the feature maps of consecutive time steps, i.e.

Notably, to this end, we need only one additional convolutional layer and some non-linear activations such as $\tanh$ , and thus, the computation is fast. Although the proposed dynamic temporal blending network is simple and light, we demonstrate that it helps improve deblurring quality significantly in our experiments, and we refer to this network as STRCNN+DTB.

Implementation and Training

In this section, we describe our proposed network architecture in full detail. An illustration is shown in Fig. 4, where we show a configuration at a single time step $n$ only since our model shares all trainable variables across time. Our network comprises three modules, i.e. encoder, dynamic temporal blending network, and decoder. Furthermore, we also discuss our objective function and training procedure.

Figure 4 (a) depicts the encoder of our proposed network. Input are $(2m+1)$ consecutive blurry frames $\langle\textbf{B}_{n}\rangle_{m}$ where $\textbf{B}_{n}$ is the mid-frame, along with feature activations $\textbf{F}_{n-1}$ from the previous stage. All input images are in color and range in intensity from 0 to 1. The feature map $\textbf{F}_{n-1}$ is half the size of a single input image, and has 32 channels. All blurry input images are filtered first, before being concatenated with the feature map and being fed into a deep residual network. Our encoder has a stack of 5 residual blocks (10 convolutional layers) similar to . Each convolution filter within a residual block is composed of 64 filters of size 3x3 pixels. The output of our encoder is feature map $\textbf{h}_{n}$ .

1.2 Dynamic temporal blending

We tested different layout configurations by changing the location of our dynamic temporal blending network. Best results were obtained when placing the dynamic temporal blending network right between encoder and decoder as shown in Fig. 3 (c) rather than somewhere in the middle of the encoder or decoder network.

1.3 Decoder

2 Objective function

As an objective function we use the mean squared error (MSE) between the latent frames and their corresponding sharp ground-truth frames, i.e.

where $N_{mse}$ denotes the number of pixels in a latent frame. In addition, we use weight decay to prevent overfitting, i.e.

where W denotes the trainable network parameters. Our final objective function E is given by

where $\lambda$ trades off the data fidelity and regularization term. In all our experiments we set $\lambda$ to $10^{-5}$ .

3 Training parameters

For training, we randomly select 13 consecutive blurry frames from artifically blurred videos (i.e., $\textbf{B}_{1},\ldots,\textbf{B}_{13}$ ) , and crop a patch per frame. Each patch is 128x128 pixels in size, and a randomly chosen pixel location is used for cropping all 13 patches. Moreover, we use a batch size of 8, and employ Adam for optimization with an initial learning rate of 0.0001, which is decreased exponentially (decay rate = 0.96) with an increasing number of iterations.

Experiments

We study the three different network architectures that we discussed in Sec. 4, and evaluate deblurring quality in terms of peak signal-to-noise ratio (PSNR). For fair comparison, we use the same number of network parameters, except for one additional convolutional layer that is required in the dynamic temporal blending network. We use our own recorded dataset (described in Sec. 3) for training, and use the dataset of for evaluation at test time.

First, we compare the PSNR values of the three different models for varying blur strength by changing the effective shutter speed $\tau$ in Eq. (1). We take five consecutive blurry frames as input to the networks. As shown in Fig. 7, our STRCNN+DTB model shows consistently better results for all blur sizes. On average, the PSNR value of our STRCNN is 0.2dB higher than the baseline (CNN) model, and STRCNN+DTB achieves a gain of 0.37dB against the baseline.

Next, in Table 2, we evaluate and compare the performance of the models with a varying number of input blurry frames. Our STRCNN+DTB model outperforms other networks for all input settings. We choose STRCNN+DTB using five input frames ( $m=2$ ) as our final model.

Our method processes a video sequence in an online fashion, thus we also show how the PSNR value changes with an increasing number of processed frames in Fig. 7. Although our proposed method shows initially (i.e. n=1) worse performance due to lack of temporal information (initially zeros are given), restoration quality improves and stabilizes quickly after one or two frames.

2 Quantitative results

For objective evaluations, we compare with the state-of-the-art video deblurring methods whose source codes are available at the time of submission. In particular, as Shuochen et al. provide their fully trained network parameters with three different input alignment methods. Specifically, they align input images with optical flow (FLOW), or homography (HOMOG.), and they also take raw inputs without alignment (NOALIGN). For fair comparisons, we train our STRCNN+DTB model with their dataset, and evaluate performance with our own dataset.

We provide a quantitative comparison for 25 test videos captured with our high-speed camera described in Sec.3. Our model outperforms the state-of-the-art methods in terms of PSNR as shown in Table. 2.

3 Qualitative results

To verify the generalization capabilities of our trained network, we provide qualitative results for a number of challenging videos. Figure 8 shows a comparison with on challenging video clips. All these frames have spatially varying blurs caused by distinct object motion and/or rotational camera shake. In particular, blurry frames shown in the third and fourth rows are downloaded from YouTube, and thus contain high-level noise and severe encoding artifacts. Nevertheless, our method successfully restores the sharp frames especially at the motion boundaries in real-time. In the last row, the offline (batch) deblurring approach by Kim and Lee shows the best result however at the cost of long computation times. On the other hand, our approach yields competitive results though orders of magnitudes faster.

4 Run time evaluations

At test time, our online approach can process VGA (640x480) video frames at $\sim$ 24 frames per second with a recent NVIDIA GTX 1080 graphics card, and HD (1280x720) frames at $\sim$ 8 frames per second. In contrast, other conventional (offline) video deblurring methods take much longer. In Table. 2, we compare run-times for processing 100 HD (1280x720) video frames. Notably, our proposed method runs at a much faster rate than other conventional methods.

5 Effects of dynamic temporal blending

In Fig. 9, we show a qualitative comparison of the results obtained with STRCNN and STRCNN+DTB. Although STRCNN could also remove motion blur by camera shake in the blurry frames well, it causes some artifacts on the car window. In contrast, STRCNN+DTB successfully restores sharp frames with less artifacts by enforcing temporal consistency using the proposed dynamic temporal blending network.

Conclusion

In this work, we proposed a novel network architecture for discriminative video deblurring. To this end we have acquired a large dataset of blurry/sharp video pairs for training, and introduced a novel spatio-temporal recurrent network which enables near real-time performance by adding the feature activations of the last layer as an additional input to the network at the following time step. In doing so, we could retain large receptive field which is crucial to handle large blurs, without introducing a computational overhead. Furthermore, we proposed a dynamic blending network that enforces temporal consistency, which provides a significant performance gain. We demonstrate the efficiency and superiority of our proposed method by intensive experiments on challenging real-world videos.