Video Propagation Networks

Varun Jampani, Raghudeep Gadde, Peter V. Gehler

Introduction

In this work, we focus on the problem of propagating structured information across video frames. This problem appears in many forms (e.g., semantic segmentation or depth estimation) and is a pre-requisite for many applications. An example instance is shown in Fig. 1. Given an object mask for the first frame, the problem is to propagate this mask forward through the entire video sequence. Propagation of semantic information through time and video color propagation are other problem instances.

Videos pose both technical and representational challenges. The presence of scene and camera motion lead to the difficult pixel association problem of optical flow. Video data is computationally more demanding than static images. A naive per-frame approach would scale at least linear with frames. These challenges complicate the use of standard convolutional neural networks (CNNs) for video processing. As a result, many previous works for video propagation use slow optimization based techniques.

We propose a generic neural network architecture that propagates information across video frames. The main innovation is the use of image adaptive convolutional operations that automatically adapts to the video stream content. This yields networks that can be applied to several types of information, e.g., labels, colors, etc. and runs online, that is, only requiring current and previous frames.

Our architecture is composed of two components (see Fig. 1). A temporal bilateral network that performs image-adaptive spatio-temporal dense filtering. The bilateral network allows to connect densely all pixels from current and previous frames and to propagate associated pixel information to the current frame. The bilateral network allows the specification of a metric between video pixels and allows a straight-forward integration of temporal information. This is followed by a standard spatial CNN on the bilateral network output to refine and predict for the present video frame. We call this combination a Video Propagation Network (VPN). In effect, we are combining video-adaptive filtering with rather small spatial CNNs which leads to a favorable runtime compared to many previous approaches.

VPNs have the following suitable properties for video processing:

VPNs can be used to propagate any type of information content i.e., both discrete (e.g., semantic labels) and continuous (e.g., color) information across video frames.

Online propagation:

The method needs no future frames and can be used for online video analysis.

Long-range and image adaptive:

VPNs can efficiently handle a large number of input frames and are adaptive to the video with long-range pixel connections.

End-to-end trainable:

VPNs can be trained end-to-end, so they can be used in other deep network architectures.

Favorable runtime:

VPNs have favorable runtime in comparison to many current best methods, what makes them amenable for learning with large datasets.

Empirically we show that VPNs, despite being generic, perform better than published approaches on video object segmentation and semantic label propagation while being faster. VPNs can easily be integrated into sequential per-frame approaches and require only a small fine-tuning step that can be performed separately.

Related Work

Techniques for propagating content across image/video pixels are predominantly optimization based or filtering techniques. Optimization based techniques typically formulate the propagation as an energy minimization problem on a graph constructed across video pixels or frames. A classic example is the color propagation technique from . Although efficient closed-form solutions exists for some scenarios, optimization tends to be slow due to either large graph structures for videos and/or the use of complex connectivity. Fully-connected conditional random fields (CRFs) open a way for incorporating dense and long-range pixel connections while retaining fast inference.

Filtering techniques aim to propagate information with the use of image/video filters resulting in fast runtimes compared to optimization techniques. Bilateral filtering is one of the popular filters for long-range information propagation. A popular application is joint bilateral upsampling that upsamples a low-resolution signal with the use of a high-resolution guidance image. The works of showed that one can back-propagate through the bilateral filtering operation for learning filter parameters or doing optimization in the bilateral space . Recently, several works proposed to do upsampling in images by learning CNNs that mimic edge-aware filtering or that directly learn to upsample . Most of these works are confined to images and are either not extendable or computationally too expensive for videos. We leverage some of these previous works and propose a scalable yet robust neural network approach for video propagation. We will discuss more about bilateral filtering, that forms the core of our approach, in Section 3.

Video object segmentation

Prior work on video object segmentation can be broadly categorized into two types: Semi-supervised methods that require manual annotation to define what is foreground object and unsupervised methods that does segmentation completely automatically. Unsupervised techniques such as use some prior information about the foreground objects such as distinctive motion, saliency etc.

In this work, we focus on the semi-supervised task of propagating the foreground mask from the first frame to the entire video. Existing works predominantly use graph-based optimization that perform graph-cuts on video. Several of these works aim to reduce the complexity of graph structure with clustering techniques such as spatio-temporal superpixels and optical flow . Another direction was to estimate correspondence between different frame pixels by using nearest neighbor fields or optical flow . Closest to our technique are the works of and . proposed to use fully-connected CRF over the object proposals across frames. proposed a graph-cut in the bilateral space. Instead of graph-cuts, we learn propagation filters in the high-dimensional bilateral space. This results in a more generic architecture and allows integration into other deep networks. Two contemporary works proposed CNN based approaches for object segmentation and rely on fine-tuning a deep network using the first frame annotation of a given test sequence. This could result in overfitting to the test background. In contrast, the proposed approach relies only on offline training and thus can be easily adapted to different problem scenarios as demonstrated in this paper.

Semantic video segmentation

Earlier methods such as use structure from motion on video frames to compute geometrical and/or motion features. More recent works construct large graphical models on videos and enforce temporal consistency across frames. used dynamic temporal links in their CRF energy formulation. proposes to use Perturb-and-MAP random field model with spatial-temporal energy terms and propagate predictions across time by learning a similarity function between pixels of consecutive frames.

In the recent years, there is a big leap in the performance of semantic segmentation with the use of CNNs but mostly applied to images. Recently, proposed to retain the intermediate CNN representations while sliding a image CNN across the frames. Another approach is to take unary predictions from CNN and then propagate semantic information across the frames. A recent prominent approach in this direction is of which proposes a technique for optimizing feature spaces for fully-connected CRF.

Bilateral Filtering

We briefly review the bilateral filtering and its extensions that we will need to build VPN. Bilateral filtering has its roots in image denoising and has been developed as an edge-preserving filter. It has found numerous applications and recently found its way into neural network architectures . We will use this filtering at the core of VPN and make use of the image/video-adaptive connectivity as a way to cope with scenes in motion.

The filter values $W^{i,j}$ change for every pixel pairs $i,j$ and depend on the image/video content. And since the number of image/video pixels is usually large, a naive implementation of Eq. 1 is prohibitive. Due to the importance of this filtering operation, several fast algorithms have been proposed, that directly computes Eq. 1 without explicitly building $W$ matrix. One natural view that inspired several implementations was offered by , who viewed the bilateral filtering operation as a computation in a higher dimensional space. Their observation was that bilateral filtering can be implemented by 1. projecting $\mathbf{v}$ into a high-dimensional grid (splatting) defined by features $F$ , 2. high-dimensional filtering (convolving) the projected signal and 3. projecting down the result at the points of interest (slicing). The high-dimensional grid is also called bilateral space/grid. All these operations are linear and written as:

where, $S_{splat}$ and $S_{slice}$ denotes the mapping to-from image pixels and bilateral grid, and $B$ denotes convolution (traditionally Gaussian) in the bilateral space. The bilateral space has same dimensionality $g$ as features $F^{i}$ . The problem with this approach is that a standard $g$ -dimensional convolution on a regular grid requires handling of an exponential number of grid points. This was circumvented by a special data structure, the permutohedral lattice as proposed in . Effectively permutohedral filtering scales linearly with dimension, resulting in fast execution time.

The recent work of then generalized the bilateral filter in the permutohedral lattice and demonstrated how it can be learned via back-propagation. This allowed the construction of image-adaptive filtering operations into deep learning architectures, which we will build upon. See Fig. 2 for a illustration of 2D permutohedral lattices. Refer to for more details on bilateral filtering using permutohedral lattice and refer to for details on learning general permutohedral filters via back-propagation.

Video Propagation Networks

We aim to adapt the bilateral filtering operation to predict information forward in time, across video frames. Formally, we work on a sequence of $h$ (color or grayscale) images $S=(\mathbf{s}_{1},\mathbf{s}_{2},\ldots,\mathbf{s}_{h})$ and denote with $V=(\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{h})$ a sequence of outputs, one per frame. Consider as an example a sequence $\mathbf{v}_{1},\ldots,\mathbf{v}_{h}$ of foreground masks for a moving object in the scene. Our goal is to develop an online propagation method that can predict $\mathbf{v}_{t}$ , having observed the video up to frame $t$ and possibly previous $\mathbf{v}_{1,\ldots,t-1}$

If training examples $\{(S_{i},V_{i})|i=1,\ldots,l\}$ with full or partial knowledge of $\mathbf{v}$ are available, it is possible to learn $\mathcal{F}$ and for a complex and unknown input-output relationship, a deep CNN is a natural design choice. However, any learning based method has to face the challenge: the scene/camera motion and its effect on $\mathbf{v}$ . Since no motion in two different videos is the same, fixed-size static receptive fields of CNN are insufficient. We propose to resolve this with video-adaptive filtering component, an adaption of the bilateral filtering to videos. Our Bilateral Network (Section 4.1) has a connectivity that adapts to video sequences, its output is then fed into a spatial Network (Section 4.2) that further refines the desired output. The combined network layout of this VPN is depicted in Fig. 3. It is a sequence of learnable bilateral and spatial filters that is efficient, trainable end-to-end and adaptive to the video input.

Several properties of bilateral filtering make it a perfect candidate for information propagation in videos. In particular, our method is inspired by two main ideas that we extend in this work: joint bilateral upsampling and learnable bilateral filters . Although, bilateral filtering has been used for filtering video data before , its use has been limited to fixed filter weights (say, Gaussian).

We will use this idea to propagate content from previous frames ( $\mathbf{v}_{in}=\mathbf{v}_{1,\ldots,t-1}$ ) to the current frame ( $\mathbf{v}_{out}=\mathbf{v}_{t}$ ). The summation in Eq. 1 now runs over all previous frames and pixels. This is illustrated in Fig. 2. We take all previous frame results $\mathbf{v}_{1,\ldots,t-1}$ and splat them into a lattice using the features $F_{1,\ldots,t-1}$ computed on video frames $\mathbf{s}_{1,\ldots,t-1}$ . A filtering (described below) is then applied to every lattice point and the result is then sliced back using the features $F_{t}$ of the current frame $\mathbf{s}_{t}$ . This result need not be the final $\mathbf{v}_{t}$ , in fact we compute a filter bank of responses and continue with further processing as will be discussed.

Standard bilateral features $F^{i}=(x,y,r,g,b)^{\top}$ used for images need not be optimal for videos. A recent work of propose to optimize bilateral feature spaces for videos. Instead, we choose to simply add frame index $t$ as an additional time feature yielding a 6 dimensional feature vector $F^{i}=(x,y,r,g,b,t)^{\top}$ for every video pixel. Imagine a video where an object moves to reveal some background. Pixels of the object and background will be close spatially $(x,y)^{\top}$ and temporally $(t)$ but likely be of different color $(r,g,b)^{\top}$ . Therefore they will have no strong influence on each other (being splatted to distant positions in the six-dimensional bilateral space). One can understand the filter to be adaptive to color changes across frames, only pixels that are static and have similar color have a strong influence on each other (end up nearby in the bilateral space). In all our experiments, we used time $t$ as additional feature for information propagation across frames.

In addition to adding time $t$ as additional feature, we also experimented with using optical flow. We make use of optical flow estimates (of the previous frames with respect to the current frame) by warping pixel position features $(x,y)^{\top}$ of previous frames by their optical flow displacement vectors $(u_{x},u_{y})^{\top}$ to $(x+u_{x},y+u_{y})^{\top}$ . If the perfect flow was available, the video frames could be warped into a common frame of reference. This would resolve the corresponding problem and make information propagation much easier. We refer to the VPN model that uses modified positional features $(x+u_{x},y+u_{y})^{\top}$ as VPN-Flow.

Another property of permutohedral filtering that we exploit is that the input points need not lie on a regular grid since the filtering is done in the high-dimensional lattice. Instead of splatting millions of pixels on to the lattice, we randomly sample or use superpixels and perform filtering using these sampled points as input to the filter. In practice, we observe that this results in big computational gains with minor drop in performance (more in Section 5.1).

Learnable Bilateral Filters Bilateral filters help in video-adaptive information propagation across frames. But the standard Gaussian filter may be insufficient and further, we would like to increase the capacity by using a filter bank instead of a single fixed filter. We propose to use the technique of to learn a filter bank in the permutohedral lattice using back-propagation.

BNN Architecture The Bilateral Network (BNN) is illustrated in the green box of Fig. 3. The input is a video sequence $S$ and the corresponding predictions $V$ up to frame $t$ . Those are filtered using two BCLs (BCLa, BCLb) with $32$ filters each. For both BCLs, we use the same features $F^{i}$ but scale them with different diagonal matrices: $\Lambda_{a}F^{i},\Lambda_{b}F^{i}$ . The feature scales ( $\Lambda_{a},\Lambda_{b}$ ) are found by validation. The two $32$ dimensional outputs are concatenated, passed through a ReLU non-linearity and passed to a second layer of two separate BCL filters that uses same feature spaces $\Lambda_{a}F^{i},\Lambda_{b}F^{i}$ . The output of the second filter bank is then reduced using a $1\times 1$ spatial filter to map to the original dimension $d$ of $\mathbf{v}$ . We investigated scaling frame inputs with an exponential time decay and found that, when processing frame $t$ , a re-weighting with $(\alpha\mathbf{v}_{t-1},\alpha^{2}\mathbf{v}_{t-2},\alpha^{3}\mathbf{v}_{t-3}\ldots)$ with $0\leq\alpha\leq 1$ improved the performance a little bit.

In the experiments, we also included a simple BNN variant, where no filters are applied inside the permutohedral space, just splatting and slicing with the two layers BCLa and BCLb and adding the results. We will refer to this model as BNN-Identity as this is equivalent to using filter $B$ that is identity matrix. It corresponds to an image adaptive smoothing of the inputs $V$ . We found this filtering to already have a positive effect in our experiments.

2 Spatial Network

The BNN was designed to propagate information from the previous frames to the present one, respecting the scene and object motion. We then add a small spatial CNN with 3 layers, each with $32$ filters of size $3\times 3$ , interleaved with ReLU non-linearities. The final result is then mapped to the desired output of $\mathbf{v}_{t}$ using a $1\times 1$ convolution. The main role of this spatial CNN is to refine the information in frame $t$ . Depending on the problem and the size of the available training data, other network designs are conceivable. We use the same network architecture shown in Fig. 3 for all the experiments to demonstrate the generality of VPNs.

Experiments

We evaluated VPN on three different propagation tasks: propagation of foreground masks, semantic labels and color in videos. Our implementation runs in Caffe using standard settings. We used Adam stochastic optimization for training VPNs, multinomial-logistic loss for label propagation networks and Euclidean loss for training color propagation networks. We use a fixed learning rate of 0.001 and choose the trained models with minimum validation loss. Runtime computations were performed using a Nvidia TitanX GPU and a 6 core Intel i7-5820K CPU clocked at 3.30GHz machine. The code is available online at http://varunjampani.github.io/vpn/.

We focus on the semi-supervised task of propagating a given first frame foreground mask to all the video frames. Object segmentation in videos is useful for several high level tasks such as video editing, rotoscoping etc.

We use the recently published DAVIS dataset for experiments on this task. It consists of 50 high-quality videos. All the frames come with high-quality per-pixel annotation of the foreground object. For robust evaluation and to get results on all the dataset videos, we evaluate our technique using 5-fold cross-validation. We randomly divided the data into 5 folds, where in each fold, we used 35 videos for training, 5 for validation and the remaining 10 for the testing. For the evaluation, we used the 3 metrics that are proposed in : Intersection over Union (IoU) score, Contour accuracy ( $\mathcal{F}$ ) score and temporal instability ( $\mathcal{T}$ ) score. The widely used IoU score is defined as $TP/(TP+FN+FP)$ , where TP: True Positives; FN: False Negatives and FP: False Positives. Refer to for the definition of the other two metrics.

VPN and Results

In this task, we only have access to foreground mask for the first frame $\mathbf{v}_{1}$ . For the ease of training VPN, we obtain initial set of predictions with BNN-Identity. We sequentially apply BNN-Identity at each frame and obtain an initial set of foreground masks for the entire video. These BNN-Identity propagated masks are then used as inputs to train a VPN to predict the refined masks at each frame. We refer to this VPN model as VPN-Stage1. Once VPN-Stage1 is trained, its refined mask predictions are in-turn used as inputs to train another VPN model which we refer to as VPN-Stage2. This resulted in further refinement of foreground masks. Training further stages did not result in any improvements. Instead, one could consider VPN as a RNN unit processing one frame after another. But, due to GPU memory constraints, we opted for stage-wise training.

Following the recent work of on video object segmentation, we used $F^{i}=(x,y,Y,Cb,Cr,t)^{\top}$ features with YCbCr color features for bilateral filtering. To be comparable with one of the fastest state-of-the-art technique , we do not use any optical flow information. First, we analyze the performance of BNN-Identity by changing the number of randomly sampled input points. Figure 4 shows how the segmentation IoU changes with the number of sampled points (out of 2 million points) from the previous frames. The IoU levels out after sampling 25% of the points. For further computational efficiency, we used superpixel sampling instead of random sampling. Compared to random sampling, usage of superpixels reduced the IoU slightly (0.5), while reducing the number of input points by a factor of 10. We used 12000 SLIC superpixels from each frame computed using the fast GPU implementation from . As an input to VPN, we use the mask probabilities of previous 9 frames as we observe no improvements with more frames. We set $\alpha=0.5$ and the feature scales ( $\Lambda_{a},\Lambda_{b}$ ) are presented in Tab. A.1.

Table 1 shows the IoU scores for each of the 5 folds and Tab. 2 shows the overall scores and runtimes of different VPN models along with the best performing techniques. The performance improved consistently across all 5 folds with the addition of new VPN stages. BNN-Identity already performed reasonably well. VPN outperformed the present fastest BVS method by a significant margin on all the performance measures while being comparable in runtime. VPN perform marginally better than OFL method while being at least 80 $\times$ faster and OFL relies on optical flow whereas we obtain similar performance without using any optical flow. Further, VPN has the advantage of doing online processing as it looks only at previous frames whereas BVS processes entire video at once.

Augmentation of Pre-trained Models

One of the main advantages of VPN is that it is end-to-end trainable and can be easily integrated into other deep networks. To demonstrate this, we augmented VPN architecture with standard DeepLab segmentation network . We replaced the last classification layer of DeepLab-LargeFOV model to output 2 classes (foreground and background) in our case and bi-linearly upsampled the resulting low-resolution probability map to the original image dimension. 5-fold fine-tuning of the DeepLab model on DAVIS dataset resulted in the average IoU of 57.0 and other scores are shown in Tab. 2. To construct a joint model, the outputs from the DeepLab and the bilateral network (in VPN) are concatenated and then passed on to the spatial CNN. In other words, the bilateral network propagates label information from previous frames to the present frame, whereas the DeepLab network does the prediction for the present frame. The results of both are then combined and refined by the spatial network in the VPN. We call this ‘VPN-DeepLab’ model. We trained this model end-to-end and observed big improvements in performance. As shown in Tab. 2, the VPN-DeepLab model has the IoU score of 75.0 which is a significant improvement over the published results. The total runtime of VPN-DeepLab is only 0.63s which makes this also one of the fastest techniques. Figure 5 shows some qualitative results with more in Figs. A.1, A.2 and A.3. One can obtain better VPN performance with using better superpixels and also incorporating optical flow, but this increases runtime as well. Visual results indicate that learned VPN is able to retain foreground masks even with large variations in viewpoint and object size.

2 Semantic Video Segmentation

This is the task of assigning semantic label to every video pixel. Since the semantics between adjacent frames does not change radically, intuitively, propagating semantics across frames should improve the segmentation quality of each frame. Unlike video object segmentation, where the mask for the first frame is given, we approach semantic video segmentation in a fully automatic fashion. Specifically, we start with the unary predictions of standard CNNs and use VPN for propagating semantics across the frames.

We use the CamVid dataset that contains 4 high quality videos captured at 30Hz while the semantically labeled 11-class ground truth is provided at 1Hz. While the original dataset comes at a resolution of 960 $\times$ 720, we operate on a resolution of 640 $\times$ 480 as in . We use the same splits as in resulting in 367, 100 and 233 frames with ground truth for training, validation and testing.

VPN and Results

Since we already have CNN predictions for every frame, we train a VPN that takes the CNN predictions of previous and present frames as input and predicts the refined semantics for the present frame. We compare with a state-of-the-art CRF approach which we refer to as FSO-CRF. We also experimented with optical flow in VPN and refer that model as VPN-Flow. We used the fast DIS optical flow and modify the positional features of previous frames. We used superpixels computed with Dollar et al. as gSLICr has introduced artifacts.

We experimented with predictions from two different CNNs: One is with dilated convolutions (CNN-1) and another one (CNN-2) is trained with the additional video game data, which is the present state-of-the-art on this dataset. For CNN-1 and CNN-2, using 2 and 3 previous frames respectively as input to VPN is found to be optimal. Other parameters of VPN are presented in Tab. A.1. Table 3 shows quantitative results. Using BNN-Identity only slightly improved the performance whereas training the entire VPN significantly improved the CNN-1 performance by over 1.2 IoU, with both VPN and VPN-Flow. Moreover, VPN is at least 25 $\times$ faster, and simpler to use compared to the optimization based FSO-CRF which relies on LDOF optical flow , long-term tacks and edges . Replacing bilateral filters with spatial filters in VPN improved the CNN-1 performance by only 0.3 IoU showing the importance of video-adaptive filtering. We further improved the performance of the state-of-the-art CNN-2 with VPN-Flow model. Using better optical flow estimation might give even better results. Figure 6 shows some qualitative results with more in Fig. A.4.

3 Video Color Propagation

We also evaluate VPNs on a regression task of propagating color information in a grayscale video. Given the color image for the first video frame, the task is to propagate the color to the entire video. For experiments on this task, we again used the DAVIS segmentation dataset with the first 25 frames from each video. We randomly divided the dataset into 30 train, 5 validation and 15 test videos.

We work with YCbCr representation of images and propagate CbCr values from previous frames with pixel intensity, position and time features as guidance for VPN. The same strategy as in object segmentation is used, where an initial set of color propagated results is obtained with BNN-Identity and then used to trained a VPN-Stage1 model. Training further VPN stages did not improve the performance. We use 300K randomly sampled points from previous 3 frames as input to the VPN network. Table 4 shows the PSNR results. We also show a baseline result of that does graph based optimization using optical flow. We used fast DIS optical flow in the baseline method and we did not observe significant differences with using LDOF optical flow . Figure 7 shows a visual result with more in Fig. A.5. VPN works reliably better than while being 20 $\times$ faster. The method of relies heavily on optical flow and so the color drifts away with incorrect flow. We observe that our method also bleeds color in some regions especially when there are large viewpoint changes. We could not compare against recent color propagation techniques as their codes are not available online. This application shows general applicability of VPNs in propagating different kinds of information.

Conclusion

We proposed a fast, scalable and generic neural network approach for propagating information across video frames. The VPN uses bilateral network for long-range video-adaptive propagation of information from previous frames to the present frame which is then refined by a spatial network. Experiments on diverse tasks show that VPNs, despite being generic, outperformed the current state-of-the-art task-specific methods. At the core of our technique is the exploitation and modification of learnable bilateral filtering for the use in video processing. We used a simple VPN architecture to showcase the generality. Depending on the problem and the availability of data, using more filters or deeper layers would result in better performance. In this work, we manually tuned the feature scales which could be amendable to learning. Finding optimal yet fast-to-compute bilateral features for videos together with the learning of their scales is an important future research direction.

We thank Vibhav Vineet for providing the trained image segmentation CNN models for CamVid dataset.

References

Appendix A Parameters and Additional Results

In this appendix, we present experiment protocols and additional qualitative results for experiments on video object segmentation, semantic video segmentation and video color propagation. Table A.1 shows the feature scales and other parameters used in different experiments. Figures A.1, A.2 show some qualitative results on video object segmentation with some failure cases in Fig. A.3. Figure A.4 shows some qualitative results on semantic video segmentation and Fig. A.5 shows results on video color propagation.