SfM-Net: Learning of Structure and Motion from Video

Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki

Introduction

We propose SfM-Net, a neural network that is trained to extract 3D structure, ego-motion, segmentation, object rotations and translations in an end-to-end fashion in videos, by exploiting the geometry of image formation. Given a pair of frames and camera intrinsics, SfM-Net, depicted in Figure 1, computes depth, 3D camera motion, a set of 3D rotations and translations for the dynamic objects in the scene, and corresponding pixel assignment masks. Those in turn provide a geometrically meaningful motion field (optical flow) that is used to differentiably warp each frame to the next. Pixel matching across consecutive frames, constrained by forward-backward consistency on the computed motion and 3D structure, provides gradients during training in the case of self-supervision. SfM-Net can take advantage of varying levels of supervision, as demonstrated in our experiments: completely unsupervised (self-supervised), supervised by camera motion, or supervised by depth (from Kinect).

SfM-Net is inspired by works that impose geometric constraints on optical flow, exploiting rigidity of the visual scene, such as early low-parametric optical flow methods or the so-called direct methods for visual SLAM (Simultaneous Localization and Mapping) that perform dense pixel matching from frame to frame while estimating a camera trajectory and depth of the pixels in the scene . In contrast to those, instead of optimizing directly over optical flow vectors, 3D point coordinates or camera rotation and translation, our model optimizes over neural network weights that, given a pair of frames, produce such 3D structure and motion. In this way, our method learns to estimate structure and motion, and can in principle improve as it processes more videos, in contrast to non-learning based alternatives. It can thus be made robust to lack of texture, degenerate camera motion trajectories or dynamic objects (our model explicitly accounts for those), by providing appropriate supervision. Our work is also inspired and builds upon recent works on learning geometrically interpretable optical flow fields for point cloud prediction in time and backpropagating through camera projection for 3D human pose estimation or single-view depth estimation .

A method for self-supervised learning in videos in-the-wild, through explicit modeling of the geometry of scene motion and image formation.

A deep network that predicts pixel-wise depth from a single frame along with camera motion, object motion, and object masks directly from a pair of frames.

Forward-backward constraints for learning a consistent 3D structure from frame to frame and better exploit self-supervision, extending left-right consistency constraints of .

We show results of our approach on KITTI , MoSeg , and RGB-D SLAM benchmarks under different levels of supervision. SfM-Net learns to predict structure, object, and camera motion by training on realistic video sequences using limited ground-truth annotations.

Related work

Differentiable warping has been used to learn end-to-end unsupervised optical flow , disparity flow in a stereo rig and video prediction . The closest previous works to ours are SE3-Nets , 3D image interpreter , and Garg et al.’s depth CNN . SE3-Nets use an actuation force from a robot and an input point cloud to forecast a set of 3D rigid object motions (rotation and translations) and corresponding pixel motion assignment masks under a static camera assumption. Our work uses similar representation of pixel motion masks and 3D motions to capture the dynamic objects in the scene. However, our work differs in that 1) we predict depth and camera motion while SE3-Nets operate on given point clouds and assume no camera motion, 2) SE3-Nets are supervised with pre-recorded 3D optical flow, while this work admits diverse and much weaker supervision, as well as complete lack of supervision, 3) SE3-Nets consider one frame and an action as input to predict the future motion, while our model uses pairs of frames as input to estimate the intra-frame motion, and 4) SE3-Nets are applied to toy or lab-like setups whereas we show results on real videos.

Wu et al. learn 3D sparse landmark positions of chairs and human body joints from a single image by computing a simplified camera model and minimizing a camera re-projection error of the landmark positions. They use synthetic data to pre-train the 2D to 3D mapping of their network. Our work considers dense structure estimation and uses videos to obtain the necessary self-supervision, instead of static images. Garg et al. also predict depth from a single image, supervised by photometric error. However, they do not infer camera motion or object motion, instead requiring stereo pairs with known baseline during training.

Concurrent work to ours removes the constraint that the ground-truth pose of the camera be known at training time, and instead estimates the camera motion between frames using another neural network. Our approach tackles the more challenging problem of simultaneously estimating both camera and object motion.

Geometry-aware motion estimation.

Motion estimation methods that exploit rigidity of the video scene and the geometry of image formation to impose constraints on optical flow fields have a long history in computer vision . Instead of non-parametric dense flow fields researchers have proposed affine or projective transformations that better exploit the low dimensionality of rigid object motion . When depth information is available, motions are rigid rotations and translations . Similarly, direct methods for visual SLAM having RGB or RGBD video as input, perform dense pixel matching from frame to frame while estimating a camera trajectory and depth of the pixels in the scene with impressive 3D point cloud reconstructions.

These works typically make a static world assumption, which makes them susceptible to the presence of moving objects in the scene. Instead, SfM-Net explicitly accounts for moving objects using motion masks and 3D translation and rotation prediction.

Learning-based motion estimation.

Recent works propose learning frame-to-frame motion fields with deep neural networks supervised with ground-truth motion obtained from simulation or synthetic movies. This enables efficient motion estimation that learns to deal with lack of texture using training examples rather than relying only on smoothness constraints of the motion field, as previous optimization methods . Instead of directly optimizing over unknown motion parameters, such approaches optimize neural network weights that allow motion prediction in the presence of ambiguities in the given pair of frames.

Unsupervised learning in videos.

Video holds a great potential towards learning semantically meaningful visual representations under weak supervision. Recent works have explored this direction by using videos to propagate in time semantic labels using motion constraints , impose temporal coherence (slowness) on the learnt visual feature , predict temporal evolution , learn temporal instance level associations , predict temporal ordering of video frames , etc.

Most of those unsupervised methods are shown to be good pre-training mechanisms for object detection or classification, as done in . In contrast and complementary to the works above, our model extracts fine-grained 3D structure and 3D motion from monocular videos with weak supervision, instead of semantic feature representations.

Learning SfM

We compute per frame depth using a standard conv/deconv subnetwork operating on a single frame (the structure network in Figure 2). We use a RELU activation at our final layer, since depth values are non-negative. Given depth $0pt_{t}$ , we obtain the 3D point cloud $\mathbf{X}_{t}^{i}=(X_{t}^{i},Y_{t}^{i},Z_{t}^{i}),i\in 1,\ldots,w\times h$ corresponding to the pixels in the scene using a pinhole camera model. Let $(x_{t}^{i},y_{t}^{i})$ be the column and row positions of the $i^{th}$ pixel in frame $I_{t}$ and let $(c_{x},c_{y},f)$ be the camera intrinsics, then

where $0pt_{t}^{i}$ denotes the depth value of the $i$ th pixel. We use the camera intrinsics when available and revert to default values of $(0.5,0.5,1.0)$ otherwise. Therefore, the predicted depth will only be correct up to a scalar multiplier.

Scene motion.

Let $\{R^{c}_{t},t^{c}_{t}\}\in SE3$ denote the 3D rotation and translation of the camera from frame $I_{t}$ to frame $I_{t+1}$ (relative camera pose across consecutive frames). We represent $R^{c}_{t}$ using an Euler angle representation as ${R^{c}_{t}}^{x}(\alpha){R^{c}_{t}}^{y}(\beta){R^{c}_{t}}^{z}(\gamma)$ where

to be in the interval $$ by using RELU activation and the minimum function.

Optical flow.

$\mathbf{X}^{\prime\prime}_{t}=R^{c}_{t}(\mathbf{X}^{\prime}_{t}-p^{c}_{t})+t^{c}_{t}.$

Finally we obtain the row and column position of the pixel in the second frame $(x_{t+1}^{i},y_{t+1}^{i})$ by projecting the corresponding 3D point $\mathbf{X}^{\prime\prime}_{t}=(X^{\prime\prime}_{t},Y^{\prime\prime}_{t},Z^{\prime\prime}_{t})$ back to the image plane as follows:

The flow $U,V$ between the two frames at pixel $i$ is then $(U_{t}(i),V_{t}(i))=(x^{i}_{t+1}-x^{i}_{t},y^{i}_{t+1}-y^{i}_{t})$ .

2 Supervision

SfM-Net inverts the image formation and extracts depth, camera and object motions that gave rise to the observed temporal differences, similar to previous SfM works . Such inverse problems are ill-posed as many solutions of depth, camera and object motion can give rise to the same observed frame-to-frame pixel values. A learning-based solution, as opposed to direct optimization, has the advantage of learning to handle such ambiguities through partial supervision of their weights or appropriate pre-training, or simply because the same coefficients (network weights) need to explain a large abundance of video data consistently. We detail the various supervision modes below and explore a subset of them in the experimental section.

Given unconstrained video, without accompanying ground-truth structure or motion information, our model is trained to minimize the photometric error between the first frame and the second frame warped towards the first according to the predicted motion field, based on well-known brightness constancy assumptions :

where $x^{\prime}=x+U_{t}(x,y)$ and $y^{\prime}=y+V_{t}(x,y)$ . We use differentiable image warping proposed in the spatial transformer work and compute color constancy loss in a fully differentiable manner.

Spatial smoothness priors.

When our network is self-supervised, we add robust spatial smoothness penalties on the optical flow field, the depth, and the inferred motion maps, by penalizing the L1 norm of the gradients across adjacent pixels, as usually done in previous works . For depth prediction, we penalize the norm of second order gradients in order to encourage not constant but rather smoothly changing depth values.

Forward-backward consistency constraints.

We incorporate forward-backward consistency constraints between inferred scene depth in different frames as follows. Given inferred depth $0pt_{t}$ from frame pair $I_{t},I_{t+1}$ and $0pt_{t+1}$ from frame pair $I_{t+1},I_{t}$ , we ask for those to be consistent under the inferred scene motion, that is:

where $W_{t}(x,y)$ is the $Z$ component of the scene flow obtained from the point cloud transformation. Composing scene flow forward and backward across consecutive frames allows us to impose such forward-backward consistency cycles across more than one frame gaps, however, we have not yet seen empirical gain from doing so.

Supervising depth.

If depth is available on parts of the input image, such as with video sequences captured by a Kinect sensor, we can use depth supervision in the form of robust depth regression:

Supervising camera motion.

Supervising optical flow and object motion.

Ground-truth optical flow, object masks, or object motions require expensive human annotation on real videos. However, these signals are available in recent synthetic datasets . In such cases, our model could be trained to minimize, for example, an L1 regression loss between predicted $\{U(x,y),V(x,y)\}$ and ground-truth $\{U^{GT}(x,y),V^{GT}(x,y)\}$ flow vectors.

3 Implementation details

Our depth-predicting structure and object-mask-predicting motion conv/deconv networks share similar architectures but use independent weights. Each consist of a series of $3\times 3$ convolutional layers alternating between stride 1 and stride 2 followed by deconvolutional operations consisting of a depth-to-space upsampling, concatentation with corresponding feature maps from the convolutional portion, and a $3\times 3$ convolutional layer. Batch normalization is applied to all convolutional layer outputs. The structure network takes a single frame as input, while the motion network takes a pair of frames. We predict depth values using a $1\times 1$ convolutional layer on top of the image-sized feature map. We use RELU activations because depths are positive and a bias of $1$ to prevent small depth values. The maximum predicted depth value is further clipped at $100$ to prevent large gradients. We predict object masks from the image-sized feature map of the motion network using a $1\times 1$ convolutional layer with sigmoid activations. To encourage sharp masks we multiply the logits of the masks by a parameter that is a function of the number of step for which the network has been trained. The pivot variables are predicted as heat maps using a softmax function over all the locations in the image followed by a weighted average of the pixel locations.

Experimental results

The main contribution of SfM-Net is the ability to explicitly model both camera and object motion in a sequence, allowing us to train on unrestricted videos containing moving objects. To demonstrate this, we trained self-supervised networks (using zero ground-truth supervision) on the KITTI datasets and on the MoSeg dataset . KITTI contains pairs of frames captured from a moving vehicle in which other independently moving vehicles are visible. MoSeg contains sequences with challenging object motion, including articulated motions from moving people and animals.

Our first experiment validates that explicitly modeling object motion is necessary to effectively learn from unconstrained videos. We evaluate unsupervised depth prediction using our models on the KITTI 2012 and KITTI 2015 datasets which contain close to 200 frame sequence and stereo pairs. We use a scale-invariant error metric (log RMSE) proposed in due to the global scale ambiguitiy in monocular setups which is defined as

We compare the the results of Garg et al. who use stereo pairs to estimate depth. Their approach assumes the camera pose between the frames is a known constant (stereo baseline) and optimize the photometric error in order to estimate the depth. In contrast, our model considers a more challenging “in the wild” setting where we are only given sequences of frames from a video and camera pose, depth and object motion are all estimated without any form of supervision. Garg et al. report a log RMSE of 0.273 on a subset of the KITTI dataset. To compare with our approach on the full set we emulate the model of Garg et al. using our architecture by removing object masks from our network and using stereo pairs with photometric error. We also evaluate our full model on frame sequence pairs with camera motion estimation both with and without explicit object motion estimation.

Table 1 shows the log RMSE error between the ground-truth depth and the three approaches. When using stereo pairs we obtain a value of $0.31$ which is on par with existing results on the KITTI benchmark (see ). When using frame sequence pairs instead of calibrated stereo pairs the problem becomes more difficult, as we must now infer the unknown camera and object motion between the two frames. As expected, the depth estimates learned in this scenario are less accurate, but performance is much worse when no motion masks are used. The gap between the two approaches is wider on the KITTI 2015 dataset which contains more moving objects. This shows that it is important to account for moving objects when training on videos in the wild.

Figure 3 shows qualitative examples comparing the depth obtained when using stereo pairs with a fixed baseline and when using frame sequences without camera pose information. When there is large translation between the frames, depth estimation without camera pose information is as good as using stereo pairs. The failure cases in the last two rows show that the network did not learn to accurately predict depth for scenes where it saw little or no translation between the frames during training. This is not the case when using stereo pairs as there is always a constant offset between the frames. Using more data could help here because it increases the likelihood of generic scenes appearing in a sequence containing interesting camera motion.

Figure 4 provides qualitative examples of the predicted motion masks and flow fields along with the ground-truth in the KITTI 2015 dataset. Often, the predicted motion masks are fairly close to the ground truth and help explain part of the motion in the scene. We notice that object masks tended to miss very small, distant moving objects. This may be due to the fact that these objects and their motions are too small to be separated from the background. The bottom two rows show cases where the predicted masks do not correspond to moving objects. In the first example, although the mask is not semantically meaningful, note that the estimated flow field is reasonable, with some mistakes in the region occluded by the moving car. In the second failure case, the moving car on the left is completely missed but the motion of the static background is well captured. This is a particularly difficult example for the self-supervised photometric loss because the moving object appears in heavy shadow.

Analysis of our failure cases suggest possible directions for improvement. Moving objects introduce significant occlusions, which should be handled carefully. Because our network has no direct supervision on object masks or object motion, it does not necessarily learn that object and camera motions should be different. These priors could be built into our loss or learned directly if some ground-truth masks or object motions are provided as explicit supervision.

MoSeg.

The moving objects in KITTI are primarily vehicles, which undergo rigid-body transformations, making it a good match for our model. To verify that our network can still learn in the presence of non-rigid motion, we retrained it from scratch under self-supervision on the MoSeg dataset, using frames from all sequences. Because each motion mask corresponds to a rigid 3D rotation and translation, we do not expect a single motion mask to capture a deformable object. Instead, different rigidly moving object parts will be assigned to different masks. This is not a problem from the perspective of accurate camera motion estimation, where the important issue is distinguishing pixels whose motion is caused by the camera pose transformation directly from those whose motion is affected by independent object motions in the scene.

Qualitative results on sampled frames from the dataset are shown in Fig. 5. Because MoSeg only contains ground-truth annotations for segmentation, we cannot quantitatively evaluate the estimated depth, camera trajectories, or optical flow fields. However, we did evaluate the quality of the object motion masks by computing Intersection over Union (IoU) for each ground-truth segmentation mask against the best matching motion mask and its complement (a total of six proposed segments in each frame, two from each of the three motion masks), averaging across frames and ground-truth objects. We obtain an IoU of 0.29 which is similar to previous unsupervised approaches for the small number of segmentation proposals we use per frame. See, for example, the last column of Figure 5 from , whose proposed methods for moving object proposals achieve IoU around 0.3 with four proposals. They require more than 800 proposals to reach an IoU above 0.57.

Kinect depth supervision.

While the fully unsupervised results show promise, our network can benefit from extra supervision of depth or camera motion when available. The improved depth prediction given ground truth camera poses on KITTI stereo demonstrate some gain. We also experimented with adding depth supervision to improve camera motion estimation using the RGB-D SLAM dataset . Given ground-truth camera pose trajectories, we estimated relative camera pose (camera motion) from each frame to the next and compare with the predicted camera motion from our model, by measuring translation and rotation error of their relative transformation, as done in the corresponding evaluation script for relative camera pose error and detailed in Eq. 2. We report camera rotation and translation error in Table 2 for each of the Freiburg1 sequences compared to the error in the benchmark’s baseline trajectories. Our model was trained from scratch for each sequence and used the focal length value provided with the dataset. We observe that our results better estimate the frame-to-frame translation and are comparable for rotation.

Conclusion

Current geometric SLAM methods obtain excellent ego-motion and rigid 3D reconstruction results, but often come at a price of extensive engineering, low tolerance to moving objects — which are treated as noise during reconstruction — and sensitivity to camera calibration. Furthermore, matching and reconstruction are difficult in low textured regions. Incorporating learning into depth reconstruction, camera motion prediction and object segmentation, while still preserving the constraints of image formation, is a promising way to robustify SLAM and visual odometry even further. However, the exact training scenario required to solve this more difficult inference problem remains an open question. Exploiting long history and far in time forward-backward constraints with visibility reasoning is an important future direction. Further, exploiting a small amount of annotated videos for object segmentation, depth, and camera motion, and combining those with an abundance of self-supervised videos, could help initialize the network weights in the right regime and facilitate learning. Many other curriculum learning regimes, including those that incorporate synthetic datasets, can also be considered.

We thank our colleagues Tinghui Zhou, Matthew Brown, Noah Snavely, and David Lowe for their advice and Bryan Seybold for his work generating synthetic datasets for our initial experiments.