Atlas: End-to-End 3D Scene Reconstruction from Posed Images

Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, Andrew Rabinovich

Introduction

Reconstructing the world around us is a long standing goal of computer vision. Recently many applications have emerged, such as autonomous driving and augmented reality, which rely heavily upon accurate 3D reconstructions of the surrounding environment. These reconstructions are often estimated by fusing depth measurements from special sensors, such as structured light, time of flight, or LIDAR, into 3D models. While these sensors can be extremely effective, they require special hardware making them more cumbersome and expensive than systems that rely solely on RGB cameras. Furthermore, they often suffer from noise and missing measurements due to low albedo and glossy surfaces as well as occlusion.

Another approach to 3D reconstruction is to use monocular , binocular or multivew stereo methods which take RGB images (one, two, or multiple respectively) and predict depth maps for the images. Despite the plethora of recent research, these methods are still much less accurate than depth sensors, and do not produce satisfactory results when fused into a 3D model.

In this work, we observe that depth maps are often just intermediate representations that are then fused with other depth maps into a full 3D model. As such, we propose a method that takes a sequence of RGB images and directly predicts a full 3D model in an end-to-end trainable manner. This allows the network to fuse more information and learn better geometric priors about the world, producing much better reconstructions. Furthermore, it reduces the complexity of the system by eliminating steps like frame selection, as well as reducing the required compute by amortizing the cost over the entire sequence.

Our method is inspired by two main lines of work: cost volume based multi view stereo and Truncated Signed Distance Function (TSDF) refinement . Cost volume based multi view stereo methods construct a cost volume using a plane sweep. Here, a reference image is warped onto the target image for each of a fixed set of depth planes and stacked into a 3D cost volume. For the correct depth plane, the reference and target images will match while for other depth planes they will not. As such, the depth is computed by taking the argmin over the planes. This is made more robust by warping image features extracted by a CNN instead of the raw pixel measurements, and by filtering the cost volume with another CNN prior to taking the argmin.

TSDF refinement starts by fusing depth maps from a depth sensor into an initial voxel volume using TSDF fusion , in which each voxel stores the truncated signed distance to the nearest surface. Note that a triangulated mesh can then be extracted from this implicit representation by finding the zero crossing surface using marching cubes . TSDF refinement methods take this noisy, incomplete TSDF as input and refine it by passing it through a 3D convolutional encoder-decoder network.

Similar to cost volume multi view stereo approaches, we start by using a 2D CNN to extract features from a sequence of RGB images. These features are then back projected into a 3D volume using the known camera intrinsics and extrinsics. However, unlike cost volume approaches which back project the features into a target view frustum using image warping, we back project into a canonical voxel volume, where each pixel gets mapped to a ray in the volume (similar to ). This avoids the need to choose a target image and allows us to fuse an entire sequence of frames into a single volume. We fuse all the frames into the volume using a simple running average. Next, as in both cost volume and TSDF refinement, we pass our voxel volume through a 3D convolutional encoder-decoder to refine the features. Finally, as in TSDF refinement, our feature volume is used to regress the TSDF values at each voxel (see Figure 1).

We train and evaluate our network on real scans of indoor rooms from the Scannet dataset. Our method significantly outperforms state-of-the-art multi view stereo baselines producing accurate and complete meshes.

As an additional bonus, for minimal extra compute, we can add an additional head to our 3D CNN and perform 3D semantic segmentation. While the problems of 3D semantic and instance segmentation have received a lot of attention recently , all previous methods assume the depth was acquired using a depth sensor. Although our 3D segmentations are not competitive with the top performers on the Scannet benchmark leader board, we establish a strong baseline for the new task of 3D semantic segmentation from multi view RGB.

Related Work

Reconstructing a 3D model of a scene usually involves acquiring depth for a sequence of images and fusing the depth maps using a 3D data structure. The most common 3D structure for depth accumulation is the voxel volume used by TSDF fusion. However, surfels (oriented point clouds) are starting to gain popularity . These methods are usually used with a depth sensor, but can also be applied to depth maps predicted from monocular or stereo images.

With the rise of deep learning, monocular depth estimation has seen huge improvements , however their accuracy is still far below state-of-the-art stereo methods. A popular classical approach to stereo uses mutual information and semi global matching to compute the disparity between two images. Similar approaches have been incorporated into SLAM systems such as COLMAP and CNN-SLAM. More recently, several end-to-end plane sweep algorithms have been proposed. DeepMVS uses a patch matching network. MVDepthNet constructs the cost volume from raw pixel measurements and performs 2D convolutions, treating the planes as feature channels. GPMVS builds upon this and aggregates information into the cost volume over long sequences using a Gaussian process. MVSNet and DPSNet construct the cost volume from features extracted from the images using a 2D CNN. They then filter the cost volume using 3D convolutions on the 4D tensor. R-MVSNet reduces the memory requirements of MVSNet by replacing the 3D CNN with a recurrent CNN, while P-MVSNet starts with a low resolution MVSNet and then iteratively refines the estimate using their point flow module. All of these methods require choosing a target image to predict depth for and then finding suitable neighboring reference images. Recent binocular stereo methods use a similar cost volume approach, but avoid frame selection by using a fixed baseline stereo pair. Depth maps over a sequence are computed independently (or weakly coupled in the case of ). In contrast to these approaches, our method constructs a single coherent 3D model from a sequence of input images directly.

While TSDF fusion is simple and effective, it cannot reconstruct partially occluded geometry and requires averaging many measurements to reduce noise. As such, learned methods have been proposed to improve the fusion. OctNet-Fusion uses a 3D encoder-decoder to aggregate multiple depth maps into a TSDF and shows results on single objects and portions of scans. ScanComplete builds upon this and shows results for entire rooms. SG-NN improves upon ScanComplete by increasing the resolution using sparse convolutions and training using a novel self-supervised training scheme. 3D-SIC focuses on 3D instance segmentation using region proposals and adds a per instance completion head. Routed fusion uses 2D filtering and 3D convolutions in view frustums to improve aggregation of depth maps.

More similar in spirit to ours are networks that take one or more images and directly predict a 3D representation. 3D-R2N2 encodes images to a latent space and then decodes a voxel occupancy volume. Octtree-Gen increases the resolution by using an octtree data structure to improve the efficiency of 3D voxel volumes. Deep SDF chooses to learn a generative model that can output an SDF value for any input position instead of discretizing the volume. These methods encode the input to a small latent code and report results on single objects, mostly from shapenet. This small latent code is unlikely to contain enough information to be able to reconstruct an entire scene (follow up work , concurrent with ours, addresses this problem, but they do not apply it to RGB only reconstruction). Pix2Vox encodes each image to a latent code and then decodes a voxel representation for each and then fuses them. This is similar to ours, but we explicitly model the 3D geometry of camera rays allowing us to learn better representations and scale to full scenes. SurfNet learns a 3D offset from a template UV map of a surface. Point set generating networks learns to generate point clouds with a fixed number of points. Pixel2Mesh++ uses a graph convolutional network to directly predict a triangulated mesh. Mesh-RCNN builds upon 2D object detection and adds an additional head to predict a voxel occupancy grid for each instance and then refines them using a graph convolutional network on a mesh.

Back projecting image features into a voxel volume and then refining them using a 3D CNN has also been used for human pose estimation . These works regress 3D heat maps that are used to localize joint locations.

Deep Voxels and the follow up work of scene representation networks accumulate features into a 3D volume forming an unsupervised representation of the world which can then be used to render novel views without the need to form explicit geometric intermediate representations.

2 3D Semantic Segmentation

In addition to reconstructing geometry, many applications require semantic labeling of the reconstruction to provide a richer representation. Broadly speaking, there are two approaches to solving this problem: 1) Predict semantics on 2D input images using a 2D segmentation network and back project the labels to 3D 2) Directly predict the semantic labels in the 3D space. All of these methods assume depth is provided by a depth sensor. A notable exception is Kimera , which uses multiview stereo to predict depth, however, they only show results on synthetic data and ground truth 2D segmentations.

SGPN formulates instance segmentation as a 3D point cloud clustering problem. Predicting a similarity matrix and clustering the 3D point cloud to derive semantic and instance labels. 3D-SIS improves upon these approaches by fusing 2D features in a 3D representation. RGB images are encoded using a 2D CNN and back projected onto the 3D geometry reconstructed from depth maps. A 3D CNN is then used to predict 3D object bounding boxes and semantic labels. SSCN predicts semantics on a high resolution voxel volume enabled by sparse convolutions.

In contrast to these approaches, we propose a strong baseline to the relatively untouched problem of 3D semantic segmentation without a depth sensor.

Method

Our method takes as input an arbitrary length sequence of RGB images, each with known intrinsics and pose. These images are passed through a 2D CNN backbone to extract features. The features are then back projected into a 3D voxel volume and accumulated using a running average. Once the image features have been fused into 3D, we regress a TSDF directly using a 3D CNN (See Fig. 2). We also experiment with adding an additional head to predict semantic segmentation.

where $P_{t}$ and $K_{t}$ are the extrinsics and intrinsics matrices for image $t$ respectively, $\Pi$ is the perspective mapping and $:$ is the slice operator. Here $(i,j,k)$ are the voxel coordinates in world space and $(\hat{i},\hat{j})$ are the pixel coordinates in image space. Note that this means that all voxels along a camera ray are filled with the same features corresponding to that pixel.

These feature volumes are accumulated over the entire sequence using a weighted running average similar to TSDF fusion as follows:

For the weights we use a binary mask $W_{t}(i,j,k)\in\{0,1\}$ which stores if voxel $(i,j,k)$ is inside or outside the view frustum of the camera.

2 3D Encoder-Decoder

Once the features are accumulated into the voxel volume, we use a 3D convolutional encoder-decoder network to refine the features and regress the output TSDF (Fig. 3). Each layer of the encoder and decoder uses a set of 3x3x3 residual blocks. Downsampling is implemented with 3x3x3 stride 2 convolution, while upsampling uses trilinear interpolation followed by a 1x1x1 convolution to change the feature dimension. The feature dimension is doubled with each downsampling and halved with each upsampling. All convolution layers are followed by batchnorm and relu. We also include additive skip connections from the encoder to the decoder.

At the topmost layer of the encoder-decoder, we use a 1x1x1 convolution followed by a tanh activation to regress the final TSDF values. For our semantic segmentation models we also include an additional 1x1x1 convolution to predict the segmentation logits.

We also include intermediate output heads at each decoded resolution prior to upsampling. These additional predictions are used both for intermediate supervision to help the network train faster, as well as to guide the later resolutions to focus on refining predictions near surfaces. At each resolution, any voxel that is predicted beyond a fraction (.99) of the truncation distance is clamped to one at the following resolutions. Furthermore, loss is only backpropageted for non-clamped voxels. Without this, the loss at the higher resolutions is dominated by the large number of empty space voxels and the network has a harder time learning fine details.

Note that since our features are back projected along entire rays, the voxel volume is filled densely and thus we cannot take advantage of sparse convolutions in the encoder. However, the multiscale outputs can be used to sparsify the feature volumes in the decoder allowing for the use of sparse convolutions similar to . In practice, we found that we were able to train our models at $4cm^{3}$ voxel resolution without the need for sparse convolutions.

Implementation Details

We use a Resnet50-FPN followed by the merging method of with 32 output feature channels as our 2D backbone. Our 3D CNN consists of a four scale resolution pyramid where we double the number of channels each time we half the resolution. The encoder consists of (1,2,3,4) residual blocks at each scale respectively, and the decoder consists of (3,2,1) residual blocks.

Results

We evaluate our method on ScanNet, which consists of 2.5M images across 707 distinct spaces. Standard train/validation/test splits are adopted. The 3D reconstructions are benchmarked using standard 2D depth metrics (Table 2) and 3D metrics (Table 3), which are defined in Table 1. We also show qualitative comparisons in Figure 6 where our method really stands out.

We compare our method to 4 state-of-the-art baselines: COLMAP , MVDepthNet, GPMVS, and DPSNet. For COLMAP we use the default dense reconstruction parameters but use the ground truth poses provided by Scannet. For each of the learned methods we fine tuned the models provided by the authors on Scannet. At inference time, 6 reference frames were selected temporally with stride 10 centered around the target view. We also mask the boundary pixels since the networks have visible edge effects that cause poor depth predictions here (leading to 92.8% completeness).

To evaluate these in 3D we fuse the predicted depth maps using two techniques: TSDF Fusion and point cloud fusion. For COLMAP we use their default point cloud fusion, while for the other methods we use the implementation of . We found point cloud fusion was more robust to the outliers present in the depth predictions than our implementation of TSDF Fusion. As such, we only report the point cloud fusion results in Table 3 which are strictly better than the TSDF Fusion results (Note that the $L_{1}$ metric is computed using the TSDF Fusion approach as it is not computed in the point cloud fusion approach).

As seen in Figure 4 our method is able to fill holes that are missing from the ground truth. These holes arise from two causes: A) limitations of depth sensors on low albedo and specular surfaces, and B) unobserved regions caused by occlusion and incomplete scans. While other multiview stereo method often learn to predict depth for these troublesome surfaces, they are not able to complete unobserved geometry. On the other hand, since our method directly regresses the full TSDF for a scene, it is able to reason about and complete unobserved regions. However, this means that we must take extra care when evaluating the point cloud metrics, otherwise we will be falsely penalized in these regions. We remove geometry that was not observed in the ground truth by taking the rendered depth maps from our predicted mesh and re-fuse them using TSDF Fusion into a trimmed mesh. This guarantees that there is no mesh in areas that were not observed in the ground truth.

Our method achieves state-of-the-art on about half of the metrics and is competitive on all metrics. However, as seen in Figure 6, qualitatively our results our significantly better than previous methods. While the $L_{1}$ metric on the TSDF seems to reflect this performance gap better, the inability of the other metrics to capture this indicates a need for additional more perceptual metrics.

As mentioned previously, we augment the existing 3D-CNN with a semantic segmentation head, requiring only a single $1\times 1\times 1$ convolution, to be able to not only reconstruct the 3D structure of the scene but also provide semantic labels to the surfaces. Since no prior work attempts to do 3D semantic segmentation from only RGB images, and there are no established benchmarks, we propose a new evaluation procedure. The semantic labels from the predicted mesh are transferred onto the ground truth mesh using nearest neighbor lookup on the vertices, and then the standard IOU metric can be used. The results are reported in Table 4 and Fig. 7 (note that this is an unfair comparison since all prior methods include depth as input).

From the results in Table 4 we see that our approach is surprisingly competitive with (and even beats some) prior methods that include depth as input. Having depth as an input makes the problem significantly easier because the only source of error is from the semantic predictions. In our case, in order to correctly label a vertex we must both predict the geometry correct as well as the semantic label. From Fig. 7 we can see that mistakes in geometry compounds with mistakes in semantics which leads to lower IOUs.

In Figure 5 we show an example of how our method degrades as the number of frames is reduced at inference time. We see that there is almost no degradation with as few as 25 frames. See accompanying video for more examples.

Since our method only requires running a small 2D CNN on each frame, the cost of running the large 3D CNN is amortized over a sequence of images. On the other hand, MVS methods must run all their compute on every frame. Note that they must also run depth map fusion to accumulate the depth maps into a mesh, but we do not include this additional time here. We report inference times using 2 neighbors. All models are run on a single NVidia TiTan RTX GPU. From Table 5 we can see that after approximately 4 frames, ours becomes faster than DPSNet (note that most Scannet scenes are a few thousands of frames).

Conclusions

In this work, we present a novel approach to 3D scene reconstruction. Notably, our approach does not require depth inputs; is unbounded temporally, allowing the integration of long frame sequences; completes unobserved geometry; and supports the efficient prediction of other quantities such as semantics. We have experimentally verified that the classical approach to 3D reconstruction via per view depth estimation is inferior to direct regression to a 3D model from an input RGB sequence. We have also demonstrated that without significant additional compute, a semantic segmentation objective can be added to the model to accurately label the resultant surfaces. In our future work, we aim to improve the back projection and accumulation process. One approach is to allow the network to learn where along a ray to place the features (instead of uniformly). This will improve the models ability to handle occlusions and large multi room scenes. We also plan to add additional tasks such as instance segmentation and intrinsic image decomposition. Our method is particularly well suited for intrinsic image decomposition because the network has the ability to reason with information from multiple views in 3D.