Multiview Compressive Coding for 3D Reconstruction

Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

Introduction

Images depict objects and scenes in diverse settings. Popular 2D visual tasks, such as object classification and segmentation , aim to recognize them on the image plane. But image planes do not capture scenes in their entirety. Consider Fig. LABEL:fig:teaser:a. The toy’s left arm is not visible in the image. This is framed by the task of 3D reconstruction: given an image, fully reconstruct the scene in 3D.

3D reconstruction is a longstanding problem in AI with applications in robotics and AR/VR. Structure from Motion lifts images to 3D by triangulation. Recently, NeRF optimizes radiance fields to synthesize novel views. These approaches require many views of the same scene during inference and do not generalize to novel scenes from a single image. Others predict 3D from a single image but rely on expensive CAD supervision . Reminiscent of generalized cylinders , some introduce object-specific priors via category-specific 3D templates , pose or symmetries . While impressive, these methods cannot scale as they rely on onerous 3D annotations and category-specific priors which are not generally true. Alas large-scale learning, which has shown promising generalization results for images and language , is largely underexplored for 3D reconstruction.

Image-based recognition is entering a new era thanks to domain-agnostic architectures, like transformers , and large-scale category-agnostic learning . Motivated by these advances, we present a scalable, general-purpose model for 3D reconstruction from a single image. We introduce a simple, yet effective, framework that operates directly on 3D points. 3D points are general as they can capture any objects or scenes and are more versatile and efficient than meshes and voxels. Their generality and efficiency enables large-scale category-agnostic training. In turn, large-scale training makes our 3D model effective.

Central to our approach is an input encoding and a queriable 3D-aware decoder. The input to our model is a single RGB-D image, which returns the visible (seen) 3D points via unprojection. Image and points are encoded with transformers. A new 3D point, sampled from 3D space, queries a transformer decoder conditioned on the input to predict its occupancy and its color. The decoder reconstructs the full, seen and unseen, 3D geometry, as shown in Fig. LABEL:fig:teaser:a. Our occupancy-based formulation, introduced in , frames 3D reconstruction as a binary classification problem and removes constraints pertinent to specialized representations (e.g., deformations of a 3D template) or a fixed resolution. Being tasked with predicting the unseen 3D geometry of diverse objects or scenes, our decoder learns a strong 3D representation. This finding directly connects to recent advances in image-based self-supervised learning and masked autoencoders (MAE) which learn powerful image representations by predicting masked (unseen) image patches.

Our model inputs single RGB-D images, which are ubiquitous thanks to advances in hardware. Nowadays, depth sensors are found in iPhone’s front and back cameras. We show results from iPhone captures in §4 and Fig. LABEL:fig:teaser:b. Our decoder predicts point cloud occupancies. Supervision is sourced from multiple RGB-D views, e.g., video frames, with relative camera poses, e.g., from COLMAP . The posed views produce 3D point clouds which serve as proxy ground truth. These point clouds are far from “perfect” as they are amenable to sensor and camera pose noise. However, we show that when used at scale they are sufficient for our model. This suggests that 3D annotations, which are expensive to acquire, can be replaced with many RGB-D video captures, which are much easier to collect.

We call our approach Multiview Compressive Coding (MCC), as it learns from many views, compresses appearance and geometry and learns a 3D-aware decoder. We demonstrate the generality of MCC by experimenting on six diverse data sources: CO3D , Hypersim , Taskonomy , ImageNet , in-the-wild iPhone captures and DALL $\cdot$ E 2 generations. These datasets range from large-scale captures of more than 50 common object types, to holistic scenes, such as warehouses, auditoriums, lofts, restaurants, and imaginary objects. We compare to state-of-the-art methods, tailored for single objects and scene reconstruction and show our model’s superiority in both settings with a unified architecture. Enabled by MCC’s general purpose design, we show the impact of large-scale learning in terms of reconstruction quality and zero-shot generalization on novel object and scene types.

Related Work

Multiview 3D reconstruction is a longstanding problem in computer vision. Traditional techniques include binocular stereopsis , SfM , and SLAM . Reconstruction by analysis or synthesis via volume rendering of implicit and explicit representations have shown to produce strong results. Supervised approaches predict voxels or meshes by training deep nets. These techniques produce high-quality outputs, but rely on multiple views at test time. In this work, we assume a single RGB-D image during inference.

Single-view 3D reconstruction is challenging. One line of work trains models that predict 3D geometry via CAD , meshes , voxels or point clouds supervision. Results are commonly demonstrated on synthetic simplistic benchmarks, such as ShapeNet , or for a small set of object categories, as in Pix3D . Weakly supervised approaches use category-specific priors via 3D shape templates and pose or learn via 2D silhouettes and re-projection on posed views . While impressive, these approaches are limited to specific objects from a closed-world vocabulary. Some explore category-agnostic models, but focus on synthetic datasets. In this work, we learn a general-purpose 3D representation from RGB-D views from a diverse and large set of data sources of real-world objects and scenes.

Shape completion methods complete the 3D geometry of partial reconstructions. For objects, methods directly output full point clouds or deploy generative models , but are typically tied to a fixed resolution. For scenes, techniques include plane fitting , 3D model fitting and retrieval or leverage symmetries and predict 3D semantics . Our model tackles both objects and scenes with a unified architecture and outputs any-resolution 3D geometry with a 3D-aware decoder. We compare to recent shape completion techniques.

Implicit 3D representations such as SDFs and occupancy nets (OccNets) have proven effective 3D representations. NeRF optimizes per-scene neural fields for view synthesis. NeRF extensions target scene generalization by encoding input views with deep nets or improve reconstruction quality by supervising with depth . MCC adopts an occupancy-based representation, similar to OccNets , with an attention mechanism on encoded appearance and geometric cues which allows it to predict in any 3D region, even outside the camera frustum, efficiently. We show that this strategy outperforms the global-feature strategy from OccNets or single-location features used in NeRF-based methods .

Self-supervised learning has advanced image and language understanding. For images, masked autoencoders paired with transformers and large-scale category-agnostic training learn general representations for 2D recognition. We draw from these findings and extend the architecture and learning for the task of 3D reconstruction.

Multiview Compressive Coding (MCC)

During training, we supervise MCC with “true” points derived from posed RGB-D views. These point clouds serve as ground truth: $q_{i}$ is labeled as positive if it is close to the ground truth and negative otherwise. Intuitively, the other views guide the model to reason about what parts of the unseen space belong to the object or scene. As a result, the input encoding $R$ learns a representation of the full 3D geometry and guides the decoder to make the right prediction.

During inference, the model predicts occupancy and color for a grid of points at any desired resolution. The set of occupied colored points forms the final reconstruction.

MCC requires only points for supervision, extracted from posed RGB-D views, e.g., video frames. Note that the derived point clouds, which serve as ground truth, are far from perfect due to noise in the captures and pose estimation. However, when used at scale they are sufficient. This deviates from OccNets and other distance-based works which rely on clean CAD models or 3D meshes. This is an important finding as it suggests that expensive CAD supervision can be replaced with cheap RGB-D video captures. This property of MCC allows us to train on a wide range of diverse data. In §4, we show that large-scale training is crucial for high-quality reconstruction.

The proposed two-tower design is general and performant. Alternative designs are ablated in §4.

2 MCC Decoder

The decoder takes as input the output of the encoder, $R$ , and $N^{q}$ 3D point queries $q_{i}$ , $i=0,\ldots,N_{q}-1$ , to predict occupancy and colors for each point,

The decoder $Dec$ linearly projects each query $q_{i}$ to $C$ -dimensions (the same as $R$ ), concatenates them with $R$ in the token dimension, and then uses a transformer to model the interactions between $R$ and queries. We draw inspiration from MAE for this design. The output feature of each query token is passed through a binary classification head that predicts its occupancy $\sigma_{i}$ , and a 256-way classification head that predicts its RGB color $c_{i}$ .

As described in Eq. 2, we feed multiple queries to the decoder for efficiency via parallelization, which significantly speeds up training and inference. However, since all tokens attend to all tokens in a standard transformer, this creates undesirable dependencies among queries. To break the unwanted dependencies, we mask out the attention weights such that tokens cannot attend to the other queries (except for self). This masking pattern is illustrated in Fig. 3.

MCC’s attention architecture differentiates it from prior 3D reconstruction approaches. In , points condition on a globally pooled image feature; in they condition on the projected locations of the image feature map. In §4 we show that MCC’s design performs better.

The computation of the decoder grows with the number of queries, while the encoder embeds the input image once regardless of the final output resolution. By using a relatively lightweight decoder, our inference is made efficient even at high resolutions, and the encoder cost is amortized. This allows us to dynamically change output resolutions and does not require re-computing the input encoding $R$ .

3 Query Sampling

Training. MCC samples $N^{q}=550$ queries from the 3D world space uniformly and per training example. We ablate sampling strategies in §4. A query is considered “occupied” (positive) if it is located within radius $\tau=0.1$ to a ground truth point, and “unoccupied” (negative) otherwise. The ground truth is defined as the union of all unprojected points from all RGB-D views of the scene.

Inference. We uniformly sample a grid of points covering the 3D space. Queries with occupancy score greater than a threshold of $0.1$ and their color predictions form the final reconstruction. Techniques such as Octree could be easily integrated to further speed up test-time sampling.

4 Implementation Details

Object Reconstruction Experiments

MCC works naturally for both objects and scenes. In §4, we show results and compare to competing methods for single object reconstruction. In §5, we show results on scenes.

Dataset. We use CO3D-v2 as our main dataset for single object reconstruction. It consists of $\scriptstyle\sim$ 37k short videos of 51 object categories; the largest dataset of 3D objects in the wild. To show generalization to new objects, we hold out 10 randomly selected categories for evaluation and train on the remaining 41. The list of held-out categories is available in the Appendix. Since CO3D is object-centric, we focus on foreground objects specified by segmentation masks provided in CO3D. Full 3D annotations, such as 3D meshes, are not available. CO3D extracts point clouds from the videos via COLMAP , which are inevitably noisy and are used to train our model. Despite imperfect supervision, we show that MCC learns to reconstruct 3D shapes and texture and even corrects the noisy depth inputs.

Metrics. Following Kulkarni et al. , we report: accuracy (acc), the percentage of predicted points within $\rho$ to a ground truth point, completeness (cmp), the percentage of ground truth points within $\rho$ from a predicted point, and their F-score (F1) which drives our comparisons. $\rho$ is $0.1$ .

Training Details. We train with Adam for 150k iterations with an effective batch size of 512 using 32 GPUs, a base learning rate of 10-4 with a cosine schedule and a linear warm-up for the first 5% of iterations. Training takes $\scriptstyle\sim$ 2.5 days. We randomly scale augment images by $s\in\left[0.8,1.2\right]$ . We also perform 3D augmentations by randomly rotating 3D points along each axis by $\theta\in\left[-180^{o},180^{o}\right]$ . Rotation is applied to the seen points $P$ , the queries and the ground truth. Image $I$ and points $P$ are aligned through the concatenation of their encodings (Eq. 1). Points $P$ and queries are consistent as well, as both are rotated. Essentially, our 3D augmentations build in rotation equivariance.

Coordinate System. We adopt the original CO3D coordinate system from , where objects are normalized to have zero-mean and unit-variance. Training and testing points are sampled from $\left[-3,3\right]$ along each axis. Evaluation points are sampled with a granularity of 0.1.

Fig. 4 shows qualitative results on the CO3D test set of novel categories. We show reconstructions for a variety of shapes and object types. MCC tackles heavy self-occlusions, e.g., the mug handle is barely visible in the input image, and complex shapes, e.g., the toy airplane. In addition to shape, MCC predicts texture which is difficult especially for unseen regions. For instance, the left and back side of the kids backpack is completely invisible, but MCC predicts to propagate the color from the right side. We also note that MCC is robust to noisy depth from COLMAP, present at varying degrees and depicted in the seen points of each example (top row). MCC corrects and completes the geometry in spite of the noise in depth inputs. We emphasize that we do not make geometric assumptions nor use any priors such as symmetry or mean templates when reconstructing objects. MCC learns only from data.

2 Ablation Study

Encoder Structure. In Table LABEL:tab:abl:enc, we ablate our encoder design which models $I$ and $P$ with two separate transformers (decoupled) and compare to a shared transformer which models the fused (sum) patch embeddings of $I$ and $P$ (shared). Our decoupled design performs slightly better.

Training Query Sampling. In Table LABEL:tab:abl:sampling, we compare our uniform sampling strategy with a contrastive-style sampling, where each example samples a fixed number of positives and negatives. Both work similarly. We choose uniform sampling because of its simplicity.

Decoder Design. As described in §3, MCC’s decoder concatenates queries to the input encoding $R$ in the token dimension, and a transformer models their interactions (concat+attn). We compare this design with two popular ones. Recent works on image-conditioned NeRF condition points on their projected location in the feature map followed by an MLP (loc+MLP) – this comparison was also presented in the context of feature conditioning strategies. Another approach is cross-attention (cross-attn), where the encoded input $R$ only serves as keys/values but not as queries to a transformer, e.g., in Perceiver models . Table LABEL:tab:abl:dec shows that our decoder is critical for performance.

Comparison to Prior Work with an Explicit Design. Finally, we compare MCC and its queriable 3D decoder with a state-of-the-art 3D point completion method PoinTr . PoinTr inputs an incomplete point cloud and predicts a fixed-resolution output using a transformer which models explicit geometric point relations (via nearest neighbors). We train PoinTr on CO3D which inputs the set of seen points $P$ . For a fair comparison, we implement PoinTr with the same 12-layer architecture as ours, which is stronger than their 6-layer one. Since PoinTr does not use RGB, we compare with a MCC variant that ignores texture by encoding $P$ but not $I$ . We additionally report chamfer distance (CD), as in , and use the same number of points for a fair comparison. Table LABEL:tab:abl:explicit shows that MCC outperforms PoinTr by a large margin. Fig. 6 presents a qualitative comparison. In §4.5, we also compare to NeRF-based methods.

3 Scaling Behavior Analysis

MCC’s strength is that it only requires points for training and does not rely on any shape priors. As a result, MCC can train on a large number of examples. We analyze our model’s performance as a function of data size. Fig. 7 shows that scaling the training data leads to steady performance improvements. Furthermore, if we increase the number of categories, and thus the visual diversity of our training data, the improvements are even larger. This suggests two things. First, building category-agnostic scaleable models like MCC is a promising direction towards general-purpose 3D reconstruction. Second, expanding the datasets, and especially the set of categories, is promising.

4 Zero-Shot Generalization In-the-Wild

In §4.1, we show generalization to novel categories from the CO3D dataset. Now, we turn to in-the-wild settings and show MCC reconstructions on ImageNet , iPhone captures, and AI-generated images .

iPhone Captures. This is arguably the most popular in-the-wild setting — our personal use of an off-the-shelf smart phone for capturing everyday objects. Specifically, we use iPhones and their depth sensor to take RGB-D images on a diverse set of objects in two of the coauthors’ homes (using a 12 and 14 Pro iPhone). This is a challenging setting due to the domain shift from the training data and the difference in the depth estimation pipeline (COLMAP in CO3D vs. sensor from iPhone). Fig. LABEL:fig:iphone shows ours results. Examples such as the vacuum or the VR headset in Fig. LABEL:fig:teaser:b stand out as they deviate from our training set. Fig. LABEL:fig:iphone demonstrates MCC’s ability to learn general shape priors, instead of memorizing the training set.

ImageNet. We turn to ImageNet , which contains highly diverse Internet photos, ranging from bears and elephants in their natural habitat to Japanese mailboxes, drastically different than the staged CO3D objects. For depth, we use an off-the-shelf model from Ranftl et al. , which differs from CO3D’s COLMAP output. Fig. LABEL:fig:imagenet shows results on ImageNet images of diverse objects.

AI-generated Images. We test MCC on DALL $\cdot$ E 2 which generates images of imaginary objects. Fig. LABEL:fig:dalle shows MCC reconstructions including the Internet-famous avocado chair and a cat-shaped marshmallow with a mustache!

5 Comparison to Image-Conditioned NeRF

Scene Reconstruction Experiments

MCC naturally handles singles objects and scenes without modifications to its design. So, now we turn to scenes.

Task. We test 3D scene reconstruction from a single RGB-D image. Formally, we aim to reconstruct everything in front of the camera ( $z>0$ in camera coordinate system) up to a certain range. Note that this includes areas outside the camera frustum, which increases the complexity of the task.

Dataset. We experiment on the Hypersim dataset , which contains complex, diverse scenes, such as warehouse, lofts, restaurants, church etc., with over 77k images. We split the dataset into 365 scenes for training and 46 scenes for testing. We use images along with the associated depth as ground truth for training. Since 3D meshes are available, we use them for evaluation and report the metrics from §4.

Qualitative Results. Fig. LABEL:fig:hypersim shows qualitative results on Hypersim . While MCC only sees the scene within the view frustum, it is able to complete furniture, walls, floors, and ceilings. For instance, in the left example, MCC predicts the space behind the kitchen, including the floors, which are almost entirely occluded in the input view. In the right example, MCC predicts the wall on the left which is entirely outside of the view frustum. Scene reconstruction from a single view is hard; while MCC reconstructs the room geometry it fails to capture fine details in both shape and texture. We expect more data to significantly improve performance, as suggested by our scaling analysis in §4.3.

Quantitative Evaluation. We compare to recent state-of-the-art on scene reconstruction, DRDF , which we extend to take RGB-D inputs like MCC. Table 3 shows that MCC outperforms DRDF across all metrics. We also extend DRDF to use MCC’s architecture but keeping its original loss and ray-based inference. This variant performs better than the original DRDF but still worse than MCC.

2 Zero-Shot Generalization to Taskonomy

Finally, we deploy MCC, trained on Hypersim, on novel scenes from Taskonomy . While photorealistic, Hypersim is synthetic, while Taskonomy is real. So, we test both generalization to novel scenes but also the “sim-to-real” transfer. Fig. LABEL:fig:taskonomy shows MCC’s reconstructions, which demonstrate that our model is able to reconstruct the room layout (floors, walls, ceilings) in this challenging setting.

Failure Cases

While MCC has demonstrated promising results, we observe three error modes: (1) Sensitivity to depth input. MCC can recover from noisy depth inputs. But if depth is largely incorrect, it will fail to reconstruct accurate 3D geometry. (2) Distribution shifts. For targets far from the training distribution, we see errors in texture and geometry (e.g., Rubik’s cubes). (3) High-fidelity texture. Detailed texture predictions from a single view are difficult and MCC often omits details (e.g., text on volleyball in Fig. 4).

Conclusions

We present MCC, a general-purpose 3D reconstruction model that works for both objects and scenes. We show generalization to challenging settings, including in-the-wild captures and AI-generated images of imagined objects. Our results show that a simple point-based method coupled with category-agnostic large-scale training is effective. We hope this is a step towards building a general vision system for 3D understanding. Models and code are available online.

From an ethics standpoint, as with all data-driven methods, MCC can potentially inherit the bias (if any) in data. In this project, we solely train on inanimate objects and scenes to minimize the risk. We do not foresee immediate negative repercussions with the model, but caution against future use without paying careful attention to the training dataset.

Appendix A Appendix

We provide 360-view animations and interactive 3D visualizations for all qualitative results, in Figures 4, 7 and 9, and more in our project page. Our video animations are shown in the main window and interactive 3D visualizations are available by clicking on the 3D icon, per the instructions in the webpage.

A.2 Architecture Specifications

LayerNorm is used in all self-attention and MLP layers following standard practice .

A.3 Held-Out CO3D Categories

In our experiments, we hold out 10 randomly selected categories which we use as our test set. The 10 randomly selected held-out categories are: {apple, ball, baseballglove, book, bowl, carrot, cup, handbag, suitcase, toyplane}. They have 8,453 example videos in total. Please see the original CO3D paper for more information about CO3D .

A.4 Additional Implementation Details for Scene Reconstruction Experiments

Similar to the object reconstruction experiments, we train MCC on Hypersim with Adam for 100k iterations with an effective batch size of 512 using 32 GPUs, a base learning rate of 5 $\times$ 10-5 with a cosine schedule and a linear warm-up for the first 10% of iterations. Training takes $\scriptstyle\sim$ 1.6 days. We normalize each scene to have zero-mean and unit-variance. At inference time, we predict points up to 6.0 units (i.e., 6 $\times$ standard deviation) away from the camera origin. Since we aim at predicting the scene in front of the camera, we use the camera view coordinate system ( $X$ -axis points top/down, $Y$ -axis points left/right, and $Z$ -axis points out from the image plane). We randomly scale augment images by $s\in\left[0.8,1.2\right]$ , as in the object reconstruction model, but do not perform rotation augmentation. Other implementation details follow the CO3D experiments.