What Do Single-view 3D Reconstruction Networks Learn?

Maxim Tatarchenko, Stephan R. Richter, René Ranftl, Zhuwen Li, Vladlen Koltun, Thomas Brox

Introduction

Object-based single-view 3D reconstruction calls for generating the 3D model of an object given a single image. Consider the motorcycle in Fig. LABEL:fig:teaser. Inferring its 3D structure requires a complex process that combines low-level image cues, knowledge about structural arrangement of parts, and high-level semantic information. We refer to the extremes of this spectrum as reconstruction and recognition. Reconstruction implies reasoning about the 3D structure of the input image using cues such as texture, shading, and perspective effects. Recognition amounts to classifying the input image and retrieving the most suitable 3D model from a database, in our example finding a pre-existing 3D model of a motorcycle based on the input image.

While various architectures and 3D representations have been proposed in the literature, existing methods for single-view 3D understanding all use an encoder-decoder structure, where the encoder maps the input image to a latent representation and the decoder is supposed to perform non-trivial reasoning about the 3D structure of the output space. To solve the task, the overall network is expected to incorporate low-level as well as high-level information.

In this work, we analyze the results of state-of-the-art encoder-decoder methods and find that they rely primarily on recognition to address the single-view 3D reconstruction task, while showing only limited reconstruction abilities. To support this claim, we design two pure recognition baselines: one that combines 3D shape clustering and image classification and one that performs image-based 3D shape retrieval. Based on these, we demonstrate that the performance of modern convolutional networks for single-view 3D reconstruction can be surpassed even without explicitly inferring the 3D structure of objects. In many cases the predictions of the recognition baselines are not only better quantitatively, but also appear visually more appealing, as demonstrated in Fig. LABEL:fig:teaser.

We argue that the dominance of recognition in convolutional networks for single-view 3D reconstruction is a consequence of certain aspects of popular experimental procedures, including dataset composition and evaluation protocols. These allow the network to find a shortcut solution, which happens to be image recognition.

Related work

Historically, single-image 3D reconstruction has been approached via shape-from-shading . More exotic cues for reconstruction are texture and defocus . These techniques only reason about visible parts of a surface using a single depth cue. More general approaches for depth estimation from a single monocular image use multiple cues as well as structural knowledge to infer an estimate of the depth of visible surfaces. Saxena et al. estimated depth from a single image by training an MRF on local and global image features. Oswald et al. solved the same problem with interactive user input. Hoiem et al. used recognition together with simple geometric assumptions to construct 3D models from a single image. Karsch et al. proposed a non-parametric framework that uses part- and object-level recognition to assemble an estimate from a database of images and corresponding depth maps. More recently, significant advances have been made in monocular depth estimation by employing convolutional networks .

This paper focuses on methods that not only reason about the 3D structure of object parts visible in the input image, but also hallucinate the invisible parts using priors learned from data. Tulsiani et al. approached this task with deformable models for specific object categories. Most of the recent methods trained convolutional networks that map 2D images to 3D shapes using direct 3D supervision. A cluster of approaches used voxel-based representations of 3D shapes and generated them with 3D up-convolutions from a latent representation . Several works performed hierarchical partitioning of the output space to achieve computational and memory efficiency, which allows predicting higher-resolution 3D shapes. Johnston et al. reconstructed high-resolution 3D shapes with an inverse discrete cosine transform decoder. Wang et al. generated meshes by deforming a sphere into a desired shape, assuming a fixed distance between camera and objects. Groueix et al. assembled surfaces from small patches. Multiple methods produced multi-view depth maps that are fused together into an output point cloud. Richter et al. extended this with nested shapes fused into a single voxel grid. Fan et al. directly regressed point clouds. Wu et al. learned the mapping from input images to 2.5D sketches in a fully-supervised fashion, and then trained a network to map these intermediate representations to the final 3D shapes. Kong et al. use 2D landmark locations together with silhouettes to retrieve and deform CAD models. Pontes et al. improved upon this work by using a free-form deformation parametrization to model shape variation.

Tulsiani et al. and Niu et al. aimed for structural 3D understanding, approximating 3D shapes with a pre-defined set of primitives.

Recently, there has been a trend towards using weaker forms of supervision for single-view 3D shape prediction with convolutional networks. Multiple approaches trained shape regressors by comparing projections of ground-truth and predicted shapes. Kanazawa et al. predicted deformations from mean shapes trained from multiple learning signals.

There are only very few datasets available for the task of single-image 3D reconstruction – a consequence of the cost of data collection. Most existing methods use subsets of ShapeNet for training and testing. Recently, Wiles and Zisserman introduced two new synthetic datasets: Blobby objects and Sculptures. The Pix3D dataset provides pairs of perfectly aligned natural images and CAD models. This dataset, however, contains a low number of 3D samples, which is problematic for training deep networks.

Reconstruction vs. recognition

Single-view 3D understanding is a complex task that requires interpreting visual data both geometrically and semantically. In fact, these two modes are not disjoint, but span a spectrum from pure geometric reconstruction to pure semantic recognition.

Reconstruction implies per-pixel reasoning about the 3D structure of the object shown in the input image, which can be achieved by using low-level image cues such as color, texture, shading, perspective, shadows, and defocus. This mode does not require semantic understanding of the image content.

Recognition is an extreme case of using semantic priors: it operates on the level of whole objects and amounts to classifying the object in the input image and retrieving a corresponding 3D shape from a database. While it provides a robust prior for reasoning about the invisible parts of objects, this kind of purely semantic solution is only valid if the new object can be explained by an object in the database.

As reconstruction and recognition represent opposing ends of a spectrum, resorting exclusively to either is unlikely to produce the most accurate 3D shapes, since both ignore valuable information present in the input image. It is thus commonly hypothesized that a successful approach to single-view 3D reconstruction needs to combine low-level image cues, structural knowledge, and high-level object understanding .

In the following sections, we argue that current methods tackle the problem predominantly using recognition.

Conventional setup

In this section, we analyze current methods for single-view 3D reconstruction and their relation to reconstruction and recognition. We employ a standard setup for single-view 3D shape estimation. We use the ShapeNet dataset . Unlike several recent approaches, which evaluated only on the 13 largest classes, we deliberately use all 55 classes, as was done in . This allows us to investigate how the number of samples within a class influences shape estimation performance. Within each class, the shapes are randomly split into training, validation, and test sets, containing 70%, 10%, and 20% of the samples respectively. Every shape was rendered using the ShapeNet-Viewer from five uniformly sampled viewpoints $\left(\theta_{azimuth}\in[0^{\circ},360^{\circ}),\theta_{elevation}\in[0^{\circ},50^{\circ})\right)$ . The distance to the camera was set such that each rendered shape roughly fits the frame. We rendered RGB images of size $224\times 224$ , which were downsampled to the input resolution that is required by each method.

All 3D shapes have a consistent canonical orientation and are represented as $128^{3}$ voxel grids. Using high-resolution ground truth (compared to the conventionally used $32^{3}$ voxel grids) is crucial for evaluating a method’s ability to reconstruct fine detail. Evaluating on a higher resolution than $128^{3}$ does not offer additional benefits, since the performance of state-of-the-art methods saturates at this level , while training and evaluation become much more costly. We follow standard procedure and measure shape similarity with the mean Intersection over Union (mIoU) metric, aggregating predictions within semantic classes .

We base our experiments on modern convolutional networks that predict high-resolution 3D models from a single image. A taxonomy of approaches arises by categorizing them based on their output representation: voxel grids, meshes, point clouds, and depth maps. To this end, we chose state-of-the-art methods that cover the dominant output representations or have clearly shown to outperform other related representations for our evaluation.

We use Octree Generating Networks (OGN) as the representative method that predicts the output directly on a voxel grid. Compared to earlier works that operate on this output representation, OGN allows predicting higher-resolution shapes by using octrees to represent the occupied space efficiently. We evaluate AtlasNet as the representative approach for surface-based methods. AtlasNet predicts a collection of parametric surfaces and constitutes the state-of-the-art among methods that operate on this output representation. It was shown to outperform the only approach that directly produces point clouds as output , as well as another octree-based approach . Finally, we evaluate the current state-of-the-art in the field, Matryoshka Networks . Matryoshka Networks use a shape representation that is composed of multiple, nested depth maps, which are volumetrically fused into a single output object.

For IoU-based evaluation of the surface predictions from AtlasNet, we project them to depth maps, which we further fuse to a volumetric representation. In our experiments, this approach reliably closed holes in the reconstructed surfaces while retaining fine details. For surface-based evaluation metrics, we use the marching cubes algorithm to extract meshes from volumetric representations.

2 Recognition baselines

We implemented two straightforward baselines that approach the problem purely in terms of recognition. The first is based on clustering of the training shapes in conjunction with an image classifier; the second performs database retrieval.

Clustering. In this baseline, we cluster the training shapes into $K$ sub-categories using the K-means algorithm . Since using $128^{3}$ voxelizations as feature vectors for clustering is too costly, we run the algorithm on $32^{3}$ voxelizations flattened into a vector. Once the cluster assignments are determined, we switch back to working with high-resolution models.

Within each of the $K$ clusters, we calculate the mean shape as

where $v_{n}$ is one of the $N_{k}$ shapes belonging to the $k$ -th cluster. We threshold the mean shapes at $\tau_{k}$ , where the optimal $\tau_{k}$ value is determined by maximizing the average IoU over the models belonging to the $k$ -th cluster:

where the thresholding operation is applied per voxel. We enumerate $\tau$ in the interval $[0.05,0.5]$ with a step size of $0.05$ to find the optimal threshold. We set $K=500$ .

Since correspondences between images and 3D shapes are known for the training set, images can be readily matched with the respective cluster $k$ . Subsequently, we train a 1-of- $K$ classifier that assigns images to cluster labels. At test time, we set the mean shape of the predicted cluster as the inferred solution. For classification, we use the ResNet-50 architecture , pre-trained on the ImageNet dataset , and fine-tuned for 30 epochs on our data.

Retrieval. Our retrieval baseline is inspired by the work of Li et al. , which learns to embed images and shapes in a joint space. The embedding space is constructed from the pairwise similarity matrix of all 3D shapes in the training set by compressing each row of the matrix to a low-dimensional descriptor via Multi-Dimensional Scaling with Sammon mapping . To compute the similarity of two arbitrary shapes, Li et al. employ the lightfield descriptor . To embed images in the space spanned by the shape descriptors, a convolutional network is trained to map images to the descriptor given by the corresponding shape in the training set. During training, the network optimizes the Euclidean distance between predicted and ground-truth descriptors.

We adapt the work of Li et al. in several ways. As with our clustering baseline, we determine the similarity between two shapes via the IoU of their $32^{3}$ voxel grid representation. We then compute a low-dimensional descriptor via principal component analysis. We further use a larger descriptor (512 vs. 128) and a network with larger capacity (ResNet-50 , pre-trained on ImageNet , without fixing any layers during fine-tuning). Finally, instead of minimizing the Euclidean distance, we maximize the cosine similarity between descriptors during training.

Oracle nearest neighbor. To gain more insight into the characteristics of the dataset, we evaluate an Oracle Nearest Neighbor (Oracle NN) baseline. For each of the test 3D shapes, we find the closest shape from the training set in terms of IoU. This method cannot be applied in practice, but gives an upper bound on how well a retrieval method can solve the task.

3 Analysis

We start by conducting a standard comparison of all methods in terms of their mean IoU scores. The results are summarized in Fig. 1. We find that state-of-the-art methods, despite being backed by different architectures, perform at a remarkably similar level. Interestingly, the retrieval baseline, a pure recognition method, outperforms all other approaches both in terms of mean and median IoU. The simple clustering baseline is competitive and outperforms both AtlasNet and OGN. We further observe that a perfect retrieval method (Oracle NN) performs significantly better than all other methods. Strikingly, the variance in the results is extremely high (between 35% and 50%) for all methods. This implies that quantitative comparisons that rely solely on the mean IoU do not provide a full picture at this level of performance. To shed more light on the behavior of the methods, we proceed with a more detailed analysis.

Per-class analysis. The similarity in average accuracy cannot be attributed to methods specializing in different subsets of classes. In Fig. 2 we observe consistent relative performance between methods across different classes. The retrieval baseline achieves the best results for 30 out of 55 classes. The classes are sorted from left to right in ascending order according to the performance of the retrieval baseline. The variance is high for all classes and all methods.

One might assume that the per-class performance depends on the number of training samples that are available for a class. However, we find no correlation between the number of samples in a class and its mean IoU score; see Fig. 3. The correlation coefficient between the two quantities is close to zero for all methods. This implies that there is no justification for only using 13 out of the 55 classes, as was done in many prior works .

The quantitative results are backed by qualitative results shown in Fig. 4. For most classes, there is no significant visual difference between the predictions of the decoder-based methods and our clustering baseline. Clustering fails when the sample is far from the mean shape of the cluster, or when the cluster itself cannot be described well by the mean shape (this is often the case for chairs or tables because of thin structures that get averaged out in the mean shape). The predictions of the retrieval baseline look more appealing in most cases due to the presence of fine details, even though these details are not necessarily correct. We provide additional qualitative results in the supplementary material.

Statistical evaluation. To further investigate the hypothesis that convolutional networks bypass true reconstruction via image recognition, we visualize the histograms of IoU scores for individual object classes in Fig. 5. For histograms of all 55 classes we refer to the supplementary material. Although the distributions differ between classes, the within-class distributions of decoder-based methods and recognition baselines are surprisingly similar.

For reference, we also plot the results of the Oracle NN baseline, which, for many classes, differs substantially. To verify this observation rigorously, we perform the Kolmogorov-Smirnov test on the 50-binned versions of the histograms for all classes and all pairs of methods. The null hypothesis assumes that two distributions exhibit no statistically significant difference. We visualize the results of the test in the rightmost part of Fig. 5. Every cell of the heat map shows the number of classes for which the statistical test does not allow to reject the null hypothesis, i.e., where the p-value is larger than $0.05$ . We find that for decoder-based methods and recognition baselines the null hypothesis cannot be rejected for the vast majority of classes.

Problems

In the preceding section we provided evidence that current methods for single-view 3D object reconstruction predominantly rely on recognition. Here we discuss aspects of popular experimental procedures that may need to be reconsidered to elicit more detailed reconstruction behavior from the models.

The vast majority of existing methods predict output shapes in an object-centered coordinate system, which aligns objects of the same semantic category to a common orientation. Aligning objects this way makes it particularly easy to find spatial regularities. It encourages learning-based approaches to recognize the object category first, and refine the shape later if at all.

Shin et al. studied how the choice of coordinate frames affects reconstruction performance and generalization abilities of learning-based methods, comparing object-centered and viewer-centered coordinate frames. They found that a viewer-centered frame leads to significantly better generalization to object classes that were not seen during training, a result that can only be achieved when a method operates in a geometric reconstruction regime.

To validate these conclusions, we repeated the experimental evaluation (Sec. 4) in a viewer-centered coordinate frame. We attempted to extend the clustering baseline with a viewpoint prediction network which would regress the azimuth and elevation angles of the camera w.r.t. the canonical frame. This naive approach failed because the canonical frame has a different meaning for each object class, implying that the viewpoint network needs to use class information in order to solve the task. For the retrieval baseline, we retrained the method, treating each training view as a separate sample. To avoid artifacts from rotating voxelized shapes, we synthesized ground-truth shapes by rotating and then voxelizing the original meshes, resulting in a distinct target shape for each view of each object. Results are shown in Fig. 6, where we observe a mild decrease in performance for OGN and Matryoshka networks, and a larger drop for the retrieval baseline. For the retrieval setting, the viewer-centered setup is computationally more demanding, as different views of the same object now refer to different shapes to be retrieved. Consequently, less learning capacity is available for each individual object.

2 Evaluation metric

Intersection over union. The mean IoU is commonly used as the primary quantitative measure for benchmarking single-view reconstruction approaches. This can be problematic if it is used as the sole metric to argue for the merits of an approach, since it is only indicative of the quality of a predicted shape if it reaches sufficiently high values. Low to mid-range scores indicate a significant discrepancy between two shapes.

An example is shown in Fig. 7, which compares a car model to different shapes in the dataset and illustrates their similarity in terms of IoU scores. As shown in the figure, even an IoU of 0.59 allows for considerable deviation from the ground-truth shape. For reference, note that 75% of the predictions by the best performing approach, our retrieval baseline, have an IoU below 0.66; 50% are below 0.43 (c.f. Fig. 1).

All information about an object’s shape is situated on its surface. However, for voxel-based representations with a solid interior, the IoU is dominated by the interior parts of objects. As a consequence, even seemingly high IoU values may poorly reflect the actual surface similarity.

Moreover, while IoU can easily be evaluated for a volumetric representation, there is no straightforward way to evaluate it for point clouds. A good measure should allow comparing different 3D representations within the same unified framework. Point-based measures are most suitable for this, because a point cloud can be obtained from any other 3D representation via (a) surface point sampling for meshes, (b) per-pixel reprojection for depth maps, or (c) running the marching cubes algorithm followed by point sampling for voxel grids.

Chamfer distance. Some recent methods use the Chamfer Distance (CD) for evaluation . Although it is defined on point clouds and by design satisfies the requirement of being applicable (after conversion) to different 3D representations, it is a problematic measure because of its sensitivity to outliers. Consider the example in Fig. 8. Both target chairs perfectly match the source chair in the lower part and are completely wrong in the upper part. However, according to the CD score, the second target is much better than the first. As this example shows, the CD measure can be significantly perturbed by the geometric layout of outliers. It is affected by how far the outliers are from the reference shape. We argue that in order to reliably reflect real reconstruction performance, a good quantitative measure should be robust to the detailed geometry of outliers.

F-score. Motivated by the insight that both IoU and CD can be misleading, we propose to use the F-score , an established and easily interpretable metric that is actively used in the multi-view 3D reconstruction community. The F-score explicitly evaluates the distance between object surfaces and is defined as the harmonic mean between precision and recall. Precision measures the accuracy of the reconstruction by counting the percentage of reconstructed points that lie within a certain distance to the ground truth. Recall measures the completeness of the reconstruction by counting the percentage of points on the ground truth that lie within a certain distance to the reconstruction. The strictness of the F-score can be controlled by varying the distance threshold $d$ . The metric has an intuitive interpretation: the percentage of points (or surface area) that was reconstructed correctly.

We plot the F-score of viewer-centered reconstructions for different distance thresholds $d$ in Fig. 9 (left). At $d=2\%$ of the side length of the reconstructed volume, the absolute F-score values are in the same range as the current mIoU scores, which, as we argued before, is not indicative of the prediction quality. We therefore suggest evaluating the F-score at distance thresholds of $1\%$ and below.

In Fig. 9 (right), we show the percentage of models with an F-score of 0.5 or higher at a threshold ${d=1\%}$ . Only a small number of shapes is reconstructed accurately, indicating that the task is still far from solved. Our retrieval baseline is no longer a clear winner, further showing that a reasonable solution in viewer-centered mode is harder to get using a pure recognition method.

We observe that AtlasNet often produces qualitatively good surfaces. It even outperforms the Oracle NN baseline on more liberal (above $2\%$ ) thresholds, as shown in Fig. 9 (left). Perceptually, humans tend to judge quality by global and semi-global features and tolerate if parts are slightly wrong in position or shape. We observe that AtlasNet, which was trained to optimize surface correspondence, rarely completely misses parts of the model, but tends to produce poorly localized parts. This is reflected in the high-performance range analysis, shown in Fig. 9 (right), where AtlasNet trails all other approaches.

Analyzing precision and recall separately provides additional insights into each method’s behavior. In Fig. 10 we see that OGN and Matryoshka Networks outperform Oracle NN in terms of precision. However, both Oracle NN and the retrieval baseline show higher recall. This is supported by qualitative observations that OGN and Matryoshka Networks tend to produce incomplete models.

Both recall and precision can be easily visualized to gain further insights, as illustrated in Fig. 12.

3 Dataset

The problem of networks finding a semantic shortcut solution is closely related to the choice of training data. The ShapeNet dataset has been used extensively because of its size. However, its particular composition – single objects of representative types, aligned to a canonical reference frame – enables recognition models to masquerade as reconstruction. In Fig. 1, we demonstrate that a retrieval solution (Oracle NN) outperforms all other methods on this dataset, i.e., the test data can be explained by simply retrieving models from the training set. This indicates a critical problem in using ShapeNet to evaluate 3D reconstruction: for a typical shape in the test set, there is a very similar shape in the training set. In effect, the train/test split is contaminated, because so many shapes within a class are similar. A reconstruction model evaluated on ShapeNet does not need to actually perform reconstruction: it merely needs to retrieve a similar shape from the training set.

Conclusion

In this paper, we reasoned about the spectrum of approaches to single-view 3D reconstruction, spanned by reconstruction and recognition. We introduced two baselines, classification and retrieval, which leverage only recognition. We showed that the simple retrieval baseline outperforms recent state-of-the-art methods. Our analysis indicates that state-of-the-art approaches to single-view 3D reconstruction primarily perform recognition rather than reconstruction. We identify aspects of common experimental procedures that elicit this behavior and make a number of recommendations, including the use of a viewer-centered coordinate frame and a robust and informative evaluation measure (the F-score). Another critical problem, the dataset composition, is identified but left unaddressed. We are working towards remedying this in a subsequent work.

Acknowledgements

We thank Jaesik Park for his help with F-score evaluation. We also thank Max Argus and Estibaliz Gómez for valuable discussions and suggestions. This project used the Open3D library .

References

Appendix

Appendix A Metrics and evaluation protocol

For completeness, we provide the definitions of the evaluation metrics used and additional details for converting different shape representations for evaluation.

In the context of 3D shape reconstruction, the IoU between two shapes $\mathcal{G}$ and $\mathcal{R}$ , represented as binary occupancy maps, is commonly defined as

In our evaluation protocol, we compare shapes $A,B$ at a resolution of $128^{3}$ binary cells (voxels).

A.2 Chamfer Distance (CD)

The Chamfer Distance (CD) between the ground truth shape $\mathcal{G}$ and the reconstructed shape $\mathcal{R}$ (both represented as point clouds) is defined as

A.3 F-score

Here we provide a full definition of the F-score measure. Consider a ground truth shape $\mathcal{G}$ and a reconstructed shape $\mathcal{R}$ both represented as point clouds. For every point $r\in\mathcal{R}$ its distance to $\mathcal{G}$ is calculated as

Subsequently, we calculate the percentage of points reconstructed better than a certain threshold $d$ which results in the precision value

The same procedure is repeated in the opposite direction to produce the recall value

The final F-score is given by the harmonic mean of the precision and recall values

In practice, we set $d$ as a fraction of the side length of the reconstructed volume (e.g., $1\%$ ).

To evaluate a method using the F-score, we convert each shape prediction to a mesh representation, from which we evenly sample 10K points from the surface. We show how predictions by different methods compare in terms of their visual quality, precision and recall for a qualitative example in Fig. 12. OGN , Matryoshka and the clustering baseline completely miss parts of the plane, resulting in high precision but comparably low recall. AtlasNet reconstructs a complete shape, but misplaces individual parts, resulting in both low precision and low recall. The retrieval baseline finds a reasonably similar model, leading to comparably high precision and recall values.

Appendix B Quantitative results

In Tab. 1 we provide the exact F-score values at 1% threshold in the viewer-centered mode.

Appendix C Qualitative examples

In addition to the qualitative examples for a selection of classes in the main paper, we show a randomly sampled qualitative example for each class of the ShapeNet dataset in Fig. 13. As in the main paper, we show, from left to right: input image, ground truth shape, and predictions from AtlasNet , OGN , Matryoshka , our clustering baseline, our retrieval baseline, and an Oracle Nearest Neighbor. Numbers in the bottom left of each prediction indicate the IoU (dark gray) and the F-score at a $1\%$ threshold (bold), respectively.

Appendix D Statistical evaluation

In the main paper we showed within-class IoU histograms for a selection of three classes. We visualize such histograms for all 55 classes in Fig. 19.

We performed the Kolmogorov-Smirnov test on the within-class distributions for each ShapeNet class and each pairing of methods. The null hypothesis assumes that two distributions exhibit no statistically significant difference. We plot the p-values for each test result in Fig. 22. The color of each cell indicates whether the null hypothesis can be rejected (orange) or not (green). Aggregated results can be found in Fig. 6 (right) in the main paper.