ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

Danila Rukhovich, Anna Vorontsova, Anton Konushin

Introduction

RGB images are an affordable and universal data source; therefore, RGB-based 3D object detection has been actively investigated in recent years. RGB images provide visual clues about the scene and its objects, yet they do not contain explicit information about the scene geometry and the absolute scale of the data. By virtue of that, detecting 3D objects from the RGB images is an ill-posed task. Given a monocular image, deep learning-based 3D object detection methods can only deduce the scale of the data. Moreover, the scene geometry cannot be unambiguously derived from the RGB images since some areas may be invisible. However, using several posed images might help obtain more information about the scene than a monocular RGB image. Accordingly, some 3D object detection methods run multi-view inference. These methods obtain predictions on each monocular RGB image independently, then aggregate these predictions.

In contrast, we use multi-view inputs not only for inference but also for training. During both training and inference, the proposed method accepts posed multi-view inputs with an arbitrary number of views; this number might be unique for each multi-view input. Besides, our method can accept posed monocular inputs (treated as a special case of multi-view inputs). Furthermore, it works surprisingly well on monocular benchmarks.

All RGB-based 3D object detection methods are designed to be indoor or outdoor and work under certain assumptions about the scene and the objects. For instance, outdoor methods are typically evaluated on cars. In general, cars are of similar size, they are located on the ground, and their projections onto the Bird’s Eye View (BEV) do not intersect. Accordingly, a BEV-plane projection contains much information on the 3D location of a car. So, a common approach in outdoor 3D object detection is to reduce a 3D object detection in a point cloud to a 2D object detection in the BEV plane. At the same time, indoor objects might have different heights and be randomly located in space, so their projections onto the floor plane provide little information about their 3D positions. Overall, the design of RGB-based 3D object detection methods tends to be domain-specific.

To accumulate information from multiple inputs, we construct a voxel representation of the 3D space. We use this unified approach to detect objects in both indoor and outdoor scenes: we only choose between an indoor and outdoor head, while the meta-architecture remains the same.

In the proposed method, final predictions are obtained from 3D feature maps, which corresponds to the formulation of the point cloud-based detection problem. On this basis, we use off-the-shelf necks and heads from point cloud-based object detectors with no modifications.

As far as we know, we are the first to formulate a task of end-to-end training for multi-view 3D object detection based on posed RGB images only.

We propose a novel fully convolutional 3D object detector that works in both monocular and multi-view settings.

With domain-specific heads, the proposed method achieves state-of-the-art results for both indoor and outdoor datasets.

Related Works

Many scene understanding methods accept multi-view inputs. For instance, some scene understanding sub-tasks can only be solved given multi-view inputs. For example, the SLAM task implies reconstructing 3D scene geometry and estimating camera poses given a sequence of frames. Structure-from-Motion (SfM) approaches are designed to estimate camera poses and intrinsics from an unordered set of images, whereas Multi-View Stereo (MVS) methods use SfM outputs to build a 3D point cloud.

Other scene understanding sub-tasks might be reformulated to be multi-view. Several methods that use multi-view inputs to address these tasks have been proposed recently. For instance, 3D-SIS performs 3D instance segmentation based on a set of RGB-D inputs. MVPointNet uses multi-view RGB-D inputs for 3D semantic segmentation. Atlas processes several monocular RGB images to perform 3D semantic segmentation and TSDF reconstruction jointly.

2 3D Object Detection.

Point cloud-based. Point clouds are three-dimensional, so it seems natural to employ a 3D convolutional network for detection. However, this approach requires exhaustive computation that causes slow inference on large outdoor scenes. Recent outdoor methods decrease the runtime by projecting the 3D point cloud to the BEV plane. The common practice in point cloud processing is to subdivide a point cloud into voxels. The projection onto the BEV plane implies that all voxels in each vertical column should be encoded into a fixed-length feature map. Then, this pseudo-image can be passed to a 2D object detection network to obtain final predictions.

Indoor object detection methods generate object proposals for each point in a point cloud. However, some indoor objects are not convex, so the geometrical center of an indoor object may not belong to this object (e.g., the center of a table or a chair might be in between legs). Accordingly, an object proposal given by a single center point might be irrelevant, so indoor methods use deep Hough voting to generate proposals .

Stereo-based. Despite accepting more than one image, stereo-based methods cannot be considered multi-view as they use two images. In contrast, multi-view methods can process an arbitrary amount of inputs. Moreover, camera poses might be arbitrary for multi-view inputs, and for stereo inputs, the relative transformation between two cameras is known precisely and remains fixed while recording. This makes it possible to perform stereo reconstruction by estimating optical flow between the left and right images. Stereo-based methods rely heavily on the stereo assumptions, e. g., 3DOP uses stereo reconstruction to generate object proposals, while TLNet runs triangulation to merge proposals obtained for left and right images independently. Stereo R-CNN generates object proposals given both left and right images, then estimates object location by triangulating keypoints.

Monocular-based. Mono3D generates 3D anchors by aggregating clues from semantic maps, visible contours of the objects, and location priors via a complex energy function. Deep3DBox uses discretization to estimate the orientation of each object and derives its 3D pose from constraints between 2D and 3D bounding boxes. MonoGRNet decomposes the 3D object detection problem into sub-tasks, namely object distance estimation, object location estimation, and object corners estimation. These sub-tasks are solved by separate networks, trained first stage-wise then altogether to refine 3D bounding boxes.

Other methods, e.g., , exploit 2D detection and lift information from 2D to 3D. extend 2D detection network with a 3D branch that regresses object pose. Some methods make use of external data sources, e.g., DeepMANTA uses an iterative coarse-to-fine algorithm of generating 2D object proposals, which are used to select a CAD model. 3D-RCNN also performs 2D detection and matches the outputs to 3D models. Then, it uses a render-and-compare approach to recover the shape and pose of an object.

Monocular indoor 3D object detection is a less explored problem, with only SUN RGB-D benchmark existing. This benchmark implies that indoor 3D object detection is a sub-task of total scene understanding. Beside detecting 3D objects, estimate camera poses and room layouts. The most recent Total3DUnderstanding reconstructs object meshes using an attention mechanism to consider relationships between objects.

Some outdoor 3D object detection methods are evaluated on the nuScenes dataset on multi-view inputs. Specifically, these methods infer on each monocular RGB image, then aggregate the outputs. Aggregation is an inevitable part of the pipeline; however, doing this on the latest stage is controversial, as spatial information might not be exploited as effectively as possible.

So, none of the existing methods formulate 3D object detection given multiple RGB images as an end-to-end optimization problem.

Proposed Method

Our method accepts an arbitrary-sized set of RGB inputs along with camera poses. First, we extract features from the given images using a 2D convolutional backbone. Then, we project the obtained image features to a 3D voxel volume. For each voxel, the projected features from several images are aggregated via a simple element-wise averaging. Next, the voxel volume with assigned features is passed to a 3D convolutional network referred to as neck. The outputs of the neck serve as inputs to the last few convolutional layers (head) that predict bounding box features for each anchor. The resulting bounding boxes are parameterized as (x,y,z,w,h,l,θ)(x,y,z,w,h,l,\theta), where (x,y,z)(x,y,z) are the coordinates of the center, w,h,lw,h,l are for width, height, and length, and θ\theta is the rotation angle around zz-axis. The general scheme of the proposed method is depicted in Fig. 1.

2D features projection and 3D neck network have been proposed in . First, we briefly outline these steps. Then, we introduce a novel multi-scale 3D head designed for indoor detection.

where KK and RtR_{t} are the intrinsic and extrinsic matrices, and Π\Pi is a perspective mapping. After projecting 2D features, all voxels along a camera ray get filled with the same features. We also define a binary mask MtM_{t} of the same shape as VtV_{t}, which indicates whether each voxel is inside the camera frustum. Thus, for each image ItI_{t}, the mask MtM_{t} is defined as:

Then, we project FtF_{t} for each valid voxel in a volume VtV_{t}:

The aggregated binary mask MM is a sum of M1,,MtM_{1},\dots,M_{t}:

Finally, we obtain the 3D volume VV by averaging projected features in volumes V1,,VtV_{1},\dots,V_{t} across valid voxels:

2 3D Feature Extraction

Indoor. Following , we pass the voxel volume VV through a 3D convolutional encoder-decoder network to refine the features. For indoor scenes, we use an encoder-decoder architecture from . However, with over 48 3D convolutional layers, the original network is computationally heavy and slow on inference. For a better performance, we simplify the network by reducing the number of time-consuming 3D convolutional layers. The simplified encoder has only three downsampling residual blocks, each with three 3D convolutional layers. The simplified decoder consists of three upsampling blocks, and each upsampling block is made up with a transposed 3D convolutional layer with stride 2 followed by another 3D convolutional layer. The decoder branch outputs three feature maps of the following shapes: Nx4×Ny4×Nz4×c2\frac{N_{x}}{4}\times\frac{N_{y}}{4}\times\frac{N_{z}}{4}\times c_{2}, Nx2×Ny2×Nz2×c2\frac{N_{x}}{2}\times\frac{N_{y}}{2}\times\frac{N_{z}}{2}\times c_{2}, and Nx×Ny×Nz×c2N_{x}\times N_{y}\times N_{z}\times c_{2}. For the actual value of c2c_{2}, see 4.2.

Outdoor. Outdoor methods reduce 3D object detection in 3D space to 2D object detection in the BEV plane. In these methods, both the neck and head are composed of 2D convolutions. The outdoor head accepts a 2D feature map, so we should obtain a 2D representation of a constructed 3D voxel volume to use in our method. In order to do that, we use the encoder part of the encoder-decoder architecture from . After passing through several 3D convolutional and downsampling layers of this encoder, a voxel volume VV of shape Nx×Ny×Nz×c1N_{x}\times N_{y}\times N_{z}\times c_{1} is mapped to the tensor of shape Nx×Ny×c2N_{x}\times N_{y}\times c_{2}.

3 Detection Heads

ImVoxelNet constructs a 3D voxel representation of the space; thus, it can use the head from point cloud-based 3D object detection methods. Therefore, instead of time-consuming custom architecture implementation, one can employ state-of-the-art methods with no modifications. However, the design of heads significantly differs for outdoor and indoor methods.

We reformulate outdoor 3D object detection as 2D object detection in the BEV plane following the common practice. We use the 2D anchor head that appeared to be efficient on KITTI and nuScenes datasets. Since outdoor 3D detection methods are evaluated on cars, all objects are of a similar scale and belong to the same category. For single-scale and single-class detection, the head consists of two parallel 2D convolutional layers. One layer estimates class probability, while the other regresses seven parameters of the bounding box.

Input. The input is a tensor of shape Nx×Ny×c2N_{x}\times N_{y}\times c_{2}.

Output. For each 2D BEV anchor, the head returns a class probability pp and a 3D bounding box as a 7-tuple:

Here gt\cdot^{\mathsf{gt}} and a\cdot^{\mathsf{a}} are the ground truth and anchor boxes, respectively. The length of the bounding box diagonal da=(wa)2+(la)2d^{\mathsf{a}}=\sqrt{{(w^{\mathsf{a}})}^{2}+{(l^{\mathsf{a}})}^{2}}. zaz_{\mathsf{a}} is constant for all anchors since they are located in the BEV plane.

Loss. We use the loss function introduced in SECOND . The total outdoor loss consists of several loss terms, namely smooth mean absolute error as a location loss LlocL_{\mathsf{loc}}, focal loss for classification LclsL_{\mathsf{cls}}, and cross-entropy loss for direction LdirL_{\mathsf{dir}}. Overall, we can formulate the outdoor loss as

where nposn_{\mathsf{pos}} is the number of positive anchors, λloc=2\lambda_{\mathsf{loc}}=2, λcls=1\lambda_{\mathsf{cls}}=1, λdir=0.2\lambda_{\mathsf{dir}}=0.2.

3.2 Indoor Head

All modern indoor 3D object detection methods perform deep Hough voting for sparse point cloud representation. In contrast, we follow and use dense voxel representation of intermediate features. To the best of our knowledge, there is no dense 3D multi-scale head for 3D object detection. We construct such a head inspired by a 2D detection method FCOS . An original FCOS head accepts 2D features from FPN and estimates 2D bounding boxes via 2D convolutional layers. To adapt FCOS for 3D detection, we replace 2D convolutions with 3D convolutions to process 3D inputs. Following FCOS and ATSS , we apply center sampling to select candidate object locations. In these works, 9 (3×33\times 3) candidates were chosen; since we operate in 3D space, we set a limit of 27 candidate locations per object (3×3×33\times 3\times 3). The resulting head consists of three 3D convolutional layers for classification, location, and centerness, respectively, with weights shared across all object scales.

Input. A multi-scale input is composed of three tensors of shapes Nx4×Ny4×Nz4×c2\frac{N_{x}}{4}\times\frac{N_{y}}{4}\times\frac{N_{z}}{4}\times c_{2}, Nx2×Ny2×Nz2×c2\frac{N_{x}}{2}\times\frac{N_{y}}{2}\times\frac{N_{z}}{2}\times c_{2}, and Nx×Ny×Nz×c2N_{x}\times N_{y}\times N_{z}\times c_{2}.

Output. For each 3D location (xa,ya,za)(x^{\mathsf{a}},y^{\mathsf{a}},z^{\mathsf{a}}) and each of three scales, the head estimates a class probability pp, a centerness cc, and a 3D bounding box as a 7-tuple:

Here, xmingt,xmaxgt,ymingt,ymaxgt,zmingt,zmaxgtx_{\mathsf{min}}^{\mathsf{gt}},x_{\mathsf{max}}^{\mathsf{gt}},y_{\mathsf{min}}^{\mathsf{gt}},y_{\mathsf{max}}^{\mathsf{gt}},z_{\mathsf{min}}^{\mathsf{gt}},z_{\mathsf{max}}^{\mathsf{gt}} denote the minimum and maximum coordinates along axes of a ground truth bounding box.

Loss. We adapt the loss function used in the original FCOS . It consists of focal loss for classification LclsL_{\mathsf{cls}}, cross-entropy loss for centerness LcntrL_{\mathsf{cntr}}, and IoU loss for location LlocL_{\mathsf{loc}}. Since we address the 3D detection task instead of the 2D detection task, we replace 2D IoU loss with rotated 3D IoU loss . In addition, we update ground truth centerness with the third dimension. The resulting indoor loss can be written as

where nposn_{\mathsf{pos}} is the number of positive 3D locations.

4 Extra 2D Head

In some indoor benchmarks, the 3D object detection task is formulated as a sub-task of scene understanding. Accordingly, evaluation protocols imply solving various scene understanding tasks rather than only estimating 3D bounding boxes. Following , we predict camera rotations and room layouts. Similar to , we add a simple head for joint RtR_{t} and 3D layout estimation. This extra head consists of two parallel branches: two fully connected layers output room layout and the other two fully connected layers estimate camera rotation.

Input. The input is a single tensor of shape 8c08c_{0}, obtained through global average pooling of the backbone output.

Output. The head outputs camera pose as a tuple of pitch β\beta and roll γ\gamma and a 3D layout box as a 7-tuple (x,y,z,w,l,h,θ)(x,y,z,w,l,h,\theta). As , we set yaw angle and shift to zeros.

Loss. We modify losses used in to make them consistent with the losses used to train a detection head. Accordingly, we define layout loss LlayoutL_{\mathsf{layout}} as rotated 3D IoU loss between predicted and ground truth layout boxes; this is the same loss as we use in 3.3.2. For camera rotation estimation, we use Lpose=sin(βgtβ)+sin(γgtγ)L_{\mathsf{pose}}=|\sin(\beta^{\mathsf{gt}}-\beta)|+|\sin(\gamma^{\mathsf{gt}}-\gamma)| similar to 3.3.1. Overall, the extra loss can be formulated as

where λlayout=0.1\lambda_{\mathsf{layout}}=0.1 and λpose=1.0\lambda_{\mathsf{pose}}=1.0.

Experiments

We evaluate the proposed method on four datasets: indoor ScanNet and SUN RGB-D , and outdoor KITTI and nuScenes . SUN RGB-D and KITTI are benchmarked in monocular mode, while for ScanNet and nuScenes, we address the detection problem in multi-view formulation.

KITTI. The KITTI object detection dataset is the most decisive outdoor benchmark for monocular 3D object detection. It consists of 3711 training, 3768 validation and 7518 test images. The common practice is to report results on validation subset and submit test predictions to an open leaderboard. All 3D object annotations have a difficulty level: easy, moderate, and hard. A 3D object detection method is assessed according to the results on moderate objects from the test set. Following , we evaluate our method only on objects of the car category.

nuScenes. The nuScenes dataset provides data for developing algorithms addressing self-driving-related tasks. It contains LiDAR point clouds, RGB images captured by six cameras, accompanied by IMU and GPS measurements. The dataset covers 1000 video sequences, each recorded for 20 seconds, totalling 1.4 million images and 390 000 point clouds. Training split covers 28 130 scenes, and validation split contains 6019 scenes. The annotation contains 1.4 million objects divided into 23 categories. Following , the accuracy of 3D detection is measured only on car category. In this benchmark, not only the average precision (AP) metric but average translation error (ATE), average scale error (ASE), and average orientation error (AOE) are calculated as well.

SUN RGB-D. SUN RGB-D is one of the first and most well-known indoor 3D datasets. It contains 10 335 images captured in various indoor places alongside corresponding depth maps obtained with four different sensors and camera poses. The training split is composed of 5285 frames, while the rest 5050 frames comprise the validation subset. The annotation includes 58 657 objects. For each frame, a room layout is provided.

ScanNet. The ScanNet dataset contains 1513 scans covering over 700 unique indoor scenes, out of which 1201 scans belong to a training split, and 312 scans are used for validation. Overall, this dataset contains over 2.5 million images with corresponding depth maps and camera poses, alongside reconstructed point clouds with 3D semantic annotation. We estimate 3D bounding boxes from semantic point clouds following the standard protocol . The resulting object bounding boxes are axis-aligned, so we do not predict the rotation angle θ\theta for ScanNet.

2 Implementation Details

3D Volume. We use ResNet-50 as a feature extractor. Accordingly, the number of convolutions in the first convolutional block c0c_{0} equals 256. We set both the 3D volume feature size c1c_{1} and the ouput feature size c2c_{2} to 256 as proposed in .

Indoor and outdoor scenes are of different absolute scales. Therefore, we choose the spatial sizes of the feature volume for each dataset considering the data domain. We use the values provided in previous works , as shown in Tab. 1. Thus, using anchor settings of the 3D head in , we set voxel size ss as 0.320.32 meters for outdoor datasets. Minimal and maximal values for all three axes for outdoor datasets also follow the point cloud ranges for car class in . For selecting indoor dataset constraints we follow , where the room size is 6.4×6.4×2.566.4\times 6.4\times 2.56 meters. The only change is that we are increasing voxels size ss from 0.040.04 to 0.160.16 to increase memory efficiency.

Training. During training, we optimize LindoorL_{\mathsf{indoor}} for indoor datasets and LoutdoorL_{\mathsf{outdoor}} for outdoor datasets, unless told otherwise. We use Adam optimizer with an initial learning rate set to 0.00010.0001 and weight decay of 0.00010.0001. The implementation is based on the MMDetection framework and uses its default training settings. The network is trained for 12 epochs, and the learning rate is reduced by ten times after the 8th and 11th epoch. For ScanNet, SUN RGB-D, and KITTI, the network sees each scene three times every training epoch. We use 8 Nvidia Tesla P40 GPUs for training, distributing one scene (multi-view scenario) or four images (monocular scenario) per GPU. We randomly apply horizontal flip and resize inputs in monocular experiments by no more than 25% of their original resolution. Moreover, in indoor scenes, we can augment 3D voxel representations similar to point cloud-based methods, so we randomly shift a voxel grid center by at most 1m along each axis.

Inference. During inference, outputs are filtered with a Rotated NMS algorithm, which is applied to objects projections onto the ground plane.

3 Results

First, we report the results of detecting cars on outdoor KITTI and nuScenes benchmarks. Then, we discuss the results of multi-class 3D object detection on SUN RGB-D and ScanNet indoor datasets.

KITTI. We present the results of monocular car detection on KITTI in Tab. 2. ImVoxelNet achieves the best moderate AP on the test split, which is the main metric in the KITTI benchmark. Moreover, our method surpasses previous state-of-the-art by 6% AP3D and 4% APBEV for easy objects. Overall, ImVoxelNet is superior in terms of almost all metrics on both test and val splits.

nuScenes. For nuScenes, unlike other methods that only run inference on images from 6 onboard cameras, ImVoxelNet uses multi-view inputs for training. As shown in Tab. 3, the proposed method outperforms MonoDIS by more than 1% of mean AP, which is the main metric. According to AP@0.5, ImVoxelNet outputs almost twice as many highly accurate estimates comparing to MonoDIS. For car detection, two boxes might have IoU = 0 when a center distance exceeds 1 meter. By that, AP@1.0m, AP@2.0m, and AP@4.0m might be calculated for non-intersecting bounding boxes, which seems counter-intuitive (e.g., for the KITTI dataset, only boxes with IoU >0.7 are considered to be true positive). Hence, we argue that AP@0.5 is the most decisive metric.

Moreover, we report values of ATE, ASE, and AOE metrics. As represented in the Tab. 3, ImVoxelNet has at least 0.09 meters smaller ATE than other monocular methods.

SUN RGB-D. We compare ImVoxelNet with existing methods on the most recent monocular benchmark introduced in , which includes objects of NYU-37 categories . Since the chosen benchmark implies estimating camera pose and layout, we optimize Lindoor+LextraL_{\mathsf{indoor}}+L_{\mathsf{extra}} for training. For a fair comparison with Total3DUnderstanding , we report their results without joint training since it requires the additional mesh-annotated dataset. Tab. 4 demonstrates that ImVoxelNet surpasses all previous methods by a margin exceeding 18% in terms of mAP. Furthermore, ImVoxelNet outperforms Total3DUnderstanding in both layout and camera pose estimation. We also report metrics on other benchmarks: the PerspectiveNet benchmark with 30 object categories, and the VoteNet benchmark with 10 categories, which is used by point cloud-based methods (see A).

ScanNet. We compare ImVoxelNet to existing methods on the common benchmark with 18 classes. During training, we use T=50T=50 images per scene, as was proposed in . We conduct an ablation study to choose an optimal number of test images per scene (Tab. 6). We run our method five times on different samples for each number of test images and report an average result with a 0.95 confidence interval. Experiments show that the more images per test scene, the better. The most time-consuming part of the pipeline is processing a voxel volume with 3D convolutions while extracting 2D features gives a minor overhead. Consequently, with an increase in the number of test images per scene, the runtime grows sublinearly.

According to Tab. 5, ImVoxelNet still shows competitive results despite not using point clouds. Notably, it outperforms point cloud-based 3D-SIS which builds a voxel volume representation using RGB images as an additional modality.

Performance. We report the inference time on the KITTI dataset in Tab. 7. All the methods were examined in the same experimental setup on a single GPU. ImVoxelNet uses computationally expensive 3D convolutions, so it is expected to be slower than the methods that rely on 2D convolutions only. In our experiments, ImVoxelNet appeared to be inferior in speed to most of the listed methods, yet the runtime differs within an order of magnitude. The listed methods use different backbones, and this affects the total speed. In ImVoxelNet, extracting features with a backbone is a simple, lightweight procedure compared to processing voxel volume with 3D convolutions. Accordingly, the choice of a backbone is negligible: experiments show that replacing ResNet-50 with a more lightweight version has a minor influence on performance.

Conclusion

In this paper, we formulate the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we have proposed ImVoxelNet, a novel fully convolutional method of 3D object detection given posed monocular or multi-view RGB inputs. During both training and inference, ImVoxelNet accepts multi-view inputs with an arbitrary number of views. Besides, our method can accept monocular inputs (treated as a special case of multi-view inputs). The proposed method has achieved state-of-the-art results in outdoor car detection on both the monocular KITTI benchmark and the multi-view nuScenes benchmark. Moreover, it has surpassed existing methods of 3D object detection on the indoor SUN RGB-D dataset. For the ScanNet dataset, ImVoxelNet has set a new benchmark for indoor multi-view 3D object detection. Overall, ImVoxelNet successfully works on both indoor and outdoor data, which makes it general-purpose.

References

Appendix

Appendix A More results on SUN RGB-D

For a comprehensive comparison, we also mention PerspectiveNet , which is evaluated following a different protocol. In that protocol, the annotations are mapped into 30 object categories. Accordingly, we train ImVoxelNet using the same object categories. The results are reported in Tab. 10. Among these 30 categories, 10 object categories are consistent with 10 categories used in . So, we can merge these benchmarks and report metrics for that are obtained on the same subset of 10 object categories 9. Following , we assume camera poses are known, so we optimize only LindoorL_{indoor} and do not use any additional camera pose loss.

Another SUN RGB-D benchmark has been proposed in for point cloud-based methods evaluation. This benchmark implies detecting objects of 10 categories with mAP@0.25 chosen as the main metric. In Tab. 8, we report the results of our method against point cloud-based methods. This comparison is unfair, favoring point cloud-based methods since they have access to more complete data. Nevertheless, we report the metrics to establish a baseline for monocular 3D object detection on SUN RGB-D.

Comparison with Total3DUnderstanding on all NYU-37 object categories is present in Tab. 11. In this experiment, we optimize Lindoor+LextraL_{\mathsf{indoor}}+L_{\mathsf{extra}} since camera pose is assumed unknown.

Appendix B Visualization

All visualized images belong to validation subsets of the corresponding datasets. Different colors of the depicted bounding boxes mark different object categories; the color encoding is consistent within each dataset.