An End-to-End Transformer Model for 3D Object Detection

Ishan Misra, Rohit Girdhar, Armand Joulin

Introduction

3D object detection aims to identify and localize objects in 3D scenes. Such scenes, often represented using point clouds, contain an unordered, sparse and irregular set of points captured using a depth scanner. This set-like nature makes point clouds significantly different from the traditional grid-like vision data like images and videos. While there are other 3D representations such as multiple-views , voxels or meshes , they require additional post-processing to be constructed, and often loose information due to quantization. Hence, point clouds have emerged as a popular 3D representation, and spurred the development of specialized 3D architectures.

Many recent 3D detection models directly work on the 3D points to produce the bounding boxes. Of particular interest, VoteNet casts 3D detection as a set-to-set problem, i.e., transforming an unordered set of inputs (point cloud), into an unordered set of outputs (bounding boxes). VoteNet uses an encoder-decoder architecture: the encoder is a PointNet++ network which converts the unordered point set into a unordered set of point features. The point features are then input to a decoder that produces the 3D bounding boxes. While effective, such architectures have required years of careful development by hand-encoding inductive biases, radii, and designing special 3D operators and loss functions.

In parallel to 3D, set-to-set encoder-decoder models have emerged as a competitive way to model 2D object detection. In particular, the recent Transformer based model, called DETR , casts 2D object detection as a set-to-set problem. The self-attention operation in Transformers is designed to be permutation-invariant and capture long range contexts, making them a natural candidate for processing unordered 3D point cloud data. Inspired by this observation, we ask the following question: can we leverage Transformers to learn a 3D object detector without relying on hand-designed inductive biases?

To that end, we develop 3D DEtection TRansformer (3DETR) a simple to implement 3D detection method that uses fewer hand-coded design decisions and also casts detection as a set-to-set problem. We explore the similarities between VoteNet and DETR, as well as between the core mechanisms of PointNet++ and the self-attention of Transformers to build our end-to-end Transformer-based detection model. Our model follows the general encoder-decoder structure that is common to both DETR and VoteNet. For the encoder, we replace the PointNet++ by a standard Transformer applied directly on the point clouds. For the decoder, we consider the parallel decoding strategy from DETR with Transformer layers making two important changes to adapt it to 3D detection, namely non-parametric query embeddings and Fourier positional embeddings .

3DETR removes many of the hard coded design decisions in VoteNet and PointNet++ while being simple to implement and understand. Unlike DETR, 3DETR does not employ a ConvNet backbone, and solely relies on Transformers trained from scratch. Our transformer-based detection pipeline is flexible, and as in VoteNet, any component can be replaced by other existing modules. Finally, we show that 3D specific inductive biases can be easily incorporated in 3DETR to further improve its performance. On two standard indoor 3D detection benchmarks, ScanNetV2 and SUN RGB-D we achieve 65.0% AP and 59.0% AP respectively, outperforming an improved VoteNet baseline by $9.5\%$ AP50 on ScanNetV2.

Related Work

We propose a 3D object detection model composed of Transformer blocks. We build upon prior work in 3D architectures, detection, and Transformers.

Grid-based 3D Architectures. Convolution networks can be applied to irregular 3D data after converting it into regular grids. Projection methods project 3D data into 2D planes and convert it into 2D grids. 3D data can also be converted into a volumetric 3D grid by voxelization . We use 3D point clouds directly since they are suitable for set based architectures such as the transformer.

Point cloud Architectures. 3D sensors often acquire data in the form of unordered point clouds. When using unordered point clouds as input, it is desirable to obtain permutation invariant features. Point-wise MLP based architectures such as PointNet and PointNet++ use permutation equivariant set aggregation (downsampling) and pointwise MLPs to learn effective representations. We use a single downsampling operation from to keep the number of input points tractable in our model.

Graph-based models can operate on unordered 3D data. Graphs are constructed from 3D data in a variety of ways – DGCNN and PointWeb use local neighborhoods of points, SPG uses attribute and context similarity and Jiang et al. use point-edge interactions.

Finally, continuous point convolution based architectures can also operate on point clouds. The continuous weights can be defined using polynomial functions as in SpiderCNN or linear functions as in Flex-Convolutions . Convolutions can also be applied by soft-assignment matrices or specific ordering . PointConv and KPConv dynamically generate convolutional weights based on the input point coordinates, while InterpCNN uses these coordinates to interpolate weights. We build upon the Transformer which is applicable for sets but not tailored for 3D.

3D Object Detection is a well studied research area where methods predict three dimensional bounding boxes from 3D input data . Many methods avoid expensive 3D operations by using 2D projection. MV3D , VoxelNet use a combination of 3D and 2D convolutions. Yan et al. simplify the 3D operation while uses a 2D projection, and uses ‘pillars’ of voxels. We focus on methods that directly use 3D point clouds . PointRCNN and PVRCNN are 2-stage detection pipelines similar to the popular R-CNN framework for 2D images. While these methods are related to our work, for simplicity we build a single stage detection model as done in . VoteNet uses Hough Voting on sparse point cloud inputs and detects boxes by feature sampling, grouping and voting operations designed for 3D data. VoteNet is a building block for many follow up works. 3D-MPA combines voting with a graph ConvNet for refining object proposals and uses specially designed 3D geometric features for aggregating detections. HGNet improves Hough Voting and uses a hierarchical graph network with feature pyramids. H3DNet improves VoteNet by predicting 3D primitives and uses a geometric loss function. We propose a simple detection method that can serve as a building block for such innovations in 3D detection.

Transformers in Vision. The Transformer architecture by Vaswani et al. has been immensely successful across domains like NLP , speech recognition , image recognition , and for cross-domain applications . Transformers are well suited for operating on 3D points since they are naturally permutation invariant. Attention based methods have been used for building 3D point representations for retrieval , outdoor 3D detection , object classification . Concurrent work also uses the Transformer architecture for 3D. While these methods use 3D specific information to modify the Transformer, we push the limits of the standard Transformer. Our work is inspired by the recent DETR model for object detection in images by Carion et al. . Different from Carion et al., our model is an end-to-end transformer (no convolutional backbone) that can be trained from scratch and has important design differences such as non-parametric queries to enable 3D detection.

Approach

We briefly review prior work in 3D detection and their conceptual similarities to 3DETR. Next, we describe 3DETR, simplifications in bounding box parametrization and the simpler set-to-set objective function.

The recent VoteNet framework forms the basis for many detection models in 3D, and like our method, is a set-to-set prediction framework. VoteNet uses a specialized 3D encoder and decoder architecture for detection. It combines these models with a Hough Voting loss designed for sparse point clouds. The encoder is a PointNet++ model that uses a combination of multiple downsampling (set-aggregation) and upsampling (feature-propagation) operations that are specifically designed for 3D point clouds. The VoteNet “decoder” predicts bounding boxes in three steps - 1) each point ‘votes’ for the center coordinate of a box; 2) votes are aggregated within a fixed radius to obtain ‘centers’; 3) bounding boxes are predicted around ‘centers’. BoxNet is a non-voting alternative to VoteNet that randomly samples ‘seed’ points from the input and treats them as ‘centers’. However, BoxNet achieves much worse performance than VoteNet as the voting captures additional context in sparse point clouds and yields better ‘center’ points. As noted by the authors , the multiple hand-encoded radii used in the encoder, decoder, and the loss function are important for detection performance and have been carefully tuned .

The Transformer is a generic architecture that can work on set inputs and capture large contexts by computing self-attention between all pairs of input points. Both these properties make it a good candidate model for 3D point clouds. Next, we present our 3DETR model which uses a Transformer for both the encoder and decoder with minimal modifications and has minimal hand-coded information for 3D. 3DETR uses a simpler training and inference procedure. We also highlight similarities and differences to the DETR model for 2D detection.

2 3DETR: Encoder-decoder Transformer

3DETR takes as input a 3D point cloud and predicts the positions of objects in the form of 3D bounding boxes. A point cloud is a unordered set of $N$ points where each point is associated with its $3$ -dimensional XYZ coordinates. The number of points is very large and we use the set-aggregation downsampling operation from to downsample the points and project them to $N^{\prime}$ dimensional features. The resulting subset of $N^{\prime}$ features is passed through an encoder to also obtain a set of $N^{\prime}$ features. A decoder takes these features as input and predicts multiple bounding boxes using a parallel decoding scheme inspired by . Both encoder and decoder use standard Transformer blocks with ‘pre-norm’ and we refer the reader to Vaswani et al. for details. Fig. 2 illustrates our model.

Encoder. The downsample and set-aggregation steps provide a set of $N^{\prime}$ features of $d=256$ dimensions using an MLP with two hidden layers of $64,128$ dimensions. The set of $N^{\prime}$ features is then passed to a Transformer to also produce a set of $N^{\prime}$ features of $d\!=\!256$ dimensions. The Transformer applies multiple layers of self-attention and non-linear projections. We do not use downsampling operations in the Transformer, and use the standard self-attention formulation . Thus, the Transformer encoder has no specific modifications for 3D data. We omit positional embeddings of the coordinates from the encoder since the input already contains information about the XYZ coordinates.

Decoder. Following Carion et al. , we frame detection as a set prediction problem, i.e., we simultaneously predict a set of boxes with no particular ordering. This is achieved with a parallel decoder composed of Transformer blocks. This decoder takes as input the $N^{\prime}$ point features and a set of $B$ query embeddings $\{\mathbf{q}^{e}_{1},\dots,\mathbf{q}^{e}_{B}\}$ to produce a set of $B$ features that are then used to predict 3D-bounding boxes. In our framework, the query embeddings $\mathbf{q}^{e}$ represent locations in 3D space around which our final 3D bounding boxes are predicted. We use positional embeddings in the decoder as it does not have direct access to the coordinates (operates on encoder features and query embeddings).

Non-parametric query embeddings. Inspired by seed points used in VoteNet and BoxNet , we use non-parametric embeddings computed from ‘seed’ XYZ locations. We sample a set of $B$ ‘query’ points $\{\mathbf{q}_{i}\}_{i=1}^{B}$ randomly from the $N^{\prime}$ input points (see Fig. 2). We use Farthest Point Sampling for the random samples as it ensures a good coverage of the original set of points. We associate each query point $\mathbf{q}_{i}$ with a query embedding $\mathbf{q}^{e}_{i}$ , by converting the coordinates of $\mathbf{q}_{i}$ into Fourier positional embeddings followed by projection with a MLP.

3 Bounding box parametrization and prediction

The encoder-decoder architecture produces a set of $B$ features, that are fed into prediction MLPs to predict bounding boxes. A 3D bounding box has the attributes (a) its location, (b) size, (c) orientation, and (d) the class of the object contained in it. We describe the parametrization of these attributes and their associated prediction problems.

The prediction MLPs produce a box around every query coordinate $\mathbf{q}$ . (a) Location: We use the XYZ coordinates of box’s center $\mathbf{c}$ . We predict this in terms of an offset $\mathbf{\Delta q}$ that is added to the query coordinates, i.e., $\mathbf{c}=\mathbf{q}+\mathbf{\Delta q}$ .

(b) Size: Every box is a 3D rectangle and we define its size around the center coordinate $\mathbf{c}$ using XYZ dimensions $\mathbf{d}$ .

(c) Orientation: In some settings , we must predict the orientation of the box, i.e., the angle it forms compared to a given referential. We follow and quantize the angles into $12$ bins from $[0,2\pi)$ and note the quantization residual. Angular prediction involves predicting the the quantized ‘class’ of the angle and the residual to obtain the continuous angle $a$ . (d) Semantic Class: We use a one-hot vector $\mathbf{s}$ to encode the object class contained in the bounding box. We include a ‘background’ or ‘not an object’ class as some of the predicted boxes may not contain an object.

Putting together the attributes of a box, we have two quantities: the predicted boxes $\hat{\mathbf{b}}$ and the ground truth boxes $\mathbf{b}$ . Each predicted box $\hat{\mathbf{b}}=[\hat{\mathbf{c}},\hat{\mathbf{d}},\hat{\mathbf{a}},\hat{\mathbf{s}}]$ consists of (1) geometric terms $\hat{\mathbf{c}},\hat{\mathbf{d}}\in^{3}$ that define the box center and dimensions respectively, $\hat{\mathbf{a}}=[\hat{\mathbf{a}}_{c},\hat{\mathbf{a}}_{r}]$ that defines the quantized class and residual for the angle; (2) semantic term $\hat{\mathbf{s}}=^{K+1}$ that contains the probability distribution over the $K$ semantic object classes and the ‘background’ class. The ground truth boxes $\mathbf{b}$ also have the same terms.

4 Set Matching and Loss Function

To train the model, we first match the set of $B$ predicted 3D bounding boxes $\{\hat{\mathbf{b}}\}$ to the ground truth bounding boxes $\{\mathbf{b}\}$ . While VoteNet uses hand-defined radii to do such set matching, we follow to perform a bipartite graph matching which is simpler, generic (see § 4.2.1) and robust to Non-Maximal Suppression. We compute a loss for each predicted box using its matched ground truth box.

Bipartite Matching. We define a matching cost for a pair of boxes, predicted box $\hat{\mathbf{b}}$ and ground truth box $\mathbf{b}$ , using a geometric and a semantic term.

We compute the optimal bipartite matching between all the predicted boxes $\{\hat{\mathbf{b}}\}$ and ground truth boxes $\{\mathbf{b}\}$ using the Hungarian algorithm as in prior work . As we predict a larger number of boxes than the ground truth, the predicted boxes that do not get matched are considered matched to the ‘background’ class. This encourages the model to not over-predict, a property that helps our model be robust to Non-Maximal Suppression (see § 5).

Our final loss function is a weighted combination of the above five terms and we provide the full details in the appendix. For predicted boxes matched to the ‘background’ class, we only compute the semantic classification loss with the background class ground truth label. For datasets with axis-aligned 3D bounding boxes, we also use a loss directly on the GIoU as in . We do not use the GIoU loss for oriented 3D bounding boxes as it is computationally involved.

Intermediate decoder layers. At training time, we use the same bounding box prediction MLPs to predict bounding boxes at every layer in the decoder. We compute the set loss for each layer independently and sum all the losses to train the model. At test time, we only use the bounding boxes predicted from the last decoder layer.

5 Implementation Details

We implement 3DETR using PyTorch and use the standard nn.MultiHeadAttention module to implement the Transformer. We use a single set aggregation operation to subsample $N^{\prime}\!=\!2048$ points and obtain $256$ dimensional point features. The 3DETR encoder has 3 layers where each layer uses multiheaded attention with four heads and a two layer MLP with a ‘bottleneck’ of $128$ hidden dimensions. The 3DETR decoder has 8 layers and closely follows the encoder, except that the MLP hidden dimensions are $256$ . We use Fourier positional encodings of the XYZ coordinates in the decoder. The bounding box prediction MLPs are two layer MLPs with a hidden dimension of $256$ . Full architecture details in the appendix § A.1.

Experiments

Dataset and metrics. We evaluate models on two standard 3D indoor detection benchmarks - ScanNetV2 and SUN RGB-D-v1 . SUN RGB-D has 5K single-view RGB-D training samples with oriented bounding box annotations for 37 object categories. ScanNetV2 has 1.2K training samples (reconstructed meshes converted to point clouds) with axis-aligned bounding box labels for 18 object categories. For both datasets, we follow the experimental protocol from : we report the detection performance on the val set using mean Average Precision (mAP) at two different IoU thresholds of $0.25$ and $0.5$ , denoted as AP25 and AP50. Along with the metric, their protocol evaluates on the 10 most frequent categories for SUN RGB-D.

In this set of experiments, we validate 3DETR for 3D detection. We compare it to the BoxNet and VoteNet models since they are conceptually similar to 3DETR and are the foundations of many recent detection models. For fair comparison, we use our own implementation of these models with the same optimization improvements used in 3DETR– leading to a boost of +2-4% AP over the original paper (details in supplemental). We also compare against a state-of-the-art method H3DNet and provide a more detailed comparison against other recent methods in the appendix. 3DETR models use $256$ and $128$ queries for ScanNetV2 and SUN RGB-D datasets.

Observations. We summarize results in Table 1. The comparison between BoxNet and 3DETR is particularly relevant since both methods predict boxes around location queries while VoteNet uses 3D Hough Voting to obtain queries. Our method significantly outperforms BoxNet on both the datasets with a gain of $+13\%$ AP25 on ScanNetV2 and $+3.9\%$ AP25 on SUN RGB-D. Even when compared with VoteNet, our model achieves competitive performance, with $+2.3\%$ AP25 on ScanNetV2 and $-1.5\%$ AP25 on SUN RGB-D. 3DETR-m, which uses the masked Transformer encoder, achieves comparable performance to VoteNet on SUN RGB-D and a gain of $+4.6\%$ AP25 and $+9.5\%$ AP50 on ScanNetV2.

Compared to a state-of-the-art method, H3DNet , that builds upon VoteNet, 3DETR-m is within a couple of AP25 points on both datasets (more detailed comparison in Appendix B). These experiments validate that a encoder-decoder detection model based on the standard Transformer is competitive with similar models tailored for 3D data. Just as the VoteNet model was improved by the innovations of H3DNet , HGNet , 3D-MPA , similar innovations could be integrated to our model in the future.

Qualitative Results. In Fig. 3, we visualize a few detections and ground truth boxes from SUN RGB-D. 3DETR detects boxes despite the partial (single-view) depth scans and also predicts amodal bounding boxes or missing annotations on SUN RGB-D.

2 Analyzing 3DETR

We conduct a series of experiments to understand 3DETR. In § 4.2.1, we explore the similarities between 3DETR, VoteNet and BoxNet. Next, in § 4.2.2, we compare the design decisions in 3DETR that enable 3D detection to the original components in DETR.

The encoder-decoder paradigm is flexible and we can test if the different modules in VoteNet, BoxNet and 3DETR are interchangeable. We focus on the encoders, decoders and losses and report the detection performance in Tables 2 and 3. For simplicity, we denote the decoders and the losses used in BoxNet and VoteNet as Box and Vote respectively. We use PointNet++ to refer to the modified PointNet++ architecture used in VoteNet .

Replacing the encoder. We train 3DETR with a PointNet++ encoder (Table 2) and observe that the detection performance is unchanged or slightly worse compared to 3DETR with a transformer encoder. This shows that the design decisions in 3DETR are broadly compatible with prior work, and can be used for designing better encoder models.

Replacing the decoder. In Table 3, we observe that replacing our Transformer-based decoders by Box or Vote decoders leads to poor detection performance on both benchmarks. Additionally, the Box and Vote decoders work only with their respective losses and our preliminary experiments using set loss on these decoders led to worse results. Thus, the drop of performance could be attributed to changing the decoder used with our transformer encoder. We inspect this next by replacing the loss in 3DETR while using the transformer encoder and decoder.

Replacing the loss. We train 3DETR, i.e., both Transformer encoder and decoder with the Box and Vote losses. We observe (Table 3 rows 4 and 5) that this leads to similar degradation in performance, suggesting that the losses are not applicable to our model. This is not surprising since the design decisions, e.g., voting radius, aggregation radius etc. in the Vote loss was specifically designed for radius parameters in the PointNet++ encoder . This set of observations exposes that the decoder and loss function used in VoteNet depend greatly on the nature of the encoder (additional results in § B.4). In contrast, our set loss has no design decisions specific to our encoder-decoder.

Visualizing self-attention. We visualize the self-attention in the decoder in Fig. 1. The decoder focuses on whole instances and groups points within instances. This presumably makes it easier to predict bounding boxes for each instance. We provide visualizations for the encoder self-attention in the supplemental.

Encoder applied to Shape classification. To verify that our encoder design is not specific to the detection task we test the encoder on shape classification of of models including 3D Warehouse .

We use the three layer encoder from 3DETR with vanilla self-attention (no decoder) or the three layer encoder from 3DETR-m. To obtain global features for the point cloud, we use the ‘CLS token’ formulation from Transformer, i.e., append a constant point to the input and use this point’s output encoder features as global features (see supplemental for details). The global features from the encoder are input to a 2-layer MLP to perform shape classification. Table 4 shows that both the 3DETR and 3DETR-m encoders are competitive with state-of-the-art encoders tailored for 3D. These results suggest that our encoder design is not specific to detection and can be used for other 3D tasks.

2.2 Design decisions in 3DETR

Our model is inspired by the DETR architecture but has major differences - (1) it is an end-to-end transformer without a ConvNet, (2) it is trained from scratch (3) uses non-parametric queries and (4) Fourier positional embeddings. In Table 5, we show the impact of the last two differences by evaluating various versions of our model on ScanNetV2. The version with minimal modifications is a DETR model applied to 3D with our training and loss function.

First, this version does not perform well on the ScanNetV2 benchmark, achieving 15% AP25. However, when replacing the parametric queries by non-parametric queries, we observe a significant improvement of +40% in AP25 (Table 5 rows 3 and 5). In fact, only using the non-parametric queries (row 4) without positional embeddings doubles the performance. This shows the importance of using non-parametric queries with 3D point clouds. A reason is that point clouds are irregular and sparse, making the learning of parametric queries harder than on a 2D image grids. Non-parametric queries are directly sampled from the point clouds and hence are less impacted by these irregularities. Unlike the fixed number of parametric queries in DETR, non-parametric queries easily enable the use different number of queries at train and test time (see § 5.1).

Finally, replacing the sinusoidal positional embedding by the low-frequency Fourier encodings of provides an additional improvement of +5% in AP25 (Table 5 rows 2 and 3). As a side note, using positional encodings benefits the decoder more than the encoder because the decoder does not have direct access to coordinates.

Ablations

We conduct a series of ablation experiments to understand the components of 3DETR with settings from § 4.

Effect of NMS. 3DETR uses the set loss of DETR (§ 3.4) that forces a 1-to-1 mapping between the ground truth box and the predicted box. This loss penalizes models that predict too many boxes, since excess predictions are not matched to ground truth. In contrast, the loss used in VoteNet does not discourage multiple predictions of the same object and thus relies on Non-Maximal Suppression to remove them as a post-processing step. We compare 3DETR and VoteNet with and without NMS in Table 6 with the detection AP metric, which penalizes duplicate detections. Without NMS, 3DETR drops in performance by only 3% AP while VoteNet drops by 50%, showing our set loss works without NMS.

Effect of encoder/decoder layers. We assess the importance of the number of layers in the encoder and decoder in Fig. 4. While a higher number of layers improves detection performance in general, adding the layers in the decoder instead of the encoder has a greater impact on performance. For instance, for a model with three encoder and three decoder layers, adding five decoder layers improves performance by +7% AP50 while adding five encoder layers improves by +2%AP50. This preference toward the decoder arises because in our parallel decoder, each layer further refines the prediction quality of the bounding boxes.

An advantage of our model is that we can adapt its computation during inference by using less layers in the decoder or queries to predict boxes without retraining.

Adapting decoder depth. The parallel decoder of 3DETR is trained to predict boxes at each layer with the same bounding box prediction MLPs. Thus far, in all our results we used the predictions only from the last decoder layer. We now test the performance of the intermediate layers for a decoder with six layers in Fig. 5 (left). We compare this to training different models with a varying number of decoder layers. We make two observations - (1) similar to Fig. 4, detection performance increases with the number of decoder layers; and (2) more importantly, the same model with reduced depth at test time performs as well or better than models trained from scratch with reduced depth. This second property is shared with the DETR, but not with VoteNet. It allows adapting the number of layers in the decoder to a computation budget during inference without retraining.

Adapting number of queries. As we increase the number of queries, 3DETR predicts more bounding boxes, resulting in better performance at a cost of longer running time. However, our non-parametric queries in 3DETR allow us to adapt the number of box predictions to trade performance for running time. Note that this is also possible with VoteNet, but not with DETR. In Fig. 5 (right), we compare changing the number of queries at test time to different models trained with varying number of queries. The same 3DETR model can adapt to a varying number of queries at test time and performs comparably to different models. Performance increases until the number of queries is enough to cover the point cloud well. We found this adaptation to number of queries at test time works best with a 3DETR model trained with $128$ queries (see Appendix B for other models). This adaptive computation is promising and research into efficient self-attention should benefit our model. We provide inference time comparisons to VoteNet in § A.1 for different versions of the 3DETR model.

Conclusion

We presented 3DETR, an end-to-end Transformer model for 3D detection on point clouds. 3DETR requires few 3D specific design decisions or hyperparameters. We show that using non-parametric queries and Fourier encodings is critical for good 3D detection performance. Our proposed design decisions enable powerful Transformers for 3D detection, and also benefit other 3D tasks like shape classification. Additionally, our set loss function generalizes to prior 3D architectures. In general, 3DETR is a flexible framework and can easily incorporate prior components used in 3D detection and can be leveraged to build more advanced 3D detectors. Finally, it also combines the flexibility of both VoteNet and DETR, allowing for a variable number of predictions at test time (like VoteNet) with a variable number of decoder layers (like DETR).

Acknowledgments: We thank Zaiwei Zhang for helpful discussions and Laurens van der Maaten for feedback on the paper.

Supplemental Material

Appendix A Implementation Details

We describe the 3DETR architecture in detail.

Encoder. The encoder has three layers of self-attention followed by an MLP. The self-attention operation uses multi-headed attention with four heads. The self-attention produces a $2048\times 2048$ attention matrix which is used to attend to the features to produce a $256$ dimensional output. The MLPs in each layer have a hidden dimension with $128$ . All the layers use LayerNorm and the ReLU non-linearity.

3DETR-m Encoder. The masked 3DETR-m encoder has three layers of self-attention followed by an MLP. At each layer the self-attention matrix of size #points $\times$ #points is multiplied with a binary mask $M$ of the same size. The binary mask entry $M_{ij}$ is $1$ if the point coordinates for points $i$ and $j$ are within a radius $r$ of each other. We use radius values of $[0.4,0.8,1.2]$ for the three layers. The first layer operates on $2048$ points and is followed by a downsample + set aggregation operator that downsamples to $1024$ points using a radius of $0.4$ , similar to PointNet++. The encoder layers follow the same structure as the vanilla Encoder described above, i.e., MLPs with hidden dimension of $128$ , multi-headed attention with four heads etc. The encoder produces $256$ dimensional features for $1024$ points.

Decoder. The decoder operates on the $N^{\prime}\times 256$ encoder features and $B\times 256$ location query embeddings. It produces a $B\times 256$ matrix of box features as output. The decoder has eight layers and uses cross-attention between the location query embeddings (Sec 3.2 main paper) and the encoder features, and self-attention between the box features. Each layer has the self-attention operation followed by a cross-attention operation (implemented exactly as self-attention) and an MLP with a hidden dimension of $256$ . All the layers use LayerNorm , ReLU non-linearity and a dropout of $0.3$ .

Inference speed. 3DETR has very few 3D-specific tweaks and uses standard PyTorch. VoteNet relies on custom GPU CUDA kernels for 3D operations. We measured the inference time of 3DETR (256 queries) and VoteNet (256 boxes) on a V100 GPU with a batchsize of 8 samples. Both models downsample the pointcloud to $2048$ points. 3DETR needs 170 ms while VoteNet needs 132 ms. As research into efficient self-attention becomes more mature (several recent works show promise), it will benefit the runtime and memory efficiency of our model.

A.2 Set Loss

For $B$ predicted boxes and $G$ ground truth boxes, we compute a $B\times G$ matrix of costs by using the above pairwise cost term. We then compute an optimal assignment between each ground truth box and predicted box using the Hungarian algorithm. Since the number of predicted boxes is larger than the number of ground truth boxes, the remainder $B-G$ boxes are considered to match to background. We set $\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4}$ as $2,1,0,0$ for ScanNetV2 and $3,5,1,5$ for SUN RGB-D.

For each predicted box that is matched to a ground truth box, our loss function is:

For each unmatched box that is considered background, we compute only the semantic loss term. The semantic loss is implemented as a weighted cross entropy loss with the weight of the ‘background’ class as $0.2$ and a weight of $0.8$ for the K object classes.

Appendix B Experiments

We provide additional experimental details and hyperparameter settings.

We improve the VoteNet and BoxNet baselines by doing a grid search and improving the optimization hyperparameters. We train the baseline models for $360$ epochs using the Adam optimizer with a learning rate of $1\times 10^{-3}$ decayed by a factor of 10 after $160,240,320$ epochs and a weight decay of . We found that using a cosine learning rate schedule, even longer training than 360 epochs or the AdamW optimizer did not make a significant difference in performance for the baselines. These improvements to the baseline lead to an increase in performance, summarized in Table 8.

B.2 Per-class Results

We provide the per-class mAP results for ScanNetV2 in Table 10 and SUN RGB-D in Table 9. The overall results for these models were reported in the main paper ( Table 1).

B.3 Detailed state-of-the-art comparison

We provide a detailed comparison to state-of-the-art detection methods in Table 11. Most state-of-the-art methods build upon VoteNet. H3DNet uses 3D primitives with VoteNet for better localization. HGNet improves VoteNet by using a hierarchical graph network with higher resolution output from its PointNet++ backbone. 3D-MPA uses clustering based geometric aggregation and graph convolutions on top of the VoteNet method. 3DETR does not use Voting and has fewer 3D specific decisions compared to all other methods. 3DETR performs favorably compared to these methods and outperforms VoteNet. This suggests that, like VoteNet, 3DETR can be used as a building block for future 3D detection methods.

B.4 3DETR-m with Vote loss

We tuned the VoteNet loss with the 3DETR-m encoder and our best tuned model gave 60.7% and 56.1% mAP on ScanNetV2 and SUN RGB-D respectively (settings from Table 3 of the main paper). The VoteNet loss performs better with 3DETR-m compared to the vanilla 3DETR encoder (gain of 6% and 3%), confirming that the VoteNet loss is dependent on the inductive biases/design of the encoder. Using our set loss is still better than using the VoteNet loss for 3DETR-m ( Table 1 vs. results stated in this paragraph). Thus, our set loss design decisions are more broadly applicable than that of VoteNet.

B.5 Adapt queries at test time

We provide additional results for Section 5.1 of the main paper. We change the number of queries used at test time for the same 3DETR model. We show these results in Fig. 7 for two different 3DETR models trained with 64 and 256 queries respectively. We observe that the model trained with $64$ queries is more robust to changing queries at test-time, but at its most optimal setting achieves worse detection performance than the model trained with $256$ queries. In the main paper, we show results of changing queries at test time for a model trained with $128$ queries that achieves a good balance between overall performance and robustness to change at test-time.

B.6 Visualizing the encoder attention

We visualize the encoder attention for a 3DETR model trained on the SUN RGB-D dataset in Fig. 8. The encoder focuses on parts of objects.

B.7 Shape Classification setup

We use the processed point clouds with normals from , and sample 8192 points as input for both training and testing our models. Following prior work , we report two metrics to evaluate shape classification performance: 1) Overall Accuracy (OA) evaluates how many point clouds we classify correctly; and 2) Class-Mean Accuracy (mAcc) evaluates the accuracy for each class independently, followed by an average over the per-class accuracy. This metric ensures tail classes contribute equally to the final performance.

We use the base 3DETR and 3DETR-m encoder architectures, followed by a 2-layer MLP with batch norm and a 0.5 dropout to transform the final features into a distribution over the 40 predefined shape classes. Differently from object detection experiments, our point features include the 3D position information concatenated with 3D normal information at each point, and hence the first linear layer is correspondingly larger, though the rest of the network follows the same architecture as the encoder used for detection. For the experiments with 3DETR, we prepend a [CLS] token, output of which is used as input to the classification MLP. For the experiments with 3DETR-m that involve masked transformers, we max pool the final layer features, which are then passed into the classifier.

All models are trained for 250 epochs with a learning rate of $4\times 10^{-4}$ and a weight decay of $0.1$ , using the AdamW optimizer. We use a linear warmup from $4\times 10^{-7}$ to the initial LR over 20 epochs, and then decay to $4\times 10^{-5}$ over the remaining 230 epochs. The models are trained on 4 GPUs with a batch size of 2 per GPU.