CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, Jiaqi Ma

Introduction

Autonomous vehicles (AVs) need the accurate surrounding perception and robust online mapping capabilities for robust and safe autonomy. AVs are normally located on the ground plane, so it is natural to represent semantic and geometric information of surroundings in the bird’s eye view (BEV) maps. Projecting multi-camera views onto the holistic BEV space brings clear strengths in preserving the location and scale of road elements both spatially and temporally, which is critical for various autonomous driving tasks, including scene understanding and planning . It also presents a scalable vision-based solution for real-world deployment without relying on costly LiDAR sensors.

Map-view (or BEV) semantic segmentation is a fundamental task that aims to predict road segments from single- or multi-calibrated camera inputs. Significant efforts have been made toward precise camera-based BEV semantic segmentation. One of the most popular techniques is to leverage depth information to infer the correspondences between camera views and the canonical maps . Another family of works directly learns the camera-to-BEV space transformation, either implicitly or explicitly, using attention-based models . Despite the promising results, vision-based perception systems have intrinsic limitations – camera sensors are known to be sensitive to object occlusions and limited depth-of-field, which can lead to inferior performance in areas that are heavily occluded or far from the camera lens .

Recent advancements in Vehicle-to-Vehicle (V2V) communication technologies have made it possible to overcome the limitations of single-agent line-of-sight sensing. That is, multiple connected AVs can share their sensory information with each other through broadcasting, thereby providing multiple viewpoints of the same scene. Several prior works have demonstrated the efficacy of cooperative perception utilizing LiDAR sensors . Nevertheless, whether, when, and how this V2V cooperation can benefit camera-based perception systems has not been explored yet.

In this paper, we present CoBEVT, the first-of-its-kind framework that employs multi-agent multi-camera sensors to generate BEV segmented maps via sparse vision transformers cooperatively. Fig. 1 illustrates the proposed framework. Each AV computes its own BEV representation from its camera rigs with the SinBEVT Transformer and then transmits it to others after compression. The receiver (i.e. other AVs) transforms the received BEV features onto its coordinate system, and employs the proposed FuseBEVT for BEV-level aggregation. The core ingredient of these two transformers is a novel fused axial attention (FAX) module, which can search over the whole BEV or camera image space across all agents or camera views via local and global spatial sparsity. FAX contains global attention to model long-distance dependencies, and local attention to aggregate regional detailed features, with low computational complexity. Our extensive experiments on the V2V perception dataset show that CoBEVT achieves performance gains of 22.7% and 6.9% over single-agent baseline and leading multi-agent fusion models, respectively.

Furthermore, we demonstrate the generalizability of the proposed framework in two additional tasks. First, we evaluate SinBEVT alone for single-agent multi-view BEV segmentation. Second, we validate the attention fusion on a different sensor modality – multi-agent LiDAR fusion. Our experiments on the nuScenes dataset and the LiDAR-track of OPV2V show that CoBEVT exhibits outstanding performance and capably generalize to many other tasks. Our contributions are:

We present the generic Transformer framework (CoBEVT) for cooperative camera-based BEV semantic segmentation. CoBEVT delivers superior performance and flexibility, achieving state-of-the-art results on multi-agent camera-based, single-vehicle multi-view BEV semantic segmentation, and multi-agent LiDAR-based 3D detection.

We propose a novel sparse attention module called fused axial (FAX) attention, which can efficiently capture both local and global relationships between different agents or cameras. We build two instantiations – self-attention (FAX-SA) and cross-attention (FAX-CA) to accommodate different application scenarios.

We construct a large-scale benchmark study on the cooperative BEV map segmentation task with a total of eight strong baseline models. Extensive experimental results and ablation studies show the strong performance and efficiency of the proposed model. All code, baselines, and pre-trained models will be released.

Related Work

V2V perception leverages communication technologies to enable AVs to share their sensing information to enhance their perception. Previous works mainly focus on cooperative 3D object detection with LiDAR. A straightforward sharing strategy is to transmit raw point cloud (i.e. early fusion) or detection outputs (i.e. late fusion) . However, they either require a large bandwidth or ignore the context information. Recently, V2VNet proposes to circulate the intermediate features extracted from 3D backbones (i.e., intermediate fusion), then utilize a spatial-aware graph neural network for multi-agent feature aggregation. Following a similar transmission paradigm, OPV2V employs a simple agent-wise single-head attention to fuse all features. F-Cooper uses a simple maxout operation to fuse features. DiscoNet explores knowledge distillation by constraining intermediate feature maps to match the correspondences in the early-fusion teacher model.

Compared to the previous multi-agent algorithms, our CoBEVT is the first to employ sparse transformers to explore the correlations between vehicles efficiently and exhaustively. Furthermore, previous approaches mainly focus on cooperative perception with LiDAR, while we aim to propose a low-cost camera-based cooperative perception solution free of LiDAR devices.

2 BEV Semantic Segmentation

BEV semantic segmentation aims to take camera views as input and predict a rasterized map with surrounding semantics under the BEV view. A common approach for this task is to use inverse perspective mapping (IPM) to learn the homography matrix for view transformation . As camera images lack explicit 3D information, another family of models includes depth estimation to inject auxiliary 3D information . Recently, researchers start to directly model the image-to-map correspondence using transformers or MLPs. VPN learns map-view transformation in a spatial MLP module on flattened camera-view image features. CVT develops positional embedding for each individual camera depending on their intrinsic and extrinsic calibrations. BEVFormer exploits the camera intrinsic and extrinsic explicitly to compute the spatial features in the regions of interest of the BEV grid across camera views using deformable transformer . Our CoBEVT builds upon CVT but further improves on CVT with our proposed 3D FAX attention, which is more efficient and thus supports a larger BEV embedding size to retrieve better accuracy. Furthermore, we developed a hierarchical architecture that can aggregate multi-scale camera features to preserve finer image details with only a low computational cost.

3 Transformers in Vision

Transformers are originally proposed for natural language processing . ViT has demonstrated for the first time that, a pure Transformer that simply regards image patches as visual words, is sufficient for vision tasks by large-scale pre-training. Swin Transformer further improves the generality and flexibility of pure Transformers via restricting attention fields in local (shifted) windows. For high dimensional data, video Swin Transformer extends the Swin approach onto shifted 3D space-time windows, achieving high performance with low complexity. Recent works have been focused on improving the architectures of attention models, including sparse attention , enlarged receptive fields , pyramidal designs , efficient alternatives , etc. Our work belongs to efficient model designs of 3D Transformers for high dimensional data. While we have only validated the efficacy of the proposed FAX attention for multi-view and multi-agent autonomous perception, we expect its broad applications to other vision tasks such as video and multi-modality.

Methodology

We consider a V2V communication system where all AVs can exchange sensing information with others. Assuming the poses of all the agents are accurate and transmitted messages are synchronized, we propose a robust cooperative framework that can exploit the shared information across multiple agents to obtain a holistic BEV segmentation map. The overall architecture of CoBEVT is illustrated in Fig. 1, which consists of: SinBEVT for BEV feature computation (Sec. 3.2), feature compression and sharing (Sec. 3.3), and FuseBEVT for multi-agent BEV fusion (Sec. 3.3). We propose a novel 3D attention mechanism called fused axial attention (FAX, Sec. 3.1) as the core component of SinBEVT and FuseBEVT that can efficiently aggregate features across agents or camera views both locally and globally. We will later show that this FAX attention has great generality, showing efficacy on different modalities for multiple perception tasks, including cooperative/single-agent BEV segmentation based on multi-view cameras and cooperative 3D LiDAR object detection.

Fusing BEV features from multiple agents requires both local and global interactions across all agents’ spatial positions. On the one hand, neighboring AVs often have different occlusion levels on the same object; hence, local attention, which cares more about details, can help construct pixel-to-pixel correspondence on that object. Take the scene in Fig. 2(a) as an example. The ego vehicle should aggregate all the BEV features per location from nearby AVs to obtain reliable estimates. On the other hand, long-term global contextual awareness can also assist in understanding the road topological semantics or traffic states – the road topology and traffic density ahead of the vehicle are often highly correlated with the one behind. This global reasoning is also beneficial for multi-camera views understanding. In Fig. 2(b), for instance, the same vehicle is torn apart into multi-views, and global attention is highly capable of connecting them for semantic reasoning.

Combining this 3D local and global attention with typical designs of Transformers , including Layer Normalization (LN) , MLPs , and skip-connections, forms our proposed FAX attention block, as shown in Fig. 3b. Our 3D FAX attention only requires $\mathcal{O}(2(NP)^{2}HWC)$ complexity assuming $P\sim G$ (typically $N<=5$ , $P,G\in\{8,16\}$ ), significantly cheaper than the full attention $\mathcal{O}((NHW)^{2}C)$ . Still, it enjoys non-local 3D interactions by seeing through all the agents, which is more expressive than local attention approaches . The 3D FAX self-attention (FAX-SA) block can be expressed as:

2 SinBEVT for Single-agent BEV Feature Computation

We take a BEV processing architecture similar to CVT , wherein a learnable BEV embedding is initialized as the query to interact with encoded multi-view camera features, as shown in Fig. 3a. We have observed that CVT uses a low-resolution BEV query that fully cross-attends to image features, which leads to degraded performance on small objects, despite being efficient. Thus, CoBEVT learns a high-resolution BEV embedding instead, then uses a hierarchical structure to refine the BEV features with reduced resolution. To efficiently query features from camera encoders at high resolution, the FAX-SA module is further extended to build a FAX cross-attention (FAX-CA) module (Fig. 3b), in which the query vector is obtained using the BEV embedding, whereas the key/value vectors are projected by multi-view camera features. Before applying cross-attention, we add a camera-aware positional encoding derived from camera intrinsics and extrinsic, to learn implicit geometric reasoning from individual camera views to a canonical map-view representation, following CVT. This rather simple, implicit approach demonstrates a good balance of performance and efficiency, and our FAX attention allows for global interactions in a hierarchical network, showing better accuracy against low-resolution isotropic approaches such as CVT.

3 FuseBEVT for Multi-agent BEV Feature Fusion

Decoder. We apply a series of lightweight convolutional layers and bi-linear upsampling operations on the aggregated BEV representation and generate the final segmentation output.

Experiments

We evaluate the effectiveness of the proposed CoBEVT on the camera track of the V2V perception dataset OPV2V . To show the flexibility and generality of our CoBEVT, we also conduct experiments on the LiDAR track of OPV2V and the autonomous driving dataset nuScenes .

OPV2V is a large-scale V2V perception dataset that is collected in CARLA and the cooperative driving automation tool OpenCDA . It contains 73 diverse scenarios, which have an average of 25 seconds duration. In each scenario, various numbers (2 to 7) of AVs show up simultaneously, and each one is equipped with one LiDAR sensor and 4 cameras in different directions to cover 360° horizontal field-of-view. Our main experiment only utilizes the camera rigs of the dataset, and we use Intersection over Union (IoU) between map prediction and ground truth map-view labels as the performance metric. Since OPV2V has multiple AVs in the same scene, we select a fixed one as the ego vehicle during testing and evaluate the 100m×100m area around it with a 39cm map resolution.

To demonstrate its generality, we also evaluated our proposed CoBEVT on the OPV2V LiDAR-track 3D detection task. We use the same evaluation range in , and the detection performance is measured by Average Precisions (AP) at an IoU threshold of 0.7. For both camera and LiDAR track, there are 6764/1981/2719 frames for train/validation/test set, respectively.

The nuScenes dataset contains 1000 diverse scenes, each of around 20 seconds long. In total, there are 40K sampled frames in this dataset, and the dumped data captures a 360∘ view of surroundings using 6 cameras. We use the groundtruth in . The evaluation ranges are [-50m, 50m] for the X and Y axis, and the resolution of the BEV grid is 0.5m.

2 Experiments Setup

Implementation details. We assume all the AVs have a 70m communication range following , and all the vehicles out of this broadcasting radius of ego vehicle will not have any collaboration. For the OPV2V camera-track,we choose ResNet34 as the image feature extractor in SinBEVT. The transmitted BEV intermediate representation has a resolution of $32\times 32\times 128$ . For the multi-agent fusion, our FuseBEVT component has 3 encoded layers and a window size of 8 for both local and global attention. We train the whole model end-to-end with Adam optimizer and cosine annealing learning rate scheduler . We use weighted cross entropy loss and train all models with 60 epochs, with a batch size of 1 per GPU. Please refer to the supplementary materials for more details, as well as the configurations on nuScenes and OPV2V LiDAR-track.

Compared methods. For multi-agent perception task, we consider single-agent perception system No Fusion as the baseline. We compare with the state-of-the-art multi-agent perception algorithms: F-Cooper , AttFuse , V2VNet , and DiscoNet . We also implement a straightforward fusion strategy Map Fusion, which transmits the segmentation map instead of BEV features and fuses all maps by selecting the closest agent’s prediction for each pixel.

For the nuScenes dataset, we compare against state-of-the-art models including CVT , FIERY , View Parsing Network (VPN) , Orthographic Feature Transform (OFT) , and Lift-Splat-Shoot . All models only utilize single-step timestamp data for fair comparisons. We intentionally use the same image feature extractor Efficient-B4 and decoder as CVT and FIERY.

3 Quantitative Evaluation

OPV2V camera-track results. To make a fair comparison, we first employ CVT to extract the BEV feature from camera rigs for all methods and only use the fusion component (i.e. FuseBEVT) of CoBEVT to compare with other fusion models. Then we compare it with our complete CoBEVT to show the effectiveness of SinBEVT as well. As shown in Tab. 3, all cooperative methods perform better than No Fusion, which proves the benefits from multi-agent perception system. Among all fusion models, our FuseBEVT achieves the best IoU for all classes, outperforming the second-best method by 5.5%, 1.4%, and 3.4% on vehicle, drivable area, and lane, respectively. More importantly, by replacing the CVT with our SinBEVT for feature extraction, our CoBEVT can further increase the accuracy by 1.4%, 0.9%, and 3.8% on the three classes compared to using FuseBEVT only.

OPV2V LiDAR-track results. As Tab. 3 reveals, our FuseBEVT also has the best performance on the LiDAR-track task, which improves the single-agent system by 25.0% and outperforms the leading algorithm DiscoNet by 1.7%. Furthermore, our method exhibits great robustness against LiDAR feature compression, with only a 0.3% drop with the $64\times$ compression rate.

nuScenes vehicle map-view segmentation. Our SinBEVT can run 35 FPS on RTX2080 with 37.1 IoU score and 1.6 M parameters, achieving the best accuracy with real-time performance. Compared to the state-of-the-art method CVT, we are 1.1% higher with similar parameters and latency.

Effect of compression rate. Data transmission size is a critical factor in V2V applications. Here we study the effect of different compression rates on our CoBEVT by adjusting the $1\times 1$ convolution. Tab. 4 shows that CoBEVT is insensitive to compression, and it can still beat other fusion methods even with a large compression rate of 64.

4 Qualitative Analysis

Fig. 4 shows the qualitative results of CoBEVT on scenes containing 3 AVs. In each row, we draw the front camera image of each AV along with the ground truth and prediction pairs. Our framework can overcome most of the occlusions and perceive distant objects accurately, benefiting from our Transformer design that learns from all agents and views. However, one limitation we have observed is the “merging” predictions of multiple nearby vehicles, which may be attributed to the combined effects of low-resolution BEV embedding and the complicated ground truth in dense traffic.

5 Ablation Study

Component analysis. Tab. 5 shows the importance of local and global attention in the multi-agent fusion model FuseBEVT, while other components are retained in CoBEVT. Both attention blocks significantly contribute to the final performance.

Robustness to camera dropout. Sensor failure during driving can lead to fatal accidents. Therefore, here we investigate how well our CoBEVT handles it. We random drop $n\in$ cameras of the ego vehicle, and demonstrate the performance decrease for both SinBEVT (no collaboration) and CoBEVT in Fig. 5a. It can be seen that by introducing sensing cooperation, driving safety can be significantly improved, as even if all ego cameras break down, CoBEVT can still reach an IoU score of 44.3.

Number of agents. Here we study the influence brought by the number of collaborators on CoBEVT. As Fig. 5b describes, increasing the collaborators can generally bring performance improvement, whereas such gain will be marginal when the agent number is greater than 4.

Inference speed of FuseBEVT. Real-time multi-agent feature fusion is critical for real-world deployment. Here we examine the inference speed of FuseBEVT with different BEV feature map spatial resolution (from 16 to 64) and the number of agents on RTX3090. Fig. 5c shows that our fusion algorithm can achieve real-time performance under distinct collaboration scenarios.

Conclusion and Limitations

In this paper, we propose a holistic vision Transformer dubbed CoBEVT for multi-view cooperative semantic segmentation. We propose a fused axial attention (FAX) mechanism that allows for local and global interactions across all views and agents. Extensive experiments on both simulated and real-world datasets show that CoBEVT achieves superior performance on multi-camera cooperative BEV segmentation. It can also be adapted to other tasks and substantially improve multi-agent LiDAR detection and single-agent map-view segmentation.

Limitations. Despite the proposed single-agent model outperforming the real-world nuScenes dataset, the entire cooperative framework has been trained and validated on simulated datasets only, and thus its real-world generalization capability remains unknown. The proposed approach does not explicitly model realistic V2V challenges such as asynchronization and position errors, which may impair its robustness under these noises. The perception robustness against different domains such as severe weather or lighting conditions needs further examination. Addressing these limitations needs future research on real-world, realistic, and diverse cooperative datasets and benchmarks.

This material is supported in part by the Federal Highway Administration Exploratory Advanced Research (EAR) Program, and by the US National Science Foundation through Grants CMMI # 1901998. We thank Xiaoyu Dong for her insightful discussions.

References

Appendix

In this supplementary material, we will first provide more details about the camera track of the OPV2V dataset (Sec. A). Afterwards, the model details of the proposed FAX attention, and implementation details of our CoBEVT models on different datasets will be illustrated in Sec. B and Sec. C. Finally, we show more qualitative results for all three tasks tested in the main paper in Sec. D.

Appendix A The Camera Track of OPV2V dataset

Sensor Configuration. In OPV2V, every AV is equipped with 4 cameras toward different directions to cover $360\degree$ surroundings as Fig. 6 shows. Each camera has an $800\times 600$ spatial resolution and $110\degree$ FOV, which introduces a $10\degree$ view overlap between any neighboring pair.

Groundtruth. The BEV semantic segmentation groundtruth mask has a pixel resolution of $256\times 256$ and covers a $100\times 100~{}m$ area around the ego vehicle, which represents a map sampling resolution of $0.39~{}m/pixel$ . The authors also provide corresponding visible masks, where all dynamic objects that can be seen by any AV’s camera rigs are marked as visible, and vice versa for the invisible. Similar to previous works , we only consider objects that are visible during both training and testing.

Appendix B Model Details

We give more details about the proposed 3D fused axial attention (FAX) below.

3D Relative Attention. The vanilla attention mechanism defined in is a global mixing operator based on the weighted sum of all the spatial locations, whereas the weights are calculated by normalized pairwise similarity. Formally, the attention operator can be defined as

where the $\mathbf{Q},\mathbf{K},\mathbf{V}$ are the query, key, and value matrices projected from the input tensor. Multi-head attention is an extension of (3) in which we split the channels into multiple “heads”, in parallel, and run attention on each head separately. Here for simplicity, we only use a single-head equation, but we always use multi-head variants in the actual implementations.

We then denote the $\mathsf{Fused\text{-}Unblock}(\cdot)$ operation as the reverse of the above 3D window partition procedure. Likewise, for the global attention branch, we define another 3D grid partitioning operator as $\mathsf{Fused\text{-}Grid}$ with the grid parameter $G$ , representing dividing the input feature using a uniform 3D grid of size $N\times G\times G$ . Note that unlike Eq. (5), we need to apply an extra $\mathsf{Transpose}$ to place the grid dimension in the assumed “spatial axis”:

with its inverse operator $\mathsf{Fused\text{-}Ungrid}$ that reverses the 3D-gridded input back to the original tensor shape.

Now we are ready to present the whole 3D FAX attention module. The 3D local block attention can be expressed as:

while the sparse global 3D Attention can be expressed as:

where the $\mathbf{QKV}$ matrices in Eq. (4) are linearly projected from input $\mathbf{x}$ and are omitted for simplicity. LN denotes the Layer Normalization , where MLP is a standard MLP network consisting of two linear layers applied on the channel: $\mathbf{x}\leftarrow W_{2}\text{GELU}(W_{1}\mathbf{x})$ .

Appendix C Implementation Details

In the following, we show the detailed architectures for the three experiments, respectively.

We illustrate the architectural specifications of CoBEVT in Table A2. Further illustrations are presented below.

Model Separation. Same as , we have separate models for dynamic objects and static layout BEV semantic segmentation. Both models have the same configurations except for the last layer in the network.

C.2 nuScenes

To make a fair comparison, we strictly follow the same experiment setting as CVT Image Encoder. We follow CVT and Fiery to use EfficientNet B-4 as image feature extractor. We compute features at three scales - (56, 120), (28, 60), and (14, 30).

SinBEVT. The BEV query starts with a size of $100\times 100\times 32$ and ends with a size of $25\times 25\times 128$ . We set the window/grid size of image features and BEV query for the three FAX-CA blocks as (6, 12), (6, 12), (14, 30) and (10, 10), (10, 10), (25, 25) respectively. Main architecture is the same to the SinBEVT specifications shown in Table A2.

Decoder The decoder structure is the same as CVT. The decoder consists of three (bilinear upsample + conv) layers to upsample the BEV feature to the final output size ( $200\times 200$ ).

Training. We train our models with focal loss and a batch size of 4 per GPU for 30 epochs. We employ AdamW optimizer with the one-cycle learning rate scheduler. The whole training process is around 8 hours on 4 RTX3090 gpus.

Evaluation. We evaluate the 100m×100m area around the vehicle with a 50cm sampling resolution. We use the Intersection-over-Union (IoU) score between the model predictions and the ground-truth segmentation mask.

C.3 OPV2V LiDAR Track

All the comparison methods have the same configurations except for the fusion component.

Point Cloud Encoder. We select PointPillar as the point cloud feature extractor and set the voxel resolution as (0.4, 0.4, 4) on the x, y, and z-axis. The architecture settings are the same as . The extracted feature has a final resolution of $176\times 48\times 256$ .

FuseBEVT. The configurations of FuseBEVT are the same as the ones in OPV2V camera track.

Detection Head and Training. We simply apply two $3\times 3$ convolution layers for classification and regression, respectively. We train the models using Adaw optimizer with a multi-step learning rate scheduler. The learning rate starts with $0.001$ and decays 10 times for every 10 epochs.

Appendix D More Qualitative Results

OPV2V camera track. Fig. 7 and Fig. 8 show the visual comparisons between our CoBEVT and others on OPV2V camera track. Our method significantly outperforms others both on dynamic object prediction and road topology segmentation in most of the scenarios.

OPV2V LiDAR track. We demonstrate detection visualization results in OPV2V LiDAR track in 4 different busy intersections in Fig. 9 and Fig. 10. Compare to other state-of-the-art fusion methods in including AttFuse , F-Cooper , V2VNet , and DiscoNet , our CoBEVT achieves more robust performance in general. We carefully examined the detection visualization comparisons between our method and the previous SOTA method DiscoNet. As shown in Fig. 9 and Fig. 10, we use red circles to highlight the objects that have obviously different detection results among these two methods. It is obvious that our results have fewer undetected objects and fewer displacements.

nuScenes. Fig. 11 depicts the qualitative results of our SinBEVT on nuScenes under different road typologies, traffic situations, and light conditions. Our method can recognize most objects and robustly estimate the complicated road layout, demonstrating the strong generalization ability of the proposed FAX attention for various autonomous driving tasks.