[Overview of BEV perception algorithm - next generation autonomous driving perception algorithm]

BEV sensing algorithm concept

Bird's-Eye-View, bird's-eye view (top view). The BEV sensing algorithm has many advantages.

First of all, the BEV view has the advantage of small occlusion. Due to the perspective effect of vision, real-world objects are easily blocked by other objects in 2D images. Therefore, the traditional 2D-based perception method can only perceive visible targets. The occlusion algorithm will not be able to do anything.

In the BEV space, time series information can be easily fused, and the algorithm can predict the occluded area based on prior knowledge and "brain" whether there are objects in the occluded area. Although the "imagined" objects certainly have an "imaginary" component, they still have many benefits for the subsequent control modules.

In addition, the scale change of the BEV perception algorithm is small, and better perception results can be obtained by inputting data with relatively consistent scales into the network.

Introduction to BEV sensing algorithm data set

2.1 kitti-360 data set

kitti-360 is a large-scale dataset containing rich sensory information and complete annotation. We recorded several suburbs of Karlsruhe, Germany, over a driving distance of 73.7 kilometers, corresponding to more than 320,000 images and 100,000 laser scans. We annotate static and dynamic 3D scene elements with coarse boundary primitives and transfer this information to the image domain, providing dense semantic and instance annotations for 3D point clouds and 2D images.

To collect data, the station wagon was equipped with a 180° fisheye camera on each side and a 90° see-through stereo camera on the front (baseline 60 cm). In addition, a Velodyne HDL-64E and a SICK LMS 200 laser scanning unit were installed on the roof in a pushrod configuration. This setup is similar to the one used by KITTI, except that thanks to the additional fisheye camera and pushbroom laser scanner, a full 360° field of view is obtained, whereas KITTI only offers perspective images and Velodyne laser scanning, with a vertical field of view of 26.8°. In addition, the system is equipped with an IMU/GPS positioning system. The sensor layout of the collection vehicle is shown in the figure.

Insert image description here

Figure 1 Kitti-360 data set collection vehicle

2.2 nuScenes dataset

nuScenes is the first large-scale data set to provide a full set of sensor data for autonomous vehicles, including 6 cameras, 1 lidar, 5 millimeter wave radars, as well as GPS and IMU. Compared to the kitti dataset, it contains more than 7 times more object annotations. The sensor layout of the collection vehicle is shown in the figure.

picture

Figure 2 nuScenes data set collection vehicle model

BEV sensing algorithm classification

Based on the input data, BEV perception research is mainly divided into three parts - BEV Camera, BEV LiDAR and BEV Fusion. The figure below depicts the overview of the BEV sensing family. Specifically, BEV Camera represents a vision-only or vision-centric algorithm for 3D object detection or segmentation from multiple surrounding cameras; BEV LiDAR describes the detection or segmentation task of point cloud input; BEV Fusion describes the detection or segmentation task from multiple surrounding cameras. Fusion mechanism of multiple sensor inputs, such as cameras, lidar, global navigation satellite systems, odometers, high-definition maps, CAN bus, etc.
Insert image description here
As shown in the figure, the basic perception algorithms of autonomous driving (classification, detection, segmentation, tracking, etc.) are divided into three levels, with the concept of BEV perception located in the middle. Based on different combinations of sensor input layers, basic tasks, and product scenarios, a certain BEV sensing algorithm can be formulated accordingly. For example, M2BEV and BEVFormer belong to the visual BEV direction and are used to perform multiple tasks including 3D object detection and BEV map segmentation. BEVFusion designed a fusion strategy in the BEV space to simultaneously perform 3D detection and tracking from camera and lidar inputs.

The representative work in BEV Camrea is BEVFormer. BEVFormer implements 3D target detection and Map segmentation tasks and achieved SOTA results.

3.1 Pipeline of BEVFormer:

1) Backbone + Neck (ResNet-101-DCN + FPN) extracts multi-scale features of surround images;

2) The Encoder module proposed in the paper (including the Temporal Self-Attention module and the Spatial Cross-Attention module) completes the modeling of surround image features to BEV features;

3) The Decoder module similar to Deformable DETR completes the classification and positioning tasks of 3D target detection;

4) Definition of positive and negative samples (using the Hungarian matching algorithm commonly used in Transformer, the total loss sum of Focal Loss + L1 Loss is minimum);

5) Calculation of loss (Focal Loss classification loss + L1 Loss regression loss);

6) Back propagation to update network model parameters;

picture

Figure 4 BEVFormer framework diagram

The BEVFusion algorithm is inseparable from the BEV LiDAR and BEV Camera algorithms, and usually uses a fusion module to fuse point cloud and image features. Among them, BEV Fusion is one of the representative works.

3.2 BEVFusion’s Pipeline:

1) Given different perceptual inputs, first apply modality-specific encoders to extract their features;

2) Convert multi-modal features into a unified BEV representation that preserves both geometric and semantic information;

3) The existing view conversion efficiency bottleneck can be accelerated through pre-computation and intermittent reduction to accelerate the BEV pooling process;

4) Then, the convolution-based BEV encoder is applied to the unified BEV features to alleviate the local bias between different features;

5) Finally, add some task-specific headers to support different 3D scene understanding tasks.

picture

Figure 5 BEV Fusion framework diagram

Advantages and Disadvantages of BEV Sensing Algorithm

At present, the research on perception and prediction algorithms based on pure vision in the industry usually only focuses on image-view solutions for a single sub-problem in the above process, such as 3D target detection, semantic map recognition or object motion prediction. Different methods are combined through pre-fusion or post-fusion. The sensing results of the network are fused. This results in that multiple sub-modules can only be stacked in a linear structure when building the overall system. Although the above approach enables problem decomposition and facilitates independent academic research, this serial architecture has several important drawbacks:

1) Model errors in upstream modules will continue to be transmitted downstream. However, in independent research on sub-problems, true values ​​are usually used as input, which makes the cumulative error significantly affect the performance of downstream tasks.

2) There are repeated computational processes such as feature extraction and dimension conversion in different sub-modules, but the serial architecture cannot realize the sharing of these redundant calculations, which is not conducive to improving the overall efficiency of the system.

3) Unable to make full use of temporal information. On the one hand, temporal information can be used as a supplement to spatial information to better detect objects that are occluded at the current moment and provide more reference information for locating the location of objects. On the other hand, timing information can help determine the motion state of an object. In the absence of timing information, pure vision-based methods are almost unable to effectively judge the movement speed of an object.

Different from the image-view solution, the BEV solution uses multiple cameras or radars to convert visual information into a bird's-eye view for related perception tasks. This solution can provide a larger field of view for autonomous driving perception and can complete multiple perception tasks in parallel. At the same time, the BEV perception algorithm is to integrate information into the BEV space, so this is conducive to exploring the conversion process from 2D to 3D.

At the same time, the BEV perception algorithm currently has a gap with existing point cloud solutions in 3D detection tasks. Exploring visual BEV perception algorithms will help reduce costs. The cost of a set of LiDAR equipment is often 10 times that of visual equipment, so visual BEV is the truth of the future, but at the same time the huge amount of data it brings requires huge computing resources.

summary

To sum up, current research on perception and prediction algorithms based on pure vision usually only deals with a single sub-problem, and builds an overall system by fusing the results of different networks. However, this serial architecture has some important drawbacks, such as error propagation, redundant computation, and lack of timing information utilization.

In contrast, the BEV solution provides a broader field of view in autonomous driving perception by converting visual information into a bird's-eye view, and can complete multiple perception tasks in parallel. At the same time, the BEV perception algorithm can integrate information into the BEV space, helping to explore the conversion process from 2D to 3D.

However, there is still a gap between current BEV perception algorithms and point cloud solutions in 3D detection tasks. Although visual BEV has the advantage of reducing costs, it also brings huge data volume and computing resource requirements. Therefore, future research needs to address these challenges to further improve the application of BEV sensing algorithms in autonomous driving.

Guess you like

Origin blog.csdn.net/weixin_47869094/article/details/135185359