0 is written in front
Share your recent work on BEV perception, and welcome peers in autonomous driving to communicate and learn, so as to help autonomous driving come to fruition as soon as possible.
1 Overview
For autonomous driving, object detection under BEV (Bird's Eye View) is a very important task. Although this task has attracted considerable research effort, it remains a challenge to flexibly handle arbitrary camera configurations (single or multiple cameras) installed on autonomous vehicles.
To this end, BEVFormer is proposed, which utilizes the powerful feature extraction capabilities of Transformer and the query and mapping capabilities of the time series features of the Timestamp structure to aggregate the feature information of the two modalities in the time and space dimensions to enhance the detection effect of the overall perception system.
Paper link: https://arxiv.org/pdf/2203.17270v1.pdf
Code link: GitHub - zhiqi-li/BEVFormer
About BEVFormer
BEVFormer interacts time and space through predefined grid-like BEV queries to mine space and time information. To aggregate spatial information, we design a spatial cross-attention where each BEV query extracts spatial features from regions of interest in the camera view. For temporal information, we propose a temporal self-attention to repeatedly fuse historical BEV information. On the nuScenes dataset, the NDS evaluation value indicator reaches SOTA: 56.9%, which is 9 points higher than the previous SOTA method based on lidar. We further show that BEVFormer significantly improves the accuracy of target velocity estimation and recall in low-visibility conditions.
figure 1
2. Structural framework
figure 2
The encoding layer of BEVFormer consists of grid-like BEV query, temporal self-attention and spatial cross-attention.
In spatial cross-attention, each BEV query only interacts with the image features of the region of interest.
In temporal self-attention, each BEV query interacts with two features: the BEV query at the current timestamp and the BEV feature at the previous timestamp.
3. Refer to the source code for the configuration environment in detail. I will not elaborate here one by one. Here I will share the problems and solutions in my configuration process.
- Error: No module named 'tools' Analysis: The absolute path is not recognized
- Solution: export PYTHONPATH=${PYTHONPATH}:/home/mnt/mmdetection3d/BEVFormer/tools
- source ~/.profile
Execute in the terminal: python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes --version v1.0 --canbus ./data
If the result shown in the figure below is generated, it indicates that the data processing is correct
4. Experimental results, demo display
nuScense contains 1000 samples, each about 20s of data, labeled 2Hz, each sample contains 6 cameras with a 360-degree horizontal scene. For the target detection task, 1.4M 3D boxes are marked, including a total of 10 categories. 5 evaluation criteria: ATE, ASE, AEO, AVE, AAE, in addition, nuScense also proposed NDS to calculate the comprehensive score.
BEV features can be used for 3D object detection and map semantic segmentation tasks. Commonly used 2D detection networks can be migrated to 3D detection with minor modifications. Experiments verify that the same BEV features are used to simultaneously support 3D object detection and map semantic segmentation. Experiments show that multi-task learning can improve the effect on 3D detection.
Video demo of continuous frames:
Written at the end: Due to my limited facilities, the training data has been reduced. It is recommended that you train on 8 GPUs
Answer: From the perspective of visual algorithms, identifying whether an object exists is more of a semantic issue. This process depends on training data, and there are bound to be errors such as missed detection and false detection. It is more reliable to recognize the presence of objects on a physical level through devices such as LiDAR. In addition, traditional problems in vision algorithms such as multi-scale and small object detection will also restrict the performance of the system.
You can ask about the specific process through the bilibili comment area, and I will give you answers in the comment area. More high-quality information can be shared through my CSDN princess account, and you can leave a message after paying attention