SparseOcc: Fully sparse 3D panoramic occupancy prediction (semantic + instance dual tasks)

Click on the card below to follow the " Heart of Autonomous Driving " public account

ADAS giant volume of dry information is now available

Today, Autonomous Driving Heart will share with you a fully sparse 3D panoramic semantics + instance occupancy prediction task. If you have related work to share, please contact us at the end of the article!

>> Click to enter→ Heart of Autonomous Driving [3D Target Detection] Technology Exchange Group

论文:Fully Sparse 3D Panoptic Occupancy Prediction

Link: https://arxiv.org/pdf/2312.17118.pdf

What is the starting point for this paper?

Occupancy prediction plays a key role in autonomous driving. Previous methods usually construct dense 3D Volumes, ignoring the inherent sparsity of the scene, which results in high computational cost. Furthermore, these methods are limited to semantic occupancy and cannot differentiate between different instances. To exploit sparsity and ensure instance awareness, the authors introduce a new fully sparse panoramic occupancy network called SparseOcc. SparseOcc initially reconstructs sparse 3D representations from visual input. Subsequently, it uses sparse instance queries to predict each target instance from the sparse 3D representation.

In a word, it distinguishes both semantics and examples! This talk comes from the first self-driving developer community in China, the Heart of Self-Driving Knowledge Planet. Scan the QR code on WeChat to communicate with 2,500 people!

d37ff36313532fac977fe8c3a287e809.png

Additionally, the authors establish the first vision-centric panoramic occupancy benchmark. SparseOcc achieved an mIoU of 26.0 on the Occ3D nus data set while maintaining a real-time inference speed of 25.4 FPS. By combining the temporal modeling of the first 8 frames, SparseOcc further improved its performance and achieved an mIoU of 30.9. The code will be open sourced later.

98dc9911a5724f69a5bb0f3a2df8cbfa.png

SparseOcc structure and process

SparseOcc consists of two steps. First, the authors propose a sparse voxel decoder to reconstruct the sparse geometry of the scene, which only models the non-free regions of the scene, thus significantly saving computational resources. Secondly, a mask transformer is designed that uses sparse instance queries to predict the mask and label of each target in the sparse space.

ba71a0630c8a226f520e630d67695010.png

In addition, the author further proposes sparse sampling of mask-guide to avoid dense cross-attention in mask transformation. Therefore, SparseOcc can exploit the above two sparse characteristics at the same time to form a completely sparse architecture, because it neither relies on dense 3D features nor has sparse to dense global attention operations. At the same time, SparseOcc can distinguish different instances in the scene and unify semantic occupancy and instance occupancy into panoramic occupancy!

0d4f0b46ada3804a6f4649f71e3af701.png

The designed sparse voxel decoder is shown in Figure 4. Typically, it follows a coarse-to-fine structure but takes a sparse set of voxel labels as input. At the end of each layer, we estimate the occupancy fraction of each voxel and perform sparsification based on the predicted fraction. Here, there are two sparsification methods, one is based on threshold (e.g., only keep scores >0.5) and the other is based on top-k. In this work, the author chooses top-k because threshold processing will cause unequal sample lengths and affect training efficiency. k is a dataset-dependent parameter obtained by counting the maximum number of non-free voxels in each sample at different resolutions, and the sparsified voxel labels will be used as input to the next layer!

Timing modeling. Previous dense occupancy methods usually warp historical BEV/3D features to the current timestamp and use deformable attention or 3D convolution to fuse temporal information. However, this method is not suitable for our case because the 3D features are sparse. To deal with this problem, the authors take advantage of the flexibility of sampling points and wrap them to previous timestamps to sample image features. Sampled features from multiple timestamps are superimposed and aggregated via adaptive blending.

Loss design: Supervise each layer. Since a class-agnostic occupancy is reconstructed in this step, a binary cross-entropy (BCE) loss is used to supervise the occupancy head. Only a sparse set of locations (based on predicted occupancy) is supervised, which means that regions discarded in early stages will not be supervised.

In addition, due to severe class imbalance, the model is easily dominated by categories with larger proportions, such as the ground, thereby ignoring other important elements in the scene, such as cars, people, etc. Therefore, voxels belonging to different categories are assigned different loss weights. For example, a voxel belonging to class c is assigned a loss weight of:

7b0b8af0a1e1453fdff8968f2c64ffa6.png

where Mi is the number of voxels belonging to the i-th class in GT!

Mask-guided sparse sampling. A simple baseline for the mask transformer is to use the mask cross attention module from Mask2Former. However, it involves all locations of keypoints, which can be very computationally intensive. Here, the author devises a simple alternative. Given the mask prediction of the previous (l − 1) Transformer decoder layer, a set of 3D sample points is generated by randomly selecting voxels within the mask. These sampling points are projected onto the image to sample image features. Furthermore, our sparse sampling mechanism makes temporal modeling easier by simply warping sample points (as done in sparse voxel decoders).

08f5186371556f706029b89a54b7cd6f.png

Experimental results

3D occupancy prediction performance on the Occ3D nuScenes dataset. "8f" means fusing temporal information from 7+1 frames. Our method achieves the same or even better performance than previous methods under weaker settings!

a95cac0c2016eb6af5d549bd0e62a49e.png 32e0d9d56ade14aa24d76242d2121355.png 459eb5e3c977e1262dc248b4a1895e22.png 6fba9822485008e040de151a6978aff5.png 52c3c23d35a1b5df57ea0cc821320a39.png 6561332719dce6a2af5bfe570775208f.png 830d0552d11ebfa0e5547a54fa97e9ed.png

The contributing author is a special guest of " Autonomous Driving Heart Knowledge Planet ", welcome to join the exchange!

① Exclusive video courses on the entire network

BEV perception , millimeter wave radar vision fusion , multi-sensor calibration , multi-sensor fusion , multi-modal 3D target detection , lane line detection , trajectory prediction , online high-precision map , world model , point cloud 3D target detection , target tracking , Occupancy, CUDA and TensorRT model deployment , large models and autonomous driving , Nerf , semantic segmentation , autonomous driving simulation, sensor deployment, decision planning, trajectory prediction and other learning videos ( scan the QR code to learn )

f10775098e2ed5abaef3b3ea9107ab7c.png Video official website: www.zdjszx.com

② The first autonomous driving learning community in China

A communication community of nearly 2,400 people, involving 30+ autonomous driving technology stack learning routes. Want to know more about autonomous driving perception (2D detection, segmentation, 2D/3D lane lines, BEV perception, 3D target detection, Occupancy, multi-sensor fusion, Technical solutions in the fields of multi-sensor calibration, target tracking, optical flow estimation), autonomous driving positioning and mapping (SLAM, high-precision maps, local online maps), autonomous driving planning control/trajectory prediction, AI model deployment and implementation, industry trends, Job postings are posted. Welcome to scan the QR code below and join the Knowledge Planet of the Heart of Autonomous Driving. This is a truly informative place where you can communicate with industry leaders about various problems related to getting started, studying, working, and job-hopping, and share papers and code on a daily basis. +Video , looking forward to communication!

f033d24775d1d28e4a91566ee2dca670.png

③【Heart of Autonomous Driving】Technical Exchange Group

The Heart of Autonomous Driving is the first autonomous driving developer community, focusing on target detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, target tracking, 3D target detection, BEV perception, multi-modal perception, Occupancy, Multi-sensor fusion, transformer, large model, point cloud processing, end-to-end autonomous driving, SLAM, optical flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment and implementation, autonomous driving simulation testing, products Managers, hardware configuration, AI job search exchanges , etc. Scan the QR code to add Autobot Assistant WeChat invitation to join the group, note: school/company + direction + nickname (quick way to join the group)

a8015af667eb62d6ccbeb58b93d7df16.jpeg

④【Heart of Autonomous Driving】Platform Matrix, welcome to contact us!

32232bdf8104fbf42d892f50788aef12.jpeg

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/135470325