[Paper Notes] MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

Original link: https://arxiv.org/abs/2304.09801

1 Introduction

  At present, a major problem in multi-modal fusion sensing is that the impact of sensor failure is ignored. Key issues with previous work include:

  1. Feature misalignment: CNN is usually used to process the spliced ​​feature map, and the presence of geometric noise may lead to feature misalignment; this can be attributed to the limitations of CNN in long-distance perception and adaptive attention to the input.
  2. Highly dependent on complete modalities: Query-based methods or channel-based fusion methods are highly dependent on complete modal inputs, and performance will severely degrade when a certain modality fails.

  This paper proposes MetaBEV to solve the above problems under a unified BEV expression, regardless of the modality used or the specific task. Since the bottleneck of existing methods is that the fusion module lacks the ability to fuse independently, this paper proposes an arbitrary-modal BEV evolutionary decoder that uses cross-modal attention to associate single-modal or multi-modal features.
  This article examines 6 types of sensor failures (including limited field of view (LF), beam reduction (BR), object loss (MO), view loss (VD), view noise (VN), obstacle occlusion (OO)) and 2 sensor losses. MetaBEV was evaluated in the case of lidar loss (ML), camera loss (MC). Experiments show that MetaBEV has strong robustness.
  Furthermore, this article uses a shared framework to handle multitasking. However, conflicts between multitasks will lead to performance degradation, and there is little work to analyze and design corresponding solutions. This article combines MetaBEV with multi-task hybrid experts (M 2^22oE ) Module integration to provide possible solutions for multi-task learning.

3. MetaBEV method

  This paper connects each modality through parameterized meta-BEV and uses cross-modal attention to integrate geometric and semantic information from each modality. The network is shown in the figure below and consists of a feature encoder, a BEV evolutionary decoder (with cross-modal deformable attention) and a task head.
  

3.1 Overview of BEV feature encoder

  MetaBEV generates fusion features under BEV to combine multi-modal features and adapt to different tasks.
  Camera/lidar to BEV : Use the BEVFusion method to extract multi-view image features using the image backbone, upgrade the image features to 3D space according to LSS, and compress to obtain the BEV feature map B c B_cBc. After lidar is voxelized, 3D sparse convolutional encoding is used to express B l B_l for BEVBl

3.2 BEV evolution decoder

  This part consists of three components: cross-modal attention layer, self-attention layer and plug-and-play M2 ^22oE blocks. The structure is shown in the figure below.
Insert image description here
  Cross-modal attention layer: first initialize the dense BEV query, called meta-BEVB m B_mBm. After adding to the position encoding, interact with each modality. In order to speed up efficiency, this article uses deformable attention DAttn ( ⋅ ) \text{DAttn}(\cdot)DAttn ( ) . But original deformable attention is not suitable for processing arbitrary modal inputs. This article uses modality-related MLP (C-MLP and L-MLP in the above figure) to predict sampling points and attention weightAAA. _ Given BEV expressionx ∈ { B c , B l } x\in\{B_c,B_l\}x{ Bc,Bl} , first generate the modal-related sampling offsetΔ px \Delta p^xΔ px and attention weightA x A^xAx , the former is used to locate sampling features, and the latter is used to scale sampling features. The meta-BEV is then updated using the scaled sampling features. The whole process can be expressed as follows:DAttn ( B m , p , x ) = ∑ m = 1 MW m [ ∑ x ∈ { B c , B l } ∑ k = 1 KA mkx ⋅ W m ′ x ( p + Δ pmkx ) ] \text{DAttn}(B_m,p,x)=\sum_{m=1}^MW_m[\sum_{x\in\{B_c,B_l\}}\sum_{k=1}^KA_{mk} ^x\cdot W'_{m}x(p+\Delta p_{mk}^x)]DAttn ( Bm,p,x)=m=1MWm[x{ Bc,Bl}k=1KAmkxWmx(p+Δ pmkx)] wheremmm represents the attention head,KKK represents the number of sampling points,ppp represents the reference point. W m W_mWmand W m ′ W'_mWmRepresents a learnable projection matrix.
  The cross-attention mechanism fuses features layer by layer, allowing meta-BEV to iteratively "evolve" into fused features.
  Self-attention layer : The above process does not involve interaction between queries. This article uses self-attention. Use B m B_mBmReplace xx in the previous formulax , get the expression of self-attentionDAttn ( B m , p , B m ) \text{DAttn}(B_m,p,B_m)DAttn ( Bm,p,Bm)
  M 2 ^\textbf{2} 2 oE block: Following the previous method of modeling large languages ​​through the mixed expert layer (MoE), this article introduces MLP into the BEV evolution block and proposes M2 ^22 oE block is used for multi-task learning, as shown in I and II in the previous figure.
  First introduce RM2^22oE: M 2 oE ( x ) = ∑ i = 1 t R ( x ) i E i ( x ) , t ≪ E \text{M}^2\text{oE}(x)=\sum_{i=1}^t\mathcal{R}(x)_i\mathcal{E}_i(x),t\ll E M2oE(x)=i=1tR(x)iEi(x),tE wherexxx is the input RM2^22oE-FFN的token, R : R D → R E \mathcal{R}:\mathbb{R}^D\rightarrow\mathbb{R}^E R:RDRE is the pathfinding function that assigns tokens to the corresponding experts,E i : RD → RD \mathcal{E}_i:\mathbb{R}^D\rightarrow\mathbb{R}^DEi:RDRD is forexpertiiThe token processed by i . R \mathcal{R}Rsum Ei \mathcal{E}_iEiAll are MLP, EEE is a hyperparameter that determines the number of experts. For each token,R \mathcal{R}R will choosethe ttT experts are assigned, resulting in a large number of experts being inactive.
  HM2^22 oE belongs to RM2^2Degenerate version of 2oE ( EEE is equal to the number of tasks andt = 1 t=1t=1 ). The token will bypass the path allocation process through the FFN network of the corresponding task, and then be fused in the task fusion network. This process can alleviate task conflicts by separating conflicting gradients of multiple tasks through different experts.

3.3 Sensor failure

  This article defines 6 sensor failure modes:

  1. LiDAR Limited Field of View (LF): Due to incorrect collection or partial hardware damage, the lidar can only acquire data from one part of the field of view;
  2. Missing Object (MO): Certain materials prevent lidar point reflections;
  3. Beam Reduction (BR): due to limited energy or sensor processing capabilities;
  4. View Loss (VD): from camera failure;
  5. View noise (VN): from camera failure;
  6. Obstacle Occlusion (OO): Objects are occluded from the camera view.

  In addition, this paper also considers two serious sensor missing scenarios: camera loss and lidar loss.

3.4 Switch mode training

  This paper proposes a switching mode training method, which randomly receives input from a certain mode according to a predefined probability during training, ensuring high accuracy when using any mode.

4. Experiment

4.2 Performance in full mode

  Experiments show that for target detection tasks, MetaBEV can significantly exceed the performance of existing models in a single image mode; both lidar single mode and camera lidar multi-modality can achieve equivalent performance to SotA. For the BEV semantic segmentation task, MetaBEV can significantly surpass previous methods in both lidar single mode and camera lidar multi-modality.

4.3 Performance when sensor fails

  When the sensor is lost, past methods cannot handle the missing features. This paper replaces the missing features with all 0 values ​​to ensure that the network can output prediction results. Experiments show that MetaBEV is more robust to modal loss. Especially in the absence of lidar, the detection performance can significantly exceed BEVFusion; in the absence of cameras, BEV segmentation performance can significantly exceed BEVFusion. Even in the case of camera loss, MetaBEV exceeds the performance of the LiDAR single-modality SotA model.
  When the sensor part fails, this paper conducts two evaluation methods: zero-sample testing and in-domain testing. For the former, the training model is directly tested when the sensor partially fails; for the latter, the training model is first trained when the sensor partially fails and then tested. Experiments show that MetaBEV can surpass BEVFusion in two test methods.

4.4 Performance of multi-task learning

  Without adding MoE, the performance of MetaBEV can already reach SotA. By adding two types of MoE, the performance is improved, and RMoE has a greater improvement than HMoE.

4.5 Ablation studies

  Network configuration : First find the optimal structure of the BEV evolutionary decoder, including the combination of layers, the number of sampling points and the number of experts. Experiments show that sufficient performance can be achieved by using a small number of cross-attention layers and a small number of sampling points. In addition, adding a self-attention layer can also improve performance because it captures the correlation between queries. In RMoE, better performance can be achieved by using and assigning more experts.
  Switching-modality training : Compared with full-modality training, switching-modality training can greatly improve the performance in the missing mode, and can also slightly improve the performance in the full mode.

Supplementary material

7. Implementation details

7.2 Sensor failure

  1. Limited field of view (LF) : Only input lidar point clouds within a certain angle range.
  2. Object Missing (MO) : Points from objects are dropped probabilistically.
  3. Beam Reduction (BR) : Select points from the lidar portion of the beam.
  4. View Noise (VN) : Adds random noise to part or all of the view image.
  5. View Dropout (VD) : Partial views are randomly discarded and replaced with all-zero input.
  6. Obstacle Occlusion (OO) : Generates predefined masks and alpha blends them with the image view.

7.3 Training details

  Use standard data augmentation in MMDetection3D for images and lidar; use CBGS to balance classes.
  During multi-task training, the segmentation head is inserted into the pre-trained 3D detection network and the entire network is fine-tuned.
  When training using the switching modality training method, the input probabilities of various modal combinations are set on average.

Guess you like

Origin blog.csdn.net/weixin_45657478/article/details/132247523