Research on online high-precision map generation algorithm

1. HDMapNet

 The overall network architecture is shown in the figure. The final Decoder outputs three branches, one for semantic segmentation, one for embedding, and one for direction prediction. This information is then processed into a vectorized road representation through post-processing. Direction prediction divides 360 degrees into N parts. Then in GT, two directions are 1, and the rest are 0. We get two directions, multiply them by the step size to get the next point, and post-processing greedily connects them. .

The img2bev method previously had IPM, which completed the conversion by assuming that the height of the ground was 0. However, due to the existence of ground tilt and vehicle bumps, we cannot guarantee that the lane lines are correctly projected onto the BEV. As for LSS, since there is no explicit depth as supervision, the effect is not very good. Here we adopt a VPN method and use a fully connected network to learn how to change the perspective.

2. MapTR

1. Innovation points of the paper:

1. Influenced by DETR, an end-to-end vectorized map construction algorithm is designed, using a hierarchical query embedding scheme to flexibly encode instance-level and point-level information, and perform hierarchical binary matching to allocate instances and points in sequence. point. Point-to-point loss and edge direction loss are also proposed to supervise point- and edge-level geometric direction information.
2. Propose a displacement equivalent modeling method to construct each map element as a point set with a set of equivalent displacements.
3. Compared with the complex and time-consuming post-processing of HDmapNet, the previous vectorMapNet also represented each map element as a point sequence, but it used an autoregressive decoder to predict points sequentially, resulting in longer inference time and replacement. Uncertainty. This article predicts multiple points of multiple instances at the same time and outputs them in parallel.

2. Displacement equivalent modeling

Map elements can be divided into two types, a closed shape polygon and an open shape polyline, both of which can be represented as ordered point sets. The first one has 2n arrangements, and the second one has two arrangements. In fact, they can all be used to represent a map element, and it is obviously unreasonable to use a fixed arrangement point set of GT as supervision in vectormapnet. For this reason, we regard all possible arrangements of a map element as Hungarian matching objects, which is equivalent to constructing an equivalent permutation set.

3. Level matching 

First, a fixed-size map element set N is speculated in parallel. At the same time, we will also fill the GT map elements with empty sets and fill their length to N. Each map element contains (an element label + map element point set + point set All permutations and combinations), then the map elements predicted by the model include (an element label + a map ordered point set), a hierarchical bipartite graph matching algorithm is introduced, and instance-level matching and point-level matching are performed in order.
Instance-level matching : First find the optimal instance-level label assignment between the predicted map elements and the ground truth values, so that the cost is minimized. The cost has two parts, one is the focal loss of the label, and the other is the position matching loss of the two point sets.
Here is a pitfall, what is the position matching loss of the point set? Is it an ordered point set or an unordered point set? The formula should be the predicted ordered point set and the unordered point set in GT. I’m very curious about the difference between this and the point set matching below. After reading this brother’s interpretation
, As mentioned above, in fact, the position matching loss is exactly the same as the point level matching loss below.
MapTR reading notes and looking around the lane lines - Zhihu (zhihu.com)


Point-level matching: After instance-level matching, map elements matching GT non-empty instances are regarded as positive samples, and we perform point-level matching on them. That is, the matching cost is calculated between the point set predicted by the model and all GT point set permutations, and the GT point set permutation with the minimum matching cost is selected. What we are looking for here is the Manhattan distance loss between points.

4.Model structure

Use pure cameras to generate online high-precision maps. The GKT method proposed by Horizon used by img2bev. For BEV query, first obtain its prior position on the image (which may correspond to multiple pictures) through internal and external parameters, and extract nearby w*h Features of the kernel area are then used to create a cross-attention mechanism with bev query.

After obtaining the BEV features, a decoder similar to the DETR structure is used in the paper. The author introduces two query embeddings, one is an instance-level query vector (there are N), and the other is a point-level query vector (there are NV). , the point-level query vector is shared by all map elements (that is, instances). Then the hierarchical query expression of the j-th point of the i-th map element is its corresponding instance-level query vector + the corresponding query vector at the point level.

 After obtaining the characteristics of each point, we first send it to the multi-head self-attention mechanism to allow hierarchical queries to exchange information (including instance level and between point sets), and then use the deformable attention mechanism to allow hierarchical queries to exchange information. There is interaction between the query and BEV features. Each query qij predicts a two-dimensional normalized xy coordinate point on the BEV, then samples the BEV features near the coordinate point, and updates the features of the query qij, and so on. Iterate. In the end, each map element is a set of reference points. This is sent to the prediction head to obtain a classification score and a vector of 2Nv dimensions, representing the positions of Nv points in the prediction point set.

 5. Loss composition

For the optimal matching result at the instance level, we use focal loss to calculate its classification loss.
Then after the point-level matching is completed, we calculate the Manhattan distance loss between points to constrain the position of the points.
However, only the position of the point is constrained, and the direction of the edge line is not well constrained, so the edge direction loss is added. The author considers the cosine similarity between the predicted edge and the true edge in the paired point set to constrain it. .

6.Improvements of MapTRV2 

1. Decoupling the self-attention module in the hierarchical query mechanism. This greatly reduces memory consumption.

2. Proposed a more efficient variant of the cross-attention mechanism. Not only the BEV features are queried, but also the PV features are queried.

3. In order to speed up training convergence, a one-to-many auxiliary matching branch is added during training.

4. Three auxiliary dense prediction losses are added to assist training.

Let’s talk about the first point first: the original prediction query is N * Nv points, and then the time complexity of the self-attention mechanism is (N * Nv) ^2. In the paper, a decoupling operation is performed
here, and self-attention between instances is performed separately. Force mechanism and self-attention mechanism for instance interior points. The time complexity is reduced to (N^2 + Nv^2).

Second point: The previous cross-attention mechanism predicts a point and uses deformable attention to extract features on BEV. Now the point is projected onto the image and then the features are extracted to make the features richer.

The third point: We repeat GT K times as the true value of the auxiliary branch to increase the proportion of positive samples. At the same time, the main branch and the auxiliary branch share the decoder and point-level query.

The fourth point: LSS is used, and the depth prediction auxiliary branch, PV semantic segmentation branch, and BEV semantic segmentation branch are added. Depth prediction loss refers to BEVDepth. Added an additional BEV segmentation head to aid training. Further, in order to make full use of intensive supervision, map GT is used to combine the internal and external parameters of the camera to obtain the foreground mask on the perspective view, and the auxiliary PV segmentation head is used to train the network.

 

7. Some thoughts on MapTR

 Compared with pixel-by-pixel prediction, the final distance accuracy of end-to-end prediction is still slightly worse. However, tedious post-processing operations are omitted, so the real-time performance is very strong. At the same time, because it directly outputs 20 points for an instance, the model will be more robust to occlusion than the pixel model. At the same time, another criticism is that the training time is too long. I personally think that during decoding, a query will predict a position on the BEV, but it does not have accurate features on the BEV. How can we get the accurate BEV position? For example, his real position is (100, 100), but at the beginning the model randomly predicts (20, 20). At this time, the deformable attention mechanism extracts features near (20, 20). Can he correct it? I feel that the characteristics of (20,20) do not allow him to move closer to (100,100), because after all, it is not a global attention mechanism, but a local one. I feel that this should be the reason why his training convergence has been very slow. .

3. SuperFusion

It is done using multi-level image point cloud feature fusion.

The first is the fusion at the data level. First, the point cloud is projected to the image, and then the sparse depth is concatenated with the image. At the same time, bilinear interpolation is used to obtain a dense depth map to supervise the depth estimation of LSS.

For feature-level fusion, the BEV features of the point cloud are used as Q to query the features of the image, and cross-attention is used to obtain new BEV features, which are then fused through a series of convolutions. Obtain the final features of point cloud BEV.

For fusion at the BEV level, image features are transferred to BEV through LSS and then fused with point cloud BEV features. However, due to internal and external parameters and depth estimation errors, direct concatenation will cause feature misalignment. So it first concatenates and then learns a flow field, recalculates the BEV features of the image based on the flow field (a flow direction at each position, and then uses bilinear interpolation to obtain the post-flow features as the current image BEV features), and then Both concatenate.

The post-processing part is the same as HDMapNet.

4. MachMap

5. MapVR

Guess you like

Origin blog.csdn.net/slamer111/article/details/132173286