[SuperPoint] Semantic SLAM deep learning for feature extraction

1. Overview

The author's writing ideas are very clear, and the reasons for each technical point are clearly written. There are three articles in total, and the other two were published in 2016 and 2017. After reading these three articles, you can clearly see that the author is using In order to clarify the evolution process of the method of deep learning for pose estimation, in order to clarify this context, we interpret these three articles in chronological order, respectively:

1)Deep Image Homography Estimation

2)Toward Geometric Deep SLAM

3)SuperPoint: Self-Supervised Interest Point Detection and Description

2. The first Deep Image Homography Estimation

For similar mining results, refer to R TALK | Image Alignment and Its Application
Deep Image Homography Estimation Translation
1.1. Overview
Deep Image Homography Estimation estimates the homography matrix of a pair of images in an end-to-end manner. The training data set is generated by selecting a picture from MS-COCO, and then performing a homography transformation on this picture to obtain an image pair. In order to get the confidence of matrix transformation (for example, these things are needed to set the variance in slam), the author divides the network into two parts, corresponding to two kinds of outputs, one outputs a single transformation result, and the other outputs multiple possible transformation results, and The confidence level of each transformation result is given, and in actual use, the one with the highest confidence level is selected.

1.2. Algorithm flow
1.2.1 The basic knowledge
has also been improved just now. The method proposed in this article outputs a homography matrix. The so-called homography matrix means that the target point in the image is considered to be on a plane, and the corresponding , which is called the fundamental matrix if it is not on a plane.
In the actual slam application, the homography matrix needs to be used in the following three situations:

When the camera only rotates without translation, the epipolar constraint of the two views does not hold, and the fundamental matrix F is a zero matrix. At this time, the homography matrix H needs to be used. The
points in the scene are all on the same plane, and the homography matrix can be used to calculate the image. point matching point.
When the translation distance of the camera is small relative to the depth of the scene, the homography matrix H can also be used.
When initializing in the familiar ORB-SLAM, the homography matrix and the fundamental matrix are estimated at the same time, and then the reprojection error is calculated according to the results estimated by the two methods, and the one with the smallest reprojection error is selected as the initialization result.

1.2.2 Building a model
A homography matrix is ​​actually a 3X3 matrix. Through this matrix, a point in the image can be projected onto the corresponding image pair. The corresponding formula is
insert image description here

In this article, in order to better train the model and evaluate the effect of the algorithm, the author uses another model to equivalently replace the above formula. We know that when a picture is subjected to homography transformation, the coordinates of the points on the image will change according to the transformation matrix (as above), then in turn, if I know the coordinates of n points before and after transformation, then these two The transformation matrix between the pictures can be obtained. In the plane relationship, n is 4, that is, it is enough to know at least four points. Therefore, the author uses the changes corresponding to the four points to establish a new model, as shown in the following formula
insert image description here

It has a one-to-one correspondence with the homography matrix

insert image description here

The advantage of this is that the matrix relationship between the image pairs is converted into the relationship between points and points. When evaluating the accuracy, the distance can be calculated directly based on the coordinates of the converted points and the real coordinates as the error Evaluation indicators, moreover, can also be used in the calculation of loss functions in the network.

1.2.3 Generate Dataset
The author uses MS-COCO as the data set, but there is no image pair in the data set, that is, there is no true value of the homography matrix, which cannot be trained. Therefore, the author automatically generated image pairs based on the original images in the dataset. The specific method is shown in the figure below
insert image description here

The specific steps are:

  1. Select a rectangular area in the image, and the area can be represented by the above-mentioned four-point model

  2. Randomly translate the four points in the area to obtain a quadrilateral, and the homography matrix between the two quadrilaterals is also known

  3. Transform the image according to this homography matrix, and select the area framed by the quadrilateral

  4. In this way, the images obtained in 1) and 3) form an image pair with a known true homography matrix

1.2.4 Designing the network structure
The network structure of this article is shown in the figure below

insert image description here

The network is divided into two parts, namely Classification HomographyNet and Regression HomograhyNet. The latter directly outputs 8 quantities, which are naturally the x and y coordinate values ​​of the four points. But the disadvantage of this is also obvious, that is, it is not known what the confidence of each coordinate value is, for example, there is no basis for setting the variance in slam. Therefore, Classification HomographyNet is based on Regression HomograhyNet, and the output terminal is changed to an 8X21 output vector. The 8 here is still the x and y coordinates of the four points, and the 21 here is one of the possible values ​​​​of each coordinate value. , and the probability of this value is given, so that the confidence can be quantitatively analyzed. The visualization effect of the confidence output by the network is shown in the figure below
insert image description here

1.2.5 Experimental results
The accuracy evaluation method of the experimental results is to convert the coordinates of each point according to the homography matrix, measure the L2 distance with the real coordinates, and then average the error values ​​of the four points. The author evaluated the output of the two parts of the network and the results of ORB feature calculation, and the comparison results are as follows:
insert image description here

From this table, there is no obvious advantage over ORB, but the author shows several pictures, each of which shows the corrected box pair, and the difference can be clearly seen from the box pair. The left is the ORB method, and the right is the method of this paper.

1.3. Summary and thinking
Design an end-to-end homography matrix estimation method, using the fixed-point structural equivalent homography matrix, based on this structure design data set generation method and precision evaluation method, the final results show the effect It is significantly higher than the extraction performed by ORB.

It can be seen that the regression method works best, but the classification method can obtain confidence and can visually correct the experimental results, which is advantageous in some applications.

The author summarizes two advantages of this system, one is fast speed, with the help of Nvidia's Titan graphics card, it can process images at 300 frames per second, with a batch size of one.

Second, the estimation problem of the most basic homography matrix in computer vision is transformed into a machine learning problem, which can be specifically optimized for application scenarios such as indoor navigation robots using SLAM. In fact, the homography matrix has important applications in image stitching, ORB-SLAM algorithm, Augmented Reality (AR), and camera calibration. The three authors of this article are all from Magic Leap, an AR company that has received billions of dollars in investment from companies such as Google and Alibaba.

Some thoughts:
1) Our thinking and learning of the design pattern value of using deep learning to solve the difficulties encountered in traditional methods, so that we can fully combine the common characteristics of traditional and deep learning.

  1. This way of generating true values ​​​​from images and then using these images to estimate the matrix is ​​due to overfitting and the effect is good?

  2. It is used when the general features of the homography matrix are coplanar. The pictures listed in the final comparison effect in the paper are obviously not the case (the display data can be understood as a scene from a far perspective). The reason why it can be aligned is because it is trained with this, and ORB is estimated based on the real scene, there is no coplanar assumption, and the rationality of the experimental design is compared

3. The second Toward Geometric Deep SLAM

4. 第三篇SuperPoint: Self-Supervised Interest Point Detection and Description

Guess you like

Origin blog.csdn.net/Darlingqiang/article/details/131632941