Fourteen Lectures on Visual SLAM---Preliminary Knowledge

What is SLAM?

SLAM is the abbreviation of Simultaneous Localization and Mapping, which means simultaneous positioning and map construction.
It refers to a subject equipped with specific sensors that, without prior information about the environment, builds a model of the environment during movement and estimates its own movement at the same time . If the sensor here is mainly a camera, it is called "visual SLAM".

The purpose of SLAM

The purpose of SLAM is to solve the problems of "localization" and "map construction", which means that on the one hand, it is necessary to estimate the position of the sensor itself, and on the other hand, it is necessary to build a model of the surrounding environment.
This requires sensor information. Sensors observe the external world in a certain form, but different sensors observe in different ways.
We hope to perform SLAM in real time and without prior knowledge. When using a camera as a sensor, what we have to do is to infer the movement of the camera and the surrounding environment based on the continuous moving images (they form a video).

SLAM application

  • Indoor sweepers and mobile robots need positioning
  • Self-driving cars in the wild need positioning
  • Drones in the sky need to be positioned
  • Virtual reality and augmented reality devices.

SLAM system module

  • visual odometry
  • Backend optimization
  • Mapping
  • Loopback detection

Sensors are divided into two categories

  • Carrying on the robot body, such as the robot's wheel code, camera, laser sensor, etc.
  • Installed in the environment, such as guide rails, QR code signs, etc. Sensing equipment installed in the environment can usually directly measure the position information of the robot and solve the positioning problem simply and effectively. Since they require that the environment must be controlled by human steps, the scope of use of robots is limited to a certain extent.

Sensors constrain the external environment. Only when these constraints are satisfied, positioning schemes based on them can work. When constraints cannot be satisfied, we cannot perform positioning. Such sensors are simple and reliable, but they cannot provide a universal, one-size-fits-all solution. In contrast, those sensors carried on the robot body, such as laser sensors, cameras, wheel encoders, inertial measurement units (IMU), etc., usually measure indirect physical quantities rather than direct positions. data. For example, wheel encoders measure the angle of wheel rotation, IMUs measure the angular velocity and acceleration of motion, and cameras and laser sensors read certain observation data of the external environment.

Using portable sensors to complete SLAM is our key concern. In particular, when talking about visual SLAM, we mainly refer to how cameras solve localization and mapping problems. Likewise, if the sensor is primarily laser, it's called laser SLAM.

The camera used in SLAM is not the same thing as the SLR camera we usually see. It is simpler and usually does not carry expensive lenses, but shoots the surrounding environment at a certain rate to form a continuous video stream. Ordinary cameras can capture images at a rate of 30 pictures per second, while high-speed cameras are faster.

According to different working methods, cameras can be divided into three categories: monocular cameras, binocular cameras and depth (RGB-D) cameras. Intuitively, a monocular camera has only one camera and a binocular camera has two. The principle of RGB-D is more complex. In addition to collecting color pictures, it can also read the distance between each pixel and the camera. Depth cameras usually carry multiple cameras and their working principles are different from ordinary cameras.

There are also special or emerging categories in SLAM such as panoramic cameras and event cameras.

monocular camera

The practice of using only one camera for SLAM is called monocular SLAM. This kind of sensor has a very simple structure and low cost, so monocular SLAM has attracted great attention from researchers.
Monocular camera data: photos. A photo is essentially a projection of a scene left on the imaging plane of the camera . It records the three-dimensional world in two-dimensional form. This process loses one dimension of the scene, known as depth (or distance). In a monocular camera, we cannot calculate the distance (near and far) between objects in the scene and the camera from a single picture. This distance is very critical information in SLAM. We have seen a large number of images and have developed an innate intuition. We have an intuitive sense of distance (sense of space) for most scenes , which can help us judge the distance relationship between objects in the image. Due to the perspective relationship between near and far objects, they may appear to be the same size in the image.
Since the image captured by a monocular camera is only a two-dimensional projection of a three-dimensional space, if you really want to restore the three-dimensional structure, you must change the camera's perspective. The same principle applies to monocular SLAM. We must move the camera to estimate its motion (Motion) , and at the same time estimate the distance and size of objects in the scene, which is the structure . Nearby objects move fast, distant objects move slowly, and extremely distant (infinity) objects (such as the sun and moon) appear to be motionless. Therefore, when the camera moves, the movement of these objects on the image forms parallax (Disparity). Through parallax, we can quantitatively determine which objects are far away and which objects are close. Even if we know the distance of an object, they are still only a relative value.
The trajectory and map estimated by monocular SLAM will differ from the real trajectory and map by a factor, which is the so-calledScale . Since monocular SLAM cannot determine this true scale from images alone, it is also called scale uncertainty (Scale Ambiguity) .
Depth can only be calculated after translation, and the true scale cannot be determined. These two things cause a lot of trouble for the application of monocular SLAM. The fundamental reason is that depth cannot be determined from a single image. To get depth, binocular cameras and depth cameras are used.

Binocular and depth cameras

The purpose of using a binocular camera and a depth camera is to measure the distance between an object and the camera by some means, overcoming the shortcoming of a monocular camera that cannot know the distance. Once the distance is known, the three-dimensional structure of the scene can be recovered from a single image while eliminating scale uncertainty.
Both are used to measure distance, but the principles of measuring depth between binocular cameras and depth cameras are different. A binocular camera consists of two monocular cameras, but the distance between the two cameras [called the baseline ] is known. We use this baseline to estimate the spatial position of each pixel - very similar to the human eye.
We humans can judge the distance of an object through the difference between the left and right eye images, and the same is true on computers. If you expand the binocular camera, you can build a multi-camera camera, but the essence is no different.
Binocular camera data: left eye image, right eye image. Through the difference between the left and right eyes, the distance between the object in the scene and the camera can be judged.
The depth range measured by the binocular camera is relative to the baseline. The greater the baseline distance, the farther away the object can be measured, so the binocular camera mounted on the unmanned vehicle is usually a big one. The distance estimation of the binocular camera is obtained by comparing the images of the left and right eyes and does not rely on other sensing devices, so it can be applied both indoors and outdoors. The disadvantage of binocular or multi-eye cameras is that the configuration and calibration are relatively complex, and their depth range and accuracy are limited by the binocular baseline and resolution. Moreover, the calculation of parallax consumes very computing resources and requires the use of GPU and FPGA equipment to accelerate. Output the distance information of the entire image in real time. Therefore, under the current conditions, the amount of calculation is one of the main problems of binocular vision.
The biggest feature of the depth camera (also known as RGB-D camera) is that it can detect objects through infrared structured light or Time-og-Flight (ToF) principle, like a laser sensor, by actively emitting light to the object and receiving the returned light. distance from the camera. It is not solved through software calculation like a binocular camera, but through physical measurement methods, so it can save a lot of computing resources compared to a binocular camera.
Currently commonly used RGB-D cameras include Kinect/Kinect V2, Xtion Pro Live, RealSense, etc. People also use them to recognize faces on some mobile phones. Most current RGB-D cameras still have many problems such as narrow measurement range, high noise, small field of view, susceptibility to sunlight interference, and inability to measure projection materials. In terms of SLAM, it is mainly used indoors, but it is more difficult to use outdoors.
RGB-D data: Depth cameras can directly measure the image and distance of an object, thereby recovering the three-dimensional structure.
As the camera moves in the scene, a series of continuously changing images will be obtained. The goal of visual SLAM is to perform positioning and map construction through such images. As long as we input data, we can continuously output positioning and map information.

Classic visual SLAM framework

The entire visual SLAM process includes the following steps:

  • **Sensor information reading. **In visual SLAM, it is mainly the reading and preprocessing of camera image information. If it is in a robot, there may also be code disks, inertial sensors and other information to read and synchronize.
  • **Front-end Visual Odometry (VO). **The task of visual odometry is to estimate the movement of the camera between adjacent images, and what the local map looks like. VO is also called the Front End.
  • **Back-end (non-linear) optimization (Optimization). **The backend accepts camera poses measured by visual odometry at different times, as well as loop detection information, and optimizes them to obtain globally consistent trajectories and maps. Because it is connected after VO, it is also called back end (Back End).
  • **Loop Closure Detection. **Loopback detection determines whether the robot has reached the previous position. If a loopback is detected, it provides the information to the backend for processing.
  • **Mapping. **It builds a map corresponding to the mission requirements based on the estimated trajectory.
    If the working environment is limited to static, rigid bodies, light changes are not obvious, and there is no human interference, then the SLAM technology in this scenario is already quite mature.

visual odometry

Visual odometry cares about camera motion between adjacent images. The simplest case is of course the motion relationship between two images.
In the field of computer vision, things that seem very natural to humans intuitively are very difficult in computer vision. An image is just a numerical matrix in a computer.
In visual SLAM, we can only see pixels one by one and know that they are the result of the projection of certain spatial points on the imaging plane of the camera. Therefore, in order to quantitatively estimate camera motion, we must first understand the geometric relationship between the camera and the spatial points .
Visual odometry is able to estimate camera motion from images between adjacent frames and recover the spatial structure of the scene. It is called an "odometer" because, like an actual odometer, it only counts movement at adjacent moments and has no connection with past information. At this point, visual odometry is like a species with only short-term memory (but it can not be limited to two frames, the number can be more, such as 5-10 frames).
If there is already a visual odometry, the camera motion between the two images is estimated. On the one hand, as long as the movements at adjacent moments are "stringed together", the robot's movement trajectory is formed, thereby solving the positioning problem. On the other hand, we calculate the position of the spatial point corresponding to each pixel based on the camera position at each moment, and we obtain the map.
Visual odometry is indeed the key to SLAM, but when estimating trajectories only through visual odometry, accumulating drift will inevitably occur. This is caused by the fact that visual odometry only estimates the motion between two images in the simplest case.
Accumulated errors cause long-term estimation to be no longer accurate, requiring loop closure detection and global correction. This is also called drift. It will result in our inability to build consistent maps.
In order to solve the drift problem, we also need two technologies: backend optimization and loopback detection . Loop detection is responsible for detecting "the robot returns to its original position", while back-end optimization corrects the shape of the entire trajectory based on this information.

Backend optimization

Back-end optimization mainly refers to dealing with noise problems in the SLAM process. In reality, even the most accurate sensor contains a certain amount of noise. Cheap sensors have larger measurement errors, while expensive sensors may have smaller errors. Some sensors are also affected by magnetic fields and temperature. In addition to solving "how to estimate camera motion from images", we also need to care about how much noise this estimate contains, how this noise is transferred from the previous moment to the next moment, and how confident we are in the current estimate. .
The issue to be considered in back-end optimization is how to estimate the state of the entire system from these noisy data, and how uncertain the state estimate is - this is called the maximum posterior probability estimate (Maximum- a-Posteriori, MAP). The state here includes both the robot's own trajectory and the map.
The visual odometry part is sometimes called the "front end". In the SLAM framework, the front-end provides the back-end with the data to be optimized, as well as the initial values ​​of these data. The backend is responsible for the overall optimization process. It often only faces data and does not need to care about what sensor the data comes from.
In visual SLAM, the front-end is more related to computer vision research fields, such as image feature extraction and matching, while the back-end is mainly filtering and nonlinear optimization algorithms.
In a historical sense, what we now call back-end optimization was directly called "SLAM research" for a long time. The early SLAM problem was a state estimation problem - exactly what back-end optimization was supposed to solve. In the first series of papers that proposed SLAM, people at the time called it "Estimation of Spatial Uncertainty". The
essence of SLAM: the estimation of spatial uncertainty of the moving subject itself and the surrounding environment. In order to solve the SLAM problem, we need state estimation theory to express the uncertainty of positioning and mapping, and then use filters or nonlinear optimization to estimate the mean and uncertainty (variance) of the state.

Loopback detection

Loopback detection, also known as closed-loop detection, mainly solves the problem of position estimation drifting over time .
How to solve it? Assume that in reality the robot returns to the origin after a period of movement, but due to drift, its position estimate does not return to the origin. How to do it? If there is some way to let the robot know that it has "returned to the origin", or to identify the "origin", we can then "pull" the position estimate over, and the drift can be eliminated. This is called loopback detection.
Loopback detection is closely related to both "positioning" and "mapping".
In fact, we believe that the main purpose of maps is to let robots know where they have been. In order to implement loopback detection, we need to set a marker (such as a QR code picture) under the robot. As long as it sees this sign, it knows that it has returned to the starting point. This marker is essentially a sensor in the environment, with restrictions on the application environment.
We prefer that the robot can use the sensor it carries, that is, the image itself, to complete this task (for example, it can determine the similarity between images to complete loop detection). If the loopback detection is successful, the cumulative error can be significantly reduced.
Visual loop detection is essentially an algorithm for calculating the similarity of image data. Since the image information is very rich, the difficulty of correctly detecting loop closure is greatly reduced.
After detecting the loopback, we will tell the back-end optimization algorithm the information "A and B are the same point". Then, based on this new information, the backend adjusts the trajectory and map to match the loop closure detection results. In this way, if we have sufficient and correct loop closure detection, we can eliminate cumulative errors and obtain globally consistent trajectories and maps.

Mapping

Mapping refers to the process of building a map. Mapping refers to the process of building a map. The map is a description of the environment, but this description is not fixed and needs to be determined by SLAM.
We have so many ideas and needs for maps. Therefore, compared with the visual odometry, back-end optimization and loop detection mentioned above, mapping does not have a fixed form and algorithm. A collection of spatial points can be called a map. A beautiful 3D model is also a map. A picture marked with cities, villages, railways, and rivers is also a map. The form of the map depends on the application of SLAM.
Generally speaking, they can be divided into two types: metric maps and topological maps .

Metric Map

Metric maps emphasize accurately representing the positional relationships of objects in the map, and are usually classified as Sparse and Dense . Sparse maps provide a certain level of abstraction and do not necessarily represent all objects. Dense maps focus on modeling everything seen. A sparse landmark map is sufficient for positioning. When used for navigation, dense maps are often needed (otherwise, what if you hit the wall between two road signs?) Dense maps usually consist of many small squares (Vovel) at a certain resolution. Usually, a small block contains three states: occupied, free, and unknown, to express whether there is an object in the grid. When querying a certain spatial location, the map can give information about whether the location is passable. Such maps can be used in various navigation algorithms, such as A , D*, etc., and are valued by robotics researchers. But we also see that on the one hand, this kind of map needs to store the status of each point, which will consume a lot of storage space, and in most cases many details of the map are useless. On the other hand, large-scale metric maps sometimes suffer from consistency issues. A small steering error may cause the walls of two rooms to overlap, rendering the map invalid.

Topological Map

Rather than measuring the accuracy of the map, topological maps emphasize the relationships between map elements. A topological map is a graph (Graph), consisting of nodes and edges, and only considers the connectivity between nodes. (For example, only focus on whether points A and B are connected, without considering how to get from point A to point B). It relaxes the map's need for precise locations, removes map details, and is a more compact representation. However, topological maps are not good at expressing maps with complex structures. How to segment the map to form nodes and edges, and how to use topological structures for navigation and path planning, are still issues to be studied.

First, let’s explain what the robot’s position x is. We have not clearly explained the meaning of location.
Moving in the plane, the radish can parameterize its position in the form of two coordinates plus a rotation angle.
We know that the movement of three-dimensional space consists of three axes, so the movement of the radish is described by translation on the three axes and rotation around the three axes, with a total of 6 degrees of freedom.

Guess you like

Origin blog.csdn.net/weixin_45867259/article/details/132473312