7k words summary of common face recognition methods and systems (public)

0. Introduction

Face detection and recognition is a very practical direction and can be applied in many scenarios. This article brings you the overall sharing about face recognition from the face recognition algorithm and evaluation indicators, the composition of the face recognition system, etc. I hope it will be helpful to everyone.

1. Goals of face recognition

To sum up two points, first, if you recognize the same person, no matter how your status changes, you will know that you are you. Second, distinguish different people. Maybe the two people look very similar, or both of them wear makeup, but no matter how the status changes, facial recognition can know that these are two different people.

Face recognition itself is a type of biometric technology, mainly providing a means of identity authentication. In terms of accuracy, face recognition is not the highest. Face recognition is affected by many other conditions, such as lighting. The advantage of face recognition is that it generally does not require much cooperation from the user . Nowadays, surveillance cameras in various places, including computer cameras, mobile phone video input devices, and photographic equipment have become very popular. This kind of visible light equipment can Can do face recognition. Therefore, when facial recognition is introduced, the new investment may be very small, which is its advantage.

2. Face recognition process

The core process of face recognition , the so-called core process means that no matter what kind of face recognition system, this process is basically there. First, face detection, second step, face alignment, and third step, feature extraction . These are the three steps that must be done for every photo. When comparing, compare the extracted features. , and then determine whether the two faces belong to the same person.

Insert image description here

3. Face detection

Face detection is to determine whether there is a face in a large scene, and to find the position of the face and cut it out. It is a type of object detection technology and is the basis of the entire face perception task. The basic method of face detection is to slide the window on the image pyramid, use a classifier to select candidate windows, and use a regression model to correct the position.
Insert image description here
The three windows drawn above, one is 0.3 times, 0.6 times, and 1.0 times. When the position of the face is uncertain and the size cannot be identified, this technology can be used to make the image itself have different sizes, but the size of the sliding window is the same. The size of the image input to the deep network is generally fixed, so the sliding window in front is basically fixed. In order to allow the fixed sliding window to cover different ranges, the size of the entire image is scaled to different proportions. The 0.3, 0.6, and 1.0 shown here are just examples. There can be many other different multiples in actual use.

The classifier refers to looking at each position of the sliding window to determine whether it is a human face, because the position where the sliding window slides may not include the entire face, or it may be larger than the entire face. In order to find more accurate faces, putting the sliding window into the regression model can help correct the accuracy of face detection.

The input is a sliding window. If there is a face in it during output, which direction should be corrected and how much it needs to be corrected, so Δx, Δy, Δw, Δh are its coordinates and how much its width and height should be corrected. After having the amount of correction and using the classifier to determine that it is a window of a human face, by combining these two together, a more accurate position of the human face can be obtained.

The above is the process of face detection, and it can also be applied to other object detection.

4. Evaluation indicators for face detection

No matter what kind of model, it is divided into two aspects: speed and accuracy.

4.1 Speed

4.1.1 Speed ​​is the detection speed at the specified resolution

The reason why the resolution is specified is because every time the sliding window slides to a position, a classification and regression judgment must be made. Therefore, when the image is larger, the number of windows that need to be made for detection and judgment may be more, and the entire face detection takes time. the longer.

Therefore, to evaluate the quality of an algorithm or model, you have to look at its detection speed at a fixed resolution. Generally speaking, what the detection speed will be? It may be the time it takes to detect the face of a picture, such as 100 milliseconds, 200 milliseconds, 50 milliseconds, 30 milliseconds, etc.

Another way to express speed is fps. Nowadays, general web cameras are often 25fps or 30fps, which means how many pictures can be processed per second. The benefit of fps can be used to judge whether face detection can achieve real-time detection. As long as If the fps number of face detection is greater than the fps number of the camera, it can be achieved in real time , otherwise it cannot be achieved.

4.1.2 Is the speed affected by the number of faces in the same picture?

From our actual operation, most of them are not affected, because it is mainly affected by the number of sliding windows. The number of hits is not particularly heavy, but it has a slight impact.

4.2 Accuracy

Precision is basically determined by recall rate , false detection rate , and ROC curve . The recall rate refers to the proportion of the photo that is a human face, and the real model determines that it is a human face. The false detection rate and the negative sample error rate refer to the proportion of the photo that is not a human face, but is misjudged to be a human face.

4.2.1 ACC accuracy

The ACC calculation method is to divide the correct number of samples by the total number of samples. For example, take 10,000 photos for face detection. Some of these 10,000 photos have faces and some do not. Then determine what the correct ratio is.

But there is a problem with this accuracy. If you use it to judge, it has nothing to do with the ratio of positive and negative samples. That is, it does not care about the correct rate in positive samples or the correct rate in negative samples. It only cares about the total of . When the accuracy of this model is 90%, others do not know the difference between positive and negative samples. Including classification, including regression, generally speaking, the classification model will first use a regression to obtain a so-called confidence level. When the confidence level is greater than a certain value, it is considered to be it, and then when the confidence level is less than the same value, it is considered to be not.

The ACC statistical model is adjustable, that is, adjusting the confidence level will change the accuracy.

Therefore, the ACC value itself is greatly affected by the proportion of the sample , so it is somewhat problematic to use it to characterize the quality of a model. When the test indicator says it has reached 99.9%, it is easier to be deceived simply by looking at this value. In other words, this statistic is biased.

In order to solve this problem, a curve called ROC is generally used to characterize the accuracy of this model.

4.2.2 ROC receiver operating characteristic curve

Insert image description here

The abscissa: FPR (False Positive Rate), which is the error rate of negative samples
. The ordinate: TPR (True Positive Rate), which is the correct rate of positive samples.

The performance of the algorithm on positive and negative samples can be distinguished, and the shape of the curve has nothing to do with the ratio of positive and negative samples.

The ROC (Receiver Operating Characteristic) curve is to mark the abscissa and ordinate with the negative sample error rate and the positive sample correct rate. In this case, the same model will not see a point on this graph, or it will not be a single point. data, but a line. This line is the confidence threshold. The higher you adjust it, the more stringent it is, and the lower it is, the less stringent it is. Above this, it can reflect the impact of changes in the confidence threshold.

In the future, it is best not to ask directly what your accuracy is, but to look at the ROC curve, which makes it easier to judge the capabilities of the model.

5. Face alignment

The purpose of face alignment is to adjust the face texture to a standard position as much as possible and reduce the difficulty of the face recognizer.

Insert image description here
In order to artificially reduce its difficulty, you can first align it so that the detected eyes, nose, and mouth of the person are all in the same position. In this way, when the model is comparing, it only needs to find At the same location, whether they are the same or similar to each other is still very different. So we are able to do this step of alignment. For this step, the common method we use now is the two-dimensional method, which is to find the key feature points in this picture. Generally, they are five points, nineteen points, and more than sixty points. There are all kinds of spots, including more than 80 spots. But for face recognition, five is basically enough .

The image of other points other than these five points can be considered as performing an operation similar to interpolation, and then pasting it to that position. After it is completed, it can be sent to the subsequent face recognizer. Identified. This is a general approach, but there are also more cutting-edge approaches. Some research institutions are using the so-called 3D face alignment , which means I tell you what a frontal face looks like, such as what it looks like when rotated 45 degrees. Then after training him with this kind of picture, he will know that when I see a picture rotated 45 degrees to the left and right, there is a high probability of what it will look like when it is turned right, and this model can guess it.

6. Facial feature extraction algorithm

The previous traditional methods were the so-called local texture model, global texture model, shape regression model and the like. What is more popular now is the use of deep convolutional neural networks (CNN) or recurrent neural networks (RNN), or convolutional neural networks with 3DMM parameters. The so-called 3DMM parameters have three-dimensional information in them, and then there are cascaded deep neural networks.
Insert image description here
The cascaded deep neural network, that is, to get the face, must first infer the positions of five points. If a single model is used to do this at once, the model will need to be very complicated.

But how can the complexity of this model be reduced?

That is, multiple inputs are made. After the first input into the network, a guess is made. This guess is an acceptable and less accurate guess. It roughly knows where the five points of the human face are. Then put these five points and the original image into the second network to get the approximate correction amount. After you have a basic five points and then find the correction amount, it will be better than finding the accurate five points directly from the original image. This point is slightly easier. Therefore, using this method of gradual refinement and cascading multiple networks together can achieve a better balance between speed and accuracy. In fact, when we do it now, we basically use two layers and it is about the same .

7. Evaluation indicators for facial feature point extraction

7.1 Accuracy

In order to allow faces of different sizes to be compared together , what is statistically called the normalized root mean square error is used .

Insert image description here
NRMSE (Normalized Root Mean Square Error), the normalized root mean square error, is used to measure the difference between the coordinates of each feature point and the labeled coordinates.

For example: we draw five points on paper, and then let the machine tell the distance between these five points. The closer the value given is to the real distance, the more accurate the prediction is. Generally speaking, the predicted value will have some deviation, so how to express this accuracy value? We usually express it by the average or root mean square value of distance. However, the problem arises. When the same machine predicts images of different sizes, the accuracy values ​​will appear different, because the larger the image, the higher the absolute value of the error will be. The same principle applies to faces of different sizes. Therefore, our solution is to take into account the original size of the human face. Generally, the denominator is the distance between the human eyes or the diagonal distance of the human face, and then divide the distance difference by the distance between the eyes, or divide by The diagonal of the face, in this case, you can get a value that basically does not change with the size of the face, and use it for evaluation.

8. Face comparison

8.1 Purpose

That is to determine whether the two aligned faces belong to the same person.

8.2 Difficulties

The same face will appear in different states under different conditions, such as being particularly affected by light, smoke, makeup, etc.

The second one is caused by different parameters mapped to two-dimensional photos. The so-called mapping to two-dimensional parameters means that the original face looks like this. When the shooting equipment takes the picture, the angle it presents to him, the distance from him , and the focus Whether it is accurate, the shooting angle , and the light accumulation all have an impact, which will cause the same face to appear in different states.

The third is the influence of age and plastic surgery.

9. Face comparison method

9.1 Traditional methods

Some features such as HOG, SIFT, wavelet transform, etc. are extracted manually. Generally speaking, the extracted features may require fixed parameters, that is, no training or learning is required. A set of fixed algorithms is used, and then the features are compared. .

9.2 Depth methods

The mainstream method is the deep method, that is, the deep convolutional neural network . This network generally uses DCNN to replace the previous feature extraction methods, that is, to extract some different features on a picture or a face. Come on, there are many parameters in DCNN. These parameters are learned, not told by people. If they are learned, they will be better than those summarized by people.

Then the obtained set of features may generally have 128 dimensions, 256 dimensions, or 512 dimensions or 1024 dimensions, and then make comparisons. To judge the distance between feature vectors, Euclidean distance or cosine similarity is generally used .

The evaluation indicators of face comparison are also divided into speed and accuracy. Speed ​​includes the calculation time of a single face feature vector and the comparison speed. Accuracy includes ACC and ROC. Since it has been introduced before, here we focus on the comparison speed.

Ordinary comparison is a simple operation, which is to calculate the distance between two points. You may only need to do an inner product once , which is the inner product of two vectors. However, when face recognition encounters a 1:N comparison, when that N When the library is very large, when you get a photo and search it in the N library, the number of searches will be very large. For example, if the N library is one million, you may have to search it one million times. One million times is equivalent to To do one million comparisons, there are still requirements for the total time at this time, so there will be various technologies to speed up this comparison.

10. Other algorithms related to face recognition

Mainly include face tracking, quality assessment, and living body recognition.

10.1 Face tracking

In video face recognition scenarios such as surveillance , if the entire face recognition process is executed for every frame of the same person walking by, it will not only waste computing resources, but also may cause misrecognition due to some low-quality frames, so there are It is necessary to determine which faces belong to the same person. And select appropriate photos for recognition, which greatly improves the overall performance of the model.

Nowadays, not only face tracking, but also various object tracking or vehicle tracking, etc., will use tracking algorithms . Such algorithms do not rely on or will not always rely on detection. For example, after detecting an object at the beginning, it will not detect it at all and only use the tracking algorithm to do it. At the same time, in order to achieve very high accuracy and avoid loss, each tracking takes a lot of time.

In order to prevent the tracked face from not matching the range of the face recognizer, generally speaking, a face detector will be used for a detection. This detection method relies on relatively lightweight tracking of face detection. In a certain In some scenarios, a balance between speed and quality can be achieved.
Insert image description here

This detection method is called Tracking by Detection, that is, face detection is still performed in each frame. After detecting the face, the front and back are compared based on the four values ​​​​of each face, that is, its coordinate position, its width and height. From the position and size of the faces in the two frames, it can be roughly inferred whether the two faces belong to the same moving object.

10.2 Optional interval full-frame detection

It means that when doing Tracking by Detection, one way is to do full-screen detection on the two frames before and after. The so-called full-screen detection means scanning the entire screen. However, this method is very time-consuming, so sometimes another method is used. One method is to do a full screen every few frames . Generally, the next frame is predicted. The position will not change too much. As long as the position of the previous frame is slightly expanded up, down, left, and right, and then detected again, it is often possible with a high probability. Detected, most frames can be skipped.

Why do we have to do a full-screen detection every few frames?

This is to prevent new objects from coming in. If you only search based on the position of the previous object, the new objects may not be detected when they come in. To prevent this, you can do it after five or ten frames . A full screen inspection.

10.3 Face quality assessment

Due to the limitations of face recognizer training data, etc., it is impossible to achieve good performance on faces in all states. The quality assessment will determine the degree of agreement between the detected faces and the characteristics of the recognizer, and only select faces with a high degree of agreement. Send it for identification to improve the overall performance of the system.

Face quality assessment includes the following 4 elements:

  • Regarding the size of the human face, the recognition effect will be greatly reduced if the face is too small.

  • Facial posture refers to the rotation angle in three axes, which is generally related to the data used for recognizer training. If most faces with small postures are used during training, it is best not to choose faces with large deflections when actually doing recognition, otherwise it will not be applicable.

  • The degree of blur is very important. If the photo has lost information, there will be problems in recognition.

  • Occlusion, if the eyes, nose, etc. are covered, the features of this area cannot be obtained, or the obtained features are wrong. They are features of an occluder, which will have an impact on subsequent recognition. If it can be determined that it is occluded, then discard it, or do some special processing, such as not putting it into the recognition model.

10.4 Living body identification

This is a problem that all face recognition systems will encounter. If they only recognize faces, photos can also get passed. In order to prevent the system from being attacked, some judgments will be made to determine whether this is a real face or a fake face.

Basically there are three current methods:

  • With traditional dynamic recognition , many bank cash machines require the user to make some cooperation, such as asking the user to blink or turn their head , in order to determine whether the user has made the same cooperation by blinking or turning their head. Therefore, there is a problem with dynamic recognition, that is, it requires a lot of cooperation from the user, so the user experience will be a bit bad.

  • Static recognition does not judge based on actions, but only based on the photo itself to determine whether it is a real face or a fake face. It is based on the commonly used attack methods, which are relatively convenient. For example, take a mobile phone or a display screen and use the screen to attack. The luminous ability of this kind of screen is different from the luminous ability of a human face under actual lighting conditions . For example, a display with 16 million luminous colors cannot achieve the luminous ability of visible light, that is, all of them are continuous and all bands can emit light. Issued to. Therefore, when shooting this kind of screen, compared with the primary imaging in the real natural environment, the human eye can also see that there will be some changes and some unnaturalness. After putting this unnaturalness into a model for training, you can still judge whether it is a real face based on this subtle difference.

  • For stereoscopic recognition , if you use two cameras or a camera with depth information , you can know the distance of each captured point from the camera, which is equivalent to 3D imaging of a character . If you use a screen to shoot in this way, the screen must be a flat surface. , realized that it was a flat person, and the flat person was definitely not a real person. This is to use a three-dimensional recognition method to exclude flat faces.

11. System composition of face recognition

First, make a classification. From the comparison form, there are 1:1 recognition systems and 1:N recognition systems; from the comparison objects, there are photo comparison systems and video comparison systems; according to the deployment form, , there are private deployments, cloud deployments or mobile device deployments.

11.1 Photo 1:1 recognition system

Insert image description here
The 1:1 recognition system is the simplest. Take two photos, generate a feature vector for each photo, and then compare the two feature vectors to see if they are the same person, and you can identify them.

11.2 Photo 1: N’s recognition system

Insert image description here
The 1:N recognition system determines whether the photo material is in a sample library . This sample library is prepared in advance and may have a whitelist or a blacklist. It contains a photo of each person, and a series of feature vectors are generated from this photo. This is used as a sample library. The uploaded photos are compared with all the features in the sample library to see which one is most similar to the person. This is a 1:N recognition system.

11.3 Video 1:1 recognition system

Insert image description here
The video 1:1 recognition system is similar to the 1:1 system for photos, but the comparison object is not photos, but video streams. After getting the video stream, we will do detection, tracking, and quality assessment, and then we will compare it after we get the appropriate photos.

11.4 Video 1: N’s recognition system

Insert image description here
The video 1:N adaptation system is similar to the 1:N photo system, except that the video stream is used for identification, and detection, tracking, and quality assessment are also required.
Insert image description here

Generally speaking, the so-called system configuration is not necessarily a face recognition system, and this is probably the case for various AI systems. The first is the computing resource layer , which runs on the CPU or GPU. Running on the GPU may also have support for CUDA, CUDN, etc.

The second is the computing tool layer , including the deep learning forward network computing library, matrix computing library and image processing tool library. Since it is impossible for everyone who makes algorithms to write their own data operations, they will use some existing data operation libraries, such as Pytorch, TensorFlow, MXNET or Caffe, etc., or they can write their own set.

The last is the application algorithm layer , including face detection, feature point positioning, quality assessment and other algorithm implementations. The above is the general system composition.

Insert image description here

Reference

https://mp.weixin.qq.com/s/a9hbiyjcJtfQEp86Fn6LQw

Guess you like

Origin blog.csdn.net/JishuFengyang/article/details/132866631