YOLO series algorithm

YOLO series algorithm

learning target

  • Know the yolo network architecture and understand its input and output
  • Know the method of training sample construction of the yolo model
  • Understand the loss function of the yolo model
  • Know how to improve the yoloV2 model
  • Know the multi-scale detection method of yoloV3
  • Know the network structure and network output of the yoloV3 model
  • Understand the method of yoloV3 model a priori box design
  • Know why the yoloV3 model is suitable for multi-label target classification
  • Understand the yoloV4 model

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-u1IVdIYj-1646276606890)(note picture/image-20200915142921616.png)]

The YOLO series algorithm is a typical one-stage target detection algorithm. It uses the anchor box to combine the regression problem of classification and target positioning, so as to achieve high efficiency, flexibility and good generalization performance, so it is also very popular in the industry. , and then we introduce the YOLO series of algorithms.

yolo algorithm

The Yolo algorithm uses a separate CNN model to achieve end-to-end target detection. The core idea is to use the entire image as the input of the network, and directly return the position of the bounding box (bounding box) and its category at the output layer. The entire The system is shown in the figure below:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-W7Rgq5YO-1646276606891)(note picture/image-20200915144129736.png)]

First resize the input image to 448x448, then send it to the CNN network, and finally process the network prediction result to get the detected target. Compared with the R-CNN algorithm, it is a unified framework and its speed is faster.

Yolo algorithm idea

Before introducing the Yolo algorithm, let's recall the RCNN model. The RCNN model proposes the method of Region Proposals, which first searches out some candidate regions (Selective Search) that may contain objects from the picture, about 2000, and then Object recognition is performed per proposal, but the processing speed is slower.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-uIVmkMmT-1646276606892)(note picture/image-20200915150333995.png)]

Yolo means You Only Look Once, it does not really remove the candidate area, but creatively combines the candidate area and the target classification into one, and you can know which objects are there and where they are at a glance at the picture.

The Yolo model uses the method of pre-defining the prediction area to complete the target detection. Specifically, the original image is divided into 7x7=49 grids (grid), and each grid allows the prediction of 2 bounding boxes (bounding box, containing a certain The rectangular box of the object), a total of 49x2=98 bounding boxes. We understand it as 98 prediction areas, which roughly cover the entire area of ​​the picture, and target detection is performed in these 98 prediction areas.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-pKAVNy6q-1646276606892)(note picture/image-20200915150718666.png)]

As long as the target classification and regression results of these 98 regions are obtained, the final target detection result can be obtained by performing NMS. How to achieve it?

Yolo's network structure

The structure of YOLO is very simple, that is, simple convolution and pooling and two layers of full connections are added at the end. From the perspective of network structure, there is no essential difference from the CNN classification network introduced earlier. The biggest difference is that the output layer is made of linear functions. Activation function, because it is necessary to predict the position of the bounding box (numeric type), not just the probability of the object. So roughly speaking, the entire structure of YOLO is that the input image is transformed by the neural network to obtain an output tensor, as shown in the following figure:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-5IrhY9yq-1646276606893)(note picture/image-20200915151948836.png)]

The network structure is relatively simple, the key point is that we need to understand the relationship between network input and output.

network input

The input to the network is the original image, the only requirement is to scale it to a size of 448x448. The main reason is that in Yolo's network, the convolutional layer is connected to two fully connected layers at the end. The fully connected layer requires a fixed-size vector as input, so the size of Yolo's input image is fixed at 448x448.

network output

The output of the network is a 7x7x30 tensor. So how do we understand this output?

7X7 grid

According to YOLO's design, the input image is divided into 7x7 grids, and the 7x7 in the output tensor corresponds to the 7x7 grid of the input image. Or we regard the 7x7x30 tensor as 7x7=49 30-dimensional vectors, that is, each grid in the input image corresponds to a 30-dimensional vector output. As shown in the figure below, for example, the grid in the upper left corner of the input image corresponds to the vector in the upper left corner of the output tensor.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-Opjy0IEm-1646276606893)(note picture/image-20200915152825629.png)]

30-dimensional vector

The 30-dimensional vector contains: the position and confidence of 2 bboxes and the probability that the grid belongs to 20 categories

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-dvHTUC0q-1646276606894)(note picture/image-20200915153123684.png)]

  • The positions of the 2 bounding boxes Each bounding box needs 4 values ​​​​to represent its position, (Center_x, Center_y, width, height), that is (the x coordinate of the center point of the bounding box, the y coordinate, the width and height of the bounding box) , 2 bounding boxes need a total of 8 values ​​to represent their positions.
  • Confidence of the two bounding boxes Confidence of the bounding box = Probability of an object in the bounding box * The IOU between the bounding box and the actual bounding box of the object is represented by the formula:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-yCA8yb7G-1646276606894)(note picture/image-20200915153543735.png)]

Pr(Object) is the probability that there is an object in the bounding box

  • Probability of classifying 20 objects

Yolo supports the recognition of 20 different objects (people, birds, cats, cars, chairs, etc.), so there are 20 values ​​here that represent the probability of any object at the grid position.

Yolo model training

When performing model training, we need to construct training samples and design loss functions in order to use gradient descent to train the network.

Construction of training samples

Input a picture into the yolo model, and the corresponding output is a 7x7x30 tensor. When constructing the label label, a 30-dimensional vector needs to be constructed for each grid grid in the original image. Let's build the target vector against the following figure:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-dRlyHzQJ-1646276606895) (note picture/image-20200915155204485.png)]

  • Probability of classifying 20 objects

For each object in the input image, first find its center point. For example, the center point of the bicycle in the picture above is at the position of the yellow dot, and the center point falls within the yellow grid. Therefore, in the 30-dimensional vector corresponding to the yellow grid, the probability of the bicycle is 1, and the probability of other objects is 0. In all the 30-dimensional vectors of the other 48 grids, the probability of the bicycle is 0. This is what is called "the grid on which the center point lies is responsible for predicting this object". The classification probabilities of dogs and cars are filled in the same way

  • The position of the 2 bounding boxes

The bbox position of the training sample should fill in the real position bbox of the object, but one object corresponds to two bounding boxes, which one should be filled in? It needs to be selected according to the bbox output by the network and the IOU of the actual bbox of the object, so it is necessary to dynamically determine which bbox to fill in during the training process.

  • Confidence of 2 bounding boxes

The formula for the prediction confidence is:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-2TgxHUzV-1646276606896)(note picture/image-20200915155812745.png)]

IOUredtruth is calculated using the two bounding boxes output by the network and the real bounding box of the object. Then look at the IOU of the two bounding boxes, whichever is larger, which bounding box is responsible for predicting whether the object exists, that is, the confidence target value of the bounding box is 1, and the position of the real bounding box of the object is also filled in The bounding box. Another bounding box that is not responsible for prediction has a confidence target value of 0.

The results corresponding to the grid where the bicycle is located in the above figure are shown in the figure below:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-t5Bd4u8i-1646276606897)(note picture/image-20200915160053996.png)]

loss function

The loss is the deviation between the actual output value of the network and the sample label value:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-Ft5wrUDh-1646276606897)(note picture/image-20200915160218266.png)]

The loss function given by yolo:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-1xT6CEhL-1646276606898)(note picture/image-20200915160632201.png)]

Note: where 1iobj indicates whether the target appears in grid cell i, 1ijobj indicates that the jth bounding box predictor in cell i is responsible for the prediction, YOLO sets λcoord=5 to increase the weight of position error, λnoobj=0.5 is Reduce the weight of the confidence error of the bounding box where the object does not exist.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-pKMlwwEX-1646276606899)(note picture/image-20200915160850102.png)]

model training

Yolo first uses the ImageNet dataset to pre-train the first 20-layer convolutional network, and then uses the complete network to perform object recognition and localization training on the PASCAL VOC dataset.

The last layer of Yolo uses a linear activation function, and the other layers are Leaky ReLU. Drop out and data augmentation are used in training to prevent overfitting.

model prediction

Resize the picture to a size of 448x448, send it to the yolo network, and output a 7x7x30 tensor (tensor) to represent the objects (probability) contained in all grids in the picture and the two possible positions of the object (bounding box) and Credibility (confidence level). The NMS (Non-maximal suppression) algorithm is used to select the result that is most likely to be the target.

yoloSummary

advantage

  • The speed is very fast, the processing speed can reach 45fps, and its fast version (with smaller network) can even reach 155fps.
  • Training and prediction can be performed end-to-end, which is very simple.

shortcoming

  • Accuracy will be discounted
  • The detection effect is not good for small targets and close targets

yoloV2

Compared with the v1 version, YOLOv2 has improved in three aspects: more accurate prediction (Better), faster speed (Faster), and more recognition objects (Stronger) on the basis of continuing to maintain the processing speed. Among them, recognizing more objects is to expand to be able to detect 9000 different objects, called YOLO9000. Let's see what improvements have been made to yoloV2?

Prediction is more accurate (better)

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-YxQQFlBl-1646276606899)(note picture/image-20200915164503590.png)]

batch normalization

Batch normalization helps to solve the problem of gradient disappearance and gradient explosion in the backpropagation process, reduces the sensitivity to some hyperparameters, and when each batch is normalized separately, it has a certain regularization effect, so that it can Obtain better convergence speed and convergence effect. After convolution in yoloV2, all Batch Normalization is added, and the network will increase mAP by 2%.

Fine-tune a classification model using high-resolution images

YOLO v1 uses ImageNet's image classification samples with 224x224 as input to train the CNN convolutional layer. Then when training object detection, the image samples for detection use a higher resolution 448x448 image as input. But this switch has a certain impact on the performance of the model.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-fW6qh1ic-1646276606900)(note picture/image-20200915165005595.png)]

After YOLOV2 uses 224x224 images for classification model pre-training, it uses 448x448 high-resolution samples to fine-tune the classification model (10 epochs), so that the network features gradually adapt to the resolution of 448x448. Then use 448x448 detection samples for training, which alleviates the impact caused by sudden resolution switching.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-UWAHMKNX-1646276606901)(note picture/image-20200915165015860.png)]

After using this technique, the mAP of the network increased by about 4%.

Using Anchor Boxes

YOLO1 does not use a priori box, and each grid only predicts two bounding boxes, and the entire image has 98. If YOLO2 uses 5 prior frames for each grid, there are a total of 13x13x5=845 prior frames. By introducing anchor boxes, the number of predicted boxes is increased (13x13xn).

Cluster extraction anchor scale

The anchor ratio selected by Faster-rcnn is manually specified, but it may not be completely suitable for the data set. YOLO2 tries to count the a priori frame that is more in line with the size of the object in the sample, so that it can reduce the difficulty of fine-tuning the a priori frame to the actual position of the network. YOLO2's approach is to perform cluster analysis on the marked borders in the training set to find the border size that matches the sample as much as possible.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-D6pwflqE-1646276606901)(note picture/image-20200915165616802.png)]

YoloV2 chose the five sizes of the cluster as the anchor box.

Prediction of bounding box position

In Yolov2, the results of the bounding box are constrained to a specific grid:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-GNolV45x-1646276606902)(note picture/image-20200915171150105.png)]

in,

bx,by,bw,bh are the center, width and height of the predicted bounding box. Pr(object)∗IOU(b,object) is the confidence degree of the predicted frame, and YOLO1 is the value of the direct prediction confidence degree. Here, the σ transformation of the prediction parameter to* is used as the value of the confidence degree. cx, cy is the distance from the upper left corner of the current grid to the upper left corner of the image, the grid size must be normalized first, that is, the width of a grid=1, and the height=1. pw, ph are the width and height of the prior box. σ is the sigmoid function. tx, ty, tw, th, and to are the parameters to be learned, which are used to predict the center, width and height of the border, and the confidence.

As shown below:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-YZ1n8CQA-1646276606903)(note picture/image-20200915171632888.png)]

Since the σ function constrains tx,ty to be in the range (0,1), the blue center point of the predicted bounding box is constrained within the grid of the blue background. Constraining the bounding box positions makes the model easier to learn and the predictions more stable.

Assuming the network prediction value is:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-0U86C3iJ-1646276606903)(note picture/image-20200915171823875.png)]

The anchor box is:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-4lPDA1iV-1646276606904)(note picture/image-20200915171844683.png)]

Then the position of the target in the feature map:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-zwPv90Az-1646276606904)(note picture/image-20200915171906489.png)]

Position in original image:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-EhnvQ4HH-1646276606905)(note picture/image-20200915171925122.png)]

Fine-grained feature fusion

The objects in the image can be large or small. The input image is extracted through a multi-layer network. In the final output feature map, the features of smaller objects may not be obvious or even ignored. In order to better detect some relatively small objects, the final output feature map needs to retain some more detailed information.

YOLO2 introduces a method called a passthrough layer to retain some detailed information in the feature map. Specifically, before the last pooling, the size of the feature map is 26x26x512, which is divided into 4 by 1, and directly passed (passthrough) to the feature map after pooling (and after a set of convolutions), and the two are superimposed together Feature map as output.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-KZRm3jxs-1646276606906)(note picture/image-20200915172541517.png)]

The specific split method is as follows:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-fkXadG47-1646276606906)(note picture/image-20200915172504922.png)]

multi-scale training

There is no fully connected layer in YOLO2, and images of any size can be input. Because the downsampling multiple of the entire network is 32, 10 input image sizes such as {320, 352,...,608} are used, and the input images of these sizes correspond to output feature map widths and heights of {10, 11,...19}. A size is randomly changed every 10 batches during training, so that the network can adapt to object detection of various sizes.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-JQrIsMnB-1646276606907)(note picture/image-20200915172724192.png)]

Faster

yoloV2 proposed the Darknet-19 (with 19 convolutional layers and 5 MaxPooling layers) network structure as a feature extraction network. DarkNet-19 is smaller than VGG-16, and the accuracy is not weaker than VGG-16, but the amount of floating point operations is reduced to about ⅕ to ensure faster operation speed.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ZYc3R4N9-1646276606908)(note picture/image-20200915173047956.png)]

There is only convolution + pooling in the network of yoloV2, which is transformed from 416x416x3 to 13x13x5x25. Added batch normalization, added a passthrough layer, removed the fully connected layer, and adopted 5 prior frames. The output of the network is shown in the figure below:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-CPmETzdf-1646276606909)(note picture/image-20200915173440874.png)]

Identify objects more

The VOC data set can detect 20 types of objects, but in fact there are many types of objects, but there is a lack of corresponding training samples for object detection. YOLO2 tries to use ImageNet's very large number of classification samples to train together with COCO's object detection data set, so that YOLO2 can detect these objects even if it has not learned many object detection samples.

yoloV3

The improvement of yoloV3 based on V1 and V2 mainly includes: using multi-scale features for target detection; richer prior frames; adjusted network structure; object classification uses logistic instead of softmax, which is more suitable for multi-label classification tasks.

Algorithm Introduction

YOLOv3 is the third version of the YOLO (You Only Look Once) series of target detection algorithms. Compared with the previous algorithms, especially for small targets, the accuracy has been significantly improved.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-8tpBI0LU-1646276606909)(note picture/image-20200502103836394.png)]

The process of yoloV3 is shown in the figure below. For each input image, YOLOv3 will predict three different scale outputs in order to detect targets of different sizes.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-UuCrrIXI-1646276606909)(note picture/image-20200502104048380.png)]

multi-scale detection

Often an image contains a variety of different objects, both large and small. Ideally, objects of all sizes can be detected at the same time. Therefore, the network must have the ability to "see" objects of different sizes. Because the deeper the network, the smaller the feature map will be, so the deeper the network, the harder it is to detect small objects.

In the actual feature map, as the depth of the network deepens, the shallow feature map mainly contains low-level information (object edge, color, primary position information, etc.), and the deep feature map contains high-level information (such as the semantics of the object Information: dogs, cats, cars, etc.). Therefore, feature maps at different levels correspond to different scales, so we can perform target detection on feature maps at different levels. The following figure shows a variety of classic methods of scale transformation.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-0ma9rzl7-1646276606910)(note picture/image-20200502104855459.png)]

(a) This method first establishes an image pyramid, and pyramid images of different scales are input into the corresponding network for detection of objects of different scales. But the result of this is that each level of the pyramid needs to be processed once, which is very slow.

(b) Detection is only performed in the last feature map stage, this structure cannot detect objects of different sizes

© Perform target detection on feature maps of different depths. Such a structure is adopted in SSD. Such small objects will be detected in the shallow feature map, and large objects will be detected in the deep feature map, so as to achieve the purpose of corresponding to objects of different scales. The disadvantage is that the information obtained by each feature map is only From the previous layer, the feature information of the subsequent layer cannot be obtained and utilized.

(d) It is very close to ©, but the difference is that the feature map of the current layer will upsample the feature map of the future layer and make use of it. Because of such a structure, the current feature map can obtain the information of the "future" layer, so that the low-order features and high-order features are organically integrated to improve the detection accuracy. In YOLOv3, this method is used to realize the multi-scale transformation of the target.

Network Model Structure

In terms of basic image feature extraction, YOLO3 adopts the network structure of Darknet-53 (containing 53 convolutional layers), which draws on the practice of residual network ResNet, and sets shortcuts between layers to solve the problem of deep network gradients. The problem, the shortcut is shown in the figure below: it contains two convolutional layers and a shortcut connections.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-jutdPg1x-1646276606910)(note picture/image-20200502110956252.png)]

The model structure of yoloV3 is as follows: In the entire v3 structure, there is no pooling layer and fully connected layer. The downsampling of the network is achieved by setting the stride of the convolution to 2. The size of the image after passing through this convolutional layer will be reduced to half.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ild5r13B-1646276606910) (note picture/image-20210106152945470.png)]

Let's look at the network structure:

  • Basic components: the part inside the blue box

1. CBL: The smallest component in the Yolov3 network structure, consisting of Conv+Bn+Leaky_relu activation function. 2. Res unit: Learn from the residual structure in the Resnet network, so that the network can be built deeper. 3. ResX: It consists of a CBL and X residual components, which is a large component in Yolov3. The CBL in front of each Res module plays the role of downsampling, so after 5 Res modules, the obtained feature map is 416->208->104->52->26->13 in size.

  • Other basic operations:

1. Concat: Tensor splicing, which will expand the dimensions of two tensors, for example, two tensors of 26×26×256 and 26×26×512 are concatenated, and the result is 26×26×768.

2. Add: add tensors, add tensors directly without expanding the dimension, for example, add 104×104×128 and 104×104×128, the result is still 104×104×128.

  • Number of convolutional layers in Backbone:

Each ResX contains 1+2×X convolutional layers, so the entire backbone network Backbone contains a total of 1+(1+2×1)+(1+2×2)+(1+2×8)+( 1+2×8)+(1+2×4)=52, plus a FC fully connected layer, can form a Darknet53 classification network. However, in the target detection Yolov3, the FC layer is removed, and the backbone network of Yolov3 is still called the Darknet53 structure.

prior box

yoloV3 uses K-means clustering to obtain the size of the prior frame, sets 3 types of prior frames for each scale, and clusters a total of 9 sizes of prior frames.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-HDp6JtPa-1646276606911)(note picture/image-20200502103458654.png)]

The nine prior frames in the COCO dataset are: (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x326). Apply larger prior frames (116x90), (156x198), (373x326) on the smallest (13x13) feature map (with the largest receptive field), suitable for detecting larger objects. Medium (30x61), (62x45), (59x119) are applied on the medium (26x26) feature map (medium receptive field), which is suitable for detecting medium-sized objects. Applied on a larger (52x52) feature map (smaller receptive field), among which smaller prior boxes (10x13), (16x30), (33x23), suitable for detecting smaller objects.

Intuitively feel the size of the 9 kinds of prior boxes, the blue box in the figure below is the prior box obtained by clustering. The yellow box is the ground truth, and the red box is the grid where the center point of the object is located.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-GdVWCyx7-1646276606911)(note picture/image-20200502103442569.png)]

logistic regression

Instead of using softmax when predicting object categories, it is replaced by a 1x1 convolutional layer + logistic activation function structure. When using the softmax layer, it has been assumed that each output only corresponds to a single class, but in data sets where some classes overlap (such as woman and person), using softmax cannot make the network predict the data well. .

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-c20hqyzy-1646276606911)(note picture/image-20200502112750112.png)]

Input and output of yoloV3 model

The input and output form of YoloV3 is shown in the figure below:

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Bf3aFnQa-1646276606912)(note picture/image-20200502111721372.png)]

Input a 416×416×3 image, and get prediction results of three different scales through the darknet network, each scale corresponds to N channels, containing the prediction information;

Prediction results for anchors of each size per grid.

YOLOv3 has a total of 13×13×3 + 26×26×3 + 52×52×3 predictions. Each prediction corresponds to 85 dimensions, which are 4 (coordinate value), 1 (confidence score), and 80 (coco category probability).

yoloV4

The father of YOLO announced his withdrawal from the CV world in early 2020. The author of YOLOv4 is not the original author of the YOLO series. YOLO V4 is a major update of the YOLO series. Its average precision (AP) and frame rate precision (FPS) on the COCO data set have increased by 10% and 12% respectively, and it has been officially recognized by Joseph Redmon and is considered to be the One of the current strongest real-time object detection models.

yoloV4 summarizes most of the detection techniques, and then screens, arranges and combines, and experiments (ablation study) which methods are effective. In general, Yolov4 does not create new improvements, but uses a large number of target detection techniques. Here we mainly show you its network architecture:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-kN8ElNBY-1646276606912)(note picture/image-20200915180225300.png)]

The structure diagram of Yolov4 is similar to Yolov3, but various new algorithm ideas are used to improve each substructure. First sort out the structural components of Yolov4

  • Basic components:
  • CBM: The smallest component in the Yolov4 network structure, consisting of Conv+Bn+Mish activation function.
  • CBL: It consists of Conv+Bn+Leaky_relu activation function.
  • Res unit: Learn from the residual structure in the Resnet network, so that the network can be built deeper.
  • CSPX: It consists of three convolutional layers and X Res unint module Concate.
  • SPP: Use 1×1, 5×5, 9×9, 13×13 maximum pooling methods for multi-scale fusion.
  • Other basic operations:
  • Concat: Tensor splicing, the dimension will be expanded, the same as the explanation in Yolov3, corresponding to the route operation in the cfg file.
  • Add: Tensor addition will not expand the dimension, corresponding to the shortcut operation in the cfg file.
  • Number of convolutional layers in Backbone: Each CSPX contains 3+2×X convolutional layers, so the entire backbone network Backbone contains a total of 2+(3+2×1)+2+(3+2×2) +2+(3+2×8)+2+(3+2×8)+2+(3+2×4)+1=72.

Notice:

The input size of the network is not fixed. The default input in yoloV3 is 416×416, and the default in yoloV4 is 608×608. In actual projects, it can also be modified according to needs, such as 320×320, which is generally a multiple of 32. The size of the input image corresponds to the size of the last three feature maps, such as 416×416 input, the size of the last three feature maps is 13×13, 26×26, 52×52, if it is 608×608, The last three feature map sizes are 19×19, 38×38, and 76×76.

Summarize

  • Know the yolo network architecture and understand its input and output

The entire structure of YOLO is that the input image is transformed by the neural network to obtain an output tensor.

  • Know the method of training sample construction of the yolo model

For each grid grid in the original image, a 30-dimensional vector needs to be constructed: classification, confidence, and regression target value

  • Understand the loss function of the yolo model

The loss function is divided into 3 parts: classification loss, regression loss, confidence loss

  • Know how to improve the yoloV2 model

Using BN layer, high-resolution training, using Anchorbox, clustering to get the size of anchorbox, improving the method of bounding box prediction, feature fusion, multi-scale training, network model using darknet19, using imagenet dataset to identify more targets

  • Multi-scale detection method of yoloV3

In YOLOv3, the FPN structure is used to improve the accuracy of corresponding multi-scale target detection. The current feature map uses the information of the "future" layer to fuse low-order features with high-order features to improve detection accuracy.

  • The network structure of the yoloV3 model

Based on darknet-53, drawing on the idea of ​​resnet, a residual module is added to the network, which is beneficial to solve the gradient problem of deep network

In the entire v3 structure, there are no pooling layers and fully connected layers, only convolutional layers

The downsampling of the network is achieved by setting the stride of the convolution to 2

  • The method of yoloV3 model prior frame design

K-means clustering is used to obtain the size of the prior frame, and 3 types of prior frames are set for each scale, and a total of 9 sizes of prior frames are clustered.

  • Why is the yoloV3 model suitable for multi-label target classification

Instead of using softmax when predicting object categories, use the output of logistic for prediction

  • Input and output of yoloV3 model

For an input image of 416×416×3, 3 prior boxes are set in each grid of the feature map of each scale, and there are a total of 13×13×3 + 26×26×3 + 52×52×3 = 10647 forecast. Each prediction is a (4+1+80)=85-dimensional vector. This 85-dimensional vector contains frame coordinates (4 values), frame confidence (1 value), and the probability of the object category (for the COCO dataset, there are 80 objects).

Guess you like

Origin blog.csdn.net/qq_43966129/article/details/123248885