Detailed explanation of YOLOV5

1. Improvement of YOLOV5's pre-processing Anchor

1.1 Improvements to Anchor Generation

  1. First of all, YOLOV3/V4/V5 generate anchors based on the training data set, that is, use an independent program to calculate the anchor before training, but it is not good enough

  2. Because the automatically generated anchor is used for the entire data set, but we know that the target detection training is divided into batches. YOLOV5 embeds this function into the training process, which means that each batch of YOLOV5 will Generate an anchor closer to our dataset.

1.2 Process generated by Anchor

  1. Contains all Width and Height of the current batch

  2. Scale the maximum value of w and h in each picture to the specified size, and the smaller side is also scaled accordingly

  • The specified size here is a hyperparameter. You can specify the input image size by changing the img_size parameter in the training configuration file. For example, img_size=640 means that the size of the input image is 640x640 pixels.
  1. The absolute coordinate form of the ground truth bbox (GT) marked in the training set is obtained after scaling and coordinate transformation.

  2. Filter bboxes and keep bboxes with w and h greater than or equal to 2 pixels

  • During training, objects that are too small (eg, less than 2 pixels) are less likely to be detected because their size is so small that they are difficult to distinguish in the image. In addition, a too small target will also increase the difficulty of training, which may have a negative impact on the training effect of the model. Therefore, smaller objects are usually filtered out and only larger objects are kept for training. In YOLOv5, bboxes whose w and h are both greater than or equal to 2 pixels are used as training data to improve the efficiency and accuracy of training.
  1. Use k-means clustering to get n Anchors;

  2. Use the genetic algorithm to randomly mutate the w and h of the Anchors. If the effect becomes better after the mutation, assign the mutated result to the Anchors. If the effect becomes worse after the mutation, skip it.

2. Improvement of YOLOV5 pre-processing Letterbox

  1. In the early YOLO algorithm, sizes such as 416×416 and 608×608 were commonly used, such as scaling and filling an 800×600 image. As shown in the figure, the author of YOLOv5 believes that in the actual application of the project, many images have different aspect ratios, so after directly zooming and filling, the size of the gray edges at both ends will be different, and if there are more fillings, there is information. Redundancy may also affect inference speed.

  2. The Letterbox function of YOLOv5 adaptively scales and fills the original image so that it can be divisible by the smallest multiple of 32, while adding the least gray edges to reduce the impact of information redundancy and inference speed. This allows for efficient handling of images of different sizes and aspect ratios, and keeps the network stable and efficient.

3. A case look at the calculation process hyperparameter image_size setting 416

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-lY0nFxQV-1679577573911)(null)]

image

  1. First calculate the ratio: min(416/800, 416/600) = 0.52

  2. Calculate the length and width of the shrunk image: w, h = 800 x 0.52, 600 x 0.52 = 312, 480

  3. Calculate the number of pixels to pad: padding_num = 0.2 x (32 x (1 - 312 / 32 - int(312 / 32))) = 4

  4. Finally, we know that the pixels filled up and down are 4, and the final size of this picture is 416 x (316 + 4 + 4) = 416 x 320

  5. Summary: Use hyperparameters to calculate the pixels that need to be filled on the wide and wide sides

4. SiLU of the activation function of YOLOV5

  1. Mish and SiLU are two different activation functions. Although their shapes are similar, their derivatives and function values ​​are calculated differently.

  2. The Mish activation function is an activation function proposed by Misra in 2019. Its formula is:
    f(x) = x * tanh(softplus(x))
    where the softplus function is defined as: softplus(x) = ln(1 + exp(x)).

  3. The SiLU activation function is an activation function proposed by Elsken in 2018. Its formula is:
    f(x) = x * sigmoid(x)

Among them, the sigmoid function is defined as: sigmoid(x) = 1 / (1 + exp(-x)).

Although their shapes are similar, the derivative calculation of Mish is more complicated, while the derivative calculation of SiLU is relatively simple. Therefore, using SiLU as the activation function in YOLOv5 can improve the efficiency of network training. In addition, the calculation speed of SiLU is also faster than Mish, which can further improve the inference speed of the network.

5. CSP Block in YOLOV5

5.1 CBA module

  1. Conv + BN + SiLU
    image

5.2 C3 module

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-x4lROUli-1679577573867)(null)]

This is V5, it looks relatively simple

844fc98e3a91e40b7a96c296d28260c

This is V4, it looks a lot more complicated

image

Removed BN-MISH and just use fusion first

6. Complete BackBone

  1. The stage is how many times the above CSPBlock has been executed, which can be seen from the configuration file. Like YOLOV4's CSPDarkNet53, stage = [3, 6, 9, 3] is used, and the corresponding downsampling endorsements are: 4, 8, 16 , 32

image

image

6. SPP improvement in YOLOV5: SPPF

  1. SPPF/SPP is also loaded behind the backbone for further feature extraction.

  2. Neither SPPF nor SPP will change the size and channels of the feature map, and removing them will not affect the output and input of the overall network, but their role is to extract and fuse high-level features.

  3. The main function of this module is to extract and fuse high-level features. During the fusion process, maximum pooling is used multiple times to extract as many high-level semantic features as possible.

7. Comparison of coordinate representation (V3/V4 vs V5)

7.1 Coordinate representation used by v3/v4

coordinate diagram

image

image

b_w = p_w * e^(t_w) Here is a typo, the formula is subject to the following

x = sigmoid(tx) + cx
y = sigmoid(ty) + cy
w = pw * exp(tw)
h = ph * exp(th)

First understand that the black frame outside is Anchor, which is generated by clustering before we train. The blue ones are our bboxes.

The things returned by the YOLO series of algorithms are the relative coordinates of tx, ty, tw, th on the graph, so that the mother can calculate the coordinates, width and height of the bounding box through the calculation formula on the graph.

Cx, Cy, are the values ​​of the grid, telling us what the current grid is

This is to understand that the bounding box is changed from the Anchor.

7.2 Coordinate system representation of v5

image

image

In fact, it is a different way to change the anchor to a bounding box. YOLOV5 has no papers, but YOLOV3/V4 U version uses this coordinate conversion

8. The matching of positive and negative samples in YOLOV5 compared with the matching of positive and negative samples in V3 V4

8.1 Prior knowledge

The predictions of V3/V4/V5 are divided into three scales for prediction, and each prediction layer has three Anchor Boxes

Increasing positive samples can improve the accuracy and recall rate of the model, but it does not necessarily directly improve the convergence speed of the model. During the training process, adding positive samples may increase the complexity of the model, resulting in longer training time, and also increases the risk of model overfitting. Therefore, while increasing positive samples, it is necessary to comprehensively consider the accuracy and complexity of the model, and make appropriate adjustments and optimizations.

8.2 Positive and negative sample matching in V3

  1. V3 is that each scale, that is, each detection layer, assigns an Anchor box with the largest IOU of GT as a positive sample. Yes, a positive sample is an Anchor box that meets the conditions, and then it will obtain parameters through training and generate them during inference. The parameters required by the bounding box above generate bboxes

  2. In this way, there will be a problem of insufficient positive samples, because each data V3 can only generate at most 3 positive samples, that is, the largest one in each layer. If a sample is not a positive sample, then it has neither positioning loss nor loss. No class loss, only confidence loss

  3. V4 increases the number of positive samples

8.3 Positive and Negative Sample Matching of V4

  1. The choice of V4 is that as long as it is greater than the set IOU threshold, all are set as positive samples, which means that three scales/three prediction layers can have up to 9 positive samples

  2. His side, which was originally ignored by V3, is also regarded as a positive sample

8.4 Positive and Negative Sample Matching of V5

  1. In fact, after the completion of V4, an additional operation is added, which is to divide each grid into four quadrants, and then two more grids can be pulled in. Pulling two more grids means that 6 more Anchors can be calculated IOU to see if there is a chance to become a positive sample

  2. V3 has only 1 positive sample per layer, V4 can have 1-3 through calculation, and V5 can have 3-9 by pulling two more grids in.

  3. This matching method of YOLOv5 can allocate more positive samples, which helps to speed up the convergence of training and the balance of positive and negative samples. And since each feature map will calculate whether all GTs and the anchor of the current feature map can be assigned positive samples, it means that a GT may be assigned positive samples in multiple feature maps.

9. Loss function

Both YOLOv5 and YOLOv4 use CIoU Loss as the regression loss function of Bounding Box, while both classification loss and target loss use cross-entropy loss.

For the regression loss, its mathematical expression is as follows:

image

In the formula, d and c represent the Euclidean distance between the prediction result and the center point of the labeled result and the diagonal distance of the frame, respectively. In this way, CIOU Loss takes into account the three important geometric factors that the target box regression function should consider: overlapping area, center point distance, and aspect ratio.
For classification loss and target loss, the mathematical expressions are as follows:

image

Guess you like

Origin blog.csdn.net/bobchen1017/article/details/129739776