"Target Detection" YOLO V5 (1): study notes

1. Accumulation of basic knowledge points

1.1 Adaptive anchor box

  • In yolov3, k-means and genetic algorithm are used to analyze the custom data set to obtain the preset anchor point box suitable for the prediction of the bounding box of the object in the custom data set.
  • In yolov5, the anchor box is automatically learned based on training data. ( Auto Learning Bounding Box Anchors )

1.2 Activation function

In yolov5, the Leaky ReLU activation function is used in the middle/hidden layer, and the Sigmoid activation function is used in the final detection layer;

Insert picture description here

1.3 Optimizer

Two optimization functions Adam and SGD are provided in yolov5, and the matching training hyperparameters are preset. The default is SGD.

1.4 Loss function

  • yolo series loss calculation is based on objectness score, class probability, bounding box regression score.
  • GIOU Loss is used as the loss of the bounding box in yolov5;
  • In yolov5, binary cross entropy and Logits loss function are used to calculate the loss of class probability and target score. At the same time, we can also use the fl_gamma parameter to activate the Focal loss calculation loss function.

2. Innovation

2.1 Data enhancement

2.1.1 Zoom

2.1.2 Adjustment of color noise space

2.1.3 Image occlusion

  • (1) Random Erase: Replace the area of ​​the image with a random value or the average pixel value of the training set.
  • (2) Cutout: Use cutout mask only for the input of the first layer of CNN.
  • (3) Hide and Seek: The image is divided into a by the SxSimage patch a grid of randomly hide some patches based on the probability set, allowing the model to learn the way the entire object, rather than a single piece, such as not rely solely animal's face to do Recognition.
  • (4) Grid Mask: hide the area of ​​the image in the grid, the role is also to let the model learn the entire component of the object.
  • (5) MixUp: Convex superposition of image pairs and their labels.

2.1.4 Multi-picture combination

  • Cutmix: Paste the cut part of another image to the enhanced image. The clipping of the image forces the model to learn to predict based on a large number of features.

2.1.5 Mosaic data enhancement

  • In Cutmixwe combined the two images, and in Mosaicthe four training images we use a certain percentage combined into a single image, the model learn to recognize objects within a smaller range. Secondly, it also helps to significantly reduce the demand for batch-size, after all, most people have limited GPU memory.

2.2 Self-adversarial training (SAT)

  • Self-Adversarial Training is a data enhancement technology that resists adversarial attacks to a certain extent. CNN calculates Loss, then changes the picture information through backpropagation to form an illusion that there is no target on the picture, and then performs normal target detection on the modified image. It should be noted that during the back propagation of SAT, there is no need to change the network weight.
  • Using adversarial generation can improve the weak links in the learning decision boundary and improve the robustness of the model. Therefore, this data enhancement method is used by more and more object detection frameworks.

Insert picture description here

2.3 Class label smoothing

  • Class label smoothing is a regularization method. If the neural network is overfitting or overconfident, we can all try to smooth the labels. That is to say, there may be errors in the labels during training, and we may "overly" believe in the labels of the training samples, and to some extent fail to examine the complexity of other predictions. Therefore, in order to avoid over-belief, it is more reasonable to code the class label representation to assess uncertainty to a certain extent.
  • It is adopted in yolo v4, and class label smoothing does not seem to be used in yolo v5.

2.4 Adaptive anchor box

  • In YOLO V3 and YOLO V4, k-means and genetic learning algorithm are used to analyze the custom data set to obtain a preset anchor box suitable for the prediction of the bounding box of the object in the custom data set.
  • The anchor box in YOLO V5 is automatically learned based on training data.

Three, Backbone

Backbone: A convolutional neural network that aggregates and forms image features on different image granularities.

Both yolov4 and yolov5 use CSPDarknet as the backbone to extract rich information features from the input image.

CSPNet

  1. Solved the gradient information duplication problem of network optimization in Backbone of other large-scale convolutional neural network frameworks;
  2. Specific approach: Integrate the gradient changes into the feature map from beginning to end, thereby reducing the parameter amount and FLOPS value of the model, which not only ensures the speed and accuracy of inference, but also reduces the model size.
  3. CSPNet is actually based on Densenetthe idea of copying the base layer feature map, send a copy by dense block to the next stage, which will feature a map of the base layer is separated, which can effectively alleviate the problem disappear gradient, propagation characteristics support and encourage network Reuse features, thereby reducing the number of network parameters.
  4. CSPNet ideas can be combined with ResNet, ResNext and DenseNet. At present, there are mainly two types of Backbone network transformation: CSPResNext50 and CSPDarknet53.

Four, Neck

PANET
Neck: A series of network layers that mix and combine image features and pass the image features to the prediction layer.

Neck is mainly used to generate feature pyramids. The feature pyramid will enhance the model's detection of objects at different scales, so that the same object of different sizes and scales can be recognized. Both yolov4 and yolov5 use PANet as Neck to aggregate features.

PANet

  1. PANet is based on the Mask R-CNN and FPN framework, while enhancing information dissemination.
  2. The feature extractor of the network uses a new FPN structure that enhances the bottom-up path, which improves the propagation of low-level features.
  3. Each stage of the third passage will feature the previous stage as an input, and with a 3*3handle them convolution layer, the output is added to the same stage in the feature map from top to bottom by a transverse connecting passage, the next picture shows the features Provide information at the stage.
  4. Use adaptive feature pooling to restore the damaged information path between each candidate area and all feature levels, aggregate each candidate area on each feature level, and avoid arbitrary allocation.

Five, Head

Head: Predict image features, generate bounding boxes and predict categories.

  • Head is mainly used in the final detection part. It applies anchor box on the feature map and generates the final output vector with class probability, object score and bounding box.
  • yolov3, yolov4 and yolov5 all use the same head model structure.

Head

  1. Head different zoom scale is used to detect objects of different sizes, each Head total (80个类 + 1个概率 + 4个坐标)* 3anchor block, a total of 255 channels.

Reference link

  1. https://zhuanlan.zhihu.com/p/161083602

Guess you like

Origin blog.csdn.net/libo1004/article/details/110928070