YOLO series target detection algorithm-YOLOv4

YOLO series target detection algorithm catalog - article link


This article summarizes:

  1. In order to improve the accuracy of the neural network, there are many methods and techniques so far. This paper selects some of them and combines them to achieve the best performance.
  2. The target detector at this stage mainly consists of two parts, backbone and head. In recent years, some layers are usually inserted between the backbone and the head to collect feature maps at different stages, which is called Neck in this paper.
  3. The method that only changes the training strategy or only increases the training cost without increasing the inference cost is called "bag of freebies" (BoF).
  4. Those modules and post-processing methods that only slightly increase the inference cost but can significantly improve the target detection accuracy are called "bag of specoals" (BoS).
  5. The methods included in BoF are data enhancement, regularization function, label refinement network, box regression function, etc.
  6. The methods included in BoS include enhanced receptive field, attention module, feature integration, activation function, post-processing, etc.
  7. By comparing various methods and techniques, YOLOv4 finally selected CSPDarknet53 as the backbone, SPP+PAN as the Neck, and YOLOv3-head as the Head.
  8. The methods to select and retain from BoF are CutMix for Backbone, Mosaic data enhancement, DropBlock regularization, Class label smoothing; CIoU-loss, CmBN, DropBlock regularization for detectors, Mosaic data enhancement, Self-Adversarial Training, Eliminate grid sensitivity, one GT uses multiple anchors, Cosine annealing scheduler, hyperparameter tuning, random training image size.
  9. The methods for selecting and retaining from BoS include Mish activation function for Backbone, Cross-stage partial connections (CSP), Multiinput weighted residual connections (MiWRC); Mish activation function for detectors, SPP-block, SAM-block , PAN path-aggregation block, DIoU-NMS.

Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。



5. YOLO series target detection algorithm-YOLOv4

5.1 Introduction

  Currently, there are many ways to improve the accuracy of convolutional neural networks. Some of these methods are only applicable to some models and some problems, or only applicable to small-scale data sets; some methods, such as batch normalization (BN) and residual connections (residual-connections), are applicable to most models, tasks and data set. These general methods include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch, Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation, etc. This article achieves the best results on the MS COCO dataset by using methods such as WRC, CSP, CmBN, SAT, Mish activation, Mosaic data enhancement, CmBN, DropBlock regularization, and CIoU Loss, and combining some of these methods: On Tesla V100, the real-time speed is 65FPS, 43.5% AP (65.7% AP50). The results of YOLOv4 are shown in Figure 1.
insert image description here

The main contributions of YOLOv4 are as follows:

  1. An efficient and robust object detection model is developed. It enables everyone to use a 1080 Ti or 2080 Ti GPU directly to train an ultra-fast and accurate object detector;
  2. During the model training, the most advanced Bag-of-Freebies (generally refers to those methods that do not increase the complexity of the model and the amount of inference calculation) and Bag-of-Specials (generally refers to the increase of some model complexity and calculation amount, The performance and impact of methods such as methods that take a little longer to reason about;
  3. Use state-of-the-art methods including CBN/PAN/SAM, etc. to make it more efficient and suitable for single-GPU training.

  The target detector at this stage mainly consists of two parts, backbone and head. Backbone mainly performs pre-training on datasets such as ImageNet, and then uses the head to predict the target category and coordinates. For the GPU platform, the backbone generally uses VGG, ResNet, ResNeXt or DenseNet. For the CPU platform, the backbone generally chooses SqueezeNet, MobileNet, and ShuffleNet. The head part is mainly divided into two-stage and one-stage target detectors. The two-stage target detectors are mainly anchor-based, mainly R-CNN series, including fast R-CNN, faster R-CNN, R- FCN, Libra R-CNN, etc. also have anchor-free ones, such as RepPoints. The one-stage target detectors are mainly YOLO series, SSD and RetinaNet, etc. These are anchor-based. In recent years, there are many anchor-free one-stage target detectors, such as CenterNet, CornerNet and FCOS and so on.

  Object detectors developed in recent years usually insert some layers between the backbone and the head, and these layers are usually used to collect feature maps at different stages. It can be called the neck neck of the object detector. Typically, a neck consists of multiple bottom-up paths and multiple top-down paths. Networks equipped with this mechanism include Feature Pyramid Network (FPN), Path Aggregation Network (Path Aggregation Network, PAN), BiFPN and NAS-FPN, etc.

  In addition to the above models, some researchers also focus on directly building new backbones (DetNet, DetNAS) or new overall models (SpineNet, HitDetector) for object detection.
In summary, a common object detector consists of the following parts:

  • Input:Image,Patches,Image Pyramid
  • Backbones:VGG16,ResNet-50,SpineNet,EfficientNet-B0/B7,CSPResNeXt50,ResNeXt-101,CSPDarkNet53
  • Neck:
    • Additional blocks:SPP,ASPP,RFB,SAM
    • Path-aggregation blocks:FPN,PAN,NAS-FPN,Fully-connected FPN,BiFPN,ASFF,SFAM
  • Heads:
    • Dense Prediction(one-stage):
      • anchor-based:RPN,SSD,YOLO,RetinaNet
      • anchor-free:CornerNet,CenterNet,MatrixNet,FCOS
    • Sparse Prediction(two-stage):
      • anchor-based:Faster R-CNN,R-FCN,Mask R-CNN
      • anchor-free:RepPoints

5.2 Bag of freebies

  Usually, conventional object detectors are trained offline. Therefore, this advantage can be exploited to develop better training methods to achieve better accuracy for object detectors without increasing the cost of inference. This paper refers to these methods that only change the training strategy or only increase the training cost as "bag of freebies".

  Data augmentation is commonly used in target detection methods. The purpose of data augmentation is to increase the variability of input images so that the designed object detection model is more robust to images obtained from different environments. For example, photometric warping and geometric warping are two commonly used data augmentation methods. Adjust the brightness, contrast, hue, saturation, noise, and more of an image while working with photometric warping. For geometric deformations, add random scaling, cropping, flipping, rotating, and more.

  The above data enhancement methods are adjusted pixel by pixel, and all original pixel information in the adjusted area is preserved. In addition, there are also some data enhancement methods that focus on simulating target occlusion, and also achieve good classification and positioning accuracy. For example, random erase and CutOut randomly select a rectangular area of ​​an image and fill it with zero or random pixel values. As for hide-and-seek and grid mask, they randomly or uniformly select multiple rectangular regions in the image and replace them all with zeros.

  Also in feature map processing, similar methods include DropOut, DropConnect, and DropBlock. Additionally, some researchers have proposed methods for using multiple images together to perform data augmentation. For example, MixUp uses two images multiplied and overlaid with different coefficient ratios, and then uses these overlay ratios to adjust labels. For CutMix, it crops the image over a rectangular area of ​​other images and adjusts the label according to the size of the blended area. In addition to the above methods, style transfer GAN can also be used for data augmentation, which can effectively reduce the texture bias learned by CNN.

  Different from the various methods proposed above, some other bag-of-freeibies methods address the problem that the semantic distribution in the dataset may be biased. When dealing with the problem of semantic distribution deviation, a very important problem is the problem of data imbalance between different classes. In two-stage target detection algorithms, this problem is usually solved by hard negative example mining or online hard sample Mining (online hard example mining, OHEM) to solve. However, these mining methods are not suitable for one-stage object detectors because such detectors belong to dense prediction architectures. Therefore, FocalLoss was proposed to solve the data imbalance problem existing among different categories. Another very important problem is that it is difficult to use one-hot to express the correlation relationship between different categories. This representation scheme is often used when labeling data. The label smoothing proposed later is to convert hard labels into soft labels for training, which can make the model more robust. In order to obtain better soft labels, someone introduced the concept of knowledge distillation to design a label refinement network.

  The last bag-of-freebies is the objective function for bounding box (BBox) regression. Traditional object detectors usually use the mean square error (MSE) to directly regress the center point coordinates and the height and width of the BBox, such as regressing the center point coordinates and width and height, or regressing the upper left and lower right corner coordinates. For the anchor-based method, it is to predict the corresponding offset value. However, directly predicting the coordinate value of each point of the BBox is to treat these points as independent variables, but in fact this does not consider the integrity of the target itself. To better deal with this problem, some researchers proposed IoU loss, which considers the area where the predicted BBox and the GT's BBox intersect. Since IoU is scale-invariant, it can solve the problems encountered by traditional methods, such as the problem that the loss will increase with the increase of scale when l1 and l2 loss calculate the loss of x, y, w, and h. Recently, some researchers continue to improve IoU loss. For example, in addition to the coverage area, GIoU Loss also considers the shape and orientation of the object. DIoU Loss, which also takes into account the distance from the center of the object. CIoU Loss, which simultaneously considers overlapping regions, distances between center points, and aspect ratios. CIoU can achieve better convergence speed and accuracy on BBox regression problems.

5.3 Bag of specoals

  For those modules and post-processing methods that will increase the cost of inference by a small amount, but can significantly improve the accuracy of object detection, they are called "bag of specoals". Generally speaking, these modules are used to enhance certain attributes in the model, such as expanding the receptive field, introducing an attention mechanism, or enhancing feature integration capabilities, etc. Post-processing is a method to screen the prediction results of the model.
insert image description here
  Common modules that can be used to enhance the receptive field are SPP, ASPP, and RFB. The SPP module is derived from Spatial Pyramid Matching (SPM), and the original method of SPM is to divide the feature map into several d×d equal blocks, where d can be 1/2/3, etc., thus forming a space Pyramid, and then extract bag-of-word features. SPP integrates SPM into CNN and replaces bag-of-word operations with max-pooling operations. Since the SPP module proposed by He Kaiming et al. will output a one-dimensional feature vector, it is not suitable for a fully convolutional network (FCN). Therefore, in the design of YOLOv3, the SPP module is improved as a cascade of max-pooling outputs with kernel size k×k, where k={1,5,9,13} and stride equal to 1. Under this design, relatively large k×k max pooling effectively increases the receptive field of backbone features. After adding the improved version of the SPP module, YOLOv3-608 increased the AP50 by 2.7% on the MS COCO target detection task, and the additional calculation cost was only 0.5%.
insert image description here
  The difference between the ASPP module and the improved SPP module is that ASPP does not use max pool to obtain feature maps of different receptive fields, but uses convolution to achieve, and its kernel size is all 3, but different hole rates are introduced to Expand the receptive field in disguise. The rest of the operations are consistent with SPP.
insert image description here
insert image description here
  The RFB module k=1/3/5 convolution kernel cooperates with dilated convolution to realize the increase of the receptive field and the fusion of different receptive field feature maps. The expansion ratio is equal to k and the step size is equal to 1 to obtain a more comprehensive space than ASPP. cover. RFB spends only 7% extra inference time, increasing the AP50 of SSD on MS COCO by 5.7%.

  The attention modules often used in target detection are mainly divided into channel-wise attention and point-wise attention. The representatives of these two attention models are Squeeze-and-Excitation (SE) and spatial attention module ( Spatial Attention Module, SAM). Although the SE module can improve the ability of ResNet50 in the ImageNet image classification task by 1% top-1 accuracy, it only needs to increase the computational workload by 2%, but it usually increases the inference time by about 10% on the GPU, so it is more Suitable for use in mobile devices. But for SAM, it only needs an extra 0.1% computation to improve the top-1 accuracy of ResNet50-SE by 0.5% on the ImageNet image classification task. Most importantly, it does not affect the inference speed on the GPU at all.

  In terms of feature integration, early practice is to use skip connection or hyper-column to integrate low-level physical features into high-level semantic features. Since multi-scale prediction methods such as FPN have become popular, many lightweight modules integrating different feature pyramids have been proposed. Such modules include SFAM, ASFF, and BiFPN. The main idea of ​​SFAM is to perform channel weighting on multi-scale concatenated feature maps using SE modules. For ASFF, it uses softmax as point-wise level weighting, and then adds feature maps of different scales. In BiFPN, multi-input weighted residual connections are proposed to perform scale-level reweighting and then add feature maps of different scales.

  In deep learning research, some people focus on finding good activation functions. A good activation function allows gradients to propagate more efficiently without incurring too much additional computational cost. In 2010, the proposed ReLU substantially solved the gradient disappearance problem often encountered in traditional tanh and sigmoid activation functions. Subsequently, LReLU, PReLU, ReLU6, Scaled Exponential Linear Unit (SELU), Swish, hard-Swish, and Mish were proposed, which are also used to solve the gradient disappearance problem. The main purpose of LReLU and PReLU is to solve the problem that the gradient of ReLU is zero when the output is less than zero. As for ReLU6 and hard-Swish, they are specially designed for quantized networks. To self-normalize neural networks, the SELU activation function is proposed to meet this goal. It should be noted that both Swish and Mish are continuously differentiable activation functions.

  A commonly used post-processing method in deep learning-based object detection is NMS, which can be used to filter those BBoxes that predict poorly for the same object, and only keep candidate BBoxes with higher responses. The method that NMS tries to improve is consistent with the method of optimizing the objective function. The original method proposed by NMS does not consider the context information, so the classification confidence is added in R-CNN as a reference, and NMS is performed in the order from high score to low score according to the order of confidence. For soft NMS, it considers the problem that the occlusion of the object may lead to the decrease of the confidence score in NMS with IoU score. The way of thinking of DIoU NMS developers is to add the center point distance information to the BBox screening process on the basis of soft NMS. It is worth mentioning that since none of the above post-processing methods directly involve the captured image features, post-processing is no longer required on anchor-free methods.

5.4 YOLOv4 Method Collection

  The basic goal of object detection is the fast running speed of the neural network in the production system and the optimization of parallel computing, rather than the low computational load theoretical index (BFLOP). This paper proposes two real-time neural network options:

  • For GPU, use a small number of groups (1-8) in the convolutional layer: CSPResNeXt50/CSPDarknet53
  • For VPU (Video Processing Unit, video processing unit), use group convolution, but avoid using Squeeze-and-excitement (SE) blocks, including the following models: EfficientNet-lite / MixNet / GhostNet / MobileNetV3

5.4.1 Network structure selection

  The goal of this article is to find the best balance between the input network resolution, the number of convolutional layers, the number of parameters (convolution kernel size^2 convolution kernel number of channels/groups), and the number of layer outputs (convolution kernel number). For example, numerous studies show that CSPResNext50 is much better than CSPDarknet53 in terms of classification on the ILSVRC2012 (ImageNet) dataset. However, on the contrary, CSPDarknet53 is better than CSPRESNET50 in object detection on MS COCO dataset.

  The next goal is to select additional blocks for increasing the receptive field and the best way to aggregate parameters from different backbone levels for different detector levels (e.g. FPN, PAN, ASFF, BiFPN).
A model that is optimal for classification is not always optimal for detection. Unlike classifiers, detectors require:

  • Higher input size (resolution): for detecting small size objects
  • More layers: used to obtain a higher receptive field to cover the enlarged input network
  • More parameters: To allow the model to have greater capacity, multiple objects of different sizes can be detected in a single image

  It can be assumed that a model with a larger receptive field size (with more 3×3 convolutional layers) and more parameters should be selected as the backbone. Table 1 shows the comparison results of CSPResNeXt50, CSPDarknet53 and EfficientNet-B3.
insert image description here
  CSPResNext50 contains only 16 3×3 convolutional layers, 425×425 receptive field and 20.6 M parameters, while CSPDarknet53 contains 29 3×3 convolutional layers, 725×725 receptive field and 27.6 M parameters. This theoretical argument along with extensive experiments show that CSPDarknet53 is the best model as the backbone of the detector.

  The impact of different sizes of receptive fields is summarized as follows:

  • Up to Object Size: Allows viewing of the entire object
  • Maximum Network Size: Allows to see the context around the target
  • Exceeding network size: increasing the number of connections between image points and final activations

  The SPP block is added to CSPDarknet53, as it significantly increases the receptive field, isolates the most important contextual features, and hardly slows down the network operation. The FPN used in YOLOv3 is no longer used, and PANet is used as the parameter aggregation method instead.

  In the end, this paper chooses CSPDarknet53 backbone, SPP additional module, PANet Neck and YOLOv3 (anchor-based) Head as the architecture of YOLOv4.

5.4.2 BoF and BoS Selection

  To improve object detection training, CNNs typically use the following methods:

  • Activation function:ReLU, leaky-ReLU, parametric-ReLU, ReLU6, SELU, Swish, or Mish
  • BBOX regression loss function: MSE, IoU, GIoU, CIoU, DIoU
  • Data Augmentation: CutOut, MixUp, CutMix
  • Regularization method: DropOut, DropPath, Spatial DropOut, DropBlock
  • 标准化方法:Batch Normalization (BN) ,Cross-GPU Batch Normalization (CGBN or SyncBN), Filter Response Normalization (FRN) , Cross-Iteration Batch Normalization (CBN)
  • Skip-connections:Residual connections, Weighted residual connections, Multi-input weighted residual connections, Cross stage partial connections (CSP)

  Since PReLU and SELU are more difficult to train, and ReLU6 is specially designed for quantized networks, the above activation functions are removed from the candidate list.

  Among the regularization methods, the people who published DropBlock compared their method with other methods in detail, and their regularization method achieved a lot of results. Therefore, this paper chooses DropBlock as the regularization method without hesitation.

  For the selection of the normalization method, since this paper only focuses on the training strategy using one GPU, syncBN is not considered.

5.4.3 Additional promotion strategies

  To make the designed detector more suitable for training on a single GPU, the following additional designs and improvements are made:

  • A new data augmentation method is used: Mosaic and Self-Adversarial Training (SAT)
  • Choosing Optimal Hyperparameters When Applying a Genetic Algorithm
  • Modified some existing methods to make them suitable for efficient training and detection: modified SAM, modified PAN and Cross mini-Batch Normalization (CmBN)

  Mosaic is a new data augmentation method that mixes 4 training images. Thus, 4 different contexts are mixed, while CutMix only mixes 2 input images. This allows the detection of objects outside the normal context. In addition, batch normalization computes activation statistics from 4 different images on each layer, which greatly reduces the need for large mini-batch sizes (e.g. only one image at a time if training resources are constrained) For pictures, with the Mosaic data enhancement method, it is equivalent to training a picture that contains 4 picture information).

  Self-Adversarial Training (SAT) is also a new data augmentation technique, which is divided into two stages before and after. In the first stage, the neural network alters the original image, not the network weights. In this way, the neural network performs an adversarial attack on itself, altering the original image in such a way as to produce a spoof that the desired target is not present on the image. In the second stage, the neural network is trained to detect objects on the modified image in the normal way.
insert image description here
  CmBN represents the modified version of CBN, as shown in Figure 4, defined as Cross mini-Batch Normalization (CmBN). This only collects statistics between mini-batches within a single batch.
insert image description here
insert image description here
  Modify SAM from spatial-wise attention to point-wise attention, and replace the shortcut connection of PAN with concatenation, as shown in Figure 5 and Figure 6, respectively.

5.5 YOLOv4

  The composition of YOLOv4:

  • Backbone: CSPDarknet53
  • Neck: SPP , PAN
  • Head: YOLOv3

YOLOv4 uses the following bags:

  • Bag of Freebies (BoF) used in Backbone: CutMix and Mosaic data enhancement, DropBlock regularization, Class label smoothing
  • Bag of Specials (BoS) used in Backbone: Mish activation function, Cross-stage partial connections (CSP), Multiinput weighted residual connections (MiWRC)
  • Bag of Freebies (BoF) used in the detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data enhancement, Self-Adversarial Training, Eliminate grid sensitivity, one GT uses multiple anchors, Cosine annealing scheduler, hyperparameter tuning , random training image size
  • Bag of Specials (BoS) used in the detector: Mish activation function, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS
    insert image description here

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/126976069