YOLO series target detection algorithm-YOLOv2

YOLO series target detection algorithm catalog - article link


This article summarizes:

  1. Compared with other target detection algorithms, YOLO has the disadvantages of many positioning errors and low recall rate, so YOLOv2 focuses on improving the recall rate and positioning accuracy;
  2. Use a variety of advanced technical methods to apply to YOLO to analyze and compare performance. These methods include adding BN, using higher resolution, fully convolutional network, adding anchor mechanism, new network structure, k-means clustering, Directly predict detection results, use more fine-grained feature maps, multi-scale training, etc.;
  3. Through experimental comparison results, BN was finally selected, using high resolution, full convolution, new network structure, selecting anchors through clustering, directly predicting detection results, using more fine-grained feature maps, and multi-scale training.
  4. In order to maintain the "fast" characteristics of YOLO, the network was not expanded, but the network was simplified, and then the network was easier to learn, and a new classification network Darknet-19 was proposed;
  5. Since detection datasets are few and classification datasets are abundant, a method for jointly training classification and detection data is proposed. During training, images from the detection and classification datasets are mixed. When the network sees an image marked for detection, it can backpropagate based on the full YOLOv2 loss function. When it sees a classified image, it only backpropagates the loss for the classification-specific part of the architecture;
  6. Using this method, YOLO9000 is trained, which can accurately locate the target in the absence of detection and labeling data, and can run and detect more than 9,000 target categories in real time.

Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。



**YOLO series target detection algorithm-YOLOv2**

  2016.12.25 《YOLO9000: Better, Faster, Stronger》

1 Introduction

  This paper not only introduces YOLOv2 , but also proposes YOLO9000 , a real-time object detection system that can detect more than 9000 object categories.

  Datasets for object detection are limited compared to those used for other tasks such as classification. And labeling images for detection is much more costly than labeling images for classification. So this paper proposes a new approach to exploit the large amount of classification data we already have and use it to extend the scope of current detection systems. Our approach uses a hierarchical view of object classification , which allows different datasets to be combined together.

A joint training algorithm   is also proposed , which allows training object detectors on detection and classification data. The method utilizes labeled detection images to learn to precisely localize all labeled and unlabeled objects, while using classified images to increase its vocabulary and robustness.

First, the basic YOLO detection system is improved using   some novel technical methods to obtain YOLOv2, and a multi-scale training method is used , so that the same YOLOv2 model can run at different sizes, and there is a good relationship between speed and accuracy. good compromise. Then, use the proposed data set combination method and joint training algorithm to train YOLO9000, so that YOLO9000 can accurately predict the target without target label data, and can run and detect more than 9000 target categories in real time.

2. Improve

  Compared with state-of-the-art detection systems, YOLO suffers from various disadvantages. Compared with Fast R-CNN, YOLO produces a large number of localization errors . Furthermore, YOLO has relatively low recall compared to region proposal based methods . Therefore, this paper mainly focuses on improving recall and localization accuracy while maintaining classification accuracy.

  Neural network algorithms generally tend towards larger, deeper networks, and better performance often depends on training larger networks or integrating multiple models together. However, with YOLOv2, a more accurate and still fast detector is needed, so the improvement did not expand the network, but simplified the network , and then made the network easier to learn. This paper brings together a variety of previous ideas and the latest technologies to improve the performance of YOLO, and various technologies are summarized in Table 2.

  • Batch Normalization
    can significantly improve convergence while eliminating the need for other forms of regularization. By adding batch normalization on all convolutional layers in YOLO, mAP can be improved by more than 2%. BN also helps to normalize the model. Through BN, the dropout in the model can be removed without overfitting.
  • All advanced detection methods of High Resolution Classifier
    are generally pre-trained classifiers on the ImageNet dataset. Before AlexNet, most classifiers used input images smaller than 256×256. The original YOLO classifier network training used pictures of 224×224. When training the detection part, the resolution was increased to 448. This means that the network must simultaneously switch to learning object detection and adapt to the new input resolution.

  For YOLOv2, the classification network is first fine-tuned for 10 epochs on ImageNet to a full resolution of 448×448 . This gives the network time to adjust the filter weights to better handle higher resolution inputs. Then, the network is fine-tuned on the detection results . This high-resolution classification network resulted in a nearly 4% increase in mAP .

  • Convolutional With Anchor Boxes
    YOLO uses a fully connected layer on top of a convolutional feature extractor to directly predict the coordinates of bounding boxes. Instead of using a hand-designed prior box like R-CNN to quickly predict coordinates. The Region Proposal Network (RPN) in Faster R-CNN uses only convolutional layers to predict anchor offsets and confidences. Since the prediction layer is convolutional, RPN can predict these offsets at each location of the feature map. Predicting offsets instead of coordinates simplifies the problem and makes the network easier to learn.

  This paper removes the fully connected layer from YOLO and uses the anchor to predict the bounding box . First, a pooling layer is removed to make the output of the convolutional layers higher resolution, and the network is also downscaled to 416x416 instead of 448x448. The purpose of this is to have an odd number of positions in the feature map , so there will only be one central ceil . Objects, especially large objects, tend to occupy the center of the image, so it is better to have a location in the center to predict these objects rather than locations near the four vertices. The convolutional layer of YOLO reduces the sampling rate of the image by 32 times, so by using 416 input images, an output feature map of 13×13 can be obtained.

  When turning to the anchor box, the class prediction mechanism is decoupled from the spatial position, and instead the class and objectness of each anchor are predicted. The objectness prediction still refers to the IOU of the GT and the candidate box, and the class prediction refers to the given The conditional probability of the class of the target.

  With the anchor box, the accuracy of YOLO will drop slightly. Without anchor, YOLO only predicts 98 frames per picture; but with anchor, the prediction frame exceeds 1000. Without anchors, the model gets 69.5 mAP with a recall of 81%. Using anchors, the model gets 69.2 mAP with a recall of 88%. Although mAP is reduced, the increase in recall means that the model has more room for improvement.

  • Dimension Clusters
    encountered two problems when using anchors in YOLO. First, the box size is set manually. The network can learn to adjust the box appropriately, but if a better prior box can be selected for the network, it can make it easier for the network to learn to predict better detection results.

In this paper, we did not choose to manually set the prior value, but run k-means clustering   on the bounding box of the training set to automatically find the appropriate prior box. If standard k-means with Euclidean distance is used, larger boxes produce more errors than smaller boxes. However, what we really want are box priors that can obtain a good basis score, independent of the size of the box. Therefore, for the distance metric, we use:
insert image description here

  This paper runs k-means with various k values ​​and plots the average IOU closest to the centroid. The results are shown in Figure 2.
insert image description here

  Finally k=5 was chosen as a good compromise between model complexity and high recall . The cluster centroids are significantly different from manually set anchors. There are fewer boxes that are short and wide, and more boxes that are tall and thin.

  The average IoU is compared with prior values ​​close to our clustering strategy and hand-picked anchors, and the results are shown in Table 1.
insert image description here
  Using the 5 anchor boxes obtained by clustering, the result is 61.0%, which is similar to the 60.9% result of the 9 anchor boxes manually selected. If you use the 9 anchor boxes obtained by clustering, you will get a better effect of 67.2%. This shows that using k-means to generate anchor boxes can run the model with a better representation and make the model easier to learn.

  • Direct location prediction
    When using anchors in YOLO, there is a second problem: model instability , especially during early iterations. Most of the instability comes from predicting the (x,y) coordinates of the box. In the RPN network, the network prediction values ​​t_x and t_y and (x, y) center coordinates are calculated as follows:For example, when the prediction result tx=1, the prediction box will move the box to the right according to the width of the anchor, and when tx=-1 will be Move it to the left by the same amount.
    insert image description here

  This formulation is unconstrained, so any anchor can end up at any point in the image, regardless of the position of the predicted box. With random initialization, the model takes a long time to stabilize and predict reasonable offsets.

  Instead of using the predicted offset, this paper uses the YOLO method to predict the position coordinates relative to the grid cell position . This limits GT between 0 and 1. Use sigmoid to constrain the network's predictions to this range.

  The network predicts 5 bounding boxes at each cell in the output feature map. The network predicts 5 coordinates tx, ty, tw, th and to for each bounding box. Assuming that the offset of the cell from the upper left corner of the image is (cx, cy), and the width and height of the anchor are pw, ph, the prediction results correspond to:
insert image description here

  • Fine-Grained Feature
    This improved YOLO can predict on a 13×13 feature map. While this is sufficient for large objects, it might benefit from finer-grained features for locating smaller objects . Both Faster R-CNN and SSD run candidate networks on multiple feature maps in the networkto use multiple resolutions. This paper takes a different approach by simply adding a passthrough layer that extracts features from previous layers at 26×26 resolution.

  The passthrough layer connects high-resolution features with low-resolution features by superimposing adjacent features into different channels ( concat ) instead of summing at the same position , which is similar to the identity map in ResNet. This converts a 26×26×512 feature map into a 13×13×2048 feature map, which can be concatenated with the original features. The detector is used on top of this extended feature map, so it has access to more fine-grained features. This slightly improves performance by 1%.

  • Multi-Scale Training
    's original YOLO uses an input resolution of 448×448. After adding the anchor, change the resolution to 416×416. However, since the newly proposed model only uses convolutional and pooling layers, it can be resized dynamically . It is hoped that YOLOv2 can run on images of different sizes, so images of various scales will be trained into the model.

  To train images of multiple scales, instead of fixing the size of the input image, the network is changed every few iterations . The network randomly selects a new image size every 10 batches. Since the downsampling of the model is finally reduced by 32 times, it is extracted from the following 32 times the image size: {320,352,…, 608}. So the smallest option is 320×320 and the largest option is 608×608 . Resize the network to this image size and continue training.

  This regime forces the network to learn to predict well over a wide range of input sizes. This means that the same network can predict detections at different resolutions. Networks run faster at smaller sizes, so YOLOv2 provides an easy trade-off between speed and accuracy.

  At low resolutions, YOLOv2 is an inexpensive, reasonably accurate detector. It runs at 90+ FPS at 288×288, and the mAP is almost as good as Fast R-CNN. This makes it ideal for smaller GPUs, high frame rate video, or multiple video streams.

  High-resolution YOLOv2 is also an advanced detector, with 78.6 mAP on VOC 2007, while still running at faster than real-time speed. The YOLOv2 comparison of other frameworks on VOC2007 is shown in Table 3 and Figure 4.
insert image description here
insert image description here

3. Faster

  While maintaining fastness while improving accuracy, most inspection applications, such as robotics or self-driving cars, rely on low-latency predictions. To maximize performance, YOLOv2 is designed to be fast from the ground up.

  • Darknet-19
    newly proposed a classification model, which is used as the basic network of YOLOv2. Similar to the VGG model, it mainly uses a 3×3 convolution kernel, and the number of channels is doubled after each pooling step. Following the Network in Network (NIN) research, YOLOv2 uses global average pooling for prediction and uses a 1×1 convolution kernel to compress feature maps between 3×3 convolutions. Use BN to stabilize training, speed up convergence, and regularize the model. Named Darknet-19, it has 19 convolutional layers and 5 max pooling layers . See Table 6 for a complete description. Darknet-19 only needs 5.58 billion operations to process images, achieving 72.9% top-1 accuracy and 91.2% top-5 accuracy on ImageNet.
    insert image description here

  • Training for classification
    uses the Darknet framework, on the ImageNet-1000 dataset, using stochastic gradient descent (starting learning rate is 0.1), polynomial rate decay (4th power), weight decay (0.0005) and momentum is 0:9) pair The network is trained for 160 epochs. During training, standard data augmentation techniques are used, including random cropping, rotation, hue, saturation, and exposure shifting, among others.

  As mentioned above, after initial training on images of 224×224, the network was resized to a larger size of 448 . For this fine-tuning, continue training with the above parameters, but only train for 10 epochs with a learning rate of 10e−3. At this higher resolution, the network achieves 76.5% top-1 accuracy and 93.3% top-5 accuracy.

  • Training for detection
    modifies this network for detection by removing the last convolutional layer , instead of adding three 3×3 convolutional layers with 1024 filters each, followed by a final 1×1 convolutional layer and Detect the desired number of outputs. For COCO, the model predicts 5 boxes, each with 5 coordinates, and a total of 80 classes per box, so a total of 5*(5+80)=425 filters. A pass-through layer is also added from the last 3 × 3 × 512 (i.e., 16th layer) layer to the penultimate convolutional layer so that the model can use finer-grained features. The network was trained for 160 epochs, with an initial learning rate of 10e−3, divided by 10 at 60 and 90 epochs respectively. The weight decay used was 0.0005, and the momentum was 0.9. Use data augmentation similar to YOLO and SSD, including random cropping, color transformation, etc. The same training strategy is used for different datasets COCO and VOC.
    insert image description here
    insert image description here

4. Stronger

  This paper proposes a method for jointly training classification and detection data . The method uses images labeled for detection to learn detection-specific information, such as bounding box coordinate predictions and box scores, and how to classify common objects. Then by extending it with images with only class labels, you can expand the number of detected classes.

  During training, images from the detection and classification datasets are mixed. When the network sees an image marked for detection, it can backpropagate based on the full YOLOv2 loss function . When it sees a classified image, it only backpropagates the loss for the classification-specific part of the architecture.

5 Conclusion

  This paper introduces the YOLOv2 and YOLO9000 real-time object detection systems. YOLOv2 is faster than other detection algorithms on various detection datasets. Furthermore, it can run on various image sizes to provide the best compromise between speed and accuracy.
  YOLO9000 detects more than 9,000 target categories by jointly optimizing detection and classification training methods . Use WordTree to combine data from different sources, and use the joint optimization technique in this paper to train on ImageNet and COCO simultaneously. YOLO9000 is a powerful step towards bridging the gap between detection and classification datasets.
  The WordTree representation of ImageNet in this paper provides a richer and more detailed output space for image classification. Combining datasets using hierarchical classifications is useful in the fields of classification and segmentation. Training techniques such as multi-scale training can provide benefits for various vision tasks.
  For future work, it is hoped to use similar techniques for weakly supervised image segmentation. There are also plans to improve detection results using more powerful matching strategies for assigning weak labels to classification data during training. Computer vision has vast amounts of labeled data. We will continue to find ways to bring together different data sources and structures to build more powerful models of the visual world.

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/127072675