YOLO
Predict bounding boxes and class probabilities directly from full images in a single evaluation using a single neural network. Since only one network is used in the entire detection process, the detection performance can be directly optimized end-to-end.
YOLO structure: ----GoogleNet + 4 convolutions + 2 fully connected layers
- 1. Scale the image to 448X448
- 2. Run the convolutional network on the graph
- 3. Threshold the detection results according to the confidence of the model
cell
7 x 7=49 pixel values, understood as 49 cells, each cell can represent a square of the original image. Cells need to do two things:
- Each bounding box contains two object predictions, each object includes 5 predicted values: x, y, w, h and confidence
- Each cell predicts two (default) bbox positions, two bbox confidences (confidence): 7 x 7 x 2=98 bboxes. 30=(4+1+4+1+20), 4 coordinate information , 1 confidence (confidence) represents the result of a bbox, 2 0 represents the predicted probability result of 20 categories
-
A grid will predict two Bboxes, and we only have one Bbox dedicated to it during training (one Object and one Bbox)
-
The 20 class probabilities represent a bbox in this network
-
confidence
-
If there is no object in the grid cell, confidence is 0
-
If there is, the confidence score is equal to the IOU product of the predicted box and the ground truth, (the two bboxes in each cell are compared with the real value to determine the final bbox)
-
training loss
- Three-part loss bbox loss + confidence loss + classfication loss
YOLO V2
For the YOLO algorithm, improvements: (training mechanism, network changes – Darknet-19, k-means clustering algorithm cluster analysis on the bounding boxes in the training set, direct position prediction)
YOLO V3
Improvement: (Network Darknet-53, logistic regression instead of softmax as classifier)
Reference:
https://zhuanlan.zhihu.com/p/94986199
YOLO paper