[target detection] (6) YOLOV2 target detection is improved on the basis of V1

Hello everyone, today I will share with you the principle of the YOLOV2 target detection algorithm. I suggest you learn about YOLOV1 first. You can read my last article: https://blog.csdn.net/dgvv4/article/details/123767854

Despite being very fast, YOLOV1 suffers from low precision, poor localization performance, low recall, and poor detection of small and dense targets. Therefore, YOLOV2 has made the following improvements


1. Batch Normalization

Batch Normalization subtracts the output of the neuron from the mean and divides it by the standard deviation, resulting in a distribution with a mean of 0 and a standard deviation of 1 . Since many activation functions, such as sigmoid and tanh functions, are in the unsaturated region near 0, if the input of the activation function is too large and too small, it will fall into the saturation region of the activation function, resulting in the disappearance of the gradient and difficulty in training . Use Batch Normalization to force the output of neurons to be concentrated around 0.

Batch Normalization differs between model training phase and testing phase.

In the training phase , if batch_size=32, that is, each batch processes 32 pictures, each picture will output a response value after passing through a neuron, then a neuron in a certain layer will output 32 response values; for this The 32 response values ​​are calculated as mean, standard deviation, and then normalized; the standardized responses are multiplied \gammaand added \beta, and each neuron needs to be trained a group \gamma, \beta. In this way, the output of the neuron is limited to a distribution with 0 as the mean and 1 as the standard deviation, and the output is limited to the unsaturated region to speed up the convergence and prevent overfitting.

In the testing phase , the mean, variance, , and \gamma, \beta all use the results obtained globally in the training phase. For example, the mean of the test uses the expectation of the mean of many batches during the training phase

The formula is as follows, which \epsilon is a very small number to prevent the denominator from being 0,

BN(X_{test}) = \gamma \frac{X_{test}-\mu _{test}}{\sqrt{\sigma _{test}^{2}}+\epsilon } + \beta


2. High-resolution classifiers

General image classification networks are trained on the imagenet dataset at a smaller resolution, such as 224*224. The resolution of the input image of YOLOV1's model is 448*448, then the network trained at a smaller resolution is trained in a large-resolution target detection model, and the network needs to be switched from small resolution to large resolution during network training. rate, resulting in reduced performance. YOLOV2 directly trains the backbone network on the 448*448 image classification dataset, adapting the network to large resolutions and improving the map by 3.5%


3. A priori box

In YOLOV1 , the image is divided into 7x7 grids, each grid generates 2 prediction frames, which prediction frame and the IOU of the real detection frame have a larger intersection ratio, and that prediction frame is responsible for fitting the real detection frame, intersecting and merging Smaller prediction boxes are eliminated. The two prediction boxes have no length and width restrictions, and they both change randomly . In YOLOV2 , the prediction frame has an initial reference frame, and the prediction frame only needs to be fine-tuned in the original position and the offset can be adjusted.

In YOLOV2, a picture is divided into 13*13 grids, and each grid has 5 a priori boxes with different length and width scales , that is, 5 different aspect ratios. Each prior box corresponds to a prediction box, and the prediction only needs to output the offset of the prediction box relative to the prior box.

In which grid the center point of the manually marked real detection frame falls, it should be predicted by the a priori frame with the largest intersection ratio between the a priori frame generated in the grid and the real detection frame IOU, and the prediction result is the prediction the offset of the box from its own prior box


4. Model output results

In YOLOV1, no prior frame is used, the picture is divided into 7*7 grids, each grid generates 2 prediction frames, each prediction frame contains 4 position parameters and 1 confidence parameter, each grid There are conditional probabilities containing 20 categorical classes . So each grid has 5+5+20 parameters . The shape of the model output feature map is [7,7,30]

In YOLOV2 , the picture is divided into 13*13 grids, each grid generates 5 a priori boxes, each a priori box contains 4 position parameters, 1 confidence parameter, and 20 categories of conditional probabilities . So each grid has 5*(4+1+20)=125 parameters . The shape of the model output feature map is [13,13,125]

As shown in the figure below, the input image shape of the model is [416, 416, 3]. After a series of convolution operations, a tensor of 13*13*125 is output. Each grid contains 5 a priori boxes, and each a priori box has (5+ 20)*5 parameters


5. Prediction box fine-tuning

Model output The offset of the prediction box compared to the prior box , the coordinate offset (tx, ty), and the width and height offset (tw, th) . Where (tx, ty) can be any number from negative infinity to positive infinity, in order to prevent the coordinate offset from being too large, add a sigmoid function  \sigma (tx)to the offset, limit the coordinate offset between 0-1, and set the prediction frame 's center point is constrained to the grid it is on .

As shown in the figure below, (cx, cy) is the upper left coordinate (normalized coordinate) of the grid where the center point of the prior frame is located , (pw, ph) is the width and height of the largest prior frame that intersects with the real frame . Since the target object may be large and the prediction box is also large, multiply the width and height of the prior box by the exponent ex


6. Loss function

YOLOV2 traverses all prediction boxes of 13*13 grids,

(1) The first term is the confidence error . Whether the IOU intersection ratio of the prior box and the real detection box is less than the threshold. 1 if less than, 0 otherwise. The center point of the prior box and the detection box are coincident to calculate the intersection ratio. Let the confidence of the prior box that is not responsible for predicting the object be as close to 0 as possible

(2) The second term is the position error between the prediction box and the prior box . Determine whether it is the first 12800 iterations, whether it is an early stage of model training. 1 if satisfied, 0 otherwise. Let the position parameters (x, y, w, h) of the a priori frame be as close as possible to the position parameters of the prediction frame

(3) The third item is that the a priori frame with the largest IOU is responsible for detecting objects. 1 for yes, 0 for no. Here it is assumed that the ground-truth detection box is predicted by the prior box with the largest IOU . For those prior boxes whose IOU is larger than the threshold but not the maximum value, its loss is ignored . The positioning error requires the position of the real detection frame and the position of the prediction frame to be as close as possible, the confidence error requires the IOU of the prior frame and the real frame and the confidence of the prediction frame to be as consistent as possible, and the classification error requires the category of the real frame and the prediction frame. Categories are as consistent as possible.


7. Fine-grained features

The feature map output by the shallow network is divided into two branches, one branch is convolutional operation, the other branch splits the feature map into four parts and stacks them in the channel dimension, and the two branches are superimposed, then the feature map contains both The fine-grained information of the bottom layer and the information of the high-level after convolution are obtained. It integrates information of different scales, which is beneficial to the detection of small objects.

The splitting method is equivalent to the focus method of YOLOV5. As shown in the figure below, one feature map is changed into four, the length and width of each feature map are reduced to half of the original, and the number of channels has become four times the original.

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123772756