Classic target detection R-CNN series (2) Fast R-CNN

Classic target detection R-CNN series (2) Fast R-CNN

  • Fast R-CNN is another masterpiece of author Ross Girshick after R-CNN.

  • Also using VGG16 as the backbone of the network, compared with R-CNN, the training time is 9 times faster, the test reasoning time is 213 times faster, and the accuracy rate is increased from 62% to 66% (on the Pascal VOC dataset).

1 Fast R-CNN forward process

The Fast R-CNN algorithm process can be divided into 3 steps

  • An image generates 1K~2K candidate areas (using the Selective Search method)

  • Input the image into the network to obtain the corresponding feature map , project the candidate frame generated by the SS algorithm onto the feature map, and obtain the corresponding feature matrix.

  • Each feature matrix is ​​scaled to a 7x7 size feature map through the ROI pooling layer , and then the feature map is flattened and passed through a series of fully connected layers to obtain the prediction results.

insert image description here

In addition to generating region proposals in a relatively independent step, Fast R-CNN implements the other four main links with an integrated neural network structure: convolutional feature extraction, RoI feature extraction, category prediction , and location prediction . The final stage of target detection also includes post-processing , which implements absolute position calculation of bounding boxes, category-position binding and removal of redundant bounding boxes based on NMS.

insert image description here

1.1 Convolution feature extraction

  • R-CNN sequentially inputs the candidate box areas into the convolutional neural network to obtain features.

  • Fast-RCNN sends the entire image into the network, and then extracts the corresponding candidate regions from the feature image. The features of these candidate regions do not need to be repeatedly calculated .

insert image description here

  • Fast-RCNN does not limit the size of the input image.

  • The entire original image is input into CNN in a fully convolutional manner to obtain a convolutional feature map, and the feature map of a certain convolutional layer is used as the final output to obtain convolutional features.

    • For example, if the backbone network structure adopts VGG16, and the output of the convolutional layer conv5_3 is used as the feature map, the obtained feature map has 512 channels, and the downsampling ratio is 16 [conv1 to conv5 parts are only used for convolutional feature extraction].

insert image description here

1.2 RoI feature extraction

  • The RoI pooling layer takes the convolutional feature and the correspondingly reduced region proposal (called RoI in Fast R-CNN) as input, and performs W×H grid division on the RoI projected on the feature map ( ) W和H为RoI池化层的超参数,分别表示网格的宽度和高度,也就是输出特征图的宽度和高度, 逐通道在每个网格上做最大池化, Each channel operates independently.

  • Different from SPP-Net, the meshing of the RoI pooling layer is only performed on a fixed scale. That is, RoI pooling can be regarded as a single-scale version of the SPP layer in SPP-Net.

  • The feature map obtained by RoI pooling is then input into several fully connected layers for further feature transformation.

insert image description here

1.3 Category prediction and location prediction

1.3.1 Category prediction

The obtained RoI features are "divided into two paths" for category prediction and location prediction respectively.

In the category prediction branch, the RoI feature is input into a fully connected layer whose output dimension is the number of target categories K, and a branch network equipped with a softmax classifier to obtain category distribution predictions, thereby realizing category determination.

insert image description here

1.3.2 Location prediction

  • In the position prediction branch, the RoI features are input into the bounding box regressor represented by a fully connected layer.

  • The output of the fully connected layer is C×4, where C is the number of target detection categories, and 4 here represents the bounding box position transformation parameters in the form (dx§, dy§, dw§, dh§)

  • It can be seen that the Fast R-CNN bounding box position is also category-related.

insert image description here

1.3.3 Post-processing

Implements bounding box absolute position calculation, category-position binding and NMS-based redundant bounding box removal.

1.3.4 Singular value decomposition

  • In Fast R-CNN, category prediction and position prediction are both implemented through the fully connected layer, and the essence of the fully connected operation is the linear transformation of the vector.

  • For example, the number of generated RoIs is 2000 (refer to the number of region proposals obtained by selective search in R-CNN), and the amount of multiplication calculations for the fully connected layer of the position prediction branch will exceed 680 million times (specifically 2000×84×4096 =688,128,000 times), such large-scale calculation requires a lot of time overhead, which seriously limits the speed of target detection.

  • Therefore, in order to improve the target detection speed, Fast R-CNN is based on singular value decomposition. Practice shows that Fast R-CNN uses this method to achieve a 30% speed increase at the cost of 0.3% MAP loss.

2 Loss function of Fast R-CNN

2.1 Classification loss

insert image description here

The Fast R-CNN category prediction branch predicts the probability distribution p=(p0,p1,…,pC−1) of C categories for each RoI.

In models based on neural networks, the probability distribution is generally obtained by a fully connected layer with C outputs and a softmax function.

Assuming that the GT category related to RoI is u(u=0,1,…,C−1), the loss of category prediction can be defined as 交叉熵误差
L cls ( p , u ) = − log ⁡ pu L_{cls}(p, u)=−log⁡p_uLcls(p,u)=logpu

2.2 Bounding box regression loss

insert image description here

From the above function definition, it can be seen that compared with the L2 loss used by R-CNN and SPP-Net, the smooth L1 loss penalizes the deviation of the bounding box position, especially when the deviation is large (outliers, outliers) The penalty is smoother, thus preventing exploding gradients caused by too large gradients.

2.3 Advantages and Disadvantages of Fast R-CNN

优点

  • Fast R-CNN draws on the ideas of SPP-Net and maintains the excellent feature of supporting input of any size.

  • The category prediction and location prediction results are used as parallel outputs of the model, and the corresponding training links are also completed simultaneously in multi-task mode.

  • In addition to regional suggestion generation, most other links are end-to-end, and the training and testing speeds have been greatly improved.

  • In addition, Fast R-CNN also reaches a very high level in terms of target detection accuracy.

缺点

  • In terms of speed, it takes about 0.3 seconds for Fast R-CNN to process an image from extracting convolutional features using CNN to obtaining the final result, but if it is performed with a selection-based search method, it can be said that region proposal generation becomes a constraint on Fast 区域建议生成,仅此一步操作就需要2至3秒R -The bottleneck of the overall speed of CNN.

  • In terms of process, whether in the training phase or the testing phase, the generation of region proposals is still independent of CNN, and it is still not fully implemented end-to-end.

Guess you like

Origin blog.csdn.net/qq_44665283/article/details/131776593