Article Directory
CornerNet background
The current one-stage and two-stage detection of a target state-of-the-art methods are based anchor, drawbacks: (1) a large number of anchor frames, positive and negative samples is not balanced, slow down the speed of the training. (2) requires hyperparametric and design (how much anchor, size, scale)
one-stage and two-stage difference:
- | one-stage | two-stage |
---|---|---|
The main algorithm | Yolov1、Yolov2、Yolov3、SSD、RetainNet | Fast R-CNN、Faster R-CNN、DeNet |
Detection accuracy | Lower | Higher |
Detection rate | Faster | Slower |
CornerNet body part
Our cornrtNet paper, there are two parts belong to our innovations.
- The target detection as the key point to solve the problem, that is, obtain the prediction box by two key points left and right corners of the detection target frame, so there is no anchor in CornerNet algorithm concept
- Training the entire monitoring network is to start from scratch, not based on pre-trained classification model, which allows users to freely design feature extraction network, not limited by pre-training model
Most of our previous target detection methods are based anchor made, such as FasterR-CNN, SSD, YOLO (v2, v3), etc., after the introduction of anchor, we have to target detection efficiency has been significantly improved, but at the same time as anchor of introduction, will bring some disadvantages:
- This will lead to excessive computation of our algorithm, since many of the algorithms are thousands of anchor, but we need to detect a target and not so much, it will cause too much redundant computation algorithms. So with focal loss of negative samples and sub-sampling algorithms to do to solve this problem.
- Introduce more ultra-parameters, such as the number of anchor, length, width, higher, so this does not use anchor CornerNet able to have a good effect of eliminating the need for these additional operations.
CornerNet network structure
First, a layer of 7x7 convolution of the input image size is reduced to 1/4 of the original (paper input image size is 511x511, 128x128 size is output), extraction feature after entering backbone network, backbone network main network uses hourglasss network, while the hourglass network module consists of two hourglass. Each module is sized by a series of input image downsampling operation, and then lower and lower by the sample returned to the original image size. The entire depth of layer 104 is hourglass network.
After two output branches have hourglasss network, respectively predicted left corner and lower right prediction prediction branches, each comprising a corner poolings and three outputs: heatmap, embedding, offsets.
HeatMap : prediction corner information is output, as may be the dimensions H x W of the feature map, which indicates the type of object C (Note: no background) of each channel is a characteristic diagram of this mask, the range of the mask is 0 to 1
embeddings : embeddings main role is to predict the corner points do group, that is to say detection detects whether the left and right corners are the same goals.
offests : used to predict the position of the frame is finely adjusted, since the point feature map to be mapped during quantization error.
In our CornerNet algorithm there is a relatively innovative things that we used in this network pooling layer ~~ Corner Pooling.
Corner Pooling:
Corner Pooling is calculated as follows:
Why do not we use our common maximum pooling (max-pooling) and pooling the average (mean-pooling) it in CornerNet network?
This is because we CornerNet the special nature of the network, our CornerNet main test is that we need to detect a target object left and right corners, if you use the usual pool layer, we can not determine the left-up corner detecting an object Therefore we use the upper left corner of the right corner pooling, has been critical information to detect objects, but the bottom right of the upper left corner there is critical information to detect objects, and therefore there is a corner pooling.
corner pooling is how to do it (to the left of the pool: from right to left to see, on top of the pool: from the bottom up, pooling on the right: from left to right, bottom of the pool: from top to bottom)
The mathematical formula CornerNet
One of our L_ {det} representative of our Heatmap loss function, we L __ {det} concrete form as follows:
N represents the number of the equation is the object of our testing, and P_ {cij} is the prediction value heatmap (i, j) position of the C-channel, and y __ {cij} our Gaus-sians Loss value, where x and y represented by (i, j) is the relative position of the origin of coordinates.
L_ {pull} specific forms as follows:
In the formula e_ {tk} k behalf of the Target by embedding vector of the upper left corner, and e __ {ek} embedding vector k belonging to the class of the lower right corner of the goal, and e_ {k} is e_ {tk} and e_ the average value of {bk}.
L_ {push} specific forms as follows:
L_ {push} effect is used to expand the distance vector point does not belong to the same corners of the target.
Another is offset output CornerNet , value and target detection algorithm to predict the offset similar but completely different, say something like that because the information is biased, say the same because of the predicted target detection algorithm is offset represents the offset between the block prediction and the anchor, where the offset is lost when taking the whole calculation accuracy information, i.e. content expressed by equation 2.
We know that there is between the size reduction from the input image to the feature map, assuming demagnification is n, then the input point corresponds to the following equation on the feature map on the image (x, y).
The formula symbol is rounded down, rounding will bring loss of precision, which particularly affects the return of the small size of the target, Faster RCNN the ROI Pooling also have similar accuracy loss problems. Therefore offset calculated by Equation 2, then Equation 3 smooth L1 loss function supervised learning similar to the return branch parameter, and the common target detection algorithm.
Experimental results
Table1 Comparative Experiment is about conner pooling can be seen added conner pooling (second line) significantly enhance the effect of the comparison, in this particular lifting large scale target data performed significantly.
Table2 is to take effect on different weight loss function for negative samples at different locations. The first line is not the effect of using this strategy; second line is the effect of using a fixed radius value, it can be seen lifting obvious; the third line is based radius calculated target effects (article employed a), the effect is further enhanced.
to sum up
We have introduced CornerNet, this is a new target detection method can detect the bounding box corner is paired. We CornerNet evaluated on MS COCO, and demonstrate the competitive results. But because of this our forecast accuracy will therefore decline. Such as might put two objects relatively close distance to an object and for detecting marquee. Such as the following
To solve this problem, and proposed a method CenterNet, reducing the occurrence of this situation to some extent.