YOLO_v1

Target detection algorithm can be divided into two categories:

  • One is based on region proposal series of R-CNN algorithm (R-CNN, Fast R-CNN, Faster R-CNN), which is a two-stage. First use of heuristic methods (selective search) or CNN Network (RPN) generated region proposal, and then do the classification and regression in the region proposal. High precision, low speed.
  • The other is YOLO, SSD such one-stage algorithm, using only a direct network CNN predicted category and location of different goals. Low accuracy, speed.

Sliding window object detection algorithm is very simple idea, the detected image is transformed to classification. Principle is the use of different sizes and proportions (aspect ratio) of the window in the entire image in a certain step east, and then do the classification of the image region corresponding to the windows, so that it can be tested throughout the image. The disadvantage is that I do not know the target size to be detected, so to set a different size and proportions of the window to slide, but also to take appropriate steps, it will produce a lot of sub-regions, and go through the classifier to do forecasting, which requires a large amount of computation, so you can not be too complicated classification as to ensure speed. One idea is to resolve to reduce the classification of the sub-region, which is an R-CNN improvement strategy, which uses a selective search methods to find the most likely to contain the target sub-region (Region Proposal), in fact, can be seen as heuristic The method of filtering out many sub-region, which will increase efficiency.

If the CNN classifier, a sliding window is very time consuming, the method using the full convolution, convolution in the network layer instead of whole connection layer. E.g. 16x16 input image obtained through a series of 2x2 convolution FIG feature, each of the elements of FIG ZOOM is one to one, corresponding to 14x14 convolution kernel, convolution step 2, resulting in 4 words region, output channel number 4, can be seen as the predicted values of the four classes of probability, such a class prediction calculation CNN all sub-regions of the sliding window can be achieved, which is thought overfeat algorithm passage. This is because the spatial location information of convolution picture invariance, corresponding to the position to save a good relationship. This idea has also been drawing R-CNN, the birth of Fast R-CNN.

While sliding window above calculation amount can be reduced, but only for a fixed length window step size, is not enough.

YOLO algorithm good solution to this problem is no longer sliding window, but directly to the original picture is divided into small overlapping complementary box, and then finally generated feature map by the convolution of this size, feature map each element is the corresponding original picture of a small box, and then use to predict those goals central point in the small square of each element .

design

The picture resize to 448x448, sent to the CNN network, the final processing network prediction results predicted target.

Specifically, YOLO of the CNN network SXS picture into a grid, each cell detects that the center point of the target to fall within the lattice, the center point of this dog following figure falls within the lower left corner of the goal of a cell, then this cells responsible for forecasting the dog.

Each cell will predict the B bouding box and bouding box of confidence. The confidence representing the predicted confidence box containing the object bounding box and the plurality of predicted registration confidence, this value is calculated:

: Within the bounding box containing the probability targets, there to take 1 and 0 otherwise;

: IoU between the bounding box and GT

Thus each of the bounding box prediction (x, y, w, h) and a total of five confidence values. Each cell also predict a category information category C, the SxS grids, each grid to predict a bounding box B C also a prediction categories (because coding, if the dog in the cell center, then the dog the probability is 1, the other with probability 0.5, similar to the code: 100 000 000 000 ...).

Output is S x S x (5 * B + C) is a tensor.

Note: class information for each grid, confidence is information for each of the bounding box.

Method c class of each cell is forecast, as long as the target falls on the center of the cell, the cell is responsible for forecasting the class, so only once predicted.

For example: in PASCAL VOC, the input image is 448x448, taking S = 7, B = 2, a total of 20 categories (C = 20), the output is a tensor of 7x7x30.

When the test, each of the class information and the predicted mesh bounding box information prediction confidence is multiplied, to obtain a class-specific confidence score for each of the bounding box:

Prediction probability of each category x grid containing a probability confidence in the probability of a certain class = class prediction box belongs, but also information about the accuracy of the box

After getting class-specific confidence score for each box, the threshold value is set, the low scores filtered boxes, boxes to be retained NMS process, to obtain the final detection result.

* Since the output layer is fully connected layer, upon detection, YOLO training model only support the same resolution input training image.

* Although B cells can predict a bounding box, but only the final selection to select only the highest IOU bounding box as an object detection output, i.e., each grid predicted maximum of one object. When the object representing a small aspect ratio, such as a herd or flock included in the image, each grid comprising a plurality of objects, but only one of which is detected. This is a flaw YOLO methods.

Network Training

Before training, the Imagenet now pre-trained, the classification model with the trained using the convolution FIG front layer 20, and then add a average-pool full layer and the connection layer. And after the training, layer 20 on layer pretraining convolutional obtained random initialization plus four and two convolution layers fully connected layers. Since the detection tasks generally require more clear picture, the input from the network want increased 448x448 224x224 network entire process as shown below:

It is finally connected through a full-feature layer of FIG 7x7x30 full access into the connection layer shown in FIG.

How to coordinate the return?

Author coordinates are normalized treatment, just under the same amount, so easy to deal with.

假设有一个bounding box其中心刚好落在了(row,col)网格中,则这个网格需要负责预测整个红框中的dog目标。假设图像的宽为widthimage,高为heightimage;红框中心在(xc,yc),宽为widthbox,高为heightbox那么:(1) 对于bounding box的宽和高做如下normalization,使得输出宽高介于0~1:

(2) 使用(row, col)网格的offset归一化bounding box的中心坐标:

经过上述公式得到的normalization的(x, y, w, h),再加之前提到的confidence,共同组成了一个真正在网络中用于回归的bounding box;

而当网络在Test阶段(x,y,w,h)经过反向解码又可得到目标在图像坐标系的框。

损失函数

最后输出的是7x7x30,也就是一共有7x7个单元格,每个单元格有30维度,8 维是回归 box 的坐标,2 维是 box的 confidence,还有 20 维是类别。 

 其中坐标的 x, y 用对应网格的 offset 归一化到 0-1 之间,w, h 用图像的 width 和 height 归一化到 0-1 之间。

作者简单粗暴的全部采用了 sum-squared error loss 来做这件事。这种做法存在以下几个问题: 

  • 第一,8维的 localization error 和20维的 classification error 同等重要显然是不合理的; 
  • 第二,如果一个网格中没有 object(一幅图中这种网格很多),那么就会将这些网格中的 box 的 confidence push 到 0,相比于较少的有 object 的网格,这种做法是 overpowering 的,这会导致网络不稳定甚至发散。

解决的方法:

  • 更重视8维的坐标预测,给这些损失前面赋予更大的 loss weight, 记为在 pascal VOC 训练中取 5。

  • 对没有 object 的 box 的 confidence loss,赋予小的 loss weight,记为在 pascal VOC 训练中取 0.5。

  • 有 object 的 box 的 confidence loss 和类别的 loss 的 loss weight 正常取 1。

  • 对不同大小的 box 预测中,相比于大 box 预测偏一点,小 box 预测偏一点肯定更不能被忍受的。而 sum-square error loss 中对同样的偏移 loss 是一样。

  • 为了缓和这个问题,作者用了一个比较取巧的办法,就是将 box 的 width 和 height 取平方根代替原本的 height 和 width。这个参考下面的图很容易理解,小box 的横轴值较小,发生偏移时,反应到y轴上相比大 box 要大。(也是个近似逼近方式)

 

一个网格预测多个 box,希望的是每个 box predictor 专门负责预测某个 object。具体做法就是看当前预测的 box 与 ground truth box 中哪个 IoU 大,就负责哪个。这种做法称作 box predictor 的 specialization。

  • 只有当某个网格中有 object 的时候才对 classification error 进行惩罚。
  • 只有当某个 box predictor 对某个 ground truth box 负责的时候,才会对 box 的 coordinate error 进行惩罚。

缺点

YOLO 方法模型训练依赖于物体识别标注数据,因此,对于非常规的物体形状或比例,YOLO 的检测效果并不理想。

YOLO 采用了多个下采样层,网络学到的物体特征并不精细,因此也会影响检测效果。

YOLO 的损失函数中,大物体 IOU 误差和小物体 IOU 误差对网络训练中 loss 贡献值接近(虽然采用求平方根方式,但没有根本解决问题)。因此,对于小物体,小的 IOU 误差也会对网络优化过程造成很大的影响,从而降低了物体检测的定位准确性。

YOLO 对相互靠的很近的物体,还有很小的群体检测效果不好,这是因为一个网格中只预测了两个框(B=2),并且只属于一类。

同一类物体出现的新的不常见的长宽比和其他情况时,泛化能力偏弱。

由于损失函数的问题,定位误差是影响检测效果的主要原因。尤其是大小物体的处理上,还有待加强。

 

Guess you like

Origin www.cnblogs.com/pacino12134/p/11412118.html