One-stage object detection

摘自http://machinethink.net/blog/object-detection/

Object detection is more tricky than classification. One of the problems you’ll encounter is that a training image can have anywhere from zero to dozens of objects in them, and the model may output more than one prediction, so how do you figure out which prediction should be compared to which ground-truth bounding box in the loss function?

A one-stage detector requires only a single pass through the neural network and predicts all the bounding boxes in one go. The most common examples of one-stage object detectors are YOLOSSDSqueezeDet, and DetectNet.

In this blog post I’ll try to explain how these one-stage detectors work and how they are trained and evaluated.

Why object detection is tricky

A classifier takes an image as input and produces a single output, the probability distribution over the classes. But this only gives you a summary of what is in the image as a whole, it doesn’t work so well when the image has multiple objects of interest.

One-stage detectors such as YOLO, SSD, and DetectNet all solve this problem by assigning each bounding box detector to a specific position in the image. That way the detectors learn to specialize on objects in certain locations. For even better results, we can also let detectors specialize on the shapes and sizes of objects.

Enter the grid

Using a fixed grid of detectors is the main idea that powers one-stage detectors, and what sets them apart from region proposal-based detectors such as R-CNN.

Let’s consider the simplest possible architecture for this kind of model.

On top of the feature extractor are several additional convolutional layers. These are fine-tuned to learn how to predict bounding boxes and class probabilities for the objects inside these bounding boxes. This is the object detection part of the model.

The output of the final layer is a feature map. For our example model this is a 13×13 feature map with 125 channels.

We interpret this feature map as being a grid of 13 by 13 cells. Each cell in the grid has 5 independent object detectors, and each of these detectors predicts a single bounding box.

The key thing here is that the position of a detector is fixed: it can only detect objects located near that cell. With this grid, a detector on the left-hand side of the image will never predict an object that is located on the right-hand side.

Jesse comment: 如何限制detector只能检测靠近的object??训练时限制??cnn本身就能限制(考虑到object detection layer 都是卷积层)??

Each object detector produces 25 numbers:

  • 20 numbers containing the class probabilities
  • 4 bounding box coordinates (center x, center y, width, height)
  • 1 confidence score

Constraints are good

why does it work?

I’ve already mentioned that assigning each bounding box detector to a fixed position in the image is the trick that makes one-stage object detectors work. We use the 13×13 grid as a spatial constraint, to make it easier for the model to learn how to predict objects.

Using such (architectural) constraints is a useful technique for machine learning.

In fact, convolutions are themselves a constraint too. It’s much harder for a machine learning model to learn about images if we only use plain FC layers. The constraints imposed upon the convolutional layer — it looks only at a few pixels at a time, and the connections share the same weights — help the model to extract knowledge from images. We use these constraints to remove degrees of freedom and to guide the model into learning what we want it to learn.

Likewise, the grid forces the model to learn object detectors that specialize in specific locations. The detector in the top-left cell will only predict objects located near that top-left cell, never for objects that are further away. (The model is trained so that the detectors in a given grid cell are responsible only for detecting objects whose center falls inside that grid cell.)

Anchors

The grid is a useful constraint that limits where in the image a detector can find objects. We can also add another constraint that helps the model make better predictions, and that is a constraint on the shape of the object.

Our example model has 13×13 grid cells and each cell has 5 detectors, so there are 845 detectors in total. But why are there 5 detectors per grid cell instead of just one? 

We use the grid to specialize our detectors to look only at certain spatial locations, and by having several different detectors per grid cell, we can make each of these object detectors specialize in a certain object shape as well.

We train the detectors on 5 specific shapes:

These five shapes are called the anchors or anchor boxes.

It’s no accident there are 5 anchors. There is one anchor for each detector in the grid cells. Anchors force the detectors inside the cells to each specialize in a particular object shape.

It’s important to understand that these anchors are chosen beforehand. They’re constants and they won’t change during training.

Thanks to anchors, the detectors don’t have to work very hard to make pretty good predictions already, because predicting all zeros simply outputs the anchor box, which will be reasonably close to the true object (on average). This makes training a lot easier!

What does the model actually predict?

What the model predicts for each bounding box is not their absolute coordinates in the image but four “delta” values, or offsets:

  • delta_xdelta_y: the center of the box inside the grid cell
  • delta_wdelta_h: scaling factors for the width and height of the anchor box

Each detector makes a prediction relative to its anchor box.

To get the actual width and height of the bounding box in pixel coordinates, we do:

box_w[i, j, b] = anchor_w[b] * exp(delta_w[i, j, b]) * 32
box_h[i, j, b] = anchor_h[b] * exp(delta_h[i, j, b]) * 32

where i and j are the row and column in the grid (0 – 12) and b is the detector index (0 – 4).

It’s OK for the predicted box to be wider and/or taller than the original image, but it does not make sense for the box to have a negative width or height. That’s why we take the exponent of the predicted number.

By the way, we multiply by 32 because the anchor coordinates are in the 13×13 grid, each grid cell covers 32 pixels in the 416×416 input image.

Note: Interestingly enough, in the loss function we’ll actually use the inverse versions of the above formulas. Instead of doing exp() on the predicted values, we’ll take the log() of the ground-truth values. 

To get the center x,y position of the predicted box in pixel coordinates, we do:

box_x[i, j, b] = (i + sigmoid(delta_x[i, j, b])) * 32
box_y[i, j, b] = (j + sigmoid(delta_y[i, j, b])) * 32

A key feature of YOLO is that it encourages a detector to predict a bounding box only if it finds an object whose center lies inside the detector’s grid cell. This helps to avoid spurious detections, so that multiple neighboring grid cells don’t all find the same object.

To enforce this, delta_x and delta_y must be restricted to a number between 0 and 1 that is a relative position inside the grid cell. That’s what the sigmoid function is for.

Besides coordinates, the model also predicts a confidence score for the bounding box. Because we want this to be a number between 0 and 1, we use the standard trick and stick it through a sigmoid:

confidence[i, j, b] = sigmoid(predicted_confidence[i, j, b])

And finally, we predict the class probabilities. As usual we apply a softmax make it a nice probability distribution:

classes[i, j, b] = softmax(predicted_classes[i, j, b])

Since we have many more predictions that we need, and most will be no good, we’ll now filter out the predictions with very low scores. In the case of YOLO, we do that by combining the confidence score for the box with the largest class probability.

confidence_in_class[i, j, b] = classes[i, j, b].max() * confidence[i, j, b]

It is convolutional, baby!

Having a grid of detectors is actually a natural fit for using a convolutional neural network.

The 13×13 grid is the output of a convolutional layer. Our example model’s last layer has 125 of these kernels.

Why 125? There are 5 detectors and each detector has 25 convolution kernels. Each of these 25 kernels predicts one aspect of that detector’s bounding box: x, y, width, height, confidence score, the 20 class probabilities.

Initially, the 125 numbers that get predicted at every grid position will be totally random and meaningless, but as training progresses the loss function will guide the model to learn to make more meaningful predictions.

Now, even though I keep saying there are 5 detectors in each grid cell, for 845 detectors overall, the model really only learns five detectors in total — not five unique detectors per grid cell. This is because the weights of the convolution layer are the same at each position and are therefore shared between the grid cells.

Jesse comment:对任一detector的某个prediction(如width)来说,用的是同一个kernel,即计算方法都是一样的,与位置无关。

The model really learns one detector for every anchor. Even though we only have 5 unique detectors in total, thanks to the convolution these detectors are independent of where they are in the image and therefore can detect objects regardless of where they are located.

It’s the combination of the input pixels at a given location, with the weights that were learned for that detector / convolution kernel, that determine the final bounding box prediction at that position.

This also explains why model always predicts where the bounding box is relative to the center of the grid cell. Due to the convolutional nature of this model, it cannot predict absolute coordinates. Since the convolution kernels slide across the image, their predictions are always relative to their current position in the feature map.

How do you train this thing?

The problem is that the number of ground-truth boxes can vary between images, from zero to dozens. During training, we must match each of our detectors with one of these ground-truth boxes, so that we can compute the regression loss for each predicted box.

The solution is to use a grid with a fixed number of detectors, where each detector is only responsible for detecting objects that are located in that part of the image, and is only responsible for objects of a certain size.

Matching ground-truth boxes to detectors

How does this matching work? There are different strategies. The way YOLO does it is to make only one detector responsible for detecting a given object in the image.

First we find the grid cell that the center of the bounding box falls in. That grid cell will be responsible for this object. If any other grid cells also predict this object they will be penalized for it by the loss function.

Each grid cell has multiple detectors and we only want one of these detectors to find the object, so we pick the detector whose anchor box best matches the object’s ground-truth box. This is done with the usual IOU metric.

Only that particular detector in that cell is supposed to predict this object. This rule helps the different detectors to specialize in objects that have a shape and size that is similar to the anchor box.

Since the output of the model is a 13×13×125 tensor, the target tensor that will be used by the loss function will also be 13×13×125. Again, that number 125 comes from: 5 detectors that each predict 20 probability values for the object’s class + 4 bounding box coordinates + 1 confidence score.

In the target tensor we only fill in the bounding boxes (and one-hot encoded class vector) for the detectors that are responsible for an object. We set the expected confidence score to 1 (since we’re 100% sure this is a real object).

For all other detectors — the ones with negative examples — the target tensor contains all zeros. The bounding box coordinates and class vector aren’t important here, since they will be ignored by the loss function, and the confidence score is 0.

So when the training loop asks for a new batch of images and their targets, what it gets is a tensor of B×416×416×3 images and a tensor of B×13×13×125 numbers.

What happens when there is more than one ground-truth box whose center happens to fall into the same cell?

YOLO solves this by first randomly shuffling the ground-truths, and then it just picks the first one that matches the cell. If a new ground-truth box is matched to a cell that already is responsible for another object, then we simply ignore this new box. Better luck next epoch! This means that in YOLO at most one detector per cell is given an object.

Note that there are other possible matching strategies too. For example, SSD can match the same ground-truth box with multiple detectors: it first picks the detector with the best IOU, but then also chooses any (unassigned) detector whose anchor box has an IOU over 0.5 with this ground-truth. This is supposed to make it easier for the model to learn because it won’t have to choose between which detector should predict an object — multiple detectors now have a shot at predicting this object.

The loss function

As always, the loss function is what really tells the model what it should learn.

For object detection, we want a loss function that encourages the model to predict correct bounding boxes and also the correct classes for these boxes. On the other hand, the model should not predict objects that aren’t there.

Detectors with no ground-truth (negative examples)

This part of the loss function only involves the confidence score — since there is no ground-truth box here, we don’t have any coordinates or class label to compare the prediction to.

In YOLO, it looks like this:

no_object_loss[i, j, b] = no_object_scale * (0 - sigmoid(pred_conf[i, j, b]))**2

Note that we use sigmoid() to restrict the predicted confidence to a number between 0 and 1.

The no_object_scale is a hyperparameter. It’s typically 0.5, so that this part of the loss term doesn’t count as much as the other parts. Because we don’t want the model to learn only about “no objects”, this part of the loss shouldn’t become more important than the loss for the detectors that do have objects.

The above formula is for a single detector in a single cell. To find the aggregate no-object loss, we add up the no_object_loss for all cells ij and all detectors b. For detectors that are responsible for finding an object, the no_object_loss is always 0. In SqueezeDet, the total no-object loss is also divided by the number of “no object” detectors to get the mean value but in YOLO we don’t do that.

SSD does not have this no-object loss term. Instead, it adds a special background class to the possible classes. If that is the predicted class, then the output of that detector counts as “no object”.

Detectors with a ground-truth (positive examples)

Confidence score

object_loss[i, j, b] = object_scale * (1 - sigmoid(pred_conf[i, j, b]))**2

Actually, YOLO does something slightly more interesting:

object_loss[i, j, b] = object_scale * (IOU(truth_coords, pred_coords) - sigmoid(pred_conf[i, j, b]))**2

The predicted confidence score pred_conf[i, j, b] is supposed to represent the IOU between the predicted bounding box and the ground-truth box. This makes sense: if the IOU is small then the confidence should be small.

Class probabilities

YOLO v1 and v2 used the following loss term for the predicted class probabilities:

class_loss[i, j, b] = class_scale * (true_class - softmax(pred_class))**2

Here, true_class is a one-hot encoded vector of 20 numbers, and pred_class is the vector of predicted logits. Note that even though we apply a softmax to the predictions, this loss term does not use the cross-entropy. (I think maybe they used sum-squared-error because that makes this loss term easier to balance with the other loss terms. In fact, even the softmax is optional.)

YOLO v3 and SSD take a different approach. They don’t see this as a multi-class classification problem but as a multi-label problem. Hence they don’t use softmax (which always chooses a single label to be the winner) but a logistic sigmoid, which allows multiple labels to be chosen. They use a standard binary cross-entropy to compute this loss term.

Since SSD does not predict a confidence score, it has a special “background” class to serve this purpose. If a detector predicts background then that counts as the detector not finding an object (and we simply ignore those predictions).

Bounding box coordinates

coord_loss[i, j, b] = coord_scale * ((true_x[i, j, b] - pred_x[i, j, b])**2
                                   + (true_y[i, j, b] - pred_y[i, j, b])**2
                                   + (true_w[i, j, b] - pred_w[i, j, b])**2
                                   + (true_h[i, j, b] - pred_h[i, j, b])**2)

The scale factor coord_scale is used to make the loss from the bounding box coordinate predictions count more heavily than the other loss terms. A typical value for this hyperparameter is 5.

Because the model doesn’t directly predict valid coordinates, the ground-truth values used in the loss function are not supposed to be true coordinates either.

true_x[i, j, b] = ground_truth.center_x - grid[i, j].center_x
true_y[i, j, b] = ground_truth.center_y - grid[i, j].center_y
true_w[i, j, b] = log(ground_truth.width / anchor_w[b])
true_h[i, j, b] = log(ground_truth.height / anchor_h[b])

SSD again uses a slightly different loss term. Its localization loss is known as the “Smooth L1” loss. 

difference = abs(true_x[i, j, b] - pred_x[i, j, b])
if difference < 1:
    coord_loss_x[i, j, b] = 0.5 * difference**2
else:
    coord_loss_x[i, j, b] = difference - 0.5

And so on for the other coordinates too. This loss is supposed to be less sensitive to outliers.

How well does the model work?

With object detection there are several things we can compute a score for:

  • the classification accuracy of each detected object
  • how much the predicted object overlaps the real object (IOU)
  • whether the model actually finds all the objects in the image (known as “recall”)

True positive, false positive, false negative

Computing the mAP

For the Pascal VOC mAP, first we compute the average precision (or AP) for each of the 20 classes separately, then we take the mean over these 20 numbers to get the final mAP score. So the mAP score is the average of an average.

precision = TP / (TP + FP)

In our case, a false positive happens when a predicted bounding box is too different from any ground-truth box in the image, or when the predicted class is not the same.

recall = TP / (TP + FN)

A false negative happens when no prediction is made for an object, or when the confidence score is too low (which really means the same as “no prediction”).

Here is how to compute the number of TP and FP, in pseudocode:

sort the predictions by confidence score (high to low)

for each prediction:
    true_boxes = get the annotations with same class as the prediction
                     and that are not marked as "difficult"

    find IOUs between true_boxes and prediction
    choose ground-truth box with biggest IOU overlap

    if biggest IOU > threshold (which is 0.5 for Pascal VOC):
        if we do not already have a detection for this ground-truth box:
            TP += 1
        else:
            FP += 1
    else:
        FP += 1

A prediction counts as a true positive if it has the same class as a ground-truth object and their bounding boxes overlap by more than 50%.

This is the precision-recall curve:

Note that precision is given as a function of recall, which is why the area under the curve is the average of the precision for this class. We want to know what the precision is for different values of recall.

A precision-recall curve is always created by computing the precision and recall scores at different thresholds. For a binary classifier this is the threshold after which we conclude that a prediction is positive. In the case of an object detection model, the threshold we’re varying is the confidence score of the predicted boxes.

As the threshold decreases and we’re including more objects in the result, recall increases. Precision can go up and down, although it tends to become lower since now more false positives are found.

I hope this makes it clear there’s always a trade-off between false positives and false negatives predicted by the model. The reason we’re using a precision-recall curve is to measure that trade-off, and to find a good threshold for the confidence score.

  • Choosing a high threshold means we keep fewer predictions, and therefore we’ll have fewer false positives (we make fewer mistakes) but we’ll also have more false negatives (we miss more objects).

  • The lower the threshold, the more predictions we include but they’ll usually be of lower quality.

The final mAP score is simply the average over the average precisions for the 20 classes.

更多mAP可参考https://blog.csdn.net/Airfrozen/article/details/104264459

发布了0 篇原创文章 · 获赞 0 · 访问量 149

猜你喜欢

转载自blog.csdn.net/Airfrozen/article/details/104117923