Computer Vision: Target Detection Theory and Practice

For the part about the anchor box, you can read my other article: click here . I won't go into details below



Multi-Scale Object Detection

Multi-scale anchor box

Reducing the number of anchor boxes on an image is not difficult. For example, we can uniformly sample a small number of pixels in the input image and generate anchor boxes centered on them. Furthermore, at different scales, we can generate anchor boxes with different numbers and sizes. Intuitively, smaller objects are more likely to appear on the image than larger objects. For example, 1 × 1 1\times 11×1 1 × 2 1\times 2 1×2 and2 × 2 2\times22×The target of 2 can appear in 2 × 2 2\times 2in 4, 2 and 1 possible ways respectively2×2 on the image. Therefore, when detecting smaller objects with smaller anchor boxes, we can sample more regions, while for larger objects, we can sample less regions.

Let's read an image first. Its height and width are 561 and 728 pixels

%matplotlib inline
import torch
from d2l import torch as d2l

img = d2l.plt.imread('../img/catdog.jpg')
h, w = img.shape[:2]
h, w

insert image description here
The display_anchors function is defined as follows. We generate anchor boxes (anchors) on the feature map (fmap), with each unit (pixel) as the center of the anchor box. Since ( x , y ) in the anchor box (x,y)(x,The y ) axis coordinate values ​​(anchors) have been divided by the width and height of the feature map (fmap), so these values ​​are between 0 and 1, indicating the relative position of the anchor boxes in the feature map.

Since the centers of anchor boxes (anchors) are distributed over all units on the feature map (fmap), these centers must be uniformly distributed on any input image according to their relative spatial locations. More specifically, given the width and height of a feature map fmap_w and fmap_h, the following function will uniformly sample the pixels in the fmap_h row and fmap_w column in any input image. Centering on these uniformly sampled pixels, anchor boxes of size s (assuming the list s has length 1) with different aspect ratios will be generated.

def display_anchors(fmap_w, fmap_h, s):
    d2l.set_figsize()
    # 前两个维度上的值不影响输出
    fmap = torch.zeros((1, 10, fmap_h, fmap_w))
    anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
    bbox_scale = torch.tensor((w, h, w, h))
    d2l.show_bboxes(d2l.plt.imshow(img).axes,
                    anchors[0] * bbox_scale)

First, let's consider detecting small objects. To make it easier to distinguish when displayed, anchor boxes with different centers do not overlap here: the scale of anchor boxes is set to 0.15, and the height and width of feature maps are set to 4. We can see that the centers of the anchor boxes in the 4 rows and 4 columns on the image are evenly distributed.

display_anchors(fmap_w=4, fmap_h=4, s=[0.15])

insert image description here
We then halve the height and width of the feature maps and use larger anchor boxes to detect larger objects. When the scale is set to 0.4, some anchor boxes will overlap each other.

display_anchors(fmap_w=2, fmap_h=2, s=[0.4])

insert image description here

Now that we have generated multi-scale anchor boxes, we will use them to detect objects of various sizes at different scales. Below, we introduce a CNN-based multi-scale object detection method, which will be implemented in the following.

At some scale, suppose we have ccThe shape of c sheets ish × wh\times wh×Feature map of w . Using the method above, we generatehw hwh w groups of anchor boxes, where each group hasaaa anchor boxes with the same center. For example, on the first scale of the above experiment, given 10 (number of channels)4 × 4 4\times 44×4 feature maps, we generate 16 groups of anchor boxes, and each group contains 3 anchor boxes with the same center. Next, each anchor box is labeled with a class and an offset from the ground-truth bounding box. At the current scale, the object detection model needs to predicthw hwh w groups of anchor box categories and offsets, where different groups of anchor boxes have different centers.

When the feature maps of different layers have different sizes of receptive fields on the input image, they can be used to detect objects of different sizes. For example, we can design a neural network where feature map units closer to the output layer have wider receptive fields so that they can detect larger objects from the input image.

In short, we can leverage deep neural networks to hierarchically represent images at multiple levels, enabling multi-scale object detection.

data set

get dataset

%matplotlib inline
import os
import pandas as pd
from mxnet import gluon, image, np, npx
from d2l import mxnet as d2l

npx.set_np()

#@save
d2l.DATA_HUB['banana-detection'] = (
    d2l.DATA_URL + 'banana-detection.zip',
    '5de26c8fce5ccdea9f91267273464dc968d20d72')
#@save
def read_data_bananas(is_train=True):
    """读取香蕉检测数据集中的图像和标签"""
    data_dir = d2l.download_extract('banana-detection')
    csv_fname = os.path.join(data_dir, 'bananas_train' if is_train
                             else 'bananas_val', 'label.csv')
    csv_data = pd.read_csv(csv_fname)
    csv_data = csv_data.set_index('img_name')
    images, targets = [], []
    for img_name, target in csv_data.iterrows():
        images.append(torchvision.io.read_image(
            os.path.join(data_dir, 'bananas_train' if is_train else
                         'bananas_val', 'images', f'{
      
      img_name}')))
        # 这里的target包含(类别,左上角x,左上角y,右下角x,右下角y),
        # 其中所有图像都具有相同的香蕉类(索引为0)
        targets.append(list(target))
    return images, torch.tensor(targets).unsqueeze(1) / 256
#@save
class BananasDataset(torch.utils.data.Dataset):
    """一个用于加载香蕉检测数据集的自定义数据集"""
    def __init__(self, is_train):
        self.features, self.labels = read_data_bananas(is_train)
        print('read ' + str(len(self.features)) + (f' training examples' if
              is_train else f' validation examples'))

    def __getitem__(self, idx):
        return (self.features[idx].float(), self.labels[idx])

    def __len__(self):
        return len(self.features)
#@save
def load_data_bananas(batch_size):
    """加载香蕉检测数据集"""
    train_iter = torch.utils.data.DataLoader(BananasDataset(is_train=True),
                                             batch_size, shuffle=True)
    val_iter = torch.utils.data.DataLoader(BananasDataset(is_train=False),
                                           batch_size)
    return train_iter, val_iter
batch_size, edge_size = 32, 256
train_iter, _ = load_data_bananas(batch_size)
batch = next(iter(train_iter))
batch[0].shape, batch[1].shape

exhibit

#@save
def show_bboxes(axes, bboxes, labels=None, colors=None):
    """显示所有边界框"""
    def _make_list(obj, default_values=None):
        if obj is None:
            obj = default_values
        elif not isinstance(obj, (list, tuple)):
            obj = [obj]
        return obj

    labels = _make_list(labels)
    colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
    for i, bbox in enumerate(bboxes):
        color = colors[i % len(colors)]
        rect = d2l.bbox_to_rect(bbox.detach().numpy(), color)
        axes.add_patch(rect)
        if labels and len(labels) > i:
            text_color = 'k' if color == 'w' else 'w'
            axes.text(rect.xy[0], rect.xy[1], labels[i],
                      va='center', ha='center', fontsize=9, color=text_color,
                      bbox=dict(facecolor=color, lw=0))
imgs = (batch[0][0:10].permute(0, 2, 3, 1)) / 255
axes = d2l.show_images(imgs, 2, 5, scale=2)
for ax, label in zip(axes, batch[1][0:10]):
    show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w'])

insert image description here

Single-shot multi-frame detection (SSD)

model design

The figure below describes the design of the single-shot multi-box detection model. This model mainly consists of a base network followed by several multi-scale feature blocks. The basic network is used to extract features from the input image, so it can use deep convolutional neural network. In the single-shot multi-frame detection paper, VGG (Liu et al., 2016), which is truncated before the classification layer, is used, and ResNet is often used instead. We can design the base network so that its output is large in height and width. In this way, the number of anchor boxes generated based on the feature map is large, which can be used to detect objects with smaller sizes. Each subsequent multi-scale feature block reduces the height and width of the feature map provided by the previous layer (such as halving), and makes the receptive field of each unit in the feature map on the input image wider.

insert image description here

Since the multi-scale feature maps near the top of the figure above are smaller but have larger receptive fields, they are suitable for detecting fewer but larger objects. In short, through multi-scale feature blocks, single-shot multi-box detection generates anchor boxes of different sizes, and detects objects of different sizes by predicting the category and offset of the bounding box, so this is a multi-scale object detection model .

class prediction layer

Let the number of target categories be qqq . In this way, the anchor box hasq + 1 q+1q+1 class, where class 0 is the background. At a certain scale, let the height and width of the feature map behhh sumwww . If aais generated centered on each unita anchor box, then we need towha whaw ha anchor boxes for classification. If the fully connected layer is used as the output, it is easy to cause too many model parameters. Single-shot multi-frame detection uses the channels of the convolutional layer to output category predictions to reduce model complexity.

Specifically, the category prediction layer uses a convolutional layer that maintains the input height and width. In this way, the spatial coordinates of the output and input on the width and height of the feature map correspond one-to-one. Consider output and input in the same space coordinates ( x , y ) (x,y)(x,y ) : ( x , y ) (x,y)on the output feature map(x,y ) coordinate channel contains the input feature map( x , y ) (x,y)(x,y ) coordinates are the category predictions of all anchor boxes generated with the center. So the number of output channels isa ( q + 1 ) a(q+1)a(q+1 ) , where the index isi ( q + 1 ) + j ( 0 <= j <= q ) i(q+1)+j(0<=j<=q)i(q+1)+j(0<=j<=q ) represents the channel with indexiiThe relevant category index of the anchor box of i is jjj 's forecast.

In the following, we define such a category prediction layer, specifying aa through the parameters num_anchors and num_classes respectivelya andqqq . This layer uses 3 × 3 3\times 3filled with 13×3 convolutional layers. The width and height of the input and output of this convolutional layer remain constant.

%matplotlib inline
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l


def cls_predictor(num_inputs, num_anchors, num_classes):
    return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
                     kernel_size=3, padding=1)

Bounding box prediction layer

The design of the bounding box prediction layer is similar to that of the category prediction layer. The only difference is that here you need to predict 4 offsets for each anchor box, instead of q+1 q+1q+1 category.

def bbox_predictor(num_inputs, num_anchors):
    return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)

Linking Multiscale Predictions

As we mentioned, one-shot multi-box detection uses multi-scale feature maps to generate anchor boxes and predict their categories and offsets. At different scales, the shape of feature maps or the number of anchor boxes centered on the same cell may vary. Therefore, the shape of the predicted output may be different at different scales.

In the following example, we build feature maps of two different scales (Y1 and Y2) for the same mini-batch, where Y2 has half the height and width of Y1. Taking category prediction as an example, assume that each unit of Y1 and Y2 generates 5 55 and3 33 anchor boxes. Assume further that the number of target categories is10 1010 , for feature maps Y1 and Y2, the number of channels in the category prediction output is55 5555 and33 3333 , where the shape of either output is (batch size, number of channels, height, width).

def forward(x, block):
    return block(x)

Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))
Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
Y1.shape, Y2.shape

insert image description here
As we can see, except for the dimension of batch size, the other three dimensions all have different sizes. In order to link these two prediction outputs for computational efficiency, we will convert these tensors to a more consistent format.

The channel dimension contains predictions for anchor boxes with the same center. We first move the channel dimension to the last dimension. Because the batch size remains constant at different scales, we can convert the prediction results into two-dimensional (batch size, height × \times× width×\times× number of channels) format to facilitate later in dimension1 11 on the link.

def flatten_pred(pred):
    return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)

def concat_preds(preds):
    return torch.cat([flatten_pred(p) for p in preds], dim=1)

Height and width halved

The downsampling block is a commonly used convolution block in target detection. Its function is to reduce the height and width of the input feature map by half while increasing the number of channels. Usually, the height and width halving block consists of a convolutional layer and a pooling layer, where the convolutional layer is used to increase the number of channels, and the pooling layer is used to reduce the height and width.

def down_sample_blk(in_channels, out_channels):
    blk = []
    for _ in range(2):
        blk.append(nn.Conv2d(in_channels, out_channels,
                             kernel_size=3, padding=1))
        blk.append(nn.BatchNorm2d(out_channels))
        blk.append(nn.ReLU())
        in_channels = out_channels
    blk.append(nn.MaxPool2d(2))
    return nn.Sequential(*blk)

This code implements a downsampling block that halves the height and width of the input feature map while increasing the number of channels. The input parameters of this function include the number of input channels in_channels and the number of output channels out_channels, and the return value is a sequence (nn.Sequential type) consisting of multiple convolutional layers, batch normalization layers, ReLU activation functions, and maximum pooling layers .

  1. First create an empty list blk to store the layers in the convolutional block.
  2. Then, add two convolutional layers, a batch normalization layer, and a ReLU activation function to blk sequentially using a for loop. Here use 3 × 3 3\times33×3 convolution kernels and1 1A padding of 1 keeps the size of the feature map constant. The number of input channels of each convolutional layer is in_channels, and the number of output channels is out_channels. The number of input channels of the first convolutional layer is the number of channels of the input feature map in_channels, and the number of input channels of subsequent convolutional layers is the previous one. The number of output channels out_channels of the convolutional layer, and so on.
  3. Add a 2 × 2 2\times22×A max pooling layer of 2 is used to reduce the height and width of the feature map by half.
  4. Finally, form a sequence of all layers in the blk list and return that sequence.

basic network block

For computational simplicity, we construct a small base network that concatenates 3 blocks of half height and width, and gradually doubles the number of channels. Given an input image of shape 256 × 256 256\times 256256×256

def base_net():
    blk = []
    num_filters = [3, 16, 32, 64]
    for i in range(len(num_filters) - 1):
        blk.append(down_sample_blk(num_filters[i], num_filters[i+1]))
    return nn.Sequential(*blk)

forward(torch.zeros((2, 3, 256, 256)), base_net()).shape

complete model

The complete single-shot multi-box detection model consists of five modules. The feature maps generated by each block are used both to generate anchor boxes and to predict the categories and offsets of these anchor boxes. Among these five modules, the first is the basic network block, the second to fourth are the height and width halving blocks, and the last module reduces both height and width to 1 using global max pooling. Technically speaking, the second to fifth blocks are multi-scale feature blocks.

def get_blk(i):
    if i == 0:
        blk = base_net()
    elif i == 1:
        blk = down_sample_blk(64, 128)
    elif i == 4:
        blk = nn.AdaptiveMaxPool2d((1,1))
    else:
        blk = down_sample_blk(128, 128)
    return blk

forward propagation function

#@save
def multibox_prior(data, sizes, ratios):
    """生成以每个像素为中心具有不同形状的锚框"""
    in_height, in_width = data.shape[-2:]
    device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
    boxes_per_pixel = (num_sizes + num_ratios - 1)
    size_tensor = torch.tensor(sizes, device=device)
    ratio_tensor = torch.tensor(ratios, device=device)

    # 为了将锚点移动到像素的中心,需要设置偏移量。
    # 因为一个像素的高为1且宽为1,我们选择偏移我们的中心0.5
    offset_h, offset_w = 0.5, 0.5
    steps_h = 1.0 / in_height  # 在y轴上缩放步长
    steps_w = 1.0 / in_width  # 在x轴上缩放步长

    # 生成锚框的所有中心点
    center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
    center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
    shift_y, shift_x = torch.meshgrid(center_h, center_w, indexing='ij')
    shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)

    # 生成“boxes_per_pixel”个高和宽,
    # 之后用于创建锚框的四角坐标(xmin,xmax,ymin,ymax)
    w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
                   sizes[0] * torch.sqrt(ratio_tensor[1:])))\
                   * in_height / in_width  # 处理矩形输入
    h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
                   sizes[0] / torch.sqrt(ratio_tensor[1:])))
    # 除以2来获得半高和半宽
    anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
                                        in_height * in_width, 1) / 2

    # 每个中心点都将有“boxes_per_pixel”个锚框,
    # 所以生成含所有锚框中心的网格,重复了“boxes_per_pixel”次
    out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
                dim=1).repeat_interleave(boxes_per_pixel, dim=0)
    output = out_grid + anchor_manipulations
    return output.unsqueeze(0)

For a detailed explanation of the above code, see: Click here .

def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
    Y = blk(X)
    anchors = multibox_prior(Y, sizes=size, ratios=ratio)
    cls_preds = cls_predictor(Y)
    bbox_preds = bbox_predictor(Y)
    return (Y, anchors, cls_preds, bbox_preds)

This code implements the forward propagation process of a convolution block (blk), including steps such as feature extraction, generating anchor boxes, predicting categories and bounding boxes. The input parameters include input feature map X, convolution block blk, anchor frame size size, anchor frame aspect ratio ratio, category predictor cls_predictor and bounding box predictor bbox_predictor, the return value is a tuple, including output feature map Y, anchor box, category prediction, and bounding box prediction.

  1. The input feature map X is used as the input of the convolution block blk, and the output feature map Y is obtained after a series of operations of the convolution block.
  2. Use the multibox_prior function to generate a set of anchor boxes, where the sizes parameter indicates the size of the anchor box, and the ratios parameter indicates the aspect ratio of the anchor box. The multibox_prior function generates a set of anchor boxes according to the size of the input feature map and the anchor box parameters, and the return value is a shape ( 1 , A , 4 ) (1, A, 4)(1,A,4 ) A tensor whereAAA represents the number of anchor boxes.
  3. The output feature map Y is used as the input of the category predictor cls_predictor, and the category prediction result cls_preds is obtained, and the shape is ( N , A , C ) (N, A, C)(N,A,C ) , whereNNN represents the batch size,AAA represents the number of anchor boxes,CCC represents the number of categories.
  4. The output feature map Y is used as the input of the bounding box predictor bbox_predictor, and the bounding box prediction result bbox_preds is obtained, and the shape is ( N , A , 4 ) (N, A, 4)(N,A,4 ) , whereNNN represents the batch size,AAA represents the number of anchor boxes, and 4 represents the number of coordinates of the bounding box.
  5. Finally, four items of the output feature map Y, anchor box, category prediction and bounding box prediction are returned as a tuple.
class TinySSD(nn.Module):
    def __init__(self, num_classes, **kwargs):
        super(TinySSD, self).__init__(**kwargs)
        self.num_classes = num_classes
        idx_to_in_channels = [64, 128, 128, 128, 128]
        for i in range(5):
            # 即赋值语句self.blk_i=get_blk(i)
            setattr(self, f'blk_{
      
      i}', get_blk(i))
            setattr(self, f'cls_{
      
      i}', cls_predictor(idx_to_in_channels[i],
                                                    num_anchors, num_classes))
            setattr(self, f'bbox_{
      
      i}', bbox_predictor(idx_to_in_channels[i],
                                                      num_anchors))

    def forward(self, X):
        anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
        for i in range(5):
            # getattr(self,'blk_%d'%i)即访问self.blk_i
            X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                X, getattr(self, f'blk_{
      
      i}'), sizes[i], ratios[i],
                getattr(self, f'cls_{
      
      i}'), getattr(self, f'bbox_{
      
      i}'))
        anchors = torch.cat(anchors, dim=1)
        cls_preds = concat_preds(cls_preds)
        cls_preds = cls_preds.reshape(
            cls_preds.shape[0], -1, self.num_classes + 1)
        bbox_preds = concat_preds(bbox_preds)
        return anchors, cls_preds, bbox_preds

This code defines a class named TinySSD, which represents a small target detection network based on the SSD (Single Shot MultiBox Detector) model, which can be used to detect target objects in pictures. This class inherits from PyTorch's nn.Module class, so it can take advantage of functions such as automatic derivation and model parameter management provided by PyTorch.

Key members of the TinySSD class include:

  1. init method: used to initialize model parameters. The input parameters of this method include the number of target categories num_classes, and other optional parameters **kwargs. In this method, first call the init method of the parent class to initialize, and then set the attributes of the model according to the input parameters. Among them, idx_to_in_channels is a list indicating the number of input channels of each convolution block; get_blk(i) is a function used to return the iii convolutional blocks; cls_predictor and bbox_predictor are category predictors and bounding box predictors, respectively, used to predict the category and location of the target object.
  2. forward method: used to implement the forward propagation of the model. The input parameter of this method is the input image X, and the output results include anchor box anchors, category prediction cls_preds and bounding box prediction bbox_preds. In this method, first initialize anchors, cls_preds, and bbox_preds to None, and then perform forward propagation one by one to obtain the output of each convolution block and the corresponding anchor box, category prediction, and bounding box prediction. Next, the anchor boxes, category predictions, and bounding box predictions of each convolutional block are concatenated to obtain the final anchor box, category predictions, and bounding box predictions.
net = TinySSD(num_classes=1)
X = torch.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)

print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)

insert image description here

train

import dataset

batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)
device, net = d2l.try_gpu(), TinySSD(num_classes=1)
trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)

Define the loss function

There are two types of losses for object detection. Use L 1 L_1L1Norm loss, which is the absolute value of the difference between the predicted value and the true value. The mask variable bbox_masks makes the negative anchor box and the filled anchor box not participate in the calculation of the loss. Finally, we sum the losses for anchor box categories and offsets to obtain the final loss function for the model.

cls_loss = nn.CrossEntropyLoss(reduction='none')
bbox_loss = nn.L1Loss(reduction='none')

def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
    batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
    cls = cls_loss(cls_preds.reshape(-1, num_classes),
                   cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
    bbox = bbox_loss(bbox_preds * bbox_masks,
                     bbox_labels * bbox_masks).mean(dim=1)
    return cls + bbox
def cls_eval(cls_preds, cls_labels):
    # 由于类别预测结果放在最后一维,argmax需要指定最后一维。
    return float((cls_preds.argmax(dim=-1).type(
        cls_labels.dtype) == cls_labels).sum())

def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
    return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())

Utils function (for labeling category and offset for each anchor box)

See code explanation: click here .

def box_iou(boxes1, boxes2):
    """Compute pairwise IoU across two lists of anchor or bounding boxes.
    Defined in :numref:`sec_anchor`"""
    """计算两个锚框或边界框列表中成对的交并比"""
    box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
                              (boxes[:, 3] - boxes[:, 1]))
    # Shape of `boxes1`, `boxes2`, `areas1`, `areas2`: (no. of boxes1, 4),
    # (no. of boxes2, 4), (no. of boxes1,), (no. of boxes2,)
    areas1 = box_area(boxes1)
    areas2 = box_area(boxes2)
    # Shape of `inter_upperlefts`, `inter_lowerrights`, `inters`: (no. of
    # boxes1, no. of boxes2, 2)
    inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
    inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
    inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
    # Shape of `inter_areas` and `union_areas`: (no. of boxes1, no. of boxes2)
    inter_areas = inters[:, :, 0] * inters[:, :, 1]
    union_areas = areas1[:, None] + areas2 - inter_areas
    return inter_areas / union_areas
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
    """Assign closest ground-truth bounding boxes to anchor boxes.
    Defined in :numref:`sec_anchor`
    将最接近的真实边界框分配给锚框"""
    num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
    # Element x_ij in the i-th row and j-th column is the IoU of the anchor
    # box i and the ground-truth bounding box j
    jaccard = box_iou(anchors, ground_truth)
    # Initialize the tensor to hold the assigned ground-truth bounding box for
    # each anchor
    anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
                                  device=device)
    # Assign ground-truth bounding boxes according to the threshold
    max_ious, indices = torch.max(jaccard, dim=1)
    anc_i = torch.nonzero(max_ious >= 0.5).reshape(-1)
    box_j = indices[max_ious >= 0.5]
    anchors_bbox_map[anc_i] = box_j
    col_discard = torch.full((num_anchors,), -1)
    row_discard = torch.full((num_gt_boxes,), -1)
    for _ in range(num_gt_boxes):
        max_idx = torch.argmax(jaccard)  # Find the largest IoU
        box_idx = (max_idx % num_gt_boxes).long()
        anc_idx = (max_idx / num_gt_boxes).long()
        anchors_bbox_map[anc_idx] = box_idx
        jaccard[:, box_idx] = col_discard
        jaccard[anc_idx, :] = row_discard
    return anchors_bbox_map
def box_corner_to_center(boxes):
    """Convert from (upper-left, lower-right) to (center, width, height).

    Defined in :numref:`sec_bbox`"""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = d2l.stack((cx, cy, w, h), axis=-1)
    return boxes

def offset_boxes(anchors, assigned_bb, eps=1e-6):
    """Transform for anchor box offsets.

    Defined in :numref:`subsec_labeling-anchor-boxes`"""

    """对锚框偏移量的转换"""

    c_anc = box_corner_to_center(anchors)
    c_assigned_bb = box_corner_to_center(assigned_bb)
    offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
    offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
    offset = torch.cat([offset_xy, offset_wh], axis=1)
    return offset
def multibox_target(anchors, labels):
    """Label anchor boxes using ground-truth bounding boxes.

    Defined in :numref:`subsec_labeling-anchor-boxes`"""

    """使用真实边界框标记锚框"""

    batch_size, anchors = labels.shape[0], anchors.squeeze(0)
    batch_offset, batch_mask, batch_class_labels = [], [], []
    device, num_anchors = anchors.device, anchors.shape[0]
    for i in range(batch_size):
        label = labels[i, :, :]
        anchors_bbox_map = assign_anchor_to_bbox(
            label[:, 1:], anchors, device)
        bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
            1, 4)
        # Initialize class labels and assigned bounding box coordinates with
        # zeros
        class_labels = torch.zeros(num_anchors, dtype=torch.long,
                                   device=device)
        assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
                                  device=device)
        # Label classes of anchor boxes using their assigned ground-truth
        # bounding boxes. If an anchor box is not assigned any, we label its
        # class as background (the value remains zero)
        indices_true = torch.nonzero(anchors_bbox_map >= 0)
        bb_idx = anchors_bbox_map[indices_true]
        class_labels[indices_true] = label[bb_idx, 0].long() + 1
        assigned_bb[indices_true] = label[bb_idx, 1:]
        # Offset transformation
        offset = offset_boxes(anchors, assigned_bb) * bbox_mask
        batch_offset.append(offset.reshape(-1))
        batch_mask.append(bbox_mask.reshape(-1))
        batch_class_labels.append(class_labels)
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    class_labels = torch.stack(batch_class_labels)
    return (bbox_offset, bbox_mask, class_labels)

training function

num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                        legend=['class error', 'bbox mae'])
net = net.to(device)
for epoch in range(num_epochs):
    # 训练精确度的和,训练精确度的和中的示例数
    # 绝对误差的和,绝对误差的和中的示例数
    metric = d2l.Accumulator(4)
    net.train()
    for features, target in train_iter:
        timer.start()
        trainer.zero_grad()
        X, Y = features.to(device), target.to(device)
        # 生成多尺度的锚框,为每个锚框预测类别和偏移量
        anchors, cls_preds, bbox_preds = net(X)
        # 为每个锚框标注类别和偏移量
        bbox_labels, bbox_masks, cls_labels = multibox_target(anchors, Y)
        # 根据类别和偏移量的预测和标注值计算损失函数
        l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                      bbox_masks)
        l.mean().backward()
        trainer.step()
        metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
                   bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                   bbox_labels.numel())
    cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
    animator.add(epoch + 1, (cls_err, bbox_mae))
print(f'class err {
      
      cls_err:.2e}, bbox mae {
      
      bbox_mae:.2e}')
print(f'{
      
      len(train_iter.dataset) / timer.stop():.1f} examples/sec on '
      f'{
      
      str(device)}')

insert image description here

predict

#预测
X = torchvision.io.read_image('banana.jpg').unsqueeze(0).float()
img = X.squeeze(0).permute(1, 2, 0).long()
#@save
def offset_inverse(anchors, offset_preds):
    """根据带有预测偏移量的锚框来预测边界框"""
    anc = d2l.box_corner_to_center(anchors)
    pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + anc[:, :2]
    pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]
    pred_bbox = torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1)
    predicted_bbox = d2l.box_center_to_corner(pred_bbox)
    return predicted_bbox

#@save
def nms(boxes, scores, iou_threshold):
    """对预测边界框的置信度进行排序"""
    B = torch.argsort(scores, dim=-1, descending=True)
    keep = []  # 保留预测边界框的指标
    while B.numel() > 0:
        i = B[0]
        keep.append(i)
        if B.numel() == 1: break
        iou = box_iou(boxes[i, :].reshape(-1, 4),
                      boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
        inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
        B = B[inds + 1]
    return torch.tensor(keep, device=boxes.device)

#@save
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
                       pos_threshold=0.009999999):
    """使用非极大值抑制来预测边界框"""
    device, batch_size = cls_probs.device, cls_probs.shape[0]
    anchors = anchors.squeeze(0)
    num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
    out = []
    for i in range(batch_size):
        cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
        conf, class_id = torch.max(cls_prob[1:], 0)
        predicted_bb = offset_inverse(anchors, offset_pred)
        keep = nms(predicted_bb, conf, nms_threshold)

        # 找到所有的non_keep索引,并将类设置为背景
        all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
        combined = torch.cat((keep, all_idx))
        uniques, counts = combined.unique(return_counts=True)
        non_keep = uniques[counts == 1]
        all_id_sorted = torch.cat((keep, non_keep))
        class_id[non_keep] = -1
        class_id = class_id[all_id_sorted]
        conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
        # pos_threshold是一个用于非背景预测的阈值
        below_min_idx = (conf < pos_threshold)
        class_id[below_min_idx] = -1
        conf[below_min_idx] = 1 - conf[below_min_idx]
        pred_info = torch.cat((class_id.unsqueeze(1),
                               conf.unsqueeze(1),
                               predicted_bb), dim=1)
        out.append(pred_info)
    return torch.stack(out)

def predict(X):
    net.eval()
    anchors, cls_preds, bbox_preds = net(X.to(device))
    cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1)
    output = multibox_detection(cls_probs, bbox_preds, anchors)
    idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
    return output[0, idx]

output = predict(X)
def display(img, output, threshold):
    d2l.set_figsize((5, 5))
    fig = d2l.plt.imshow(img)
    for row in output:
        score = float(row[1])
        if score < threshold:
            continue
        h, w = img.shape[0:2]
        bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]
        d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')

display(img, output.cpu(), threshold=0.9)

insert image description here

Guess you like

Origin blog.csdn.net/qq_51957239/article/details/130938800