自己写—YOLOv3(1)—网络架构

携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第6天，点击查看活动详情

前言

自从 YOLOv3 一来，网格结构基本成形，可以说 YOLO 网络结构基本确定了。随后 YOLOv4 是在 YOLOv3 框架上应用了许多小 tricks。在完成了 YOLOv3 的 2 年后，原作者也宣布退出了 CV 领域，随后 YOLOv4 和 YOLOv5 就不再是 YOLO 原作者的作品。

YOLOv3 还是基于 Darknet 深度学习框架实现的，不过本人并不了解 Darknet，所以还是选择熟悉 Pytorch 来实现 YOLOv3，网上已经有很多人用 Pytorch 实现过，这里就直接拿来借鉴。

import torch
import torch.nn as nn

基本要求

了解深度学习以及目标检测任务
对于什么是一阶段目标检测有一定了解
对 YOLOv1 以及 YOLOv2 网络结构、损失函数等一定了解，可以看之前分享视频或者文章
掌握 Pytorch 语法

网络结构

在 YOLOv3 中提供 3 种不同尺度的输出特征层用于检测不同尺度目标，也就是在 3 个不同尺度的特征层上进行预测。通常主干网络对输入图像进行 32 倍的下采样，那么对于 $416 \times 416 \times 3$ RGB 输入，经过 32 倍下采样输出特征图为 $13 \times 13$ ，在主干网络结构经过 4 个残差块，Darknet53 中卷积块为残差块，借鉴了 ResNet 网络结构，所以后面 3 个残差块分别输出 $13 \times 13$ 、 $26 \times 26$ 和 $52 \times 52$ 特征层，这些特征层具有不同分辨率，可以用检测小目标、中等大小目标和大目标。

YOLOv3 主干网络是一个标准的，基于卷积的神经网络，类似之前 Darknet 网络，增加了残差结构，添加 FPN 结构，主干网络输出特征图会经过一系列卷积层后在用 $1\times 1$ 卷积层来做检测，对于高层通过上采样得到和下一层(低层)同样大小特征层进行 concat 这样可以将高层抽象语义信息融合到底层特征层

YOLOv3

实现网络架构

使用列表数据结构将模型结构清晰的描述出来

config = [
    (32, 3, 1),
    (64, 3, 2),
    ["B", 1],
    (128, 3, 2),
    ["B", 2],
    (256, 3, 2),
    ["B", 8],#
    (512, 3, 2),
    ["B", 8],
    (1024, 3, 2),
    ["B", 4],  # To this point is Darknet-53
    (512, 1, 1),
    (1024, 3, 1),
    "S",# 
    (256, 1, 1),
    "U",# 表示上采样
    (256, 1, 1),
    (512, 3, 1),
    "S",
    (128, 1, 1),
    "U",
    (128, 1, 1),
    (256, 3, 1),
    "S",
]

这里简单地说一下，在列表按类型划分不同类型元素

tuple 类型表示卷积块，(32, 3, 1)三个数字分别代表卷积核个数(输出通道数)，卷积核大小、步长
list 表示重复的残差结构，每个残差结构通常先经过一个 1x1 的卷积对调整通道数，接下来再做 3x3 卷积，可分离卷积的结构，然后再和输入做一个相加
"U" 表示上采样，通过上采样输出和上一个残差结构的输出层特征层，进行 concat，这一点和传统 FPN 是将两个特征层进行逐点相加有所不同
"S" 不同比例用于输出特征层，输出给 neck 模块，在 neck 模块中，使用 FPN 来做不同尺度的融合

模块化

通常会把网络架构中常见的组件定义为类，便于复用性，然后 YOLOv1 是基于卷积的神经网络，在 YOLOv3 一种常见卷积结构是卷积层，接下来 BN 层做归一化，然后就是激活层，这里激活函数选择的是 LeakyReLU。

class CNNBlock(nn.Module):
    def __init__(self,in_channels,out_channles,bn_act=True,**kwargs):
        super().__init__()
        self.conv = nn.Conv2d(in_channels,out_channles,bias=not bn_act,**kwargs)
        self.bn = nn.BatchNorm2d(out_channles)
        self.leakyrelu = nn.LeakyReLU(0.1)
        
        self.use_bn_act = bn_act
        
    def forward(self,x):
        if self.use_bn_act:
            return self.leakyrelu(self.bn(self.conv(x)))
        else:
            return self.conv(x)

通过 bn_act 来控制是否启用 BN 和激活层，在检测头位置，在输出特征层做检测时的卷积块是不需要 BN 和激活函数的。当采用 BN 和激活层，无需在开启卷积偏置

我们在学着实现代码或者看源码时候，对于先去看一看 paper，对于一些方法要有一个印象，在读 paper 时候，也需要对于一些 paper 提出方法应该思考一下应该如何实现，当然最好是动手实现一下。然后带着这些问题再去看源码，在源码找到作者给出解决方案，这样才会有进步，不然即使当时看懂了源码，随后也会忘记。源码看多了，以后自然也就是就感觉了，当遇到要实现一些功能时，也会有一些思路。一点点感悟。

残差模块

定义了残差结构，基本上由两个卷积层组合而成，再加上一个快捷边连接，在一个卷积层先通过 $1 \times 1$ 卷积层将通道数减半，然后再 2 个卷积层再将通道数翻倍还原和输入到残差块的通道数保持一致。通过 use_residual 来确定是否使用残差结构

class ResidualBlock(nn.Module):
    """
    通常在 ResidualBlock 是不会概念通道数
    """
    def __init__(self, channels, use_residual=True, num_repeats=1):
        super().__init__()
        self.layers = nn.ModuleList()
        for repeat in range(num_repeats):
            self.layers += [
                nn.Sequential(
                    CNNBlock(channels,channels//2,kernel_size=1),
                    CNNBlock(channels//2,channels,kernel_size=3,padding=1)
                )
            ]
        self.use_residual = use_residual
        self.num_repeats = num_repeats
    
    def forward(self,x):
        for layer in self.layers:
            x = layer(x) + self.use_residual * x
            
        return x

有些东西看似简单，不过要自己动手做一遍可能发现要比想象中困难，有些东西看似触不可及，超出自己能力范围，不过只要沉下心来，认真去做，发现也没有想象中那么难。

多尺度预测组件

通常是在输出特征图上做两次卷积，首先做 1 个 $3 \times 3$ 通道数翻倍的卷积，然后做一个通道数为

class ScalePrediction(nn.Module):
    """
    in_channels
    num_classes
    """
    def __init__(self,in_channels, num_classes, anchors_per_scale=3):
        super().__init__()
        
        self.num_classes= num_classes
        self.anchors_per_scale = anchors_per_scale
        
        self.pred = nn.Sequential(
            CNNBlock(in_channels, 2*in_channels,kernel_size=3, padding=1),
            #[prob, x,y,w,h]
            CNNBlock(2 * in_channels,(num_classes + 5) * self.anchors_per_scale, bn_act=False,kernel_size=1),
        )
        
    def forward(self,x):
        return (
            self.pred(x)
                # reshape [batch_size, anchos_per_scale,grid_size,grid_size,5 + number of classes]
                .reshape(x.shape[0],self.anchors_per_scale,self.num_classes + 5, x.shape[2],x.shape[3])
                .permute(0,1,3,4,2)
        )

指尖在键与键之前灵活切换，代码在屏幕上一行一行地呈现，每一段代码都在述说的一段逻辑，没有任何情感，这里哆嗦，多余多余描绘反而破坏了原有的简洁。

class YOLOv3(nn.Module):
    def __init__(self, in_channels=3,num_classes=80):
        super().__init__()
        self.num_classes = num_classes
        self.in_channels = in_channels
        self.layers = self._create_conv_layers()
        
    # 前向传播
    def forward(self,x):
        #输出特征层列表，随后包含输出特征层
        outputs = []
        
        route_connections = []
        for layer in self.layers:
            #如果为 ScalePrediction 模块，添加到输出，然后略过随后逻辑，再次进入下一次循环
            if isinstance(layer,ScalePrediction):
                outputs.append(layer(x))
                continue
            x = layer(x)
            #如果layer 为残差块并且残差结构重复次数为 8 将输出天骄 route_connections 以备后用
            if isinstance(layer, ResidualBlock) and layer.num_repeats == 8:
            
                route_connections.append(x)
            #如果是上采样，这将上采样输出结果和前一个残差结构块输出进行 concat 来实现特征层间的融合
            elif isinstance(layer, nn.Upsample):
                x = torch.cat([x, route_connections[-1]],dim=1)
                route_connections.pop()
                
        return outputs
    
    def _create_conv_layers(self):
        #首先定义 ModuleList 接下来创建好层就放置到模块列表中
        layers = nn.ModuleList()
        #输入通道数
        in_channels = self.in_channels
        # 遍历配置文件
        for module in config:
            #如果 tuple 类型，那么读取 tuple 卷积核个数、卷积核大小和步长信息来创建一个卷积块
            if isinstance(module, tuple):
                out_channels, kernel_size, stride = module
                layers.append(
                    CNNBlock(
                        in_channels,
                        out_channels,
                        kernel_size=kernel_size,
                        stride=stride,
                        padding=1 if kernel_size == 3 else 0,
                    )
                )
                #更新输入通道数为该卷积块输出通道数
                in_channels = out_channels
            elif isinstance(module,list):
                # 是不会改变特征图尺寸和通道数
                num_repeats = module[1]
                layers.append(
                    ResidualBlock(
                        in_channels,
                        num_repeats=num_repeats
                    )
                )
                
            elif isinstance(module,str):
                # S 表示输出，对于不同尺度输出，这里先进行 Res 结构
        
                if module == "S":
                    layers += [
                        ResidualBlock(in_channels,use_residual=False, num_repeats=1),
                        CNNBlock(in_channels, in_channels // 2,kernel_size=1),
                        ScalePrediction(in_channels // 2, num_classes=self.num_classes),
                    ]
                    in_channels = in_channels // 2
                    
                elif module == "U":
                    layers.append(
                        nn.Upsample(scale_factor=2)
                    )
                    in_channels = in_channels *3
                    
        return layers

# 测试函数
def test():
    num_classes = 20
    model = YOLOv3(num_classes=num_classes)
    img_size = 416
    x = torch.randn((2,3,img_size,img_size))
    out = model(x)
    assert out[0].shape == (2,3,img_size//32,img_size//32, 5 + num_classes)
    assert out[1].shape == (2,3,img_size//16,img_size//16, 5 + num_classes)
    assert out[2].shape == (2,3,img_size//8,img_size//8, 5 + num_classes)
    
test()