YOLOv6 Pro | YOLOv6 Network Magic Change (1) - RepGFPN Fusion Efficient Aggregation Network (ELAN) and Reparameterized Target Detection Neck (from DAMO-YOLO)

In the paper "GiraffeDet: A Heavy-Neck Paradigm for Object Detection" published by Ali Dharma Academy ICLR2022, they proposed GiraffeDet, which has a very lightweight backbone and a large computational neck, making the network more focused on Information interaction between spatial information in high-resolution feature maps and semantic information in low-resolution feature maps. At the same time, in their open source DAMO YOLO at the end of November 2022, the idea of ​​GFPN was used again. Based on the queen-fusion GFPN, they added the idea of ​​efficient aggregation network (ELAN) and reparameterization to form a new Neck Network RepGFPN, taking advantage of the heat, this article will improve RepGFPN in the YOLOV6 neck structure in the YOLOv6 Pro framework, and the same improvement can also be achieved in YOLOv5.

As usual, introduce the YOLOv6 Pro framework!

 · YOLOv6 Pro is based on the official  YOLOv6 overall architecture, using  YOLOv5 the network construction method to build a  YOLOv6 network, including  backbone, neck, effidehead structure.
· You can  yaml modify or add modules arbitrarily in the file, and each modified file is independently executable, the purpose is to facilitate scientific research.
·   More network structure improvements will be added based on the modules in yolov5 and  in the future. · Pre-trained weights have been converted from official weights to ensure they can match.yoloair

· Pre-released p6 model (unofficial)

· Some other improved modules have been added, such as RepGFPN, FocalTransformer, RepGhost, CoAtNet, etc.

yoloair The framework  we used  won first place YOLOv6 pro in the IEEE UV 2022 "Vision Meets Alage" object detection competition!

Project link: GitHub - yang-0201/YOLOv6_pro: Make it easier for yolov6 to change the network structure

Interested friends can click Star and Fork, and feedback in time if you have any questions. At the beginning of the project, some functional suggestions will be adopted and developed. Like-minded friends are also welcome to submit PRs to jointly maintain and develop the project. The follow-up of the project will be Continue to update and improve, so stay tuned!

Into the title!

DAMO YOLO: Paper address  https://arxiv.org/pdf/2211.15444.pdf

This is the network structure diagram of the entire DAMO YOLO, including the MAE-NAS backbone network obtained based on neural structure search NAS technology , RepGFPN and a ZeroHead structure. Today we mainly focus on the RepGFPN structure, and we can find that the main module is the Fusion Block structure.

The authors believe that one of the main reasons why GFPN is effective is because it can fully exchange high-level semantic information and low-level spatial information. In GFPN, multi-scale features are fused in both the previous layer and the current layer’s hierarchical features. More importantly, log2(n) skip layer connections provide more efficient information transfer, which can be extended to deeper networks.

 Similarly, when they directly replaced the original Neck structure with GFPN on the modern yolo series model, they achieved higher accuracy. (You can try to replace GFPN later) But the problem they found is that the delay of the GFPN-based model is much higher than that of the improved panet-based model, so the improvement in accuracy may not be worth the candle. Summarize the following reasons:

1. Feature maps of different scales have the same channel dimension;

2. Queen-fusion cannot meet the requirements of real-time detection model;

3. Convolution-based cross-scale feature fusion is inefficient;

Based on GFPN, we propose a novel and efficient design—RepGFPN to meet the design of real-time target detection. Considering these reasons, the author's thoughts are as follows:

1. Due to the large difference in flops of feature maps of different scales, it is difficult to control the same number of channels shared by each scale feature map under the constraint of limited computational cost.

Therefore, in the author's neck feature fusion, the settings of different scale feature maps of different channel dimensions are used. The author compares the performance of the same and different channels and the trade-off of neck depth and width, as shown in the following table

We can see that by flexibly controlling the number of channels at different scales, we can achieve higher accuracy than sharing the same number of channels at all scales. The best performance is obtained when the depth is equal to 3 and the width is equal to (96, 192, 384).

2) GFPN enhances feature interaction through queen-fusion , but it also brings a lot of additional upsampling and downsampling operations.

The authors compare the performance of these upsampling and downsampling operations, and the results are shown in the table

We can see that the additional upsampling operation results in an increase in latency of 0.6 ms, while the accuracy improvement is only 0.3mAP, which is much lower than the performance improvement brought by the additional downsampling operation. Therefore, under the constraint of real-time detection, we remove the extra upsampling operation in queen-fusion .

3. In the feature fusion block, we first replace the original 3x3 convolution-based feature fusion with CSPNet, obtaining 4.2 mAP gain. After that, we upgrade CSPNet by combining reparameterization mechanism and connection of Efficient Layer Aggregation Network (ELAN). We achieve higher accuracy without incurring an additional huge computational burden. The results of the comparison are listed in the table

After the introduction in the paper, I believe that everyone has already understood the idea of ​​RepGFPN, let's take a look at the code!

I refer to the source code of DAMO YOLO and added the RepGFPN structure in the framework of YOLOv6 Pro, including RepGFPN-T, RepGFPN-M, RepGFPN-S, which were added to YOLOv6l and YOLOv6t as examples

First look at the yaml file of the yolov6l+RepGFPN-M structure:

depth_multiple: 1.0  # model depth multiple
width_multiple: 1.0  # layer channel multiple
backbone:
  # [from, number, module, args]
  [[-1, 1, ConvWrapper, [64, 3, 2]],  # 0-P1/2
   [-1, 1, ConvWrapper, [128, 3, 2]],  # 1-P2/4
   [-1, 1, BepC3, [128, 6, "ConvWrapper"]],
   [-1, 1, ConvWrapper, [256, 3, 2]],  # 3-P3/8
   [-1, 1, BepC3, [256, 12, "ConvWrapper"]],
   [-1, 1, ConvWrapper, [512, 3, 2]],  # 5-P4/16
   [-1, 1, BepC3, [512, 18, "ConvWrapper"]],
   [-1, 1, ConvWrapper, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, BepC3, [1024, 6, "ConvWrapper"]],
   [-1, 1, SPPF, [1024, 5]]]  # 9
neck:
   [
    [ 6, 1,ConvBNAct,[ 256, 3, 2, silu ] ],
    [ [ -1, 9 ], 1, Concat, [ 1 ] ], # 768
    [ -1, 1, RepGFPN, [ 512, 1.5, 1.0, silu ] ],  #  8

    [ -1, 1, nn.Upsample, [ None, 2, 'nearest' ] ],
    [ 4, 1,ConvBNAct,[ 128, 3, 2, silu ] ],
    [ [ -1, 6, 13 ], 1, Concat, [ 1 ] ], # 896
    [ -1, 1, RepGFPN, [ 256, 1.5, 1.0, silu ] ], # merge_4 12

    [ -1, 1, nn.Upsample, [ None, 2, 'nearest' ] ],
    [ [ -1, 4 ], 1, Concat, [ 1 ] ], # 384
    [ -1, 1, RepGFPN, [ 128, 1.5, 1.0, silu ] ], # 512+256  merge_5  15  out

    [ -1, 1,ConvBNAct,[ 128, 3, 2, silu ] ],
    [ [ -1, 16 ], 1, Concat, [ 1 ] ], # 384
    [ -1, 1, RepGFPN, [ 256, 1.5, 1.0, silu ] ], # 512+256  merge_7  18  out

    [ 16, 1,ConvBNAct,[ 256, 3, 2, silu ] ],
    [ -2, 1,ConvBNAct,[ 256, 3, 2, silu ] ],
    [ [ -1, 12, -2 ], 1, Concat, [ 1 ] ], # 1024
    [ -1, 1, RepGFPN, [ 512, 1.5, 1.0, silu ] ], # 512+512+1024 merge_6 22  out
   ]

effidehead:
  [[19, 1,Head_out , [128, 16]],
  [22, 1, Head_out, [256, 16]],
  [26, 1, Head_out, [512, 16]],
  [[27, 28, 29], 1, Out, []]]


 Compare the original picture

The modules of various colors are ConvBNAct, which is a simple convolution plus normalization plus relu/silu activation function module.

Fusion Block is the main fusion module, corresponding to RepGFPN in the yaml file

 The input is two or three layers. After concat, 1x1 convolution is used to reduce the channel. Below is the feature aggregation module imitating ELAN. It is composed of N Rep 3x3 convolution and 3x3 convolution. Different layers are output at the same time, and then pass concat to get the final output.

 The meaning of the parameters of the input RepGFPN is [output channel, depth coefficient, scaling factor of the middle layer channel, the type of activation function used], in order to keep the idea of ​​heavy Neck and light Head, I removed the decoupling head of 6 and replaced it with Head_out. a simple output

yaml file of yolov6t+RepGFPN-T structure:

depth_multiple: 0.33  # model depth multiple
width_multiple: 0.375  # layer channel multiple
backbone:
  # [from, number, module, args]
  [[-1, 1, RepVGGBlock, [64, 3, 2]],  # 0-P1/2
   [-1, 1, RepVGGBlock, [128, 3, 2]],  # 1-P2/4
   [-1, 6, RepBlock, [128]],
   [-1, 1, RepVGGBlock, [256, 3, 2]],  # 3-P3/8
   [-1, 12, RepBlock, [256]],
   [-1, 1, RepVGGBlock, [512, 3, 2]],  # 5-P4/16
   [-1, 18, RepBlock, [512]],
   [-1, 1, RepVGGBlock, [1024, 3, 2]],  # 7-P5/32
   [-1, 6, RepBlock, [1024]],
   [-1, 1, SimSPPF, [1024, 5]]]  # 9
neck:
   [
    [ 6, 1,ConvBNAct,[ 192, 3, 2 ] ],
    [ [ -1, 9 ], 1, Concat, [ 1 ] ], # 576
    [ -1, 1, RepGFPN, [ 384, 1.0, 1.0 ] ],  #  8

    [ -1, 1, nn.Upsample, [ None, 2, 'nearest' ] ],
    [ 4, 1,ConvBNAct,[ 96, 3, 2 ] ],
    [ [ -1, 6, 13 ], 1, Concat, [ 1 ] ], # 672
    [ -1, 1, RepGFPN, [ 192, 1.0, 1.0 ] ], # merge_4 12

    [ -1, 1, nn.Upsample, [ None, 2, 'nearest' ] ],
    [ [ -1, 4 ], 1, Concat, [ 1 ] ], # 288
    [ -1, 1, RepGFPN, [ 64, 1.0, 1.0 ] ], #  merge_5  15  out

    [ -1, 1,ConvBNAct,[ 64, 3, 2 ] ],
    [ [ -1, 16 ], 1, Concat, [ 1 ] ], # 256
    [ -1, 1, RepGFPN, [ 128, 1.0, 1.0 ] ], #   merge_7  18  out

    [ 16, 1,ConvBNAct,[ 192, 3, 2 ] ],
    [ -2, 1,ConvBNAct,[ 128, 3, 2 ] ],
    [ [ -1, 12, -2 ], 1, Concat, [ 1 ] ], # 704
    [ -1, 1, RepGFPN, [ 256, 1.0, 1.0 ] ], #  merge_6 22  out
   ]

effidehead:
  [[19, 1,Head_out , [170, 0]],  ##170 * 0.375 = 64
  [22, 1, Head_out, [341, 0]],   ##341 * 0.375 = 128
  [26, 1, Head_out, [682, 0]],   ##682 * 0.375 = 256
  [[27, 28, 29], 1, Out, []]]


The code that needs to be added is to add it in common.py, or create a RepGFPN.py file yourself, and then import the module name in common.py,

from yolov6.layers.damo_yolo import ConvBNAct,RepGFPN
import numpy as np
import torch
import torch.nn as nn

class RepGFPN(nn.Module):
    def __init__(self,in_channels,out_channels,depth=1.0,hidden_ratio = 1.0,act = 'relu',block_name='BasicBlock_3x3_Reverse',spp = False):
        super(RepGFPN, self).__init__()


        self.merge_3 = CSPStage(block_name,
                                in_channels,
                                hidden_ratio,
                                out_channels,
                                round(3 * depth),
                                act=act)


    def forward(self,x):
        x = self.merge_3(x)
        return  x
class CSPStage(nn.Module):
    def __init__(self,
                 block_fn,
                 ch_in,
                 ch_hidden_ratio,
                 ch_out,
                 n,
                 act='swish',
                 spp=False):
        super(CSPStage, self).__init__()

        split_ratio = 2
        ch_first = int(ch_out // split_ratio)
        ch_mid = int(ch_out - ch_first)
        self.conv1 = ConvBNAct(ch_in, ch_first, 1, act=act)
        self.conv2 = ConvBNAct(ch_in, ch_mid, 1, act=act)
        self.convs = nn.Sequential()

        next_ch_in = ch_mid
        for i in range(n):
            if block_fn == 'BasicBlock_3x3_Reverse':
                self.convs.add_module(
                    str(i),
                    BasicBlock_3x3_Reverse(next_ch_in,
                                           ch_hidden_ratio,
                                           ch_mid,
                                           act=act,
                                           shortcut=True))
            else:
                raise NotImplementedError
            if i == (n - 1) // 2 and spp:
                self.convs.add_module(
                    'spp', SPP(ch_mid * 4, ch_mid, 1, [5, 9, 13], act=act))
            next_ch_in = ch_mid
        self.conv3 = ConvBNAct(ch_mid * n + ch_first, ch_out, 1, act=act)

    def forward(self, x):
        y1 = self.conv1(x)
        y2 = self.conv2(x)

        mid_out = [y1]
        for conv in self.convs:
            y2 = conv(y2)
            mid_out.append(y2)
        y = torch.cat(mid_out, axis=1)
        y = self.conv3(y)
        return y
class ConvBNAct(nn.Module):
    """A Conv2d -> Batchnorm -> silu/leaky relu block"""
    def __init__(
        self,
        in_channels,
        out_channels,
        ksize,
        stride=1,
        act='relu',
        groups=1,
        bias=False,
        norm='bn',
        reparam=False,
    ):
        super().__init__()
        # same padding
        pad = (ksize - 1) // 2
        self.conv = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size=ksize,
            stride=stride,
            padding=pad,
            groups=groups,
            bias=bias,
        )
        if norm is not None:
            self.bn = get_norm(norm, out_channels, inplace=True)
        if act is not None:
            self.act = get_activation(act, inplace=True)
        self.with_norm = norm is not None
        self.with_act = act is not None

    def forward(self, x):
        x = self.conv(x)
        if self.with_norm:
            x = self.bn(x)
        if self.with_act:
            x = self.act(x)
        return x

    def fuseforward(self, x):
        return self.act(self.conv(x))
class BasicBlock_3x3_Reverse(nn.Module):
    def __init__(self,
                 ch_in,
                 ch_hidden_ratio,
                 ch_out,
                 act='relu',
                 shortcut=True):
        super(BasicBlock_3x3_Reverse, self).__init__()
        assert ch_in == ch_out
        ch_hidden = int(ch_in * ch_hidden_ratio)
        self.conv1 = ConvBNAct(ch_hidden, ch_out, 3, stride=1, act=act)
        self.conv2 = RepConv(ch_in, ch_hidden, 3, stride=1, act=act)
        self.shortcut = shortcut

    def forward(self, x):
        y = self.conv2(x)
        y = self.conv1(y)
        if self.shortcut:
            return x + y
        else:
            return y
def get_norm(name, out_channels, inplace=True):
    if name == 'bn':
        module = nn.BatchNorm2d(out_channels)
    else:
        raise NotImplementedError
    return module
class SPP(nn.Module):
    def __init__(
        self,
        ch_in,
        ch_out,
        k,
        pool_size,
        act='swish',
    ):
        super(SPP, self).__init__()
        self.pool = []
        for i, size in enumerate(pool_size):
            pool = nn.MaxPool2d(kernel_size=size,
                                stride=1,
                                padding=size // 2,
                                ceil_mode=False)
            self.add_module('pool{}'.format(i), pool)
            self.pool.append(pool)
        self.conv = ConvBNAct(ch_in, ch_out, k, act=act)

    def forward(self, x):
        outs = [x]

        for pool in self.pool:
            outs.append(pool(x))
        y = torch.cat(outs, axis=1)

        y = self.conv(y)
        return y
import torch.nn.functional as F
class Swish(nn.Module):
    def __init__(self, inplace=True):
        super(Swish, self).__init__()
        self.inplace = inplace

    def forward(self, x):
        if self.inplace:
            x.mul_(F.sigmoid(x))
            return x
        else:
            return x * F.sigmoid(x)
class RepConv(nn.Module):
    '''RepConv is a basic rep-style block, including training and deploy status
    Code is based on https://github.com/DingXiaoH/RepVGG/blob/main/repvgg.py
    '''
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size=3,
                 stride=1,
                 padding=1,
                 dilation=1,
                 groups=1,
                 padding_mode='zeros',
                 deploy=False,
                 act='relu',
                 norm=None):
        super(RepConv, self).__init__()
        self.deploy = deploy
        self.groups = groups
        self.in_channels = in_channels
        self.out_channels = out_channels

        assert kernel_size == 3
        assert padding == 1

        padding_11 = padding - kernel_size // 2

        if isinstance(act, str):
            self.nonlinearity = get_activation(act)
        else:
            self.nonlinearity = act

        if deploy:
            self.rbr_reparam = nn.Conv2d(in_channels=in_channels,
                                         out_channels=out_channels,
                                         kernel_size=kernel_size,
                                         stride=stride,
                                         padding=padding,
                                         dilation=dilation,
                                         groups=groups,
                                         bias=True,
                                         padding_mode=padding_mode)

        else:
            self.rbr_identity = None
            self.rbr_dense = conv_bn(in_channels=in_channels,
                                     out_channels=out_channels,
                                     kernel_size=kernel_size,
                                     stride=stride,
                                     padding=padding,
                                     groups=groups)
            self.rbr_1x1 = conv_bn(in_channels=in_channels,
                                   out_channels=out_channels,
                                   kernel_size=1,
                                   stride=stride,
                                   padding=padding_11,
                                   groups=groups)

    def forward(self, inputs):
        '''Forward process'''
        if hasattr(self, 'rbr_reparam'):
            return self.nonlinearity(self.rbr_reparam(inputs))

        if self.rbr_identity is None:
            id_out = 0
        else:
            id_out = self.rbr_identity(inputs)

        return self.nonlinearity(
            self.rbr_dense(inputs) + self.rbr_1x1(inputs) + id_out)

    def get_equivalent_kernel_bias(self):
        kernel3x3, bias3x3 = self._fuse_bn_tensor(self.rbr_dense)
        kernel1x1, bias1x1 = self._fuse_bn_tensor(self.rbr_1x1)
        kernelid, biasid = self._fuse_bn_tensor(self.rbr_identity)
        return kernel3x3 + self._pad_1x1_to_3x3_tensor(
            kernel1x1) + kernelid, bias3x3 + bias1x1 + biasid

    def _pad_1x1_to_3x3_tensor(self, kernel1x1):
        if kernel1x1 is None:
            return 0
        else:
            return torch.nn.functional.pad(kernel1x1, [1, 1, 1, 1])

    def _fuse_bn_tensor(self, branch):
        if branch is None:
            return 0, 0
        if isinstance(branch, nn.Sequential):
            kernel = branch.conv.weight
            running_mean = branch.bn.running_mean
            running_var = branch.bn.running_var
            gamma = branch.bn.weight
            beta = branch.bn.bias
            eps = branch.bn.eps
        else:
            assert isinstance(branch, nn.BatchNorm2d)
            if not hasattr(self, 'id_tensor'):
                input_dim = self.in_channels // self.groups
                kernel_value = np.zeros((self.in_channels, input_dim, 3, 3),
                                        dtype=np.float32)
                for i in range(self.in_channels):
                    kernel_value[i, i % input_dim, 1, 1] = 1
                self.id_tensor = torch.from_numpy(kernel_value).to(
                    branch.weight.device)
            kernel = self.id_tensor
            running_mean = branch.running_mean
            running_var = branch.running_var
            gamma = branch.weight
            beta = branch.bias
            eps = branch.eps
        std = (running_var + eps).sqrt()
        t = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta - running_mean * gamma / std

    def switch_to_deploy(self):
        if hasattr(self, 'rbr_reparam'):
            return
        kernel, bias = self.get_equivalent_kernel_bias()
        self.rbr_reparam = nn.Conv2d(
            in_channels=self.rbr_dense.conv.in_channels,
            out_channels=self.rbr_dense.conv.out_channels,
            kernel_size=self.rbr_dense.conv.kernel_size,
            stride=self.rbr_dense.conv.stride,
            padding=self.rbr_dense.conv.padding,
            dilation=self.rbr_dense.conv.dilation,
            groups=self.rbr_dense.conv.groups,
            bias=True)
        self.rbr_reparam.weight.data = kernel
        self.rbr_reparam.bias.data = bias
        for para in self.parameters():
            para.detach_()
        self.__delattr__('rbr_dense')
        self.__delattr__('rbr_1x1')
        if hasattr(self, 'rbr_identity'):
            self.__delattr__('rbr_identity')
        if hasattr(self, 'id_tensor'):
            self.__delattr__('id_tensor')
        self.deploy = True
def get_activation(name='silu', inplace=True):
    if name is None:
        return nn.Identity()

    if isinstance(name, str):
        if name == 'silu':
            module = nn.SiLU(inplace=inplace)
        elif name == 'relu':
            module = nn.ReLU(inplace=inplace)
        elif name == 'lrelu':
            module = nn.LeakyReLU(0.1, inplace=inplace)
        elif name == 'swish':
            module = Swish(inplace=inplace)
        elif name == 'hardsigmoid':
            module = nn.Hardsigmoid(inplace=inplace)
        elif name == 'identity':
            module = nn.Identity()
        else:
            raise AttributeError('Unsupported act type: {}'.format(name))
        return module

    elif isinstance(name, nn.Module):
        return name

    else:
        raise AttributeError('Unsupported act type: {}'.format(name))
def conv_bn(in_channels, out_channels, kernel_size, stride, padding, groups=1):
    '''Basic cell for rep-style block, including conv and bn'''
    result = nn.Sequential()
    result.add_module(
        'conv',
        nn.Conv2d(in_channels=in_channels,
                  out_channels=out_channels,
                  kernel_size=kernel_size,
                  stride=stride,
                  padding=padding,
                  groups=groups,
                  bias=False))
    result.add_module('bn', nn.BatchNorm2d(num_features=out_channels))
    return result

Add in yolo.py

elif m in [RepGFPN, ConvBnAct]:
            c1 = ch[f]
            c2 = args[0]
            args = [c1, c2, *args[1:]]

You're done!

Guess you like

Origin blog.csdn.net/qq_43000647/article/details/128293855