[target detection] (8) ASPP improves and strengthens the feature extraction module, with the complete code of Tensorflow

Hello classmates, I recently wanted to improve the SPP enhancement feature extraction module of YOLOV4. I saw that the ASPP module in semantic segmentation was used to improve it in many papers. Today, I will use Tensorflow to reproduce the code.

The backbone network code of YOLOV4 can be found in my last article: https://blog.csdn.net/dgvv4/article/details/123818580

Replace the original SPP module code with the ASPP code in this section.


1. Method introduction

In YOLOv4, the SPP module is used to extract the information of different receptive fields, but it does not fully reflect the semantic relationship between global information and local information . The ASPP designed in this paper introduces depthwise separable convolution + hole convolution operations with different expansion rates , realizes the pooling operation in SPP , and parallels it with the global average pooling to form a new feature pyramid model to aggregate multiple The scale context information enhances the ability of the model to recognize the same object of different sizes .

The improved YOLOV4 framework combined with ASPP is shown in the figure below, using dilated convolution with expansion rates of 1, 3, and 5 and a convolution kernel size of 3*3, and using depthwise separable convolution to reduce the amount of parameters . The local features of the previous layer are associated with a wider field of view to prevent small target features from being lost during information transfer.

The output feature map of the third effective feature layer of the backbone network is used as the input of the ASPP module , and the shape of the input feature map is [13, 13, 1024]. The first branch is 1*1 standard convolution, the purpose is to maintain the original receptive field ; the second to fourth branches are depthwise separable convolutions with different expansion rates, the purpose is to extract features to obtain different receptive fields ; The fifth branch is to pool the input global average to obtain global features . Finally, the feature maps of the five branches are stacked in the channel dimension, and the information of different scales is fused through 1*1 standard convolution.


2. Atrous convolution

When the traditional deep convolutional neural network processes image tasks, the image is generally convolved for feature extraction and dimension change, and then the image is pooled to reduce the size. With the deepening of the number of network layers, the pooling layer causes the image size to become smaller and smaller . When the image needs to be enlarged to the original size through the upsampling operation, it will lead to loss of internal data structure, spatial hierarchy information and information about small target reconstruction. Information loss and other issues, so it may cause the network accuracy to no longer be significantly improved .

The calculation idea of ​​atrous convolution is the same as that of ordinary convolution , but it is a variant of ordinary convolution, in which a new parameter is introduced, which is recorded as "dilation rate" . As the name suggests, the expansion rate represents the size of the expansion of the convolution kernel, that is, the distance between each parameter in the convolution kernel. The ordinary convolution expansion rate is 1, and the convolution kernel is not expanded. Atrous convolution increases the receptive field without using a pooling layer, so that each convolution output contains a larger range of information, thereby improving network performance .

The figure below shows atrous convolution on two-dimensional data. The red dot is the input of the 3*3 convolution kernel, the green area is the receptive field captured by each input, and the receptive field refers to the features of the output of each layer of the convolution. The feature points in the graph map the size of the area on the input image

Figure a corresponds to a convolution with a convolution kernel size of 3*3 and an expansion coefficient r of 1, which is calculated in the same way as a common convolution.

The expansion coefficient r corresponding to the 3*3 convolution in Figure b is 2 , which means that a hole is inserted between every two convolution points, which can be regarded as a convolution kernel with a size of 7*7, of which there are only 9 The weight of the point is not 0, the rest are 0. Although the size of the convolution kernel is only 3*3, the receptive field has been increased to 7*7 .

If the previous layer of convolution with an expansion coefficient of 2 is a convolution with an expansion coefficient of 1, that is , each red point in Figure b corresponds to the output of the expansion coefficient r=1 convolution , so the expansion coefficients r=1 and r=2 Together, a 7*7 receptive field can be achieved .

The expansion coefficient corresponding to the 3*3 convolution in Figure c is 4. Similarly, using the dilated convolution with expansion coefficients r=1 and r=2 as input, a receptive field of 15*15 can be achieved.


3. Code reproduction

First, we construct depthwise separable convolution blocks with different dilation rates . If you have doubts about the theory of depthwise separable convolution, you can read my MobileNetV3 article: https://blog.csdn.net/dgvv4/article/details/123476899

3*3 depthwise convolution (DepthwiseConv) only processes the information in the length and width directions of the feature map, and 1*1 point-by-point convolution (PointConv) only processes the information in the channel direction of the feature map

#(1)深度可分离卷积+空洞卷积
def block(inputs, filters, rate):
    '''
    filters:1*1卷积下降的通道数
    rate:空洞卷积的膨胀率
    '''

    # 3*3深度卷积,指定膨胀率
    x = layers.DepthwiseConv2D(kernel_size=(3,3), strides=1, padding='same',
                               dilation_rate=rate, use_bias=False)(inputs)

    x = layers.BatchNormalization()(x)  # 标准化
    x = layers.Activation('relu')(x)  # 激活函数

    # 1*1逐点卷积调整通道数
    x = layers.Conv2D(filters, kernel_size=(1,1), strides=1, padding='same', use_bias=False)(x)
    
    x = layers.BatchNormalization()(x)  # 标准化
    x = layers.Activation('relu')(x)  # 激活函数

    return x

Next, build the ASPP module of the backbone. The input is the third effective feature layer of the backbone network. After a 1*1 standard convolution to fuse the channel information, the 3*3 hole convolutions of the three branches with different expansion rates get different scales. information, the global average pooling of a branch obtains the global information.

#(2)aspp加强特征提取模块,inputs是网络输出的第三个有效特征层[13,13,1024]
def aspp(inputs):

    # 获取输入图像的尺寸
    b,h,w,c = inputs.shape

    # 1*1标准卷积降低通道数[13,13,1024]==>[13,13,512]
    x1 = layers.Conv2D(filters=512, kernel_size=(1,1), strides=1, padding='same', use_bias=False)(inputs)
    x1 = layers.BatchNormalization()(x1)  # 标准化
    x1 = layers.Activation('relu')(x1)  # 激活

    # 膨胀率=1
    x2 = block(inputs, filters=512, rate=1)
    # 膨胀率=3
    x3 = block(inputs, filters=512, rate=3)
    # 膨胀率=5
    x4 = block(inputs, filters=512, rate=5)

    # 全局平均池化[13,13,1024]==>[None,1024]
    x5 = layers.GlobalAveragePooling2D()(inputs)
    # [None,1024]==>[1,1,1024]
    x5 = layers.Reshape(target_shape=[1,1,-1])(x5)
    # 1*1卷积减少通道数[1,1,1024]==>[1,1,512]
    x5 = layers.Conv2D(filters=512, kernel_size=(1,1), strides=1, padding='same', use_bias=False)(x5)
    x5 = layers.BatchNormalization()(x5)
    x5 = layers.Activation('relu')(x5)
    # 调整图像大小[1,1,512]==>[13,13,512]
    x5 = tf.image.resize(x5, size=(h,w))
    
    # 堆叠5个并行操作[13,13,512]==>[13,13,512*5]
    x = layers.concatenate([x1,x2,x3,x4,x5])
    
    # 1*1卷积调整通道
    x = layers.Conv2D(filters=512, kernel_size=(1,1), strides=1, padding='same', use_bias=False)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    # 随机杀死神经元
    x = layers.Dropout(rate=0.1)(x)

    return x

View the architecture of the ASPP module and construct the input layer [13, 13, 1024]

#(3)查看网络结构
if __name__ == '__main__':

    inputs = keras.Input(shape=[13,13,1024])  # 输入层
    outputs = aspp(inputs)  # 结构aspp模型

    # 构建网络模型
    model = Model(inputs, outputs)
    model.summary()

The model architecture is as follows

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to

==================================================================================================
input_1 (InputLayer)            [(None, 13, 13, 1024 0

__________________________________________________________________________________________________
depthwise_conv2d (DepthwiseConv (None, 13, 13, 1024) 9216        input_1[0][0]

__________________________________________________________________________________________________
depthwise_conv2d_1 (DepthwiseCo (None, 13, 13, 1024) 9216        input_1[0][0]

__________________________________________________________________________________________________
depthwise_conv2d_2 (DepthwiseCo (None, 13, 13, 1024) 9216        input_1[0][0]

__________________________________________________________________________________________________
global_average_pooling2d (Globa (None, 1024)         0           input_1[0][0]

__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 13, 13, 1024) 4096        depthwise_conv2d[0][0]        

__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 13, 13, 1024) 4096        depthwise_conv2d_1[0][0]      

__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 13, 13, 1024) 4096        depthwise_conv2d_2[0][0]      

__________________________________________________________________________________________________
reshape (Reshape)               (None, 1, 1, 1024)   0           global_average_pooling2d[0][0]
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 13, 13, 1024) 0           batch_normalization_1[0][0]   

__________________________________________________________________________________________________
activation_3 (Activation)       (None, 13, 13, 1024) 0           batch_normalization_3[0][0]   

__________________________________________________________________________________________________
activation_5 (Activation)       (None, 13, 13, 1024) 0           batch_normalization_5[0][0]   

__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 1, 1, 512)    524288      reshape[0][0]

__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 13, 13, 512)  524288      input_1[0][0]

__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 13, 13, 512)  524288      activation_1[0][0]

__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 13, 13, 512)  524288      activation_3[0][0]

__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 13, 13, 512)  524288      activation_5[0][0]

__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 1, 1, 512)    2048        conv2d_4[0][0]

__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 13, 13, 512)  2048        conv2d[0][0]

__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 13, 13, 512)  2048        conv2d_1[0][0]

__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 13, 13, 512)  2048        conv2d_2[0][0]

__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 13, 13, 512)  2048        conv2d_3[0][0]

__________________________________________________________________________________________________
activation_7 (Activation)       (None, 1, 1, 512)    0           batch_normalization_7[0][0]   

__________________________________________________________________________________________________
activation (Activation)         (None, 13, 13, 512)  0           batch_normalization[0][0]     

__________________________________________________________________________________________________
activation_2 (Activation)       (None, 13, 13, 512)  0           batch_normalization_2[0][0]   

__________________________________________________________________________________________________
activation_4 (Activation)       (None, 13, 13, 512)  0           batch_normalization_4[0][0]   

__________________________________________________________________________________________________
activation_6 (Activation)       (None, 13, 13, 512)  0           batch_normalization_6[0][0]   

__________________________________________________________________________________________________
tf.image.resize (TFOpLambda)    (None, 13, 13, 512)  0           activation_7[0][0]

__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 13, 13, 2560) 0           activation[0][0]

                                                                 activation_2[0][0]

                                                                 activation_4[0][0]

                                                                 activation_6[0][0]

                                                                 tf.image.resize[0][0]

__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 13, 13, 512)  1310720     concatenate[0][0]

__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 13, 13, 512)  2048        conv2d_5[0][0]

__________________________________________________________________________________________________
activation_8 (Activation)       (None, 13, 13, 512)  0           batch_normalization_8[0][0]
__________________________________________________________________________________________________
dropout (Dropout)               (None, 13, 13, 512)  0           activation_8[0][0]
==================================================================================================
Total params: 3,984,384
Trainable params: 3,972,096
Non-trainable params: 12,288
__________________________________________________________________________________________________

Guess you like

Origin blog.csdn.net/dgvv4/article/details/123933184