Mask_RCNN代碼研讀（matterport版本）系列文（一）- ResNet部份

前言
訓練及推論模式中的共同部份

ResNet Graph

identity_block
conv_block
resnet_graph

Stage 1
Stage 2
Stage 3
Stage 4
Stage 5

小結

前言

在開始閱讀這近三千行的代碼之前，先對模型整體架構有基本的認識會比較好。以下先說說閱讀代碼時需注意的幾個地方：

Mask R-CNN的發展歷程是由R-CNN->Fast R-CNN->Faster R-CNN->Mask R-CNN。它們使用ResNet當backbone來抽取特徵，並且用到了Feature Pyramid Network來解決所謂多尺度目標檢測的問題。
這個repo是用Keras+TensorFlow寫成。經筆者實測，使用Keras2.2.2或TensorFlow1.8會出現問題，因此建議使用Keras2.1.3&TensorFlow1.9。
模型分為training模式及inference模式，這兩種模式的輸入與輸出並不相同。
與Mask RCNN模型架構比較相關的部份是model.py及config.py這兩個檔案。在本系列文中會對這兩個檔案交互參看。
本repo的大架構是以Keras的Layer作為基礎，如果其中有Keras不支援的運算，作者的作法是先將所需要的運算用TensorFlow寫成一個函數，然後在它的外面包上一層Keras的Layer。如：
rpn_class_loss_graph是一個TensorFlow運算所組成的函數，這裡在它外面包上了一層Keras的Lambda層，使它能與其它Keras的層相容。

rpn_class_loss = KL.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")(
                [input_rpn_match, rpn_class_logits])

有些運算僅支持單個輸入，如：tf.gather及作者自己寫的apply_box_deltas_graph，clip_boxes_graph等。所以作者用了一個hack（即utils檔案裡的batch_slice函數）來克服這個問題。

本篇僅涉及Mask RCNN的backbone，即ResNet的部份。其餘部份將在本系列後續的文章中介紹。

訓練及推論模式中的共同部份

模型的骨幹是用來從輸入圖片中抽取特徵的，作者己經提供在coco數據集上預訓練好的權重，因此我們在訓練時可以選擇將骨幹部份freeze，只訓練heads的部份。以下是範例的balloon.py中訓練heads的代碼：

model.train(dataset_train, dataset_val,
                learning_rate=config.LEARNING_RATE,
                epochs=30,
                layers='heads')

可以看到model.train中的layers參數是填入’heads’，代表骨幹部份被略過不訓練，而使用了預訓練的權重。

注：作者提供的權重是ResNet101版本的，詳見Backbone of pre-trained COCO weights (mask_rcnn_coco.h5) ?

ResNet Graph

ResNet是由何愷明等人所發明的一種CNN，其最大的特點是使用skip connection來解決隨著網路層數增多，準確率開始趨近於飽和，最後急速下降的現象，這個問題在論文中被稱作degradation problem。ResNet在Mask RCNN的大架構底下扮演的是feature extractor的角色，即把原始的輸入圖片變換成較抽象的表徵。

首先來看ResNet中的組成部份，即identity_block與conv_block，再來看resnet_graph中的整體架構是怎麼實現的。

identity_block

def identity_block(input_tensor, kernel_size, filters, stage, block,
                   use_bias=True, train_bn=True):
    """The identity_block is the block that has no conv layer at shortcut
    # Arguments
        input_tensor: input tensor
        kernel_size: default 3, the kernel size of middle conv layer at main path
        filters: list of integers, the nb_filters of 3 conv layer at main path
        stage: integer, current stage label, used for generating layer names
        block: 'a','b'..., current block label, used for generating layer names
        use_bias: Boolean. To use or not use a bias in conv layers.
        train_bn: Boolean. Train or freeze Batch Norm layers
    """
    nb_filter1, nb_filter2, nb_filter3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = KL.Conv2D(nb_filter1, (1, 1), name=conv_name_base + '2a',
                  use_bias=use_bias)(input_tensor)
    x = BatchNorm(name=bn_name_base + '2a')(x, training=train_bn)
    x = KL.Activation('relu')(x)

    x = KL.Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same',
                  name=conv_name_base + '2b', use_bias=use_bias)(x)
    x = BatchNorm(name=bn_name_base + '2b')(x, training=train_bn)
    x = KL.Activation('relu')(x)

    x = KL.Conv2D(nb_filter3, (1, 1), name=conv_name_base + '2c',
                  use_bias=use_bias)(x)
    x = BatchNorm(name=bn_name_base + '2c')(x, training=train_bn)

    x = KL.Add()([x, input_tensor])
    x = KL.Activation('relu', name='res' + str(stage) + block + '_out')(x)
    return x

從這段代碼中可以發現以下幾點：

我們會用str(stage)+block來指代整個identity block。
identity block中有三組Conv2D+BatchNorm+ReLU（最後一組沒有ReLU），他們的名稱分別為2a,2b,2c。
輸入在經過三組Conv2D+BatchNorm+ReLU（最後一組沒有ReLU）後，與一開始的輸入相加，之後經過ReLU成為輸出。而這就是ResNet中最具突破性的想法——skip connection，的實現方式。注意到這裡是與一開始的輸入相加，沒有經過任何變化，因此我們將這種block稱作identity_block。

我們再看得仔細一點，計算輸入的長寬在這個網路結構裡的變化。首先發現只有Conv2D有可能改變輸出入的長寬，因此接下來將聚焦在這個部份。2a這個block的kernel_size是（1，1）。2b這個block的kernel_size是由輸出指定（通常是3），並採用了same padding。2c與2a一樣kernel_size是（1，1）。因此經過整個identity block後，輸出的長寬還是與輸入一樣。

conv_block

def conv_block(input_tensor, kernel_size, filters, stage, block,
               strides=(2, 2), use_bias=True, train_bn=True):
    """conv_block is the block that has a conv layer at shortcut
    # Arguments
        input_tensor: input tensor
        kernel_size: default 3, the kernel size of middle conv layer at main path
        filters: list of integers, the nb_filters of 3 conv layer at main path
        stage: integer, current stage label, used for generating layer names
        block: 'a','b'..., current block label, used for generating layer names
        use_bias: Boolean. To use or not use a bias in conv layers.
        train_bn: Boolean. Train or freeze Batch Norm layers
    Note that from stage 3, the first conv layer at main path is with subsample=(2,2)
    And the shortcut should have subsample=(2,2) as well
    """
    nb_filter1, nb_filter2, nb_filter3 = filters
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'

    x = KL.Conv2D(nb_filter1, (1, 1), strides=strides,
                  name=conv_name_base + '2a', use_bias=use_bias)(input_tensor)
    x = BatchNorm(name=bn_name_base + '2a')(x, training=train_bn)
    x = KL.Activation('relu')(x)

    x = KL.Conv2D(nb_filter2, (kernel_size, kernel_size), padding='same',
                  name=conv_name_base + '2b', use_bias=use_bias)(x)
    x = BatchNorm(name=bn_name_base + '2b')(x, training=train_bn)
    x = KL.Activation('relu')(x)

    x = KL.Conv2D(nb_filter3, (1, 1), name=conv_name_base +
                  '2c', use_bias=use_bias)(x)
    x = BatchNorm(name=bn_name_base + '2c')(x, training=train_bn)

    shortcut = KL.Conv2D(nb_filter3, (1, 1), strides=strides,
                         name=conv_name_base + '1', use_bias=use_bias)(input_tensor)
    shortcut = BatchNorm(name=bn_name_base + '1')(shortcut, training=train_bn)

    x = KL.Add()([x, shortcut])
    x = KL.Activation('relu', name='res' + str(stage) + block + '_out')(x)
    return x

首先略讀這段代碼，可以發現以下幾點：

比起identidy_block，conv_block的輸入多了strides這個參數，且預設值是(2，2），這說明了conv_block將會改變輸入的長寬。
conv_block一樣是由三組Conv2D+BatchNorm+ReLU所組成。
輸入在經過三組Conv2D+BatchNorm+ReLU（最後一組沒有ReLU）後，與shortcut相加再經過ReLU成為輸出。而這裡的shortcut是將輸入經過Conv2D與BatchNorm的變換後形成的。因為此處的shortcut是經過變換的，因此我們將這種block稱作conv_block。

接著來計算輸入的長寬在conv_block的變化。2a的strides預設為（2，2），這使得輸入的長寬都減半。2b的kernel_size為3，但是padding=‘same’,因此不會改變長寬。2c的kernel_size為（1，1），一樣不會改變長寬。shortcut的strides與2a一樣都是（2，2），也會使輸入長寬減半。因此經過整個conv_block後，輸入的長寬會變成原來的一半。

resnet_graph

def resnet_graph(input_image, architecture, stage5=False, train_bn=True):
    """Build a ResNet graph.
        architecture: Can be resnet50 or resnet101
        stage5: Boolean. If False, stage5 of the network is not created
        train_bn: Boolean. Train or freeze Batch Norm layers
    """
    assert architecture in ["resnet50", "resnet101"]
    # Stage 1
    x = KL.ZeroPadding2D((3, 3))(input_image)
    x = KL.Conv2D(64, (7, 7), strides=(2, 2), name='conv1', use_bias=True)(x)
    x = BatchNorm(name='bn_conv1')(x, training=train_bn)
    x = KL.Activation('relu')(x)
    C1 = x = KL.MaxPooling2D((3, 3), strides=(2, 2), padding="same")(x)
    # Stage 2
    x = conv_block(x, 3, [64, 64, 256], stage=2, block='a', strides=(1, 1), train_bn=train_bn)
    x = identity_block(x, 3, [64, 64, 256], stage=2, block='b', train_bn=train_bn)
    C2 = x = identity_block(x, 3, [64, 64, 256], stage=2, block='c', train_bn=train_bn)
    # Stage 3
    x = conv_block(x, 3, [128, 128, 512], stage=3, block='a', train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='b', train_bn=train_bn)
    x = identity_block(x, 3, [128, 128, 512], stage=3, block='c', train_bn=train_bn)
    C3 = x = identity_block(x, 3, [128, 128, 512], stage=3, block='d', train_bn=train_bn)
    # Stage 4
    x = conv_block(x, 3, [256, 256, 1024], stage=4, block='a', train_bn=train_bn)
    block_count = {"resnet50": 5, "resnet101": 22}[architecture]
    for i in range(block_count):
        x = identity_block(x, 3, [256, 256, 1024], stage=4, block=chr(98 + i), train_bn=train_bn)
    C4 = x
    # Stage 5
    if stage5:
        x = conv_block(x, 3, [512, 512, 2048], stage=5, block='a', train_bn=train_bn)
        x = identity_block(x, 3, [512, 512, 2048], stage=5, block='b', train_bn=train_bn)
        C5 = x = identity_block(x, 3, [512, 512, 2048], stage=5, block='c', train_bn=train_bn)
    else:
        C5 = None
    return [C1, C2, C3, C4, C5]

先略讀這段代碼，可以發現以下幾件事：

Stage 2～Stage 5是由先前定義的identity_block及conv_block所組成。
這個repo是支持resnet50及resnet101兩種backbone。
ResNet包含了5（或4）個stage，在每個stage都會有C?的輸出。這些C1~C5的輸出在最後會被收集在一個list裡回傳。

關於第二點，ResNet50及ResNet101這兩種架構的差別在於其網路層數不同。說得更詳細一點，就是它們在Stage 4分別使用了5或22個identity_block。我們可以在config.py檔中設定BACKBONE = "resnet101"或BACKBONE = "resnet50"來選擇網路架構。

接著說第三點，這C1～C5的輸出就是由ResNet所抽取的特徵，這些表徵之後再為Feature Pyramid Network（以下簡稱FPN）所用。

來自FPN論文：

There are often many layers producing output maps of the same size and we say these layers are in the same network stage.

即各stage（事實上是除了Stage 1外的各stage）裡的每個各layer的輸出長寬都是一樣的。我們可以來驗證這一點。

我們己經知道identity_block不會改變輸入長寬，而conv_block在strides=(2,2)時會將輸入長寬減半。

而在config.py裡有這麼一段註解：

#The multiple of 64 is needed to ensure smooth scaling of feature
#maps up and down the 6 levels of the FPN pyramid (2**6=64).

也就是說輸入圖片的長寬必須是64的倍數（原因在本段結束後將會作解釋），在這裡假設輸入圖片的長寬是（576，576）。

Stage 1

ZeroPadding2D：將長寬變為（582，582）。
Conv2D：kernel_size為7，stride為2。依據公式，輸出長寬皆為(582-7＋1)/2＝288。filter數量為64，因此輸出的形狀為（288，288，64）。
BatchNorm：不改變長寬。
ReLU：不改變長寬。
MaxPooling2D：stride為2。長寬變為一半，形狀為（144，144，64）。

Stage 2

第一個conv_block將strides設為（1，1），因此長寬不變。
接下來兩個identity_block一樣不改變長寬。
最後一個identity_block的filter數量為256，因此輸出形狀為（144，144，256）。
因為其輸出的feature map的長寬為輸入圖片的1／4，故receptive field為4*4。

Stage 3

第一個conv_block採用預設的strides（2，2），長寬變為一半。
最後一個identity_block的filter數量為512，因此輸出形狀為（72，72，512）。
因為其輸出的feature map的長寬為輸入圖片的1／8，故receptive field為8*8。

Stage 4

第一個conv_block採用預設的strides（2，2），長寬變為一半。
最後一個identity_block的filter數量為1024，因此輸出形狀為（36，36，1024）。
因為其輸出的feature map的長寬為輸入圖片的1／16，故receptive field為16*16。

Stage 5

第一個conv_block採用預設的strides（2，2），長寬變為一半。
最後一個identity_block的filter數量為2048，因此輸出形狀為（18，18，2048）。
因為其輸出的feature map的長寬為輸入圖片的1／32，故receptive field為32*32。

config.py檔裡有個BACKBONE_STRIDES參數，預設值為[4, 8, 16, 32, 64]，可與此處相呼應。
注意[4,8,16,32]對應的是C2~C5，而64對應的則是P6（在MaskRCNN這個class會看到）。

回想起這個網路只接受長寬皆是64倍數的圖片，原因在於經過ResNet以及稍後MaskRCNN裡出現的MaxPooling2D之後，feature map的長寬會變為原來的1/64，為了避免出現不整除的情況，才會有此規定。以下是MaskRCNN類別裡build函數內所做的限制。

# Image size must be dividable by 2 multiple times
h, w = config.IMAGE_SHAPE[:2]
if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):
    raise Exception("Image size must be dividable by 2 at least 6 times "
                    "to avoid fractions when downscaling and upscaling."
                    "For example, use 256, 320, 384, 448, 512, ... etc. ")

小結

在本篇中我們看到了圖片在經過ResNet後會被抽象化成C1～C5的特徵，而這些特徵將會由FPN所用。下一篇我們將會介紹FPN本身以及它在Mask RCNN這個大架構裡所發揮的作用。

Mask_RCNN代碼研讀（matterport版本）系列文（一）- ResNet部份

Mask_RCNN代碼研讀（matterport版本）系列文（一）- ResNet部份

前言

訓練及推論模式中的共同部份

ResNet Graph

identity_block

conv_block

resnet_graph

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

小結

猜你喜欢