[Semantic Segmentation] DeepLab v3 (Cascaded model, ASPP model, comparison of two ASPPs, Multi-grid, training details)

Rethinking Atrous Convolution for Semantic Image Segmentation

Paper address : Rethinking Atrous Convolution for Semantic Image Segmentation
Pytorch implementation code : pytorch_segmentation/deeplab_v3

This is an article published on CVPR in 2017. Compared with DeepLab V2, there are three changes: ① Multi-grid is introduced; ② ASPP structure is improved; ③ CRFs post-processing is removed.

①Introduction of Multi-grid : The introduction of Multi-grid aims to further improve the use of dilated convolution and improve the performance of semantic segmentation models at different scales. Compared with previous versions (such as DeepLab V2), DeepLab V3 introduces multi-scale expansion rate settings, which solves the limitations caused by fixed expansion rates in past expansion convolutions. In DeepLab V3, some layers of the network use different expansion rates, allowing the model to capture richer contextual information at different scales and effectively segment objects of different sizes.

②Improved ASPP structure (Atrous Spatial Pyramid Pooling) : The ASPP structure is used to capture multi-scale context information at different scales of the feature map. In DeepLab V3, the ASPP structure has been improved. In addition to the original dilation convolution of multiple expansion coefficients, image-level features (global average pooling) are also introduced , so that both local detail information and Can integrate global context information. This improvement helps improve the perceptual range and semantic information of the model, thereby improving segmentation performance.

③Remove CRFs post-processing (Conditional Random Fields) : In DeepLab V2, post-processing using CRFs is to further optimize the semantic segmentation results, especially to smooth the boundary details. However, CRF is computationally expensive and increases model complexity . DeepLab V3 removes the CRF post-processing step, but instead obtains a more comprehensive and accurate feature representation directly in the model by introducing Multi-grid and improving the ASPP structure, thereby reducing the need for CRF. In this way, DeepLab V3 simplifies the model structure and training process while maintaining high performance.

DeepLab v3 Overview

DeepLab v3 is a semantic segmentation model. Its core idea is to use deep convolutional neural networks to achieve high-precision semantic segmentation tasks. It is the third version of the DeepLab series model and has been improved on some shortcomings of the previous version.

The core ideas of DeepLab v3 include the following key points:

Dilated Convolution : In order to increase the size of the receptive field, DeepLab v3 introduced dilated convolution. Traditional convolution operations only consider local neighborhood information when extracting features, while dilated convolution introduces holes (or called expansion rates) into the convolution kernel so that the convolution kernel can obtain feature information in a wider range. thereby capturing broader contextual information.
Multi-scale information fusion : DeepLab v3 uses a multi-scale information fusion method to obtain feature maps of different resolutions by performing spatial pyramid pooling operations on feature maps at different scales. Then, by upsampling and fusing these feature maps, the model can obtain richer information at different scales, which is beneficial to more accurate segmentation of the target.
Global Average Pooling : DeepLab v3 uses global average pooling on the final feature map to obtain global information of the entire image. This helps further improve the model's context-awareness, which is especially important for images containing a wide range of objects.
Conditional Random Field (CRF) : After obtaining the preliminary semantic segmentation results, DeepLab v3 uses Conditional Random Field for post-processing. CRF can use the spatial correlation and color similarity between pixels to smooth the segmentation results and reduce some detailed errors.

Based on the above points, DeepLab v3 can effectively capture contextual information and multi-scale features in images, and fully fuse and process these features to achieve high accuracy in semantic segmentation tasks.

Abstract

In this work, we reconsider the dilated convolution (Atrous Convolution), a powerful tool that can explicitly adjust the receptive field of the convolution kernel and control the resolution of the feature responses calculated by CNNs for Applications to semantic image segmentation. To solve the problem of segmenting objects at multiple scales, we design modules using dilated convolutions that employ multiple dilation rates rr in series or in $r$ to capture multi-scale context. In addition, we enhance the previously proposed Atrous Spatial Pyramid Pooling (ASPP) module, which can detect convolutional features at multiple scales and fuse image-level features containing global context to achieve performance improvements. further improvement. We also elaborate on the implementation details and share our experience in training the system. The proposed “DeepLabv3” system significantly outperforms our previous DeepLab version without DenseCRF post-processing and achieves comparable performance to other state-of-the-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

1. Optional architecture for capturing multi-scale contextual information

Insert image description here

Figure 2. Alternative architecture for capturing multi-scale context

(a) Image pyramid : This is the easiest way for us to think of, which is to scale the images to different scales and then send them to the network for reasoning; and finally fuse them.
(b) Codec : First perform a series of downsampling on the input image according to Backbone, and finally perform a series of upsampling on the final feature map; during the upsampling process, it will be fused with the feature map obtained in Backbone; And so on until restored to the original image size.
(c) Method in DeepLab v1 : Set the last few convolutional layers in Backbone strideto 1; then introduce dilated convolution to increase the receptive field of the network.
(d) Method in DeepLab v2 : ASPP (Atrous Space Pyramid Pooling) is introduced to increase the model’s ability to obtain multi-scale information.

2. Two model structures of DeepLab v3

Cascaded Model: cascaded model
ASPP Model: empty space pyramid model

The ASPP module is not used in the Cascaded model.

The Cascaded blocks module is not used in the ASPP model.

Note that although two structures are proposed in the text,The author says that the ASPP model is slightly better than the Cascaded model.. Including some open source codes on Github, most of them also use the ASPP model.

2.1 Cascaded Model

Insert image description here

The Cascaded model proposed in the paper refers to the picture above. $3\times 3$ in the first residual structure is $3 \times 3$ $1\times 1$ on the shortcut branch $1 \times 1$ The stride of the convolutional layerstrideis changed from 2 to 1 (that is, no downsampling is performed), and $3\times 3$ in all residual structures $3 \times 3$ ordinary convolutional layers have been replaced by $3\times 3$ dilated convolutional layers. Block5, Block6 and Block7 are additional layer structures.Their structures are exactly the same as Block4, that is, they are composed of three residual structures using dilated convolution.

Note❗️The original paper says that it is used when training the Cascaded model output_stride=16(that is, the downsampling rate of the feature layer relative to the input image), but it is used during verification output_stride=8. Because output_stride=16the final feature layer Hand Wwill be smaller when , which means that a larger can be set batch_sizeand the training speed can be accelerated. However, the feature layer Hand Wbecoming smaller will cause the feature layer to lose detailed information (the article says it becomes "rougher"), so it is used during verification output_stride=8. In fact, as long as your GPUvideo memory is large enough and your computing power is strong enough, you can directly set it output_strideto 8. ——To put it simply, the author used a 16x downsampling rate during training, and an 8x downsampling rate during verification . But we don't have to make such a compromise, we can directly set the downsampling rate to 8 times.

In addition, it should be noted that what is marked in the figure rateis not the actual expansion coefficient used by dilation convolution. The actual expansion coefficient should be ratemultiplied by Multi-Gridthe parameter in the figure. For example, in Block4 rate=2, Multi-Grid=(1, 2, 4)then the actual expansion coefficient is $\times (1, 2 , 4) = (2, 4, 8)$ . The Multi-Grid parameters will be mentioned later.

$r_\mathrm{actually} = \mathrm{rate} \times \mathrm{MultiGrid}$

2.2 ASPP Model

2.2.1 Overall model structure

Although most of the paper talks about the Cascaded Model and its corresponding experiments, the most commonly used one is the ASPP Model, whose model structure is shown in the figure below.

Insert image description here

Note ❗️Same as the Cascaded Model, the original paper said that it is used during training output_stride=16(that is, the downsampling rate of the feature layer relative to the input image), but it is used during verification output_stride=8. However, in the DeepLabV3 source code officially implemented by PyTorch, it is directly output_strideset to 8 for training.

2.2.2 Comparison of two versions of ASPP

2.2.2.1 ASPP (V2 version)

First, review the ASPP structure in DeepLab V2 . The ASPP structure in DeepLab V2 is actually through four parallel dilated convolution layers . The dilated convolution layers on each branch use different expansion coefficients (note that the dilated convolution here The layer is not followed by BatchNorm and paranoid Bias is used). Then pass add $\oplus$ the outputs on the four branches in an additive manner.

Insert image description here

The dilated convolution layer in ASPP (V2 version) has paranoid bias but no BN layer

2.2.2.2 ASPP (V3 version)

Let's take a look at the ASPP structure in DeepLab V3, as shown in the figure below.

Insert image description here

The dilated convolution in ASPP (V3 version) is the classic hamburger structure: Conv → BN → Activation

The ASPP structure here has 5 parallel branches, namely:

①A 1 $1\times 1$ convolutional layer
② ~ ④three $3\times 3$ dilated convolutional layers (dilation rates are different)
⑤A global average pooling layer (followed by a $1\times 1$ W convolutional layer, and then restore the inputback to the input through bilinear interpolationH).

Regarding the last global pooling branch, the author said that it is to add a global contextual information (Global Contextual Information).

Then Concatthe outputs of these five branches are spliced (along channelthe direction) through , and finally through a $1\times 1$ further fuses information.

3. Multi-grid (multi-level grid method)

Although dilation convolution has been used in the previous DeepLab v1 and v2 models, the setting of the expansion coefficient was relatively arbitrary. In DeepLab V3, the author has done some related experiments to see how to set it up more reasonably. The following table uses Cascaded Model (ResNet-101 as Backbone as an example) as the experimental object to study the effect of using different numbers of Cascaded blocks models and using different Multi-Grid parameters for Cascaded blocks (mean IoU).

Insert image description here

Table 3. When the output stride (downsampling multiple) is 16, different configurations of the number of Cascaded blocks are performed on ResNet-101 using the multi-grid method (Multi-grid). Best model performance is shown in bold

Note❗️

block5 ~ block7 have the same structure as block4, but the expansion coefficient has changed.

As I just mentioned when talking about the Cascaded model, the actual expansion coefficient used in blocks should be the rate in the picture multiplied by the Multi-Grid parameter here.

What do you think of this picture?

Multi-Grid (vertical) : Multi-Grid parameters of block5 ~ block7 used each time (even if the corresponding block does not exist)

block4 (vertical) : use block4, no block5 ~ block7

block5 (vertical) : use block4 and block5, no block6 ~ block7

block6 (vertical) : use block4 ~ block6, no block7

block7 (vertical) : use block4 ~ block7 (all used)

Output stride ( output stride): downsampling multiple

Found through experiments:

When three additional Blocks are used (that is, additional Block5, Block6 and Block7 are added), the Multi-Grid is set to have the (1, 2, 1)best effect - Cascaded Model The best model
In addition, if you do not add any additional Block (that is, there is no Block5, Block6 and Block7), set the Multi-Grid to the (1, 2, 4)best effect - ASPP Model
. This is because there is no additional Block layer added in the ASPP model. ASPP will be discussed later. The model's ablation experiment uses (1, 2, 4)the situation where Multi-Grid is equal to .

4. Ablation experiment

4.1 Cascaded model ablation experiment

The following table is about ablation experiments of Cascaded model.

Insert image description here

Table 4. Inference strategies on the validation set. MG: Multi-grid. OS: output stride (downsampling multiple). MS: Use multi-scale input during testing. Flip: Add input that flips left and right

in:

MG stands for Multi-Grid. As mentioned above, MG(1, 2, 1)it is best to use MG in the Cascaded model.
OS stands for output_stride ( downsampling multiple ). As mentioned above, output_stridesetting it to 8 during verification will have better results.
MS stands for multi-scale, similar to DeepLab V2. However, more scales are used in DeepLab V3scales = {0.5, 0.75, 1.0, 1.25, 1.5, 1.75}
Flip represents adding a horizontally flipped image input

4.2 ASPP model ablation experiment

The following table is about the ablation experiments of the ASPP model.

Insert image description here

Table 6. Inference strategy on the validation set: MG: Multi-grid. ASPP: Atrous spatial pyramid pooling. OS: output stride (downsampling multiple). MS: Use multi-scale input during testing. Flip: Add input that flips left and right. COCO: model pretrained on MS-COCO

in:

MG stands for Multi-Grid. As mentioned above, MG(1, 2, 4)it is best to use it in the ASPP model.
ASPP mentioned before
Image Pooling represents adding the global average pooling layer branch to ASPP
OS stands for output_stride. As mentioned above, output_stridesetting it to 8 during verification will have better results.
MS stands for multi-scale, similar to DeepLab V2. However, more scales are used in DeepLab V3scales = {0.5, 0.75, 1.0, 1.25, 1.5, 1.75}
Flip represents adding a horizontally flipped image input
COCO represents pre-training on the COCO data set

5. Training details

The following table is the mean IOU of DeepLab V3 on the Pascal VOC2012 test data set given in the original paper.

Insert image description here

Through comparison, it is found that DeepLab V3 has actually improved by about 6 points compared with V2. However, the DeepLab V3 here does not seem to clearly indicate whether it is the Cascaded model or the ASPP model. I personally think it is most likely referring to the ASPP model. Then think carefully about how these 6 points are improved. If only by introducing Multi-Grid, improving the ASPP module and using more scales in MSC, it should not improve so many points. So all I can think of is that some changes during the training process caused the mean IOU to increase.

In the section of the paper A. Effect of hyper-parameters, the author says:

During the training process, the size of the training input image is increased (there is a point in the paper that you need to pay attention to, namelyWhen using a large expansion coefficient, the input image size cannot be too small, otherwise $3\times 3$ may degenerate into $1\times 1$ Ordinary convolution of $1$ 。
When calculating the loss, the predicted result is restored to the original scale through upsampling (that is, the final bilinear interpolation upsampling of the network is 8 times) and then the loss is calculated with the real label image . According to the experiment in Table 8, it can be improved by more than 1 point.
Previously, in V1 and V2, the loss was calculated using 8 times the real label image and the prediction results without upsampling (the purpose of this was to speed up training).
After training, freeze the parameters of the BN layer, and then fine-tune other layers of the network. According to the experiment in Table 8, it can be improved by more than 1 point.

6. PyTorch officially implements the DeepLab V3 model structure

The following figure is the network structure drawn by PILIBALA WZ based on the DeepLab V3 source code officially implemented by PyTorch (there are slight differences from the original paper):

In DeepLab V3 officially implemented by PyTorch, Multi-Grid is not used
There is an additional FCNHead auxiliary training branch in DeepLab V3 officially implemented by PyTorch. You can choose not to use it.
output_stride8 is used in both training and verification in DeepLab V3 officially implemented by PyTorch.
The expansion coefficient of the three expansion convolution branches in ASPP is {12, 24, 36}, because the paper said that the expansion coefficient should be doubled at that output_stride=8time

Insert image description here