《U^2-Net:Going Deeper with Nested U-Structure for Salient Object Detection》论文笔记

Reference code: U-2-Net

1 Overview

Introduction: This article proposes a new network structure U^2-Net for image salient target detection (segmentation may be more appropriate). The network structure as a whole consists of a two-layer U-shaped structure, which is composed of U-shaped sub-modules (ReSidual U-blocks, RSU-L) construct a larger U-shaped network. The article points out that this can bring two benefits: 1) It can greatly improve the network's ability to obtain contextual information on different scales, which is obtained by the network mixing different feature map sizes; 2) Because of the introduction in RSU The pooling operation can deepen the network without incurring huge computational overhead. In addition, the article also proposes from the specific task that the ImageNet pre-training model is not required, and the training is directly started from the beginning, thus saving the cost of pre-training, and the effect is also good from the article effect. The U^2 large model (176.3Mb) proposed in the article is 30FPS (320*320 size input) on 1080Ti, and the small model (4.7MB) is 40FPS.

For pre-training model in ImageNet on training its often get more attention is to extract semantic information entered the picture, in order to better complete the classification task, but the details of the local and global contrast information is more important for object detection is significant. This article directly proposes to define a network suitable for salient target detection and then directly train it from scratch. The final results also show that this method can obtain approximate performance results, eliminating the need for pre-training.

In order to obtain better performance, it needs more scale feature maps for salient target detection, and better feature map extraction methods (network structure) at different scales. This article proposes the U built by the RSU module. The structure of the article allows it to obtain better performance when the cost is relatively small and the amount of calculation is relatively small. Compared with other methods, the size model of the article compares its performance as shown in the figure below:
Insert picture description here

2. Method design

2.1 RSU-L module

The current network structure is built in the form of module stacking. According to the structure design article, it is compared with the RSU module, as shown in the figure below:
Insert picture description here
RSU-L (C in, M, C out) (C_{ in},M,C_{out})(Cin,M,Cout) , Where the three parameters represent the input channel/middle layer channel/output channel respectively. The calculation process can be roughly divided into 3 processes:

  • 1) Pass a 3 ∗ 3 3*33The convolution of 3 generates a and output channel (C out C_{out}Cout) Consistent feature map, the process is described as F 1 (x) F_1(x)F1(x)
  • 2) Use its defined parameter LLL represents the number of hierarchical sampling, so as to achieve multi-level feature map extraction within the block, and also deepen the network, obtain larger receptive fields and richer local and global features. This part of the calculation process is described asU (F 1 (x)) \mathcal(U)(F_1(x))U(F1(x))
  • 3) Use the residual network to add it to the previous feature map to fuse local and multi-scale information, F 1 (x) + U (F 1 (x)) F_1(x)+\mathcal{U}(F_1 (x))F1(x)+U(F1(x))

Compare the structure of the article with the residual structure in ResNet. The difference between them is shown in the figure below:
Insert picture description here
In addition, the structure of the different modules in Figure 2 is compared with the RSU module designed by the article. The comparison is shown in the figure below. Shown:
Insert picture description here
The module structure designed in the article is compared with other types of module structure performance:
Insert picture description here

2.2 U^2-Net network structure

The network structure proposed in the article is obtained by nesting a multi-layer U-shaped structure. From the trade-off between the amount of calculation consumed and the video memory, the article limits it to 2 to obtain the network structure in the following figure: the
Insert picture description here
above-mentioned network structure is divided according to its composition structure It can be divided into 3 components:

  • 1) Encoding En _ 1, En _ 2, En _ 3, En _ 4 En \ _1, En \ _2, En \ _ 3, En \ _ 4E n _ 1 ,E n _ 2 ,E n _ 3 ,E n- _ . 4 which modules are used RSU RSU-7 / RSU-6 /wherein the parameter (7/6/5/4) RSU represents the height of the module, which is The resolution of the input feature map is determined (because more downsampling operations can be performed). For encoding moduleEn _ 5, En _ 6 En\_5, En\_6E n _ 5 ,E n _ 6 Because the resolution of the feature map is relatively small, the RSU-4F module is used here, which represents the use of expanded convolution to replace the up-sampling operation, so the input and output feature maps of this type of module are the same size of;
  • 2) For the decoder module, it can be seen from Figure 5 that its structure is similar to that of the encoder, but with the addition of the corresponding level of Fujimasa for fusion and size matching up-sampling operations;
  • 3) In the above codec module, the input image has been divided into feature maps of different scales: E n _ 6, D e _ 5, D e _ 4, D e _ 3, D e _ 2, D e _ 1 En\_6,De\_5,De\_4,De\_3,De\_2,De\_1E n _ 6 ,D e _ 5 ,D e _ 4 ,D e _ 3 ,D e _ 2 ,D e _ 1 , the article uses3 ∗ 3 3*3on these feature maps33 convolution and sigmoid function to estimate the saliency target output on these scales:S side (6), S side (5), S side (4), S side (3), S side (2), S side (1) S_{side}^{(6)},S_{side}^{(5)},S_{side}^{(4)},S_{side}^{(3)},S_{ side}^{(2)},S_{side}^{(1)}Sside(6),Sside(5),Sside(4),Sside(3),Sside(2),Sside(1). After that, these outputs are up-sampled to the original image size and concat, after a 1 ∗ 1 1*111 Convolution and sigmoid function get the output after fusionS fuse S_{fuse}Sfuse

According to the needs of the task, the article designs two types of U^2-Net, one large and one small, as shown in Table 1:
Insert picture description here

2.3 Network loss function

Since the network has outputs on multiple scales, the loss function of the natural network is composed of multiple parts:
L = ∑ m = 1 M wside (m) lside (m) + wfuselfuse L=\sum_{m=1} ^Mw_{side}^{(m)}l_{side}^{(m)}+w_{fuse}l_{fuse}L=m=1Mwside(m)lside(m)+wfuselfuse
For the specific calculation of the loss function on each component, it is described in the form of cross entropy:
l = − ∑ (r, c) (H, W) [PG (r, c) log PS (r, c) + (1 − PG (r, c)) log (1 − PS (r, c))] l=-\sum_{(r,c)}^{(H,W)}[P_{G(r,c )}logP_{S(r,c)}+(1-P_{G(r,c)})log(1-P_{S(r,c)})]l=(r,c)(H,W)[PG(r,c)logPS(r,c)+(1PG(r,c))log(1PS(r,c))]

3. Experimental results

Result 1:
Insert picture description here
Result 2:
Insert picture description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/111189625