(RN)Region Normalization for Image Inpainting

Paper address: AAAI 2020.  https://arxiv.org/pdf/1911.10375v1.pdf

motivation : The traditional image inpainting method uses FN (feature normalization) to help network training, but they are often performed on the entire image, without considering the impact of corrupted region pixels on mean/variance. This article uses RN (regional) normalization) separate the spatial pixels into different regions according to the mask, and then calculate the mean and variance on different regions.

Key idea : Two RNs are designed for inpainting images, both of which are area normalization:
(1) Basic RN (RN-B), which normalizes the corrupted area and the uncorrupted area based on the input mask. (2) Learnable
RN (RN-L), which automatically detects potential damaged and undamaged areas for separate normalization, and perform a global affine transformation to enhance their fusion. Finally, RN-B is used in the shallow layers of the network and RN-L is used in the deep layers. (The latter is a learnable version, just input feature maps.)

The overall structure of the network:

insert image description here

It consists of three parts: Encoder, Residual Block, and Decoder. Among them, the Encoder adopts the RN-B method, and the Residual Block and the Decoder adopt the RN-L method.
Discriminator: Copy the structure of PatchGAN (Isola et al. 2017; Zhu et al. 2017). At the same time, its loss function is adopted, which includes four parts: reconstruction loss, adversarial loss, perceptual loss and style loss.
There is no innovation in the network structure and loss function, and different existing excellent methods are pieced together.

4. The place of innovation in the text: the definition of Region Normalization

For each input feature map, it has four dimensions: N, C, H, W. Represent batch size (number of batches), channels (number of channels), height (height of feature map), and Weith (feature width). Since it is region normalization, there must be different regions. The author gives the following expression:

insert image description here

 In addition to n, c, there are H, W, so each batch of feature maps is divided into several blocks according to the two dimensions of H and W. The idea of ​​the RN algorithm is relatively easy to understand. The green part represents the damaged data, the red part represents the undamaged data, and the two parts of the data are normalized respectively. As shown below:

insert image description here

 Determine the area to be processed with n, c as the index. Divide the area to be processed into several sub-areas. Normalize several regions separately and then merge them.
The same is true when performing normalization operations on different regions:

insert image description here

 It's a process of subtracting the mean and then dividing by the standard deviation. The calculation of the mean and standard deviation is also a traditional method, just pay attention to the calculation of the pixels in the unified region.
Here the author explains that this method is actually an extension of the Instance Normalization (IN) method. When the number of divided regions is 1, it is the IN method. In the field of image inpainting, the number of regions is set to 2, one type of intact area and one type of masked area.

5. The author uses the RN method in two ways, RN-B and RN-L

1)RN-B(Basic Region Normalization)

insert image description here

This method divides the original image into two regions (masked/unmasked) according to the input mask. The specific rules are as follows:

insert image description here

That is, where the mask pixel value is 255, it is determined as masked. The two divided regions are normalized by the above method respectively, the mean and variant are calculated separately and then combined to obtain a complete feature map. But for the feature of each channel, there are two sets of network parameters to learn, no longer a set of weights and biases for one channel.

2)RN-L(Learnable Region Normalization)

In the deep network, each corrupted area and uncorrupted area are more and more difficult to distinguish, and the corresponding mask is difficult to obtain, so RN-L predicts the corrupted area and helps the fusion of the two areas through a global radiation coefficient. The following is the structure diagram of RN-L:

insert image description here

 This method no longer requires manual input of masks to divide regions. As shown in the figure above, first perform max pooling and mean pooling (on the channel axis) on the input feature maps to obtain two 1×H×W maps. The original text says The two pooling operations are able to obtain an efficient feature descriptor.
Convolve the obtained two pooling layers and apply the Sigmod activation function to obtain a spatial response map:

insert image description here

 Then M_{sr}set a threshold of t=0.8 to judge whether it is a masked area (I don't quite understand the principle of doing this here, it may be explained in the article on the operation of the two pooling layers mentioned above):

insert image description here

 The threshold of 0.8 here only plays a role in the process of forward propagation inference, and does not affect the gradient update in back propagation. (The specific effect of this sentence includes the possible impact of this operation. I don't want to understand it for the time being. I need to see the code to understand this place)

When learning the γ (scale) and β (shift) parameters, the text says that it is also obtained through the convolution operation:

insert image description here

 The convolution here and the convolution for the two pooling layers above need to look at the code to understand. Will γ and β expand along the channel dimension during the affine transformation?

Thoughts

  1. If the shallow network corrupted area and uncorrupted area are merged directly after normalization, it may not be smooth in the boundary part. You can use the spatial adaptive loss mentioned in the previous Prior as the final loss to help the network smooth, but this is in the network. Output, as for how to smoothly merge in RN-B, I think a boundary area can be added to the normalization and fusion module, that is, corrupted/uncorrupted/boundary, and the boundary area can be replaced by the average of three.
  2. Regarding the acquisition of the mask of the feature map, it is not convincing enough to simply use the threshold. The simplest method is that the inner and outer thresholds of the mask of the original image should not be the same.
  3. I have some doubts about the pixel-wise calculation of the deep radiation coefficient. The function of the affine coefficient is to scale the normalized pixels, so in the local area, I think these coefficients should be shared. , but as to which areas should be shared, it does need to be "segmented".

[Note] Region Normalization for Image Inpainting - 知乎

Region Normalization Summary - Deep and Deep - CSDN Blog

Guess you like

Origin blog.csdn.net/weixin_43135178/article/details/123348536