Natural Image Matting via Guided Contextual Attention

Natural Image Matting via Guided Contextual Attention


1. Background
Deep learning-based methods have made significant progress in natural image matting. Many methods can generate a visually believable alpha estimate, but usually produce blurry structures or textures in semi-transparent areas. This is because of the local ambiguity of transparent objects.
A possible solution is to use the surrounding information to estimate the local opacity. The existing matting methods based on affinity and sampling have similar strategies: to learn from image regions with similar appearances. The following are the differences between these two strategies:
1) affinity-based: transfer the opacity of a known area with a similar appearance to an unknown area;
2) sampling-based: combine the foreground and the background based on certain assumptions Paired sampling is used to estimate the alpha value of each pixel in the unknown area;
these two types of methods have also achieved some results, but they cannot handle the situation where there are only background and unknown areas in the trimap (the known areas are less), because these methods To use foreground and background information at the same time. Moreover, the traditional affinity-based method has high computational complexity and is not suitable for high-resolution α estimation.
2. Content
Inspired by the success of affinity-based methods and contextual attention mechanisms in inpainting, a novel end-to-end natural image matting method with guided contextual attention (GCA) module has been developed.
The method of the article guides the network information flow from the image context directly to the unknown pixel area. The GCA module is proposed to optimize the spread of affinity in the network. In the GCA module, the low-level features are used for the guidance of the alpha feature. And transfer the features of the known area with similar appearance to the unknown area.
Insert picture description here
The two lower right pictures in the above picture are the attention pictures output by the two GCA blocks in the article codec module, which are used to guide the migration of opacity. We can easily identify the car in the sieve. The light pink patch in the center of the sieve indicates that these features have spread from the left side of the car. The blue part shows the features borrowed from the road on the right. These propagated features will help to identify foreground and background in subsequent convolutional layers.
3. Network
(1) Network structure The network structure of the
article is roughly a U-shaped network structure. In the U-shaped network structure, the encoder feature extraction is realized through 5 shortcuts. The input contains the 6-channel image composed of the original image and the trimap, and the middle Low-level detail features are introduced through GCA, and the structure is shown in the following figure:
Insert picture description here
(2) Loss function The
network only uses an alpha prediction loss. The α prediction loss is defined as the average of the absolute difference between the prediction in the unknown area and the ground-truth alpha matte:
Insert picture description here
U represents the area marked as unknown in the trimap, and αi^ and αi represent the prediction sum of the α mask at i ground-truth.
In addition, the article also experimented with loss functions such as compositional loss, gradient loss, and Gabor loss:
Insert picture description here
under the mean square error and gradient error, the use of component loss did not bring any significant difference. When considering gradient loss and α prediction loss, this Both errors will increase. Although the use of Gabor loss can reduce the gradient error to a certain extent, it also slightly increases the mean square error. Therefore, this article only selects the α prediction loss in the model.
(3) Data augmentation
The strategies adopted in the article for data augmentation include:
1) Randomly select 2 foreground targets with a probability of 0.5 to obtain a new foreground target and alpha image, and resize the image to a resolution of 640 ∗ 640 with a probability of 0.25, so that it has a larger sensing range than crop;
2 ) Use affine transformations such as random rotation/scaling/horizontal and vertical flipping. For trimap, the alpha image is obtained by the morphological operation of corrosion and expansion with different random pixel numbers;
3) Randomly crop 512 ∗ 512 sub-images from the foreground image, and use HSV color space transformation;
at the same time, only image cropping and corrosion expansion are performed As shown in the results of the above table, AUG indicates that data enhancement has
been carried out. 4. The
GCA module The GCA module is actually composed of two parts: the GCA block for image low-level feature extraction and information transmission.
The purpose of using the GCA module in two places in the network structure is to make full use of the information in alpha.
(1) Image low-level feature extraction.
Most affinity-based methods are based on the setting that regions with the same appearance will have the same opacity. Based on this assumption, the known regions can be matched through affinity graphs. Unknown areas with similar appearance are predicted for opacity.
The method of the article is based on the idea of ​​this affinity-based method, which corresponds to two information flows:
1) Alpha feature flow: This part is the characteristic information flow of the blue arrow in the overall network structure, which represents high-level information flow. Transparency information;
2) Low-level appearance feature flow: This part is extracted by using 3 stride=2 convolutions on the basis of the original image. It has rich low-level appearance features, corresponding to the network structure diagram. Yellow arrow
In the above two information streams, low-level appearance information and high-level opacity information are completely obtained, which provides a basis and guidance for opacity prediction based on affinity.
(2) block GCA
GCA paper presents the block shown in FIG below:
Insert picture description here
GCA image feature and advantage of the features of alpha
First, the low-level feature image detail is divided into two parts: the known and unknown part. Then use the window size of 3 ∗ 3 on the entire image feature for dicing, and then use its reshape as a convolution kernel to calculate the similarity with the unknown area:
Insert picture description here
use the above formula to obtain the correlation measurement matrix and then use softmax to get The corresponding attention score. But it is not stable directly, especially when there are more and less areas to be determined. For this reason, this article proposes the idea of ​​weighting according to the number of known areas (the more known areas, the larger the corresponding weight coefficient) :
Insert picture description here
When the guided attention score is obtained from the image features, this paper spreads on the alpha feature based on the affinity graph defined by the guided attention. Similar to image features, patches are extracted from alpha features and reshaped into filter kernels. Information dissemination is implemented as a deconvolution between the guided attention score and the reshaped alpha feature patches. This deconvolution produces a reconstruction of the alpha feature in the unknown area, and the values ​​of overlapping pixels in the deconvolution are averaged. Finally, the input alpha feature is combined with the propagation result through element summation. This elemental summation serves as a residual connection, which can stabilize training.
5. Results
(1) 1. Composition-1k test data set
Insert picture description here
Insert picture description here
(2) Alphamatting.com benchmark data set
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/balabalabiubiu/article/details/115317042