AlphaNet: An Attention Guided Deep Network for Automatic Image Matting

AlphaNet: An Attention Guided Deep Network for Automatic Image Matting


Paper link: https://arxiv.org/abs/2003.03613?context=cs.CV
Publication source: 2020 CVPR

1. Background
Digital imate matting is a method of high-quality extraction of foreground objects from natural images. It has a wide range of applications in the fields of mixed reality, film production, and intelligent creation. In order to pursue a smoother and faster customer experience, the process of extracting and synthesizing large amounts of data must be automated, and the results must be of high quality.
The disadvantage of semantic segmentation is that it focuses on the rough semantics of visual input, which leads to fuzzy
image matting in structural details. Although it can produce better results in details, it often requires user intervention, which leads to the matting workflow. The processing delay and overhead of, and severely limit the application of image matting.

2. Contributions
The semantic segmentation and depth image matting are merged into a single network to extract foreground objects with high precision from natural images.
(1) A new model structure is proposed, which unifies the functions of up-sampling and down-sampling with attention, and combines segmentation and matting. Unlike other normal down-sampling and up-sampling techniques, attention-guided down-sampling and Upsampling can extract high-quality boundary details.
(2) An attention-guided encoder-decoder framework is used, which performs unsupervised learning and adaptively generates attention maps from data to serve and guide up-sampling and down-sampling operations.
(3) A high-quality alpha matting data set centered on fashion e-commerce is constructed to facilitate the training and evaluation of image matting.

3. Network overview
The AlphaNet proposed in this paper consists of two parts: a segmentation network and a matting network.
The RBG image is used as the input of the segmentation network, and a binary segmentation mask is generated for the foreground object. The binary mask is used to estimate the bounding box, which together with the mask is used as the input of the corrosion-expansion layer to generate the trimap. The trimap generated by this process is rough and contains many uncertain regions mainly along the edges of the generated mask.
Then, the trimap is connected with the RGB image as the input of the matting network. The matting network is an attention-guided model that estimates the alpha matte based on the RGB image and the generated rough trimap. Then, different loss functions are used to compare the predicted alpha matte with the ground truth, and the gradient is calculated for network parameter optimization.
Insert picture description here

4. Segmentation and trimap estimation The network
segmentation network includes the DeepLabV3 + encoder-decoder architecture of the ReseNet18 backbone to extract the rough semantic information of the foreground image, with an additional corrosion-expansion layer at the end, which can convert the binary output into a rough Trimap .
Specifically, the estimation of trimap is to use the output of the segmentation model and the additional object bounding box. The article believes that only the area near the mask boundary needs to be further estimated by the image matting model, so one of the corrosion and expansion in the binary mask The area is marked as an unknown area in the trimap, where αi = 0.5. Other pixels inside the mask are classified as having a foreground with α i = 1.0, and pixels other than unknown and foreground pixels are assigned α i = 0.0.
The degree of corrosion and expansion depends on the calculated object size. The height is approximated by height = bbox [3]-bbox [1], and the width is approximated by width = bbox [2]-bbox [0]. Corrosion and expansion rates are therefore fixed as a percentage of height and width.

5. Cutout network
The framework of the feature extraction module in this paper is to improve the DIM framework proposed by Deep Image Matting
(1) encoder-decoder
builds an encoder-decoder based on MobileNetV2, and adds an additional attention module to generate attention maps to guide Up-sampling and down-sampling operations. The
pooling layer and the unpooling layer follow the general configuration of 2×2 kernel size and step size 2. The core of the network is the attention module, which obtains feature maps from the encoder branch and generates attention maps to guide down-sampling and up-sampling operations.
(2) Attention Module
attention mechanism consists of a predefined attention block and two normalization layers.
The core of the attention block is a fully convolutional neural network, which converts the input feature map into an attention map (that is, the attention map is modeled as a function of the encoder feature map F ∈ R^H x W x C^), respectively Two attention maps are generated for upsampling and downsampling. The attention map has the same spatial dimension as the input feature map but only one channel contains a specific attention weight A i ∈[0,1]. The specific mapping is as follows:
Insert picture description here
attention block is followed by two A normalization layer is responsible for normalizing the attention mapping of the encoder and decoder. The attention mapping of the encoder is first normalized by a sigmoid function, and then normalized by a softmax function to ensure that the encoder is down-sampled. The attention mapping of the decoder is normalized only through a sigmoid function.
Once the attention maps are normalized, they are input into the encoder and decoder pooling and unpooling operators, respectively. The main difference between the normal unpooling method and the article method is that the normal operation applies a fixed learning core to all regions, while the article module applies different cores to different regions based on the calculated attention map.
Insert picture description here
As shown in the figure, the specific structure of the AM module:
First, use two sets of 4 parallel 4 x 4 group convolution with 2-stride, 1 padding on the feature map of size H x W x C to generate a size H/ Attention map of 2 x W/2 x 2C.
After that, a set of normalization layers and a ReLU layer for nonlinear mapping. Then use two point convolutional layers to process the generated tensor to achieve feature map pooling, and generate an attention map with a size of H / 2 x W / 2 x 1.
The final attention map consists of four down-sampling attention maps (without weight sharing) and shuffled rearrangement (pixels are randomly mixed and then up-sampled).
6. Results
(1) Visual comparison of different methods
Insert picture description here
(2) Quantitative comparison of different methods
Insert picture description here
(3) Ablation experiment
Insert picture description here

Guess you like

Origin blog.csdn.net/balabalabiubiu/article/details/115023178