Expectation-Maximization Attention Networks for Semantic Segmentation

0. Summary

        Self-attention mechanisms have been widely used in various tasks. It is designed to compute the representation of each location through a weighted sum of the features of all locations. Therefore, it can capture the long-range relationships of computer vision tasks. However, this approach is computationally expensive since the attention map is computed relative to all other locations. In this paper, we formalize the attention mechanism in an expectation-maximization manner and iteratively estimate a more compact set of basis upon which the attention map is computed. By weighted summation of these basis, the resulting representation is low-rank and reduces the noise information in the input. The proposed Expectation Maximization Attention (EMA) module is robust to the variance of the input and is also memory and computationally friendly. Furthermore, we established basic maintenance and normalization methods to stabilize its training process. We conduct extensive experiments on popular semantic segmentation benchmark datasets, including PASCAL VOC, PASCAL Context, and COCO Stuff, setting new records on these datasets.

1 Introduction

        Semantic segmentation is a fundamental and challenging problem in computer vision, where the goal is to assign a semantic category to each pixel of an image. It is important for various tasks such as autonomous driving, image editing, and robot perception. In order to effectively complete the semantic segmentation task, we need to distinguish some confusing categories and consider the appearance of different objects. For example, "grass" and "ground" may have similar colors in some cases, and "people" may have different proportions, shapes, and clothing in different locations in the image. At the same time, the output label space is very compact and the number of categories for a particular dataset is limited. Therefore, this task can be viewed as projecting data points in a high-dimensional noise space into a compact subspace. The essence is to remove the noise from these changes and capture the most important semantic concepts.

Recently, many state-of-the-art methods based on fully convolutional networks (FCNs) [22] have been proposed to solve the above problems. Due to their fixed geometry, they are inherently limited by local receptive fields and short-range contextual information. In order to capture long-range dependencies, some works adopt multi-scale context fusion [17], such as astrous convolution [4], spatial pyramid [37], large convolution kernel convolution [25] and so on. Furthermore, to retain more detailed information, encoder-decoder structures [34, 5] are proposed to fuse mid-level and high-level semantic features. To aggregate information from all spatial locations, an attention mechanism [29, 38, 31] is used, which allows the features of a single pixel to fuse information from all other locations. However, original attention-based methods need to generate a large attention map, which has high computational complexity and takes up a lot of GPU memory. The bottleneck is that the generation and use of attention maps are computed relative to all locations.

In response to the above problems, this paper rethinks the attention mechanism from the perspective of the expectation maximization (EM) algorithm [7] and proposes a new attention-based method, namely expectation maximization attention (EMA). Unlike previous methods [38, 31] that treat all pixels themselves as reconstruction bases, we use the EM algorithm to find a more compact basis set that can greatly reduce computational complexity. Specifically, we treat the building base as the parameters to be learned in the EM algorithm and the attention map as the latent variable. In this setting, the EM algorithm aims to find the maximum likelihood estimate of the parameters (base). The Expectation (E) step is used to estimate the expectations of the attention map given the current parameters, while the Maximization (M) step is used to update the parameters (base) by maximizing the full data likelihood function. Step E and step M are executed alternately until convergence. After convergence, the output can be calculated as a weighted sum of the basis, where the weights are the final attention map after normalization. The process of EMA is shown in Figure 1. We further embed the proposed EMA method into the neural network module, named EMA unit. EMA units can be easily implemented with common operations. It is also very lightweight and can be easily embedded into existing neural networks. In addition, in order to fully utilize its capacity, we also propose two methods to stabilize the training process of the EMA unit. We also evaluate its performance on three challenging datasets.

The main contributions of this paper are as follows: • We redefine the self-attention mechanism as an expectation-maximization iterative approach, which can learn a more compact basis set and greatly reduce the computational complexity. To the best of our knowledge, this is the first time that EM iteration is introduced into the attention mechanism. •We build the proposed expectation-maximizing attention as a lightweight module of a neural network and establish a specific way of underlying maintenance and normalization. • Extensive experiments on three challenging semantic segmentation datasets, including PASCAL VOC, PASCAL Context and COCO Stuff, demonstrate that our method outperforms other state-of-the-art methods.

Figure 1: The flow of the proposed expectation maximization attention method.

2.Related work

        Semantic segmentation. Methods based on fully convolutional networks (FCN) [22] have made tremendous progress in image semantic segmentation by leveraging the powerful convolutional features of classification networks [14, 15, 33] pre-trained on large-scale data. . To enhance multi-scale context aggregation, several model variants have been proposed. For example, DeeplabV2 [4] utilizes Star Spatial Pyramid Pooling (ASPP) to embed contextual information, which includes parallel dilated convolutions with different dilation rates. DeeplabV3 [4] extends ASPP with image-level features to further capture global context. Meanwhile, PSPNet [37] proposes a pyramid pooling module to collect contextual information at different scales. GCN [25] adopts decoupling of large convolution kernel convolutions to obtain a large receptive field of feature maps and capture long-range information. For another category of variants, they mainly focus on predicting more detailed outputs. These methods are based on U-Net [27], combining the advantages of high-level features and mid-level features. RefineNet [21] utilizes the Laplacian image pyramid to explicitly capture the information available during downsampling and output predictions from coarse to fine. DeeplabV3+ [5] adds a decoder on DeeplabV3 to specifically improve segmentation results along object boundaries. Exfuse [36] proposed a new framework to bridge the gap between low-level features and high-level features, thereby improving segmentation quality.

        Attention model. Attention has been widely used in various tasks such as machine translation, visual question answering and video classification. Self-attention methods [2, 29] calculate the contextual encoding of a position by taking a weighted sum of the embeddings of all positions in the sentence. Non-local [31] first used the self-attention mechanism as a module for computer vision tasks, such as video classification, object detection and instance segmentation. PSANet [38] learns to aggregate contextual information for each location through predicted attention maps. A2Net [6] proposes dual attention blocks to distribute and collect informative global features from the entire spatiotemporal image space. DANet [11] applies spatial and channel attention to aggregate information around feature maps, and its computational and storage costs are even higher than Non-local methods. Our approach is inspired by the success of attention mechanisms in the above work. We rethink the attention mechanism from the perspective of the EM algorithm and calculate the attention map in an iterative manner of the EM algorithm.

3.Preliminary knowledge

        Before introducing our proposed method, we first review three highly related methods, namely EM algorithm, Gaussian mixture model and non-local module.

3.1. Expectation maximization algorithm

        The expectation maximization (EM) [7] algorithm aims to find the maximum likelihood solution for latent variable models. Denote X = {x1, x2,...,xN} as a data set containing N observation samples, and each data point xi has its corresponding latent variable zi. We call {X, Z} the complete data, and its likelihood function is of the form ln p(X, Z | θ), where θ is the set of all parameters of the model. In practice, we can only gain knowledge of the latent variables in Z through the posterior distribution p(Z|X,θ). The EM algorithm aims to maximize the likelihood ln p(X, Z | θ) through two steps, the E-step and the M-step. In step E, we use the current parameters θold to find the posterior distribution of Z given by p(X,Z|θ). We then use the posterior distribution to calculate the expectation of the full data likelihood Qθ, θold, which is given by: Qθ, θold = X z p Z|X, θoldln p (X, Z | θ). (1) Then, in step M, the revised parameters θnew are determined by maximizing the function: θnew = arg max θQθ, θold. (2) The EM algorithm alternately executes steps E and M until the convergence conditions are met.

3.2. Gaussian mixture model

In practical applications, we can simply replace Σk with the identity matrix I and omit Σk in the above equation.

3.3. Non-local

        The function of the non-local module [31] is the same as the self-attention mechanism. It can be expressed as: , where f(·,·) represents a general kernel function, C(x) is a normalization factor, and xi represents the eigenvector of position i. Since this module is applied to the feature map of the convolutional neural network (CNN), considering that N(xn|µk, Σk) in equation (5) is a specific kernel function between xn and µk, equation (8) is just Eq. (9) a specific design. Therefore, from the perspective of GMM, the non-local module only re-estimates X, without E-step and M-step. Specifically, µ is just X selected in the non-local module. In GMM, the Gaussian basis is selected manually and usually satisfies K ≪ N. But in non-local modules, the base is chosen to be the data itself, so K = N. Non-local modules have two obvious disadvantages. First, the data is in a low-dimensional manifold, so there are too many cardinalities. Secondly, the calculation overhead is high and the memory cost is also high.

4. Expect maximum attention

        Considering the high computational complexity of the attention mechanism and the limitations of non-local modules, we first propose the expectation maximization attention (EMA) method, which is an enhanced version of self-attention. Unlike non-local modules that select all data points as cardinality, we use EM iteration to find a compact set of cardinality. To simplify the notation, we reshape the input feature map Our proposed EMA consists of three operations, including responsibility estimation (AE), likelihood maximization (AM) and data re-estimation (AR). Simply put, given inputs AM uses this estimate to update the base µ as an M step. The AE and AM steps are executed alternately for a prespecified number of iterations. Then, using the converged µ and Z, AR reconstructs the original X into Y and outputs it.

It has been shown that with the iteration of the EM step, the complete data likelihood ln p(X,Z) increases monotonically. Since ln p(X) can be estimated by marginalizing Z, maximizing ln p(X,Z) is a proxy for maximizing ln p(X). Therefore, with the iteration of AE and AM, the updated Z and µ can better reconstruct the original data X. The reconstructed X˜ can capture as much of the important semantics in X as possible. Furthermore, compared to non-local modules, EMA finds a compact set of cardinality for the pixels of the input image. Compactness is non-trivial. Since K ≪ N, X˜ lies in the subspace of X. This mechanism removes a lot of unnecessary noise and makes the final classification of each pixel more tractable. Furthermore, this operation reduces the complexity (space and time) from O(N^2) to O(NKT), where T is the number of iterations of AE and AM. The convergence of the EM algorithm is also guaranteed. It is worth noting that in our experiments, EMA only requires three iterations to obtain good results. Therefore, T can be considered as a small constant, which means the complexity is only O(NK).

Figure 2: Overall structure of the proposed EMAU. The key component is the EMA operator, in which AE and AM are executed alternately. In addition to the EMA operator, we add two 1×1 convolutions at the beginning and end of the EMA and sum the output with the original input to form a residual-like block. Best viewed on screen.

4.1. Responsibility assessment

        Responsibility estimation (AE) is used as the E step in the EM algorithm. This step calculates the expected value of znk, which corresponds to the responsibility of the k-th base µ to xn, where 1≤k≤K and 1≤n≤N. We express the posterior probability of xn given µk as: p(xn|µk)=K(xn,µk), where K represents a general kernel function. Now, equation (5) can be reformulated into a more general form:         There are several options for K(a,b), such as inner product a⊤b, exponential inner product exp(a⊤ b), Euclidean distance ka−bk2/2, RBF kernel exp[−ka−bk2/σ2], etc. The choice of these functions has a negligible impact on the final results compared to non-local modules. Therefore, in our paper, we simply adopt the exponential inner product exp(a⊤b). In experiments, equation (11) can be implemented as a matrix multiplication plus a softmax layer. In short, in the t-th iteration, the operation of AE can be expressed as:

4.2. Maximization of likelihood function

        Likelihood Maximization (AM) is used as the M step of the EM algorithm. Using the estimated Z, AM updates µ by maximizing the full data likelihood. To keep the base in the same embedding space as X, we update the base µ using a weighted sum of X. Therefore, the update of µk is:

        In the t-th iteration of AM. It is worth noting that if we change λ→∞ in Equation (12), then {zn1, zn2,···,znK} will become a one-hot embedding. In this case, each pixel is assigned to only one base. And the base is updated by the average of those pixels assigned to it. This is what the K-means clustering algorithm [10] does. Therefore, the iterations of AE and AM can also be considered as soft versions of K-means clustering.

4.3. Data re-evaluation

        EMA alternately runs AE and AM a total of T times. After this, the final µ(T) and Z(T) are used to re-estimate X. We use equation (8) to construct a new X, that is, X˜, which is expressed as: X˜=Z(T)µ(T). (14) Since X˜ is constructed from a compact cardinality set, it has low-rank characteristics compared to the input X. We show an example of X˜ in Figure 2. Obviously, the X˜ output from AR is very compact in the feature space, and the feature variance inside the object is smaller than that of the input.

5.EMA unit

        To better combine the proposed EMA with deep neural networks, we further propose the Expectation Maximization Attention Unit (EMAU) and apply it to the semantic segmentation task. In this section, we describe EMAU in detail. We first introduce the overall structure of EMAU, and then discuss the cardinality maintenance and normalization mechanisms.

5.1.Structure of EMA unit

        The overall structure of EMAU is shown in Figure 2. EMAU looks similar to the bottleneck structure of ResNet at first glance, but it replaces the heavy 3×3 convolution with EMA operation. First, the first convolution without ReLU activation is added to transform the value range of the input from (0, +∞) to (−∞, +∞). This transformation is very important, otherwise the estimated µ(T) will also be in the range [0, +∞), which would have half the capacity compared to the general convolution parameters. Finally, a 1×1 convolution is inserted to transform the re-estimated X˜ into the residual space of X. For each of the AE, AM and AR steps, the computational complexity is O(NKC). Since we set K ≪ C, several iterations of AE and AM plus one AR are only of the same order of magnitude as a 1×1 convolution with C input and output channel numbers. With the additional computation of two 1×1 convolutions, EMAU’s overall FLOPs are approximately 1/3 of a module running a 3×3 convolution with the same number of input and output channels. Furthermore, the parameters maintained by EMA only count as KC.

5.2.Basic maintenance

        Another problem with the EM algorithm is the initialization of the base. The EM algorithm has guaranteed convergence because the possibility of complete data is limited and in each iteration the E and M steps increase their current lower bound. However, convergence to the global maximum is not guaranteed. Therefore, the initial value of the base before iteration is very important. We only described above how to use EMA to process an image. However, for computer vision tasks, there are thousands of images in the dataset. Since each image X has a different pixel feature distribution, it is not suitable to use µ calculated on one image to reconstruct feature maps of other images. Therefore, we run EMA on each image.

        For the first mini-batch, we initialize µ(0) using Kaiming’s initialization [13], where we treat matrix multiplication as a 1×1 convolution. For the following mini-batch, a simple option is to update µ(0) using standard backpropagation. However, since iterations of AE and AM can be expanded into a recurrent neural network (RNN), the gradients propagated through them will encounter vanishing or exploding problems. Therefore, the update of µ(0) is unstable and the training process of the EMA unit may crash. In this paper, we use a moving average to update µ(0) during training. After iterating over the image, the resulting µ(T) can be viewed as a biased update of µ(0), where the bias comes from the image sampling process. To make it less biased, we first average µ(T) over the mini-batch to get µ¯(T). Then we update µ(0) to: µ(0)←αµ(0)+(1−α)µ¯(T) (15), where α∈[0,1] is the momentum. For inference, µ(0) remains unchanged. This moving average mechanism is also suitable for batch normalization (BN) [16].

5.3. Basic standardization

        In the above subsection, we completed the maintenance of each mini-batch µ(0). However, due to the flaws of RNN, stable updates of µ(t) in AE and AM iterations are still not guaranteed. The above moving average mechanism requires that µ¯(T) is not significantly different from µ(0), otherwise it will also collapse like backpropagation. This requirement also limits the value range of µ(t), 1≤t≤T. To do this we need to apply normalization to µ(t). At first glance, batch normalization (BN) or layer normalization (LN) [1] seem to be good choices. However, these normalization methods change the direction of each basis µ(kt), thus changing their properties and semantic meaning. To keep the direction of each basis vector unchanged, we choose Euclidean normalization (L2Norm), which divides each µ(kt) by its length. By applying it, µ(t) lies on the K-dimensional union hypersphere, on which the sequence of nµ(0)k, µ(1)k, ···, µ(kT)o forms a trajectory.

5.4. Comparison with dual attention

        A2Net [6] proposed a dual attention block (A2 block), where the output Y is calculated as follows: Y = hφ(X, Wφ)sfm(θ(X,Wθ))⊤isfm(ρ(X,Wρ)), ( 16) Where sfm represents the softmax function. φ, θ and ρ represent three 1 × 1 convolutions with convolution kernels Wφ, Wθ and Wρ respectively. If we share parameters between θ and ρ, we can label both Wθ and Wρ as µ. It can be seen that sfm(θ(X,Wθ)) only calculates Z the same as equation (5), and those variables located within [·] update µ. The entire A2 block process is equivalent to EMA with only one iteration. Wθ in block A2 is updated via backpropagation, while our EMAU is updated via moving average. In summary, dual attention blocks can be regarded as a special form of EMAU.

Figure 3: Ablation study of EMAU basis vector maintenance strategy (left) and normalization (right). The experiment was conducted using ResNet-50 on the PASCAL VOC data set, with a batch size of 12 and a training output stride of 16. The number of training iterations T is set to 3. Best viewing screen.

6. Experiment

        To evaluate the proposed EMAU, we conduct extensive experiments on the PASCAL VOC dataset [9], PASCAL Context dataset [24] and COCO Stuff dataset [3]. In this section, we first introduce the implementation details. We then conduct an ablation study to verify the superiority of the proposed method on the PASCAL VOC dataset. Finally, we report results on the PASCAL Context dataset and COCO Stuff dataset.

6.1. Implementation details

        We use ResNet [14] pre-trained on ImageNet [28] as our backbone network. Following previous work [37, 4, 5], we adopt a polynomial learning rate strategy, where the initial learning rate is multiplied by (1-iter/total iter) 0.9 after each iteration. The initial learning rate for all datasets is set to 0.009. The momentum and weight decay coefficients are set to 0.9 and 0.0001 respectively. For data augmentation, we apply common scaling (0.5 to 2.0), cropping, and image flipping to enhance the training data. The input size of all datasets is set to 513×513. In all experiments, simultaneous batch normalization and multi-grid [4] were employed. For evaluation, we adopt the commonly used average intersection-to-union ratio metric. The output stride of the backbone network was set to 16 when trained on PASCAL VOC and PASCAL Context, and to 8 when trained on COCO Stuff and evaluated on all datasets. To speed up the training process, we perform all ablation studies on ResNet-50 [14] with a batch size of 12. For all models compared with the state-of-the-art, we train on ResNet-101 with a batch size of 16. We train on PASCAL VOC and COCO Stuff for 30K iterations and on PASCAL Context for 15K iterations. We use 3×3 convolutions to reduce the number of channels from 2,048 to 512, and then stack EMAU on top of it. We call the entire network EMANet. We set the number of bases K to 64, λ to 1, and the number of iterations T to 3 as default values ​​for training.

Figure 4: Ablation study for iteration number T. The experiment was conducted using ResNet-50 on the PASCAL VOC data set, with a training output stride of 16 and a batch size of 12.

6.2. Results on PASCAL VOC data set

6.2.1.Basic maintenance and standardization

        In this section, we first compare the performance of different maintenance µ(0) strategies. We set T=3 in training and 1≤T≤8 in evaluation. As shown on the left side of Figure 3, the performance of all strategies improves as the number of iterations of AE and AM increases. When T ≥ 4, the gain from more iterations becomes insignificant. The moving average is the best performing of them all. It achieves the highest performance in all iterations and outperforms other methods by at least 0.9 on mIoU. Surprisingly, backpropagation updates do not show an advantage compared to no updates, and even perform worse when T ≥ 3. We then compare the performance with no normalization, LN and L2Norm as described above. It is clear from the right side of Figure 3 that LN is even worse than no normalization. Because it can partially alleviate the gradient problem of a similar RNN structure. The performance of LN and no normalization has little correlation with the number of iterations T. In contrast, the performance of L2Norm increases as the number of iterations becomes larger, and when T ≥ 3, it outperforms LN and no normalization.

6.2.2. Ablation study of number of iterations

        It can be seen from Figure 3 that during the evaluation process, the performance of EMAU improves with more iterations, and when T>4, the gain becomes insignificant. In this subsection, we also study the impact of T during training. We plot the performance matrices of Ttrain and T eval in Figure 4. It is clear from Figure 4 that no matter what Ttrain is, mIoU increases monotonically with more iterations. They will eventually converge to a fixed value. However, this rule does not apply in training. mIoU peaks at Ttrain=3 and decreases with more iterations. This phenomenon may be caused by the RNN-like behavior of EMAU. Although moving average and L2Norm can alleviate this problem to a certain extent, the problem still exists. We also conducted experiments on the A2 block [6], which can be regarded as a special form of EMAU mentioned in Section 5.4. Likewise, non-local modules can also be considered as a special form of EMAU without AM steps, which includes more cardinality and Ttrain=1. Under the same backbone network and training scheduler, the A2 block achieves 77.41% on mIoU, the non-local module achieves 77.78% on mIoU, and EMANet achieves 77.34% at Ttrain = 1 and T eval = 1. These three results have small differences and are consistent with our analysis.

Table 1: Detailed mIoU (%) comparison of DeeplabV3/V3+ and PSANet using ResNet-101 and output stride 8 on PASCAL VOC. FLOPs and memory are calculated using input size 513×513. SS: Single scale input during testing. MS: multi-scale input. Flip: Add input that flips left and right. EMANet(256) and EMANet(512) represent EMANet with input channel numbers of 256 and 512 respectively.

Table 2: Comparison on the PASCAL VOC test set.

Table 3: Comparison with existing techniques on the PASCAL Context test set. "+" indicates pre-training on COCO Stuff. Table 4: Comparison on the COCO Stuff test set.

6.2.3. Comparison with state-of-the-art methods

        We first thoroughly compared EMANet with three baseline models, namely DeeplabV3, DeeplabV3+ and PSANet, on the validation set. We report mIoU, FLOPs, memory cost and number of parameters in Table 1. We can see that EMANet significantly outperforms these three baseline models in performance. Furthermore, EMANet has a much lighter computational and memory burden. We further compare our method with existing methods on the PASCAL VOC test set. Following previous methods [4, 5], we train EMANet on the COCO, VOC trainaug and VOC trainval sets sequentially. We set the basic learning rate to 0.009, 0.001 and 0.0001 respectively. We performed 150K iterations on COCO and 30K iterations in the last two rounds. We utilize multi-scale testing and left-right flipping when inferring on the test set. As shown in Table 2, our EMANet sets a new record on PASCAL VOC and improves on mIoU by 2.0% over the same backbone of DeeplabV3 [4]. Our EMANet performs best among networks using the ResNet-101 backbone and improves on the previous best result by 0.9%, which is very significant as this benchmark is very competitive. Furthermore, it achieves comparable performance to methods based on some larger backbone networks. Figure 5: Visualization of responsibility Z during the last iteration. The first two rows illustrate two examples from the PASCAL VOC validation set. The last two lines illustrate two examples from the PASCAL Context validation set. z·i represents the responsibility of the i-th basis, assigned to all pixels in the last iteration. i, j, k and l are four randomly chosen indices where 1 ≤ i, j, k, l ≤ K. Best viewed on screen.

6.3. Results on the PASCAL Context data set

        To verify the generalization ability of our proposed EMANet, we conducted experiments on the PASCAL Context data set. The quantitative results of PASCAL Context are shown in Table 3. To the best of our knowledge, EMANet based on ResNet-101 achieves the highest performance on the PASCAL Context dataset. Even with pre-training on extra data (COCO Stuff), SGR+ still falls short of EMANet.

6.4. Experimental results on the COCO stuff data set

        To further evaluate the effectiveness of our method, we also conducted experiments on the COCO Stuff dataset. The comparison results with previous state-of-the-art methods are shown in Table 4. Notably, EMANet achieves 39.9% in mIoU and outperforms previous methods by a large margin.

6.5. Visualization of basic responsibilities

        To gain a deeper understanding of our proposed EMAU, we visualize the iterative responsibility mapping Z in Figure 5. For each image, we randomly select four basis (i, j, k and l) and show their corresponding responsibility for all pixels in the last iteration. Obviously, each basis corresponds to an abstract concept of the image. As iterative AE and AM progress, abstract concepts become more compact and clear. As we have seen, these bases converge to some specific semantics rather than just focusing on foreground and background. Specifically, the first two rows of bases focus on specific semantics, such as human, wineglass, cutlery, and profile. The last two rows of foundations focus on sailboats, mountains, airplanes, and lanes.

7. Summary

        In this paper, we propose a novel attention mechanism, namely Expectation Maximization Attention (EMA), which computes a more compact basis set by iteratively executing the EM algorithm. The reconstructed output of EMA is low-rank and robust to changes in input. We formalize the proposed method into a lightweight module that can be easily inserted into existing CNNs at little cost. Extensive experiments on many benchmark datasets demonstrate the effectiveness and efficiency of the proposed EMAU.

Guess you like

Origin blog.csdn.net/ADICDFHL/article/details/133762407