CCNet:Criss-Cross Attention for Semantic Segmentation

[code pytorch](t https://github.com/speedinghzl/CCNet)

Two attention-based context aggregation graphs

  1. For each location (for example, blue), the Non-local module generates a dense attention map with a weight of H × W (green).
  2. For each position (for example, blue), the criss-cross attention module generates a sparse attention map with only H + W-1 weights. After the loop operation, each location (such as red) in the final output feature map can capture the remote dependencies of all pixels. For the sake of clarity, the residual connection is ignored.
    Insert picture description here

1.Overall

Insert picture description here

  • Figure 2 is the basic structure of the network. The input image is passed through a deep convolutional neural network (DCNN) to generate a feature map X.
    After obtaining the feature map X, we first apply a convolutional layer to obtain a dimensionality-reduced feature map H, and then put the feature map H into a cross-cutting attention (CCA) module and generate a new feature map H ′, which is aggregated Long-distance context information and each pixel is synchronized in a cross way.

  • The feature map H 'only aggregates context information in the horizontal and vertical directions, which is not enough for semantic segmentation. In order to obtain richer and denser context information, we input the feature map H ′ into the cross-attention module again, and then output the feature map H ″. Therefore, each position in the feature map H '' actually collects information from all pixels. The two criss-cross attention modules share the same parameters to avoid adding too many extra parameters. This recursive structure is named Recursive Cross-Cross Attention (RCCA) module.

  • Then, we connect the dense context feature H '' with the local representation feature X. This is followed by one or several convolutional layers with batch normalization and activation for feature fusion. Finally, the fused features are input into the segmentation layer to generate the final segmentation map.

2.Criss-Cross Attention

Insert picture description here

  • As shown in Fig. 3, the local feature HR C × W × H , the crisp-cross attention module first applies two convolutional layers with 1 × 1 filters on H to generate two feature maps Q and K, respectively Q {,} ∈R K C '× H × W is . C 'is the number of channels in the feature map. Due to the reduction in size, C' is smaller than C. After obtaining the feature graphs Q and K, we further generate the attention graph A ∈ R (H + W-1) × W × H through Affinity operation . At each position u in the spatial dimension of the feature map Q, we can obtain the vector Qu ∈ R C ' . Similarly, we can obtain the set Ωu by extracting feature vectors from K, where K is in the same row or column as u. Therefore, Ωu ∈ R (H + W-1) × C ' , Ωi, u ∈ R C' is the ith element of Ωu.

  • Affinity operation defined as follows:
    Insert picture description here
    wherein di, u∈D characterize Qu and Ωi, the degree of correlation of u, I = [. 1, ..., | Ωu |], D∈R (+ W is-H. 1) × H × W is . Then, we apply the softmax layer on D along the channel dimension to calculate the attention graph A.

  • Then another convolutional layer with 1 × 1 filters is applied on H to generate V ∈ R C × W × H for feature adaptation. Each position on the spatial dimensions of the features of FIG V u, we can obtain the vector Vu∈R C and set Φu∈R (+ W is-H. 1) × C . The set Φu is a set of feature vectors in V, and these feature vectors are located in the same row or column at position u. The remote context information is collected by the Aggregation operation:
    Insert picture description here
    where H'u represents the feature vector at the position u of the output feature map H'∈R C × W × H. Ai, u are the scalar values ​​at channel i and position u. Context information is added to the local features H to enhance local features and enhance representation by pixels. Therefore, it has a broad contextual view and selectively aggregates context based on the spatial attention graph.

3.Recurrent Criss-Cross Attention

Insert picture description here

  • Although the horizontal and vertical attention module can capture remote context information in horizontal and vertical directions, the connection between pixels and pixels is still sparse. Based on the above-mentioned cross-crossing attention module, we have introduced repeated cross-crossing attention. Loop cross attention module can be expanded into R loops. In the first loop, the criss-cross attention module extracts the feature map H extracted from the CNN model as input, and takes the feature map H 'as input, where H and H' have the same shape. In the second loop, the criss-cross attention module takes the input feature map H ′ and the output feature map H ″ as inputs.
    As shown in Figure 2, the cyclic cross-interleaving attention module has two loops (R = 2), enough to obtain remote dependencies from all pixels to generate a new feature map with dense and rich context information.

  • A and A 'are the attention maps in cycle 1 and cycle 2, respectively. Since we are only interested in the spatial dimension of context information rather than the channel dimension, we can consider the convolutional layer with 1 × 1 filters to be the same connection. In addition, the mapping function from the position x ', y' to the weight Ai, x, y is defined as

  • Insert picture description here

  • For any position u on the feature map H '' and any position θ on the feature map H, if R = 2, there is a connection. One case is that u and θ are in the same row or column:
    Insert picture description here

  • Where ← indicates additional operations. Another situation is that u and θ are not in the same row and same column. Figure 4 shows the propagation path of context information in the spatial dimension:
    Insert picture description here

  • Generally, our cyclic cross-attention module makes up for the lack of the cross-attention module unable to obtain dense context information from all pixels. Compared with the crisscross attention module, the cyclic crisscross attention module (R = 2) does not bring additional parameters, and can achieve better performance at the cost of a smaller calculation increment.

Published 12 original articles · praised 4 · visits 1266

Guess you like

Origin blog.csdn.net/qq_36321330/article/details/105460926