Overview of attention mechanism

Overview of attention mechanism

1 Introduction

In terms of vision, the core idea of ​​the attention mechanism is to highlight certain important features of the object. [From paying attention to everything to focusing on key points]

The purpose of the attention mechanism can be considered to be to add attention to certain weights in the structural design of deep neural networks. It can be understood as adding another layer of weight. The weight of the important part is set larger, and the weight of the unimportant part is set smaller. [Fewer parameters + fast speed + good effect]

​ There are several types of visual attention. The core idea is to find the correlation between them based on the original data, and then highlight some of its important features, including channel attention, pixel attention, multi-level attention, etc. There are also Introduction of self-attention in NLP.

2 Basic structure of common attention mechanisms

2.1 transformer

​ Self-attention (also known as internal attention), the purpose is to calculate the expression form of the sequence, the attention mechanism related to different positions of a single sequence, because of the position invariance of the decoder, and in DETR, each Each pixel focuses on numerical information and position information.

​ Self-Attention is the core content of Transformer. It can be understood that the queue and a set of values ​​correspond to the input, that is, a mapping of query, key, and value to output is formed. Output can be regarded as the weighted sum of value, and the weighted value is It is derived from Self-Attention.

All encoders have the same structure but no shared parameters.

The decoder structure consists of multiple identical stacks.

2.2 soft-attention

​ Belongs to the continuous distribution problem between [0,1], paying more attention to areas or channels.

Calculate the weight probability for all keys. Each key has a corresponding weight. It is a global calculation method (it can also be called Global Attention). This method is more rational. It refers to the contents of all keys and then weights them. But the calculation amount is large.

​ Belongs to deterministic attention, which can be generated through the network after learning is completed, and is differentiable. The gradient can be calculated through the neural network, which can be learned through forward propagation and backward feedback to obtain the weight of attention.

​Difference from hard-attention: "hard-attention model" selects a fixed number of input parts, while "soft-attention model" selects a dynamic number of input parts and weights them to generate an output. [6] Hard-attention directly locates a certain key accurately, and ignores other keys. This is equivalent to the probability of this key being 1, and the probability of all other keys being 0. Therefore, this alignment method is very demanding and requires one step. If it is not aligned correctly, it will have a great impact. On the other hand, because it is not differentiable, it generally needs to be trained using reinforcement learning methods. (Or use gumbel softmax or something)

2.2.1 Spatial domain attention (spatial transformer network)

Transform the spatial information in the original image into another space while retaining key information. The trained spatial transformer can find the areas that need attention in the image information. At the same time, this transformer can also have the functions of rotation and scaling, so that important local information of the image can be extracted by the frame through transformation.

2.2.2 Channel Attention (CA)

​A typical representative is SENet. Each layer of the convolutional network has many convolution kernels, and each convolution kernel corresponds to a feature channel. Compared with the spatial transformer attention mechanism, Channel Attention allocates resources between each convolution channel, and the allocation granularity is one level larger than that of the spatial transformer.

2.2.3 Mixed domain model

​ (1)Residual Attention Network for image classification(CVPR 2017 Open Access Repository

​ (2)Dual Attention Network for Scene Segmentation(CVPR 2019 Open Access Repository)

2.2.4 non-local:

This means that the receptive field can be very large, rather than a local area.

​ Full connection is non-local and global. However, full connection brings a large number of parameters, which makes optimization difficult.

2.2.5 position-wise attention

In the DANet above, the attention map calculates the similarity between all pixels and all pixels, and the space complexity is (HxW)x(HxW). Position-wise attention adopts the criss-cross idea, which only calculates the similarity between each pixel and the pixels in the same column, that is, on the cross. By looping (the same operation twice), the similarity between each pixel and each pixel is indirectly calculated. Similarity, reducing the space complexity to (HxW)x(H+W-1). [Second-order attention is able to obtain contextual information of the whole image from all pixels to generate new feature maps with dense and rich contextual information.

3. Implementation process of attention mechanism

3.1 Structural level

In terms of structure, it is divided into single-layer attention, multi-layer attention and multi-head attention according to whether the hierarchical relationship is divided:

1) Single-layer Attention, which is a relatively common approach, uses a query to pay attention to a piece of original text.

2) Multi-layer Attention is generally used in models with hierarchical relationships in text. Suppose we divide a document into multiple sentences. In the first layer, we use attention to calculate a sentence vector for each sentence (that is, a single layer attention); in the second layer, we apply attention to all sentence vectors to calculate a document vector (also a single-layer attention), and finally use this document vector to do the task.

3) Multi-head attention, which is the multi-head attention mentioned in Attention is All You Need, uses multiple queries to pay attention to a piece of original text multiple times. Each query pays attention to different parts of the original text, which is equivalent to repeating it. Multiple single-layer attention: head(i)=Attention(q(i),K,V), and finally put together the results of single-layer attention for Alisma.

3.2 Model aspects

From a model perspective, Attention is generally used in CNN and LSTM, and pure Attention calculations can also be performed directly.

yolov5+Attention【ref:[9],[10]】

3.2.1 SE

​Add example:

(1) Create a new yolov5s_SE.yaml in the yolov5/models folder

(2) Add the SE attention code to commom.py

(3) Add the class name SE to yolov5/models/yolo.py

(4) Modify yolov5s_SE.yaml and add SE attention to the location we need to add

(5) Modify the relevant parameters in train.py to start training

3.2.2 CBAM

3.2.3 ECA

3.2.4 CA

3.3 Others

When doing attention, you need to calculate the score (similarity) between the query and a certain key. Commonly used methods are:

1) Click multiplier

2) Matrix multiplication

3)cos

4) Connect q and k in series

4 FAQ:

4.1 Why does the accuracy sometimes decrease after adding an attention mechanism?

Most attention modules have parameters, and adding an attention module will increase the complexity of the model. (1) If the model is in an underfitting state before adding attention, then increasing parameters is beneficial to model learning and performance will improve. (2) If the model is in an over-fitting state before adding attention, adding parameters may aggravate the over-fitting problem, and the performance may remain unchanged or decline.

4.2

ref:

[1] https://zhuanlan.zhihu.com/p/146130215

[2] http://papers.nips.cc/paper/7181-attention-is-all-you-need

[3] https://zhuanlan.zhihu.com/p/48508221

[4] https://github.com/huggingface/transformers

[5] https://blog.csdn.net/xys430381_1/article/details/89323444

[6] https://www.jianshu.com/p/1a78bd494c4a

[7] https://mp.weixin.qq.com/s?__biz=MzI5MDUyMDIxNA

[8] https://www.zhihu.com/question/478301531/answer/2280232845

[9] https://zhuanlan.zhihu.com/p/543231209

[10] https://blog.csdn.net/weixin_50008473/article/details/124590939

Guess you like

Origin blog.csdn.net/qq_42835363/article/details/130348955