Self-Attention running process

The Input of Self-Attention is a series of Vectors. Then this Vector may be the Input of your entire Network, or it may be the
Output of a Hidden Layer, so we do not use x to represent it here. (It may be used alternately, it may be in a certain link of the network)

 We use it to represent it, which means that it may have been processed before. It is the output of a hidden layer. After inputting a row of vector a, Self-Attention needs to output another row of vector b.

Then each b is generated after considering all the a , so here deliberately draw a lot of arrows, telling you to b^{1}consider a^{1}the a^{4}generated, b^{2}considering a^{1}the a^{4}generated, the same is the same, considering the entire input sequence, it was generated.

How to generate this b^{1}vector, after you know how to generate one, you will know how to generate the remaining 234.

There is a special mechanism here. This mechanism is based on this vector to find out which parts of the entire long sequence are important, which parts are related to which label is judged, and which parts we want to decide class,
the information needed to determine the regression value

 The degree of association between each vector is represented by a value called α

How does this self-attention module automatically determine the correlation between two vectors?a^{1} If you give it two vectors a^{4}, how does it determine how related the two are, and then give it a value α? What about you? You need a module to calculate attention,

That is, take two vectors as input and directly output the value of α

 

 There are two common approaches

1., called dot product, the two input vectors are multiplied by two different matrices , the vector on the left is multiplied W^{q}by this matrix to obtain matrix q, the vector on the right is multiplied by this W^{k}matrix to obtain matrix k,
and then dot product , is to multiply them element-wise, and then add them up to get a scalar,
this scalar is α, which is a way to calculate α
2 There is another calculation method called Additive, its calculation method is , pass the same two vectors through W^{q} the sum W^{k}to get the root, then we don’t make it a Dot-Product, but string it together, and then throw it into this to pass through an Activation Function
and then pass through a Transformer, and then get α


dot-product

insert image description here

Detailed explanation of another brother of CSDN

What is element-wise ? Just look at this code segment.

import numpy as np
 
np1 = np.array([4, 6])
np2 = np.array([-3, 7])
print(np2 * np1)
 
# [-12  42]


In short, there are many different methods, which can calculate Attention, calculate the value of α, and calculate the degree of association.
But in the following discussions, we only use the left method , which is the most commonly used method today . The method used in Transformer.

Then you have to a^{1}calculate the correlation between them, that is, calculate the α between them.

 

 You can get it a^{1}by multiplying it , then this q has a name, we call it Query , it is like searching for related articles when you search the engine, just like searching for keywords of related articles, so this is called QueryW^{q}q^{1}

Then multiply a^{2}, a^{3}and 4 by one W^{k}to get the Vector of k, which is called Key,  then you take this Query q1 and this Key k2, and calculate the Inner-Product to get α

We \alpha _{1,2}use to represent that when Query is provided by 1 and Key is provided by 2, the correlation between 1 and 2 is called the Attention Score, which is called the Attention score.

Next use 3 and 4

 In fact, in general, during implementation, query1 will also calculate the correlation with yourself. How important is it to calculate the correlation with yourself? You can try it yourself when doing your homework to see if it has a big impact After calculating the correlation between a1 and each vector, a Soft-Max
will be connected here

 This Soft-Max is exactly the same as the Soft-Max used for classification , so the output of Soft-Max is a row of α, so there is originally a row of α, and you can get it through Soft-Max \alpha ^{1}
Here you don’t have to use Soft-Max , It’s okay to use other alternatives. For example, someone tried to make a ReLU, and they all made a ReLU here. It turned out that it was better than Soft-Max, so you don’t have to use Soft-Max here. You can use whatever Activation Function you want, as long as you are happy, you can try it, and Soft-Max is the most common, then you can try it yourself to see if you can get better results than Soft-Max

Next, after getting this \alpha ^{1}, we need to \alpha ^{1}extract the important information in the Sequence based on this. According to this α, we already know which vectors are most related to them, and how to extract important information.

 

1. First, multiply each vector here a^{1} to get a new vector, which is used to represent 2. Next , multiply each vector here by the Attention score, and multiply them by 3. Then add it up again to geta^{4}W^{v}v^{1}v^{2}v^{3}v^{4}
v^{1}v^{4}a^{'}
b^{1}

 (One thing I don’t understand very well, what is the purpose of putting a v matrix here)

If a certain vector has a higher score, for example, if a^{1}it has a^{2}a strong correlation with it, the value obtained is very large, then after we do Weighted Sum today, b^{1}the value obtained may be closer. v^{2}
So whose Whoever has the highest Attention score will dominate the results you draw.
So here we will talk about how to get from a whole Sequenceb^{1}

So after generating this b1. It is the same to generate b2 after generating b1 from a row of vectors. But matrix operations have an advantage, they are generated at the same time, not sequentially. are generated at the same time, how to calculate b^{2}is to change the protagonist into a2

 

 

 

 

matrix angle

It is already known that each a generates a qkv

Each a is multiplied by a matrix, if it is q, then the multiplication is one W^{q}. Many different Wq can be combined as a matrix. 

You can put a1a2a3a4 together and see it as a matrix, and this matrix is ​​represented by i. The four columns of this matrix are a1 to a4.

I multiplied W^{q}and got proof Q. The matrix column I is the input of our Self-attention is a1 to a4; it is actually a parameter of the network, which will be learned later (what does it mean to be learned) ; the four columns are q1 to q4 .

The next operation to generate k and v is exactly the same as q

So for each a to get qkv, in fact, multiply the input vector sequence by three different matrices, and you get q, k, and v.

The next step is, each q will follow each k to calculate the inner product and get the attention score

Then get the attention score, if you look at it from the perspective of matrix operation, what kind of thing is it doing?

You just make q1 and k1 the inner product and get it \alpha _{1,1}, so \alpha _{1,1}it is the inner product ofq^{1} and , then here I will draw this, the vector behind it, and draw it wider to mean that it is transpose, which is the same as , For the inner product, the same is true for the other three and four steps. You can actually put together the operation of these four steps and regard it as the multiplication of a matrix and a vector.k^{1}
k^{1}
\alpha _{1,2}q^{1}k^{2}

 

 These four actions can be regarded as k^{1}a combination of four rows of a k^{4}matrix .

The other q2q3q4 are also the same

 So these attention scores can be regarded as the multiplication of two matrices . The row of one matrix is k^{1}​​to k^{4} , and the other matrix. Its column is q^{1}the q^{4}
score that we will be in attention. Do normalization, for example, you will do Softmax, you will do softmax for each
column here, and let the value in each column add up to 1.
I said before that actually doing softmax here is not the only option, you can choose other options Operations, such as ReLU and the like, in fact, the results obtained will not be worse. After passing softmax, the value it obtains is a bit different, so we use to represent the result after passing softmax.

has been calculatedA^{1}

Then multiply by a matrix v^{1}v^{2}v^{3}v^{4}that thinks column, you can get b

And what to do by multiplying the first column of A by the V matrix is ​​actually to take each column in the V matrix, and do a A^{'} weighted sum according to each element in each column in the matrix (each subclass corresponds to plus), then get b1.

 So in fact, the whole Self-attention, when we talked about the operation, we told you at the very beginning, we first generated qkv, and then based on this q to find the relevant position, and then to the v Do weighted sum, in fact, this series of operations is just a series of matrix multiplication

Let's review the matrix multiplication we just saw

1. I is the input of Self-attention, and the input of Self-attention is a row of vectors. These rows of vectors are put together as the
column of the matrix, which is I

2. This input is multiplied by three matrices respectively W^{q}, W^{k}followed W^{v}by QKV

3. These three matrices, then Q is multiplied by the transpose of K to get the matrix A. You may do some processing on the matrix of A (such as softmax) to get it. Sometimes we will put this, It is called Attention Matrix. The Q matrix is ​​generated to get the Attention score, which is\alpha
4. Then you multiply it by V to get O. O is the output of the Self-attention layer. V is generated to calculate
the final b , which is the matrix O

So the input of Self-attention is I, and the output is O. Then you will find that although it is called attention, in fact, in the Self-attention layer, the only parameter that needs to be learned is only heel. Only heel is unknown and needs to W^{q}W^{k}be transparentW^{v}  . W^{q}W^{k}W^{v}found it through our training data

 But there are no unknown parameters for other operations, which are all artificially set by us, and there is no need to find out through training data. Then this whole is the operation of self-attention, from I to O is self-attention.

 

Guess you like

Origin blog.csdn.net/jcandzero/article/details/127183502