self-Attention
Article directory
Self-Attention Architecture
The way self-attention works is to input a row of vectors and output a row of vectors.
The output vector takes into account the information of all the input vectors.
- Self-attention can be superimposed many times.
Fully connected layers (FC) and Self-attention can be used interchangeably.
- Self-attention processes the information of the entire Sequence
- FC's Network, focusing on handling inquiries from a certain location
The process of Self-Attention
-
One of its architectures is like this, the output is a row of bbb isaaa is calculated and output.
-
b 1 b^{1} b1 is considereda 1 , a 2 , a 3 , a 4 a^{1},a^{2},a^{3},a^{4}a1,a2,a3,agenerated after 4 .
-
b 2 , b 3 , b 4 b^{2},b^{3},b^{4} b2,b3,b4 also considera 1 , a 2 , a 3 , a 4 a^{1},a^{2},a^{3},a^{4}a1,a2,a3,a4 , their calculation principles are the same.
Computes the correlation of two input vectors
-
Two common calculation methods: dot product and Additive
-
Among them, in the first method, first let these two vectors be multiplied by a W (these two w are different, one is weight_q, the other is weight_k), respectively get q, kq, kq,k , and then do the inner product to get a scoreα i , j \alpha _{i,j}ai,jIndicates the calculation of ai and aj a_{i} and a_{j}aiand ajof relevance.
-
After a 1 a^{1}a1 minute suma 2 , a 3 , a 4 a^{2},a^{3},a^{4}a2,a3,a4 Calculate the similarity.a 1 a^{1}a1 targetq 1 q^{1}q1 minute suma 2 , a 3 , a 4 a^{2},a^{3},a^{4}a2,a3,ak 2 , k 3 , k 4 of 4 k^{2},k^{3},k ^ {4}k2,k3,k4 Do the inner product to get attention scorcea 1 , 2 a_{1,2}a1,2, a 1 , 3 a_{1,3} a1,3, a 1 , 4 a_{1,4} a1,4, a 1 a^{1} will also be calculated in actual operationa1 similarity to yourself.
-
Subsequent future a 1 , 1 , a 1 , 2 a_{1,1},a_{1,2}a1,1,a1,2, a 1 , 3 a_{1,3} a1,3, a 1 , 4 a_{1,4} a1,4, get a 1 , 1 ′ a^{'}_{1,1} after a soft-max layera1,1′ a 1 , 2 ′ a^{'}_{1,2} a1,2′ …
-
Then put a 1 a^{1}a1 toa 4 a^{4}aEach vector of 4 is multiplied by W v W^{v}Wv gets a new vector, respectively getv 1 , v 2 , v 3 , v 4 v^{1},v^{2},v^{3},v^{4}v1,v2,v3,v4
-
Next, v 1 v^{1}v1 tov 4 v^{4}v4 , each vector is multiplied by the Attention score, and then added up to getb 1 b^{1}b1
b 1 = ∑ i α 1 , i ′ v i b^{1} = \sum_{i}\alpha_{1,i}^{'}v^{i} b1=i∑a1,i′vi -
Theory, a 2 , a 3 , a 4 a^{2},a^{3},a^{4}a2,a3,a4 also do the same operation to getb 2 , b 3 , b 4 b^{2},b^{3},b^{4}b2,b3,b4 .
matrix angle
Each input vector will generate a set of qi , ki , viq^{i} ,k^{i},v^{i}qi,ki,vi
Therefore, we will each aia^{i}ai left multiplied by awqw^{q}wq matrix, you can get the Q matrix by doing matrix multiplication, and each column of the Q matrix is each inputaia^{i}ai的 q i q^{i} qi.
Similarly each group ki , vik^{i} , v^{i}ki,vi can also be calculated with a matrix.
We know that each attentionai , j a_{i,j}ai,jRepresents qiq^{i} of the i-th inputqkjk^{j} of i and the jth inputkj inner product.
Then, these four steps can be obtained by multiplying the matrix and the vector.
Furthermore, we can calculate all the attention and getA ′ A^{'}A′
whileb 1 b^{1}b1 is by usingviv^{i}vi left sideA ′ A^{'}A' get,
review
Figure (15)
- I is the input of self-attention, and the input of self-attention is a row of vectors, which are put together as columns of the matrix.
- Input is multiplied by three matrices wq , wk , wvw^{q} ,w^{k} , w^{v}wq,wk,wv gets Q, K, V.
- Next, Q is multiplied by the transpose of K to obtain matrix A, and after softmax processing, A ′ A^{'}A′ , and then left multiplied by V to get Output.
- Therefore, the only parameter to learn in self-attention is the W matrix, which is the part that requires network train.
code example
The example is divided into the following steps:
- ready to enter
- Initialize weights
- Export representation of key, query and value
- Calculating attention scores
- Calculate softmax
- Multiply attention scores by value
- Sum the weighted values to get the output
Assuming input:
Input 1: [1, 0, 1, 0]
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]
Initialization parameters
Since our input is three 4-dimensional vectors, the figure (15) needs to be multiplied by W from the left, and the dimension of W is set to (4,3),
w k , w q , w v w^{k} ,w^{q},w^{v} wk,wq,wv are w_key, w_query, w_value respectively.
x = [
[1,0,1,0], # 输入1
[0,2,0,2], #输入2
[1,1,1,1], #输入3
]
x = torch.tensor(x,dtype=torch.float32)
# 初始化权重
w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
]
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
]
# 转化成tensor数据类型
w_key = torch.tensor(w_key,dtype=torch.float32)
w_query = torch.tensor(w_query, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)
View QKV
querys = torch.tensor(np.dot(x,w_query),dtype=torch.float32)
keys = torch.tensor(np.dot(x,w_key) ,dtype=torch.float32)
calculate attention
# get attention scorce
attention_scores = torch.tensor(np.dot(querys,keys.T))
softmax processing
# 计算soft-max
attention_scores_softmax = torch.tensor( softmax(attention_scores,dim=-1) )
Multiply attention scores by value
weight_values = values[:,None] * attention_scores_softmax.T[:,:,None]
Sum the weighted values to get the output
outputs = weight_values.sum(dim=0)
test code
import torch
import numpy as np
from torch.nn.functional import softmax
def preData():
#
x = [
[1,0,1,0], # 输入1
[0,2,0,2], #输入2
[1,1,1,1], #输入3
]
x = torch.tensor(x,dtype=torch.float32)
# 初始化权重
w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
]
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
]
# 转化成tensor数据类型
w_key = torch.tensor(w_key,dtype=torch.float32)
w_query = torch.tensor(w_query, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)
# get K, Q,V
keys = torch.tensor(np.dot(x,w_key) ,dtype=torch.float32)
querys = torch.tensor(np.dot(x,w_query),dtype=torch.float32)
values = torch.tensor(np.dot(x,w_value),dtype=torch.float32)
# get attention scorce
attention_scores = torch.tensor(np.dot(querys,keys.T))
print(attention_scores)
# 计算soft-max
attention_scores_softmax = torch.tensor( softmax(attention_scores,dim=-1) )
print(values.shape)
weight_values = values[:,None] * attention_scores_softmax.T[:,:,None]
outputs = weight_values.sum(dim=0)
return outputs
if __name__ == "__main__" :
b = preData()
print(b)