Introduction to self-attention and code handwriting

self-Attention


Self-Attention Architecture

The way self-attention works is to input a row of vectors and output a row of vectors.

The output vector takes into account the information of all the input vectors.

figure 1

  • Self-attention can be superimposed many times.
    insert image description here
    Fully connected layers (FC) and Self-attention can be used interchangeably.
  1. Self-attention processes the information of the entire Sequence
  2. FC's Network, focusing on handling inquiries from a certain location

The process of Self-Attention

insert image description here

  • One of its architectures is like this, the output is a row of bbb isaaa is calculated and output.

  • b 1 b^{1} b1 is considereda 1 , a 2 , a 3 , a 4 a^{1},a^{2},a^{3},a^{4}a1,a2,a3,agenerated after 4 .

  • b 2 , b 3 , b 4 b^{2},b^{3},b^{4} b2,b3,b4 also considera 1 , a 2 , a 3 , a 4 a^{1},a^{2},a^{3},a^{4}a1,a2,a3,a4 , their calculation principles are the same.

    Computes the correlation of two input vectors

insert image description here

  • Two common calculation methods: dot product and Additive

  • Among them, in the first method, first let these two vectors be multiplied by a W (these two w are different, one is weight_q, the other is weight_k), respectively get q, kq, kq,k , and then do the inner product to get a scoreα i , j \alpha _{i,j}ai,jIndicates the calculation of ai and aj a_{i} and a_{j}aiand ajof relevance.

insert image description here

  • After a 1 a^{1}a1 minute suma 2 , a 3 , a 4 a^{2},a^{3},a^{4}a2,a3,a4 Calculate the similarity.a 1 a^{1}a1 targetq 1 q^{1}q1 minute suma 2 , a 3 , a 4 a^{2},a^{3},a^{4}a2,a3,ak 2 , k 3 , k 4 of 4 k^{2},k^{3},k ^ {4}k2,k3,k4 Do the inner product to get attention scorcea 1 , 2 a_{1,2}a1,2, a 1 , 3 a_{1,3} a1,3, a 1 , 4 a_{1,4} a1,4, a 1 a^{1} will also be calculated in actual operationa1 similarity to yourself.

  • Subsequent future a 1 , 1 , a 1 , 2 a_{1,1},a_{1,2}a1,1,a1,2, a 1 , 3 a_{1,3} a1,3, a 1 , 4 a_{1,4} a1,4, get a 1 , 1 ′ a^{'}_{1,1} after a soft-max layera1,1 a 1 , 2 ′ a^{'}_{1,2} a1,2
    insert image description here

  • Then put a 1 a^{1}a1 toa 4 a^{4}aEach vector of 4 is multiplied by W v W^{v}Wv gets a new vector, respectively getv 1 , v 2 , v 3 , v 4 v^{1},v^{2},v^{3},v^{4}v1,v2,v3,v4

  • Next, v 1 v^{1}v1 tov 4 v^{4}v4 , each vector is multiplied by the Attention score, and then added up to getb 1 b^{1}b1
    b 1 = ∑ i α 1 , i ′ v i b^{1} = \sum_{i}\alpha_{1,i}^{'}v^{i} b1=ia1,ivi

  • Theory, a 2 , a 3 , a 4 a^{2},a^{3},a^{4}a2,a3,a4 also do the same operation to getb 2 , b 3 , b 4 b^{2},b^{3},b^{4}b2,b3,b4 .

matrix angle

Each input vector will generate a set of qi , ki , viq^{i} ,k^{i},v^{i}qi,ki,vi

insert image description here

Therefore, we will each aia^{i}ai left multiplied by awqw^{q}wq matrix, you can get the Q matrix by doing matrix multiplication, and each column of the Q matrix is ​​each inputaia^{i}ai q i q^{i} qi.

insert image description here
Similarly each group ki , vik^{i} , v^{i}ki,vi can also be calculated with a matrix.
insert image description here
We know that each attentionai , j a_{i,j}ai,jRepresents qiq^{i} of the i-th inputqkjk^{j} of i and the jth inputkj inner product.
insert image description here
Then, these four steps can be obtained by multiplying the matrix and the vector.
insert image description here
Furthermore, we can calculate all the attention and getA ′ A^{'}A
insert image description here
whileb 1 b^{1}b1 is by usingviv^{i}vi left sideA ′ A^{'}A' get,
insert image description here
insert image description here

review

insert image description here
​ Figure (15)

  • I is the input of self-attention, and the input of self-attention is a row of vectors, which are put together as columns of the matrix.
  • Input is multiplied by three matrices wq , wk , wvw^{q} ,w^{k} , w^{v}wq,wk,wv gets Q, K, V.
  • Next, Q is multiplied by the transpose of K to obtain matrix A, and after softmax processing, A ′ A^{'}A , and then left multiplied by V to get Output.
  • Therefore, the only parameter to learn in self-attention is the W matrix, which is the part that requires network train.

code example

The example is divided into the following steps:

  1. ready to enter
  2. Initialize weights
  3. Export representation of key, query and value
  4. Calculating attention scores
  5. Calculate softmax
  6. Multiply attention scores by value
  7. Sum the weighted values ​​to get the output

Assuming input:

    Input 1: [1, 0, 1, 0]     
    Input 2: [0, 2, 0, 2]  
    Input 3: [1, 1, 1, 1]

Initialization parameters

Since our input is three 4-dimensional vectors, the figure (15) needs to be multiplied by W from the left, and the dimension of W is set to (4,3),

w k , w q , w v w^{k} ,w^{q},w^{v} wk,wq,wv are w_key, w_query, w_value respectively.

   x = [
    [1,0,1,0], # 输入1
    [0,2,0,2], #输入2
    [1,1,1,1], #输入3
    ]
    x = torch.tensor(x,dtype=torch.float32)
    # 初始化权重
    w_key = [
        [0, 0, 1],
        [1, 1, 0],
        [0, 1, 0],
        [1, 1, 0]
        ]
    w_query = [
        [1, 0, 1],
        [1, 0, 0],
        [0, 0, 1],
        [0, 1, 1]
        ]
    w_value = [
        [0, 2, 0],
        [0, 3, 0],
        [1, 0, 3],
        [1, 1, 0]
        ]
    # 转化成tensor数据类型
    w_key = torch.tensor(w_key,dtype=torch.float32)
    w_query = torch.tensor(w_query, dtype=torch.float32)
    w_value = torch.tensor(w_value, dtype=torch.float32)

View QKV

insert image description here

querys = torch.tensor(np.dot(x,w_query),dtype=torch.float32)

insert image description here

keys = torch.tensor(np.dot(x,w_key) ,dtype=torch.float32)

calculate attention

# get attention scorce
attention_scores = torch.tensor(np.dot(querys,keys.T))

softmax processing

# 计算soft-max
attention_scores_softmax = torch.tensor( softmax(attention_scores,dim=-1) )

Multiply attention scores by value

weight_values = values[:,None] * attention_scores_softmax.T[:,:,None]

Sum the weighted values ​​to get the output

outputs = weight_values.sum(dim=0)

test code


import torch
import numpy as np
from torch.nn.functional import softmax
def preData():
    #
    x = [
    [1,0,1,0], # 输入1
    [0,2,0,2], #输入2
    [1,1,1,1], #输入3
    ]
    x = torch.tensor(x,dtype=torch.float32)
    # 初始化权重
    w_key = [
        [0, 0, 1],
        [1, 1, 0],
        [0, 1, 0],
        [1, 1, 0]
        ]
    w_query = [
        [1, 0, 1],
        [1, 0, 0],
        [0, 0, 1],
        [0, 1, 1]
        ]
    w_value = [
        [0, 2, 0],
        [0, 3, 0],
        [1, 0, 3],
        [1, 1, 0]
        ]
    # 转化成tensor数据类型
    w_key = torch.tensor(w_key,dtype=torch.float32)
    w_query = torch.tensor(w_query, dtype=torch.float32)
    w_value = torch.tensor(w_value, dtype=torch.float32)

    # get K, Q,V

    keys = torch.tensor(np.dot(x,w_key) ,dtype=torch.float32)
    querys = torch.tensor(np.dot(x,w_query),dtype=torch.float32)
    values = torch.tensor(np.dot(x,w_value),dtype=torch.float32)

    # get attention scorce
    attention_scores = torch.tensor(np.dot(querys,keys.T))
    print(attention_scores)
    # 计算soft-max
    attention_scores_softmax = torch.tensor( softmax(attention_scores,dim=-1) )
    print(values.shape)
    weight_values = values[:,None] * attention_scores_softmax.T[:,:,None]
    outputs = weight_values.sum(dim=0)
    return outputs

if __name__ == "__main__" :
    b = preData()
    print(b)

reference

Guess you like

Origin blog.csdn.net/qq_41661809/article/details/124735234