Transformer's Q, K, V and Mutil-Head Self-Attention (super detailed interpretation)

Table of contents

1. What are Q, K, V

二.Mutil-Head Self-Attention


Transformer is very popular and has achieved achievements that cannot be ignored in many fields. Today's popular large language model LLM is also based on Transformer, but what are Q, K, V and multi-head attention in Transformer? Here is a simple study record to understand and master again.

1. What are Q, K, V

Q, K and V in Transformer refer to the three input representation vectors used in the self-attention mechanism.

Q represents the query vector, K represents the key vector, and V represents the numerical vector. These three vectors are obtained from the original input vector (usually a word embedding representation) through linear transformation.

In the self-attention mechanism, based on the query vector Q, by calculating the similarity between the query vector and all key vectors K, a weight distribution is obtained, which is used to weight and sum the associated numerical vector V.

The concepts of Q, K, and V come from the retrieval system, where Q is Query, K is Key, and V is Value. It can be simply understood that Q and K perform similarity matching, and the result obtained after matching is V. For example, when we search for something on a certain treasure, the search keyword entered is Q, the corresponding description of the product is K, and the product searched after Q and K are successfully matched is V.

In Transformer, the core formula of attention is

picture

, then where did Q, K, and V come from? This is actually obtained by linearly transforming the input matrix X. The formula can be simply written as follows:

picture

Visually represented by pictures:

picture

where, , are three trainable parameter matrices, and the input matrix Multiplying the matrix parameters is equivalent to performing a linear transformation to obtain Q, K, and V.

Then use Q, K, V to calculate the attention matrix, the formula is as follows:

picture

The figure given in the paper is as follows:

picture

Q and pass through MatMul to generate a similarity matrix. Divide each element of the similarity matrix by , is the dimension size of . This division is called Scale. When is very large, the variance of the multiplication result of becomes larger. Performing Scale can make the variance smaller and make the gradient update more stable during training. Then go through SoftMax, and finally do a MatMul operation with V to get the result.

二.Mutil-Head Self-Attention

We have understood Q, K, V and their origins above, but what is bull’s attention?

The multi-head attention formula given in the Transformer paper is as follows:

picture

As can be seen from the formula, multi-head attention is to Concat multiple heads and then combine them with

Multiply. Each head is composed of

picture

Obtained by doing Attention operation with Q, K, V. The figure given in the paper is as follows:

picture

Q, K, and V pass through Linear and then h Self-Attention, obtaining h outputs, where h refers to the number of attention heads. h outputs are Concated and then passed through Linear to get the final result.

picture

Then you get multiple groups of Q, K, and V, each group is a head.


The following is an explanation quoted from the content of Pilibala Wz, the author of Station B.

Let’s make some preparations first, as shown below

picture

picture

picture

In the same way, head2 of the two heads with different inputs will also be obtained. As shown below

picture

The left side is head1 for x1 and x2 inputs, the right side is head2 for x1 and x2 inputs, and b is the offset.

To this end, you get every

picture

corresponding

picture

parameter. Next, use the same method as in Self-Attention for each head to get the corresponding results.

picture

 Then concat the results obtained by each head, and then pass the spliced ​​results through

picture

(learnable parameters) for fusion.

As can be seen from the above, the subspace that each head focuses on is not necessarily the same, so this multi-head mechanism can combine the information learned from different head parts, which makes the model have stronger cognitive capabilities.

More heads means more powerful model capabilities. For example, the number of heads in the large LLM model Baichuan-13B is 40, while the number of heads in Baichuan-7B is 32.

Guess you like

Origin blog.csdn.net/weixin_45303602/article/details/134188049