[Attention] copy paper notes summarize three: self-attention and transformer

self-attention and transformer explain

论文:attention is all you need.

reference

1. Completely reference this blog , speak very good! This summary is just retelling.
2. Reference is also known to almost the article , as a supplement

1 self-attention particularly

self-attention i.e. K = V = Q. For example, enter a sentence, then each and every word inside the sentence Attention all words were calculated. The purpose is to study the internal structure of the interior of the sentence term dependency, captured sentence.

1.1 Process

  1. Create three vector from the input vector for each endoer
  • For each word, create a query, key, value. These vectors by speaking embedding * matrix derived from training.

    Most of the NLP, Key and Value are often the same, i.e. Key = Value.

  • The new vector dimension smaller than embdding, dimension 64. This is an architectural choice, we can calculate the multi-head attention remained largely unchanged.
    self-attention Process 1
  1. query, key, value and calculates a score
  • Score is calculated by key word query vector and vector scores click.

    Fractional bit word is the dot (q1, k1), the second score is dot (q1, k2)

  1. The fraction / 8 (the paper is the dimension (d key vector k square root), which is sqrt (64))

    In order to obtain a more stable gradient.

  2. softmax transfer operation result.

    Standardization, and determine the expression of each word in this position is 1, it is clear that the word this location has the highest score softmax, but sometimes concerned about another word associated with the current word is useful.

  3. Each value vector * softmax scores

    Word you want to keep the same focus, remove unrelated words.

  4. Weighted summation value vector.

    The resulting self-attention in this position the output layer
    self-attention Process 4

The resulting vector is sent to the vector of the feed forward network. Is actually carried out in matrix form, a matrix form so see the self-attention.

Matrix self-attention 1.2

process:

  1. Calculating query, key, value matrix.

The embedding packaged into a matrix X, and multiplied by trained weight matrix (WQ, WK, WV).
Process matrix self-attention 1
X for each row in the matrix corresponds to input a word in the sentence.

  1. The concentration step 2-6

    self-attention matrix process 2
    self-attention matrix calculation process

1.3 Scaled Dot-Product Attention

The above procedure: the dot product of the similarity calculation Attention is as Scaled Dot-Product Attention . In addition to more than just a sqrt (D K ) to play a regulatory role, the product will not be worth much.
Scaled Dot-Product Attention1
Scaled Dot-Product Attention2
Scaled Dot-Product Attention3

2 transformer structure

to sum up:

  • The encoder and decoder use a long attention mechanisms (Multi-Head Attention) to learn the text representation, particularly where Q = K = V, is self-attentional mechanisms (self-attention).
  • By the encoder and decoder to translate Attention aligned, K and V input encoder and a decoder output Q as an input Multi-Head Attention
  • Positional Encoding be introduced using the position information
  • Masked Multi-Head Attention affect 0-1mask eliminate the use of the word to the right of the current word
  • Network connections and standardized residuals (Add & Norm) so that a more optimal network depth

endoer-decoder of transformer

2.1 Encoder与Multi-headed Attention

improve:

  1. Expand the ability of the model focus different locations.
  2. Imparting layer plurality attention "represents a sub-space."

    ① a plurality of query \ key \ value weight matrix set (note 8 using Transformer head, thus eventually providing eight sets for each note head Encoder / Decoder)
    ② Each set is randomly initialized.
    ③ trained, for each set of input embeddings (or vector from the lower encoder / decoder) is projected onto different subspaces represented.

DETAILED 2.1.1 multi-headed

multi-headed attention图1

In the case of long attention, we maintain a separate Q / K / V weighting matrix for each head, resulting in a different Q / K / V matrix. As before, we obtain Q / K / V matrix is ​​multiplied by X-WQ / WK / WV matrix.

Also on: Queries, Keys, Values.

If the self-attention to do the same calculation listed above, only eight different weighting matrix can be obtained 8 different Z matrix.
He said attention heads is different matrix Z generated.
multi-headed attention图2

8 feedforward matrix layer need not be - it requires a matrix (a vector corresponding to each word). So we need a way to put the eight compressed into a matrix.

Matrix simplified

  1. concatenate all the attention heads
  2. Multiplied by an additional weight matrix W0.
  3. Get the result Z matrix that captures the attention heads in all the information, so this sentence feeding FFNN
    multi-headed attention图3

These are multi-headed self-attention.
The above process:
multi-headed attention图4

If visualization of multi-headed, you can see the model of the target word expression, the expression is reflected in some other source word in.

multi-headed attention图5

2.1.2 Positional Encoding use position of the coding sequence of a sequence

So far, the model we have described is one thing missing words in order to explain the sequence of input method.

Add a transformer --positional encoding vectors input to each embedding. These vectors follow a particular pattern model of learning , which helps determine the location of each word , or the distance between sequences in different words .

Intuition here is to add these values ​​to the embedding, is projected to the period Q / K / V vectors and vector dot product embedding in attention, provide meaningful distance therebetween.

positional encoding 1

If we assume that the number of embedding dimension is 4, then the actual position code should look like this:
positional encoding 2

The left half of the position-coding vector by a function (sinusoidal) generating, by the right half to generate another function (cosine), after connected, coding vector is formed for each location.
The advantage of this method is that the coding can be extended to the length of the sequence is not visible , such as if we were asked to translate a training model focused on any sentence is longer sentences than training.

The actual position encoder Example: embedding size is 20 * 512. It is in the middle of a two. Each row corresponds to a position-coding vector.
positional encoding 3

2.1.3 residual connection

A need to mention details of the encoder architecture, each sub-layer (self-attention, ffnn) around each encoder has a residual connection, also layer-normalization.
Figure:

The residual connections 1, layer-norm operator to visualize
Residual connecting structure
This also applies to the sub-layer decoders. If we consider a two-stacked Transformer encoder and decoder consisting, as it should.

2.2 Decoder section

2.2.1 DETAILED decoder

The encoder will top the encoder outputs the converted into a set of Attention vector K and V , respectively by "encoder-decoder attention" layer using the respective decoders, the movable FIG procedure below:
Decoder 1
the step of repeating this process until the a special symbol indicating the decoder has completed its transformer output.
Output of each step in the next time step are provided to the bottom of the decoder , the decoder decoding result like pop encoder. As we deal with encoder inputs, we embed and add those to the position encoder decoder input to indicate the location of each word .

图2:https://jalammar.github.io/images/t/transformer_decoding_2.gif)

Work self attention layer decoder in the encoder slightly different:

  1. In the decoder, self-attention layer only process output sequence earlier position , which is calculated by the self-attention prior to masking the future position SoftMax (I guess score of k is not all key) (set them -inf) to achieve!
  2. Encoder-Decoder Attention works layer and the multi-headed self attention similar, except that it queries from the lower layer to create a matrix, and obtains keys and values ​​output from the encoder matrix in the stack.

The last layer of Linear and Softmax

Stack output vector floating point decoder. How do we turn it into a word of it? This is the last work of a linear layer, then a layer Softmax.

Is a simple linear layer fully connected neural network, it will decode the vector generated onto the stack a larger vector, called logits vector.

This will logits vector cell of the wide number of dictionary - Each cell corresponds to a unique word score. This is how we interpret the linear model output after layer.
Softmax layer was then converts these probability scores (all positive, all add up to 1.0). Selecting cells with the highest probability, and generates the word associated with the output time as this step.

Decoder 3

Decoder handwriting

Published 63 original articles · won praise 13 · views 40000 +

Guess you like

Origin blog.csdn.net/changreal/article/details/102622021