[Notes] paper summarizes two Attention: Attention essence of thought + Hard / Soft / Global / Local form of Attention

Attention summary II:

Involving paper:

  1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attentio(用了hard\soft attention attention)
  2. Effective Approaches to Attention-based Neural Machine Translation(提出了global\local attention)

This article Reference article:

Attention - bis
had five Attention understand the model and its application
attention model approach Summary of
Mechanisms of Attention reading --global attention and attention local
, Ltd. Free Join Attention / Local Attention

This article summary

  1. attention mechanisms essential idea
  2. Summarize each attention mechanisms (hard \ soft \ global \ local attention)
  3. Other attention-related

1 Attention is essentially ideological mechanism

The essential idea, see: this article , this article also said that the self-attention.
Short answer, it is the attention (query, key, value) in machine translation key-value is the same.
PS: Attention NMT mechanism in the application of the basic idea, see the paper summarizes: Attentin a summary

2 all kinds of attention

Speaking about other attention:

  1. hard attention
  2. soft attention
  3. gloabal attention
  4. local attention
  5. self-attention: target = source -> Multi-head attention - (Attention discharge summary c)

2.1 hard attention

Paper: Show, Attend and Tell:. Neural Image Caption Generation with Visual Attention
hard attention structure
notes Source: Summary of model approach attention

soft attention to keep all components are weighted, hard attention is part of a strategy selected component. hard attention is concerned about the part.
soft attention to training after that to spread.

hard attention的特点:
the hard attention model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train

specific

Encoder model using CNN (VGG net), extracts the L-dimensional vectors ai D image, i = 1,2, ... L, each vector represents a portion of image information.
a decoder is LSTM, t timestep per input consists of three parts: zt, ht-1, yt -1. Wherein zt and ai obtained from αti.
αti by attention model F ATT be calculated.
F herein att is a multilayer perceptron:
fattention
it can be calculated zt zt
wherein f Attention Model att get there are two ways: stochastic attention and deterministic attention.

2.1.2 Stochastic “Hard” Attention

st t is the time of the attention focus position number decoder, sti indicating whether attention of the position of interest at time t i, sti, i = 1,2, ... L, [st1, st2, ... stL] is a one-hot encoding , the Attention a time focus position of a practice, is a source of hard .
Model according to a = (a1, a2, ... aL) generates a sequence y (y1, ..., yC) , s where = {s1, s2, ... sC } is the key focus sequence on the time axis, theoretically L ^ C a.

PS: deep learning thought: study objective function, then study the objective function gradient parameters.

Used the famous jensen inequality to the objective function (maximize logp (y | a)), the objective function to do the conversion (because there is no explicit s), to obtain lower bound of the objective function,

then logp (y | a) instead of the original objective function gradient calculation model parameters W, then Monte Carlo sampling methods do s.
There are details involved in reinforcement learning.

2.1.3 Deterministic “Soft” Attention

The whole model is smooth and differentiable (ie the objective function, which is the objective function of the right LSTM weight αti is differentiable, the reason is very simple, because the objective function of zt differentiable, and zt of αti differentiable, according to the chain rule available αti objective function differentiable) under the deterministic attention, so learning end-to-end is trivial by using standard backpropagation.

In hard attention inside, the sequence of each model time t [st1, ... stL] takes only a 1, all the rest is 0, a time that is a focus position, and will take care of each soft attention to all of the position, just the right weight at different locations vary. zt is the weighted sum of ai:

Fine: ,

used to adjust the context vector with respect to the LSTM ht-1 and the specific gravity yt-1.

2.1.4 training process

Two kinds of attention models using SGD (stochastic gradient descent) to train.


2.2 Global / Local Attention papers

论文:Effective Approaches to Attention-based Neural Machine Translation

Reference Notes from:

  1. Attention Mechanisms of reading --global attention and local attention
  2. Global Attention / Local Attention

Papers calculation context vectors:

ht -> at -> ct -> h~t

Global Attention

global attention

global attention in the calculation context vector ct will be considered when all the hidden state encoder generated.

It can be seen, global attention in respect of a summary Attention Attention is similar but simpler. The difference between the two, can refer to this article , i.e. the FIG Note:
Here Insert Picture Description])

Referred decoder target hidden time t is all hidden state ht, encoder is H ~ S , S = 1,2, ... n-. This is called: attentional hidden State .

For any H ~ S , the weight A T (S) is a variable length alignment vector, a length equal to the length of the encoder part of the time series. By comparing the current hidden state of the decoder layer H T and each of the encoders hidden layer STATUS H ~ S obtained:

A T (S) is a decoder and an encoder state of the state comparison obtained.
score is a content-based function, the paper gives three various calculation methods (article called alignment function):

in which: dot of global attention better, general of the local attention better.

Another only H T of the score is all the way A T (S) are integrated into a weighting matrix, to obtain Wa, can be calculated A T :

A pair of T to make a weighted average operation (H ~ S the weighted summation) can be obtained context vector C T , and proceed to a subsequent step

FIG global attention process:

global-attention process

Local Attention

global attention in the calculation of the state of each decoder need to focus on all encoder input, a greater amount of calculation.
local attention can be considered as a mixture of hard attention and soft attention (mixing the advantage), because of its computational complexity is lower than the global attention, soft attention, and is different from the hard attention, local attention almost everywhere differentiable, easy to train.
local attention

local attention is focused on the mechanism of the selective context where a small window (focus source position each time only a small portion), which can reduce the computational cost.

In this model, each is a target for the time t vocabulary model generates a first aligned position (the aligned position) P t .
context vector C T by the encoder in a state set point calculated hidden layer, the hidden layer encoder comprises the window [P T -D, P T + D], the size D is selected empirically.

These models C T formed is different, as summarized below Global LOCATION VS .

Back local attention, where P T is a source position index, as will be appreciated Attention focus as parameters of the model. P T calculated two calculation programs:

  • Monotonic alingnment(local-m)

    Set P T = T, assuming that the source sequence and the target sequence aligned substantially monotonically, then the alignment vectors A T can be defined as:

  • Predictive alignment(local-p)

    Model predicts a position of alignment, instead of assuming that the source and target sequence monotonic sequence alignment.

    W the p- and v the p- be model parameters to predict the position through training. S is the source sentence length, then this calculation, P T ∈ [0, S].
    To support p t alignment point vicinity is provided a p around t Gaussian distribution, so that alignment weights αt (s) can be expressed as:

    here is the same alignment function and global in the alignment function, it can be seen from the center pt more remote location, its source hidden state at a position corresponding to the weights will be compressed to more severe.

Obtained C T h is computed after ~ T method, by a connecting layer context of the vector C T and h T integrated into h ~ T :
h ~ T = tanh (Wc of [C T ; h T ])
h ~ T is an attention vector, the probability that the predicted output vector generated by the following formula word distribution:

FIG local attention process:
Here Insert Picture Description

2.2.1 Global vs Local Attention

Therefore, global / local distinction is this:

  • Former alignment vector A T variable size, depending on the encoder portion of the length of the input sequence;
  • A vector which context T size is fixed, A T ∈R 2D +. 1 ;

Global Attention and Local Attention advantages and disadvantages, Global's practice with a little more because:

  • Local Attention when the encoder is not long, the amount of calculation and does not reduce
  • Position vector P T prediction is not very accurate, directly affect the accuracy of the Local Attention

2.2.2 Input-feeding Approach

inputfeeding approach: Attentional vectors h~t are fed as inputs to the next time steps to inform the model about past alignment decisions. The effect of this is twofold:

  1. make the model fully aware of previous alignment choice
  2. we create a very deep network spanning both horizontally and vertically
    input-feeding approach

2.2.3 This paper summarizes the technical point of use:

  • global\ local attention,
  • input-feeding approach
  • better alignment function

2.2.4 paper realization tips

Realization of the time involved in the concepts and techniques:
progressive layers , such as the first based model, then + reverse, + dropout, + global attention, + feed input, + unk replace, and then look at the extent of scores improve.
reverse is reverse the source sentence,
the above known techniques, such as on: Source Reversing , dropout , unknowed Replacement Technique .
by integrating a variety of settings, such as 8 different models, such as attention using different methods, without use of a dropout

Vocabulary size, such as taking each language top 50K,
the unknown word used <unk>in place of
the sentence to fill, LSTM layers, such as in the initial design parameters [-0.1, 0.1] in the range of, the normalized gradient is rescaled whenever its norm exceeds 5.

Training methods: SGD
designed hyperparameters:
LSTM layers, such as the number of units of each 100cells, how many dimensions word embeddings, epoch number, mini-batch size, such as 128,
the learning rate can be changed, such as the beginning of 1 , 5pochs each subsequent epoch after halve, dropout such as 0.2,
as well as dropout start 12pochs, after 8epochs halve the rate of learning

experiment analysis:

  • Look decreased learning curve
  • effects of long sentences
  • attentional architectures
  • alignment quality

3 other

3.1 Attention design

  • location-based attention

    Location-based, which means that, attention here is no other additional objects of interest, namely attention vector is hi itself.
    si = f (hi) = activation (WThi + b)

  • general attention (not common)

  • concatenation-based attention

    Concatenation-based meaning, attention here is simply more attention to other objects.
    And f that is designed to measure the correlation function between hi and ht.
    si = f (hi, ht) = vTactivation (W1hi + W2ht + b)

3.2 Attention expansion

K2 a sentence of a document, each sentence by the k1 (k1 sizes of each sentence) composed of a word.

The first layer: word-level of attention
has k1k1 a word for each sentence, the corresponding vectors have k1k1 Wiwi, using the second chapter mentioned manner, resulting in the expression vector for each sentence, denoted stisti.
The second layer: sentence-level of attention
by the attention of the first layer, we can get k2k2 a stisti, re-use way second chapter mentioned, resulting in the expression vector didi each document, of course, you can get each stisti weights corresponding to weight αiαi, then get these, analyze specific task.

Published 63 original articles · won praise 13 · views 40000 +

Guess you like

Origin blog.csdn.net/changreal/article/details/102518702