[Note] Attention summary of a paper: paper-based Neural Machine Translation by Jointly Learning to Align and Translate

0 Attention background summary

encoder-decoder

This part of the background from this: https://blog.csdn.net/u012968002/article/details/78867203 This article explains the attention well.

encoder-decoder, the input sentence converted by the non-linear transformation expressed as the intermediate C semantics for the Decoder decoder, its task is to represent the semantics of the sentence C Source and intermediate previously generated history information y1, y2 ...... yi -1 generation time i want to generate word yi, then yi=g(C, y1, y2,...yi-1)each yi are generated sequentially so, it seems, is the goal of the entire system generated Source sentence target according to the input sentence.

如果Source是中文句子,Target是英文句子,那么这就是解决机器翻译问题的Encoder-Decoder框架;
如果Source是一篇文章,Target是概括性的几句描述语句,那么这是文本摘要的Encoder-Decoder框架;
如果Source是一句问句,Target是一句回答,那么这是问答系统或者对话机器人的Encoder-Decoder框架。

In the field of text processing, Encoder-Decoder of a wide range of applications.
Encoder-Decoder frame widely used not only in the field of text, it is often used in speech recognition, image processing and other fields. Generally, speech recognition and text processing are usually used Encoder RNN model, the image processing Encoder CNN commonly used model.

0.1 expand their own

0.1.1 attention Category Description

  • soft-attention

      软寻址,指的不像一般寻址只从存储内容里面找出一条内容,而是可能从每个Key地址都会取出内容,取出内容的重要性根据Query和Key的相似性来决定。
      之后对Value进行加权求和,这样就可以取出最终的Value值,也即Attention值。
      本论文用的就是soft-attention
    
  • hard-attention

      (后面看论文了再补过来。好像比较难,因为不能用反向传播)
    
  • self-attention

      指的不是Target和Source之间的Attention机制,而是Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制。
      其具体计算过程是一样的,只是计算对象发生了变化而已。
    

    Self Attention can capture semantic or syntactic features between features in the same word in a sentence. Self Attention introduction easier characterized interdependent sentence capture long distance .
    However Self Attention will direct any two words of the sentence in the calculation process by a link directly linked calculating step, the distance-dependent distance between features is greatly shortened, conducive to efficient use of these features. Further in addition, Self Attention for increased computing parallelism has a direct help. This is the main reason why Self Attention increasingly widely used.

  • coattention (later summarized supplement)

  • All kinds of attention transformer also complement the future

0.1.2 attention of development history

attention histroy
The NLP Attention:
Essence Attention function may be described as a query (query) to a series of (key value key- value) of the map, as shown below. transformer is a typical k \ q \ v.
In fact, the figure below reflects the essence of the idea Attention Mechanism (mainly written Attention this summary mechanism in machine translation and reflect on the basic idea of the encoder-decoder, the nature of the mechanism of thought attention stripped from the encoder-decoder framework and further abstracted, by the Figure evident, specific attention will summarize the next one, and then later fill)
attenion5
in the calculation of attention can be divided into three steps:

1. 第一步是将query和每个key进行相似度计算得到权重,常用的相似度函数有点积,拼接,感知机等;
2. 然后第二步一般是使用一个softmax函数对这些权重进行归一化;
3. 最后将权重和相应的键值value进行加权求和得到最后的attention。目前在NLP研究中,key和value常常都是同一个,即key=value。

1 review papers and innovation

This article with Attention mechanism (soft attention) to complete the task of machine translation, considered attention mechanism in the first application of NLP in very instructive.

Traditional RNN encoder-decoder disadvantages:

1. 对长句子的处理不好(梯度消失)
2. 词对齐问题

NMT ceiling with paper broke encoder-decoder model - RNN is generated by a fixed-length vector of a sentence C is recorded all the information mode. It combined learning alignment and translation , it is simply the input sequence into a vector, and adaptively select a subset of vectors when decoding translation. This allows the model to better handle long sentences.

In this paper, whenever generated model to generate a word in translation when it (soft) search for the source sentence of the most relevant information centralized location, then, based on the model related to the location of the source sentence context vector and produced before All the target words to predict the target word.

attention and interest in the source sentence to be output word y is more relevant section (input by generating double RNN hidden H, then store these hidden down state, to focus more attention by the relevant weighting thereby generating the hidden layer C).
And attention mechanisms may align solve the problem (question papers have to be aligned alignment model, traditional statistical machine translation is generally in the process of doing steps will have a special phrase aligned, and attention from the fact the model is the same effect.).
Overall, the paper is not difficult to learn attention mechanisms worth starting.

1.1 attention Define key

  • In general applications in natural language processing model Attention will be seen as aligned model output Target sentence a word in a sentence and each word of the input Source. This is the core idea of ​​this paper.
  • Generating states side of the target, all the context vectors will be used as the input.
  • Attention core point is the translation of each target word (or predict commodity title text Category) used in the context is different, this consideration is obviously more reasonable.
  • Attention mechanism Source Value is a weighted sum of the elements
  • This paper attention of our task active application requirements and objectives of the concept itself at the same time. Attention is often defined as the degree of correlation destination and source. But there are a lot of task at the same time does not have the concept of source and target. For example, document classification, it is only the original, not the target language / essay, and then such as sentiment analysis (also can be seen as the simplest kind of document classification), it is only the original. So in this case attention how to start it? This requires a variety of techniques, called intra-attention (or self-attention), as the name suggests, is to focus on their own internal mechanism original.
  • Source and destination
    (1) classification, context vector and source sentences defined in the model
    (2) has two inputs, such as QA, both the source and target
    (3) self-attention, the source and target are their own, which can be obtained Some configuration information, such as information refers to

In most papers, the Attention Vector is a weight (typically softmax output), which dimension is equal to the length of the context. The more important the greater the weight of the corresponding position on behalf of context.

Above from this article, https://blog.csdn.net/fkyyly/article/details/82492433

1.3 Other

The paper has a common framework: bi-RNN, align Model. Using: Method joint alignment and translation of learning.
Future areas for improvement are: For a number of uncommon words or word training corpus does not appear how to better representation. This allows a more attention mechanism in place to improve.

2 align learning and translation

General Structure:

  • encoder: be-RNN
  • docoder: emulates searching through a source sentence during docoding a translation (simulated search when decoding translation of the source sentence)

2.1 decoder: Universal Description

Hand drawn attention structure
And description of the structure as follows:
This paper decoder structure

Ci constantly changing depending on the current generation word. For example:
Attention-C
where, f2 function represents Encoder input English words a transformation function, if the Encoder RNN model is used (such as may google2016年底翻译系统用the encoder and decoder are used LSTM 8 layers), the result of this function f2 often a specific time after the state of the input hidden node value xi; G represents Encoder Representative synthetic intermediate conversion function entire sentence semantic representation of the intermediate words, in general practice, g is a function of a weighted sum of the constituent elements.

Tx是句子source的长度。
αij代表在Target输出第i个单词时Source输入句子中第j个单词的注意力分配系数
hj则是Source输入句子中第j个单词的语义编码

There is also mentioned Alignment Model , I understand that calculates degree of matching for the input positions and output positions i j model, relevant information found in the score function (e score calculated here is a function of the model) are the following in several ways:

1. 点积             dot  
2. 双线性函数       general
3. 拼接             concat
3. 隐层的MLP(感知机)(本文用这种) perceptron

Here I attached two Bowen ( https://blog.csdn.net/changdejie/article/details/90782040 , https://www.cnblogs.com/robert-dlut/p/8638283.html see) several similarity function calculated:

3 attention score
attention
? As can be seen from Figure query, key, value in, query on behalf of? Representatives of key? Representatives of value? See Part II summarizes attention

2.1.2 The right weight matrix probability distribution calculation

Probability distribution is calculated
Decoder for using the RNN is, at time i, if the word to be generated yi, we can know the time before generating Target of yi i-1, the output time point i-1 hidden layer node value Hi-1 (paper the Si-1) of.

Our aim is to compute input to generate yi words in a sentence "Tom", "Chase", "Jerry" attention allocation for yi is the probability distribution, it can be hidden layer output sentence i-1 time with Target state Hi-1 (Si-1) compared to eleven hj (hj before each stored up), i.e., by the function F. (hj and RNN hidden node status of each word in the input sentence corresponding to the Source, Hi- 1) - dissertation: eij = (si-1, hi) to obtain the target word is aligned possibility yi corresponding to each input word.
The F function may take different approaches in different papers, and then outputs the function F Softmax normalized get the attention of the probability distribution of the probability distribution value interval in line with the distribution of values.
That is, for each yi, can get a probability distribution on the similarity of each source word! (Because it is a probability distribution, so be normalized)
probability distribution of many! The probability of each word corresponding to the generated target sentence word in the input sentence aligned probability distribution will be appreciated that the input word is generated and the target sentence word

alignment model parameterized as a feedforward neural network, a direct calculation of the soft-aligned, so that it can be used to calculate the cost function to spread, thereby obtaining parameters, the gradient may be used to align the joint training model and the entire translation.

2.2 encoder: for sequences using Bi-RNN

Use two-way RNN again concate get hj, hj focus on information xj around, so not only summarize the previous hj word can sum up the word back.
Annotation sequence decoder is used and alignment model, to calculate the context vecotr.

2.3 Structure Selection

The above is the general structure, such as the activation function F RNN model, its model, a, can be freely selected. This article gives specific selection of the following:

2.3.1 RNN

RNN: gated hidden unit (reset gate + update gate), with a logistic sigmoid activation function

At each step decoder, a single layer maxout units + normalize to count output probabilities

2.3.2 Alignment Model

Alignment model to take into account the model for calculating the length of Tx * Ty Tx and Ty are the need for secondary sentence, in order to reduce the amount of calculation, the MLP Multilayer Perceptron. MLP weights in learning to come.

Experiment 3

  • 数据集:concatenate news-test-
    2012
    and news-test-2013 to make a development (validation) set, and evaluate the models on the test
    set (news-test-2014) from WMT ’14, which consists of 3003 sentences not present in the training
    data.
    we use a shortlist of 30,000 most frequent words in each language to
    train our models. Any word not included in the shortlist is mapped to a special token ([UNK]).

  • Model: training RNNsearch-30 \ RNNsearch-50 model performance:

    • The front and backward RNNsearch RNN 1000 has hidden units, multilayer network with a single maxout hidden layer to calculate the conditional probability each target word.
    • SGD + Adadelta to train the model, minibach sentence is 80, five days training
    • After a good training model with beam search to find the maximum possible translation

4 results

Results Figure 4.1

评估:BLEU socre
attenion BLEU score
This is done by visualizing the annotation
weights. From this we see which positions in the source sentence were considered more important when generating the target word.
RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences.RNNsearch-50, especially, shows no performance deterioration even with sentences of length 50 or more.

4.2 Analysis of results

4.2.1 Alignment

Compared to hard-alignment, soft-alignment of translation would be more useful. Because Soft-Alignment solves the this Our Issue Naturally by Letting at The both-AT Model look [at The] and [man].
Soft-Alignment benefits:

1.关注多个part从而找出正确翻译
2.处理不同长度的句子

5 summary

拓展了基础的encoder-decoder,让模型(soft-)search for a set of input words, or their annotations computed by an encoder, when generating target word.
That lets the model focus only on information relevant to the generation of the next target word. the model can correctly align each target word with the relevant words, or their annotations.

Future challenges:

  • Better handling of unknown or rare words

    需要模型model to be 更widely used, to match the performance of current state-of-the-art machine translation systems in all contexts.

Published 63 original articles · won praise 13 · views 40000 +

Guess you like

Origin blog.csdn.net/changreal/article/details/101774872