[Notes] paper AS Reader vs Stanford Attentive Reader

Attention Sum Reader Network

data set

 

CNN&DailyMail

Each article as one document (document), in a summary document excluded a word class entity, and as a matter of (question), excluding the entity class that is, as the word answer (answer), all entities in this document can be word class The answer candidate (candidate answers). Wherein each sample text in all similar named entities " @ Entity1 " Alternatively, disrupted and randomly expressed.

 Children's stories (Children's Book Test, CBT)

Extract 20 consecutive sentences from the story of every child as a document (document), the first 21 sentence as a question (question), and removed from an entity class word as an answer (answer).

 

Model Introduction

Reader and Attentive very similar, a one-dimensional model matching (also Stanford Attentive Reader), mainly in the final determination application Answer a Pointer Sum Attention mechanism, model structure shown below:

 

Model Specific

 

probability si is that the answer to query q appears at position i in the document d.

Compared with Attentive Reader:

  • Attention Dot Attention layer is applied, with respect to less Attentive Reader parameters , i.e. attention weight
  • Attention scores after a match-dimensional model is equivalent to direct document d as a probability of each word in answer to a specific question in the context of the vector, the practice of the model is that in getting the score after each word Softmax normalized, the word of the same type of cumulative score , the highest score of the word is the answer (ie the authors mentioned Pointer Sum the Attention )

  • Attention solving process and structure of the model is significantly simpler than Attentive Reader, it has achieved better results

Pointer Sum Attention also shows that the higher the frequency of occurrence if a word is more likely to become the answer to question (because the more attention accumulated score), the experimental data show that this assumption is reasonable, after all, which is consistent with most reading comprehension law.

Experimental setup

  • Optimization function: Adam
  • Learning Rate: 0.001,0.0005
  • Loss function: -logP (a | q, d)
  • initializing layer weight matrix embedding range: [- 0.1, 0.1]
  • GRU initialize the network weights: random orthogonal matrix
  • GRU network initialization bias: 0
  • batch size:32

Experimental results

The following figure shows the comparison of model results.

 

Other relevant

这里的pointer sum attention,使用attention as a pointer over discrete tokens in the context document and then they directly sum the word’s attention across all the occurrences.

Local softmax result candidate answers term appears in the document accumulation.

This attention to the use of different seq2seq (blend words from the context into an answer representations), the use of attention here has been Pointer Networks (Ptr-Nets inspired) of

Attentive and Impatient Readers

Compare the difference and Attentive Reader's;

Mentioned Chen et.al

Referred to the Memory Networks - MemNNs

 

Standford Attentive Reader

paper:

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

Reference 1: https://www.imooc.com/article/28801

参考2:https://www.cnblogs.com/sandwichnlp/p/11811396.html#model-2-attentive-sum-reader

Source: https://github.com/danqi/rc-cnn-dailymail

Source resolved: http://www.imooc.com/article/29397

Effect: better than ASReader and Attentive Reader effects

Model Introduction

  1. Depth learning neural networks in the MRC
  2. MRC boosted decision trees of the forest

Data set : CNN & DailyMail

Reading comprehension machine boosted decision tree model based on forest

Characterized in engineering to build the entity class word feature vector e f_p, q (e), wherein there is: if there is, the appearance position, word frequency, n-gram matching features, from feature word, characterized in dependency grammar, sentence co-occurrence characteristics, etc.

The machine is reading comprehension as a scheduling problem, and use LambdaMART RankLib package to build boosted decision tree model forest.

 

Based on the model of deep learning: Stanford Attentive Reader

Encoding layer

Stanford Attentive Reader model and ASReader model encoding steps are basically the same: document and question the consistent encoding

Attention layer

Points by ASReader different product models, Stanford Attentive Reader using a bilinear function as a matching function. Then accumulate the same word similarity at different locations in different articles. Bilinear function to evaluate the similarity between q and p_i, more flexible than a dot product.

In Attention layer, matching function is different, indicating the machine on CNN & Dailymail datasets reading comprehension model at this time is no big difference between the basic model, the important point is that the study matching function.

Attentive Reader record it with a different part :

3.1 experimental setup

Optimization function: SGD

Word vector dimension: 100 (100 using a pre-trained peacekeeping glove word vector)

Learning Rate: 0.1

Loss function: -logP (a | q, d)

Weights GRU network initialization: a Gaussian distribution N (0, 0.1)

Hidden layer size h: CNN (128), Dailymail (256)

Attention layer weight matrix range initialization: [- 0.01, 0.01]

batch size:32

dropout:0.2

 

Published 63 original articles · won praise 13 · views 40000 +

Guess you like

Origin blog.csdn.net/changreal/article/details/103666402