What is the attention mechanism, its application in the recommendation model (related models will be introduced, AFM/DIN/DIEN/DST) and references
What is the attention mechanism
The essence of the Attention function can be described as a mapping from a query to a series of (key key-value value) pairs, and it is mainly divided into three steps when calculating attention
- The first step is to calculate the similarity between the query and each key to obtain the weight. Commonly used similarity functions are dot product, splicing, perceptron, etc.;
- The second step is generally to normalize these weights using a softmax function;
- In the third step, the weight and the corresponding key value are weighted and summed to obtain the final attention.
At present, in NLP research, key and value are often the same, that is, key=value (ie self-attention).
The difference and connection between attention and self-attention
The query comes from the decoding layer, and when the key and value come from the encoding layer, it is called vanilla attention, which is the most basic attention. The query, key and value all come from the encoding layer called self-attention .
Taking the Encoder-Decoder framework as an example, the content of input Source and output Target are different. For example, for English-Chinese machine translation, Source is an English sentence, Target is the corresponding translated Chinese sentence, and Attention occurs in the elements of Target Between all elements in Query and Source.
Self Attention does not refer to the Attention mechanism between Target and Source, but the Attention mechanism that occurs between the internal elements of Source or between the internal elements of Target. It can also be understood as the Attention in the special case of Target=Source.
The specific calculation process of the two is the same, but the calculation object has changed.