[Original] Understanding the attention mechanism of ChatGPT and getting started with Transformer

Author: Night Passerby

Time: April 27, 2023

If you want to learn this content coherently, please read the previous articles:

[Original] Understand the working principle of ChatGPT's GPT

[Original] Understanding ChatGPT's Introduction to Machine Learning

[Original] AIGC's ChatGPT advanced usage skills

What does GPT mean

The full name of GPT is Generative Pre-trained Transformer (generated pre-trained transformation model), which is trained on a large amount of corpus data to generate text similar to human natural language. The "pre-training" in its name refers to the initial training process on a large text corpus, where the model learns to predict the next word in an article, which can be used for various natural language processing tasks, such as text generation, code generation, video Generation, text question answering, image generation, paper writing, film and television creation, scientific experiment design, and more.

Below we briefly introduce the entire working principle of the GPT model in an easy-to-understand manner.

As mentioned above, the word of our GPT is: Generative Pre-trained Transformer. The simple dismantling of these three words is:

Generative - generate the next word

Pre-trained - text pre-training (various text materials on the Internet)

Transformer - Based on the Transformer architecture (unsupervised learning)

The general description of GPT is: after text pre-training through the Transformer architecture, it can generate a reasonable text continuation model according to the given text. (Text Solitaire)

Among them, in the unsupervised training, the core relies on the Transformer model, which is an excellent model. If the Transformer framework is relatively excellent, you need to have a general understanding of the iterative development process of the neural network in this past, and understand why it is an in-depth study. Excellent model framework in AI Question Answering (ChatGPT).

For ChatGPT to achieve such a good interactive communication effect, the core is the GPT mechanism. In addition to pre-trained (Pre-trained) and artificial feedback reinforcement learning (RLHF - Reinforcement Learning from Human Feedback), the most basic of the GPT mechanism is This T (Transformer) mechanism.

The entire Transformer can be simply understood as an "unsupervised learning" "word solitaire" game, and then complete the training of the entire core basic large model, and then make this LLM (large model) look more and more intelligent through RLHF intensive training.

How RNN develops to Transformer

Let's take a look at the entire Transformer development roadmap:

Seeing that the key nodes above are mainly RNN -> LSTM -> Attention -> Transformer, the basic is to develop the long and short memory network (LSTM) from the cyclic neural network (RNN), and then the attention mechanism (Attention) was born, and then To Transformer (including Multi-Head Attention multi-head self-attention), the large framework system of the entire Transformer is established.

RNN (Recurrent Neural Network - Recurrent Neural Network)

To understand the basic principles of RNN, we can simply understand it.

For example, when we recognize a picture, each picture is independent, and the recognition of "apple" in the picture has no effect on the recognition of the next picture as "pear". But with languages, order is super important. The two sentences "I eat apples" and "Apple eats me" have completely different meanings, and the order also provides information. For example, there is a high probability that "eating" is followed by food nouns.

To understand this connection between data, people invented a model called a recurrent neural network, or RNN for short. RNN is a kind of neural network, it can be understood that it has a small memory box, which is used to remember past data. When new data comes in, the network needs to consider the information stored in the memory box. With the input of new data, the stored information is constantly updated. The information in the memory box is called "hidden state".

RNNs are most commonly used in natural language processing, such as machine translation and writing poetry. "Machine translation" is to find sequences that express the same meaning in different languages ​​(translate Chinese into English); poetry generation is to generate logical word sequences according to themes and rules; change the type of input and output, input pictures and output sentences is "see picture Speech can also be regarded as a time-series sound signal, and scenarios such as speech recognition and speech generation are also within the capabilities of RNN; stock price changes can also be regarded as a time-varying sequence, and many quantitative trading models are based on this understanding.

(The h in the middle is the hidden state, c is the input, and y is the output)

RNN can process sequence data, that is, both its input and output can be sequences. This is because RNN has loops in the hidden layer, which can maintain the internal state of the network and continuously update the state according to the input sequence data.

RNN is very powerful, and it can handle pictures and texts very well, and has many applicable scenarios. But RNN also has obvious disadvantages, such as the earlier the data is input, the smaller the influence in the hidden state. That is to say, if a sentence is very long, RNN will forget what to say at the beginning, and standard RNN will have the problem of gradient disappearance or gradient explosion on longer sequences, which makes it unable to capture long-term dependencies. For this reason, an improved version of RNN-LSTM (Long Short-Term Memory Network) was invented later.

LSTM (Long Short-Term Memory, long short-term memory network)

RNN has a certain memory ability, but unfortunately it can only remember short-term memory, and its performance in various tasks is not very good. So what to do?

People turned their attention to humans themselves. People's memory is selective, we don't remember everything that happened at every moment, we will selectively save important things and discard unimportant things. Referring to this memory mechanism of human beings, Sepp Hochreiter transformed the "memory box" in 1997 and found the mechanism of "door". "Gate" is a switch used to decide how to save information, its value is between 0 and 1, 1 means completely saved, 0 means completely discarded.

There are three doors on the "memory box":

Forgot gate: decide how much original information the memory box should save, that is, which unimportant memories should be discarded;

Input gate: determine how much current network information should be saved to the memory box, that is, which new things to contact;

Output gate: Determines the degree to which the information in the memory box is output.

The modified memory box can not only understand the current network state through the input gate, but also use the forget gate to retain important information in the past. This is the LSTM long short-term memory model.

By changing the structure of the memory box, there are many variants of LSTM, such as GRU. GRU has only two gates, and the update gate is a combination of the forget gate and the input gate, which decides which old information is discarded and which new information is added. The GRU also decides how much of the current network state to write to capture short-term memory. The structure of GRU is simpler, the calculation is more efficient, and the effect is comparable to that of LSTM, so GRU is becoming more and more popular. There are also some modules including Remember-responsible for information storage in memory, Update gate-controlling the update of information stored in Remember, and so on.

A rough LSTM network:

Attention (attention mechanism)

The attention mechanism does not come out of thin air, it is mainly the basic human principle of learning from humans themselves.

An attention mechanism in deep learning, from its name, is very similar to the attention mechanism of our human vision. In fact, it is also borrowed from the human visual attention mechanism. What is their core? Their core is to focus on important information and ignore unimportant information. And the formation of this mechanism is formed in the long evolution of human beings. It is a main core thing to obtain a small amount of key information from a large amount of information, which is the core ability of human beings to avoid danger.

Let's look at it directly. For example, when we first see such a picture, we will pay attention to the baby's face first, and then we will pay attention to the bear toy:

For example, for this news, you will actually see the "title" first, and then you will see the following content:

In the above two pictures, we will give priority to the baby's face, and then we will focus on the title of the article; for example, this news, the beginning of the title, which is a large paragraph of text, and these things happen to be the core of this picture and the most critical Position, after we pay attention to these things, we will continue to look at the most core position, for example, we look at the baby's face, whether the skin is white or not, what expression, etc.; What is said, and for other irrelevant information, we will choose to ignore it.

This core can be simply summarized as: "The visual attention mechanism is the brain signal processor held by human vision. Human vision obtains the target area that needs to be focused on by quickly scanning the global image, which is generally referred to as the focus of attention. Then invest more attention resources in this area to obtain more detailed information of the target that needs attention, and suppress other useless information."

One of the cores of the human attention mechanism and the attention mechanism in deep learning is to "focus on the core part and suppress other useless information". This is a means for human beings to use limited attention resources to quickly screen out high-level information from a large amount of information. It is a survival mechanism formed by human beings in the long-term evolution, and then Attention has also learned the entire mechanism, which can greatly improve Improve the efficiency and accuracy of information processing.

Some concepts in nature can be used to anthropomorphically describe the attention mechanism, and these characteristics can be paid attention to:

1. Focusing: Just like our visual system can focus on a certain area in the field of view, the attention mechanism can also focus on certain parts of the input sequence and give it higher weight. This focusing ability allows the model to focus on the most important and relevant information at the moment.

2. Filtering: Our perception system will filter out a lot of irrelevant and important information, and only select critical information. Similarly, the attention mechanism also has the function of filtering, which can filter out less relevant elements in the input sequence, and only select the most relevant and important information.

3. Context Awareness: When humans understand language, they correctly interpret the meaning of a word or phrase based on the context. Likewise, the attention mechanism can incorporate contextual information into the representation of the current input, resulting in context-dependent output. This makes the model's predictions more appropriate to the context of the current input.

4. Attention drift: Human attention is not fixed, and we can change the target of attention at any time according to our needs. The attention mechanism also has a similar ability, which can change the distribution of attention according to the importance of the input at any time, so as to dynamically focus on the most relevant input at present. This dynamic attention allocation makes the model more flexible and powerful.

Therefore, the attention mechanism is similar to some characteristics of human attention, such as the ability to focus, filter, context awareness and attention drift, which enables the attention mechanism to selectively focus on certain parts of the input sequence and filter less relevant information. , and the ability to adjust the attention distribution according to the context, resulting in more accurate output results. These features are learned by the neural network and then simulated.

Application of Attention in Deep Learning

Application of Attention to Images and Text

The above is to use the attention mechanism to identify key information, and all white areas are to generate text for the following image recognition:

An LSTM network with attention mechanism for text computation:

Since nature and humans have attention mechanisms, how to calculate attention in deep learning is how to judge which content in pictures and texts needs more attention.

Image Attention Calculation

In image processing, the main principle of the attention mechanism is to judge the correlation between the current input and each historical element (pixel here) in the input sequence, and assign weights to each pixel according to the correlation. These weights determine the Which pixels need special attention and focus, and which need to be filtered.

Specifically, the attention mechanism calculates the similarity or correlation between the current input feature and each historical pixel. Pixels with higher similarity are given greater weight, indicating that they are more important to the current input feature and require more attention. Pay more attention; pixels with lower similarity are given smaller weights, indicating that they have less influence and can be filtered.

Then, in the picture, the attention mechanism mainly judges the correlation between two inputs through the following methods:

1. Spatial attention: Calculate the relative positional relationship of two pixels in space, the closer the pixel is, the higher the correlation and the greater the weight. This attention can capture spatial structure information.

2. Channel attention: If two pixels are closer in value on the RGB channel, it is considered that the correlation is higher and the weight is greater. This can learn dependencies between channels.

3. Hierarchical attention: On the basis of spatial attention and channel attention, multi-layer attention can be established, and the high-level comprehensive low-level results are correlated with the input calculations. This enables attention to examine inputs at different levels of abstraction with greater accuracy.

4. Similarity attention: directly use methods such as dot product or cosine similarity to calculate the similarity between two pixels. The higher the similarity, the stronger the correlation and the greater the weight.

In image processing, the attention mechanism mainly calculates the correlation between the current input feature and each pixel, assigns different weights to the latter, and generates a new representation of the current input feature accordingly. Pixels with higher correlation have greater influence and greater weight, which enables the model to selectively focus on important regions in the input image and filter less relevant background regions. This process mimics how humans allocate attention to important information when comprehending images.

Text Attention Calculation

In text processing, the main principle of the attention mechanism is to judge the correlation between the current input and each historical element in the input sequence, and assign weights to each historical element according to the correlation. These weights determine which historical elements need Pay special attention and focus on which ones need to be filtered.

Specifically, the attention mechanism calculates the similarity or correlation between the current input and each historical input. Elements with higher similarity are given greater weight, indicating that they are more important to the current input and require more attention. ; Elements with lower similarity are given smaller weights, indicating that they have less influence and can be filtered.

So, how does the attention mechanism judge the correlation between two inputs? The following methods are mainly used:

1. Dot multiplication attention: Calculate the dot product between two input vector representations (Embedding), the larger the dot product, the higher the correlation.

2. Scaling dot product attention: On the basis of dot product attention, divide the dot product result by a scaling factor (such as the square root of the vector dimension), which can make the weight distribution more concentrated and focus on more important elements.

3. Multi-head attention: use multiple attention heads, each head has its own Query, Key, and Value, and finally splice or average the output of each head to generate the final output. This allows the model to look at correlations from different angles and be more accurate.

4. Position encoding: Add position information to the input Embedding of the input sequence, so that the model can use the position information to judge the correlation. Two inputs that are located closer together are more correlated.

Therefore, the attention mechanism of the text mainly calculates the correlation between the current input and the historical input, assigns weight to the latter, and generates a new representation of the current input accordingly. The historical input with higher correlation has greater influence and greater weight, which enables the model to selectively focus on important elements in the input sequence and filter less relevant elements. This process mimics the human attention allocation mechanism when decoding text.

Transformer model overview

Transformer solves the problem of seq2seq

In machine learning, the general problems that need to be solved mainly include that I input something to the model, and then the model outputs something (such as a word or a picture) and then outputs: (such as word translation or classification problems)

Or input a bunch of things, then output one thing, and finally output a label for the entire input sequence: (such as classification problems, or sentiment analysis)

Further, in the calculation, there may be N input vectors, and the output is also N labels (the input and output vectors are fixed):

Another common way is to input N vectors, and the output may be M labels. This kind of problem is called Seq2Seq (Sequence to Sequence, a typical way in machine learning, common such as AI question answering, machine translation, etc.):

ChatGPT can be considered as a Seq2Seq problem. The user inputs a bunch of prompts, and then GPT outputs a bunch of text. Seq2Seq realizes the transformation from one sequence to another. For example, Google has used the Seq2Seq model plus the attention model to realize the translation function. Similarly, it can also realize the chat robot dialogue model. The classic RNN model fixes the size of the input sequence and the output sequence, but the Seq2Seq model breaks through this limitation.

The most important thing about this structure of Seq2Seq is that the length of the input sequence and output sequence is variable.

Generally, the problem of dealing with seq2seq will adopt the structure of Encoder-Decoder, input a seq, go through various processes of encoder, and then the decoder becomes the desired content of a target, such as the Encode-Decoder architecture in the classic RNN:

Encoder-Decoder architecture, let's look at a translation scenario as an example:

For ChatGPT, T is the Transformer framework we mentioned above, and Transformer is a model design for processing seq2seq. The above figure can be simply understood as the basic working structure diagram of the Transformer framework, which is essentially an Encoder-Decoder structure.

The difference between Transformer and LSTM/RNN

In essence, using RNN or LSTM can solve this kind of AI question answering problem, but Transformer has these advantages over them:

1. Parallel computing: RNN and LSTM are sequential models, and the output of each step in the calculation process depends on the output of the previous step, which cannot be calculated in parallel. The Transformer adopts the Attention mechanism, which can calculate all timesteps in parallel, greatly improving the calculation speed.

2. Long-term dependency learning: Although RNN and LSTM have a cyclic structure that can capture contextual information, it is difficult to learn long-term dependencies in longer sequences, and the problem of gradient disappearance will occur. By using the Attention mechanism, Transformer can directly model the dependency relationship between any two timesteps, and better learn long-term dependencies.

3. More stable training: The loop structure in RNN and LSTM makes the training process more difficult, parameter selection and initialization will have a greater impact on the final result, and the gradient explosion problem is prone to occur. The acyclic structure of Transformer makes its training more stable.

4. Fewer parameters: RNN and LSTM require more parameters, and Transformer can obtain equal or better performance with fewer parameters by using the Attention mechanism.

5. Input and output without calibration: When RNN and LSTM encode sequences, it is usually necessary to add special start and end tags at both ends of the input sequence, while Transformer does not have this requirement.

The excellence of the Transformer framework is not only the Encoder-Decoder mechanism, but also its Multi-Header Attention mechanism. The transformer architecture relies entirely on the Attention mechanism (attention mechanism), which solves the long-term dependency problem of input and output, and has the ability of parallel computing (Multi-head), which greatly reduces the computing time. The self-attention module allows the source sequence and the target sequence to be "self-associated" first. In this way, the Embedding (word embedding) representation of the source sequence and the target sequence itself contains more information, and the subsequent FFN (feedforward network) Layers also enhance the expressive power of the model. The Muti-Head Attention module enables the Encoder side to have parallel computing capabilities.

Compared with RNN and LSTM, Transformer has parallel computing, faster computing speed, better learning long-term dependence, less prone to gradient disappearance problem, more stable training, less prone to gradient explosion problem, fewer parameters, lower space complexity and computational complexity, There is no need to add special tags to the input and output sequences, so it is a very good neural network model.

Transformer 既可以看作一种模型,也可以看作一种架构。如果从具体的模型实现来看,如BERT,GPT等等,这些都是基于Transformer架构设计的独立的模型,用于不同的自然语言处理任务,可以看作是各种模型。但如果从更高的层面来理解,Transformer本质上提出了一种基于注意力机制的encoder-decoder框架或架构。这个架构中的主要组件,如多头注意力机制、位置编码、残差连接以及前馈神经网络都是通用的构建块。

所以,从这个意义上讲,Transformer更像是一种统一的架构或框架。研究人员可以基于这个架构,通过选择不同的训练语料或任务,设计出用于不同目的的变体模型,如:

- BERT:通过自监督的方式,在大规模语料上预训练得到的Transformer模型,用于语言理解。

- GPT:通过自我监督学习在大规模语料上预训练得到的Transformer模型,用于语言生成。

- Transformer-Align:用于序列对齐任务的Transformer模型。

- Graph Transformer:用于处理图数据的Transformer模型。

所以,总的来说,我的理解是:Transformer提出的是一个通用的注意力为基础的神经网络架构,而各种基于该架构设计的模型,如BERT,GPT等则可以看作是该架构的具体实例。这些具体的实例模型通过选择不同的数据集以及训练目标,可以完成不同的自然语言处理任务。

总结

Transformer模型的主要优点如下:

1. 并行计算。Transformer可以并行计算所有时间步,计算速度很快,这是其相比RNN和LSTM的最大优势。

2. 学习长期依赖。Transformer通过Attention机制可以直接建模任意两个时间步之间的依赖关系,可以很好地学习长期依赖,不容易出现梯度消失问题。

3. 训练更稳定。Transformer的非循环结构使其训练过程更加稳定,不容易出现梯度爆炸问题,参数选择也更加灵活。

4. Fewer parameters. Compared with RNN and LSTM, Transformer requires fewer parameters, especially in longer sequence tasks, the gap in the amount of parameters is more obvious.

5. Input and output without calibration. Transformer does not need to add special start and end tags at both ends of the sequence.

The main disadvantages of Transformer are as follows:

1. Transformer contains no recurrence. Formally, Transformer has no loop structure, and some features of RNN are lost. For example, Transformer cannot model periodic time series well.

2. Transformer may not be suitable for shorter sequences. For shorter sequences, Transformer has relatively more parameters, which is not necessarily better than RNN and LSTM.

3. High computational complexity. The calculation cost of Attention in Transformer is relatively large, and bottlenecks may occur when some computing resources are limited.

4. Lack of prosodic and temporal information. Unlike RNN and LSTM, Transformer does not contain recurrent structures and hidden states, and cannot model temporal and prosodic information well.

Generally speaking, the main advantages of Transformer lie in parallel computing, learning long-term dependence and training stability, but there are also certain disadvantages, such as no loop structure, poor effect when processing short sequences, high computational complexity and modeling time domain and prosodic information are weaker. What kind of model to choose also needs to be weighed according to the requirements of specific tasks and data characteristics.

This article gives an overview of the entire development process of Transformer, as well as the basic natural interface of the Attention attention mechanism that dominates Transformer and the working principle in deep learning.

I hope that I can have a little impression and good opinion of the Transformer model, and understand the basic natural laws behind this artifact.

What replaces you is not AI, but someone who knows AI better than you and can use AI better!

##End##


If you want to pay attention to more technical information, you can pay attention to the "Dark Night Passerby Technology" public account

Guess you like

Origin blog.csdn.net/heiyeshuwu/article/details/130377248