One article to understand the working principle of large models - taking ChatGPT as an example

written in front

After the ChatGPT model came out on November 30, 2022, it immediately caused an uproar around the world. Both AI practitioners and non-practitioners are talking about ChatGPT’s impactful interactive experience and amazing generated content. This has made the general public realize the potential and value of artificial intelligence again. For AI practitioners, the ChatGPT model has become an expansion of ideas. Large models are no longer just toys for rankings. Everyone recognizes the importance of high-quality data and firmly believes that "as much artificial intelligence as there is, there will be as much intelligence as there are." ".

The ChatGPT model is too good. On many tasks, even zero-sample or few-sample data can achieve SOTA results, causing many people to turn to the research of large models.

Not only Google has proposed the Bard model to benchmark ChatGPT, but many large Chinese models have emerged in China, such as Baidu's "Wen Xin Yi Yan", Alibaba's "Tongyi Qianwen", SenseTime's "RiRiXin", Zhihu "Zhihaitu AI", Tsinghua University's "ChatGLM", Fudan University's "MOSS", Meta's "Llama1&Llama2", etc.

After the advent of the Alpaca model, it was proved that although the model with 7 billion parameters cannot achieve the effect of ChatGPT, it has greatly reduced the computing power cost of large models, making it possible for ordinary users and ordinary enterprises to use large models. The data issues that have been emphasized before can be obtained through the GPT-3.5 or GPT-4 interface, and the data quality is also quite high. If you only need a basic effect model, it is not so important whether the data is accurately calibrated again (of course, to obtain better effects, more accurate data is needed).

1.Tansformer architecture model

The essence of pre-trained language models is to obtain better results in downstream subtasks by learning universal expressions of language from massive amounts of data. As model parameters continue to increase, many pre-trained language models are also called large language models (Large Language Model, LLM). Different people have different definitions of "big". It is difficult to say how many parameter models are large language models. Usually, there is no forced distinction between pre-trained language models and large language models.
Insert image description here

Pre-trained language models are generally divided into Encoder-only architecture models, Decoder-only architecture models and Encoder-Decoder architecture models according to the underlying model network structure. Among them, only Encoder architecture models include but are not limited to BERT, RoBerta, Ernie, SpanBert, AlBert, etc.; only Decoder architecture models include but are not limited to GPT, CPM, PaLM, OPT, Bloom, Llama, etc.; Encoder-Decoder architecture models include but are not limited to Limited to Mass, Bart, T5, etc.

Insert image description here

2.ChatGPT principle

The overall process of ChatGPT training is mainly divided into three stages, the pre-training and prompt learning stage, the result evaluation and reward modeling stage, and the reinforcement learning self-evolution stage; the three stages have a clear division of labor, realizing the model from the imitation period, the discipline period, and the autonomy phase transition.

Insert image description here

In the first stage of imitation, the model focuses on learning various command-based tasks. The model at this stage has no self-discrimination awareness, and is more about imitating artificial behavior. It makes its behavior itself through continuous learning of human annotation results. Has a certain degree of intelligence. However, mere imitation often turns the machine's learning behavior into a toddler.

In the second phase of the discipline period, the optimization content has undergone a directional change, changing the focus from educating the content of machine answers to educating the quality of machine answers. In the first stage, the focus is to hope that the machine will use input X to imitate and learn to output Y', and strive to make Y' consistent with the originally labeled Y. Then, in the second stage, the focus is to hope that when multiple models output multiple results (Y1, Y2, Y3, Y4) for X, they can judge the pros and cons of multiple results by themselves.

When the model has a certain judgment ability, it is considered that the model has completed the second stage of learning and can enter the third stage - the autonomous period. In the autonomous period, the model needs to complete its self-evolution through left-right interaction, that is, on the one hand, it automatically generates multiple output results, on the other hand, it judges the quality of different results, and evaluates the model differences based on the effects of different outputs, and optimizes and improves them. Automatically generate the model parameters of the process, thereby completing the self-reinforcement learning of the model.

To sum up, the three stages of ChatGPT can also be compared to the three stages of human growth. The purpose of the imitation stage is to "know the principles of nature", the purpose of the discipline stage is to "distinguish right from wrong", and the purpose of the autonomy stage is to "understand all things".

3. Prompt learning and the emergence of large model capabilities

After the release of the ChatGPT model, it became popular all over the world for its smooth conversational expression, strong context storage, rich knowledge creation and its ability to comprehensively solve problems, refreshing the public's understanding of artificial intelligence. Concepts such as Prompt Learning, In-Context Learning, and Chain of Thought (CoT) have also entered the public eye. There is even a profession called prompt engineer on the market, which specializes in writing prompt templates for specified tasks.

Hint learning is considered by most scholars to be the fourth paradigm of natural language processing after feature engineering, deep learning, and pre-training + fine-tuning. As the parameters of the language model continue to increase, the model has also emerged with capabilities such as context learning and thought chaining. Without training the language model parameters, it is possible to achieve better results in many natural language processing tasks with just a few demonstration examples. score.

3.1 Tips for learning

Prompt learning is to append additional prompt information as new input to the original input text, convert the downstream prediction task into a language model task, and convert the prediction results of the language model into the prediction results of the original downstream task.

Taking the sentiment analysis task as an example, the original task is to determine the emotional polarity of the text based on the given input text "I love China". Prompt learning is to add additional prompt templates to the original input text "I love China", for example: "The emotion of this sentence is {mask}." The new input text "I love China" is obtained. The emotion of this sentence is {mask}." Then use the mask language model task of the language model to predict the {mask} tag, and then map the predicted Token to the emotional polarity label, and finally achieve emotional polarity prediction.

3.2 Contextual learning

Context learning can be regarded as a special case of prompt learning, that is, the demonstration example is regarded as part of the manually written prompt template (discrete prompt template) in prompt learning, and the model parameters are not updated.

The core idea of ​​contextual learning is learning through analogy. For an emotion classification task, first extract some demonstration examples from the existing emotion analysis sample library, including some positive or negative emotional texts and corresponding labels; then compare the demonstration examples with the emotional text to be analyzed. Spliced ​​and fed into a large language model; finally, the emotional polarity of the text is obtained through learning analogies to demonstration examples.
Insert image description here

This learning method is also closer to the decision-making process of human beings after learning. By observing how others handle certain events, when you encounter the same or similar events, you can easily and well solve them.

3.3 Thought chain

In an era where large-scale language models are rampant, it has completely changed the paradigm of natural language processing. As model parameters increase, for example: sentiment analysis, topic classification and other System-1 tasks (tasks that humans can complete quickly and intuitively), better results can be obtained even under few-sample and zero-sample conditions. But for System-2 tasks (tasks that humans need to think slowly and thoughtfully to complete), such as logical reasoning, mathematical reasoning, and common sense reasoning, even when the model parameters increase to hundreds of billions, the effect is not ideal, and Simply increasing the number of model parameters does not bring substantial performance improvement.

Google proposed the concept of Chain of thought (CoT) in 2022 to improve the ability of large language models to perform various reasoning tasks. The thinking chain is essentially a discrete prompt template. The main purpose of the thought chain is to use the prompt template to enable large language models to imitate the human thinking process and provide step-by-step reasoning basis to deduce the final answer. The reasoning basis for each step is composed of The collection of sentences is the content of the thought chain.

Thinking chain actually helps large language models decompose a multi-step problem into multiple intermediate steps that can be solved individually, instead of solving the entire multi-hop problem in one forward pass.

Insert image description here

4. Industry reference suggestions

4.1 Embrace change

Unlike other fields, the AIGC field is currently one of the most rapidly changing fields. Taking the week from March 13, 2023 to March 19, 2023 as an example, we have experienced Tsinghua University releasing the ChatGLM 6B open source model, openAI releasing the GPT4 interface, Baidu Wenxinyiyan holding a press conference, and Microsoft launching Office in conjunction with ChatGPT. Combined with a series of major events such as the brand new product Copilot.

These events will affect the direction of industry research and trigger more thinking. For example, should the next technical route be based on open source models or pre-train new models from scratch? How many parameters should be designed? Copilot is ready, how should application developers of the office plug-in AIGC respond?

Even so, it is still recommended that practitioners embrace changes, quickly adjust strategies, and use cutting-edge resources to accelerate the realization of their tasks.

4.2 Clear positioning

You must be clear about your goals for segmenting the track, such as whether to do the application layer or the base optimization layer, whether to do the C-end market or the B-end market, whether to do industry vertical applications or general tool software. Don't be too ambitious, seize the opportunity and "cut the cake accurately".

Having a clear positioning does not mean that you will not hit the wall and never turn back. It means understanding your own purpose and significance.

4.3 Compliance and controllability

The biggest problem of AIGC is the uncontrollability of the output. If this problem cannot be solved, its development will face a big bottleneck and it will not be widely used in the B-side and C-side markets. In the product design process, attention needs to be paid to how to integrate the rule engine, strengthen the reward and punishment mechanism, and appropriate manual intervention. Practitioners should focus on the copyright, ethical and legal risks involved in AIGC-generated content.

4.4 Experience accumulation

The purpose of experience accumulation is to establish one's own barriers. Don’t pin all your hopes on a single model. For example, we once designed the product in a plain text format to seamlessly integrate with ChatGPT, but the latest GPT4 already supports multi-modal input. We should not be discouraged, but should quickly embrace changes and use the previously accumulated experience (data dimension, prompt dimension, interaction design dimension) to quickly complete product upgrades to better cope with new scenarios and interaction forms. Finally, I would like to recommend a very good book to everyone - "ChatGPT Principles and Practical Combat"!

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_63866037/article/details/132818328