Analyze the history of NLP and see the development of chatGPT

1. Historical evolution of NLP

1.1 NLP supervised paradigm

The paradigm of supervised tasks in NLP can be summarized as follows.

The input is a sequence of words, and the key step in the middle is the semantic representation. After the semantic representation is obtained, it is then handed over to the downstream model for learning. Therefore, the development of pre-training technology is gradually improving at the level of how to get a good semantic representation (representation).

Semantic feature calculation is divided into three stages, namely

1. The feature engineering stage, with the bag-of-words model as a typical representative

2. Shallow representation stage, with word2vec as a typical representative

3. The deep representation stage, with transformer-based Bert as a typical representative

I am a little unconvinced by this division method, so why do we have to engage in the semantic part of the word [such as Chinese word segmentation, part-of-speech tagging, syntactic analysis, semantic parse] at a stage? This stage is called "intermediate task", mainly because the development of NLP technology is not high enough. For example, the early "machine translation" was difficult, so the problem was divided and conquered, and decomposed into various intermediate stages such as word segmentation, part-of-speech tagging, and syntactic analysis. .

Aiming at these three stages, let’s expand them a little bit, and if there is an opportunity in the future, let’s expand on the key points

1.2 Bag of words model

Count the number of times each dimension is in the document. The problem is that the semantics are limited to whether they are literally the same or not. For example, apple and apple are strongly related in semantics, but in the bag of words model, it is gg.

1.3 Word vector

The representation of words has clustering properties and linear features. The key problem to solve is semantic representation. It can represent a sentence in vectorized semantics, but it cannot solve the contextual semantics. For example, play music and play football cannot be distinguished in the same play. Play ball or play the piano.

1.4 Pre-trained language model

Through a large amount of corpus unsupervised training, extract semantic representation information, and then use it for fine-tuning downstream tasks, such as cloze, text classification, QA question answering, etc., the most representative is bert, almost 18 years later. All kinds of dissatisfaction.

​Note: From the perspective of the time axis

Around 2016: Semantic representation gradually upgrades network depth, the more complex the network, the stronger the ability to represent

After 2018, semantic representation focuses more on semantic understanding, and whoever can better understand semantic content will have a better effect.

2. Historical evolution of GPT

2.1 Macro: From GPT1/2 to GPT3

​The GPT-1, GPT-2, and GPT-3 three-generation models based on text pre-training all use the Transformer as the core structure model. The difference at the network model level is the number of layers of the model and the length of word vectors and other hyperparameters.

2.2 GPT-1: Based on Transformer Decoder pre-training + fine-tuning Finetune

Generative Pre-Training GPT mainly refers to the NLP generative pre-training model.

The training mode is divided into 2 stages:

Phase 1 pre-training: pre-training learning with language model LLM

The second-order fine-tuning: solve downstream tasks by fine-tuning Fine tuning

2.3 GPT-2: Discard fine-tuning and directly use zero-short learning

GPT-2 uses a larger network and a larger data set on the existing network structure design of GPT-1, and predicts one word at a time during the training and prediction process to train a language model capable of zero-short learning.

N-shot Learning for small samples/zero samples came into being, which can be divided into the following three types:

1) Zero-shot Learning: Let the pre-trained language model complete specific tasks without any training samples for fine-tuning training;

Just give the task description, then please give the content of the prompt. Note: This kind of zero-sample learning is more challenging

2) One shot Learning (single sample learning): In the case of a training sample for fine-tuning training, the pre-trained language model completes a specific task;

​Give a specific task description and an example, then please give the content of the prompt

3) Few-shot Learning (few samples or small sample learning): In the case of only a small number of samples for fine-tuning training, the pre-trained language model completes specific tasks;

​Give a specific task description and 3 examples, then please give the content of the prompt

2.4 GPT3: open the new paradigm of NLP prompt, realize small sample learning

The original GPT-3 demonstrated three important capabilities:

  • Language Generation: Follow prompts and generate sentences that complete the prompts. This is also the most common way humans interact with language models today.

  • In-context learning: following several examples of a given task and then generating solutions for new test cases. It is very important to note that although GPT-3 is a language model, its paper hardly talks about "language modeling" (language modeling) - the authors put all their writing energy into the vision of contextual learning, This is the real point of GPT-3.

  • World knowledge: including factual knowledge and common sense.

  • Fact: Li Bai is a poet, Sima Qian wrote "Historical Records"

  • Common sense: the sun rises from the east, people have two legs + one pair of hands

These capabilities come from the following:

pretrain stage:

The predicted data of 300 billion words, pre-training 175 billion parameter model, the model follows the structure of GPT-2, but has greatly improved the network capacity. Among them, 5 different corpora are 60% low-quality Common Crawl, 22% high-quality WebText2, 16% Books1 and Books2, 3% Wikipedia, etc.

After such a large-scale pre-training, the above-mentioned capabilities can be concluded because:

  • The ability of language generation comes from the training goal of language modeling (language modeling).

  • World knowledge comes from a training corpus of 300 billion words (a multi-category data source).

  • The model's 175 billion parameters are for storing knowledge (the performance of knowledge-intensive tasks is closely related to the model size).

  • The source of the ability of contextual learning and why contextual learning can be generalized. This ability may come from the fact that the data points of the same task are arranged in the same batch in sequence during training. At this moment, I want to emphasize "In-context learning". We need to understand meta-learning first. For a task with few samples, the initialization value of the model is very important. From a good initialization value as Starting point, the model can converge as soon as possible, so that the obtained results can approach the global optimal solution very quickly. The core idea of ​​meta-learning is to find a suitable initialization range through a small amount of data, so that the model can quickly fit on a limited data set and achieve good results.

prompt stage:

Conventional NLP tasks: QA question answering, sentence similarity, closed-book question answering, pattern analysis, machine translation, etc. have better results

Other domain tasks: mathematical addition, article generation, writing code, etc. are amazing

Note: The motivation of Prompt-Tuning is to solve the two pain points of traditional Fine-tuning

  • Reduce the semantic difference (Bridge the gap between Pre-training and Fine-tuning): The pre-training task is mainly based on Masked Language Modeling (MLM), while the downstream task reintroduces new training parameters, so the goals of the two stages usually have There is a large difference, so it is necessary to solve the problem of how to narrow the target gap between the two stages of Pre-training and Fine-tuning;

  • Avoid overfitting (Overfitting of the head): Since additional parameters need to be introduced in the Fine-tuning stage to adapt to the corresponding task needs, overfitting is prone to occur when the number of samples is limited, reducing the generalization of the model ability, so it is necessary to face the overfitting problem of the pre-trained language model.

At that time, we talked about Fine-tunning, and then we talked about prompting, so what is the difference between Fine-tuning vs Prompting?

Fine-tuning: The pre-trained language model is adapted to various downstream tasks through fine-tuning. The specific embodiment is to introduce various auxiliary task losses, add them to the pre-training model, and then continue pre-training to make it more suitable for downstream tasks.

Prompting: Various downstream task capabilities are moved up to the pre-trained language model. Different tasks need to be reconstructed so that it can achieve the effect of adapting the pre-trained language model. In this process, the downstream tasks have made more sacrifices, but the language model paradigm can be better unified. But in GPT-3 using In Context Learning and BERT using Fine-tunning, some examples are provided for LLM, but in essence Fine-tunning is the example data training model, using back propagation to update LLM parameters, In Context Learning is an example Let LLM take a look at the data and not update the model parameters. This means that LLM has not gone through the learning process, but it is puzzling why it can have magical effects at a glance.

2.5 Prompt Learning vs Instruct Learning

Same point:

The purpose of both instruction learning and hint learning is to mine the knowledge possessed by the language model itself.

Differences:

Prompt is to stimulate the completion ability of the language model, such as generating the second half of the sentence based on the first half of the sentence, or cloze and so on.

Instruct is to stimulate the understanding ability of the language model, and let the model take correct actions by giving more obvious instructions.

Take a case:

Tips for learning: complete the content of this sentence [ I bought this diamond ring for my lover, she likes it very much, this diamond ring is too ____ ]

Instructional study: Judge the emotion of this sentence [I bought this diamond ring for my lover, she likes it very much. A=good; B=fair; C=bad]

3、BERT

3.1 Bert architecture

​Pre-training stage : Based on massive data, pre-training autoregressive large models, how to evaluate the effect of the pre-training stage? 1. Mask lm, which is to mask part of the token, and then predict the masked token, by calculating the cross entropy function; 2. NSP, which is whether the context token is connected

Fine-Tunning stage : On a specific task [QA, NER, text classification, etc.], load the pre-trained large model, and fine-tune the model on the downstream massive sample data

3.2 Downstream tasks

3.3 Pre-training architecture

First, text token encoding + position encoding is sent to the N-layer encoder network for MHA+FNN, etc.

From the structure of Transformer, the model parameters consist of two parts: MHA+FFN

  • Multi-head attention (MHA) accounts for 1/3 of the total parameters,

  • FFN total parameter set 2/3.

MHA is mainly used to calculate the correlation strength between words or knowledge, and integrate the global information. It is more likely to establish the connection between knowledge. There is a high probability that no specific knowledge points will be stored, while the knowledge subject of the LLM model is stored in FFN. in the structure.

Why is it stored in the FFN structure? This part is more complicated, so I won’t expand it, but let’s start with the conclusion that “low-level FNN stores low-level knowledge such as lexical and syntax, and middle-level and high-level storage semantics and other factual probability knowledge”.

Note: In the pre-training stage, a certain layer of FFN stores outdated or wrong knowledge, because there are some errors in our corpus data, so how to solve it? I am digging a hole here, and I will answer the follow-up topic.

 3.4 What is Fine-tuning?

The pre-trained language model is adapted to various downstream tasks through fine-tuning. The specific embodiment is to introduce various auxiliary task losses, add them to the pre-training model, and then continue pre-training to make it more suitable for downstream tasks.

Note: It has already been explained above, so I will repeat it again.

4、GPT vs BERT

BERT

  • Based on the transformer framework encoder module

  • Bidirectional language model (masked language model)

  • Stronger language comprehension ability, given a few surrounding words, predict the hollowed-out words in the middle, such as predicting the word 'are' in the middle

GPT

  • Based on the transformer framework decoder module

  • One-way language model (Markov chain)

  • The language generation ability is stronger, given the first few words, predict the next word, such as predicting the word 'you'

Note: Masked Multi-Head-Attention means that you cannot see the following words when processing the current word.

For example: when processing [it], you cannot see the word after [it], but you will pay attention to [a, robot] in the word before [it], and then the attention will calculate the inter-word [a, robot, …, it] The weighted sum of the vector and its Attention score, that is, the weight of QKV.

5. The overall framework of chatGPT

5.1 Capability Demonstration

1) Ability to understand natural language

2) Answer the questions

3) Write articles

4) Programming ability

Such as optimizing the model scene

5) Take the college entrance examination

Details can be found in the link

6) Self-learning and self-renewal

​chatGPT will have self-learning ability, through communication with humans and human feedback, if you find yourself wrong, you will improve and learn by yourself

To sum up, chatGPT has various capabilities such as natural language processing, machine translation, chat robot, multi-agent, dialogue engine, etc. It can answer your questions, give useful feedback, generate interesting content, learn new skills, etc. .

5.2 From GPT-3 to chatGPT

In July 2020, openai released the first version of GPT-3, code-named davinci, and has continued to evolve since then.

In July 2021, openai released the codex version, fine-tuned based on the GPT-3 version with 12 billion parameters, code-named Code-davinci-001, Code-cushman-001.

In March 2022, openai released the InstructGPT [instruction fine-tuning] version, code-named Instruct-davinci-beta, Text-davinci-001.

From April to July 2022, openai released the LM+code pre-training version with added instruction fine-tuning, code-named Code-davinci-002, which has evolved to the GPT-3.5 series at this stage.

From May to June 2022, openai released Text-davinci-002, which is a supervised instruction fine-tuning model based on Code-davinci-002.

In November 2022, Text-davinci-003 and chatGPT will be released at the same time, which is an intensive learning version instruction fine-tuning model based on human resistance.

Note: text-davinci-003 recovers (but is still worse than code-davinci-002) some of the partial context learning capabilities lost in text-davinci-002 (presumably because it incorporates language modeling into fine-tuning) and Further improved zero-shot capability (thanks to RLHF), which is better at context learning. On the other hand, ChatGPT seems to sacrifice almost all context learning capabilities for the ability to model dialog history, which is better at dialog.

From 2020 to November 22, the proportion of openai’s personnel investment in these projects is

The CodeX project with the largest number of ChatGPT team members participating

The team is dominated by men, with only 10% women

5.3 chatGPT architecture

When we use chatGPT, we enter a question, such as "Which is the highest mountain in the world?", and may give multiple answers, such as A1: Himalayas, A2: Mount Fuji, why is this so? Because chatGPT does not know which is the best, it can only find the answer combination with the best probability in the pre-training prediction data, so it needs to rely on a teacher to teach chatGPT to learn which answer is better, and the teacher has accumulated experienced answers , such as Q1: Which table is the highest peak in time? A1: The Himalayas, Q2: Where is the deepest sea in the world? A2: Dead Sea.

How does the teacher teach chatGPT? The main purpose is to teach chatGPT how many points it can get after generating an answer. This process should enable chatGPT to have the ability to identify good and bad answers, and the good answers should be ranked in front of the bad answers.

In the actual process of interacting with humans, real-time feedback from humans will be used to make flexible adaptations and increase human personalization factors. To ensure that everyone has "different tastes", it is necessary to constantly adjust strategies and continuously evolve chatGPT through reward and punishment strategies. Become the little assistant who understands you best.

5.4 chatGPT training process

Currently the AI ​​is having problems

Although the AI ​​generative model based on the prompt paradigm has achieved great success [AI writes novels, AI writes code, AI draws pictures and even AI makes videos], generative models are difficult to train. Taking language models as an example, most of them use "autoregressive generative ", generate content word by word or word by word through cyclic decoding. During training, it is often simple to predict the next word based on context information, and then use cross-entropy to calculate the loss of each word. Obviously, this kind of token-level loss cannot guide the model optimization direction from the overall output level.

current solution

In the training phase, if you directly use human preferences (or human feedback) to calculate the reward or loss for the overall output of the model, based on this idea, the object to be discussed in this article-RLHF (Reinforcement Learning from Human Feedback): To use reinforcement learning methods, language models are directly optimized using human feedback signals.

Therefore, chatGPT is completed as follows: through 3 stages

Note: The blue arrow represents that the data is used to train the model, such as pre-training model, RM model, RL model

Phase 1: Pre-trained language model (LM)

Initialize the pre-training model [input text (long sequence), output text (single token)], as long as the input prompt can be guaranteed, the model can predict a piece of text, and the corpus data grid [prompt, text].

At this stage, the key point is to make a pre-training model first, similar to GPT3, and then do SFT fine-tuning, load the pre-training model, provide part of the data set, let the model promt predict a piece of text, and let the human annotator correct the expected output content , and finally realize the fine-tune GPT3 model.

Phase 2: Collect data and train reward model

The goal of a reward model (RM) is to characterize whether the output of the model is good for humans. That is, input [prompt, text generated by the model] outputs a scalar number characterizing the quality of the text. Annotators sort the text generated by the initialized language model [no scoring, just a relative order].

The approach is "pair-wise", that is, given the same prompt, let two language models generate text at the same time, and then compare which of the two texts is better. In the end, these different sorting results will be turned into scalar signals (ie point-wise) in a normalized way and thrown to the model training.

RM model selection: Based on the pre-trained language model (LM), input text, the output is a scalar, representing the rank score of the result

Rank model: pair-wise loss training

Why introduce the RM model?

The RM model cannot all correspond to the reward in reinforcement learning, because part of it is the score given by the rm model according to human preferences, and the other part is a certain gap between the gpt participating in reinforcement learning and the original version sft---we don't want the gap to be too large, So by adding a bias term, I worry about pleasing humans in some tricky way during the reinforcement learning training process, and not giving the correct answer based on human questions. The reward function only rewards after completing an action, and does not give rewards in the middle of the process.

Phase 3: Fine-tuning the LM via reinforcement learning

The core idea is "use RM to score the output results, and then use RL to feed back learning, compare the initial GPT model, and iterate continuously."

Strategy [policy]: based on the language model, receive prompt as input, and then output a series of texts (or the probability distribution of texts);

Action space [action space]: the arrangement and combination of all tokens in the vocabulary in all output positions (a single position usually has about 50k token candidates);

Observation space: the possible input token sequence (prompt), which is obviously quite large, is the permutation and combination of all tokens in the vocabulary in all input positions;

Reward function [reward function]: The trained RM model is rewarded with some policy-level constraints.

Specifically how to calculate the reward (reward):

Based on the pre-enriched data, sample the prompt input from it, and throw it to the initial language model and the language model (policy) we are currently training to get the output text y1, y2 of the two models. Then use the reward model RM to score y1 and y2 to judge who is better. Obviously, the difference in scores can be used as a signal for the parameters of the training strategy model, and this signal is generally used to calculate the size of the "reward/punishment" through KL divergence. Obviously, the more the score of y2 text is higher than that of y1, the greater the reward, and vice versa, the greater the penalty. This reward signal reflects the overall generation quality of the text.

With this reward, the model parameters can be updated according to the Proximal Policy Optimization (PPO) algorithm.

The iterative update of the reward model (RM) and the policy model (policy) makes the reward model more accurate in describing the output quality of the model, and the output of the policy model can be further separated from the initial model, making the output text more and more accurate. in line with human cognition.

5.5 Datasets

Since chatGPT does not disclose training data details, I got some details from the instructGPT paper

1) Data source category

  • General generated data accounted for 45.6%, such as: tell a story

  • Open QA questions and answers accounted for 12.4%, such as: what gift to give to your lover on Valentine's Day

  • Rich brains accounted for 11.2%, such as: list 5 workplace suggestions

  • Chat records accounted for 8.4%, such as: what industry is the most profitable in 23 years

2) Training sample size

  • SFT Data, including 1.3W prompts, source: API request and manual annotation

  • RM Data, including 3.3W prompts, source: API request and manual annotation

  • PPO Data, including 3.1W prompts, source: API request

For details, see the table below

3) Manual labeling

​A team of about 40 annotators was hired on Upwork and through ScaleAI.

Divide the annotators into two groups who have been trained [help the annotators answer questions in the room, have the same preference under the same task] and those who have not been trained, but after comparison, it is found that the two groups have a high consistency of 73±4%

To generate data for generality, it is required to be "beneficial, real, and harmless", as described below

5.6 Chain-of-thought

This word is frequently a high-frequency word recently, so what exactly is a chain of thought? The thinking chain is a kind of discrete prompt learning, more specifically, the context learning under the large model (that is, without training, adding examples to the front of the current sample input, and letting the model input these texts at a time to output the task), compared with Compared with the traditional context learning before, that is to use x1, y1, x2, y2,.... example:

The rambling of the chain of thinking does not directly predict y, but also predicts the "thinking process" r of y (many academic scholars collectively refer to this process as relationale). Of course, we don't need these "thought processes" in the end, these are just used to prompt for better answers, just choose the final answer.

The author marked these thought chains for different data sets that were originally used for context learning and then ran experiments, and found that doing so can significantly improve performance (left figure), and this performance improvement is similar to blowout properties ( Right picture) (later they issued a document claiming that this property is called emergent).

Such a powerful thing, how does chatGPT have it?

After reading Zhihu's answer, I feel that this ability is very confusing.

I talked to my colleague about this issue one day, and he told me that it is possible to have this kind of "emergence ability", which is likely to be acquired after relying on massive data training, and there is no particularly good explanation for the reason.

Later, I thought about it, could chatGPT use code data for training? Or the task evaluation index is not smooth enough, for example: most of the strings predicted by the model are matched with the true value, and if there is no complete match, it is an error. When the model is large enough, the correct output ability is significantly improved, and there will be a phenomenon of "emergence". The two cases [the possibility of code data is higher] is really possible, the reason is that the reasoning ability of the model thinking chain of the first generation GPT3 is weak or even non-existent. code-davinci-002 and text-davinci-002 are two A model with a strong enough thinking chain reasoning ability.”, so the high probability is that the ability to use the thinking chain for complex reasoning is likely to be a magical by-product of code training. The reasons are as follows:

  • The original GPT-3 was not trained with code, it cannot do chain of thought.

  • Although the text-davinci-001 model has been fine-tuned by instructions, the first version of the thinking chain paper reported that its reasoning ability of other thinking chains is very weak-so instruction fine-tuning may not be the reason for the existence of thinking chains, code training is The most likely reason why the model can do thinking chain reasoning.

  • PaLM has 5% code training data, which can be used as a chain of thought.

  • The amount of code data in the Codex paper is 159G, which is about 28% of the 570 billion training data of the original GPT-3. code-davinci-002 and its subsequent variants can do thinking chain reasoning.

All of the above observations are correlations between code and reasoning ability/chain of thought. This correlation between code and reasoning ability/chain of thought is a very interesting question for the research community, but it is still not well understood. However, there is still no conclusive evidence that code training is responsible for CoT and complex reasoning. The origin of the chain of thought is still an open research question. Furthermore, another possible by-product of code training is long-distance dependencies, as Peter Liu points out: "Next-word prediction in languages ​​is often very local, and codes often require longer dependencies to do things like before-and-after matching parentheses or referencing a distant function definition". Here I would like to add further: due to class inheritance in object-oriented programming, code may also help model the ability to build coding hierarchies.

Note: procedure-oriented programming is similar to the process of human beings solving tasks step by step, and object-oriented programming is similar to the process of decomposing complex tasks into multiple simple tasks.

6. Future thinking

The chatGPT released by openai has brought too much subversive cognition to everyone. It turns out that the AI ​​​​industry can still play like this, especially for NLPers. I thought that "based on bert pre-training + downstream task fine-tuning" is already a one-shot solution to conquer the world. At that time, the company had needs [scene classification, translation, information extraction, etc.], so it is better to pre-train or load the public bert model first, based on the company data, and then fine-tune it downstream, run it in minutes, simply adjust the parameters, and get it done^_ ^. It sounds like this industry already has a ceiling. If the effect is not good, it depends on the data. Then make up the data specially, or the scene is not suitable. Anyway, I can find a reason to explain it, but no one has thought about it.” GPT+RM+RL "This iron triangle combination can produce subversive products. Note: "Google+facebook+Microsoft+other companies" have been explored before, but the effect is so-so? Understanding from two dimensions: "company + technology"

Company: openai has a relatively clear and clear self-positioning, which is to challenge human problems, so the positioning is relatively high, and it will unswervingly explore and challenge. The chatGPT can be made without external interference, and another reason is that everyone respects a family. A start-up company is tolerant, but it is unlikely to be a big company.

Technology: openai firmly chooses the generative autoregressive model GPT. In fact, GPT-1 came out earlier than Bert. Although Bert has proved that the bidirectional language model is better than GPT in NLP understanding tasks, it is still on the road of generation, and it is still Try zero shot/few shot, etc., but the effect was far worse than bert+fine-tuning at that time, but then GPT-3 appeared, with powerful zero shot/few shot prompt and other capabilities, and later InstructGPT explored, increasing the confidence of the openai team, Eventually the chatGPT product appeared. Let's think about it. During this process, how many people in our country are using GPT to land and explore, and most of them are choosing "bert+fine-tunning". Why does a generative autoregressive model work? The following is an excerpt from the original taste of Mr. Zhang Junlin, without any modification

"First of all, Google's T5 model unifies the external manifestations of natural language understanding and natural language generation tasks in form. As shown in the figure above, the red one is a text classification problem, and the yellow one is a regression to judge sentence similarity Or classification problems, which are typical natural language understanding problems. In the T5 model, these natural language understanding problems are consistent with the generation problems in the form of input and output, that is, the classification problems can be converted into LLM models to generate Strings corresponding to categories, so that the comprehension and generation tasks are completely unified in the form of expression.

This shows that the natural language generation task can be compatible with the natural language understanding task in terms of expression. If it is the other way around, it is difficult to do this. The advantage of this is that the same LLM generation model can solve almost all NLP problems. And if the Bert mode is still adopted, this LLM model cannot handle the generation task well. That being the case, we certainly tend to use generative models for one reason.

The second reason is that if you want to do a good job with zero shot prompting or few shot prompting, you must adopt the GPT mode. Now there are studies (reference: On the Role of Bidirectionality in Language Model Pre-Training) that prove that: if fine-tuning is used to solve downstream tasks, Bert mode is better than GPT mode; if it is zero shot/few shot prompting this If this mode solves downstream tasks, the effect of GPT mode is better than that of Bert mode. This shows that the generative model is easier to do zero shot/few shot prompting tasks, and the Bert mode has a natural disadvantage in doing tasks in this way. This is the second reason.

What opportunities are there for Chinese companies? Or are there any scriptures that can be taken?

I am a little pessimistic and say that if I want to build a Chinese version of chatGPT, there will be no more than 3 to 4 companies. This is something that big companies can do. [Everyone is optimistic about Baidu]. Entrepreneurs should sleep well. The reason is that "data + computing power + talent + scenarios" are indispensable.

Data: To have different kinds of massive data

Computing power: There can be more than 1,000 A100 high-bandwidth graphics cards [200G/s] GPU machines in China. There can be several companies, or the ability to do large-scale training data corpus [not less than chatGPT] + reduce the amount of model parameters [1] /4 x 1750B], the model effect is not lower than the chatGPT effect.

Talents: "A big cow leads a bunch of younger brothers" is a customary model in China. Take a look at the openai team. Everyone is a leader in the industry. Note: Talents should focus on intensive learning

Scenario: Thinking from the perspective of human needs, what products can we make in industries such as "clothing, food and travel"? You can refer to the following scenarios:

  • Intelligent customer service, question and answer: provide users with personalized question and answer services, such as sentiment analysis, question classification, intelligent question and answer, can recommend, intelligent question and answer, intelligent chat robot, etc.

  • Social media marketing: use social media platforms for marketing, use various social media tools to promote brands, products and services, and increase company and brand awareness.

  • Content Creation: Create original articles, pictures, videos and other content, so as to attract more users and build up the popularity of the company.

  • Network research: Through network research, we can understand consumer behavior and better meet consumer needs.

  • E-commerce: use the Internet to set up online shopping malls, sell products and services, and expand the influence of enterprises.

  • Mobile application development: develop application software for smartphones, tablet computers, smart devices, etc., to meet users' needs for mobile Internet.

  • Network Security: Provide effective network security protection and management services to ensure the security of corporate networks.

So if you really want to do this, how should you implement it?

The first step: pre-training to generate a large model LM, you can refer to the nanoGPT code magic modification,

Step 2: Train the reward model RM. The difficulty is to label the scene data. For the calculation scheme, refer to the recommendation system pair-wise loss training model.

The third step: Intensive learning PPO model, I am still exploring and implementing this part, and I have no right to speak for the time being, but I can try to recruit talents for intensive learning.

Step 4: Form an engineering team, chatGPT can be made, personally think that the engineering team is very strong

After openai opens the API interface, there are more things that can be done.

  • Keep up with Microsoft, such as ChatGPT version of Bing search, ChatGPT version of Azure cloud computing platform

  • Notion products

  • Enterprise private customization, such as the cooperation between Vanke and Microsoft Azure OpenAI, and the customer feedback analysis platform equipped with GPT-3

So there are many products that can be made, especially suitable for the development needs of China’s national conditions, but I think we can consider the training corpus data. The public images and texts can be crawled is limited, but it is more core to be able to produce industry data in a certain field. arms.

GPT-4 is coming, turning to multi-modality, which has a more subversive impact. Combining image + audio + text, etc., there are more products that can be made. Google released the 562 billion parameter multi-modal model PaLM-E, Robot control is omnipotent and Microsoft released Visual ChatGPT: the visual model supports ChatGPT to achieve silky chat .

References

Guess you like

Origin blog.csdn.net/stark_summer/article/details/129479791