Decoding strategy of dialogue system (Top-k & Top-p & Temperature)

Table of contents

1. Case analysis

2. top-k sampling

3. top-p sampling

4. Temperature sampling

5. Joint sampling (top-k & top-p & Temperature) 

6. Supplement

6.1 Beam Search

6.2 Temperature parameter introduction


1. Case analysis

In natural language tasks, we usually use a pre-trained large model (such as GPT) to generate output text (such as an answer or an end) based on a given input text (such as an beginning or a question). To generate output text, we need to let the model predict each token one by one until a termination condition is reached (such as a punctuation mark or a maximum length). At each step, the model gives a probability distribution representing its prediction of the next word.

Suppose we train a model that describes personal life preferences, and we want it to complete the sentence " I like beautiful ___ best ". The model might give the following probability distribution:

So, how should we choose the next word from this probability distribution? Here are a few commonly used methods:

  • Greedy Decoding : Directly select the word with the highest probability. This method is simple and efficient, but may result in generated text that is too monotonous and repetitive.
  • Random Sampling : Randomly select a word according to a probability distribution. This approach can increase the diversity of the generated text, but may lead to the generated text being incoherent and meaningless.
  • Beam Search : At each time step, instead of only retaining the word with the highest current probability, the words are sorted from high to low by probability and retain the first num_beams words. This method can balance the quality and diversity of generation, but it is also difficult to avoid the problem of word duplication. We will introduce beam search in detail in subsequent chapters.

In response to the respective problems of the above methods, we need to think about how to make the reply words generated by the model more active? To this end, researchers introduced  top-k  sampling,  top-p  sampling and temperature sampling.

2. top-k sampling

In the above example, if the greedy strategy is used, the selected word must be "girl". Top-k sampling is an optimization of the previous "greedy strategy". It randomly samples from the top k words, allowing words with other probabilities to also have a chance to be selected. In many cases, the randomness introduced by this sampling helps improve the quality of the generation.

Here is an example of top-k sampling:

In the example above, we set k to 3, then the model will only select one word from girl, shoes, and elephant, without considering the word watermelon. Specifically, the model first screens the words with the top three likelihood values, then recalculates the sampling probability based on the likelihood values ​​of these three words, and finally performs sampling based on the probability.

By adjusting the size of k, the size of the sampling list can be controlled. The "greedy strategy" is actually top-k sampling with k = 1.

To summarize, top-k sampling has the following advantages:

  • It can control the diversity and quality of generation by adjusting the size of k. Generally speaking, the larger k is, the higher the diversity is generated, but the lower the quality is; the smaller the k is, the higher the quality is, but the lower the diversity is. Therefore, we can choose appropriate k values ​​according to different tasks and scenarios.
  • It can be used in conjunction with other decoding strategies, such as Temperature Scaling, Repetition Penalty, Length Penalty, etc., to further optimize the generated effect.

But top-k sampling also has some disadvantages, such as:

  • It may result in generated text that does not follow common sense or logic. This is because top-k sampling only considers the probability of words, but does not consider the semantic and grammatical relationships between words.
  • It can result in generated text that is too simplistic or boring. This is because top-k sampling only considers the k words with the highest probability, but does not consider other low-probability but meaningful or creative words. For example, if the input text is "I like to eat", then even if apples, dumplings, and hot pot are all reasonable choices, they may not necessarily be the most interesting or surprising choices, because the user may prefer to eat some special or novel foods.

Therefore, we usually consider top-k sampling combined with other strategies, such as top-p sampling.

3. top-p sampling

There is a flaw in top-k sampling, which is "What is the optimal value of k?" This is very difficult to determine. As a result, a strategy for dynamically setting the size of the word candidate list emerged, namely top-p sampling, also known as Nucleus Sampling. This is also the sampling method used by chatGPT.

The idea of ​​top-p sampling is to set a probability limit p value in advance. At each step, the candidate words are sorted from high to low in probability, and then the words are selected in sequence to construct a set. The construction principle of the set is: if the current word is added, the total probability is less than or equal to p, then the current word is put into the set; if the current word is added, the total probability is greater than p, then the current word is discarded, and the set construction ends here. The model will randomly select a word from the set, regardless of words outside the set.

The above figure shows the effect of Top-p sampling with a p value of 0.9. It is worth noting that we can use top-k sampling and top-p sampling at the same time, top-p will work after top-k.

4. Temperature sampling

Temperature sampling is inspired by statistical thermodynamics, where high temperatures mean lower energy states are more likely to be encountered. In the probability model, logits play the role of energy. We can implement Temperature sampling by dividing logits by temperature, and then input it into the Softmax function to further obtain the sampling probability.

The temperature in Temperature sampling is related to the Boltzmann distribution , and its formula is as follows:

\rho_{i} = \frac{1}{Q}e^{-\epsilon_{i}/kT}=\frac{e^{-\epsilon_{i}/kT}}{\sum_{j=1 }^{M} e^{-\epsilon_{j}/kT}}

where  \rho _{i} is the probability of state i,  \epsilon _{i} is the energy of state i, k is Boltzmann's constant, T is the temperature of the system, and M is the number of all quantum states that the system can reach.

Friends with a machine learning background will feel that the above formula looks familiar at first sight. Yes, the above formula is similar to the Softmax function  :

Softmax(z_{i}) = \frac{e^{z_{i}}}{\sum_{c=1}^{C}e^{z_{c}}}

Essentially, the temperature (T) parameter is added to the Softmax function. Logits are scaled based on our temperature values ​​and then passed to the Softmax function to calculate a new probability distribution.

In the above example of " I like beautiful ___ ", the initial temperature T=1, let's see directly what happens to the probability when T takes different values:

We can clearly see from the above figure that as the temperature decreases, the model becomes more and more inclined to choose "girl"; on the other hand, as the temperature increases, the distribution becomes more and more even. When T=50, the probability of choosing "Watermelon" is almost the same as the probability of choosing "Girl". 

Generally speaking, temperature is related to the "creativity" of the model. but it is not the truth. Temperature simply adjusts the probability distribution of words. The net macroscopic effect is that at lower temperatures our model is more deterministic and at higher temperatures less deterministic

5. Joint sampling (top-k & top-p & Temperature) 

Usually we use top-k, top-p, and Temperature in combination. The order of use is top-k->top-p->Temperature.

Let’s still take the previous example.

First, we set top-k = 3, which means the 3 words with the highest retention probability. This will retain the three words girl, shoe, and elephant:

  • Girls: 0.664
  • Shoes: 0.199
  • Elephant: 0.105

Next, we can use the top-p method to construct a set, that is, select the two words girl and shoes. Then we use Temperature = 0.7 for normalization, and the likelihood values ​​of these two words become:

  • Girls: 0.660
  • Shoes: 0.340

Then, we can randomly sample from the above distribution and select a word as the final generation result.

6. Supplement

6.1 Beam Search

This section serves as supplementary content for interested readers.

Beam Search is an improvement on the greedy strategy. The idea is also very simple, which is to slightly relax the scope of investigation. At each time step, instead of retaining only the 1 word with the highest current probability, num_beams are retained. When num_beams=1, beam search degenerates into greedy search.

The figure below is a practical example. Each time step has a total of 5 possible outputs of ABCDE. The num_beams=2 in the figure means that each time step will retain the two sequences with the best conditional probability up to the current step.

  • In the first time step, A and C are the two optimal ones, so two results are obtained [A],[C], and the other three are discarded;
  • The second step will continue to generate based on these two results. In branch A, 5 candidate words can be obtained, [AA],[AB],[AC],[AD],[AE]and in C, 5 candidate words can be obtained in the same way. At this time, these 10 will be uniformly ranked, and the best two will be retained. , that is, [AB]the sum in the picture [CE];
  • The third step is the same, and the best two will be retained from the new 10 candidates, and finally [ABD],[CED]two results are obtained.

It can be found that the number of candidates that beam search needs to examine at each step is num_beams times that of greedy search, so it is a method of sacrificing time for performance.

6.2  Temperature parameter introduction

Temperature is a parameter used to control the level of creativity of artificial intelligence-generated text. By adjusting the "temperature" you can influence the AI ​​model's probability distribution to make the text more focused or more diverse.

Consider the following example: an AI model must complete the sentence "A cat is ____." The next word has the following token probability:

Play: 0.5

Sleep: 0.25

Eat: 0.15

Driving: 0.05

Fly: 0.05

  • Low temperature (e.g. 0.2): The AI ​​model becomes more focused and deterministic, choosing the tag with the highest probability, such as "play".
  • Medium temperature (e.g. 1.0): The AI ​​model maintains a balance between creativity and focus, selecting markers based on probability and without obvious biases such as "play", "sleep" or "eat".
  • High temperature (e.g. 2.0): The AI ​​model becomes more adventurous, increasing the chance of choosing unlikely tags, such as "drive" and "fly".

If the temperature is lower, the probability of sampling other classes than the one with the highest log probability will be small, and the model may output the most correct text, but rather boringly with less variation.

If the temperature is high, the model can output with a fairly high probability, or not the highest probability. The generated text will be more diverse, but the likelihood of grammatical errors and generated nonsense is higher.

References:

Detailed explanation of ChatGPT model sampling algorithm-Alibaba Cloud Developer Community

Large model text generation - decoding strategy (Top-k & Top-p & Temperature) - Zhihu

Guess you like

Origin blog.csdn.net/weixin_45684362/article/details/135037826