foreword
- This paper briefly presents
Scaling law
the main conclusions of - Original address: Scaling Laws for Neural Language Models
- I personally think that it is not necessary to pay special attention to the specific values of the various symbols in the formula, but more attention should be paid to the relationship between different factors, the proportion, etc.
Summary
-
Performance depends strongly on scale, weakly on model shape
- scale: parameter amount NNN , the amount of dataDDD , calculation amountCCC
- shape: model depth, width, number of self-attention heads, etc.
-
Smooth power laws: N , D , C N, D, C N,D,Among the three factors of C , when the other two are unrestricted, the model performance is comparable to any one of the factors
power-law relationship
-
Universality of overfitting: As long as we increase NN togetherN andDDD , the performance will improve predictably. But when one of them is fixed and the other is increasing, the performance drops. The proportional relationship between the two is roughlyN 0.74 / DN^{0.74}/DN0.74 /D, which means,Each time the model is increased by 8 times, only the amount of data needs to be increased by 6 times to avoid performance degradation (overfitting)
-
Universality of training: When the number of model parameters is constant, the performance of the model is predictable. By extrapolating earlier training curves, it is possible to roughly estimate how the model will perform after training for a longer period of time
-
Transfer improves with text performance: When evaluating the model on text with different distributions, the results are closely related to the results on the validation set, and the offset of the loss is roughly constant. This shows that it is reasonable to use the results of the validation set as an evaluation indicator
-
Sample efficiency: Large models can achieve the same performance in fewer steps and less data (Figure 4)
-
Convergence is inefficient: When the amount of calculation is fixed, but the size of the model and the amount of data are not limited, the large model is far from converging when it gets the best performance. The maximum training efficiency training is more sample efficient than training a small model to convergence, and the data demand increases slowly with the amount of calculation D ∼ C 0.27 D \sim C^{0.27}D∼C0.27
-
Optimal batch size: The best batch size has
power-law
a relationship with loss and is also affected by the scale of gradient noise
In general, the performance of LLM increases smoothly and predictably with the increase of model size, data volume and calculation amount
Summary of Scaling Laws
When the performance is only limited by one of the model parameters other than the embedding layer N
, dataset size D
, the test loss of the compute budgec C_min
autoregressive Transformer
model can be power-law
predicted with one.
-
When the model parameters are limited:
-
When the amount of data is limited:
-
When the amount of calculation is limited:
power-law α N , α D , α C m i n \alpha_N, \alpha_D, \alpha_C^{min} aN,aD,aCminRepresents the degree of model performance improvement when we increase model parameters, data volume, and calculation volume (The bigger the better), N c , D c , C c m i n N_c, D_c, C_c^{min} Nc,Dc,CcminThe specific value has no practical significance
- see here,The amount of data is improved the most, followed by model parameters, and finally the amount of calculation
batch size
and the performance of the model on the test set LLThere is a between Lpower-law
-
Combining the formulas of model parameters and data volume, we can see that,When increasing the model parameters, it should be N α N α D ∼ N 0.74 N^{\frac{\alpha_N}{\alpha_D}} \sim N^{0.74}NaDaN∼NThe ratio of 0.74 increases the amount of data, here is an equation that combines the two (Fig. 4. left):
-
SS in limited update stepsUnder S , test loss andN, SN, SN,The relationship of S is (Figure 4. Right)
- S c ∼ 2.1 × 1 0 3 , α s ∼ 0.76 S_c \sim 2.1 \times 10^3, \alpha_s \sim 0.76 Sc∼2.1×103,as∼0.76
- S m i n ( S ) S_{min}(S) Smin( S ) is the smallest possible number of optimization steps
When the calculation amount CCWhen C is limited and other factors are not limited, the bestN , B , S , DN,B,S,DN,B,S,D andCCThe relation of C is
- When the amount of calculation increases, the most important thing to increase is the size of the model, not the training time and data volume. This also shows that when the model becomes larger, it is more sample efficient (a large amount of data can be trained with a small amount of data). model)
- However, in practice, due to hardware limitations, people usually train small models for a longer time instead of pursuing compute-efficient