【Paper Reading】Scaling Laws for Neural Language Models

foreword

  • This paper briefly presents Scaling lawthe main conclusions of
  • Original address: Scaling Laws for Neural Language Models
  • I personally think that it is not necessary to pay special attention to the specific values ​​​​of the various symbols in the formula, but more attention should be paid to the relationship between different factors, the proportion, etc.

Summary

  • Performance depends strongly on scale, weakly on model shape

    • scale: parameter amount NNN , the amount of dataDDD , calculation amountCCC
    • shape: model depth, width, number of self-attention heads, etc.
  • Smooth power laws: N , D , C N, D, C N,D,Among the three factors of C , when the other two are unrestricted, the model performance is comparable to any one of the factorspower-law relationship

    insert image description here

  • Universality of overfitting: As long as we increase NN togetherN andDDD , the performance will improve predictably. But when one of them is fixed and the other is increasing, the performance drops. The proportional relationship between the two is roughlyN 0.74 / DN^{0.74}/DN0.74 /D, which means,Each time the model is increased by 8 times, only the amount of data needs to be increased by 6 times to avoid performance degradation (overfitting)

    insert image description here

  • Universality of training: When the number of model parameters is constant, the performance of the model is predictable. By extrapolating earlier training curves, it is possible to roughly estimate how the model will perform after training for a longer period of time

  • Transfer improves with text performance: When evaluating the model on text with different distributions, the results are closely related to the results on the validation set, and the offset of the loss is roughly constant. This shows that it is reasonable to use the results of the validation set as an evaluation indicator

  • Sample efficiency: Large models can achieve the same performance in fewer steps and less data (Figure 4)

    insert image description here

  • Convergence is inefficient: When the amount of calculation is fixed, but the size of the model and the amount of data are not limited, the large model is far from converging when it gets the best performance. The maximum training efficiency training is more sample efficient than training a small model to convergence, and the data demand increases slowly with the amount of calculation D ∼ C 0.27 D \sim C^{0.27}DC0.27

  • Optimal batch size: The best batch size has power-lawa relationship with loss and is also affected by the scale of gradient noise

In general, the performance of LLM increases smoothly and predictably with the increase of model size, data volume and calculation amount

Summary of Scaling Laws

When the performance is only limited by one of the model parameters other than the embedding layer N, dataset size D, the test loss of the compute budgec C_minautoregressive Transformermodel can be power-lawpredicted with one.

  • When the model parameters are limited:

    insert image description here

  • When the amount of data is limited:

    insert image description here

  • When the amount of calculation is limited:

    • insert image description here

    insert image description here

power-law α N , α D , α C m i n \alpha_N, \alpha_D, \alpha_C^{min} aN,aD,aCminRepresents the degree of model performance improvement when we increase model parameters, data volume, and calculation volume (The bigger the better), N c , D c , C c m i n N_c, D_c, C_c^{min} Nc,Dc,CcminThe specific value has no practical significance

  • see here,The amount of data is improved the most, followed by model parameters, and finally the amount of calculation

batch sizeand the performance of the model on the test set LLThere is a between Lpower-law

insert image description here

insert image description here

  • Combining the formulas of model parameters and data volume, we can see that,When increasing the model parameters, it should be N α N α D ∼ N 0.74 N^{\frac{\alpha_N}{\alpha_D}} \sim N^{0.74}NaDaNNThe ratio of 0.74 increases the amount of data, here is an equation that combines the two (Fig. 4. left):

    insert image description here

  • SS in limited update stepsUnder S , test loss andN, SN, SN,The relationship of S is (Figure 4. Right)

    insert image description here

    • S c ∼ 2.1 × 1 0 3 , α s ∼ 0.76 S_c \sim 2.1 \times 10^3, \alpha_s \sim 0.76 Sc2.1×103,as0.76
    • S m i n ( S ) S_{min}(S) Smin( S ) is the smallest possible number of optimization steps

When the calculation amount CCWhen C is limited and other factors are not limited, the bestN , B , S , DN,B,S,DN,B,S,D andCCThe relation of C is

insert image description here

insert image description here

  • When the amount of calculation increases, the most important thing to increase is the size of the model, not the training time and data volume. This also shows that when the model becomes larger, it is more sample efficient (a large amount of data can be trained with a small amount of data). model)
  • However, in practice, due to hardware limitations, people usually train small models for a longer time instead of pursuing compute-efficient

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/131697352